Vous êtes sur la page 1sur 27

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/309275131

The GRIM Test: A Simple Technique Detects Numerous Anomalies in the


Reporting of Results in Psychology

Article  in  Social Psychological and Personality Science · October 2016


DOI: 10.1177/1948550616673876

CITATIONS READS

10 697

2 authors:

Nick Brown James A J Heathers


University of Groningen Northeastern University
37 PUBLICATIONS   252 CITATIONS    38 PUBLICATIONS   578 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology View project

Cortical Thickness and Resting State Cardiac Function Across the Lifespan: A Cross-Sectional Pooled Mega Analysis View project

All content following this page was uploaded by James A J Heathers on 21 October 2017.

The user has requested enhancement of the downloaded file.


1 The GRIM test: A simple technique detects numerous anomalies in the reporting of results in

2 psychology

3
4 Nicholas J. L. Brown (*)

5 University Medical Center, University of Groningen, The Netherlands

6
7 James A. J. Heathers

8 Division of Cardiology and Intensive Therapy, Poznań University of Medical Sciences

9 University of Sydney

10
11 (*) Corresponding author. E-mail: nick.brown@free.fr

12
13 Nick Brown is a PhD candidate at the University Medical Center, University of Groningen, The

14 Netherlands.

15 James Heathers conducted the bulk of the work described in the attached document while a

16 postdoctoral fellow at the Poznań University of Medical Sciences in Poland. He is currently a

17 postdoctoral fellow at Northeastern University.

18

19 Acknowledgements

20 The authors wish to thank Tim Bates and Chris Chambers for their helpful comments on an

21 earlier draft of this article, as well as those authors of article that we examined who kindly

22 provided their data sets and help with the reanalysis of these.

23

1
24 Abstract

25 We present a simple mathematical technique that we call GRIM (Granularity-Related

26 Inconsistency of Means) for verifying the summary statistics of research reports in psychology.

27 This technique evaluates whether the reported means of integer data such as Likert-type scales

28 are consistent with the given sample size and number of items. We tested this technique with a

29 sample of 260 recent empirical articles in leading journals. Of the articles that we could test with

30 the GRIM technique (N=71), around half (N=36) appeared to contain at least one inconsistent

31 mean, and more than 20% (N=16) contained multiple such inconsistencies. We requested the

32 data sets corresponding to 21 of these articles, receiving positive responses in nine cases. We

33 confirmed the presence of at least one reporting error in all cases, with three articles requiring

34 extensive corrections. The implications for the reliability and replicability of empirical

35 psychology are discussed.

2
36 Consider the following (fictional) extract from a recent article in the Journal of Porcine Aviation

37 Potential:

38 Participants (N=55) were randomly assigned to drink 200ml of water that either contained

39 (experimental condition, N=28) or did not contain (control condition, N=27) 17g of

40 cherry flavor Kool-Aid® powder. Fifteen minutes after consuming the beverage,

41 participants responded to the question, “To what extent do you believe that pigs can fly?”

42 on a seven-point scale from 1 (Not at all) to 7 (Definitely). Participants in the “drank the

43 Kool-Aid” condition reported a significantly stronger belief in the ability of pigs to fly

44 (M=5.19, SD=1.34) than those in the control condition (M=3.86, SD=1.41), t(53)=3.59,

45 p<.001.

46 These results seem superficially reasonable, but are actually mathematically impossible. The

47 reported means represent either errors of transcription, some version of misreporting, or the

48 deliberate manipulation of results. Specifically, the mean of the 28 participants in the

49 experimental condition, reported as 5.19, cannot be correct. Since all responses were integers

50 between 1 and 7, the total of the response scores across all participants must fall in the range 28–

51 196. The two integers that give a result closest to the reported mean of 5.19 are 145 and 146.

52 However, 145 divided by 28 is 5.17857142 , which conventional rounding returns as 5.18.

53 Likewise, 146 divided by 28 is 5.21428571 , which rounds to 5.21. That is, there is no

54 combination of responses that can give a mean of 5.19 when correctly rounded. Similar

55 considerations apply to the reported mean of 3.86 in the control condition: Multiplying this value

56 by the sample size (27) gives 104.22, suggesting that the total score across participants must

3
57 have been either 104 or 105. But 104 divided by 27 is 3.851, which rounds to 3.85, and 105

58 divided by 27 is 3.888 , which rounds to 3.89.

59 In this article, we first introduce the general background to and calculation of what we

60 term the Granularity-Related Inconsistent Means (GRIM) test. Next, we report on the results of

61 an analysis using the GRIM test of a number of published articles from leading psychological

62 journals. Finally, we discuss the implications of these results for the published literature in

63 empirical psychology.

64

65 General description of the GRIM technique for reanalyzing published data

66 Participant response data collected in psychology are typically ordinal in nature—that is, the

67 recorded values have meaning in terms of their rank order, but the number representing them are

68 arbitrary, such that the value corresponding to any item has no significance beyond its ability to

69 establish a position on a continuum relative to the other numbers. For example, the seven-point

70 scale cited in our opening example, running from 1 to 7, could equally well have been coded

71 from 0 to 6, or from 6 to 0, or from 10 to 70 in steps of 10. However, while the limits of ordinal

72 data in measurement have been extensively discussed for many years (e.g., Carifio & Perla,

73 2007; Coombs, 1960; Jamieson, 2004; Thurstone, 1927), it remains common practice to treat

74 ordinal data composed of small integers as if they were measured on an interval scale, calculate

75 their means and standard deviations, and apply inferential statistics to those values. Other

76 common measures used in psychological research produce genuine interval-level data in the

77 form of integers; for example, one might count the number of anagrams unscrambled, or the

78 number of errors made on the Stroop test, within a given time interval. Thus, psychological data

79 often consist of integer totals, divided by the sample size.

4
80 One often-overlooked property of data derived from such non-continuous measures,

81 whether ordinal or interval, is their granularity—that is, the numerical separation between

82 possible values of the summary statistics. Here, we consider the example of the mean. With

83 typical Likert-type data, the smallest amount by which two means can differ is the reciprocal of

84 the product of the number of participants and the number of items (questions) that make up the

85 scale. For example, if we administer a three-item Likert-type measure to 10 people, the smallest

86 amount by which two mean scores can differ (the granularity of the mean) is (1 / (10 × 3)) = 0.033.

87 If means are reported to two decimal places, then—although there are 100 possible numbers with

88 two decimal places in the range 1 ≤ X < 2 (1.00, 1.01, 1.02, etc., up to 1.99)—the possible

89 values of the (rounded) mean are considerably fewer (1.00, 1.03, 1.07, 1.10, etc., up to 1.97). If

90 the number of participants ( N ) is less than 100 and the measured quantity is an integer, then not

91 all of the possible sequences of two digits can occur after the decimal point in correctly rounded

92 fractions. We use the term inconsistent to refer to reported means of integer data whose value,

93 appropriately rounded, cannot be reconciled with the stated sample size. (More generally, if the

94 number of decimal places reported is D, then some combinations of digits will not be consistent

95 if N is less than 10 D .)

96 This relation is always true for integer data that are recorded as single items, such as

97 participants’ ages in whole years, or a one-item Likert-type measure, as frequently used as a

98 manipulation check. In particular, the number of possible responses to each item is irrelevant;

99 that is, it makes no difference whether responses can range from 0 to 3, or from 1 to 100. When

100 a composite measure is used, such as one with three Likert-type items where the mean of the

101 item scores is taken as the value of the measure, this mean value will not necessarily be an

102 integer; instead, it will be some multiple of (1/L), where L is the number of items in the measure.

5
103 Similar considerations would apply to a hypothetical one-item measure where the possible

104 responses are simple fractions instead of integers. For example, a scale with possible responses

105 of 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0 would be equivalent to a two-item measure with integer

106 responses in the range 0–3. Alternatively, in a money game where participants play with

107 quarters, and the final amount won or lost is expressed in dollars, only values ending in .00, .25,

108 .50 or .75 are possible. However, the range of possible values that such means can take is still

109 constrained (in the example of the three-item Likert-type scale, assuming item scores starting at

110 1, this range will be 1.00, 1.33 , 1.66 , 2.00, 2.33 , etc.,) and so for any given sample size, the

111 range of possible values for the mean of all participants is also constrained. For example, with a

112 sample size of 20 and L=3, possible values for the mean are 1.00, 1.02 [rounded from 1.016 ],

113 1.03 [rounded from 1.033 ], 1.05, 1.07, etc. More generally, the range of means for a measure

114 with L items (or an interval scale with an implicit granularity of (1/L), where L is a small integer,

115 such as 4 in the example of the game played with quarters) and a sample size of N is identical to

116 the range of means for a measure with one item and a sample size of L × N . Thus, by

117 multiplying the sample size by the number of items in the scale, composite measures can be

118 analyzed using the GRIM technique in the same way as single items, although as the number of

119 scale items increases, the maximum sample size for which this analysis is possible is

120 correspondingly reduced as the granularity decreases towards 0.01. We use the term GRIM-

121 testable to refer to variables whose granularity (typically, one divided by the product of the

122 number of scale items and the number of participants) is sufficiently large that they can be tested

123 for inconsistencies with the GRIM technique. For example, a five-item measure with 25

124 participants has the same granularity (0.008) as a one-item measure with 125 participants, and

125 hence scores on this measure are not typically GRIM-testable.

6
126 Figure 1. Plot of consistent (white dots) and inconsistent (black dots) means, reported to 2

127 decimal places.

128

129 Notes:

130 1. As the sample size increases towards 100, the number of means that are consistent with that

131 sample size also increases, as shown by the greater number of white (versus black) dots. Thus,

132 GRIM works better with smaller sample sizes, as the chance of any individual incorrectly-

133 reported mean being consistent by chance is lower.

7
134 2. The Y axis represents only the fractional portion of the mean (i.e., the part after the decimal

135 point), because the integer portion of the mean plays no role. That is, for any given sample size,

136 if a mean of 2.49 is consistent with the sample size, then means of 0.49 or 8.49 are also

137 consistent.

138 3. This figure assumes that means ending in 5 at the third decimal place (e.g., 10/80=0.125) are

139 always rounded up; if such means are allowed to be rounded up or down, a few extra white dots

140 will appear at sample sizes that are multiples of 8.

8
141 Figure 1 shows the distribution of consistent (shown in white) and inconsistent (shown in

142 black) means as a function of the sample size. Note that only the two-digit fractional portion of

143 each mean is linked to consistency; the integer portion plays no role. The overall pattern is clear:

144 As the sample size increases, the number of means that are consistent with that sample size also

145 increases, and so the chance that any single incorrectly-reported mean will be detected as

146 inconsistent is reduced. However, even with quite large sample sizes, it is still possible to detect

147 inconsistent means if an article contains multiple inconsistencies. For example, consider a study

148 with N=75 and six reported “means” whose values have, in fact, been chosen at random: There is

149 a 75% chance that any one random “mean” will be consistent, but only a 17.8% (0.756) chance

150 that all six will be.

151 Our general formula, then, is that when the number of participants (N) is multiplied by

152 the number of items composing a measured quantity (L, commonly equal to 1), and the means

153 that are based on N are reported to D decimal places, then if (L × N ) < 10 D , there exists some

154 number of decimal fractions of length D that cannot occur if the means are reported correctly.

155 The number of inconsistent values is generally equal to (10D−N); however, in the analyses

156 reported in the present article, we conservatively allowed numbers ending in exactly 5 at the

157 third decimal place to be rounded either up or down without treating the resulting means as

158 inconsistent, so that some values of N have fewer possible inconsistent means than this formula

159 indicates.

160 Using the GRIM technique, it is possible to examine published reports of empirical

161 research to see whether the means have been reported correctly1. Psychological journals

1
We have provided a simple spreadsheet at https://osf.io/3fcbr that automates the steps of this

procedure.

9
162 typically require the reporting of means to two decimal places, in which case the sample size

163 corresponding to each mean must be less than 100 in order for its consistency to be checked.

164 However, since the means of interest in experimental psychology are often those for subgroups

165 of the overall sample (for example, the numbers in each experimental condition), it can still be

166 possible to apply the GRIM technique to studies with overall sample sizes substantially above

167 100. (Note that percentages reported to only one decimal place can typically be tested for

168 consistency with a sample size of up to 1000, as they are, in effect, fractions reported to three

169 decimal places.)

170 We now turn to our pilot trial of the GRIM test.

171

172 Method

173 We searched recently published (2011–2015) issues of Psychological Science (PS), Journal of

174 Experimental Psychology: General (JEP:G), and Journal of Personality and Social Psychology

175 (JPSP) for articles containing the word “Likert” anywhere in the text. This strategy was chosen

176 because we expected to find Likert-type data reported in most of the articles containing that word

177 (although we also checked the consistency of the means of other integer data where possible).

178 We sorted the results with the most recent first and downloaded at most the first 100 matching

179 articles from each journal. Thus, our sample consisted of 100 articles from PS published

180 between January 2011 and December 2015, 60 articles from JEP:G published between January

181 2011 and December 2015, and 100 articles from JPSP published between October 2012 and

182 December 2015.

183 We examined the Method section of each study reported in these articles to see whether

184 GRIM-testable measures were used, and to determine the sample sizes for the study and, where

10
185 appropriate, each condition. A preliminary check was performed by the first author; if he did not

186 see evidence of either GRIM-testable measures, or any (sub)sample sizes less than 100, the

187 article was discarded. Subsequently, each author worked independently on the retained articles.

188 We examined the table of descriptives (if present), other result tables, and the text of the Results

189 section, looking for means or percentages that we could check using the GRIM technique. On

190 the basis of our tests, we assigned each article a subjective “inconsistency level” rating. A rating

191 of 0 (no problems) meant that all the means we were able to check were consistent, even if those

192 means represented only a small percentage of the reported data in the article. We assigned a

193 rating of 1 (minor problems) to articles that contained only one or two inconsistent numbers,

194 where we believed that these were most parsimoniously explained by typographical or

195 transcription errors, and where an incorrect value would have little effect on the main

196 conclusions of the article. Articles that had a small number of inconsistencies that might impact

197 the principal results were rated at level 2 (moderate problems); we also gave this rating to

198 articles in which the results seemed to be uninterpretable as described. Finally, we applied a

199 rating of 3 (substantial problems) to articles with a larger number of inconsistencies, especially if

200 these appeared at multiple points in the article. Finally, ratings were compared between the

201 authors and differences resolved by discussion.

202

203 Results

204 The total number of articles examined from each journal, the number retained for GRIM

205 analysis, and the number to which we assigned each rating, are shown in Table 1. A total of 260

206 articles were initially examined. Of these, 189 (72.7%) were discarded, principally because

207 either they reported no GRIM-testable data or their sample sizes were all sufficiently large that

11
208 no inconsistent means were likely to be detected. Of the remaining 71 articles, 35 (49.3%)

209 reported all GRIM-testable data consistently and were assigned an inconsistency level rating of

210 0. That left us with 36 articles that appeared to contain one or more inconsistency. Of these, we

211 assigned a rating of 1 to 15 articles (21.1% of the 71 in total for which we performed a GRIM

212 analysis), a rating of 2 to five articles (7.0%), and a rating of 3 to 16 articles (22.5%). In some of

213 these “level 3” articles, over half of the GRIM-testable values were inconsistent with the stated

214 sample size.

12
215

216

217 Table 1 Journals and Articles Consulted

Journal PS JEP:G JPSP Total

Number of articles 100 60 100 260

Earliest article date January 2011 January 2011 October 2012

Articles with GRIM-testable data 29 15 27 71

Level 0 articles (no problems detected) 16 8 11 35

Level 1 articles (minor problems) 5 3 7 15

Level 2 articles (moderate problems) 1 1 3 5

Level 3 articles (substantial problems) 7 3 6 16

218 Notes: PS=Psychological Science. JEP:G=Journal of Experimental Psychology: General.

219 JPSP: Journal of Personality and Social Psychology.

220

221

222 Next, we e-mailed2 the corresponding authors of the articles that were rated at level 2 or 3

223 asking for their data. In response to our 21 initial requests, we received 11 replies within two

224 weeks. At the end of that period, we sent follow-up requests to the 10 authors who had not

225 replied to our initial e-mail. In response to either the first or second e-mail, we obtained the

226 requested data from eight authors, while a ninth provided us with sufficient information about

227 the data in question to enable us to check the consistency of the means. Four authors promised

228 to send the requested data, but have not done so to date. Five authors either directly or

2
The text of our e-mails is available in the supplementary information for this article.

13
229 effectively refused to share their data, even after we explained the nature of our study;

230 interestingly, two of these refusals were identically worded. In another case, the corresponding

231 author’s personal e-mail address had been deleted; another author informed us that the

232 corresponding author had left academia, and that the location of the data was unknown. Finally,

233 two of our requests went completely unanswered after the second e-mail.

234 Our examination of the data that we received showed that the GRIM technique identified

235 one or more genuine problem in each case. We report the results of each analysis briefly here, in

236 the order in which the data were received.

237 Data set 1. Our GRIM analysis had detected two inconsistent means in a table of descriptives, as

238 well as eight inconsistent standard deviations3. Examining the data, we found that the two

239 inconsistent means and one of the inconsistent SDs were caused by the sample size for that cell

240 not corresponding to the sample size for the column of data in question; five SDs had been

241 incorrectly rounded because the default (3 decimal places) setting of SPSS had caused a value of

242 1.2849 to be rounded to 1.285, which the authors had subsequently rounded manually to 1.29;

243 and two further SDs appeared to have been incorrectly transcribed, with values of 0.79 and 0.89

244 being reported as 0.76 and 0.86, respectively. All of these errors were minor and had no

245 substantive effect on the published results of the article.

246 Data set 2. Our reading of the article in this case had detected several inconsistent means, as

247 well several inconsistently-reported degrees of freedom and apparent errors in the reporting of

248 some other statistics. Examination of the data confirmed most of these problems, and indeed

3
SDs exhibit granularity in an analogous way to means, but the determination of (in)consistency

for SDs is considerably more complicated. We hope to cover the topic of inconsistent SDs in a

future article.

14
249 revealed a number of additional errors in the authors’ analysis. We subsequently discovered that

250 the article in question had already been the subject of a correction in the journal, although that

251 had not addressed most of the problems that we found. We will write to the authors to suggest a

252 number of points that require (further) correction.

253 Data set 3. In this case, our GRIM analysis had shown a large number of inconsistent means in

254 two tables of descriptives. The corresponding author provided us with an extensive version of

255 the data set, including some intermediate analysis steps. We identified that most of the entries in

256 the descriptives had been calculated using a Microsoft Excel formula that included an incorrect

257 selection of cells; for example, this resulted in the mean and SD of the first experimental

258 condition being included as data points in the calculation of the mean and SD of the second. The

259 author has assured us that a correction will be issued.

260 Data set 4. In the e-mail accompanying their data, the authors of this article spontaneously

261 apologized in advance (even though we had not yet told them exactly why we were asking for

262 their data) for possible discrepancies between the sample sizes in the data and those reported in

263 the article. They stated that, due to computer-related issues, they had only been able to retrieve

264 an earlier version of the data set, rather than the final version on which the article was based. We

265 adjusted the published sample sizes using the notes that the authors provided, and found that this

266 adequately resolved the GRIM inconsistencies that we had identified.

267 Data set 5. The GRIM analyses in this case found some inconsistent means in the reporting of

268 the data that were used as the input to a number of t tests, as well as in the descriptives for one of

269 the conditions in the study. Analysis revealed that the former problems were the result of the

270 authors having reported the Ns from the output of a repeated-measures ANOVA in which some

271 cases were missing, so that these Ns were smaller than those reported in the method section. The

15
272 problems in the descriptives were caused by incorrect reporting of the number of participants

273 who were excluded from analyses. We were unable to determine to what extent this difference

274 affected the results of the study.

275 Data set 6. Here, the inconsistencies that we detected were mostly due to the misreporting by

276 the authors of their sample size. This was not easy to explain as a typographical error, as the

277 number was reported as a word at the start of a sentence (e.g., “Sixty undergraduates took part”).

278 Additionally, one inconsistent standard deviation turned out to have been incorrectly copied

279 during the drafting process.

280 Data set 7. This data set confirmed numerous inconsistencies, including large errors in the

281 reported degrees of freedom for several F tests, from which we had inferred the per-cell sample

282 sizes. Furthermore, a number that was meant to be the result of subtracting one Likert-type item

283 score from another (thus giving an integer result) had the impossible value of 1.5. We reported

284 these inconsistencies to the corresponding author, but received no acknowledgement.

285 Data set 8. The corresponding author indicated that providing the full data set could be

286 complicated, as the data were taken from a much larger longitudinal study. Instead, we provided

287 a detailed explanation of the specific inconsistencies we had found. The author checked these

288 and confirmed that the sample size of the study in question had been reported incorrectly, as

289 several participants had been excluded from the analyses but not from the reported count of

290 participants. The author thanked us for finding this minor (to us) inconsistency and described the

291 exercise as “a good lesson.”

292 Data set 9. In this case, we asked for data for three studies from a multiple-study article. In the

293 first two studies, we found some reporting problems with standard deviations in the descriptives

294 and some other minor problems to do with the handling of missing values for some variables.

16
295 For the third study, however, the corresponding author reported that, during the process of

296 preparing the data set to send to us, an error in the analyses had been discovered that was

297 sufficiently serious as to warrant a correction to the published article.

298 For completeness, we should also mention that in one of the cases above, the data that we

299 received showed that we had failed to completely understand the original article; what we had

300 thought were inconsistencies in the means on a Likert-type measure were due to that measure

301 being a multiple-item composite, and we had overlooked that it was correctly reported as such.

302 While our analysis also discovered separate problems with the article in question, this

303 underscores how careful reading is always necessary when using the GRIM technique.

304

305 Discussion

306 We identified a simple method for detecting discrepancies in the reporting of statistics derived

307 from integer-based data, and applied it to a sample of empirical articles published in leading

308 journals of psychology. Of the articles that we were able to test, around half appeared to contain

309 one or more errors in the summary statistics. (We have no way of knowing how many

310 inconsistencies might have been discovered in the articles with larger samples, had it been

311 standard practice to report means to three decimal places.) Nine datasets were examined in more

312 detail, and we confirmed the existence of reporting problems in all nine, with three articles

313 requiring formal corrections.

314 We anticipate that the GRIM technique could be a useful tool for reviewers and editors.

315 A GRIM check of the reported means of an article submitted for review ought to take only a few

316 minutes. (Indeed, we found that even when no inconsistencies were uncovered, simply

317 performing this check enhanced our understanding of the methods used in the articles that we

17
318 read.) When GRIM errors are discovered, depending on their extent and how the reviewer feels

319 they impact the article, actions could range from asking the authors to check a particular

320 calculation, to informing the action editor confidentially that there appear to be severe problems

321 with the manuscript.

322 When an inconsistent mean is uncovered by this method, we of course have no

323 information about the true mean value that was obtained; that can only be determined by a

324 reanalysis of the original data. But such an inconsistency does indicate, at a minimum, that a

325 mistake has been made. When multiple inconsistencies are demonstrated in the same article, we

326 feel that the reader is entitled to question what else might not have been reported accurately.

327 Note also that not all incorrectly reported means will be detected using the GRIM technique,

328 because such a mean can still be consistent by chance. With reporting to two decimal places, for

329 a sample size N<100, a random “mean” value will be consistent in approximately N% of cases.

330 Thus, the number of GRIM errors detected in an article is likely to be a conservative estimate of

331 the true number of such errors.

332 A limitation of the GRIM technique is that, with the standard reporting of means to two

333 decimal places, it cannot reveal inconsistencies with per-cell sample sizes of 100 or more, and its

334 ability to detect such inconsistencies decreases as the sample size (or the number of items in a

335 composite measure) increases. However, this still leaves a substantial percentage of the

336 literature that can be tested. Recall that we selected our articles from some of the highest-impact

337 journals in the field; it might be that other journals have a higher proportion of smaller studies.

338 Additionally, it might be the case that smaller studies are more prone to reporting errors (for

339 example, because they are run by laboratories that have fewer resources for professional data

340 management).

18
341 A further potential source of false positives is the case where one or more participants are

342 missing values for individual items in a composite measure, thus making the denominator for the

343 mean of that measure smaller than the overall sample size. However, in our admittedly modest

344 sample of articles, this issue only caused inconsistencies in one case. We believe that this

345 limitation is unlikely to be a major problem in practice because the GRIM test is typically not

346 applicable to measures with a large number of items, due to the requirement for the product of

347 the per-cell sample size and the number of items to be less than 100.

348
349 Concluding remarks

350 On its own, the discovery of one or more inconsistent means in a published article need not be a

351 cause for alarm; indeed, we discovered from our reanalysis of data sets that in many cases where

352 such inconsistencies were present, there was a straightforward explanation, such as a minor error

353 in the reported sample sizes, or a failure to report the exclusion of a participant. Sometimes, too,

354 the reader performing the GRIM analysis may make errors, such as not noticing that what looks

355 like a single Likert-type item is in fact a composite measure.

356 It might also be that psychologists are simply sometimes rather careless in retyping

357 numbers from statistical software packages into their articles. However, in such cases, we think

358 it is legitimate to ask how many other elementary mistakes might have been made in the analysis

359 of the data, and with what effects on the reported results. It is interesting to compare our

360 experiences with those of Wolins (1962), who asked 37 authors for their data, obtained these in

361 usable form from seven authors, and found “gross errors” in three cases. While the numbers of

362 studies in both Wolins’ and our cases are small, the percentage of severe problems is, at an

363 anecdotal level, worrying. Indeed, we wonder whether some proportion of the failures to

364 replicate published research in psychology (Open Science Collaboration, 2015) might simply be

19
365 due to the initial (or, conceivably, the replication) results being the products of erroneous

366 analyses.

367 Beyond inattention and poorly-designed analyses, however, we cannot exclude that in

368 some cases, a plausible explanation for GRIM inconsistencies is that some form of data

369 manipulation has taken place. For example, in the fictional extract at the start of this article, here

370 is what should have been written in the last sentence:

371 Participants in the “drank the Kool-Aid” condition did not report a significantly stronger

372 belief in the ability of pigs to fly (M=4.79, SD=1.34) than those in the control condition

373 (M=4.26, SD=1.41), t(53)=1.43, p=.16.

374 In the “published” extract, compared to the above version, the first mean was “adjusted” by

375 adding 0.40, and the second by subtracting 0.40. This transformed a non-significant p value into

376 a significant one, thus making the results considerably easier to publish (cf. Kühberger, Fritz, &

377 Scherndl, 2014).

378 We are particularly concerned about the eight data sets (out of the 21 we requested) that

379 we believe we may never see (five due to refusals to share the data, two due to repeated non-

380 response to our requests, and one due to the apparent disappearance of the corresponding author).

381 Refusing to share one’s data for reanalysis without giving a clear and relevant reason is, we feel,

382 professionally disrespectful at best, especially after authors have assented to such sharing as a

383 condition of publication, as is the case in (for example) APA journals such as JPSP and JEP:G.

384 We support the principle, currently being adopted by several journals, that sharing of data ought

385 to be the default situation, with authors having to provide strong arguments why their data cannot

386 be shared in any given case. When accompanied by numerical evidence that the results of a

387 published article may be unreliable, a refusal to share data will inevitably cause speculation

20
388 about what those data might reveal. However, throughout the present article, we have refrained

389 from mentioning the titles, authors, or any other identifying features of the articles in which the

390 GRIM analysis identified apparent inconsistencies. There are three reasons for this. First, the

391 GRIM technique was exploratory when we started to examine the published articles, rather than

392 an established method. Second, there may be an innocent explanation for any or all of the

393 inconsistencies that we identified in any given article. Third, it is not our purpose here to

394 “expose” anything or anyone; we offer our results in the hope that they will stimulate discussion

395 within the field. It would appear, as a minimum, that we have identified an issue worthy of

396 further investigation, and produced a tool that might assist reviewers of future work, as well as

397 those who wish to check certain results in the existing literature.

21
398 References

399 Carifio, J., & Perla, R. J. (2007). Ten common misunderstandings, misconceptions, persistent

400 myths and urban legends about Likert scales and Likert response formats and their

401 antidotes. Journal of Social Sciences, 3, 106–116.

402 http://dx.doi.org/10.3844/jssp.2007.106.116

403 Coombs, C. H. (1960). A theory of data. Psychological Review, 67, 143–159.

404 http://dx.doi.org/10.1037/h0047773

405 Jamieson, S. (2004). Likert scales: How to (ab)use them. Medical Education, 38, 1212–1218.

406 http://dx.doi.org/10.1111/j.1365-2929.2004.02012.x

407 Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis

408 based on the correlation between effect size and sample size. PLoS ONE, 9(9), e105825.

409 http://dx.doi.org/10.1371/journal.pone.0105825

410 Open Science Collaboration (2015). Estimating the reproducibility of psychological science.

411 Science, 349, aac4716. http://dx.doi.org/10.1126/science.aac4716

412 Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34, 273–286.

413 http://dx.doi.org/10.1037/h0070288

414 Wolins, L. (1962). Responsibility for raw data. American Psychologist, 17, 657–658.

415 http://dx.doi.org/10.1037/h0038819

22
416 Supplemental Information
417
418 1. A numerical demonstration of the GRIM technique

419 For readers who prefer to follow a worked example, we present here a simple method for

420 performing the GRIM test to check the consistency of a mean. We assume that some quantity

421 has been measured as an integer across a sample of participants and reported as a mean to two

422 decimal places. For example:

423 Participants (N = 52) responded to the manipulation check question, “To what extent did

424 you believe the research assistant’s story that the dog had eaten his homework?” on a 1–7

425 Likert-type scale. Results showed that they found the story convincing (M = 6.28,

426 SD = 1.22).

427 In terms of the formulae given earlier, N (the number of participants) is 52, L (the number of

428 Likert-type items) is 1, and D (the number of decimal places reported) is 2. Thus, 10 D is 100,

429 which is greater than L × N (52), and so the means here are GRIM-testable. The first step is to

430 multiply the sample size (52) by the number of items (1), giving 52. Then, take that product and

431 multiply it by the reported mean. In this example, that gives (6.28 × 52) = 326.56. Next, round

432 that product to the nearest integer (here, we round up to 327). Now, divide that integer by the

433 sample size, rounding the result to two decimal places, giving (327 / 52) = 6.29. Finally,

434 compare this result with the original mean. If they are identical, then the mean is consistent with

435 the sample size and integer data; if they are different, as in this case (6.28 versus 6.29), the mean

436 is inconsistent.

437 When the quantity being measured is a composite Likert-type measure, or some other

438 simple fraction, it may still be GRIM-testable. For example:

23
439 Participants (N = 21) responded to three Likert-type items (0 = not at all, 4 = extremely)

440 asking them how rich, famous, and successful they felt. These items were averaged into

441 a single measure of fabulousness (M = 3.77, SD = 0.63).

442 In this case, the measured quantity (the mean score for fabulousness) can take on the values 1.00,

443 1.33 , 1.66 , 2.00, 2.33 , 2.66 , 3.00, etc. The granularity of this quantity is thus finer than if it

444 had been reported as an integer (e.g., if the mean of the total scores for the three components,

445 rather than the mean of the means of the three components, had been reported). However, the

446 sample size is sufficiently small that we can still perform a GRIM test, by multiplying the sample

447 size by the number of items that were averaged to make the composite measure (i.e., three)

448 before performing the steps just indicated. In terms of our formulae, N is 21, L is 3, and D is

449 again 2. 10 D is 100, which is greater than L × N (63), so the means here are again GRIM-

450 testable. Thus, in this case, we multiply the sample size (21) by the number of items (3) to get

451 63; multiply this result by the reported mean (3.77), giving 237.51; round 237.51 to the nearest

452 integer (238); divide 238 by 63 to get 3.777, which rounds to 3.78; and observe that, once again,

453 this mean is inconsistent with the reported sample size.

454 Professor Irving Herman (personal communication, August 11, 2016) has pointed out that

455 another (equivalent) method to test for inconsistency is to multiply the minimum and maximum

456 possible unrounded mean values by the sample size, and see if the integer parts of the results are

457 the same; if so, the mean is inconsistent, as there is no integer that it could represent. For

458 example, a reported mean of 5.19 could correspond to any value between 5.185 and 5.195. With

459 a sample size of 28, the range of possible values is from (5.185×28)=145.18 and

460 (5.195×28)=145.46; thus, there is no possible integer that, when divided by the sample size,

461 could give a mean that would round to 5.19.

24
462 2. E-mails sent to authors to request sharing of data

463 We show here the e-mails that were sent to the authors of the articles in which we found apparent

464 problems, to request that they share their data. In some cases there were minor variations in

465 wording or punctuation.

466

467 The first e-mail, sent in late January 2016:

468 Dear Dr. <name>,

469 We have read with interest your article “<title>”, published in <year> in <Journal>.

470 We are interested in reproducing the results from this article as part of an ongoing project
471 concerning the nature of published data.

472 Accordingly, we request you to provide us with a copy of the dataset for your article, in
473 order to allow us to verify the substantive claims of your article through reanalysis. We
474 can read files in SPSS, XLS[x], RTF, TXT, and most proprietary file types (e.g., .MAT).

475 Thank you for your time.

476 Sincerely,

477 Nicholas J. L. Brown


478 PhD candidate, University Medical Center, Groningen
479 James Heathers
480 Postdoctoral fellow, Poznań University of Medical Sciences
481

482 We took the words “verify the substantive claims [of your article] through reanalysis” directly
483 from article 8.14, “Sharing Research Data for Verification”, of the American Psychological
484 Association’s ethics code (http://memforms.apa.org/apa/cli/interest/ethics1.cfm). In the case of
485 articles published in JEP:G and JPSP, we knew that the corresponding author had explicitly
486 agreed to these conditions by signing a copy of a document entitled “Certification of Compliance
487 With APA Ethical Principles” prior to publication.

25
488
489 The second e-mail, sent about 10 days after the first if we had received no reply to the first:

490 Dear Dr. <name>,

491 Not having received a reply to our first e-mail (see below), we are writing to you again.
492 We apologise if our first message was a little cryptic.

493 We are working on a technique that we hope will become part of the armoury of peer
494 reviewers when checking empirical papers, which we hope will allow certain kinds of
495 problems with the reporting of statistics to be detected from the text. Specifically, we
496 look at means that do not appear to be consistent with the reported sample size. From a
497 selection of articles that we have analysed, yours appears to be a case where our
498 technique might be helpful (if we have understood your method section correctly).

499 However, we are still refining our technique, which is why we are asking 20 or so authors
500 to provide us with data so that we can check that we have fully understood their methods,
501 and see how we should refine the description of our technique to make it as specific and
502 selective as possible. Comparing the results of its application with the numbers in the
503 dataset(s) corresponding to the articles that we have identified will hopefully enable us to
504 understand this process better. So if you could provide us with your data from this article,
505 that would be very helpful.

506 Kind regards,

507 Nick Brown


508 James Heathers
509
510

26

View publication stats

Vous aimerez peut-être aussi