Vous êtes sur la page 1sur 27

How to Upset the Statistical Referee 

Martin Bland,
Dept of Health Sciences,
University of York

Talk first presented to the London Hypertension Society.

Introduction 
I was first asked to speak on this topic by Donald Singer for a meeting organised by the
London Hypertension Society. The aim, of course, is not to upset the statistical referee,
but this way round is more fun.

Researchers come to me with comments from statistical referees quite often. I usually
agree with the referee. This is not as bad as it sounds, because I can often show the
frustrated authors how to do what the referee suggests and so get their papers accepted.
We must accept, however, that referees, statistical or otherwise, are fallible people just
like you and me and, like us, they get it wrong. After all, the authors might have spent
months working on their paper and the referee is unlikely to spend more than half a day
on it. Sometimes I'll disagree with the referee and help the author to fight, but this is
definitely the minority of cases. And, of course, those referees (of whom there are far too
many) who recommend changing or rejecting my work are fools and charlatans.

When I am the referee, on the other hand, I find it is the authors who are unfit to be let
out alone. I find myself gasping at the folly of my fellow men and women and racing
down the corridor to show my colleagues the latest jaw-dropper. I could not resist, for
example, the grant applicant who asked under computing for money to buy 'soft wear'; a
nice cashmere sweater for cold computer rooms, perhaps. I have often thought it a pity
that such things do not get to a wider audience. Accordingly, when I was given this
tempting title I decided to make it is a personal account and use some of my referee's
reports. I based it on my experience as a statistical referee for the Lancet, as summer
relief in 1994 and 1995. I doubt very much that things have improved dramatically since
then, but if you know different, let me know.

In what follows, I shall use quotes from some of my reports to the Lancet. As all the
papers were confidential, I have changed a few details to protect the ignorant. As a rule, I
have no qualms about publicly pointing out the mistakes of others once they have been
published. If we do not do this, the conclusions that follow from these mistakes will be
quoted by others, usually without any criticism, and become generally accepted. Also, if
you publish your work, you must be prepared to defend your position, or amend it. But
when work has not been published, but rather is at some point on the twisting road to
publication, I think that it would be unfair to criticise it publicly. On the other hand, such
work can often be very illuminating. I'm not the only statistician who has thought that I
would really like this paper that I am reviewing to be published, because it would make a
wonderful teaching example of what not to do. I have therefore done my best in what
follows to describe real papers but at the same time to preserve the confidentiality of the
reviewing process. I have disguised the nature of the research, sometimes calling the
variables 'X' rather than giving them their proper name. I have even changed the some of
the numbers. However, I do not think that I have changed or exaggerated the nature of the
statistical mistake that I was pointing out. These quotes come from reports on just 15
papers, so if something comes up several times it may be pretty common.

The Allstat Sample 
After I had given the talk a couple of times, I wanted generalise a bit and incorporate the
views of other statistical referees. I used a completely non-statistical approach: a
convenience sample with a low response rate. I used Allstat, an email list that keeps
statisticians in touch with one another. I broadcast the following message:

Subject: Statistical referees for medical journals

To allstaters who act as statistical referees for medical journals.

I am preparing a talk entitled "How to upset the statistical referee". This is based on my
own (rather limited) adventures with the Lancet. I wondered what are the pet hates of
other referees, the things which really irritate them? If there is something which authors
do which really upsets you, could you tell me what it is? I shall, of course, post a
summary of replies.

Allstat responded beyond all expectations. I received 35 replies, many of which were
very extensive and wide-ranging. I found this rather overwhelming and I never did
produce that summary of replies which I had promised. Apologies to Allstat for that.

Eventually, I managed to sort and classify these replies. I have added the Allstat
comments wherever they fit in with my own and added them separately where they do no
not.

I think of this as my only purely qualitative research project, using two convenience
samples (of reviews and of respondents to my Allstat message) one of which was self-
selected, used to triangulate the theory generated.

You may be surprised by some of the things which my colleagues and I object to, as you
will see many of them appearing frequently in journals. Some of them are what might be
termed 'parastatistics', statistics as practiced by users of statistics but not by statisticians.
Not all statisticians would agree with me or with my respondents, either, and we should
not forget that the collective noun for statisticians is a 'variance'. Given these cautions, I
hope that what follows will give a good introduction as to what might going through the
mind of the statistical referee for your paper. Here we go.
Effects which are not significant 
My most frequent and severe complaints concern significance tests and confidence
intervals. I think that one of the greatest statistical crimes is to carry out a significance
test, get a large P value, and then interpret this as meaning that there is no difference. This
happens again and again. My comments included:

`This is a small trial of two similar regimes. They interpret "no significant difference" as
meaning "no difference". I do not think that there was any chance of a significant
difference anyway. They should present confidence intervals as in the Lancet's
guidelines.'

`Not significant should NOT be interpreted as "no change".'

`The conclusion interprets "not significant" as meaning "no difference", which it does
not. It means that a difference has not been shown to exist.'

`The habit of reporting non-significant differences as no differences gives me no


confidence in the report of no change here. I suggest that some data be included.'

A couple of my Allstat respondents mentioned this, too:

'Interpreting P>0.5 as "evidence" of no difference, without reference to sample size or


confidence intervals'

'Interpreting non-significance as "no difference" to such an extent that the Discussion


focuses around why this should be also grates high on the pet hates scale.'

Lack of confidence intervals 
Wherever possible, authors should report confidence intervals for differences, not just
significance tests. For years statisticians have been trying to persuade researchers of this
(e.g. Gardner and Altman 1986). This is the usual guideline of most journals anyway,
including the Lancet. The current guideline, from the Lancet website, include: 'When
possible, quantify findings and present them with appropriate indicators of measurement
error or uncertainty (such as confidence intervals). Avoid relying solely on statistical
hypothesis testing, such as the use of P values, which fails to convey important
quantitative information.' Authors continually ignore this, and my papers were no
exception. My comments included:

`The results should be presented as confidence intervals, not significance tests. For
example, the non-significant 19% adverse reactions on the test treatment compared to
12% on the standard treatment is a relative risk of adverse reaction 1.5, 95% confidence
interval 0.5 to 4.6. Thus the data are compatible with more than four times as many
adverse reactions on the new than on the standard treatment. For the presence of X, one
in each group, the relative risk is 1.3; the 95% confidence interval is 0.08 to 20. Thus the
data are compatible with more than twenty times as many Xs on the new than on the
standard treatment!'

`A confidence interval for the mean difference would be much better than significance
tests. A non-significant difference in 10 subjects cannot be interpreted.'

`A finding of "not significant" is meaningless in 4 or 5 subjects. Confidence intervals


should be used.'

`This non-significant difference, reported as "unchanged", is proportionately greater than


many significant differences in this paper. A confidence interval for the mean difference
would be much better.'

My Allstat sample agreed with me, four respondents mentioning the point. Typical
comments were:

'Papers where the only statistics are p-values.'

'Insisting on giving the test statistics, and refusing to give estimated effects.'

Presenting P values 
Computers now print out the exact P values for most test statistics. These should be
given, rather than change them to "not significant" or P>0.05. Similarly, if we have
P=0.0072, we are wasting information if we report this as P<0.01. This method of
presentation arises from the pre-computer era, when calculations were done by hand and
P values had to be found from tables. Personally, I would quote this to one significant
figure, as P=0.007, as figures after the first do not add much, but the first figure can be
quite informative. Two of my 15 Lancet papers had these problems:

`A report of "p=NS" is not very informative. If significance tests must be used, the exact
P value is preferable.'

`These are P=0.01, P=0.006, P=0.05, not P<0.01, P<0.006, P<0.05. In fact, the first is
actually P=0.012, so what they have written is incorrect.'

Several Allstat respondents raised this issue. Their comments included:

'Using 'NS' for any p>0.05, including p=0.0501 (three replies made this point)'

'Showing a table of p-values to huge numbers of decimal places when they're significant,
but not even to one place when not: 'NS' should be banished!'
'Also, statistical methods sections which say "all results were regarded as significant at
the 5% level", followed by results where p<0.05 or p=NS.'

'The term "failed to achieve statistical significance"'.

'I mainly derive irritation from little things, such as "P<0.013"'

So if you want to avoid irritating the statistical referee (and you may not) you should
quote your P values correctly to one significant figure.

More on P values 
P values greatly exercised my Allstat respondents. Three complained about multiple
testing:

'Carrying out hundreds of significance tests, instead of either addressing specified


hypotheses, or admitting that the study is descriptive.'

'Massed p-values, like firing a blunderbuss into a fishpond.'

'Skipping of a non-significant finding on the principal outcome to concentrate on a


significant result in a side issue, whether this is the infamous sub-group or some minor
outcome measure.'

If we carry out many tests of significance, even if the null hypotheses are all true #we
expect that 5% of them will be significant. If we then concentrate on these significant
tests in our report we can give a very misleading impression. One of my favourite
examples is due to Newnham et al. (1993), who randomized pregnant women to receive
a series of Doppler ultrasound blood flow measurements or to control. They found a
significantly higher proportion of birthweights below the 10th and 3rd centiles in the
Doppler group compared to the controls (P=0.006 and P=0.02). These were only two of
many comparisons and at least 35 were reported in the paper. Only these two were
reported in the abstract. (Birthweight was not the intended outcome variable for the trial).
This trial was widely reported and the finding that Doppler ultrasound reduced
birthweight was reported in the national news.

Another Allstat respondent raised an interesting point:

'If you aren't already a fan you should be watching ER. Last night there was talk of a chi-
squared analysis showing significance at the 0.06 level, "so we only need one more
positive result"'.

I wonder how many of the television audience understood that one. Most statistical
analyses assume that the observations are independent of one another. If we do not have
independent observations, an analysis which requires this will be wrong. If we test each
time an observation is added, the observations cannot be independent, because an
observation will only be made if the previous ones did not show a significant difference.
We would be doing multiple testing, and the probability of a test reaching the nominal P
value of 5% if the null hypothesis were true would be much more than 5%. I doubt that
people who do this would actually mention it in their paper. The final test would be
presented as if it were the only one carried out. Doing this could be the result of
ignorance, researchers genuinely thinking that this is a valid procedure. If the researcher
knows that the procedure is not valid, it is fraud. In either case, we would end with a
potentially false and misleading conclusion.

I don't know quite what an Allstat respondent meant by this complaint:

'Authors who use p-value cut-offs other than <0.05, <0.01 or <0.001 and then don't
attempt to justify the levels they use (I find this especially in the papers concerning large
animals where there are only 3 cows and insufficient data for any conventional statistical
significance at all).'

I suspect that he or she was referring to authors who regard differences as significant if
P<0.10 or even higher probabilities. This can be justifiable in some circumstances. An
example might be in the screening of novel chemicals for pharmaceutical activity. We put
all chemicals through an initial screen intended to select some for a further more
intensive screen. It is more important to detect any which have biological activity than to
avoid further testing any which do not. A high P value is therefore appropriate: we have a
high type I error in order to get a low type II error. If authors wish to do this in published
papers, however, they must justify it to the reader, and to the referee.

One of my respondents complained about the use of:

'"Significant" when they mean important.'

This is a difficult one. According to the Shorter Oxford Dictionary, the second meaning
of 'significant' is 'important, notable', and has been since 1761. Its statistical meaning
relates more to its first definition: 'full of meaning or import'. Thus, if a difference is
significant in a sample this difference has meaning, because there is evidence that it exists
in the population. I do not think that statisticians can really appropriate 'significant' and
deny its other uses, but it's unlikely that I am going to be the statistical referee for your
paper, because I do it as rarely as possible. Other statisticians may be more jealous of
'significant' than I and, in the interests of publication, I recommend avoiding its non-
statistical applications. The Lancet supports this line, instructing authors to 'Avoid
nontechnical uses of technical terms in statistics, such as . . . "significant" . . .'

Another respondent mentioned:

'Direct comparison of p-values.'


I think that what this person had in mind was concluding that one difference is larger or
more important than another because it has a smaller P value. This is sometimes done, for
example, when a change is tested in two separate groups of subjects and a difference
between the P values is interpreted as evidence of a difference between the groups. This
is one of my own particular bêtes noirs. An example came up in one of my Lancet
reviews:

`It is not correct to compare two groups by testing changes in each one separately.
Significance does not depend only on magnitude, but on variability and sample size. A
two sample t method should be used to compare the log ratios in the two groups.'

One of my respondents made the same point:

'People who carry out controlled clinical trials but do not carry out a controlled analysis.
Instead of quoting the estimated treatment effect (active - placebo) with its standard error,
they quote the "effect" in the group given active treatment (usually difference from
baseline).'

In general we need only note that the P value measures the strength of the evidence that
an effect exists in the population, it doesn't convey much about the magnitude of that
difference, and a large P value does not, in itself, mean that there is no population
difference or that the difference is small.

We must compare effect sizes, not P values. A special case of this was mentioned by
another Allstat respondent:

'Sub-group analyses unsupported by interaction tests.'

Sometimes authors will carry out significance tests of the same difference or relationships
in different subgroups of their subjects, for example in young and old, male and female.
They will then conclude that the difference exists only or mainly in the subgroups where
a significant difference was found. As explained above, this conclusion does not follow
from the analysis and the correct approach is to test the difference between the
magnitudes of the effects in the subgroups (Altman and Matthews 1996; Matthews and
Altman 1996a, 1996b; Altman and Bland 2003). This is known as a test of interaction.

Design problems 
Referees' criticisms of the study design are the most difficult to deal with. Criticisms of
the presentation, analysis, and interpretation of the data can be remedied fairly easily,
because all these things can be changed. Once the study has been carried out and the data
collected, it cannot be redesigned. It is therefore essential that the design be correct to
begin with. Statisticians are forever saying that they should be consulted before the
project begins, although, as we are elusive beasts, this is often pretty difficult to achieve.
In my Lancet reports there were only two design issues. The first was a treatment
comparison using observational data:

`From a statistical viewpoint, this is pretty awful. I don't think we should have non-
randomised clinical trials in the Lancet.'

I think that we have now got past the argument about whether randomised trials are
effective or ethical and want to know what the randomised trial evidence for a treatment
is. I do not think that randomized trials are the only source of useful information, but
authors must be aware of the principles of randomization and have a pretty clear idea of
why they are using data from non-randomized subjects and what the limitations of such
data are. Two of my Allstat respondents mentioned randomisation. One complained
about:

'The adamant refusal of medical investigators to use randomization and random


sampling.'

I found this surprising, as in my experience medical investigators are usually very ready
to use randomization and there are vast numbers of randomized trials in the literature.
However, experience can vary greatly and this informant may have been working in an
area of application where trials are few. Sometimes the perspective of others can be
startling. In their textbook Using and understanding medical statistics (Matthews and
Farewell 1988) the authors wait until chapter 8 before mentioning the Normal
distribution, saying that continuous data are rarely encountered in medical research! They
devote three chapters to survival analysis. Their experience in cancer research had
certainly given them an entirely different perspective to myself, who cut his statistical
teeth on peak expiratory flow and forced expiratory volume. When I read that for the first
time, my thought was; 'Ever heard of blood pressure?' (Despite this, it's a good book.)
However, I entirely agreed with my respondent about random sampling. This is almost
unknown in medicine, though usually there are good reasons for this.

Another respondent mentioned:

'Claims that a study is randomised or blinded when in fact allocation has been by hospital
number, date of birth, day of week etc, and blinding has been patently superficial and
ineffective.'

This is spot on. People who use systematic allocation of this type (hospital number, etc.)
sometimes argue that this is random, because the hospital number is not going to be
related to the patients' prognosis. But when Bradford Hill first advocated randomisation
in clinical trials, it was firstly to avoid such allocation schemes (Chalmers 1999). If
clinicians admitting patients to a trial know what treatment the patient will receive, as
they will in these systematic systems, this may bias the decision to admit the patient or
not. Schulz et al. (1995) have shown that when the admitting clinician is aware of the
treatment patients will receive; the treatment effect is larger, on average, than when
treatment is concealed. This implies that such open allocation tends to be biased. This
might arise, for example, because clinicians might judge a potential trial recruit to be too
frail for the trial treatment, but not for the control treatment. They might then decide to
recruit the patient to the trial if the patient would receive the control treatment, but not if
the patient would receive the trial treatment. Thus a bias in favour of the trial treatment
would be built in. Schulz et al. (1995) also showed that trials where the investigators
were not blinded to treatment had larger average treatment effects than trials where
investigators were blinded. Sometimes blinding is impossible, sometimes it is difficult,
but we must always be aware of its importance and the potential for bias when it is not
used. I think the referee wants to see that the authors understand this and are suitably
cautious in their interpretation as a result. A good point for the discussion.

The other design issue which came up was sample size:

`This is a small trial of two similar regimes. How was the sample size decided? Was there
a power calculation? What difference were the authors hoping to detect?'

I have had experience of sample size calculations being removed from papers to shorten
them, at the request of the journal. I think we should resist such shortsighted editing, but I
think that in this case no sample size calculations, other than feasibility, had been done. I
doubted that even had there been the modest treatment effect which they might have
hoped for, the chance of getting a significant difference in such a small trial would have
been much above 5%. (It is 5% even if there is no difference at all.)

On the subject of sample size, I had no example in my Lancet series, but another thing I
would pounce on would be a sample size calculation for a cluster randomized trial which
ignored the clustering. I would treat analysis which ignored the clustering in the same
way. See other talks: Cluster designs: a personal view, Sample size in guidelines
trials, and Cluster randomised trials in the medical literature.

Standard deviation and standard error 
Standard deviations and standard errors are the basic currency of statistics, familiar to
most researchers, yet they seem to cause a lot of difficulty. One problem is that authors
often quote them without specifying what they are quoting. I had two examples in my 15
papers:

`I presume the numbers in brackets are standard deviations. The authors should say so.'

`Are these ± numbers standard deviations, standard errors or confidence intervals?'

One of my respondents also mentioned this:

'± notation without any interpretation of whether it refers to se, sd, or CIs.'
Actually, I find the use of the '±' symbol itself is rather misleading. If we quote 'mean ±
SD' as researchers often do, what does this mean? We are not saying that the bservations
all lie between mean - SD and mean + SD. In fact, we expect about one third of them to
be outside these limits. Similarly, if we quote 'mean ± SD' as researchers often do, what
does this mean? We are not saying that the observations all lie between mean - SD and
mean + SD. In fact, we expect about one third of them to be outside these limits.
Similarly, if we quote 'mean ± SE' we do not actually wish to imply that the population
mean lies between mean - SE and mean + SE. This would only be true for 2/3 of samples.
I think that standard deviations and standard errors are best placed in parentheses: mean
(SD). In one of my papers this ± notation seems to have gone rather haywire:

`There is something wrong with the presentation of X. We have "mean X ... was 51.9 ±
7.9 (range)". Is 7.9 the standard deviation? Have the authors omitted the range by
mistake?'

Or did they perhaps mean that the minimum value was 51.9 - 7.9 and the maximum 51.9
+ 7.9? This seems most unlikely.

Sometimes the main comparison in a paper is for the same subjects under different
conditions, e.g. before and after an intervention. A paired t test might be used. This test
uses the mean, standard deviation and standard error of the mean for the differences.
Authors often quote the P value from a paired test, but quote the standard deviation or
standard error for each condition separately, instead of for differences within the subject.
I had a sample of this:

`Most of the standard errors given are irrelevant, as it is the change within subjects which
is important, and the standard error of the mean difference is the relevant figure.'

One of my respondents complained about the same thing:

'Confidence intervals (or SE's) on group means, rather than on comparisons.'

If the correct standard deviations and standard errors are given, it is much easier for other
workers to incorporate your results in meta-analysis, to compare them with their own
data, and so on. Presentation

I had very few comments on specifically on presentation in my Lancet reviews, although


my Allstat respondents had quite a lot to say. I made a suggestion that the zero should be
included on the y-axis of a graph, and I made this point about a graph:

`I think a scatter plot, showing the actual data, would be much more informative. Are the
thin lines standard errors?'

On similar lines, one of my Allstat respondents complained about:

'Dynamite pushers, skyscrapers with TV-aerials'.


What he had in mind, and on which I had been commenting, was a graph like Figure 1:

Figure 1. Bar graph showing capillary density (per mm2) in the feet of ulcerated patients
and a healthy control group (data, but not graph, supplied by Marc Lamah).

You see graphs like this frequently in journals and it may come as a surprise to
researchers that many statisticians dislike them intensely. There are several reasons for
this. My Allstat respondents complained about:

'Summary graphs with less information than the original data.'

Compare Figure 2:
d

Figure 2. Scatter graph of the capillary density data.

Figure 2 shows the same data as Figure 1 in the form of a scatter diagram or dot plot.
This shows not only the relative magnitudes and the variability of the measurement in the
two groups, but also the distribution of the measurement. We can add the means and
standard deviations to the scatter diagram, as shown in Figure 3:

Figure 3. Scatter graph of the capillary density data with mean and standard deviation
added.
This now shows all the information in Figure1 and Figure 2. If there are a large number
of points, the scatter diagram will become a mass of indistinguishable points. In this case
we can use box and whisker plots (see Bland 2000a), as in Figure 4.

Figure 4. Box and whisker graph of the capillary density data.

These do not give all the information in a scatter diagram, but they do show central
tendency, spread and the shape of the distribution. We can see from Figure 4 that the
distributions are roughly symmetrical, apart from one rather extreme point, that the
control group tend to have higher capillary density that the ulcer group, and that the data
are suitable for the t distribution to be applied.

My Allstat respondents had quite a lot to say. A common complaint about graphs such as
Figure 1, which I had made in my review, is that authors do not always make clear what
the vertical lines represent, standard deviations, standard errors, or confidence intervals,
an irritation which I mentioned above concerning '±' notation. A third objection to the bar
graph shown in Figure 1 is that it has only four numbers in it, which could be reported
much more efficiently in the text. Two of my respondents made similar points:

'Using bar charts to show that the proportion of women in the study was 55% and men
45%, and similar low information ways of using ink and space.' (Two similar replies.)

On the other hand, one respondent complained about:

'Tables of data with (literally) hundreds of figures when the information content is
minimal and a graph would be more useful.'
The Lancet instructs its authors to 'Use graphs as an alternative to tables with many
entries'. Personally, I am usually inclined to tables rather than graphs. I think that this bias
(yes, I have them!) arises because I do not have a strong visual imagination or ability to
think pictorially. However, I also think that the argument that other researchers can make
use of your findings more easily if they are presented numerically rather than graphically
is a forceful one, and this should lead us to choose numbers when in doubt.

I have no problems with the view of my respondents who were irritated by authors:

'Giving far too many decimal places.' (3 replies).

The week before writing this, I reviewed a paper which gave all P values, F statistics, and
even degrees of freedom to four decimal places, e.g. 'F=1.9367 with 34.3452 and 45.3298
degrees of freedom, P=0.0189'. This used an approximation to the F distribution which
involved changing the degrees of freedom, making them fractional. Now I doubt that the
F statistic conveys much useful information anyway, but all those decimal places do not.
There is no point in reporting F, t, or chi-squared statistics to more than two decimal
places. I do not think that anything would be lost by reducing the decimal places to two
here: 'F=1.94 with 34.35 and 45.33 degrees of freedom, P=0.019'. Indeed, I would render
the P value to one significant figure: 'P=0.02'. Only the first non-zero number and the
number of zeros preceding it are important. The reason for this profligate and
unconsidered reporting of many decimal places must be that computer programs deliver
them. Programmers try to give the users everything they could possibly want and if the
program calculates the F statistic to seven significant figures, why not print them out?
But this is no reason for the researcher to burden his readers with them. They often make
text and tables much more difficult to read. Correlation coefficients are frequent example.
Programs often print them to four decimal places, but is there really any important
difference between 'r=0.3421 and 'r=0.3379'? I think that 'r=0.34' would do very nicely
for both and make the meaning text and tables easier to grasp.

One respondent complained about something which I also dislike:

'Using multiple crosshatched three-dimensional bars' (2 replies).

I find that three-dimensional effects seldom make a graph clearer. The effect is usually to
make it more difficult to read.

Assumptions 
Many statistical methods require the data to meet some assumptions, such as that data
follow a Normal distribution with uniform variance. Such assumptions are often not
checked, particularly for t methods. The statistical referee can often detect skewness from
the data and graphs given in the paper (Altman and Bland 1996). One giveaway is a
standard deviation which is greater than half the mean, which implies that two standard
deviations below the mean would be a negative number. For most measurements negative
values are impossible we could not have any observations less than mean minus two
standard deviations, and 2.5% of observations from a Normal distribution would be found
there. Such data cannot therefore be from a Normal distribution. Another is to give mean
or median and quartiles or extreme values. If the mean or median is not close to the
centre of the interval determined by the limits, we should suspect that the distribution is
skew. Yet another betrayer of non-Normal distributions can arise when the mean and
standard deviation or standard error are calculated separately for several different groups,
then given in a table or graph. The standard deviation should not be related to the mean.
Often we see that groups with large means also have large standard deviations. A scatter
diagram of the data, while highly desirable, can also reveal deviations from the
assumptions of statistical methods. I three examples of obvious deviations from
assumptions in my 15 papers:

`Are the thin lines standard errors? If so, they suggest that the data are not Normal, which
casts doubt on the F test.'

`I would be surprised if these measurements followed Normal distributions. Figure 2


suggests that this is not the case, as the distribution of X looks positively skew. The
authors should check the distributions of their variables, and use a logarithmic
transformation where appropriate.'

`The data are very skewed, positively for X (mean 17.6, range 16.0-21.7) and negatively
for Y (mean 8.6, range 4.9-9.4). This is produced by the selection criteria for the trial,
which accepts subjects with X > 16.0 and Y < 9.5. No attempt is made to allow for this in
the analyses, which assume that data follow Normal distributions.'

To my surprise, only one of my respondents mentioned this:

'Authors who don't attempt to check the normality of their data and use normal theory
with clearly non-normal data.'

Incorrect descriptions of statistical methods 
The Lancet specifies that authors should: 'Put a general description of methods in the
Methods section. When data are summarized in the Results section, specify the statistical
methods used to analyze them.' This is good advice. It is certainly annoying when authors
do not tell the reader what statistical method is being used and I had an instance in my 15
reviews, in one of which I complained that:

`The statistical test used should be stated.'

My Allstat respondents thought this was an important problem, complaining about:

'Authors who assume that the description of the statistics is so unimportant that they don't
actually give any information at all' (5 similar replies).
One had a specific complaint about authors:

'Stating only that "statistical analysis was done using x computer package"'.

Telling us which package was used is important, as they are not all the same and many
statistical methods can be implemented in different ways which may give different
answers. Indeed, the Lancet asks for it: 'Specify any general-use computer programs
used'. But it is not enough to tell us what is being done. In mathematical language, we
would say that it is necessary but not sufficient. This reported statistical methods section
deserves to become a classic of pointless minimalism:

'The analysis was performed on an IBM486, under MSDOS'

A less frequent, but also irritating, practice is not using the methods stated in the method
section of the paper. It is easy to do this, as papers often go through many drafts, with
parts being cut out and new one inserted, but it is annoying when an obscure method is
references and the referee spends time looking it up only to find that this time had been
wasted. I had an example of this in my 15 papers:

`I do not think Hotelling's t test is actually used anywhere.'

An Allstat respondent made the same point:

'Reference in the methods section to analyses undertaken but with no results appearing
anywhere in the report.'

This comment from my reviews combined a method reported in the method section
which was not used with not saying what done in the analyses which were reported:

`I think that tests other than paired t tests were done. I can't actually find any data suitable
for a paired t test. ... the appropriate method would be Fisher's exact test, which gives
P=0.2 ... this should be a rank correlation. I get tau=0.37, P=0.08 . . . The appropriate
method would be Fisher's exact test, which gives P=0.09.'

I have no idea what they had actually done, but I was pretty confident that whatever it
was, was wrong. Sometimes I had to pinch myself to reassure myself that this was not a
ghastly nightmare, and that people had really submitted to this stuff to the world's most
prestigious medical journal.

Baseline characteristics in randomised trials 
Baseline characteristics deserve special mention because two common parastatisical
practices relate to them. Baseline characteristics are those which we record after subjects
have been recruited to the trial but before treatment begins. There are several good
reasons for making and reporting baseline measurements. The first of these is obvious:
we want to describe the population which our trial subjects represent. The second is that
we want to check and demonstrate that the randomization process has worked. This is not
always the case. I was asked to advise on a trial where a programming error had resulted
in almost all the older subjects being allocated to one arm of the trial a nd almost all the
younger subjects to the other. My advice had to be 'Do it again'. (MacArthur 2001) The
third is that we may want to adjust the treatment difference for prognostic variables. If a
variable measured at baseline is a strong predictor of the outcome of treatment, adjusting
for it statistically may lead to reveal treatment effects which were masked. Altman
(1991) gives a good example.

The first common parastatistical mistake is to carry out tests of significance on the
baseline variables between the randomized treatment groups. Randomization produces
treatment groups which are random samples from the same population. Therefore, any
null hypothesis that states that there is no difference between the populations from which
the groups come is true. Any significant differences between the treatment groups have
arisen by chance; they are type I errors. I had two examples of this in my 15 reviews:

`The tests of significance at baseline should not be done. If the subjects are randomized,
they come from the same population and the null hypothesis is true. There is no reason to
test it.'

`There is no need to test the difference between the groups before the withdrawal of
treatment. Because they are randomised, they are from the same population until
treatment is changed, and hence the null hypotheses are true.'

One of my Allstat respondents mentioned this, too, complaining about:

'Significance testing of baseline variables in RCTs.'

The second parastatistical error is that, having tested for differences between baseline
characteristics, adjustment of the difference in the outcome measurement between
treatments is done for those variables which are significant one the baseline
measurements but not for any others. It is not the chance relationship of baseline
variables to treatment which is important, but their relationship to the outcome variable.
Even when the treatment groups are exactly balanced for the prognostic variable,
adjusting for it statistically should remove a lot of variability from the error term and so
make confidence intervals narrower and possibly make P values smaller. I had a good
example of this approach in one of my reviews:

`The statement that adjustment for baseline characteristics is not needed because baseline
differences are not significant is quite wrong. Such adjustments may reduce the
variability and so improve the power.'

An Allstat respondent made the same point, complaining about authors:

'Not reporting analyses adjusted for baseline values of prognostic covariates.'


A miscellany 
A lot of other issues came up once or twice, either in my own reviews or from my
correspondents. I think that this represents the tip of a very large iceberg of possible
mistakes on the part of researchers. I present them in the hope that my readers will in
future avoid these particular ones at any rate.

An occasional mistake is to include repeated measurements on same subject as if they


were different subjects. The data are then analysed using methods which assume that the
observations are independent. This can have the effect of making P values too small and
confidence intervals too wide. I had a couple of examples in my reviews:

`It is wrong to mix multiple observations from different subjects in this way (Bland and
Altman 1994). An appropriate method is described by Bland and Altman (1995).'

`It is not clear why two subjects were measured twice. Inspection of Table 1 suggests that
the intention was to measure at 18 hours but that subject 3 was tested additionally at 2
hours and subject 5 at 48 hours. This should be clarified. Repeat observations on the same
subject and observations on different subjects cannot be mixed as if they were all
independent. I suggest that the first observation on subject 3 and the second on subject 5
should be omitted from the statistical analysis, as they are at very different times.'

The same problem can occur on a larger scale:

`However, they ignore the fact that these 21 groups of subjects are from 9 different trials,
and analyse the data as if they are all from the same population.'

Again, this would have the effect of making the P values too small and the confidence
intervals too wide. There are well-established mbethods of meta-analysis (see, for
example, Bland 2000b) for carrying out the combination of data from different trials and
authors should use them.

Significance test methods based on rank order, such as the Mann Whitney and Wilcox on
tests and those associated with the Spearman and Kendall rank correlation coefficients,
are inappropriate when samples are very small. One cannot have a significant two-sided
test at the 5% level when samples are smaller than two groups of four for the Mann
Whitney U test or less than six for the Wilcoxon paired test or the rank correlation
coefficients. Each possible rank ordering has probability greater than 0.05. Hence rank
methods on very small samples are inevitably not significant and there is no point in
using them. I made this point in one of my reviews:

`Rank methods are inappropriate for such small samples as they cannot detect any
differences, no matter how large the difference is.'
Curiously, I have been asked by publishers to review at least three proposals for
introductory statistics text-books (not written by statisticians) which contained the
statement that when we have fewer than six observations we should use non-parametric
methods, because parametric methods such as t tests are inappropriate, it being
impossible to verify the Normal distribution assumptions. The opposite is the case,
because parametric methods can produce significant differences for very small samples
although rank-based methods cannot. I wish I knew the source of this often-repeated idea.
As for checking the Normal assumption, we often have a good idea from other data
whether this is reasonable.

Correlation coefficients can cause a problem because there is an assumption that the same
is a representative (i.e. random) sample of its population and that both variables are
random variables. They should not be used when the values of one variable are set by the
experimenter. I had two instances of this in my reviews:

` . . . Correlation is inappropriate when one of the variables is fixed by the investigator


(dose and time) . . . One and two sample t methods and regression should be used.'

`The statement that there is no significant correlation between time of measurement and
X is meaningless. The times are almost equal except for the duplicate measurements. The
ratio is much higher for the early measurement and much lower for the late measurement,
suggesting that there is a possibility of a strong relationship with time.'

One my respondents, somewhat enigmatically, cited:

'Spurious use of correlation and regression (oh dear not again!)'

Statisticians mostly have a background in mathematics, as do I, and have been trained for
many years to think logically. Indeed, a colleague, Shirley Beresford, once remarked that
she thought that the main contribution of statisticians in medical research was not to carry
out statistical analyses but 'to inject a bit of logic into the situation'. So imbued with logic
are we that we can forget that this is not the only way of thinking and is not the main
method of thinking for most people, nor is it always the most useful. Thus to us this one
is jaw-dropping:

`The comparisons of X means between the low X and high X groups are not useful. If we
divide subjects according X and then compared the mean X between the two groups, of
course it will be significant. We could do the same thing with their telephone numbers.'

Of course, the null hypothesis that a group chosen to have X below a cut-off and a group
chosen to have X above the cut-off the mean X will be the same is inevitably false. As we
know this, there is no point in testing it. I presume the authors simply split the subjects
into two groups then tested everything between them. One of my Allstat respondents
made a similar point about:
'Dichotomising continuous variables especially if they identify 'responders' and 'non-
responders' using these variables.'

Splitting the subjects into two groups using a continuous variable reduces the amount of
information which we have. P values may become larger and we may miss important
relationships. Some researchers might be tempted to split the sample not at an arbitrary
cut-off, such as the overall mean, but to choose a cut-off to minimise a P value and make
a relationship significant. This is a real misuse of statistics and will produce misleading
results.

The authors of one of the Lancet papers were particularly unlucky (or lucky, depending
how you look at it) because they were applying my own work on agreement between
methods of measurement and received this comment:

'I suggest replacing the term "95% confidence intervals of agreement" by "95% limits of
agreement". The "95% limits of agreement" of Bland and Altman are not a confidence
interval, but two point estimates.'

My Allstat respondents came up with a lot more. One mentioned:

'Chi-square test analyses of ordered categorical data.'

What was meant is that we often have categorical data where the categories are ordered
in some way, such as physical condition being classified as 'poor', 'fair', 'good' or
'excellent'. The usual chi-squared test for a contingency table ignores this ordering and
tests the null hypothesis of no relationship of any sort between the variables. (NEED
REAL EXAMPLE HERE.) This is usually a mistake, but an understandable one. Many
textbooks use examples with ordered categories to illustrate chi-squared tests.

Another gave the example of

'Rate per 1000 person-years = 3 (95% CI -3 to 9).'

The rate of something per year cannot be negative, so the calculation of the confidence
interval has produced an impossible lower limit. This happens because researchers use
methods designed for the analysis of large samples or large numbers of events to small
samples or small numbers of events. They calculate standard errors and then calculate the
confidence interval using the Normal distribution, as the observed value ± 1.96 standard
errors. But if the number of events or the sample size is not large enough for this Normal
approximation we can get negative lower limits. The same thing can happen with
proportions close to the top of their range of possible values, such as sensitivities and
specificities, which are sometimes given confidence intervals with upper limits above
100%. There are better approximations and exact methods which can be used in these
cases to give confidence intervals which do not include impossible values. Even zero
would be an impossible lower limit for the rate in the example, for if in the sample we
had observed a case, as we must to get a rate of 3 per 1000 person-years, then the rate in
the population cannot be zero. We sometimes see confidence intervals like the one given
presented as '3 (95% CI 0 to 9).' This happens because researchers calculate the interval
as -3 to 9, recognise that -3 is impossible, and replace it with zero.

My respondents made a couple of general points about the way statistics is carried out in
medical research. One complained about:

'Papers where the statistical methods are copied from a previous paper in the field, which
was in turn copied from a previous paper, which was in turn . . .'

This undoubtedly happens, and most statisticians have had the experience of researchers
who say that a published paper had used a particular method of which the statistician
disapproves, and was published, so why shouldn't they? Another respondent complained
about:

'Doctors who don't realise that statistics is an advancing science; and the best methods of
20 years ago are not always the best methods of today.'

Well, I think that there are plenty of statisticians in this category, too, and I have no doubt
that I am guilty of this from time to time. I do not think we can expect researchers to keep
up with what is happening in statistics as well as in their own field. Perhaps, though, we
can expect them to embrace a new and better technique when the referee has pointed it
out.

One despondent respondent commented:

'There is no hope, at times.'

Not taking us seriously 
Some of my respondents complained about authors' attitude to statisticians: These
included:

'Papers which show no sign of having had input from a statistician.'

I can sympathise with this, but statisticians can be hard to find for many researchers. The
trouble is, you don't know what you don't know, so it hard to spot your own mistakes or
to realise that you need help. I think that it should be much easier for researchers to get
not just statistical advice but also collaboration. Trying to teach doctors how to analyse
their own data is very inefficient. It requires a different way of thinking from medicine,
and few people can do both. It is much better to train statisticians to collaborate with
them. An additional advantage, unfortunately, is that we do not pay the statistician as
much as the doctor, so it makes economic sense too. Another respondent felt that
statisticians did not get the prominence they deserved:
'Acknowledgements to a statistician who clearly did all the analysis and should be on the
paper.'

Researchers sometimes ask me whether I would like to be acknowledged for my help. I


usually paraphrase Oscar Wilde and tell them that there is only one thing worse than
being acknowledged, and that is not being acknowledged. I think that the role of the
statistician in research is often worthy of authorship, but when I think I am entitled to be
an author I am usually welcomed. I think that statisticians have to make clear to
researchers who consult them that they have to have something to show for the time they
spend in advisory work and that if they make a real contribution, they should be included
in the author list. On the other hand, I often refuse authorship because I feel that I have
not done enough or could defend the paper.

Two respondents commented on the attitude of authors to statistical referees:

'People who ignore referees comments and send [the] paper to another journal.'

Sometimes this is all an author can do, but I agree that usually authors should take note of
what referees say. If, as can happen, the referee has missed the point of the paper entirely,
the author should ask why and see how the point can be clarified. Another respondent
mentioned:

'The view of many doctors that any comment made by a statistician regarding the quality
of the design must by definition be niggling and unimportant.'

I have been accused of being an academic who does not understand the real world of life
and death in which doctors operate. This may be true, but so what? I understand
something about the world of research and its interpretation. On the whole though, I get
on very well with medical profession and have found them warmly welcoming.

The author bites back 
Some respondents did not answer my question about what researchers did to annoy
referees, but got a few things off their chests about what reviewers did to annoy authors.
One complained about:

'Making comments which you know are a matter of opinion and not fact without
declaring them as such.'

This is fair enough. If a referee knows that something is only a matter of opinion, they
should not condemn others for disagreeing. Another complained about referees:

'Suggesting extensions to analyses which you know will involve far more work than is
justified by any likely improvement to the analysis.'
If a referee did really know this then complaints would be justified. Another respondent
did not like referees:

'Taking far more time to review a manuscript than is reasonable.'

Mea culpa to that. Refereeing is a difficult task for which one gets little or know reward
and which competes for time with the work for which the statistician is paid. Some
journal do pay a small fee, but it could not possibly compensate for the time spent in
understanding a paper and finding the holes in it. However, I will try to do better.

'Using the anonymity usually afforded to pursue your own interests.'

My own experience as a statistical referee is that I am not remotely interested in the


papers which I sent and I am not clear how I could pursue my own interests by impeding
their publication. This is more likely to be a complaint about specialist referees who are
working in the same area.

'I am giving a pet hate of my own about statistical referees. It is the apparently absolute
conviction that their own method of dealing with a data set, whether it be by confidence
intervals for differences between groups, their favourite (and usually obscure) measure of
agreement, or idiosyncratic ways of normalising data before analysis, is the only right
and proper one. In fact, as we all know, a collection of statisticians represents a variance
of at least two standard deviations, and they agree to an even lesser extent than
psychiatrists. So let's have a bit more humility, please.'

I wondered if the comment about the measure of agreement was a dig at myself. I am
quite keen on confidence intervals for differences, too. However, it is certainly true that
there is often more than one acceptable way to analyse data. I am irritated by referees
who always insist on nonparametric methods because they do not believe that any data
follow a Normal distribution, and by those who always insist that nonparametric methods
are replaced by parametric ones.

What really upsets me 
When I first gave this talk, without the Allstat sample, one of my audience said that he
did not think that any of the things I had mentioned really upset me. He thought that what
really annoyed me was statistics not being taken seriously by researchers.

I did not think this was the case. I think that what really upset me about this refereeing
experience was that there were so many errors in so few papers, and in papers submitted
to one of the world's most prestigious medical journals. The journal's own guidelines
were ignored. Nothing about most of these papers suggested that the authors had read
them.
This suggests a lack of care about research, regarding it as an unimportant activity which
does not merit the effort which one hopes these medical researchers put into other aspects
of their work. This matters. Incorrect analysis may lead to incorrect conclusions.
Incorrect conclusions may lead to incorrect treatments and advice to patients. People can
die.

How to avoid upsetting the statistical referee 
We can draw a few tentative conclusions from this study. The things which should be
avoided above all are:

1. Read the journal's instructions to authors. If they do not cover statistics, use those
of one of the major general medical journals.
2. Never, ever, conclude that there is no difference or relationship because it is not
significant.
3. Give confidence intervals where you can.
4. Give exact P values where possible, not P<0.05 or P=NS, though only one
significant figure is necessary.
5. Be clear what your main hypothesis and outcome variable are. Avoid multiple
testing.
6. Get the design right, be clear about blinding and randomisation, do a sample size
calculation if you can.
7. Be clear whether you are quoting standard deviations or standard errors, avoid '±'
notation.
8. Avoid bar charts with error bars.
9. Check the assumptions of your statistical methods.
10. Give clear descriptions of your statistical methods.
11. Decide for which baseline characteristics you should adjust in advance, then do it.

A good aid to writing up clinical trials, and worth reading anyway, is the CONSORT
statement (Moher et al., 2001), a template for doing this developed by a group of
statisticians and trialists. If you follow this you should sail through the refereeing process.

And finally 
I'll finish this talk with three comments from my Lancet reviews:

`The statistics are all wrong but it should be fairly easy to put them right. What a huge
number of authors and none of them understand statistics!'

`Why do they do a totally statistical project without a statistician? I suggest they get one!'

And just to show that not all my 15 reviews were negative:


`My comments are very minor, not enough to make me rate any part of the paper as
inadequate. I like it.'

Acknowledgements 
I thank Donald Singer for first suggesting the topic, the editors of the Lancet for
providing such rich source material, and my Allstat respondents, including Colin
Chalmers, Rick Chappell, Tim Cole, Margaret Corbett, Carole Cull, Keith Dear, Michael
Dewey, Simon Dunkley, the late Nicola Dollimore, Clarke Harris, Dan Heitjan, Jim
Hodges, Alan Kelly, Peter Lewis, Russell Localio, Alison Macfarlane, Sarah MacFarlane,
David Mauger, Richard Morris, Ian Plewis, Mike Procter, Paul Seed, Stephen Senn, Jim
Slattery, Anthony Staines, Graham Upton, Andy Vail, Ian White, Sheila Williams, Ian
Wilson, and a few whose names did not come through with the email.

References 
Altman, D.G. (1991) Practical Statistics for Medical Research Chapman and Hall,
London, p. 389-391. Back to text.

Altman DG, Bland JM. (1996) Detecting skewness from summary information.
British Medical Journal 313, 1200. Back to text.

Altman DG and Bland JM. (2003) Interaction revisited: the difference between two
estimates. 326, 219. Back to text.

Altman DG, Matthews JNS. (1996) Interaction 1: Heterogeneity of effects. British


Medical Journal 313, 486. Back to text.

Bland JM, Altman DG. (1994) Correlation, regression and repeated data. British
Medical Journal 308, 896. Back to text.

Bland JM, Altman DG. (1995) Calculating correlation coefficients with repeated
observations: Part 1, correlation within subjects. British Medical Journal 310, 446.
Back to text.

Bland M (2000a) An Introduction to Medical Statistics, 3rd edition Oxford, University


Press. Section 4.5 Medians and quantiles. Back to text.

Bland M (2000b) An Introduction to Medical Statistics, 3rd edition Oxford, University


Press. Section 17.11 Meta-analysis: data from several studies. Back to text.

Chalmers I. (1999) Why transition from alternation to randomisation in clinical


trials was made. British Medical Journal 319, 1372. Back to text.
Gardner, M.J. and Altman, D.G. (1986) Confidence intervals rather than P values:
estimation rather than hypothesis testing. British Medical Journal 292, 746-50. Back to
text.

MacArthur C, Shennan AH, May A, Whyte J, Hickman N, Cooper G, Bick D, Crewe L,


Garston H, Gold L, Lancashire R, Lewis M, Moore P, Wilson M, Bharmal S, Elton C,
Halligan A, Hussain W, Patterson M, Squire P, de Swiet M. (2001) Effect of low-dose
mobile versus traditional epidural techniques on mode of delivery: a randomised
controlled trial. Lancet 358, 19-23. Back to text.

Matthews, D.E. and Farewell, V. (1988) Using and understanding medical statistics,
second edition Karger, Basel, Back to text..

Matthews JNS, Altman DG. (1996a) Interaction 2: compare effect sizes not P values.
British Medical Journal 313, 808. Back to text.

Matthews JNS, Altman DG. (1996b) Interaction 3: How to examine heterogeneity.


British Medical Journal 313, 862. Back to text.

Moher D, Schultz KF, Altman DG. (2001) The CONSORT statement: revised
recommendations for improving the quality of reports of parallel group randomized trials.
Lancet 357, 1191-1194. Back to text.

Newnham, J.P., Evans, S.F., Con, A.M., Stanley, F.J., Landau, L.I. (1993) Effects of
frequent ultrasound during pregnancy: a randomized controlled trial. Lancet 342, 887-91.
Back to text.

Schulz, K.F., Chalmers. I., Hayes, R.J., and Altman, D.G. (1995) Bias due to non-
concealment of randomization and non-double-blinding. Journal of the American
Medical Association 273, 408-12. Back to text.

Appendix 
From the Lancet's instructions to authors:

Statistics 

Describe statistical methods with enough detail to enable a knowledgeable reader with
access to the original data to verify the reported results. When possible, quantify findings
and present them with appropriate indicators of measurement error or uncertainty (such
as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as the
use of P values, which fails to convey important quantitative information. Discuss the
eligibility of experimental subjects. Give details about randomization. Describe the
methods for and success of any blinding of observations. Report complications of
treatment. Give numbers of observations. Report losses to observation (such as dropouts
from a clinical trial). References for the design of the study and statistical methods should
be to standard works when possible (with pages stated) rather than to papers in which the
designs or methods were originally reported. Specify any general-use computer programs
used.

Put a general description of methods in the Methods section. When data are summarized
in the Results section, specify the statistical methods used to analyze them. Restrict tables
and figures to those needed to explain the argument of the paper and to assess its support.
Use graphs as an alternative to tables with many entries; do not duplicate data in graphs
and tables. Avoid nontechnical uses of technical terms in statistics, such as "random"
(which implies a randomizing device), "normal," "significant," "correlations," and
"sample." Define statistical terms, abbreviations, and most symbols.

The Lancet's full instructions to authors are well worth reading.

Back to Some full length papers and talks.

Back to Martin Bland's Home Page.

This page is maintained by Martin Bland.


Last updated: 19 August, 2004.

Vous aimerez peut-être aussi