Understanding Statistics - Hypothesis Testing & Confidence Intervals

ttp://www.bsava.
com
STATISTICS UPDATE
Statistics: more than pictures

M. Scott, D. Flaherty* and J. Currall
School of Mathematics and Statistics, University of Glasgow, Glasgow G12 8QW *School of Veterinary Medicine, University of Glasgow, Glasgow G61 1QH IT Services, University of Glasgow, Glasgow G12 8QQ
This, the third of our series of articles on statistics in veterinary medicine, moves onto the more complex concepts of hypothesis testing and confidence intervals. As these two areas are widely discussed in many clinical research publications, an awareness of the underlying methodology behind their use is essential to appreciate the information they convey.
Journal of Small Animal Practice (2012) 53, 1218 DOI: 10.1111/j.1748-5827.2011.01168.x Accepted: 25 November 2011
INTRODUCTION
In the previous article (Scott and others 2011), we focussed on exploration and the more subjective but still important aspects of visualising and summarising data, using material from a study by Bell and others (2011) (Table 1). In this article, we will highlight some more formal statistical methods, namely, hypothesis tests and confidence intervals, and introduce them for two simple experimental situations. Underpinning these methods is the idea of a probability model, and, in the last article, we introduced the normal or Gaussian distribution and some of its properties, which will serve for our model. We also showed an example of a probability plot to verify the particular distributional assumption we make. To highlight the clinical use of some of these statistical methods, we will introduce a further study in this article, an investigation of the anaesthetic sparing effect of brachial plexus block (BPB) in cats (Mosing and others 2010; Table 1). Although various quantities were measured in their experiment, we will consider only the cardiorespiratory variables (heart rate, respiratory
Table 1. Basic details of the two studies being discussed in this article
Bell and others (2011) performed a blinded investigation assessing the sedative, cardiorespiratory and propofol-sparing effects in dogs of two doses of dexmedetomidine in combination with buprenorphine, compared with acepromazine in combination with buprenorphine. Sixty dogs (20 in each group) were recruited to the study, although 1 dog was subsequently excluded. Heart rate, arterial blood pressure, respiratory rate and quality of sedation were recorded by the authors, as well as propofol dose requirements. Mosing and others (2010) performed a blinded study evaluating the isoflurane sparing effect and the postsurgical analgesia provided by a brachial plexus nerve block in cats undergoing distal thoracic limb surgery. Twenty cats were recruited to the study, with 10 undergoing conventional anaesthesia and another 10 undergoing anaesthesia combined with a BPB. Two cats were subsequently excluded from the study (one from each group). The investigators recorded a number of cardiovascular and respiratory variables, in addition to isoflurane requirements and postoperative pain scores.
rate and non-invasive arterial blood pressure), and, for simplicity, we will simply look at one timepoint, although measurements were made at five timepoints (including all five timepoints in our analysis will become the topic of a later article). Hypothesis testing and construction of confidence intervals are part of the process of statistical inference, and Fig 1 tries to explain what inference is about. Statistical inference is the process whereby, on the basis of the experiment we performed, the data observed and a statistical model, we draw conclusions and make statements about the real world, not restricted to the experiment we carried out. Before proceeding to introduce hypothesis testing and confidence intervals, it is necessary to define a basic vocabulary that is commonly used. variable a single aspect or characteristic of interest, e.g. heart rate population a large group of individuals, e.g. in regard to the Mosing and others (2010) study, this would be all cats undergoing distal thoracic limb surgery sample a subset of the population, e.g. from the Mosing and others (2010) paper, this would be the 18 such cats they included in their study parameter a single number summarising the values of a variable in the population, e.g. population mean heart rate or the population variance statistic a single number summarising the values of a variable in a sample, e.g. the sample mean heart rate, sometimes called the estimate of the population parameter. A statistic will be subject to sampling variability, and this is quantified in the estimated standard error.
Real world Problem and population Interpretation and conclusions Inference
FIG 1. Diagram illustrating the concept of statistical inference
Model Data Statistical model Analysis
12
Journal of Small Animal Practice
Vol 53
January 2012
2011 British Small Animal Veterinary Association
Statistics review
A parameter is generally unknown (usually because it is not possible to measure the variable on the entire population); the corresponding statistic and the variability of that statistic are then used as a tool to make inferences about the unknown parameter. This means we make a statement about a specific sample and then try to generalise it to the entire population. We should be particularly concerned about how the sample is to be extracted from the population and how representative of the population the sample actually is that we have collected. We have seen in the previous article (Scott and others 2011) how data from the paper of Bell and others (2011) demonstrated that heart rate varied amongst study subjects even before they were assigned to any treatment, so that handling variability is an important aspect of any statistical analysis. If we repeated the experiment by Bell and others (2011), and so identified a further sample of 60 subjects, they also would vary in terms of heart rate; this is sometimes referred to as sampling variability. The objectives of our designed experiment are concerned not with the properties of just the individuals that we have been working with, but with the properties of the wider population. In specifying the objectives, we might use statements such as is the effect of drug A on blood pressure significant?, is there a difference between the mean heart rate of dogs on treatment A compared with treatment B? and so on. The analysis of the results from suitably designed experiments will be used to answer such questions, but we only have a sample from the population and we cannot, therefore, be 100% certain about the true value of the population attribute (e.g. arterial blood pressure). The results of our experiment are subject to uncertainty, and in seeking to answer the experimental question, we need to take account of both sampling variability and also any intrinsic uncertainty (e.g. if the equipment used was only calibrated to measure arterial blood pressure to within the nearest 10 mmHg, there will be an inherent inaccuracy in the answer we obtain). There is a variety of different hypothesis tests and confidence intervals depending on the experimental design and the questions of interest, but all have certain principles in common. This article will first describe those principles and then take two simple examples to explain those principles in action.
others (2010), the study hypothesis would be that BPB reduces isoflurane requirements and postoperative pain in cats following distal thoracic limb surgery, while the null hypothesis would be that BPB does not alter the anaesthetic requirements or the degree of postoperative pain. In some ways, the null hypothesis is dull, it is the scientific status quo, that is no effect. For simplicity, we will focus on what are known as two-sided hypothesis tests, where the null specifies that the population parameter is equal to a specified constant or that two parameters are the same, and the alternative hypothesis says simply the parameter is not equal to the specified constant or that the two parameters are not equal. In some situations (as in the Mosing and others study), there is information which would suggest the direction of difference (e.g. BPB pain is less than the nonBPB pain) (called one-sided tests), but we will not consider those here. Next, we define a test statistic which summarises the evidence based on our experiment. This will be a numerical value based on the results from our experiment. The actual form of the test statistics will vary from one hypothesis test to another. Finally, we need to define a decision rule (usually formulated to include a rejection region). The decision rule says simply that if the observed value of the test statistic falls within the rejection region, then we should reject the null in favour of the alternative, otherwise we do not reject the null. For many of the common situations, we evaluate the test statistic and reject the null if the test statistic is larger than a critical value (read from statistical tables), or more typically, by looking at the P value generated by the computer software we are using (we would reject the null hypothesis for small P values). This means that in reality with modern statistical software, most veterinarians can ignore the actual value of the test statistic and the rejection region and simply focus on the P value. So what is a P value? Before answering this question, we need to introduce some further ideas. There are two types of mistake we could make when we reach a decision in the testing framework: rejecting the null hypothesis when we should not (sometimes called type 1 error), i.e. concluding there is a genuine difference or effect when there is no difference, and not rejecting the null hypothesis when we should (sometimes called type 2 error), i.e. concluding there is no difference or effect when there actually is. In an ideal world, we would not make any mistakes, but in reality, the best we can do is attempt to control the chance of making such errors while at the same time recognising that we cannot reduce the chance to zero. In statistical jargon, we talk about the P value of a test, which is the probability of making a type 1 error (rejecting the null hypothesis when we should not do so). You may often see in scientific papers, notation such as: *P<005; **P<001; ***P<0001 In scientific research, P<005 is considered to be a reasonable risk of a specific error, but the appropriate P value for significance depends on the study in question. If you have only your reputation to lose, P<005 is probably reasonable, but if animals lives are at stake, you might want a rather smaller value. In general, it
13
THE PRINCIPLES OF HYPOTHESIS TESTS AND CONFIDENCE INTERVALS

Hypothesis tests The hypothesis testing approach requires that we formulate two competing hypotheses. These are called the null hypothesis (nothing of interest happens) and the alternative hypothesis (what we really expect to happen). The alternative hypothesis may be thought of as being the study hypothesis, whilst the null hypothesis is its opposite. These hypotheses concern one or more population parameters which the experiment is designed to examine. So carrying out a hypothesis test is about making a choice between two possible descriptions of the world. Usually, we begin by defining the study (alternative) hypothesis and take the null as its converse. So, for the experiment by Mosing and
Vol 53
January 2012
M. Scott and others
is best to state the calculated P value and your conclusions and then let your readers decide if they agree with your assessment. The P value of the test is in fact the probability of obtaining a value for the test statistic as extreme or more extreme than the actual observed value when the null hypothesis is considered true. So it is a measure of the possibility that the result you have observed may have occurred by chance; the smaller the P value, the less likely that any effect or difference is purely a fluke. The principles of confidence intervals A confidence interval is a range of credible values for a population parameter, or a difference between two population parameters. It is based on an estimate of the population parameter of interest (calculated from the sample of results obtained in our experiment) and its estimated standard error (dependent on how variable the individuals in the sample are). The estimated standard error is a statistic calculated from the sample standard deviation but is not the same as the sample standard deviation. While being related to the sampling variability, it also captures the precision with which we are able to estimate the population mean (so it is sometimes called the standard error of the mean) or the difference between two population means. The most commonly used method of interval estimation is to produce 95% confidence intervals. The justification for such intervals involves a probabilistic argument, ensuring that in the long run, 95% of such intervals will contain the true but unknown parameter value. A common approximate form of a 95% confidence interval is Estimate of the parameter of interest 2the estimated standard error. Note that this form is appropriate for examples where we are interested in the population mean; it is approximate but works well for situations where we have sample sizes of 10 or more. We will discuss in more detail when we carry out some calculations how the estimated standard error is arrived at. Two-sided hypothesis tests and confidence intervals are based on the same statistical and probabilistic basis. Figure 2 shows 100 confidence intervals, each one created from repeating an hypothetical experiment. Imagine that we take random samples of 15 healthy adult Labradors and measure
95% confidence intervals for mean weight.
32 30
their weight. Each time we take a sample, the exact individuals chosen vary and so therefore does their mean weight and the variability of the sample. Now we have done this using a computer to generate the results which is much quicker and easier than weighing Labradors. In the computer, we repeatedly generated samples of 15 values from a probability distribution with known, true weight. From each sample, we can calculate a 95% confidence interval for the population mean weight of healthy adult Labradors. We repeated this 100 times to simulate taking 100 samples of 15 adult, healthy Labradors and calculating the 95% confidence interval for the population mean weight. We have drawn all 100 intervals side-by-side in Fig 2. So what can we see from this graph? First, the intervals represented by the vertical lines have different lengths, and they start and finish at different points. But secondly (and more interestingly), as this is a computer-generated experiment, we know what the true population weight really is. In the figure, this is identified by the horizontal line at 25 kg. This horizontal line does not pass through every interval; indeed, it does not pass through four intervals, which means that those confidence intervals do not contain the true value, so in fact, 4% (4 of 100) of the intervals do not contain the true value. A 95% confidence interval should contain the true value 95% of the time and in our case, it contains the true value: 1004=96% of the time. This computer simulation of an experiment illustrates that if we were to repeat the same experiment many times (whereas, in reality, we usually only do it once), in the long run, 95% of our confidence intervals will contain the true but unknown parameter value. The confidence coefficient is the percentage of times that the method will in the long run capture the true population parameter. Convention tends to use 95%, but it is possible to use 99% intervals (much more conservative and the resulting interval will be wider). What assumptions do these tests and confidence intervals require to be valid? Strictly speaking, both the test and confidence interval above assume that the data are drawn from normally distributed populations and in the specific example of the Mosing and others study, that the cats in the two groups are independent. Deviations from normality can however be tolerated, but if the deviation is extreme, then it might be advisable to consider one of a number of possible alternatives. These include using a type of test that does not make such a distributional assumption. A non-parametric test such as the MannWhitney (two sample) test does not assume normality but is generally not as powerful as one that does. In addition, there is another assumption made about the variability of the populations (measured formally by calculating their variance). It is assumed that the two population variances are approximately equal. There is quite a degree of latitude in this: as a rule of thumb, if there is less than a factor of 3 between the two sample standard deviations, then this assumption will be reasonable. Note also that the two-sided hypothesis test and the corresponding confidence interval will lead to the same statistical inference or conclusion.
28 weight(kg) 26 25 24 22 20 simulation
FIG 2. Graph of confidence intervals generated from a hypothetical normal population with mean weight 25 kg
14
Vol 53
January 2012
Statistics review
EXAMPLES
Example 1: two independent samples, comparison of the population means A two-sample t test: Using the study of Mosing and others (2010), we first consider a specific type of hypothesis test, namely, a two-sample t test, used in this case since we have two independent groups of nine cats. Although the study authors compared a number of variables between the two groups at various timepoints in the experiment, for simplicity, we will just look at the baseline heart rates immediately before the start of surgery. At this stage, all of the cats were anaesthetised, nine had received brachial plexus nerve blocks (BPB), while the other nine had not. A t test on the data from this timepoint is used to determine if there are differences between the groups even before the surgery started, as this would have implications for subsequent differences observed in heart rates between the two groups. Null hypothesis: There is no difference in baseline mean heart rate between cats having BPB and cats not receiving BPB. Alternative hypothesis: There is a difference in baseline mean heart rate between cats having BPB and cats not receiving BPB. In this case, the test statistic uses (quite naturally) the observed difference between the two sample means (133 beats/min from the cats with no block and 127 beats/min from the cats that received BPB), and their standard deviations (19 and 17, respectively). Simply put, the question is whether a difference of 6 (133 127) is sufficiently different to 0, that we can say that the result is unlikely to have occurred by chance if the mean heart rates in the populations of cats were really the same. We can only answer this question by taking account of the variability in the samples because this experiment only actually looked at one sample of cats with no block and one sample of cats that received the block, out of all the possible samples of cats that might have taken part in the experiment, just one of many possible sets of 18 cats that we could have observed. Table 2 shows the statistical output for a two-sample t test for our example. In the first section, it shows the summary statistics mean, standard deviation and the standard error of the mean (SEM) [the SEM is the standard deviation divided by the square root of n, where n is the number of samples for each group (so SEM in this case would be 19/3 and 17/3)]. The second section shows the difference between the means of the two treatments in our samples, namely, 6. It also shows the test statistic (t value) which is 071 and the P value for that test statistic which is 049.
With a P value of 049, the chances of observing this amount of difference when indeed there is no real difference between the two treatments is about 50% (or 49% to be precise). We should note that the variability is quite large (SEM of 63 and 57) in comparison to the difference in mean heart rate (60), and so the difference is quite likely to have occurred by chance, even if the treatment had no effect. The authors can then conclude that there is no statistically significant difference between the mean heart rates for the two treatment groups. What about the other pieces of information: DF=16 and the fact that a pooled standard deviation was used? Are these important in our discussion? DF stands for degrees of freedom, which is a way of expressing how many independent observations there are in arriving at a calculated statistic; it is in this case 9+92 or total number of observations less 2 (since we are interested in the two population mean values), but this is rather a technical point and one which we can ignore. The fact that the test has used the pooled standard deviation is more important, as it relates to an assumption that has been made for this particular version of the two-sample t test. The two-sample t test makes the assumption that heart rate is normally distributed; the normal distribution is specified in terms of the mean or average value and variance or variability of the individuals, and for this case, we are making the assumption that the variance or variability is the same in the two treatment groups. This is then estimated by calculating a pooled variance, and the form of pooling is by a weighted average of the two separate sample variances. Too much detail? Perhaps, but it is always good to be aware of the assumptions made in your statistical analysis. A confidence interval for the difference between two population means: Continuing our analysis, an alternative approach would be to construct a confidence interval for the difference between the two population means. We will use the same information as we did for the t test, but in a slightly different way. Remember that we explained that a common approximate form of a 95% confidence interval is Estimate of the parameter of interest 2the estimated standard error The parameter of interest here is the difference in heart rates between the two treatments, so our estimate of the population mean difference in heart rate is 6 and next, we need its estimated standard error. There are two methods of approach to the estimated standard error, the most common one being to assume that the variability in the two groups is the same (as we did in the hypothesis test above where we used the pooled standard deviation), as only in this way are the probability assumptions satisfied. What about the estimated standard error of that difference? This is calculated using the standard error of the mean for each group. Numerically, this would result in the estimated standard error of the difference being 8495. Tables 3 and 4 explain this in more detail. So the difference in population mean heart rate could plausibly lie somewhere between 11 and 23. The important thing is that this interval includes the value 0, so the two population mean heart rates (for cats with BPB and cats without BPB) are plausibly the same.
15
Table 2. Typical output from the two-sample t test

Group n Mean SD SEM 1 9 1330 190 63 Section 1 2 9 1270 170 57 Difference = mu (1)mu (2) Estimate for difference: 600 Section 2 t test of difference = 0 (versus not=): t value = 071 P value = 0490 DF = 16 Both use pooled SD = 180278
SD Standard deviation, SEM Standard error of the mean
Vol 53
January 2012
M. Scott and others
Table 3. Calculation of the estimated standard error of the difference between two population means
Group 1 2 n 9 9 Mean 1330 1270 SD 190 170 SEM
heart rate timepoint 1
Scatterplot of heart rate

150
63 57
125
First we must use the pooled standard deviation which, from Table 2, is 18028. Then, we calculate the SEM for each group, so for group 1, 18028/3 = 6009 and for group 2 also 6009. Squaring and then adding these two figures together gives 72216
100
75
50
Table 4. Calculation of the confidence interval for the difference between two population means
Mean difference=133127=60 Standard error of the difference = 8498
6 28498 or 6 16996 which gives an interval from 616996 to 6 + 16996 or 11 to 23
50
75
100 heart rate timepoint 2
125
150
FIG 3. Scatterplot of heart rate at timepoints 1 and 2 with the line of equality
The conclusions from both the hypothesis test and the confidence interval are the same, but if given a choice, we would recommend that you use confidence intervals. Why? Suppose you are in a hypothetical situation similar to the one above, but the hypothesis test leads you to reject the null hypothesis, so your conclusion would be that there is a statistically significant difference in the population mean heart rate. Would the next obvious question not be, how much of a difference, or is treatment 1 mean heart rate higher or lower than that of treatment 2? To answer these follow-on questions, we need a confidence interval, so why not use it immediately? Let us extend this section to consider another common type of question. Example 2: paired data The next experimental situation we could consider is that of paired data. In the first instance, consider a situation where the same individual has the same attribute (e.g. heart rate) measured on two separate occasions (e.g. before and after a treatment). Here, the questions of interest typically concern the change, if any, which has occurred as a result of the treatment. In this case, we assume that the paired observations are related, so we cannot treat the problem as though we had two independent samples as we did when we considered the two individual groups in the study by Mosing and others (2010). The question of interest concerns evidence for a difference in the population mean values before and after. The simple and elegant solution is to simply calculate the difference between two measurements of heart rate for the same individual and analyse. This would lead to a one-sample t test. So let us return to the study by Bell and others (2011) and consider one group of dogs and the measurement of heart rate at two different timepoints in the experiment (Fig 3). How has heart rate changed from timepoint 1 to timepoint 2? Each dot represents an individual dog, and we can compare heart rate measured at the two timepoints on the same dog (only 18 dots
16
are visible as there are three dogs with a heart rate of 80/min before and after treatment, and these dots are superimposed on the graph). The line represents the line of equality. Points that lay on this line would represent dogs for which the heart rates at the two timepoints were the same. This type of scatterplots, where the scales in the horizontal and vertical axes are the same and the line of equality has been included, are very good for paired data experiments. Has heart rate changed between the two timepoints? In Fig 3, it looks as though there are more points above the line than below, so that would suggest that heart rate at timepoint 1 was higher than that at timepoint 2 for the majority of dogs, which might suggest that heart rate has decreased. Let us calculate the differences for each individual dog and see what they show (Fig 4). Table 5 and Fig 4 summarise the differences. But we need to know how the difference was calculated before we can discuss its interpretation. In this case, the difference was calculated as timepoint 1 heart rate minus timepoint 2 heart rate. The mean difference is 170, so on average, timepoint 1 heart rates were 17 bpm higher (that is quite small), and more importantly, the standard error was 277 (which is greater than the mean difference). If the variability is relatively large, different samples taken from the population will have quite different means. This is not a problem if the difference between treatments is large, but if the difference is small, there is a high risk that a single sample from each treatment will produce a difference that is not representative of the real difference produced by the treatment. So, informally, there is no evidence of the change in mean heart rate. However, as statisticians, we need to investigate formally. We can again use a hypothesis test approach or a confidence interval approach. For the hypothesis test, the null hypothesis would stipulate that the two population means are the same, while the alternative would stipulate they are different. Table 6 presents both the test and confidence interval results.
-21
-14
-7
0 7 difference in heart rate
14
21
FIG 4. Dotplot of difference in heart rate between the two timepoints

Vol 53 January 2012 2011 British Small Animal Veterinary Association
Statistics review
Table 5. Summary statistics for difference in heart rate between the two timepoints
Descriptive statistics: difference Difference in mean HR 2600 n 20 n* 0 Mean 170 SEM 277 SD 1239 Minimum 2400 Q1 600 Median 150 Q3 1050
HR Heart rate, SEM Standard error of the mean, SD Standard deviation
Table 6. Summary of hypothesis test and confidence interval

Difference in mean HR n 20 Mean 170 SD 1239 SEM 277 95% CI (410, 750) T 061 P 0547
HR Heart rate, SEM Standard error of the mean, SD Standard deviation
So first let us focus on the test: the test statistic is 061 and the P value is 0547, so we would see a test statistic as large as this nearly 55% of the time purely by chance. We therefore conclude that we cannot reject the null hypothesis (because of the large P value which is far greater than the nominal 005 significance level that is frequently used). From this, we conclude that it is plausible that the mean heart rates are the same at the two different timepoints. The 95% confidence interval for the mean difference is 41 to 75 and as this interval includes 0, it is entirely plausible that there is no difference in the population mean heart rates. We now consider another example, this time to compare blood pressure (BP). This is tackled in the same way as above but we can explore how the results and our interpretation differ when we find a statistically significant difference as you will see. In Fig. 5, we can see that there seems to be more dots above the line of equality than below the line, suggesting that in more cases, the mean BP at timepoint 2 is higher than that at timepoint 1. We can also see one observation where the mean BP at timepoint 2 is recorded at 150, which is substantially higher than any other value. From Fig 6, we can see that the differences are predominantly positive, ranging from 0 to approximately 28 and that there is one observation at over 70 (which is rather odd being much larger than the next largest difference).
-28
-14
14 28 difference in mean BP
42
56
70
FIG 6. Dotplot of difference in mean BP between the two timepoints
Scatterplot of mean BP at two timepoints

150
mean BP timepoint 2
125
Table 7 summarises the differences and also reports the 95% confidence interval for the difference in mean BP as 018 to 1992, with the P value reported at 0046. How would we report the findings? Well first, as the 95% confidence interval for the difference does not include 0, we can conclude that there is a statistically significant difference in mean BP between timepoint 1 and timepoint 2, with the mean BP at timepoint 2 highly likely to be somewhere between 018 and 1992 mmHg higher than at timepoint 1. Additionally (although not necessary), we could also say that the P value is less than 005, so at the 5% level, we can reject the null hypothesis that the mean BP is the same at timepoints 1 and 2 in favour of the alternative hypothesis that the mean BP at timepoints 1 and 2 is different. A small warning note: the confidence interval just excludes 0, the P value is just less than 005 and we have a worryingly large observed difference at 70. How much of an influence has this observation had on the result? Perhaps more than we might imagine. If we remove this unusual observation and repeat the last step in the analysis as reported in Table 8, suddenly the confidence interval includes 0 and the P value (while still small) is greater than 005. So now we would conclude that there is no statistically significant difference. Of course, we are not recommending that this observation should be removed, we simply wanted to highlight that single observations do matter and we should remain alert to the possibility that our findings depend critically on (or are sensitive to) a small number of potentially unusual or influential observations. Significance versus importance One final point: Statistical significance does not always translate into practical importance; a very small effect may be of statistical significance but may be of no practical significance whatsoever. Significant results can only be given appropriate importance by reference back to the actual problem that the study is addressing. Statistical analysis only sorts out the shades of grey; it does not tell you the answer to your problem.
17
100
75
50 50 75 100 mean BP timepoint 1 125 150
FIG 5. Scatterplot of mean BP at timepoints 1 and 2 with the line of equality

Vol 53
January 2012
M. Scott and others
Table 7. Summary descriptive statistics and confidence interval for difference in mean BP
Descriptive statistics: difference in mean BP Difference maximum in mean BP 7200 n n* Mean SEM SD Minimum Q1 Median Q3
20
1005
472
2109
3100
025
950
2250
One-sample t test: difference in mean BP Test of mu = 0 versus not = 0 Difference in mean BP n 20

BP Blood pressure
Mean 1005
SD 2109
SEM 472
95% CI (018, 1992)
T 213
P 0046
Table 8. Confidence interval for difference in mean BP

One-sample t test: difference in mean BP (excluding one data value) Test of mu = 0 versus not = 0 Difference in mean BP n 19
BP Blood pressure
Mean 679
SD 1566
SEM 359
95% CI (076, 1434)
T 189
P 0075
have avoided using mathematical formulae, but of course, the tests and intervals have a mathematical foundation (based on the theory of estimation and probability). Interested readers who wish to see a fuller, more in depth development should refer to the studies of Altman and Bland (2005) and Bland and Altman (1986). Conflicts of interest None of the authors of this article has a financial or personal relationship with other people or organisations that could inappropriately influence or bias the content of the paper. References
ALTMAN, D. G. & BLAND, M. (2005) Standard deviations and standard errors. British Medical Journal 331, p 993 BELL, A. M., AUCKBURALLY, A., PAWSON, P., SCOTT, E. M. & FLAHERTY, D. (2011) Two doses of dexmedetomidine in combination with buprenorphine for premedication in dogs; a comparison with acepromazine and buprenorphine. Veterinary Anaesthesia and Analgesia 38, 15-23 BLAND, M. & ALTMAN, D. G. (1986) Confidence intervals rather than p-values: estimation rather than hypothesis testing. British Medical Journal 292, 746-750 MOSING, M., REICH, H. & MOENS, Y. (2010) Clinical evaluation of the anaesthetic sparing effect of brachial plexus block in cats. Veterinary Anaesthesia and Analgesia 37, 154-161 SCOTT, M., FLAHERTY, D. & CURRALL, J. (2011) Statistics: making sense of what you see. Journal of Small Animal Practice 52, 560-565
Remember, the fact that a statistical test gives a significant result does not necessarily indicate that the result is important. Conclusions Simply put, hypothesis tests and confidence intervals are powerful inferential tools, taking you beyond the results of your experiment to the bigger picture. You need to be careful (as always), because assumptions will be made that should be checked and pictures and diagrams have an important role to inform your analysis. There are many different hypothesis tests and confidence intervals for different experimental situations, but they are all built on the same principles as those introduced here. We
18
Vol 53
January 2012

Understanding Statistics - Hypothesis Testing &amp; Confidence Intervals

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Understanding Statistics - Hypothesis Testing &amp; Confidence Intervals

Transféré par

Droits d'auteur :

Formats disponibles

ttp://www.bsava.

Statistics: more than pictures

Model Data Statistical model Analysis

Journal of Small Animal Practice

2011 British Small Animal Veterinary Association

THE PRINCIPLES OF HYPOTHESIS TESTS AND CONFIDENCE INTERVALS

2011 British Small Animal Veterinary Association

M. Scott and others

Journal of Small Animal Practice

2011 British Small Animal Veterinary Association

Table 2. Typical output from the two-sample t test

Journal of Small Animal Practice

2011 British Small Animal Veterinary Association

M. Scott and others

Scatterplot of heart rate

100 heart rate timepoint 2

0 7 difference in heart rate

FIG 4. Dotplot of difference in heart rate between the two timepoints

Journal of Small Animal Practice

HR Heart rate, SEM Standard error of the mean, SD Standard deviation

Table 6. Summary of hypothesis test and confidence interval

HR Heart rate, SEM Standard error of the mean, SD Standard deviation

FIG 6. Dotplot of difference in mean BP between the two timepoints

Scatterplot of mean BP at two timepoints

50 50 75 100 mean BP timepoint 1 125 150

FIG 5. Scatterplot of mean BP at timepoints 1 and 2 with the line of equality

2011 British Small Animal Veterinary Association

M. Scott and others

One-sample t test: difference in mean BP Test of mu = 0 versus not = 0 Difference in mean BP n 20

95% CI (018, 1992)

Table 8. Confidence interval for difference in mean BP

95% CI (076, 1434)

Journal of Small Animal Practice

2011 British Small Animal Veterinary Association

Vous aimerez peut-être aussi

Understanding Statistics - Hypothesis Testing & Confidence Intervals

Understanding Statistics - Hypothesis Testing & Confidence Intervals