31 vues

Transféré par stald

Contin Educ Anaesth

- Nfl Draft - Supplemental Analysis
- ADL 07 Quantitative Techniques in Management V3
- 9780321780225_deveaux_ssm.pdf
- STAT assignment 2010
- Accumulators in Boost
- Unit 10 – Introduction to Statistics
- Babbie Basics 5e PPT Ch 14
- SAC-99
- H2_descriptiveSP14
- QARM (Lecture-04 and 05)
- bai tap toan kinh te
- Complete Statics Tools in Excel
- for printing (pre-thesis).docx
- AshleyBriones Module4 Homework
- STATISTICS FOR BUSINESS - CHAP05 - Sampling and Sampling Distribution.pdf
- Basic Statistics
- Tutorial_2Q.pdf
- Stats Chapter 3
- GridDataReport-newdata8
- first summative test in statistics and probability.docx

Vous êtes sur la page 1sur 4

spread of data

Anthony McCluskey BSc MB ChB FRCA

Abdul Ghaaliq Lalkhen MB ChB FRCA

Central tendency

Examining the raw data is an essential rst step

before proceeding to statistical analysis.

Thereafter, two key sample statistics that may

be calculated from a dataset are a measure of

the central tendency of the sample distribution

and of the spread of the data about this central

tendency. Inferential statistical analysis is

dependent on a knowledge of these descriptive

statistics. In the rst article of this series, types

of data and correlations were discussed.

1

Different measures of central tendency

attempt to determine what might variously be

termed the typical, normal, expected or average

value of a dataset. Three of them are in general

use for most types of data: the mode, median,

and mean.

The mode

Literally, the mode is strictly a measure of the

most popular (frequent) value in a dataset and

is often not a particularly good indicator of

central tendency. Despite its limitations, the

mode is the only means of measuring central

tendency in a dataset containing nominal catego-

rical values. For example, in a survey of 10

senior house ofcers (SHOs) asked which form

of continual professional development (CPD)

activity they preferred in preparation for a

forthcoming examination, the following

responses were obtained: viva practice 5, tutor-

ial 3, in-theatre teaching 2, lecture 0. The mode

of this dataset is viva practice as it is the

largest (most popular) category. We might say

that a typical SHO prefers viva practice and

plan the CPD time accordingly.

The mode may also be used for ordinal cate-

gorical data and for interval data, although the

median or mean are more useful in these circum-

stances. For example, suppose a pilot study is

undertaken to determine the severity of pain on

injection of propofol in 10 patients and an ordinal

verbal pain score system between 03 is used.

The pain scores observed are: 0, 1, 1, 2, 2, 2, 2, 3,

3, 3. The mode of this dataset is a pain score of 2.

The median

The median is dened as the central datum

when all of the data are arranged (ranked) in

numerical order. As such, it is a literal measure

of central tendency. When there are an even

number of data, the mean (see below) of the

two central data points is taken as the median.

For the distribution of pain scores described

above, the median pain score is again 2.

The median may be used for ordinal categ-

orical data and for interval data. When analysing

interval data, the median is preferred to the

mean when the data are not normally (symmetri-

cally) distributed, as it is less sensitive to the

inuence of outliers.

The mean

The mean is used to summarize interval data.

As the mean may be inuenced by outlying

data points, it is best used as a measure of

central tendency when the data is normally

(symmetrically) distributed. Although several

different means are dened, the arithmetic

mean is most commonly used. The arithmetic

mean is calculated by adding all the individual

datum values in a dataset (x

1

x

2

. . . x

n

)

and dividing by the number of values (n) in the

dataset.

The mean of ordinal categorical data is

often reported in the literature (together with its

associated measure of data spread, SD). For

example, in the sample of verbal pain scores

above, the mean score is 1.9. The use of mean

and SD for ordinal data is controversial. For

example, what does a pain score of 1.9 actually

mean when using a categorical scale?

Measurement of spread of

data (variability)

Once again, the rst step in assessing spread of

data is to examine it in either a table or an

appropriate graphical form. A graph often

makes clear any symmetry (or lack of it) in the

spread of data, whether there are obvious atypi-

cal values (outliers) and whether the data is

Key points

Two key sample statistics

are measures of central

tendency and the spread of

data about this central

tendency.

The mode, median, and

mean are used to describe,

respectively, the central

tendency of categorical

data, interval data that is

not normally distributed,

and normally distributed

interval data.

The most important

distribution in stastistical

analysis is the normal

(Gaussian) distribution,

dened by its mean and SD.

The normal distribution is

characterized by its

unimodal, symmetrical, bell-

shaped frequency curve.

Anthony McCluskey BSc MB ChB

FRCA

Consultant

Department of Anaesthesia

Stockport NHS Foundation Trust

Stepping Hill Hospital

Stockport SK2 7JE

UK

Tel 44: 0161 419 5869,

Fax: 0161 419 5045,

E-mail: a.mccluskey4@ntlworld.com

(for correspondence)

Abdul Ghaaliq Lalkhen MB ChB FRCA

Specialist Registrar

Department of Anaesthesia

Royal Lancaster Inrmary

Ashton Road

Lancaster LA1 4RP

UK

127

doi:10.1093/bjaceaccp/mkm020

Continuing Education in Anaesthesia, Critical Care & Pain | Volume 7 Number 4 2007

& The Board of Management and Trustees of the British Journal of Anaesthesia [2007].

All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

b

y

g

u

e

s

t

o

n

J

u

l

y

3

0

,

2

0

1

4

h

t

t

p

:

/

/

c

e

a

c

c

p

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

/

D

o

w

n

l

o

a

d

e

d

f

r

o

m

skewed in one direction or the other (a tendency for more values

to fall in the upper or lower tail of the distribution).

Range

The simplest measure of data variability is the range, dened

simply as the interval between the highest and lowest values in a

distribution. It is of limited practical use in statistical analysis as it

is obviously profoundly inuenced by extreme outliers.

Percentiles

When a dataset is arranged in order of magnitude, it may be

divided into 100 separate cut-off points ( percentiles). The xth per-

centile is dened as a cut-off point such that x% of the sample has

a value equal to or less than the cut-off point. For example, the

35th percentile splits the data up into two groups containing,

respectively, 35 and 65% of the data.

Quartiles are used most commonly, i.e. lower (25th percentile),

middle (50th percentile or median), and upper (75th percentile).

They split the data into four equal groups. The interquartile range

(IQR) is often quoted when referring to interval data that is not

normally distributed. If the 25th percentile value (lower quartile)

of a dataset is 10 and the 75th percentile value (upper quartile) is

40, the IQR may be expressed as either 1040 or simply as

30. Percentiles and quartiles may be estimated from a cumulative

frequency curve.

A useful graphical representation of the distribution of interval

data is the box and whisker plot. For example, the box and whisker

plot in Figure 1 has been produced from the marks obtained by a

cohort of candidates taking an examination. The upper and lower

limits of the box (hinges) represent the upper and lower quartiles,

respectively. The horizontal line inside the box is the median and

the whiskers represent extreme values, in this case the 10th and

90th percentiles. Any further outliers are represented by asterisks.

The normal distribution

The most important and useful distribution of data in statistical

analysis is the normal or Gaussian distribution (Fig. 2). It is also

often referred to as a parametric distribution because two key par-

ameters which fully describe its shape can be dened (the mean

and SD). A normal distribution is characterized by a unimodal,

symmetrical, bell-shaped curve when interval data are represented

by a histogram or line graph.

Much biological data such as height and mean arterial blood

pressure in healthy adults are normally distributed. In clinical

trials, sample data drawn from such a population will also

follow a normal distribution provided the sample size is reason-

ably large (e.g. .100 in each group). However, it is not actu-

ally necessary for sample data to follow a normal distribution

in order to subject the data to parametric statistical analysis

(which is often the case with the smaller sample sizes described

in clinical studies). Rather, it is necessary for the sample data

to be compatible with having been drawn from a population,

which is normally distributed. Thus, although visual inspection

of the frequency curve is useful and should always be under-

taken, with smaller sample sizes it may not be obvious that

the sample data is compatible with normally distributed popu-

lation data. Data can be subjected to formal statistical analysis

for evidence of normality using a variety of tests (e.g. Shapiro

Wilkes test, DAgostinoPearson omnibus test). However, these

tests should be used with caution for very small samples as

they may then give false positive results. If in doubt as to

whether data is normally distributed or not, it is safer to use

non-parametric inferential statistical analysis which is not based

on any assumptions about the shape of the frequency distribution

curve.

The normal distribution shown in Figure 2 has a mean of 20

and a SD of 4. The mean positions the frequency curve on the

x-axis. The spread (width) of the curve around the mean is deter-

mined by its SD. The shape of any normal distribution frequency

curve is entirely described by these two parameters. The centre of

the distribution occurs at the zenith and all three measures of

Fig. 2 The normal distribution [mean (SD) 20 (4)].

Fig. 1 Box and whisker plot of the examination marks. The data do not

follow a normal distribution.

Statistics: Central tendency and spread of data

128 Continuing Education in Anaesthesia, Critical Care & Pain j Volume 7 Number 4 2007

b

y

g

u

e

s

t

o

n

J

u

l

y

3

0

,

2

0

1

4

h

t

t

p

:

/

/

c

e

a

c

c

p

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

/

D

o

w

n

l

o

a

d

e

d

f

r

o

m

central tendency (mode, median, and mean) are equal and

described by the zenith.

For a population, the SD is calculated by summing the squares

of all the individual differences of each datum value from the

mean, then by calculating the mean of this value, and nally by

calculating the square root. The process of squaring the differences

followed by taking the square root results in all of the differences

being converted into positive values. Otherwise, the SD would

always be zero.

SD

P

in

i1

x

i

m

2

n

s

;

where m population mean and n population size.

As clinical research rarely deals with populations but with

(random) samples drawn from the population, the SD of a sample

is:

SD

P

in

i1

x

i

x

2

n 1

s

;

where x sample mean and n sample size.

It is expressed in the same units as those of the mean from

which it is derived.

For parametric (normally distributed, symmetrical) data, the

mean and SD are the appropriate measures of central tendency and

variability of the data. For non-parametric data, the median is the

appropriate central tendency measure and the IQR is the appropri-

ate measure of the variability of the data.

In a normal distribution, approximately two-thirds of the data

(68%) lie within+1 SD of the mean, approximately 95% of the

data lie within+2 SD of the mean (sometimes quoted more accu-

rately as +1.96 SD) and approximately 99.7% of the data lie

within +3 SD of the mean.

Several other sample statistics related to the sample SD are often

used in statistical analysis, e.g. sample variance is dened as SD

2

,

standard error of the mean (SEM) SD/

n

p

The standard normal distribution

The standard normal distribution is dened as having a mean 0

and SD 1. Any normal distribution may be converted into the

standard normal distribution according to the formula: z (x

i

2

m)/s, where x

i

is a datum value from the original distribution, m is

the mean of the original normal distribution, and s is the SD of

original normal distribution. The standard normal distribution is

therefore sometimes referred to as the z-distribution. A z-value

indicates the number of SDs, a datum value is above or below the

mean.

The standard normal distribution may be useful in comparing

different normal distributions. For example, suppose we wish to

consider two candidates for a job by comparing their performance

in a qualifying examination set by different examination boards.

They might both have scored 60% but it would not be surprising if

the mean and SD of the marks obtained in each of the examinations

was different. The raw marks obtained by candidates may be con-

verted into z-values, equivalent to the number of SDs that the

scores are from the mean of zero, according to the equation:

z-value (x

i

2 m)/s, where x

i

a candidates raw score, m

mean mark of the population of all the candidates in each particu-

lar examination, and s SD of the population of all the marks

obtained in each particular examination.

If candidate 1 scored 60% in an examination with a mean

mark of 70% and SD of 10%, the corresponding z-value mark is

21. His performance was 1 SD below the mean for that examin-

ation, i.e. his mark was at the level of the 16th percentile [50 2

(68/2)] for his cohort. If candidate 2 scored 60% in an examination

with a mean mark of 40% and SD 10%, the corresponding z-value

mark is 2. His performance was 2 SD above the mean for that

examination, i.e. his mark was at the level of the 97.5 percentile

[50 (95/2)] for his cohort.

Even with the same pass mark, candidate 2 is seen to have per-

formed better. In this context, the z-values are equivalent to the

candidates standard normal scores in their examinations. Strictly

speaking, candidate 2 performed better in relation to his cohort,

not to an absolute standard. If the two cohorts were of widely dif-

fering general ability then our conclusion that candidate 2 was

more able based on his better standard normal score might be

invalid.

Standard deviation vs SEM

When the results of statistical analyses are reported, the SD and

SEM are sometimes used inappropriately. For example, authors

may quote the range of their sample data as mean +SEM rather

than mean+SD. The temptation to do this follows from the fact

that by denition, the SEM decreases as sample size increases

(SEM SD/

n

p

). When statistics comparing two or more groups

are quoted in this way, any differences between them appear more

signicant than when the SD is quoted. The SD should always be

used when describing the variability of the actual sample data. The

SEM is used specically to describe the precision of the sample

mean, i.e. how far is the sample mean from the population mean.

The 95% condence interval (see future article) for the population

mean sample mean +2 SEM.

Skewness and kurtosis

Two terms may be dened with reference to the shape of a fre-

quency curve, kurtosis and skewness. Kurtosis describes the peak-

edness of the curve whereas skewnesss describes the symmetry of

the curve. A positive kurtosis indicates a frequency distribution

with a sharper peak and a longer tail than a normal distribution; a

negative kurtosis indicates a wide, attened distribution. The stan-

dard normal distribution has a kurtosis of zero. If a frequency

curve has a longer upper tail, the data is positively skewed; if the

data has a longer lower tail, it is negatively skewed.

Statistics: Central tendency and spread of data

Continuing Education in Anaesthesia, Critical Care & Pain j Volume 7 Number 4 2007 129

b

y

g

u

e

s

t

o

n

J

u

l

y

3

0

,

2

0

1

4

h

t

t

p

:

/

/

c

e

a

c

c

p

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

/

D

o

w

n

l

o

a

d

e

d

f

r

o

m

The distribution shown in Figure 3 is positively skewed. It is

evident from the frequency curve that the mean has been shifted to

the right by the skewed data; although the median has also shifted,

it has been inuenced less by the outliers on the upper tail and is

the appropriate measure of central tendency for skewed distri-

butions. Biological data not infrequently follows such a distri-

bution, examples being adult weight, white cell count of healthy

individuals, and serum triglyceride concentrations. Negatively

skewed data is relatively uncommon.

When faced with a sample that comprises non-normally

distributed (skewed) data, there are two choices: to accept the dis-

tribution as it is and use non-parametric inferential statistical

analysis or to attempt to transform the data into a normal distri-

bution. Common methods of transforming skewed data into a

normal distribution are logarithmic, square root, and reciprocal

transformations.

Acknowledgements

The authors are grateful to Professor Rose Baker, Department

of Statistics, Salford University for her valuable contribution in

providing helpful comments and advice on this manuscript.

Bibliography

1. McCluskey A, Lalkhen AG. Statistics I: Data and correlations. CEACCP

2007; 7: 9599

2. Bland M. An Introduction to Medical Statistics, 3rd Edn. Oxford: Oxford

University Press, 2000

3. Altman DG. Practical Statistics for Medical Research. London: Chapman &

Hall/CRC, 1991

4. Rumsey D. Statistics for Dummies. New Jersey: Wiley Publishing Inc.,

2003

5. Swinscow TDV. Statistics at Square One. http://www.bmj.com/statsbk

(accessed 14 June 2007)

6. Lane DM. Hyperstat Online Statistics Textbook. http://davidmlane.com/

hyperstat/ (accessed 14 June 2007)

7. SurfStat Australia. http://www.anu.edu.au/nceph/surfstat/surfstat-home/

surfstat.html (accessed 14 June 2007)

8. Elwood M. Critical Appraisal of Epidemiological Studies and Clinical

Trials, 2nd Edn. Oxford: Oxford University press, 1998

Please see multiple choice questions 1417

Fig. 3 Positively skewed data.

Statistics: Central tendency and spread of data

130 Continuing Education in Anaesthesia, Critical Care & Pain j Volume 7 Number 4 2007

b

y

g

u

e

s

t

o

n

J

u

l

y

3

0

,

2

0

1

4

h

t

t

p

:

/

/

c

e

a

c

c

p

.

o

x

f

o

r

d

j

o

u

r

n

a

l

s

.

o

r

g

/

D

o

w

n

l

o

a

d

e

d

f

r

o

m

- Nfl Draft - Supplemental AnalysisTransféré parmoonie_phantharath
- ADL 07 Quantitative Techniques in Management V3Transféré parsolvedcare
- 9780321780225_deveaux_ssm.pdfTransféré parSammy Jackson
- STAT assignment 2010Transféré parLinzi Jacobs
- Accumulators in BoostTransféré parsj314
- Unit 10 – Introduction to StatisticsTransféré parNag Dhall
- Babbie Basics 5e PPT Ch 14Transféré parRandy Cavalera
- SAC-99Transféré parLuis Espinosa
- H2_descriptiveSP14Transféré parHennrocks
- QARM (Lecture-04 and 05)Transféré parajeeta_shreya2475
- bai tap toan kinh teTransféré partranthevut
- Complete Statics Tools in ExcelTransféré parPreethi Prith
- for printing (pre-thesis).docxTransféré parGiana Marie Garcia
- AshleyBriones Module4 HomeworkTransféré parmilove4u
- STATISTICS FOR BUSINESS - CHAP05 - Sampling and Sampling Distribution.pdfTransféré parHoang Nguyen
- Basic StatisticsTransféré parskp23
- Tutorial_2Q.pdfTransféré parLi Yu
- Stats Chapter 3Transféré parVivek Narayanan
- GridDataReport-newdata8Transféré parMega Bayu S
- first summative test in statistics and probability.docxTransféré parLieVane Dayondon Limosnero
- unit 10 learning resourcesTransféré parapi-318119777
- Statistics and ProbabilityTransféré parKarl Angelo Cuellar
- stats 18 - ppt websiteTransféré parapi-294156876
- The Mean and Variance of the Sampling Distribution of Means(1)Transféré parVynz Amenodin
- Assignment 1 Manual,hvTransféré parMohammad Shafay Shahid
- IC102 2011 Lecture End of Two Weeks Ending18thTransféré parvat123
- Day 2 Permutations and CombinationsTransféré parJanine Amador
- Week6.pdfTransféré parosmanfırat
- Unit i Lesson 4 Measures of VariabilityTransféré parwill grayson
- final group project - math 1040Transféré parapi-323453133

- KennedyTransféré parstald
- 1985 COSGROVE, D. Perspective and the Evolution of the Landscape Idea.pdfTransféré parPietro de Queiroz
- Lec 6 Vector TopologyTransféré parstald
- Factsreports 282 Vol 3 Ecosystemic Forest Management Approach to Ensure Forest Sustainability and Socio Economic Development of Forest Dependent Communities Evidence From Southeast CameroonTransféré parstald
- How to Select Appropriate Statistical TestTransféré pardrquan
- Learn Swedish Words - Core 100 List.pdfTransféré parstald
- HCV Briefing Note High ResTransféré parstald
- Euroforester Graduate Survey_Gosia and Vilis_2008Transféré parstald
- Otto von Bismarck.pdfTransféré parstald
- Stats for DummiesTransféré parAzuanee Azuanee
- SFM in GermanyTransféré parstald
- DocumentTransféré parstald
- Biodiversity.pdfTransféré parstald
- InTech-A Data Mining Amp Knowledge Discovery Process ModelTransféré parstald
- Salvage Foster06Transféré parstald
- Grillmayer_Lc_Model.pdfTransféré parstald
- bulletin29_3Transféré parstald
- bulletin30_3Transféré parstald
- Phenomenon.pdfTransféré parstald
- Anatol Rapoport - Wikipedia, The Free EncyclopediaTransféré parstald
- bulletin30_2Transféré parstald
- 07 KaplanTransféré parstald
- bulletin31_1Transféré parstald
- bulletin30_4Transféré parstald
- bulletin30_1Transféré parstald
- bulletin29_2Transféré parstald
- bulletin29_1Transféré parstald
- bulletin28_1Transféré parstald
- Bulletin27 2 SupplementTransféré parstald

- Week-1 ML NotesTransféré parSumanth_Yedoti
- Gabai x 2008Transféré parAnonymous mowhIc5Ly
- Formulas Of All the Quant Topic.pdfTransféré parEseiwi
- Math 160 Chapter 17 Lecture Notes Fall 2015Transféré parLinh Pham
- 8 - 3.48 - 3.4Transféré parOlivia Angiuli
- UntitledTransféré parmechkalyan
- Project Management for Construction_ Construction PlanningTransféré parFarid Morgan
- A Bayesian Reliability Approach To Multiple Response Optimization With Seemingly Unrelated Regression Models_PaperTransféré parIntan Fitri Maharany
- 1010_ Analytical Data - Interpretation and TreatmentTransféré parAnonymous ulcdZTo
- 4.83 MMS MarketingTransféré parSagar Gavit
- Sample Statistics QuestionsTransféré parrenjithr_20009290
- Probability and StatisticsTransféré parMarco Della Pelle
- skittles term project reflectionTransféré parapi-272673268
- Lecture02 Random VariablesTransféré parMuhammad Inam Ur Rehman
- MBA IB Course Structure 2012-13 OnwardsTransféré parKapil Mittal
- PROCESS CAPABILITYTransféré parLazarasBenny Isprithiyone
- SAMPLE-introductory-mathematical-analysis.pdfTransféré parLalani Wije
- Modeling Risk and Realities Week 4 Session 3Transféré parhitesh
- 1 - Social Media Use Integration ScaleTransféré pardavidrferrer
- The Steps to Follow in a SAS Multiple Regression AnalysisTransféré pararijitroy
- Judo Match AnalysisTransféré parAttilio Sacripanti
- Mcdaniel 01Transféré parPEPINOKING
- (Rsc Analytical Spectroscopy Momographs) M. J. Adams-Chemometrics in Analytical Spectroscopy -Royal Society of Chemistry (1995)Transféré parciborg1978
- v66i05Transféré parLeonhard Damasceno
- Body LossTransféré parDoTrongTuan
- s813387 AbstractTransféré paranna
- Final ProjectTransféré parSam
- MBA 1st Sem Model PapersTransféré parRahul Charan
- Probability of Detection AHiTransféré parPDDELUCA
- Tem AssignmentTransféré paresraaalhaj