Vous êtes sur la page 1sur 302

Introduction to Biostatistics, University of Damascus Dental School

Introduction to
Biostatistics

( )PhD Glasgow
- Sheffield
- Glasgow
1

Introduction to Biostatistics, University of Damascus Dental School

Statisticians Image
Dull, dry, humorless
Speaks in technical jargon that no one
understands
Wears thick glasses and carries a
calculator in the pocket
Inflexible ( always says You cant do
that!)
Dr Mohammad Y Hajeer, DDS, MSc, PhD

Introduction to Biostatistics, University of Damascus Dental School

Statisticians Image
Spends Thursday nights in the library
Favorite Movie: Revenge of the Nerds

Doesnt play golf


A necessary evil
SMART!
Dr Mohammad Y Hajeer, DDS, MSc, PhD

Introduction to Biostatistics, University of Damascus Dental School

A statistician is a person that is good


with numbers but that lacks the
personality to become an
accountant.

Dr Mohammad Y Hajeer, DDS, MSc, PhD

Introduction to Biostatistics, University of Damascus Dental School

The opposite sex ignores us because


we are boring.

Dr Mohammad Y Hajeer, DDS, MSc, PhD

Introduction to Biostatistics, University of Damascus Dental School

It was God that made me so beautiful.


If I werent, then Id be a teacher.
Supermodel Linda Evangelista

Dr Mohammad Y Hajeer, DDS, MSc, PhD

Introduction to Biostatistics, University of Damascus Dental School

Bio-Sadistics
Instead of
Bio-statistics

Dr Mohammad Y Hajeer, DDS, MSc, PhD

Introduction to Biostatistics, University of Damascus Dental School

In God we trust.
All others must bring data.

....

Dr Mohammad Y Hajeer, DDS, MSc, PhD

Introduction to Biostatistics, University of Damascus Dental School

Statistical thinking will one


day be as necessary for
efficient citizenship as the
ability to read and write.
H.G. Wells

Dr Mohammad Y Hajeer, DDS, MSc, PhD

Introduction to Biostatistics, University of Damascus Dental School

Dr Mohammad Y Hajeer, DDS, MSc, PhD

10

Introduction to Biostatistics, University of Damascus Dental School

Challenges

Statistical ideas can be intimidating


and difficult
Thus:
Statistical results are often skippedover when reading scientific literature
Data is often mis-interpreted
Dr Mohammad Y Hajeer, DDS, MSc, PhD

11

Introduction to Biostatistics, University of Damascus Dental School

Mis-Interpretation of Data
On average, my class is doing well.
Half of my students think that
2+2=3, the other half thinks that
2+2=5.

Dr Mohammad Y Hajeer, DDS, MSc, PhD

12

Introduction to Biostatistics, University of Damascus Dental School

You may think that:


A Bar Chart is a map of the locations of
the nearest taverns
A p-value is the result of a urinalysis
Martingale residuals are the droppings of
a rare bird
A t-test is a blinded taste test between
black and green tea
Dr Mohammad Y Hajeer, DDS, MSc, PhD

13

Introduction to Biostatistics, University of Damascus Dental School

Data

Pieces of information
Scales of Measurement
Nominal unordered categories
Ordinal ordered categories
Discrete only whole numbers are
possible, order and magnitude matters
Continuous any value is conceivable
Dr Mohammad Y Hajeer, DDS, MSc, PhD

14

Introduction to Biostatistics, University of Damascus Dental School

Data
Many errors in research arise from a
poor planning (e.g., data collection)
Fancy statistical methods cannot
rescue garbage data
Careful planning is prudent
Dr Mohammad Y Hajeer, DDS, MSc, PhD

15

Introduction to Biostatistics, University of Damascus Dental School

Data
Collect exact values whenever possible
Standardize data collection
Consistency
Training on test administration and data collection

Central labs
Central reading of imaging, etc.

Dr Mohammad Y Hajeer, DDS, MSc, PhD

16

Introduction to Biostatistics, University of Damascus Dental School

Statistics

The science of collecting,
monitoring, analyzing, summarizing,
and interpreting data
This includes study design

Dr Mohammad Y Hajeer, DDS, MSc, PhD

17

Introduction to Biostatistics, University of Damascus Dental School

Biostatistics

Statistics applied to biological (life)
problems, including:
Public health
Medicine
Ecological and environmental

Much more statistics than biology,


however biostatisticians must learn the
biology as well
Dr Mohammad Y Hajeer, DDS, MSc, PhD

18

Introduction to Biostatistics, University of Damascus Dental School

Biostatistician Roles

Identify and develop treatments for
disease and estimate their effects.
Identify risk factors for diseases.
Design, monitor, analyze, interpret, and
report results of clinical studies.

Develop statistical methodologies to


address questions arising from
medical/public health data.
Dr Mohammad Y Hajeer, DDS, MSc, PhD

19

Introduction to Biostatistics, University of Damascus Dental School

Why Can it be Interesting?


Combines rigors of mathematics with
uncertainties of the real world.

Can make contribution to advancement of


science, statistics, medicine, and public health.
Can study diseases/health problems in which you
may have an interest (cancers, HIV, reproductive
health, ).
Dr Mohammad Y Hajeer, DDS, MSc, PhD

20

Introduction to Biostatistics, University of Damascus Dental School

Challenge
...

Much of life is composed of a systematic


component (i.e., signal) and a random
component (i.e., error or noise)
Example:

Smoking is associated with lung cancer.


Yet not everyone that smokes, gets lung
cancer, and not everyone that gets lung
cancer, smokes
Yet we know that there is an association (a
systematic component)
Dr Mohammad Y Hajeer, DDS, MSc, PhD

21

Introduction to Biostatistics, University of Damascus Dental School

A Challenge
Our challenge
Identify the systematic component
(separate it from the random
component), estimate it, and perhaps
make inferences with it

Dr Mohammad Y Hajeer, DDS, MSc, PhD

22

Introduction to Biostatistics, University of Damascus Dental School

The Big Picture


Populations and Samples

Sample / Statistics
x, s, s2

Population
Parameters
, , 2

Dr Mohammad Y Hajeer, DDS, MSc, PhD

23

Introduction to Biostatistics, University of Damascus Dental School

Populations and Parameters


Population
A group of individuals that we would like to
know something about

Parameter
A characteristic of the population in which we
have a particular interest
Often denoted with Greek letters (, , )
Examples:

The proportion of the population that would respond to


a certain drug
The association between a risk factor and a disease in
a population
Dr Mohammad Y Hajeer, DDS, MSc, PhD

24

Introduction to Biostatistics, University of Damascus Dental School

Samples and Statistics


Sample
A subset of a population (hopefully
representative)

Statistic
A characteristic of the sample
Examples:

The observed proportion of the sample that responds


to treatment
The observed association between a risk factor and a
disease in this sample
Dr Mohammad Y Hajeer, DDS, MSc, PhD

25

Introduction to Biostatistics, University of Damascus Dental School

Populations and Samples


Studying populations is too expensive and
time-consuming, and thus impractical
If a sample is representative of the
population, then by observing the sample
we can learn something about the
population
And thus by looking at the characteristics of
the sample (statistics), we may learn
something about the characteristics of the
population (parameters)
Dr Mohammad Y Hajeer, DDS, MSc, PhD

26

Introduction to Biostatistics, University of Damascus Dental School

Statistical Analyses

Two steps
Descriptive Statistics
Describe the sample

Inference
Make inferences about the population using
what is observed in the sample
Primarily performed in two ways:
Hypothesis testing
Estimation
Dr Mohammad Y Hajeer, DDS, MSc, PhD

27

Introduction to Biostatistics, University of Damascus Dental School

Issues

Samples are random
If we had chosen a different sample,
then we would obtain different
statistics (sampling variation or random
variation)
However, note that we are trying to
estimate the same (constant) population
parameters

Dr Mohammad Y Hajeer, DDS, MSc, PhD

28

Introduction to Biostatistics, University of Damascus Dental School

Step I Descriptive Statistics


Describe the Sample
Begin one variable at a time
Describe important variables in your
analyses (e.g., endpoints,
demographics, confounders, etc.)

Dr Mohammad Y Hajeer, DDS, MSc, PhD

29

Introduction to Biostatistics, University of Damascus Dental School

Types of Data
Several types of data

Nominal
Ordinal
Discrete -
Continuous
Time-to-event with censoring

The type of data influences the analysis


methods to be employed
Dr Mohammad Y Hajeer, DDS, MSc, PhD

30

Introduction to Biostatistics, University of Damascus Dental School

Nominal Data

Mutually exclusive unordered categories
Examples
Sex (male, female)
Race/ethnicity (white, black, latino, asian, native
american, etc.)
Site

Can summarize in:


Tables using counts and percentages
Bar chart/graph
Dr Mohammad Y Hajeer, DDS, MSc, PhD

31

Introduction to Biostatistics, University of Damascus Dental School

Ordinal Data

Ordered Categories
Examples
Adverse events
Mild, moderate, severe, life-threatening,
death

Income
Low, medium, high
Dr Mohammad Y Hajeer, DDS, MSc, PhD

32

Introduction to Biostatistics, University of Damascus Dental School

Discrete Data

Often only integer numbers are possible


If there are many different discrete
values, then discrete data is often treated
as continuous
Examples: CD4 count, HIV viral load

If there are very few discrete values, then


discrete data is often treated as ordinal
Dr Mohammad Y Hajeer, DDS, MSc, PhD

33

Introduction to Biostatistics, University of Damascus Dental School

Continuous Data

Any value on the continuum is possible
(even fractions or decimals)
Examples:
Height
Weight
Many discrete variables are often treated as
continuous
Examples: CD4 count, viral load
Dr Mohammad Y Hajeer, DDS, MSc, PhD

34

Introduction to Biostatistics, University of Damascus Dental School

Survival Data
Time to an event (continuous variable)
The event does not have to be survival

Concept of Censoring

If we follow a person until the event, then the survival


time is clear
If we follow someone for a length of time but the event
does not occur, the the time is censored (but we still
have partial information; namely that the event did not
occur during the follow up period)

Examples: time to progression (cancer), time to


response, time to relapse, time to death
Dr Mohammad Y Hajeer, DDS, MSc, PhD

35

Introduction toDATASET:
Biostatistics, University
of Damascus
Dentalof
School
EXAMPLE
Evans SR,
et. al, Journal

Clinical Oncology, 2002


Obs

AGE

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

52
43
41
35
30
41
36
33
38
22
37
31
51
42
47
40
47
32
27
36
48
27
32
29
33
35
31
52
30
34
33
57
31
39
27
44

SEX
M
M
M
M
M
M
M
M
M
F
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M

RACE

CAUSE OF DEATH

Black Non-Hispanic
MAI/MAC Disease
Black Non-Hispanic
.
Black Non-Hispanic
HIV Progression-Other
White Non-Hispanic
.
Black Non-Hispanic
.
Black Non-Hispanic
.
White Non-Hispanic
.
Hispanic (Regardless of Race)
.
Hispanic (Regardless of Race)
HIV Progression-Other
Black Non-Hispanic
.
White Non-Hispanic
.
White Non-Hispanic
.
White Non-Hispanic
HIV Progression-Other
White Non-Hispanic
.
White Non-Hispanic
.
White Non-Hispanic
.
Hispanic (Regardless of Race)
.
White Non-Hispanic
Other
White Non-Hispanic
.
Hispanic (Regardless of Race)
.
Hispanic (Regardless of Race)
.
Hispanic (Regardless of Race)
.
Hispanic (Regardless of Race)
.
Hispanic (Regardless of Race)
HIV Progression-Other
White Non-Hispanic
.
White Non-Hispanic
Kaposi's Sarcoma
White Non-Hispanic
CMV Disease
White Non-Hispanic
Kaposi's Sarcoma
White Non-Hispanic
Suicide
White Non-Hispanic
Other Clinical-Non-HIV
White Non-Hispanic
Other Clinical-Non-HIV
White Non-Hispanic
PCP
White Non-Hispanic
.
White Non-Hispanic
Other
White Non-Hispanic
PCP
White
Non-Hispanic
.
Scott Evans, Ph.D., Lynne Peeples, M.S.

36

Introduction to Biostatistics, University of Damascus Dental School

Data Summaries

It is ALWAYS a good idea to summarize
your data (at least for important
variables)
You become familiar with the data and the
characteristics of the sample that you are
studying
You can also identify problems with data
collection or errors in the data (data
management issues)
Range checks for illogical values
Dr Mohammad Y Hajeer, DDS, MSc, PhD

37

Introduction to Biostatistics, University of Damascus Dental School

Visual Data Summaries



Some visual ways to summarize data
(one variable at a time):
Tables
Graphs
Bar charts
Histograms
Box plots

Dr Mohammad Y Hajeer, DDS, MSc, PhD

38

Introduction to Biostatistics, University of Damascus Dental School

Frequency Tables

Summarizes a variable with counts
and percentages
The variable is categorical (e.g.,
nominal or ordinal)

Dr Mohammad Y Hajeer, DDS, MSc, PhD

39

Introduction to Biostatistics, University of Damascus Dental School

Frequency Table Cause of Death


Cause of Death Frequency

Percent

Motor Vehicle

48

48

Drowning

14

14

House Fire

12

12

Other

19

19

Total

100

100

Homicide

Dr Mohammad Y Hajeer, DDS, MSc, PhD

40

Introduction to Biostatistics, University of Damascus Dental School

Frequency Tables
Note that you can take a continuous
variable and create categories with
it
How do you create categories for a
continuous variable?
Choose cutoffs that are biologically
meaningful
Natural breaks in the data
Precedent from prior research
Dr Mohammad Y Hajeer, DDS, MSc, PhD

41

Introduction to Biostatistics, University of Damascus Dental School

Example: Serum Cholesterol Levels

How to choose the categories?


Talk to physician about risk categories
May look at National Cholesterol
Education Program (NCEP) guidelines
and categories

Dr Mohammad Y Hajeer, DDS, MSc, PhD

42

Introduction to Biostatistics, University of Damascus Dental School

Frequencies of serum cholesterol levels


Cumulative
Cholesterol level
Cumulative Relative
Relative
(mg/100 ml)
Frequency Frequency Frequency (%) Frequency (%)
_______________________________________________________________
80-119
13
13
1.2
1.2
120-159
150
163
14.1
15.3
160-199
442
605
41.4
56.7
200-239
299
904
28.0
84.7
240-279
115
1019
10.8
95.5
280-319
34
1053
3.2
98.7
320-360
9
1062
0.8
99.5
360-399
5
1067
0.5
100.0
_______________________________________________________________
Total
1067
100.0
Note. The choice of intervals (and cut-off values) in a frequency table is
very important. However, there are no established rules for determining
them.

Dr Mohammad Y Hajeer, DDS, MSc, PhD

43

Introduction to Biostatistics, University of Damascus Dental School

Graphical Summaries

Bar Graphs
Nominal data

No order to horizontal axis

Histograms

Continuous or ordinal data on horizontal axis

Box Plots
Continuous data

Dr Mohammad Y Hajeer, DDS, MSc, PhD

44

Introduction to Biostatistics, University of Damascus Dental School

Bar Chart Cause of Death

Frequency

60

40

20

0
Motor Vehicle

Drowning

House Fire

Homicide

Other

Cause of Death
Dr Mohammad Y Hajeer, DDS, MSc, PhD

45

Introduction to Biostatistics, University of Damascus Dental School

Histogram Cigarette
Consumption (1900-1990)
Cigarette consumption
4000

3000

2000

1000

1900

1920
1910

1940
1930

1960
1950

1980
1970

1990

Cigarette consumption between 1900 and 1990


Dr Mohammad Y Hajeer, DDS, MSc, PhD

46

Introduction to Biostatistics, University of Damascus Dental School

The Box Plot

Follow these steps in order to produce a box plot:


1.

Calculate the median m

2.

Calculate the first and third quartile Q1 and Q3

3.

Compute the inter-quartile range IQR=Q3-Q1

4.

Find the lower fence LF=Q1-1.5*IQR

5.

Find the upper fence UF=Q3+1.5*IQR

6.

Find the lower adjacent value LAV=smallest value in the data that is
greater or equal to the lower fence

7.

Find the upper adjacent value UAV=largest value in the data that is smaller
or equal to the upper fence

8.

Any value outside the LAV or UAV is called an outlier and should receive
extra attention
Dr Mohammad Y Hajeer, DDS, MSc, PhD

47

Introduction to Biostatistics, University of Damascus Dental School

Box Plot Depression Scores


depscore
20

15

10

'Box plot of Koopmans depression scores'


Dr Mohammad Y Hajeer, DDS, MSc, PhD

48

Introduction to Biostatistics, University of Damascus Dental School

Box Plot
The width of the plot has no
meaning

25%
25%
25%
25%

of
of
of
of

the
the
the
the

data
data
data
data

<13
13-16
16-18
>18

Dr Mohammad Y Hajeer, DDS, MSc, PhD

49

Introduction to Biostatistics, University of Damascus Dental School


myhajeer@gmail.com

Dr Mohammad Younis Hajeer, DDS, MSc, PhD

50

Introduction to Biostatistics
Part II

( )PhD Glasgow
- Sheffield
- Glasgow

Biostatistics
(a portmanteau word made from biology and
statistics)
The application of statistics to a wide range of topics
in biology.

Biostatistics
It is the science which deals with development and
application of the most appropriate methods for
the:
Collection of data.

Presentation of the collected data.


Analysis and interpretation of the results.

Making decisions on the basis of such analysis

Other definitions for Statistics

Frequently used in referral to recorded data


Denotes characteristics calculated for a set of data :
sample mean

Role of statisticians

To guide the design of an experiment or survey


prior to data collection
To analyze data using
procedures and techniques

proper

statistical

To present and interpret the results to researchers


and other decision makers

Sources of
data

Records

Comprehensive

Surveys

Experiments

Sample

Types of data
Constant

Variables

Types of variables
Quantitative variables

Qualitative variables

) (

Quantitative
continuos
Quantitative
descrete

Qualitative
nominal
Qualitative
ordinal

Methods of presentation of data



presentation

Numerical

presentation

Graphical

Mathematical

presentation
( )

1- Numerical presentation
Tabular presentation (simple complex)
Simple frequency distribution Table (S.F.D.T.)
Name of variable
(Units of variable)
- Categories
Total

Title
Frequency

Table (I): Distribution of 50 patients at the surgical


department of National Hospital of Hamah in May
2008 according to their ABO blood groups
Blood group
A
B
AB
O
Total

Frequency

12
18
5
15
50

24
36
10
30
100

Table (II): Distribution of 50 patients at the surgical


department of National Hospital of Hamah in May
2008 according to their age
Age
(years)
20-<30
304050+
Total

Frequency

12
18
5
15
50

24
36
10
30
100

Complex frequency distribution Table


Table (III): Distribution of 20 lung cancer patients at the chest
department of National Hospital of Hamah and 40 controls in May 2008
according to smoking

Lung cancer
Cases
Control
No.
%
No.
%

No.

Smoker

15

75%

20%

23

38.33

Non
smoker

25%

32

80%

37

61.67

Total

20

100

40

100

60

100

Smoking

Total

Complex frequency distribution Table


Table (IV): Distribution of 60 patients at the chest department of
National Hospital of Hamah in May 2008 according to smoking &
lung cancer

Lung cancer
positive
negative
No.
%
No.
%

No.

Smoker

15

65.2

34.8

23

100

Non
smoker

13.5

32

86.5

37

100

Total

20

33.3

40

66.7

60

100

Smoking

Total

2- Graphical presentation
Graphs drawn using Cartesian coordinates

Line graph
Frequency polygon
Frequency curve
Histogram
Bar graph
Scatter plot

Pie chart
Statistical maps

rules

Line Graph
MMR/1000

Year MMR
1960 50

60
50
40
30
20
10
0

Year

1960

1970

1980

1990

1970

45

1980

26

1990

15

2000

12

2000

Figure (1): Maternal mortality rate of (country),


1960-2000

Frequency polygon
Age
(years)

Males

Females

20 -

3 (12%)

2 (10%)

(20+30) / 2 = 25

30 -

9 (36%)

6 (30%)

(30+40) / 2 = 35

40-

7 (8%)

5 (25%)

(40+50) / 2 = 45

50 -

4 (16%)

3 (15%)

(50+60) / 2 = 55

60 - 70

2 (8%)

4 (20%)

(60+70) / 2 = 65

Total

Sex

Mid-point of interval

25(100%) 20(100%)

Frequency polygon
Males

Females

%
40

Age

35
30

Sex
M

M-P

25

20-

(12%)

(10%)

25

20

30-

(36%)

(30%)

35

15

40-

(8%)

(25%)

45

10

50-

(16%)

(15%)

55

60-70

(8%)

(20%)

65

Age
25

35

45

55

65

Figure (2): Distribution of 45 patients at (place) , in


(time) by age and sex

Frequency curve

Frequency

9
8

Female

Male

6
5
4
3
2
1
0
20-

30-

40-

Age in years

50-

60-69

Histogram
% 35
Distribution of a group of cholera patients by age

30
25

Age (years)

Frequency

2530404560-65

3
5
7
4
2

14.3
23.8
33.3
19.0
9.5

Total

21

100

20
15
10

25

30

40

45

60
65
Age (years)

Figure (2): Distribution of 100 cholera patients at (place) , in (time)


by age

Bar chart .....


%

50

40
30
20
10
0
Single Married DivorcedWidowed
%

Bar chart
%
50

Male
Female

40
30
20
10
0
Single

Married

Divorced

Widowed
Marital status

Pie chart ....


Inversion
18%

Deletion
3%

Translocatio
n
79%

3-Mathematical presentation

Summery statistics

Measures of location
1- Measures of central tendency
2- Measures of non central locations
(Quartiles, Percentiles )
Measures of dispersion

Summery statistics

1- Measures of central tendency (averages)

Midrange
Smallest observation + Largest observation
2

Mode
the value which occurs with the greatest
frequency i.e. the most common value

Summary statistics

1- Measures of central tendency (cont.)

Median
the observation which lies in the middle of
the ordered observations.

Arithmetic

mean (mean)

Sum of all observations


Number of observations

Measures of dispersion

Range
Variance
Standard

deviation
Semi-interquartile range
Coefficient of variation
Standard

error

Standard deviation SD
7 7
7 77
7

Mean = 7
SD=0

7 77
6
Mean = 7
SD=0.63

7 8

13

9
Mean = 7
SD=4.04

Introduction to Biostatistics, University of Damascus Dental School

Measures of
Central Tendency & Spread

( )PhD Glasgow
- Sheffield
- Glasgow
1

Common statistical terms


Data
Measurements or observations of a variable

Variable
A characteristic that is observed or
manipulated
Can take on different values

Statistical terms (cont.)


Independent variables
Precede dependent variables in time
Are often manipulated by the researcher
The treatment or intervention that is used in a
study

Dependent variables
What is measured as an outcome in a study
Values depend on the independent variable
3

Statistical terms (cont.)


Parameters
Summary data from a population

Statistics
Summary data from a sample

Population
-
A population is the group from which a
sample is drawn
e.g., headache patients in a chiropractic
office; automobile crash victims in an
emergency room

In research, it is not practical to include all


members of a population
Thus, a sample (a subset of a population)
is taken
5

Random samples

Subjects are selected from a population so
that each individual has an equal chance
of being selected
Random samples are representative of the
source population
) (
Non-random samples are not
representative
May be biased regarding age, severity of the
condition, socioeconomic
status etc.
6

Random samples

In experimental studies, patients are
randomly assigned to treatment and
control groups
Each person has an equal chance of being
assigned to either of the groups

Random assignment is also known as


randomization ) (

Descriptive statistics (DSs)



A way to summarize data from a sample
or a population
DSs illustrate the shape, central tendency,
and variability of a set of data
The shape of data has to do with the
frequencies of the values of observations

...
Central tendency describes the location of the
middle of the data
Variability is the extent values are spread
above and below the middle values
a.k.a., Dispersion

DSs can be distinguished from inferential


statistics
DSs are not capable of testing hypotheses
9

Hypothetical study data


(partial from book)
Distribution provides a summary of:
Frequencies of each of the values

23
34
43
51
61
72

etc.

Ranges of values
Lowest = 2
Highest = 7
10

Case #
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Visits
7
2
2
3
4
3
5
3
4
6
2
3
7
4

Frequency distribution table


()
2
3
4
5
6
7

Frequency Percent
3
21.4
4
28.6
3
21.4
1
7.1
1
7.1
2
14.3

11

Cumulative %
21.4
50.0
71.4
78.5
85.6
100.0

Frequency distributions are


often depicted by a histogram

12

Histograms
A histogram is a type of bar chart, but
there are no spaces between the bars
Histograms are used to visually depict
frequency distributions of continuous data
Bar charts are used to depict categorical
information
e.g., MaleFemale, MildModerateSevere,
etc.
13

Measures of central tendency


Mean (a.k.a., average)
The most commonly used DS

To calculate the mean


Add all values of a series of numbers and
then divided by the total number of elements

14

Formula to calculate the mean


X
n

Mean of a sample

Mean of a population

X (X bar) refers to the mean of a sample and refers to the


mean of a population
EX is a command that adds all of the X values
n is the total number of values in the series of a sample and
N is the same for a population

15

Measures of central
tendency (cont.)

Mode

Mode

The most frequently


occurring value in a
series
The modal value is
the highest bar in a
histogram

16

Measures of central
tendency (cont.)

Median
The value that divides a series of values in
half when they are all listed in order
When there are an odd number of values
The median is the middle value

When there are an even number of values


Count from each end of the series toward the
middle and then average the 2 middle values

17


Each of the three methods of measuring
central tendency has certain advantages
and disadvantages
Which method should be used?
It depends on the type of data that is being
analyzed
e.g., categorical, continuous, and the level of
measurement that is involved
18

There are 4 levels of measurement


Nominal, ordinal, interval, and ratio

1. Nominal
Data are coded by a number, name, or letter
that is assigned to a category or group
Examples

Gender (e.g., male, female)


Treatment preference (e.g., manipulation,
mobilization, massage)
19

Levels of measurement (cont.)


2. Ordinal
Is similar to nominal because the
measurements involve categories
However, the categories are ordered by rank
Examples

Pain level (e.g., mild, moderate, severe)


Military rank (e.g., lieutenant, captain, major,
colonel, general)

20

Levels of measurement (cont.)


Ordinal values only describe order, not
quantity
Thus, severe pain is not the same as 2 times
mild pain

The only mathematical operations allowed


for nominal and ordinal data are counting
of categories
e.g., 25 males and 30 females
21

Levels of measurement (cont.)


3. Interval
Measurements are ordered (like ordinal
data)
Have equal intervals
Does not have a true zero
Examples

The Fahrenheit scale, where 0 does not


correspond to an absence of heat (no true zero)
In contrast to Kelvin, which does have a true zero
22

Levels of measurement (cont.)


4. Ratio
Measurements have equal intervals
There is a true zero
Ratio is the most advanced level of
measurement, which can handle most types
of mathematical operations

23

Levels of measurement (cont.)


Ratio examples
Range of motion
No movement corresponds to zero degrees
The interval between 10 and 20 degrees is the
same as between 40 and 50 degrees

Lifting capacity
A person who is unable to lift scores zero
A person who lifts 30 kg can lift twice as much as
one who lifts 15 kg
24

Levels of measurement (cont.)


NOIR is a mnemonic to help remember
the names and order of the levels of
measurement
Nominal
Ordinal
Interval
Ratio

25

Levels of measurement (cont.)

Measurement scale

Permissible mathematic
operations

Best measure of
central tendency

Nominal

Counting

Mode

Ordinal

Greater or less than


operations

Median

Interval

Addition and subtraction

Symmetrical Mean
Skewed Median

Ratio

Addition, subtraction,
multiplication and division

Symmetrical Mean
Skewed Median

26

The shape of data


Histograms of frequency distributions have
shape
Distributions are often symmetrical with
most scores falling in the middle and fewer
toward the extremes
Most biological data are symmetrically
distributed and form a normal curve (a.k.a,
bell-shaped curve)
27

The shape of data (cont.)

Line depicting
the shape of
the data

28

The normal distribution



The area under a normal curve has a
normal distribution (a.k.a., Gaussian
distribution)
Properties of a normal distribution
It is symmetric about its mean
The highest point is at its mean
The height of the curve decreases as one
moves away from the mean in either direction,
approaching, but never reaching zero
29

The normal distribution



Mean

The highest point of


the overlying
normal curve is at
the mean

As one moves away from


the mean in either direction
the height of the curve
decreases, approaching,
but never reaching zero

A normal distribution is symmetric about its mean

30

The normal distribution



Mean = Median = Mode

31

Skewed distributions

The data are not distributed symmetrically
in skewed distributions
Consequently, the mean, median, and mode
are not equal and are in different positions
Scores are clustered at one end of the
distribution
A small number of extreme values are located
in the limits of the opposite end
32

Skewed distributions

Skew is always toward the direction of the
longer tail
Positive if skewed to the right
Negative if to the left
The mean is shifted
the most

33

Skewed distributions

Because the mean is shifted so much, it is
not the best estimate of the average score
for skewed distributions
The median is a better estimate of the
center of skewed distributions
It will be the central point of any distribution
50% of the values are above and 50% below
the median
34



About 68.3% of the area under a normal
curve is within one standard deviation
(SD) of the mean
About 95.5% is within two SDs
About 99.7% is within three SDs

35

36

Standard deviation (SD)


SD is a measure of the variability of a set
of data
The mean represents the average of a
group of scores, with some of the scores
being above the mean and some below
This range of scores is referred to as
variability or spread

Variance (S2) is another measure of


spread
37

SD (cont.)
In effect, SD is the average amount of
spread in a distribution of scores
The next slide is a group of 10 patients
whose mean age is 40 years
Some are older than 40 and some younger

38


Ages are spread
out along an X axis

The amount ages are


spread out is known as
dispersion or spread

39

Etc.

Adding deviations
always equals zero

40

Calculating S2
To find the average, one would normally
total the scores above and below the
mean, add them together, and then divide
by the number of values
However, the total always equals zero
Values must first be squared, which cancels
the negative signs
... ..
41

Calculating S2 cont.

S2 is not in the
same units (age),
but SD is

Symbol for SD of a sample


for a population

42

Calculating SD with Excel


Enter values in a column

43

SD with Excel (cont.)

Click Data Analysis


on the Tools menu

44

SD with Excel (cont.)

Select Descriptive
Statistics and click OK

45

SD with Excel (cont.)

Click Input Range icon

46

SD with Excel (cont.)

Highlight all the


values in the column

47

SD with Excel (cont.)

Click OK

Check if labels are


in the first row
Check Summary

Statistics

48

SD with Excel (cont.)

SD is calculated precisely
Plus several other DSs

49

50



It is more difficult to
see a clear distinction
between groups
in the upper example
because the spread is
wider, even though the
means are the same

51

z-scores
-
The number of SDs that a specific score is
above or below the mean in a distribution
Raw scores can be converted to z-scores
by subtracting the mean from the raw
score then dividing the difference by the
SD
X
z

52

z-scores
Standardization
The process of converting raw to z-scores
The resulting distribution of z-scores will
always have a mean of zero, a SD of one,
and an area under the curve equal to one

The proportion of scores that are higher or


lower than a specific z-score can be
determined by referring to a z-table
53

z-scores -
Refer to a z-table
to find proportion
under the curve

54

Partial z-table (to z = 1.5) showing proportions of the


area under a normal curve for different values of z.

z-scores (cont.)

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.0

0.5000

0.5040

0.5080

0.5120

0.5160

0.5199

0.5239

0.5279

0.5319

0.1

0.5398

0.5438

0.5478

0.5517

0.5557

0.2

0.5793

0.5832

0.5871

0.5910

0.3

0.6179

0.6217

0.6255

0.6293

0.5596
0.5636
0.5675
Corresponds
to
the 0.5714
area 0.5753
0.5948
0.5987
0.6026
0.6064
0.6103
0.6141
under
the
curve
in
black
0.6331
0.6368
0.6406
0.6443
0.6480
0.6517

0.4

0.6554

0.6591

0.6628

0.6664

0.6700

0.6736

0.6772

0.6808

0.6844

0.6879

0.5

0.6915

0.6950

0.6985

0.7019

0.7054

0.7088

0.7123

0.7157

0.7190

0.7224

0.6

0.7257

0.7291

0.7324

0.7357

0.7389

0.7422

0.7454

0.7486

0.7517

0.7549

0.7

0.7580

0.7611

0.7642

0.7673

0.7704

0.7734

0.7764

0.7794

0.7823

0.7852

0.8

0.7881

0.7910

0.7939

0.7967

0.7995

0.8023

0.8051

0.8078

0.8106

0.8133

0.9

0.8159

0.8186

0.8212

0.8238

0.8264

0.8289

0.8315

0.8340

0.8365

0.8389

1.0

0.8413

0.8438

0.8461

0.8485

0.8508

0.8531

0.8554

0.8577

0.8599

0.8621

1.1

0.8643

0.8665

0.8686

0.8708

0.8729

0.8749

0.8770

0.8790

0.8810

0.8830

1.2

0.8849

0.8869

0.8888

0.8907

0.8925

0.8944

0.8962

0.8980

0.8997

0.9015

1.3

0.9032

0.9049

0.9066

0.9082

0.9099

0.9115

0.9131

0.9147

0.9162

0.9177

1.4

0.9192

0.9207

0.9222

0.9236

0.9251

0.9265

0.9279

0.9292

0.9306

0.9319

0.9332
0.9332
0.9345

0.9357

0.9370

0.9382
55

0.9394

0.9406

0.9418

0.9429

0.9441

1.5

0.09
0.5359


myhajeer@gmail.com

Descriptive Statistics
Part II



( )PhD Glasgow
- Sheffield
- Glasgow

Descriptive Biostatistics
The best way to work with data is to
summarize and organize them.
Numbers that have not been
summarized and organized are called
raw data.

Descriptive measures

A descriptive measure is a single

number that is used to describe a set


of data.
Descriptive measures include
measures of central tendency and
measures of dispersion.
3

Measures of Central Tendency



Central tendency is a property of the
data that they tend to be clustered about
a center point.
Measures of central tendency include:
mean (generally not part of the data set)
median (may be part of the data set)
mode (always part of the data set)
4

Measures of Dispersion

Dispersion is a property of the data
that they tend to be spread out.
Measures of dispersion include:
range
variance
standard deviation

Arithmetic mean

The mean or arithmetic mean is

the "average" which is obtained by


adding all the values in a sample or
population and dividing them by the
number of values.

General formula--population mean

General formula--sample mean


1. Uniqueness -- For a given set of
data there is one and only one mean.
2. Simplicity -- The mean is easy to
calculate.
3. Affected by extreme values -The mean is influenced by each value.
Therefore, extreme values can distort
the mean.
10

Median

The median is the value that

divides the set of data into two equal


parts. It is the midpoint of the data
set.
The number of values equal to or
greater than the median equals the
number of values less than or equal
to the median.
11

Finding the median


1. Arrange (sort) the data in order of
increasing value in a sorted list.
2. Find the median.
a. Odd number of values (n is odd)

12

Finding the median


b. Even number of values
(n is even)
median = average of the two
values in the middle

13


1. Uniqueness -- There is only one
median for each set of data.
2. Simplicity -- It is easy to calculate.
3. Effect of extreme values -- The
median is not as drastically affected by
extreme values as is the mean.

14

15

Mode

The mode is the value that occurs

most often in a set of data.


It is possible to have more than one
mode or no mode.

16

Variability of data

Dispersion refers to the variety
exhibited by the values of the data. The
amount may be small when the values
are close together.

17

Range
The range is the difference between
the largest and smallest values in the set
of observations.
These values are often called the
maximum and the minimum.

18

Variance
Variance is used to measure the
dispersion of values relative to the
mean.
When values are close to their mean
(narrow range) the dispersion is less
than when there is scattering over a wide
range.

19

Calculation of the sample variance

= sample variance
= individual value
= sample mean
n = number of values
20

Variance of a population

= population variance
N = population size
= population mean
21

Degrees of freedom
In computing the variance there are
n - 1 degrees of freedom because if
n -1 values are known, the nth one is
determined automatically.
This is because all of the values of
( - ) must add to zero.

22

Differences in calculations
Values of
because
whereas

and
are different
divides by n-1
divides by N.

23

Sample standard deviation


The standard deviation is the square root of
the variance. The standard deviation
expresses the dispersion in terms of the
original units. Since the variance of a sample
is , we take the square root.

24

Population Standard Deviation


For a population, the standard deviation
is s which is the square root of the
population variance.

25

26

Coefficient of variation

Coefficient of variation is a measure of
the relative amount of variation as
opposed to the absolute variation.

C.V. is independent of the units of


measure. It can be useful for comparing
different results from people investigating
the same variable.
27

28


Hypothesis Testing & p-values

( )PhD Glasgow

- Sheffield

- Glasgow


p value





Sampling
population
Normal distribution

.
SD
.

.




..
..


:



.





( :

) ..
: ( )


.. ..


.
: :
.
1000 10000

Prove .



disprove .
..
...
( )
. Null Hypothesis


:

.

...


) Griffiths et al (2000
184

( )
() .

.
211
123 .
Intervention . Control


population



.
effective
( Intervention ) Control
()
..
...






!! effectiveness
:



int=con
.




.... ...



( )d 88 .
) SD(d
). SE(d






...
SD1
SD2 .... SD pooled


(
)

( )d ...
Int-Con
Int-Con d
) SEpooled (d .



: ( )d
) SEpooled (d
Int-Con

.


88 (
) 16.1 5.46
( ) z
.
:
%95
1.96
( )
.


(
)
( )
.

reject .
:

.

p-value
%95
..
:

( )z
( ) .
= %5 (1.96 = )z

p-value
%95 :

Glazener et al 2001
(...)z

p-value
( )z :


( ) ()
p-value
... 0.0324

p-value

p-value

p-value
Statistical Significance
Test p-value
.

(
) .
%5
...

p-value

p-value

p-value



.
p-value
.


:

:
(
) .
.
.


( . ) p
p
.

: 1

.

:

..


.

:2 ()


1

Statistically
.... significant difference
%5
..


:3

( )p-value

:


(
) . p


: 4


()
p
...


Type I Decision Error . Type I Error



p-value
.


:
p-value
..
...
...

Statistical Power


.

Statistical Power

.. Type I Error


False-Positive
...

Statistical Power
Study Power
( : )
..
. Type II Error
- False-Negative
( ) .

.....

Statistical Power
Power of the study

.
:
p Statistically significant
...

Statistical Power

Statistical Power


..

= 1 .

= 1 .

Statistical Power

p

..

..

... ...


.

p


....


.

p-value
...

p

p
.


myhajeer@gmail.com


Tests for comparing two groups

( )PhD Glasgow

- Sheffield

- Glasgow

...

..
( -)
.1 Objectives

and Aims
.2
.3
.4
.5

Hypothesis
( ) Type of Data
Distribution
Summary
Measure



:
.
Paired
Observations :
Cross-over Trials
Matched pairs of
. subjects




Exploratory Data ...
Analysis



... Baseline Sample Characteristics


:

( . )Morrell ete al 1998
233

)120( Intervention Group 113
. Control Group

12
. :

( ) (
)
12

Health related Quality of
HRQoL ... Life

.
. SF-36
( ) 100
( ).




( )

Cross over trials
Matched pairs ..


.






. Case-control Studies
:
: (
) 12 .





.

....



:
.....

....



HRQoL
.



:H0
HRQoL Baseline

= .
: HA HRQoL

.

Paired t test



Paired t test

Assumptions



t
t . n-1



.

...


36
7.3
16.5 2.8
( )
n-1 35
t = 2.66

35 = df
p-value 0.01 0.02
. 0.012



%95
... 1.7 12.9
%95 HRQoL
1.7
12.9
() 7.3 .


Wilcoxon Signed rank
. matched pairs test




Wilcoxon Singed Rank Test
t
..
.

0.012
( . ) %5



:
. .. ...

..

.
..
Two-sample ...
t test




Two-sample t-test for comparing means



n1 + n2 2

p-value .



%95 :

1.2 = %95 10.5 .



.




.

..
... ...




..
.. Mann-Whitney U test
..
Log Rank Test
... Survival Data
.



Mann-Whitney U test

.. Mann-Whitney


Wilcoxon Mann Whitney
Mann-Whitney

. Wilcoxon

W . U
W
Wilcoxon
..
U Mann-Whitney


.



Mann-Whitney U test




z
1
p-
. value




...
...



Discrete Or Count Data

(

) ..
...


...
...





One-Way ANOVA
Analysis of Variance
...

Kruskall-Wallis

.


myhajeer@gmail.com

2
Tests for comparing two groups

( )PhD Glasgow

- Sheffield

- Glasgow

( -)

.1 Objectives
and Aims
.2
.3
.4
.5

Hypothesis
( ) Type of
Data
Distribution
Summary
Measure


( )


( )




( )

.
( + )
Cross-tabulated
2 X 2
() () 4
...




.

.




:
:




. Chi-squared test

( : :
).


..
:
.



r x c

r c

.

Yates :


0.247


. 0.620



Fishers Exact Test
2 X 2
1 Yates
.

2 2



...

.



X 2
Factor
( )
Ordinal
( :
... ) 5 1


. Chi-square test for trend


McNemar
.

...


Non-Normal distributions
Degrees of Freedom


myhajeer@gmail.com

Vous aimerez peut-être aussi