Académique Documents
Professionnel Documents
Culture Documents
Principles of Research
Methodology
A Guide for Clinical Investigators
Jeffrey S. Borer
Professor and Chair, Department of Medicine
Chief, Division of Cardiovascular Medicine
Director of The Howard Gilman Institute for Heart Valve Disease
Director of the Cardiovascular Translational Research Institute
SUNY Downstate Medical Center
Brooklyn, NY, USA
This superb book on research philosophy and methodology that Drs. Phyllis
Supino and Jeffrey Borer have written and edited came out of an experience
common to most of us involved in training investigators beginning their
research careers. How do you teach these investigators the mostly unwritten
ways of an area as complex as medical research? How do you help the
research neophyte develop into a creative and reliable researcher? For me and
my associates in the Cardiology Branch of the NIH (of which Dr. Borer was
one) in the 1970s and 1980s, the teaching process was mostly based on an
apprenticeship model, with learning coming in the actual doing of the
research. This time-honored approach led to the development, in many
research centers, of a cadre of superb researchersbut it was hard to master
and the results were necessarily inconsistent, with many young investigators
going down wrong paths.
Drs. Supino and Borers book represents a unique collaboration between
an accomplished educator specializing in research methodology and a promi-
nent physician-scientist. Drs. Supino and Borer began their collaboration
more than 20 years ago at Cornell University Medical College, continuing
their work together in what became the Howard Gilman Institute for Valvular
Heart Diseases. The Institute, of which Dr. Borer is the Director, now is
located at the State University of New York Downstate Medical Center.
Working within the context of a research institute housed within a medical
school, Dr. Borer soon discovered that most of the fellows coming into his
program had no formal research training and scant knowledge of research
methodology. Prior to joining the Institute, Dr. Supino had been conducting
continuing education in research methodology for scientists and health pro-
fessionals since late 1970s. When Dr. Supino joined the Institute in 1990, she
applied her accumulated expertise in this eld to develop a curriculum and
lead a comprehensive course providing formal training in research methodol-
ogy for Dr. Borers fellows and others at the institution. This curriculum and
course, developed in partnership with Dr. Borer, turned out to be our good
fortune. During the ensuing 20+ years Drs. Supino and Borer gradually devel-
oped the pedagogical framework for writing what is one of the best books in
the eld.
This book provides in depth chapters containing information critical to
creating good researchfrom the kind of mind-set that generates valuable
research questions to study design, to exploring a variety of online data
v
vi Foreword
bases, to the elements making for compelling research grants and papers,
and to the wonderfully informing chapter on the history of the application
of ethics to medical research. There also is a valuable chapter on statistical
considerations and a fascinating discussion on the origins and elements of
hypothesis generation.
Its also important to emphasize that this superb text is not only for the
new investigator, but for experienced investigators as well. This results from
the fact that Drs. Supino, Borer, and their coauthors write their chapters in
ways that are not only easily accessible to the new investigator, but at the
same time are sufciently sophisticated so that the seasoned investigator will
prot.
As an example, I particularly enjoyed the rst chapter, written by
Dr. Supino, which provides some down to earth examples of, in essence, why
there should be a clearly dened primary endpoint in clinical investigations.
As I was reading her chapter, I realized I had forgotten the why of this
requirement, and that I was just taking the requirement for granteda situation
that could make investigators vulnerable to dismissing its importance. In this
regard, over the years Ive found it not uncommon for investigators, who nd
that the efcacy of the intervention theyre studying signicantly improves
one or another secondary endpoints but not the primary endpoint, to freely
attack this requirement and argue theyve proven the efcacy of their inter-
vention. But Dr. Supino reminds us what good science is by providing an
elegantly simple example of the marksman who boasts his skills after inter-
preting the results of his shooting a gun at a piece of paper hung on the side
of a barn. The marksman, it turns out, does not prospectively dene the bulls
eye. Rather, after multiple bullets are red at the piece of paper, he inspects
the bullet hole-riddled paper, sees the random bullet hole patterns, and then
draws a circle (bulls eye) around a group of holes that by chance have fallen
into a tight cluster. The post hoc denition of the bulls eye (i.e., now the
primary endpoint) speaks (unjustiably) to the marksmans skill. By this
simple anecdote, Dr. Supino makes the critical importance of prospectively
dening the primary endpoint exquisitely clear.
A foreword is no place to provide extensive details of what a book con-
tains. Ill therefore limit myself and just enthusiastically say this rst chapter
I read is representative of the high quality of the chapters to come. Drs.
Supino and Borer have used the many years they have developed their course
extraordinarily wellthey and their outstanding coauthors have produced a
book that is well written, beautifully edited, and contains wisdom and insight.
It is a book, whether reading it in its entirety or perusing individual chapters,
that presents the reader with a superb learning experience. The authors have
certainly hit the bulls eye.
This book has been written to aid medical students, physicians, and other
health professionals as they probe the increasingly complex and varied medi-
cal/scientic literature for knowledge to improve patient care and search for
guidance in the conduct of their own research. It also is intended for basic
scientists involved in translational research who wish to better understand the
unique challenges and demands of clinical research and, thus, become more
successful members of interdisciplinary medical research teams.
The book is based largely on a lecture series on research methodology,
with particular emphasis on issues affecting clinical research, that the editors
designed and have offered for 21 years to more than 1,000 members of the
academic medical communities of Weill Cornell Medical College and the
State University of New York (SUNY) Downstate Medical Center, both
located in New York City. The book spans the entire research process, begin-
ning with the conception of the research problem to publication of ndings.
The need for such a book has become increasingly clear to us during many
years of conducting a program of training and research in cardiovascular dis-
eases and in our general teaching of research methodology to students, train-
ees, and postgraduate clinical physicians and researchers. Though agreement
on the fundamental principles of scientic research has existed for more than
a century, the application of these principles has changed over time. The pre-
cision required in dening study populations and in detailing methodologies
(and their deciencies) is continually increasing. In addition, a bewildering
arsenal of statistical tools has developed (and continues to grow) to identify
and dene the magnitude and consistency of relationships. Simultaneously,
acceptable formats for communicating scientic data have changed in
response to parallel changes in the world at large, and under the pressure of
an information explosion which mandates succinctness and clarity.
Despite these demands, there are few books, if any, that comprehensively and
concisely present these concepts in a manner that is relevant and comprehensible
to a broad professional medical community. This text is designed to resolve this
deciency by combining theory and practical application to familiarize the
reader with the logic of research design and hypothesis construction, the impor-
tance of research planning, the ethical basis of human subjects research, the
basics of writing a clinical protocol, the logic and techniques of data generation
and management, and the fundamentals and implications of various sampling
vii
viii Preface
techniques and alternative statistical methodologies. This book also aims to offer
guidance for assembling and interpreting results, writing scientic papers, and
publishing studies.
The books 13 chapters emphasize the role and structure of the scientic
hypothesis (reinforced throughout the various chapters) in informing meth-
ods and in guiding data interpretation. Chapter 1 describes the general
characteristics of research and differentiates among various types of research;
it also summarizes the steps typically utilized in the hypothesis-testing
(hypothetico-deductive) method and underscores the importance of proper
planning. Chapter 2 reviews the origins of clinical research problems and the
types of questions that are commonly asked in clinical investigations; it also
identies the characteristics of well-conceived research problems and explains
the role of the literature search in research problem development. Chapter 3
introduces the reader to various modes of logical inference utilized for
hypothesis generation, describes the characteristics of well-designed research
hypotheses, distinguishes among various types of hypotheses, and provides
guidelines for constructing them. Chapter 4 takes the reader through classic
epidemiological (observational) methods, including cohort, casecontrol,
and cross-sectional designs, and describes their respective advantages and
limitations. Chapter 5 discusses the meaning of internal and external validity
in the context of studies that aim to examine the effects of purposively applied
interventions, identies the most important sources of bias in these types of
studies, and presents a variety of alternative study designs that can be used to
evaluate interventions, together with their respective strengths and weak-
nesses for controlling each of the identied biases. Chapter 6 denes and
describes the purpose of the clinical trial and provides in-depth guidelines for
writing the clinical protocol that governs its conduct. Chapter 7 describes
methodologies used for data capture and management in clinical trials and
reviews associated regulatory requirements. Chapter 8 explains the steps
involved in designing, implementing, and evaluating questionnaires and
interviews that seek to obtain self-reported information. Chapter 9 reviews
the pros and cons of systematic reviews and meta-analyses for generating
secondary data by synthesizing evidence from previously conducted studies,
and discusses methods for locating, evaluating, and writing them. Chapter 10
describes the various methods by which subjects can be sampled and the
implications of these methods for drawing conclusions from clinical research
ndings. Chapter 11 introduces the reader to fundamental statistical princi-
ples used in biomedical research and describes the basis of determination of
sample size and denition of statistical power. Chapter 12 describes the ethi-
cal basis of human subjects research, identies areas of greatest concern to
institutional review boards, and outlines the basic responsibilities of investi-
gators towards their subjects. Finally, Chapter 13 provides practical guidance
on how to write a publishable scientic paper.
The authors of this book include prominent medical scientists and meth-
odologists who have extensive personal experience in biomedical investiga-
Preface ix
xi
Contents
xiii
xiv Contents
xv
xvi Contributors
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 1
DOI 10.1007/978-1-4614-3360-6_1, Phyllis G. Supino and Jeffrey S. Borer 2012
2 P.G. Supino
conduct a research project and to expect to study to a broader context (external validity or
design, execute, and complete it in that time extrapolability).
frame. 3. It should be empirical.
There is general consensus that information Despite the deductive processes that may pre-
gathering, including reviewing and synthesizing cede data collection, the findings of research
the literature, is a critically important activity to must always be based on observation or experi-
be undertaken by an investigator. However, in ence and, thus, must relate to reality. It is the
and of itself, it is not research. The same can be empirical quality of research that sets it apart
said for data gathering activities aimed at per- from other logical disciplines, such as philoso-
sonal edification or those undertaken to resolve phy, which also attempts to explain reality.
organization-specific issues. So what, then, char- Recognition of this fact may pose a problem for
acterizes research? physicians who, according to some researchers
Tuckman [3] has argued that in order for an [4, 5], have a cognitive style that tends to be
activity to qualify as research, it should possess a more deterministic than probabilistic, causing
minimum of five characteristics: personal experience to be valued more than
1. It should be systematic. data. Under these circumstances, the impor-
While some important research findings have tance of subordinating the hypothesis to data
occurred serendipitously (e.g., Flemings may not be fully appreciated. As part of the edu-
accidental discovery of penicillin, Pasteurs cation of the physician scientist, he or she must
chance finding of microbial antibiosis), most learn that when confronted with data that do not
arise out of purposeful, structured activity. support the study hypothesis, it is the hypothesis
Structure is engendered by a series of the rules and not the data that must be discarded, unless it
for defining variables, constructing hypothe- is abundantly clear that something untoward
ses, and developing research designs. Rules occurred during the performance of the study.
also exist for collecting, recording, and ana- 4. It should be reductive.
lyzing data, as well as for relating results to As Tuckman [3] has noted, a fundamental pur-
the problem statement or hypotheses. These pose of research is to reduce the confusion of
rules are used to generate formal plans (or individual events and objects to more under-
protocols) which guide the research effort, standable categories of concepts (p. 11). One
thereby optimizing the likelihood of achieving heuristic tool used by scientists for this pur-
valid results. pose is the creation of abstractive constructs
2. It should be logical. such as intervening variables (e.g., resistance
Research employs logic that may be induc- and solubility in the physical sciences, condi-
tive, deductive, or abductive in nature. tioning or reflex reserve in the behavioral sci-
Inductive logic is employed to develop gener- ences) to explain how phenomena cause or
alizations from repeated observations, abduc- otherwise interact with each other [6]. Another
tive logic is used to form generalizations that powerful tool available to the researcher for
serve as explanations for anomalous events, this purpose is a constellation of techniques
and deductive logic is used to generate specific for numerical and graphical data analysis
assertions from known scientific principles or (the specific methodology employed depend-
generalizations. Further elaboration of these ing on the objectives and design of the study
distinctions is covered in Chap. 3. Logic is as well as the number of observations gener-
used both in the development of the research ated by the study). As Tuckman observes,
design and selection of statistics to ensure that whenever data are subjected to analysis, some
valid inferences may be drawn from data information is lost, specifically the uniqueness
(internal validity). Logic also is used to of the individual observation. However, such
generalize from the results of the particular losses are offset by gains in the capacity to
1 Overview of the Research Process 3
practical problems (indeed, it can progress for in-depth discussion of purpose, challenges, and
decades before leading to breakthroughs and par- techniques of translational research in clinical
adigm shifts in practice), though it can yield medicine and associated career opportunities, the
unexpected applications (e.g., the discovery of reader is referred to the collective works of
the laser and its value for fiber-optic communica- Schuster and Powers [12], Woolf [13], Robertson
tions [10]), and it often provides the theoretical and Williams [14], and Goldblatt and Lee [15].)
underpinnings of applied research. Applied
research, in contrast, is conducted specifically to
find solutions to practical problems in as rapid a Hypothesis-Generating Versus
time frame as possible. In medicine, applied Hypothesis-Testing Research
research searches for explicit knowledge to
improve the treatment of a specific disease or its Although some studies are undertaken to describe
sequelae. Examples of applied research include a phenomenon (e.g., incidence of a new disease
clinical trials of new drugs and devices in human or prevalence of an existing disorder in a new
subjects or evaluation of new uses for existing population), most research is performed to gener-
therapeutic interventions. ate a hypothesis or to test a hypothesis. In hypoth-
In recent years, translational or translative esis-generating research, the investigator begins
research has emerged as a paradigm alternative to with an observation (e.g., a newly discovered pat-
the dichotomy between basic and applied tern, a rare event) and constructs an argument to
research. Currently practiced in the natural, explain it. Hypothesis-generating research
behavioral, and social sciences, and heavily typically is conducted when existing theory or
reliant on multidisciplinary collaboration, trans- knowledge is insufficient to explain particular
lational research is a method of conceptualizing phenomena. Popular tools for hypothesis gen-
and conducting basic research to render its eration in preclinical research include gene
findings directly and more immediately applica- expression microarray studies; hypotheses for
ble to the population under study. In medicine, clinical or epidemiological research may be
this iterative approach is used to translate results generated secondary to a projects initial purpose
of laboratory research more rapidly into clinical by mining existing datasets. In contrast, in
practice and vice versa (bench to bedside and hypothesis-testing research (sometimes called the
back or T1 translation) and from clinical prac- hypothetico-deductive approach), the investi-
tice to the population at large (to the community gator begins with a general conjecture or hunch
and beyond and back or T2 translation) to put forth to explain a prior observation or to clar-
enhance public knowledge. This is one of the ify a gap in the existing knowledge base.
major initiatives of the US National Institutes of It is vitally important that the investigator
Health (NIH) Roadmap for Medical Research. keep these differences in mind when designing
Examples of T1 translation include the develop- and drawing inferences from a study. To under-
ment of a technique for evaluating endothelium- score what can happen when these distinctions
dependent vasodilator responses as a diagnostic are blurred, it is instructive to step back from
test in patients with atherosclerosis and the eluci- scientific inquiry and mull over the following
dation of the role of the p53 tumor suppressor scenario:
gene in the regulation of apoptosis in the treat- A Texas cowboy fires his gun randomly at the
ment of patients with cancer [11]. Examples of side of a barn. Figure 1.1 (left panel) shows his
T2 translation would include the implementation, results. He pours over his efforts, paints a target
evaluation, and ultimate adoption of interventions centered around his largest number of hits (Fig. 1.1,
that have been shown to be effective in clinical right panel), and claims to be a sharpshooter.
research for primary or secondary prevention in Do you agree that the Texan is a sharpshooter?
heart disease, stroke, and other disorders. (For an Do you think that if he repeated his so-called
1 Overview of the Research Process 5
target practice, he would again be able to get that researchers), the Texas Sharpshooter Fallacy is
many bullets in the circle? Note: the Texan related to the clustering illusion, which refers
defined his target only after he saw his results. He to the tendency of individuals to interpret patterns
also ignored the bullets that were not in the clus- in randomness when none actually exists, often
ter! This parable illustrates what epidemiologists due to an underlying cognitive bias.
call the Texas Sharpshooter Fallacy [16] to Consider a more clinical example: A resident
underscore the dangers of forming causal conclu- inherits a dataset that contains information about
sions about cases of disease that happen to cluster 95 patients with chronic coronary artery disease.
in a population due to chance alone or to reasons Figure 1.2 depicts the variables in that dataset.
other than the chosen cause. As per Atul Gawande, He believes that he could satisfy his research
in his classic article in The New Yorker, of the elective if he could draw inferences about this
myriad of cancer clusters studied by scientists study group, though he has no a priori idea about
in the United States, not one has convincingly what relationships would be most reasonable to
identified an underlying environmental cause explore. He recruits a friend who happens to have
[17]. In a more general sense (and particularly a statistical package installed on his computer,
germane to the activities of some biomedical enters all of the variables in the dataset into a
6 P.G. Supino
multiple regression model, and comes up with game [18]. The most important take-home point
some statistically significant findings, as noted is if you wish to test it, a hypothesis always should
below: be generated before data collection begins.
Ischemia severity and benefit of coronary Hypothesis-testing studies (especially ran-
artery bypass grafting (CABG): p < 0.001 domized clinical trials [RCTs]) are highly
Hair color and severity of myocardial infarc- regarded in medicine because, when based on
tion (MI): p < 0.03 correct premises, properly designed, and ade-
Zip code and height: p < 0.04 quately powered, they are likely to yield accu-
He concludes that he has confirmed the hypoth- rate conclusions [19]; in contrast, conclusions
esis that there is a strong association between drawn from hypothesis-generating studies, even
preoperative ischemia severity and benefit of when well designed, are more tentative than those
coronary artery bypass grafting because not of hypothesis-testing studies due to the myriad of
only was the obtained probability (p) value low, explanations (hypotheses) one can infer from the
his hypothesis also makes clinical sense. He also observation of a phenomenon.
decides that he would not report the other findings For these reasons, hypothesis-generating stud-
because, while also statistically significant, ies are appropriately regarded as exploratory in
he cannot explain them. What methodological nature. These differences notwithstanding, there
error has the resident made in drawing his is general consensus that hypothesis-testing and
conclusion? hypothesis-generating activities both are vital
The answer is that, analogous to the rifleman aspects of the research process. Indeed, the latter
who defined his target only after the fact, the resi- are the crucial initial steps for making discoveries
dent confirmed a hypothesis that did not exist in medicine. As Andersen [20] has correctly
before he examined patterns in his data. The fal- argued, without hypothesis-generating activities,
lacy would not have occurred if the resident had, there would be no hypotheses to test and the body
in mind, a prior expectation of a particular of theory and knowledge would stagnate. The
association. It also would not have occurred had critical role of the hypothesis in the research pro-
the resident used the data to generate a hypothesis cess and the logical issues entailed in formulating
and validated it, as he should have, with an inde- and testing them are further discussed in Chap. 3.
pendent group of observations if he wanted to
draw such a definitive conclusion. This is an
important distinction because the identification Retrospective Versus Prospective
of an association between two or more variables Research
may be the result of a chance difference in the
distribution of these variablesand hypotheses Research often is classified as retrospective or
identified this way are suggestive at best, not prospective. However, as pointed out by Catherine
proven. What one cannot do is to use the same DeAngelis, former editor-in-chief of the Journal
data to generate and test a hypothesis. of the American Medical Association (JAMA),
Moreover, the resident compounded his error these terms are among the most frequently mis-
by capitalizing on only one association that he understood in research [21] in part because they
found, ignoring all of the others. Working with are used in different ways by different workers in
hypotheses is like playing a game of cards. You the field and because some forms of research do
cannot make up rules after seeing your hand, or not neatly fall within this dichotomy. Many meth-
change the rules midstream if you do not like the odologists [22, 23] consider research to be
hand that you have been dealt. Similarly, if you retrospective when data (typically recorded for
gather your data first and draw conclusions based purposes other than research) are generated prior
only on those you believe to be true, you have, in to initiation of the study and to be prospective
the words of the famed behavioral scientist, Fred when data are collected starting with or subse-
Kerlinger, violated the rules of the scientific quent to initiation of the study. Others, including
1 Overview of the Research Process 7
DeAngelis, prefer to distinguish retrospective casecontrol study can be used to infer cause and
from prospective research according to the inves- effect associations, though various biases (dis-
tigators and subjects orientation in the data cussed in depth in Chap. 4) may limit its value for
acquisition process. According to the latter view, this purpose.
a study is retrospective if subjects are initially The two most typical examples of prospective
identified and classified on the basis of an out- research in clinical medicine are observational
come (e.g., a disease, mortality, or other event) cohort and experimental studies. In an observa-
and are followed backward in time to determine tional cohort study, subjects within a defined
the relation of the outcome to exposure to one or group who share a common attribute of interest
more risk factors (genetic, biological, environ- (e.g., newly diagnosed cardiac patients, new
mental, or behavioral); conversely, the study is dialysis patients) who are free of some outcome
prospective if it begins by identifying and classi- of interest are identified on the basis of exposure
fying subjects on the basis of the exposure (even to risk factors whose presence or absence is out-
if the exposure preceded the investigation), with side the control of the investigator. These indi-
outcome (s) observed at a later point in time [21]. viduals are followed over time until the occurrence
There are various types of retrospective stud- of an outcome (or outcomes) that usually (but not
ies. The simplest (and least credible from the always) is measured at a later date. In an experi-
standpoint of scientific evidence) is the case mental study, outcomes also are assessed at a
study (or case report), which typically pro- later date, but subjects initially are differentiated
vides instructive, albeit anecdotal, information according to their exposure to one or more inter-
about unusual symptoms not previously observed ventions which have been purposively applied.
in a medical condition or new combinations of (Further distinctions between observational and
conditions within a single individual [24]. The experimental studies are discussed below.)
case series (or clinical series) is an uncon- Prospective research is less prevalent in the
trolled study that furnishes information about literature than retrospective research principally
exposures, outcomes, and other variables of inter- due to its relatively greater cost. In most prospec-
est among multiple similar cases. Though lack of tive studies, the investigator must invest the time
control precludes evaluation of cause and effect, and resources to follow subjects and sometimes
this type of study can provide useful information even apply an intervention if the study is experi-
about unusual presentations or infrequently mental. Moreover, prospective studies usually
occurring diseases and can be used to generate require larger sample sizes. Why, then, would
hypotheses for testing, using more rigorous stud- anyone choose a prospective design over a retro-
ies [24]. The most common type of retrospective spective approach? One reason is that prospective
research used to draw inferences about the rela- studies (particularly RCTs and concurrent cohort
tion of prior exposures to diseases (and the most studies, described below) potentially have more
rigorous of the various retrospective designs) is control over temporal sequence and extraneous
the casecontrol study. In this type of investiga- factors, including selection and recall bias,
tion, a group of individuals who are positive for a although loss to follow-up can be problematic.
disease state (e.g., lung cancer) is compared with Second, prospective designs are more appropriate
a group comprised of those who are negative for than retrospective designs for rare exposures and
that disease state (e.g., free of lung cancer). By relatively more common outcomes. Finally, if it
looking back at the medical record, we attempt to is desired that the exposure be manipulated by
determine differences in risk factors (e.g., prior the investigator, as in an experimental study, the
exposure to cigarette smoke or asbestos) that may relation between exposure and outcome can be
account for the disease. Because of the temporal evaluated only with a prospective design.
sequence and interval between the factor and the In many prospective studies (all RCTs, many
outcome variable and the availability of a com- cohort studies), the exposure takes place coinci-
parison group (e.g., nondiseased subjects), the dent with or following the initiation of the study.
8 P.G. Supino
Fig. 1.3 Concurrent versus noncurrent prospective research (Reprinted with permission from [21])
This type of prospective research has been termed point (e.g., exposure to a putative risk factor or
concurrent [25, 26] because the investigator intervention) and follow them forward in time
moves along in parallel with the research process until the occurrence of a specified outcome (e.g.,
(i.e., from application or assessment of the expo- a disease state or event), whereas retrospective
sure to ascertainment of the outcomes associated studies begin with existing cases and look back in
with the exposure). In other instances, the expo- time at the history of the subject to identify rele-
sure and even the outcomes will have taken place vant exposures or other instructive trends. Both
in the past, i.e., before the investigators involve- are examples of longitudinal research because
ment in the study. If the logic of the study is to subjects are examined on multiple occasions that
follow subjects from exposure to outcome, are separated in time.
the research may be termed a nonconcurrent Not all studies have defined temporal
prospective study [25, 26], a historical cohort windows between putative risk factors and out-
study, or a retrospective cohort study (departing comes. One that does not is the cross-sectional
from the view of prospective research held by (or prevalence) study. With this approach, several
DeAngelis and others). These distinctions are variables are measured at the same point in time
shown in Fig. 1.3. to determine their frequency and/or possible
association within a group of individuals who
are selected without regard to exposure or dis-
Longitudinal Versus Cross-Sectional ease status. They are usually based on data col-
Research lected in the past for other purposes but can be
based on information acquired de novo. When
As noted above, prospective studies sample mem- used with large representative samples (to permit
bers of a defined group at a common starting valid generalizations), cross-sectional studies can
1 Overview of the Research Process 9
provide useful information about the prevalence Prospective descriptive studies include natural
of risk factors, disease states, and health-related history investigations that follow individual
knowledge, attitudes, and behaviors in a specified subjects or groups over time to determine changes
population. Cross-sectional studies are prevalent in parameters of interest.
in the literature principally because they are rela- While descriptive studies attempt to examine
tively quick, easy to conduct, and can be used to what types of problems exist in a population, ana-
evaluate multiple associations. However, unlike lytic studies attempt to determine how or why
the casecontrol study, where temporality these problems came to be. Thus, the ultimate
between risk factor and outcome variables can be goal of analytic studies is to test prestated hypoth-
established (or at least inferred) in order to eses about risk factors or interventions versus
buttress a cause and effect relationship, cross- outcomes to elucidate causality. Analytic studies
sectional studies are best suited for generating, can be performed with two or more equivalent or
rather than testing, such hypotheses [23]. matched comparison groups, in which case infer-
ences are drawn on the basis of analysis of inter-
group differences (comparative research) or by
Descriptive Versus Analytic Research comparisons within a single group in which
assessments are made over time before and after
Research can be further subdivided into descrip- imposition of an intervention or a naturally occur-
tive and analytical subtypes. In descriptive stud- ring event. Analytic research can be retrospective
ies, the presence and distribution of characteristics (e.g., casecontrol studies) or prospective (e.g.,
(e.g., health events or problems) of a single group observational cohort or experimental studies).
of subjects are examined and summarized (but Correlational analysis of cross-sectional data is
are not intervened upon or otherwise modified) to classified as analytic by some [28] but not all [22]
determine who, how, and when they were affected workers in the field.
and the magnitude of these effects. Descriptive
studies can involve a single case or a large popu-
lation. Though they are considered to be among Observational Versus Experimental
the simplest types of investigation, they can yield Research
fundamental information about an individual or
group that is of importance when little is known In this dichotomy, research is differentiated by the
about the subject in question. Modes of data col- amount of control that the investigator has over
lection for descriptive studies are primarily the factors in the study by which the outcome
observational and include survey methods, objec- variables are compared. In observational studies,
tive assessments of physiological measures, and the investigator is passive with respect to the fac-
review of historical records. Methods of analysis tors of interest as these usually are naturally
include computation of descriptive statistics such occurring risk factors or exposures outside of the
measures of central tendency and dispersion investigators control. He or she can identify them
(quantitative studies) and verbal descriptions and measure them but cannot allocate subjects to
and content analysis (qualitative studies) [27]. treatment groups or deliberately manipulate a
Because descriptive studies contain no reference treatment to systematically study its effect. The
groups, they cannot be used to test hypotheses investigators sole responsibility is to select a
about cause and effect; however, they can be use- design which can validly assess the impact of the
ful for hypothesis generation, thus providing the risk factor on the outcome variable. In contrast, in
foundation for future analytic studies. Descriptive experimental studies, the input of interest not
studies may be either retrospective or prospec- only is measured or observed but is purposively
tive. Retrospective descriptive studies include applied by the investigator, who manipulates
the single case study and case series formats. events by arranging for the intervention to occur
10 P.G. Supino
or, at the very least, arranges for random alloca- In contrast, qualitative research gathers informa-
tion of subjects to alternative treatment or control tion about how phenomena are experienced by
groups. As a consequence, most of the inherent individuals or groups of individuals (and the con-
differences that exist between comparison groups text of these experiences) based on open-ended
are minimized, if not eliminated, thereby provid- (unstructured) interviews, questionnaires, obser-
ing greater capacity to determine cause and effect vation, and focus group methodology. Fewer sub-
relationships between the intervention and the jects are studied than with quantitative research,
outcome. Unlike observational studies, which can but the investigators contact with them is longer
either be prospective or retrospective, experimen- and more interactive. As Portney and Watkins
tal studies, as noted earlier, always are prospec- [29] have noted, quantitative methods can be used
tive. Midway between observational and across the continuum of research approaches to
experimental studies is a methodology known as describe, generate, and test hypotheses, whereas
quasi-experimental research. With this approach, qualitative methods typically are used for descrip-
the investigator evaluates the impact of an tive or exploratory (hypothesis-generating)
intervention (e.g., a therapeutic agent, policy, pro- research. Quantitative and qualitative research
gram, etc.) which has been applied either to an each subsumes many different methodologies.
entire population or to one or more subgroups
on a nonrandom basis. Although he or she may
have been directly involved in arranging the inter- Steps in the Research Process
vention, control is nonetheless suboptimal due to
limitations in the quality of reference data; as As mentioned earlier, research is structured by a
such, inferences drawn from quasi-experimental series of methodological rules which govern the
studies, while stronger than those generated with nature and order of procedures used in the inves-
purely observational data, are less robust than tigation. It is, therefore, necessary that a plan be
those drawn from true experimental investiga- developed prior to the study which incorporates
tions. Characteristics of the true experimental and these procedures. This is true, irrespective of the
quasi-experimental approaches are detailed more type of research involved. The following is a brief
fully in Chap. 5. listing of the steps, identified by DeAngelis
[21], which comprise the research process in
general and the hypothetico-deductive approach
Quantitative Versus Qualitative in particular:
Research In the first stages of the project, the investigator
will:
Finally, research also can be differentiated accord- 1. Identify the problem area or question.
ing to whether the information sought is collected 2. Optimally restate the question as a
quantitatively or qualitatively. Quantitative hypothesis.
research involves measurement of parameters 3. Review the published literature and other
(e.g., demographic, functional, geometric, or information resources, including meeting
physiological characteristics; mortality, morbid- abstracts and databases of funded resource
ity, and other outcome data; attitudes, knowledge, summaries or blogs, to determine whether the
and behaviors) that have been obtained under hypothesis has been adequately evaluated or
standardized conditions by structured or semi- is in need of further study.
structured instrumentation and that may be sub- Prior to developing the research design, he/she
jected to formal statistical analysis. Typically, will:
numerous subjects are studied and the investiga- 4. Identify all relevant study variables, knowl-
tors contact with them is relatively brief and min- edge of whose presence, absence, change, or
imally interactive to avoid introduction of bias. interrelationship is the objective of the study.
1 Overview of the Research Process 11
In order to bring precision to the research, he/she some of the data were lost, and what was located
will: had not been recorded uniformly. As a result,
5. Construct operational definitions of all years of hard work were wasted. In a second
variables. example, addressing scheduling problems, Marks
6. Develop a research design and analytic plan describes the failure of an investigator, studying
to test the hypothesis. The design will iden- the effects of a drug developed for patients
tify the nature and number of subjects from undergoing elective coronary artery bypass graft-
whom data will be obtained, the timing and ing, to complete his research project within his
sequence of measurements, and the presence specified time frame. Though the investigator had
or absence of comparison groups or other the foresight to calculate his required sample size
procedures for controlling bias. The analytic and to estimate patient accrual rates, he made the
plan will define the statistical procedures to mistake of allowing only 4 months to study 30
be performed on the data and must be points. Much to his chagrin, a poorly worded
prespecified to minimize the likelihood of consent form submitted to his institutional review
reaching spurious conclusions. board (IRB) delayed him approximately 6 weeks
7. If data collection instruments are available, and, by then, the number of nonemergency oper-
they must be specified. If not, they must be ations had dropped dramatically due to the winter
constructed. (Data collection instruments holidays. After 4 months, only a quarter of his
include all tools used to collect relevant sample had been accruedand no data analysis
observations in the study such as physiologi- had been performed.
cal measurements questionnaires, interviews, Other common problems associated with poor
and case report forms, to name a few.) planning include inability to implement or com-
8. A data collection plan, containing provisions plete a study (due to disregard of organizational,
for accrual of subjects and for recording and political, or ethical factors), loss of statistical
management of data, must be designed. power to confirm hypotheses (due to inadequate
Only after these important preparatory steps have attention to patient accrual factors, attrition of sub-
been taken should the investigator proceed to: jects, or excess variability in the study population),
9. Collect and process the data. ambiguity of findings (due to lack of operational
10. Conduct statistical analysis to describe the definitions or nonuniformity of data collection
dataset and test hypotheses. procedures), and unsound conclusions brought
11. After the data are analyzed, conclusions are about by weak research designs, among others.
drawn and these are related to the problem Marks vignettes about the adverse conse-
statement and/or hypotheses. quences of poor research planning depict errors
12. Finally, the research report is written and, if that unfortunately are not uncommon. A number
accepted after peer review, is presented and/ of years ago, in this authors first position as a
or published as a journal article. research director (at an institution that I shall
The importance of following a research plan decline to name), I was asked to implement a
was addressed by Marks [30], who described a research project, previously designed by a princi-
number of typical planning errors and their nega- pal investigator (PI) who was senior to me at the
tive consequences. To cite one example, Marks time. The purpose of the project was to evaluate
detailed the experience of an investigator who the impact of an in-hospital patient education
failed to receive renewal of his multiyear research program after a first myocardial infarction. Four
grant because he could not report the results of hospitals were involved in the study: two inter-
the data analysis to the granting agency. This vention sites and two controls (business as
occurred because he failed to develop a mecha- usual). In the first phase, patients at Hospital A
nism for the storage, handling, and analysis of received the new educational program and
data. Due to staffing changes and other factors, patients in Hospital B did not. In the replication
12 P.G. Supino
phase, patients at Hospital C received the new A final problem concerned the instrumenta-
intervention and patients at Hospital D did not. tion. Though, in fact, both the Beck Depression
The instrument chosen to evaluate depression and State-Trait Anxiety Scales had been vali-
was the Beck Depression Scale and the instru- dated, the validation had not been performed on
ment chosen to evaluate anxiety was the patients shortly after an acute myocardial infarc-
State-Trait Anxiety Scale. The study design tion. An analysis of baseline scores revealed that
compared responses before and after the educa- most patients were neither depressed nor anxious,
tional program by site. Being schooled in psycho- apparently due to the unanticipated effects of
metrics, I was concerned about the reliability and sedation or denial. Thus, low scores on these
validity of these instruments for this population primary measures (which clearly were adminis-
but was told that these had been extensively used tered too soon after the index event) could not
and previously validated in other patient popula- possibly improve due to what are called floor
tions. I also had concerns about the quality of the effects. Needless to say, the private foundation
experiences that patients were receiving at the that funded this study was less than thrilled, and
control hospitals but was told that for political none of you have ever seen it in published form.
reasons, we could not ask too many questions. Examples like these abound in research but usu-
Additionally, I had concerns about the implemen- ally are not reflected in the literature because
tation of the educational intervention but was told aborted or incomplete research investigations
that this was firmly under the control of the nurse are never published, and those failing to demon-
coordinator. I next argued for a pilot before strate statistically significant differences (or asso-
launching this very costly and lengthy research ciations) are published far less often than those
project but was told that there was no time and that doa phenomenon known as publication
that the PI did not wish to waste patients. bias [31], further discussed in Chap. 9.
And so the intervention proceeded according A number of years ago, a pediatric emergency
to protocol for well over 2 years. No interim anal- fellow at another area hospital approached me for
ysis ever was performed because the PI thought assistance with a dataset that she had compiled
that would be too expensive and waste time. over a 4-month period. The data profiled the pre-
When the primary data finally were analyzed, senting complaints, diagnoses, and disposition of
there were no detectable differences whatsoever a series of children who had presented to an
between the outcomes obtained in the experi- emergency room after having complained of
mental versus control hospitals. The PI was largely nonserious illnesses during school. I asked
horrified and did not understand how this could her for a copy of her protocol, but she told me
have happened. When the process data were ana- that she did not have one because her study was a
lyzed post hoc, we learned that, due to staffing chart review, based on de-identified anonymous
problems at the experimental sites, many nurses data and, therefore, was IRB Exempt. I next
who were entrusted to implement the educational asked her for a written copy of her research plan
intervention had attended few, if any, in-service to which she responded, I never developed one
sessions about the intervention. Moreover, even because my clinical mentor told me that it wasnt
though the new intervention had a beautifully necessary, and I didnt know that I needed one.
designed curriculum that had been packaged in a I asked her what schools the children had come
glossy binder, it became known only after the from and who had made the decision to bring
fact that quality patient education also had taken them to the emergency room, but she couldnt
place at Control Hospital B, and we never knew answer these questions because that information
what was done at Control Hospital D, again, for was not routinely included in the medical chart,
political reasons. which was the source of all of her data. I asked
1 Overview of the Research Process 13
her why she had selected a retrospective chart objective, which was to furnish information that
review as her study design, and she answered that potentially could alter decision-making patterns
the charts were readily available and that she for this patient population. Had the fellow devel-
hadnt thought about any other approach. I asked oped a proper research plan in the first place, she
her why she thought the research study was worth would have better conceptualized her study and
doing, to which she responded, Im not sure, but saved months of her time on what was essentially
maybe the data will encourage emergency physi- a fruitless undertaking.
cians to better counsel parents and school officials The moral posed by these stories is that ade-
who refer relatively healthy children to the emer- quate planning is vital for achieving research
gency room and, thus, cut down on inappropriate objectives and for minimizing the risk of wasting
visits. time and resources. As Marks correctly argues,
Feeling sorry for her, I helped her to sort out The success of a research project depends on
whatever data that she had, and to write an how well thought out a project is and how poten-
abstract and manuscript that appeared to be tial problems have been identified and resolved
respectable, at least superficially. The abstract before data collection begins [30].
was accepted at an international meeting (which In subsequent chapters, we will consider many
had somewhat less stringent standards than of the fundamental concepts, principles, and
domestic meetings in her specialty), but when issues involved in planning and implementing a
she submitted her manuscript for publication in well-designed study. It is hoped that awareness of
an academic journal, it was rejected. The review- these factors will help you to achieve your
ers correctly argued that without knowing who research objectives, minimize your risk of wast-
made the decision to bring the child to the emer- ing time and resources, and result in a more
gency room, the study had failed its primary rewarding research experience.
Take-Home Points
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 15
DOI 10.1007/978-1-4614-3360-6_2, Phyllis G. Supino and Jeffrey S. Borer 2012
16 P.G. Supino and H.A.B. Epstein
algorithms and diagnostic modalities for differ- which publish requests for proposals (RFPs) or
entiating symptoms of myocardial ischemia from applications (RFAs) to address understudied
symptoms that mimic ischemia? When should areas affecting the public health. These publica-
such patients be medically managed and when tions will explicitly identify a problem that the
should they undergo invasive therapeutic proce- agency would like an investigator to address,
dures? What is the risk-benet ratio of percutane- provide a background and context for the prob-
ous coronary angioplasty vs. coronary artery lem, stipulate a study population (as well as on
bypass grafting? How often and how should occasion, specify the approach to be taken), and
patients undergoing these procedures be evalu- indicate the level of support offered to the poten-
ated after intervention? What patient-level, soci- tial investigator.
etal, and economic factors inuence these Finally, research problems can be fostered by
decisions? Issues such as these have enormous environments that stimulate an open interchange
public health implications and have spawned of ideas. These environments include scientic
hundreds of research studies. sessions conducted by professional societies and
Research problems also can be generated from organizations, grand rounds given at hospitals
observations collected in conjunction with medi- and medical schools, and other conferences and
cal procedures [2]. A radiologist might have a set seminars. In recent years, methodological
of interesting data collected in conjunction with a approaches such as brainstorming, Delphi meth-
new imaging modality (e.g., full-eld digital ods, and nominal group techniques [35] have
mammography) and might wish to know how much been developed and sometimes are utilized to
more sensitive and specic this new modality is facilitate the rapid generation (and prioritization)
vs. older technology for breast cancer screening. of research problems by individuals and groups.
Alternatively, he might be interested in a new
application of an existing modality. A thoracic
surgeon may have outcomes data available from Characteristics of Well-Conceived
two competing surgical techniques. The process Research Problems
of critically thinking about these data, sharing
them with colleagues, and obtaining their feed- Although the genesis of a research problem is a
back can lead to interesting questions for analysis complex, variable, and an inherently unpredict-
and stimulate additional research. able process, fortunately, there are generally
Another source of research problems is the agreed-upon criteria, described below, for evalu-
published scientic literature, where an observed ating the merits of the problem once it has been
exception to the ndings of past research or generated [68]. Attention to these at the outset
accepted theory, unresolved discrepancies will ensure a solid footing for the remainder of
between studies, or a general paucity of quality the investigation.
data on a clinically signicant topic can motivate
thinking and point to an opportunity for future
study. In addition, most well-crafted manuscripts The Problem Should Be Important
typically document limitations in the investiga-
tion (e.g., potential selection bias, inadequate The most signicant characteristic of a good
sample size, low number of endpoint events, loss research problem is importance. A clinical
to follow-up) and may suggest areas for future research problem is considered important if its
research. Thus, thoughtful review of published resolution has the potential to clarify a signicant
research can point to gaps in knowledge that issue affecting the public health and, ultimately,
potentially could be lled by new investigations cause the clinician (or health-care policy maker)
designed to rene or extend previous research. to make a decision or undertake an action that he
Research problems also can be suggested by or she would not have made or undertaken had
governmental and private funding agencies the problem not been addressed. The greater the
2 Developing a Research Problem 17
need for clarication and the larger the number of The Problem Should Be Interesting
individuals potentially impacted (i.e., the greater
the disease burden), the more important the prob- As Hully and Cummings have noted, a good
lem. For this reason, when research proposals are research problem, especially if suggested by
submitted to a funding agency or when research someone else, must be interesting to the investi-
manuscripts are submitted to a journal for publi- gator to provide the intensity of effort needed
cation, perceived importance of the problem is for overcoming the many hurdles and frustrations
heavily weighted during the peer-review process. of the research process [7]. It also should be
Indeed, importance of the problem typically interesting to:
overshadows other criticisms such as incomplete The investigators peers and associates to
consideration of the literature, suboptimal meth- attract collaborators
odology, and poor writing style, as these aws Senior scientists at the investigators institu-
often can be remedied. Studies that merely repli- tion who can provide necessary mentorial sup-
cate other studies, with no signicant alteration port to guide the study (if the investigator is
in methods, content, or population (or that reect relatively junior)
only a minor incremental advance over previous Potential sponsors to motivate them to fund
information) are considered unimportant and the study (if outside funding is sought)
tend to fare poorly in the peer-review process. Fellow researchers within the larger scientic
This is true even if the study is well designed. community who, ultimately, will read and
This point is illustrated below by the divergent judge its ndings
comments actually made by a reviewer in Individuals outside the scientic community
response to two different manuscripts submitted (e.g., clinicians in private practice, policy
for publication to a cardiology journal: makers, the popular media, and consumers)
Manuscript #1: This is a superb contribution who, optimally, will consider, disseminate,
which adds importantly to our knowledge and/or utilize the eventual products of the
about the pathophysiology of heart failure. research (if the problem is applied or transla-
The results of this well-focused study are of tional in nature)
great clinical importance. (Recommendation: Gauging the potential interest of a research
Accept) problem is difcult because, as Shugan has
Manuscript #2: Comment: Despite a great noted, no research ndings are innately inter-
deal of very precise and laborious effort and esting. Instead, they are interesting only rela-
the generation of an extraordinary mass of tive to a particular audience within some context
numbers little forethought was given to the that they dene [9]. While research can be inter-
focus or importance of the questions to be esting simply because it is new, in general, a
asked . The nding is not unexpected, hav- research problem will tend to be viewed as note-
ing been suggested by several earlier studies worthy if it impacts a wide audience, has the
which have evaluated the issue of regional potential to cause signicant change in what
performance in different ways (Thus,) the members of that audience will do [9] (i.e., has
authors observations add little that is impor- importance), and is clearly framed within the
tant or useful to the currently available litera- context of a current hot-button issue (or an
ture. (Recommendation: Reject) older but nonetheless viable issue). Before
Evaluating the importance of a research prob- investing substantial time pursuing a research
lem requires considerable knowledge of and problem, it is advisable that new researchers
experience in the discipline. For this reason, the check with their mentors and/or other experi-
new investigator should seek the assistance of enced investigators with broad insights into the
mentors and other experts early on to maximize general area of inquiry to conrm that the prob-
the likelihood that the proposed research will be lem satises these criteria and, thus, is likely to
fruitful. be interesting to others [10].
18 P.G. Supino and H.A.B. Epstein
similar vein, questions soliciting opinions (e.g., The scope of a study can be gauged by the
what should be done to improve the health of a number of subproblems (discrete areas of inquiry
specic population?) and value-laden questions within the investigation) needed to express the
such as should terminally ill comatose patients main problem. If the number of subproblems
be disconnected from life support? certainly are exceeds six, there is high likelihood that the prob-
important and make excellent subjects for argu- lem is too broad. In contrast, if an investigator is
ment. However, they (like any question including unable to dene a minimum of two subproblems,
the word should) are not always assessable it may be too narrow [17].
empirically and may require special methods for The issue of scope of the problem has direct
data gathering (e.g., qualitative techniques). practical implications for the researcher. Even if
The problem also should be feasible on a prac- the problem is important and empirically test-
tical level [16]. An investigator must decide, early able, the investigator must balance these factors
on, if he or she has the resources to address it against the cost of doing the research. Long
within a realistic time frame and at a reasonable before data are collected, the researcher must
cost. A primary determinant of feasibility is the decide whether he or she has the time or resources
scope of the proposed problem. In planning a to collect and analyze the data.
research study, it is important to avoid selecting a Factors affecting time include:
problem that is too broad because a single inves- The interval needed for subject accrual
tigation cannot possibly provide all relevant The time involved in administering the inter-
information about a problem. The process of vention (if the research is experimental)
identifying the problem can raise ancillary ques- The time involved in collecting data on inputs
tions that may be of interest to the investigator, such as risk factors (if the research is
but it is important to prioritize these and reserve observational)
some for future research so that the time and The time involved in assessing outcome
resources of the investigator are not strained. An Factors potentially affecting resources include:
axiom in research planning is that it is better to Costs of accruing and managing subjects (pur-
provide quality answers to a small number of chasing and housing of animals for a preclini-
questions than to provide inferior information in cal study, reimbursing human subjects for
volume. For example, should an investigator wish participation in a clinical research study)
to study the effect of drug therapy on patients Cost of the intervention (if any)
with heart disease, the question What is the Costs of measurement procedures
effect of drug therapy on patients with poor heart Cost of data collection, processing, and
function?, while conceptually interesting and analysis
clinically important, is much too broad for one Costs of equipment, supplies, and travel
study and, in fact, would require hundreds of Technical expertise (the investigators own
investigations to answer adequately. The investi- research skills or access to skilled collabora-
gator would do well to narrow the problem to tors or consultants)
include a given class of drugs (e.g., adrenal ste- One way an investigator can determine feasi-
roids), a specic index of heart function (e.g., left bility is by conducting a pilot study. A pilot study
ventricular performance), and a specic popula- (sometimes called a feasibility study) typically
tion (e.g., patients with chronic severe aortic attempts to determine whether it is possible to
regurgitation). On the other hand, the problem address the research problem (or subproblems)
should not be too narrowly dened. A question under conditions approximating those of the larger,
such as what are the effects of Inderal on the proposed study but with a smaller number of sub-
change in ejection fraction from rest to exercise jects over an abbreviated period of time. The pilot
in 75-year-old Queens residents? probably can provide information about the complexities of
would result in a criticism of the study as trivial. patient recruitment and the appropriateness of data
20 P.G. Supino and H.A.B. Epstein
Does application of dental sealants actually What is the in-hospital mortality associated
prevent the development of tooth decay? with valvular replacement? Is it greater with
Have current local and global interventions concomitant coronary artery bypass grafting?
and services reduced the transmission and (harm)
acquisition of HIV infection?
Questions of most interest to clinicians, how-
ever, typically center on issues related to the Role of the Literature Search
clinical management of patients with known
or suspected diseases. Borrowing from an Even if the research problem was sparked by
evidence-based practice framework, these can previously published research, once its basic
be subcategorized as questions about screening/ elements have been dened, it is necessary to
diagnosis, treatment, prognosis, etiology, or conduct a comprehensive search of the literature
harm (from treatment) [21]. Examples are given to acquire a thorough knowledge of relevant ear-
below: lier ndings, ongoing research, or new theories.
What is the most cost-effective way to differ- Although there is no set rule governing the opti-
entiate children who are at risk for develop- mal time frame for a literature search or the num-
mental delays from those who are not? ber of publications to be included, there is general
(screening) consensus that the search should be of sufcient
What are the sensitivity, specicity, and posi- length and breadth to include existing pertinent
tive and negative predictive values of positron seminal and landmark studies [22] as well as cur-
emission tomography [PET] among women rent studies in the eld (i.e., those conducted
with suspected coronary artery disease? What within the past 10 years). A proper literature
is the diagnostic accuracy of PET vs. other search will help the investigator to determine
available tests such as thallium scintigraphy? answers to the following questions:
(diagnosis) Has the problem been previously addressed?
What is the best (most effective, tolerable, If so, was it adequately studied?
cost-effective) currently available chemother- Are the proposed hypotheses, if any, supported
apy regimen for acute myeloid leukemia? by current theory or knowledge?
(treatment) Does the methodology cited in the literature
Is combination therapy better than single agent provide guidance on available instrumentation
therapy for benign prostatic hypertrophy? for measuring variables?
(treatment) Are the results of prior studies informative for
What is the probable clinical course of patients calculation of sample size and power?
with aortic stenosis? (prognosis) Did previous investigators describe the limita-
Which patients with chronic, severe aortic tions of their research or suggest areas for
regurgitation progress most rapidly to surgical future study?
indications? (prognosis) Seeking answers to these questions early in
Is autoimmunity causally related to the devel- the planning process will enable the investigator
opment of Crohns disease? Is it also impli- to determine whether performance of the present
cated in the development of lupus and study is feasible, whether it is likely to signicantly
rheumatic arthritis? (etiology) contribute to the existing knowledge base (thus
Do enzymes involved in the synthesis of the supporting the need for the study), and also
extracellular matrix play a role in the develop- whether it may provide guidance on the construc-
ment of brotic diseases and cancer? tion of hypotheses and choice of study design. In
(etiology) addition, creating an automatic search prole
What is the magnitude of risk for adverse early in the planning process will keep the inves-
outcome of carotid endarterectomy among the tigator informed about the latest research related
elderly? (harm) to his or her problem. The search prole will
2 Developing a Research Problem 23
generate updated lists of new literature and mid 1940s. For more information about PubMed,
provide alerts to these updates via e-mail or RSS see www.pubmed.gov. Many of the MEDLINE
feed on a daily, weekly, or monthly basis, as citations in PubMed link to the Gene, Nucleotide,
desired. The updates also can be used to alert the and Protein databases from the National Center
investigator to research performed by other inves- for Biotechnology Information (NCBI) for cov-
tigators and provide an opportunity for erage of molecular biology. Google Scholar
collaboration. pulls in freely available scholarly literature from
Like other aspects of a research project, the PubMed and other sources, with some linking to
performance of a proper literature search requires the full text of the articles.
a signicant investment of time and effort. This is MEDLINE may not provide adequate infor-
true in part because the results of most scientic mation about a research problem. Thus, many
investigations (particularly those reecting recent investigators consider searching EMBASE in lieu
work or primary literature) are dispersed over a of or in addition to MEDLINE (which now is
myriad of e-mail communications, meeting included within EMBASE). EMBASE is created
abstracts, web documents, and periodicals, rather by Excerpta Medica and produced by Elsevier.
than organized collectively in books or other sin- One can subscribe to it individually from Elsevier
gle sources of research. Traditionally, if an inves- or through Ovid from Wolters Kluwer Health in
tigator needed to learn more about earlier related three separate databases: EMBASE, EMBASE
work, he or she would begin by examining key Drugs and Pharmacology, and EMBASE
references cited in known relevant published Psychiatry. There are over 24 million indexed
studies. Today, continuing this principle of it records from more than 7,500 current, mostly
only takes one good article to get you going, peer-reviewed journals covering biomedical and
online systems like PubMed from the National pharmacological literature. In addition, there is
Library of Medicine, ISI Web of Knowledge extensive coverage of meeting abstracts. Like
from Thomson Reuters the EBSCOhost family MeSH from MEDLINE, EMBASE uses a hierar-
of databases from EBSCO Publishing, and the chical classication of subject headings called
databases of Ovid Technologies, Wolters Kluwer EMTREE that can be expanded. EMBASE can
Health, and Google Scholar, generate a list of be searched with signicant words, signicant
possible important citations and invite you to phrases, and EMTREE terms. Links to full text of
click on the related articles link, or times cited the journal articles are available from many
link to nd similarly indexed papers or cited ref- medical libraries.
erences from these papers to locate additional An investigator may also consider searching
relevant citations. A summary of selected core BIOSIS Previews, Biological Abstracts, and
online resources are provided in Table 2.1. Zoological Record together as a package from
Most investigators will choose to search ISI Web of Knowledge, a product of Thomson
MEDLINE, the premier bibliographic databases Scientic. This resource represents a comprehen-
from the National Library of Medicine. It is avail- sive index to the life sciences and biomedical
able by searching PubMed, ISI Web of research, including meeting abstracts, journals,
Knowledge, EBSCOhost, and Ovid plus many books and patents, and contains more than 18
other free or fee-based searching systems. The million records taken from more than 5,000 inter-
database covers the life sciences with a concen- national resources from 90 countries (1926 to
tration in biomedicine. Bibliographic citations present). BIOSIS Previews is available by search-
with author abstracts and linking to full text of ing the Ovid suite of databases and ISI Web of
many articles come from more than 5,400 bio- Knowledge.
medical journals published in the USA and Web of Sciences Science Citation Index
around the world. Most citations are written in Expanded, part of ISI Web of Knowledge from
English with English abstracts. MEDLINE con- Thomson Reuters covers scientic literature
tains over 21 million citations dating back to the from 1900 to present. An investigator can search
24
this resource by subject topics and keywords. add citations to a folder, permitting them to be
The citation display features a summary abstract, printed, e-mailed, or saved. Also, like other data-
a bibliography, and publications that have cited bases, CINAHL links to cited references.
that paper. As with many systems today, full text Finally, for those seeking the latest information
of the paper as well as related article citations on evidence-based health care, the Cochrane
also may be linked. A citation map can be gener- Library is an excellent source of systematic
ated to visually display for two generations the reviews (discussed in depth in Chap. 9), RCTs,
references in the bibliography and cited papers. and health technology and economic assessments.
If the investigator is interested in behavioral It is produced by the Cochrane Collaboration, a
science research, the American Psychological worldwide effort dedicated to systematically
Association offers a suite of databases, reviewing the effectiveness of health-care interven-
PsycINFO, PsycARTICLES, PsycBOOKS, tions, and is available from Wiley and Wolters
PsycCritiques, and PsycEXTRA. Information Kluwer Health via Ovid. Though the Cochrane
can be found on psychology and related disci- Library can be searched with words, phrases, and
plines (e.g., psychiatry, nursing, neuroscience, MeSH descriptors, its central database of random-
law, education, sociology, social work). Available ized trials is extensive (mandating a more precise
in a variety of formats (e.g., journal articles, searching strategy), whereas its database of sys-
books or book chapters, dissertations, technical tematic reviews contains fewer than 5,000 elements
and annual reports, government reports, confer- (requiring a broader search strategy). If the searcher
ence presentations, consumer brochures, maga- is able to identify a systematic review that contains
zines, among others), PsycINFO can be searched a reasonable number of trials from which valid and
with words, phrases, and terms from the Psyc consistent inferences have been drawn, it may pro-
thesaurus. Like MeSH, the terms are arranged in vide most of the literature needed to support a
alphabetical and hierarchical order. research project.
Web of Sciences Social Science Citation Although web-based bibliographic programs
Index can be explored for those interested in have become increasingly user-friendly by
social sciences research. Almost 2,500 journals encouraging the searcher to place signicant
are indexed, representing 50 social science and words, phrases, and database subject terms in a
related disciplines, including anthropology, urban search box, the search process itself remains a
studies, industrial relations, law, linguistics, sub- combination of science and art which requires
stance abuse, public health, and information and practice and patience. In view of this, some
library sciences, among others. Like Science investigators may opt to complete an online tuto-
Citation Index, the citation display features a rial, sign on to a web-based training session,
summary abstract, bibliography, and publications attend an in-person course at their local library,
that have cited the paper; full text of the paper or consult with a librarian for training and search
and related article citations also may be linked. planning. Some investigators will team up with a
This database also can be searched with words searching professional to run the search together
and phrases. or, after a rigorous interview (in which the goals
The EBSCOhost family of databases covers of the study are carefully discussed), will have
the humanities and social sciences. It also includes the searching professional perform the search.
CINAHL-Cumulated Index to Nursing and Allied For those without access to such instructional
Health Literature. This database provides index- resources, we offer the following
ing for nearly 3,000 journals from the elds of recommendations:
nursing and allied health, including librarianship, Frame your search topic in the form of a
and contains more than 2.2 million records dating specic question or statement.
back to 1981. Like MEDLINE, EMBASE, and Depending on your choice of search system(s),
PsycINFO, one searches CINAHL with plan your search strategy accordingly with
signicant words and phrases as well as CINAHL signicant words, phrases, and database sub-
descriptors that can be expanded. Searchers can ject headings or descriptors.
26 P.G. Supino and H.A.B. Epstein
Decide whether empirical and/or theoretical likely to modify or extend the existing body of
literature is to be included: knowledge. Moreover, information gained from
Empirical literature comprises primary the literature review (including successes or fail-
research reports (e.g., observational stud- ures of previous published work) can, as indicated
ies, controlled trials) and systematic reviews earlier, prove invaluable for rening the problem
of research. (if necessary), buttressing or revising hypotheses,
Theoretical literature includes descriptions and validating or modifying the approach taken.
of concepts, models, and theoretical
frameworks.
Identify preferred literature sources, for exam- Crafting the Problem and Purpose
ple, articles, book chapters, and dissertations. Statements
Determine the amount of information needed
and the temporal period of interest. Once the problem has been conceptualized and
Evaluate the likelihood of nding specic the literature search completed, the investigator is
information about your topic. If you think the in a position to communicate to interested parties
topic is voluminous, use a more narrow (e.g., mentors, colleagues, potential sponsors) the
approach to search the literature. If you think nature, context, and signicance of the problem,
the topic will yield a small amount of litera- including, typically, the type and size of the
ture, use a broader approach. affected population, what is known and not yet
Display and review all citations with as much known, and the consequences of the lack of
text, searching terms, and related links as pos- knowledge (i.e., the implied or directly stated),
sible. Many articles will be available in full thus elucidating the active challenge to be
text directly from the searching system. addressed and justifying the logical argument
If you determine that your retrieval is inade- underlying the study. These elements are incorpo-
quate for your needs, consider modifying your rated collectively into a problem statement, a
search strategy and running your search again. declarative set of assertions, interwoven with lit-
Obtain and organize all source documents. erature support, which customarily appears in the
Once the key references have been compiled, Introduction of the research report or in the
these should be carefully reviewed to identify the Background and Significance section of a research
methodologies employed, conclusions drawn, proposal (though, as Polit et al. [12] have observed,
and limitations of the selected studies. It is of the problem statement rarely is labeled as such
paramount importance that the investigator care- and must be ferreted out). As a general rule, a
fully read the entire published study and any well-constructed problem statement should be
accompanying editorials, comments, and letters, written as concisely as possible for optimal clarity
rather than rely on information given in an yet contain sufcient information to make a via-
abstract or in published reviews of the literature ble argument in support of the study and elicit
written by others. This is because abstracts and interest [13]. Abbreviated problem statements,
review articles provide only incomplete informa- condensed into a sentence or two with minimal
tion; in addition, the perspective of the reviewing supporting argumentation, commonly are pro-
author may bias the interpretation of primary vided in the beginning of the abstract accompany-
ndings contained in the review articles. ing the main body of the research report or
The information contained within each refer- research proposal. (Ellis and Levy [13] refer to
ence should be related to the problem statement to these reductions as statements of the problem to
form a nexus between the earlier studies and the differentiate them from fully developed problem
current research project. If the investigator deter- statements with appropriate argumentation.)
mines that the literature supports the need to study If the study is broad, it is recommended that
the proposed problem, he or she can proceed with the investigator divide the main problem into
condence, knowing that pursuit of the research subproblems, each of which addresses a single
project (if properly designed and implemented) is issue. It is important that the sum of the content
2 Developing a Research Problem 27
Table 2.2 Examples of well-dened problem statements from two research reports
PROBLEM STATEMENT #1: PROBLEM STATEMENT #2:
Fleming et al., Circulation, 2008 [23] Walker et al., CMAJ 2000 [24]
Atrial brillation (AF), the most common complication Asymptomatic bacteriuria is common in
after cardiac surgery, is associated with signicant institutionalized elderly people. The prevalence
morbidity, increased mortality, longer hospital stay, and increases with age, occurring in up to 50% of elderly
higher hospital costs . Because ventricular dysfunction women and 35% of elderly men who reside in
is common following cardiac surgery, inotropic drugs are long-term facilities . Despite lack of benet,
often necessary to improve hemodynamic status; however, institutionalized older adults with asymptomatic
the effect of inotropic drugs on postoperative AF has not bacteriuria are frequently treated with antibiotics. This
been extensively studied . Milrinone has been reported practice is of particular concern given the deleterious
to be associated with a lower risk of postoperative AF effects of antibiotics, including the potential for the
compared to dobutamine use, but milrinone increases development of antibiotic resistance and adverse
the risk of atrial arrhythmias in patients with acute reactions seen in this population. Why antibiotics
exacerbation of chronic heart failure continue to be prescribed for asymptomatic bacteriuria
is unclear
Table 2.3 Examples of well-dened statements of purpose from two published research studies
PURPOSE STATEMENT #1: PURPOSE STATEMENT #2:
Fleming et al., Circulation, 2008 [23] Walker et al., CMA 2000 [24]
The aim of this analysis was to test the hypothesis that The aim of our study was to explore the perceptions,
the use of inotropic drugs is associated with an increased attitudes, and opinions of physicians and nurses
risk of postoperative AF in cardiac surgery patients involved in the process of prescribing antibiotics
participating in an ongoing randomized, double blinded, for asymptomatic bacteriuria in institutionalized
placebo controlled trial elderly people
reected in the subproblems equates to no more statement. Although, like the problem statement,
or no less than the content reected in the main the statement of purpose typically is not labeled as
problem. Like the main problem, the subprob- such, it is easily identiable as it includes the
lems should be stated clearly and be related to words purpose (the purpose of the study was/
each other in a meaningful way so that the is .), goal (the goal of the study was/is .),
research will maintain coherence. or, alternatively, intent, aim, or objective
Two examples of well-dened problem state- [12]. In a quantitative study, the statement of pur-
ments are given in Table 2.2. The rst (shown in pose also identies the key variables to be exam-
the left column) is drawn from a quantitative ined and/or interrelated (parameters to be estimated,
study by Fleming et al. [23] about the impact of hypotheses to be tested), the nature of the study
milrinone on risk for atrial brillation after car- population (who is included), and, occasionally,
diac surgery. The second (shown in the right col- the nature of the study design; in a qualitative
umn) is a qualitative study by Walter et al. [24] investigation, the purpose statement commonly
addressing reasons for prescription of antibiotic will include the phenomenon or phenomena under
therapy among the asymptomatic institutional- study (rather than hypotheses), as well as the study
ized elderly with bacteriuria. Note, in each case, group, community, or setting [12]. Shown in
the problem statement makes the argument that Table 2.3 are the purpose statements from the
there is an important unresolved issue that should Fleming and Walker studies. In both cases, the
be addressed, and sets the stage for what the reader will note that the statements of purpose ow
investigator intends to do to facilitate a solution. directly from the problem statements.
The problem statement typically is followed by As Polit et al. have noted (and as illustrated
a statement of purpose (usually the last sentence above), the use of verbs in a purpose statement
or two in the Introduction of the research report or is key to determining the thrust of the inquiry
given as a list in the Specific Aims of the research and also helps to differentiate quantitative from
proposal), which succinctly identies what the qualitative studies [12]. The former typically
investigator intends to do (the type of inquiry) to include terms such as compare, contrast,
resolve the unknowns explicated in the problem correlate, estimate, and test, whereas the
28 P.G. Supino and H.A.B. Epstein
Table 2.4 Examples of research questions restated from two statements of purpose
PURPOSE STATEMENT #1: RESTATED PURPOSE STATEMENT #2: RESTATED AS A
AS A RESEARCH QUESTION RESEARCH QUESTION
Fleming et al., Circulation, 2008 [23] Walker et al., CMA 2000 [24]
Does the use of inotropic drugs increase risk of What are the perceptions, attitudes, and opinions of physicians
postoperative AF in cardiac surgery patients? and nurses involved in the process of prescribing antibiotics for
asymptomatic bacteriuria in institutionalized elderly people?
latter include terms such as describe, explore, carbon monoxide (CO) poisoning is a substantial
understand, discover, and develop. Verbs health problem in the US, causing an estimated
11,547 deaths from 1979 through 1988. The US
such as prove or show should be avoided in Consumer Product Safety Commission estimates
purpose statements of research studies as these can that there was an average of about 28 charcoal-
be construed as indicative of investigator bias [12]. related deaths per year from 1986 through 1992.
As noted above, a statement of purpose can be Charcoal briquettes are not an uncommon source
of CO poisoning in Washington State: 16% of the
expressed in declarative form. However, some 509 unintentional poisoning cases that required
investigators instead will frame the purpose of their hyperbaric oxygen treatment between October
study interrogatively as one or more research ques- 1982 and October 1993 involved charcoal. Our
tions (each addressing a single concept) that are investigation suggests that CO poisoning following
severe winter storms should be anticipated. It also
directed at the unknowns in the problem state- suggests that preventive messages are important
ment. Alternatively, these questions can be added public health messages, but that they should be
to a global statement of purpose to improve clarity understandable to those in the community who nei-
and specicity. As Polit et al. contend, research ther read nor speak English. [25]
questions invite an answer and help focus atten- Does the Introduction contain a clear state-
tion on the kinds of data that would have to be ment of the problem so that it is evident why the
collected to provide that answer [12]. Listed in investigation was important? Is there a statement
Table 2.4 are research questions that could have of purpose (or a set of questions) that explains
been framed by Fleming et al. and Walker et al. to what the investigators did to address the prob-
address the targets of inquiry in their studies. lem? Do the authors introductory statements pre-
However written, both the problem and pur- pare the reader to follow the rest of the paper?
pose of the study (or the research questions) After all, that is the principal role of the
should be apparent to the reader in the Introduction Introduction in a research manuscript. (For fur-
of the research report (or in the Background, ther details about the role and proper construction
Significance, and Specific Aims of the research of the Introduction of the scientic paper, the
proposal) and should possess sufcient clarity for reader is referred to Chap. 13.) Note, the authors
the reader to understand them without the pres- have provided the reader with a general back-
ence of the author. Unfortunately, this is not ground statement and also have presented their
always the case. Consider the statements articu- conclusions in their Introduction, repeating infor-
lated by Houck and Hampson in the introduction mation already given in their Abstract. However,
to their study about carbon monoxide poisoning other than suggesting that their data were unique,
following a winter storm during the 1980s, when the rationale and aims of their study have not
charcoal briquettes commonly were used for been articulated, and their research questions
heating in certain areas of the USA: remain undened even after reading their com-
A major epidemic of carbon monoxide poisoning ments. The moral illustrated by this example is
occurred after a severe winter storm struck western that for the published paper to engage and edify
Washington State during the morning of 20 January
the reader, the research problem, purpose, and/or
1993. Charcoal briquettes and gasoline-powered
generators were principal sources of CO. Although research questions must be unambiguously stated
previous reports have described CO poisoning early in the research report.
following winter storms in the Eastern United When there is poor denition of problem
States, the large number and wide distribution of
and purpose, not only may the reader become
cases following this storm are unique. Unintentional
2 Developing a Research Problem 29
confused, but these deciencies may adversely In its current form, the manuscript resembles
impact the study methodology because all subse- a mystery story with a good outcome more
quent steps in the research process (e.g., con- than a scientic study. Thus, while indicating
struction of the research questions or hypotheses, the general aim of the authors, the Introduction
development of the research design, collection misstates the specic goals required by the
and analysis of data) are guided by the statements apparent design of the reported work, thus
of problem and purpose statements. Houck and misfocusing the reader. (Recommendation:
Hampson were fortunate. When their article was Consider after revision)
written, there were relatively few experienced In sum, all research (whether basic or applied,
peer reviewers in their discipline (emergency quantitative or qualitative, hypothesis generating
medicine). This may well have helped the authors or hypothesis testing, retrospective or prospec-
efforts to gain publication. tive, observational or experimental) may be con-
More commonly, deciencies in the wording sidered as a response to a problem (an ambiguity,
of these statements and their connection to the gap in knowledge, or other perplexing state) that
remainder of the paper can be a primary cause of requires resolution. In thinking through the prob-
a manuscript being rejected for publication, or lem and communicating it to others, the investi-
being sent back to the author for revision, follow- gator must provide a clear and convincing
ing the peer-review process. The following criti- argument that indicates why the problem must be
cisms, made by a reviewer in response to two addressed (the problem statement), articulate a
different submissions to a cardiology journal, are solution to the problem to clarify the ambiguity
illustrative of this point: or ll the gap in knowledge (the purpose state-
Submission #1: Comment: The focus of the ment or research questions), and tie these state-
study is not clearly apparent, even from the ments to the methods used. The challenge to the
last paragraph which specically describes investigator is to dene and interrelate these ele-
the goals. The rst page does not point directly ments well enough to justify the research study
to the study hypothesis. (Recommendation: and maximize the likelihood that the ndings will
Consider after revision) be understood, appreciated, and utilized.
Take-Home Points
Wrong hypotheses, rightly worked from, have produced more results than unguided
observation
Augustus De Morgan, 1872[1]
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 31
DOI 10.1007/978-1-4614-3360-6_3, Phyllis G. Supino and Jeffrey S. Borer 2012
32 P.G. Supino
Thus, the investigator does not set out to test it. induction, and abduction [5]. These differ
Examples of assumptions include: primarily according to (1) whether the origin
Radionuclide cineangiography measures ven- of the hypothesis is a body of knowledge or
tricular performance. theory (the rationalist perspective), an empiri-
Chest x-rays measure the extent of lung cal event (the inductivist perspective), or some
inltrates. combination of the two (the abductivist per-
The SF-36 measures general health-related spective); (2) the logical structure of the argu-
quality of life. ment; and (3) the probability of a correct
Medical education improves knowledge of conclusion.
clinical medicine.
An apple a day keeps the doctor away (the
most famous [albeit untested] assumption of Hypothesis by Deduction
them all).
In contrast, the hypothesis is an expectation Deduction (from the Latin de [out of] and
that an investigator will attempt to conrm dcer [to draw or lead]) is one of the oldest
through observation or experiment. Examples in forms of logical argument. It was introduced by
clinical medicine include: the ancient Greeks who believed that acquisition
Among patients with chronic nonischemic of scientic knowledge (insight into the princi-
mitral regurgitation (insufciency), survival ples and character of natural substances and
will be better among those whose valves have their causes) could be achieved largely by the
been repaired or replaced than among those same logical processes used to prove the validity
who have been maintained on medical of mathematical propositions [6]. Today, deduc-
therapy. tion remains the predominant mode of formal
Among patients hospitalized with community- inference in research in mathematics and in the
acquired pneumonia, posthospital course will fundamental sciences, but it also plays an
be better among those with a low-risk prole important role in the empirical sciences. A deduc-
than among those with a high-risk prole tively derived hypothesis arises directly from
before hospitalization. logical analysis of a theoretical framework, pre-
Life expectancy will be greater among indi- viously developed to provide an explanation of
viduals consuming low-calorie diets than events or phenomena. It is considered to be non-
among those consuming high-calorie diets. ampliative because, while it helps to provide
Health-related quality of life is better among proof of principle, it adds nothing new beyond
those whose mitral valves have been repaired the theory. The validity of a theory can never be
than among those whose mitral valves have directly examined. Therefore, scientists wishing
been replaced. to evaluate it, or to test its utility within a given
(perhaps new) context, will formulate a conjec-
ture (hypothesis) that can be subjected to empiri-
Hypothesis Generation: Modes cal appraisal. In forming a hypothesis by
of Inference deduction, the investigator typically moves from
a general proposition to a more specic case that
There is a paucity of empirical data regarding the is thought to be subsumed by the generalization
way (or ways) in which hypotheses are formu- (i.e., from theory to a conceptual hypothesis or
lated by scientists and even less information from a conceptual hypothesis to a precise pre-
about whether these methods vary across disci- diction based on the hypothesis). Deductive argu-
plines. Nonetheless, philosophers and research ments can be conditional or syllogistic (e.g.,
methodologists have suggested three fundamen- categorical [all, some, or none], disjunctive [or],
tally different modes of inference: deduction, or linear [including a quantitative or qualitative
3 The Research Hypothesis: Role and Construction 33
(drawing conclusions about the future cases from cowpox (vaccinia), they became immune to its
a current sample), causal inference (concluding more severe human analogue, smallpox. The
that association implies causality), and Bayesian English surgeon, Edward Jenner (17491823),
inference (given new evidence [data], using prob- used this hypothesis as the basis of a series of
ability theory [Bayes theorem] to alter belief in a scientic experiments, using exudates from an
hypothesis). infected milkmaid, to develop and formally test a
All inductive arguments contain multiple vaccine against this disease [11]. He became
premises that provide grounds for a conclusion famous for using vaccination as a method for pre-
but do not necessitate it (in contrast to a deduc- venting infection, though there is growing recog-
tive argument where the premises, if true, entail nition that the rst successful inoculations against
the conclusion). In other words, a conclusion smallpox actually were performed by a farmer,
drawn from an inductive argument is probable (at Benjamin Jesty, some 20 years earlier, who vac-
best), even if its premises are correct. For this cinated his family using cowpox taken directly
reason, all inductive arguments, while amplia- from a local cow [12]. It also has been claimed
tive, are considered to be logically invalid and are that Charles Darwin used inductive reasoning
judged, instead, according to their strength when generalizing about the shapes of the beaks
(i.e., whether they are inductively strong or from nches from the various Galapagos Islands
inductively weak). The strength of an inductive [13] and when forming conjectures from obser-
generalization is determined by the number of vations based on the breeding of dogs, pigeons,
observations supporting it and the extent to which and farm animals at home (inferences that formed
the observations reect all observations that could underpinnings of his theory of evolution) and
be made. The more (consistent) observations that that Gregor Mendel used the same form of rea-
exist, the more likely the conclusion is correct soning to conceptualize his law of hybridiza-
(inconsistent observations, of course, reduce the tion [14]. Even if these claims are true (and there
arguments inductive strength). The typical form is far from universal agreement on this matter),
of an inductive generalization is given below: inductive generalizations typically are regarded
A1 is a B as inferior to hypothesis-generating methods
A2 is a B that involve more theoretical reasoning, that con-
(All As I have observed are Bs) sider variations in circumstances (i.e., possible
\ All As are Bs confounding factors) that may account for spuri-
Like deductive arguments, inductive general- ous patterns, and that provide possible causal
izations can be categorical, that is, represent con- explanation for observed phenomena. Moreover,
clusions about all (as above), no, or some recent research in cognition and the relatively
members of a class, or they may involve quantita- new eld of neural modeling suggest that simple
tive arguments, for example, 50% of all coins induction across a limited set of observations
I have sampled are quarters; therefore, 50% of all may have a far smaller role in scientic reasoning
coins coming from the same lot that I have sam- than previously realized [15].
pled probably are quarters (or, as a clinical
example, 30% of the patients I have examined
are obese; therefore, 30% of patients sampled Hypothesis by Abduction
from the same population as those who I have
examined probably are obese). Of the three primary methods of reasoning, the
Not all inductive hypotheses used by scientists one that has been most implicated in the creation
have been formulated by scientists; some, in fact, of novel ideas, including scientic discoveries, is
owe their origin to folklore. For example, by the the logical process of abduction (from the Latin
late eighteenth century, it was common knowl- ab [meaning away from] and dcer [to draw
edge among English farm workers that when or to lead]). It also is the most common mode of
humans were exposed to cows infected with reasoning employed by clinicians when making
3 The Research Hypothesis: Role and Construction 35
diagnostic inferences. Abduction was introduced that the abductive argument is logically less
into modern logic by American philosopher and secure than a deductive argument (or even an
mathematician, Charles Sanders Peirce (1839 inductive argument). It represents a possible con-
1914) [16], and remains an important, albeit con- clusion only (after all, the beans might come from
troversial, topic of research among philosophers some other bagor from no bag at all). Therefore,
of science and students of articial intelligence. like an inductive argument, it is ampliative though
It refers to the process of formulation and accep- logically invalid. Its strength is based on how
tance on probation of a hypothesis to explain a well the argument accounts for all available
surprising observation. Thus, hypotheses formed evidence, including that which is seemingly
by abduction (unlike those formed by induction) contradictory.
are always explanatory. (The reader should note As Peirces work evolved, he shifted his efforts
that other synonyms for, and denitions of, to developing a theory of inferential reasoning in
abduction exist, e.g., retroduction, reduction, which abduction was taken to mean the genera-
inference to the best explanation, etc., the latter tion of new rules to explain new observations. In
reecting the evaluative and selective functions so doing, he focused on, what some have termed,
that also have been associated with this term.) the creative character of abduction [17]. Peirce
Abductive reasoning entails moving from a argued that abduction had a major role in the pro-
consequent (the observation or current fact) to cess of scientic inquiry and, indeed, was the
its antecedent (presumed cause or precondition) only inferential process by which new knowledge
through a general rule. It is considered back- was createda view that was, and continues to
ward because the inference about the antecedent be, hotly debated by the philosophical commu-
is drawn from the consequent. nity. In his later work, Peirce described the logi-
Peirce devoted his earliest work (before 1900), cal structure of abduction as follows:
as did Aristotle long before him, to furthering the The surprising fact, C, is observed.
development of syllogistic theory to express logi- But if A were true, C would be a matter of
cal relations. During this early period, abduction course.
(then termed by him as hypothesis) was taken to Hence, there is reason to suspect that A is true.
mean the use of a known rule to explain an [18]
observation (result); accordingly, his initial The surprise (the stimulus to the abductive
efforts were devoted to demonstrating how the inference) arises because the observation is
hypothesis relates to the premises of the argu- viewed, at that moment in time, as an anomaly,
ment and how it differs from the logical structure given the observers preexisting corpus of knowl-
of other forms of reasoning (i.e., deduction or edge (theory base) which cannot account for it.
induction). In his essay, Deduction, Induction, The lack of compatibility between the observa-
Hypothesis, Peirce presents an abductive tion and expectation introduces a type of cogni-
syllogism: tive dissonance that seeks resolution through the
Rule: All the beans from this bag are white. adoption of a coherent explanation. In Peirces
Result: These beans are white. opinion, the explanation might be nothing more
Case: These beans are from this bag. [16] than a guess (Peirce believed that humans were
In this argument, the rule and result repre- hardwired with the ability for guessing cor-
sent the premises (background knowledge and rectly) that, unlike an inductive generalization,
observation, respectively [the order is arbitrary]) enters the mind like a ash [18] or, what is
and the case represents the conclusion (here, commonly termed, as a eureka moment or an
the hypothesis). Had this argument been expressed ah ha! experience. Because a guess (insightful
deductively, the case would have been the sec- or not), by its very nature, is speculative (and, as
ond premise, and the result, the conclusion noted above, is a relatively insecure form of rea-
(i.e., all the beans from this bag are white, these soning), Peirce recognized that an abductive
beans are from this bag; therefore, these beans hypothesis must be rigorously tested before it
are white). It should be obvious to the reader could be admitted into scientic theory. This, he
36 P.G. Supino
reasoned, is accomplished by using deduction to Although, as Peirce points out, all three modes
explicate the consequences of the hypothesis (i.e., of inference (abduction, deduction, and induc-
the predictions) and induction to form a conclu- tion) are used in the process of scientic inquiry,
sion about the likelihood of their truthfulness, each requires different skills. As scholars have
based on experimental verication. According to noted, deduction requires the capacity to reason
Peirce, these are the primary roles of deduction logically and inductive reasoning requires under-
and induction in the scientic process. Figure 3.1 standing of the statistical implications of drawing
illustrates the Peircian view of the relation conclusions from samples to populations. In con-
between abduction, deduction, and induction as trast, as Danmark et al. have noted, abduction
interpreted by Flach and Kakas [19]. requires the discernment of new relations and
Countless abductively derived hypotheses, connections not immediately obvious [21]in
principles, theories, and laws have been put for- other words, to think outside the box. For this
ward in science. Many, if not most, owe to the reason, the best abductive hypotheses in science
serendipitous consequences of an unexpected have been made by those who not only are obser-
observation made while looking for something vant, wise, and well grounded in their disciplines
else [20]. Well-known examples of such happy but who also are imaginative and receptive to
accidents include: new ideas. This view was, perhaps, best expressed
Archimedes principles of density and by Louis Pasteur (18221895) when he argued,
buoyancy In the elds of observation, chance favors only
Hans Christian Oersteds theory of prepared minds [22]. Accordingly, developing
electromagnetism the prepared mind, in general, and enhancing
Luigi Galvanis principle of bioelectricity the capacity to reason abductively, deductively,
Claude Bernards neuroregulatory principle of and inductively, in particular, should be among
circulation the most important goals of those seeking to
Paul Gross protease-antiprotease hypothesis effectively engage in the process of scientic
of pulmonary emphysema discovery.
3 The Research Hypothesis: Role and Construction 37
association? Third, to what is right ventricular biomedical and other empirical sciences, is
performance compared? Is the contrast between achieved through the acts of observation or
right ventricular performance and clinical experimentation, analysis, and judicious
descriptors, anatomic descriptors, other func- interpretation. If one or more of the elements
tional descriptors, or between all of these? comprising the hypothesis is not present in
Fourth, what type of valvular heart disease the population or sample, or if a phenomenon
is being studied? Is it regurgitant, stenotic, or or characteristic contained within the hypoth-
both? Does it involve the mitral, aortic, or esis is highly subjective or otherwise difcult
some other heart valve? Finally, what is meant to measure, the hypothesis cannot be prop-
by less important? Who (or what) are the erly evaluated. For example, the statement
others? As is true for the research problem, female patients cope better with stress than
the clearer and less complex the statement of male patients would be a poor hypothesis if
the hypothesis, the more straightforward the the investigator did not have access to both
study and the more useful the ndings. male and female patients or was unable to
4. It should provide an adequate answer to the generate acceptable denitions and measures
research problem. to evaluate coping and stress. An even
For a hypothesis to be adequate, it must more egregious example is the hypothesis
address, in a satisfactory manner, both the prognosis following diagnosis of ovarian
content and scope of the central question; that cancer is related to the patients survival
is, whether the problem is narrow or broad, instinct, as it would be extremely difcult to
simple or complex, evaluation of the develop empirical data in support of a sur-
hypothesis(es) should result in the full resolu- vival instinctassuming it did exist.
tion of the research problem. For this reason, For many years, philosophers of science
it is recommended that the investigator formu- have argued about what constitutes evidence
late at least one hypothesis for every subprob- in science or support for a scientic hypothe-
lem articulated in the study. Equally important, sis. By the mid-twentieth century, the tenets
a hypothesis must be plausible; for this condi- of logical positivism (or logical empiri-
tion to be satised, the hypothesis should be cism) dominated the philosophy of science in
based on prior relevant observation and expe- the United States as well as throughout the
rience, buttressed by consideration of existing English-speaking world [24], replacing the
theory, and should reect sound reasoning and Cartesian emphasis on rationalism as a pri-
knowledge of the problem at hand. In contrast, mary epistemological tool. Strongly eschew-
speculations which have either no empirical ing metaphysical and theological explanations
support or legitimate theoretical basis, even if of reality, the logical positivists argued that a
interesting, constitute poor hypotheses and proposition held meaning only if it could be
typically yield weak or uninterpretable study veried (i.e., if its truth could be determined
outcomes. Finally, if the hypothesis is explan- conclusively by observable facts). Early crit-
atory in nature (rather than an inductive gener- ics of logical positivism, most notable among
alization), all else being equal, it should them Karl Popper, believed that veriability
represent the simplest of all possible compet- was too stringent a criterion for scientic dis-
ing explanations for the phenomenon or data covery. This, he argued, was due to the logical
at hand [23], a principle known as Occams limitations inherent in inductive reasoning,
razor or entia non sunt multiplicanda praeter namely, the deductive invalidity of forming a
necessitatem (Latin for entities must not be generalization based on the observation of
multiplied beyond necessity). particulars, and the attendant uncertainty of
5. It should be testable. such an inference. Thus, while both positive
A hypothesis must be stated in such a way as existential claims (e.g., there is at least one
to allow for its examination which, in the white swan) and negative universal claims
3 The Research Hypothesis: Role and Construction 39
Fig. 3.2 The hypothetico-deductive model: Poppers view of the role of falsication in scientic reasoning
(e.g., not all swans are white) could be or law could be falsied by nding a single
conrmed by nding, respectively, at last one counterexample.
white swan or one black swan, it would be Poppers greatest contribution to science
impossible to verify a positive universal claim was his characterization of scientic inquiry,
(e.g., all swans are white). To accomplish that, based on a cyclical system of conjectures and
one would have to observe every swan in exis- refutations (a form of critical rationalism)
tence, at all times and in all places, or risk widely known as the hypothetico-deductive
being wrong. method [27]. A schematic of Poppers view
According to Popper, the hallmark of a of this method is shown in Fig. 3.2. Consistent
testable claim is its capacity to be falsified with Poppers writing on the subject, the terms
[25]. In his view, falsication (not verication) hypothesis and theory are used interchange-
is the criterion for demarcation between those ably as both are viewed as tentative, though
hypotheses, theories, and other propositions most workers in the eld currently reserve the
that are scientic versus those that are not latter term for hypotheses (or related systems
scientic. This, of course, did not mean that a of hypotheses) that have received consistent
scientic hypothesis or theory must be false; and long-standing empirical support.
rather, if it were false, it could be shown to be The reader will note that the hypothetico-
so. Returning to our earlier example, all that deductive method begins with an early postu-
would be required to disprove the claim all lation of a hypothesis. The investigator then
swans were white is to nd a swan that is not uses deductive logic to form predictions from
white. Indeed, this inductive inference, based the hypothesis that should be true if the
on the observation of millions of white swans hypothesis is, in fact, correct. The nature of
in Europe, was shown to be false when black the predictions can vary from study to study,
swans were discovered in Western Australia in but they share the common attribute of being
the eighteenth century [26]an event that was unknown before data collection. The predic-
not unnoticed by Popper. It provided clear tions are then evaluated by formal experimen-
support for his assertion that no matter how tation or observation. Assuming a properly
many observations are made that appear to designed study, those predictions that are dis-
conrm a proposition, there is always the pos- cordant with data falsify the hypothesis, which
sibility that an event not yet seen could refute is then discarded or revised, leading to addi-
it. Similarly, any scientic hypothesis, theory, tional study. Although a hypothesis can never
40 P.G. Supino
be shown to be true via collection of compat- high incidence of morbid events. Although
ible information (as Popper noted, a subse- these may be important hypotheses, these
quent demonstration of counterfactual data statements cannot be directly tested as they
can overturn any hypothesis), the extent to are fundamentally abstract. What do the inves-
which it survives repeated attempts at tigators mean by high fat, depression,
falsication provides support (corroboration) severity of coronary artery disease, rela-
for its validity. As a result, testing of a hypoth- tively high, or morbid events? How will
esis serves to advance the existing theory base these terms be evaluated?
and body of knowledge. Popper argued that To render conceptual hypotheses testable,
the hypothetico-deductive method was the they must be recast as more specic statements
only sound approach to scientic reasoning; with elements (variables) that are precisely
moreover, in his opinion, it was the only dened according to explicit observable or
method by which science made any progress. measurable criteria. Hypotheses of this type are
Although Popper did not originate the referred to as operational hypotheses or, alter-
hypothetico-deductive method, he was the natively, specic hypotheses or predictions and
rst to explicate the central role of falsication represent the specic (observable) manifesta-
versus conrmation of a hypothesis in the tion of the conceptual hypothesis that the study
developing science. While his arguments have is designed to test. Once the study is designed,
been criticized by other philosophers of sci- data will be collected and analyzed to deter-
ence who assert that scientists do not neces- mine whether they are concordant or discordant
sarily reason that way [28], his views remain with the operational hypothesis which, ulti-
prominent in modern philosophy and continue mately, will be reinterpreted in terms of its
to appeal to many modern scientists [29]. broader meaning as a conceptual hypothesis.
Today, the Popperian view of the hypothetico- Figure 3.3 below illustrates a simplied version
deductive method, with its emphasis on test- of the hypothetico-deductive method, as con-
ing to falsify a proposed hypothesis, generally ceptualized by Kleinbaum, Kupper, and
is taken to represent an ideal (if not universal) Morgenstern [31] depicting the relation of con-
approach to curbing excessive inductive spec- ceptual and operational hypotheses to the
ulation and ensuring scientic objectivity, and design and interpretation of the study.
is considered to be the primary methodology Construction of operational hypotheses
by which biological knowledge is acquired represents an important preliminary step in
and disseminated [30]. the development of the research design, data
collection strategy, and statistical analysis
plan and is described in greater detail in sub-
Types of Hypotheses sequent sections of this chapter.
2. Single Variable Versus Multiple Variable
Hypotheses can be classied in several ways, as Hypotheses
shown below. Some investigations are undertaken to deter-
1. Conceptual Versus Operational Hypotheses mine whether a mean, proportion, or other
Hypotheses can vary according to their degree parameter from a sample varies from a
of specicity or precision and theoretical relat- specied value. For example, a group of obste-
edness. Hypotheses can be written as broad or tricians may have read a report that concludes
general statements, in which case they are that, throughout the nation, the average length
termed conceptual hypotheses. For example, of stay following uncomplicated caesarian
an investigator may hypothesize that a high- section is 5 days. They may have reason to
fat diet is related to severity of coronary artery believe that the length of stay for similar
disease or another may conjecture that patients at their institution differs from the
depression is associated with a relatively national average and would like to know if
3 The Research Hypothesis: Role and Construction 41
Fig. 3.3 Interrelation of conceptual hypotheses, opera- and Quantitative Methods, Fig. 2.2: An Idealized
tional hypotheses, and the hypothetico-deductive method Conceptualization of the Scientific Method (New York:
(Reprinted with permission Kleinbaum DG, Kupper LL, Van Nostrand Reinhold 1982), p. 35)
Morgenstern H. Epidemiologic Research: Principles
their belief is correct. To study the question, length of stay. In this case, caesarian section is
they must rst recast their question as a only a descriptor of the target population
hypothesis including the stipulated variable, because all data to be examined are from
select a representative sample of patients from patients undergoing this procedure.)
their institution, and compare data from their However, the objective of most hypotheses
sample with the national average (stipulated is not to draw inferences about population
value) using an appropriate one-sample statis- parameters but to facilitate evaluation of a
tical test. (The reader should note that the only proposition that two or more variables are sys-
variable being tested within this hypothesis is tematically related in some manner [32].
42 P.G. Supino
Indeed, some methodologists recognize only However, hypotheses often are not written
the latter form of argument as a legitimate this way because support for a cause-and-
hypothesis [7, 3335]. The simplest hypothe- effect relation requires not only biological
ses about intervariable association contain two plausibility and a strong statistical result but
variables (bivariable hypotheses), for also an appropriate (and usually rigorous)
example: study design. If the investigator believes that
Caffeine consumption is more frequent the variables are related, but prefers not to
among smokers than nonsmokers. speculate on the inuence of one variable on
Women have a higher fat-to-muscle ratio another, the hypothesis may be cast to propose
than men. an association only, without explicit reference
Heart attacks are more common in winter to causality. For example:
than in other seasons. Surgical benet is related to preoperative
If the objective of the study is to compare ischemia severity.
the relative association of several characteris- Exercise tolerance is correlated with chron-
tics, it usually will be necessary to construct a ological age.
single hypothesis which relates three or more Consumption of low-calorie beverages is
variables (multivariable hypotheses), for associated with body weight.
example: Finally, hypotheses also can be written to a
Ischemia severity is a stronger predictor of assert that there will be a difference between
cardiac events than symptom status and levels of a variable among two or more groups
risk factor score. of individuals or within a single group of indi-
Response to physical training is affected viduals at different points in time, as shown by
more by age than gender. the following examples:
Improvement in health-related quality of Patients enrolled in a health maintenance
life after cardiac surgery is inuenced more organization (HMO) will have a different
by preoperative symptoms than by ventric- number of hospitalizations than those
ular performance or geometry. enrolled in preferred provider organiza-
The number and type of variables contained tions (PPOs) or traditional fee-for-
within the hypothesis (as well as the nature of service insurance plans.
the proposed association) will dictate the study Among patients undergoing mitral valve
design, measurement procedures, and statisti- repair or replacement, left ventricular
cal analysis of the results. These concepts are performance will be dissimilar at 1 versus
addressed in Chaps. 5 and 11. 3 years after operation.
3. Hypotheses of Causality Versus Association The hypothesis also can be framed so that
or Difference the nature of the association (e.g., linear, cur-
The relation posited between variables may be vilinear, positive, inverse, etc.) or difference
cast as one of cause-and-effect, in which case (larger or smaller, better or poorer,
the researcher hypothesizes that one variable etc.) will be specied (see below, Alternative
affects or inuences the other(s) in some man- hypotheses [directional]).
ner. For example: 4. Mechanistic Versus Nonmechanistic
Estrogen produces an increase in coronary Hypotheses
ow. Hypotheses can be written so as to provide a
Smoking promotes lung cancer. mechanism (i.e., an explanation) for an
Patient education improves compliance. asserted relationship or prediction, or they can
Coronary artery bypass grafting causes a be written without dening an underlying
reduction in the number of subsequent car- mechanism. Mechanistic hypotheses are com-
diac events. mon in preclinical research which typically
3 The Research Hypothesis: Role and Construction 43
attempts to dene biochemical and physiolog- (falsication) reects the fact that two
ical causes of disease or dysfunction and path- outcomes always can arise out of a study of
ways amenable to therapeutic intervention. any single research problem. Thus, prior to
Shown below are two examples of mecha- collecting and evaluating empirical evidence
nistic hypotheses that were evaluated in two to resolve a problem, the investigator will
different preclinical investigations: (Note the posit two opposing assertions. The rst asser-
use of the phrase as a result of in the rst tion will indicate the supposition for which
hypothesis evaluating the impact of endothe- support actually is sought (e.g., that there is a
lial nitric oxide synthase [eNOS] and due to difference between a population parameter
in the second hypothesis evaluating antago- and an expected value or, more commonly,
nism of endothelin [ET]-induced inotropy. that there is some form of relation between
Italics have been added for emphasis.) variables within a particular population); the
Gender-specic protection against myo- other will indicate that there is no support for
cardial infarction occurs in adult female as this supposition. This rst type of assertion is
compared to male rabbits as a result of termed the alternative hypothesis and is gen-
eNOS upregulation [36]. erally denoted HA or H1. The alternative
ET-induced direct positive inotropy is hypothesis can be differentiated further accord-
antagonized in vivo by an indirect car- ing to its quantitative attributes. As an exam-
diodepressant effect due to a mainly ETA- ple, in a study evaluating the impact of beta-
mediated and ET-induced coronary adrenergic antagonist treatment (b-blockade)
constriction with consequent myocardial on the incidence of recurrent myocardial
ischemia [37]. infarctions (MIs), an investigator could frame
In clinical research, hypotheses more com- three contrasting alternative hypotheses:
monly are nonmechanistic (i.e., framed with- 1. The proportion of recurrent MIs among
out including an explicit explanation). Shown comparable patients treated with versus
below are two published literature examples: without b-blockade is different.
Patients with medically unexplained 2. The proportion of recurrent MIs among
symptoms attending the clinic of a general patients treated with b-blockade is less
adult neurologist will have delayed earliest than that among comparable patients
and continuous memories compared with treated without b-blockade.
patients whose symptoms were explained 3. The proportion of recurrent MIs among
by neurological disease [38]. patients treated with b-blockade is greater
Patients with acute mental changes will be than that among comparable patients
scanned more frequently than other elder treated without b-blockade.
patients [39]. The rst of these statements is termed a
The reader will note that these hypotheses nondirectional hypothesis because the nature
do not include the mechanism for memory of the expected relation (i.e., the direction of
variations in these patient populations (rst the intergroup difference in the proportion of
example) or the reasons why elderly patients recurrent infarctions) is not specied. The
with acute mental changes should be scanned second and third statements are termed direc-
more frequently than comparable patients tional hypotheses since, in addition to posit-
without such changes (second example). In ing a difference between groups, the nature of
situations like this, it is critical that the the expected difference (positive or negative)
justication be clear from the introductory is predened. Generally, the decision to state
section of the research paper or protocol. an alternative hypothesis in a directional ver-
5. Alternative Versus Null Hypotheses sus nondirectional manner is based on theo-
The requirement that a hypothesis should be retical considerations and/or the availability
capable of corroboration or unsupportability of prior empirical information. (In statistics, a
44 P.G. Supino
connote absence of a property (in this case, (e.g., number of dental caries, number of white
absence of kinetic energy). When analyzing cells per cubic centimeter of blood, number of
interval data, one can add or subtract but not readers of medical journals, or other count-
multiply or divide. Most statistical and opera- based data) can take on only whole numbers.
tions are permissible, including calculation of Nominal and ordinal variables are intrinsically
measures of central tendency (e.g., mean, discrete, though in some disciplines (e.g.,
median, or mode), measures of dispersion behavioral sciences), ordinally scaled data
(e.g., standard deviation, standard error of the often are treated as continuous variables. This
mean, range), and performance of many statis- practice is considered reasonable when ordi-
tical tests of hypotheses including correlation, nal data intuitively represent equivalent inter-
regression, t-tests, and analysis of variance. vals (e.g., visual analogue scales), when they
However, due to the absence of a true zero contain numerous (e.g., 10 or more) possible
point, ratios between values on an interval scale values or orderings [43], or when
scale are not meaningful (though ratios of dif- shorter individual measurement scales are
ferences can be computed). combined to yield summary scores. The reader
4. The Ratio Variable should note, however, that in other disciplines
Like interval variables, the distances between and settings, treating all data as continuous
successive values on a ratio scale are equal. data is controversial and generally is not
However, ratio variables reect the highest recommended [44].
level of measurement because they contain a
true, nonarbitrary zero point that reects com- Role in the Research Hypothesis
plete absence of a property. Examples of ratio Another method of classifying variables is based
variables include temperature on a Kelvin on the specic role (function) that the variable
scale (where zero reects absence of kinetic plays in the hypothesis. Accordingly, a variable
energy), mass, length, volume, weight, and can represent (1) the putative cause (or be associ-
income. When ratio data are analyzed, all ated with a causal factor) that initiates a subse-
arithmetic operations are available (i.e., addi- quent response or event, (2) the response or event
tion, subtraction, multiplication, and division). itself, (3) a mediator between the causal factor
The same statistical operations that can be and its effect, (4) a potential confounder whose
performed with interval variables can be per- inuence must be neutralized, or (5) an explana-
formed with ratio variables. However, ratio tion for the underlying association between the
variables also permit meaningful calculation hypothesized cause and effect. Viewed this way,
of absolute and relative (or ratio) changes in a variables may be independent, dependent, or may
variable and computation of geometric and serve as moderator, control, or intervening vari-
harmonic means, coefcients of variation, and ables. Understanding these distinctions is crucial
logarithms. for constructing a research design, executing a
Quantitative variables (interval or ratio) statistical program, or communicating effectively
can be either continuous or discrete. Continuous with a statistician.
variables (e.g., weight, height, temperature) 1. The Independent Variable
differ from discrete variables in that the for- The independent variable is that attribute
mer may take on any conceivable value within within an individual, object, or event which
a given range, including fractional values or affects some outcome. The independent vari-
decimal values. For example, within the range able is conceptualized as an input in the study
150151 lbs, an individual theoretically can that may be manipulated by the investigator
weigh 150 lbs, 150.5 lbs or 150.95 lbs, though (such as a treatment in an experimental study)
the capacity to distinguish between these values or reect a naturally occurring risk factor. In
clearly is limited by the precision of the mea- either case, the independent variable is viewed
surement device. In contrast, discrete variables as antecedent to some outcome and is presumed
3 The Research Hypothesis: Role and Construction 47
For example, suppose a psychiatrist wishes effective, promoting greater task persistence
to study the effects of a new amphetamine- among patients without associated anxiety but
type drug on task persistence in patients with decreasing task persistence among those with
attention decit hyperactivity disorder anxiety, as hypothesized.
(ADHD) who have not responded well to cur- A cautionary note is in order. Although mod-
rent medical therapy. She believes that the erator variables can increase the yield or accu-
drug may have efcacy but suspects that its racy of information from a study, an investigator
effect may be diminished by the comorbidity needs to be very selective in using them as each
of chronic anxiety. Rather than give the new additional factor introduced into the study design
drug to patients with ADHD who do not also increases the sample size needed to enable the
have anxiety and placebo to patients with impact of these secondary factors to be satisfac-
ADHD plus anxiety, to avoid confounding, torily evaluated. During the study planning pro-
she enrolls both types of patients, randomly cess, the investigator must determine the
administers drug or placebo to members of likelihood of a potential interaction, the theoreti-
each subgroup, and measures task persistence cal or practical knowledge to be gained by dis-
among all subjects at a xed interval after covery of an interaction, and decide whether
onset of therapy. In this hypothetical study, the sufcient resources exist for such evaluation.
independent variable would be type of therapy 4. The Control Variable
(factor levels: new drug, placebo), the depen- In this last example, the investigator chose to
dent variable would be task persistence, and evaluate the interactive effects of a secondary
chronic anxiety (presence, absence) would be variable on the relation of the independent
the moderator. Figure 3.4 illustrates the impor- and dependent variables. Others in similar
tance of a moderator variable. If none had situations might choose not to study a second-
been used in the study, the data would have led ary independent variable, particularly if it is
the investigator to conclude that the new drug viewed as extraneous to the primary hypoth-
was ineffective as no overall treatment effect esis or focus of the study. Additionally, it is
would have been observed for the ADHD impractical to examine the effects of every
group (left panel, diagonal patterned bar), with ancillary variable. However, extraneous vari-
change in task persistence for the entire treated ables cannot be ignored because they can
group similar to subjects on placebo (right confound study results and render the data
panel). However, as noted, the new drug was uninterpretable. Variables such as these usu-
not ineffective but instead was differentially ally are treated as control variables.
3 The Research Hypothesis: Role and Construction 49
coronary vessels and another, working in the same To render this hypothesis testable, its constituent
study, dened it as 70% luminal diameter nar- elements could be dened as follows:
rowing; or if one investigator studying new onset b-blockers = propranolol (assuming that the
angina used 1 week as the criterion for new and investigator was specically interested in this
another used 1 month. Operational denitions can drug)
describe the manipulations that the investigator Capacity for physical activity = New York
performs (e.g., the intervention), or they can Heart Association functional class
describe behaviors or responses. Still others Severity of symptoms = angina class 12
describe the observable characteristics of objects versus angina class 34
or individuals. Once the investigator has selected This hypothesis, in its operational form, would
appropriate operational denitions (this choice is be stated: Patients with angina who are treated
entirely study dependent), all hypotheses in the with propranolol will have greater improvement
study can be operationalized. in New York Heart Association functional class
A hypothesis is rendered operational when its than those not treated with propranolol, and
broadly (conceptually) stated variables are this improvement will vary as a function of ini-
replaced by operational denitions of those vari- tial angina class (12 vs. 34). In this form,
ables. Hypotheses stated in this manner are called the hypothesis could be directly tested, although
operational hypotheses, specific hypotheses, or the investigator would still need to specify mea-
predictions. surement criteria and develop an appropriate
Let us consider two hypotheses previously design.
given in this chapter: Any element of a hypothesis can have more
Patients with heart failure who are treated than one operational denition and, as noted, it is
with adrenal corticosteroids will have better sys- the investigators responsibility to select the one
tolic performance than those who are not is that is most suitable for his or her study. This is
sufciently general to be considered a conceptual an important judgment because the remaining
hypothesis and, as such, is not directly testable. research procedures (i.e., specication of subject
To render this hypothesis testable, the investiga- inclusion/exclusion criteria, the nature of the
tor could operationally dene its constituent ele- intervention and outcome measures, and data
ments as follows: analysis methodology) are derived from opera-
Heart failure = secondary hypodynamic tional hypotheses. Investigators must be careful
cardiomyopathy to use a sufcient number of operational
Adrenal corticosteroids = cortisol denitions so that reviewers will have a basis
Better systolic performance = higher left ven- upon which to judge the appropriateness of the
tricular ejection fractions at rest methodology outlined in submitted grant propos-
The hypothesis, in its operational form, would als and manuscripts, so that other investigators
state: Patients with secondary hypodynamic car- will be able to replicate their work, and so that
diomyopathy who have received cortisol will the general readership can understand precisely
have higher ventricular ejection fractions at rest what was done and have sufcient information to
than those who have not received cortisol properly interpret ndings.
treatment. Once operational denitions have been devel-
Similarly, the hypothesis that patients with oped and the hypothesis has been restated in
angina who are treated with b-blockers will have operational form, the investigator can conduct the
a greater improvement in their capacity for physi- study. The next step will be to select a research
cal activity than those not treated with b-blockers, design that can yield data to support optimal sta-
and that this improvement will vary as a function tistical hypothesis testing. The strengths, weak-
of initial symptoms, while complex, is still nesses, and requirements of various study designs
general enough to be considered conceptual. will be discussed in Chaps. 4 and 5.
52 P.G. Supino
Take-Home Points
A hypothesis is a logical construct, interposed between a problem and its solution, which
represents a proposed answer to a research question. It gives direction to the investigators
thinking about the problem and, therefore, facilitates a solution.
There are three primary modes of inference by which hypotheses are developed: deduction
(reasoning from a general propositions to specic instances), induction (reasoning from
specic instances to a general proposition), and abduction (formulation/acceptance on pro-
bation of a hypothesis to explain a surprising observation).
A research hypothesis should reect an inference about variables; be stated as a grammati-
cally complete, declarative sentence; be expressed simply and unambiguously; provide an
adequate answer to the research problem; and be testable.
Hypotheses can be classied as conceptual versus operational, single versus bi- or multi-
variable, causal or not causal, mechanistic versus nonmechanistic, and null or alternative.
Hypotheses most commonly entail statements about variables which, in turn, can be
classied according to their level of measurement (scaling characteristics) or according to
their role in the hypothesis (independent, dependent, moderator, control, or intervening).
A hypothesis is rendered operational when its broadly (conceptually) stated variables are
replaced by operational denitions of those variables. Hypotheses stated in this manner are
called operational hypotheses, specic hypotheses, or predictions and facilitate testing.
19. Flach PA, Kakas AC. Abductive and inductive reason- 35. Tuckman BW. Conducting educational research.
ing: background issues. In: Flach PA, Kakas AC, New York: Harcourt, Brace, Jovanovich; 1972.
editors. Abduction and induction. Essays on their rela- 36. Wang C, Chiari PC, Weihrauch D, Krolikowski JG,
tion and integration. The Netherlands: Klewer; 2000. Warltier DC, Kersten JR, Pratt Jr PF, Pagel PS.
Chapter 1. Gender-specicity of delayed preconditioning by
20. Murray JF. Voltaire, Walpole and Pasteur: variations isourane in rabbits: potential role of endothelial nitric
on the theme of discovery. Am J Respir Crit Care oxide synthase. Anesth Analg. 2006;103:27480.
Med. 2005;172:4236. 37. Beyer ME, Slesak G, Nerz S, Kazmaier S, Hoffmeister
21. Danemark B, Ekstrom M, Jakobsen L, Karlsson JC. HM. Effects of endothelin-1 and IRL 1620 on myo-
Methodological implications, generalization, scientic cardial contractility and myocardial energy metabo-
inference, models (Part II) In: explaining society. lism. J Cardiovasc Pharmacol. 1995;26(Suppl 3):
Critical realism in the social sciences. New York: S1502.
Routledge; 2002. 38. Stone J, Sharpe M. Amnesia for childhood in patients
22. Pasteur L. Inaugural lecture as professor and dean of with unexplained neurological symptoms. J Neurol
the faculty of sciences. In: Peterson H, editor. A trea- Neurosurg Psychiatry. 2002;72:4167.
sury of the worlds greatest speeches. Douai, France: 39. Naughton BJ, Moran M, Ghaly Y, Michalakes C.
University of Lille 7 Dec 1954. Computer tomography scanning and delirium in elder
23. Swineburne R. Simplicity as evidence for truth. patients. Acad Emerg Med. 1997;4:110710.
Milwaukee: Marquette University Press; 1997. 40. Easterbrook PJ, Berlin JA, Gopalan R, Matthews DR.
24. Sakar S, editor. Logical empiricism at its peak: Publication bias in clinical research. Lancet.
Schlick, Carnap and Neurath. New York: Garland; 1991;337:86772.
1996. 41. Stern JM, Simes RJ. Publication bias: evidence of
25. Popper K. The logic of scientic discovery. New York: delayed publication in a cohort study of clinical
Basic Books; 1959. 1934, trans. 1959. research projects. BMJ. 1997;315:6405.
26. Caws P. The philosophy of science. Princeton: D. Van 42. Stevens SS. On the theory of scales and measurement.
Nostrand Company; 1965. Science. 1946;103:67780.
27. Popper K. Conjectures and refutations. The growth of 43. Knapp TR. Treating ordinal scales as interval scales:
scientic knowledge. 4th ed. London: Routledge and an attempt to resolve the controversy. Nurs Res.
Keegan Paul; 1972. 1990;39:1213.
28. Feyerabend PK. Against method, outline of an anar- 44. The Cochrane Collaboration. Open Learning Material.
chistic theory of knowledge. London, UK: Verso; www.cochrane-net.org/openlearning/html/mod14-3.
1978. htm. Accessed 12 Oct 2009.
29. Smith PG. Popper: conjectures and refutations 45. MacCorquodale K, Meehl PE. On a distinction
(Chapter IV). In: Theory and reality: an introduction between hypothetical constructs and intervening
to the philosophy of science. Chicago: University of variables. Psychol Rev. 1948;55:95107.
Chicago Press; 2003. 46. Baron RM, Kenny DA. The moderator-mediator vari-
30. Blystone RV, Blodgett K. WWW: the scientic able distinction in social psychological research:
method. CBE Life Sci Educ. 2006;5:711. conceptual, strategic and statistical considerations.
31. Kleinbaum DG, Kupper LL, Morgenstern H. J Pers Soc Psychol. 1986;51:117382.
Epidemiological research. Principles and quantitative 47. Williamson GM, Schultz R. Activity restriction medi-
methods. New York: Van Nostrand Reinhold; 1982. ates the association between pain and depressed
32. Fortune AE, Reid WJ. Research in social work. 3rd affect: a study of younger and older adult cancer
ed. New York: Columbia University Press; 1999. patients. Psychol Aging. 1995;10:36978.
33. Kerlinger FN. Foundations of behavioral research. 1st 48. Song M, Lee EO. Development of a functional capac-
ed. New York: Hold, Reinhart and Winston; 1970. ity model for the elderly. Res Nurs Health. 1998;
34. Hoskins CN, Mariano C. Research in nursing and 21:18998.
health. Understanding and using quantitative and 49. MacKinnon DP. Introduction to statistical mediation
qualitative methods. New York: Springer; 2004. analysis. New York: Routledge; 2008.
Design and Interpretation
of Observational Studies: Cohort, 4
CaseControl, and Cross-Sectional
Designs
Martin L. Lesser
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 55
DOI 10.1007/978-1-4614-3360-6_4, Phyllis G. Supino and Jeffrey S. Borer 2012
56 M.L. Lesser
In both of these study designs, the timing of vantages that need to be weighed when such a
the suspected risk factor exposure in relation to choice is being considered.
the development or diagnosis of the disease is
important. Both study designs consider the situa-
tion where exposure to the risk factor precedes Cohort Studies
the disease. While such designs cannot prove
causality (as will be discussed below), this order- Basic Notation
ing of exposure and disease is a necessary condi-
tion for causality. In the most general setting, we will hypothesize
A third type of commonly used observational that exposure (E) to a particular agent, environ-
design is the cross-sectional study. As will be dis- mental factor, gene, life event, or some other
cussed below, this design does not specically specic factor increases the risk of developing a
examine the timing of exposure and disease. particular disease (D) or condition. Perhaps, a
It should be pointed out that casecontrol and better way to state the hypothesis would be that
cohort study designs are not necessarily restricted exposure is associated with the disease.
to the study of risk factors for a disease, per se. More formally, we might use the following
For example, if we wanted to conduct a study to hypothesis testing notation:
determine risk factors for a patient dropping out H0: Exposure to the factor is not associated with
of a clinical trial, we could select cases to be an increased risk of developing the disease.
those who dropped out of a clinical trial and con- HA: Exposure to the factor increases the risk of
trols would be those who did not drop out of the developing the disease.
clinical trial. Of course, dropping out of a clinical In statistical terms, H0 and HA are the null and
trial is not a disease (we might refer to it as an alternative hypotheses, respectively. (A discus-
outcome), yet it can be studied in the context of sion of hypothesis specication and testing can
a casecontrol study design. be found in Chaps. 3 and 11 in this text.) As in
The casecontrol, cohort, and cross-sectional most hypothesis testing problems, the objective
studies are considered observational study is to refute the null hypothesis and demonstrate
designs, which means that no particular therapeu- support for the alternative hypothesis.
tic or other interventions are being purposively It is important to note the hypotheses relating
applied to the subjects of the study. The subjects E and D do not use the word cause because in
of the study simply are being observed in their observational studies, we cannot prove causality;
natural settings to determine, in this example, we can only hope to show that an association
how many developed lung cancer or how many exists between E and D which may not necessar-
were smokers. A study design where an interven- ily be causal. We will have more to say about
tion is purposively applied to subjects to deter- establishing causality from observational studies
mine, for example, whether one treatment later in this chapter.
modality is better than another would be called
an experimental design or more specic to bio-
medical research, a clinical trial in which the Selection of Exposed Subjects
intervention (e.g., drug, device, etc.) is assigned
to the subject as per protocol. (For detailed In order to conduct a cohort study, one must rst
discussions of studies of interventions and how select subjects who have been exposed to the
to prepare for them, the reader is referred to hypothesized risk factor. It is not the purpose of
Chaps. 5 and 6.) this chapter to provide detailed guidance on alter-
The important issue of whether to choose a native sampling methodologies, which is dis-
casecontrol or cohort study design for a particu- cussed in greater detail in Chap. 10. Here, our
lar research study will be discussed later in this goal is to provide general guidance as to how to
chapter. Each has relative advantages and disad- sample subjects and from where they might be
4 Design and Interpretation of Observational Studies 57
sampled, with the specic details left to the reader Table 4.1 Sources of exposure information
in consultation, perhaps, with a statistician or Preexisting records
epidemiologist. Interviews, questionnaires
Direct physical examination or tests
Denition of the Exposure Direct measurement of the environment
Daily logs
To select exposed subjects, there must be a clear
denition of what it means to be, or have been,
exposed to the risk factor under study. Suppose, near environmental hazards, persons with certain
for example, a study was conducted to determine lifestyles, such as those who regularly attend an
the effect of exposure to heavy metals (e.g., gold, exercise gym. In an epidemiologic study of long-
silver, etc.) on semen and sperm quality in men term effects of prescription drugs, one might uti-
during their peak reproductive years. We might lize a roster or list of individuals who have been
enlist the support of a company that works with prescribed a certain type of drug. When selecting
heavy metals in a factory setting and then obtain cohorts of exposed subjects, an attempt should be
seminal uid samples from men working in that made to select these cohorts for their ability to
factory. However, we would still need to know facilitate the collection of relevant data, possibly
what it means to be exposed. Exposure can be over a long period of time. For example, there are
dened in many ways. For example, just working several large-scale prospective cohort studies that
in that factory environment for at least 6 months involve physicians [1, 2].
might be one denition of exposure; another
denition might involve the direct measurement Sources of Exposure Information
of heavy metal particles in the factory or on a To determine whether or not a subject has
detector worn by each factory worker from which been exposed to a particular risk factor, the
a determination of exposure might be made based investigator has several sources of information
on some minimum threshold exposure level indi- that might be used for making this determination
cated on the detector. If one were to study the (Table 4.1). First, preexisting records (medical
effect of cigarette smoking in pregnant women charts, school records, etc.) might be used
on the birth weight of newborns, once again, one for determining whether a particular exposure
would need to have a denition of what it means occurred. While preexisting records may be easy
to be a smoker during pregnancy: is having and inexpensive to retrieve, they may be inaccu-
smoked one cigarette during pregnancy enough rate with respect to the information that an inves-
to dene the smoking status or does it need to be tigator needs in his or her research investigation
a more consistent and higher frequency of ciga- because data in the chart was not collected with
rettes during the pregnancy? As for measurabil- this research study in mindrather, the data were
ity, it is desirable but not always possible to dene collected for clinical reasons only.
exposure based on some directly measurable A second source of exposure information, that
quantity. represents an improvement upon preexisting
records, is self-reported information (e.g., inter-
Sources of Exposed Subjects views or questionnaires that may be administered
Where might exposed subjects be found? to prospective participants in the cohort study).
Certainly, in the prior example of occupational This approach allows the investigator some
exposure, one might look to identify potentially exibility about which questions should be
exposed subjects from the roster of companies in asked and how they should be asked, which
certain lines of manufacturing or other work, might not be available in preexisting records. Of
labor unions, or other organizations or groups of course, conducting interviews or administering
individuals that would be associated with a par- questionnaires has associated costs that may be
ticular occupation and, potentially, with such an substantially greater than retrieving preexisting
exposure. One also might enroll persons living records or charts.
58 M.L. Lesser
Beyond direct interviews and questionnaires, Table 4.2 Sources of outcome information
the investigator also can perform physical Death certicates
examinations or tests on individual subjects to Physician and hospital records
determine certain exposures. Direct measurement Disease registries
of environmental variables (e.g., in an occupa- Self-report
Direct physical examination or tests
tional exposure type of cohort study) also would
be reasonable. Of course, these approaches to
determining exposure status generally have need to be considered. For example, in our
higher associated costs and logistical difculties hypothetical study on heavy metal exposure and
than do interviews, questionnaires, or use of pre- male fertility, it might be convenient to select
existing records. Finally, the investigator might controls from the business ofces of the same
ask subjects to maintain daily logs of certain company which might be located at some dis-
activities, environmental exposures, foods, etc., tance from the factory. However, if one were to
in order to determine levels of exposure over select ofce workers as potential unexposed con-
time. Daily logs have the advantage of providing trols, the investigator would have to be careful
information on a detailed and regular basis but that those potential controls are not regularly
have the shortcoming of being inaccurate due to exposed to the heavy metal factory. This could
the self-report nature of a daily log. happen if, for example, the vice president for
In summary, there are many sources of expo- quality control, who worked in the business
sure information available to the investigator. The ofce, made daily tours of the factory and, there-
use of a particular source depends on its relative fore, was exposed (albeit a small amount of expo-
advantages and disadvantages with respect to sure) to the heavy metals.
accuracy, feasibility, and cost.
may not have access to that information based on Table 4.3 Criteria for confounding
his or her immediate hospital records. 1. The presumed confounder (F) is associated with the
Disease registries can be useful sources of exposure (E)
information, but, once again, they are very simi- 2. Independent of exposure, F, must be associated with
the risk of disease (D)
lar to physician and hospital records in that dis-
ease registries are often specic to a particular
hospital or large regional health area. Also, also occur when a third variable makes it appear
condentiality issues may preclude the ability to that there is no association between an exposure
access records in disease registries for subjects. and a disease when, in fact, there is.
Self-report (described in detail in Chap. 8) is a Before providing concrete examples of con-
relatively inexpensive and logistically simple founding, it is important to formally dene the
method for determining outcome but can be inac- concept. Let E denote the exposure and D
curate because patients may not be cognizant of denote the disease being studied. A third factor,
the subtleties of various diseases or outcomes F, is called a confounding variable if it meets
that have been diagnosed. However, written two criteria: (1) F is associated with exposure, E;
permission from the patient sometimes can be and (2) independent of exposure, F, is associated
obtained for the investigator to contact the with the risk of developing the disease, D. It
patients physicians and hospital records in order should be emphasized that a confounding factor,
to make denite ascertainment of whether or not F, must meet both of these conditions in order to
an outcome occurred. be a confounder. Often, in error, research investi-
Finally, direct physical examination or tests gators treat variables as confounders when they
conducted on the subject might reveal whether an only meet one of those criteria (Table 4.3).
outcome has occurred, of course, depending on As an example of confounding, suppose that
the nature of the outcome being studied. Once an investigator wished to determine whether
again, this type of information might be very smoking during pregnancy was a risk factor for
accurate but could be costly or logistically an adverse outcome (dened as spontaneous
difcult to obtain in all subjects. abortion or low birth weight). The investigator
In sum, different sources of outcome informa- would recruit two cohorts of pregnant women,
tion have their advantages and disadvantages one whose members smoke while pregnant and
relative to accuracy, logistics, and cost and should the other whose members do not. (The ner
be weighed carefully by the investigator in details of how to identify and recruit these cohorts
designing a cohort study. are not within the scope of this chapter.) The two
cohorts are then followed through their pregnan-
cies, and the rates of adverse outcomes are
Confounding in Cohort Studies compared (using a measure known as relative
risk, which will be described later). Further, sup-
Nature of the Problem pose that the investigator does nd an increased
While the identication of a potential unexposed risk of adverse outcomes in the smoking group.
group might seem rather straightforward in many He submits his results to a peer-reviewed journal
study designs, there is always an underlying but is unsuccessful in gaining publication because
problem in the choice of these unexposed con- one of the reviewers notes that the explanation for
trols, i.e., confounding. Essentially, confound- the increased risk may not be due to smoking, but,
ing can be described in two ways. It is the rather, to the effect of a confounding variable,
phenomenon that occurs when an exposure and a namely, educational status. Why might educational
disease are not associated but a third variable status be a confounder? First, individuals with
(known as the confounding variable) makes it low educational levels are more likely to
appear that there is an association between expo- be smokers. (This satises criterion #1 of the
sure and disease. Conversely, confounding can denition of confounding.) Second, irrespective
60 M.L. Lesser
of smoking, women with low educational levels Table 4.4 Bias and related problems in cohort studies
are at greater risk for adverse maternal-fetal 1. Exposure misclassication bias
outcomes. (This satises criterion #2.) Thus, it 2. Change in exposure level over time
is unclear whether the increased risk is attribut- 3. Loss to follow-up
able to smoking, educational level, or both. How 4. Nonparticipation bias
5. Reporting bias
does one eliminate the effect of a confounding
variable?
Sources of Bias in Cohort Studies
Minimizing Confounding by Matching
One solution to the confounding problem in As in any type of study design, there are potential
cohort studies is to match the exposed and aws (or biases) that may creep into the study
unexposed cohorts on the confounding vari- design and affect interpretation of the results. As
ables. (This approach will be discussed in also noted in Chaps. 5 and 8, bias refers to an
greater detail later on in the section on case error in the design or execution of a study that
control designs.) For example, a smoker who produces results that are distorted in one direc-
did not achieve a high school education would tion or another due to systematic factors. In other
be paired (or matched) with a nonsmoker who words, bias causes us to draw (incorrect) infer-
was also a non-high school graduate. By match- ences based on faulty assumptions about the
ing in this way, the representation of education nature of the data.
level will wind up being identical in both There are many types of bias that can occur in
cohorts; thus, the effect of the confounding vari- research designs. Given in Table 4.4 are some of
able is eliminated. Of course, matching could be the more common types that would be encoun-
carried out for multiple confounders, but usu- tered in cohort studies. (See Hennekens and Buring
ally, only two or three are considered for practi- 1987 [3] for a more complete description.)
cal reasons. 1. Exposure Misclassification Bias. This type of
Although matching exposed and unexposed bias occurs when there is a tendency for
subjects on confounding variables is theoretically exposed subjects to be misclassied as unex-
desirable, such matching often is not carried out posed or vice versa. The example cited above
in cohort studies due to sample size, expense, and in selection of controls is an example of
logistics. Many cohort studies are rather large, misclassication bias. In that example, the
and to perform matching can be practically quality control personnel who work in
difcult. Matching in small cohort studies also the white-collar business ofce might be
may be limited by the sample size in that it may classied as unexposed when, in fact, they are
be difcult to nd appropriate matches for the routinely exposed to the heavy metals because
exposed subjects. they tour the factory twice a day (even though
Typically, in cohort studies, confounding vari- they do not work in the factory). Typically,
ables are dealt with in the statistical analysis exposure misclassication bias occurs in the
phase where adjustments can be made for these direction of erroneously classifying an indi-
variables as covariates in a statistical regression vidual as unexposed when, in fact, he or she is
model. Also, it should be pointed out that in exposed. This would have the effect of reduc-
cohort studies which often are conducted over a ing the degree of association between the
long period of time, a subjects confounding vari- exposure and the disease. In other words, if, in
able may change over time, and a more compli- fact, exposure did increase the risk of disease,
cated accounting for that change would need to it is possible that we would declare little or no
be dealt with in the analysis phase. Matching is association. If the bias went in the other direc-
more common in casecontrol studies and will be tion (i.e., unexposed subjects are misclassied
discussed in greater detail below. as exposed), then we run the risk of nding an
4 Design and Interpretation of Observational Studies 61
association when, in fact, none exists. A solu- unexposed cohort), and, of the 50 IVDUs, 20
tion to the misclassication problem is to have have died before the end of the 1-year follow-
strict, measurable criteria for exposure. Of up period, leaving only 30 with measured viral
course, the ability to accurately measure or load levels at follow-up (as there is no follow-
determine exposure may be limited by avail- up viral load recorded on the 20 IVDUs who
able resources. died). The effect of this might be that the 30
2. Change in Exposure Level over Time. Bias IVDUs who completed the 1-year follow-up
may occur when a subjects exposure status might have been, in general, healthier than
changes with time. For example, a subject in the IVDUs who died, leading to a biased
the smoking cohort may quit smoking 10 years comparison.
after high school. Is that subject in the smok- 4. Nonparticipation Bias. Nonparticipation bias
ing or nonsmoking cohort? In cases like this, it is somewhat similar to loss to follow-up bias
is common to classify the subjects time peri- except that the bias occurs at the time of
ods with respect to smoking or nonsmoking enrollment into the study. Suppose we were
and to use the person-years method (see conducting a cohort study to determine
Kleinbaum et al. 1982 [4]) to analyze the data. whether child abuse is a risk factor for psychi-
Using this method, the subject is not classied atric disorders in teenage years. Although this
as exposed or unexposedonly his follow-up might be a problematic study to conduct, due
time periods. Nevertheless, if crossover to the sensitive nature of the risk factor (i.e.,
from one cohort to the other occurs, particu- child abuse), one might consider contacting
larly in one direction only (e.g., smokers families who were seen at a psychiatric facil-
become nonsmokers, but nonsmokers do not ity once child abuse was discovered and ask-
start to smoke after high school), this may ing them to participate in the study to follow
impart a bias that confounds interpretation of their children through their teenage years to
the study. For example, if many quitters determine their psychiatric status. Controls
develop lung cancer (presumably because they would be families or subjects without histo-
were exposed for several years), this occur- ries of abuse who would be followed in the
rence might reduce the observed association same way. In a situation such as this, it is
between smoking and lung cancer. likely that many families with histories of
3. Loss to Follow-up Bias. Bias can occur when child abuse would decline to participate and
members of one of the groups are differen- that those who would participate might be
tially lost to follow-up compared to the other, psychologically healthier, rendering them
and the reason for their loss is related, in part, unrepresentative of the general group of fami-
to their level of exposure. Consider the fol- lies with child abuse. Furthermore, if this
lowing hypothetical observational study that group were, indeed, psychologically healthier,
evaluates newly diagnosed heterosexual AIDS then the incidence of teen psychological dis-
patients. The two cohorts in this example orders might be lower, thus attenuating the
are those patients who were IV drug users true association between child abuse and psy-
(IVDUs) and those who were not. Both cohorts chological disorders.
are started on the same antiretroviral therapy 5. Reporting Accuracy Bias. Reporting accuracy
at diagnosis. The research question is whether bias in cohort studies is similar to that in case
there is a difference between the two groups in control studies. It refers to a situation where
viral load at the end of one year. either the exposed or unexposed subjects delib-
As the study progresses, some patients die. erately misreport either their exposure or their
To illustrate this bias using an exaggerated outcome status, usually due to the sensitive
scenario, suppose that there are 50 IVDUs nature of the variables being studied. (See the
(the exposed cohort) and 50 non-IVDUs (the section on casecontrol studies for examples.)
62 M.L. Lesser
time to see who developed an MI. Likewise, study of smoking during pregnancy as a risk
1,000 OC nonusers were followed in a similar factor for adverse maternal-fetal outcomes is of
way. The incidence rates of MI were 0.03 and the prospective type because, as described, the
0.003, respectively, yielding a RR = 10, which investigator must wait from the time of exposure
means that women who used OC had 10 times to observe the outcome of the pregnancy.
a greater risk of MI than nonusers. For deter- However, suppose that the study were to be con-
mining whether a RR is signicantly different ducted by reviewing patient charts from 2 years
from 1, the reader is referred to Kleinbaum prior to the initiation of the study and identifying
et al. 1982 [4]. women who smoked and did not smoke during
pregnancy at that time. Then, the investigator
would determine the pregnancy outcome from
Prospective Versus Retrospective the chart data (i.e., the outcomes are already
Cohort Designs known and documented in the charts). This is an
example of what many term a retrospective
One usually thinks of a cohort study as prospec- cohort study. (As noted in Chap. 1, DeAngelis [6]
tive because it looks forward from an exposure and others would refer to this as a historical or
to the subsequent development of disease. nonconcurrent cohort study.)
However, a cohort study can be classied as ret- To the reader, the distinction between retro-
rospective or prospective, depending on when spective and prospective cohort studies may not
it is being conducted with respect to the outcome. seem important since the logic of the two
If, at the time the investigator initiates the study, approaches is essentially the same. However, in a
the outcome (e.g., disease) has not yet occurred in prospective cohort study, the investigator typically
the study subjects, then the study is prospective has more quality control of the conduct of the
because the investigator must follow the subjects study and how data are to be collected than in a
in real time in order to ascertain outcome status. retrospective study because the former is being
On the other hand, if the study is conducted after conducted in real time. In a retrospective cohort
the exposures and outcomes have already study, the investigator is limited by the nature and
occurred, this type of design often is classied as quality of data already available, which most likely
a retrospective cohort study. were collected for routine clinical purposes using
For example, referring back to the section on criteria and standards that are different from those
confounding, there is general consensus that the of the current research investigation.
64 M.L. Lesser
First, casecontrol studies often involve the instead, be associated with its lethality. Thus, it is
recall of information about past exposures. This possible that the smokers are those who died
type of information often is obtained by inter- early in the group that was diagnosed in the more
viewing the subject him or herself or by inter- distant past whereas nonsmokers are the ones
viewing family members or friends who might who have survived despite their disease. In this
have such information. Of course, some exposure case, when comparing this biased group of cases
information may also be gleaned from patient to non-cancer controls, we would observe an
charts or other documents that exist independent attenuated association between smoking and lung
of an interview with a subject. It stands to reason cancer. This bias would provide potentially mis-
that if the interval of time between diagnosis of leading results.
the disease and the interview for exposure infor- On the other hand, if one were to simply sam-
mation is lengthy, then the ability to properly ple recently diagnosed cases and assuming that
recall exposures will be reduced. Certain expo- the disease is not rapidly fatal (even small cell
sures such as smoking are not likely to be forgot- lung cancer patients would survive to be inter-
ten, but, for example, if we were studying more viewed), almost all of the available lung cancer
complex and/or rare exposures, the ability to cases would be included in the study since, at that
accurately recall such exposures and associated point, no one would be lost to follow-up or death.
details would decrease over time. Thus, the Therefore, the sample would not be biased as it
shorter the interval between diagnosis and gath- might have been had the sampling methodology
ering of exposure information the more likely the been based on prevalent case selection.
recall of information will be accurate.
A second reason for selecting incident cases is
illustrated by the following example. Suppose we Selection of Controls
were studying the association between smoking
and lung cancer. We might go to the tumor regis- Perhaps, the most difcult aspect of conducting a
try of our hospital and nd 1,000 lung cancer casecontrol study is the selection of controls. In
cases that were diagnosed over the past 10 years. principle, controls should be a group of individu-
The next step in our research design would be to als who are free of the disease or outcome in
contact these subjects and ask them whether or question (i.e., unexposed) and are as similar in all
not they were smokers prior to their development other respects to the case group.
of lung cancer. One of the problems associated
with this approach is that out of those 1,000 lung Denition of Controls
cancer cases diagnosed over the past 10 years, Controls should be free of the disease in ques-
many will have expired before we would be able tion. One of the difculties in selecting controls
to contact them. Cases that are still alive probably is determining how far we should go to ensure
would fall into two broad groups: (a) those who that someone is free of the disease or outcome.
have been recently diagnosed and have not had For example, if we were to select as a control for
enough exposure to lung cancer yet to die from our lung cancer cases an individual who has
the disease and (b) those who were diagnosed in never had a diagnosis of lung cancer, do we need
the more distant past but who have survived. The to perform a bronchoscopy on that patient for
latter group (b) is likely made up of those with certainty of that fact, or do we simply take his
lower grade disease or those who have been more self-report as the truth that he has never had lung
successful in combating their disease with therapy. cancer? Of course, there are subtleties that arise
That group may be very different from those who when subclinical disease exists at the time an
were diagnosed in the more distant past who individual is being selected as a control. These
already have died of their disease. In fact, it is are ne points that would need to be dealt with in
conceivable that smoking may not just be a very careful manner, in consultation with a stat-
associated with developing lung cancer but may, istician or an epidemiologist.
66 M.L. Lesser
At this point, it is instructive to provide an from visitors to a shopping mall (even though
example of where verication of non-disease sta- colonoscopy, itself, is not infallible). Of course,
tus might be problematic and require some subjects who have a diagnosis of colon cancer
additional thought about the design of the study. based on the colonoscopy would be excluded
Suppose we were conducting a casecontrol from the control group.
study to determine whether there is an associa- The selection of controls from among those
tion between a high fat diet and colon cancer. undergoing colonoscopy, nonetheless, could
Specically, our hypothesis is that colon cancer potentiate a different problem, namely, selection
cases will report a higher frequency of high fat bias. Generally speaking, there are two broad
diets than non-cancer controls. To test our hypoth- groups of individuals who undergo colonoscopy:
esis, we would select our colon cancer cases in (a) those who are symptomatic and who are
some way consistent with the guidelines already referred by their physician to a gastroenterologist
stated above and then select controls. One possi- to determine the cause of their rectal bleeding,
ble source of controls would be adults visiting a abdominal pains, cramping, diarrhea, etc., and
large shopping mall. (We might choose to select (b) those who are asymptomatic who undergo
individuals over 50 years old if our casecontrol colonoscopy for screening purposes only.
study was designed to answer the question in this However, these two groups differ in ways that
population.) Next, we could set up a colon cancer can inuence the results of the investigation. For
information booth in the mall and invite the pass- example, a high fat diet may not be specic to the
ersby to answer a question or two about history risk of colon cancer but may be associated with
of colon cancer and, if they wished, to pick up a other intestinal problems (e.g., some of the benign
fecal occult blood test kit so that they can screen conditions cited above). If this association was
themselves for colon cancer. Those who self- not appreciated during the study design stage,
reported that they had never had a diagnosis of and individuals from the symptomatic group
colon cancer could be invited to participate as were selected as controls, their rate of high fat
controls for our casecontrol study. We might use diets would be spuriously inated, thus reducing
as an exclusion criterion a positive test result on the observed degree of association between fatty
the fecal occult blood test (even though that diets and colon cancer. On the other hand, selec-
nding obviously does not equate to a diagnosis tion of the asymptomatic individuals who undergo
of colon cancer). cancer screening are more likely to be health-
A member of our investigative team might conscious individuals since they are voluntarily
object to this approach since self-report and fecal attending a screening program. Because these
occult blood testing, in and of themselves, would individuals are more health conscious, they may
not completely verify the disease-free status of have an articially lower level of fat intake
someone passing through the shopping mall. than a standard population of individuals without
Thus, we might be more rigorous in our selection colon cancer. Accordingly, when we compare the
of controls. This might be done by enlisting the fat intake for this control group against the colon
collaboration of a gastroenterologist who per- cancer group, we may observe an exaggerated
forms colonoscopies and selecting from his or association because of the articially reduced
her colonoscopy practice those subjects who have levels of fat intake in our control group.
colonoscopies with a benign or negative out- There are several ways to address this
come. Such outcomes might include diverticulo- problem, none of which constitutes a perfect res-
sis, inammatory bowel disease, a benign polyp, olution of the issue. In this example, some inves-
other benign tumors of the colon, etc. If we were tigators might employ only one of the control
to view colonoscopy as a close to foolproof way groups with the understanding that the bias would
of determining an individuals colon cancer sta- need to be considered when interpreting the
tus, then this would be a better way of selecting results. Thus, for example, if the benign disease
controls for such a study than selecting them group were used as the control and only a small
4 Design and Interpretation of Observational Studies 67
association was observed (i.e., odds ratio [OR] is Confounding in CaseControl Studies
close to 1), the association would be inconclusive
because of the directionality of the bias. However, The Nature of the Problem
if a large and statistically signicant association The impact of confounding on interpretation of
(i.e., OR > 1) were found, then, because the bias ndings from cohort studies has previously been
is working against the hypothesis of positive addressed. The reader should note that its adverse
association, this larger OR would provide evi- effects are not limited to cohort studies but repre-
dence in favor of the association. Another sent a potentially serious problem in casecontrol
approach might be to include both groups as sep- designs as well. Schlesselman [7] provides inter-
arate controls and, knowing the opposite direc- esting examples of such confounding, which we
tions of the bias, compare cases to each control now describe.
group and draw inferences accordingly. Consider a hypothetical casecontrol study
designed to test the hypothesis of association
Sources of Controls between alcohol use (E) and lung cancer (D).
Recall that in a casecontrol study, cases of dis- Cases of lung cancer are selected for study, and a
ease are most conveniently selected from a med- group of controls without lung cancer is identied.
ical practice or facility, but controls need not be Suppose that the rate of alcohol use in the lung
selected from such sources even though it might cancer cases is found to be signicantly greater
also be convenient to do so. Controls also can be than that of the controls. The conclusion would
selected from the community at-large using be that alcohol use increases the risk of lung can-
sophisticated sampling techniques or by simply cer. However, one might criticize the study
placing advertisements in community media to because smoking should have been considered a
recruit individuals who meet the control criteria. confounding variable.
Very often, investigators will collaborate with Why is smoking a confounding variable? One
various work places that will permit access to needs to refer back to the denition. Certainly,
their employees as potential controls for a par- smoking is associated with lung cancer (criterion
ticular study. Over the years, departments of #2), independent of any other factors. However,
motor vehicles often have served as a source of smokings association with lung cancer does not,
controls for many research studies. Occasionally, in itself, make it a confounding variable. Smoking
close friends, relatives, or neighbors of an indi- must also be associated with alcohol use (crite-
vidual case will serve as controls. Choosing such rion #1). How is smoking associated with alcohol
individuals can solve a myriad of problems use? The answer lies in the fact that individuals
because this type of control sometimes will share who drink alcohol tend to have a higher rate of
the same environmental conditions as the case or smoking than individuals who do not drink alco-
have a similar genetic disposition. The approach hol. Therefore, smoking is related both to alcohol
also facilitates cooperation because, very often, use (E) and lung cancer (D) and is, therefore, a
friends, relatives, or neighbors will cooperate confounding variable.
with an investigator who is also working with As another example of a confounding variable
that individuals relative. However, selecting that may obscure an association between a puta-
friends and relatives as controls may have tive risk factor and disease, consider a case
adverse consequences because it often forces control study to determine whether there is an
the cases and controls to be similar on the very association between oral contraceptive (OC) use
risk factors being investigated, thus reducing the and MI in women. Once again, one would pick
association between the risk factor and disease. cases of women who had suffered a recent MI
In summary, the selection of controls requires and determine whether or not they had used OC
careful thought and knowledge of the underlying in, say, the past 5 years. A possible result of this
subject matter. study would be that the level of OC use was not
68 M.L. Lesser
substantially greater in the MI cases than in the similar with respect to one or more confounding
non-MI controls, thereby resulting in the conclu- variables. When cases and controls are properly
sion that there is little or no association between matched, the representation of the confounding
OC use and MI. However, once again, smoking variables is similar in both groups and, therefore,
could be considered a confounding factor because should have no appreciable effect on the results
it meets the two criteria of a confounder: rst, and interpretation of the casecontrol study.
smoking is associated with MI. Second, smoking Most students in the medical sciences are
is associated with OC use. Why is this so? The familiar with the idea of matching since they
reason is that women who are smokers are less probably have read many studies where matching
likely to be prescribed an OC than women was employed. However, it is our objective in this
who are nonsmokers because of the risk of chapter to describe the logistics of matching in
thrombophlebitis and other cardiovascular disor- somewhat more detail. The rst step in matching
ders. In this example, the OC users were under- cases to controls is to identify the confounding
represented in the MI case group because there variables. The next step is to determine the
were many smokers in the MI group, many of desired method of matching. Typically, one
whom were never prescribed OC. Thus, the should not match on more than a few variables
confounding effect of smoking potentially masks (i.e., two or three), but this also depends on the
a relationship (i.e., reduces the association) sample size in the casecontrol study and on
between OC use and MI. the distribution of the confounding variables in
Although it is important to identify confound- the samples being studied. Let us consider a sim-
ers, it is just as important to recognize factors ple example where we have determined that age
that may appear to be confounders but, in fact, and sex are important confounders. (It is impor-
are not. Once again, two examples from tant to emphasize that, while age, sex, race, and
Schlesselman [7] are instructive. Consider a socioeconomic status are four of the most com-
casecontrol study designed to investigate monly encountered confounders, it is not always
whether a sedentary lifestyle is a risk factor for necessary to match on any of these variables. The
MI. Cases are those with a recent history of MI reader should be reminded again that in order for
and controls are individuals without MI (appro- a variable to be a confounder, it must meet the
priately chosen). The exposure variable is (for two criteria given in the denition above.)
simplicity) sedentary lifestyle (coded as no 1. Group Versus. Calipers Matching. When age
or yes), as derived from some validated mea- and sex are potential confounders, one way to
sure of physical activity. One might consider lev- match cases and controls is to classify male
els of uid intake (F) as a possible confounding and female subjects into age groupings (a com-
variable because physically active, non-sedentary mon method of classication for age is by
subjects might have higher levels of uid intake decades, i.e., age 2029, 3039, 4049, 5059,
than sedentary subjects; in other words F is asso- or 60 and above). This approach would yield
ciated with E. Accordingly, we would consider up to 10 different age/sex combinations cor-
matching cases to controls on uid intake. responding to each of the 5 age categories
However, uid intake is not a true confounder cross-classied with sex (male, female).
because there is no known or presumed associa- Therefore, if a case were to be chosen and that
tion between uid intake and MI (D). Thus, particular subject was a 30-year-old male, we
matching on uid intake is not necessary. would choose a control who was a male in the
30- to 39-year age group; these two individu-
Reducing Confounding by Matching als (the case and the control) would be natu-
If confounding is an important problem in epide- rally matched and paired.
miologic studies, how do we deal with it? A com- The reader should note, however, that there
mon solution is matching. Matching is a technique is a disadvantage to creating groups on a mea-
whereby cases and controls are made to appear sured variable such as age. Suppose, in the
4 Design and Interpretation of Observational Studies 69
above example, we required a match for a the calipers extremely narrow). For example,
30-year-old male, and, based on the pool of one would not match children to within three
potential controls, a 29-year-old male and a years (e.g., matching a 10-year-old girl to a
39-year-old male were both available. Using seven- or 13-year-old girl) since individuals at
the grouping criteria dened above, the these ages could have very different outcomes
30-year-old male would have to be matched due to variations in socialization, sexual matu-
with the 39-year-old male because they were rity, body size, and other developmental vari-
in the same age category. However, it would ables. Effective matching, under these
make more sense to match a 30-year-old male circumstances, requires that there be a large
with a 29-year-old male because the two are pool of available controls to pair with cases.
closer in age. 2. Individual Versus Frequency Matching.
A solution to this problem is to use what is Another consideration in matching is whether
known as calipers matching whereby, on a the investigator wishes to use individual ver-
measured variable, a control would be matched sus frequency matching. Typically, with indi-
to a case based on being within a certain num- vidual matching, one case and one control are
ber of units away from that cases measure- matched to one another (1:1 matching).
ment (hence the use of the term calipers). For Occasionally, the statistician or epidemiolo-
example, we might dene a rule to match age gist will recommend many-to-one matching
to within () three years. In this case, the which might involve matching two or three
29-year-old male is within three years of the controls to each case. It is uncommon to match
30-year-old male and would be matched to the more than three controls to a case because it
30-year-old male, whereas the 39-year-old can be shown that the statistical power benets
male would be outside the dened three-year do not substantially increase after two or three
limit. A compromise between broad grouping matches to a control. The reader should keep
and calipers would be to arrange the poten- in mind that if he or she conducts a case
tially confounding variable (in this case, age) control study with 1:1 matching, it is neces-
into narrow categories (e.g., 3033, 3437, sary that there be an equal number of cases
3841, etc.). This would reduce the effect of and controls. A common misstatement that is
the disparity that occurred in the example seen in many research proposals employing
given above involving grouping by decades. casecontrol studies is, for example, there
When using this method for age matching, the will be 50 cases with disease and they will be
investigator must take care to consider the matched to 20 controls without disease. If the
nature of the study population. For example, if investigator was thinking of performing indi-
one were matching on age using three-year vidual matching, then this statement makes no
calipers in a casecontrol study evaluating uti- sense as it would require a constant ratio of
lization of health-care services, a 64-year-old controls to cases. Usually, what the investiga-
case could be matched to any control ranging tor intends is that they will select cases and
from 61 to 67 years old. However, in this controls so that, for example, the average age
example, matching a 64-year-old to, say, a (or sex distribution) of both groups is approxi-
64-year-old in a health services utilization mately the same. However, this approach is
study might result in matching a non-Medicare not matching; it is simply determining how
subject with a Medicare subject. As these two comparable the two groups are after they have
types of patients might have very different uti- been selected. Unless one prospectively selects
lization patterns, a bias could be introduced controls in a deliberate way so as to match
into the study design. Similarly, when conduct- them directly to a given case, the term match-
ing research with pediatric patients, it is impor- ing is not appropriate.
tant to match as closely and precisely to actual When an investigator does not perform
age as possible (which is equivalent to making individual matching but instead wants to
70 M.L. Lesser
ensure that the confounding variables have the a study such as this where ascertainment of
same joint distributions among both cases and smoking status (the risk factor) could be made
controls, the method of choice is frequency by chart review so that one could rst consti-
matching. Frequency matching refers to the tute the case group and then return to select
deliberate and prospective selection of con- the control group. Frequency matching may
trols so that the joint distribution of the con- be logistically more difcult to conduct in
founding variables is approximately the same other types of casecontrol studies, but the
in both the case and control groups. As an concept is still the same.
example, suppose we were performing a case 3. Propensity Matching. A recently developed
control study to determine whether maternal method for matching cases and controls
smoking during pregnancy was a risk factor (which also may also be used for matching
for premature birth. Our cases might be 100 exposed and unexposed subjects in a cohort
premature infants delivered during the past study) is known as propensity scoring
year, and our controls would be drawn from (Rosenbaum and Rubin [8, 9]). Briey, this
the hundreds of normal term births delivered method involves predicting whether a subject
during the same time period. Further, we have is a case or a control based on observed pre-
determined that parity (i.e., nulliparous vs. dictor covariates. Thus, one subject may be a
parous) and age (grouped in 3-year intervals) case and the other a control, but their covari-
are confounding variables for which matching ate proles are similar as reected by their
will be performed. Suppose we have decided predicted probability of being in, say, the
that, based on statistical power and resources case group. Specically, the probability of
available to conduct the study, that the number being a case (i.e., the propensity score) is
of controls will be 250. Further, suppose that computed for each subject in the study (both
in the case group, 10% of the cases were born cases and controls) using a statistical method
to nulliparous 30- to 33-year-old women. We known as multiple logistic regression (see
would then identify from our vast pool of Chap. 11). Then, cases are matched to con-
term-delivery controls all women who are nul- trols on the propensity score. So, for example,
liparous 30- to 33-year-olds. From this pool of suppose that in a particular study, the score is
candidates, we would randomly select 25 nul- being computed as a function of age, sex,
liparous 30- to 33-year-old women. By select- smoking status, family history, and socioeco-
ing 25 at random, this would assure that 10% nomic status. If a particular case has a score
of the control group (10% of 250=25) would of, for example, 0.75, we would try to match
be nulliparous 30- to 33-year-olds. Likewise, this case to a control that also has a score of
suppose that 16% of the cases are parous 25- 0.75. In this way, cases and controls are
to 28-year-old women, then in a similar way matched based on a measure of their similar-
we would identify all parous 25- to 28-year- ity. An advantage of the propensity score
old women who had full-term deliveries and, method is that it allows the investigator to
from that group, randomly select 40 matching match cases and controls on a single
controls as 40 would constitute 16% of the criterion (the score) that is a function of mul-
control group. If we continued in this fashion, tiple confounding variables, rather than hav-
we would obtain a control group that had either ing to match on each of the individual
precisely or approximately the same joint dis- confounders.
tribution of parity and age in both cases and
controls. It is important to note that to use fre-
quency matching, one would need to know the Sources of Bias in CaseControl Studies
distribution of the confounding variables in
the case group prior to selecting the matched As in cohort studies, casecontrol studies are
controls. This certainly would be workable in subject to a variety of biases. Given below
4 Design and Interpretation of Observational Studies 71
are some of the more common types that may be select as cases women with newly diagnosed VD.
encountered. Controls could be women from the same clinic
who do not have a diagnosis of VD. The impor-
Recall Bias tant question in the epidemiologic interview
Recall bias occurs when one of the groups recalls would be how many sexual partners have you
exposure to the risk factor more accurately than had in the past year? The responses in the case
the other group. It is not uncommon for recall group (those with VD) might look as follows: 1,
bias to manifest itself as cases remembering 1, 2, 2, 2, 3, 4, 5, 5, 6, 6, 6, 8, 9, and 10. (The
exposures better than controls. As an example, responses have been ordered from smallest to
suppose one were conducting a casecontrol largest in order to better visualize the data.) When
study to examine risk factors for early childhood the control group is asked to respond to the same
leukemia. The cases in such a study might be par- question, the results might be 1, 1, 1, 1, 1, 1, 1, 2,
ents of children with leukemia who were diag- 2, and 2. Based on these responses, the average
nosed before their fourth birthday, and the number of sexual partners in the case group
controls might be parents of children who did not would be 4.7 versus 1.3 in the control group, thus
have a diagnosis of leukemia. The investigator suggesting (subject to a formal statistical test)
interviews both groups of parents with respect to that increased number of sexual partners is a risk
exposure to a variety of potential risk factors. It factor for venereal disease.
would not be unlikely that the mother of a young Although, at face value, the interpretation of
child with leukemia would remember many the results might be as just stated, there is a poten-
household exposures better than a mother whose tial reporting accuracy bias. The bias might occur
child was healthy since it is human nature to because women who have VD may be more likely
recall antecedent events potentially leading up to to be truthful about the number of sexual partners
a serious disease or traumatic event better than they have had, whereas women who are controls
someone who has no reason to remember those may not be, thus causing the average number of
events or exposures. Another example of recall sexual partners to be artifactually greater in the
bias might be found in a study examining ante- case group than in the control group. Why might
cedents of lower back pain. Subjects who experi- such a bias exist? One hypothesis is that individu-
ence lower back pain probably would have better als with a particular disease (in this case, VD)
recall of events related to lifting of heavy objects tend to be more candid with their physicians
that may have preceded the diagnosis of the back about past medical history and behaviors [10]. In
pain versus those without back pain who may not fact, many patients (rightly or wrongly) believe
have any particular reason to remember such that if they are truthful, then their physicians may
events. be able to better treat their disease than if they are
not truthful. Assuming that this womens health
Reporting Accuracy Bias center serves women who are married, those with
This term refers to lying or deception in the boyfriends, male partners, etc., among the con-
response to questions concerning exposure, as trol group might be less likely to be truthful about
frequently occurs in the setting of casecontrol the number of sexual partners because they would
studies where sensitive questions are being asked perceive that they have something to lose and
of the subject. A classic example of reporting nothing to gain by admitting multiple sexual part-
accuracy bias might be as follows: Suppose one ners. Of course, the ethical conduct of such a
were to conduct a casecontrol study among study would require an assurance of condentiality
women to determine if her number of sex part- with respect to responses to the epidemiologic
ners during the past year is a risk factor for questions, but such an assurance does not guaran-
contracting venereal disease (VD). One might tee that subjects will cooperative when confronted
conduct this study at a womens health center and with a highly personal and sensitive question.
72 M.L. Lesser
during which various calculations are carried out For various mathematical reasons, it is more
to quantify the relationship between the presumed convenient to express the risk, not as a difference
risk factor and the disease under investigation. between proportions but as a ratio of odds. To the
The most common measure used for drawing unfamiliar reader, the odds of an event occurring
inferences in a casecontrol study is the odds is dened as the probability that the event will
ratio (OR). The calculation and interpretation of occur divided by the probability that it will not
the OR can be illustrated by reference to Fig. 4.3. occur. For example, if the probability of an event
Here, a and c, respectively, represent the number is 25%, the odds of the event occurring is 25/75
of cases who were exposed and not exposed to the (or, as some would prefer to express it, 1:3 odds).
risk factor. Likewise, b and d, respectively, repre- Thus, the odds of exposure among cases is [a/
sent the number of controls who were exposed (a + c)]/[c/(a + c)] whereas the odds of exposure
and not exposed. In a casecontrol study, one usu- among controls is [b/(b + d)]/[d/(b + d)]. If we
ally selects cases so that the column total of cases denote these quantities by O1 and O2, respec-
(a + c) is xed at some predetermined sample size; tively, then OR = O1/O2 = (ad)/(bc). Computation
likewise for the control column (b + d). Frequently, of the OR in this fashion always will result in a
the cases and controls are sampled in equal num- positive number unless one or more of the cells in
bers (so that a + c = b + d), but there are circum- the above 2 2 table contains a zero; in the latter
stances where equality may not hold, as pointed instance, it is common to compute the OR by
out in the section on matching. adding to a, b, c, and d and using the same
In the case group, the fraction of subjects who formula [5] employed for computation of the
were exposed to the candidate risk factor is a/ relative risk (RR) in a cohort study. Just as in the
(a + c); the corresponding proportion exposed in interpretation of the RR, if OR > 1, this is taken to
the control group is b/(b + d). Typically, one might mean that the exposure to the risk factor increases
compare the two proportions to determine the risk of disease by that many times or by that
whether they are different since if the proportions fold increase. Thus, for example, if OR = 1.5,
are the same, that effectively tells us that the risk this means that individuals with the risk factor
factor is not associated with the disease; on the are 1.5 times more likely to get the disease than
other hand, if the proportion of exposed cases is those without the risk factor. Conversely, if
much larger than that of the controls, that would OR < 1, exposure to the risk factor is protective.
suggest that the risk factor is associated with the Thus, if OR = 0.5, that means that those with the
disease. risk factor are half as likely to get the disease as
74 M.L. Lesser
those without the risk factor. An OR that is Permit calculation of incidence rates (absolute
close to 1.0 means the factor is not associated risk) as well as relative risk.
with risk of disease. Figure 4.4 illustrates compu- Enable the study of relatively rare exposures.
tation of the OR for a hypothetical casecontrol Methodology and results are easily understood
study investigating family history of coronary by non-epidemiologists.
artery disease (CAD) as a risk factor for myocar-
dial infarction (MI) in men. In this example, Disadvantages
OR = 1.56, which means that men with a family Not suited for the study of rare diseases because
history of CAD have a 1.56 times greater risk of a large number of subjects is required.
MI than those without such a family history. Not suitable when the time between exposure
and disease manifestation is very long, although
this can be overcome in historical cohort
CaseControl and Cohort Designs: studies.
Advantages Versus Disadvantages Exposure patterns, for example, the composi-
tion of oral contraceptives, may change during
As with any scientic study design, there are dis- the course of the study and make the results
tinct advantages and disadvantages to their uses. irrelevant.
Below, we provide a concise listing of some of the Maintaining high rates of follow-up can be
important pros and cons of casecontrol and difcult.
cohort designs, as identied by Schlesselman [7]. Expensive to carry out because a large number
of subjects usually is required.
Baseline data may be sparse as the large num-
Cohort Studies ber of subjects often required for these studies
does not allow for long interviews.
Advantages
Allow complete information on the subjects
exposure, including quality control of data, CaseControl Studies
and experience thereafter
Provide a clear temporal sequence of exposure Advantages
and disease. Permit the study of rare diseases.
Afford an opportunity to study multiple out- Permit the study of diseases with long latency
comes related to a specic exposure. between exposure and manifestation.
4 Design and Interpretation of Observational Studies 75
Can be launched and conducted over relatively via this study design would not shed any light on
short time periods. this question because (given the way the study
Relatively inexpensive as compared to cohort was conducted) it would not be known whether
studies. the sweetener exposure came before or after the
Can study multiple potential causes of disease. diagnosis of diabetes. Obviously, to be implicated
in a causal process, the exposure would have had
Disadvantages to occur prior to the disease. (This would be a
Information on exposure and past history pri- necessary but not sufcient condition for causal-
marily is based on interview and may be sub- ity [see below].)
ject to recall bias. Thus, one of the disadvantages of a cross-
Validation of information on exposure is sectional study is that a causal (or suggested
difcult, or incomplete, or even impossible. causal) association cannot be determined.
By denition, concerned with one disease Another disadvantage is that rare diseases are
only. difcult to study since a very large number of
Cannot usually provide information on inci- subjects would be needed to yield a sufcient
dence rates of disease. number of diseased individuals (likewise, if the
Generally incomplete control of extraneous prevalence of the risk factor was rare). Despite
variables. these important drawbacks, cross-sectional
Choice of appropriate control group may be designs usually are quicker and less expensive to
difcult. conduct than casecontrol or cohort studies since
Methodology may be hard to comprehend for no follow-up is needed. Another advantage of the
non-epidemiologists, and correct interpreta- cross-sectional study is that it can provide some
tion of results may be difcult. evidence suggesting an association between
exposure and disease and, thus, help in designing
a more formalized cohort or casecontrol study.
Cross-Sectional Studies
in order to establish causality, all of the ve of the the association is spurious, lending evidence
following criteria must be satised: toward the causality hypothesis.
1. Temporal association. If causation is to hold, 4. Doseresponse relationship. If it can be shown
then exposure must precede the disease. that the risk of disease increases as the dose
Sometimes, the time sequence of E and D may of the risk factor increases, this makes causal-
be difcult to determine, but this criterion of ity more plausible.
temporal association is certainly a necessary 5. Biological plausibility. While satisfaction of
condition. the above criteria is important, causality ulti-
2. Consistency of association. Loosely trans- mately will be more believable if there is some
lated, this means that different studies of the acceptable biological explanation as to why
same risk factordisease question result in such causal association might exist.
similar, or consistent, results. If results among In summary, it is not possible to directly prove
several similar studies were discordant, this a causal hypothesis using casecontrol or cohort
would weaken the causality hypothesis. study designs. However, the causal hypothesis
3. Strength of association. The greater the value becomes much more tenable if the above ve cri-
of the relative risk or odds ratio, the less likely teria can be established for the problem at hand.
Take-Home Points
The use of a proper study design is essential to the investigation of risk factors for disease
or other outcomes.
Observational studies are useful in studying risk factors for disease or clinical outcomes.
Cohort and casecontrol study designs are the most common strategies used in observa-
tional research, with cross-sectional studies playing a less important role.
The choice between utilizing a cohort or casecontrol design depends upon several factors
including disease prevalence and/or incidence, data availability and quality, and time
required for follow-up.
Confounding is a potentially serious problem that can affect the interpretation of either a cohort
or a casecontrol study.
Matching is a method used to reduce the effects of confounding.
The degree of risk is quantied by the relative risk for cohort studies and the odds ratio for
casecontrol studies.
There are numerous sources of bias that can affect the interpretation of observational
studies.
In general, causality cannot be directly proven in observational studies, but certain criteria can
suggest a causal hypothesis.
4 Design and Interpretation of Observational Studies 77
Phyllis G. Supino
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 79
DOI 10.1007/978-1-4614-3360-6_5, Phyllis G. Supino and Jeffrey S. Borer 2012
80 P.G. Supino
the clinician, this would be equivalent to the logic reason, observed differences on outcome
underlying the protocols for ruling out myocar- measures among the groups may be due to
dial infarction in the setting of chest pain. (or at least strongly inuenced by) these
Campbell and Stanley identied eight factors that baseline differences rather than to the inter-
may threaten the internal validity of an interven- vention. Selection bias sometimes can be
tional study. They referred to these as internal neutralized after data collection through sta-
validity threats because they can provide com- tistical processes. However, the best strategy
peting explanations for observed outcomes and, is to preclude the problem by using an appro-
thus, obscure true causal linkages. It is incum- priate study design to maximize the compa-
bent on a good investigator to use study designs rability of the compared groups prior to
devoid of these potential internal validity threats intervention.
insofar as is possible. 2. History Effects. History effects are caused
1. Selection Bias. Selection bias is the improper by events not related to, or anticipated by, the
assignment (allocation) of subjects for com- research protocol that occur during the study
parison. It is one of the most commonly rec- and inuence outcomes. History effects
ognized threats to the internal validity of an potentially threaten internal validity when a
interventional study. An investigator may study is performed in a less than isolated set-
inadvertently contribute to this bias by non- ting, particularly when effects on the depen-
rigorous matching (or failed randomization) dent variable are assessed before and after
techniques, or by choosing subjects for the the intervention and the temporal interval
experimental treatment who are believed to separating these assessments is relatively
be most likely to benet from it (a form of long. When history effects occur, measured
referral bias). For example, in a trial com- outcomes may partially or completely reect
paring surgery with medical treatment, those the outside event and not the intervention.
with the most favorable clinical prole might History effects can be caused by factors such
be assigned (referred) to the surgical group as unintended procedural or environmental
(based on presumed benet), while the less changes in the experimental setting, changes
robust patients might be assigned to the med- in the social climate that can inuence atti-
ically treated group. This approach is almost tudes, media campaigns that can increase
always optimistically biased in favor of the general knowledge, to newsworthy events
surgical group, which is why it is so difcult relevant to the altered health concerns of
to form condent conclusions from trials subjects in the study, etc. As an example of
conducted in this manner. It is equally incor- the latter, if an investigator was evaluating
rect to allow subjects to self-select their treat- the impact of a breast cancer awareness pro-
ment assignments because volunteers for gram to promote increased use of mammog-
experimental treatments have been shown in raphy and a well-known pubic gure was
various studies [35] to be different from the diagnosed with breast cancer, it would be
total ambient population in terms of person- difcult to determine whether the ensuring
ality (e.g., risk tolerance, decisiveness, action increased use of mammography was due to
orientation), severity of disease or symp- the program or to the media attention sur-
toms, and race, among other variables. These rounding the public gures diagnosis. In the
characteristics could skew associated out- clinical setting, history effects can be induced
comes in any direction (though it is generally by changes in routine care (e.g., introduction
thought that the direction of the bias induced of a new medication or other treatment,
by self-selection bias, like referral bias, is in alterations in patient management, variations
favor of the experimental treatment). in patient reimbursement rules) that could
When groups to be compared are not impact study outcomes. The effects of history
equivalent initially for these or for any other are best minimized by closely monitoring
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 81
to ensure that ancillary factors not directly quent results through practice or learning.
integral to the intervention remain equivalent The threat to internal validity can be mini-
for all compared groups for the duration of mized by using alternate forms of measure-
the study. History effects also can be mini- ment for testing before and after intervention,
mized by using contemporaneous (parallel) or by eliminating pre- and post-intervention
control group designs, where comparators comparisons from the data analysis plan. Of
would have equal likelihood of exposure to course, as is true in virtually all interven-
signicant external events extraneous to the tional research, the latter approach requires
experimental setting. demonstration of equivalence of the com-
3. Maturation Effects. Maturation effects are pared groups before the intervention is
due to dynamic processes within subjects applied (i.e., at baseline, the pre-interven-
that may change with time and are indepen- tion period, or control condition).
dent of the intervention (e.g., growing older, 5. Instrumentation Effects. Instrumentation
progression or regression of illness). Like effects (also known as instrument decay
history, maturation may threaten internal or instrument drift) are caused by chang-
validity when analysis of outcome depends ing measurement instruments or observers
on comparison of pre- and post-intervention during the course of a study, or by intra-study
measures. It is a particular concern when changes in the original instruments or
studies extend over long periods of time observers, that may cause systematic error
(longitudinal studies) during which biologi- (bias) in measuring the outcome variable. If
cal alterations naturally can be expected and, the error entails consistent overprediction
thus, may affect outcomes. The effects of versus baseline, the bias is said to be posi-
maturation, like selection bias and history tive; consistent underprediction is a negative
effects, are minimized in parallel designs by bias [6]. For example, if alternate versions of
selecting comparison groups likely to have a test instrument are used before and after an
similar developmental patterns. intervention to reduce testing effects, any
4. Testing Effects. Testing effects are the observed changes may be due to differences
inuences of taking a test, being measured, in difculty level (e.g., easier posttests in
or otherwise being observed, on the results studies assessing educational impact) or
of subsequent testing, measurement, or other systematic variations in the alternative
observation. Testing effects may occur instruments, rather than to the intervention.
whenever the testing process is itself a stim- To avoid instrument effects when alternate
ulus to change, even in the absence of a forms of measurement are employed, they
treatment. Examples are the act of being should be previously evaluated to assure
weighed during a weight-reduction pro- equivalence. Parallel problems can occur
gram, or requiring patients receiving nico- when observers are changed during the course
tine substitutes to document and periodically of study since new observers may use differ-
report the number of cigarettes they have ent criteria for scoring and interpreting data
smoked. In these cases, assuming the sub- than the original observers. Instrumentation
jects are aware of the results of testing, the effects also can occur when the same instru-
process of being measured may cause ment (or observer) is used throughout the
subjects to undertake lifestyle changes study since instrument calibration may change
that will affect outcome independently of with time (or observer attitudes/assessment
the intervention. Testing effects are poten- criteria may change with experience).
tial concerns when measurement assesses Like history and maturation, instrumenta-
knowledge, attitudes, behaviors, and (espe- tion effects are a potential threat to internal
cially) skills, because the testing itself can validity in any longitudinal study involving
provide an opportunity for altering subse- serial measurements. They are of particular
82 P.G. Supino
concern when subjective measures (e.g., especially if these attributes are related to the
interviews or questionnaires) are used; in this outcome. Experimental mortality can bias
situation, care must be taken to assure that outcome even for post-interventional com-
instruments have demonstrated high reliabil- parisons if dropout is due to some character-
ity (internal consistency) to ensure stability. istic of an intervention that is not related to
However, whether objective or subjective the mechanism underlying its presumed
measures are used, observers may alter their efcacy. When comparison groups are used
interpretation of data as they grow more in an experimental design, a mortality bias
procient or fatigued. Thus, instrumentation also is introduced if the subjects lost to
effects also can be minimized through devel- follow-up differ diagnostically among these
opment of standardized data collection pro- groups. For example, a psychiatrist might
tocols so that any uctuations in measurement wish to follow two groups of psychotic
will occur randomly rather than systemati- patients, one of which had been given an
cally (or when comparing treatments by innovative treatment (the experimental
using the same observers across treatment group) while the other had been managed
conditions [counterbalancing] to avoid traditionally (the control group) to determine
confounding). whether the intervention decreased return
6. Statistical Regression. Statistical regres- visits to his/her practice. If more severely ill
sion is the tendency of individuals who patients were lost to follow-up in the inter-
scored extremely high or low on initial test- vention group than in the control group, the
ing to score closer to the previously estab- investigator might falsely conclude that
lished population mean on subsequent reductions in return visits among the inter-
retesting, independent of the intervention. vention group were attributable to the inno-
This is one of the most often overlooked vative treatment when, in fact, they may have
threats to internal validity, even among inves- occurred merely as a result of differences in
tigators who are well trained in statistics. attrition rates due to differences in illness
Statistical regression results from measure- severity. Experimental mortality is best mini-
ment error, as extreme or highly deviant mized by using large groups of subjects who
scores may arise due to chance. Such deviant are geographically stable, accessible to
scores are less likely to reappear on reevalua- investigators (i.e., have working telephone
tion. Regression effects can be minimized by numbers and valid postal or e-mail addresses),
avoiding the selection of a subject pool based and who are interested in participating in the
on extreme scores, for example, very high study, and by developing strategies to facili-
blood pressure or low IQ scores. Another use- tate follow-up. When subjects are lost, it is
ful strategy to avoid regression effects is to prudent to compare their baseline character-
obtain multiple measurements on each patient istics with those who remain in study to iden-
at several different appropriate times prior to tify potential bias, and to utilize external vital
intervention, or several measurements at the statistics databases (e.g., the National Death
protocol-mandated baseline and time after Index) to identify and conrm deaths that
intervention, which may then be averaged to may not be known to investigators.
optimize reliability of the estimate. 8. Interaction of Factors. Sometimes two or
7. Experimental Mortality. Experimental mor- more threats to validity can exist concur-
tality (or attrition bias) is caused by the rently. These may combine to further restrict
loss of subjects from a study who were origi- validity. Two factors that might be expected
nally included at baseline. Because subjects to combine are selection and maturation.
who withdraw may have different attributes For example, if two groups of patients were
than those who remain, their withdrawal may not initially equivalent in severity of illness
bias pre- to post-intervention comparisons, (a selection bias), their illnesses might
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 83
progress at different rates (a maturation bias). control arm (a form of instrumentation bias).
Thus, one of the two groups might end up Experimenter bias is best controlled by tech-
sicker, or healthier, than the other, irrespec- niques that blind both the investigator and
tive of any intervention. This threat is best the subject to the latters treatment assign-
controlled by procedures to minimize indi- ment, by the use of observers from whom the
vidual biases (e.g., randomized allocation to purpose of the study is withheld, and by stan-
treatment groups). dardization of the methodology of outcome
9. Experimenter Bias. In a perfect world, an assessment to ensure that subjects in the
investigator involved in a quantitative study control group are evaluated as thoroughly
would be detached and objective, maintain- and as frequently as those receiving the
ing a highly circumscribed relationship with intervention.
the subject. In an interventional study, his or 10. Subject Expectancy Effects. The subject
her responsibility is to administer or allocate expectancy effect (also termed nonspecic
subjects to a treatment and to impartially effects), also not identied by Campbell and
measure outcomes and other variables of Stanley, is a cognitive bias that arises when a
interest. Experimenter bias, not identied subject anticipates an outcome (positive or
by Campbell and Stanley, occurs when the negative) from an intervention, and reports a
expectations of the investigator (usually response to the intervention that is premised
unknowingly and unintentionally) inuence on this belief. This is the basis of the pla-
the outcome of the study, thereby confound- cebo effect, long recognized in clinical
ing the results. The profound impact of medicine. It occurs when a patient responds
experimenter bias on internal validity was positively to an inactive intervention (e.g., a
demonstrated by Rosenthal (1964) in his pharmacologically inert pill) and appears to
seminal studies of expectancy on experi- improve subjectively and even, occasionally,
menter judgment and learning outcomes objectively. This effect on outcome is due to
conducted during the mid-1960s [7]. The the patients belief that the intervention is
experimenters expectations typically arise curative. It may be stimulated or reinforced
from deeply seated views about his or her by suggestion of therapeutic benet by an
study hypothesis and can impact the study in authority gure (e.g., physician or other
a number of ways. For example, the investi- investigator, as noted above under
gator could subtly communicate expectations Experimenter Bias) and/or by the subjects
(cues) to participants about anticipated out- inherent desire to please him or her. Indeed,
comes and inuence them through the power the term placebo is derived from the Latin, I
of suggestion. The investigator could provide will please. An opposite phenomenon is the
extra attention or care to subjects that is out- nocebo (Latin for, I will harm) effect
side of the intervention (the latter is also which occurs when a subject reports nega-
termed performance bias when systemati- tive responses to administration of an inert
cally done for members of only one of the intervention due to his/her pessimistic expec-
comparison groups or compensatory treat- tation that it would produce harmful or
ment bias when specically applied to con- unpleasant consequences. Although the mag-
trols). The investigator also can bias the nitude of these subject expectancy effects is
study through improper ascertainment or variable and somewhat controversial, there is
verication of outcomes, for example, by general consensus that they can impact the
searching more diligently for adverse events validity of any study in which the subject is
in patients with versus without hypothesized aware of receiving a treatment for which the
risk factors (detection bias) or by assign- outcome is subjective (e.g., studies involving
ing a more favorable rating on a subjective pain control or symptom relief). As with
scale to subjects in the experimental versus experimenter bias, subject expectancy is best
84 P.G. Supino
controlled by utilizing study designs that external validity is not assured even when internal
blind the subject to his/her treatment validity has been established. In fact, the rigorous
assignment. For some type of interventions controls required to establish internal validity
such as those involving lifestyle changes may inadvertently compromise a studys general-
(e.g., dietary alterations, smoking cessation) izability. The investigator must use a variety of
or surgical studies, subject blinding may be strategies to strike a delicate balance between
difcult, if not impossible. (This is also true both concerns, if the study is to be both accurate
for those conducting these interventions and (internally valid) and have practical utility (be
other members of the investigational team.) externally valid). The four most common threats
In these instances, blinded assessment of to external validity, identied in the seminal works
outcomes by external adjudicators could of Campbell and Stanley, are given below.
reduce, if not eliminate, expectancy biases. 1. Reactive Effects of Testing. The reactive
However, in many biomedical studies (e.g., effects of testing involve sensitizationor
those evaluating the effects of pharmacologi- desensitizationof study subjects to interven-
cal agents), subjects (and investigators) can tions caused by the pre-intervention testing
be blinded to treatment assignments through that might not be undertaken in the general,
the use of placebos. The incorporation of pla- nonstudy population. This threat to external
cebos enables determination of treatment validity is most often encountered when pre-
effects above and beyond those arising from tests are obtrusive and/or outside of the nor-
subject (or investigator) expectancy. mal experience of the subject. For example, to
Obviously, placebos work best when they study the effects of a new nutrition program,
closely approximate the physical characteris- an investigator might assess baseline knowl-
tics of the active intervention. (This problem edge of food groups and portion control,
is avoided in early phase I clinical trials of for the purpose of comparing pre- to post-
therapeutics where both placebo and active intervention changes. If the pretest had focused
drug may be administered intravenously, or attention on the intervention, any treatment
when the investigational intervention does effects that were observed might not be repli-
not cause characteristic physiological effects cable if the pretest was not given. To diminish
that might unmask the treatment assign- this bias, the investigator should minimize or,
ment.) When the treatment assignment is ideally, dispense with the use of pretests.
known to both subject and investigator, it is However, as with its internal validity analog
said to be unblinded (or open); when (testing effects), this approach is valid only
only the subject or the investigator (but not when there is reasonable certainty that the
both) is unaware of the treatment assignment, comparison groups are equivalent at baseline.
the study is said to be single blinded; when Alternatively, the investigator could opt to use
treatment assignment is unknown both to the least obtrusive pre-intervention assess-
subject and investigator, the study is said to ments to minimize reactivity. Special research
be double blinded; and when it is unknown designs (e.g., the Solomon four-square design),
to the subject, investigator, and others ana- in which pretests are given to some but not all
lyzing or monitoring the data, the study is study subjects, can be used to determine the
said to be triple blinded. reactive effects of testing on study outcomes.
2. Interactive Effects of Selection and Treatment.
Threats to External Validity Sometimes two investigators will run similar
External validity refers to generalizability, that studies and obtain different ndings. One pos-
is, can the study ndings be extrapolated to sub- sible cause of this outcome is the interactive
jects, contexts, and times other than those for effects of selection and treatment (or selec-
which the ndings were obtained? Internal valid- tion-treatment interaction). The interactive
ity is a prerequisite for external validity. However, effects of selection and treatment are the
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 85
presumed basis of the failure of results found as aberrant behavior exhibited by subjects that
in an intervention study to be generalizable to results solely as a consequence of their partici-
other subjects to whom that intervention is pation in an experiment, and that may not
applied. This failure occurs because the study occur outside the experimental setting. The
was conducted on a sample that was not repre- reactive effects of experimental arrangements
sentative of the larger population to which are often confused with the placebo effect.
results should be extrapolated. The selection- Although there are cognitive components
treatment interaction frequently is seen in inherent in both validity threats, the primary
clinical research when research subjects are difference is that with the reactive effects of
scarce (a common situation) and the investi- experimental arrangements, the subjects bias
gator is limited to those who present them- is based on the idiosyncrasies of the research
selves and are willing to participate. In these environment, whereas with the placebo effect,
situations, study subjects typically are selected the subjects bias is based on expectations
by convenience, rather than by population- about the treatment (that may or may not be
based sampling. A convenience sample part of a research study). The reactive effects
includes all, or a portion, of patients who are of experimental arrangements were serendipi-
being seen in a practice, hospital, or clinic, tously discovered in a series of trials evaluat-
provided they meet the inclusion criteria of ing the impact of the work environment on
the study, and consent to participate. If the employee productivity, conducted by Harvard
subjects selected for the study are, for exam- University researchers between 1924 and
ple, healthier, wealthier, or wiser than the gen- 1932 at the Hawthorne Works, a factory plant
eral population, or if they come from a unique of the Western Electric Company in Cicero,
geographic area, they may benet more or less Illinois. The initial studies (illumination
from a treatment, and it may not be possible to experiments) varied the level of light intensity
replicate the study, or to extrapolate its results to which employees were exposed. When the
to the larger population of interest. In theory, light intensity increased, worker output (and
the interactive effects of selection and treat- positive affect) improved but, much to the
ment are best controlled by random selection investigators surprise, worker performance
of subjects from the target population. Because also improved when lighting intensity was
this seldom is possible in clinical research diminished. The same pattern emerged when
(especially in randomized clinical trials other environmental factors were manipu-
[RCTs] in which strict inclusion/exclusion cri- lated. These unintended outcomes (also known
teria and possibility of a subjects receiving a as the Hawthorne effect) [8] led the research-
placebo sharply narrow the pool of study-eli- ers to conclude that the mere act of being stud-
gible patients), the investigator should ied changed the participants behavior (i.e.,
endeavor to select subjects who have charac- brought about a pseudo-treatment effect), con-
teristics similar to those to which he or she founding inferences about effects of the vari-
wishes to extrapolate results. Multicenter ous interventions imposed upon them.
studies, drawing from diverse demographic Underlying mechanisms proposed to explain
populations, tend to suffer less than single- these ndings include unintended special
center studies from this external validity attention and benets that may have been
threat. Nonetheless, even small, single-center given to subjects by observers, uncontrolled
studies have value provided the investigator novelty due to the articiality of the experi-
identies and reports potential biases in his or mental arrangements, and inadvertent
her selection plan and is also careful to limit responses to subjects from observers leading
generalizations to appropriate populations. to learning effects that positively impacted
3. Reactive Effects of Experimental performance. While there is no consensus as to
Arrangements. This validity threat is dened the cause, the reactive effects of experiments
86 P.G. Supino
currently are recognized as a potential threat eliminate the effects of the prior exposure.
both to external and internal validity in Under these conditions, it will be difcult to
research from various disciplines (e.g., medi- determine how much of the ultimate treatment
cine, education, psychology, and management outcome was attributable to the rst treatment
science). Their impact is potentially problem- and how much was due to the second, thus
atic in any situation in which there is human limiting the applicability of the study ndings
awareness of participation in a study and in to the real world in which patterns of treat-
which study outcomes can be motivated by ment availability may not mirror those of
that knowledge. A related threat to validity study. Multiple treatment interference is very
that is caused by experimental arrangements is difcult to eradicate. It is best controlled by
known as the John Henry effect [9]. This avoiding the use of within-subject designs.
may occur when subjects in the control group, Where this is not possible, the investigator
being aware of their treatment assignment, must carefully counterbalance or randomly
view themselves as competing with subjects order treatments across subjects and provide
in the intervention group and change their appropriate washout periods.
behavior (i.e., try harder) in an attempt to out-
perform them.
Whenever possible, the investigator should Elements of the Research Design
take steps to reduce the reactive effects of
experimental arrangements to increase the In analyzing the anatomy of a study to evaluate
likelihood that the ndings from a study will the impact of an intervention, it can be very help-
be replicated beyond the experimental con- ful to employ shorthand that displays the major
text. Methodological options for achieving elements of the design, the sequence of events,
this objective include (1) minimizing the and certain of the constraints within the design.
obtrusiveness of experimental manipulations This shorthand, based largely on the notation
and measurements, (2) blinding subjects to developed by Campbell and Stanley, will be used
their treatment assignment (to control for in the remainder of this chapter to examine the
John Henry effects), and (3) providing strengths and weaknesses of ten alternative study
equivalent attention to intervention and con- designs.
trol groups, especially in studies involving The symbol X denotes the intervention (pri-
psychological, behavioral, and educational mary treatment or independent variable) that
outcomes. To accomplish this, investigators is applied to the subjects in the study. When
may include a Hawthorne control group that more than one level of a treatment is included
receives an irrelevant intervention to equalize in a design, they are labeled X0 (control), X1,
subject contact with project staff. X2, and so on; XP indicates that a placebo has
4. Multiple Treatment Interference. A fourth been given to control subjects (in designs
threat to the external validity of an interven- incorporating parallel treatment arms) or dur-
tion study is multiple treatment interference, ing the control condition (in time-series or
dened as the inuence of one treatment on crossover design) to control for expectancy.
another, which may produce results that would Y indicates that a secondary treatment has been
not be found if either were applied alone. coadministered, concomitant with the primary
Multiple treatment interference is a potential treatment. Variations in levels of the secondary
problem in any study in which more than one treatment, if any, may be distinguished by sub-
treatment (or treatment level) is given to, and scripts in a similar manner as for X. Absence
formally evaluated in, the same subject. The of Y indicates absence of co-treatment.
threat applies even when the treatments are O is the observation (or measurement of the
given in sequence because treatment effects dependent variable) in the study. O may repre-
may carry over and it may not be possible to sent a test result, a record, or other data; when
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 87
more than one observation is involved over erly termed pre-experimental designs because
time, they are variously labeled as O1, O2, etc., they contain only few of the essential structural
to distinguish them. elements needed to draw unambiguous inferences
An arrow represents the experimental order about the impact of an intervention. They are pre-
(sequence of events during the study period). sented below to heighten the readers awareness
A dashed line indicates that intact groups (e.g., of their glaring deciencies. The three most com-
hospitals, clinics, or wards) have been com- mon are the following:
pared (in other words, that subjects have not 1. The one-shot case study
been allocated to treatment on a random basis). 2. The pretest-posttest only design
R indicates that study subjects have been allo- 3. The static-group comparison
cated to treatment groups on a random basis.
(Thus, a dashed line and R generally will not Pre-Experimental Research Design # 1:
appear in the same design as these represent The One-Shot Case Study
alternative methods of subject allocation to XO
treatment.)
Some studies in medicine utilize a design in
which a single patient (or series of patients) is
Alternative Research Designs studied only once, following the administration
an intervention. No pre- to post-intervention
Several alternative research designs have been comparisons are made, and no concurrent control
used to evaluate the effects of an intervention on groups are used. Instead, inferences about causal-
some specied outcome. Each of these differs ity are predicated on expectations of what would
according to its adequacy in ensuring that valid have been observed in the absence of the inter-
inferences are made about the effects and gener- vention, usually based on implicit comparison
alizability of an intervention. with past information. This most rudimentary
pre-experimental design is termed the one-shot
case study and is diagrammed as follows: X for
Pre-experimental Research Designs the intervention, followed by an arrow, and O for
the observation. Consider an example from the
The literature regrettably includes many studies literature by R.F. Visser, published in the journal
that use designs which fail to control for most Clinical Cardiology [10] (summary and design
threats to internal validity. These are most prop- structure are given in Fig. 5.1).
have been well standardized. (Indeed, the authors A third pre-experimental design also found in
are silent about the test-retest reliability of their the literature is the static-group comparison. This
instruments.) Statistical regression poses another design incorporates two groups: one that receives
possible threat, if the study subjects had been an intervention (again denoted as X) and a sec-
chosen on the basis of extremely poor scores on ond that does not receive an intervention and
the initial test. In the nal analysis, because so which serves a control (denoted by the absence of
many potential individual biases are uncontrolled X). Groups one and two typically are observed
in this study, there is also the strong likelihood concurrently after the intervention has been
that interaction of these factors could undermine applied in one of the groups, and the observations
its internal validity and the conclusions drawn made in these groups are denoted by the Os. This
from it. Indeed, Campbell and Stanley argued design includes no pretesting or baseline mea-
that this type of design should be used only when surements. Note that both intervention and con-
nothing else can be done. trol groups are separated, schematically, by a
The study also suffers from several threats to dashed line to indicate that study subjects were
external validity, namely, the potential for selection- assigned to treatment as intact groups, that is,
treatment interaction. First of all, very few sub- they were not randomly allocated to treatment.
jects were studied, and it is highly unlikely that A study, published by Bolland et al. in the Journal
they were representative of all patients being of the American Dietetic Association [12],
treated for ADHD (selection-treatment interac- employed a variant of this design which tested
tion). Second, the subjects (as well as their doc- for effects extended over time (summary and
tors) were unblinded, and subjects may have design structure are given in Fig. 5.3).
improved due to the effects of their participa- Are these conclusions credible? A review of
tion in the study (reactive effects of experimental the structure of this design will be revealing. In
arrangements). These issues are noted only for this study, X represents the food quantity estima-
completeness. As noted above, this study fails to tion intervention, and the O represents the post-
meet criteria for internal validity; thus, its gener- intervention assessments of knowledge of food
alizability is unimportant. quantities in the experimental (trained) and con-
trol (untrained) groups, assessed at three different
Pre-Experimental Research Design # 3 times among trained subjects. (The reader should
The Static-Group Comparison note that the use of deferred assessments is not
typical of the static-group comparison design
but was used in this study in an attempt to dene
persistence of treatment effects.) The broken line
90 P.G. Supino
between the experimental and control groups indi- absolutely no protection. The rst threat is selec-
cates the intact nature of the comparison groups, tion (or allocation bias). The authors do not tell
signifying that subject assignment to the interven- us how the study subjects were divided into treat-
tion or control comparison group was not random. ment groups. Was it by instructor preference or
The static-group comparison design repre- self-selection by the study subjects? Either of
sents an improvement over the one-shot case these scenarios would be equally awed because
study because the inclusion of a contemporane- without baseline (pre-intervention) assessments,
ous control group permits comparison of the there is no way to determine whether the observed
results of the trained study subjects with the other, outcomes were due to the training or to pre-inter-
untrained study subjects, evaluated approxi- vention differences in the subjects knowledge
mately in parallel, thereby avoiding the obvious about estimating food quantities. Even if the
biases inherent in the use of external or historical investigators had attempted to match the groups
controls (or, in the worst-case scenario, no con- on other variables, such matching would be inef-
trols). Moreover, the fact that study subjects in fective in achieving true baseline parity among
both groups are being evaluated in the same way trained versus untrained subjects, especially if
during a relatively short interval decreases the subjects had, indeed, self-selected participation
potential for maturation and instrumentation in the intervention. In addition, even though the
effects (assuming uniform data collection). study was relatively short in duration, the validity
Finally, this design also represents an improve- of the conclusions, nonetheless, is threatened by
ment over the one-group pretest-posttest only the potential for experimental mortality (attrition
design because the absence of pretesting and sub- bias) as no information is given about whether all
ject selection based on extreme pretest scores subjects who began this study actually completed
obviates the threat of testing effects and statisti- it or whether attrition (if it did occur) differed
cal regression. systematically between the two groups. Thus,
Nonetheless, there are two potential threats to even if subjects were comparable on average
internal validity for which this design affords before training, the apparent superiority of the
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 91
trained group (relative to the untrained group) on assignment to the alternative study arms, and that
the outcome measure possibly could have been probability remains constant throughout the
due to several of the less knowledgeable students study. The randomization process can be per-
dropping from the former group (or, conversely, formed according to a coin toss or a table of
due to some of the more knowledgeable students random numbers or special computer software
dropping from the latter group) prior to testing. can be used. This type of randomization is known
The primary threat to external validity is the as simple randomization and works best when
interaction of selection and treatment. (After all, sample size is relatively large. However, when
how representative is one class of introductory sample size is small, simple randomization may
nutrition students of the larger relevant popula- result in statistically unequal groupings. Under
tion?) However, since the internal validity of the these circumstances, restrictive randomization
study is severely compromised, this threat to methods (e.g., blocked randomized designs or
external validity has little if any importance. stratified random allocation) can be employed.
With blocked randomization, subjects are
assigned to treatment in groups (blocks) that are
True-Experimental Research Designs similar to one another with regard to a source (or
several important sources) of variability that is
The most prominent characteristic of true- (are) not of primary interest to the experimenter
experimental designs is random allocation of (e.g., a potential confounding variable such as
study subjects, drawn from a common population, gender, geographic area). Stratified randomiza-
to alternative treatment conditions. When this tion is performed by conducting separate ran-
approach is employed, participants baseline char- domization procedures within each of two or
acteristics can be expected to be equally distrib- more subgroups of subjects that are dened
uted across the various comparisons according to according to prespecied patient characteristics
the laws of probability, especially when sample (usually important prognostic risk factors) and
size is large. Even when randomization does not increases the likelihood that allocation to treat-
result in perfect equivalence, most workers in the ment is well balanced within each stratum. With
eld believe that this form of treatment allocation adaptive methods (a Bayesian approach increas-
is the best way to reduce the threat of selection ingly used in contemporary clinical trials) [15],
bias. The theoretical underpinnings of random- the probability of allocation changes in response
ized designs can be traced to Fisher and to accumulating information during the study
Mackenzies agricultural experiments in the about the composition of, or outcomes associated
1920s [13]; however, it was not until the late with, the alternative treatment arms. (For a com-
1940s that they made their appearance in the med- prehensive discussion of the theory and tech-
ical literature, when the RCT was rst used to niques of adaptive randomization, the reader is
demonstrate the efcacy of streptomycin in the referred to Hu and Rosenberger, 2006 [16].)
treatment of tuberculosis [14]. Since that time, the As noted, the purpose of randomization is to
RCT has been considered the standard to be met render the comparison groups as similar as pos-
for clinical research, even though investigations sible at study entry to permit valid inferences to
of this type comprise only a minority of the be drawn about the effects of an intervention.
clinical research ever conducted or published. However, during the course of the trial, some
Randomization also is important in many preclin- patients may not initially receive the intended
ical/basic science research protocols, though other intervention or, during the course of the study,
considerations may minimize application of this may drop out or cross over to the alternate treat-
approach in the nonclinical setting. ment for a variety of reasons. One widely used
Most commonly randomization is fixed, less solution to circumvent these problems is intention-
commonly it is adaptive. With xed random allo- to-treat analysis (ITT), which denes the compar-
cation, each subject has an equal probability of ison groups according to initial assigned treatment
92 P.G. Supino
rather than to the treatment actually received or study. All provide much better protection than do
completed (i.e., once randomized, always ana- pre-experimental designs against most threats to
lyzed). Many workers in the eld consider ITT internal validity.
analysis to be the gold standard method of analy-
sis for clinical trials [17], describing it as the least True Experimental Design # 1
biased for drawing inferences about trial results The Pretest-Posttest Control Group Design
[17, 18], and it is considered the pivotal analysis
by major regulatory bodies in Europe and in the
USA for approval of new therapeutics. However,
the reader should note that ITT analysis provides In the most common form of the pretest-
only a pragmatic estimate of the benet of a new posttest control group design, study subjects are
treatment policy rather than an estimate of poten- randomly allocated to two comparison groups or
tial benet in patients who receive treatment treatment arms. One group receives the experi-
exactly as planned; moreover, full application of mental intervention and the second, no interven-
this method is possible only when complete out- tion, a placebo, or an alternate intervention. Both
come data are available for all randomized sub- groups are observed, in parallel, before and after
jects [19]. Thus, The ITT approach is not without the intervention on the same outcome measure(s)
its critics [20]. Some clinical trialists argue that to determine whether change varied as a function
efcacy is best demonstrated when analysis of the treatment. The structure of this design is
focuses on subjects who actually received the represented symbolically above: R denotes that
treatment of interest (sometimes termed efcacy subjects have been randomly allocated to the
subset analysis), arguing that ITT approaches comparison groups; X denotes that a treatment
provide an overly conservative estimate of the has been given to the rst group; absence of X in
magnitude of treatment effects principally due to the second group indicates that this is a control
dilution of effects by nonadherence. In addition, group (the control group also could have been
ITT analysis creates difculty in interpretation of denoted by X0 [or Xp if a placebo had been
ndings if numerous participants cross over to given]). O and its positioning indicate the obser-
opposite treatment arms. Finally, it is suboptimal vations made in both groups before and after the
for studies of equivalence, generally increasing intervention. An example of a study incorporat-
the likelihood of erroneously concluding that no ing this design was published by Gorbach et al. in
difference exists between two test articles [21]. the Journal of the American Dietetic Association
A common solution is to employ both methods of [22] (summary and design structure are given in
analysis in the same study, using ITT and on- Fig. 5.4).
treatment approaches as primary and secondary The structural representation of this study is a
analysis, respectively. clue to the strength of its internal validity. Here,
Four of the most common true-experimental X represents fat reduction dietary intervention;
designs found in the biomedical literature are the the absence of X represents no dietary interven-
following: tion, the control group; O1 and O3 represent base-
1. The pretest-posttest control group design line fat intake in the experimental and control
2. The posttest only control group design groups; O2 and O4 represent post-intervention fat
3. The true-experimental 2 2 factorial design intake in both groups; R signies that the study is
4. The crossover study (two-period design) randomized.
The rst two designs can be used to evaluate Because study subjects have been randomly
the impact of a single intervention (vs. control or allocated to comparison groups from a common
an alternate intervention), and the third and fourth subject pool, selection bias has been removed as
permit the investigator to examine the separate a serious threat to internal validity, assuming that
effects of two interventions (again, vs. control or the randomization was effective. Having baseline
an alternate intervention) applied within the same measures of the dependent variable (and other
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 93
key variables that potentially could inuence it) to the latter criterion, average regression effects
and comparing them between groups permits us would not confound interpretation of the results
to conrm or reject this assumption; these com- because if they had occurred, they should have
parisons typically are expressed in tabular form been equivalent in the comparison groups, given
in most published RCTs. History effects are con- that the subjects were randomly allocated from a
trolled because if a potentially confounding gen- common subject pool. Thus, this design also pro-
eral external event had occurred, it should have tects against statistical regression. Finally, while
affected the comparison groups equally since treatment assignment could not be fully blinded
they are studied in parallel; nonetheless, as noted (as noted earlier, a common characteristic of
earlier in this chapter, the investigator must be studies evaluating impact of lifestyle interven-
vigilant and attempt to control for differences tions) to entirely eliminate the threat of expec-
between comparison groups that might occur on tancy effects, the investigators endeavored to
a more micro level (i.e., within group varia- reduce them by standardizing their methodology
tions in temperature, time of day, season, etc.). for outcome ascertainment and by blind-coding
For similar reasons, the use of a parallel design data to ensure that subjects in the control group
also protects against the threats of maturation, and those receiving the intervention were evalu-
testing, and instrumentation effects because natu- ated uniformly and impartially. The one error
ral variations in these factors should impact com- made in this study was the use of an incorrect test
parison groups equally; instrumentation effects of statistical signicance (i.e., computing two
also are minimized here because all data were sets of t-tests, one for the experimental group and
collected using standardized techniques. In this one for the control group, rather than conducting
study, subjects were selected on the basis of high direct statistical comparisons of the changes
risk for breast cancer, not on the basis of extremes between the groups). With this single exception
in pre-intervention fat and energy intake. (which Campbell identied as a wrong statistic
However, even if they had been chosen according in common use among investigators employing
94 P.G. Supino
these designs [1]), the use of random allocation impact of the selection-treatment interaction,
to parallel treatment groups afforded by the appli- which must be considered, even though hundreds
cation of the pretest-posttest parallel group of subjects were enrolled in the trial.
design, coupled with standardized data collection A third potential threat to the external validity
methodology, protected this study very well from is the reactive effects of the experimental arrange-
most factors that could have undermined its inter- ments. Because the intervention was not part of
nal validity, thus maximizing the likelihood that the routine care of this population and informed
the intervention, rather than other factors, was consent was required, subjects certainly were
responsible for the observed outcomes. aware of their participation in an experiment.
However, the external validity of this study is All subjects would have been exposed to the nov-
open to question. The reason is that randomized elty associated with random allocation techniques
designs, including this model, may lead to con- and new ways of keeping food records. Subjects
clusions that, while internally valid for the study, in the intervention group would have been
may not generalize to the reference population exposed to new health-care providers (in this
for the following three reasons. study, the nutritionists) and, as a part of such
First of all, in this study, pretests were used to intervention, may well have received more atten-
assess relative change in fat and energy intake in tion from project personnel than those told to fol-
the comparison groups. Their use may have sen- low customary diets (i.e., the control group),
sitized study subjects to the intervention, with the unless a Hawthorne control had been built into
possibility that results might not generalize when the study (which it had not). Any of these factors
the intervention is applied without pretesting. might have led to changes that were due to reac-
This threat to external validity, known as the tivity to the experiment (a possibility that is sup-
interactive effect of testing and treatment and ported by changes in fat and energy consumption,
described earlier, is a potential problem for any albeit of a lesser magnitude, among control group
pretest-posttest comparison design, randomized participants), raising the concern that the effects
or not, unless the testing itself is considered a of the intervention might not be replicated when
component of the intervention being studied. applied nonexperimentally.
Another potential threat to external validity is
True-Experimental Design # 2
the interaction of selection and treatment. Since
The Posttest Only Control Group Design
the purpose of hypothesis testing is to make infer-
ences about the reference population from which R X O1
study subjects are drawn, the representativeness R O2
of the study group must be ascertained for the gen-
eral population of women at high risk for breast The next approach, called a posttest only con-
cancer. As noted, the majority of subjects in this trol group design, again utilizes two groups: each
study were well educated, and a quarter had annual has been randomly allocated to treatment; as
household incomes that were relatively high for before, one group receives the intervention, repre-
the time (1990). It is also relevant that patients sented by X, and the second group either receives
were excluded from the study for a number of rea- no intervention, an alternate intervention, orif it
sons including, but not limited to, their unwilling- is a drug studysometimes a placebo (designated
ness to sign an informed consent form, or because as XP). Both are observed after the intervention
they were judged by the study nutritionist to be only, as shown by the positioning of O. The major
potentially unreliable in complying with the study distinction between this design and the preceding
protocol. Unfortunately, as is the case for many one is that, here, study subjects are not assessed on
published RCTs, the authors fail to state how the dependent (outcome) variable at baseline.
many patients were excluded for these reasons, Instead, they are compared only after the interven-
making it difcult to evaluate the potential adverse tion. Unless knowledge of relative change on an
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 95
outcome is required, baseline assessments of the How well does this study design protect against
dependent variable are not necessary for establish- threats to internal validity? The answer is very
ing comparability of the comparison groups in well. Again, as for pretest-posttest parallel control
true-experimental designs, since random alloca- group design, the use of random allocation of
tion to treatment should eliminate the threat of almost 4,000 patients to treatment assignment
selection bias. As noted earlier, this is especially controls for selection bias (the comparability of
true if the number of study subjects is large and the distributions of baseline clinical variables,
the randomization strategy is properly executed. electrocardiographic abnormalities, age, gender,
Nevertheless, baseline data on relevant demo- and other descriptors between the propranolol and
graphic and clinical variables other than study placebo groups noted in their manuscript illus-
outcomes typically are collected to permit exami- trates this point). In addition, the use of parallel
nation of this assumption. comparison group post-intervention comparisons,
The posttest only control group design is espe- rather than sole reliance on within-group changes
cially appropriate in situations where within-sub- without controls, effectively rules out history,
ject outcomes logically cannot be dened before maturation, testing, mortality, regression, and
application of the intervention (e.g., in studies instrumentation effects and their interactions as
relating impact of the intervention on survival). competing explanations for the outcomes. In addi-
A classic example was published by the b-Blocker tion, because the study was double blinded, both
Heart Attack Research Group in the Journal of subject expectancy and experimenter bias also are
the American Medical Association [23] (sum- eliminated as potential threats to validity.
mary and design structure are given in Fig. 5.5). The study also is superior to that of Gorbach
In this study design, X represents the experi- et al. with regard to external validity. The reason is
mental drug, in this case propranolol, and XP is that the posttest only comparison group design
the placebo. O1 and O2, respectively, represent does not require pre-intervention assessments as a
the percent mortality for the propranolol and pla- benchmark against which to establish intervention
cebo groups. As before, the symbol R denotes the effects. Thus, by denition, it controls for the reac-
use of randomized allocation to treatment group. tive effects of testing. Indeed, this is the primary
96 P.G. Supino
advantage of this design versus the pretest-posttest comparative effectiveness), the second group
parallel group design. In this study, the outcomes might receive an alternative primary treatment
of the intervention were all hard events rather (in this case, these treatments would be desig-
than behavioral or educational outcomes, and the nated X1 and X2 to differentiate them). One group
intervention, itself, involved medication rather receiving the primary treatment and one receiving
than promotion of lifestyle change. Therefore, the an alternate treatment, or no primary treatment,
reactive effects of experimental arrangements, if also receive a secondary treatment, denoted here
any, should be minimal, provided that the investi- as Y. The remaining two groups do not or may
gators took care to minimize the obtrusiveness of receive a placebo. The groups are observed in
the experimental manipulations and measure- parallel after application of the intervention, as
ments. Nonetheless, while the study was large and denoted by O. A 2 2 true-experimental design,
multicentered, the authors reported that 77% of published by the International Study Group in
those patients invited to participate did not do so. The Lancet [24], was employed to evaluate the
Therefore, despite the many thousands of patients relative effectiveness and safety of two throm-
enrolled, there is still a question of how represen- bolytic drugs administered with or without hepa-
tative the sample was of the general population rin (summary and design structure denoted are
after a recent MI. Consequently, the external valid- given in Fig. 5.6).
ity of this study potentially is threatened by the In this study, X1 represents streptokinase, and
selection-treatment interaction which, as noted X2 represents alteplase. Y indicates concomitant
earlier, is a common problem in many RCTs. administration of heparin; the absence of Y indi-
cates that no heparin was given. O1O4 denote the
True-Experimental Design # 3 percentages of in-hospital deaths in each of the
The 2 X 2 Factorial Study comparison groups (Fig. 5.6).
Because this study (like those using true-
experimental designs #1 and #2) employed a
design that randomly allocated subjects to four
large parallel treatment arms, selection bias is
controlled as are history effects, maturation,
instrumentation, testing, experimental mortality,
The rst two true-experimental designs per- and regression. Unfortunately, neither patients
mitted the investigator to evaluate the impact of a nor investigators were blinded to the formers
primary treatment versus an alternative primary treatment assignment. Thus, the study did not con-
treatment or control. True-experimental factorial trol for the potential effects of expectancy. This
designs are modications that include a second- omission is important because even though the
ary treatment administered concurrently with the dependent variable clearly was an objective out-
primary treatment to permit examination of the come (i.e., death) and randomization led to groups
modication of the main and interactive effects that appeared to be well balanced at study entry,
of each. They can be designed with and without knowledge of the treatment assignment still could
pretests (as above) and with or without blinding, have resulted in unintended differences between
if the latter is not practical or possible. the treatment arms in the use of nonprotocol-
An example of these designs is diagramed mandated co-interventions (e.g., percutaneous
above. This exemplar is termed a 2 2 factorial coronary angioplasty or coronary bypass grafting)
true-experimental design and includes four con- that, themselves, could have inuenced study
current parallel groups: the rst two groups receive outcomes. This design aw, of course, is not a
a primary treatment, denoted by X, and the second limitation of the true-experimental factorial
two receive no primary treatment, denoted by design (which, otherwise, controls very well for
the absence of X (or, alternatively, X0) or Xp if major threats to internal validity) but, as noted
placebos are given to the nontreatment controls. earlier, is a problem associated with any open
In a variation of this design (for evaluation of (unblinded) study. Had the study been blinded,
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 97
this true-experimental factorial design, like the therapies, which prevents us from generalizing
two preceding true-experimental designs, would, the mortality ndings to similar patients in whom
in theory, have afforded full protection against these therapies are not given.
most, if not all, serious threats to internal validity.
The chief advantage of this study design for True Experimental Design # 4
The Two-Period Crossover (Change-Over) Design
external validity (vs. the crossover study, dis-
cussed below) lies in fact that its structure per- [Period A] [Period B]
mits a purposive and systematic evaluation of the
separate and combined (i.e., interactive) effects
of concomitant investigational therapies, thereby
avoiding unplanned carryover effects and pre-
cluding the threat of multiple treatment interfer- In the previous example, the main and interac-
ence. Though this design can increase the tive effects of two treatments were evaluated. To
efciency of interventional trials by permitting accomplish this, a factorial parallel (between-
simultaneous tests of several hypotheses, the subjects) design was used that required allocation
reader should be aware that if interactions are of large numbers of subjects into four different
severe, loss of statistical power is possible [25]. treatment arms, resulting in one protocol-
A limitation to the external validity of this par- mandated exposure per subject during the course
ticular study (but not to factorial designs in of the study. In contrast, if the study objectives
general) is the coadministration of noninvestiga- were to determine only the main (isolated) effects
tional drugs (i.e., b-blockade and aspirin) among of two treatments, rather than their interactions,
all patients without contraindications to these this objective could be accomplished more
98 P.G. Supino
efciently (i.e., with fewer subjects producing carryover effects could compromise the validity
equivalent statistical power or precision) using of data obtained after the initial period (e.g.,
the true-experimental crossover (or changeover) cause under- or overestimation of the efcacy of
design. A crossover design is a type of repeated the second treatment) and undermine the
measures design in which each subject is exposed efciency of the study.
to different treatments during the study (but they Although crossover studies can involve multi-
cross or change over from one treatment to ple periods and sequences, the most common is
another). The order of treatment administration true-experimental design #4, the two-period cross-
(determined priori via randomization) is termed over design, illustrated symbolically above. When
a sequence, and the time of the treatment this approach is used to test the efcacy and safety
administration is called a period. The statistical of different investigational drugs, subjects nor-
efciency of the design results from the fact that mally will undergo a run-in period during which
each subject acts as his or her own control, noninvestigational medications are discontinued
thereby minimizing error due to (and sample size and a suitably long washout interval between the
needed to overcome) the effects of between- two active treatment periods, A and B, (the latter
subject variability. Crossover designs have enjoyed guided by the bioavailability of the drugs) so as to
popularity in many disciplines including medi- minimize the likelihood of carryover effects.
cine, psychology, and agriculture. They are com- Typically, half of the sample initially receives the
monly used in the early stages of clinical trials to rst drug, denoted by X1, and the other half ini-
assess the efcacy and safety of pharmacological tially receives the second drug, denoted by X2.
agents and constitute the preferred methodologi- Following the washout, study subjects who
cal approach for establishing bioequivalence. received the rst drug are given the second drug,
A variant that can be used for these purposes is and vice versa, resulting in a fully counterbal-
the n-of-1 study, a mini-RCT in which a single anced design. Observations are recorded pre- and
patient is observed during exposure to randomly postdrug administration in the two treatment peri-
ordered sequences of treatment (frequently given ods, denoted by O. The symbol R to the left of the
in varying doses) and placebo. Both the patient diagram indicates that the order of initial treat-
and clinician are blinded as to treatment alloca- ment assignment is allocated at random to counter
tion, and the codes are broken after the trial. possible order effects. An example of a study
Responses, such as reported side effects, are employing a crossover design was conducted by
graphed or analyzed through a variety of para- Seabra-Gomes et al. [26] who evaluated the rela-
metric and nonparametric statistical techniques. tive effects of two antianginal drugs on exercise
When performed in series, the n-of-1 study can performance in men with stable angina (summary
provide valuable information for subsequent par- and design structure are given in Fig. 5.7).
allel group trials. In this study, X1 denotes isosorbide-5-mono-
A crossover study has utility for clinical nitrate and X2 stands for isosorbide dinitrate.
research only when three conditions are satised: O1O3 are the outcome variables measured among
(1) subjects must have a chronic stable disease patients receiving X1 during period A; O4O6 are
that is not likely to progress during the study; (2) the same variables measured during period B.
study endpoints must be transitory, that is, must O7O12 are the outcome variables measured
reect temporary physiological changes (e.g., among patients initially receiving X2. R indicates
blood pressure) or relief of pain, rather than cure that the order of the initial drug assignments was
(or death); and (3) the investigational treatments randomly allocated.
must be able to deliver relatively rapid effects As with all other true-experimental models,
that are quickly reversible after their withdrawal. internal validity is very well controlled with this
The latter point is especially critical. If the effects design. Selection bias is eliminated because study
of the investigational interventions are permanent subjects are their own controls and comparisons
or more long lasting than anticipated, their of outcomes are made within rather than between
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 99
clinical and other health-related interventions on The basic structure of this design is symbol-
group outcomes are the following: ized above. It is almost identical to the pretest-
1. The nonequivalent control group design posttest true-experimental control group design
2. The time-series design except that study subjects are not randomly
3. The multiple time-series design assigned to treatment groups; therefore, the
The rst design can be used to evaluate the groups cannot be assumed to be equivalent before
impact of an intervention using a single before the intervention. As before, X symbolizes the
and after assessment of the dependent variables in intervention, O denotes the pre- and post-inter-
two or more comparison groups. The second uses vention assessments in each of the comparison
multiple assessments, conducted over time, of the groups, and the dashed line (and absence of R)
dependent variable in a single group of subjects. indicates that intervention was applied to an
The third (a combination of quasi-experimental intact group (i.e., allocation was not random).
designs #1 and #2) includes multiple assessments, Steyn et al. [30] used a nonequivalent control
again over time, but in two or more parallel group design to examine the intervention effects
groups. Because the observations in designs #2 of a community-based hypertension control pro-
and #3 are broken up by the imposition of the gram (the Coronary Risk Factor Study [CORIS])
intervention, both also are termed interrupted that was introduced for 4 years among white
time-series designs. (The reader is referred to hypertensive residents in two rural South African
Kazdin [27] or to Janosky et al. [28], for a detailed towns (summary and design structure are given
discussion of other quasi-experimental designs in Fig. 5.8).
used for research with single or small groups of In this study, O1, O3, and O5 represent baseline
subjects, and to Stanley and Campbell [1], Cook systolic blood pressure and diastolic blood pres-
and Campbell [2], and Shadish, Cook, and sure in the intervention and control towns; O2, O4,
Campbell [29], for additional quasi-experimental and O6 represent post-intervention blood pres-
designs used with larger groups or populations). sures in these towns. X1 represents the low-
intensity hypertension reduction intervention,
Quasi-Experimental Design # 1 X2 represents the high-intensity intervention, and
The Nonequivalent Control Group Design the absence of X denotes the lack of intervention
(the control). The dashed line indicates intact
O2 X O2
------------------ (nonrandom) treatment assignment.
Because allocation to the intervention was not
O3-------> O4
performed randomly, confounding variables may
The nonequivalent control group design (also have inuenced the observed outcomes.
termed the nonequivalent comparison design) Therefore, internal validity is not as well pro-
compares outcomes among two or more intact tected as it is with true-experimental design #4
groups, at least one of which receives the inter- (the pretest-posttest control group design),
vention; another serves as the control. This design which has a similar structure but includes random
is most useful when concurrent comparison allocation. The greatest potential threat to inter-
groups are available, when random allocation to nal validity with this design is differential selec-
treatment condition is not possible, and when tion, which could cause the comparison groups to
pretesting of the dependent variable can be per- vary on key factors related to the dependent vari-
formed so that baseline similarity of the compari- able; if present, selection bias could interact with
son groups can be evaluated. It is commonly used other potential biases such maturation (e.g., a
when comparison groups are spontaneously or sicker group could have disease that might prog-
previously assembled entities (e.g., different clin- ress more rapidly) or regression (if one of the two
ics, wards, schools, or geographic areas) or when groups were chosen on the basis of extreme val-
logistic difculties preclude random allocation to ues). Selection bias can occur if the investigator
treatment within the same entity. evaluates the intervention in two intrinsically
102 P.G. Supino
dissimilar populations or uses a nonuniform sub- pressures prior to the intervention. Thus, it is not
ject recruitment approach (e.g., permits subjects likely (though, certainly, it is not impossible) that
to self-select their treatment assignment). the differences found after the intervention were
However, if care is taken to avoid these practices, attributable to selection bias. The inclusion of
the availability of baseline measures of the depen- baseline measures also permits the investigator to
dent variable, a critical component of the non- evaluate the potential threat of experimental mor-
equivalent control group design, permits the tality (attrition bias). If there were losses to fol-
investigator to evaluate the extent and direction low-up among the comparison groups, their
of a potential selection bias and to minimize it, as potential impact could be evaluated by comparing
appropriate, through covariance analysis. baseline characteristics of those who withdrew
Therefore, this design affords much greater con- with those who completed the study. The authors
trol for this selection bias than pre-experimental of CORIS, who performed this analysis, found
design #3 (the static-group comparison) which that those who withdrew were similar to those
also contrasts outcomes across intact groups, but who remained with regard to age, gender, initial
which lacks critical baseline data needed to estab- cholesterol levels, blood pressure, body mass
lish initial comparability. Where pre-intervention index, and smoking behavior. Thus, the potential
data show relative comparability between groups threat of experimental mortality was effectively
on relevant variables, the nonequivalent control ruled out.
group design generally is appropriate; when pre- In the absence of differential selection and a
intervention comparability is not present, an alter- hypothesized interaction between selection and
native design should be used. In the CORIS study, the day-to-day experiences of the subjects, history
the authors state that the groups had similar blood effects are not plausible as an alternative (rival)
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 103
explanation for the observed outcomes and, thus, be less reactive and, thus, have better external
also can be ruled out as a major potential threat validity than most true experiments.
to internal validity when using the nonequivalent
control group design. The reason is that, barring Quasi-Experimental Design # 2
evidence to the contrary, external events occur- The Time-Series Design
ring in one comparison group should be just as O1 O2 O3 O4 X O5 O6 O7 O8
likely to occur in the other when subjects are
evaluated in parallel. However, as with true- The previous example compared the impact of
experimental designs, the burden remains with an intervention on outcomes using several intact
the investigator to ascertain the degree to which groups. Occasionally, an investigator planning
other relevant events may be occurring in the to evaluate an intervention may be unable to
intact group settings that might also affect out- identify a suitable (or any) comparison group.
comes; this is especially important when com- This might occur when patients are candidates
parators are geographically separated, as in this for a treatment, the effectiveness of which is to be
study. Also, because groups are studied in paral- tested, but an alternate treatment is not available,
lel, internal validity threats such as maturation, or if available, is viewed as unacceptable by the
testing, instrumentation, and regression effects patients or their physicians; a similar problem
are fairly well controlled (again, assuming the frequently occurs when a specic treatment cannot
groups share common baseline characteristics). be withheld for what are considered ethical
Finally, any potential biases associated with reasons. Thus, sometimes, interventions must
expectancy are not inherently greater than those be presented to entire groups, for example, all
found with true-experimental designs and may be patients potentially at risk. In these cases, an
reduced, at least in part, by uniform standards for investigator might opt for a pre-experimental
data collection and analysis (as was done in design without a control group (e.g., the pretest-
CORIS). posttest only design), in which a single group of
As with true-experimental design #2, the use study subjects is observed on just one occasion
of pre-intervention testing (essential with this before and after the intervention, or might com-
design for establishing baseline comparability of pare results obtained in study subjects with exter-
the comparison groups) may pose a threat to nal or historical controls. The literature reects
external validity unless the testing itself were many such examples. Unfortunately, as noted
deemed to be part of the intervention, as it would earlier, pre-experimental designs provide very
appear to be in the CORIS study. Additionally, as poor control against important threats to internal
with any design, a selection-treatment interac- validity, and comparing results from a current
tion can occur if the study subjects are not repre- treatment group with those obtained among his-
sentative of all subjects who potentially could be torical controls is almost always biased in favor
studied. Indeed, the authors of CORIS recognized of the former, principally due to improvement in
that their ndings did not necessarily apply to the general health of the population over time.
individuals of ethnic backgrounds and socioeco- The time-series design (sometimes called an
nomic statuses not included in CORIS. In gen- interrupted time-series) represents an improve-
eral, however, the nonequivalent control group ment over both of these pre-experimental
design places far fewer restrictions on sampling approaches. In its simplest form, multiple obser-
and, therefore, tends to be more generalizable vations (the number depending on the stability of
than the typical randomized parallel group trial. the data) are generated for a single group of sub-
Lastly, the reactive effects of experimental jects both before and after application of an inter-
arrangements potentially could limit the external vention. The objective of any study using such a
validity of studies using this design, but because design is to provide evidence that observations
they entail comparisons of interventions applied made before (and sometimes after) imposition of
to naturally occurring groupings, they tend to the intervention differ in a consistent manner from
104 P.G. Supino
hospitalizations is based on data patterns that confound their results. Dynamic changes within
conform to the inverse of those shown in Fig. 5.9, subjects or populations (i.e., maturation effects),
line B (i.e., changes on the dependent variable if any, usually are well controlled with time-series
contemporaneous with the intervention that designs because they (like regression effects) are
return to baseline after termination). unlikely to cause variations that occur only when
In both of these studies, the threats of selection the intervention is applied. For similar reasons,
bias and experimental mortality are con- the time-series design controls for testing effects
trolled, provided that the same subjects partici- even in cases in which the measurement process
pate in each of the pre- and post-intervention is more obtrusive than that used in the Delate and
assessments. Since this is rarely the case in Reding studies.
community-based studies, the investigators must The chief potential threat to internal validity
take steps to evaluate natural migratory patterns of studies using time-series designs is history.
within the community to ensure that these do not Because human subjects rarely are studied in a
106 P.G. Supino
vacuum, the investigator must be on the alert could compromise external validity by sensitizing
for outside inuences (e.g., programs, policy subjects to their treatments. The potential for a
changes, or even seasonal uctuations) occurring testing-treatment interaction (or testing reactiv-
coincident with the intervention that also might ity) is heightened with a time-series design
affect study outcomes. For example, to accept because multiple pre-intervention assessments
Delates conclusions, one would have to believe are required to establish the stable pre-interven-
that there were no other factors (e.g., changes in tion pattern against which changes in slope and/
physician prescribing patterns, advertising cam- or intercept of the post-intervention assessments
paigns) to which the subjects were exposed that are compared. For this reason, studies using these
would have caused them to use fewer PPIs during designs generalize best when performed in set-
the post-program period. Similarly, the Reding tings in which data are collected as part of routine
conclusions are tenable only if one accepts that practice. Additionally, when based on natural
nothing else (such as another psychiatric inter- experiments, like those reported by Delate and
vention or availability of new treatments, etc.) Reding, they cause few, if any, reactive effects
occurred in Kalamazoo County specically dur- because the interventions are experienced as part
ing the tenure of the mobile psychiatrist that also of the subjects normal environment. As with any
might have reduced admissions to state hospitals. design, however, the ability to generalize out-
If careful documentation by the investigator rules comes depends on the similarity of the study
this out, then history effects become a less plau- group to the reference population.
sible alternative hypothesis for the observed Readers with clinical experience may recog-
changes. A second internal validity threat is nize a variant of the time-series design in which
instrumentation. If the calibration of an objective an intervention is reintroduced after one or more
measure (or the instrument itself) changes during intervals of withdrawal. In behavioral research
the study, and if this change occurs when the with single subjects or with series of subjects
intervention is applied, then it is difcult to know (e.g., studies designed to extinguish inappropri-
whether the observations made after the interven- ate actions among children with autism or adult
tion are due to it or to changes in the instrument. schizophrenics or to improve task performance in
The same problem may occur when measurement the setting of attention decit hyperactivity disor-
criteria or outcome adjudicators change in paral- der), this approach is termed an ABAB Design,
lel with the intervention, especially when the lat- where A and B respectively denote alternating
ter are aware of the study hypothesis. With control and intervention periods. (It is called a
administrative data, there is always a chance that BABA Design when the sequence begins with the
the methodology used for record keeping might intervention, followed by its withdrawal and rein-
spuriously inuence outcomes. For example, a troduction, etc.) In other specialties, it is more
change in the coding of diagnostic rating groups commonly termed an equivalent time samples
(DRGs) during an intervention might lead the design or a repeated treatment design. This gen-
investigator to conclude incorrectly that there eral approach has greater control of history and
were more (or less) hospitalizations for a given instrumentation effects than the classic time-
disease during this interval. To minimize these series design because the probability of some
potential effects, the investigator should endeavor external event or unintentional instrument or
to standardize measures and educate research observer change tracking with (and accounting
personnel about such issues. Finally, whenever for) the effects of intermittent applications of
possible, steps should be taken to blind those the intervention is arguably lower than it would be
interpreting outcomes to knowledge of the treat- when only one application of the intervention is
ment period to reduce the inuence of expectancy involved. It can be particularly useful as the basis
on these assessments. for relatively rigorous determination of the effects
As with all designs that evaluate change over of pharmacological therapies (particularly adverse
time, the use of multiple observations, if obtrusive, outcomes of chronically employed drugs), when
5 Fundamental Issues in Evaluating the Impact of Interventions: Sources and Control of Bias 107
such effects are predictably transient or reversible The multiple time-series design combines the
in nature. For example, with age, individuals tend unique features of nonequivalent control group
to perceive arthralgias and myalgias with relative and time-series designs to maximize internal
frequency. Hypercholesterolemia is fairly wide- validity. It evaluates relative change over time
spread according to current epidemiological on one or more dependent variables in two or
denitions, and the prescription of HMG-CoA more intact comparison groups (again, usually
reductase inhibitors (statins) to control choles- preexisting groups assembled for other pur-
terol is quite common. The drugs have been well poses) at least one of which receives an inter-
demonstrated in RCTs to reduce coronary disease vention and one of which does not (the control).
events and, specically, mortality, among patients Thus, this design creates two experiments, one
so treated. In some patients (the minority), statins in which the intervention is compared against a
also can cause myalgias and, in fewer still, poly- no-intervention control and the second in which
serositis with arthralgias. Most patients are aware pre-intervention time-series data are compared
of these potential problems from constant refer- with those obtained after the intervention,
ence to them in the news media and often ascribe thereby increasing the amount of available evi-
their symptoms to the statins because of expec- dence to buttress a claim of an intervention
tancy. Thus, when patients complain of myalgias effect. In its most general design structure,
and/or arthralgias while taking statins, it is incum- shown above, X symbolizes the intervention
bent upon the physician to determine whether the (applied within one of the groups), O is the pre-
association truly is cause and effect. The best and post-intervention assessment of the depen-
approach is to employ an equivalent time samples dent variable(s), and the dashed line denotes the
design, beginning with a careful history of cur- intact nature of the comparators. The design is
rent symptoms on drug (O) followed by with- most appropriate when it is not possible to ran-
drawal of sufcient duration to allow drug effects domly allocate subjects to an intervention, when
to dissipate, another careful history, and then a concurrent no-intervention group is avail-
reinstitution (rechallenge) with the drug, with able for comparison, and when serial data can
another O after some period of use. If the result is be (or have been) generated for both groups
unclear, the series can be repeated. Unfortunately, during the pre- and post-intervention periods.
in the real world, patients tend to confound out- As for the nonequivalent control group design,
come by interposing anti-inammatory drug use the availability of baseline data is necessary to
concomitantly with cessation of the statin and evaluate initial comparability of the interven-
often refuse the rechallenge. Nonetheless, this tion and control groups. The multiple time sam-
example illustrates the importance of understand- ples design was used by Holder et al. [34] to
ing and applying the principles of study design in evaluate the effects of a community-based
the course of clinical practice. (For further details intervention on high-risk drinking and alcohol-
about the pros and cons of this design as a tool for related injuries (summary and design structure
research and methods for implementing it in clin- are given in Fig. 5.12).
ical populations, the reader again is referred to In this study, X represents the community-
the works of Campbell and Stanley [1], Cook and based alcohol deterrence intervention; O (made
Campbell [2], Kazdin [27], Janosky et al. [28], approximately monthly over a 5-year interval)
and to Haukoos et al. [33].) denotes average (1) frequency of drinking, (2)
number of alcoholic drinks consumed per drink-
Quasi-Experimental Design # 3
ing occasion, (3) instances of driving while intox-
The Multiple Time-Series Design icated, (4) motor vehicle crashes (daytime,
DUI-related, nighttime injury-associated), and
proportion of (5) emergency room and (6) hospi-
tal admissions for violent assault among the
108 P.G. Supino
Take-Home Points
The ability to draw valid inferences from data is the cornerstone of research and provides
the basis for understanding the new knowledge that research results represent.
Internal validity reects the extent to which a manipulated variable can be shown to account
for changes in a dependent variable. It is indispensable for interpreting the experiment.
Ten common threats to internal validity include selection bias, history effects, maturation
effects, testing effects, instrumentation effects, statistical regression, experimental mortality,
interaction of these factors, experimenter bias, and subject expectancy effects.
Four threats to external validity (generalizability) are reactive effects of testing, interactive
effects of selection and treatment, reactive effects of experimental arrangements, and mul-
tiple treatment interference.
A variety of research designs can be used to evaluate interventions. Each differs in its ade-
quacy for ensuring that valid inferences are made about effects and generalizability.
The poorest for controlling threats to internal validity are termed pre-experimental
designs. These lack adequate control groups.
The strongest are termed true-experimental designs. They incorporate control groups to
which subjects have been randomly allocated but may suffer from lack of generalizability.
Quasi-experimental designs represent a good compromise when randomization is not
possible.
110 P.G. Supino
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 111
DOI 10.1007/978-1-4614-3360-6_6, Phyllis G. Supino and Jeffrey S. Borer 2012
112 J.A. Franciosa
important enough role in a disease such that it Table 6.2 Components of the study design summary
might be a therapeutic target. Statement of study type (e.g., controlled clinical trial)
In addition to stating the broad programmatic Overview of study design
goal of the proposed research, the statement of Parallel-group, crossover
the hypothesis also presents a more specic broad Level of blinding (e.g., open-label, single-blind,
objective of the research followed by some more double-blind)
Method of treatment assignment
detailed specic aims of the research. For exam- (e.g., randomization, stratication)
ple, a broad objective might be to test the hypoth- Statement of treatment/intervention to be used
esis that a new drug improves symptoms in Investigative drug or device
patients with the disease of interest to the overall Dosage of drugs or usage of devices
Type of control (e.g., placebo, active drug, no
research program. The specic aims might be to treatment)
determine whether certain of those symptoms Description of study population
improve by a specied amount over a specied Planned sample size
period of time without producing major side Source of patients
effects. The specic aims typically include major Number of centers
Note any unique patient characteristics (age, race,
outcomes (primary endpoint [s]) that essentially sex) required
drive the study design and other outcomes of Description of the disease or condition being studied
lesser importance (secondary endpoints) that pro- and any characteristics of that disease/condition that
vide supportive information, as will be discussed might affect patient eligibility or study outcomes
in greater detail below. Duration
Etiology
The statement of hypothesis should be suc- Severity
cinctly phrased and should provide a basis for the Treatment
overall study design being employed to test it, Sequence and duration of study visits
i.e., to determine whether the hypothesis is sup- Description of study endpoints
ported by the study results. As noted in Chap. 3,
the operational restatement of the hypothesis
should, at minimum, clearly identify the patient Overview of Study Design Summary
population, intervention (if any), primary end-
point, key methods, duration, and anticipated It is common practice and helpful to include an
outcomes. overall summary or synopsis of the study design
before embarking on the detailed discussion of
the various protocol components that will ensue.
Signicance of the Research This summary is especially useful to certain
reviewers, e.g., research administrators, funding
The Introduction should conclude with some dis- agency ofcials, or institutional review board
cussion, even if largely speculative, about the (IRB) members, who may not be scientists or
signicance of the proposed research and its pos- may not require the level of detail of the full pro-
sible outcomes. If the hypothesis is conrmed, tocol in order to perform their specic review or
what does that mean in terms of the initial objec- critique functions. Thus, this section is typically
tives? Is it conclusive or does it indicate a direc- very brief and to the point, as details of every-
tion for future research? Results which are not thing addressed here will be provided in the sec-
conrmatory may lead to outright rejection of the tions that follow. Table 6.2 shows the key
hypothesis or may imply a need for modication components of this summary.
of the research approach. Finally, some ndings The summary should include a statement of
of the study may generate new hypotheses to be the nature of the study design (e.g., whether it is
addressed by future research. controlled or uncontrolled, parallel or crossover,
114 J.A. Franciosa
blinded or unblinded, and the number and nature to address multiple primary endpoints almost
of treatment arms). A brief description of any invariably lead to methodological inconsistencies
randomization methods should be provided (the and difculties, resulting in a trial that fails to
details of which should be given in the Statistical achieve any meaningful result in terms of pri-
Considerations section). It also should indicate mary endpoints. The primary endpoint(s) should
the number of centers involved (single or multi- be specically dened, along with an explanation
center), total number of patients to be enrolled, of how and when it will be measured. The sec-
and the geographical area included, e.g., United ondary endpoints may be more numerous than
States, North America, Europe, China, or a the primary ones. They may represent additional
region of a country. The study population should measures of efcacy or safety but also may be
be characterized, especially any unique demo- included for other reasons such as exploration of
graphic characteristics, e.g., women only, mechanisms, particular safety concerns, and
African-Americans only, or a certain age group. development of data for future research. The sec-
In addition to patient demographics, a brief ondary endpoints also should be specically
description of their underlying disease condition dened, and the timing and methodology of their
being studied should be mentioned along with measurements should be briey stated.
any important information about the current sta- Factors considered in the selection of end-
tus, duration, severity, and treatment of the con- points (especially the primary endpoints), such as
dition that might affect patient eligibility as well relevance, practicality, acceptability, validation,
as outcomes. The active intervention being tested, and experience should be discussed. Clearly, it is
along with any control interventions, should be necessary to establish that the endpoint chosen is
briey described. In addition, the frequency and relevant to the patients and conditions being stud-
duration of the intervention should be stated ied; that is, it addresses real and signicant needs
along with the total study duration, which may be such as improving symptoms, survival, diagno-
longer than the intervention period. Finally, the sis, or other outcomes. In addition, the endpoints
primary study endpoint should be described should be practical, not only by addressing real
along with a statement about how it will be needs but by utilizing readily applied methods of
assessed, when it will be assessed, and how often objective measurement. Furthermore, the meth-
it will be assessed. Key secondary endpoints may ods used must be acceptable to both investigators
be simply listed. and patients in terms of ease of application,
safety, comfort, and cost. Optimally, they should
be standard methods that are appropriate for the
Endpoints group under study to avoid the necessity of vali-
dating them, which usually must be done in sepa-
It is desirable to present the study endpoints early rate preliminary studies [3]. Validation involves
in the protocol, as these tend to drive the rest of establishing (via the literature or the investiga-
the study design which is developed to measure tors own work) that the proposed methods per-
an effect on those same endpoints. Thus, the sam- form as intended in both the patients and
ple size, methodology, duration of study, and conditions being studied. The investigators must
analytical methods are all inuenced by the indicate that they have sufcient experience with
choice of endpoints. the successful use of the proposed methods.
The endpoints are dened as primary and sec- Finally, it is critical that there be a consensus
ondary. The primary endpoint is usually a single regarding study endpoints among all investiga-
one, though it may include two endpoints, or may tors, study administrators, and committees before
consist of a single composite endpoint made of the study starts in order to avoid disputes when
two or more components. It is important to strictly the nal results become available [4]. Table 6.3
limit the number of primary endpoints, as attempts lists guidelines for describing the key components
6 Protocol Development and Preparation for a Clinical Trial 115
Table 6.3 Primary study endpoints Although the terms patients and/or subjects
State the primary study endpoint(s) often are used interchangeably or may be estab-
Briey mention the appropriateness and relevance of lished according to convention of the sponsoring
the endpoint group, we prefer to use the term patients for
Describe the methods, timing, and frequency for those individuals with a medical diagnosis or
assessing the endpoint
condition that is the target of the proposed
As needed, describe and special personnel perform-
ing the assessment (e.g., an unblinded assessor in a research. We reserve the term subjects for nor-
double-blind study) mal healthy individuals that typically are included
Additional details about collecting endpoint data may in some studies as the control population but who
need to include: also may represent the primary population, e.g.,
Details about the use of subjects diaries in studies of the clinical pharmacological proper-
Any instructions on timing/conditions of
assessment
ties of a new drug before it is given to patients.
Details about unusual collection, storage, or
analysis of laboratory samples
Provide information about the standardization and General Description of the Study
validation of the methods to be used for endpoint Population
measurement
Describe the investigators experience with the
methods to be used The study population should be described in
As needed, describe any training that might be terms of its general demographics, as well as the
required in using the methods for endpoint characteristics of the disease or condition being
measurements studied that the patients should have, along with
the number of such patients that will be recruited
and enrolled. The demographic characteristics
of primary study endpoints; secondary endpoints typically describe the sex and age group of
should follow this same sequence, though with patients and, if appropriate, their race. If any of
less detail. these characteristics are particularly restrictive,
It should be noted that endpoints, as discussed the reason for that restriction should also be
above, refer primarily to clinical trials. Other given. For example, if one is studying only Asian
kinds of studies, such as nonprospective obser- females in their 20s, the reason for focusing on
vational studies that evaluate associations or dis- that population should be presented. In many
tributional characteristics (e.g., prevalences) instances, this may have been addressed in the
rather than intervention effects may not employ introductory sections and need not be gone into
endpoints as described above for their study in great detail in this section. The selection of
objectives. Observational studies are discussed these demographic characteristics (especially
in greater detail in Chap. 4. age) should not be taken lightly, as they may
have important effects on adverse events and
study outcomes [5]. In fact, it has been sug-
gested that these kinds of patient characteristics
Study Population may impact study results more than other fea-
tures of the study design itself [6]. These charac-
This section is a detailed description of the teristics will be expanded upon in greater detail
patients/subjects to be included in the study and as needed in the list of inclusion/exclusion crite-
should provide a broad description of the study ria, as discussed below. The medical condition
population, the source of patients, and a compre- these patients must have in order to participate
hensive listing of the inclusion (eligibility) and in the study also should be described in terms
exclusion criteria for study participation. of its diagnostic criteria, duration, etiology
116 J.A. Franciosa
(if appropriate), treatment, present status, and location of investigative sites that will provide
severity. If normal subjects are included, then patients and/or participate in the trial. Not all
operational criteria for dening the normal sub- sites may actually have study investigators; some
ject also must be presented. Subjects may be may serve only as sources that will identify and
required to be completely normal, with no refer patients to an investigators site. Methods to
signicant past or current medical conditions, be used for nding patients should be described.
especially if these subjects constitute the pri- These may include various ways of publicizing
mary study population. If normal subjects are the study, ranging from notices within the local
included as a control group, they may only be institution to advertising in various media. These
required to be relatively normal, i.e., they techniques and the individuals responsible for
should not have the same disease as the other implementing them should be described. It also
patients in the study. These disease characteris- is necessary to describe how patients, once
tics will be expanded upon in greater detail in identied, will be further screened and by whom.
the list of inclusion/exclusion criteria. This sec- A detailed description of the screening process to
tion also should include a description of the determine eligibility should be included, listing
number of patients to be studied. Whereas a the specic initial parameters that will be used
sample size estimate typically is included in the preliminarily to identify potential eligible
statistical analysis section (see below), that patients. It is common practice to identify patients
estimate usually refers to the number of patients who meet initial screening criteria by history,
needed to complete the study. Since, typically, then follow them for a brief interval to determine
some patients fail to complete a trial for several whether they subsequently meet all study eligi-
different reasons, it is necessary to try to esti- bility criteria. For example, in a study of treat-
mate the total number of patients that will be ment of hypertension, patients initially may be
recruited in order to achieve the number needed screened on the basis of having a history of
to complete the trial. Depending on the disease, hypertension or of having a single reading of
study population, and treatment, patients may elevated blood pressure. Typically, such patients
drop out of the trial for many reasons, including would be followed for a limited period to see if
death and side effects of the treatment. In addi- they, in fact, do currently have hypertension.
tion to these reasons, which will vary, some The location of screening procedures should
patients withdraw consent, move, or just never be specied. This could involve screening of
return for follow-up. The investigator must make clinic records, emergency room logs, diagnostic
every attempt to estimate the number of expected laboratory reports, etc., depending on the popula-
dropouts and decide what to do about them, tion being sought. For example, in a study of
i.e., to replace them or not in the study. It is criti- patients with documented coronary artery dis-
cal to estimate the number of patients that need ease, one might screen the cardiac catheterization
to be recruited not only in order to achieve the and intensive care unit logs. The protocol should
desired number of study completers but also describe who will do this, when it will be done,
to properly estimate resource needs, e.g., study and how it will be done. Unlike some sections of
medications, case report forms, and laboratory the protocol (e.g., endpoint denitions, patient
supplies. inclusion/exclusion criteria, and analytic meth-
ods to be used), the screening procedures are not
carved in stone and may be modied as
Patient Sources needed.
For a more detailed description of recruiting
The techniques to be used for recruiting patients techniques and the many issues that may become
for the trial should be discussed in detail in this involved, the reader should consult standard ref-
section. One should describe the number and erences and the medical literature [1, 711].
6 Protocol Development and Preparation for a Clinical Trial 117
actually has the medical condition required for him/her as a new patient in the screening
study participation. A run-in period also may be phase. Another potential risk and criticism of
used to demonstrate that a patient has the required run-in periods is that they may introduce bias by
status of the condition being studied. For exam- selecting the better responders to the active study
ple, it may be required that a patient have stable intervention [12].
symptoms while taking all standard treatment for
the condition in order to minimize difculty in
interpreting changes in the patients condition Start of Study Treatment/Intervention
after starting active treatment. If the patient was
not stable or if other treatments were started after Once all inclusion criteria are satised and no
the study intervention, it would be extremely exclusion criteria are met, whether at the end of
difcult to assess the cause of a change in the screening or after a run-in period, the patient is
patients condition. Another common reason for ready to initiate study-mandated activities. At
using a run-in is to assess the tolerability of the this time, the patient will be assigned his/her
study intervention. A patient may have difculty study treatment or intervention. If the study is not
complying with an intervention if it produces controlled, the patient will be started on the study
signicant side effects or is difcult to adminis- intervention. If the study is controlled, the patient
ter. Furthermore, patient compliance may be is randomized to his/her study treatment. The
inuenced by other patient conditions or behav- method of randomization, e.g., consulting a list,
iors, e.g., substance abuse or alcoholism. A run- opening an envelope, or contacting a central ran-
in period may be useful to assess the patients domization center should be briey described
likelihood of complying with and completing all here. If the intervention being evaluated in the
study requirements. trial includes pharmacological therapy, the study
Treatment during run-in periods may vary. If drug may also be dispensed at this time or
the purpose is only to acquire nal inclusion/ arrangements may be made for procuring it. The
exclusion information, no treatment may be patient should be given any applicable instruc-
needed. Obviously, if the purpose is to assess sta- tions at this time and scheduled for the next clinic
bility and/or compliance with an intervention visit. Typically, the details of the randomization
such as a study drug, it would be necessary that it technique, and the administration and manage-
be given according to the same regimen that ment of the intervention, respectively, are pro-
would be used in the active phase of the study. vided in the statistical and administrative sections
This phase usually involves either active study of the protocol.
intervention in all patients if its purpose is pri-
marily to assess tolerability or placebo in all
patients to assess patient compliance for reasons Schedule of Visits and Observations
other than tolerability of the intervention. Clearly,
the patient is kept blinded to treatment if the The protocol must provide a schedule of patient
active phase is to be double-blinded. visits with details about when these will be con-
Finally, the duration of the run-in periods ducted and what information will be collected at
should be as short as possible, typically not more each visit. This section is used and closely
than 23 weeks. In general, less time is needed to adhered to by study personnel, much as a recipe
obtain laboratory tests, and more time would be is followed by a cook. Study visits typically
needed to assess tolerability or compliance. The consist of a baseline or study initiation visit,
problem with excessively long run-in periods is follow-up interim visits, a nal on-treatment
that patients may change during this time. In study visit, and a post-study follow-up visit. It is
cases where a run-in period has had to be important to specify the timing of these visits,
extended, it is common practice to terminate that with a window of plus or minus a small number
patient from the study at that point and restart of days, if possible, to allow the patient some
120 J.A. Franciosa
exibility in scheduling appointments. Typically, visits primarily are intended to monitor the
the time is set relative to randomization or base- patients progress and his/her tolerability of the
line, i.e., at some time a time window follow- study intervention. A brief medical history and
ing the date of randomization or the baseline physical examination are carried out, with the
visit. The observations recorded at each visit emphasis on looking for any adverse events or
often are variable, with fewer items observed at ndings. Information on one or more study end-
interim visits. points may be collected, but not necessarily the
primary endpoint, especially if that involves a
Baseline Visit special procedure, e.g., cardiac catheterization,
The baseline visit is performed at or very close to which might be done only at the end of the study
the time when the patients are randomized to or once during an interim visit. In trials evaluat-
study treatment/intervention, whether or not that ing medications, patient compliance usually is
treatment/intervention has actually been insti- assessed, typically by having the patient bring
tuted. This is a critical visit as all observations any unused study medications with him/her and
recorded at this time will be the basis for com- calculating the percentage of pills taken relative
parison with all observations made while on to those prescribed. The interim visit also is con-
study treatment. Thus, a complete medical his- cluded by dispensing any study drugs or other
tory and physical examination usually are per- required materials to the patient, scheduling the
formed, along with laboratory tests. All next visit, and arranging for any procedures or
concomitant medications are recorded with tests needed for the next visit.
details about dose and duration of administra- Of course, patients may develop complica-
tion. In addition to this general medical exami- tions and may need to be seen between scheduled
nation, there is usually information collected visits. All clinical trials must include provisions
that is specic to the status of the medical condi- for patients to be seen by physicians who may be
tion being studied, e.g., its duration, severity, associated with the study in order to deal with
history of complications, current symptoms and clinical necessities whether or not a visit is spe-
status, and current treatment. Any special tests, cically related to a protocol-based assessment.
assessments, or procedures relating to study end- The reasons for, and ndings obtained during,
points are carried out at this visit or are sched- any unscheduled visits must be recorded as study
uled to be obtained very soon after this visit, if data on appropriate forms.
not yet already done. One cannot overemphasize
the importance of all baseline determinations.
They must be thorough and comprehensive, as Final Visit
any medical and/or laboratory ndings that The nal visit is the last one during which the
appear later must be ascribed in some way to patient is still receiving the study intervention.
study participation if they were not present at Its observations include essentially the same as
baseline. In trials evaluating experimental medi- those obtained at the baseline visit and are just
cations, the baseline visit is concluded by dis- as critical since they represent the study results
pensing any study drugs or other required and outcomes that will be compared to those
materials to the patient, scheduling the next visit, from the baseline visit. In addition, the same
and arranging for any procedures or tests needed kind of information collected at the interim
for the next visit. visits is obtained to cover the interval since that
preceding interim visit.
Interim Visits Whereas a nal visit is obtained routinely in
Following the baseline visit, the patient is seen at all patients at the end of the study, it may be nec-
intervals specied in the protocol to occur at essary to perform a nal visit if a patient termi-
some set time, e.g., every 3 months 1 week nates his/her study participation prematurely, as
from the date of the baseline visit. These interim might happen for intolerable side effects or other
6 Protocol Development and Preparation for a Clinical Trial 121
reasons. In such cases, every attempt must be mandatory to attribute any side effects or compli-
made to have the patient return and perform all cations occurring during this period to the study
the procedures and observations required at a intervention. These post-study visits also are of
regularly scheduled nal visit. Without this, that value in helping to document patient status and to
patients entire dataset may be useless and exclude protect all study personnel and institutions in the
the patient from the study analysis. In most event of any future allegations stemming from
instances, nal visit data obtained even prema- the patients study involvement. It is strongly
turely may still be analyzable and allow the suggested that a ow chart of all scheduled visits
patient to be included in the results. and related procedures be included, a template of
At the end of the nal visit, study drug/inter- which is shown in Table 6.5.
vention is terminated, and the patient is sched-
uled for a study follow-up visit.
Data Management
Post-study Follow-Up Visit
Often by regulatory requirement, but more in the A clinical trial, along with its data generation and
interests of good clinical practice, patients should acquisition, is driven by the thoroughness and
be seen at least once after completing their study objectivity of the research protocol. The research
participation to ensure that they are not experi- data to be generated, collected, processed, and
encing any sequelae that might be attributed to stored in the clinical database must support the
their study involvement. Such visits usually are objectives of the study, as specied in the proto-
scheduled at 1 week to 1 month after the nal on col. This, in turn, relies on designing data man-
treatment study visit, depending on the possible agement processes that correctly capture the
duration of effects of the study intervention. required research data. All data generated by the
(As used here, the term on treatment means the trial must be captured and managed to ultimately
patient is still receiving a study-mandated inter- yield the results of the trial. Data management
vention, regardless of whether he/she is receiving has been enhanced dramatically in recent years
active therapy or an inactive control substance [or as a result of technological advancements includ-
other control condition].) In some instances, ing computerization of databases, bioinformat-
especially by regulatory requirements, it may be ics, and Internet applications to facilitate
122 J.A. Franciosa
acquisition and processing of data [1315]. As a the quality of conduct of the study; as such, they
consequence, modern data management pro- are commonly audited after study completion to
cesses involve specialized personnel and meth- help ascertain the validity and reliability of the
ods which are discussed in detail in Chap. 7. For study conclusions.
all these processes to be properly carried out, it is
necessary that a detailed, comprehensive, and
unambiguous protocol be developed, as the pro- Safety Monitoring Procedures
tocol drives the data management processes
which tend to follow the protocol in a chrono- A complete protocol should describe all proce-
logical fashion. Obviously, the tools used for dures that will be in place to ensure and assess the
data collection will be developed in accordance safety of study participants. Whereas much of
with protocol specications. Ideally, data man- this information already is included in different
agement processes should be developed in parts of the protocol, e.g., on the schedule of vis-
advance of data collection because post hoc its and procedures, it is recommended that a
changes potentially introduce a risk of bias, specic section be devoted to summarizing all
threatening the validity and credibility of the safety monitoring procedures. It should summa-
results, as noted above. rize how often patients will be seen, that an
The data management plan closely follows interim history and physical examination will be
the structure and sequence of the protocol. performed, and that laboratory tests will be
A well-written data management section will obtained. It is important to point out any special
provide detailed descriptions of each data item to visits, examinations, tests, or procedures that will
be collected, how it will be collected, and when it be conducted specically to look for known side
will be collected. The data management group effects of the treatment. For example, liver func-
must work very closely with the team that is pre- tion tests would be obtained in a trial of a new
paring the actual protocol to help ensure that all drug suspected of possibly producing liver toxic-
the data described are readily obtainable, com- ity, or the eyes would be examined often in a trial
plete, unambiguous, objective, and easily pro- of an intervention that could potentially be asso-
grammable and quantiable. Furthermore, it ciated with cataract formation.
must be ascertained that all of the data generation In addition to describing what will be done and
methods are generally accepted and that the how often, it is important to specify who is respon-
research team is adequately experienced in using sible for carrying out these procedures and what
these methods so as to help ensure reliability and will be done with the information in case some-
validity of the data obtained. thing is found, i.e., instructing the investigators
Whereas trials generally try to limit the amount whom to contact, how to establish contact, and
of information collected to that which is necessary the timeframe for making contact. It is important
to obtain valid results, it is common to collect that all study personnel know what constitutes an
additional information, especially at baseline, adverse event or serious adverse event. These are
because this is the last time one can make obser- not simply clinical impressions but are specically
vations before the effects of the trial interventions dened by regulations. These regulations also
come into play. Just being in a clinical trial may establish what information about the adverse
affect patient outcomes because of the level and event must be collected (start date, duration,
frequency of care provided (see also Chap. 5). It severity, drug dose, concomitant drugs, action
is critical that every attempt be made to capture taken, outcomes, etc.) and who must be notied
all the required data at the times specied by the within the specied time frame (other investiga-
protocol, as incomplete, inaccurate, and/or miss- tors, IRBs, study administrators, regulatory agen-
ing data can undermine the reliability and credi- cies, etc.). Instruction also should be provided to
bility of results. The completeness, accuracy, and the investigators regarding possible discontinua-
timeliness of data collection are key indicators of tion of the study drug, premature termination of
6 Protocol Development and Preparation for a Clinical Trial 123
the patients study participation, unblinding of sponsor, auditors, or other regulatory authorities
any study medication, etc. and that his/her study information may be used in
It is critical that all study personnel understand publications. In any of these instances, the patient
that an adverse event is any undesirable sign, must be assured that his/her identity will be kept
symptom, or medical condition that occurs after strictly condential. The process of obtaining
starting study participation regardless of its rela- informed consent offers an excellent opportunity
tionship to the study intervention, i.e., even if a to establish good communications and rapport
cause other than the study intervention is present. between the patient and the investigators and, as
Any condition that was present before starting such, may impact the study outcome [2123]. It
study participation must be considered an adverse is important to recognize that consent for study
event if it worsened. Furthermore, the serious- participation contains important elements that
ness of an adverse event is not synonymous with distinguish it from consent to a procedure, be it a
its severity or potential outcomes. An adverse routine clinical procedure or one required as part
event is considered serious if it is (1) serious or of the study; thus, consent to participate in a
life-threatening, (2) requires or prolongs hospi- research study should be obtained separately
talization, (3) is signicantly or permanently dis- from other permissions obtained in caring for a
abling or incapacitating, (4) constitutes a patient [24]. The informed consent form itself is
congenital anomaly or birth defect, or (5) requires considered a part of the protocol. The protocol
medical/surgical intervention to prevent any one also should contain a statement that IRB approval
of the preceding. There is no mention of severity will be obtained and that the investigators and all
or potential seriousness. Thus, a severe symptom study personnel will obtain all periodic re-
or abnormal laboratory nding that does not meet approvals and comply with all other requirements
one of these criteria is not considered a serious of that review board.
adverse event. In addition, the protocol often includes a
Above all, it is critical that adverse events be description of the investigators responsibilities
looked for, recognized, recorded, and reported as regarding patient safety. This description typi-
quickly as possible to the appropriate study gov- cally points out the research policies, regulations,
erning personnel to allow any necessary actions to and requirements of governmental, international,
be taken to safeguard all other study participants. institutional, and sponsoring bodies. The investi-
gators are required to comply with all of these. In
addition, the investigators agree to accept full
Ethical Considerations responsibility for protecting the rights, safety,
(See Also Chap. 12) and welfare of patients under their care during
the study. The principles of good clinical practice
The protocol must state that all patients will pro- mandate that the investigators provide the best
vide informed consent prior to being enrolled in available care, themselves or by appropriate
the study. The consent form must be written in referral, for any medically related problems that
language the patient can fully understand and arise during the study, regardless of their rela-
must contain certain elements. These include a tionship to the study itself.
description of the study; what is expected of the
patient; what risks are involved with any tests,
procedures, and treatments; what alternative Statistical Considerations
treatments are available; and assurance that the
patient will be given the best available treatment All protocols should contain a section that
for his/her condition whether or not he/she describes trial-specic statistical evaluation
chooses to participate initially or to terminate plans. For randomized controlled clinical trials,
prematurely. The patient should also be informed such considerations typically include (but are not
that his/her study records may be reviewed by the limited to): the specic nature of the study design
124 J.A. Franciosa
and related issues, the specics of the randomi- Studies typically encounter unforeseen prob-
zation procedure and rationale employed, justi- lems and questions during their conduct. In addi-
cation of sample size and associated power (see tion, some potential issues can be foreseen prior
also Chap. 11), the statistical analysis planned to study initiation; these need to be prospectively
for assessing primary and secondary outcome addressed so that solutions can be decided quickly
measures, and a statement of the null hypothesis according to plan should they, indeed, arise dur-
for primary efcacy comparison. When appro- ing the course of the study. Examples of such
priate (e.g., a randomized controlled trial evalu- issues include endpoint criteria, rules for early
ating high-risk patients), this section also may termination of the study, need for protocol
articulate statistically-based stopping rules for changes, etc. It is important for any study, and is
premature termination of the study (e.g., early mandatory for multicenter studies, that the proto-
evidence of efcacy in the absence of safety col identify those individuals responsible for
problems). making decisions about the studys conduct.
Thus, the protocol should specify the individuals
and committees who are responsible for study
Protocol Implementation leadership and charged with making the kinds of
and Study Conduct decisions mentioned above.
Multicenter studies should have a chairperson
Recent observations suggest that the conduct of who is empowered to make and/or delegate day-
certain types of clinical trials have decreased, to-day decisions regarding such things as decid-
raising concerns about adequacy of planning and ing if a patient satises all inclusion/exclusion
implementation. For example, late phase clini- criteria or if a patient or center has violated pro-
cal trials represented about 20% of all clinical tri- tocol requirements, etc. In addition, there may be
als in 1994 whereas in 2008, they accounted for a steering or executive committee to address
only 4.4% of all clinical trials [16]. Possible rea- broader issues, e.g., protocol changes, and to
sons that may contribute to this apparent decline address recommendations of any subcommittees.
include inadequate organization and infrastruc- The subcommittees may typically include an
ture, lack of coordinated research team effort, and independent data safety and monitoring board
insufcient training [1618]. No matter how well (DSMB) that periodically reviews study data to
a protocol is written, it is of little value if it cannot assess the need for possible premature termina-
be implemented and carried out to completion. tion of the study if a clear benet or risk appears
that makes it unethical to continue the study.
Another subcommittee might analyze study end-
Study Organization, Structure, point outcomes, e.g., cause of death or reason for
and Administration hospital admission. It is mandatory that subcom-
mittees and committees prospectively dene the
In addition to describing how the study will be rules and criteria to be used in arriving at any
done, protocols typically address issues which decisions they make and that information required
help safeguard the well-being of patients during to satisfy these rules be included as a part of the
their study participation, while ensuring the integ- protocol. Subcommittees and other committees
rity and proper conduct of the study. Many of the generally make recommendations to the steering
topics discussed in this section are addressed at or executive committee who has responsibility
great length in other publications and reference for making nal decisions based on those
materials which the reader should consult [1, 9]. recommendations.
We will focus here on some of these topics, espe- In summary, the leadership of the study is
cially those that are typically required for inclu- responsible for the general satisfactory conduct
sion in a protocol by sponsoring institutions, of the study in all of its aspects. This includes
funding agencies, and regulatory authorities. resource recruitment and allocation, providing
6 Protocol Development and Preparation for a Clinical Trial 125
any training required, ensuring timeliness of timely availability of supplies. In addition, study
patient recruitment, overseeing data manage- leaders must be readily available to these same
ment, and reporting of the results. individuals to try to resolve any supply problems
that might arise.
The protocol should contain information about
Resource Allocation and Management study materials the patient will need, including
study drugs, laboratory kits, questionnaires, dia-
Key resources include funds, manpower, and ries, etc. Information should be provided on who
supplies. Funding may be available prior to study is responsible for procuring and dispensing these
initiation in some settings with predetermined materials, how and where they will be procured,
budgets, e.g., industry. In other settings, funding how they will be supplied (kits, bottles, etc.), how
must be applied for, and its procurement often they will be labeled to correctly identify content
depends heavily on the quality of the research and the study patient, and instructions for their
proposal and/or protocol. Once funds are secured, use. There also should be a description of how
the study leadership must oversee their alloca- the supplies will be stored. Finally, there must be
tion, accountability, and continuing availability, an accurate inventory of all materials, with dates
as well as identify the individuals who will be of receipt, dispensing, names of recipients, etc.
responsible for these matters. There also must be a procedure for returning
The success of the study also will depend on study material and recording their receipt. All of
the availability of sufcient and qualied per- these records are mandatory for accountability of
sonnel to carry out all the required functions. For supplies and are subject to strict regulations,
certain functions, especially those that might especially when any controlled substances are
only be required from time to time to address involved. This section is critical to the study
specic issues that might arise, it may be prefer- sponsor who generally provides the materials and
able to use consultants. For example, if patient must be able to show that adequate instructions
recruitment lags, the advice of persons special- for their correct handling were provided to
ized in recruitment techniques might be sought. investigators.
It is critical that all personnel be qualied to
carry out whatever responsibilities they are
assigned and that the study leadership provides Recruitment of Study Participants
the proper training needed to ensure their
qualications. The recruitment of eligible patients/subjects into
Availability of all supplies needed to carry out the study in a timely fashion is one of the key
the study is critical and may be a rate-limiting rate-limiting processes that has a major impact
factor in starting and completing the study in a on study results. Failure to recruit patients in a
timely fashion. Obviously, the study cannot start timely manner may have serious consequences
without materials for gathering and reporting by precipitating retrospective protocol changes,
data, e. g., case report forms (see also Chap. 7). such as relaxing eligibility/exclusion require-
Similarly, study drugs and/or devices must be ments or modifying procedures and observations.
available and ready for use, i.e., properly coded Any such changes can signicantly affect the
and allocated for a randomized trial. Any supplies study and potentially undermine its original intent
for laboratory tests and study procedures also and capacity to properly test the study hypothe-
must be available. Not only is it important that all sis, thus yielding results that may not be valid and
supplies be available to start the study, but it also conclusive relative to the original intent. Failure
is necessary to assure that they will continue to be to recruit patients quickly enough in sufcient
available throughout the study until its conclu- numbers can lead to early termination of the
sion. A key responsibility of study leadership is to study itself as well as discontinuation of its fund-
oversee the individuals responsible for ensuring ing, thereby jeopardizing the power of the trial to
126 J.A. Franciosa
achieve its projected sample size needed to is important to describe the procedures that these
achieve statistically conclusive results. individuals will follow to ensure (1) adherence to
Techniques for recruiting study subjects vary the protocol, (2) provision of complete and accu-
considerably and represent a specialized topic in rate data, (3) response to queries, and (4) compli-
and of itself [1, 19, 20] that is beyond the scope ance with auditing. Instructions on record keeping
of this chapter. The study leadership must iden- and record retention should also be provided.
tify the individuals responsible for recruitment Monitoring techniques vary and may include
and provide them with adequate resources and simple periodic telephone or e-mail contact with
training for whatever recruitment techniques are mailing or electronic submission of study docu-
employed. The specic techniques to be used ments between investigator sites and the moni-
should be spelled out in detail in the protocol. tors. Monitors may visit sites on a periodic basis
Numerous recruitment techniques are available to retrieve and deliver study materials as well as
and include screening subjects from (1) the local directly observe the sites performance. For a
research site (ofce, clinic, hospital, etc.), (2) more detailed description of monitoring methods
collaborating local sites, and (3) collaborating and procedures, the reader should consult stan-
regional, national, and/or international sites. dards references on the subject [1].
Within each of these sites, local areas of interest
must be identied, e.g., ofce, laboratory, and
emergency room. Screening-type trials seeking Data Acquisition and Processing
large or broad populations of subjects may estab-
lish recruitment centers in churches, schools, The principles of data acquisition and manage-
supermarkets, shopping centers, commercial estab- ment are described in detail in Chap. 7. From the
lishments, etc., to identify appropriate patients. study conduct perspective, it is important that ade-
In addition, advertising through various media quate numbers of qualied personnel are available
should be utilized to reach potentially eligible for data processing and management. Furthermore,
participants. Other sources are colleagues, bulle- these individuals must have expertise or be trained
tin board notices, direct mailings, and telephone in the required methods to be used for acquiring
screening [1]. The nal decision regarding and processing data. Similarly, study leadership
recruiting methods will depend on the overall must ensure that all appropriate materials, espe-
number and kinds of patients/subjects needed. cially equipment, hardware and software, are
Importantly, the duration of active recruiting available to properly process the data.
efforts commonly is specied in a protocol. These
timelines should be closely monitored and
adjusted as needed by the study leadership. End of Study Procedures
regulatory agencies, etc. Most importantly, it is trial takes the form of a prospective study
strongly recommended that all nal results be comparing the effect of an intervention, usually a
published. Only in this manner can the study be new drug or device, with a comparator or control
critically analyzed by all those with a stake in its (i.e., a placebo or a treatment already available)
outcome as well as be replicated if deemed [26]. The fundamental design of the clinical trial
desirable. can be widely applied to many different disci-
plines or areas of clinical research. (For a com-
prehensive discussion of contemporary clinical
Overview of the Interventional trial methodology, the reader is referred to the
Clinical Trial seminal writings of Spilker [1]). Clinical trials
can be employed to evaluate many forms of ther-
Most of what is discussed above has derived apy, including surgical interventions and radia-
from, and has been best dened by, interven- tion therapy. In addition, clinical trials can be
tional clinical trials which represent the culmina- used to test other nontherapeutic approaches to
tion of clinical research and merit special patient care, such as diagnostic tests or proce-
consideration because of their impact on clinical dures [27]. Thus, the NIH classies clinical trials
research methodology. Interventional clinical tri- into ve categories according to their purpose,
als are designed and conducted for the primary i.e., treatment trials, prevention trials, diagnostic
purpose of testing a treatment or management trials, screening trials, and health-related quality
strategy in patients with a specic disease. Such of life trials. These categories reect the way in
trials typically are sponsored by large research which clinical trials t within the entirety of the
organizations, such as the United States National clinical research spectrum, as they can be instru-
Institutes of Health (NIH), or by private organi- mental in assisting clinical efforts to improve not
zations such as pharmaceutical companies or only the treatment of a particular disease (as is
medical device manufacturers. most often the case) but also its prevention and
An interventional clinical trial is a formal detection [27].
experiment designed to elucidate and evalu- The clinical trial is the most widespread appli-
ate the relative efcacy and safety of different cation of experimental study design in humans
treatments or management strategies for patients [26]. Indeed, it is the adherence of the trial to the
with a specic medical condition [25]. Healthy principles of scientic experimentation, perhaps
volunteers often are used in the early phases of more so than a reliance on therapeutic compari-
assessment of a new therapy primarily to assure son, that most aptly validates the results of the
sufcient safety of an intervention before apply- trial. Along this vein, a number of general charac-
ing it to patients with the disease targeted by the teristics of the scientic method play a substan-
intervention. Such studies typically involve tial role in the modern conduct of clinical trials
establishing the proper dosing and/or administra- including, most notably, the control of extrane-
tion of the intervention along with demonstrating ous factors that might inuence outcome vari-
that the intervention is tolerated well enough to ability, selection bias, or interpretation of results
permit further studies in patients. However, [28]. For example, an important feature of the
healthy human volunteers provide only indirect randomized controlled trial, which is widely
evidence of effects on patients. Therefore, ulti- accepted as the primary standard of evidence
mately, clinical trials of putative interventions when interventions are evaluated, is the require-
must be conducted among individuals with dis- ment to randomly allocate patients to alternative
ease. The results obtained from this limited sam- interventions, strengthening the internal validity
ple then are used to make inferences about how of the study (see also Chap. 5).
treatment can be applied in the diseased popula- In any clinical trial, regardless of which inter-
tion in the future [25]. Most commonly, a clinical ventions or tests are administered, investigators
128 J.A. Franciosa
Take-Home Points
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 131
DOI 10.1007/978-1-4614-3360-6_7, Phyllis G. Supino and Jeffrey S. Borer 2012
132 M. Guralnik
[CRO] or a site management organization [SMO]) then rely on the transcribed sponsors source
or by the investigator or site staff, and may include documents to be the accurate and overriding data
study case report forms (CRFs) or electronic case points for resolution. Simply stated, erroneous
report forms (eCRFs) if used as the rst point of data could be considered the factual representa-
data capture. A source document could even be a tion of an event or observation. A simple but
cafeteria napkin containing laboratory results or effective tool for avoiding such situations is to
other observations, although a more formal dene in advance on a site-by-site, as well as a
data collection source document would be much form-by-form, basis what is and what is not
preferred. source documentation. When clarifying the
Use of the original ink concept can help to denition of source documentation, an important
differentiate a source document from subsequent point to keep in mind is that the study staff may
documentation. Original ink is a term that may be habitually record original ink data in certain
used to dene the rst-ever written documenta- places. For example, a patients temperature and
tion of an event or observation pertaining to the pulse may be routinely taken at the bedside by
study subject. Thus, documents containing origi- the study coordinator and recorded on a copy of
nal ink are considered source documents for the CRF. If the patients blood pressure is then
research. The US Food and Drug Administration taken from the physicians notes and recorded on
(FDA) as well as other regulatory agencies also the copy, then that copy becomes the source doc-
recognize a CRF as source documentation when umentation for the rst two measurements, but
it has captured the original ink of an event or not for the third. Interviewing the staff prior to
observation in a clinical trial. In contrast, tran- source document verication is an effective time-
scriptions or reproductions are considered sub- saving tool. When done early in the study initia-
sequent documentation based on the source tion process, this method can very effectively
original ink document. With todays use of clarify potential discrepancies.
advanced computer technology, ranging from
digital photography to voice dictation, we must
consider other forms of original ink or, more Research-Independent Data Sources
appropriately termed, original electronic chroni-
cles. These include voice, electronic, magnetic, A wealth of medical information is generated
photo-optical, and other source documentation every day for nonresearch purposes. A signicant
and records. For further information on the FDAs source of such data, accessible for research pur-
position on source documentation, the reader is poses, are the patient medical records maintained
referred to Guidance for Industry: Electronic by hospitals, clinics, and doctors ofces. Even
Source Documentation in Clinical Investigations the simplest medical records could contain impor-
(2010) [8]. tant information for research purposes, such as
Confusing these issues can lead to misrepre- sociodemographic data, clinical data, administra-
sentation of clinical trial data. For example, after tive data, economic data, and behavioral data.
site staff has collected a subjects history directly Additional potential research-independent
on sponsor-designated CRFs, the study monitor primary data sources are (a) claims data (such as
might remind the investigators staff that pre- those from managed care databases), (b) encoun-
printed sponsor source documents exist and that ter data (such as those from a staff/group model
they are designed to assist the site in capturing of health maintenance organizations), (c) expert
all necessary data elements. The site staff might opinions, (d) results of published literature,
then proceed to transcribe data from the CRF (e) patient registries, and (f) national survey data-
onto the sponsors source documents. To further bases. Since these data sources contain historical
confuse the matter, subsequent monitoring or as well as current data that are updated on an
query resolution activities by the sponsor would ongoing basis, these sources provide data that
136 M. Guralnik
are potentially useful in both retrospective stud- be obtained directly from the patient, most often
ies (designed to investigate past events) and through the use of a questionnaire or survey.
prospective studies (designed to investigate Questionnaires and surveys consist of a prede-
events occurring after patients have been enrolled termined set of questions administered verbally, as
in a study). a part of a structured interview, or nonverbally on
paper or an electronic device. The responses to the
questions may be discrete bits of data or may be
Research-Dependent Data Sources grouped as measures of study outcomes (e.g., psy-
chological scales). If the questionnaire is intended
Controlled evaluation of investigational products to measure study outcomes, establishing its reli-
or interventions requires prospective data collec- ability and validity and minimizing bias are essen-
tion which typically involves identifying one or tial. Administering a published questionnaire for
more patient groups, collecting baseline data, which reliability and validity have been previously
delivering one or more products or interventions, determined is recommended when possible.
collecting follow-up data, and comparing the However, the use of some published question-
changes from baseline among the different patient naires requires permission of their authors and
groups. Although there may be some research- may have a cost associated with their use. When
independent sources collected in these controlled the use of published questionnaires is not feasible,
evaluations (e.g., demographic, characteristics, new questionnaires will need to be developed.
medical history), most of the baseline data and, Such questionnaires should be pretested systemati-
obviously, the follow-up data must be collected cally (i.e., piloted) with a small subgroup of the
from research-dependent sources. Well-designed patient population in order to identify and correct
investigations of this nature specify, prior to the ambiguities or biases in the way the questions are
initiation of the study, the data to be collected and stated. Training interviewers who verbally admin-
the collection methods to be used. ister a questionnaire will also increase the quality
of the data generated both from published or newly
developed data collection instruments. (See Chap. 8
Data Collection Methods for a detailed description of various item formats
used in questionnaires and general rules to con-
The study design and the study data to be col- sider when constructing questionnaire items.)
lected dictate the methods by which the data are
to be collected. Laboratory data (e.g., hematol-
ogy, urinalysis, serology) and vital signs (e.g., Data Capture
height, weight, blood pressure) may be required
in a clinical trial to evaluate efcacy and, often, Paper-Based Methods
to evaluate patient safety. These data typically
would be collected using standard methods for Efcient analysis, summarization, and reporting of
these data types and recorded in the patients biomedical research data require that data be avail-
medical records, often designed specically for able in an electronic database, such as a spread-
the research study. Other data collected to sheet or one of several available databases, some of
address the research question(s) may require which have been designed specically for clinical
clinical information (e.g., events experienced by research data. The manner in which the data are
the patient, nonstudy medications used by the entered into these databases has been evolving.
patient), tracking information (e.g., timing and Historically, most data in biomedical research, par-
amount of study medications received, alcohol ticularly in RCTs, were entered from a set of paper
consumption, sexual activity), or subjective CRFs specically designed for the study. Figure 7.1
information (e.g., personal opinions of medical shows an example of a typical paper CRF used to
condition or ease of treatment). These data must collect data obtained from physical examination.
7 Data Collection and Management in Clinical Research 137
Fig. 7.1 Example of a paper CRF used to collect research Health, Division of Cancer Prevention. http://dcp.cancer.
data from a physical examination. Downloaded from the gov/Files/clinical-trials/FINAL_DCP_CRF_Templates_
National Cancer Institute at the National Institutes of Version_3.doc (Accessed 10 Nov 2011)
Table 7.1 Features that may be available for electronic CRFs depending on the clinical trial data management software
used (Reproduced with permission from Brandt et al. [2])
Feature Function
Primary electronic data entry Data entered into CRF by interviewer or subject (rather than into a paper form rst)
Context-sensitive help Help is given in the context of the problem (immediately)
Default values set Based upon predened criteria, or previously entered date, values of elds may be set
Skip patterns Disabling of questions that become inapplicable based on response to a previous
question
Computed (derived) values Certain questions may be based on values of other questions (such as body mass
index (BMI) that is derived from height and weight). Computed values may also
control skip patterns on a CRF. If BMI exceeds a present threshold, questions related
to high BMI may be enabled
Interactive validation Immediate checking of the values entered into the CRF based upon predened
criteria such as ranges, other values in the CRF or study, etc.
to the traditional outsourcing of this task. Although EDC systems are most often used
Desktop-publishing systems and precollated by formally organized research centers with data
no-carbon-required paper (NCR) allow printing, management staff, many clinical investigators in
collating, and binding of CRFs, with multicol- private practice or in academia conduct studies
ored two- or three-part sets [11]. Over the course without the support of qualied biomedical
of a longitudinal study, CRFs often are improved informatics consultants and sophisticated EDC
or rened, including the addition of new entries systems [15]. Nevertheless, EDC systems are
and modication or deletion of entries on previ- available that can be implemented without spe-
ous versions [2]. Some newly requested data cialized software for investigators with small
(such as information about the patients history) budgets or limited access to data management
may be obtainable later, whereas time-dependent staff.
observations (such as measurements taken at a Data collection has naturally evolved along-
certain clinic visit) will not. Data for new or side with computer and information technology.
modied questions that cannot be obtained must Major milestones in this evolution include
be treated as missing. Conversely, when a personal computers, relational databases, user-
question is deleted, data for patients evaluated friendly interfaces for software once reserved for
under the older CRF version must be archived or engineering and systems design staff, and broad-
purged or both [2]. Regardless of the types of ened connectivity options such as computer
changes made, the FDA requires that the sponsor to computer, internet networking, wireless to
preserve all electronic versions for agency review Ethernet, and cellular data connectivity. These
and copying [12]. advances along with the availability now of
Electronic systems are designed to support mobile computing and electronics devices, like
data entry where data are entered directly from the iPad, have a potentially huge impact on how
source documents with most data validations we gather data, as well as where data capture is
executed real time as the data are entered and heading.
errors promptly resolved typically by study site The iPad is a major step forward for clinical
staff. As will be noted below, EDC systems data management. These truly remarkable
also support the monitoring, cleaning, storage, devices, resting in the hands of all members of
retrieval, and analysis of research data [2], as the research team, would allow quick access to
well as promote the uniform collection of data, tools for capturing data, real time or otherwise.
which can then be more easily analyzed and They also offer two-way connectivity along
shared across a variety of platforms and data- with the portability and functionality of the
bases [13]. hardware, thereby lending them the exact adapt-
EDC systems, however, are not without their ability needed for clinical medicine and research
own constraints. To be useful in multicenter roles.
trials, EDC systems must allow electronic sub- Newer generation iPads allow data to migrate
mission of data from different sites to a central from text-based eld entry, or PDF form data
data center, be easy to implement and use, and entry, through to server-based relational data-
minimize disruption at the clinical sites [9]. bases. Using methods from e-mail as a carrier to
Timing is essential to the successful implementa- internet-connected applications, the data stream
tion of an EDC system. Considerable information can be instantaneous, allowing for immediate
technology (IT) support is needed to build the two-way data efforts, relaying back from sponsor
eCRFs, and considerable time must be dedicated to investigator. Third-party communications fur-
to educating the trial site staff on the proper use ther enhance the iPad platform. All of this has
of the new systems. To be successful and reap begun to evolve because the iPad platform has
the benets of EDC systems, this effort should be simplied the process of data capture and trans-
undertaken prior to the initiation of any research fer via its accessible hardware and novel data
study [14]. management applications.
140 M. Guralnik
For each category, a proportion of the CRFs is process. The use of eCRFs in combination with
sampled by a random-sample-generating pro- manual ad hoc queries by study monitors has been
gram, and entered data are compared with the able to reduce data discrepancies and the conse-
source documents for discrepancies. For very quent need for clarications by more than 50%.
important categories (i.e., data that are central to The enhanced ability to clean and analyze data
the study objective and must be correct), as many has resulted in the generation of more accurate
as 100% of CRFs may be sampled [2, 6]. data [21]. Moreover, compared with a paper-based
Noncritical data, which should be correct but system, EDC systems with built-in error checking
would not affect the study outcome if incorrect, for data quality have been shown to reduce the
would require a lower proportion of CRFs to be total number of queries and decrease the cost of
checked [6]. After sampling, the number of dis- each query resolution from $60 to $10 [14].
crepancies is reported and corrective action taken.
The proportion of audited CRFs for any category
may be modied for a given site in light of site- Document Retention, Security,
specic discrepancy rates [2]. and Storage
Electronic data validation identies entry
errors by their deviation from allowable and Retention
expected values or answers. These include labo-
ratory measurements, answers that contradict All clinical investigators should ensure that rele-
answers to other questions entered elsewhere on vant forms such as CRFs are always accessible in
the CRF, spelling errors, and missing values [2]. an organized fashion. Informed-consent forms,
Because of their concrete nature, these errors can CRFs, laboratory forms, medical records, and
easily be identied. correspondence should be retained by the investi-
gator until the end of the study and, thereafter, by
the sponsor for at least 2 years after clinical
Data Queries development of the investigational product has
been formally discontinued or 6 years after the
To support the full process of study monitoring trial has ended. Even after the completion of
and auditing, the data management system should the study, side effects or benets of the interven-
have querying tools in place [2]. After the data tion may be present and the relevant forms may
entry/verication process discovers an entry that need to be retrieved. Factors to be considered are
requires clarication and determines that the data the availability of storage space and the possibil-
were accurately entered into the database, the ity of off-site storage if there is insufcient stor-
data coordinator sends the participating institu- age space [22].
tion a paper or electronic query. Examples of
entries that warrant queries include missing data
values, values out of range, values that fail Security and Privileging
logic checks, or data that appear to be inconsis-
tent [20]. The query should include protocol and Both during and after completion of a study, inves-
patient identiers, specic descriptions of the tigators and their staff must prevent unauthorized
form/data item in question and the clarication access, preserve patient condentiality, and prevent
needed, and instructions on how and when to retrospective tampering/falsication of data. Under
send a response. In turn, the coordinating center the FDAs Title 21 Code of Federal Regulations
should have a mechanism for recording the issue [23], access must be restricted to authorized per-
and response to each query [20]. sonnel, the system must prevent malicious changes
EDC systems have a proven superiority to to research data through selective data locking, and
paper-based systems with respect to the querying an audit trail must exist [2].
142 M. Guralnik
Consideration should be given for software patient identifying information, but other per-
that provides: sonnel, such as biostatisticians performing
Privileging: Study-specic role-based privi- analyses, may view only de-identied data.
leges should be assigned, with roles requiring Data Locking: The software should allow a
adequate training and documentation of such study coordinator to lock all the data in the
training prior to system use. In the case of system by study, subject, or CRF level when
multisite studies, it is especially important to required. All investigators, particularly those
be able to assure investigators from each site involved in any type of human subjects
that other sites can be restricted from altering research, must be sure to take adequate steps
their data or, in some cases, even seeing their to preserve the condentiality of the data they
data while the study is in progress. Also, dif- collect. Investigators must specify who will
ferent users should have different data access have access to the data, how and at what point
and editing privileges. Software should allow in the research personal information will be
site restriction of data and the assignment of separated from other data, and how the data
both role-based and functional privileges. The will be retained at the conclusion of the study.
software should allow the level of restriction The following guidelines for preserving patient
to be changed as appropriate. condentiality should be followed [24, 25]:
Storing of De-identified Data: For studies In general, all information collected as part of
where breach of patient condentiality could a study is condential: data must be stored in
have serious repercussions, the software a secure manner and must not be shared
should support storing of de-identied data. It inappropriately.
is important to note that the Health Insurance Information should not to be disclosed with-
Portability and Accountability Act (HIPAA) out the subjects consent.
does not prohibit the storing of patient- The protocol must clearly state who is entitled
identiable information: it requires only that it to see records with identiers, both within and
be secure, be made accessible strictly on a outside the project.
need-to-know basis, and that accesses to such Wherever possible, potentially eligible sub-
information be audited. The drawback of not jects should be contacted either by the person
storing patient-identiable information in to whom they originally gave the information
every study is that many of a systems useful or by another person with whom they have a
workow-automation features, such as gener- trust relationship.
ation of reminders to be mailed to patients Information provided to prospective subjects
periodically, cannot function seamlessly and should include descriptions of the kind of data
personalization of reminders requires manual that will be collected, the identity of the per-
processes. Also, in prospective clinical studies sons who will have access to the data, the
for life-threatening conditions such as cancer, safeguards that will be used to protect the data
where decisions such as dose escalation are from inappropriate disclosure, and the risks
based on values of patient parameters, the that could result from disclosure of the data.
storage and selective echoing of protected Academic and research organizations should
health information (PHI) provides an added establish patient privacy guidelines for non-
safeguard to ensure that data are being entered, employee researchers.
or the appropriate intervention is being per-
formed, for the correct patient.
Generation of De-identified Data: The soft- Other Responsibilities and Issues
ware should be able to de-identify the data
when required in order to share data and GCP guidelines mandated through the Code of
should utilize information about user role- Federal Regulations require that institutions (or
based privileges as well. For example, an when appropriate, an IRB) maintain records of all
investigator may have privileges to view research proposals reviewed (including any
7 Data Collection and Management in Clinical Research 143
scientic evaluations that accompany the propos- instruments that contain data, properly disposing
als), approved sample consent documents, prog- of computer sheets and other documents, limiting
ress reports submitted by investigators, and reports access to data, and storing research records in
of injuries to subjects [25]. Institutions also must locked cabinets. Although most researchers are
maintain adequate records on the shipment of the familiar with the routine precautions that should
drug product to the trial site and its receipt there, be taken to maintain the condentiality of data,
the inventory at the site, use of the product by more elaborate precautions may be needed in
study participants, and the return to the sponsor of studies involving sensitive matters such as sexual
unused product and its disposition [2628]. behavior or criminal activities to give subjects the
Because drug-accountability records must be condence they need to participate and answer
accurate and clear, especially for an audit of the questions. When information linked to individu-
study site [29], electronically based inventory als will be recorded as part of the research design,
management systems have been devised. In addi- IRBs require that data managers ensure that ade-
tion to describing current inventory [20], some of quate precautions are in place to safeguard the
these systems have look ahead capabilities to condentiality of the information; thus, numerous
assess and fulll future inventory needs [30]. specialized security methods have been devel-
oped for this purpose and IRBs typically have at
least one member (or consultant) who is familiar
Oversight of Data Management: Role with the strengths and weaknesses of the different
of Institutional Review Boards systems available. Researchers should also be
aware that federal ofcials have the right to
As will be noted in Chap. 12, IRBs have a wide inspect research records, including consent forms
range of responsibilities in the design, conduct, and individual medical records, to ensure compli-
and oversight of clinical trials, and it is important ance with the rules and standards of their pro-
that clinical researchers be familiar with them. grams. In the USA, FDA rules require that
IRB functions that are particularly germane to information regarding this authority be included
those managing data include oversight of protec- in the consent forms for all research regulated by
tion of the privacy and condentiality of human that agency.
subjects (identiers and other data), monitoring
of collected data to optimize subjects safety, and
continuing review of ndings during the duration Monitoring and Observation
of the research project [31].
One of the areas typically reviewed by the IRB is
the researchers plan for collection, storage, and
Condentiality and Privacy analysis of data. Regular monitoring of research
of Research Data ndings is important because preliminary data
may signal the need to change the research design
Information obtained by researchers about their or the information that is presented to subjects or
subjects must not be improperly divulged. It is even to terminate the study early if deemed nec-
essential that researchers be able to offer subjects essary. Thus, for an IRB to approve proposed
assurance of condentiality and privacy and research, the protocol must, as appropriate,
make explicit provisions for preventing breaches. include plans for monitoring the data collected to
For most clinical research studies, assuring ensure the safety of subjects. Investigators some-
condentiality typically requires adherence to the times misinterpret this requirement as a call for
following routine practices: substituting codes for annual reports to the IRB. Instead, US Federal
patient identiers, removing face sheets (contain- regulations require that, when appropriate,
ing such items as names and addresses) from survey researchers provide the IRB with a description of
144 M. Guralnik
their plans for analyzing the data during the the consent document(s) and any variations in the
collection process. Concurrent collection and manner of data collection must be reviewed and
analysis enables the researcher to identify aws approved by the IRB. The IRB has the authority
in the study design early in the project. The level to observe, or have a third party observe, the con-
of monitoring in the research plan should be sent process and the research itself. The researcher
related to the degree of risk posed by the is required to keep the IRB informed of unex-
research. Furthermore, when the research will be pected ndings involving risks and to report any
performed at foreign sites, the IRB at a US insti- occurrence of serious harm to subjects. Reports
tution may require different monitoring and/or of preliminary data analysis may be helpful both
more frequent reporting than that required by to the researcher and the IRB in monitoring
the foreign institution. Under normal circum- the need to continue the study. An open and coop-
stances, however, the IRB itself does not under- erative effort between the researcher and the IRB
take data monitoring. Rather, other independent protects all concerned parties.
persons (e.g., members of a data safety monitor-
ing board [DSMB]) typically are responsible for
monitoring trials and for decisions about Summary and Conclusions
modication or discontinuation of trials. It is the
IRBs responsibility, though, to ensure that these Clearly dened study endpoints combined with
functions are carried out by an appropriate well-designed source documents, CRFs, and
group. The review group should be required to systems for capturing, monitoring, cleaning, and
report its ndings to the IRB on an appropriate securely storing data are essential to the integ-
schedule. rity of ndings from clinical biomedical research
trials. Because IRBs have a wide range of
responsibilities in the design, conduct, and over-
Continuing Review sight of clinical trials, it is also essential that
clinical investigators be familiar with their
At the time of its initial review, the IRB deter- requirements.
mines how often it should reevaluate the research The inexorable shift from paper-based to EDC
project and will set a date for its next review. systems in large trials promotes the efcient and
Some IRBs set up a complaint procedure that uniform collection of data that can be analyzed
allows subjects to indicate whether they believe and shared across a variety of platforms and data-
that they were treated unfairly or that they were bases. EDC systems can build quality control
placed at greater risk than was agreed upon at into the data collection process from its incep-
the beginning of the study. A report form avail- tiona more productive approach than building
able to all researchers and staff may be helpful checks onto the end [19]. Although modern soft-
for informing the IRB of unforeseen problems or ware tools unquestionably improve the potential
accidents. US Federal policy requires that inves- for data collection and management, systems
tigators inform subjects of any important new alone are worthless without pro-active study
information that might affect their willingness to coordinators and investigators who create and
continue participating in the trial. Typically, the enforce policies and procedures to ensure
IRB will make a determination as to whether any quality [2]. Therefore, a trials data collection
new ndings, new knowledge, or adverse effects system and its ndings are only as sound as the
should be communicated to subjects, and it commitment by individuals who formulate and
should receive copies of any such information carry out document design, study procedures,
conveyed to subjects. Any necessary changes to training, and data management plans.
7 Data Collection and Management in Clinical Research 145
Take-Home Points
Well-designed trials and data management methods are essential to the integrity of the
ndings from clinical trials, and the completeness, accuracy, and timeliness of data collec-
tion are key indicators of the quality of conduct of the study.
The research data provide the information to be analyzed in addressing the study objec-
tives, and addressing the primary objectives is the critical driver of the study.
Since the data management plan closely follows the structure and sequence of the protocol,
the data management group and protocol development team must work closely together.
Accurate, thorough, detailed, and complete collection of data is critical, especially at base-
line as this is the last time observations can be recorded before the effects of the trial inter-
ventions come into play.
The shift from paper-based to electronic systems promotes efcient and uniform collection
of data and can build quality control into the data collection process.
A self-report measure, as the name implies, is a subject often can provide valuable information
measure where the respondent supplies informa- about social, demographic, economic, psycho-
tion about him or herself. Such information may logical, and other factors related to the risk of dis-
include self-reports of behaviors, physical states ease or to adverse outcomes of disease. The
or emotional states, attitudes, beliefs, personality choice between self-report, observational, and
constructs, and self-judged ability among others. biophysiological measures will depend on the
A self-report may be obtained via questionnaire, data that are available and the nature of the research
interview, or related methods. Questionnaires questions and hypotheses. It is important to note
typically are written documents that are adminis- that while the range of biophysiological measures
tered without the involvement of an interviewer, is constantly increasing, and while these mea-
whereas interviews usually (but not always) sures may permit objective evaluation of clini-
are administered orally [1]; both are sometimes cally relevant attributes, they are not perfectly
termed surveys. reliable (i.e., free from measurement error). Even
Self-reports are important in medical research more importantly, they may fail to capture the
because while some variables can be evaluated specic quality that the investigator wishes to
through physiological measures, chart review, evaluate. For example, if an investigator is inter-
physical exam, direct observation of the respon- ested in blood pressure, this may be evaluated
dent, or by reports by others, other variables only biophysiologically. However, if the aim of the
can be assessed from information directly fur- investigation is to examine the effects of mood on
nished by the patient or other subject. Indeed, the blood pressure, mood can be evaluated only by
self-report as there are no biophysiological
measures of mood (though there may be biophys-
P.L. Flom, PhD () iological correlates, and even causes and conse-
Peter Flom Consulting, LLC,
quences of biophysical factors). Observational
515 West End Ave, New York, NY 10024, USA
e-mail: Peteromconsulting@mindspring.com data also may provide useful information, but
their use has its own perils as individuals do not
P.G. Supino, EdD
Department of Medicine, College of Medicine, always accurately observe the actions of others.
SUNY Downstate Medical Center, For these reasons, information directly reported
450 Clarkson Avenue, Box 1199, by patients and other subjects commonly is col-
Brooklyn, NY 11203, USA
lected by clinicians, clinical investigators, and
e-mail: phyllissupino@aol.com
other health-care professionals, and can be used
N.P. Ross, BS, MS, PhD Statistics
as a tool for patient management or for research.
SUNY Downstate Medical Center,
9006 Kirkdale Road, Bethesda, MD 29817, USA Topics commonly examined by self-report
e-mail: ross@statlogic.net include physical or mental symptoms, level of
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 147
DOI 10.1007/978-1-4614-3360-6_8, Phyllis G. Supino and Jeffrey S. Borer 2012
148 P.L. Flom et al.
pain or stress, activities of daily living, health- Questionnaires, like tests, can produce a total
related quality of life, availability of social sup- score or subscores, but also can yield different
port, use and perceived effectiveness of strategies types of information that can be separately ana-
used to cope with ill-health, satisfaction with the lyzed. Questionnaires are almost always a neces-
doctor-patient interaction, and adherence to med- sity when direct contact with the subject is not
ication schedules (though the latter might, at least possible. Under these circumstances, question-
in theory, also be evaluated through objective naires typically are administered by mail to the
testing). respondent who, in turn, completes and returns
Although self-report instruments are relatively them to the sender. In other circumstances, ques-
easy to use, their construction and validation can tionnaires may be read to the respondent over the
be difcult. This chapter will cover fundamental telephone or in-person as part of a structured
aspects of, and distinctions among, question- interview, or they may be administered via the
naires, interviews, and other methods of self- Internet in a variety of ways. A questionnaire can
report and will indicate the circumstances under cover virtually any topic, although here we will
which a new self-report measure may be needed. emphasize those that capture information related
It also will describe methods of generating and to medical issues or health-related topics includ-
structuring responses; discuss approaches to ask- ing, but not limited to, diseases, symptoms, and a
ing about sensitive information; describe the patients experiences with doctors and other
rationale for, and processes involved in, pilot test- health professionals. Some well-known question-
ing, evaluating, and revising a measure; review naires used in medical research are the Brief
related ethical and legal aspects; and provide a Symptom Inventory (a 53-item questionnaire
general guide to the entire process. covering nine dimensions of psychological
health [5]); the SF-36 (a 36-item patient-centered
questionnaire about general physical and mental
What Is a Questionnaire? health-related quality of life [6]); the 26-item
World Health Organization Quality of Life
A questionnaire is a type of self-report instru- Questionnaire (WHOQOL) [7] assessing general,
ment that is designed to elicit specic informa- physical, emotional, social, and environmental
tion from a population of interest. Questionnaires health quality; the Minnesota Living with Heart
may be standardized but often are designed (or Failure Questionnaire (MLHFQ) (comprising 21
adapted) specically for a particular study. questions that measure the patients perceived
Depending on the objective of the study and limitations due to heart failure [8]); and the
resources, the questionnaire, like other self- Morisky Scale (a series of six questions about
report measures, may be administered to all sub- medication adherence [9]).
jects in the available sample or to a dened
subsample. As noted below, the most common
method of administration is direct mailing to Interviews and Related Methods
subjects, though other methods exist. Deciding
upon the sampling strategy is a complex pro- There are a large variety of interview and related
cess. It can range from a simple random sample methods that also can be used to collect self-
to a very complex hierarchical design involving report data. These can be categorized along sev-
multiple strata and sampling procedures, as eral dimensions: level of structure of the interview,
reviewed in Chap. 10. For additional informa- number of respondents involved (one vs. two or
tion on this subject, the reader is referred to Kish more), and use of subject narrative (historical or
(1995) [2], Groves et al. (2004) [3], and Cochran anecdotal methods). In addition, these types of
(1977) [4]. measures are usually qualitative (i.e., focus
The questionnaire usually is in the form of a groups, in-depth/unstructured interviews, ethno-
written document, though sometimes it may be graphic interviews) as opposed to quantitative
administered by audio or with pictorial methods. (e.g., structured interviews and questionnaires) in
8 Constructing and Evaluating Self-Report Measures 149
the eld of study is well developed. A lesser joint interview) they must have sufcient skill to
degree of structure is more appropriate earlier in ensure that one member of the group does not
the development of a eld of knowledge or when dominate the discussion. Focus groups have been
the particular research is highly exploratory. used in medical research to uncover attitudes
about a particular illness or difculty. For exam-
ple, Quatromoni and colleagues used focus
Number of Respondents groups to explore the attitudes toward, and knowl-
edge about, diabetes among Caribbean-American
While the traditional interview typically entails a women [21], whereas Hicks et al. used focus
one-on-one interaction between interviewer and groups to explore ethical problems faced by med-
an interviewee (respondent), the joint interview ical students [22].
involves two (or sometimes several) individuals
who know each other, commonly a couple or a
family [16]. Joint interviews differ from focus Narrative Methods
group methods (described below) where those
being interviewed may be strangers. They have Life Histories, Oral Histories, and Critical
value in survey research because different indi- Incidents: Life histories are narrative self-
viduals may have very different perspectives that disclosures about personal life experiences,
may be illuminated by the interaction between or typically recounted orally or in writing in
among them. These different perspectives, in chronological sequence [1]. They commonly
turn, may provide the researcher with greater are used as an ethnographic tool for identify-
insight into the problem at hand; however, to ing and elucidating cultural patterns, but the
accomplish this objective, the interviewer must technique also can be of value for eliciting the
be able to prevent one respondent from dominat- experience of patterns and meanings of health
ing the discussion. Joint interviews have been care in populations of interest. Oral histories
used to study family reactions to youth suicide are similar to life histories, but they focus on
[17] and to study reliability of reports of pediatric personal recollections of thematic events
adherence to HIV medication by interviewing rather than on individual life stories. The crit-
both patients and their caregivers [18]. Note that ical incident technique, pioneered by
the term joint interview sometimes is used Flanagan [23] in the mid-1950s, is widely
when there are two interviewers, rather than two used in many areas of health sciences and
subjects. This approach can be used as a vehicle health sciences education. More focused than
for interviewer training and for determination of life or oral history methods, the critical inci-
inter-rater reliability, but it also can be used to dent technique requires respondents to iden-
provide better answers to health-care questions, tify and judge past behaviors and related
as when a psychiatrist and an internist jointly factors that have contributed to their success
interview a patient to obtain information from or failure in accomplishing some outcome of
varying perspectives [19]. interest. The critical incident method has been
In a focus group, typically four or more used to explore such wide-ranging topics as
individuals (usually a fairly homogenous group) adverse reactions to sedation among children
collectively discuss an issue, guided by a moder- [24], attitudes of third-year medical students
ator. Focus groups are useful for exploring a par- toward becoming physicians [25], and reasons
ticular issue in depth. However, to provide useful why physicians changed their areas of clinical
information, members of the focus group must be practice [26].
properly selected. In addition, moderators must Diaries: A diary is not technically an interview,
be matched well to the subjects, they must know as no one is asking questions. Nonetheless,
the subject matter very well, they must be able to because diaries have some similarities with
elicit information from those who do not offer it interview methods, sometimes they are
spontaneously [20], and (as in the case of the classied with them. A diary is a written
8 Constructing and Evaluating Self-Report Measures 151
record kept by the respondent, usually over a There are even questionnaires that may be com-
fairly lengthy period of time. Diaries may have pleted by couples or groups. Nevertheless, these
any degree of structure or content; for exam- methods differ in certain important respects. As
ple, in a study of diet, a diary might include noted, questionnaires tend to be more structured;
only what the respondent ate each day. On the some forms of interview, such as those conducted
other hand, in a study of reactions to medica- with focus groups, cannot be conducted as a ques-
tion, the diary might include any reactions that tionnaire and require a trained moderator. In addi-
a patient may have experienced after taking tion, some individuals (e.g., young children,
the medication. If subjects are not literate, dia- stroke patients, nonnative speakers) may be more
ries may need to be orally recorded. Diaries comfortable with spoken than with written English
have been used in clinical research to describe and may have a diminished ability to read, which
somnolence syndrome in patients after under- would limit their ability to complete a paper and
going cranial radiotherapy [27], to measure pencil questionnaire. These factors notwithstand-
morbidity of children experienced at home ing, some types of questions, particularly those
[28], and for improving heart failure recogni- that are relatively complex, are better suited to
tion after intervention [29]; the methodology questionnaires, particularly when skip patterns
has been particularly useful for monitoring are clear. (The skip pattern refers to the idea that
symptoms in individual patients in the setting some questions will be passed over appropriately
of N of 1 randomized clinical trials [30] (see depending on answers to earlier questions or
Chap. 5). when the questions do not apply to the respon-
Think-Aloud Methods: With think-aloud dent.) For example, in a questionnaire about gen-
methods, respondents are asked to dictate their eral health, women might answer questions on
thoughts into a recorder while they are trying topics such as menstruation and pregnancy,
to solve a problem or make a decision. These whereas men would not answer these questions.
methods produce inventories of decisions as In addition, because it takes less time to read a
they occur in context [1]. One fundamental question than to speak it, questionnaires can con-
aspect of think-aloud methods that differenti- tain more items, yet be completed within the same
ates them from other approaches is that they amount of time as an interview covering fewer
are concurrent with the process involved items. Finally, self-completed questionnaires may
that is, information is gathered while active be viewed as less intrusive than face-to-face inter-
reasoning is taking place. Think-aloud meth- views. Thus, the choice is a complex process, and
ods have been used to examine nurses reason- a variety of factors must be weighed.
ing and decision-making processes [31] and
have been shown to produce useful informa-
tion in hospital settings [32]. For further infor- When Is a New Self-Report
mation about this approach, the reader is Measure Needed?
referred to the seminal writings of Ericsson
and Simon (1993) [33]. Creating a new self-report measure entails con-
siderable time and effort for item construction
and for pilot testing, renement, and validation.
Making the Choice: Questionnaires Before undertaking such a project, it makes sense
Versus Interviews to be sure it is necessary to do so. As noted above,
answers to some questions can be obtained
This choice is, in some ways, a false one. Similar through biophysiological methods or through
questions may be asked in interviews and ques- direct observation and some cannot. Should the
tionnaires, and as noted above, interviews may be investigator decide that answers to a research
guided by written questionnaires. Either approach question can be obtained only through use of a
may be relatively structured or unstructured. self-report measure, he or she should rst
152 P.L. Flom et al.
determine whether a suitable measure already respondents reading level and related charac-
exists. (The Internet site http://www.med.yale. teristics must be kept in mind. How educated
edu/library/reference/publications/tests.html pro- will they be? In which languages will they be
vides directories of tests and measures in medi- uent? If subjects are excluded who are not
cine, psychology, and other elds; other good uent in the language used in the question-
sources are Tests in Print [34], the Mental naire, how will lack of uency bias the sample?
Measurements Yearbook [35], and the Directory Answers to all of these questions will vary by
of Unpublished Experimental Mental Measures sample and by location. If, for example, an
[36].) Should an existing measure be selected investigator is surveying a group of profes-
(even if widely used and psychometrically sound sionals (e.g., doctors or nurses) in the United
in other populations), the investigator should States [USA], England, or in another country
ensure that it has been successfully employed in which the native language is English, it
and, optimally, validated in the population under probably is safe to assume that the respon-
study. If an appropriate preexisting measure can- dents will have a reasonable command of
not be identied, it may be possible to identify English as well as a high level of education.
two (or more) measures that together may serve On the other hand, if patients are being
the needs of the study, though the investigator surveyed from among a heterogeneous popu-
should be aware that combining multiple mea- lation where geographic variations in language
sures (or rewording items) can impact the psy- exist, it must be assumed that the patients lan-
chometric properties of their constituent parts. guage prociency in the countrys primary
language (and their use of alternative lan-
guages) will vary by location and that at least
Sources of Items some may have little formal education. These
assumptions can be examined by administer-
The rst source of items for a self-report measure ing various tests of reading level. If reading
is the existing literature, which, as noted, includes level is low, alternative formats can be used
existing tests and measures. In some cases, there including auditory or pictorial methods. For
may be a strong conceptual basis for a set of example, pain scales exist that use faces repre-
questions in which case the theoretical or discur- senting different levels of pain [37]. These can
sive literature may be helpful for item generation. be particularly useful with young children or
An additional source of items is observation with illiterate respondents. (Issues regarding
and interview. One protable long-term research need for and methods of translating question-
strategy is to begin with relatively qualitative naires are discussed below.)
methods (such as unstructured interviews or Clarity: Not only must questions be readable
observation), administered among relatively by the target population, they also must be
small samples, and use the ndings obtained with clearly framed to render the survey process as
these methods to develop more structured forms simple as possible for the respondent. It is
that can be administered to signicantly larger very common to assume that a question that is
samples. On the other hand, unexpected responses clear to the investigator will be clear to others.
to a highly structured method may provide the However, this often is not the case. The best
impetus to developing less structured surveys route to assess clarity is thorough pilot test-
that can further explore those areas. ing. Questions that are unclear may be skipped
by the respondent or, worse, may be answered
in unexpected ways. Unlike readability, lack
Structuring Questions: Key Points of clarity affects respondents at all levels of
education and language prociency, although
The Respondents Reading Level: When it may be more problematic at lower
developing a questionnaire, the potential levels. Ironically, sometimes it can be more
8 Constructing and Evaluating Self-Report Measures 153
birthday and his or her age. However, it is a question such as When did you move to
important to be selective, as asking all New York? then, given an open-ended format,
questions in multiple ways not only will respondents may name a year, a date, or may
make for a very long survey, it will invari- refer to a time in their lives (e.g., right after I
ably irritate the respondents. Therefore, it got married) or to the history of the area (e.g.,
is best to include intentionally redundant just before the big blackout). For a question
items only for key areas and under condi- such this, it is better to ask for a specic type of
tions where ambiguity is difcult to avoid. response (e.g., either How old were you when
you moved to New York? or In what year did
you move to New York?) because, under these
Structuring Potential Responses circumstances, it is unlikely that any response
given would be unduly constrained.
There are two broad types of questions that can Closed-Ended Questions: Closed-ended ques-
be included in a self-report measure: open-ended tions are those in which the respondent is
(also known as open) questions and closed- asked to choose from a preexisting set of
ended (also known as closed) questions. These response options that have been generated by
differ according to who (the developer of the sur- the individuals developing the survey. Closed-
vey or the respondent) is responsible for dening ended questions, therefore, limit the answers
possible answers to the questions. that the respondent can provide. Their primary
Open-Ended Questions: Open-ended ques- advantages are that they are easier to code and
tions are those for which the respondent sup- analyze, provide more specic and uniform
plies the answer. These are subcategorized into information for a given question, and gener-
(1) numeric open-ended questions that may ally take less time to answer than open-ended
ask for responses expressed as quantities (e.g., questions. Closed-ended questions can be
How much out-of-pocket money did you subclassied into those calling for dichoto-
spend on medications during the past week? mous responses versus polychotomous (multi-
How much weight did you gain during the ple choice) responses. Dichotomous responses
last year? How old were you when you had are those that have only two possible values
your rst heart attack?) versus (2) free text most commonly, yes or no. Examples of
questions (sometimes called verbatims). The questions that may generate such responses
latter, often seen at the end of surveys, ask are legion (Did the patient die? Do you
about experiences or satisfaction with services have a physician? Have you ever had
(e.g., Do you have any other comments youd surgery?). When items are framed as state-
like to share?). Open-ended questions are the ments rather than as questions, typical dichot-
question-level equivalent of unstructured sur- omous responses include true/false or
veys and share some of the same problems (in agree/disagree response options. Items
particular, they may be difcult to code). The calling for dichotomous responses sometimes
chief advantage of open-ended questions is are combined into scales that can yield an
that they do not constrain the range of possible aggregate score. One well-known example is
responses. Indeed, they permit respondents to Thurstone scaling. Thurstone scaling refers
freely respond to the question, allowing them not to a method of soliciting responses to
to describe their feelings about, attitudes single unrelated items, but to a method of
toward, and understanding of the topic at hand. constructing and scaling several related items.
As such, they potentially can generate more The essential idea is to construct several
information about the topic than other formats. dichotomous statements about a respondents
Open-ended responses also tend to reduce the attitudes, each of which may be answered
response error associated with answers sup- Agree or Disagree. This method of
plied by others (i.e., the survey developer). But scaling can be used to classify respondents
this approach has its perils. If a survey includes with different levels of an attribute [40].
156 P.L. Flom et al.
For example, if the area of inquiry entailed a follow-up question asking about reasons for the
nurses attitudes about doctors orders, the hospitalization, with responses entered into
following series of items might be presented: separate columns of a spreadsheet.
Ordinal responses are those that have a mean-
(a) A nurse must always follow every order
ingful sequence, but no xed distances between
that a doctor gives, even if he/she thinks it
the levels of the sequence. Questions about sub-
is wrong.
jective responses are often ordinal. For example,
Agree Disagree
responses to a question such as How much pain
(b) A nurse should almost always follow a
are you in? could range from none, to a lit-
doctors orders, but may raise questions
tle, to some, to a lot, to excruciating. They
on rare occasions.
are considered to be ordinal rather than interval
Agree Disagree
because while they arguably proceed from least
(c) A nurse should generally follow a doc-
to most pain, it is not at all clear whether the dif-
tors orders, but should also voice his/her
ference between, for example, none and a lit-
opinions about those orders.
tle is larger, smaller, or the same as the difference
Agree Disagree
between, for example, a lot and excruciating.
(d) Nurses should be equal partners in all
As noted, ordinal response scales typically
decisions about patient care and should
include a number of possible answers. Usually,
regard doctors orders as advice.
an odd number of responses (typically ve or
Agree Disagree
seven) is chosen to allow the respondent a neu-
In contrast to questions soliciting dichotomous tral or midrange option, though there is no con-
responses, multiple choice questions include sensus about how many choices to include. There
three or more response options. These, in turn, are a variety of different ordinal response scales.
can be differentiated into questions calling for The most common are given below:
nominal-level responses and those that call Traditional Ordinal Rating Scales: These
for ordinal responses. rating scales ask the respondent to evaluate an
As noted in Chap. 3, nominal variables are attribute such as performance by checking or
simply namesthey have no order. There are circling one of several ordered choices. Rating
two primary types of questions that call for nomi- scales often are used to measure the direction
nal responses. The rst includes items for which and intensity of attitude toward the target attri-
the respondent can provide only one answer, as bute. An example of a traditional rating scale is
the available response options are mutually exclu- given below:
sive. Examples include questions about demo-
Excellent Good Fair
graphic characteristics (e.g., religion, gender),
Poor Very Poor
other characteristics such as hair color and blood
type, and so on. The second type includes ques- Likert Scales represent another traditional
tions where the respondent can select more than type of rating scale that asks the respondent to
one response (i.e., choose all that apply ques- indicate his or her level of agreement with a
tions). The latter may provide very useful infor- given statement, with the center of the scale
mation but pose data entry and analytic challenges typically representing a neutral point [40].
that need to be considered when designing the Likert scales are most frequently used for
survey instrument. To counter these, special items that measure opinion and take the gen-
techniques are needed. For example, if one is eral form shown below:
interested in learning about why patients have
Strongly Disagree Neither Agree
gone to the hospital, it is advisable to divide the
Disagree Nor Disagree
main question into two subquestions: the rst
asking the respondent whether he or she has been Agree Strongly Agree
to the hospital and (if answered in the afrmative)
8 Constructing and Evaluating Self-Report Measures 157
may not mean the same thing to all respondents. 6574 and 7584. Nonetheless, there can be
VAS have been used commonly for the clinical advantages to categorical scaling. The primary
measurement of chronic and acute postoperative advantage is that some respondents may be more
pain. In one study designed to formally assess its willing or able to answer some questions in cate-
psychometric performance in the latter setting, gorical form than in numerical form. This is
DeLoach and coworkers [45] administered the particularly true of income questions, where
VAS to 60 patients in the immediate postopera- respondents may not know their precise income,
tive period, using the scale anchors no pain and but they will know it approximately. (Ironically,
worst imaginable pain. The authors found good self-reported age follows an opposite pattern as
correlations between the VAS and a traditional individuals appear to be better able and more
numeric measure though individual VAS esti- willing to give their birthdates than their ages.)
mates tended to be relatively imprecise.
Rank Order Scales: With this form of mea-
sure, respondents are asked to rank alterna- Asking About Sensitive Information
tives in order, rather than rate them on a scale.
For example, if members of a medical school What is sensitive information? The answer to this
class all had the same professors in one semes- question depends on the respondent, because
ter, they could be asked to grade them in rela- what is sensitive to one person is not to another.
tion to one another, as shown below: In general, questions about stigmatized or illegal
behaviors, or unusual beliefs and opinions will be
Please rank each of your professors from best
judged to be more sensitive by those who engage
to worst, where 1 = best and 5 = worst:
in those behaviors or hold those beliefs than by
Adams _____ Bassett _____ Cochran _____ those who do not [39]. Highly personal questions
Davis _____ Edwards _____ (e.g., income, weight, some health conditions) or
questions about traumatic events (e.g., rape
or child abuse, or other forms of abuse) also may
Advantages and Disadvantages be viewed as sensitive. When asking about sensi-
of Categorizing Responses tive information, warm up questions often are
used to set the respondent at ease, thereby increas-
Many times, responses that are fundamentally ing the likelihood that the sensitive questions will
continuous in nature are transformed into cate- be answered. It also may be useful to include a
gorical responses by the design of the question- cool-down or cool-off phase that can reduce
naire. Instead of asking How old are you? a the possible stress induced by the sensitive ques-
respondent can be asked Are you: (a) under 18, tions. Typical warm-up questions include those
(b) 1924, (c) 2534, (d) 3544, (e) 4554, (f) about nonsensitive demographics (e.g., county of
5564 or (g) over 65? This approach, however, residence, birth order); cool-down questions
has several important drawbacks. First, categori- often are quite trivial (e.g., pet ownership, taste in
cal responses cannot be reconverted into continu- music, food preferences, and similar items).
ous responses. Second, it can limit comparisons Sensitive questions can be uncomfortable to
with other questionnaires that utilize different the respondent and may raise ethical concerns.
breaks between categories. Third, breaks must When included within a research protocol, the
be meaningful, with variations occurring only investigator may need to demonstrate to his or
between those that have been included. Sometimes her institutional review board (IRB) the need for
the survey developer may choose breaks that are such questions and provide assurances that the
inappropriate. For example, if, after data collec- respondent will not be compelled to answer them.
tion, it is determined that most respondents are When asking highly sensitive questions, inter-
over age 65, it is not possible to reverse course viewer training is essential, and interviewers may
and redo the survey adding additional breaks for need to be aware of referral services that can be
8 Constructing and Evaluating Self-Report Measures 159
offered if the respondent reveals high-risk any particular population, precluding generaliz-
behavior, for example, being involved in an ability of conclusions. These limitations apply
abusive relationship, being suicidal, or using even to mail surveys that have been published in
illicit drugs. In addition, becoming aware of cer- medical journals, where average response rates
tain types of behavior via self-report may impose have been shown to be approximately 60% [47].
ethical responsibilities on certain classes of pro-
fessionals. For example, clinical psychologists
have a duty to report certain behaviors. Clinical E-mail and Web-Based Surveys
researchers typically are obligated to report non-
adherence to (or adverse outcomes associated E-mail and web-based surveys are less costly to
with) treatment. More generally, anyone who is a administer than traditional postal mail surveys,
member of a group that has licensure will need to but have several limitations. Anonymity can be
investigate his or her own specic requirements difcult to ensure, response rates may be low,
for such disclosure. and responses may not be random (often, there is
no way of knowing exactly who is answering the
questions). Response rates with Internet surveys
Modes of Administration have been found to differ from those obtained by
postal methods, depending on the group sur-
Self-reported information can be obtained via a veyed. Younger individuals tend to respond more
variety of methods. These include face-to-face frequently than older individuals to e-mail,
interviews, mailed questionnaires, e-mail and whereas older individuals more to traditional
web-based surveys, telephone surveys, computer- mail [48]; in one study, medical doctors have
assisted response systems, and randomized been found to respond more frequently to tradi-
response methods. tional mail than to Internet-based methods [49].
The chief advantages of face-to-face administra- Telephone surveys are less costly than face-to-
tion are that response rates are optimized and that face interviews, but the telephone-based approach
it provides an opportunity for the interviewer to may lead to signicant nonresponse. Assuming
clarify confusing items. Disadvantages include that the subject can be reached, the lack of per-
expense (both time and money), the possibility sonal contact between the interviewer and respon-
that interviewer behaviors may inuence (bias) dent may increase the likelihood that the latter
responses, and the fact that some individuals may will decline the interview. In addition, in the cur-
be reluctant to answer some questions in the rent era, many potential respondents lack landline
presence of an interviewer due to embarrassment telephones, and some have multiple telephones
(especially sensitive items) or concerns about creating difculties in achieving a random sample.
revealing illegal behavior. A recent study using telephone survey methodol-
ogy found response rates of only 39% [50].
including questions that are relevant to the subjective methods, the measurement instrument
respondent and keeping the survey short and sim- provides only an estimate of the quantity of
ple (KISS). Strategies specic to mail surveys interest. By an estimate, we mean that the
include the use of personalized questionnaires recorded value is not a direct measure of the
and/or cover letters that orient the respondent to underlying quantity of interest or the true
the purpose and importance of the study and value. For example, if we are measuring the
invite their participation. Additional strategies blood pressure of an individual, the observed
include the use of colored ink, rst class mail and value for the systolic pressure may be 124 mmHg.
recorded delivery, stamped return envelopes (or However, the true value cannot be observed and
permitting use of facsimile), contacting is equal to the 124 plus or minus some value
participants before sending surveys, maintaining reecting measurement error as well as other
follow-up contact with participants, and provid- sources of error.
ing nonrespondents with replacement question- Two fundamental components of accuracy,
naires when the initial questionnaires were not both inversely related to the error of an observa-
readily accessible [59]. In one study, the com- tion, are validity and reliability. Physicians and
bined use of replacement questionnaires and others using self-report measures for research
chocolate (the inducement) was found to should have a fundamental understanding of
signicantly increase response rates versus either these concepts if they are to form judgments
method alone [60]. Strategies specic to tele- about the quality of outcomes based on these
phone surveys include allowing the respondent to measures or develop their own measures. In the
return the call using a toll-free number and setting of tests and measures, validity relates to
sending alerts prior to initiation of the survey. how well the instrument measures what it pur-
(For more possibilities, the reader is referred to ports to measure and reliability relates to how
the website www.guidestarco.com/Increasing- consistently the instrument measures whatever it
survey-response-rates.htm.) is that it measures. These qualities exist on a con-
tinuum rather than as absolutes, that is, inferences
drawn from an instrument are neither valid nor
Evaluating Psychometric Properties invalid nor are they reliable or unreliable;
of a Self-Report Measure rather, they are valid to a certain degree and reli-
able to a certain degree for a given population
Before a self-report measure can be used with and setting (i.e., are sample dependent).
condence, it must be rigorously evaluated to Together, validity and reliability reect the abil-
determine whether it is psychometrically sound; ity of the instrument to provide an accurate
that is, that it measures the construct of interest quantitative estimate of the characteristic of inter-
(e.g., quality of life, satisfaction, emotional state est to the researcher.
of health) accurately in the population of inter-
est. Such an assessment not only is essential for
all newly developed instruments, it also is impor- Validity
tant for instruments that have been validated for
other populations. By accuracy, we mean that Validity has been dened as the degree to which
the quantitative or qualitative assessment pro- conclusions drawn from the results of any assess-
vided by the instrument should provide as true a ment are well-grounded or justiable, being at
measure of the underlying construct as possible. once relevant and meaningful [61]. When the
Unfortunately, all measurement is accompanied term validity is applied to measurement, it refers
by the possibility of error which is either system- to the extent to which the instrument measures
atic or random as no data collection technique is the actual parameter of interest [62]. Thus, a
perfect. Whenever we measure a patient charac- well-built scale should, on average, produce read-
teristic, be it by objective testing or by more ings that permit a meaningful conclusion about a
162 P.L. Flom et al.
persons actual weight; a well-constructed process. Does the assessment seem like a
measure of clinical depression should yield data reasonable way to gain the information the
that are useful for drawing meaningful conclu- investigator is attempting to obtain? Does it
sions about the presence and severity of depres- look as though it will measure what it is sup-
sive symptoms; and a properly designed measure posed to measure? Does it seem well
of health-related quality of life should provide designed? [64] For example, the Beck
responses that are value for drawing meaningful Depression Inventory, which is widely used
conclusions about health status or health utility in clinical medicine, asks questions about
from the perspective of the patient. In each of depression; more specically, it asks about
these cases, the quality of the instrument is judged such attributes as sadness, suicide, and loss of
according to the soundness of the conclusions pleasure [65]. It has face validity because
that can be drawn from the responses that it these (and other) items are what most people
provides. Therefore, though the term valid is think of as depression.
commonly used as a descriptor for various tests Content Validity: Content validity reects
and measures, validity, as Cook and Brown have how well the items comprising a measure
noted, represents a property of the inference cover (sample) the subject of interest or
rather than the instrument itself [63]. Because domain. When a domain is well dened,
these inferences are inuenced by the circum- content validity is relatively easy to ascertain.
stances under which the instrument is adminis- If the domain is less well dened, ascertain-
tered, there is no such entity as a generically valid ment of content validity may require having
instrument. Indeed, all instruments should be experts in the eld review the measure [40].
validated for each interpretation, including the The content validity of a test of knowledge of
specic populations and contexts in which it will womens health was called into question by
be used. For example, a test that measured knowl- comparing the domains it covered with those
edge of basic addition and subtraction might be covered by a set of curriculum guides [66],
used to draw valid inferences about mathematics and the content validity of the SF-36 health
prociency among rst-grade students but would questionnaire was afrmed by comparing it
not be useful for drawing similar inferences about with the longer instrument from which its
college mathematics majors. Similarly, a scale items were drawn [67].
that has been validated for one disorder (e.g., Construct Validity: Construct validity is the
depression) would need to be re-evaluated to degree to which a measure is related to other
establish its validity in the setting of another (e.g., measures or attributes, as dictated by theory. It
anxiety). Moreover, an instrument that has been reects the extent to which the construct under
shown to permit valid inferences under research study (e.g., depression), even if it cannot
conditions or in highly selected patients may directly be assessed, has been properly labeled
need further evaluation before use in a general (operationalized) by the items comprising the
clinical population [63]. measure. In other words, does the instrument
Validation of a measurement instrument is a measure what it was designed to measure?
complex process, in part, because validity encom- Thus, construct validity is a key part of valid-
passes various dimensions. The most common of ityno instrument has any value unless it
these are summarized below: satises this criterion. Inferences about con-
Face Validity: Face validity (validity at face struct validity can be evaluated by a variety of
value), also known as representation valid- methods. A common approach to construct
ity, is concerned with how a measurement validation entails assessment of the conver-
instrument or procedure appears to be relevant gent and divergent (or discriminant) validities
to a construct, as judged by a potential respon- of a measure. Convergent validity indicates
dent. It is the simplest type of validity to gauge that the measure correlates highly with
and, typically, is assessed early in the validation other measures of similar constructs, whereas
8 Constructing and Evaluating Self-Report Measures 163
Two forms of responsiveness are recognized: a systematic error consistently affects the mea-
internal and external [80]. Internal respon- surement of the variable in the same way each
siveness represents the instruments capacity time that the measurement is done. It provides an
to detect change from before to after exposure incorrect measure of the variable, and the error
to an intervention of acknowledged efcacy will be the same for every subject.
[81]. Typically, it is evaluated in the setting of There are several types of bias that specically
repeated measures designs that incorporate affect responses obtained in self-report measures;
assessments before and after the intervention some of the most common are listed below. (For
in the same individual. These designs can a fuller list, the reader is referred to Aiken and
involve a single group of subjects followed Mardegan [44] and Choi and Pak [38].) Although
over time (i.e., a treated cohort, where intra- adequate quantitative data are not available for
subject change is expected) or include two purposes of comparison, there is general agree-
groups (including an untreated control where ment that the extent and impact of these biases
change is unexpected). External responsive- vary greatly from discipline to discipline and
ness refers to the degree to which changes in a from one population to another.
measurement correlate with changes in other Social Desirability Bias: Social desirability
putatively related changes in health status bias (sometimes termed faking good bias)
[81]. Both forms of responsiveness are refers to the tendency of respondents to answer
inuenced by reliability and scale characteris- questions in ways that make them look good,
tics. Scales that are unreliable will produce rather than honestly [40]. This positive
too much noise to allow for determination of response bias may be of two typessome
meaningful change over time. Scales with too respondents may deliberately tell falsehoods
few response categories may fail to detect all in order to appear acceptable to those conduct-
but very large changes. Scales producing ing the survey, whereas others may have inter-
ceiling effects (due to restriction at the upper nalized the dishonest response. (The latter
level of the range of possible values) may occurs more commonly than generally recog-
leave little room for improvement on subse- nized [84].) The social desirability bias can
quent testing just as those producing oor compromise most forms of self-report, but its
effects (where data cannot take on lower val- potential impact should be anticipated when
ues) will be insensitive to clinical decline even asking about stigmatized behaviors or atti-
when there is a worsening of status or func- tudes (e.g., when questions involve issues of
tioning. When instruments with varying scal- criminality, violence, or sexual orientation), or
ing characteristics (type, length, directionality, when the respondent has reason to believe that
etc.) are compared to determine their relative a socially nondesirable response could cause
responsiveness, unit-free statistical approaches him or her to lose something of critical value
including standardized scores and compari- (e.g., a belief by a patient that nonadherence to
sons (e.g., effect sizes or standardized response a health-care providers instructions could
means) must be used. (For an excellent negatively impact future interactions with that
discussion of these techniques and their provider). Although it may not be possible to
interpretation, the reader is referred to Liang eradicate this form of bias, the extent of its
et al. [82] and Angst et al. [83]). potential inuence can be examined by embed-
As noted throughout this volume, the validity ding, in the self-report measure, an item or
of any study can be threatened by bias, which two that ask the respondent to answer a ques-
broadly is dened as known or unknown system- tion such as I have never intentionally told a
atic error in the design, sampling, measurement, lie or I always know the difference between
or other critical aspects in the conduct of an right and wrong or through formal testing.
investigation that can produce distortions of A common test of social desirability is the
ndings. Unlike a random error, described below, Marlowe-Crowne scale [85]; a shorter version
8 Constructing and Evaluating Self-Report Measures 165
of this scale has been created by Strahan and impressions guide their ratings. It is suspected
Gerbasi [86]. whenever respondents assign similar ratings to
Agreement Bias: Agreement bias (also known each dimension measured in a survey (e.g.,
as acquiescence bias) is the tendency to say rate all aspects of performance as excellent
yes or I agree to every item regardless of or all components of a course or program as
content. It is subtly different from social very good). The phenomenon, empirically
desirability bias as agreement bias includes conrmed by Thorndike in 1920 [93], is
admission to possessing socially undesirable thought to result from a cognitive bias, whereby
traits. For example, respondents manifesting one particular trait, especially a positive char-
agreement bias might respond afrmatively to acteristic, inuences or extends to perception
the question, Have you ever used illicit of other traits. A commonly cited example is
drugs? whereas those exhibiting social desir- judging an attractive person as more intelli-
ability bias would likely provide the opposite gent. Its logical opposite is sometimes termed
response. The phenomenon is thought to have the devil, horns, or reverse-halo effect
multiple causes. First, it has been argued that whereby individuals judged to have a single
most respondents desire to be polite and undesirable trait (e.g., unattractiveness) are
respectful and, thus, not wish to disagree with subsequently judged to have other undesirable
the questioner [87, 88]. Second, respondents traits (e.g., lack of intelligence) based on the
may feel that they have lower standing than evaluators tendency to allow a single weak-
the questioner and agree with questions based ness to inuence the totality of impressions
on this perceived status differential [89]. [94]. In the setting of a survey, a respondents
Third, respondents may select an agreeable prejudices, recollections of previous observa-
(but not necessarily truthful) answer to com- tions, and even answers to previous questions
plete the survey as rapidly as possible [90]. also may inuence responses. Thus, the halo
Whatever the cause, agreement bias can be (and reverse-halo) effects collectively repre-
detected (and sometimes resolved) by includ- sent an important bias that must be recognized
ing a balance of positively and negatively and, if possible, minimized to improve the
worded items [91], though care must be taken accuracy of individual ratings. Several
to minimize confusion to the respondents. approaches have been recommended includ-
Faking Bad Bias: In contrast to social desir- ing proper introduction of the purpose of the
ability (or faking good) bias, the faking survey (to emphasize the importance of the
bad bias occurs where failure (in the usual respondents ratings), increasing the number
sense) is a goal. In the context of self-reported of attributes to be rated (bearing in mind that
information, faking bad is a negative response an excessive number of questions may cause
bias that is caused by the respondents desire the respondent to abandon the survey), and/or
to appear worse (e.g., manifest symptom physically arranging scales so that their favor-
amplication) than he or she really is either to able and unfavorable ends alternate.
avoid duty or responsibility (i.e., malinger) or
to qualify for goods or services [38]. If faking
bad bias is suspected, methods exist to detect Reliability
it. (For a comprehensive discussion of one
such method [the Fake Bad Scale], the reader Reliability is related to the question how
is referred to Nelson et al. [92].) consistent or reproducible are the scores that an
Halo Effect: The halo effect is a systematic instrument produces? Like validity, reliability
bias that occurs when respondents fail to rate technically is considered to be a property of the
individual attributes of a person, object, event, measurement rather than of the instrument itself
or service in isolation but instead let overall because the same instrument administered in
166 P.L. Flom et al.
different settings and to different subjects under research setting (e.g., unintended variations in
varying conditions can yield widely varying reli- temperature, lighting, noise, or interruptions).
ability estimates [63]. Reliability is considered to Finally, many factors causing random error have
be a necessary, but insufcient, element of valid- their source in the instrument. For example,
ity [95, 96]. This is because valid conclusions unclear questions or directions, inadequate item
cannot be drawn from an instrument that yields sampling, suboptimal format, or even the order in
inconsistent observations [63]. At the same time, which the questions are posed are potential
reliability does not imply validity because an sources of random error. Random error (like sys-
instrument can produce consistent errors. tematic error) must be considered in interpreting
The concept of reliability can be illustrated the results of studies; the greater the error, the
using the metaphor of a bathroom scale. For less we can rely on the results of the measure-
example, if you are like many people, you prob- ment process for decision-making. In designing
ably will step on your bathroom scale in the or selecting among instruments, we are constantly
morning, check your weight, step off, and step striving to create or identify those that not only
back on the scale to recheck the reading. You measure the attribute of interest but which mea-
have learned through experience that the mea- sure that attribute reliably.
surement displayed by a bathroom scale the rst Like validity, reliability can be classied
time you weigh yourself is not always the same according to several dimensions. These include
as the second time you try, but usually it is very the stability of the measurement over time, the
close. A good scale might vary by half a pound or congruence of a measurement when dened by
so, but if measured weight differs signicantly different assessors (or determined by different
(e.g., more than 5 lb) at 7:00 a.m., 7:01 a.m., and methods), the consistency (homogeneity) of
7:02 a.m., the readings that the scale produces items within a measure or scale, and the
would have very limited reliability. Similarly, if correspondence of parallel measures. These
an instrument is designed to measure a patients dimensions, typically expressed as reliability
self-condence, then it should yield approxi- coefcients, are evaluated using various method-
mately the same result each time it is adminis- ological approaches, as described below:
tered to the same subject. Test-Retest Reliability (Temporal Stability):
Whereas validity is diminished by systematic Test-retest reliability is the most commonly
error, reliability is reduced by random (chance) recognized form of reliability. It is evaluated
error. There are many sources of random error in by administering the same item, scale, or
research measurement. The most common are instrument to a sample of individuals twice
those caused by factors related to the subject, over a relatively short period (the period
researcher, environment, and instrumentation. depending on the intrinsic stability of the vari-
For example, a subject who is tired, sick, hungry, able under study) and comparing the results
angry, irritable, or confused may produce mea- using Pearsons product moment correlation
surements that are different than they would be if for interval data or Spearmans rank order
the subject were not so aficted. Indeed, any correlation for ordinal data. Typically, test-
changing physical, emotional, or psychological retest correlation coefcients ranging 0.70
state of the subject, including the subjects aware- 0.80 generally are considered to be satisfactory
ness of the researchers presence, can introduce to good (though criteria for acceptability vary
error into the measurement process. The according to discipline). This measure of
researcher can introduce random error in mea- reliability is most appropriate for assessing
surement simply by his or her physical appear- relatively enduring characteristics such as per-
ance, demeanor, or other personal attributes or by sonality traits, aptitude, and chronic health
becoming fatigued, impatient, bored, ill, or dis- status in stable populations where subjects are
tracted. Many factors that cause random error in willing to undergo multiple administration of
measurement can arise from perturbations of the the same measure. It is less appropriate for
8 Constructing and Evaluating Self-Report Measures 167
can range from 0.00 to 1.00 (sometimes the rst assessment can inuence the results
expressed as whole numbers, 1100). of subsequent assessment by providing an
A high KR-20 coefcient (i.e., >0.90) opportunity for practice or learning inde-
indicates a homogeneous measure or scale. pendent of the intervention. This threat to
A variant, the KR-21, is computationally internal validity (testing effects) can be
simpler (it is based only on the assessment minimized (though not entirely eliminated)
mean, variance, and number of items on the by using alternate forms of measurement of
scale), but tends to produce lower reliabil- the same construct or content domain
ity estimates. before and after the intervention. One com-
Cronbachs Alpha [101] is the best known, monly used approach to creating these
and most commonly used, measure of alternate forms is to generate a large pool
internal consistency. Like the KR-20, of items, each of which addresses the con-
Cronbachs alpha conceptually represents struct being studied, and randomly divid-
the mean of all split-half reliability esti- ing the items to create two functionally
mates for a scale [76] and is computed by equivalent instruments of similar difculty
calculating pair-wise correlations between and length. Other methods include chang-
items in a scale; however, Cronbachs alpha ing the wording or order of the questions in
can be used with scales that include several the two instruments. (The same approach is
ordinal response options (e.g., 1 = strongly used to discourage cheating on high stakes
agree through 5 = strongly disagree or achievement or aptitude tests.) After the
0 = not limited by heart failure symptoms alternate forms are created, they are admin-
through 3 = severely limited by heart fail- istered to the same sample, and the results
ure symptoms) as well as those that are correlated. If they produce similar
include binary response options, making it results for the same subjects (i.e., they yield
more exible than the KR-20. Values of correlation coefcients >0.80), they are
0.70 or above are widely viewed as accept- considered to be equivalent forms and can
able, and values of approximately 0.90 are be used interchangeably [62]. (The reader
considered to be excellent [102]; however, will note that the methodology for estab-
extremely high reliability estimates (i.e., lishing alternate form reliability, when
0.95) suggest that some of the items may based on division of a related item pool, is
be redundant, contributing no additional analogous to that used for estimating split-
information than that furnished by other half reliability. The primary difference is
items on the scale. Alpha if item is deleted that with split-half reliability, items within
is a widely used index that can be useful for a single scale or measure are divided solely
deleting nonhomogenous or redundant for the purpose of determining internal
items during the process of scale develop- consistency, whereas with the alternate
ment. Nonetheless, when using standard- form approach, the objective is to construct
ized scales, all items (including those that two equivalent instruments that can be used
reduce alpha) should be retained to permit independently of one another.)
meaningful comparison with previous as
well as future assessments employing the
same instrument. Ethical and Legal Aspects of Survey
Alternate (Equivalent, Parallel) Form Methods
Reliability. An investigator may be con-
cerned that repeated measurement using Given below is a brief prcis of some ethical and
the same instrument might threaten the legal issues involved in survey research. Any
internal validity of an intervention study investigator should carefully review the policies
because (as noted in Chap. 5) exposure to of his or her institution to ensure compliance.
8 Constructing and Evaluating Self-Report Measures 169
If the investigator has a professional license, that during the chain referral process, as disclo-
licensing body may also have relevant rules and sures from the investigator could compromise
regulations governing survey research. privacy of the subject and condentiality
1. General participation. In all cases, respon- of their data, destroy the relationships
dents must know that they are free to not par- within the chain, and militate against further
ticipate, to skip questions, and to stop the recruitment [103].
survey at any time. 5. Focus groups. Focus groups pose ethical spe-
2. Sensitive questions. If sensitive questions are cial problems, because members of the focus
asked, provision should be made for debrieng, group share information that can, potentially,
and respondents should be provided with be used by one participant against another. As
information about relevant services, as appro- a hypothetical example, suppose a focus group
priate. For example, if an investigator asks a of medical students were convened to evaluate
subject about illicit drug use, information may specic academic programs and one member
need to be provided about available treatment of the focus group identied a certain faculty
facilities. member as incompetent. If another focus
3. Privacy. Especially when sensitive informa- group member knew the identity of the partici-
tion is discussed, substantial efforts should be pant expressing this view, he or she could be
made to keep identifying information private. threatened or even blackmailed. As another
One solution is to use code numbers rather example, if a focus group member acknowl-
than names and, if necessary, to store a link of edged having HIV or some other stigmatized
code numbers to names in a separate and condition or admitted to engaging in illicit
secure location. behavior (such as abuse of prescription or
4. Snowball (chain-referral) sampling. nonprescription drugs), similar problems
Sometimes, when a sampling characteristic is could ensue.
relatively rare within a population, or when a 6. Children and other special populations.
population is concealed from society at Additional rules apply when conducting self-
large, an investigator may have difculty report surveys involving children and other
locating an adequate number of subjects for a special populations (e.g., prisoners, individu-
survey. This can occur when the population of als with mental disabilities). These populations
interest comprises individuals who exhibit may have limited ability to supply informed
illegal or otherwise stigmatized behaviors consent, either due to lack of comprehension
(e.g., illicit drug use or prostitution). One (e.g., young children and individuals with
approach that sometimes is used to increase mental disabilities) or because of feelings of
sample size under these conditions is to recruit duress (e.g., prisoners). (A listing of these rules
a relatively small number of subjects who pos- can be found in the Code of Federal Regulations,
sess the desired sampling attribute and ask Title 45 Public Welfare, Department of Health
each subject to bring in additional subjects and Human Services [104].)
from among their acquaintances (social net-
work) who possess the same attribute. These,
in turn, may be called upon to recruit similar Summary: A General Guide
additional subjects for the study. Thus, the to Constructing a Measure
sample grows metaphorically like a snow-
ball. Though snowball sampling can reduce This chapter has highlighted the complexities of
subject search costs and provide access to constructing a self-report measure. If the investi-
subjects who would otherwise be inaccessible, gator believes that the need for a new measure
the investigator must take great care to ade- outweighs the effort required to develop it, the
quately protect the potentially sensitive and following provides an outline of the essential
damaging information given by respondents steps involved, adapted from those suggested by
170 P.L. Flom et al.
DeVellis [40] and Fowler [39]. (Further details of narrowed later in the process. It is not uncom-
these steps can be found in their writings.) mon for the initial pool to contain four times
1. Determine precisely what must be mea- as many items as the number of items com-
sured. It is not sufcient to have a vague idea prising the nal scale.
of what it is to be measuredone needs to be 5. Determine the measurement format. As
fairly precise. If the study is analytic, how previously indicated, questions and responses
well does the new measure facilitate testing can be framed in numerous ways. The pre-
of the research hypothesis? If the study is ferred format should be considered at the
performed to generate a hypothesis, how well same time that the item pool is generated to
will the anticipated responses achieve this maintain consistency. For example, will the
objective? Will the measure assess knowl- survey be unstructured, semistructured, or
edge, attitudes, behaviors, or a combination structured? If the questions call for closed-
of these areas? What areas must be covered? ended responses, how many response catego-
How will the new measure differ from exist- ries will there be? What type of scaling will
ing measures? What theory will guide the be used? Will the time frame to which the
development of the new measure? How questions refer be specied or implied, etc.?
specic versus general should the measure 6. Develop validation items. Validation
be? As is the case for all forms of research, items are of two types: (a) those that do not
time spent clarifying objectives at the outset directly measure the construct under study,
will save a great deal of time later on. but which may be useful for detecting aws
2. Define the population of interest. State, as (biases) in the measurement process, and (b)
precisely as possible, whom you wish to those which assist in assessing the construct
study. Often, the choice will be a compro- validity of the new measure. Including a
mise between optimal versus available sub- social desirability scale can help to determine
jects. An investigator may be interested in all which items tend to be inuenced by this
humans with a disease, but it is never possi- positive bias and serve as a basis for elimi-
ble to study all such individuals. It also is nating them. The inclusion of items from a
very difcult, if not impossible, to obtain a putatively related measure can be used to
random sample of such individuals from buttress a claim of construct validity or iden-
around the world. Early in the design of the tify poorly performing items [40].
study, the investigator should identify the 7. Pretest. Once a large pool of items has been
age group and gender(s) of interest, the geo- dened, it can be reduced to a manageable
graphic location of potential respondents, number and screened for omissions, errors,
their racial or ethnic characteristics, etc. and related problems. Independent review by
3. Select the type of self-report to be used. content-matter experts, colleagues, and key
Decide whether the information being sought decision makers can be helpful for establish-
is best obtained via a mailed self-completed ing both the face and content validity of the
questionnaire, an in-person or telephone preliminary instrument and for obtaining
interview, or a computer-based method. Each feedback regarding specic items. Reviewers
approach has advantages and disadvantages, can be asked:
as noted above. How relevant each item is to the construct
4. Generate the item pool. Initially, a large being measured
pool of items should be generated, covering How clear the items are
as many different parts of the construct of If there are ways to make the items more
interest as possible from different perspec- concise
tives. Brainstorm. At this stage, the creator of If key items are missing (there should be
the survey instrument should not fear redun- at least one question for every variable of
dancy or a long list of itemsthese can be interest)
8 Constructing and Evaluating Self-Report Measures 171
If items are superuous or redundant overly intrusive? Were any redundant? Did
If items are difcult to read or answer they ow well?). Statistical methods (e.g.,
(e.g., are ambiguous or otherwise evaluation of distributional characteristics,
unclear) examination of missing answers, item-to-
It also is helpful to solicit review of the item and item-to-scale correlations) can be
drafted items from individuals who are simi- applied to responses obtained in the pilot to
lar to the intended respondents. This can be detect poorly performing or redundant items
done within a focus group or as a series of and to evaluate their impact on internal con-
one-on-one cognitive interviews con- sistency when retained or deleted. It is
ducted among a small number of individual difcult to nd guidance regarding the mini-
respondents. Both approaches allow explora- mal number of participants to be included in
tion of how well the items are understood a pilot. Some workers in the eld have sug-
and are particularly useful when the intended gested 300 [105]; others [40] have recom-
respondents differ greatly from the individu- mended that for single scales comprising
als writing the survey instrument. Specic relatively few (e.g., 20) items, a smaller num-
questions should be asked about how respon- ber may sufce. A cautionary note is in order.
dents interpreted the questions, how they If too few respondents are chosen, it may not
thought the various questions differed from be possible to evaluate the items properly; if
each other, how readable they were, and what the sample is not representative, items may
their responses meant. At this stage, ques- have different meanings to the pilot sample
tions can be open-ended, as one of the goals versus the target population, and the relation-
of pretesting is to identify response options ships among the items may be different as
that may have been overlooked (a prespecied well [40].
list of responses options will, by force, con- 9. Edit. Invariably, once a measure is pilot
strain the respondent to think like the survey tested, revision will be required. Directions
developer). Feedback from the pretest can be may need to be claried. Confusing, overly
use to add, delete, and otherwise rene ques- intrusive, or unanswered questions will need
tions to be included in the preliminary instru- to be deleted or reworded (though reworded
ment and to frame appropriate response items may need to be retested). If revisions
options. are extensive, a second round of pilot testing
8. Pilot test. Pilot testing is crucial to develop- may be required. Once poorly performing
ment of a valid and useful scale. No matter items are eliminated, the length of the instru-
what care is taken in developing and screen- ment should be evaluated. Too short a mea-
ing items, some will be misinterpreted by sure will not fully explore the construct of
respondents. Pilot testing involves adminis- interest. However, one that is too long may
tering the preliminary questionnaire (includ- bore or frustrate the respondents.
ing the cover letter and directions) to 10. Assess reliability and validity. Before an
respondents who, again, are as similar as instrument can be used for formal research
possible to members of the target population. purposes, its reliability and validity must be
The pilot should be performed, to the extent assessed in the population of interest. As
possible, under conditions that mirror the noted above, the most common test for reli-
conditions under which the nal survey will ability is Cronbachs alpha; for validity, the
be conducted. It should ask respondents to appropriate method depends on the degree of
nd aws in the survey (e.g., Were directions development of substantive knowledge and
and skip patterns (if any) clear? Was the sur- the existence of (a) other measures of the
vey too long? Was the format appropriate? same construct, (b) measures of similar but
Were any of the questions confusing or oth- different constructs, and (c) the availability
erwise unclear? Did any not apply? Were any of a gold standard.
172 P.L. Flom et al.
Take-Home Points
A self-report (a.k.a. survey) is a measure where the respondent supplies information about
him or herself.
Self-reports are important in medical research because some variables (e.g., attitudes,
beliefs, self-judged ability) only can be assessed from information directly furnished by the
patient or other subject.
A self-report is obtained by questionnaire, interview, or related methods.
Questionnaires are written documents that can be self-completed without interviewer
involvement or read aloud as part of an interview; interviews usually (but not always) are
administered orally; both can be structured (comprise closed-ended questions), unstruc-
tured (comprise open-ended questions), or semistructured (comprise a mix of both question
types).
If answers to a research question can be obtained only via self-report, the investigator
should rst determine whether an instrument already exists that is reliable, valid, and oth-
erwise suitable for the population of interest.
In situations where a new instrument must be developed, the investigator must clearly
dene the question(s) of interest; identify the population to be surveyed; select the pre-
ferred type of self-report/format of measurement; consider inclusion of validation
questions; pretest, pilot test and edit the measure; and test the nal battery of questions
for reliability and validity.
When developing or implementing a survey, the investigator must be certain to observe all
ethical and legal aspects of survey methodology.
16. Allan G. A note on interviewing spouses together. 34. Murphy LL, Spies RA, Plake BS, editors. Tests in
J Marriage Fam. 1980;42:205210. print VII. Lincoln: Buros Institute of Mental
17. Kalischuk RG, Davies B. A theory of healing in the Measurements; 2006.
aftermath of youth suicide. J Holist Nurs. 2001;19: 35. Geisinger KF, Spies RA, Carlson JF, Plake BS,
163186. editors. The seventeenth mental measurements
18. Dolezal C, Mellins C, Brackis-Cott E, Abrams EJ. yearbook. Lincoln: Buros Institute of Mental
The reliability of reports of medical adherence from Measurements; 2007.
children with HIV and their adult care givers. J 36. Goldman BA, Mitchell DF, Egelson PE, editors.
Pediatr Psychol. 2003;28:355361. Directory of unpublished experimental mental mea-
19. Dym B, Berman S. The primary health care team: sures. Washington, DC: American Psychological
family physician and family therapist in joint prac- Association; 2007.
tice. Fam Syst Med. 1986;4:921. 37. Bieri D, Reeve R, Champion GD, Addicoat L,
20. Morrison-Beedy D, Ct-Arsenault D, Feinstein NF. Ziegler JB. The Faces Pain Scale for the self-
Maximizing results with focus groups: moderator assessment of the severity of pain experienced by
and analysis issues. Appl Nurs Res. 2001;14:4853. children: development, initial validation and pre-
21. Quatromoni PA, Milbauer M, Posner BM, Carballeira liminary investigation for ratio scale properties.
NP, Brunt M, Chipkin SR. Use of focus groups to Pain. 1990;41:139150.
explore nutrition practices and health beliefs of 38. Choi BCK, Pak AWP. A catalog of biases in ques-
urban Caribbean Latinos with diabetes. Diabetes tionnaires. Prev Chronic Dis. 2005;2:113.
Care. 1994;17:869873. 39. Fowler FJ. Improving survey questions. Thousand
22. Hicks LK, Lin Y, Robertson DW, Robinson DL, Oaks: Sage; 1995.
Woodrow SI. Understanding the clinical dilemmas 40. DeVellis RF. Scale development: theory and applica-
that shape medical students ethical development: tions. Newbury Park: Sage; 1991.
questionnaire survey and focus group study. BMJ. 41. Chang AM, Chau JPC, Holroyd E. Translation of
2001;322:709710. questionnaires and issues of equivalence. J Adv
23. Flanagan JC. The critical incident technique. Psychol Nurs. 2010;29:316322.
Bull. 1954;51:327358. 42. Healey B, Gendall P. Asking the age question in mail
24. Cot CJ, Notterman DA, Karl HW, Weinberg JA, and online surveys. Austral and New Zeal Marketing
McClosky C. Adverse sedation events in pediatrics: Acad (ANZMAC) Conference 2007. Dunedin;
a critical incident analysis of contributing factors. 2007.
Pediatrics. 2000;105:80514. 43. Heise DR. The semantic differential and attitude
25. Branch W, Pels RJ, Arky R. Becoming a doctor. research. In: Summers GF, editor. Attitude measure-
Critical-incident reports from third-year medical stu- ment. Chicago: Rand McNally; 1970.
dents. N Engl J Med. 1993;329:11301132. 44. Aiken LR. Rating scales and checklists. New York:
26. Allery LA, Owen PA, Robling MR. Why general Wiley; 1996.
practitioners and consultants change their clinical 45. DeLoach LJ, Higgins MS, Caplan AB, Stiff JL. The
practice: a critical incident study. BMJ. 1997;314: visual analog scale in the immediate postoperative
870874. period: intrasubject variability and correlation with a
27. Faithfull S. The diary method for nursing research. numeric scale. Anesth Analg. 1998;86:102106.
Eur J Cancer Care. 2007;1:1318. 46. Brealey SD, Atwell C, Bryan S, Coulton S, Cox H,
28. Bruijnzeels NA, Foets M, van der Wooden JC, Prins Cross B, Fylan F, Garratt A, Gilbert FG, Gillan
A, van den Houvel WJ. Measuring morbidity of MGC, Hendry M, Hood K, Houston H, King D,
children in the community: a comparison of inter- Morton V, Orchard J, Robling M, Russell IT,
view and diary data. Int J Epidemiol. 1998;27: Torgerson D, Wadsworth V, Wilkinson C. Improving
96100. response rates using a monetary incentive for patient
29. White MM, Howie-Esquivel J, Caldwell MA. completion of questionnaires: an observational
Improving heart failure symptom recognition: a study. BMC Med Res Methodol. 2007;7:1216.
diary analysis. Cardiovasc Nurs. 2010;25:712. 47. Asch D, Jedrziewski MK, Christakis N. Response
30. Woodeld R, Goodyear-Smith F, Arroll B. N-of-1 rates to mail surveys published in medical journals.
trials of quinine efcacy in skeletal muscle cramps J Clin Epidemiol. 1997;50:11291136.
of the leg. Br J Gen Pract. 2005;55(512):181185. 48. Diment K, Garrett-Jones S. How demographic char-
31. Aitken L, Mardegan KJ. Thinking aloud: data col- acteristics affect mode preference in a postal/web
lection in the natural setting. Western J Nurs Res. mixed survey of Australian researchers. Soc Sci
2000;22:841853. Comput Rev. 2007;25:410417.
32. Fonetyn M, Fisher A. Use of think aloud method to 49. Shih TH. Comparing response rates from web and
study nurses reasoning and decision making in clin- mail surveys: a meta-analysis. Field Methods.
ical practice settings. J Neurosci Nurs. 1995;27: 2008;20:249271.
124128. 50. OToole J, Sinclair M, Leder K. Maximising
33. Ericsson K, Simon H. Protocol analysis: verbal response rates in household telephone surveys. BMC
reports as data. London: MIT Press; 1993. Med Res Methodol. 2008;8:71.
174 P.L. Flom et al.
51. Tourangeau R, Smith TW. Asking sensitive ques- 68. Feldman AB, Haley SM, Coryell J. Concurrent and
tions: the impact of data collection mode, question construct validity of the pediatric evaluation of dis-
format and question context. Public Opin Q. 1996;60: ability inventory. Phys Ther. 1990;70:602610.
275304. 69. Lin JM, Brimmer DJ, Maloney EM, Nyarko E,
52. Turner CF, Al-Tayyib AA, Rogers SM, Eggleston BeLue R, Reeves WC. Further validation of the
MA, Villarroel MA, Roman AM, Chromy JR, Cooley Multidimensional Fatigue Inventory in a US adult
PC. Improving epidemiological surveys of sexual population sample. Popul Health Metr. 2009; 7:18
behavior conducted by telephone. Int J Epidemiol. doi:10.1186/1478-7954-7-18.
2009;38:11181127. 70. McHorney CA, Ware Jr JE, Raczek AE. The MOS
53. Couper MP, Nicholls II WL. The history and 36-item Short-Form Health Survey (SF-36): II.
development of computer assisted survey informa- Psychometric and clinical tests of validity in mea-
tion collection methods. In: Couper MP et al., edi- suring physical and mental health constructs. Med
tors. Computer assisted survey information Care. 1993;31:247263.
collection. New York: Wiley; 1998. 71. Management Sciences for Health. Creating a climate
54. Vataja R, Pohjasvaara T, Leppvuori A, Mntyl R, that motivates staff and improves performance. The
Aronen HJ, Salonen O, Kaste M, Erkinjuntti T. Manager. 2003;11:122.
Magnetic resonance imaging correlates of depres- 72. Tennant R, Hiller L, Fishwick R, Platt S, Joseph S,
sion after ischemic stroke. Arch Gen Psychiatry. Parkinson J, Secker J, Stewart-Brown S. The
2001;58:92531. Warwick-Edinburgh Mental Well-Being Scale
55. Schackman BR, Dastur Z, Rubin DS, Berger J, (WEMWBS): development and UK validation.
Camhi E, Netherland J, Ni Q, Finkelstein R. Health and Quality of Life Outcomes 2007;
Feasibility of using audio computer-assisted self- 5:63doi:10.1186/1477-7525-5-63.
interview (ACASI) screening in routine HIV care. 73. Cooper SM, Baker JS, Tong RJ, Roberts E, Hanford
AIDS Care. 2009;21:992999. M. The repeatability and criterion related validity of
56. Oetting ER, Beauvais F. Adolescent drug use. the 20 m Multistage Fitness Test as a predictor of
J Consult Clin Psychol. 1990;58:385394. maximal oxygen uptake in active young men. Br J
57. Fidler DS, Kleinknec RE. Randomized response Sports Med. 2005;39:e19.
versus direct questioning: two data-collection meth- 74. Reuben DB, Siu AL, Kimpau S. The predictive
ods for sensitive information. Psychol Bull. 1977;84: validity of self-report and performance-based mea-
10451049. sures of function and health. J Gerontol. 1991;47:
58. Lensvelt-Mulders GJLM, Hox JJ, van der Heijden M106M110.
PGM, Maas CJM. Meta-analysis of randomized 75. Heather N, Rollnick S, Bell A. Predictive validity of
response research. Sociol Method Res. 2005;33: the readiness to change questionnaire. Addiction.
319348. 1993;88:16671677.
59. Edwards P, Roberts I, Clarke M, DiGuisseppi C, 76. Portney LG, Watkins MP. Foundations of clinical
Pratap S, Wentz R, Kwan I. Increasing response rates research. Applications to practice. Upper Saddle
to postal questionnaires. BMJ. 2002;324:118391. River: Prentice Hall Health; 2000.
60. Brennan M, Charbonnau J. Improving mail survey 77. Guyatt G, Walter S, Norman G. Measuring change
response rate using chocolate and replacement ques- over time: assessing the usefulness of evaluative
tionnaires. Public Opin Q. 2009;73:368378. instruments. J Chronic Dis. 1987;40:171178.
61. Merriam-Webster Online. Available at http:// 78. Hays RD, Hadorn D. Responsiveness to change: an
www.m-w.com/. Accessed 27 July 2010. aspect of validity, not a separate dimension. Qual
62. Waltz CF, Strickland OL, Lenz ER. Measurement in Life Res. 1992;1:7375.
nursing and research. New York: Springer Publishing 79. Beaton DE, Bombadier C, Katz JN, Wright JG. A
Inc; 2005. taxonomy for responsiveness. J Clin Epidemiol.
63. Cook DA, Beckman TJ. Current concepts in validity 2001;54:12041217.
and reliability for psychometric instruments: theory 80. Husted JA, Cook RJ, Farewell VT, Gladman DD.
and application. Am J Med. 2006;119(2):166. Methods for assessing responsiveness: a critical
e7166.e16. review and recommendations. J Clin Epidemiol.
64. Litwin MS. How to measure survey reliability and 2000;53:459468.
validity. Thousand Oaks: Sage; 1995. 81. Roach KE. Measurement of health outcomes: reli-
65. Beck AT, Steer R, Brown GK. Manual for the Beck ability, validity and responsiveness. JPO. 2006;
Depression Inventory-II. San Antonio: Psychological 18:812.
Corporation; 1996. 82. Liang MH, Fossel AH, Larson MG. Comparison of
66. Williams RA. Womens health content validity of ve health status instruments for orthopedic evalua-
the family medicine in-training exam. Fam Med. tion. Med Care. 1990;28:632642.
2007;39:572577. 83. Angst F, Verra ML, Lehmann S, Aeschlimann A.
67. Ware JE, Sherbourne CD. The MOS 36 item short Responsiveness of ve condition-specic and
form health survey. Med Care. 1992;30:473483. generic outcome assessment instruments for chronic
8 Constructing and Evaluating Self-Report Measures 175
pain. BMC Med Res Methodol 2008;8:26 (published 94. Roeckelein J. Elseviers dictionary of psychological
online 2008 April 25 doi:10.1186/1471-2288-8-26). theories. Amsterdam: Elsevier BV; 2006.
84. Tavris C, Aronson E. Mistakes were made, but not 95. Feldt LS, Brennan RL. Reliability. In: Linn RL,
by me. Orlando: Harcourt Books; 2008. editor. Educational measurement. 3rd ed. New York:
85. Crowne DP, Marlowe D. A new scale of social desir- American Council on Education and Macmillan;
ability independent of psychopathology. J Consult 1989.
Psychol. 1960;24:349354. 96. Downing SM. Validity: on the meaningful interpre-
86. Strahan R, Kerbasi K. Short homogenous version of tation of assessment data. Med Educ. 2003;37:
the Marlowe-Crowne Social Desirability Scale. J 830837.
Cin Psychol. 1972;28:191193. 97. Landis JR, Koch GG. The measurement of observer
87. Furnham A, Henderson M. The good, the bad and agreement for categorical data. Biometrics. 1977;33:
the mad: Response bias in self-report measures. Pers 159174.
Indiv Differ. 1982;3:311320. 98. Shrout PE, Fleiss JL. Intraclass correlations: uses in
88. Leary MR, Kowalski RM. Impression management: assessing rater reliability. Psychol Bull. 1979;86:
a literature review and two-component model. 420428.
Psychol Bull. 1990;107:3447. 99. McDowell I, Newell C. Measuring health. A guide to
89. Lenski GE, Leggett JC. Caste, class, and deference rating scales and questionnaires. 2nd ed. New York:
in the research interview. Am J Sociol. 1960;65: Oxford University Press; 1996.
463467. 100. Kuder GF, Richardson MW. The theory of the esti-
90. Krosnick JA, Alwin DF. An evaluation of cognitive mation of test reliability. Psychometrika. 1937;2:
theory of response order effects in survey measure- 15160.
ment. Public Opin Q. 1987;51:201219. 101. Cronbach LJ. Coefcient alpha and the internal
91. Toner B. Impact of agreement bias on the rating of structure of tests. Psychometrika. 1951;16:297334.
questionnaire response. J Soc Psychol. 1987;127: 102. George D, Mallery P. SPSS for Windows step by
221222. step. Boston: Allyn & Bacon; 2003.
92. Nelson NW, Parsons TD, Grote CL, Smith CA, 103. Faugier J, Sargeant M. Sampling hard to reach popu-
Sisung II JR. The MMPI-2 Fake Bad Scale: concor- lations. J Adv Nurs. 1997;26:790797.
dance and specicity of true and estimate scores. 104. Code of Federal Regulations, Title 45 Public wel-
J Clin Exp Neuropsychol. 2006;28:112. fare, department of Health and Human Services,
93. Thorndike EL. A constant error in psychological rat- Revised 15 Jan 2009, (Effective 14 July 2009).
ing. J Appl Psychol. 1920;4:2529. 105. Nunnally JC. Psychometric theory. New York:
McGraw-Hill; 1978.
Selecting and Evaluating Secondary
Data: The Role of Systematic 9
Reviews and Meta-analysis
Sorting through the body of available literature is means for physicians to translate clinical research
a daunting task. MEDLINE, only one of many into standard practice and help reconcile
databases, indexed 902,346 articles in 2010. This conicting studies in the literature.
number reects a continuing increase over 2009
(854,506) and 2008 (821,834). How can clini-
cians have any chance of keeping up with the Difference Between a Narrative
literature or use it for guiding research or for for- Review, Systematic Review,
mulating clinical practice decisions if their pri- and Meta-analysis
mary sources are restricted to individual studies?
The answer is that it is difcult, if not increas- A narrative review (sometimes termed a tradi-
ingly impractical. Reliance on individual studies tional literature review) is a summary of primary
is further complicated when current beliefs and published studies in which conclusions are drawn
standards of practice are challenged by new stud- by the reviewer, guided by his or her own inter-
ies. For clinicians to make informed decisions, pretations of the studies, rather than by external
they must analyze multiple studies for both their criteria. Narrative reviews are well suited for
quality and relevance to the patient population of general topics or broad coverage of a eld as they
interest. This is a principal reason for the long lag usually cover a wide range of issues within a
time before clinical research is incorporated into given topic [2], e.g., Update on Multiple
standard practice. A representative example is the Sclerosis. Typically, they are written by experts
20-year delay between initial reports suggesting in the specic eld of study rather than by experts
the utility of thrombolytic therapy for myocardial on research methodology. As such, narrative
infarctions in the late 1970s and its adoption in reviews do not necessarily explicitly state or
the 1990s [1]. For these reasons, secondary follow the rules of evidence-based search strate-
sources such as narrative reviews, systematic gies (including selection criteria for articles and
reviews, and meta-analyses are an important abstracts found) or assess the quality or validity
of the included studies. This decit leads to lack
of transparency and reproducibility and is likely
L. Paladino, MD R.H. Sinert, DO () to reect a biased selection of the total evidence
Department of Emergency Medicine, available (selection bias). A common bias in nar-
SUNY Downstate Medical Center, rative reviews is failure to include research that
450 Clarkson Avenue, 1228, Brooklyn,
NY 11203, USA
conicts with the beliefs or opinions of the expert.
e-mail: Lorenzopaladino@yahoo.com; Nonetheless, the majority of published reviews
Richard.sinert@downstate.edu are narrative rather than systematic.
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 177
DOI 10.1007/978-1-4614-3360-6_9, Phyllis G. Supino and Jeffrey S. Borer 2012
178 L. Paladino and R.H. Sinert
In contrast, systematic reviews (in medicine, effective research described by Tuckman [6] and
written most commonly about treatment or reviewed in Chap. 1. They are systematic because
diagnostic research) focus on a specic question information gathering is done in a structured and
within a topic (e.g., Are steroids effective in rigorous way and the data contained within them
controlling ares of multiple sclerosis? Does are interpreted. They are logical in that their
positron emission tomography have strong posi- methodologies employ tools for assessing the
tive predictive value for breast cancer?), render- studies bias (internal validity) and procedures to
ing them amenable to an explicit search strategy. discern the effects of varying populations on
This characteristic makes them excellent tools to study results (external validity). They are repli-
explore clinically relevant topics. Systematic cable both because they demonstrate whether the
reviews identify the databases searched and, thus, results of individual studies are congruent and
present clear and reproducible search strategies. also because the methodology employed in the
A comprehensive literature search is conducted, review, if properly performed and reported, is
and all identied studies identied are assessed sufciently explicit to be permit reproduction.
for relevance and methodology. Selection is based They are transmittable because, by digesting
on predened inclusion and exclusion criteria, available information and coming to a conclu-
quality is assessed, and data are abstracted in a sion, they effectively summarize what is cur-
standardized format. By explicitly stating how rently known on a specic topic and, when
the evidence was found, how it was appraised or published, enable clinicians to learn about the
validated, and which studies were excluded (and conclusions of research. In addition, meta-analy-
why), systematic reviews eliminate many of the ses, specically, gather, compare, and pool the
biases inherent in narrative reviews. empirical products (data) of the studies collected
A meta-analysis (sometimes termed a quanti- and are reductive to a clinical conclusion. As
tative review) often, but not always, is included noted above, meta-analyses increase sample size
as a component of a systematic review. First used by pooling the subjects of smaller studies when
for medicine in 1904 by renowned statistician appropriate. This larger N increases the general-
Karl Pearson to examine the preventive effect of izability of the results. When the results cannot
serum innoculations against enteric fever [3] and be pooled, they often shed light on reasons why
later formalized by contemporary statistician and results may not be generalizable.
educational researcher, Gene V. Glass (who
coined the term in 1976) [4], meta-analysis cur-
rently is employed in many disciplines as a statis- Searching for a Systematic Review
tical methodology to combine the results of or Meta-analysis
several studies about a topic as if they were from
one large study. In studies of treatment (the most Almost all of the of the databases described in
common focus of meta-analysis in clinical medi- Chap. 2 can be used to search for meta-analyses
cine), its principal purposes are to enable detec- and systematic reviews. The Clinical Queries
tion of overall and subgroup effects (as statistical link on the PubMed interface for MEDLINE can
power may be suboptimal due to limitations in be used to apply search lters to focus on system-
sample size of individual trials), to improve esti- atic reviews [7]. A variety of databases also are
mates of the magnitude of these effects, and to available that specialize in systematic reviews
aid in the resolution of uncertainty due to incon- and meta-analyses. The Cochrane Library (www.
sistent ndings (i.e., interstudy differences) [5]. thecochranelibrary.com), developed under the
The studies included in a meta-analysis should be auspices of the Cochrane Collaboration (an inter-
found using the same rigorous search methodol- national network dedicated to promoting well-
ogy as that used for systematic reviews. informed health-care decision-making), maintains
Well-constructed systematic reviews and an online collection of systematic reviews on
meta-analyses have many of the characteristics of intervention and treatment. The Database of
9 Selecting and Evaluating Secondary Data 179
concept of mortality reduction: what period of that is used for indexing articles; it is hierarchically
time is clinically meaningful? 30 days? arrayed to facilitate searching at varying levels of
6 weeks? 6 months? 1 year?). specicity [9]. Use of all of these tools invariably
will yield a more inclusive search.
Consider the example: Does drawing blood
Dening the Literature Search Strategy: cultures (intervention) change mortality (out-
Keywords, MeSH, and Boolean come) in adult patients with pneumonia (popula-
Operators tion)? (The comparison implied by the question
is not drawing blood cultures.) In some literature,
An organized literature search will increase the blood culture may be classied as microbiologi-
likelihood of nding answers to the question of cal culture, microbial culture, or microbial
interest. The PICO question described above can testing; pneumonia as lung infection or respi-
be subdivided into its four components for entry ratory infection; and mortality as death or
into the databases search engine. We recommend survival. MeSH terms can help expand the
that the reviewer search broadly at rst and then search by including many or all of these syn-
search more narrowly (cone down). The more onyms under one umbrella (Fig. 9.1). However,
limited the initial search, the more likely it will they should not be solely relied upon because
miss relevant articles. Each component of the ques- inclusion or exclusion of an item under a specic
tion should be searched by keywords, probable MeSH is determined subjectively by those per-
synonyms, and, if using PubMed, its MeSH (medi- forming the NML indexing.
cal subject headings) terms (also called descrip- During the search, the selected terms are
tors). MeSH is the US National Library of connected by the Boolean operators AND,
Medicines (NLM) controlled vocabulary thesaurus OR, and NOT (see Venn Diagram,
9 Selecting and Evaluating Secondary Data 181
Fig. 9.2 ). The meaning of these operators are Having formed the search question, the next step
self-explanatory; however, the implications of in constructing the systematic review is consider-
their additions to a search deserves outlining. The ation of the types of literature available to answer
OR operator expands the search to include the question. Selection should be based on sev-
any of the selected terms, whereas AND limits eral key factors, the most important of which are
it to those that contain all selected terms. listed below.
To start a search broadly, the keywords in the
query should be connected by the OR operator Levels of Evidence
(e.g., mortality OR survival). This strategy pro- Medline and other databases contain literature
vides the sum of all words as if they were searched that is very heterogenous with regard to the
individually. By adding AND pneumonia, the strength of evidence provided. The varying types
search will yield articles only about both mortal- of studies contained within the literature are
ity (OR survival) and pneumonia. This concept is represented here as a pyramid (Fig. 9.4), with
illustrated by the Venn diagram given in Fig. 9.3. the weakest evidence for answering clinical
182 L. Paladino and R.H. Sinert
questions shown at the bottom and the strongest casecontrol studies provide stronger evidence
evidence shown at the top. Bias decreases as we for association than case reports or case series,
move up the pyramid, in direct contrast to the caution must be exercised in interpretation of
amount of literature available on a given topic. results because demonstration of a statistical rela-
In vitro and animal studies, although impor- tionship does not provide proof of causality.
tant for hypothesis generation, cannot be applied Cohort studies follow individuals with specic
directly for clinical care or provide a direct answer risk factors or exposures over time and compare
to a clinical research question, as can case reports, them with comparable individuals who do not
series, casecontrol, cohorts, and randomized have the risk factor or exposure being studied to
controlled clinical trials (RCTs). As noted in pre- evaluate differences in outcomes. Though cohort
vious chapters, a case report describes the pre- studies (particularly those that are prospective in
sentation and/or treatment of an individual patient, nature) provide better evidence of association
whereas a case series consists of a collection of than casecontrol studies, they (like casecontrol
reports on several individual patients. Because studies) are observational and, as such, are sub-
they do not have control groups with which to ject to more bias than studies in which an inter-
compare outcomes, neither has validity for draw- vention has been purposely applied; their greatest
ing conclusions about associations or cause and utility in clinical epidemiology is for dening
effect. Casecontrol studies are always retro- prognosis of a disease. Quasi-experimental
spective studies in which subjects who already studies contain some of elements of true experi-
have a specic condition are compared with those ments (parallel control groups and/or repeated
who do not. These studies are well suited to test assessments) but (as noted in Chap. 5), due to
associations between risks or toxic exposures and lack of random allocation to treatment group, are
diseases, especially when the latter are relatively not fully protected from all threats to internal
rare. Data collection typically is based on the validity. In contrast, randomized controlled
medical record and/or patient recall. Though clinical trials (RCTs) study the effects of a
9 Selecting and Evaluating Secondary Data 183
Table 9.1 Criteria for calculating the Jadad score (Reprinted with permission from Jadad et al. [12])
Criteria Yes (1 point) No (0 points)
1. Was the study described as randomized?
2. Was the randomization process described, and was it appropriate?
3. Was the study described as double blind?
4. Was the method for double blinding appropriate?
5. Were the withdrawals and drops out of the study enumerated?
Interpretation
Score 02 Low-quality study
Score 35 High-quality study
purposively applied therapy by comparing an predened inclusion and exclusion criteria should
intervention group and control group to which be reported in the methods section and the search
subjects have been randomly allocated. They also strategy in the appendix, to facilitate replication
incorporate additional methodologies such as of results.
blinding (masking) and analysis by intention-to-
treat that reduce the potential for a variety of Assessing the Quality of Primary Studies
threats to internal validity, though they may suffer Assessment of bias in the methodology of the indi-
from limitations in generalizability (external vidual studies is a core component of a systematic
validity). In theory, as syntheses of prior research, review; therefore, tools for appraising the quality
systematic reviews and meta-analyses, though of the individual studies should be integrated.
relatively few in number, are at the top of the Unfortunately, no gold standard exists to evaluate
pyramid, providing the strongest evidence for the methodology of therapeutic trials or assess-
associations or cause-and-effect relationships. ments of diagnostic test performance even though
However, for this to be true, both must meet their quality and methods for synthesis are thought
stringent methodological quality criteria by some to be superior to that of other forms of
(described below) and the elements of the meta- clinical research (e.g., prognostic studies) [11].
analysis (i.e., the included studies), specically, Consensus and working groups continually reeval-
must have sufciently similar study design char- uate and improve upon assessment tools; thus, the
acteristics to permit pooling of results, a criterion preferred methods or systems change over time.
that is not always met in practice. When it does Below is a listing and brief discussion of a cross
not, a meta-analysis, if performed, will be more section of tools for detecting bias in these types of
useful for hypothesis generation than for hypoth- studies. We present these to introduce the topic
esis testing [10]. rather than to advocate a specic scoring system.
(For the author of a primary study, they can be
Standardizing Selection of Articles used as a check list to ensure a sufciently com-
The list of abstracts generated from the PICO prehensive methods section.)
search query is next screened for selection of
relevant articles. Although inclusion criteria (e.g., Therapeutic Testing Articles Appraisal
nature of the patient population, specic outcomes A variety of assessment tools for therapeutic arti-
and summary measures) optimally are predened, cles exist such as the Jadad scale [12], shown in
the process is not immune to subjectivity and Table 9.1. Common to all is evaluation of key
bias. The list should be screened independently areas prone to bias. Inclusion and exclusion
by two reviewers to minimize subjectivity. Any criteria should be reviewed to decide whether the
discrepancies should be compared and discussed patients included in the identied study meet the
to reach a consensus. The reviewers interrater requirements of the P of the PICO. As indi-
reliability should be measured and reported. The cated earlier, the highest quality studies optimally
184 L. Paladino and R.H. Sinert
will use randomized treatment assignment with outcome. The NNT must be weighed with the
concealment of allocation, double blinding, and baseline risk, NNH, benet magnitude and/or
intention-to-treat analyses. Follow-up should be cost to have comprehensive meaning to the clini-
complete and transparent. In addition, readers cian. It may be more acceptable in clinical prac-
should look for an explanation as to why partici- tice to apply a treatment that is inexpensive, easy
pants may have dropped out of an investigation, to use, and of almost no adverse risk but has
as differential attrition from a study may impact higher NNT than one that has a lower NNT but is
conclusions regarding the effectiveness of the expensive, dangerous, and has only a marginal
investigational treatment (e.g., if the sickest clinical benet. For example, while the NNT was
patients dropped out of the treatment arm receiv- relatively higher with aspirin than with SK in
ing an investigational new drug, the drug might ISIS-2, there was no reported bleeding requiring
appear to be more effective than it is.) transfusion or conrmed cerebral hemorrhage
Studies about treatment, optimally, will associated with aspirin (a very low cost, easy-to-
express the impact of therapy quantitatively as manage intervention), whereas there was a very
the number needed to treat (NNT) or the number small (though statistically signicant) excess
needed to harm (NNH). The NNT is the number occurence of these events with SK (0.5% vs.
of patients that need to be given the intervention 0.2% with placebo [major bleeds], equivalent to
for one patient to benet, thus expressing the a NNH = 333; 0.1% (SK) vs. 0.0% with placebo
effectiveness of an intervention in a clinically [cerebrovascular hemorrhage], equivalent to a
meaningful manner. It is calculated as the recip- NNH = 1,000).
rocal of the difference in outcomes of the inter-
vention and control groups (absolute risk Diagnostic Testing Articles Appraisal
reduction) derived from a therapeutic trial. The Diagnostic accuracy studies investigate how well
closer the NNT is to 1, the greater the efcacy of the results from an index test (test being evalu-
the intervention; the further from 1, the lesser its ated) agree with the results of the reference stan-
efcacy. As an example, in the landmark study dard. (As noted above, the reference standard or
ISIS-2 [13], the efcacy of (1) 1 h of IV infusion gold standard is considered the best available
of 1.5 MU streptokinase (SK), (2) 1 month of method to determine the presence or absence of a
160 mg of enteric-coated aspirin (ASA) taken condition.) Diagnostic studies have unique design
daily for 30 days, and (3) both active agents ver- features which differ from therapeutic testing;
sus placebo was evaluated through 35 days after therefore, different methods exist for detecting
a suspected acute myocardial infarction (AMI) bias and variability.
among 17,187 patients. Analysis revealed that The Quality Assessment of Diagnostic
the absolute reductions in risk of vascular mortal- Accuracy Studies (QUADAS) tool [14] is one
ity associated with SK and ASA and their combi- such method. The tool comprises 14 items,
nation versus placebo, respectively, were 2.8%, dened by expert consensus, that examine a vari-
2.4%, and 5.2%, yielding NNTs of 36 (SK), 42 ety of important biases and other methodological
(ASA), and 19 (SK + ASA). These NNTs (not concerns specic to the evaluation of diagnostic
calculated in the original study) indicated that 36 tests (Table 9.2), though it it does not address the
patients would need to be treated with SK and 42 issue of intra- or interobserver reliability.
patients with ASA aspirin to prevent one vascular Responses are framed as binary yes/no ques-
death, whereas the same result could be achieved tions, or if not enough information is supplied,
with combination therapy in 19 patients. unclear. The Cochrane Collaboration offers a
A closely related parameter is the number needed similar tools for assesing diagnostic studies [15].
to harm (NNH), calculated as the inverse of the In the past, calculations of the sensitivity,
absolute risk increase (again expressed as a pro- specicity, and predictive values of a diagnostic
portion) and interpreted as the number of patients were considered sufcient for evaluation of its
one would need to treat to expect an adverse utility. In this era, a high-quality diagnostic
9 Selecting and Evaluating Secondary Data 185
Table 9.2 The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included
in systematic reviews (Reproduced with permission from Whiting et al. [14])
Item Yes No Unclear
1. Was the spectrum of patients representative of the patients who will receive the () () ()
test in practice?
2. Were selection criteria clearly described? () () ()
3. Is the reference standard likely to correctly classify the target condition? () () ()
4. Is the time period between reference standard and index test short enough to be () () ()
reasonably sure that the target condition did not change between the two tests?
5. Did the whole sample or a random selection of the sample, receive verication () () ()
using a reference standard of diagnosis?
6. Did patients receive the same reference standard regardless of the index () () ()
test result?
7. Was the reference standard independent of the index test (i.e. the index test did () () ()
not form part of the reference standard)?
8. Was the execution of the index test described in sufcient detail to permit () () ()
replication of the test?
9. Was the execution of the reference standard described in sufcient detail to () () ()
permit its replication?
10. Were the index test results interpreted without knowledge of the results of the () () ()
reference standard?
11. Were the reference standard results interpreted without knowledge of the results () () ()
of the index test?
12. Were the same clinical data available when test results were interpreted as would () () ()
be available when the test is used in practice?
13. Were uninterpretable/intermediate test results reported? () () ()
14. Were withdrawals from the study explained? () () ()
study also will dene thresholds values for their The LR is the probability that a given test result
diagnostic test using receiver operator character- would be expected in a patient with the target
istic (ROC) curves which are plots of the true disorder divided by the probability (P) that that
positive rate (sensitivity) versus the false positive same result would be expected in a patient with-
rate (1-specicity) (Fig. 9.5). The area under the out the target disorder. LRs can be calculated
curve reects the relationship between sensitivity both for positive (LR+) and negative (LR) test
and specicity for a given test. As a curve asymp- results, as shown below.
totically approaches the upper left-hand corner,
the area under the curve approaches 1 (100% sen- sensitivity P (Test + | Disease + )
sitivity and specicity). A random guess would LR + =
1 specificity P(Test + | Disease )
generate a point along the diagonal bisecting the
graph, also called the line of no discrimination. 1 sensitivity P (Test | Disease + )
Points above the diagonal represent better results LR =
specificity P(Test | Disease )
(greater diagnostic accuracy), while points below
the line are poor (lower diagnostic accuracy). High LR + values (LR+ > 10) signicantly
(For further discussion of the use of ROC curves increase the probability of disease and low
for determination of diagnostic accuracy, the LR values (LR < 0.1) signicantly decrease
reader is referred to Chap. 11.) the probability of disease. The extent to which
Once thresholds for a positive and negative the results of a diagnostic test changes the prob-
diagnostic test are dened by ROC curves, then ability that the patient has a disease (posttest
an evidence-based operating characteristic of the probability) can be estimated using a graphical
test can be dened by its likelihood ratios (LR). tool known as the Fagan nomogram [16] by
186 L. Paladino and R.H. Sinert
using a straight edge to draw a line from the summary statistic (e.g., a risk ratio, a difference
pretest probability through the calculated LR between outcome means) for the observed effect
(Fig. 9.6). is abstracted or recalculated from each included
study. (A less common approach, not reviewed in
this chapter, combines original or patient-level
Summarizing the Results: The Role data from prior studies; for an excellent discus-
of Meta-analysis sion of the pros and cons of this method, known
as Individual Patient Data [IPD] meta-analysis,
As noted earlier, sometimes the size of an indi- the reader is referred to Stewart and Tierney
vidual clinical trial may be too small to detect a 2002) [17].) Next, a pooled effect estimate is cal-
treatment effect or to estimate its magnitude reli- culated as a weighted average by sample size of
ably. Meta-analysis is a method to increase the the intervention effects reported in the individual
power of statistical analyses and precision of esti- studies. By pooling results, the standard error of
mates by pooling the results of related trials (i.e., the weighted average effect size of the included
those that address a similar hypothesis) to obtain studies and its associated condence interval are
a quantied synthesis. Not all systematic reviews reduced, typically affording greater statistical
lead to a meta-analysis. The trials may be so var- power to detect an effect than would be possible
ied in their methodology, end points, or results from any one consitutent study. Reduction of the
that combining them may not be appropriate. condence intervals also increases precision of
In a conventional meta-analysis (sometimes the estimated population effect size [18]. In
known as aggregate-level meta-analysis, a assigning weights for generating the pooled
9 Selecting and Evaluating Secondary Data 187
Fig. 9.7 The forest plot (Reproduced with permission from: Brandler et al. [20])
or outcomes measured. Methodological vari- intervention or may be too far along the disease
ability occurs when there are differences in process to show any efcacy. Sometimes, the
study design. Not suprisingly, clinical or method- interventions themselves may be dissimilar.
ological differences will cause variations in the For example, a review of antibiotics in sepsis
effect measured. Heterogeneity refers to this may include studies that used different classes of
difference in effect size (or direction) between antibiotics. Dosing size may have an impact
studies. Of course, like all statistical tests, the on heterogeneity as well. The effects, benecial
heterogeneity of the effect size in pooled studies or harmful, may increase with increased dose
may occur by chance. and with the duration or frequency of the
Assessment of clinical and methodological het- intervention.
erogeneity includes both qualitative and quantita- Clearly, outcome measures also must be simi-
tive elements. One begins by comparing the study lar to permit appropriate comparison. Thus,
populations. Are the studies similar in age, sex, or 6-month mortality after cardiac intervention in
even type of disease? If not, is it appropriate to one study should not be compared to left ventric-
pool them together? Are the interventions the ular ejection fraction at 6 months in another.
same? Some studies may include co-interventions Length of follow-up of a trial may have an
which may be a source of confounding. Studies inuence on the estimate of treatment effect. Like
also may exhibit variability in terms of the timing applying the intervention at disparate times, fol-
of the intervention; thus, imposition of an inter- low-up at different stages of the disease likely
vention at different stages during the disease pro- will impact outcomes. This issue should have be
cess may cause differences in degree of efcacy. resolved during the study selection stage of a
For example, a study on the impact of oncologic review so that studies lacking the desired out-
surgery would likely exhibit differences in come measure were excluded. One should also
efcacy if conducted early after cancer detection be critical of surrogate marker use as an outcome
as opposed to after metastases had developed. measure, especially when being compared to a
The question of timing overlaps the issue of pop- direct outcome. Different study methods will
ulation differences as patients may be sicker at have different degrees of bias. Those conducting
one stage of the disease than another. This can meta-analyses should consider whether it is
magnify the effects or negate them. An ill popu- appropriate to compare RCTs with blinding and
lation may exaggerate the benecial effects of an concealment to unblinded cohort studies.
9 Selecting and Evaluating Secondary Data 189
changes the magnitude or direction of the or effect size and publication, with potentially
pooled effect size or its statistical signicance. adverse consequences (i.e., type I error or inap-
This analysis helps to determine whether the propriate rejection of the null hypothesis in favor
pooled result is inuenced heavily by a par- of the alternative hypothesis, further discussed in
ticular trial. Other permutations include using Chaps. 10, 11). Fortunately, a variety of graphical
only blinded, higher quality trials (or exclud- and statistical methods are available to help detect
ing lower quality trials) or performing the it. The most widely used of these are described
analysis under xed and random effects below:
assumptions. If the results are consistent, the Funnel plots. The funnel plot [23] is a graphic
sensitivity analysis provides stronger evi- display of the sample size or precision (1/stan-
dence of an effect and of generalizability. dard error) on the y-axis versus the effect esti-
mate (x-axis) used to detect publication bias.
Ideally, the results from small studies will
Pooling Results for Meta-analysis: scatter widely at the bottom of the graph form-
Considerations ing the base of the triangle or funnel because
they have less precision, with the spread nar-
Heterogeneity (whether dened graphically or rowing around the summary effects line at the
statistically) should be considered alongside a apex for larger studies. This pattern occurs
qualitative assessment of the combinability of when publication bias is absent or unlikely.
studies. When signicant methodological differ- Asymmetry indicates systematic differences,
ences and heterogeneity are detected, a meta- errors of measurement, or publication bias; as
analysis probably should not be performed as it noted, small studies with positive results are
may be misleading. Under these circumstances, more likely to be published, whereas negative
the systematic review should report the results studies of similar size are not and, therefore,
descriptively using text and tables and not pool not found during execution of the search
the data. However, if effect sizes are similar strategy. The absence of these balancing
despite variability of clinical and methodological studies are made visually obvious in the asym-
differences, the results probably are robust and metry of the plot (Fig. 9.9). Although funnel
generalizable. A cost-free program for producing plots usually are employed to test for publica-
the tables and graphs and performing the statis- tion bias, there are other causes of asymmetry
tics for a meta-analysis is available from the such as systematic differences and errors of
Cochrane group, RevMan 5 (Review Manager, measurement. When found, the causes of the
Version 5.0, The Cochrane Collaboration, asymmetry should be investigated and
Copenhagen, Denmark). explained to justify the continued grouping of
these studies for meta-analysis.
Fail-safe N. The inability to locate every
Detecting Publication Bias unpublished study about a subject might be
unnerving to authors of a meta-analysis. As a
The literature tends to be biased toward positive method of compensation for what may be
ndingsa phenomenon known as publication unknown, Rosenthal [24] developed formulae
bias [22]. Studies with large sample sizes have a based on the desired level of signicance
greater probability of achieving statistical (p value), later named the fail-safe N by Cooper
signicance and, therefore, achieving publica- [25]. Orwin [26] adapted the fail-safe N to
tion. This holds true for studies demonstrating adjust for small (d = 0.2), medium (d = 0.5), or
large treatment effects as well, even if the sample large (d = 0.8) effect sizes [27]. The formula
size is small. Indeed, many smaller or negative calculates the number of studies that would be
trials are never published. Publication Bias needed to conrm the null hypothesis and,
produces a positive relationship between sample thereby, reverse a conclusion that a signicant
9 Selecting and Evaluating Secondary Data 191
relationship exists. The formula for Orwins quality of reporting of meta-analyses of clinical
fail-safe N [26] is given below: randomized controlled trials. Since that time,
many additions, updates, and expansions of this
N ( d dc ) statement for broader applicability have led to
N fs =
dc the development of the PRISMA. (Preferred
Reporting Items for Systematic Reviews and Meta-
where N = the number of studies in the meta- analyses) statement, which provides guidelines
analysis, d = the average effect size for the designed to reduce the risk of awed reporting of
studies synthesized, and dc = the criterion value systematic reviews and improve the clarity and
selected that d would equal when some know- transparency in how reviews are conducted [31].
able number of hypothetical studies (Nfs) were Included are a 27-item checklist (Table 9.4) and
added to the meta-analysis. If the fail-safe N is 4-phase owchart (Fig. 9.10) [32].
sufciently high, it may provide reassurance Though not part of current current checklists,
that a few missing studies would not alter the conicts of interest such as nancial funding of
conclusion. individual trails should be reported in the system-
atic review or meta-analysis.
Fig. 9.10 PRISMA four-phase ow diagram (Reproduced with permission from Moher et al. [32])
194 L. Paladino and R.H. Sinert
nonadherence to proper searching strategies, lack as noted earlier, it should be considered for
of statistical rigor, and introduction of bias (inten- hypothesis generation only). In addition, the
tional or unintentional) in which studies were increased power gained by pooling the results of
cherry picked to suit the personal agenda of individual studies that is advantageous for
the reviewer/analyst. Unfortunately, not all of the decreasing type II errors also may allow small
limitations can be minimized by strict method- biases to be interpreted erroneously as an effect,
ology. A fundamental limitation of meta-analy- increasing type I errors. (Again, see Chaps. 10
sis, specically, is that it is comprised of studies and 11 for further elaboration of these fundamen-
performed under different protocols and at differ- tal concepts.) On occasion, the same dataset may
ent times; for purposes of the analysis, it is be published multiple times, making the results
assumed that the differences in protocol and not independent. If this is not recognized, the
study design of the elements are obviated by the dataset will be weighed more than once in the
large number of observations ultimately avail- analysis, articially inating the results. Finally,
able. This assumption is highly questionable. the results and conclusions of a systematic review
As noted above, if clinical and methodological or meta-analysis are only as reliable as the meth-
diversity across studies is such that substantial ods used in each of the primary studies. The
heterogeneity is determined, it may be better not methodology used for their qualitative or quanti-
to combine them in a meta-analysis (if a meta- tative synthesis does not compensate for aws or
analysis is performed under these circumstances, errors in the individual primary studies.
Take-Home Points
For clinicians to make informed decisions for patient management and research, they must
analyze multiple studies for quality and relevance to the population of interest.
Secondary sources of information (especially systematic reviews and meta-analyses) help
to summarize and reconcile conicting studies in the literature.
By explicitly stating how evidence was found, selected, and evaluated, systematic reviews
eliminate many of the biases inherent in narrative reviews.
Meta-analysis uses statistical methodology to combine results of several related studies,
which affords greater statistical power versus that of individual studies.
Though retrievable via traditional online literature search engines, a variety of databases
are available that specialize in systematic reviews and meta-analyses.
To construct a quality systematic review, one should formulate a clear question, dene a
comprehensive yet efcient literature searching strategy, include all appropriate studies,
summarize results, assess heterogeneity, and consider appropriateness of pooling results if
individual studies for meta-analysis.
Caution must be exercised when conducting/interpreting a systematic review or meta-analy-
sis to: ensure inclusiveness of literature searching, optimization of statistical rigor, minimi-
zation of bias, and avoidance of inclusion of multiple publications of the same dataset.
The results and conclusions of a systematic review or meta-analysis are only as reliable as
the methods used in each of the primary studies; their synthesis does not compensate for
errors of methodology in the individual primary studies.
Meta-analyses, constructed as they are of multiple nonidentical studies, must be viewed as a
hypothesis-generating rather than a hypothesis testing tool especially if major methodological
differences or heterogeneity among their components is detected.
9 Selecting and Evaluating Secondary Data 195
Richard C. Zink
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 197
DOI 10.1007/978-1-4614-3360-6_10, Phyllis G. Supino and Jeffrey S. Borer 2012
198 R.C. Zink
determine the levels of household and village resistance to therapy [2, 3]. Minimizing hetero-
clustering [7]. The goal of the research was to geneity of the disease is important for careful
identify the risk factors and understand the pat- study and can be accomplished using the study
terns of disease transmission. Careful study of inclusion/exclusion criteria and designing studies
factors at the household and village level ulti- of appropriate sample size and duration. Due to
mately could lead to an optimal strategy for inter- the above limitations, a sample of individuals is
vention. Within each survey, villages were selected from the study population. Data are col-
randomly selected for inclusion into the study, lected from this sample; summary statistics,
and all households within each village that had at condence intervals, and statistical tests are com-
least one child within the appropriate age range puted, and conclusions are generated. Inferences
were included. about the study population are made from the
sample ndings, and the quality of this inference
is related to how representative the sample is to
Populations and Samples the study population.
Suppose, for example, that there was interest
As noted in Chap. 2, all research begins with a in estimating the average viral load for subjects
question. For example, in developing a new anti- chronically infected with HBV meeting study
viral for chronic hepatitis B infection, we could entry criteriaour study population. Typically,
ask whether entecavir is more efcacious than viral load is measured on the log10 scale since val-
lamivudine. Here, we might assume that our pop- ues often are skewed to the right (i.e., there is a
ulation of interest is all individuals with chronic long tail of large viral load counts). The log10
HBV infection. However, most clinical trial pro- transformation is applied to make the viral loads
tocols are written with a number of inclusion or appear more normally distributed. For the study
exclusion criteria in order for subjects to partici- population, the average log10(viral load) is
pate. For example, subjects generally need to be denoted by m, and the spread of log10(viral load)
of a certain age with a well-dened and specic from this average value is represented by s, so
disease severity, and we may wish to focus on a that roughly 95% of the values are within m 2s.
particular subtype of the disease, such as those The unknown parameters m and s are referred to
positive or negative for HBeAg [5, 6]. Further, as the mean and standard deviation of log10(viral
subjects with other coexisting diseases or medi- load) in the study population, and if normally dis-
cations that may interfere or complicate interpre- tributed, we can describe the distribution of
tation of the results, or that could pose an log10(viral load) values as N(m,s2).
unreasonable safety risk, would be excluded from If we select a sample of size n from the
participation in the trial. Therefore, it is more study population, we can use the sample
n
appropriate to dene our population as all indi- xi
i =1
viduals with chronic hepatitis B infection meet- mean x= and sample variance
n2
ing the inclusion and exclusion criteria of the
i =1 (xi x )
n
study. We can refer to the larger population and s2 = as estimates for m and s2,
those eligible for the study as the population n 1
with the condition and the study population, respectively. The sample mean x will be dis-
respectively [8]. tributed N(m,s2/n), and we can use this fact to
For reasons of time and money, it is generally compute condence intervals and hypothesis
impractical to consider the entire study popula- tests to generate inference for the population
tion to address the research hypothesis. Money is mean m. Figure 10.1 plots several normal distri-
an obvious limitation, but time can be an impor- butions for the sample mean of log10(viral load)
tant factor as well, as the disease under study may for varying sample sizes with m = 9.6 and s = 2
naturally change over time. For example, antivi- (similar to summary statistics from Lai et al.
ral treatment can lead to mutations that enable [6]). Note that as the sample size n increases, the
10 Sampling Methodology: Implications for Drawing Conclusions from Clinical Research Findings 199
with no mention of this in the protocol, the results size calculations, and though we do not present
would be an overestimate of the average viral any here, entire books have been devoted to the
load for the study population. subject [9].
Despite the benets of randomness, random Maximizing power is one way of choosing the
sampling provides no guarantee of correct infer- size of a sample, but it is by no means the only
ence from the sample to the study population. It method. Sample size can be chosen to achieve a
is entirely possible to generate a sample from the certain level of precision in the parameter esti-
study population consisting of extreme values mates. This particular type of sample size calcula-
that are not a reection of the typical response. tion is often used in oversampling, when we
As noted in Chap. 11, two types of errors in infer- purposefully select a higher proportion of a partic-
ence can occur in computing a condence inter- ular kind of subject in the sample than exists in the
val or performing a hypothesis test on a sample of population. For example, the two phase III trials
data. In testing the null hypothesis of no differ- discussed in this chapter are predominately male
ence in mean log10(viral load) between the HBV (approximately 75%). Suppose these gender rates
study treatments, H0: mE = mL, versus the alterna- are reective of the true population of subjects in
tive hypothesis that a treatment difference exists, the study population. If we wanted to estimate a
HA: mE mL, a type I error occurs if we reject the treatment effect between these two antivirals with
null hypothesis based on the sample data when a particular precision for females, we could include
mE = mL is true for the population. A type II error a higher proportion of women in our study sample.
occurs when we fail to reject the null hypothesis In this scenario, our overall treatment effect could
when the null hypothesis is false. In the context be biased if gender has an important impact on the
of our clinical trial example, a type I error could characteristics of the disease. To obtain an unbi-
lead one to conclude that entecavir had better ased estimate, we could employ survey weights to
efcacy than lamivudine, when in actuality there downplay the contribution of females to obtain
is no difference between the two antivirals. A type overall estimates for the various endpoints that are
II error would have the sponsor conclude that the reective of the study population.
two antivirals have similar efcacy, when ente- A nal comment on sample size is worth men-
cavir is the more potent drug. tioning in the conduct of clinical trials. While it is
important to have sufcient sample size to have a
representative sample and achieve high levels of
Sample Size power for testing the null hypothesis, the trial
designer should realize that every subject enrolled
As further discussed in Chap. 11, the probability in the trial potentially is exposed to some unknown
of making a type I or type II error is referred to as safety risk attributable to the medications under
a and b, respectively. Typically, the sample size investigation. Therefore, it is of paramount
for a clinical trial is chosen to minimize the prob- importance that the trial designers study enough
ability of these errors occurring, subject to avail- subjects to achieve their goals, without exposing
able resources. Appropriate values for a and b unnecessary additional individuals to an experi-
depend on the scientic question at hand, but mental treatment with an unknown or limited
typical practice in clinical trials has a xed at safety prole.
0.05, with b chosen between 0.1 and 0.2.
Alternatively, we could choose sample size to
maximize the probability 1-b, which is called Probability Sampling
power. Power is the probability of rejecting the
null hypothesis, given the null hypothesis is false, As alluded to above, probability sampling
and powering a study means allocating sufcient identies the individuals within the study popula-
sample size to have a high likelihood of rejecting tion and assigns every subject a chance of being
the null hypothesis in favor of the specied alter- selected into the sample. The easiest method of
native. Formulae exist for many types of sample selecting a sample assigns every individual the
10 Sampling Methodology: Implications for Drawing Conclusions from Clinical Research Findings 201
same chance of being selected into the study. how far the proportion of HBeAg subjects in the
This method of sampling subjects is referred to as sample is from one-third of the total sample.
simple random sampling, and within clinical However, employing a stratied random sampling
research, it generally is performed without scheme, we can select separate samples from each
replacement. Without replacement sampling stratum such that 67% and 33% of the total sam-
implies that once a particular subject is selected ple size come from HBeAg-positive and HBeAg-
for inclusion into the study, the subject is not negative subjects, respectively. Thus, we maintain
returned to the pool for further sampling. The the appropriate proportions of this important dis-
practical implication of this approach is that each ease characteristic within our sample.
subject is counted exactly once within a single Stratied random sampling has a number of
clinical investigation. In contrast, with replace- additional benets. First, stratied sampling can
ment sampling returns the sampled observation lead to more efcient statistical testing through a
to the study population for further sampling. reduction in the variability of the sample esti-
While simple random sampling is straightfor- mates. Second, distinct methods for sampling can
ward to apply, it does have some disadvantages. be employed within each of the strata. For exam-
First, sampling from particularly large popula- ple, individuals located in more populated areas
tions can be cumbersome since it requires enu- may be sampled at the individual level, while
meration of all possible subjects to dene the subjects in more remote areas might be sampled
sampling frame. Such data may not exist or could as part of a cluster (described below) [10]. While
be expensive to generate. Second, while we the aforementioned example could have nancial
expect the average sample to be representative of benets in sampling distant individuals, it actu-
the population, it is possible to generate a sample ally may be necessary due to the nature of the
where important characteristics related to the information available to dene the sampling
study outcome are under- or overrepresented by frame within each stratum.
random chance. These deciencies can be Though stratied random sampling is advanta-
addressed using methods described below. geous, there are a number of difculties associ-
In stratified random sampling, mutually exclu- ated with its use. First, it is possible to stratify
sive subcategories (strata) of the study population only for characteristics known to inuence the
are dened prior to sampling. Then, within each disease in question, and the ability to identify
stratum, a separate random sample is selected. By these characteristics quickly and easily is impor-
dening the sampling scheme in this manner, it is tant for generating the sample. Second, if there
possible to maintain the appropriate proportions are multiple endpoints under investigation, it may
of important disease characteristics within the be difcult to select strata that are benecial for
study sample. Suppose, in lieu of the two separate every endpoint. Stratication can result in efcient
clinical trials for HBeAg+ and HBeAg subjects statistical testing when the strata are correlated
described above, sufcient funding was available with the outcome of interest (such as HBeAg sta-
for only a single study to obtain an overall esti- tus and viral load). However, strata that do not
mate of log10(viral load) for the study population have this property may contribute to additional
of HBV subjects. Further, suppose that HBeAg complexity and cost in the study design.
disease accounts for roughly one-third of all HBV If it is possible to order a sampling frame, a
infection [6]. If we applied simple random sam- systematic random sample can be generated by
pling to select subjects from the study population, selecting every kth value in the list after randomly
we could by random chance obtain a sample selecting a starting observation. Sampling pro-
where the proportion of HBeAg subjects differs ceeds in this manner until the required sample
substantially from 33%. Since the log10(viral load) size is obtained. One benet of systematic ran-
of HBeAg subjects tends to be lower than dom sampling is that it can naturally account for
HBeAg+ subjects [6], the overall estimate of viral the presence of strata, by sorting the frame by the
load would be biased for the study population, stratication variables. However, an important
and the magnitude of this bias would depend on drawback of systematic sampling can occur if the
202 R.C. Zink
sampling frame has periodicity present. For sampling design. Suppose that many of the
example, suppose we attempt to replace our two villages or townships were large and that sufcient
phase III trials of HBeAg-positive and HBeAg- information was available to describe each house-
negative subjects with a single study. Further, hold within each village. In the rst stage, we
suppose that the sampling frame is ordered such could randomly select villages. In the second
that positive and negative subjects alternate stage, we could select a random sample of house-
within the list. Choosing an even value for k holds from within each sampled village and
would result in a sample that was either entirely include every individual within the chosen house-
positive or negative in terms of HBV infection. holds meeting study criteria. Another option
Though this is an extreme example, it illustrates would be to apply a simple random sample of
the importance of understanding how the sam- individuals within each of the randomly selected
pling frame is ordered prior to sampling. villages, but this approach would rely on each
The methods described above assume that a village having a list of all its citizens. To further
sampling frame exists for the selection of indi- complicate the design, stratication could be
vidual subjects. However, it is often difcult or applied to allow for different sampling schemes
expensive to generate such lists, or such informa- within each stratum (the four countries described
tion may not readily be available. One alternative in the manuscript could be considered strata).
to selecting at the subject level is to randomly Ultimately, there is no one-size-ts-all solution to
select groups or clusters of observations for dene an appropriate sampling scheme. Based on
studycluster random sampling. For example, the available information, study design is a care-
as described in a study of diarrheal disease in ful balance of costs, statistical efciency, and
Africa and Asia [7], villages were randomly operational complexity.
selected for inclusion into the samples of four An important note about selecting clusters:
separate population surveys. Once a village was Cluster sizes may vary greatly, and as noted above,
selected, all households within the village that it is quite reasonable to expect that individuals are
met the study criteria were included in the sam- more similar within the cluster than between clus-
ple. A benet of sampling clusters of observa- ters. Because of this, it may be more appropriate
tions is that it can simplify the data collection to select clusters with probability proportional to
process. For example, in a situation where hun- the size of the cluster. For example, suppose that
dreds of villages may exist, randomly selecting our population consisted of ve villages with 100,
villages reduces the number of villages to which 150, 200, 250, and 300 inhabitants and that we
it may be necessary to travel. A simple random select one village to generate an estimate for the
sample of subjects across all villages may require subjects of all villages. If we sample the smaller
traveling to a majority of the villages to collect village of 100 inhabitants, the estimate of our end-
the necessary information. However, one down- point may not fully reect the individuals within
side of cluster sampling is that it typically requires the larger villages. Rather than give each village
a larger sample size than a simple random sample an equal likelihood of being in the sample (in this
to obtain the same power or precision of sample case, 1/5 or 20%), we can dene the selection
estimates. This is because individuals within probability for each cluster as equal to the total
clusters tend to be more alike than individuals size of the village divided by the total population
across clusters, and this often leads to an increase of all villages combined. For our example, the vil-
in the variability of the estimated parameters. In lages would be selected with probabilities
the example above, the reduced travel costs may 100/1,000, 150/1,000, 200/1,000, 250/1,000, and
more than make up for the additional subjects 300/1,000 or 10%, 15%, 20%, 25%, and 30%,
needed for study. respectively. By sampling in this manner, we give
By employing the selection of clusters of larger clusters a greater chance of being selected
observations, it is possible to rene the above into the sample, though this choice also increases
design for diarrheal disease into a multistage the expected sample size of the study.
10 Sampling Methodology: Implications for Drawing Conclusions from Clinical Research Findings 203
trial (e.g., which tests to perform, what endpoints results applied cautiously. Additionally, in
to measure, and the appropriate disease character- reviewing the study materials, consider these
istics for inclusion and exclusion criteria) but also additional questions: Are important geographical
those inuential persons whose participation may considerations overlooked? How are subjects
entice other physicians to become involved (and who did not consent to study procedures different
eventually write prescriptions). from those who did? Are subjects who do not
The clinician may have a number of patients seek routine care sicker than those who do? What
who regularly attend his or her practice for dis- important disease features differ in subjects who
ease management who could be included in the cannot stop taking medications that are prohib-
study, and in the course of his or her day-to-day ited within the study? How are subjects who are
job, s(he) may gauge these individuals interest in not local to participating clinicians different than
participating in a trial for a new medical therapy. those who are? How might subjects who took the
Additionally, the clinician may choose to adver- drug at the time of the trial be different from those
tise the clinical trial to attract additional patients who will take it when it becomes approved?
from medical practices not participating in the These questions may never be answered satisfac-
trial, or those individuals who may not, for one torily, and ultimately a leap of faith may be
reason or another, have routine exams with their needed to apply sample ndings to the popula-
doctor. Should the patients meet eligibility crite- tion with the condition [8].
ria and consent to study procedures, they would
be randomized to one of the available treatment
arms of the study. Conclusions
However, as the statistician Senn points out,
clinical trials are concerned with the comparative We are often in such a hurry to collect and ana-
inference of the drugs under study among the lyze data that we neglect the importance of care-
subjects under study; rarely are they concerned ful study design, and how we select individuals
with being representative of the study population for study is a critical feature. Through many of
[11]. In other words, the primary goal of the trial the examples described above, we have learned
is to illustrate the effectiveness of a new medica- that whom we select, where and how we select
tion against concurrent and comparable controls. them, and even at what time they are selected
Subjects are randomized to treatments to mini- may have serious implications on the study
mize bias in measuring the treatment effect since ndings and how they may be interpreted or
on average over all randomizations, the treatment applied. This is true for samples of convenience
groups would be considered equal at baseline. as well as for any complex probabilistic survey
This is not to say that representative samples are sample. Without knowing the characteristics of
never used within clinical research, but how a the subjects under study and how they were cho-
researcher samples an individual from a popula- sen, the researcher has an incomplete grasp of the
tion should be tied to the ultimate goals of the conclusions of the study. In fact, there may be a
study. This raises important questions: How can number of shortcomings that become obvious
one safely apply the results of a clinical investi- only once the sampling scheme is understood.
gation to other subjects? To what study popula- Finally, it is important to remember that not all
tion do the subjects ultimately belong? studies are designed to comprehensively reect
Perhaps the most straightforward way of iden- the population with the condition. Particularly for
tifying the individuals to whom these results may clinical researchers and physicians prescribing
apply to is to review the table of summary statis- new medications, it is important to rst under-
tics for baseline demographic and disease charac- stand the subjects under investigation and how
teristics and the eligibility criteria from the study they may differ from other populations available
manuscript or drug label. for treatment. Understanding these key points
Subjects who are quite characteristically dif- ensures a more successful application of new
ferent than described should have the study knowledge.
10 Sampling Methodology: Implications for Drawing Conclusions from Clinical Research Findings 205
Take-Home Points
Generating a random sample from a population is important to minimize the bias of sample
estimates in describing the population parameters.
Despite the benets of random sampling for generating appropriate inference, clinical
research often relies instead on samples of convenience.
Baseline characteristics and study inclusion and exclusion criteria can help identify the
study population from which the sample was drawn.
It is important to understand the factors that differ between the study sample and the larger
population and the potential impact these differences may have on the conclusions of the
study and how appropriate it is to apply study results to the larger population.
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 207
DOI 10.1007/978-1-4614-3360-6_11, Phyllis G. Supino and Jeffrey S. Borer 2012
208 T.A. Durham et al.
observations (e.g., age values of participants in a y-axis. Bars are typically centered about
medical study), such as the relative frequency of the midpoint of the interval.
each value, the typical value in the group, and the A sample histogram of 100 ages from a clini-
extent to which individual observations vary from cal trial is provided in Fig. 11.1. By examining a
subject to subject. When statistical techniques are frequency histogram, one can see which values of
used simply to summarize random variables from the variable are more or less common. A graphi-
the sample, the results obtained from them are cal representation or mathematical expression of
said to be descriptive statistics. the relative frequency of values of a random vari-
The number of observations which comprise a able is referred to as a distribution. For the con-
sample is referred to as the sample size, denoted struction of a histogram, the width of each
by the symbol n. One way to describe a sample of category should be the same for all categories.
observations with respect to a quantitative ran- However, it is important to note that the number
dom variable is to display the frequency of values of categories used can affect the shape of the dis-
of the random variable graphically. One type of tribution. Care should be taken so that valuable
graphical display is the frequency histogram. information is not lost through the use of too few
A histogram is constructed by: categories. If the overall sample size is large,
Dening 310 mutually exclusive (nonover- more than 10 categories may be used.
lapping) categories of equal width for the vari- Histograms are useful since they enable one to
able of interest. inspect the shape of the distribution. Distributions
Tabulating the number of observations that which have more values in the middle and fewer
fall into each category. values on the extremes are said to be unimodal,
Calculating the relative frequency of observa- and they are symmetric when the extremes
tions in each category as the count of have similar representation. Distributions which
observations in each category divided by the have more values at one extreme than at the
sample size. middle or the opposite extreme are said to be
Displaying a bar for each category contigu- asymmetric or skewed. As evident in Fig. 11.1, the
ously on the x-axis with a bar height equal to most common age values are in the category of
the relative frequency of each category on the 6069 (midpoint of 64.5). There were very few
11 Introductory Statistics in Medical Research 209
observations with ages less than 50 or more than 79. and 81 for the following ages of participants in a
For Fig. 11.1, the shape of the distribution of age clinical trial: 37, 39, 43, 43, 56, 57, 63, 81, 81,
values in the sample is reasonably symmetric. and 85.
Identication of a typical value or a measure For quantitative variables, the mean is the pre-
of central tendency from the sample is frequently ferred measure of central tendency if the distribu-
of interest. There are a number of measures of tion is relatively symmetric. If the distribution is
central tendency, some of which include the arith- asymmetric, the median and mode are appropri-
metic mean, the median, and the mode. The arith- ate measures. For qualitative variables (e.g., gen-
metic mean, typically referred to simply as the der), the mode is the most appropriate measure of
mean, is calculated as the sum of the individual central tendency.
values (indexed by the subscript i below) divided In addition to measures of central tendency,
by the sample size: the extent to which values of a characteristic vary
from observation to observation, i.e., the disper-
n
sion or variety of values, is also of interest. If two
x
i =1
i groups from a study have similar mean numbers
x= of lesions, but one group has more variation in
n
the number of lesions across subjects, one may
The mean is always dened for a sample of suspect that the two groups are different in some
numeric values. If the relative frequency of val- way. There are a number of measures of disper-
ues or the distribution is at least somewhat sym- sion, and the appropriate choice among them
metric, the sample mean is a reasonable choice as depends on what one would like to say about the
a measure of central tendency. One disadvantage variation and, to some extent, on the shape of the
of the mean is sensitivity to extreme values. For distribution (i.e., symmetric vs. skewed). All
example, heart rate values of 60, 61, 63, 58, and measures of dispersion are nonnegative, and dis-
98 beats per minute have a mean of 68, which persion of zero indicates no variation in the ran-
poorly represents a typical value. dom variable from observation to observation.
When there are a few extreme values or a The simplest measure of dispersion is the
skewed distribution, the median can be a more range, dened as the difference between the max-
appropriate measure of central tendency. The imum and minimum values. Quartiles can also be
median is the middle value after all observations used to describe dispersion. Just as the median
have been sorted from lowest to highest. If the represents the middle value, through the value
sample size is odd, the median is the ((n + 1)/2)th below and above which approximately 50% of
value after sorting (e.g., the third largest value the values lie, the 25th percentile (or rst quartile)
from a sample of 5). If the sample size is even, is the value below which approximately 25% of
the median is calculated as the mean of the two the values lie. Similarly, the 75th percentile (or
middle values, the (n/2)th value and the ((n/2) + 1)th third quartile) is the value below which approxi-
value. For example, if there are 20 observations mately 75% of the values lie. The interquartile
in a sample, the median is calculated as the mean range, another measure of dispersion, is dened
of the 10th and 11th values. The median is always as the difference between the third and rst
dened for a sample of numeric values. quartiles. It also represents the dispersion of val-
Another measure of central tendency is the ues that encompasses the middle 50% of values.
mode, dened as the most common value. The A graphical display which features a number
mode is 1 among the following rating scores: 0, of measures of central tendency and dispersion is
0, 1, 1, 1, 1, 1, 2, 2, 2, and 3. If all values of the a box plot. There are a number of different types
random variable are unique, the mode is not of box plots, but typically box plots are used to
dened. However, if there are multiple values display the values of the mean, median, 25th
which are equally as common, there is not one percentile, and 75th percentiles. Extreme values
mode, but multiple modes. The modes are 43 may also be plotted and often include the
210 T.A. Durham et al.
minimum and maximum values. Box plots can be The coefcient of variation is a unit-less mea-
displayed side by side for the comparison of sure of dispersion, dened as the standard devia-
characteristics of distributions among levels tion divided by the mean:
of the experimental factor (e.g., cases and con-
trols, treatments in a clinical trial, or time points s
CV =
in an observational study). Box plots of age val- x
ues for males and females in a cohort study are
displayed in Fig. 11.2. In this gure, the 25th per- The coefcient of variation is helpful when
centile and 75th percentile are represented by the used to compare two or more random variables
box, the median by the line bisecting the box, the with regard to their dispersion. It is used as a
mean by the crosses, and the minimum and maxi- measure of precision in assay development, but
mum value by the lines extending from the box. may also be used to compare dispersion between
A measure of dispersion of values about the two unrelated random variables each with differ-
mean is the variance. The sample variance, ent scales (e.g., dispersion of heart rate vs. sys-
denoted by the symbol s2, is calculated by sum- tolic blood pressure).
ming squared differences between each value and Descriptive analyses such as those described
the mean (to obtain a positive value), and divid- above may be used as part of a preplanned analy-
ing the result by n 1: sis (e.g., as prescribed in a study protocol or anal-
ysis plan) or for exploratory purposes. Results
n
from exploratory analyses often generate new
(x x)
2
i
i =1
hypotheses to test in future research. Descriptive
s2 = statistical analyses provide insight into the nature
n 1
of the data, as well as provide a rationale for the
It is difcult to interpret the variance since it is statistical methods used to make inferences about
expressed in squared units (e.g., age2). As a result, the population from which the sample arose.
the square root of the variance is taken to obtain
the standard deviation, which is often denoted
by the symbol s. A small value of the standard Estimation, Condence Limits,
deviation indicates that most values are close to and Hypothesis Testing
the sample mean. A large value of the standard
deviation indicates many values are far from the One important goal of statistics is to use data
sample mean. Other words which are used to from a sample (e.g., a limited number of partici-
convey the concept of the standard deviation are pants in a clinical trial) to draw a conclusion
spread and scale. about a larger set of subjects, a population.
11 Introductory Statistics in Medical Research 211
Statistical procedures for which the aim is to Select a sample from a population of interest.
make inferences about a relevant population are Collect data from the sample.
called inferential statistical methods. A popula- Calculate appropriate sample statistics as esti-
tion of interest in a clinical trial may be all mates of the population parameter.
patients who will ever be diagnosed with a par- Make a statistical inference about the popula-
ticular viral infection. A population of interest tion parameter.
in a cohort study may be all Americans exposed Make a conclusion about the population
to a carcinogen in the environment. A popula- itself.
tion from a case control study may be adults The value of the summary statistic from a
who have and have not been diagnosed with sample is called a point estimate, and it repre-
coronary heart disease. Statistical inferences sents the estimate of the population parameter
about these populations are necessary since they that is reasonably well supported by the sample
can be used to justify important policy deci- data. If one were to repeat an experiment or study
sions, such as making a new medical therapy with a new sample of the same size from the same
available for use or revising educational mate- population, a different point estimate would
rial about lifestyle changes that reduce the risk be obtained. When each sample has an equal
of adverse health events. chance of being selected from the population, the
However, as noted in Chap. 10, it is not feasi- sample is called a simple random sample.
ble to study every person who may be a member The extent to which point estimates vary from
of the population. As a result, research is con- sample to sample (of the same size) represents
ducted on a small number of them, a sample, and sampling variability and can be quantied. If one
statistical methods are used to make inferences were to select a sample of size n from the popula-
about the population of interest. One note of cau- tion of interest, calculate a sample statistic or
tion is that the validity of the statistical inference point estimate (e.g., the sample mean), record it,
depends not only on the appropriate use of statis- and repeat the process a large number of times, the
tics but also on the selection of an appropriate relative frequency of values of the sample statistic
sample on which the inference will be based. over all samples of size n would constitute the
A general conceptual description of inferential sampling distribution of the sample statistic.
statistics is provided in this section. In the previous section, the term standard devi-
A parameter is a quantitative characteristic ation was dened and represented the typical
from a population, the value of which is spread of observations about the sample mean. If
considered xed but unknown. For a case con- one were to calculate the standard deviation of
trol study, one may be interested in the value of values of the sample statistics (i.e., the standard
the population odds ratio, an estimate of the rela- deviation of the sampling distribution), the result
tive risk of an event. An example of a relevant represents the typical spread of sample statistics
population parameter from a randomized clini- about the population parameter. This quantity is
cal trial is the difference in population mean known as the standard error. The standard error of
response. Summary statistics (e.g., proportions an estimate is a measure of how precisely the
of subjects exposed to some risk factor among sample statistic has estimated the population
cases and controls or the difference in sample parameter or, stated another way, the extent to
means between the treated and control groups in which use of the sample has misestimated the true
a clinical trial) are calculated from the sample as population parameter. The larger the sample is,
estimates of the unknown population parameter the smaller the standard error will be, indicative of
of interest. The purpose of statistical inference is less uncertainty about the population parameter. It
to evaluate how well a sample statistic estimates is important to note that there is not just one stan-
an unknown population parameter. The general dard error. For every estimator, or mathematical
process of making statistical inferences is as rule used to calculate a sample statistic, there is a
follows: standard error. The remainder of this section will
212 T.A. Durham et al.
dene the standard error for one estimator, the tracting the population mean from each value and
sample mean. Later sections mention standard dividing by the standard deviation:
errors for other estimators, but details of their X m
derivation will not be included. As will be seen, Z=
s
the standard error of an estimate can be calculated
from the sample. The resulting random variable has a standard
Since a single point estimate from a sample normal distribution with mean 0 and standard
will likely vary from sample to sample, a more deviation 1. A random variable that follows the
useful way of estimating the population parame- standard normal distribution is often denoted by Z
ter of interest is an interval estimate, with a lower and called a Z score. Using the expressions above:
limit (LL) and an upper limit (UL). The general P( 1.04 < Z < 1.04) = 0.70
conceptual approach with interval estimation is
P( 1.96 < Z < 1.96) = 0.95
to dene an interval so that the proportion of
random samples that enclose the parame- P( 2.58 < Z < 2.58) = 0.99
ter q within the lower and upper limits is (1 a). The precision coefcient is specic to the
Using some notational shorthand, we would parameter being estimated. Precision coefcients
like to estimate values, LL and UL, such that can be obtained from tabled values or from statis-
P( LL < q < UL ) = 1 a , where P expresses the tical software. If a random variable follows a
proportion of random samples. The lower and standard normal distribution, one can use the
upper limits are random variables, the values of known distribution of Z scores to state that 95%
which depend on the point estimate, the standard of all Z scores lie between 1.96 and 1.96. The
error of the estimate (a measure of the error value 1.96 is the precision coefcient needed for
attributed to sampling), and a precision an interval estimate with 95% condence. Stated
coefcient. simply, a precision coefcient is the number of
A precision coefcient is a measure of how standard deviations within which 100(1 a)% of
consistently a sample statistic estimates the pop- the values of the random variable fall from the
ulation parameter, and it is obtained from well- population parameter. The symbol a represents
dened distributions of standardized random ones willingness to estimate the underlying pop-
variables. To illustrate, consider a random vari- ulation parameter incorrectly. In most elds of
able that has a particular distribution known as research, an a level of 0.05 is considered reason-
the normal distribution. Normal distributions are able, but there may be times when a higher or
symmetric about their means with a bell shape, lower a level is acceptable.
the downward slope determined by the standard In general, the standard error gets smaller as
deviation. For any random variable, X, that has a the sample size increases. The standard error for
normal distribution (mean m and standard devia- the sample mean is dened as the standard devia-
tion s), the following can be said about the prob- tion divided by the square root of the sample size:
ability of observing certain values of the random
s
variable: SE ( x ) =
n
P( m 1.04s < X < m + 1.04s ) = 0.70
Greater condence for an interval estimate
P( m 1.96s < X < m + 1.96s ) = 0.95
requires a larger precision coefcient. These
P( m 2.58s < X < m + 2.58s ) = 0.99 observations hold for standard errors of other
In other words, 70% of values are within 1.04 estimators and for other distributions used in the
standard deviations of the mean; 95% of values construction of intervals. The construction of a
are within 1.96 standard deviations of the mean; condence interval follows a general form of:
and 99% of values are within 2.58 standard devi-
ations of the mean. It is possible to standardize Point estimate (precision factor) (standard
any normally distributed random variable by sub- error of the estimate).
11 Introductory Statistics in Medical Research 213
Given this general form, the following obser- Note that the precision coefcient, t1a / 2, n 1, is
vations are worth noting. All other things being the 100(1 a)th percentile of the t distribution
equal, condence intervals are: with n 1 degrees of freedom. Since the t distri-
Narrower with larger sample sizes than smaller bution is symmetric about zero, t1a / 2 is the pre-
sample sizes cision coefcient that denes a central area of
Wider when more condence is required than (1 a).
when less condence is required As an example, consider an observational
Although somewhat of a simplication, study of 25 patients with primary biliary cirrhosis
condence intervals represent a plausible range (PBC). Among these 25 patients, the mean alka-
of values of the population parameter given the line phosphatase (U/L) value was 1,983 U/L and
sample estimate and uncertainty attributed to the the standard deviation was 2,140 U/L. Researchers
sampling process. are interested in a 95% condence interval for the
Since population parameters (e.g., the popula- population mean alkaline phosphatase. In this
tion standard deviation) are not known, there are case, the precision factor is the value from the t
related standardized scores which utilize only distribution with 24 degrees of freedom that
data from the sample. denes a central area of 95%. From a table of
The t-statistic is perhaps the best known of values, one nds this value to be 2.06. The 95%
these, and it will be used as an example of a condence interval is then
condence interval for a population mean. When
the sample size is small (particularly <30) and the 1983 2.06(2140 / 5) = 1983 881.7
population standard deviation is unknown, the = (1101.3,2864.7)
ratio of a standard normal random variable to its
standard error has a t distribution for which the The statistical interpretation of this result is
shape is determined by the number of degrees of that we are 95% condent that the interval
freedom (n 1). The statistic, (1,101.3, 2,864.7 U/L) includes the population
mean alkaline phosphatase value among patients
x m
T= with PBC.
s/ n A statistical concept that is closely related to
follows a t distribution (Students t). The t the construction of condence intervals is hypoth-
distribution is symmetric about its mean (zero) esis testing. Hypothesis testing involves the fol-
and looks like a normal distribution with, in cases lowing steps:
of sample sizes less than 200, heavier tails. As Posing a null hypothesis about the value of the
was the case with the normal distribution, the population parameter of interest
shape of the t distribution can be used to nd two Stating the alternative hypothesis about the
values which dene a central area under the den- value of the population parameter
sity curve of size (1 a). It can be shown that Identifying an appropriate test statistic against
once a value of T associated with an area of inter- which the null hypothesis will be evaluated
est (translated as a probability) is determined, the Describing the distribution of the test statistic
sample mean x is within T (s / n ) of the popula- when the null hypothesis is true; identifying
tion mean, m . This enables one to calculate a values of the test statistic that occur less than
condence interval for the population mean when 100 a% of the time under the null hypothesis
the sample size is small and the population vari- (the rejection or critical region)
ance is unknown. Calculating the test statistic
The interval estimate of the population mean, Making a conclusion about the null and alter-
the two-sided (1a)% condence interval for the native hypotheses on the basis of the test sta-
population mean, is tistic compared to the rejection region
Since the inference is being made from a sample,
x t1a / 2, n 1 (s / n )
the hypothesis test can result in two types of
214 T.A. Durham et al.
errors: rejecting the null hypothesis when it is 0.05, but there may be instances in which
should not have been rejected (a type I error) or smaller values are desirable. The test statistic is
failing to reject the null hypothesis when it should calculated as the difference between the sample
have been rejected (a type II error). Making an mean and the hypothesized value divided by the
erroneous conclusion at the end of a study is standard error of the mean:
undesirable. Hence, studies are designed to limit x m0
the probability of either of these errors occurring. t=
The probability of a type I error is denoted by s/ n
alpha (a), previously referred to as the signi- If this value is close to zero, there will be
cance level, and that for a type II error is denoted insufcient evidence to reject the null hypothesis.
by beta (b ). Its complement, (1 b ), is called the If the value is far from zero, the evidence is con-
power of a test and is the probability of correctly sidered sufcient to reject the null hypothesis.
rejecting the null hypothesis. For clinical trials, The rejection region is represented by those val-
study design considerations include specication ues of the test statistic that occur with probability
of a and (1 b ) since these affect the ability of a a or less when the null hypothesis is true. If the
study sponsor to address study objectives (e.g., to null hypothesis is rejected, either the population
claim an effect of an investigational drug). mean alkaline phosphatase is not 80 U/L or a
The process of hypothesis testing can be illus- type I error has occurred.
trated with data from the previous example. For the results obtained, the calculated value
Suppose researchers would like to know if, as of the test statistic is
they suspect, the mean alkaline phosphatase value
1983 80 1903
among patients with primary biliary cirrhosis is t= = = 4.45
different from otherwise normal subjects. The 2140 / 25 428
mean among normal volunteers is around 80 U/L. Using tabled values of the t distribution with
The null hypothesis is that the mean alkaline 24 degrees of freedom, one obtains a rejection
phosphatase among patients with PBC is 80 U/L. region of t < 2.06 or t > 2.06. Since the test statis-
If there is sufcient evidence to reject the null tic is in the rejection region, the null hypothesis is
hypothesis, the following alternative hypothesis rejected at the a = 0.05 level. The conclusion
will be concluded: the mean alkaline phosphatase from the hypothesis test is that the difference
among patients with PBC is not 80 U/L. Using between the sample estimate and the hypothe-
statistical notation: sized value is greater than would be expected by
chance (due to sampling) alone. The population
H 0 : m = 80 versus H A : m 80 mean alkaline phosphatase value for patients with
PBC is different from 80 U/L.
Note that rejection of the null hypothesis could It is possible that two people would not agree
occur because the mean PBC level was less than on the appropriate value of a, so another proba-
or greater than the hypothesized value. Since bility, the p value, is often used to reect the
there are two sides to the alternative hypothesis, extremeness of the value of the test statistic.
the test is considered two-sided. In advance, the A p value is the probability of observing the
researchers will have decided upon a value of a, actual value of the test statistic or one more
the size of the test, which represents the prob- extreme (i.e., favoring the alternative hypothesis)
ability that they will reject the null hypothesis when the null hypothesis is true. If a p value is a,
erroneously. The choice of the size of the test one rejects the null hypothesis. A p value of 0.02
depends on ones willingness to commit a type I means that the value of the observed test statistic
error. For example, if the implication of commit- and all other values more extreme (i.e., contradic-
ting a type I error is not very important as in early tory of the null hypothesis) occurs with probabil-
studies in drug development, a researcher may be ity of 0.02 under the null hypothesis. One major
satised with an a level of 0.10 or 0.20. A com- drawback of p values as a measure of evidence is
mon value for a in conrmatory research settings that they are highly dependent on the sample size
11 Introductory Statistics in Medical Research 215
as it relates to the standard error of the estimator Another algebraic manipulation of the sample
and, thereby, the power to contradict the null size expression yields the following:
hypothesis. The sample size for a study typically
(Z )
2
a + Zb
is estimated in advance to ensure there is adequate n=
power to detect an effect of interest. ( / s )2
The sample size required to provide power of The expression ( / s ) is called the effect size.
(1 b ) to reject the null hypothesis that the mean In the case where is dened as the difference
m is not different from a specied value m 0 while between two means with a common standard devi-
maintaining a type I error of a is ation, s , Cohen (1992) [7] has characterized effect
sizes around 0.2 as small, around 0.5 as moder-
three hypothetical condence intervals (with the control group is much larger than the mean
lower and upper limits indicated by the brackets) for the test group) or the positive direction (i.e.,
are displayed with the corresponding statistical mean for the test group is much larger than the
conclusion regarding the null hypothesis. The mean for the control group). In many instances,
rst interval lies entirely to the left of the hypoth- sponsors of clinical trials are only interested in
esized value of the population parameter, indicat- one direction of the alternative hypothesis,
ing that the plausible values of q are less than q 0 . namely, the direction that corresponds to a benet
Therefore, the null hypothesis is rejected. The of the test treatment. However, the null hypothe-
second interval encloses the hypothesized value sis is tested using a two-sided test of size a.
of the population parameter, indicating that q 0 is Hence, if it is rejected, the probability of errone-
among the plausible values of q . Therefore, the ously claiming a benet of the treatment is a/2
null hypothesis is not rejected. The third interval and the probability of erroneously detecting a
lies entirely to the right of the hypothesized value harm of the treatment is a/2.
of the population parameter, indicating plausible Test products which are intended to be similar
values of q are greater than q 0 . Hence, the third to an existing product in terms of the clinical
condence interval is also consistent with rejec- response are evaluated in equivalence trials. The
tion of the null hypothesis. objective of an equivalence trial is to demonstrate
The formulation of the condence interval that the difference in response between the test
depends on the population parameter being treatment and the active control does not exceed
estimated, which depends on the null hypothesis, an acceptable margin. New pharmaceutical prod-
which in turn depends on the research question of ucts which are shown to be equivalent to an active
interest. Medical research includes observational control may have other advantages to justify their
studies (prospective and retrospective) and use such as better safety, more convenient dosing,
clinical trials which are intended to evaluate the or lower cost. Bioequivalence studies are intended
effects of a medical intervention, such as a phar- to demonstrate that the pharmacokinetic proper-
maceutical agent, a surgical procedure, use of a ties of two formulations of a treatment are
device, or implementation of an educational or equivalent.
counseling program. Pharmaceutical agents are The following statistical hypotheses corre-
evaluated for their usefulness, among other spond to a clinical trial intended to demonstrate
things, on the basis of their efcacy and safety [3]. the equivalence of a test treatment to an active
In the context of pharmaceutical development, control with respect to the difference in popula-
the objective of a clinical trial can be to demon- tion means of a continuous outcome:
strate that a test treatment is:
Superior to an inactive or active control H 0 : m test m control d equivalence versus
Not unacceptably worse than (not inferior to) H A : m test m control < d equivalence
an active control
Equivalent to an active control The quantity d equivalence is called the equiva-
The following statistical hypotheses corre- lence margin, and it must be specically dened
spond to a clinical trial intended to demonstrate in advance of the study analysis. In the case of
the superiority of a test treatment compared to a pharmaceutical studies, the equivalence margin
control with respect to a continuous outcome: must be agreed upon by regulatory authorities if
the study is to be used for registration purposes.
H 0 : m test m control = 0 versus The null hypothesis in equivalence trials is
H A : m test m control 0 typically tested using a condence interval about
the difference in population parameters (e.g.,
The null hypothesis of interest could be means or proportions). If the condence interval
rejected if the difference in mean response is far excludes the equivalence margin (by being
from zero in the negative direction (i.e., mean for entirely within it), the null hypothesis is rejected.
11 Introductory Statistics in Medical Research 217
An important consideration in equivalence trials the test treatment is not inferior to the active
is that rejection of the null hypothesis can be control. Similar to equivalence trials, interpreting
interpreted as meaning that the test and control this statistical conclusion also depends on the
are both efcacious or neither is. The credibility ability of the study to establish assay sensitivity.
of such a result depends on the ability to demon- As seen in this section, both hypothesis tests
strate that the active control would have been and condence intervals are used to draw conclu-
efcacious if an inactive control were used in the sions about a quantitative characteristic of a pop-
study. The ability of a study to differentiate an ulation. In the remaining sections, specic
efcacious treatment from an inefcacious treat- statistical methods are described.
ment is called assay sensitivity. One way assay
sensitivity can be established is by the use of his-
torical data for the inactive control to demonstrate Differences Between Means
that the active control would have been superior and Proportions
to the inactive control if it had been studied.
Another way to establish assay sensitivity is to A common statistical analysis involves making an
include an inactive control group in addition to inference about the equality of two means when
the active control, although such a design may the observations are independent, meaning the
not be ethical. Interested readers may refer to value of one observation does not depend on
Chow and Liu (2004) [8] for further information another. In many medical studies, observations
on equivalence and noninferiority clinical trials. can be considered independent because the obser-
Another objective of some clinical trials is to vations are single values from different study sub-
demonstrate that a test treatment is not unaccept- jects. However, medical studies frequently involve
ably inferior to the control. Studies with such an repeated tests for the same individual (e.g., heart
objective are called noninferiority studies, and rate taken at a number of times for the same indi-
they may be used when it is unethical or logisti- vidual) or related tests within the same individual
cally difcult to use an inactive control. If the test (e.g., presence of a characteristic in more than one
treatment is considered not unacceptably worse skin location within an individual study subject).
than the active control, it may have other advan- Such observations are considered dependent.
tages such as better safety or greater convenience. In the case of independent observations, the
The following statistical hypotheses correspond hypothesis tested for the equality of two popula-
to a clinical trial intended to demonstrate the non- tion means is
inferiority of a test treatment to an active control
with respect to the difference in population means H 0 : m1 m 2 = 0
of a continuous outcome. In this formulation of
the hypotheses, a larger value of the mean is If this null hypothesis is rejected, the follow-
favorable: ing alternative hypothesis will be favored:
H 0 : m test m control d non - inferiority versus
H A : m1 m 2 0
H A : m test m control > d non - inferiority
The test statistic to test the null hypothesis is
As with equivalence trials, the noninferiority
margin must be specied in advance. The null
x1 x2
hypothesis is tested using a condence interval. t=
1 1
If the noninferiority margin is enclosed within sp +
the condence interval, the null hypothesis is not n1 n2
rejected. If the noninferiority margin is below the
lower limit of the condence interval, the null where the numerator is the difference in sample
hypothesis is rejected, and the conclusion is that means, an estimate of the difference in
218 T.A. Durham et al.
population means, and the denominator is the The size of the test will be a = 0.05. Given the
standard error of the difference in sample means. sample sizes in each group, the rejection region
(n1 1)s12 + (n2 1)s22 is any value of the test statistic t < 2.01 or
The quantity s p = is the t > 2.01, which corresponds to values of the t dis-
n1 + n2 2
tribution with 48 degrees of freedom which
pooled standard deviation and represents the dene areas in the left-hand tail of 0.025 and in
weighted average of the standard deviation across the right-hand tail of 0.025, respectively. The
the two samples with sample sizes of n1 and n2. values that dene the rejection region can be
This test is called Students t test or the indepen- obtained from tables of the distribution or from
dent groups t test because the test statistic follows statistical software.
a t distribution under the null hypothesis. The pooled standard deviation is calculated as
The assumptions required for the use of the
two-sample t test are that the distribution of
sp =
(25 1)142 + (25 1)122 = 13
the random variable is approximately normal, the 25 + 25 2
two groups represent simple random samples
from the two populations of interest, and the pop- The test statistic is calculated as
ulation variances are equal (although likely 134 118
unknown). Under the null hypothesis (i.e., assum- t= = 5.1
1 1
ing the two population means are equal), the test 13 +
statistic follows a t distribution with n1 + n2 2 25 25
degrees of freedom. For a two-sided hypothesis Since the value of the test statistic, 5.1, is in
test of size a, the rejection region is dened as the rejection region (t > 2.01), the null hypoth-
any value of the test statistic t > t1a / 2, n1 + n2 2 or esis is rejected. The evidence from the study
t < ta / 2, n1 + n2 2 , i.e., the values from the t distribu- suggests that the population mean LDL is dif-
tion with n1 + n2 2 degrees of freedom that lie ferent between adults with CHD and those
outside of a central area of (1 a). Note that without CHD. Since the difference between the
since the t distribution is symmetric, sample statistic and the hypothesized value of
ta / 2, n1 + n2 2 = t1a / 2, n1 + n2 2 . the population parameter differs much more
Consider the following example. One may be than what would be expected by chance alone,
interested in whether or not there is a difference such a difference is often called statistically
in the mean LDL cholesterol between adults with signicant. A corresponding condence inter-
coronary heart disease (CHD) and adults without val for the difference between two group means
CHD. To answer such a research question, two can be written as
samples corresponding to the populations of
(x1 x2 ) (t1a / 2,n + n 2 )s p
1 1
interest (i.e., adults with a diagnosis of CHD and +
adults with no diagnosis of CHD) would be stud-
1 2
n1 n2
ied. LDL cholesterol levels were ascertained for
25 subjects from each group. The sample means Note that this condence interval follows the
were 134 mg/dL and 118 mg/dL for the CHD and general form described previously. In this case,
non-CHD subjects, respectively. The sample 1 1
the quantity s p + is the standard error of
standard deviations were 14 mg/dL and 12 mg/dL, n1 n2
respectively. The statistical hypotheses are
the difference in sample means.
H 0 : m CHD m non - CHD = 0 versus For this particular example, the corresponding
H A : m CHD m non - CHD 0 95% condence interval is
11 Introductory Statistics in Medical Research 219
Table 11.1 ANOVA for mean VAS pain score from three k
dose groups N = ni
i =1
Source Sum of squares df Mean square F
Drug 99.89459 2 49.947295 6.28 If the null hypothesis of equal means is
Error 238.67896 30 7.955965 rejected, one would like to know which pairs of
Total 338.57355 32 the population means are unequal.
Following a signicant test result from the F
responses is partitioned into within-group vari- test, one can compare the population means
ability (the inherent variability within each sam- among samples (e.g., treatment groups in a clini-
ple) and among-group variability (the variability cal trial) using numerous methods that appropri-
of the sample means relative to the overall mean). ately control the overall type I error rate. This is
The test statistic F is calculated as the ratio of the important since one could test each of c = k(k 1)/2
variability among samples (e.g., treatment pairs of population means using an independent
groups) to the variability within samples: groups t test with a = 0.05, but the type I error
rate is only controlled at a = 0.05 with this method
VAmong when k = 3. When k > 3, the probability of incor-
F=
VWithin rectly rejecting at least one hypothesis increases
with the number of individual hypotheses tested.
That is, if the sample means vary more than In general, if c null hypotheses each have inde-
the inherent variability, the ratio will be greater pendent tests at the a level, the probability of
than one, and the evidence will suggest that the rejecting at least one by chance alone is
sample means did not arise from populations
with a common population mean. Results from = P (rejecting at least one of c hypotheses)
an analysis of variance are often displayed in an
= 1 (1 a )
c
ANOVA table, such as the results displayed in
Table 11.1 from an analysis of a clinical trial of
three doses of an analgesic. The response of inter- For example, if ve such independent com-
est is the mean pain score recorded using a visual parisons of treatment groups are tested at a = 0.05,
analog scale (VAS). The mean square for drug the probability of rejecting at least one by chance
(49.95) represents the average variability of alone could be as large as 0.226.
means relative to the grand mean response. The One appropriate method for controlling the
mean square error (7.96) represents the variabil- experimentwise error rate is the Bonferroni test
ity of responses within each treatment group. The which involves testing each of the c pairs of
ratio of these two is the test statistic and can be means using a t test with aB = a/c. For example, if
interpreted as the extent to which the variability a study with four groups was conducted and the
in mean responses across groups exceeds the F test was rejected (a = 0.05), then the six com-
inherent variability in response. parisons of means could subsequently be tested
The test statistic calculated in such a manner using aB = 0.05/6 = 0.0083. This method controls
follows an F distribution, for which the shape the experimentwise error rate since the probabil-
(and therefore the critical region) is determined ity of incorrectly rejecting at least one null
by two parameters: the numerator degrees of hypothesis is bounded by 6(0.05/6) = 0.05.
freedom, dened as the number of degrees of Another method which can be used to com-
freedom required to estimate the variability pare pairs of means is Tukeys Honestly
among sample means (k 1), and the denomina- Signicant Difference test. This test requires the
tor degrees of freedom, dened as the number of use of additional tabled values to determine the
degrees of freedom for estimating variability minimum absolute difference in means that
within samples (N k), where N represents the would lead to rejection of the null hypothesis of
total sample size across the k samples: the equality of two means. However, Tukeys
11 Introductory Statistics in Medical Research 221
method is more powerful than the Bonferroni Table 11.2 Number of events for subjects exposed and
test, meaning the absolute difference in means not exposed
leading to rejection is smaller than that required Exposed Not exposed
for the Bonferroni test. Another method which Workplace Yes 30 8 38
may be used when comparing a number of group injury? No 70 132 202
means to a common control is Dunnetts test. 100 140 240
Additional details about methods used to control
the experimentwise error rate in the setting of and without the outcome of interest for each
multiple tests can be found in Schork and group.
Remington (2000) [5] and, on a more advanced Data from a hypothetical cohort study are dis-
level, in Westfall et al. (1999) [10]. played in Table 11.2. In this study, patients with a
If the shape of the underlying distribution can- conrmed diagnosis of a particular neurological
not be assumed to be normal, a nonparametric condition (exposed) and age- and sex-matched
approach may be used. The Kruskal-Wallis test is controls (not exposed) were followed for a
to the ANOVA as the Wilcoxon rank sum test is period of 1 year to ascertain the occurrence of
to the independent groups t test. That is, for the workplace injuries.
Kruskal-Wallis test, the original random variable Note that if the proportions of subjects between
is ranked across all k groups. The test statistic is the groups were equal, the observed counts of
the ratio of the variability in ranks among groups subjects with each outcome (yes or no) would be
to the variability in ranks within groups. The null distributed in equal proportion among the groups
hypothesis for the Kruskal-Wallis test is that the (exposed or not exposed). One method that can
groups have the same population distribution. If be used to test the hypothesis of equal propor-
the null hypothesis is rejected, one would con- tions between two populations is the chi-squared
clude the alternative hypothesis is true, that the test of homogeneity. The chi-squared test is an
population distributions are different, particularly example of a goodness-of-t test, for which the
for their location. The assumptions required for observed counts of subjects with and without the
the use of the Kruskal-Wallis test are that the event are compared to the expected number of
observations are independent, the samples are subjects with and without the event when no dif-
simple random samples from the populations of ference exists (or under the null hypothesis). For
interest, and the variance is equal among the pop- goodness-of-t tests, the expected counts are
ulations under the null hypothesis. The test statis- obtained on the basis of an assumed model. In the
tic is evaluated using a chi-squared distribution, case of the test of equal proportions, the expected
for which the shape (and therefore the critical counts would be obtained by applying the overall
region) is determined by the number of degrees (across groups) proportion with response to each
of freedom (k 1). groups sample size. The test statistic for a chi-
Many medical studies examine the propor- squared test is expressed as the ratio of the
tion of subjects with a particular response, squared difference of the observed and expected
such as deaths, myocardial infarctions, or counts (denoted by O and E, respectively) to the
some risk factor for disease as the outcome of expected count for each cell and summed over all
interest. The difference in proportions between four cells (indexed below by j) of the table:
two groups (e.g., represented by cases or con- 4 (O j E j )2
trols in an observational study or treatment c =2
j =1 Ej
groups in a clinical trial) can be expressed as
one proportion minus another, p1 p2 , or as a Squaring the deviations of observed counts
p
ratio of the two, p1 , a quantity called the rela- from the expected ensures the difference is posi-
2
tive risk. Data from studies with these kinds of tive, which is required for a random variable from
outcomes are usually presented in the form of a chi-squared distribution. An alternative, math-
a table displaying the counts of subjects with ematically equivalent, form of the test statistic is
222 T.A. Durham et al.
H 0 : p1 p2 = 0, H A : p1 p2 0 + = 25.817
117.83
where the population proportions for each of two For a test with a = 0.05, the rejection region is
independent groups are represented by p1 and p2 . dened as any value of the test statistic >3.84 (chi-
Under the null hypothesis, the test statistic has squared distribution with 1 degree of freedom).
a chi-squared distribution with 1 degree of free- Therefore, the null hypothesis is rejected with a
dom. Therefore, the null hypothesis will be conclusion that the proportion of exposed subjects
rejected if the test statistic is in the rejection with workplace injuries is greater than the propor-
region dened by c > c1a ,1. Note that only large
2 2
tion of age- and sex-matched unexposed subjects.
values of the test statistic contradict the null A condence interval can also be constructed
hypothesis. Therefore, the rejection region for for the difference in two proportions. A two-sided
the chi-squared test is represented by only the 100(1 a)% condence interval for the differ-
upper tail of the distribution. The chi-squared test ence in sample proportions, p 1 p 2 , is given by
is appropriate when the groups are independent,
the outcomes are mutually exclusive, and most of ( p 1 p 2 ) z1a / 2 SE ( p 1 p 2 ) , where
the expected cell counts are at least ve. The use
of the chi-squared test is illustrated with the data p 1 (1 p 1 ) p 2 (1 p 2 )
SE ( p 1 p 2 ) = +
from Table 11.2. n1 n2
The null and alternative hypotheses concern-
ing the proportion of subjects exposed and unex- For this particular example, the corresponding
posed with workplace injuries are 95% condence interval is calculated as
Table 11.3 Number of subjects with and without counts, (a + b) and (a + c), should be about equal.
symptom pre- and postintervention Therefore, the test statistic is calculated as
Post
(b c )2
Yes No c2 =
Prior Yes a b a+b b+c
No
c d c+d and has a chi-squared distribution with 1 degree of
a+c b+d n freedom under the null hypothesis. A useful gen-
eral reference that includes additional details about
this test is Stokes, Davis, and Koch (2000) [11].
which the proportions are compared. The test sta-
tistic is computed in the same manner as for the
two groups case, except the test statistic is com- Statistical Issues in Diagnostic
puted by summing over all 2k cells. In the more Testing and Screening
general case, under the null hypothesis of equal
proportions across the k groups, the test statistic Tests which are used as an aid to diagnosing a
has a chi-squared distribution with k 1 degrees disease are called diagnostic tests. An ideal diag-
of freedom. nostic test would not identify a patient as positive
When the sample size requirements for the chi- for disease if she or he did not have it. Nor would
squared test cannot be met due to small expected an ideal diagnostic test fail to identify a patient as
cell counts, an exact test is more appropriate. The negative for disease if she or he did have it. The
fundamental concept of Fishers exact test is that diagnostic accuracy of a new test is often com-
the margins of the table are considered xed (e.g., pared to an existing gold standard test. For such
count of subjects with and without events over all studies, two samples of patients are selected:
groups and the count of subjects in each group). those who test positive for the disease using the
Given the xed margins, it is possible to specify all gold standard test and those who test negative for
possible patterns of event counts. Then the exact the disease using the gold standard. All partici-
probability of each pattern of counts of events is pants from both groups are subjected to the new
calculated using the hypergeometric distribution. test and the outcome, either test positive or test
The p value corresponding to the test is calculated negative, is noted.
exactly by summing the probabilities associated Two measures of diagnostic accuracy are sen-
with all tables which have probabilities as small sitivity and specicity. Sensitivity is the probabil-
as, or smaller than, that for the observed table. ity that a subject who has the disease will test
Medical studies involving assessment of the positive. Specicity is the probability that a sub-
presence or absence of a characteristic in the ject who does not have the disease will test nega-
same subjects before and after an intervention tive. If a diagnostic test does not have high
(e.g., negative or positive for a symptom before sensitivity or specicity, it will be of limited use
and after treatment) yield counts of paired obser- as important diagnoses will be missed in the for-
vations, as shown in Table 11.3. mer and unnecessary medical follow-up may
For assessment of whether the intervention result from the latter.
had an effect on the response, McNemars test Many assays produce a quantitative result
can be used. The null and alternative hypotheses which must be interpreted as either negative or
from such a study are positive. Using different cutoff values for the
result yields sensitivity and specicity for each
H 0 : pPre pPost = 0, H A : pPre pPost 0 one. Consider the use of the prostate-specic
antigen (PSA) test as a diagnostic for prostate
If the intervention had no effect, the propor- cancer. Higher values of PSA level (ng/mL) are
tion with response would be the same prior to and more indicative of cancer. One may be interested
postintervention, and therefore the marginal in what specic value of PSA should be used to
224 T.A. Durham et al.
Table 11.4 Sensitivity and specicity of PSA as a diag- the cutoff that provides the greatest diagnostic
nostic for prostate cancer accuracy. A ROC curve is displayed in Fig. 11.4
PSA (ng/mL) Sensitivity Specicity for the PSA data.
1.0 1.0 0.46 For this data set, a PSA value of 4.0 ng/mL is
2.0 1.0 0.72 the cutoff that optimizes both sensitivity and
3.0 0.98 0.82 specicity. Sensitivity and specicity can be
4.0 0.95 0.88 interpreted as sample proportions for which
5.0 0.81 0.92 condence intervals can be constructed to
6.0 0.54 0.95
estimate the precision of the sample estimate
7.0 0.35 0.96
relative to the underlying population proportion.
8.0 0.22 0.97
A two-sided 100(1 a)% condence interval for
9.0 0.13 0.98
10.0 0.09 0.98
a sample proportion p is
11.0 0.06 0.98
12.0 0.03 0.99 p z1a / 2 SE ( p ) , where
13.0 0.01 0.99
14.0 0.01 0.99
p (1 p )
15.0 0.01 0.99 SE ( p ) =
n
For example, an estimate of sensitivity of 0.95
indicate a positive test result for cancer. To among 100 study subjects would result in a 90%
address this question, sensitivity and specicity condence interval for the population sensitivity of
are calculated for all possible cut points or thresh-
olds. For example, a PSA 2 can be interpreted (0.95)(0.05)
0.95 1.64
as a positive test, and PSA < 2 can be interpreted 100
as a negative test. Using this criterion yields an = 0.95 0.036 = (0.91, 0.99)
estimate of sensitivity and specicity. When this
is repeated for all possible cutoff values for PSA, Apart from a tests accuracy relative to a gold
it becomes evident that there is a tradeoff between standard diagnostic protocol, its ability to accu-
sensitivity and specicity, as shown in Table 11.4. rately screen for disease is of interest. The prob-
As seen in Table 11.4, nearly all patients with ability that a patient who tests positive for disease
cancer have PSA 2 (sensitivity of 1), but only actually has the disease is called the positive pre-
three-fourths of patients without cancer have dictive value. Similarly, the probability that a
PSA < 2 (specicity of 0.72). patient who tests negative for disease does not
The results obtained for multiple cutoff values have disease is called the negative predictive
can be plotted in a receiver operating characteris- value. Through the use of a mathematical expres-
tic (ROC) curve. For each cutoff, the value of sion called Bayes theorem, it can be shown that
sensitivity is plotted on the y-axis and the value the positive predictive value is a function of the
of (1-specicity) is plotted on the x-axis. The sensitivity and specicity of the test and the
value of the cutoff that is closest to the upper left underlying prevalence (expressed as a propor-
quadrant (sensitivity of 1 and specicity of 1) is tion) of disease in the population of interest:
(sensitivity)(prevalence)
positive predictive value =
(sensitivity)(prevalence) + (1 specificity)(1 prevalence)
The negative predictive value is also a function diagnostic accuracy of the test since the two
of these quantities. It is important to note that it is groups sampled (those who test positive and those
usually not appropriate to estimate the prevalence who test negative) are typically not chosen at ran-
of disease from the same study used to dene the dom from the population of interest. Estimates of
11 Introductory Statistics in Medical Research 225
the prevalence of disease are more appropriately Some methodological issues are worth men-
estimated from epidemiologic studies. tioning for diagnostic studies. As described by
As an example, consider a test with sensitivity Ransohoff and Feinstein (1978) [12], there are a
and specicity of 0.95 each. If the prevalence of number of biases that may be introduced that
the disease is 0.1 (a common disease), the posi- affect the results of assessments for sensitivity
tive predictive value is 0.68. However, if the prev- and specicity. When carrying out a study to
alence of the disease is 0.05, 0.01, or 0.001, the assess the diagnostic accuracy of a test, it is
positive predictive value is 0.5, 0.16, and 0.02, important to select participants carefully, so they
respectively. These results suggest that, despite are similar to a population of patients for whom
high diagnostic accuracy, the use of a diagnostic the test may ultimately be used. Failure to include
test may not be informative. Secondly, this exam- an appropriately broad group of participants may
ple is an illustration of how Bayes theorem is lead to the so-called spectrum bias. Further, it is
used to combine prior information (in this case, important to use the same gold standard diagnos-
the prevalence of disease) with newly collected tic test among all participants. Lastly, carrying
data (a result from a diagnostic test with certain out the gold standard and test diagnoses sepa-
accuracy) to estimate the probability of disease rately can eliminate the possibility that one
given the test result. Condence intervals about inuences the other, thereby articially inating
the positive predictive value and negative predic- the estimates of diagnostic accuracy.
tive value also can be calculated to assess the pre- When the standard diagnostic test cannot be
cision of the sample estimates. considered a gold standard (i.e., results from it
226 T.A. Durham et al.
cannot be considered the truth), sensitivity and ance of each random variable. The coefcient is
specicity are not meaningful quantities. In this dened as
case, one would be more interested in the extent
to which results from the new test were in agree- n
ment with the standard test. A new test could be ( xi x )( yi y )
r= i =1
helpful if its diagnostic accuracy was similar to
n n
2
( xi x ) ( yi y )
2
the standard, but was advantageous for some
other reason (e.g., less expensive, easier to admin- i =1 i =1
ister, or safer than the standard test). One measure
of agreement is the kappa statistic, which ranges It is possible to test the hypothesis that there is
from 0 (indicative of agreement likely due to a signicant linear relationship between the two
chance alone) to 1 (indicative of perfect agree- random variables by testing the value of the pop-
ment). Interested readers are referred to Woolson ulation correlation coefcient, r. An assumption
and Clarke (2002) [6] and Landis and Koch for this test is that the random variables are nor-
(1977) [13] for additional details on this statistic. mally distributed and they have a joint distribu-
tion called the bivariate normal distribution. The
null and alternative hypotheses are
Correlation and Regression
H 0 : r = 0, H A : r 0
Describing the relationship between two random
variables can lend insight into their relationship The test statistic is
or association to each other. A measure of the
extent to which one variable is linearly related to r n2
t=
(or associated with) another is a correlation 1 r2
coefcient. Correlation coefcients can range
from 1 to 1. Negative correlation coefcients which has a t distribution with n 2 degrees of
imply that as values of one variable increase in freedom when the null hypothesis is true. If the
value (e.g., displayed on the x-axis) values of the null hypothesis is rejected, it is in favor of the
second variable (displayed on the y-axis) alternative hypothesis that the correlation
decrease in value. Similarly, positive correlation coefcient is not equal to zero, meaning there is a
coefcients mean that as values of one variable signicant linear relationship between the two
increase in value, values of the second variable random variables. Condence intervals for r are
also increase in value. Correlations of 1 or 1 useful and can be obtained from statistical soft-
imply perfectly linear relationships. A correla- ware. A note of caution is that cause and effect
tion of 0 implies that there is no linear relation- cannot be established solely on the basis of a sta-
ship between the two random variables. One tistical association.
signicant limitation of correlation coefcients is When at least one of the random variables is
that one random variable may be related mathe- not intervally scaled, but at least ordered (e.g., a
matically to another, but has a small correlation rank or count variable), a nonparametric correla-
coefcient because the relationship is not linear tion coefcient is more appropriate. The Spearman
(e.g., as a quadratic function). rank correlation is computed by ranking both of
The Pearson correlation coefcient, for which the random variables and calculating the correla-
the sample estimate is denoted by the symbol r, is tion coefcient on the ranks. For large sample
appropriate when the random variables are con- sizes (n > 30), the hypothesis test of the Spearman
tinuous and approximately normally distributed. rank correlation is based on a test statistic similar
The Pearson correlation coefcient is a function to that for the Pearson correlation coefcient.
of the extent to which the two random variables A statistical method used to describe the
vary jointly (the covariance) as well as the vari- relationship between an outcome (or dependent
11 Introductory Statistics in Medical Research 227
variable) and one or more independent or explan- formulation of the standard error is not included
atory variables (considered xed) is called regres- in this chapter. The test statistic is dened as
sion. Regression techniques use observed data to
estimate model coefcients for the explanatory b 1
t=
variables that account for the variability in the se(b 1 )
response. The simplest example is linear regres-
sion for which the dependent variable, often which follows a t distribution with n 2 degrees
denoted for regression as Y, is expressed as a lin- of freedom under the null hypothesis. Other ref-
ear function of one or more explanatory variables, erences for this topic note that this t-statistic is
denoted as X or X1, X2, etc. identical to that used for testing the null hypoth-
A linear regression model with a single explan- esis, H 0 : r = 0 . Likewise, a 100(1 a)%
atory variable, called simple linear regression, is condence interval can be constructed as
y = b 0 + b1 x + e . One can obtain estimates of the
model parameters for the y-intercept (b 0 ) and the b 1 t1a / 2, n 2 se(b 1 )
slope (b1 ) by tting a line to a set of observed data
points (paired values of x and y for all subjects in Interested readers can nd additional details
the study). The assumptions required for the use of in Schork and Remington (2000) [5].
linear regression are that for xed values of X, the In a prospective observational study of 202
distribution of Y is normal (with potentially differ- adults between the ages of 20 and 60, triglycer-
ent means across X) and the variance of Y is equal ides and other lipoproteins were tested over a
for all values of X. The estimates of the model period of several weeks. A linear regression
parameters, b 0 and b1, are used to predict values model was tted to the triglycerides levels as a
of y for given values of x. The resulting prediction function of age. The least squares estimates of
equation is y = b 0 + b 1 x . The interpretation of the the y-intercept and the slope yielded the follow-
slope coefcient is that for every unit change in x, ing prediction equation:
the change in y is b1 . For values of x, the differ-
ence between the actual and predicted values, triglycerides = 411.2 1.80 age
y y , is called the residual because this difference
represents the variability in the response that is So for every year increase in age, triglycerides
remaining after tting the model. The best tting were lower on average by 1.80 mg/dL. Likewise,
line is the one with the smallest sum of squared for every 10-year increase in age, triglycerides
deviations between the observed and predicted were lower on average by 18 mg/dL. A test of the
values (i.e., smallest sum of squared residuals). slope coefcient for age based on the t distribu-
Hence, the usual method to obtain the model esti- tion is rejected at the a = 0.05 level, indicating a
mates is called the method of least squares. signicant linear relationship (negatively associ-
A hypothesis test may be used to test whether ated) between triglycerides and age. Kleinbaum
the value of the slope coefcient is different from et al. (1998) [14] have written a helpful reference
zero. The corresponding hypotheses are for linear regression.
H 0 : b1 = 0, H A : b1 0
Survival Analysis and Logistic
If the null hypothesis is rejected, the appropri- Regression
ate conclusion is that there is a signicant linear
relationship between the independent variable In many studies, subjects do not participate for
and the dependent variable. The test statistic (and the planned length of observation. When research-
a corresponding condence interval) for the slope ers are interested in the occurrence of a particular
coefcient use the standard error of the sample event or not (e.g., death, occurrence of a disease
estimate, se(b1 ) . For the sake of brevity, the exact or condition, or onset of a symptom), the outcome
228 T.A. Durham et al.
may or may not occur during the period of obser- One common method is Kaplan-Meier esti-
vation. It is often desirable to utilize the experi- mation of the survival function. The Kaplan-
ence of subjects for the time they were under Meier estimate is constructed by calculating the
investigation, whether or not they had the out- conditional probability of subjects surviving a
come of interest. Consider a study of subjects time interval (e.g., year 12) conditional on sur-
who were newly diagnosed with a fatal disease. viving all previous time intervals (e.g., year 01).
One may be interested in the death rate for the Subjects who have the event or drop out prior to
5 years following diagnosis. Some subjects who the time interval are not included in the risk set
enter such a study will die while under observa- (i.e., they are no longer at risk) for that time inter-
tion, some will survive the 5-year observation val and subsequent time intervals. The probabil-
period, and some will withdraw from the study ity of surviving past a given time is calculated as
during the middle of the observation period with the product of the probability of surviving the
a last known status of alive. Among subjects who interval among those at risk and the probability
will eventually have the outcome of interest, the of surviving all other previous time intervals. The
occurrence may not be during the period of obser- survival function from the Kaplan-Meier estimate
vation. These subjects are said to be censored at is often depicted graphically as shown in Fig. 11.5
the last known observation time. for an unfavorable outcome.
Survival analyses are used when the outcome The Kaplan-Meier estimate is a step function
of interest is a binary outcome (event or not), and according to the shape of the distribution. One
it is desirable to account for the time subjects are can read off values of the survival function for a
at risk for the event. The survival function, value of X, as follows. In this gure, the survival
denoted S(t), describes the probability that a sub- distribution is plotted against time (days since
ject in the study will survive without having the start of treatment in a clinical trial). In the pla-
event past a time, t. For example, S(1 year) is the cebo group on Day 1, the estimate is 0.9, and then
probability subjects will survive past year 1 with- it drops down to 0.8 on Day 2. An important
out the event. There are a number of statistical property of the step function dened using dis-
techniques used to describe and make inferences crete event times is that it is a discontinuous
about the survival distribution. function (i.e., not dened) between event times.
11 Introductory Statistics in Medical Research 229
For example, the survival distribution function risk of the event in a small interval of time.
for the placebo group equals 0.46 on Days 6, 7, The hazard is modeled as a function of one or
and 8 (no events occurred), and then at Day 9, the more explanatory variables (e.g., age, treatment
estimate is 0.35. Looking at the Kaplan-Meier in a clinical trial, baseline severity status). The
curve for the placebo group, one could read contrast between the simple linear regression
Day 9 as having an estimate of 0.35 or 0.46, but model and the Cox regression model is important
it is appropriate to remember that the outside to understand, as the model coefcients are inter-
edge of the step (right at Day 9) is discontinuous, preted differently in the two cases. The Cox
and thus the estimated probability of survival for regression models the hazard (y) as a function of
Day 9 or later is 0.35. a single explanatory variable x and is given by
A commonly cited measure of central ten-
dency from the Kaplan-Meier estimate is the y = b 0 e b1 x
median survival time. The median survival time
is the value of t beyond which approximately The term b 0 can be thought of as the baseline
50% of subjects survive without the event, i.e., hazard for a reference group represented by X = 0.
S(median time t) = 0.5. Using this guideline, one If the explanatory variable X is dichotomous (e.g.,
can read off the median survival times by draw- 1 = hypertensive vs. 0 = normotensive), the base-
ing a reference line across the gure at S(t) = 0.50 line hazard represents the hazard for normotensive
and nding the earliest value of time on the curve patients. That is, X = 0 implies y = b 0 . Note that
below the reference line. The median times are when X = 1, y = b 0 e b1. The ratio of these two, e b1,
6 and 16 days for the placebo and active groups, is called the hazard ratio, which can be thought of
respectively. as the relative risk of the event for hypertensive
Cohort studies or clinical trials may have the patients compared to normotensive patients. When
comparison of survival distributions between two the explanatory variable, X, is continuous, the haz-
or more groups as an objective. This can be ard ratio corresponds to the multiplicative increase
accomplished through the use of the logrank test. in hazard associated with a one-unit change in X.
The logrank test is carried out by treating each Since the hazard for many events does not change
distinct event time as a stratum, calculating con- with small increments in the explanatory variable,
tributions to a chi-squared test statistic within it is often helpful to recode or rescale the explana-
each stratum, and combining over the strata. The tory variable. The Cox regression model can be
null hypothesis is that the survival distributions extended to include multiple explanatory vari-
are the same. Under the null hypothesis, the ables. An important assumption for the model is
expected counts of events would be expected to that the contribution of the explanatory variable(s)
be similar to observed counterparts across the has a constant multiplicative effect on the hazard
groups being compared. Therefore, large devia- over time. This is often referred to as the propor-
tions between the observed and expected counts tional hazards assumption.
at a number of event times will lead to a large As with simple linear regression, tting the
value of the logrank test statistic. The resulting model results in estimates of each of the
test statistic from the logrank test is distributed as coefcients and corresponding standard errors.
a chi-squared statistic with 1 degree of freedom. Exponentiation of the coefcient estimates for an
If the null hypothesis is rejected, the conclusion explanatory variable results in a hazard ratio
is that the survival distributions did not arise from expressing the increase in risk for one value of
the same population. the explanatory variable compared to another
A second analysis approach used with cen- while adjusting for other explanatory variables in
sored data is called Cox regression. For Cox the model. Condence intervals and tests for the
regression, the outcome is not the probability of coefcients can be constructed which can be
survival but the hazard, dened loosely as the transformed to condence intervals and tests for
230 T.A. Durham et al.
the hazard ratio. Care must be taken to code the odds ratio, e b1 , which is an estimate of the rela-
model correctly so the interpretation can be made tive risk of the event for subjects with x = 1 com-
with respect to a meaningful reference group (or pared to those with x = 0. Logistic regression
baseline hazard group) and not an arbitrary one. models can be extended to multiple explanatory
Cox models can be particularly helpful in variables (either categorical or continuous). When
observational studies since the primary interest is an explanatory variable is continuous, the odds
typically in one experimental factor (exposed or ratio is interpreted relative to a unit change in x.
not) while controlling for other potential explana- For example, in a logistic model of coronary heart
tory effects for the response. An introduction to disease as a function of LDL cholesterol, an odds
this topic can be found in a text by Woolson and ratio of 1.02 means that a patient with LDL of 130
Clark (2002) [6]. A reference at a more advanced has greater risk of developing CHD in terms of a
level has been written by Lee (1992) [15]. 1.0210 times greater odds and thereby greater risk
A technique called logistic regression is help- of developing CHD than a patient with LDL of
ful when the outcome of interest is dichotomous 120. Standard errors for the estimates may be used
(e.g., death, seroconversion), and the research in the construction of condence intervals for the
objective is to describe how the probability of the odds ratio. Thus, the precision of the sample esti-
outcome is related to one or more explanatory mate can be evaluated, and tests of the hypotheses
variables without accounting for the time at risk. can be carried out to determine if an explanatory
Instead of modeling the probability of outcome variable is signicantly associated with increased
as a linear function of explanatory variables, the risk of the outcome or event. Excellent references
log odds of the outcome is the dependent vari- for logistic regression include Kleinbaum and
p Klein (2002) [16], Stokes et al. (2000) [11], and
able, where the odds is dened as , the Hosmer and Lemeshow (2000) [17].
1 p
probability of outcome divided by the probability
of no outcome. The reason for this choice is that Summary
a probability is bounded by 0 and 1, whereas the
p This chapter has served as an introduction to sta-
log odds or logit, ln
1 p
, is continuous on the tistical methods in medical research. Descriptive
statistics were discussed and are commonly used
scale from negative to positive innity. The logis- to characterize the experience of study subjects
tic model with a single independent variable is and their background characteristics. Inferential
specied as statistical methods, such as condence intervals
p and hypothesis testing, are frequently used to
ln y = b 0 + b 1x
1 p
evaluate observed associations relative to chance
variation in the sampling process. The research
Model estimates can be interpreted in a man- process begins with a research question that moti-
ner similar to that described for the Cox regres- vates a study designed to answer the question for
sion model. If X is a dichotomous variable (e.g., which relevant data are collected. The involve-
gender), the predicted value y = b 0 is the log odds ment of statistics ideally begins at the start of the
of the event for a reference group with x = 0. When research process and concludes with the nal
x = 1, the predicted value y = b 0 + b 1 is the log interpretation of the analyses. Further study of
odds of the event for the group with x = 1. The dif- these topics is encouraged so that readers may
ference of these two is the log odds ratio, b1 . enhance their abilities to interpret results of pub-
Exponentiation of the log odds ratio results in the lished medical literature.
11 Introductory Statistics in Medical Research 231
Take-Home Points
Descriptive statistics are used to summarize individual observations from a study and esti-
mate a typical value (measures of central tendency) and the spread of values (measures of
dispersion). Measures of central tendency include the mean and median. Measures of dis-
persion include the standard deviation and the range.
Hypothesis tests and condence intervals are two general forms of inferential statistical
methods, for which the aim is to make an inference from a sample of subjects to a rele-
vant population.
Condence intervals represent a plausible range of values of for a population parameter,
such as the difference in mean response, the difference in proportions, or the relative risk.
p values are reported from hypothesis tests. Small p values (e.g., <0.05) suggest that the
observed result was unlikely to have occurred by chance alone.
There are many statistical methods which may be appropriate for any given research
study. The most appropriate statistical approaches must consider the research question
and the study design.
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 233
DOI 10.1007/978-1-4614-3360-6_12, Phyllis G. Supino and Jeffrey S. Borer 2012
234 E.A. Friedman
treatment decisions on the basis of perceived modern-day medicine. More than 2,000 years
ethical obligation to never cause the death of any have passed since their deaths, dynasties have
patient. Contrary to this view is the belief that a risen and fallen, and religious gures, revolu-
physician holds an ethical obligation to relieve tions, and explorations have led to vast changes
pain even if the patient dies as a consequence of in virtually every aspect of civilization. Yet, as
the advocated treatment. (One of the most widely noted above, to this day, most graduating medi-
known modern examples includes the views of cal students in the United States of America
Dr. Jack Kevorkian [811]). (USA), Canada, and in certain other parts of the
It was Celsus, a Roman encyclopedist, who is world recite some form of the Hippocratic Oath,
thought to have been the rst to consider the and current US federal legislation incorporates
rights of subjects under experimentation [12]. He principles identied by Confucianism as central
spoke strongly against procedures such as vivi- to the practice of medicine. Thus, society con-
section on condemned criminals in Egypt, calling tinues to acknowledge the importance and rele-
physicians who performed them assassinating vance of the ancient Greek and ancient Chinese
medical practitioners [13]. Though it certainly is teachings on medical ethics, both of which cham-
the case that both the ethics regarding human pioned one particular concept above all others:
subjects research and regulations for such the veneration of human life, today termed
research have evolved substantially since the time benevolence. From Hippocrates forward, all
of Celsus, his belief that medical practice should guides to the ethical practice of medicine
be a work of mercy as opposed to one of dire included this concept [14]. Although benevo-
cruelty laid the ethical foundation for human lence in medicine implies that physicians should
subjects research long ago, eventually becoming do everything in their power to ensure no harm is
the moral standard by which such research is done to the patient, hundreds of incidents, as
judged today. sampled in Table 12.2 [1521], reect efforts to
One would be hard-pressed to challenge the exploit availability of prisoners, slaves, impover-
inuence of Confucius and Hippocrates on ished adults, and even children in sometimes
236 E.A. Friedman
Performing heart catheterizations in patients Boundaries between clinical practice and oth-
who believed that they were to have erwise unneeded research
bronchoscopy Basic ethical principles to be preserved during
Assigning patients with life-threatening dis- all research studies (respect for persons,
eases to placebo control groups, where effec- benecence, and justice)
tive treatments were known to be available Fundamental applications (guidelines for
Randomizing US soldiers suffering from informed consent, assessment of risk and
streptococcal pharyngitis to penicillin versus benets, and selection of subjects).
treatments known to be ineffective. Notably, we nd that The Belmont Report
While Beechers article drew attention in its [35]a document created late in the twentieth
own right, his crusade gained remarkable steam century in a highly developed Western nation
through the publicity generated by a 1972 New presents morality-driven guidelines similar to
York Times article. Whistleblower Peter Buxton those of ancient Confucian ideology and
revealed the shocking truths behind the Tuskegee Hippocrates. In Part B: Basic Ethical Principles
study to the paper, which subsequently published of the report, respect for persons asserts the
Syphilis Victims in US Study Went Untreated importance of respecting an individuals auton-
for 40 Years as its front-page headline on July omy and protecting those persons with dimin-
26, 1972 [32]. When the study was terminated in ished autonomy, benecence requires that
1972, congressional hearings were held to actions do not cause harm and that treatments
address the matter of ethical conduct in human aim to maximize potential benet while minimiz-
investigation. ing risks, and justice entails considering vari-
The National Research Act of 1974 was passed ous factors in determining the fairness in
in the USA as a direct response to these above- distribution with regard to the benets and risks
mentioned ethical abuses (especially the revela- of human subjects research.
tion of the Tuskegee experiment) [33]. Through In the decades both leading up to and follow-
the act, congress called for the establishment of ing the release of the Belmont Report, the USA
the National Commission for the Protection of undertook a substantial review and overhaul of
Human Subjects of Biomedical and Behavioral federal regulations in human subjects research.
Research [34], which was charged with the tasks A chronology of key events is provided in
of identifying key ethical issues to be addressed Table 12.4.
by researchers and injecting clear ethical prac-
tices into human subjects research that would
help assure the public of the safety of medical The Genesis of Institutional Review
research and avoid future atrocities. Following Boards in the USA and Their
Beechers disturbing portrayal of extreme over- Regulatory Role
riding of patient rights in medical investigation
by US investigators and the rules established by With the guidance of The Belmont Report, the US
the 1974 Research Act, additional reports were Department of Health, Education, and Welfare
published that recounted instances of exposing (now the Department of Health and Human
subjects, without their consent, to radiation, Services [HHS]) established requirements for the
infectious agents, or injection of cancer cells. Of development of Institutional Review Boards or
the responses generated, perhaps the single most IRBs [36]. (IRB is a generic term used by gov-
important resource used as a basis for governing ernmental agencies, but each institution that
both the practice of medicine and conduct of establishes an IRB may maintain any name to
research involving human subjects was The describe such a board.) As a general rule, the role
Belmont Report [35], released by the commission of the IRB is to regulate human subjects research
in 1979, which established: by advocating, upholding, and maintaining the
12 Ethical Issues in Clinical Research 239
Table 12.4 Post-World War II developments aimed at protecting human subjects in research
1947: Nuremberg Code denes subject-centered principles for ethical human subjects research in response to
unethical medical experimentation by the Nazis during WWII [25].
1964: World Medical Association adopts the Declaration of Helsinki, dening new guidelines for human subjects
was an outline of required research (last revised in October 2008) [28].
1965: A speech addressing problems in clinical research is given by Henry Beecher, M.D., to journalists
assembled by the Upjohn Pharmaceutical Company and draws attention nationwide through prominent
media outlets [29].
1966: Henry Beecher M.D. publishes Ethics and clinical research in The New England Journal of Medicine,
expressing concern over the potentially vast impact of unethical procedures in clinical research, and
referencing 22 studies without explicitly identifying the studies or investigators [30].
1972: Tuskegee whistleblower Peter Buxtin contacts the Associated Press with information on the study, leading
to The New York Times July 26, 1972, article, Syphilis Victims in U.S. Study Went Untreated for 40 years;
Syphilis Victims Got No Therapy; the study is terminated that same year [32].
1973: Congressional hearings are held to address human experimentation primarily in response to the Tuskegee
revelations [33].
1974: The National Research Act is created, establishing the National Commission for the Protection of Human
Subjects of Biomedical and Behavioral Research [34].
1979: The Commission releases The Belmont Report, identifying relevant ethical principles and guidelines for
human subjects research [35].
1981: Human subject regulations are amended to provide a common framework within which Institutional
Review Boards (IRBs) can review human subjects research [36].
1991: Regulations for the protection of human subjects are codied under Title 45, Part 46 of the Code of Federal
Regulations; Subpart A is accepted by 17 US Federal Agencies as the Common Rule [38].
rights and welfare of humans participating in the FDA in 21 CFR 56.107, which covers FDA
research. IRBs are universally engaged in all oversight of drugs and medical devices.
health and social science studies funded by the The Common Rule requires that IRBs approve
National Institutes of Health (NIH) and HHS. and oversee all human research supported directly
Such studies include, but are not limited to, clini- or indirectly by, what is today known as, HHS. It
cal trials of new, novel, or repurposed devices or is within the purview of the Ofce for Human
drugs regulated by the Food and Drug Research Protections (OHRP) within HHS to
Administration (FDA); investigations of behav- regulate all IRBs, but today, all IRBs also are
ior, opinions, and attitudes; or studies on health- subject to additional governmental organization
care management. (e.g., FDA) regulations. Similar regulatory boards
In 1991, the US Federal Policy for the have been in place for animal research since the
Protection of Human Subjects was published in enactment of the Laboratory Animal Welfare Act
the Federal Register (56 FR 28003) and incorpo- of 1966, like the Institutional Animal Care and
rated into the regulating codes of 17 Federal Use Committees, which may be considered
departments [37]. The policy, known as the IRBs for nonhuman research subjects. For a
Common Rule, provides specic direction for broader discussion of ethical issues considered
the operations and regulation of IRBs, outlines in preclinical research which is beyond the
requirements for obtaining informed consent, and scope of this chapter, the reader is referred to
requires written assurance of institutional com- Animal Experimentation. The Moral Issues
pliance with federal research regulations. The (Baird and Rosenbaum, 1991) [39] or Animal
policy was codied by HHS as Title 45 Code of Experimentation: A Guide to the Issues (Monamy
Federal Regulations [CFR] Part 46 Subpart A, 2000) [40].
Basic HHS Policy for Protection of Human Historically, academic institutions and medi-
Research Subjects [38]. It later was codied by cal facilities created their own IRBs to oversee
240 E.A. Friedman
human subjects research, specically to avoid or However, these ndings may pose considerable
limit ethical problems in such research. In this unexpected risk to subjects, especially if that
era, there are additional for-prot independent or information was later revealed and linked back to
commercial IRBs that institutions may choose to the subject.
contract out to monitor their research; their role, The underlying concern for governing bodies
accountability, and composition are no different regulating clinical research is the level of risk
than that of traditional IRBs. In brief, all IRBs posed to human subjects. As such, the corner-
must contain at least ve members, chosen in a stone for virtually all IRB operations is the evalu-
nondiscriminatory fashion, with sufcient exper- ation of risks to study subjects. Beginning at the
tise to judge the scientic merit of each proposed earliest stages of application for study approval,
protocol and to assess whether the rights of the the nature of identied risks to human subjects in
subjects are properly safeguarded. In its early a research study directs the procedures for IRB
days, concerns were raised over the relatively review and approval. For example, the level of
homogeneous composition of IRB membership. risk (a concept to be further described momen-
In response, the HHS Common Rule provided tarily) is a signicant factor in determining
regulations in 45 CFR 46.107 designed to ensure whether a research study qualies for exempt
satisfactory and unbiased review of clinical status or expedited review or requires full-
research projects by requiring diversity of IRB committee review. It should be noted that certain
members with regard to their eld of expertise, types of research protocols (as mentioned below)
afliations, experience, gender, race, and cultural may qualify for exempt status with regard to
background [38]. A majority of the members IRB review. Clinical research studies considered
must be present for voting to take place, at least exempt by the IRB are additionally absolved
one of whom is a nonscientist, and IRB members from standard informed consent requirements
may not vote on their own projects [38, 41]. As unless the research involves protected health
necessary, IRBs can invite nonvoting content information, in which case patient authorization
experts to assist in the review process [38]. or IRB waivers of authorization must be obtained
IRBs must review all research protocols and for each subject [42].
related materials (e.g., informed consent docu- For the purposes of IRB review, there are three
ments and promotional iers) to ensure that pro- levels of risk to which subjects can be exposed
posed investigations are ethically conducted. For in any given research study: less than minimal
example, they must determine that patients are risk, minimal risk, and greater than minimal risk
properly selected, that the proposed protocol is [43]. Studies that involve less than minimal risk
designed so that valid inferences can be drawn, include those that pose no known physical,
that subjects are fully informed about the risks emotional, psychological, or economic risk to
and benets of the study, and that their participa- subjects. Such studies may be deemed exempt
tion is entirely voluntary (or, for special patient from IRB review and, therefore, would not
populations [e.g., those with dementia, mental require review by an IRB committee member.
retardation, severe neuropsychiatric disorders] As stipulated in 45 CFR 46.101(b), a pro-
that informed permission is appropriately obtained posed investigation may be classied as exempt
by proxy). Their role is to maximize safety in the (unless otherwise mandated by a department or
delicate balance between risk and benet for sub- agency head) if it limits involvement of human
jects once they are enrolled in research. subjects to one or more of the following catego-
Each IRB must advocate and uphold the inter- ries: (1) educational practices and assessments
ests of all research subjects. Such advocacy (e.g., comparing two or more teaching methods),
includes protection of the future interests of sub- (2) interviews or observations of public behavior,
jects, especially in situations involving tissue and (3) studies of public data or specimens
storage. Clearly, future technologies may arise without accompanying information that might
that can yield potentially valuable new data. permit subject identication [38]. Also exempt is
12 Ethical Issues in Clinical Research 241
recommend that the PI submit a formal applica- protocol are to be considered, as are potential
tion to his or her local governing body for review conflicts of interest. Finally, the IRB will deter-
by an IRB chair or other senior IRB administrator mine if the study warrants additional reviews
in order to permit their assessment. (Research during a one year period. The protocols for all
studies that have been determined to be exempt ongoing research studies are considered to be
from IRB review will retain this status unless the undergoing continuing review and are required
conditions of the study have changed, at which to be reviewed at least annually [45].
time the study should be resubmitted to deter- In addition to previously cited requirements,
mine whether such changes affect risk to subjects all IRBs must review any amendments, including
and level of required review.) If the research study updates to any research-related forms, along with
does not qualify as exempt, the IRB, based on any other documents that the IRB deems neces-
information furnished by the PI, must determine sary to protect potential human study subjects.
whether the study poses minimal risk or whether (There may be local variation in the order in
it poses more than minimal risk. Minimal risk which an IRB veries the propriety of proposed
studies may qualify for expedited review if they and/or ongoing research of human subjects.)
fall under one of the categories previously In 1996, the International Conference on
described [38]. If the study does not qualify for Harmonisation of Technical Requirements for
expedited review or if it is determined to pose Registration of Pharmaceuticals for Human Use
greater than minimal risk to potential subjects, Guideline for Good Clinical Practice (ICH-GCP)
the protocol must undergo full-committee review. [46] established additional guidelines for IRB
Though the PI may participate in the process, the oversight of clinical trials, which later were
nal decision on category of risk and level of adopted by the FDA [47]. In this context, clini-
review ultimately is governed by the IRB chair or cal trials are dened as studies that involve
his or her designee(s). investigational products. In addition to the gen-
The IRB considers a number of complex eral procedures discussed previously for human
issues in its review of proposed protocols. The subjects research studies, the requirements set
impact of the study design on human subjects is forth in the Guideline for Good Clinical Practice
evaluated with careful attention paid to any pro- mandate that ongoing IRB reviews of clinical tri-
tocol implementing deception or withholding of als must include proposed drug and device safety
information. Deception is a particularly com- documentation.
plex issue in human subjects research due to the The role of the IRB comprises far more than
extensive federal regulations regarding informed extensive document review. The Belmont Report
consent and disclosure of information. The IRB cited above states that research on human sub-
conducts an extensive assessment of risks and jects must ethically address benecence,
benefits and may require additional safeguards respect for persons, and justice [35]. This
to be implemented. It also may examine the edict can be fullled only when the IRB approves
selection of subjects, evaluating both inclusion research that fully informs subjects (or when nec-
and exclusion criteria and ensuring that the pro- essary, their proxies) about the risks of the study
cess is free of coercion. The IRB considers the before they provide consent for participation in
planned methods for identification of research the research.
participants and associated procedures in place IRBs are required to provide special attention
to protect the privacy of study subjects. Its to proposed studies of persons with diminished
members evaluate the process for obtaining comprehension, pregnant women, prisoners, the
informed consent and thoroughly review the elderly, or children. In his well-focused Lancet
informed consent forms as well as any other review of David Wendlers book on The Ethics of
documents or devices that will be introduced to Pediatric Research [48] (provoked by intended
study subjects or will be used in recruitment. use of children as the subjects of investigation),
The qualications of all investigators on the Peter Singer poses the daunting question: Is it
12 Ethical Issues in Clinical Research 243
ever ethical to do research on human subjects subjects research as commonly dened [51].
without their consent? [49]. Arguing that A year later, problems persisted, causing the rep-
research with children is justiable with parental resentatives from the OHA and AHA to issue a
consent, Singer bases this inference on his belief reafrmation of their 2003 statement [52].
in the subjects inherent altruistic desire to benet
othersan objective that, in the broadest inter-
pretation, infers that contributing to a signicant HIPAA, the Privacy Rule, and
research project is an accomplishment of ultimate Preparatory to Research Activities
value to the contributor. While the question
remains hotly debated, Singer suggests that par- In the US, IRBs took on additional tasks follow-
ents should be able to give consent for their child ing the enactment of the Health Insurance
to enroll in a well-designed study of an impor- Portability and Accountability Act (HIPAA) of
tant question despite the fact that doing so 1996 [53], passed by Congress to improve por-
involves momentary pain and, in good medical tability and continuity of health insurance cover-
practice, a risk that is greater than zero, but age in the group and individual markets, to
still extremely small [49]. In sum, current ethi- combat waste, fraud, and abuse in health insur-
cal standards in the USA permit parent-approved ance and health care delivery, to promote the use
research on children when the risk of harm to the of medical savings accounts, to improve access to
child is minor and potential benet to others is long-term care services and coverage, to simplify
likely or in situations when no alternative mecha- the administration of health insurance, and for
nism exists to attain those benets. other purposes [53]. To accomplish these tasks,
IRBs have not been free of criticism, even in the act called for a vast overhaul of the methods
this new millennium. As late as 2010, Hall, used to transmit medical information, including a
Friedman, King et al. noted that academic medi- shift toward standardized electronic transmis-
cal center IRBs and conict of interest commit- sions. HIPAA has been modied a number of
tees usually are not involved in reviewing times since its enactment in 1996 [5456], and
research budgets to determine whether per capita though initially the act most evidently applied to
payments are excessive [50] (italics added); in health-care providers and health-care plan pro-
certain circumstances (to be described later), viders, extensions of the act have had a signicant
excessive payments may be seen as undue induce- impact on clinical research. Most notably, a pro-
ment for participation in the study. In addition, vision was made requiring compliance with the
due to what is perceived to be misunderstanding HSS-issued Standards for Privacy of Individually
of specic social science research methods (e.g., Identifiable Health Information, known as the
ethnography, oral histories) by many IRB mem- HIPAA Privacy Rule, by most covered entities as
bers, some social scientists have argued that cur- of April 2003. HSS provides the following
rent regulation of social science research is statement on covered entities with regard to
insufciently exible; they believe that current research:
regulatory requirements (e.g., lengthy and/or Covered entities are health plans, health care
complicated consent forms) are overly burden- clearinghouses, and health care providers that
some in light of the fact that social science stud- transmit health information electronically in con-
ies generally pose only limited risk to subjects. In nection with certain dened HIPAA transactions,
such as claims or eligibility inquiries. Researchers
an attempt to address these concerns, the OHRP, are not themselves covered entities, unless they
in conjunction with the Oral History Association are also health care providers and engage in any
(OHA) and the American Historical Association of the covered electronic transactions. If, how-
(AHA), stated in 2003 that investigative ever, researchers are employees or other work-
force members of a covered entity (e.g., a hospital
procedures (e.g., oral histories, collection of or health insurer), they may have to comply with
anecdotes, unstructured interviews, and other that entitys HIPAA privacy policies and proce-
related methods) often do not constitute human dures. [57]
244 E.A. Friedman
The purpose of the HIPAA Privacy Rule is to Table 12.6 Conditions permitting the use or disclose of
regulate the use and disclosure of certain indi- PHI for research by covered entities
vidually identiable health information, termed If the subject of the PHI has granted specic written
protected health information (PHI). An individu- permission through an Authorization that satises
section 164.508
als PHI includes information pertaining to (1) his
For reviews preparatory to research with representa-
or her past, present, or future physical or mental tions obtained from the researcher that satisfy section
health or condition, (2) the provision of health 164.512(i)(1)(iii) of the Privacy Rule
care to the individual, and (3) the past, present, or For research solely on decedents information with
future payment for the provision of health care to certain representations and, if requested, documenta-
tion obtained from the researcher that satises section
the individual [58]. An individuals genetic infor-
164.512(i)(1)(iii) of the Privacy Rule
mation also is considered to be PHI. PHI is pro- If the covered entity receives appropriate documenta-
tected under the Privacy Rule when it contains tion that an IRB or Privacy Board has granted a
information that possibly could be used to deter- waiver of the Authorization requirement that satises
mine the identity of the individual. It is possible section 164.512(i)
If the covered entity obtains documentation of an IRB
to deidentify PHI by removing certain informa-
or Privacy Boards alteration of the Authorization
tion pertaining to the individual. The information requirement as well as the altered Authorization from
that must be removed in order to deidentify an the individual
individuals PHI is listed in Table 12.5. If the PHI has been de-identied in accordance with
Just as medical professionals must maintain the standards set by the Privacy Rule at section
164.514(a)(c) (in which case, the health information
the security and privacy of their patients PHI, is no longer PHI)
so too must clinical researchers who are covered If the information is released in the form of a limited
entities, work for covered entities, or who obtain data set, with certain identiers removed and with a
data from covered entities (as dened above). data use agreement between the researcher and the
The legislation established to regulate the trans- covered entity, as specied under section 164.514(e)
mission of PHI signicantly impacts the clinical Under a grandfathered informed consent of the
individual to participate in the research, an IRB
researcher in two specic ways: (1) a subjects waiver of such informed consent, or Authorization or
PHI must be obtained and used in a manner other express legal permission to use or disclose the
deemed permissible by the Privacy Rule, and information for research as specied under the
(2) activities that are considered to be prepara- transition provisions of the Privacy Rule at section
164.532(c) [57]
tory to research and that involve the review of
PHI must be carried out in accordance with
specic guidelines [57]. The ways by which a As mentioned, the Privacy Rule has had a sub-
covered entity may use or disclose an individu- stantial impact on activities preparatory to
als PHI for research purposes are outlined in research. Activities preparatory to research
Table 12.6. include reviews of data that enable researchers to
12 Ethical Issues in Clinical Research 245
determine whether or not it would be purposeful controls are considered sufcient. Safeguards
or reasonable to pursue a particular research study. must be in place for any third parties to uphold
This may include reviewing medical records to the same level of security and privacy with regard
determine whether or not there are enough poten- to PHI. Plans for audits of these procedures to
tial subjects to be able to carry out the study. Such make sure problems are clearly identied and
activities also may be used to allow the researcher rectied are required.
to identify potential research participants for Additionally, plans must be in place for
recruitment purposes and to contact potential responding to breaches of private information.
study participants. Each of these activities must Breaches of PHI generally are dened as the
be carried out in accordance with particular unauthorized acquisition, access, use, or disclo-
requirements. For example, a covered entity may sure of protected health information which com-
allow a researcher to review PHI, but they may promises the security or privacy of such
not permit the researcher to remove any PHI from information [56]. In the event that a breach
the covered entity. Additionally, the researcher occurs, all individuals whose PHI may have been
would not be permitted to contact a potential inappropriately disclosed (or their next of kin, if
study participant based on the PHI reviewed with- the individual is deceased) must be informed.
out the researcher being a workforce member of Furthermore, a notice of breach is to be listed on
the covered entity or without the researcher secur- the afliated institutions website or disseminated
ing proper documentation of a waiver of authori- through a major media outlet. Cases in which a
zation from the IRB or Privacy Board [59]. large number of individuals (500 or more) have
The regulations previously discussed been affected require that the secretary of HHS
specically address the researchers ability to also be notied. Affected parties should be
obtain and utilize a subjects PHI; however, there informed as to what they can do to further protect
are additional directives under the Privacy Rule themselves after a breach occurs [56].
that stipulate the handling of PHI beyond the per-
missibility of transmission. The rules for main-
taining privacy and security include written Human Research Requires
privacy procedures in which a privacy ofcer, Informed Consent
who is responsible for upholding such proce-
dures, is designated. It must be clearly stated who As mentioned previously, the informed consent
has access to specic private health information process for human subjects is a cornerstone of
and how to modify levels of accessibility. ethical standards in human research. It is important
Appropriate training must occur on a scheduled to note the distinction between authorization,
and ongoing basis for all persons with access to as discussed in relation to the HIPAA Privacy
PHI. Research information must be securely Rule, and informed consent. Authorization is
backed up in case the original information is lost written permission from an individual permitting
or corrupted in an emergency. the disclosure and/or use of his or her PHI for
A key guideline for ensuring the privacy of research. Informed consent is an individuals per-
PHI is to transmit only the minimal amount of mission to participate in research.
information necessary. Any equipment used for To the extent possible, one must receive clearly
research or patient management that contains stated information explaining the studys pur-
PHI must be monitored and protected from unau- pose, methods, risks, benets, and alternatives to
thorized access. With the growth of digital infor- research in order to be considered an informed
mation systems, any PHI that is sent over an open subject [60]. However, violations of this precept
network must have adequate encryption, but there occasionally occur even in developed societies.
is some leeway with regard to PHI sent via closed A particularly horric example was given in a
networks. In the case of closed networks, encryp- 2005 paper [61] that described how a mother
tion is optional and the existing network access learned, after the death of her baby, that the child
246 E.A. Friedman
had been buried without its heart by the staff of to participants rests with the PI and all associate
the Bristol Royal Inrmary in the United Kingdom investigators who personally interact with the
(UK). This was done without her knowledge and subject. This is to ensure that the subject (or his
consent so that tissue samples could be used for or her proxy) understands what is being proposed
future research by investigators. and comprehends any and all known potential
After the subject has been adequately adverse consequences that could arise from his or
informed, it is up to him or her to decide whether her participation. In other words, responsibility
to participate in the study. If the research under- for obtaining consent should not be delegated to
taken is to be considered ethical, it is imperative subordinates.
that this decision be completely voluntary. The Consent must be obtained in a noncoercive
subject must be able to freely decide not only and fully voluntary manner, avoiding the fraud of
whether to begin participation at the outset but Tuskegee (cited previously) and the horrors of
also whether to continue participation after the Nazi experimentation as a prelude to murder. As
study has commenced. An important point often it is always the ultimate responsibility of the
overlooked during this process is that the subject investigators to ensure that their research is prop-
must fully understand the conveyed information. erly conducted, they must remain alert (even if,
Otherwise, the decision made may not reect the as noted above, IRBs are not) to the reality that
true wishes or interests of the individual. However, excessive payment to research subjects might be
in cases where the potential subject is a child, an coercive. While compensation to subjects is gen-
unconscious adult, or an individual of otherwise erally viewed as an acceptable way of covering
limited mental capacity, informed consent from their expenses and rewarding them for their time
the individual is not required. In these instances, and effort related to the study, the use of relatively
consent is obtained instead through a proxy large incentives to facilitate recruitment may
(a decision maker who is empowered to ensure comprise, in certain circumstances, a form of
that the subjects involvement in the study is con- undue inuence by inducing the individual to
sistent with his or her values, beliefs, and inter- accept seemingly irresistible offers against his or
ests). In this way, the decision that is ultimately her better judgment. [62]
made will most closely represent what the sub- A striking example is the series of experiments
ject would have willfully done if he or she had conducted at the Willowbrook State School, in
been able to render a decision [60]. which parents were asked to enlist their retarded
Fundamental to the process of informed con- children in a research project that required them
sent is the concept of respect for potential and to be infected with hepatitis [62]. As incentive,
enrolled subjects. It is important that enrolled the child was offered a place in a residential treat-
subjects be treated with respect from the time ment facility that otherwise would have been
they are approached to be in the study to the time difcult to secure. It is not hard to see that such
their participation has ended. Likewise, individu- an incentive, as an attempt to induce parents to
als who decline to participate nevertheless should overcome their hesitation about the study by
be treated with respect throughout the entire appealing to their concern for their childs treat-
recruitment process. Respect for subjects entails ment, is ethically unsound.
not only respecting their decisions and keeping Those in favor of subject compensation argue
private information condential but also disclos- that compensating subjects for participating in
ing new information (e.g., novel risks and benets research is no different than paying people for
that might emerge during the course of the study working. As McNeill has noted, however, unlike
and affect their willingness to participate), moni- work, experimentation on human subjects
toring their well-being to prevent and treat inherently exposes people to unnecessary risks of
adverse effects, and informing them about what harmrisks that cannot be known in
was learned from the research [60]. advance [63]. Therefore, while a completion
The responsibility to maintain the integrity of bonus for a relatively harmless research study
the processes of communicating details of a study usually poses no ethical problems and is, in fact,
12 Ethical Issues in Clinical Research 247
a commonly employed method for emphasizing (H. pylori) caused gastritis and predisposed pep-
the importance of full commitment to the study, tic ulceration even in patients with a healthy
caution should be exercised when the research mucus lining, Marshall volunteered to ingest a
might be painful or distressing for the subject; in sample of H. pylori. After he developed the char-
cases such as these, compensation may be seen as acteristic symptoms of gastritis, it was shown that
undue inuence, seductively pressuring the sub- ingested H. pylori is able to colonize completely
ject to accept conditions they would otherwise normal gastric mucosa and lead to the acute
deem unreasonable or aversive. inammatory changes collectively referred to as
Investigators should always bear in mind acute H. pylori gastritis [67].
that inequalities in authority between investiga- Current federal regulations, however, do not
tor and subject persist even after informed distinguish between self-experimentation and
consent is given, creating potential threats to experimentation on subjects recruited for a
autonomy [64]. Certain strategies customarily are specic project. Clinicians may feel that if they
employed to minimize the impact of such potential are experimenting with their own bodies, then as
vulnerability. For example, while consent for par- doctors, they are cognizant of all the risks and
ticipation in a clinical research study may include may consider circumventing the IRB approval
agreement to certain pre- or postintervention pro- process altogether. However, as a general rule,
cedures, subjects still retain the right to discon- IRBs require prior submission and approval of an
tinue their participation at any time, even when application detailing all aspects of any study
their treating physician or a consulting physician incorporating self-experimentation before it
for the study believes it may be life threatening for starts. The rationale for IRB approval is the con-
the subject to withdraw from the study [65]. cern that overly zealous investigators may subject
themselves to inappropriate, unnecessary, and
unforeseen risk without the IRBs oversight. As
Self-Experimentation Guidelines an example, proper IRB oversight would protect
an investigator, with early signs of Huntingtons
Dened as the special case of single-subject disease, from self-experimenting with a promis-
scientic experimentation in which the experi- ing drug undergoing early animal trials for safety
menter conducts experiments on himself or her- and efcacy that ultimately may cause more
self, self-experimentation usually means that the deaths than standard-of-care treatment. Control
designer, operator, subject, analyst, and ultimate of self-experimentation is a delicate issue since
user of resulting information are all the same respect for each individuals right of autonomy is
person. Lawrence K. Altman has catalogued a key feature of federal governance via IRBs.
numerous instances of physician investigators Scientic research is, of course, not the only
who opted to rst expose themselves to the risks context in which people are likely to expose them-
of a new technique or therapy. [66]. Included is selves to potentially harmful situations. In a free
Karl Landsteiners pursuit of what would be society, individuals can daily engage in a wide
named the ABO blood groups repeatedly range of risky behaviors at their own discretion. For
depended on blood samples drawn from himself example, individuals may willingly have unpro-
and ve members of his staff. Similarly, tected sex, maintain an unhealthy diet, consume
invasive cardiology was pioneered in Germany alcohol in excessive amounts, or ride a motorcycle
by Werner Forssmann, who would eventually without wearing a helmet for protection. However,
receive the Nobel Prize in Physiology or if a research study requires the individual to engage
Medicine following years of self-experimenta- in a risky activity due to the research, it obligates
tion he performed by catheterizing his heart the investigators (with IRB oversight) to, truthfully
numerous times [66]. Another signicant and without restriction, fully inform each potential
example of self-experimentation was an experi- research subject of all aspects of an intended study,
ment conducted by Barry J. Marshall [67]. including risks, which the candidate would not
In order to conrm that Helicobacter pylori have assumed had the research not been performed.
248 E.A. Friedman
Following more than a decade of contentious additional cases of ethical misconduct in research
debate over the validity of the study (during which throughout the past decade [76]. Included was the
countless parents considered the much discussed case of Dr. Eric Poehlman who was sentenced to
link between the MMR vaccine and autism one year in prison in 2006 for falsifying and fabri-
when deciding whether or not to vaccinate their cating research data for a study on menopause and
children), the paper was retracted in February metabolism. Also in 2006, Elizabeth Goodwin, a
2010. Revelations of ethical misconduct in University of Wisconsin professor, resigned fol-
Wakeelds study included (1) nine of the chil- lowing the revelation that she made false state-
dren were reported as having regressive autism, ments in her genetics research. Dr. Gary Kammer
but a third lacked any autism diagnosis, and only resigned from Wake Forest University in 2005
one child actually showed clear signs of the con- when it was discovered that he had fabricated fam-
dition; (2) ve of the 12 children described as ilies in his NIH grant application, this a year after
being previously normal actually had docu- Harvard professor Ali Sultan resigned due to false
mented preexisting developmental concerns; (3) information in his own grant application [76].
the immediacy of the onset of symptoms follow- The previous examples are just a few selected
ing MMR vaccination was greatly exaggerated in cases of misconduct, with many more cases
some instances; (4) following a medical school reported in the literature about falsication of
research review, the diagnosis for nine of the data, plagiarism, research conducted without
children was changed from unremarkable to proper consent, undisclosed conicts of interest,
nonspecic colitis; (5) while 11 families actu- and much more [77]. It is difcult to calculate
ally alleged the MMR vaccine caused their chil- how much research funding has been squandered
drens symptoms, three late cases were and how much harm has been caused to the pub-
intentionally omitted in order to create the false lic health by generating and advancing fraudulent
impression of a 14-day window between vaccine ndings.
exposure and symptom onset and (6) recruitment
and funding aspects of the study correlated closely
to anti-MMR programs, accounting for substan- Final Thoughts and Closing
tial grounds for conict of interest claims [71]. It Unanswered Moral Research
also was revealed that Wakeeld proted from a Dilemmas
future lawsuit against the patent holders of cur-
rent vaccines. Wakeeld and John Walker-Smith, The vast regulations, protocols, and governing
the senior clinician involved in the study, were bodies developed over the course of history to
subjected to the UKs longest General Medical protect the ethical integrity of clinical research are
Council Fitness to Practice Hearing and were evidence that the issue is a cornerstone of human
eventually struck off the medical register [71]. subjects investigation. While current legislation
In 2009, Scott Reuben, M.D., previously a provides answers to many of the questions that
renowned anesthesiologist and pain management may be posed today regarding the ethicality of
investigator, published agrantly fraudulent research activities, it is important to keep two con-
ndings from studies that he performed without siderations in mind: (1) It is often the case that as
the approval of his own institutions IRB, going societies evolve, so too do the standards of appro-
so far as to fabricate patient data and to forge the priateness governing the nature of principles, and
name of a colleague in order to list him as a coau- (2) the passage of time will inevitably force work-
thor on a publication [7274]. In the aftermath of ers in the eld of clinical investigation to take into
that scandal, Dr. Reuben lost all credibility in his consideration issues or concerns that simply could
eld, has served jail time for health-care fraud, not be projected as possibilities at an earlier time.
and a large ne was levied against him by a US Mentioned below are some questions that can, and
federal court [75]. should, be asked by clinical researchers in this era.
In his article published in the Cleveland Clinic Is investigation of ones self ethically appropriate?
Journal of Medicine, James G. Sheehan cited four Is any age too old for subjects in an invasive
12 Ethical Issues in Clinical Research 251
biopsy study such as kidney, lung, or major vessel protocol that might provide benecial therapy?
transplantation or replaceable device? For an Should prisoners be excluded from recruitment?
organ transplant study, should young candidates In an experimental life-sustaining device (e.g.,
be selected before geriatric candidates? In studies aortic balloon pump or a hypothermia catheter)
allocating an expensive and limited therapy (e.g., study of coma patients after resuscitated cardiac
bone marrow, heart, or kidney transplant) should arrest, if the study subject fails to respond to the
individuals in advantaged positions be accepted experimental device, who decides to discontinue
into a research protocol ahead of people not so use of the device (e.g., the patient, family/proxy
advantaged? Must undocumented noncitizens be decision maker) and when should that decision be
excluded from innovative, experimental, or poten- made? How should a subjects nonadherence to a
tially life-sustaining therapy that may be scarce or protocol [78], hostility to staff, or criminality [79]
expensive? Are women to be approached for be managed? (e.g., is it ethical to withdraw ther-
research on an equal basis with men? Is it reason- apy or to consult with psychiatry, social services,
able to include race and religion as inclusion/ administration, lawyers, clergy, family members
exclusion criteria for study candidates? Is HIV or friends, or members of the Ethics Committee
infection a reasonable exclusion criterion for a under these circumstances?) Sensitivity to the
study of an experimental surgical procedure? need for respect, autonomy, and dignity of indi-
Should absence of insurance coverage or being viduals subjected to investigation in these types of
impoverished (and thus, in both cases, inability to situations allows researchers to detect and correct
pay for standard care that may not be covered by deviations from appropriate conduct in modern
a research grant) dictate exclusion from a research human research.
Take-Home Points
From the earliest prebiblical writings to modern day, concern for and debate on the
appropriate conduct by caregivers toward patients has been a central theme of appropriate
(ethical) medical practice.
Resulting from awareness of World War II German atrocities performed on prisoners, the
mentally decient, and defenseless civilians, the Nuremberg Code and Belmont Report were
devised to protect patients and society from inappropriate assault on their body and psyche,
later to be followed by regulations regarding the importance of patient privacy.
Central to acceptable ethical behavior in human research are three main principles: respect
for persons, benecence, and justice.
When possible, a fully informed written consent based on protocol comprehension must be
obtained and preserved from each subject.
With reservation and caution, parental consent may be sufcient for child participation in a
study of low risk but potential importance to society.
Currently, international guidelines for ethical human research require prior approval of
research protocols by an Institutional Review Board (IRB). The IRB must document its
views in writing, clearly identifying the trial being assessed, which documents were
reviewed, and the dates of its reaching decisions for approval, disapproval, or need for
restructuring.
The US National Institutes of Health (NIH) names the principles governing acceptable human
research: social and clinical value, scientic validity, fair subject selection, favorable risk-
benet ratio, independent review, informed consent, and respect for potential and enrolled
subjects.
252 E.A. Friedman
69. Denition of Research Misconduct. HHS, Ofce of fraud in systematic reviews: lessons from the Reuben
Research Integrity. http://ori.hhs.gov/misconduct/ case. Anesthesiology. 2009;111:127989.
denition_misconduct.shtml. Accessed 15 Sept 2011. 75. Johnson P. Scott Reuben, a former Baystate doctor
70. Wakeeld AJ, Murch SH, Anthony A, Linnell J, who faked research, sentenced to 6 months for health
Casson DM, Malik M, Berelowitz M, Dhillon AP, care fraud. MassLive.Com, 24 June 2010.
Thomson MA, Harvey P, Valentine A, Davies SE, 76. Sheehan JG. Fraud, conict of interest, and other
Walker-Smith JA. Ileal lymphoid nodular hyperplasia, enforcement issues in clinical research. Cleve Clin
non-specic colitis, and pervasive developmental J Med. 2007;74(Suppl 2):S637. discussion S68S9.
disorder in children [retracted]. Lancet. 1998;351: 77. Wells JA. Final report: observing and reporting sus-
63741. pected misconduct in biomedical research. 2008.
71. Deer B. How the case against the MMR vaccine was http://ori.dhhs.gov/research/intra/documents/
xed. BMJ. 2011;342(C5347):7782. gallup_nalreport.pdf. Accessed 13 May 2011.
72. Borrell BA. Medical Madoff: anesthesiologist faked 78. Stewart DO, DeMarco JP. Rational noncompliance
data in 21 studies. Scientic American, 10 Mar 2009. with prescribed medical treatment. Kennedy Inst
73. Harris G. Doctor admits pain studies were frauds, Ethics J. 2010;20:27790.
hospital says. New York Times, 11 Mar 2009. 79. Cleaveland C. We are not criminals: social work
74. Marret E, Elia N, Dahl JB, McQuay HJ, Miniche S, advocacy and unauthorized migrants. Soc Work.
Moore RA, Straube S, Tramr MR. Susceptibility to 2010;55:7481.
How to Prepare a Scientic Paper
13
Jeffrey S. Borer
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 255
DOI 10.1007/978-1-4614-3360-6_13, Phyllis G. Supino and Jeffrey S. Borer 2012
256 J.S. Borer
can range from a report of a well-studied single Science (AAAS), the essential elements of the
clinical experience (case report) to a highly com- electronically published scientic paper are that
plex, controlled, and carefully blinded study of the nal published version of an article after
the impact of a transfected gene on myocardial peer review (or any future peer-review equiva-
protein degradation in tissue culture. The term lent), which AAAS denotes as the Definitive
scientic paper may seem relatively nonspecic. Publication, needs to be clearly identied as such
However, given the explosion of biomedical lit- and must be publicly available, the relevant com-
erature during the past generation, the concomi- munity must be made aware of its existence, a
tant recruitment of highly talented and experienced system for long-term access and retrieval must be
journal editors, and the relative paucity of costly in placeit must not be changed (technical pro-
journal publication space, it is not surprising that tection and/or certication are desirable), it must
a fairly rigorous denition for the term can be not be removed (unless legally unavoidable), it
found. must be unambiguously identiedit must have
The denition of a scientic paper is compre- a bibliographic recordcontaining certain mini-
hensively developed and discussed by Robert A. mal information, [and] archiving and long-term
Day, professor emeritus of English at the preservation must be provided for [3].
University of Delaware and past president of As indicated by the AAAS criteria, the
the Society for Scholarly Publishing and of the denition of the scientic paper, either printed on
Council of Biology Editors, in his denitive book, paper or in electronic media, encompasses the
How to Write and Publish a Scientic Paper [2]. concept of prepublication peer review. Peer
As stated by Professor Day, a scientic paper is review is the process by which other profession-
a written and published report describing original als, understood on the basis of their own publica-
research results. However, it must be written tions or other credentials to have expertise in
and published as dened by [three centuries of papers area of focus, evaluate the paper and grade
developing] tradition, editorial practice, scientic it as to priority for publication. Most journals
ethics, and the interplay of printing and publish- employ a system of peer review to select manu-
ing procedures. Professor Day quotes the scripts to be published from within the larger pool
denition of an acceptable primary scientic pub- of those submitted. The number of peer reviewers
lication developed by the Council of Biology for most publications usually is two, though more
Editors: it must be the rst disclosure containing or fewer may be employed in any instance. The
sufcient information to enable peers (1) to assess criteria for judgment generally include the intrin-
observations, (2) to repeat experiments and (3) to sic importance of the subject about which the
evaluate intellectual processes; moreover, it must paper is written (hypothesis to be tested, research
be susceptible to sensory perception, essentially problem, etc.), the adequacy of the methodology
permanent, available to the scientic community for the stated purpose, the credibility of the results
without restriction, and available for regular and the adequacy of the data analysis, the reason-
screening by one or more of the major recognized ableness and fairness of the conclusions/interpre-
services (e.g., currently, Biological Abstracts, tations, the adequacy of the bibliography, and the
Chemical Abstracts, Index Medicus, Excerpta adequacy of the formal presentation (i.e., is the
Medica, Bibliography of Agriculture, etc., in reader likely to be able to understand the material
the United States and similar services in other as it is presented). In addition, it is hoped that
countries) [2]. peer reviewers will help to identify submissions
Today, considerable publication is performed that already are in review by more than one venue
via electronic media and the Internet, and may or that present data already published (both
never appear in an edition printed on paper. The ndings indicate transgression of copyright laws
denition of the scientic paper is not altered by and general standards for publication). Peer
the use of electronic media. As indicated by the reviewers also are expected to have some sense of
American Association for the Advancement of the likelihood that the data are real and not
13 How to Prepare a Scientific Paper 257
fraudulent, though the latter is largely impossible his new ideas in print may be reluctant to discuss
for a reviewer to verify. It is essential for authors how much ignorance he had to overcome [4].
to recognize the characteristics by which peer Fortunately, however, given the limited journal
reviewers will judge a manuscript (and, subse- publication space available, the capacity to evalu-
quently, to respond courteously and appropriately ate the logic underlying a given piece of research
to suggestions for additions, clarication, or other far outweighs the need to scrutinize the specic
alterations to the manuscript) if the work is to be and often circuitous path by which that logic was
accepted for publication. revealed, Dr. Medawar and the Saturday Review
The denition of the scientic paper also notwithstanding!
implies a certain amount of detail in reporting Since a scientic paper must communicate
methodology and results; the degree of such several aspects of a research project, a logical,
detail ultimately is the product of a complex standardized reporting format is preferred.
interplay of intellectual and moral/ethical consid- Currently, the most commonly used format is
erations and may vary with the mores of the era known by the acronym, IMRAD: introduction,
and the context within which the publication is methods, results, and discussion. This probably
conceived and written. Centuries ago, a scientic should be changed to AIMRAD to reect the
treatise did not necessarily conform to the rigor- almost universal placement of an abstract at
ous research standards that prevail today, with the the head of the scientic paper, a relatively recent
necessity for substantiating data. The concept development. The abstract is important since it
was paramount; data reporting was less rigorous may alter the information content required of the
and often relatively inaccurate. Scientic thought introduction. The IMRAD format (or AIMRAD,
was evolving, but scientists did not have the lux- or TAAIMRAD, if the title and authors are con-
ury of the technological resources available today sidered, since they, too, can convey important
that mark the often exquisite details of current information) indicates sequentially what problem
research. was studied, why it was studied, what was found,
The degree of detail that is required depends in and how these ndings should be interpreted,
part on the familiarity of the intended audience particularly within the context of related work in
with the methods employed. In many instances, the eld.
techniques that are widely used and generally The best aid to crafting a useful scientic
accepted as standard (e.g., electrocardiography) paper is a well-organized, well-planned, and
require no more than recitation, with no support- clearly written research proposal or protocol. The
ing bibliographic reference. On the other hand, well-crafted proposal will (1) clearly state the
other aspects of methodology, and particularly specic aims of the research, including hypothe-
elements of study design, may be so critical to ses to be tested (if any); (2) provide a context and
interpretation of the results by the reader that con- justication for the study with reference to the
siderable descriptive detail may be necessary. literature; and (3) dene precisely the methods to
The need for a scientic paper to enable the be employed, including the research design, mea-
reader to evaluate the intellectual process surement techniques and approach to statistical
requires either direct discussion of that process in analysis, the principal results expected, and the
the manuscript or, more commonly, organization conclusions that might be suggested by them. In
of the manuscript such that relevant inferences other words, the protocol provides the basis for
can be drawn. As noted by Feinstein, the latter the introduction and methods sections of the
has resulted in the complaint by Peter Medawar paper. However, since the best laid plans often go
in the Saturday Review that scientic writing is somewhat astray, the proposal must be supple-
often intellectually fraudulent because the care- mented by consideration of the procedures actu-
ful organization given to the published material ally employed and data truly collected before the
do[es] not reect the way things happened. After scientic paper can be written. As Turato et al.
conquering his ignorance, the scientist presenting have noted, Investigative studies without explicit
258 J.S. Borer
hypotheses give rise to the supposition that these latter indicates the general subject of the study,
enterprises have a merely mechanical course. the former also indicates the methodological
That is, they uncritically repeat the dominant approach, including the variables measured.
groups methodological models in the world of Although more verbose, the longer title helps to
academic medicine. Failure to present hypothe- dene the scope of the study and to distinguish it
ses, before enumerating the objectives, usually from others in the eld. If no study of prognosti-
represented a failure to respect the logical cation in mitral regurgitation had been performed
sequence of stages, which are understood as previously, the lengthier title would be less essen-
occurring naturally in the mind of the thinker [5]. tial. However, since other studies have been
As Knottnerus also has observed, We should not performed, the additional verbiage is useful, pro-
forget that mathematical indices are just ways to viding the knowledgeable reader with some indi-
summarise collected research data. For the qual- cation of the uniqueness of the paper and its
ity of research, dening the research question, relevance for his or her work.
and methodological challenges in study design, Other important considerations, suggested
are far more important [6]. above, include the desirability of conveying more
A summary of some specic characteristics of of the IMRAD information than merely the sub-
the components of the scientic paper follows, ject of the study and the desirability of brevity.
organized as per the TAAIMRAD format. This The criterion for acceptable brevity varies with
summary owes a considerable debt to the pub- fashion (e.g., Darwins title for his account of his
lished comments of Professor Day, as well as to voyages on the Beagle, On The Origin of Species
personal experiences in applying the generally by Means of Natural Selection, or The Preservation
accepted precepts. of Favoured Races in the Struggle for Life [7],
acceptable in 1859 but not in a medical journal
in 2010!).
The Title In summary, in crafting a title, effort is well
spent attempting to minimize words while maxi-
The title is the rst and, often, only contact of the mizing clarity, focus, and information content.
reader with the paper. Therefore, it must convey
considerable information with an economy of
words. The primary consideration in crafting a Authorship
title is clarity. Jargon should be avoided, and the
relevant rules of grammar should be followed. Different criteria exist for inclusion in an authors
Equally importantly, a title should be specic list and for the order of listing. When this author
and focused. Thus, the title must refer specically worked at the National Institutes of Health (NIH),
to the subject of the research, rather than merely a simple rule of thumb was proposed: listed
to the eld within which the research is under- authors should have made an important contribu-
taken. (Of course, the operating denition of tion to the research and should be able to present
subject and eld can vary with the research.) and defend the paper at a scientic meeting. This
For example, in a prospective study employing denition implies that an author has acquired a
radionuclide cineangiography and echocardiog- body of knowledge which can serve as a context
raphy to develop prognostic indices for survival for the reported research and that he or she is inti-
in patients with mitral regurgitation who had not mately familiar with the intricacies of the meth-
undergone valve replacement or repair, the title odology employed in the research as well as with
Prediction of Survival in Patients with Mitral the results. However, with the rapid increase
Regurgitation by Use of Noninvasively Dened in technological and biological information in
Indices of Left and Right Ventricular Performance recent years, it has become increasingly neces-
would be preferable to, for example, Prediction sary for projects to be carried out by teams
of Survival in Mitral Regurgitation. While the comprising collaborators with different, and
13 How to Prepare a Scientific Paper 259
often widely disparate, areas of expertise. For do not provide intellectual input into the process
example, in the randomized Collaborative Study (though many exceptions exist). Problems can
of Coronary Artery Surgery (CASS) [8], a trial also arise regarding the inclusion of senior scien-
designed to assess the effects of coronary artery tists in whose area of responsibility the research
bypass grafting plus standard (ad hoc) pharmaco- occured but who may have had little direct input
logical/dietary therapy compared with standard into the specic project. Clearly, the distinction
(ad hoc) pharmacological/dietary approaches between those whose intellectual responsibility is
alone on natural history of patients with coronary sufcient to warrant authorship and those whose
artery occlusive disease, public health specialists/ responsibility is not is difcult to make with pre-
epidemiologists and statisticians were critical to cision. Ultimately, this determination probably
the study design and analysis. In fact, an epide- depends on a consensus of the involved investiga-
miologist and a statistician were the rst and sec- tors. However, those who allow their names to be
ond authors of one of the most important papers listed as authors incur another responsibility,
resulting from the trial. specically for the veracity of the reported data.
However, surgeons and cardiologists partici- In several celebrated cases of research fraud three
pated in the trial, and the cardiologists included decades ago, some renowned senior scientists,
those who performed catheterizations and those not associated with collecting or analyzing data
who did not. It is likely that representatives of all but involved (sometimes distantly) in project
these groups, and more, participated in the con- conceptualization, were listed among the authors
ceptualization of the study, that all but the statisti- of papers found to be fraudulent; though none of
cian participated in primary data collection, and them was aware of the fraudulence of the reported
that many participated in interpretation of the data, they were perceived as irresponsible in
results. However, it would be excessive to expect allowing their names to be used without ade-
the epidemiologist or statistician to understand quately assessing the reported projects.
the methodological pitfalls of the catheterization Regarding the order of authorship, again per
(much less to identify them when they occurred), Day, authors should normally be listed in order
or to expect the cardiologist, the epidemiologist, of importance to the experiments [2]. Sometimes
or the statistician to fully understand and identify a senior investigator or group leader chooses to
methodological problems associated with surgi- move out of such ordering into the last position
cal procedures, or for the cardiologist or the sur- on the list, from which his or her senior status
geon to understand fully or to be able to defend can be inferred and which provides added
the procedures employed by the statistician. recognition to junior authors by moving them up
Therefore, Days denition is now most appro- the list. In some cultures, authors are listed alpha-
priate: an author of a paper should be dened as betically. No universally accepted rules exist for
one who takes intellectual responsibility for the ordering the authors list; the ultimate test of the
research results being reported [2]. Thus, authors appropriateness of the list is consensus of
should include those who actively or substan- the individuals involved.
tially contributed to the conceptualization, design,
and performance of the research. It is sometimes
true that individuals intimately involved in con- The Abstract
ceptualization and design of research and in anal-
ysis and/or interpretation of results have little or The Abstract represents a brief summary of the
no responsibility for primary data collection and paper. As such, it should contain a concise state-
that individuals involved in primary data collec- ment of the research problem, sufcient method-
tion have little or no involvement in the other pro- ological information to orient the reader, a
cesses. The latter is particularly true of technicians summary of the results of primary importance,
or research assistants, who in most circumstances and the authors principal conclusions.
260 J.S. Borer
The length of the Abstract and, often, its for- fraction, it would be inappropriate to include in
mat are governed by the policy of the publication the Introduction a paean to the value of ejection
to which the paper is submitted. fraction as an index of prognosis in heart disease.
Important considerations in Abstract writing Though ejection fraction is a useful prognostic
include (1) avoidance of abbreviations whenever index, the problem under study has nothing to do
possible and, when they are needed, limitation to with the use of ejection fraction for prognosis.
those which are generally recognized; (2) mini- The mention of this property of ejection fraction
mization of words without disregard for grammar may suggest to the reader that prognostication
and syntax; and (3) avoidance of reference to data strategies have been studied. The resulting confu-
or methods not reported in the paper. The latter sion may preclude clear assimilation of the data
requires careful nal editing since substudies or actually presented. If the reader is performing
subanalyses sometimes are eliminated from the prepublication peer review for a journal, this con-
nal edition of a paper because of considerations fusion may be translated into rejection for an oth-
of relevance or space, but still may appear in the erwise worthy effort.
previously written Abstract. As we have emphasized in Chap. 2 of this
book, the statement of the problem must be
sharply focused. Many authors have documented
Introduction a relation between alcohol consumption, acute or
chronic, and deterioration of left ventricular per-
The Introduction is a tool for communication and formance. Few have dened the quantitative rela-
is critical to the success of the paper. It serves tion between alcohol consumption and ejection
several functions. These include, but are not lim- fraction change. If the study in question was
ited to, engaging the readers interest sufciently designed to provide such information, and the
to justify proceeding into the details of methodol- relevant data were collected, then the statement
ogy and results, suggesting the logic of the meth- of the problem should focus on the effort to quan-
ods, and providing a framework for assimilating tify the relation between the intervention and the
and interpreting the results. parameter employed.
To serve these purposes, the Introduction must The author should not promise something,
(1) clearly state the problem or problems (hypoth- directly or by implication, that he or she does not
eses, research questions, specic aims) under deliver. Thus, for example, in justifying the study
study; if more than one problem has been stud- of the effect of alcohol consumption on ejection
ied, the relation of the problems, and the reason fraction, it would be best to avoid suggesting that
for studying them together, should be elucidated; the study was performed because it might help to
(2) provide a basis, usually from the literature, guide therapy unless (a) the results include data
for choosing to study the problem(s); (3) outline on the effects of therapy in this condition and
the approach to the problem indicating, when (b) the relationship of the effects of therapy to
appropriate, why this approach, rather than oth- ejection fraction is described. (In certain situa-
ers, was chosen; and (4) indicate the importance tions, this speculation might be appropriate in the
or uniqueness of the paper, i.e., justify the perfor- Discussion.)
mance of this particular study. Though some It is important to inform the reader if multiple
writers choose to briey describe results and con- problems have been assessed. All but the most
clusions in the Introduction, most do not. These compulsive readers generally will remember no
are available in the Abstract and are redundant more than one fact or concept after reading a
when more complete exposition of these aspects paper. If multiple concepts or types of results
of the research will follow. In stating the prob- have been generated in a study, a well-constructed
lem, the writer should avoid distracting irrelevan- Introduction may improve the likelihood of their
cies. For example, if one has studied the effect of recognition and retention. A negative example
alcohol consumption on left ventricular ejection may illustrate the point. In 1979, this author and
13 How to Prepare a Scientific Paper 261
colleagues assessed response of left ventricular whole should be treated as a guessing game or as
volume and function to exercise in patients with a nely wrought mystery-drama. The busy reader
aortic regurgitation [9]. In a brief, two-paragraph should be engaged early by references to material
Introduction, only the study of function (mea- which the author considers most important.
sured as ejection fraction) was mentioned. The Finally, the Introduction should be brief.
assessment of volume change received no com- Detailed review of collateral or supporting litera-
ment. In the many subsequent references to this ture is appropriate for the Discussion, but not for
frequently cited paper, the citation invariably has the Introduction. Generally, the Introduction
been to the effect of exercise on ejection fraction. should be limited to one double-spaced typed page
To this authors knowledge, no one ever has men- (approximately 250 words). If the Introduction
tioned our nding of marked reduction in left substantially exceeds this limit, the author must
ventricular end diastolic lling during exercise, consider the possibility that he or she has not
which was reported in this paper. Other authors clearly identied the key concepts in his or her
subsequently reported studies of volume changes own mind.
during exercise in aortic regurgitation, without
reference to these data. This oversight is likely
related in large part to an incomplete Introduction Methods
to the paper. As a result, other investigators could
not benet from these ndings in designing their As Day has noted, the primary purpose of this
studies. section is to describe and (if necessary) defend
A brief description of the methodological the experimental design and then provide enough
approach employed in the study will permit the detail that a competent worker can repeat the
knowledgeable reader to place the study in an experiments [2].
appropriate context for interpretation while other Clear and accurate description of methods is
sections of the paper are being read. If methodol- critically important. The careful reader cannot
ogy somehow was unique, this should be indi- properly interpret the results or evaluate the con-
cated, together with the reason for use of the new clusions without a fundamental understanding of
method. For example, in 1977, this author and the methods employed in making the observa-
colleagues reported a study of the effect of exer- tions. As a corollary, the limitations of the meth-
cise on regional and global left ventricular func- ods should be understood. This may require a
tion/performance in 11 patients with coronary specic statement by the author if he or she
disease who had normal performance descriptors believes that the interpretation or generalizability
at rest [10]. In this instance, the method employed of results is importantly mitigated by some aspect
to study performance during exercise was of of the methodology or, conversely, if the author
greater interest than the effect of exercise itself. believes that an apparent methodological limita-
Application of radionuclide cineangiography tion can be explained in a manner that minimizes
during exercise had not been previously reported circumscription of conclusions.
in a scientic paper. Therefore, the Introduction In general, the Methods section should begin
included a paragraph explaining the theoretical with a detailed statement of the subjects employed
importance of studying the effect of exercise in (physical models or devices, cells, tissues, or ani-
coronary disease and another paragraph describ- mals if the study is nonclinical) or humans stud-
ing the relevance of radionuclide cineangiogra- ied (if the study is clinical). This statement should
phy in permitting such study. include generally accepted group descriptors
The introduction should be organized accord- (i.e., demographic data in clinical studies), crite-
ing to journalistic precepts: the most important ria for acceptance and/or exclusion of subjects
concept should be presented rst, and subsidiary from the study population, and a description of
concepts should be presented thereafter. Neither any special procedures employed to determine
the Introduction nor the scientic paper as a tness for acceptance. If rabbit hearts have been
262 J.S. Borer
homogenized for analysis of protein content, the size selection are best illustrated with reference to
weight, age, and breed of rabbit should be noted, studies of therapeutic interventions, usually eval-
as well as the total number of rabbits instru- uated by comparing a new treatment modality
mented for study and reasons for any discrepancy with an established therapy. For such studies, the
between this number and the number whose expected outcome event rate with the established
hearts actually were homogenized and analyzed. therapy may be estimated from earlier studies;
This information helps the reader to evaluate pos- the difference to be sought between the new ther-
sible interactive effects of selection bias, albeit apy and the comparator may be selected by the
unintentional, that might alter extrapolability of investigators based on their judgment of
results. Similarly, in the previously noted exam- the magnitude of difference that may be clinically
ple of the study to develop prognostic strategies useful. However, if the event rate with established
in mitral regurgitation, in addition to age, sex, therapy is found to differ importantly from his-
and, perhaps, other demographic descriptors if torical standards (particularly if it is lower), the
deemed relevant to interpretation of results, the calculated sample size may provide far less than
author should dene the basis for determining the the anticipated power to detect superiority of the
diagnosis of mitral regurgitation and its severity new therapy, even if it exists. Presentation of
(physical examination, echocardiography, cathe- the basis for selection of the sample size in the
terization, etc.), including the specic criteria methods may help the reader to avoid erroneous
employed for classication with the method[s] (negative) interpretation of the data.
chosen. If the study were designed to develop As these examples suggest, the specic param-
prognostic strategies in systemic arterial hyper- eters described in a methods section will vary
tension, rather than in mitral regurgitation, then, from study to study. Nonetheless, each aspect of
in addition to age and sex, race, weight, and the methodology must be dened rigorously and
height might be important demographic descrip- precisely. On the other hand, excessive detail
tors since the pathophysiology of hypertension is which does not affect data interpretation (e.g.,
known to vary with race and, to a lesser extent, hair color, shoe size, and telephone numbers of
with obesity. the patients with mitral regurgitation) can be
Special note should be made of sample size confusing, misleading, and inappropriate. One
estimates (see detailed discussion in Chap. 11). caveat: some journals may have specic require-
Sample size should be planned in the study ments regarding identication of materials or
protocol. It may be appropriate to relate the pro- methods. These will be related in Instructions to
tocol-mandated plan in the Methods and the rea- Authors in the journal and must be followed.
soning on which the plan was based. This is After describing the subjects/items on which
particularly true when the primary results, or studies were performed and, if appropriate,
some important secondaries, are negative, i.e., explaining the basis of sample size selection, the
the expected relationships are not found. Lack of author should detail the materials employed in
statistical signicance is not equivalent to true processing and testing the subjects, as well as the
lack of relationships. Sample size estimates are procedures used to make observations. Again,
based on the expected outcome, the expected detail should be sufcient to permit interpretation
variability of the measurement methods, the like- and/or replication of results. If procedures have
lihood that the result is not due to chance alone been well described in the literature and were
(the alpha level, selected before the study by performed without substantial change from those
the investigators), and the likelihood of nding published, a general statement with a literary
the expected outcome if it really exists (also cho- citation may sufce. As a hypothetical example,
sen before the study by the investigators). The lat- assume that the authors of a study state that
ter is known as the power to nd the expected equilibrium radionuclide cineangiography was
results and is expressed as a percentage. The haz- performed at rest and during symptom-limited
ards involved in not reporting the basis of sample supine bicycle ergometry according to methods
13 How to Prepare a Scientific Paper 263
analogous to those we have previously described methodology and performed a multiple logistic
and employing in vivo labeling of red cells with regression analysis with some of the data. An
Tc99m. If substantial changes have occurred since astute peer reviewer noted that, given the size of
the previous publication of the method, these the patient population, an excessive number of
should be described and, if necessary, justied or parameters had been tested for independent
defended (e.g., radionuclide cineangiography signicance in the regression model. The descrip-
was performed using a recently developed image tion of the statistical methodology permitted
rendering method to precisely dene left ventric- detection and correction of this error.
ular borders. This method involves. It was In summary, description of methods requires
employed because cardiac function indices were judgment as to the appropriate degree of detail.
signicantly better correlated with independent When in doubt, it is usually better to include
standards than were older methods). Appropriate more rather than less, though much detail may be
references also should be supplied. removed by editorial suggestion after peer review
The research design also should be specied. and before publication. The guiding principle
If interventions are employed in some study sub- should be that sufcient information is transmit-
jects but not in others, the basis for allocation of ted so that, in the view of the authors and journal
subjects to treatment groups should be dened editor, the results can be accurately interpreted.
(e.g., randomization, stratication) as should
other design elements that reduce bias (e.g.,
blinding in processing/evaluating primary data). Results
The temporal sequencing of the observations
relative to the intervention should be described. In the Results section, the author presents the
Statistical methods employed to analyze data observations which will permit assessment of his
must be presented, including criteria for accept- or her original hypotheses and specic aims. In a
ing or rejecting the null hypothesis (i.e., the sense, the results represent the new knowledge
p value below which result will be declared sta- which has been created by the research.
tistically signicant). Most physicians are rela-
tively unfamiliar with the details of statistics and
with the criteria for selecting specic tests of Narrative
signicance in certain situations. However, the
ready availability of statistical computer pack- In general, and particularly when complex math-
ages has led to widespread performance of statis- ematical analyses and subanalyses have been per-
tical tests by nonstatisticians. While many of formed, it is useful to present the results in
these procedures undoubtedly are correctly narrative form, supplemented by tables and
selected and performed, some probably are not. gures. The narrative should indicate as clearly
The best remedy for this problem is to consult a as possible the ow and thrust of the data, i.e., the
statistician in the design of the research protocol overall sense of the ndings. Interpolation of
and in statistical analysis of results and to ask the numbers into this narrative should be done with
statistician to write the appropriate portion of care and caution, preferably when they do not
the methods section, explaining it conceptually to impede the ow. However, the narrative may be
the other authors. However, if this is not done, the strengthened by judicious interpolation of evi-
statistical methods employed should be carefully dence of the statistical signicance of the ndings
cited so that the statistically literate reader (and (p values). This latter approach necessitates
the peer reviewers) can evaluate the appropriate- clarity and comprehensiveness in the design of
ness of the analysis and resulting conclusions. tables and gures in which the data are presented
During a study performed some years ago by this quantitatively, since the narrative must be
authors group, one nonstatistician spent consid- consistent with the numbers. Moreover, the nar-
erable time familiarizing himself with statistical rative should present only the results and not the
264 J.S. Borer
conclusions. A well-designed narrative plus diminishing space needed to present the new
graphics may lead obviously to certain conclu- knowledge. In the current era in which Internet
sions, but statement of these should await the publication, with supplements and appendices,
next section. often is undertaken or accompanies printed ver-
As with the methods, some judgment must be sions of scientic papers, the space limitation
employed in deciding which results require pre- may be overcome by adding tables (and gures)
sentation. Intensive analysis may reveal many in electronic appendices, surmounting the pro-
relationships unsuspected in the planning of the scription on such additions. However, the author
study. Concern about the chance nding of a sta- must always remember that the primary purpose
tistically signicant relationship on the basis of of publication is communication and that
overanalysis of data probably is well-founded. the accretion of additional material may obscure
Therefore, unexpected relationships, particularly rather than clarify the focus and conclusions of
those derived from post hoc analyses, should be the research. In the Results, however, tables and
evaluated with caution. Nonetheless, some of gures are invaluable and space-saving devices
these may be important in drawing conclusions that often help to clarify complex results by
from the research and certainly can be hypothe- removing them from the narrative, enabling
sis-generating for future studies. Some may be comprehensible summary presentations supple-
irrelevant. The latter generally do not require pre- mented by the data from which they are derived.
sentation. Negative results often are important In the following summary of considerations in
though these, too, must not be overinterpreted. the use and conguration of tables and gures,
A negative nding may have resulted from mea- much has been gained from review of the chap-
surement error or from sample size that is ters on these subjects in the monograph by
inadequate to properly assess the relationship Edward J. Huth (How to Write and Publish
under study. In these instances, a positive, i.e., Papers in the Medical Sciences) to which the
statistically signicant, result would have been reader is referred for greater detail [11].
unlikely even if, in fact, the sought-after relation-
ship actually exists. Such limitations in the
extrapolability of the data generally should be Tables
noted in the discussion section.
Tables and gures need not be limited to the Multiple well-focused tables are preferable to
Results, but this is the section in which they are one massive compendium of all relevant data.
generally most appropriate and useful. Tables However, the number of tables that can be
and gures can be employed in the Introduction employed often is dened or limited by the edito-
or Discussion to summarize work done by others rial policy of the individual journal and must be
into which context the newly reported results known when planning use of these devices. More
must be integrated, or to diagram relationships importantly, the author must consider which
(often pathophysiological relationships) believed tables would best further the communication at
to underlie the results that are being reported. In which the paper is aimed. Information involving
general, these strategies are best reserved for few data that might be effectively displayed in
review articles and should be avoided in scientic tabular form for an oral presentation probably
papers (original research reports) because the use can be communicated more appropriately by nar-
of space for this purpose is seldom justied by rative summary in a paper. If tables are employed,
any gain in comprehension by the reader. Indeed, however, they must be cited and sequenced within
in order to make such tables comprehensible, an the text so that their relation to the narrative
expanded explanatory text often is required (fre- results is easily discernable [11]. In any table, the
quently drawing upon data not generated within title should clearly dene the focus and nature of
the report being presented by the author), poten- the data or relationships to be presented, and
tially increasing the size of the printed article column and row headings should be simple and
beyond the limit allowed by the journal and easily understood. If abbreviations or technical
13 How to Prepare a Scientific Paper 265
terms are employed for the sake of the esthetic/ and primary dependent (outcome) variables
clarity of the layout, these should be precisely (particularly when the relation follows a clear
dened in a legend. The legend also may include pattern), etc. However, the latter gures only
a summary statement amplifying or totally replac- should be employed when they provide clear
ing the table title to clarify the specic purpose of support for an authors subsequent conclusions
the table. It is critically important to dene the [11]. It is not necessary, and, in my view, it is
units of measurement for any numerical data in inappropriate to provide examples of individual
the table [11]. In some tables, absolute numerical data (e.g., a photograph of a histological sample
results are followed by parenthetical presenta- of a degenerated myocyte from an organism with
tions of percentages of the data set represented by heart failure) unless some unique characteristic
these absolute values. Unless the antecedent data of the photograph supports the existence of a pre-
set is precisely dened and obviously visible, viously unsuspected process. It is not necessary
such formats can lead to reader confusion and to present illustrations to prove that certain analy-
deterioration of communication. If statistical ses were performed: there is general agreement
comparisons among elements of the table are among researchers that statements of fact pre-
presented, it must be made absolutely clear which sented in the Results are trueit is the interpreta-
elements are being compared and what type of tions that may differ; gures are most useful
comparison has been performed. For example, a when they support interpretations.
p value for noninferiority between two data It should be intuitively obvious that any gure
sets may indicate the high likelihood that one set employed in a publication must be clean, techni-
is noninferior to the other, but unless the type of cally well reproduced, and easy to read. In addi-
comparison has been explicitly stated and there is tion, however, considerable attention should be
a numerical difference between the sets, the paid to labeling. In displays of coordinate axes,
reader may assume the p value refers to superi- the ordinate and abscissa must be clearly labeled
ority, an erroneous conclusion that could preclude with units of measurement, amplied if neces-
comprehension and subsequent application of the sary by statements in the legend. Similarly, inter-
results. ventions, time intervals, etc., must be precisely
laid out in ow charts and study design diagrams.
Idiosyncratic abbreviations in labels should be
Figures avoided when possible. Ultimately, as for tables,
the use of gures should be undertaken only
To be optimally effective, gures should be rela- when they are clearly useful in potentiating com-
tively uncluttered. In general, one fact or prehension of results and conclusions presented
relationship should be illustrated by each gure, in the Discussion. It is an error, likely to be cited
though many observations in the narrative may and extirpated by peer reviewers and editors, to
be supported by gures. It can be very confusing present the same data both in tabular and graphic
to decipher three-dimensional plots, or single formatif the data require amplication beyond
gures with two or three different ordinate or the narrative, select one format or the other, not
abscissa scales, each referring to a different line both. Remember that the goal of the presentation
identiable with reference to black or white poly- is clear communication.
gons, all within the same coordinate axes.
Examples of gures that can be very useful in
clarifying or amplifying (or replacing) text Discussion
include graphic presentations of complex study
designs, ow charts indicating reductions in pop- The purpose of the Discussion is to present con-
ulation size as exclusions or other factors impact clusions based on the results of the research.
on the population studied, quantitative relations Thus, the Discussion is the authors opportunity
between important independent (input) variables to interpret and identify the importance of their
266 J.S. Borer
work and, as Day has noted, to present the prin- authors project and that lack of placement of
ciples, relationships, and generalizations shown these ndings in the appropriate literary context
by the Results [2]. Certain principles should be does not alter their intrinsic validity or value;
observed in writing a Discussion. If they are not, nonetheless, lack of adequate literary references
most editors and many reviewers will call the may lead a reader to under- or overvalue or other-
author to task and may even reject an otherwise wise misunderstand the importance and implica-
laudable report. tions of the reported research. Also, lack of
Less generally is more. Lengthy discussions, appropriate referencing is unfair to the work and
extrapolating from every conceivable aspect of workers thus disregarded. Even if the intrinsic
the data, often are not well received. Moreover, moral issue here is uninteresting to an author, its
they can detract from the importance and practical consequences often are not. It is almost
originality of the primary observations by over- a truism that the author of a study you neglect
whelming and distracting the reader. As a corol- will be a prepublication peer reviewer and may
lary, summarization of the results is redundant resent what is perceived as an inappropriate claim
and inappropriate in the Discussion and usually of priority.
is not tolerated by editors jealous of their limited Limitations of the work, in terms of methodol-
publication space. ogy employed, inconsistencies in results, etc.,
Conclusions should be clearly and closely should be discussed. Interpretation in light of
related to the data obtained in the study. Far- these limitations should be defended when neces-
reaching speculations generally should be sary. Readers and reviewers will be aware of
avoided. Fairness and balance are necessary in these limitations, and failure to deal with them in
interpreting results. Excessive emphasis on a pet the Discussion may detract from the credibility
theory should be avoided, particularly if alterna- of otherwise excellent work.
tives exist that may be credible. Therefore, the Theoretical or abstract conclusions are
relation of the results to those of other parallel or appropriate when logically drawn from data,
similar studies should be discussed. If possible, circumscribed in their scope, and supported by
some explanation should be provided for appar- appropriate references to parallel work in the
ent differences. Often, these may be ascribable to eld. As stated by Howard Haggard in The Doctor
differences in methodology, so that careful review in History, a theory affords an explanation for
of the methodology of collateral references can known facts. Theories, when correct serve as
be very helpful. Claims of priority are appropri- guides in the search for new facts. But when
ate if correct (e.g., This study represents the rst incorrect, they obscure the truth [12]. Whether
demonstration of parthenogenesis in the Syrian or not truth is obscured, wide-ranging theorizing,
hamster), but check the literature carefully to be only tenuously related to the data, often raises the
certain of the claim (see below). ire of peer reviewers, with unfortunate conse-
Support for or refutation of conclusions should quences for the scientic paper.
be cited from the published literature and may Finally, as in the Introduction, the journalistic
require additional discussion. It is the responsi- approach is useful: discuss primary conclusions
bility of the authors to undertake a reasonable rst and secondary or subsidiary extrapolations
literature search to nd appropriate references. later. Thus, in the study of prognostic strategies
As discussed in Chaps. 2 and 9, the explosion of in mitral regurgitation, suppose that right ven-
scientic literature has made this a difcult and tricular ejection fraction less than 30% at study
time-consuming undertaking. However, several entry was associated with poor two-year survival
computer-based literature search services can be and that, as an unexpected ancillary nding,
helpful, including those readily available via an association exists between prior rheumatic
the National Library of Medicine. It is true that fever and left ventricular ejection fraction less
the scientic paper reports the ndings of the than 50% at rest. The Discussion might begin,
13 How to Prepare a Scientific Paper 267
Take-Home Points
The scientic paper is the vehicle that reports what research problem was studied, why it
was studied, what was found, and how these ndings should be interpreted, particularly
within the context of related work in the eld. Its publication, making the data available to
the scientic community, is the nal step in the research process.
The scientic paper is a communications tool. Clarity and precision of expression are criti-
cally important.
The best aid to crafting a useful scientic paper is a well-organized, well-planned, and
clearly written research proposal or protocol.
The results (not the discussion or authors interpretation) are the new knowledge; their
evaluation by the reader requires clear exposition of the methods. The discussion is not
a mystery novelstate the conclusions in order of their importance. Remember that less
usually is more.
268 J.S. Borer
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 269
DOI 10.1007/978-1-4614-3360-6, Phyllis G. Supino and Jeffrey S. Borer 2012
270 About the Editors
Jeffrey S. Borer, MD
Jeffrey S. Borer, MD is Professor of Medicine, Cell Biology, Radiology, and
Surgery at SUNY Downstate Medical Center and College of Medicine in
New York City. He is Chairman, Department of Medicine and Chief, Division
of Cardiovascular Medicine, and Director of the Howard Gilman Institute for
Heart Valve Disease and of the Cardiovascular Translational Research
Institute at SUNY Downstate. Dr. Borer received his BA from Harvard, his
MD from Cornell and trained at the Massachusetts General Hospital. He
spent 7 years in the Cardiology Branch of the NHLBI at the NIH and a year
at Guys Hospital in London as a Senior Fullbright Hays Scholar, where he
completed the rst clinical demonstration of the utility of nitroglycerin in
acute myocardial infarction. Upon returning to the NIH, he developed stress
radionuclide cineangiography, for the rst time allowing non-invasive assess-
ment of cardiac function with exercise. He then returned to Cornell for
30 years, where he was the Gladys and Roland Harriman Professor of
Cardiovascular Medicine and Chief of the Division of Cardiovascular
Pathophysiology. At Cornell, his primary research involved developing prog-
nostic standards for regurgitant valve diseases and exploring the cellular and
molecular biology of myocardial dysfunction in valve diseases, now contin-
ued at SUNY Downstate. He has been an Advisor to the USFDA for 33 years,
chairing the CardioRenal Advisory Committee for three terms and the
Cardiovascular Devices Advisory Committee for one, and Advisor to NASA
for 24 years. He has served as President of the American College of Cardiology
(ACC), New York State Chapter, and member of the Board of Governors of
the national ACC, as well as on the Boards of Governors or Trustees of mul-
tiple other national professional societies. Currently, he is President of the
Heart Valve Society of America and a member of the ISO US Valve Experts
Committee. Dr. Borer has published 400 scientic papers and four books,
edits the journal, Cardiology, and has received several awards and other
recognitions, including the Public Service Medal of NASA. He has been
extensively involved in the training of medical students, residents, fellows,
and translational scientists. Since 1990, he has closely collaborated with
Dr. Supino on a variety didactic teaching programs on research methodology
for clinicians and other members of the academic communities of Weill
Medical College and SUNY Downstate Medical Center.
Index
A instrumentation bias, 83
Abstract, scientic paper, 256, 259260 loss to follow-up bias, 7, 16, 60, 61
ACP Journal Club, 179 maturation bias, 83
Alternate form reliability, 168 nonparticipation bias, 60, 61
American Association for the Advancement of Science publication bias, 12, 44, 181, 190192
(AAAS), 256 recall bias, 7, 71, 75
Analysis of variance (ANOVA), 47, 219221 referral bias, 72, 80
Analytic research, 9 sampling bias, 154
Association consistency, 76 selection bias, 16, 66, 72, 8082, 91, 92, 95, 96, 98,
Audio computer-assisted self-interview (ACASI), 160 100102, 105, 109, 127, 177, 262
Authors list, scientic paper, 258, 259 social desirability bias, 164165
sources of, 6062, 7071, 76
testing bias, 81, 84, 91, 105, 109, 168
B Biological plausibility, 42, 76
Bacon, Sir Francis, 33 BIOSIS Previews, 23, 24
Bartholow, Roberts, 235 Bonferroni test, 220, 221
Basic research, denition, 34 Boolean operators, 180181
Bayes theorem, 34, 224, 225 Box plots, 209, 210
Beck Depression Inventory, 162 Brief Symptom Inventory (BSI), 148
Beecher, Henry K., 237239 Buxton, Peter, 238
Behaviorally anchored rating scale (BARS), 157
Belmont report, 3, 238, 239, 242, 251
Benecence, 238, 242, 251 C
Bernard, Claude, 31, 36 Case-control study
BestBETs, 179 advantages and disadvantages, 7475
Bias case
accuracy, 61, 71 denition, 64
agreement bias, 165 selection, 64
allocation bias, 88, 90 vs. cohort study, 5662, 74
denition, 95, 100, 107, 116, 153, 249 controls
detection bias, 72, 83 denition, 6567
devil bias, 165 selection, 65
expectancy bias, 84 odds ratio calculation, 67, 73, 74, 76
experimental mortality (attrition), 82, 88, 90, prevalent vs. incident case, 6465
96, 102, 105, 109 Case report form (CRF), 132, 135142, 144
experimenter bias, 83, 88, 95, 109 Case series, 7, 9, 182
exposure misclassication, 6061 Case study, 7, 9, 8788, 90
faking bad bias, 165 Categorical responses, 158
history bias, 8081, 88, 93, 95, 96, 99, 102, Ceiling effect, 164
105, 106, 109 Central tendency, measures of, 209, 229
horns bias, 165 Chi-squared/Chi-square test, 221223, 229
P.G. Supino and J.S. Borer (eds.), Principles of Research Methodology: A Guide for Clinical Investigators, 271
DOI 10.1007/978-1-4614-3360-6, Phyllis G. Supino and Jeffrey S. Borer 2012
272 Index