Vous êtes sur la page 1sur 37

European Journal of Personality, Eur. J. Pers.

27: 108119 (2013)


Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/per.1919

Recommendations for Increasing Replicability in Psychology

JENS B. ASENDORPF1*, MARK CONNER2, FILIP DE FRUYT3, JAN DE HOUWER4, JAAP J. A. DENISSEN5,
KLAUS FIEDLER6, SUSANN FIEDLER7, DAVID C. FUNDER8, REINHOLD KLIEGL9, BRIAN A. NOSEK10,
MARCO PERUGINI11, BRENT W. ROBERTS12, MANFRED SCHMITT13, MARCEL A. G. VANAKEN14,
HANNELORE WEBER15 and JELTE M. WICHERTS5
1
Department of Psychology, Humboldt University Berlin, Berlin, Germany
2
Institute of Psychological Sciences, University of Leeds, Leeds, UK
3
Department of Developmental, Personality and Social Psychology, Ghent University, Ghent, Belgium
4
Department of Experimental Clinical and Health Psychology, Ghent University, Ghent, Belgium
5
School of Social and Behavioral Sciences, Tilburg University, Tilburg, The Netherlands
6
Department of Psychology, University of Heidelberg, Heidelberg, Germany
7
Max Planck Institute for Research on Collective Goods, Bonn, Germany
8
Department of Psychology, University of California at Riverside, Riverside, CA USA
9
Department of Psychology, University of Potsdam, Potsdam, Germany
10
Department of Psychology, University of Virginia, Charlottesville, VA USA
11
Department of Psychology, University of Milano-Bicocca, Milan, Italy
12
Department of Psychology, University of Illinois, Chicago, IL USA
13
Department of Psychology, University of KoblenzLandau, Landau, Germany
14
Department of Psychology, Utrecht University, Utrecht, The Netherlands
15
Department of Psychology, University of Greifswald, Greifswald, Germany

Abstract: Replicability of ndings is at the heart of any empirical science. The aim of this article is to move the
current replicability debate in psychology towards concrete recommendations for improvement. We focus on research
practices but also offer guidelines for reviewers, editors, journal management, teachers, granting institutions, and
university promotion committees, highlighting some of the emerging and existing practical solutions that can facilitate
implementation of these recommendations. The challenges for improving replicability in psychological science
are systemic. Improvement can occur only if changes are made at many levels of practice, evaluation, and reward.
Copyright 2013 John Wiley & Sons, Ltd.

Key words: replicability; conrmation bias; publication bias; generalizability; research transparency

PREAMBLE MOVING BEYOND THE CURRENT


REPLICABILITY DEBATE
The purpose of this article is to recommend sensible
improvements that can be implemented in future research In recent years, the replicability of research ndings in psychol-
without dwelling on suboptimal practices in the past. We ogy (but also psychiatry and medicine at large) has been increas-
believe the suggested changes in documentation, publica- ingly questioned (Ioannidis, 2005; Lehrer, 2010; Yong, 2012).
tion, evaluation, and funding of research are timely, sensi- Whereas current debates in psychology about unreplicable
ble, and easy to implement. Because we are aware that ndings often focus on individual misconduct or even outright
science is pluralistic in nature and scientists pursue diverse frauds that occasionally occur in all sciences, the more impor-
research goals with myriad methods, we do not intend tant questions are which specic factors and which incentives
the recommendations as dogma to be applied rigidly and in the system of academic psychology might contribute to the
uniformly to every single study, but as ideals to be recog- problem (Nosek, Spies, & Motyl, 2012). Discussed are, among
nized and used as criteria for evaluating the quality of others, an underdeveloped culture of making data transparent
empirical science. to others, an overdeveloped culture of encouraging brief, eye-
catching research publications that appeal to the media, the
absence of incentives to publish high-quality null results, fail-
*Correspondence to: Jens B. Asendorpf, Department of Psychology, Humboldt ures to replicate earlier research even when based on stronger
University, Unter den Linden 6, 10099 Berlin, Germany. data or methodology, and contradictory ndings within studies.
E-mail: jens.asendorpf@online.de

This target paper is the result of an Expert Meeting on Reducing non-replicable Whatever the importance of each such factor might be,
ndings in personality research in Trieste, Italy, July 1416, 2012, nanced by current psychological publications are characterized by strong
the European Association of Personality Psychology (EAPP) in the recognition orientation towards conrming hypotheses. In a comparison of
of the current debate on insufcient replicability in psychology and medicine.
The participants of this Expert Meeting served as authors of the current article publications in 18 empirical research areas, Fanelli (2010)
(the organizer of the meeting as the rst author) or as its editor. found rates of conrmed hypotheses ranging from 70% (space

Copyright 2013 John Wiley & Sons, Ltd.


Recommendations for increasing replicability 109

science) to 92% (psychology and psychiatry), and in a study psychological science, and how can our recommendations be
of historic trends across sciences, Fanelli (2012) reported a implemented in everyday practice?
particularly sharp increase of the rate for psychology and
psychiatry between 1990 and 2007. The current conrmation
rate of 92% seems to be far above rates that should be expected, DATA REPRODUCIBILITY, REPLICABILITY, AND
given typical effect sizes and statistical power of psychological GENERALIZABILITY
studies (see section on Increase Sample Sizes). The rate seems
to be inated by selective nonreporting of nonconrmations as Given that replicability is not precisely dened in psychology,
well as post hoc invention of hypotheses and study designs that we propose a denition based on Brunswiks notion of a
do not subject hypotheses to the possibility of refutation. In representative design (Brunswik, 1955) and distinguish the
contrast to the rosy picture presented by publications, in a replicability of a research nding from its reproducibility
recent worldwide poll of more than 1000 psychologists, the from the same data set as well as from its generalizability.
mean subjectively estimated replication rate of an established Reproducibility of a research nding from the same
research nding was 53% (Fuchs, Jenny, & Fiedler, 2012). data set is a necessary requirement for replicability. Data
Among many other factors, two widespread habits seem reproducibility means that Researcher B (e.g. the reviewer of
to contribute substantially to the current publication bias: a paper) obtains exactly the same results (e.g. statistics and pa-
excessive exibility in data collection and in data analysis. rameter estimates) that were originally reported by Researcher
In a poll of more than 2000 psychologists, prevalences of A (e.g. the author of that paper) from As data when following
Deciding whether to collect more data after looking to see the same methodology.1 To check reproducibility, Researcher
whether the results were signicant and Stopping data B must have the following: (a) the raw data; (b) the code book
collection earlier than planned because one found the result that (variable names and labels, value labels, and codes for missing
one had been looking for were subjectively estimated at 61% data); and (c) knowledge of the analyses that were performed
and 39%, respectively (John, Loewenstein, & Prelec, 2012). by Researcher A (e.g. the syntax of a statistics program).
And it is all too easy to apply multiple methods and then Whereas (c) can be described to some extent in the method sec-
selectively pick those generating hypothesis conrmation or tion of a paper, (a), (b), and more details on (c) should either be
interesting ndings (e.g. selection of variables and inclusion available on request or, preferably, deposited in an open repos-
of covariates, transformation of variables, and details of struc- itory (an open-access online data bank; see www.opendoar.org
tural equation models; Simmons, Nelson, & Simonsohn, 2011). for an overview of quality-controlled repositories).
The question of whether there might be something funda- Replicability means that the nding can be obtained with
mentally wrong with the mainstream statistical null-hypothesis other random samples drawn from a multidimensional space
testing approach is more difcult. This has perhaps been best that captures the most important facets of the research design.
highlighted by publication of the highly implausible precogni- In psychology, the facets typically include the following: (a)
tion results in volume 100 of JPSP (Bem, 2011) that, accord- individuals (or dyads or groups); (b) situations (natural or
ing to the editor, could not be rejected because this study was experimental); (c) operationalizations (experimental manipu-
conducted according to current methodological standards. In lations, methods, and measures); and (d) time points. Which
response to this publication, some critics called for Bayesian dimensions are relevant depends on the relevant theory:
statistics relying on a priori probabilities (Wagenmakers, What constructs are involved, how are they operationalized
Wetzels, Borsboom, & van der Maas, 2011). This is not the within the theory underlying the research, and what design
only solution, however; treating stimuli as random factors is best suited to test for the hypothesized effects? Replication
(sampled from a class of possible stimuli, just as participants is obtained if differences between the nding in the original
are sampled from a population) also leaves Bems ndings Study A and analogous ndings in replication Studies B
nonsignicant (refer to Judd, Westfall, & Kenny, 2012, and are insubstantial and due to unsystematic error, particularly
the later section on a Brunswikian approach to generalizability). sampling error, but not to systematic error, particularly
We do not seek here to add to the developing literature on differences in the facets of the design.
identifying problems in current psychological research practice. The key point here is that studies do not sample only
Because replicability of ndings is at the heart of any empirical participants; they also often sample situations, operationaliza-
science and because nonreplicability is the common thread that tions, and time points that can also be affected by sampling
runs through most of the current debate, we address the follow- error that should be taken into account. By analogy with
ing more constructive question: How can we increase the analysis of variance, all design facets might be considered for
replicability of research ndings in psychology now? treatment as random factors. Although there are sometimes
First, we dene replicability and distinguish it from data good reasons to assume that a facet is a xed factor, the
reproducibility and generalizability. Second, we address the alternative of treating it as a random factor is often not even
replicability concept from a more detailed methodological considered (see Judd et al., 2012, for a recent discussion
and statistical point of view. Third, we offer recommenda- concerning experimental stimuli). Brunswikian replicability
tions for increasing replicability at various levels of academic
1
psychology: How can authors, reviewers, editors, journal Our use of the term reproducibility is aligned with the use in computational
policies, departments, and granting agenciescontribute to sciences but not in some other sciences such as biological science applica-
tions where reproducibility is more akin to the concept of replicability used
improving replicability, what incentives would encourage in psychology. Nevertheless, we use the term reproducibility to distinguish it
achieving this goal, what are the implications for teaching from replicability.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
110 J. B. Asendorpf et al.

requires that researchers dene not only the population of to chance or not, there are two types of errors: rejecting the null
participants but also the universe of situations, operationaliza- hypothesis when it is true (false positive, a) and failing to reject
tions, and time points relevant to their designs. Although such it when it is false (false negative, b). These two types of errors
specication is difcult for situations and operationalizations, can be best understood from the perspective of power (Cohen,
specication of any facet of the design is helpful for achieving 1988). The power of a statistical test is the probability of reject-
replicability; the less clear researchers are about the facets of ing the null hypothesis when it is false, or the complement of
their designs, the more doors are left open for nonreplication. the false-negative error (1 b). Its value depends on sample
Generalizability of a research nding means that it does not size, effect size, and a level. Within this framework, there is
depend on an originally unmeasured variable that has a a negative relation between the two types of error: Given effect
systematic effect. In psychology, generalizability is often and sample sizes, reducing one type of error comes at the cost
demonstrated by showing that a potential moderator variable of increasing the other type of error. This may give the mislead-
has no effect on a group difference or correlation. For example, ing impression that one has to choose between the two types of
student samples often contain a high proportion of women, leav- errors when planning a study. Instead, it is possible to
ing it unclear to what extent results can be generalized to a popu- minimize both types of errors simultaneously by increasing sta-
lation sample of men and women. Generalizability requires tistical power (Maxwell, Kelley, & Rausch, 2008). Replicable
replicability but extends the conditions to which the effect applies. results are more likely when power is high, so the key question
To summarize, data reproducibility is necessary but not becomes identifying the factors that increase statistical power.
sufcient for replicability, and replicability is necessary but The answer is simple: For any chosen a level, statistical power
not sufcient for generalizability. Thus, if I am claiming a goes up as effect sizes and sample sizes increase.
particular nding, it is necessary for reproducibility that this Instead of the null-hypothesis signicance testing, one can
nding can be recovered from my own data by a critical adopt a statistical approach emphasizing parameter estimation.
reviewer, but this reviewer may not replicate the nding Within this alternative approach, there is a third type of
in another sample. Even if this reviewer can replicate error: inaccuracy of parameter estimation (Kelley & Maxwell,
the nding in another sample from the same population, 2003; Maxwell et al., 2008). The larger the condence interval
attaining replication, this does not imply that the nding (CI) around a parameter estimate, the less certain one can
can be easily generalized to other operationalizations of the be that the estimate approximates the corresponding true
involved constructs, other situations, or other populations. population parameter. Replicable effects are more likely with
Sometimes, replicability is dismissed as an unattainable smaller CIs around the parameter estimates in the initial study,
goal because strict replication is not possible (e.g. any study is so the key question becomes identifying the factors that
performed in a specic historic context that is always changing). decrease CIs. Again the answer is simple: The width of a CI
This argument is often used to defend business as usual and increases with the standard deviation of the parameter estimate
avoid the problem of nonreplication in current research. But and decreases with sample size (Cumming & Finch, 2005).
replication, as we dene it, is generalization in its most
narrow sense (e.g. the ndings can be generalized to another Increase sample size
sample from the same population). If not even replicability These considerations have one clear implication for attempts to
can be shown, generalizability is impossible, and the nding increase replicability. All else equal, statistical power goes up
is so specic to one particular circumstance as to be of no and CI width goes down with larger sample size. Therefore,
practical use. Nevertheless, it is useful to distinguish between results obtained with larger samples are more likely to be
exact replicability and broader generalizability because the replicable than those obtained with smaller ones. This has
latter grand perspective requires many studies and ultimately been said many times before (e.g. Cohen, 1962; Tversky &
meta-analyses, whereas replicability can be studied much more Kahneman, 1971), but reviews have shown little improvement
easily as a rst step towards generalizability. In the following, in the typical sample sizes used in psychological studies.
we focus on the concept of exact replicability. Median sample sizes in representative journals are around 40,
and average effect sizes found in meta-analyses in psychology
are around d = 0.50, which means that the typical power in the
RECOMMENDATIONS FOR STUDY DESIGN AND
eld is around .35 (Bakker, Van Dijk, & Wicherts, 2012).
DATA ANALYSIS
These estimates vary, of course, with the subdiscipline. For ex-
ample, Fraley and Marks (2007) did a meta-analysis of correla-
Increasing replicability by decreasing sources of error
tional personality studies and found the median effect size to be
Scientists ideally would like to make no errors of inference, r = .21 (d = 0.43) for a median of 120 participants, resulting in a
that is, they would like to infer from a study a result that is power of .65, a little better, but still far from ideal.
true in the population. If the result is true in the population, Consequently, if all effects reported in published studies
a well-powered replication attempt (as discussed later) will were true, only 35% would be replicable in similarly under-
likely conrm it. The issue of replicability can thus be powered studies. However, the rate of conrmed hypotheses in
approached by focusing on the status of the inference in the current psychological publications is above 90% (Fanelli,
initial study, the logic being that correct inferences are likely 2010). Among other factors, publishing many low-powered
to be replicated in subsequent studies. studies contributes to this excessive false-positive bias. It cannot
Within a null-hypothesis signicance testing approach be stressed enough that researchers should collect bigger sample
that is only concerned with whether an effect can be attributed sizes, and editors, reviewers, and readers should insist on them.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
Recommendations for increasing replicability 111

Planning a study by focusing on its power is not equivalent dependence within the data (e.g. in analyses of dyads, Kenny,
to focusing on its accuracy and can lead to different results Kashy, & Cook, 2006, or hierarchically nested data, Hox,
and decisions (Kelley & Rausch, 2006). For example, for 2010), and removing the inuences of covariates, given
regression coefcients, precision of a parameter estimate appropriate theoretical rationale (Lee, 2012).
depends on sample size, but it is mostly unaffected by effect
size, whereas power is affected by both (Kelley and Maxwell, Avoid multiple underpowered studies
2003; Figure 2). Therefore, a focus on power suggests larger It is commonly believed that one way to increase replicability
sample sizes for small effects and smaller ones for large effects is to present multiple studies. If an effect can be shown in
compared with a focus on accuracy. The two approaches different studies, even though each one may be underpowered,
emphasize different questions (Can the parameter estimate be many readers, reviewers, and editors conclude that it is robust
condently tested against the null hypothesis? Is the parameter and replicable. Schimmack (2012), however, has noted that the
estimate sufciently accurate?). Both have merits, and systematic opposite can be true. A study with low power is, by denition,
use would be an important step in increasing replicability unlikely to obtain a signicant result with a given effect size.
of results. An optimal approach could be to consider them Unlikely events sometimes happen, and underpowered studies
together to achieve both good statistical power and CIs that are may occasionally obtain signicant results. But a series of
sufciently narrow. such results begins to strain credulity. In fact, a series of under-
Last but not least, this emphasis on sample size should powered studies with the same result are so unlikely that
not hinder exploratory research. Exploratory studies can be the whole pattern of results becomes literally incredible. It
based on relatively small samples. This is the whole point, suggests the existence of unreported studies showing no
for example, of pilot studies, although studies labelled as effect. Even more, however, it suggests sampling and design
such are not generally publishable. However, once an effect biases. Such problems are very common in many recently
is found, it should be replicated in a larger sample to provide published studies.
empirical evidence that it is unlikely to be a false positive and
to estimate the involved parameters more accurately. Consider error introduced by multiple testing
When a study involves many variables and their interrelations,
Increase reliability of the measures following the aforementioned recommendations becomes more
Larger sample size is not the only factor that decreases error. complicated. As shown by Maxwell (2004), the likelihood that
The two most common estimators of effect size (Cohens d some among multiple variables will show signicant relations
and Pearsons r) both have standard deviations in their denomi- with another variable is higher with underpowered studies,
nators; hence, all else equal, effect sizes go up and CIs and although the likelihood that any specic variable will show a
standard errors down with decreasing standard deviations. signicant relation with another specic variable is smaller.
Because standard deviation is the square root of variance, the Consequently, the literature is scattered with inconsistent
question becomes how can measure variance be reduced results because underpowered studies produce different sets
without restricting true variation? The answer is that measure of signicant (or nonsignicant) relations between variables.
variance that can be attributed to error should be reduced. This Even worse, it is polluted by single studies reporting over-
can be accomplished by increasing measure reliability, which estimated effect sizes, a problem aggravated by the conrmation
is dened as the proportion of measure variation attributable bias in publication and a tendency to reframe studies post hoc
to true variation. All else equal, more reliable measures have to feature whatever results came out signicant (Bem, 2000).
less measurement error and thus increase replicability. The result is a waste of effort and resources in trying and failing
to replicate a certain result (Maxwell, 2004, p. 160), not to
Increase study design sensitivity mention the problems created by reliance on misinformation.
Another way of decreasing error variance without restricting Contrary to commonly held beliefs, corrections for multiple
true variation is better control over methodological sources testing such as (stepwise) Bonferroni procedures do not solve
of errors (study design sensitivity, Lipsey & Hurley, 2009). the problem and may actually make things worse because
This means distinguishing between systematic and random they diminish statistical power (Nakagawa, 2004). Better
errors. Random errors have no explanation, so it is difcult procedures exist and have gained substantial popularity in
to act upon them. Systematic errors have an identiable several scientic elds, although still very rarely used in
source, so their effects can potentially be eliminated and/or psychology. At an overall level, random permutation tests
quantied. It is possible to reduce systematic errors using (Sherman & Funder, 2009) provide a means to determine
clear and standardized instructions, paying attention to whether a set of correlations is unlikely to be due to chance.
questionnaire administration conditions and using stronger At the level of specic variables, false discovery rate procedures
manipulations in experimental designs. These techniques (Benjamini & Hochberg, 1995) strike better compromises
do, however, potentially limit generalizability. between false positives and false negatives than Bonferroni
procedures. We recommend that these modern variants also
Increase adequacy of statistical analyses be adopted in psychology. But even these procedures do not
Error can also be decreased by using statistical analyses better completely solve the problem of multiple testing. Nonstatistical
suited to study design. This includes testing appropriateness of solutions are required such as the explicit separation of a priori
method-required assumptions, treating stimuli as random hypotheses preregistered in a repository from exploratory post hoc
rather than xed factors (Judd et al., 2012), respecting hypotheses (section on Implementation).

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
112 J. B. Asendorpf et al.

Is a result replicated? (a) Do the studies agree about direction of effect? (b) What
is the pattern of statistical signicance? (c) Is the effect size
Establishing whether a nding is quantitatively replicated is from the subsequent studies within the CI of the rst study?
more complex than it might appear (Valentine et al., 2011). (d) Which facets of the design should be considered xed
A simple way to examine replicability is to tabulate whether factors, and which random factors?
the key parameters are statistically signicant in original and
replication studies (vote counting). This narrow denition
has the advantage of simplicity but can lead to misleading RECOMMENDATIONS FOR THE PUBLICATION
conclusions. It is based on a coarse dichotomy that does PROCESS
not acknowledge situations such as p = .049 (initial study)
and p = .051 (second study). It can also be misleading if Authors
replication studies are underpowered, making nonreplication
of an initial nding more likely. A series of underpowered or Authors of scientic publications often receive considerable
otherwise faulty studies that do not replicate an initial nding credit for their work but also take responsibility for the veracity
do not allow the conclusion that the initial nding was not of what is reported. Authors should also, in our view, take
replicable. Moreover, statistical signicance is not the only responsibility for assessing the replicability of the research
property involved. The size of the effect matters too. When they publish. We propose that an increase in replicability of
two studies both show signicant effects, but effect sizes research can be achieved if, in their role as prospective authors
are very different, has the effect been replicated? of a scientic article, psychologists address the following two
More useful from a replicability perspective is a quantitative main questions: (1) How does our treatment of this research
comparison of the CIs of the key parameters. If the key contribute to increasing the transparency of psychological
parameter (e.g. a correlation) of the replication study falls research? (2) How does this research contribute to an accelera-
within the CI of the initial study (or if the two CIs overlap tion of scientic progress in psychology? We propose that
substantially, Cumming & Finch, 2005), one can argue more answering these questions for oneself become an integral part
strongly that the result is replicated. But again, the usefulness of ones research and of authoring a scientic article. We
of this method depends on study power, including that of the briey elaborate on each question and propose steps that
initial study. For instance, suppose that an initial study with could be taken in answering them. Implementing some of these
70 participants has found a correlation between two measures steps will require some cooperation with journals and other
of r = .25 [0.02, 0.76], which is signicant at p = .037. A high- publication outlets.
powered replication study of 1000 participants nds a
correlation of r = .05 [ 0.01, 0.11], which besides being trivial Increasing research transparency
is not signicant (p = .114). A formal comparison of the two (a) Provide a comprehensive (literature) review. We encourage
results would show that the correlation in the second study falls researchers to report details of the replication status of
within the CI of the rst study (Z = 1.63, p = .104). One might key prior studies underlying their research. Details of
therefore conclude that the initial result has been replicated. exact replication studies should be reported whether
However, this has only occurred because the CI of the initial they did or did not support the original study. Ideally, this
study was so large. In this specic case, a vote counting should include information on pilot studies where available.
approach would be better. (b) Report sample size decisions. So that the research procedure
The logic of quantitative comparison can be pushed further can be made transparent, it is important that researchers
if effect sizes from more than two studies are compared provide a priori justication for sample sizes used. Examples
(Valentine et al., 2011, p. 109). This basically means running of relevant criteria are the use of power analysis or minimum
a small meta-analysis in which the weighted average effect sample size based on accepted good practice (see for
size is calculated and study heterogeneity is examined; if further discussion Tressoldi, 2012). The practice of gradually
heterogeneity is minimal, one can conclude that the subsequent accumulating additional participants until a statistically
studies have replicated the initial study. However, the statistical signicant effect is obtained is unacceptable given its
power of heterogeneity tests is quite low for small samples, so known tendency to generate false-positive results.
the heterogeneity test result should be interpreted cautiously. (c) Preregister research predictions. Where researchers have
Nonetheless, we recommend the meta-analytic approach for strong predictions, these and the analysis plan for testing
evaluation of replicability even when not many replication them should be registered prior to commencing the
studies exist because it helps to focus attention on the size of research (section on Implementation). Such preregistered
an effect and the (un)certainty associated with its estimate. predictions should be labelled as such in the research
In the long run, psychology will benet if the emphasis is reports and might be considered additional markers of
gradually shifted from whether an effect exists (an initial quality. Preregistration is, for example, a precondition
stage of research) to the size of the effect (a hallmark of a for publication of randomized controlled trials in major
cumulative science). Given that no single approach to establish medical journals.
replicability is without limits, however, the use of multiple (d) Publish materials, data, and analysis scripts. Most of all,
inferential strategies along the lines suggested by Valentine we recommend that researchers think of publication as
et al. (2011, especially Table 1) is a better approach. In practice, requiring more than a PDF of the nal text of an article.
this means summarizing results by answering four questions: Rather, a publication includes all written materials, data,

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
Recommendations for increasing replicability 113

and analysis scripts used to generate tables, gures, and should allow for and encourage the implementation of good
statistical inferences. A simple rst step in improving research practices.
trust in research ndings would be for all authors to
indicate that they had seen the data. If practically possi- Do not discourage maintenance of good practices
ble, the materials, data, and analysis scripts should be Reviewers and editors should accept not only papers with
made available in addition to the nal article so that other positive results that perfectly conrm the hypotheses
researchers can reproduce the reported ndings or test stated in the introduction. Holding the perfectly conrmatory
alternative explanations (Buckheit & Donoho, 1995). The paper as the gold standard impedes transparency regarding
information can be made available through open-access nonreplications and encourages use of data analytic and other
sources on the internet. There is a broad range of techniques that contort the actual data, as well as study designs
options: repositories housed at the authors institution that cannot actually refute hypotheses. Reviewers and editors
or personal website, a website serving a group of scientists should publish robustly executed studies that include null
with a shared interest, or a journal website (section on ndings or results that run counter to the hypotheses stated in
Implementation). Options are likely to vary in degree of their introductions.
technical sophistication. Importantly, such tolerance for imperfection can augment
rather than detract from the scientic quality of a journal.
Accelerate scientic progress Seemingly perfectly consistent studies are often less informa-
(a) Publish working papers. We recommend that authors tive than papers with occasional unexpected results if
make working papers describing their research publically they are underpowered. When a paper contains only one
available along with their research materials. To increase perfect but underpowered demonstration of an effect, high-
scientic debate and transparency of the empirical body of powered replication studies are needed before much credibility
results, prepublications can be posted in online repositories can be given to the observed effect. The fact that a paper
(section on Implementation). The most prominent preprint contains many underpowered studies that all perfectly
archive related to psychology is the Social Science conrm the hypotheses can be an indication that something is
Research Network (http://ssrn.com/). wrong (Schimmack, 2012).
(b) Conduct replications. Where feasible, researchers should For example, if an article reports 10 successful conrmations
attempt to replicate their own ndings prior to rst publica- of a (actually true) nding in studies, each with a power of .60,
tion. Exact replication in distinct samples is of great value the probability that all of the studies could have achieved
in helping others to build upon solid ndings and avoiding statistical signicance is less than 1%. This probability is itself
dead ends. Replicated ndings are the stuff of cumulative a signicant result that, in a more conventional context,
scientic progress. Conducting generalizability studies is would be used to reject the hypothesis that the result is
also strongly encouraged to establish theoretical under- plausible (Schimmack, 2012).
standing. Replication by independent groups of researchers We do not mean to imply that reviewers and editors
is particularly encouraged and can be aided by increasing should consistently prefer papers with result inconsistencies.
transparency (see the earlier recommendations). When effects are strong and uniform, results tend to be
(c) Engage in scientic debate in online discussion forums. consistent. But most psychological effects are not strong or
To increase exchange among individual researchers and uniform. Studies with result inconsistencies help to identify
research units, we advocate open discussion of study the conditions under which effects vary. Low publication
results both prior to and after publication. Learning tolerance for them impedes scientic progress, discourages
about each others results without the publication time researchers from adopting good research practices, and
lag and receiving early feedback on studies create an ultimately reduces a journals scientic merits.
environment that makes replications easy to conduct There are several other subtle ways in which actions of
and especially valuable for the original researchers. After reviewers, editors, and journals can discourage researchers
study publication, such forums could be places to make from maintaining good practices. For instance, because of
additional details of study design publicly available. copyright considerations, some journals might prevent
This proposal could be realized in the same context as authors from making working papers freely available. Such
recommendation 1(d). policies hinder transparency.

Proactively encourage maintenance of good practices


Reviewers, editors, and journals
Journals could allow reviewers to discuss a paper openly
Researchers do not operate in isolation but in research with its authors (including access to raw data). Reviewers
environments that can either help or hinder application of who do so could be given credit (e.g. by mentioning the
good practices. Whether they will adopt the recommendations reviewers name in the publication). Journals could also give
in the previous section will depend on whether the research explicit credit (e.g. via badges or special journal sections) to
environments in which they operate reinforce or punish authors who engaged in good practices (e.g. preregistration
these practices. Important aspects of the research landscape of hypotheses). Also, they could allow authors to share their
are the peer reviewers and editors that evaluate research reviews with editors from other journals (and vice versa).
reports and the journals that disseminate them. So that This encourages openness and debate. It is likely to improve
replicability can be increased, reviewers, editors, and journals the review process by giving editors immediate access to

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
114 J. B. Asendorpf et al.

prior reviews, helping them to decide on the merits of the Establish a scientic culture of getting it right in the
work or guiding collection of additional reviews. classroom
As part of the submission process, journals could require The most important thing that a supervisor/teacher can do is
authors to conrm that the raw data are available for inspection establish a standard of good practice that values soundness
(or to stipulate why data are not available). Likewise, co-authors of research over publishability. This creates a research envi-
could be asked to conrm that they have seen the raw data and ronment in which reproducible and replicable ndings can
reviewed the submitted version of the paper. Such policies are be created (Nosek et al., 2012).
likely to encourage transparency and prevent cases of data
fabrication by one of the authors. Related to this, reviewers Teach concepts necessary to understand replicable science
and editors can make sure that enough information is provided (a) Teach and practice rigorous methodology by focusing on
to allow tests of reproducibility and replicability. To facilitate multiple experiments. This entails stressing the importance
communication of information and minimize journal space of a priori power estimates and sizes of effects in relation to
requirements, authors can be allowed to refer to supplementary standard errors (i.e. CIs) rather than outcomes of signi-
online materials. cance testing. Students should also learn to appreciate the
Journals could also explicitly reserve space for reports of value of nonsignicant ndings in sufciently powerful
failures to replicate existing ndings. At minimum, editors and rigorously conducted studies. Finally, students need
should revoke any explicit policies that discourage or to realize that multiple studies of the same effect, under
prohibit publication of replication studies. Editors should the same or highly similar designs and with highly
also recognize a responsibility to publish important replica- similar samples, may have divergent outcomes simply as
tion studies, especially when they involve studies that a result of chance but also because of substantively or
were originally published in their journals. Editors and methodologically important differences.
journals can go even further by launching calls to replicate (b) Encourage transparency. To stimulate accurate documen-
important but controversial ndings. To encourage researchers tation and reproducibility, students should be introduced
to respond to such calls, editors can offer guarantees of to online systems to archive data and analysis scripts
publication (i.e. regardless of results) provided that there is (section on Implementation) and taught best practices
agreement on method before the study is conducted (e.g. in research (Recommendations for Authors section).
sufcient statistical power). So that the the value of replication of statistical analyses
can be taught, students should reanalyse raw data from
published studies.
(c) Conduct replication studies in experimental methods
Recommendations for teachers of research methods and
classes. One practical way to increase awareness of the
statistics
importance of transparent science and the value of replica-
A solid methodological education provides the basis for a tions is to make replication studies essential parts of
reliable and valid science. At the moment, (under)graduate classes. By conducting their own replication studies,
teaching of research methods and statistics in psychology is students have the chance to see which information is
overly focused on the analysis and interpretation of single necessary to conduct a replication and experience the
studies, and relatively little attention is given to the issue of importance of accuracy in setting up, analysing, and
replicability. Specically, the main goals in many statistical reporting experiments (see Frank & Saxe, 2012, for
and methodological textbooks are to teach assessing the further discussion of the advantages that accompany
validity of and analysing the data from individual studies implementation of replication studies in class). Any failures
using null-hypothesis signicance testing. Undergraduate to replicate the experience will reinforce its importance.
and even graduate statistical education are based almost
exclusively on rote methods for carrying out this framework. Critical thinking
Almost no conceptual background is offered, and rarely is it (a) Critical reading. Learning to see the advantages and also
mentioned that null-hypothesis testing is controversial aws of a design, analysis, or interpretation of data is an
and has a chequered history and that other approaches are essential step in the education of young researchers.
available (Gigerenzer et al., 1989). Teachers should lead their students to ask critical questions
We propose that an increase in research replicability can when reading scientic papers (i.e. Do I nd all the
be achieved if, in their role as teachers, psychologists pursue necessary information to replicate that nding? Is the
the following goals (in order of increasing generality): (1) research well embedded in relevant theories and previous
introduce and consolidate statistical constructs necessary to results? Are methods used that allow a direct investigation
understand the concept of replicable science; (2) encourage of the hypothesis? Did the researchers interpret the results
critical thinking and exposing hypotheses to refutation rather appropriately?). To develop skills to assess research
than seeking evidence to conrm them; and (3) establish a outcomes of multiple studies critically, students should
scientic culture of getting it right instead of getting it be taught to review well-known results from the literature
published. This will create a basis for transparent and repli- that were later replicated successfully and unsuccessfully.
cable research in the future. In the following, we describe (b) Critical evaluation of evidence (single-study level).
each of these goals in more detail and propose exemplary Students should become more aware of the degree to
steps that could be taken. which sampling error affects study outcomes by learning

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
Recommendations for increasing replicability 115

how to interpret effect sizes and CIs correctly by means and emphasis on ashy and nonreplicable research, this does
of examples. A didactical approach focused on multiple not serve our science well.
studies is well suited to explaining relevant issues of Therefore, we believe that the desirable changes on
generalizability, statistical power, sampling theory, and the parts of researchers, reviewers/editors/journals, and
replicability even at the undergraduate level. It is important teachers that we described earlier need to be supplemented
to make clear that a single study generally represents by changes in the incentive structures of supporting institu-
only preliminary evidence in favour of or against a tions. We consider incentives at three institutional levels:
hypothesized effect. granting agencies, tenure committees, and the guild of
psychologists itself.
Students should also become aware that statistical tools are
not robust to the following: (1) optional stopping (adding more
cases depending on the outcome of preliminary analyses); Use funding decisions to support good research practices
(2) data shing; (3) deletion of cases or outliers for arbitrary
Granting agencies could carry out the rst, most effective
reasons; and (4) other common tricks to reach signicance
change. They could insist upon direct replication of research
(Simmons et al., 2011).
funded by taxpayer money. Given the missions of granting
(i) Critical evaluation of evidence (multistudy level). At the agencies, which are often to support genuine (and thus
graduate level, students should be taught the importance reliable) scientic discoveries and creation of knowledge,
of meta-analysis as a source for effect size estimates we believe that granting agencies should not only desire
and a tool to shed light on moderation of effects across but also promote replication of funded research.
studies and study homogeneity. Problems associated One possibility is to follow an example set in medical
with these estimates (e.g. publication biases that inate research, where a private organization has been created with
outcomes reported) must also be discussed to promote the sole purpose of directly replicating clinically relevant
critical evaluation of reported results. ndings (Zimmer, 2012). Researchers in medicine who
discover a possible treatment pay a small percentage of their
original costs for another group to replicate the original
RECOMMENDATIONS FOR INSTITUTIONAL study. Given the limited resources dedicated to social science
INCENTIVES research, a private endeavour may not be feasible. However,
granting agencies could do two things to facilitate direct
The recommended changes described earlier would go a replication. First, they could mandate replication, either by
long way to changing the culture of psychological science requiring that a certain percentage of the budget of any given
if implemented voluntarily by psychological scientists grant be set aside to pay a third party to replicate key studies
as researchers, editors, and teachers. If researchers adopt in the programme of research or by funding their own
good research practices such as being more transparent in consortium of researchers contracted to carry out direct repli-
approach, submitting and tolerating more null ndings, cations. Second, granting agency decisions should be based
focusing more on calibrating estimation of effects rather than on quality-based rather than quantity-based assessment of
null-hypothesis signicance testing, and communicating the the scientic achievements of applicants. Junior researchers
need for doing so to students, the culture will naturally would particularly benet from a policy that focuses on the
accommodate the new values. That said, we are sceptical that quality of an applicants research and the soundness of a
these changes will be adopted under the current incentive research proposal. The national German funding agency
structures. Therefore, we also call upon the key institutions recently changed its rules to allow not more than ve papers
involved in the creation, funding, and dissemination of to be cited as reference for evaluation of an applicants
psychological research to reform structural incentives that ability to do research.
presently support problematic research approaches. Additionally, attention should be paid to the publication
traditions in various subdisciplines. Some subdisciplines are
characterized by a greater number of smaller papers, which
Focus on quality instead of quantity of publications
may inate the apparent track records of researchers in those
Currently, the incentive structure primarily rewards publication areas relative to those in subdisciplines with traditions of
of a large number of papers in prestigious journals. The sheer larger and more theoretically elaborated publications.
number of publications and journal impact factors often seem
more important to institutional decisions than their content or
Revise tenure standards
relevance. Hiring decisions are often made on this basis. Grant
awards are, in part, based on the same criteria. Promotion We recommend that tenure and promotion policies at universities
decisions are often predicated on publications and the and colleges be changed to reward researchers who emphasize
awarding of grants. Some might argue that research innovation, both reproducibility and replication (King, 1995). Some
creativity, and novelty are gured into these incentives, but if may argue that tenure committees do weigh quality of
judgment of innovativeness, creativity, and novelty is based research in addition to overall productivity. Unfortunately,
on publications in journals that accept questionable research quality is often equated with journal reputation. Given that
practices, then publication quantity is the underlying indirect many of the most highly esteemed journals in our eld
incentive. Given its current bias against producing null ndings openly disdain direct replication, discourage publication of

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
116 J. B. Asendorpf et al.

null ndings, tolerate underpowered research, and/or rely on research environment, that could indeed hamper their careers.
short reports, one can question whether journal reputation is Unless our entire guild becomes more comfortable with
a sound quality criterion. Because number of publications nonreplicated ndings as an integral part of improving
weighted by journal reputation is also used in evaluating future replicability, the disincentives to change will outweigh
grants, it also promotes another widely accepted criteria for the incentives. We hope that one effect of this document is to
promotionacquisition of external funding. increase the value of identifying replicable research.
King (1995) argued that researchers should also get credit
for creating and disseminating data sets in ways that the
results can be replicated and extended by other researchers IMPLEMENTATION
(also King, 2006). To the extent that research becomes more
replicable and replication is rewarded, tenure committees Recommendations aim for implementation. However, even
could also consider the extent to which researchers work is when awareness of importance is high and practical
replicated by others (Hartshorne & Schachner, 2012). improvements identied, changing behaviour is hard. This
Conversely, tenure and promotion committees should not is particularly true if implementing improvements adds time,
punish assistant professors for failing to replicate irreproducible effort, and resources to existing workow. Researchers are
research. If a young assistant professor is inspired by a recent already busy, and incentive structures for how to spend
publication to pursue a new line of research only to nd ones time are well dened. They are unlikely to invest in
that the original result cannot be replicated because the study additional work unless that work is essential for desired
was unsound, most evaluation committees will see this as a rewards. However, strong incentives for good research prac-
waste of time and effort. The assistant professor will look tices can be implemented. For example, funders have strong
less productive than others, who, ironically, may be pursuing leverage. If they require publishing data in repositories as a
questionable research strategies to produce the number of condition of funding, then researchers will follow through
publications necessary for tenure. The tragedy of the current because earning grants is a strong incentive for researchers.
system is that years of human capital and knowledge are spent Likewise, journals and editors can impose improvements.
on studies that produce null ndings simply because they are They may not be able to do so singlehandedly though. If
based on studies that should not have been published in the rst the resource costs imposed exceed the perceived value of
place. The problem here lies not with the replication efforts. On publishing in a journal, authors may abandon that journal
the contrary, creatively disconrming existing theoretical ideas and publish elsewhere.
based on nonreplicable ndings is at least as important as Practical improvements cannot rely solely on appealing
producing new ideas, and universities and colleges could to scientists values or pressures imposed by institutions.
acknowledge this by rewarding publication of null ndings as A researcher might agree that sharing data and study
much as those of signicance. materials is a good thing, but if sharing is difcult to
One consequence of these proposed incentives for achieve, then it is not in the researchers self-interest to do
promotion and tenure would be to change the way tenure it. Practicalities affect success in implementing individual
committees go about their work. Rather than relying on behavioural change. Ensuring success thus requires attention
cursory reviews by overworked letter writers or arbitrary to the infrastructure and procedures required to implement
criteria, such as numbers of publications in the top journals, the improvements.
tenure committees may have to spend more time reading a The Internet is a mechanism for sharing of materials and
candidates actual publications to determine their quality. data that address some of the practical barriers. But its
For example, Wachtel (1980) recommended that researchers existence is not sufcient. A system is needed that does the
be evaluated on a few of their best papers, rather than CV following: (a) makes it extremely simple to archive and
length. This type of evaluation would, of course, demand document research projects and data; (b) provides a shared
that the members of tenure committees be sufciently environment so that people know where to go to deposit and
knowledgeable about the topic to discuss the nature and retrieve the materials; (c) integrates with the researchers own
approach of the research described. documentation, archiving, and collaboration practices; and
(d) offers exibility to cover variation in research applications
and sensitivity to ethical requirements. This might include
Change informal incentives
options of no sharing, sharing only with collaborators, sharing
Finally, informal incentives within our guilds need to change by permission only, and sharing publicly without restriction.
for our scientic practices to change. When we discuss prob- Ways to accomplish this are emerging rapidly. They differ
lematic research, we are not referring to abstract cases, in scope, degree of organization, technical sophistication,
but rather to the research of colleagues and friends. Few long-term perspective, and whether they are commercial or
researchers want to produce research that contradicts the nonprot ventures. We present a few of them at different
work of their peers. For that matter, few of us want to see levels of scope, without any claim of comprehensive or
failures to replicate our own research. The situation is even representative coverage. They illustrate the various levels of
worse for assistant professors or graduate students. Should engagement already possible.
they even attempt to publish a study that fails to replicate an In Europe, there are two large projects with the
eminent scientists nding? The scientist who one day will mission to enable and support digital research across all of
most likely weigh in on their tenure prospects? In the current the humanities and social sciences: Common Language

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
Recommendations for increasing replicability 117

Resources and Technology Infrastructure (http://www. CONCLUSION


clarin.eu/), nanced by the European Seventh Framework
programme, and Digital Research Infrastructure for the Arts A well-known adage of psychometrics is that measures must
and the Humanities (http://www.dariah.eu/). These aim to be reliable to be valid. This is true for the overall scientic
provide resources to enhance and support digitally enabled enterprise as well; only, the reliability of results is termed
research, in elds including psychology. The goal of these replicability. If results are not replicable, subsequent studies
programmes is to secure long-term archiving and access to addressing the same research question with similar methods
research materials and results. will produce diverging results supporting different conclu-
Unconstrained topically and geographically, the com- sions. Replicability is a prerequisite for valid conclusions.
mercial venture Figshare (http://gshare.com/) offers an This is what we meant by our opening statement that repli-
easy user interface for posting, sharing, and nding cability of ndings is at the heart of any empirical science.
research materials of any kind. Likewise, public ventures We have presented various proposals to improve the replica-
such as Dataverse (http://thedata.org/) address parts of bility of psychology studies. One cluster of these proposals
the infrastructure challenges by making it easy to upload could be called technical: improve the replicability of our
and share data. And the for-prot Social Science Research ndings through larger samples and more reliable measures,
Network (http://www.ssrn.com/) is devoted to the rapid so that CIs become smaller and estimates more precise. A
dissemination of social science research manuscripts. second cluster of proposals pertains more to the culture
There are study registries, such as http://clinicaltrials. within academia: Researchers should avoid temptation to
gov/, but they are mostly available for clinical trial misuse the inevitable noise in data to cherry-pick results
research in medicine thus far. The fMRI Data Center that seem easily publishable, for example because they
(http://www.fmridc.org/f/fmridc) in neurosciences and appear sexy or unexpected. Instead, research should be
CHILDES (http://childes.psy.cmu.edu/) for child-language about interpretation of broad and robust patterns of data
development provide data sharing and aggregation solutions and about deriving explanations that have meaning within
for particular subdisciplines. There are also groups orga- networks of existing theories.
nized around specic topics (e.g. on cognitive modelling, Some might say that the scientic process (and any other
http://www.cmr.osu.edu/). Finally, many researchers pursue creative process) has Darwinian features because it consists of
open access for papers and research materials by posting two steps (Campbell, 1960; Simonton, 2003): blind variation
them on their own institutional websites. and selective retention. Like genetic mutations, this means that
We highlight a project that aspires to offer most of the many research results are simply not very useful, even if they
aforementioned options within a single framework: the are uncovered using perfect measures. No single study speaks
Open Science Framework (http://openscienceframework. for itself: Findings have to be related to underlying ideas,
org/). The Open Science Framework is an open solution and their merits discussed by other scientists. Only the best
developed by psychological scientists for documenting, (intellectually ttest) ideas survive this process. Why then
archiving, sharing, and registering research materials bother with scrutiny of the replicability of single ndings, one
and data. Researchers create projects and drag-and-drop may ask?
materials from their workstations into the projects. Wikis The answer is pragmatic: Publishing misleading ndings
and le management offer easy means of documenting wastes time and money because scientists as well as the
the research; version control software logs changes to les larger public take seriously ideas that should not have
and content. Researchers add contributors to their projects, merited additional consideration, based on the way they were
and then the projects show up in the contributors own derived. Not realizing that results basically reect statistical
accounts for viewing and editing. Projects remain private for noise, other researchers may jump on a bandwagon and
their collaborative teams until they decide that some or all of incorporate them in planning follow-up studies and setting
their content should be made public. Researchers can register up new research projects. Instead of this, we urge greater
a project or part of a project at any time to create a read-only, continuity within broad research programmes designed to
archived version. For example, researchers can register a address falsiable theoretical propositions. Such propositions
description of a hypothesis, the research design, and analysis are plausibly strengthened when supportive evidence is
plan prior to conducting data collection or analysis. The replicated and should be reconsidered when replications
registered copy is time stamped and has a unique, permanent fail. Strong conceptual foundations therefore increase the
universal resource locator that can be used in reporting information value of failures to replicate, provided the
results to verify prior registration.2 original results were obtained with reliable methods. This is
Many emerging infrastructure options offer opportunities the direction that psychology as a eld needs to take.
for implementing the improvements we have discussed. We argue that aspects of the culture within psychological
The ones that will survive consider the daily workow of science have gradually become dysfunctional and have of-
the scientist and are nding ways to make it more efcient fered a hierarchy of systematic measures to repair them. This
while simultaneously offering opportunities, or nudges, is part of a self-correcting movement in science: After long
towards improving scientic rigour. emphasizing large numbers of sexy and surprising papers,
2 the emphasis now seems to be shifting towards getting it
Neither this nor any other system prevents a researcher from registering a
hypothesis after having performed the study and conducted the analysis. right. This shift has been caused by systemic shocks, such
However, doing this is active fraud. as the recent fraud scandals and the publication of papers

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
118 J. B. Asendorpf et al.

deemed lacking in seriousness. We hope that this movement Hartshorne, J. K., & Schachner, A. (2012). Tracking replicability as
will be sustained and lead to an improvement in the way our a method of post-publication open evaluation. Frontiers in
Computational Science, 6, 114.
science is conducted. Hox, J. J. (2010). Multilevel analysis (2nd ed.). New York, NY:
Ultimately, every scientist is responsible for the choices Routledge.
that he or she makes. In addition to the external measures that Ioannidis, J. P. A. (2005). Why most published research ndings are
we propose in this article, we appeal to scientists intrinsic false. PLoS Medicine, 2, e124.
motivation. Desire for precise measurements and curiosity John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the
prevalence of questionable research practices with incentives
to make deeper sense of incoherent ndings (instead of
for truth-telling. Psychological Science, 23, 524532.
cherry-picking those that seem easy to sell) are the reasons Judd, C. M., Westfall, J., & Kenny, D. A. (2012). Treating
many of us have chosen a scholarly career. We hope that stimuli as a random factor in social psychology: A new and
future developments will create external circumstances that comprehensive solution to a pervasive but largely ignored
are better aligned with these intrinsic inclinations and help problem. Journal of Personality and Social Psychology, 103,
the scientic process to become more accurate, transparent, 5469.
Kelley, K., & Maxwell, S. E. (2003). Sample size for multiple
and efcient. regression: Obtaining regression coefcients that are accu-
rate, not simply signicant. Psychological Methods, 8,
305321.
Kelley, K., & Rausch, J. R. (2006). Sample size planning for the
standardized mean difference: Accuracy in parameter estimation
REFERENCES via narrow condence intervals. Psychological Methods, 11,
363385.
Bakker, M., Van Dijk, A., & Wicherts, J. M. (2012). The rules Kenny, D. A., Kashy, D. A., & Cook, W. L. (2006). Dyadic data
of the game called psychological science. Perspectives on analysis. New York, NY: Guilford Press.
Psychological Science, 7, 543554. King, G. (1995). Replication, replication. PS: Political Science and
Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg Politics, 28, 443499.
(Ed.), Guide to publishing in psychology journals (pp. 316). King, G. (2006). Publication, publication. PS: Political Science and
Cambridge: Cambridge University Press. Politics, 34, 119125.
Bem, D. J. (2011). Feeling the future: Experimental evidence for Lee, J. J. (2012). Correlation and causation in the study of personality.
anomalous retroactive inuences on cognition and affect. Journal European Journal of Personality, 26, 372390.
of Personality and Social Psychology, 100, 407426. Lehrer, J. (2010). The truth wears off: Is there something wrong
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false with the scientic method? The New Yorker, December 13.
discovery rate: A practical and powerful approach to multiple Lipsey, M. W., & Hurley, S. M. (2009). Design sensitivity:
testing. Journal of the Royal Statistical Society, Series B (Methodo- Statistical power for applied experimental research. In L.
logical) 57, 289300. Bickman, & D. J. Rog (Eds.), The SAGE handbook of applied
Brunswik, E. (1955). Representative design and probabilistic theory social research methods (pp. 4476). Los Angeles, CA: SAGE
in a functional psychology. Psychological Review, 62, 193217. Publications.
Buckheit, J., & Donoho, D. L. (1995). Wavelab and reproducible Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size
research. In A. Antoniadis (ed.). Wavelets and statistics planning for statistical power and accuracy in parameter estimation.
(pp. 5581). New York, NY: Springer-Verlag. Annual Review of Psychology, 59, 537563.
Campbell, D. T. (1960). Blind variation and selective retention in Nakagawa, S. (2004). A farewell to Bonferroni: The problems of
creative thought as in other knowledge processes. Psychological low statistical power and publication bias. Behavioral Ecology,
Review, 67, 380400. 15, 10441045.
Cohen, J. (1962). The statistical power of abnormalsocial psychological Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientic utopia:
research: A review. Journal of Abnormal and Social Psychology, II. Restructuring incentives and practices to promote truth
65, 145153. over publishability. Perspectives on Psychological Science
Cohen, J. (1988). Statistical power analysis for the behavioral 7, 615631.
sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Schimmack, U. (2012). The ironic effect of signicant results on
Cumming, G., & Finch, S. (2005). Inference by eye: Condence the credibility of multiple-study articles. Psychological Meth-
intervals, and how to read pictures of data. The American ods 17, 551566.
Psychologist, 60, 170180. Sherman, R. A., & Funder, D. C. (2009). Evaluating correlations
Fanelli, D. (2010). Positive results increase down the hierarchy of in studies of personality and behavior: Beyond the number of
the sciences. PLoS One, 5, e10068. signicant ndings to be expected by chance. Journal of
Fanelli, D. (2012). Negative results are disappearing from most Research in Personality, 43, 10531061.
disciplines and countries. Scientometrics, 90, 891904. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive
Fraley, R. C., & Marks, M. J. (2007). The null hypothesis signicance- psychology: Undisclosed exibility in data collection and analysis
testing debate and its implications for personality research. In R. allows presenting anything as signicant. Psychological Science,
W. Robins, R. C. Fraley, & R. F. Krueger (Eds.), Handbook of 22, 13591366.
research methods in personality psychology (pp. 149169). Simonton, D. K. (2003). Scientic creativity as constrained stochastic
New York: Guilford. behavior: The integration of product, person, and process perspec-
Frank, M. C., & Saxe, R. (2012). Teaching replication. Perspectives tives. Psychological Bulletin, 129, 475494.
on Psychological Science, 7, 600-604. Tressoldi, P. E. (2012). Replication unreliability in psychology:
Fuchs, H., Jenny, M., & Fiedler, S. (2012). Psychologists are Elusive phenomena or elusive statistical power? Frontiers
open to change, yet wary of rules. Perspectives on Psycho- in Psychology, 3. doi: 10.3389/fpsyg.2012.00218
logical Science, 7, 639-642. Tversky, A., & Kahneman, D. (1971). Belief in the law of small
Gigerenzer, G., Swijink, Z., Porter, T., Daston, L., Beatty, J., & numbers. Psychological Bulletin, 76, 105110.
Krger, L. (1989). The empire of chance: How probability Valentine, J. C., Biglan, A., Boruch, R. F., Castro, F. G., Collins, L.
changed science and everyday life. Cambridge: Cambridge M., Flay, B. R., . . . Schinke, S. P. (2011). Replication in prevention
University Press. science. Prevention Science, 12, 103117.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
Recommendations for increasing replicability 119

Wachtel, P. L. (1980). Investigation and its discontents: Some Yong, E. (2012). Bad copy: In the wake of high-prole controversies,
constraints on progress in psychological research. The American psychologists are facing up to problems with replication. Nature,
Psychologist, 5, 399408. 485, 298300.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, Zimmer, C. (2012). Good scientist! You get a badge. Slate, August
H. L. J. (2011). Why psychologists must change the way they 14 (on-line). http://www.slate.com/articles/health_and_science/
analyze their data: The case of psi: Comment on Bem (2011). science/2012/08/reproducing_scientic_studies_a_good_house-
Journal of Personality and Social Psychology, 100, 426432. keeping_seal_of_approval_.html

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 108119 (2013)
DOI: 10.1002/per
European Journal of Personality, Eur. J. Pers. 27: 120144 (2013)
Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/per.1920

OPEN PEER COMMENTARY

Dwelling on the Past


MARJAN BAKKER1, ANGLIQUE O. J. CRAMER1, DORA MATZKE1, ROGIER A. KIEVIT2, HAN L. J. VAN DER
MAAS1, ERIC-JAN WAGENMAKERS1, DENNY BORSBOOM1
1
University of Amsterdam
2
Medical Research Council, Cognition and Brain Sciences Unit, Cambridge, UK
M.Bakker1@uva.nl

Abstract: We welcome the recommendations suggested by Asendorpf et al. Their proposed changes will undoubtedly
improve psychology as an academic discipline. However, our current knowledge is based on past research. We
therefore have an obligation to dwell on the past; that is, to investigate the veracity of previously published
ndingsparticularly those featured in course materials and popular science books. We discuss some examples of
staple facts in psychology that are actually no more than hypotheses with rather weak empirical support and
suggest various ways to remedy this situation. Copyright 2013 John Wiley & Sons, Ltd.

We support most of the proposed changes of Asendorpf et al. described in various basic textbooks on (social) psychology,
in the modus operandi of psychological research, and, unsurpris- where it often has the status of fact (Augoustinos, Walker, &
ingly perhaps, we are particularly enthusiastic about the idea to Donaghue, 2006; Bless, Fiedler, & Strack, 2004; Hewstone,
separate conrmatory from exploratory research (Wagenmakers, Stroebe, & Jonas, 2012). However, only two relatively direct
Wetzels, Borsboom, Van der Maas, & Kievit, 2012). Neverthe- (but underpowered) replications had been performed, producing
less, perhaps we disagree with Asendorpf et al. on one point. inconclusive results (Cesario, Plaks, & Higgins, 2006; Hull,
Asendorpf et al. urge readers not to dwell . . .on suboptimal Slone, Meteyer, & Matthews, 2002). Hull et al. (2002) found
practices in the past. Instead, they advise us to look ahead: the effect in two studies, but only for highly self-conscious indi-
We do not seek here to add to the developing literature on iden- viduals. Cesario et al. (2006) established a partial replication in
tifying problems in current psychological research practice. [. . .] that some but not all of the experimental conditions showed
we address the more constructive question: How can we increase the expected effects. Two more recent, direct, and well-powered
the replicability of research ndings in psychology now? replications failed to nd the effect (Doyen, Klein, Pichon, &
Although we do not want to diminish the importance of Cleeremans, 2012; Pashler, Harris, & Coburn, 2011).
adopting the measures that Asendorpf et al. proposed, we think As another example, imitation of tongue gestures by young
that, as a eld, we have the responsibility to look back. Our infants is mentioned in many recent books on developmental psy-
knowledge is based on ndings from work conducted in the past, chology (e.g., Berk, 2013; Leman, Bremner, Parke, & Gauvain,
ndings that textbooks often tout as indisputable fact. Recent 2012; Shaffer & Kipp, 2009; Siegler, DeLoache, & Eisenberg,
expositions on the methodology of psychological research reveal 2011), and the original study by Meltzoff and Moore (1977) is
that these ndings are based at least in part on questionable cited over 2000 times. However, the only two direct replications
research practices (e.g. optional stopping, selective reporting, (Hayes and Watson, 1981; Koepke, Hamm, Legerstee, & Rusell,
etc.). Hence, we cannot avoid the question of how to interpret 1983) failed to obtain the original ndings, and a review by
past ndings: Are they fact, or are they ction? Anisfeld (1991) showed inconclusive results.
Even when some (approximately) direct replication studies
Replications of the past
are summarized in meta-analysis, we cannot be sure about the
How can we evaluate past work? As Asendorpf et al. presence of the effect, as the meta-analysis may be contaminated
proposed, direct replication, possibly summarized in a meta-analy- by publication bias (Rosenthal, 1979) or the use of questionable
sis, is one of the best ways to test whether an empirical nding is research practices (John, Loewenstein, & Prelec, 2012;
fact rather than ction. Unfortunately, direct replication of ndings Simmons, Nelson, & Simonsohn, 2011). For example, many
is still uncommon in the psychological literature (Makel, Plucker, recent textbooks in developmental psychology state that infant
& Hegarty, 2012), even when it comes to textbook-level facts. habituation is a good predictor of later IQ (e.g., Berk, 2013;
For example, one area in psychology that has recently come Leman, Bremner, Parke, & Gauvain, 2012; Shaffer & Kipp,
under scrutiny is that of behavioural priming research (Yong, 2009; Siegler, DeLoache, & Eisenberg, 2011), often referring
2012). In one of the classic behavioural priming studies, Bargh, to the meta-analysis of McCall and Carriger (1993). However,
Chen, and Burrows (1996) showed that participants who were this meta-analysis suffers from publication bias (Bakker, van
primed with words that supposedly activated elderly stereotypes Dijk, & Wicherts, 2012). At best, these results point to a
walked more slowly than participants in the control condition. weak relation between habituation and IQ, and possibly to no
The Bargh et al. study is now cited over 2000 times and is relation at all.

Copyright 2013 John Wiley & Sons, Ltd.


Discussion and Response 121

Using replications to distinguish fact from ction is conrmation and may very well be ctional. To resolve this
important beyond the realms of scientic research and educa- situation, we need to dwell on the past, and several courses
tion. For instance, the (in)famous Mozart effect (Rauscher, of action present themselves. First, psychology requires
Shaw, & Ky, 1993) suggested a possible 89 IQ point im- thorough examination, for example by an American
provement in spatial intelligence after listening to classical Psychological Association taskforce, to propose a list of
music. Yet despite increasingly denite null replications psychological ndings that feature at the textbook level
dating back to 1995 (e.g., Newman et al., 1995; Pietschnig, but in fact are still in need of direct replication. In a second
Voracek, & Formann, 2010), the Mozart effect persists in step, those ndings that are in need of replication can be
the popular imagination. Moreover, the Mozart effect was reinvestigated in research that implements the proposals of
the basis of a statewide funding scheme in Georgia (Cromie, Asendorpf et al. The work initiated by the Open Science
1999), trademark applications (Campbell, 1997), and chil- Framework (http://openscienceframework.org/) has gone a
drens products; for instance, Amazon.co.uk lists hundreds long way in constructing a methodology to guide massive
of products that use the name The Mozart Effect, many replication efforts and can be taken as a blueprint for
touting the benecial effects on the babies brain. Clearly, this kind of work.
in addition to the scientic resources spent establishing Psychology needs to improve its research methodology,
whether the original claim was true, false-positive ndings and the procedures proposed by Asendorpf et al. will
can have a long-lasting inuence far outside science even undoubtedly contribute to that goal. However, psychology
when the scientic controversy has largely died down. also cannot avoid the obligation to look back and to nd
out which studies are textbook-proof and which are not.
Textbook-proof By implementing sensible procedures to further the veracity
of our empirical work, psychologists have the opportunity
The studies discussed earlier highlight that at least to lead by example, an opportunity that we cannot afford
some established ndings from the past are still awaiting to miss.

Minimal Replicability, Generalizability, and Scientic Advances in


Psychological Science
JOHN T. CACIOPPO AND STEPHANIE CACIOPPO
University of Chicago
Cacioppo@uchicago.edu

Abstract: With the growing number of fraudulent and nonreplicable reports in psychological science, many question the
replicability of experiments performed in laboratories worldwide. The focus of Asendorpf and colleagues is on research
practices that investigators may be using to increase the likelihood of publication while unknowingly undermining
replicability. We laud them for thoughtful intentions and extend their recommendations by focusing on two additional
domains: the structure of psychological science and the need to distinguish between minimal replicability and
generalizability. The former represents a methodological/statistical problem, whereas the latter represents a theoretical
opportunity. Copyright 2013 John Wiley & Sons, Ltd.

Although cases of outright fraud are rare and not unique situations, and time points in an independent sample of
to psychology, psychological science has been rocked in participants, is the currency of science.
the past few years by a few cases of failed replications and Asendorpf et al. distinguish among reproducibility (du-
fraudulent science. Among practices suggested by Asendorpf plication by an independent investigator analysing the same
et al. as contributing to these outcomes are data selection and dataset), replicability (observation with other random sam-
formulating decisions about sample size on the basis of sta- ples), and generalizability (absence of dependence on an
tistical signicance rather than statistical power. We laud originally unmeasured variable). Issues of replicability and
Asendorpf et al. for their thoughtful and timely recommenda- generalizability have been addressed before in psychology.
tions and hope their paper becomes required reading. We fo- Basic psychological research, with its emphasis on experi-
cus here on two domains they did not address: the structure mental control, was once criticized for yielding statistically
of psychological science and the need to distinguish between reliable but trivial effects (e.g., Appley, 1990; Staats,
minimal replicability and generalizability. 1989). Allport (1968) decades ago noted that scientic gains
Publication of a new scientic nding should be viewed result from this hard-nosed approach, but he lamented the
more as a promissory note than a nal accounting. Science lack of generalizing power of many neat and elegant experi-
is not a solitary pursuit; it is a social process. If a scientic ments: It is for this reason that some current investigations
nding cannot be independently veried, then it cannot seem to end up in elegantly polished trivialitysnippits of
be regarded as an empirical fact. Minimal replicability, empiricism, but nothing more (p. 68).
dened as an empirical nding that can be repeated by an Many psychological phenomena, ranging from
independent investigator using the same operationalizations, attention to racism, are multiply determined (Schachter,

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
122 Discussion and Response

Christenfeld, Ravina, & Bilous, 1991). This multiply because other factors can also affect the outcome,
determined nature of many psychological phenomena Pc=et > 0. Only when other sufcient causes of c have
calls for the parsing of big research questions into smaller, been controlled in a particular experimental paradigm or
tractable series of research questions that ultimately consti- assessment context will P(c/t) approach 1. Pc=et will
tute systematic and meticulous programmes of research. approach 0 in a given experimental context by virtue
Where to parse a phenomenon may not be obvious without of experimental controlbecause all other determinants
empirical evidence, however. Therefore, the generalizability of to c have been eliminated or controlled in the
problem, as Allport referred to it, may represent a theoretical experimental setting. Because c is multiply determined,
rather than methodological problem when investigating however, Pc=et > 0 and may be much greater than 0
phenomena that are multiply determined. For instance, four when aspects of the design, sample, operationalizations,
decades ago, concerns that experimental research on attitude or context are changed. This generalizing problem
change was not replicable or generalizable existed because need not reect a methodological quagmire but rather
the same experimental factors (e.g., source credibility) were can represent a theoretical challenge; it can lead to
found to produce different outcomes in different studies. new insights into and research on the boundary condi-
Rather than treat this as a statistical or methodological prob- tions for theories, the operation of additional antece-
lem, we identied two distinct mechanisms (routes) through dents, and the specication of new theoretical
which attitude change could occur, and we specied the organizations (Cacioppo & Berntson,1992).
theoretical conditions in which a given factor would In sum, attention to study details, from conceptuali-
trigger each route. The resulting elaboration likelihood zation, statistical power, and execution to analysis and
model (Petty & Cacioppo, 1981, 1986) made sense of interpretation, increases the likelihood that the empirical
what had appeared to be conicting results, generated results constitute replicable scientic facts upon which
predictions of new patterns of data that have since been one can solidly build. Asendorpf et al. argue that the
veried, and remains a staple in the eld. facets of a research design relevant for replicability
Multiple determinism includes parallel determinism include individuals, situations, operationalizations, and
(more than one antecedent condition can alone be time points. If psychological phenomena in principle
sufcient to produce the outcome of interest) and conver- had singular antecedents, this would be sufcient. This
gent determinism (two or more antecedent conditions are is not the only possible denition of replicability,
necessary to produce an outcome). A lack of generalizing however. In a complex science such as psychology, in
power in studies of the role of single factors is a which phenomena of interest can be multiply deter-
predictable property of multiply determined phenomena. mined, minimal replicability refers to the same observa-
Because it is rare for a single factor or determinant to tion by an independent investigator using the same
be a necessary and sufcient cause for a psychological operationalizations, situations, and time points in an in-
phenomenon, the failure to nd generalizability raises dependent sample from the same population. Such min-
the theoretical question of whether multiple parallel or imal replications suggest that an empirical fact has been
convergent determinism exist and, if so, under what established, and failures to replicate the nding using
conditions each antecedent may be operating and what different operationalizations, situations, time points, or
other factors may also be operating. populations suggest the operation of potentially
To be specic, let c represent a psychological important moderator variables (and, thus, generate theo-
phenomenon of interest, let t represent a factor or retical questions) rather than methodological problems.
treatment whose effect on c is of interest, and let et To the extent that psychological phenomena are
(not t) represent all other antecedents of c, known multiply determined, therefore, a failure to replicate a
or unknown. Carefully conceived, statistically powered, phenomenon across these facets of a research
and controlled experimentation on the role of t in design may more productively be viewed as a failure
producing c can be denoted as P(c/t). When multiple to generalize and may trigger a search for the
factors are sufcient to produce the psychological out- operation of, for instance, a previously unrecognized
come (i.e. parallel determinism), then P(c/t) > 0, but determinant.

From Replication Studies to Studies of Replicability


MICHAEL EID

Freie University Berlin


eid@zedat.fu-berlin.de

Abstract: Some consequences of the Asendorpf et al. concept of replicability as a combination of content validity and
reliability are discussed. It is argued that theories about relevant facets have to be developed, that their approach
requires multiple sampling within the same study, and that the proposed aggregation strategies should not be applied
blindly. Studies on the conditions of replicability should complement pure replication studies. Copyright 2013 John
Wiley & Sons, Ltd.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 123

Asendorpf et al. assume that research designs are charac- single example of a facet has been considered and in an-
terized by multifaceted structures. Facets are individuals, other study only a different example, replicability is not
situations, operationalizations, and time points. These facets ensured. Although Asendorpf et al. focus on the increase
are considered random factors. To ensure replicability, of sample size for individuals and items, this concerns the
random samples must be drawn from these populations. Un- other facets as well. This is in contrast to research practice
like ad hoc samples of individuals, test items, situations, and in many areas of psychology where mono-method, mono-
time points, random sampling will usually result in greater situation, and cross-sectional studies are predominant.
variance of the traits considered. Although this variance is However, it is unrealistic that in each and every study,
necessary for representativeness, the authors consider it error random samples of individuals, items, stimuli, methods,
variance that should be decreased in a next step through and so on be drawn. Moreover, it might not be necessary
aggregation. Aggregations across individuals, items, and so if the variance due to facets is low. Planning replication
on reduce standard errors andincrease reliability. Thus, they studies requires available knowledge about which facets
propose a two-step procedure to ensure replicability charac- are relevant and which facets are random and not xed.
terized by random sampling (to ensure content validity) in In many research areas, there is no knowledge about the
combination with aggregation (to increase precision and importance of different facets. Systematic replicability
reliability). In contrast to generalizability, the concept of studies that focus on different facets and the conditions
replicability does not focus on the variances of the facets of replicability are necessary. Examples are generaliz-
per se, as the variances have to be considered to obtain ability studies using the denition from generalizability
unbiased aggregated scores. Consequently, replicability theory (Cronbach, Gleser, Nanda & Rajaratnam, 1972).
depends on content validity and reliability. 3. According to Asendorpf et al., aggregation is an impor-
This conceptualization of replicability has some important tant further step. Aggregation is an appropriate method
consequences: for reducing variability due to random facets. If items,
stimuli, individuals, and so on are considered inter-
1. The facet populations have to be known. This might be changeable, aggregation is an efcient way to get rid of
the case for the population of individuals. But it might the resulting error variance. However, if the elements
not be the case for the other facets such as items, methods, of a facet are not interchangeable, aggregation might
situations, and so on. In many areas of psychological re- reduce relevant information and might not be appropriate.
search, theories are missing about the universes of stim- For example, if raters are interchangeable (e.g., students
uli, items, and methods. Taking methods as an example, rating the quality of a course), aggregation can be used
with respect to Campbell and Fiskes (1959) seminal pa- to obtain more precise estimates (e.g., mean quality of
per on convergent validity, Sechrest, Davis, Stickle, and the course). However, if raters are not interchangeable
McKnight (2000) noted that method variance was never (e.g., children and their parents rating the suicide risk of
dened in any exact way (p. 63), but they added, and we the child), aggregation might not be appropriate (the
will not do so here (p. 63). It seems to be difcult to de- mean suicide risk score across raters might not be a valid
ne the population of methods. This is also true for the indicator of the childs true suicide risk; Eid, Geiser, &
other facets. Psychological theories often are not clear Nussbeck, 2009). The recommendations of Asendorpf
about these methodological aspects. The denition of et al. for increasing replicability by decreasing sources
Asendorpf et al. shows that we should focus much of error are closely linked to their concept of facets as
more on the development of theories about the different random factors. These factors may not be random in all
facets (e.g., methods and situations) that might play roles applications. However, their approach claries that
in our research designs. We must understand the var- researchers have to think more closely about what are
iances of the facets not only to get rid of them by aggre- sources of error variance that should be eliminated and
gation but also to understand the phenomenon we are what are sources of substantive variance that should not
studying and to guide the sampling process. Decreasing be eliminated. This again requires theories about the
the standard error by sampling more individuals might nature of the facets, whether they are random or xed.
not be appropriate for increasing replicability if they are Aggregating across structurally different subpopulations
sampled from the wrong population. Increasing reliabil- (of methods, individuals, items, situations, etc.) might
ity by adding items that are reformulations of other items not be appropriate to enhance replicability even if this
in the scale might not ensure replicability. All the statisti- might increase reliability and power (Eid et al., 2008).
cal recommendations of the authors have their basis in the Aggregation might be too often used blindly. The recom-
appropriate theoretical underpinning of the facets that are mendation of Asendorpf et al. is linked to random facets
considered. These theoretical ideas have to be communi- that are linked to replicability. Fixed factors are related
cated to plan replication studies. In many research areas, to generalizability.
however, they have to be developed rst because theories
often do not integrate theories about situations, methods, It is the merit of the Asendorpf et al. concept of
and so on. replicability that it makes clear that it is the combination
2. Their two-step approach of replicability requires that of content validity (representativeness) and reliability
there is a sampling process not only across different (reduction of error variances) that should guide research.
studies but also within a study. If in one study only a Moreover, their distinction between replicability (random

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
124 Discussion and Response

facets) and generalizability (xed facets) is important for interest in negative results. Surprisingly, however, negative
choosing appropriate research strategies and methods of results in psychology and other disciplines are cited just as
data analyses (Eid et al., 2008). Their ideas require research much as positives, suggesting that the source of bias
programmes built on theories of facets. Consequently, re- might have less to do with le drawers and more with
search should move from pure replication studies to studies conrmation biases of psychologists themselves (Fanelli,
on replicability. 2012a). Another example is the recommendation to

Only Reporting Guidelines Can Save (Soft) Science


DANIELE FANELLI
University of Edinburgh
dfanelli@exseed.ed.ac.uk

Abstract: Some of the unreplicable and biased ndings in psychological research and other disciplines may not be
caused by poor methods or editorial biases, but by low consensus on theories and methods. This is an epistemological
obstacle, which only time and scientic debate can overcome. To foster consensus growth, guidelines for best research
practices should be combined with accurate, method-specic reporting guidelines. Recommending greater transparency
will not sufce. A scientic system that enforced reporting requirements would be easily implemented and would save
scientists from their unconscious and conscious biases. Copyright 2013 John Wiley & Sons, Ltd.

Asendorpf et al. propose an excellent list of preregister experimental hypotheses, in analogy to what is
recommendations that may increase the likelihood that attempted with clinical trials. This suggestion fails to take
ndings are true, by improving their replicability. How- into account the low predictive ability of psychological
ever, these recommendations might still be too generic theories and low truth value of published research ndings.
and fail to account for peculiarities and limitations that Psychology is not astrophysics. Most of its predictions may
psychology and other social and biological disciplines rest on shaky grounds, and the same study could both
face in their quest for truth. Some of the initiatives they support and falsify them depending on subtle changes in
suggested could be impractical or even counterproductive design and interpretation (LeBel & Peters, 2011; Weisburd
in psychology, which would make greater and easier & Piquero, 2008; Yong, 2012). Forcing psychologists to
progress if it shifted attention from what researchers do predeclare what they intend to test will push them, I fear, to
to what they report. either formulate more generic and less falsiable hypotheses
Psychology, like many other social and biological or massage their ndings even more.
sciences, appears to be soft (Fanelli, 2010; Simonton, In sum, although I support most of the recommenda-
2004). It deals with extremely complex phenomena, tions of Asendorpf et al., I believe that they do not fully
struggling against an enormous amount of diversity, accommodate the fact that psychology has lower theoret-
variables and noise, in conditions where practical and ethical ical and methodological consensus than much biomedical
considerations impede optimal study design. These research, let alone most physical sciences. Scientic
characteristicsno doubt, with great variability among consensus will hopefully grow over time, but only if we
individual eldsprobably render data in psychology allow it to harden through an extended, free, and fair
relatively unable to speak for themselves, hampering war of ideas, approaches, and schools of thought. Good
scholars ability to reach consensus on the validity of any research practices are the essential weapons that scientists
theory, method, or nding and therefore to build upon them. need, but fairness and freedom in battle are only guaranteed
In such conditions, scientists inevitably have many degrees by complete transparency and clarity of reporting.
of freedom to decide how to conduct and interpret studies, What makes some of human knowledge scientic is
which increases their likelihood to nd what they expect. not the superior honesty or skills of those who produced
Bias and false positives, in other words, are to some extent it, but their willingness to share all relevant information,
physiological to the subject matter, and no amount of which allowed truth to emerge by a collective process
statistical power, quantitative training, and reduced pressures of competition, criticism, and selection. There is nothing
to publish will remove them completely. wrong in conducting exploratory analyses, trying out
Although publication bias in the psychological literature several statistical approaches, rethinking ones hypothesis
is supercially similar to that observed in biomedical after the data have been collected, dropping outliers, and
research, its causes might be partly different and thus require increasing ones sample size half-way through a study
different solutions. Existing guidelines for best research as long as this is made known when results are presented.
practices tend to overlook this, and so do Asendorpf et al. These behaviours might increase false-positive ratios but
For example, they recommend that editors and reviewers will also increase the likelihood of discovering true
learn to accept negative results for publication, which buys patterns and new methods, and psychology seems to be
into the standard (biomedical) explanation that publication still in a phase where it can benet from all discovery
bias is caused by a le drawer problem, created by lack of attempts.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 125

Good research practices notwithstanding, therefore, the could decide on initial acceptance on the basis of the accu-
keys to good science are good reporting practices, which, in- racy of methodsthat is, blindly to outcomeand only later
terestingly, are much easier to ensure. Indeed, reporting ask active researchers to assess the results and discussion.
guidelines are rapidly being adopted in biomedical and clin- Strictness of reporting requirements could become a measure
ical research, where initiatives such as the EQUATOR Net- of a journals quality, quite independent of impact factor.
work (http://www.equator-network.org) and Minimum Moreover, reporting guidelines would provide clear and
Information about a Biomedical or Biological Investigation objective standards for teachers, students, and ofcers faced
(http://mibbi.sourceforge.net) publish updated lists of details with allegations of misconduct (Fanelli, 2012b).
that authors need to provide, depending on what methodol- In conclusion, Asendorpf et al. make important recom-
ogy they used. Major journals have adopted these guidelines mendations. I highlight those of funding replication studies,
spontaneously because doing so improves their reputation. If emphasizing effect sizes, and rewarding replicated results.
authors do not comply, their papers are rejected. But the key to saving psychologists (and all other scientists)
This approach could easily be exported to all disciplines from themselves is ensuring the transparency of their work.
and, if it became central to the way we do science, it would Daryl Bems evidence of precognition is problematic mainly
bring many collateral advantages. Peer reviewers, for exam- because we lack information on all tests and analyses that
ple, instead of spending more of their precious time checking were carried out before and after his experiments (LeBel &
results as Asendorpf et al. suggest, could specialize in asses- Peters, 2011). Diederik Stapels fraudulent career might have
sing papers compliance with objective reporting guidelines. never taken off if he had been forced to specify where and
Indeed, peer reviewing could nally become a career option how he had collected his data (Levelt Committee, Noort
in itself, separate from teaching and doing research. Journals Committee, & Drenth Committee, 2012).

We Dont Need Replication, but We Do Need More Data


GREGORY FRANCIS
Purdue University
gfrancis@purdue.edu

Abstract: Although the many valuable recommendations Asendorpf et al. are presented as a way of increasing and
improving replication, this is not their main contribution. Replication is irrelevant to most empirical investigations in
psychological science, because what is really needed is an accumulation of data to reduce uncertainty. Whatever
criterion is used to dene success or failure of a replication is either meaningless or encourages a form of bias that
undermines the integrity of the accumulation process. Even though it is rarely practised, the xation on replication
actively hurts the eld. Copyright 2013 John Wiley & Sons, Ltd.

Asendorpf et al. present many recommendations that will Even when we try to measure a xed effect, replication
likely improve scientic practice in psychology. Despite the is irrelevant. Suppose scientist A rejects the null hypothesis
good advice, many of the recommendations are based on for an experimental nding. Scientist B decides to repeat the
fundamental misunderstandings about the role of replication experiment with the same methods. There are two possible
in science. As Asendorpf et al. emphasize, replication is outcomes for scientist Bs experiment.
commonly viewed as a foundation of every empirical sci-
1. Successful replication: the replication experiment
ence. Experimental results that successfully replicate are
rejects the null hypothesis.
interpreted to be valid, whereas results that fail to replicate
2. Failure to replicate: the replication experiment does not
are considered invalid. Although replication has worked
reject the null hypothesis.
wonderfully for elds such as physics and chemistry, the
concept of replication is inappropriate for a eld like exper- How should the scientists interpret the pair of
imental psychology. ndings? For Case 1, it seems clear that a good scientic
The problem for psychology is that almost all experimental strategy is to use meta-analytic methods to pool the
conclusions are based on statistical analyses. When statistical ndings across the experiments and thereby produce a
noise is large relative to the magnitude of the effect being more precise estimate of the effect.
investigated, then the conclusion is uncertain. This For Case 2, it may be tempting to argue that the
uncertainty is often a characteristic of what is being measured. failure to replicate invalidates the original nding; but
The call to increase replicability is a strange request because it such a claim requires a statistical argument that is best
asks for certainty where it cannot exist: No one would made with meta-analysis. These methods appropriately
complain that coin ips are unreliable because they do not weight the experimental ndings by the sample sizes
always land on the same side. In a similar way, uncertainty is and variability. Scientist Bs nding will dominate the
often part of what is being investigated in psychological meta-analysis if it is based on a much larger sample size
experiments. than scientist As nding.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
126 Discussion and Response

Importantly, the recommended scientic strategy for provides a way to interpret measurements and to predict
both successful and unsuccessful replication outcomes experimental outcomes. In contrast, a verbal theory claiming
is to use meta-analysis to pool the experimental ndings. that one condition should have a bigger mean than another
Indeed, if the experimental methods are equivalent, then condition is only useful for exploratory work. Contrary to
pooling the data with meta-analysis is always the the claim made in many experimental papers, such verbal the-
recommended action. Scientists should not focus on an ories cannot predict the outcome of a hypothesis test because
outcome that makes no difference. Rather than being a they do not provide a predicted effect size, which is necessary
foundation of the scientic method, the concept of to estimate power. Discussions and debates about quantitative
replication is irrelevant. theories will identify where resources should be applied to
This claim is not just semantics. A xation on precisely measure important experimental effects.
replication success and failure, combined with misunder- None of this is easy. When we determine whether experi-
standings about statistical uncertainty, likely promotes mental results should be pooled together or kept separate,
some of the problems described by Asendorpf et al., such equivalent methods trump measurement differences (even
as post hoc theorizing. A researcher who expects almost statistically signicant ones); but such methodological
every experiment to successfully demonstrate a true effect equivalence often depends on a theoretical interpretation.
can easily justify generating a theoretical difference Likewise, modifying a theory so that it better reects
between two closely related experiments that give experimental ndings requires consideration of the uncertainty
different outcomes. The researcher can always point to in the measurements, so data and theory go back and forth in a
some methodological or sampling difference as an struggle for coherence and meaning. Researchers will have to
explanation for the differing outcomes (e.g., time of day chase down details of experimental methods to determine
or subject motivation). Statisticians call this tting the whether reported differences are meaningful or due to random
noise, and it undermines efforts to build coherent and gen- sampling. Proper application of the scientic method to
eralizable theories. It is no wonder that journal editors, psychology takes a tremendous amount of work, and it cannot
reviewers, and authors do not encourage replications: be reduced to the outcome of a statistical analysis.
Replications rarely resolve anything. Replication is often touted as the heart of the scientic
This all sounds very bleak. If replication is not a method, but experimental psychologists trying to put it
useful concept for psychological science, then how to practise quickly discover its inadequacies. Perhaps many
should the eld progress? First, researchers must have researchers have intuitively recognized replications irrele-
an appropriate recognition of the uncertainty that is vance, and this is why the eld praises replication but does
inherent in their experimental studies. There is nothing not practise it. When combined with unrealistic interpretations
fundamentally wrong with drawing a conclusion from about the certainty of conclusions and a lack of quantitative
a hypothesis test that just barely satises p < .05, but models, confusion about replication likely contributes to the
the conclusion should be tentative rather than denitive. current crisis in psychological sciences. It is a positive sign
Condence in the conclusion increases by gathering that, despite these difculties, Asendorpf et al. were able to
additional data that promote meta-analysis. We need generate many valuable recommendations on how to improve
more data, not more replication. the eld. Most of their recommendations will be even better by
Second, although exploratory work is ne, scientic shifting the emphasis from the concept of replication and
progress often requires testing and rening quantitative towards gathering additional data to reduce uncertainty and
theories. A quantitative theory is necessary because it promote development of quantitative theories.

Calls for Replicability Must Go Beyond Motherhood and Apple Pie


EARL HUNT
University of Washington
ehunt@u.washington.edu

Abstract: I expand on the excellent ideas of Asendorpf et al. for improving the transparency of scientic publications.
Although their suggestions for changing the behaviour of authors, editors, reviewers, and promotion committees seem
useful, in isolation, some are simply rallying cries to be good. Other suggestions might attack the replicability
problem but would create serious side effects in both the publication process and the academic endeavour. Moreover,
we should be concerned primarily for replication of key, highly cited results, not for replication of the entire body of
literature.

Asendorpf et al. wish to increase both transparency and and apple pie, MAPPLE for short. The proverbial warning
replicability of scientic studies. Who would object? How- Be careful of what you wish for, you might get it is also
ever, calls for a desirable goal, without proposing practical relevant. Present practices used to evaluate scientic contri-
means of achieving it, amount to support for motherhood butions evolved for reasons. Changing these practices to

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 127

achieve one goal may have unintended consequences that Statistical solutions
inuence other goals.
Asendorpf et al. propose changes in statistical practice
Transparency: The solvable problem and training that are good in themselves but that suffer from
two problems: unintended consequences and conceptual
Asendorpf et al. say that studies are transparent when limitation.
data and procedures are accessible and limits on conclusions The statistical training curriculum in psychology is
are stated. Accessibility requires data archiving at reasonable already overcrowded. A call to add something to the curricu-
cost. Just saying Keep good lab notes is MAPPLE. lum, however good that something is in isolation, is
There must be standards for record keeping. The issue is MAPPLE, unless the call is accompanied by suggestions
not simple. Psychological studies range from laboratory for dropping topics currently taught.
experiments to analyses of government records. Condenti-
The conceptual limitation is more subtle.
ality and the proprietary nature of some data must be Many discussions, including those of Asendorpf et al.,
considered. In some cases, there are issues of security. Recall
seem to assume that a psychological study can be regarded
the debate over whether or not the genomes for pandemic
as a point sampled from a space of possible studies. For
inuenzas should be reported. Psychology has similar, less
example, they suggest that independent variables be ana-
dramatic, cases.
lysed as random effects. This model works for laboratory
Improving record keeping would do more than improve
replicability. Asendorpf et al. and others are concerned about studies but does not t many studies outside the laboratory.
pressures on authors to rush towards publication. Clear Longitudinal studies take place at a particular place and
records aid an investigator in thinking about just how time. And what about studies of major social trends, such
strongly a claim can be made, especially when the investiga- as increases in intelligence test scores throughout the
tor realizes that the data will be available for examination. 20th century or the social and psychological effects of,
Similarly, record keeping will not prevent fraud, but it will say, the increase in urbanization throughout the world?
make it somewhat more difcult. Good record keeping is Such studies can be extremely important to the social
also one of the best defences against unjustied charges of sciences, issues of transparency are highly relevant, but
fraud. A lack of transparent records has been a factor in the relevance of models of statistical inference based on
several such allegations, including the famous Cyril Burt sampling is questionable.
case. In such cases, statistical models and signicance tests
The professional societies, such as the Association for provide useful descriptive statements because they
Psychological Science, are the logical agencies to be respon- contrast the results to ones that might have been observed
sible for both establishing standards and maintaining the in (unobtainable) ideal conditions. The statistics of a
archives. The project is substantial but doable. Creating unique study cannot be used to support inferences about
archives before record-keeping standards are established puts parameters in some nonexistent population of possible
the cart before the horse. studies. Generalization should be based on a careful
analysis of the study and the nature of the generalization
Modifying behaviour desired. Statistical analysis is part of this process but
often not the most important part.
Asendorpf et al. recommend changes in the behaviour
of researchers and in reviewing practices, both for manu- So what to do?
scripts and for professional advancement. The recommen-
dations for researchers to accelerate scientic progress
Costs must be weighed against benets. Increasing
and to engage in scientic debate are pure MAPPLE.
transparency is a low-cost, high-benet action. It should be
The changes in reviewing and personnel evaluation
taken now.
practices, although eminently reasonable (almost to the
The replicability problem is more complicated because
point of MAPPLE), may have unintended consequences.
The devil is once again in the details. costs are often indirect and because the remedies conict
The current reviewing system is already overwhelmed. with other legitimate goals of the scientic and academic
There is an inevitable conict between the desire for rapid systems. However, there is an unfortunate characteristic of
reviewing and careful reviewing. As for rewards, it is highly the social and behavioural scientic literature that makes
likely that reviewing will remain a virtuous activity. The the issue more manageable.
rewards for virtue are well known. Eighty-eight per cent of the 1745 articles in the 2005
Of course, evaluation committees should look at quality Social Science Citation index received less than two citations
rather than quantity. MAPPLE! But quantity is objective, (Bensman, 2008). Only four had more than 10. Highly cited
whereas judgments of quality are often subjective. This does studies should be replicated. The ever-more frequent
not make evaluation of quality by learned judges invalid. It publication of meta-analyses, including tests for le drawer
does make decisions difcult to defend when review systems issues, shows that in fact this is being performed. Otherwise,
are held accountable for fairness, including unconscious biases, meta-analysis would not be possible. Why bother to replicate
and productivity. the virtually uncited studies?

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
128 Discussion and Response

Rejoice! In Replication
HANS IJZERMAN, MARK J. BRANDT, AND JOB VAN WOLFEREN
Tilburg University
h.ijzerman@tilburguniversity.edu

Abstract
solid science not be aware) and that may be as trivial as temperature or
theoretical advances lighting (IJzerman & Koole, 2011).
teaching opportunities + This suggests that the ideal of direct replication may be
Rejoice! harder to achieve than expected and that any replication is a
conceptual replication, with some being more or less direct
than others. But fear not! Rejoice! Variations in replications
Copyright 2013 John Wiley & Sons, Ltd. can be to our theoretical advantage as they may illustrate
which factors facilitated an effect in the original study and
We found the target article one of the most enlightening which factors prevented the effect from being observed in a
contributions to the replicability debate through many co- replication attempt. More direct replications of a studys
gent and nuanced recommendations for improving research methods provide us with information regarding the stability
standards. Contributions such as this quickly aid in remedy- of the effect and its contextual moderators. As suggested by
ing sloppy science (slodderwetenschap) and enabling solid the target article, when the effect size across replications is
science (KNAW, 2012). The primary contribution of our com- heterogeneous, moderators of the variation can provide
mentary is the following equation: valuable theoretical insights.
solid science
A second reason an effect may fail to replicate is that the ef-
theoretical advances fect size is small (and potentially zero) and thus more difcult to
teaching opportunities + uncover than expected. In our experience, this is typically the as-
Rejoice! sumed cause of a failure to replicate. Researchers thus consider,
rightfully so, the possibility that initial ndings result from type
I errors. However, a failure to replicate is as convincing as the
The case for replication as a part of solid science was initial study (assuming similar power), and failures to replicate
made in the target article. We thus focus on the latter last may actually increase ones condence in an effect because they
two pieces of our equation. suggest there is not a vast hidden le drawer (Bakker et al., 2012;
Schimmack, 2012). Presuming that an effect is due to a type I er-
ror after a single replication attempt is as problematic as commit-
Replications are theoretically consequential ting that initial type I error (Doyen et al., 2012). However,
multiple replication attempt effect sizes that are homogenous
Generally, we appreciate the recent contributions to around zero (without reasons for the original effect to differ) sug-
the discussions of veriability of research ndings (e.g., gest that the original effect was a uke.
Ferguson & Heene, 2012; Fiedler, Kutzner, & Krueger, One direct implication is that replications require
2012; Simmons et al., 2011). Conducting replications is a many attempts across multiple contexts to provide
dirty business, and to date, few researchers have been valid inferences. Only after systematic replications can we
motivated to do it. This may be mostly because, as the target infer how robust and how veracious an effect is. Despite
article points out, researchers believe success of direct additional efforts, we believe that we should rejoice in
replications to be unlikely (Fuchs, Jenny, and Fiedler, 2012). replications as they lend credibility to research and help us
This latter point is important because it shows one way make theoretical progress. Replications can thus help solve
that replications can help advance theory. High-powered not only the replicability crisis but also the theory crisis
failures to replicate, in our eyes, have two (or more) potential (cf. Kruglanski, 2001). The true size of the effect, predictors
reasons (OSC, 2012), assuming that the replication study has of effect size variation, and knowledge of whether an
adequate statistical power and the researcher the ability to effect is true or not all advance understanding of
replicate the study. First, failures to replicate can be human psychology.
interpreted as indications that the original effect is context
dependent. Psychological ndings are often inuenced by Facilitating solid science: Walking the talk
many environmental factors, from culture (Henrich, Heine,
and Norenzayan, 2010) to specic subpopulations in a Systematic replication attempts can be more easily
culture (Henry, 2008), and even minor variations in the same achieved by facilitating transparency of published
laboratory (e.g., research on priming and social tuning; research and by systematically contributing to replication
Sinclair, Lowery, Hardin, and Colangelo, 2005). Replica- studies. To facilitate replications (and solid science more
tions, thus, involve reproducing a variety of factors that are generally), we rst examined our research practices. We
rarely recorded (and of which the original researchers may determined that for other researchers to effectively

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 129

replicate our work, it is essential to trace all steps from to how fun it is, how well students pick up on power analyses,
raw data (participants behaviour) to its nal form (the and how easy it is to use this to let students rst learn how to
published research paper). We upload all les to a crawl before they walk in research land. Thus, we should re-
national server, interface with Dataverse to provide a joice in replications because they solidify our science, facilitate
digital object identier, and link them to the published theoretical advancement, and serve as valuable teaching tools.
paper. This should make it feasible for others to replicate Finally, although many researchers (including us) have
the crucial aspects of our work. Our own detailed pointed to the necessity of replication, without innovation, there
document can be found online (Tilburg Data Sharing is no replication. A research culture of pure replication is just as
Committee, 2012), including exceptions for researchers harmful for the future of the study of human psychology as a re-
with sensitive data. search culture of pure innovation and exploration. Taking
Provided that raw materials of research are easily available, seriously the idea of systematic replication attempts, in our eyes,
replication becomes astonishingly easy to integrate into research- forces us to go beyond weird samples and odd research methods
ers scholarly habits and teaching (Frank & Saxe, 2012). Re- (Rai & Fiske, 2011). As psychologists work through the current
cently, we have implemented replication projects with our crisis, we urge researchers to both rejoice in replication and be
bachelor students. With the current sample (N = 3), we can attest enlightened in exploration.

Let Us Slow Down


LAURA A. KING
University of Missouri, Columbia
kingla@missouri.edu

Abstract: Although I largely agree with the proposals in the target article, I note that many of these suggestions echo past
calls for changes in the eld that have fallen on deaf ears. I present two examples that suggest some modest changes are
underway. Finally, I suggest that one reason for the alarmingly high productivity in the eld is that the research questions
we are addressing are not particularly signicant. Copyright 2013 John Wiley & Sons, Ltd.

Although, generally speaking, I applaud the suggestions The nal study, the one that truly tested the main predictions
made by the esteemed authors of the target article, I cannot help of the package, was underpowered. The results were barely
but note that their voices join a chorus of similar calls that have signicant and looked weird, as results from small sample
been made not only in response to recent events but also histor- studies often do. With the echoes of critiques of the eld ringing
ically. We have been lectured for decades about the problems in my ears, I drew a line in the sand and requested a direct rep-
inherent in null-hypothesis signicance testing; the wonders of lication with a much larger sample. I took this step with misgiv-
condence intervals, effect sizes, and Bayesian analyses; the ings: Was it unfair? Should this work be held to such a standard?
value of replication; and the importance of large samples. The Was this revise and resubmit, in fact, its death knell in dis-
necessities of reliable measurement and critical thinking are de guise? When the revision arrived, I fully expected a cover letter
rigueur in introductory psychology. Certainly, with regard to explaining why the replication was unnecessary. To my
practice, the authors add some good and useful new ideas based surprise, the direct replication was presented, and the predictions
on innovations in the eld and the world, but the spirit of this call were strongly supported. Good news all around, but the
is not qualitatively different from calls we have been ignoring experience gave me pause. True confession: I felt like I was
for years. The eld has continued to rely on problematic demanding that those authors hit the lottery. Twice.
practices and, if anything, has exacerbated them with increasing In our world of p-values, it is sometimes hard to remem-
pressure to publish more and more (and shorter and shorter) ber that producing good science is not about hitting the
articles, and to do so as quickly as possible. As a result, criticiz- lottery. Nor is it about taking whatever the data show and
ing our research practices has become its own cottage industry. declaring that one has won the lottery. Importantly, within
Will anything ever change? Here, I offer two bits of anecdotal the editorial process, the preregistration of predictions (that
evidence that the times might be a-changing. The rst involves the authors of the target article suggest) often happens,
my own editorial consciousness raising and the second an inevitably. When new analyses or new studies are requested,
inspiring tenure review. as in the earlier case, authors hands are tied. I realize that,
I do not believe that top journals will (or could, or even typically, JPSP is considered a slow and stodgy animal in
should) begin to publish replications as stand-alone contribu- the publication world. Such lingering conversations over
tions. However, as an editor who reads these critiques, I have papers would seem to be rather exceptional. If speed is the
tried, in admittedly small ways, to institute greater respect (and utmost value, journals are less likely to request new data than
occasional demand) for replications. For example, in its initial to simply reject a paper. If we could all slow down just a bit,
submission, one paper, currently in press in the Journal of it might help. A thoughtful and sometimes slow process has
Personality and Social Psychology, presented several studies. its advantages.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
130 Discussion and Response

The nuance and breadth of the target articles treatment of the thought that suggested itself to me as I read and ruminated
institutional factors (especially in terms of tenure) warrant high over the target article and that I hesitate to share. I do not mean
praise. I believe that the problem of determining the quality of to sound overly harsh or dismissive of the hard work we do. But,
scholarship is just as complex as the authors suggest. Recently, is it possible that we are able to produce so much because what
I wrote a tenure letter for an apparently wise institution, with the we are producing is potentially trivial, relatively easy, and
following criterion noted: preoccupied with effects rather than phenomena? It seems to
me that our energies are disproportionately dedicated to
If you were to compile a list of the most signicant books or
developing amazingly clever research designs rather than
articles to appear recently in this eld, would any of the
amazingly important research questions. Perhaps not only the
candidates publications be on your list? Which ones? Why?
practices but also the content of our scholarship requires
Such a criterion represents the kind of principles that ought rethinking. Yes, let us slow down and do our science right.
to motivate our science, more generally. If we had this criterion But let us also slow down and remember that ours is the science
in mind, what sorts of science might we produce? of human behaviour. Too often, we limit ourselves to problems
Aside from data-faking scoundrels, we work very hard, as is we have invented in the lab, to topics and variables that
evidenced in the astronomical number of articles published in implicate very little in terms of human behaviour. Consider a
our eld. The target article suggests that at least some of this paraphrase of the aforementioned criterion:
work is probably neither replicable nor particularly sound.
If you were to compile a list of the most signicant articles to
Surely, changing our practices would improve all of this science.
appear in this eld, would any recent publications be on
But the gist of this critique, as well as others, is that in some
your list?
ways, the sheer amount of research itself is problematic. And
doing this science even very, very well would not necessarily With this criterion in mind, what sorts of research questions
reduce this enormous corpus of research. And here I come to might we ask?

Stop Me If You Think You Have Heard This One Before: The Challenges of
Implementing Methodological Reforms
RICHARD E. LUCAS AND M. BRENT DONNELLAN
Michigan State University
lucasri@msu.edu

Abstract: Asendorpf et al. offer important recommendations that will improve the replicability of psychological research.
Many of these recommendations echo previous calls for methodological reform dating back decades. In this comment, we
identify practical steps that can be taken to persuade researchers to share data, to follow appropriate research practices,
or to conduct replications. Copyright 2013 John Wiley & Sons, Ltd.

The target article offers several recommendations that produce the raw materials for a multistudy article that will
will improve psychological research. These suggestions be publishable in the most prestigious journals in our
are based on sound methodological arguments and a com- eld. It will be difcult to convince those for whom this
mon sense approach for building a cumulative science. approach has been successful to change their behaviours.
Nothing the authors recommend strikes us as unreason- In terms of implementation, the target article
able or overly burdensome. Yet, their recommendations mainly focuses on making desirable research practices less
echo previous calls for more replication studies (e.g., burdensome. As one example, the authors highlight available
Smith, 1970), greater statistical power (Cohen, 1962; resources for archiving data. However, it will be important to
Rossi, 1990), and increased transparency and data sharing acknowledge that some researchers will object to specic
(Johnson, 1964; Lykken, 1991; Wollins, 1962). These policy changes for reasons that go beyond researcher burden.
prior calls have gone unheeded, and thus, if there is to For instance, existing research shows that few authors are
be any hope of lasting methodological reforms, the eld currently willing to provide data even to specic requests
must confront the obstacles that have prevented such from other individual researchers (Wicherts, Borsboom,
reforms from being implemented in the past. Kats, & Molenaar, 2006); we suspect that ease of sharing
Although many psychologists will agree in principle data is not the primary reason for refusal.
with the suggestions made in the target article, we Currently, there are few consequences for researchers
suspect there will also be vocal opposition to specic who fail to adhere to optimal research practices, such as
recommendations. Bakker et al. (2012) showed that the sharing their data. Highlighting the problems with such
most successful strategy for nding a statistically signi- policies to funding representatives may be fruitful,
cant result is to go against the recommendations of the especially given the emphasis on accountability that
target article and to run a large number of exibly accompanies funding. It is also disconcerting that journals
analysed studies with very small samples. Thus, in the published by the American Psychological Association and
current system, questionable research practices can the Association for Psychological Sciencejournals that

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 131

are typically quite prestigious and could therefore afford a Similarly, replication studies are easy to conduct and will
slight drop in submissionshave no stated penalties for have great benet for the eld. It is less important whether
researchers who go against guidelines and refuse to share such replications are conducted by students or senior
data. One option is to make it standard journal policy that researchers or whether they are published in online reposito-
papers are retracted when authors refuse to share data from ries or special sections of existing journals. The real issue is
recently published papers unless there are compelling miti- making sure that the results are made available and that those
gating circumstances that prevent sharing. Any inability to who conduct independent replications are given credit for
share data with interested psychologists should be disclosed their efforts. Any reader who agrees with the
to the editor at the time of submission (Kashy, Donnellan, recommendations provided in the target article can make an
Ackerman, & Russell, 2009). immediate contribution to the eld by committing to conduct
What are other ways that data sharing can be encouraged? regular replications of their own and others work and to
One possibility is simply to make data sharing more norma- make sure that the results are made accessible. In addition,
tive. If you are interested in someones data, you should re- concerned researchers should consider refusing to support
quest it and make sure you can replicate their results. In journals that do not publish replications as a matter of policy.
fact, it is probably not a bad idea to ask our close colleagues The fact that so much has been written about
for their data, just to make the process more commonplace methodological reform in the last 2 years is both encouraging
and less adversarial. As anyone who has been asked to share and depressing. It is encouraging because these articles could
data knows, it only takes 1- or 2-day-long scrambles to be a harbinger of major changes in how psychological
compile and annotate existing messy data before you science is conducted. Such articles can also be depressing be-
develop better procedures to prevent future occurrences. cause the current discussions have an eerie similarity to those
In addition to targeting recommendations to those who from the past decades. As it stands, many of the discussions
have leverage, it is also worthwhile considering which about methodological reform operate on the assumption that
recommendations have the largest bang for the buck. It there is basic agreement about the ultimate point of psycho-
should be clear that many (if not most) studies in logical research, which is to gain a clearer understanding of
psychology are underpowered. The small sample sizes that reality. However, it might be worth questioning this basic
plague our eld have serious consequences in terms of im- assumption. What if some researchers believe that the point
precise parameter estimates and reduced credibility. Fortu- of psychological science is simply to amass evidence for a
nately, this problem is easy to x by demanding larger particular theoretical proposition? Those with such a world-
sample sizes. Editors and reviewers should simply require view might nd the recommendations provided by the target
that authors start with the assumption that their effects will article to be unnecessary roadblocks that limit their produc-
be no larger than what is typical for the eld unless there is tivity. If so, then methodological reform needs to confront
solid evidence that the specic effect under investigation will the reality that improving psychological research must in-
be larger. Thus, we suggest that power and precision be used volve changing hearts and minds as well as encouraging
as explicit grounds for a desk rejection. more concrete changes in behaviours.

Put Your Money Where Your Mouth Is: Incentivizing the Truth by Making
Nonreplicability Costly
CORY A. RIETH1, STEVEN T. PIANTADOSI2, KEVIN A. SMITH1, EDWARD VUL1
1
University of California, San Diego
2
Univesity of Rochester
edwardvul@gmail.com

Abstract: We argue that every published result should be backed by an author-issued nonreplication bounty: an amount of
money the author is willing to pay if their result fails to replicate. We contrast the virtuous incentives and signals that arise in
such a system with the conuence of factors that provide incentives to streamline publication of the low-condence results that
have precipitated the current replicability crisis in psychology. 2013 The Authors. European Journal of Personality

A major part of the replicability crisis in psychology is replicates, you gain nothing, but if it fails to replicate, you
that commonly reported statistics often do not reect the pay the bounty using personal income. The bounty should be
authors condence in their ndings. Moreover, there is little proportional to your condenceif you are unsure, it could
incentive to attempt direct replications, as they are difcult, if be $1; if you know the results replicate, it could be a huge
not impossible, to publish. We propose a solution to both sum. This bounty measures the authors subjective condence
problems: For each result, authors must name a one-time non- on a scale that is universally interpretable, penalizes authors for
replication bounty specifying the amount they would be overcondence, and provides direct incentives for replication.
willing to pay if the result did not replicate (e.g. t(30) = 2.40, Tabling the implementation details, consider the benets:
p < .05, nonreplication bounty: $1000). Thus, when you report
a nding, you are effectively making a one-sided bet: if it (1) Author condence is clearly reported

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
132 Discussion and Response

Ultimately, only the authors know exactly how their study they do not make errors costly to the authors. Running many
was conducted and how they analysed their results. Their low-powered studies, post hoc selection of independent or depen-
condence is the best available signal of the robustness of dent variables, and other p-hackery (Simmons et al., 2011) would
their results, and a nonreplication bounty offers a clear signal all yield nice results under these metrics. We believe that a remedy
of this condence. This clear signal offers nave readers an to these ailments must provide incentives to authors to offer clear,
effortless assessment of the soundness of a result, as well as a unbiased estimates of the soundness of their results, in place of the
quantitative metric to evaluate authors and journals. Thus, current incentives for authors to directly or indirectly overstate
instead of rewarding raw publication and citation counts and en- their condence in and the reliability of their data.
couraging the frequent publication of surprising, low-con- Similarly, many proposals for remedying the replicability
dence resultsone systemic problem contributing to the crisis (such as the target article) have focused on rules that pub-
replicability crisis (Ledgerwood & Sherman, 2012)sound lication gatekeepers (reviewers and editors) should enforce so
results could be rewarded for both authors and journals. as to increase the soundness of results. In contrast, nonreplica-
tion bounties would provide a clear and reliable signal that
(2) Authors have incentive to provide an accurate signal
would alleviate some of the burden on volunteer reviewers
The nonreplication bounty is not only a clear signal of and editors, rather than increase it. Authors would no longer re-
condence but also costly to fake. A low-condence result ceive incentives to sneak low-condence results past
offers authors two choices: overestimate their own condence reviewers, and reviewers could take on more thoughtful roles
and suffer a considerable risk, or publish a result with low con- in trying to assess the validity of the measures and manipula-
dence, which readers will know should be ignored. Neither of tions: Does the empirical result really have the theoretical and
these will be appealing, so authors will be altogether less eager practical implications that the authors claim? Furthermore, as
to publish low-condence results. If authors systematically long as we have a reliable condence signal associated with
overstate their own condence, intentionally or not, they will each result, there need not be an argument about whether type
face high costs and will either calibrate or leave the eld. I or type II errors are more worrisome (Fiedler et al., 2012):
Journal editors can choose to publish exciting, but speculative,
(3) Replications are directly encouraged
ndings or to publish only high-condence results.
Replication attempts receive direct incentives: Nonreplica- As proposed (Asendorpf et al., this issue; Koole & Lakens,
tions pay a bounty. Moreover, replication attempts would be 2012), encouraging replication attempts and the publicity of
targeted towards the same results that nave readers of the their outcomes is certainly benecial. However, without quan-
literature would have most condence in: The higher the titative metrics of result soundness, there is little incentive for
bounty, the more seriously the result will be taken, and journals to publish replications as impact factor only rewards
the greater is the incentive for replications. Furthermore, short-term citations, which largely reect the novelty and note-
such a system necessitates publication of replication worthiness of a result.
successes and failures, adding further replication incentives. The status quo indirectly provides incentives for rapid publi-
We believe that many of the other proposed solutions to the cation of low-condence outcomes and their misrepresentation
replicability crisis ultimately will not work because they fail to as high-condence results: a practice that appears to be under-
provide appropriate incentive to authors (Nosek, Spies, & mining the legitimacy of our science. We believe that local
Motyl, 2012). For instance, the literature has suggested a num- changes that do not restructure authors incentives are only stop-
ber of metrics offering more reliable objective signals of result gaps for a deep-seated problem. Under our scheme, authors
soundness: use of condence intervals (Cumming & Finch, would have incentives to offer the most calibrated, precise esti-
2005), effect sizes (Cohen, 1994), Bayesian posterior intervals mates of the soundness of their available results.
(Burton, Gurrin, & Campbell, 1998; Kruschke, Aguinis, & Our position is best summarized by Alex Tabarrok (2012): I
Joo, 2012), Bayes factors (Wagenmakers, Wetzels, Borsboom, am for betting because I am against bullshit. Bullshit is polluting
& Van der Maas, 2011), and various disclaimers pertaining to our discourse and drowning the facts. A bet costs the bullshitter
the analysis procedures (Simmons, Nelson, & Simonsohn, more than the non-bullshitter so the willingness to bet signals
2012). Although these are useful statistical tools and policies, honest belief. A bet is a tax on bullshit; and it is a just tax, tribute
none is so sound as to avoid the possibility of being gamed, as paid by the bullshitters to those with genuine knowledge.

Increasing Replicability Requires Reallocating Research Resources


ULRICH SCHIMMACK AND GIUSEPPINA DINOLFO
University of Toronto Mississauga
uli.schimmack@utoronto.ca

Abstract: We strongly support the recommendation to increase sample sizes. We recommend that researchers, editors,
and granting agencies take statistical power more seriously. Researchers need to realize that multiple studies, including
exact replication studies, increase the chances of type II errors and reduce total power. As a result, they have to either
publish inconclusive null results or use questionable research methods to report false-positive results. Given limited
resources, researchers should use their resources to conduct fewer original studies with high power rather than use
precious resources for exact replication studies. Copyright 2013 John Wiley & Sons, Ltd.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 133

Psychology has had a replicability problem for In contrast, combining the two samples produces one study
decades (Sterling, 1959), but it seems as if the time has with 98% power.
nally come to improve scientic practices in psychology We agree with the recommendation to focus on
(Schimmack, 2012). Asendorpf et al. (this issue) make research quality over quantity. We believe the main
numerous recommendations that deserve careful reason for the focus on quantity is that it is an objective
consideration, and we can only comment on a few of indicator and easier to measure than subjective indicators
them. We fully concur with the recommendation to of research quality. We think it is useful to complement
increase sample sizes, but past attempts to raise sample number of publications with other objective indicators of
sizes have failed (Cohen, 1990; Maxwell, 2004; quality such as number of citations and the h-index.
Schimmack, 2012). One potential reason for the Another useful indicator could be the incredibility index
persistent status quo is that larger samples reduce research (Schimmack, 2012). A low incredibility index suggests
output (number of signicant p-values). This is a that a researcher conducted studies with adequate power
disadvantage in a game (reinforcement schedule) that and was willing to publish null ndings. In contrast, a
rewards quantity of publications. If sample size is ignored high incredibility index suggests that a researcher used
in evaluations of manuscripts, it is rational for researchers questionable research practices to publish results with a
to conduct many studies with small samples. Thus, it is lower chance of replication.
paramount to reward costly studies with adequate We agree that funding agencies have the most power
power in the review process. As most manuscripts contain to change current research practices, but we do not agree
more than one statistical test, it is also important to take that funding agencies should allocate resources to exact
the number of statistical tests into account (Maxwell, replication studies. It would be more benecial for
2004; Schimmack, 2012). Even if a single statistical test funding agencies to enforce good research practices so
has adequate power, total power (i.e. the power to obtain that original studies produce replicable results. Funding
signicant results for several tests) decreases exponen- agencies already request power analyses in grant
tially with the number of statistical tests (Schimmack, applications, but there is no indication that this
2012). As a result, holding other criteria constant, a requirement has increased power of published studies. A
manuscript with one study, one hypothesis, and a large simple way to increase replicability would be to instruct
sample is likely to produce more replicable results than review panels to pay more attention to total power and
a manuscript with many studies, multiple hypotheses, to fund research programmes that have a high probability
and small samples. We can only hope that editors will to produce replicable results.
no longer reject manuscripts because they report only a Finally, we agree with the recommendation to change the
single study because a greater number of studies is informal incentives in the eld. Ideally, psychologists have a
actually a negative predictor of replicability. Instead, common goal of working together to obtain a better
editors should focus on total power and reward manu- understanding of human nature. However, limited resources
scripts that report studies with high statistical power create conict among psychologists. One way to decrease
because statistical power is essential for avoiding type I conict would be to encourage collaboration. For example,
and II errors (Asendorf et al., in press; Maxwell, 2004). granting agencies could reward applications by teams of
We disagree with the recommendation that researchers researchers that pool resources to conduct studies that cannot
should conduct (exact) replication studies because this be examined by a single researcher in one lab. It would also
recommendation is antithetical to the recommendation to be helpful if researchers would be less attached to their
increase sample sizes. Demanding a replication study is theories or prior ndings. Science is a process, and to see
tantamount to asking researchers to split their original sample ones work as a contribution to a process makes it easier to
into two random halves and to demonstrate the effect twice. accept that future work will improve and qualify earlier nd-
For example, if the original study and the replication ings and conclusions. In this spirit, we hope that the authors
study have 80% power, total power is only 64%, meaning of the target article see our comments as an attempt to con-
every third set of studies produces at least one type II error. tribute to a common goal to improve psychological science.

In Defence of Short and Sexy


DANIEL J. SIMONS
Department of Psychology, University of Illinois
dsimons@illinois.edu

Abstract: Proposals to remedy bad practices in psychology invariably highlight the problem of brief empirical reports
of sexy ndings. Even if such papers are disproportionately represented among the disputed ndings in our eld,
discouraging brevity or interestingness is the wrong way to cure what ails us. We should encourage publication of
reliable research and discourage underpowered or unreliable studies, regardless of their length or sexiness. Improved
guidelines may help, but researchers will police themselves only if we change the incentive structure by systematically
publishing direct replications. 2013 The Authors. European Journal of Personality

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
134 Discussion and Response

Revelations of statistical bad practices, methodological short- Once published, your sexy result will become the standard by
cuts, and outright fraud in psychology have almost invariably led which future studies are judged, you likely will serve as a reviewer
to criticism of novel, sexy ndings that seem disproportionately for any ndings that challenge yours (or build on it), and the liter-
well represented among the contested claims in our eld. Here I ature will forever give more attention to your paper than any chal-
use the term sexy in the same manner that the target article lenges to it (e.g., Bangerter & Heath, 2004). There is little danger
does, to refer to a nding or claim that is provocative and excit- that it will be corrected even if it is false (e.g., Ioannidis, 2012).
ing. True, many of the problematic ndings in the literature take Consequently, there are no incentives working against the imme-
the form of sexy short reports. But they are problematic not by diate publication of sexy but underpowered studies.
virtue of being brief or interesting, but by being wrong. Sexy The only incentives that would induce consistent changes
ndings need not be wrong, though, and dampening enthusiasm in publishing practices are those that work for or against the
because a nding happens to be interesting or a paper brief will interests of the individual researcher. We must provide incen-
not cure what ails the eld. We should encourage publication tives for publication of replicable ndings and introduce conse-
of highly powered, replicable, and interesting ndings that peo- quences for publishing iffy ones. One initiative would have
ple will want to read. We should dampen enthusiasm for (and, that effect: encouraging systematic publication of replications.
ideally, publication of) underpowered and unreliable studies, re- Specically, journals should encourage direct replications,
gardless of their length or the appeal of the topic. conducted by multiple labs using the original protocol, and
Many of the changes proposed in the target paper will im- published regardless of outcome. The primary goal would be
prove the collective quality of our publications. Journal articles a cumulative estimate of the true effect size, but a secondary
are the currency of our eld, and improved reporting require- benet would be a change to the publication incentives.
ments would increase their value. I applaud the new statistical Imagine you have a new, sexy nding that just barely reached
and method standards adopted by the Psychonomic Society for p < .05 with a small sample. Would you publish it right away if
all of its journals, the initiatives under consideration at the Asso- there were a sizable chance that multiple other labs would try to
ciation for Psychological Science to improve the state of our sci- replicate it and their replication attempts would be published?
ence, and the call in the target article for funding agencies to The risk of embarrassment for being publicly wrong and the ac-
expect more rigorous statistical and methodological practices. companying hit to your scientic credibility would provide a
But these changes are not enough because they will not address large incentive to make sure you are right before publishing, par-
the real problem aficting our eld: the lack of incentives for in- ticularly if the result is sexy. To the extent that sexy ndings chal-
dividual researchers to publish replicable results. lenge well-established evidence, they merit greater scrutiny:
Bad practices in psychology are prevalent in part because Extraordinary claims require extraordinary evidence. The sexier
there is little public cost to being wrong and little direct benet the claim, the more likely that other labs would want to replicate
for being right. At present, the impact of a nding both for its au- it, and the greater the incentive for the original researcher to make
thor and for the eld is largely unrelated to its correctness. Imag- sure the result is solid before publishing. The end result might be
ine conducting an underpowered study and nding a sexy result fewer sexy ndings in our top journals, but that outcome would
at p < .05. There are many incentives to publish that sexy result emerge not by discouraging interesting results but by providing
in a top journal immediately, including that the journal likely incentives for publication of reliable ones.
would want to publish it. Journal editors justiably want to pub- Better design, analysis, and reporting standards of the
lish important, sexy ndings that people will want to read. What sort proposed in the target article are essential if we hope to
incentive do you have to make sure your own work will replicate improve the reliability and replicability of published psychol-
with a larger sample before publishing it? If you conducted the ogy research, but only by changing the incentives for indi-
larger study, the extra effort might incrementally increase the vidual researchers can the eld move away from publishing
underpowered sexy ndings and towards publication of
chances of publication success, but probably not enough to jus-
well-powered, robust, and reliable sexy ndings. With the
tify the costs. The nding would garner the same visibility with right incentives in place, researchers will verify before
or without the larger replication. A larger-scale replication would publishing, and some initially promising results will vanish
allow you to take pride in yourself as a good scientist who veri- as a result. But sexy ndings that withstand replication are
es before publishing, but most of the p-hackers among us al- the ones we want in our journals and the ones our journals
ready think of themselves as good scientists. should want to publish.

Replicability Without Stiing Redundancy


JEFFRY A. SIMPSON
University of Minnesota
simps108@umn.edu

Abstract: Many papers in psychology are written in a hypothesis-conrming mode, perhaps because authors believe that
if some of their a priori predictions fail, their papers will be rejected. This practice must change. However, we must achieve
replicability without stiing redundancy, which cannot occur unless scholars use (or develop) strong theoretical principles
to derive, frame, and test their predictions. We also need to establish reasonable expectations about data sharing and data
providing that are sensitive to the investment required to generate and maintain different kinds of datasets. 2013 The
Authors. European Journal of Personality

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 135

It is hard to disagree with most of what is said and replicate Bems (2011) retroactive facilitation of recall
recommended in this well-written and well-argued target effects (Galak, LeBoeuf, Nelson, & Simmons, 2012).
article. Psychology has clearly reached a crossroads, and JPSP: IRGP recently accepted a paper showing that the
the time has come to focus much more attention on get- ndings of several previous candidate gene studies do
ting ndings right. As the authors note, portions of the not replicate in the National Institute of Child Health
blueprint for change already exist in other elds (e.g., and Human Development Study of Early Child Care and
medicine), but larger institutional values, priorities, and Youth Development dataset (Fraley, Roisman, Booth-
practices need to shift within psychology at the level of LaForce, Owen, & Holland, in press).
authors, editors and reviewers, departments and universi- Although the target article is exemplary in many ways, it
ties, and granting agencies. does not address two sets of considerations relevant to the
Most papers in psychology are written in a hypothesis- successful implementation of the recommendations offered.
conrming mode, which may partially explain why the First, the article says relatively little about the essential roles
current conrmation rate in psychology is 92% and has that good theory and careful theorizing need to assume to
increased sharply during the last 20 years. Many authors make future ndings in our eld more replicable. The authors
implicitly believe that if even a few of their a priori are correct in emphasizing that facets of studies vary in terms
predictions fail to work as planned, their papers will of individuals/dyads/groups (the observed units), situations
suffer in the review process. Some scholars (e.g., Bem, (natural or experimental), operationalizations (manipulations,
1987) have actually advocated writing introductions so methods, and measures), and time points. They also acknowl-
that they provide a coherent story that funnels readers to- edge that Which [facet] dimensions are relevant depends on
wards a priori predictions that are predominately sup- the relevant theory (pp. XX of the target article). However,
ported by the reported data. This practice has been many researchers do not derive, frame, or test their hypothe-
harshly criticized by Kerr (1998) and others, and it needs ses from the foundation of strong theories that make specic
to change. As the authors note, editors and reviewers can predictions about the following: (i) which individuals should
both play important roles in facilitating this change. However, (and should not) show a specic effect; (ii) the situations or
we need to achieve replicability without stiing redundancy, contexts in which the effect should (and should not) emerge;
which cannot occur unless scholars use strong theories to (iii) the manipulations, methods, or measures that should (and
derive and test their predictions and to guide them when prior should not) produce the effect in certain people exposed to
results consistently fail to replicate. certain situations; and (iv) when the effect should be stronger
There is little if any argument that we, as a eld, need and weaker (i.e. its time course). Some theories do offer rea-
to increase the size of our samples, improve the reliability sonably good precision on some of these dimensions (e.g.
(and validity) of our measures, ensure that our studies certain diathesisstress models; Simpson & Rholes, 2012),
have sensitive designs, conduct proper statistical analyses, but more careful and detailed theorizing must be performed
avoid reporting underpowered studies, and think more upfront if future investigators are going to have a chance
carefully about the error introduced when multiple to replicate certain effects. Cast another way, we must do a
statistical tests are performed. There is also little if any better job of thinking theoretically to pin down how the most
argument that authors should provide comprehensive liter- critical facets associated with different research designs
ature reviews in their introductions, report their sample should operate.
size decision making within papers, be much clearer Second, the target article does not address the compli-
about what their strong a priori predictions actually cations that may arise when data sharing extends beyond
are, and archive their research materials (and data, when easier-to-collect cross-sectional experiments or self-report
realistic) so other investigators can evaluate what they studies. Some research projects are extremely intensive
have carried out. Authors also need to communicate more in terms of time, effort, and cost, such as large N social
frequently, directly, and openly with colleagues who are interaction studies that may require years of behavioural
conducting similar research, and not only individual coding, and major longitudinal projects that follow the
investigators but also different teams of researchers lo- same people over many years while collecting hundreds
cated in different labs should routinely replicate each or sometimes thousands of measures. Scholars who work
others work when feasible. on these projects often devote most of their careers to
Editors and reviewers also need to alter some of their these highly intensive data collection efforts, which can
expectations and practices. From my vantage point as the produce exactly what the authors call forvery high-
current editor of the Journal of Personality and Social quality data that can generate reliable, valuable, and very
Psychology: Interpersonal Relations and Group Pro- difcult-to-obtain ndings. Unless data-sharing expecta-
cesses (JPSP: IRGP), I believe that we cannot view the tions and rules are carefully crafted, future investigators
perfectly conrmatory paper as the gold standard for ac- who might be interested in devoting their careers to col-
ceptance and that editors should be willing to publish lecting these high-investment datasets may be disinclined
well-conducted, sufciently powered studies that fail to to do so, which would have a very negative impact on
replicate important predictions and hypotheses. This is our eld. Thus, there must be clear and reasonable expec-
occurring at JPSP. The Journal of Personality and Social tations about both data sharing and data providing that are
Psychology: Attitudes and Social Cognition section, for sensitive to the amount of investment required to generate
example, just published a set of studies that failed to and maintain different types of datasets.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
136 Discussion and Response

There Is No Such Thing as Replication, but We Should Do It Anyway


BARBARA A. SPELLMAN
University of Virginia
spellman@virginia.edu

Abstract: Despite the fact that exact replications are impossible to perform, it is important for a science to make close
attempts and to systematically collect such attempts. It is also important, however, not to treat data as only relevant to the
hypothesis being tested. Good data should be respected; they can tell us about things we have not yet thought. Plans for
publishing systematic attempts at replication (regardless of outcome) are being developed. Copyright 2013 John Wiley
& Sons, Ltd.

Claiming there is no such thing as replication may something truly bad about the datasuch as typos that
sound odd coming from a scientist, but the authors of changed the meaning of the stimuli or the measurement
the target article (Asendorpf et al., this issue) also make scales, a glitch in a randomization procedure, a manipula-
that point. More accurately, the statement should be, tion check that reveals that subjects did not understand
There is no such thing as exact replication. Each study the instructions, the fact that your graduate student says
is differentdifferent subjects, materials, time of day, to you I just nished running condition 1; I will start
time in history, and so on. The fact that each one is dif- condition 2 tomorrow (true story).
ferent is true not only in psychological science but also But that we collect good data and then toss them
in other sciences. It is different atoms, bacteria, fossils, away or bury them in a (virtual) le drawer because the
plants, and stars. The success of a replication depends study did not workthat is, because it did not conrm
on, among other things, the variability of the relevant fea- our hypothesis or replicate a previous research nding
tures within the population studied. Often a scientist does (and that therefore we have no use for it and no place
not know the variability; and often a scientist does not to publish it)well, that is sad for science. Good data
even know the relevant features. are good data, and we should respect them.
Neither one failure to replicate, nor one successful
replication, tells us much. But a pattern of failures and Bulls eye
successes does. One study is an existence proofsuch
a thing can happen. But having multiple studies, each There is a story that is told in many guises but one
slightly different, each varying on one or more dimen- version I like is this: A woman is driving through the
sions, gives us information about robustness, about countryside and sees a barn on which many huge targets
generalizability, about boundary conditions, about the are painted. Smack in the middle of the bulls eye in each
features that matter, and, therefore, about our scientic target is a bullet hole. The woman stops and talks to the
theory. farmer. You are such an amazing shot, she says, a bulls
In this comment, I discuss three points for psycholog- eye every time. Oh no, he says, rst I shoot at the barn,
ical scientists. First, we should do more to respect our then I paint the targets around the holes.
data. Second, we should do more to recognize that our This feat is analogous to HARKing in science (hypothe-
data speak to more than one theory. Third, we should sizing after the results are known; Kerr, 1998). We have a
do more to amass our data to help us understand the ro- hypothesis, we design a (good) study, we collect some data,
bustness and generalizability of what we (think we) but the study doesnt work to conrm our hypothesis.
know. Finally, I report on a project in progress involving However, after many analyses have been carried out, some-
Perspectives on Psychological Science that should help thing in the study turns out to be signicant, and we write
with that last goal. an article as if those data answered the hypothesis we were
asking all along; that is, we paint the target around the data
The data are. . . rather than where we were aiming in the rst place. Many
current calls for reforming how we do science, including
A wonderful graduate school professor of mine, the the target article, suggest that researchers register their hy-
late Tom Wickens, would often say, The data are. . .. potheses before testing them to avoid HARKing and to avoid
What he was doing was correcting our grammar, making the antics so nicely illustrated by Simmons et al. (2011) that
sure that we knew that data is the plural of datum caused subjects to age before their eyes.
(and thereby reminding us that we were reporting more Registering hypotheses is a ne idea but it seems
than one observation). But when I imagine Toms long- equally important that we tell both what we were aiming
ago admonition, I often think of an additional interpreta- for and what we hit. Listen to the data; it is informative
tion: We should give more respect to our data because about more than one hypothesis (Fiedler et al., 2012). As
the data are. The data exist, and they are trying to tell Nelson Goodman (1955, in his discussion of grue) said
us something. We should listen closely. (more or less): Every piece of data is consistent with an
Scientists work hard to collect data, but sometimes we innite number of hypotheses; it is also inconsistent with
carelessly toss them away. That is ne if we discovered an innite number of hypotheses.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 137

Or, as I like to say: One scientists .25 is another scien- replications or single failures to replicate; rather, we need
tists .01. systematic attempts at replication and meta-analysis that do
not suffer from massive le drawer problems.
What is to be done? Among the suggestions of the target article is that
journals be willing to go even further by launching calls
The target article makes some recommendations to in- to replicate important but controversial ndings with a
crease not simply replicability as their title says, but really guarantee of publication provided that there is agreement
our knowledge of which ndings are robust and generaliz- on method before the study is conducted. In fact,
able. Again, I agree. We need to not only save and publish Perspectives on Psychological Science has plans to do
more of our data but also better amass our results. We need that in the works. I do not know if it will be in place
better ways to connect our ndingsnot just knowing who by the time this comment is published, but readers can
has cited whom but what they have cited each other for check for updates and instructions at http://morepops.
(Spellman, 2012). We do not need to publish single wordpress.com.

The Signicance Test Controversy Reloaded?


HANS WESTMEYER
Free University of Berlin
hans.westmeyer@fu-berlin.de

Abstract: Asendorpf et al. addressed the currently much-discussed problem of poor replicability of scientic ndings in
many areas of psychological research and recommended several reasonable measures to improve the situation. The
current debate rekindles issues that have a long history in psychology and other social and behavioural sciences. In
this comment, I will focus on some precursors of the current debate. Copyright 2013 John Wiley & Sons, Ltd.

The target paper is a very important and highly welcome should be transferred from physics and chemistry to the areas
contribution to our current research practice. The replication where it is now a rarity. It should be realized that repeating
of scientic ndings is a neglected topic in many areas of an experiment, although not necessarily showing great origi-
psychology, and the recommendations for increasing replica- nality of mind, is nevertheless an important function. Jour-
bility are well founded and worthy of adoption by research- nals should make space for brief reports of such repetitions,
ers, editors, reviewers, teachers, employers, and granting and foundations should undertake their support. Academics
agencies. The topic of replicability has a long history in our in the social sciences should learn to feel no more embarrass-
discipline and, at least in certain areas of psychology, has ment in repeating someone elses experiment than their col-
been with us all the time. leagues in the physics and chemistry departments do now
The target paper reminds me of a book entitled The (p. 302). That is not far from an admittedly very brief version
signicance test controversy edited by Morrison and Henkel of the recommendations given in the target paper.
(1970a). Their book is a reader representing the major issues There is one important disagreement between the editors of
in the continuing debate about the problems, pitfalls, and the aforementioned book, Morrison and Henkel, and the
ultimate value of one of the most important tools of contem- authors of the target article. Morrison and Henkel (1970b)
porary research (text on the front cover). In one of the came to very sceptical conclusions concerning the signi-
reprinted papers in this book, Sterling (1970, originally cance of signicance tests in scientic research and briey
published in 1959) discussed publication decisions and their addressed the question, What do we do without signicance
possible effects on inferences drawn from tests of tests? (p. 310f), whereas the authors of the target article do
signicanceor vice versa. He presented a table (p. 296) not explicitly question the application of signicance tests.
of the signicance test outcomes performed in all contribu- At least they mention, as an alternative approach, parameter
tions to four renowned psychology research journals pub- estimation and the computation of condence intervals. This
lished in 1955 (three journals) and in 1956 (one journal). approach had also been addressed in the Morrison and Hen-
The total number of published research reports was 362; kel book by a contribution by Rozeboom (1970, originally
294 of these used signicance tests; in 286 contributions, published in 1960).
the null hypothesis had been rejected (alpha .05); only in One reason for drawing sceptical conclusions concerning the
8 out of 294 research reports (2.72%) had the null hypothesis signicance of signicance tests in psychological research is
not been rejected; not a single study was a replication of a the requirement of random samples drawn from specied
previously published experiment. The target paper shows populations (cf. Morrison & Henkel, 1970b, p. 305f). The
that the situation has not changed much within the last authors of the target article emphasize this point: Brunswi-
50 years. kian replicability requires that researchers dene not only
In a comment on the Sterling paper, Tullock (1970, origi- the population of participants, but also the universe of situa-
nally published in 1959) drew the following conclusion: tions, operationalizations, and time points relevant to their
The tradition of independent repetition of experiments designs. This reminds me of the structuralist or

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
138 Discussion and Response

nonstatement view of scientic theories that requires few areas of psychology in which many studies already
determination of the set of intended applications as an indis- satisfy the discussed requirement (e.g. when properly con-
pensable part of any proper formulation of a scientic theory structing tests).
(or hypothesis; cf. Westmeyer, 1989, 1992). Let us remain It is regrettable that the target article does not refer to
more modest and be satised with studies conducted on ran- previous explications of the terms replication and replica-
dom samples drawn from dened populations of partici- bility. These explications have been with us for a long
pants. But are most psychological studies of this kind? I time. Take, for example, differentiation of replication into
doubt it. Many of our studies are conducted on groups of stu- direct and systematic replications by Sidman (1960) and
dents, quite often from our own department, without any pre- further differentiation of direct replication into intergroup
vious population specication. These groups are not random or intersubject and intragroup or intrasubject replications,
samples; even the term convenience sample is hardly ap- not to mention still-further differentiations of systematic
propriate. For something to be a sample, there has to be to replication. For Sidman, replicability is one of the most
a targeted population. What would the population for a group important evaluation criteria for scientic ndings, although
of students be? The population of all persons worldwide? there is no place for signicance tests in his methodology.
The population of all students worldwide? The population And take, for example, Lykken (1970, originally published
of all students in a certain country? The population of stu- in 1968), who introduced three kinds of replication: literal
dents from a certain university? The population of students replication, operational replication, and constructive
from a certain department? And what about the time points? replication. Sidmans differentiations, in particular, would
Is a specication of the time points necessary, or do the re- enrich the terminology proposed in the target article, which
spective populations also comprise future (and former) stu- does not even mention experimental single-case studies
dents? If we take the requirement of (random) sampling as a possible alternative to the study of large samples
from prespecied populations seriously, a remarkable change (cf. Kazdin, 2010).
in our research practice and the way we formulate our These omissions in no way decrease the importance and
hypotheses is the consequence. That change would greatly merits of the recommendations made in the target article. I
facilitate the replicability of our ndings. Differential really hope that the new debate will have long-lasting
psychology and psychological assessment are among the consequences.

AUTHORS RESPONSE
Replication is More than Hitting the Lottery Twice
JENS B. ASENDORPF1*, MARK CONNER2, FILIP DE FRUYT3, JAN DE HOUWER4, JAAP J. A. DENISSEN5,
KLAUS FIEDLER6, SUSANN FIEDLER7, DAVID C. FUNDER8, REINHOLD KLIEGL9, BRIAN A. NOSEK10,
MARCO PERUGINI11, BRENT W. ROBERTS12, MANFRED SCHMITT13, MARCEL A. G. VAN AKEN14,
HANNELORE WEBER15, JELTE M. WICHERTS5
1
Department of Psychology, Humboldt University Berlin, Germany
2
Institute of Psychological Sciences, University of Leeds, UK
3
Department of Developmental, Personality and Social Psychology, Ghent University, Belgium
4
Department of Experimental Clinical and Health Psychology, Ghent University, Belgium
5
School of Social and Behavioral Sciences, Tilburg University, The Netherlands
6
Department of Psychology, University of Heidelberg, Germany
7
Max Planck Institute for Research on Collective Goods, Bonn, Germany
8
Department of Psychology, University of California at Riverside, USA
9
Department of Psychology, University of Potsdam, Germany
10
Department of Psychology, University of Virginia, USA
11
Department of Psychology, University of Milano-Bicocca, Italy
12
Department of Psychology, University of Illinois, USA
13
Department of Psychology, University of Koblenz-Landau, Germany
14
Department of Psychology, Utrecht University, The Netherlands
15
Department of Psychology, University of Greifswald, Germany
jens.asendorpf@online.de

Abstract: The main goal of our target article was to provide concrete recommendations for improving the replicability of
research ndings. Most of the comments focus on this point. In addition, a few comments were concerned with the
distinction between replicability and generalizability and the role of theory in replication. We address all comments
within the conceptual structure of the target article and hope to convince readers that replication in psychological
science amounts to much more than hitting the lottery twice. Copyright 2013 John Wiley & Sons, Ltd.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 139

We thank the commentators for their thoughtful, and uncertainty, and here we agree. But his arguments drew a
sometimes amusing, remarks, constructive criticisms, and false distinction between replication and meta-analysis. Rep-
suggestions. We are delighted that most comments focused lication is the stuff that makes meta-analysis possible (see
on concrete recommendations for improving the replicability also our section in the target article on small meta-analyses
of research ndings, even describing concrete actions in line for evaluating the replicability of an effect size).
with some of our recommendations (e.g., Simpson and Schimmack and Dinolfo did not question the impor-
Spellman). Thereby, the peer commentary section and, we tance of replicability, but they did question the usefulness
hope, our response contribute to the current debate in psy- of replication studies, with the argument that such studies
chology about the poor replicability of research ndings are not needed if the original study was sufciently powered.
and how to improve it. To us, the most important and com- Although we certainly agree with the implied call for greater
monly expressed mindset to address was stated best by power, it is not realistic to imagine that all studies will be
Kingthat replication is akin to hitting the lottery. Twice. sufciently powered. The central challenge is resource allo-
In this response, we hope to convince readers that empirical cation. Researchers pushing the boundaries of knowledge
research is more than a game of luck and to keep in mind that take risks and venture into the unknown. In these cases, it
the goal of any empirical study is to learn something. The is easy to justify placing a small bet to see if an idea has
role of chance in research is to provide an indication of con- any merit. It is very difcult to justify placing a large bet at
dence in the result, not to determine whether we won the the outset of a research programme. We agree that this
game. research strategy can lead to false positives resulting from
many small bets, but it is also a means of reducing false
WHAT IS HISTORICALLY DIFFERENT THIS TIME? negatives. If we can only place large bets, then we will take
very few risks and miss perhaps the most important opportu-
Commenters noted the historical cycles of recognizing nities to learn something. So, what is the solution? Replica-
challenges in replicability and failing to take action or nd tion. When one nds some initial evidence, then a larger
correctives (see particularly Westmeyer and King). The bet is justiable. Our suggestion is that it is not only justi-
current intense discussion could wither as well. However, able; it is essential. We believe that this strategy recognizes
we believe that it is different this time. First, prior cycles of the conicting challenges facing the pursuit of innovation
this debate were somewhat isolated to specic areas of and conrmation in knowledge accumulation.
psychology and other disciplines. This time, the discussion Although it is true that one well-powered study is better
is an explicit, intense, and widespread debate about the than two, each with half the sample size (see also our section
extent and the causes of nonreplication. The issue is dominat- in the target article on the dangers of multiple underpowered
ing discussion across the sciences and includes all major sta- studies), the argument ignores the point, reiterated by many
keholderssocieties, journals, funders, and scientists other commentators, that exact replication is never possible;
themselves. This gives the debate a stronger impetus than even studies designed as direct replications will inevitably
ever before, which, if wisely channelled towards getting it vary some more or less subtle features of the original study.
right, increases the chances for a truly self-correcting move- Thus, replication studies have merits even in an ideal
ment in our science. Schimmack and Dinolfo world where only well-powered
Second, contributors to the debate recognize that the studies are conducted, by making sure that the design de-
issue is systemicnot isolated to a particular practice, disci- scribed by the original authors and copied by the replicators
pline, or part of the research process. Our target article sufciently describes all causally relevant features. In many
acknowledges this by recommending actions at multiple areas of current psychology, well-powered replication
levels. Third, there exists an infrastructurethe Internet attempts of equally well-powered original studies will some-
that can enable solutions such as data sharing on a scale that times fail, turning the replication studies into assessments of
was simply not conceivable in previous epochs. Now, the the limits of generalizability.
barriers are not technical, they are social. Therefore, we are
more optimistic than some of the commentators that the current FROM REPLICABILITY TO GENERALIZABILITY
debate offers opportunity for real reform and improvement.
We view direct replicability as one extreme pole of a
NEED FOR REPLICATION continuous dimension extending to broad generalizability at
the other pole, ranging across multiple, theoretically relevant
Two commentators questioned the need for conducting facets of study design. Cacioppo and Caccioppo called
replication studies. Francis questioned replicability as a core direct replication minimal replication and linked inability
requirement for psychological ndings by drawing a distinc- to generalize to fruitful theoretical challenges. We fully en-
tion between physics and chemistry on the one hand and dorse this view (see also IJzerman et al.). When replication
psychology on the other because psychological ndings are fails, it can provide an opportunity for condition seeking
more uncertain. But, as quantum physics teaches us, uncer- what are the boundary conditions for the effect?that can
tainty is inherent in many physical phenomena, and the role stimulate theory advancement. We also like the argument
of statistics is to solve problems of probabilistic relations, by Cacioppo and Caccioppo that the multiple determination
whether in physics, chemistry, or psychology. Francis of virtually all psychological phenomena requires generaliza-
recommended meta-analysis as a solution for reducing tion rather than replication studies to appreciate a

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
140 Discussion and Response

phenomenon fully. Nevertheless, we insist that replicability psychological publications in the context of publications in
is a necessary condition for further generalization and thus other areas of science and by the editors of agship journals,
indispensable for building solid starting points for theoretical King, Simpson, and Spellman, because we were quite
development. Without such starting points, research may critical about the current policies of many such journals that
become lost in endless uctuation between alternative gener- discourage direct replications and encourage sequences of
alization studies that add numerous boundary conditions but underpowered studies.
fail to advance theory about why these boundary conditions Fanellis remark about an equal citation rate of negative
exist. and positive results in psychological publications took us
by surprise, because in the target article, we discussed conr-
ROLE OF THEORY mation bias of authors and publication bias of journal poli-
cies but not citation bias. Also, it seems to us that Fanelli
We agree that our recommendations could have done underestimated the ability to predict study outcomes in at
more to emphasize the role of theory. As Simpson correctly least some areas of psychology. To cite examples from per-
noted, we only briey cited theory as a means of guiding the sonality psychology, the effect size of certain gender differ-
selection or construction of relevant design facets. The main ences, the agreement between self and others on reliable
reason is that our focus was on replication, not on generaliza- measures of the Big Five factors of personality, and the lon-
tion. In any case, we fully endorse Simpsons and Eids gitudinal stability of such measures across a specied retest
views on the importance of theory for determining the rele- interval starting at a particular age can be predicted quite
vant facets of an experimental design, for operationalizing well. Psychology is not astrophysics, to be sure, but it offers
them such that they t the underlying theory, and for gener- much better predictions than astrology.
ating a design that is best suited to study the expected effects. Therefore, we disagree with Fanellis negative view of
Also, we like Eids discussion of the importance of deciding the preregistration of hypotheses, based as it appears to be
what should be considered measurement error and what on his assumption of low predictability. Instead, we consider
should be considered substantive variation on theoretical preregistration to be one of the most promising means for
grounds and his reminder that in many areas of psychology conrmatory testing. When the researcher has a strong a
theories for important facets are underdeveloped or com- priori hypothesis, the best way to afrm the p-values uncer-
pletely missing (e.g., a theory of stimuli as a prerequisite of tainty estimation is to register the analysis plan in advance.
a contextualized theory of perception or a theory of situations Without it, exibility in analysis strategies and motivated
as a prerequisite of a contextualized theory of personality). reasoning can lead to ination of false positives and reduc-
We only insist that replication studies have their own virtue tion of replicability in the process (see also the section on
by providing solid starting points for generalization (see also multiple hypothesis testing in the target article and Kings
the preceding section). remarks on preregistration during longer review processes).
We fully agree with Fanellis view on the merits of
STUDY DESIGN AND DATA ANALYSIS purely exploratory research, but if and only if the research
process and the results are fully and transparently reported.
Only two comments focused directly on study design and Such transparency requires standards for reporting, and we
data analysis. Eid noted that facets should not exclusively be consider Fanellis suggestions for more specic reporting
considered random; whether they should be considered guidelines to be adopted by major journals a welcome addi-
random or xed is a theoretical issue. Actually, we did not tion to our own recommendations.
propose in the target article that all facets should be consid- Kings call for slowing down, by pressing authors for
ered random; instead, we proposed that researchers should additional work invested in conducting additional studies or
at least consider that a facet might be better considered ruling out alternative explanations, is well taken in the cur-
random rather than xed. Whereas individuals are routinely rent mad rush for quick-and-many publications. We would
treated as random factors, stimuli or situations are routinely only add that instead of responding to a low-powered study
considered xed in most studies even though there are often by desk rejection as recommended by Lucas and Donnellan,
good reasons for treating them as random. Related was a more constructive slowing-down response might be to ask
Westmeyers remark that we discussed only designs includ- for additional data to achieve sufcient power. An even better
ing samples of individuals, ignoring single-case studies. We approach would be to take Cohens call for sufciently pow-
agree that we should have noted that our facet approach does ered research seriously, just as many journals nally are be-
include single-case studies as designs with no variation in the ginning to take his call for reporting effect sizes seriously.
facet of individuals, just as many cross-sectional studies are Why do journals not adopt explicit rules that only studies with
designs with no variation in the facet of developmental time. sufcient power to address their main research questions
should be submitted?
PUBLICATION PROCESS For example, in line with conventional rules, we may
dene as acceptable thresholds power at .80 with alpha at
Many comments concerned our recommendations for .05. Given that recent meta-analyses converge in indicating
reforming the publication process on the part of reviewers, that the average effect size in published psychological re-
editors, and journals. We were most curious to read the search is around d = 0.50, an approximate power calculation
comments by Fanelli because of his birds-eye view on would result in n = 100 for a one-tail hypothesis for a simple

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 141

between-participants design (two groups) or a correlation the study is conducted. In this action, PPS converges with
coefcient (one group). Of course, there are many excep- the European Journal of Personality, which encourages such
tions; within-participants designs have much more power, activities as well as articles concerned with replication issues.
several effects are greater than d = 0.50, and so on. Therefore, Spreading similar proactive encouragement of replication
this guideline should be exible and adjustable to the condi- elsewhere would benet much of our efforts. It would un-
tions of specic studies. doubtedly increase researchers awareness of the importance
The adoption of such a simple but exible guideline of replicable ndings and dampen the increasing unhealthy
would provide a clear incentive to authors to make a case, tendency over the past decade to look for sexy ndings that
if needed, why in their specic study a different effect size appeal to the mass media but later prove unreliable.
should be expected given previous relevant studies and In his comment on this issue, Simons correctly pointed
reasonable arguments. Thus, the authors should be able to out that the sexiness of a publication should not be a
justify why their specic sample size should give reliable criterion for its quality, and we do not consider sexiness
results given the expected or investigated effect, without as necessarily bad either. However, Simons conclusion that
considering the results they obtained. If they did not do this, . . .sexy ndings that withstand replication are the ones that
then the default rule of n > 100 would apply automatically, we want in our journals could be interpreted as sexy
regardless of whether there were signicant effects. replicable ndings are better than non-sexy replicable
Adoption of such rules would reduce the number of false ndings, which would run against the independence of
positives and slow down the rate of publication. Slow sexiness and scientic quality.
publication in this sense may eventually become an indicator In a similar vein, we are sceptical about Kings call for
of quality similar to slow food. slowing down by concentrating on signicant research
For reasons spelled out in detail in the target article, we questions. Although there are surely many nonsignicant
strongly disagree with Journal of Personality and Social questions around, what is viewed as signicant may depend
Psychology: Personality Processes and Individual Differences on what issues are currently mainstream and the ux and
editor Kings statement that replication studies should not be ow of fashions. Trying to steer science by signicant
published in top journals. Interestingly, Journal of Personality questions may be as short-sighted as steering science by ap-
and Social Psychology: Interpersonal Relations and Group plication questions. The history of science is full of examples
Processes editor Simpson seems more favourable towards where answers to questions that seemed awkward or trivial at
replication studies, at least if they present solid evidence that the time later became critically important in a different and
a seemingly established nding is not valid. We applaud unforeseen context.
Simpsons view and would only ask that it should particularly
be applied to failures to replicate ndings published earlier in TEACHING
the same journal. After a decade of nonreplications of single-
gene and functional magnetic resonance imaging results The enthusiastic comment by IJzerman et al. on the joys
published in top biomedical journals, we are condent that of teaching the importance of replication somewhat compen-
such a policy would increase rather than decrease the reputa- sates for the fact that these joys were based on N = 3 students.
tion of any psychology journal that followed it. Hunts perception that we are recommending more teaching
We also share Simpsons view that transparency, data of methodology and statistics, probably the most unpopular
archiving, and data sharing are particularly important for subjects for most psychology students at most departments,
costly longitudinal and behavioural observation studies. is a misinterpretation. We do not recommend more method-
Many funding agencies now require these for large projects, ology and statistics; we recommend certain shifts of focus
and journals could join the bandwagon by requiring them within the teaching of methodology and statistics (e.g., from
too, as long as condentiality concerns or legal rights are null hypothesis testing in single studies to replication of
not violated. In fact, the American Psychological Association effect sizes in multiple studies).
publication guideline 8.14 requires data sharing on request of
competent peers provided that the condentiality of the INSTITUTIONAL INCENTIVES
participants can be protected and unless legal rights
concerning proprietary data preclude their release, but it After many of us used Google to learn about Hunts us-
seems that this guideline is not taken seriously by authors age of motherhood and apple pie (it is always enchanting to
and editors (Wicherts, Bakker & Molenaar, 2011). Retraction learn new phrases of local dialect), we were additionally cu-
of an article because of violation of this guideline (as rious to learn what concrete recommendations he might offer
suggested by Lucas and Donnellan) should be a last resort, that would differ from our own. We found two but disagree
but a letter from the editor reminding an author of the with both. First, we disagree with Creating archives before
commitment he or she has already signed may help to record-keeping standards are established puts the cart before
increase willingness to share data with peers. the horse. Standardization for documentation (within limits)
We were particularly pleased by Spellmans announce- is certainly a worthwhile goal, but waiting for standards is a
ment that Perspectives on Psychological Science (PPS) will good way to guarantee that archives will never happen. As
soon take up our suggestion of launching calls to replicate the Internet age has demonstrated (e.g., formatting standards
important but controversial ndings with a guarantee of pub- on Wikipedia), standards for communication are more pro-
lication, provided that there is agreement on method before ductively pursued as an emergent quality with existing data

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
142 Discussion and Response

rather than developed in the abstract and then applied Bangerter, A., & Heath, C. (2004). The Mozart effect: Tracking the
en masse. Waiting until professional societies agree on evolution of scientic legend. British Journal of Social Psychology,
43, 605623.
standards would be counterproductiveboth for increasing Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of
sharing and for developing the standards. social behavior: Direct effects of trait construct and stereotype
Second, we disagree with Hunts suggestion that activation on action. Journal of Personality and Social
impact should be the sole criterion for launching replication Psychology, 71, 230244. doi: 10.1037/0022-3514.71.2.230
studies. Relevance to scientic theory and opportunities to Bem, D. J. (1987). Writing the empirical journal article. In M.
Zanna, & J. Darley (Eds), The compleat academic: A practical
resolve controversy seem more important to us, and these guide for the beginning social scientist (pp. 171201). New
are not always the same as impact. But we do agree York: Random House.
with Bakker et al. that highly cited textbook ndings Bem, D. J. (2011). Feeling the future: Experimental evidence for
need to be shown to be replicable; textbook-proof is not anomalous retroactive inuences on cognition and affect. Journal
sufcient, and we are pleased to see initiatives such as Open of Personality and Social Psychology, 100, 407425. doi:
Science Framework (http://openscienceframework.org/) and 10.1037/a0021524.
Bensman, S. J. (2008), Distributional differences of the impact fac-
PsychFileDrawer (http://psychledrawer.org/) providing envir- tor in the sciences Versus the social sciences: An analysis of the
onments for uploading and discussing the results of such replica- probabilistic structure of the 2005 Journal Citation Reports.
tion studies. Journal of the American society for information science and
Rieth et al.s call for clearer signals of authors con- technology, 59, 13661382.
dence is not without merits, but we are more than sceptical Berk, L. E. (2013). Child development (9th ed). Boston: Pearson.
Bless, H., Fiedler, K., & Strack, F. (2004). Social cognition:
about the specic suggestion of a nonreplication bounty. As- How individuals construct reality. East Sussex, UK:
suming that the suggestion is serious and not satirical, such a Psychology Press.
measure would be misguided for two reasons. First, it would Burton, P., Gurrin, L., & Campbell, M. (1998). Clinical signicance
contribute to unhealthy tendencies to focus only on scien- not statistical signicance: A simple Bayesian alternative to p
tists extrinsic motivation. As motivational psychology tells values. Journal of Epidemiology and Community Health, 52(5),
318323. doi:10.1136/jech.52.5.318.
us, intrinsic motivations such as striving for discovery and
Cacioppo, J. T., & Berntson, G. G. (1992). The principles of multi-
truth can be corrupted by monetary reward and punishment. ple, nonadditive, and reciprocal determinism: Implications for so-
Second, if one wants to use money as an incentive, rewarding cial psychological research and levels of analysis. In D. Ruble, P.
successful replications would seem much more productive Costanzo, & M. Oliveri (Eds), The social psychology of mental
(e.g., by reserving a percentage of grant money for replica- health: Basic mechanisms and applications (pp. 328349). New
York: Guilford Press.
tion) than punishing inability to replicate. The best way of
Campbell, D. (1997). In United States Patent and Trademark Ofce
changing hearts and minds (Lucas and Donnellan) seems (Ed.), The Mozart effect. US Patent 75094728.
to us to be to use incentives that enhance intrinsic scientic Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant
motivation (getting it better) and concern with peer reputa- validation by the multitraitmultimethod matrix. Psychological
tion, as spelled out in some detail in the target article. Bulletin, 56, 81105.
Cesario, J., Plaks, J. E., & Higgins, E. T. (2006). Automatic social be-
havior as motivated preparation to interact. Journal of Personality
and Social Psychology, 90, 893. doi: 10.1037/0022-3514.90.6.893.
Conclusion Cohen, J. (1962). Statistical power of abnormalsocial psychological
research: A review. Journal of Abnormal and Social Psychology,
Taken as a package, we hope that our and the commentators 65, 145153.
recommendations will counteract beliefs of some colleagues Cohen, J. (1990). Things I have learned (so far). American Psychologist,
that successful replication amounts to hitting the lottery 45(12), 13041312. doi:10.1037/0003-066X.45.12.1304.
Cohen, J. (1994). The earth is round (p<. 05). American Psychologist,
twice. We are convinced that psychological science can do 49(12), 9971003. doi:10.1037/0003-066X.49.12.997.
much better than that now, and better still in the near future. Cromie, W. J. (1999). Mozart effect hits sour notes. Retrieved 12/10,
2012, from http://news.harvard.edu/gazette/1999/09.16/mozart.html
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972).
The dependability of behavioral measurements: Theory of
generalizability for scores and proles. New York: Wiley.
REFERENCES Cumming, G., & Finch, S. (2005). Inference by eye: condence
intervals and how to read pictures of data. American Psychologist,
Allport, G. W. (1968). The historical background of modern social 60(2), 17080. doi:10.1037/0003-066X.60.2.170.
psychology. In G. Lindzey, & E. Aronson (Eds), The handbook Doyen, S., Klein, O., Pichon, C.-L., & Cleeremans, A. (2012). Be-
of social psychology (Vol. 1, pp. 180). Reading, MA: Addi- havioral priming: Its all in the mind, but whose mind? PLoS
son-Wesley. ONE, 7, e29081.
Anisfeld, M. (1991). Neonatal imitation. Developmental Review, Eid, M., Geiser, C. & Nussbeck, F. W. (2009). Multitrait-multi-
11, 6097. doi: 10.1016/0273-2297. method analysis in psychotherapy research: New methodological
Appley, M. H. (1990). Time for reintegration? Science Agenda, 3, approaches. Psychotherapy Research, 19, 390396.
1213. Eid, M., Nussbeck, F., Geiser, C., Cole, D., Gollwitzer, M. &
Augoustinos, M., Walker, I., & Donaghue, N. (2006). Social Lischetzke, T. (2008). Structural equation modeling of multi-
cognition: An integrated introduction (2nd ed). London, UK: trait-multimethod data: Different models for different types of
Sage Publications Ltd. methods. Psychological Methods, 13, 230253.
Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules of the Fanelli, D. (2010). Positive Results Increase Down the Hierar-
game called psychological science. Perspectives on Psychological chy of the Sciences. PLoS One, 5(3). doi: 10.1371/journal.
Science, 7, 543554. pone.0010068.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
Discussion and Response 143

Fanelli, D. (2012a). Positive results receive more citations, but only Koole, S. L., & Lakens, D. (2012). Rewarding Replications: A
in some disciplines. Scientometrics, 19. doi: 10.1007/s11192- Sure and Simple Way to Improve Psychological Science. Per-
012-0757-y. spectives on Psychological Science, 7(6), 608614. doi:10.1177/
Fanelli, D. (2012b). Project for a Scientic System Based on 1745691612462586.
Transparency. Paper presented at the EQUATOR Network Sci- Kruglanski, A. W. (2001). That vision thing: The state of theory in
entic Symposium Freiburg, Germany. http://www.equator-net- social and personality psychology at the edge of the new millen-
work.org/index.aspx?o=5605. nium. Journal of Personality and Social Psychology, 80, 871875.
Fiedler, K., Kutzner, F., & Krueger, J. I. (2012). The long way Kruschke, J. K., Aguinis, H., & Joo, H. (2012). The Time Has
from a-error control to validity proper: Problems with a short- Come: Bayesian Methods for Data Analysis in the
sighted false-positive debate. Perspectives on Psychological Organizational Sciences. Organizational Research Methods,
Science, 7, 661669. 15(4), 722752. doi:10.1177/1094428112457829.
Fraley, R. C., Roisman, G. I., LaForce, C., Owen, M. T., & Holland, LeBel, E. P., & Peters, K. R. (2011). Fearing the Future of
A. S. (in press). Interpersonal and genetic origins of adult Empirical Psychology: Bems (2011) Evidence of Psi as a Case
attachment styles: A longitudinal study from infancy to early Study of Deciencies in Modal Research Practice. Review of
adulthood. Journal of Personality and Social Psychology. General Psychology, 15(4), 371379. doi: 10.1037/a0025172.
Frank, M. C., & Saxe, R. (2012). Teaching replication. Perspectives Ledgerwood, A., & Sherman, J. (2012). Short, sweet, and problem-
on Psychological Science, 7, 600604 atic? The rise of the short report in psychological science.
Fuchs, H., Jenny, M., & Fiedler, S. (2012). Psychologists are open Perspectives on Psychological Science, 7(1), 6066. doi:10.1177/
to change, yet wary of rules. Perspectives on Psychological 1745691611427304.
Science, 7, 639642. Leman, P., Bremner, A., Parke, R. D., & Gauvain, M. (2012).
Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Developmental Psychology. London: McGraw Hill.
Correcting the past: Failures to replicate Psi. Journal of Personality Levelt Committee, Noort Committee, & Drenth Committee. (2012).
and Social Psychology, 103, 933948. doi: 10.1037/a0029709. Flawed science: The fraudulent research practices of social
Goodman, N. (1955). Fact, ction, and forecast. Cambridge: psychologist Diederik Stapel.
Harvard University Press. Lykken, D.T. (1968). Statistical signicance in psychological
Hayes, L. A., & Watson, J. S (1981). Neonatal imitation: Fact or ar- research. Psychological Bulletin, 70, 151159. Reprinted in
tifact? Developmental Psychology, 17, 655660. doi: 10.1037/ Morrison & Henkel (1970, pp. 267-279).
0012-1649.17.5.655. Lykken, D. T. (1991). Whats wrong with psychology anyway? In
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest D. Cichhetti, & W. M. Grove (Eds), Thinking Clearly about
people in the world? Behavioral and Brain Sciences, 33, 61135. Psychology. Volume I: Matters of Public Interest (pp. 339).
Henry, P. J. (2008). College sophomores in the laboratory redux: Minneapolis: MN: University of Minnesota Press.
Inuences of a narrow data base on social psychologys view of Makel, M. C., Plucker, J. A., & Hegarty, B. (2012). Replications
the nature of prejudice. Psychological Inquiry, 19, 4971. in psychology research: How often do they really occur? Per-
Hewstone, M., Stroebe, W., & Jonas, K. (2012). An introduction to spectives on Psychological Science, 7, 537542. doi: 10.1177/
social psychology. West Sussex, UK: Wiley-Blackwell. 1745691612460688.
Hull, J. G., Slone, L. B., Meteyer, K. B., & Matthews, A. R. Maxwell, S. E. (2004). The persistence of underpowered studies in
(2002). The nonconsciousness of self-consciousness. Journal psychological research: Causes, consequences, and remedies.
of Personality and Social Psychology, 83, 406. doi: 10.1037/ [Article]. Psychological Methods, 9(2), 147163. doi:10.1037/
0022-3514.83.2.406. 1082-989X.9.2.147.
IJzerman, H., & Koole, S. L. (2011). From perceptual rags to McCall, R. B., & Carriger, M. S. (1993). A meta-analysis of infant
metaphoric riches: Bodily, social, and cultural constraints on habituation and recognition memory performance as predictors of
socio-cognitive metaphors. Psychological Bulletin, 137, 355 later IQ. Child Development, 64, 5779. doi: 10.1111/j.1467-
361. 8624.1993.tb02895.x.
Ioannidis, J. P. A. (2012). Why science is not necessarily self- Meltzoff, A. N., & Moore, M. K. (1977). Imitation of facial and
correcting. Perspectives in Psychological Science, 7, 645654. manual gestures by human neonates. Science, 198(4312),
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the 7578. Retrieved from http://www.jstor.org/stable/1744187
prevalence of questionable research practices with incentives Morrison, D.E., & Henkel, R.E. (Eds). (1970a). The signicance
for truth telling. Psychological Science, 23, 524532. doi: test controversy - A reader. Chicago: Aldine.
10.1177/0956797611430953. Morrison, D.E., & Henkel, R.E. (1970b). Signicance tests in be-
Johnson, R. W. (1964). Retain the original data! American Psychologist, havioral research: Skeptical conclusions and beyond. In D.E.
19, 350351. Morrison & R.E. Henkel (Eds), The signicance test controversy
Kashy, D. A., Donnellan, M. B., Ackerman, R. A., & Russell, D. - A reader (pp. 305311). Chicago: Aldine.
W. (2009). Reporting and interpreting research in PSPB: Prac- Newman, J., Rosenbach, J. H., Burns, K. L., Latimer, B. C., Matocha,
tices, principles, and pragmatics. Personality and Social Psychol- H. R., & Rosenthal Vogt, E. (1995). An experimental test of the
ogy Bulletin, 35, 11311142. Mozart effect: Does listening to his music improve spatial ability?
Kazdin, A. E. (2010). Single case research designs: Methods Perceptual and Motor Skills, 81, 13791387. doi: 10.2466/
for clinical and applied settings (2nd ed). New York: Oxford pms.1995.81.3f.1379.
University Press. Nicholson, J. M., & Ioannidis, J. P. A. (2012). Research grants:
Kerr, N. L. (1998). HARKing: Hypothesizing after the results Conrm and be funded. Nature, 492, 3436.
are known. Personality and Social Psychology Review, 2, Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientic Utopia: II.
196217. Restructuring Incentives and Practices to Promote Truth Over
Koepke, J. E., Hamm, M., Legerstee, M., & Russell, M. (1983). Publishability. Perspectives on Psychological Science, 7(6),
Neonatal imitation: Two failures to replicate. Infant behavior and 615631. doi:10.1177/1745691612459058.
development, 6, 97102. doi: 10.1016/S0163-6383(83)80012-5. Open Science Collaboration. (2012). An open, large-scale, collabo-
Koninklijke Nederlandse Academie voor de Wetenschappen rative effort to estimate the reproducibility of psychological sci-
(KNAW, 2012). Zorgvuldig en integer omgaan met ence. Perspectives on Psychological Science, 7, 652655.
wetenschappelijke onderzoeksgegevens [Handling scientic data Pashler, H., Harris, C., & Coburn, N. Elderly-Related Words Prime
with care and integrity]. Retrieved December 2012 from http:// Slow Walking. (2011, September 15). Retrieved 06:15, December
www.knaw.nl/Content/Internet_KNAW/publicaties/pdf/ 12, 2012 from http://www.PsychFileDrawer.org/replication.php?
20121004.pdf. attempt=MTU%3D.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per
144 Discussion and Response

Petty, R. E., & Cacioppo, J. T. (1981). Attitudes and persuasion: Classic motivation. Journal of Personality and Social Psychology, 89,
and contemporary approaches. Dubuque, Iowa: Wm. C. Brown. 583592.
Petty, R. E., & Cacioppo, J. T. (1986). Communication and persua- Smith, N.C, Jr. (1970). Replication studies: A neglected aspect of
sion: Central and peripheral routes to attitude change. New psychological research. American Psychologist, 25, 970975.
York: Springer-Verlag. Spellman, B. A. (2012). Scientic utopia. . . or too much informa-
Pietschnig, J., Voracek, M., & Formann, A. K. (2010). Mozart tion? Comment on Nosek and Bar-Anan. Psychological Inquiry,
effectShmozart effect: A meta-analysis. Intelligence, 38, 23, 303304.
314323. doi:10.1016/j.intell.2010.03.001. Staats, A. W. (1989). Unicationism: Philosophy for the modern disuni-
Rai, T. S., & Fiske, A. P. (2010). Psychological research methods ed science of psychology. Philosophical Psychology, 2, 143164.
are ODD (observation and description deprived). Brain and Sterling, T. D. (1959). Publication decisions and their possible
Behavioral Science, 33, 106107. effects on inferences drawn from tests of signicance: Or vice
Rauscher, F. H., Shaw, G. L., & Ky, C. N. (1993). Music and spatial versa. Journal of the American Statistical Association, 54,
task performance. Nature, 365, 611. doi: 10.1038/365611a0 3034. doi:10.2307/2282137.
Rosenthal, R. (1979). The le drawer problem and tolerance for null Tabarrok, A. (2012, Nov 2). A Bet is a Tax on Bullshit.
results. Psychological Bulletin, 86, 638641. doi: 10.1037/0033- Marginal Revolution. Retrieved from http://marginalrevolu-
2909.86.3.638. tion.com/marginalrevolution/2012/11/a-bet-is-a-tax-on-bull-
Rossi, J. S. (1990). Statistical power of psychological research: shit.html.
What have we gained in 20 years? Journal of Consulting and Tilburg Data Sharing Committee (2012). Manual for data-sharing.
Clinical Psychology, 58, 646656. Retrieved December 2012 from http://www.academia.edu/
Rozeboom, W. W. (1960). The fallacy of the null hypothesis signif- 2233260/Manual_for_Data_Sharing_ Tilburg_University.
icance test. Psychological Bulletin, 57, 416428. Reprinted in Tullock, G. (1959). Publication decisions and tests of
Morrison & Henkel (1970, pp. 216-230). signicance: A comment. Journal of the American Statistical
Schachter, S., Christenfeld, N., Ravina, B., & Bilous, F. (1991). Association, 54, 593. Reprinted in Morrison & Henkel
Speech disuency and the structure of knowledge. Journal of (1970, pp. 301-302).
Personality and Social Psychology, 60, 362367. Wagenmakers, E., Wetzels, R., Borsboom, D., & Van der Maas,
Schimmack, U. (2012, August 27). The ironic effect of signicant H. (2011). Why psychologists must change the way they
results on the credibility of multiple-study articles. Psychological analyze their data: the case of psi: Comment on Bem (2011). Journal
Methods. Advance online publication. doi: 10.1037/a0029487. of Personality and Social Psychology, 100(3), 426432. doi:10.1037/
Sechrest, L., Davis, M., Stickle, T., & McKnight, P. (2000). Under- a0022790.
standing method variance. In L. Bickman (Ed.), Research design: Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H.
Donald Campbells legacy (pp. 6387). Thousand Oaks, CA: Sage. L. J., & Kievit, R. A. (2012). An agenda for purely conrmatory
Sidman, M. (1960). Tactics of scientic research. Evaluating research. Perspectives on Psychological Science, 7, 632638.
experimental data in psychology. New York: Basic Books. doi:10.1177/1745691612463078.
Shaffer, D. R., & Kipp, K. (2009). Developmental psychology: Weisburd, D., & Piquero, A. R. (2008). How well do criminologists
Childhood and adolescence (8th ed.). Belmont, CA: Wadsworth. explain crime? Statistical modeling in published studies Crime
Siegler, R. S., DeLoache, J. S., & Eisenberg, N. (2011). How and Justice: a Review of Research, (Vol. 37, pp. 453502).
children develop (3th ed.). New York: Worth publishers. Chicago: Univ Chicago Press.
Simmons, J. P., Nelson, L. D., Simonsohn, U. (2011). False-posi- Westmeyer, H. (Ed.) (1989). Psychological theories from a structur-
tive psychology: Undisclosed exibility in data collection and alist point of view. New York: Springer-Verlag.
analysis allows presenting anything as signicant. Psychological Westmeyer, H. (Ed.) (1992). The structuralist program in psychology:
Science, 22, 13591366. Foundations and applications. Toronto: Hogrefe & Huber
Simmons, J., Nelson, L., & Simonsohn, U. (2012). A 21 Word Publishers.
Solution. Available at SSRN: http://ssrn.com/abstract=2160588. Wicherts, J. M., Borsboom, D., Kats, J., Molenaar, D. (2006). The
Simonton, D. K. (2004). Psychologys status as a scientic disci- poor availability of psychological research data for reanalysis.
pline: Its empirical placement within an implicit hierarchy of American Psychologist, 61, 726728.
the sciences. Review of General Psychology, 8(1), 5967. doi: Wicherts, J. M., Bakker, M., & Molenaar, D. (2011). Willingness to
10.1037/1089-2680.8.1.59. share research data is related to the strength of the evidence and
Simpson, J. A., & Rholes, W. S. (2012). Adult attachment orientations, the quality of reporting of statistical results. PLoS One, 6, e26828.
stress, and romantic relationships. In P. G. Devine, A. Plant, J. Wollins, L. (1962). Responsibility for raw data. American Psychologist,
Olson, & M. Zanna (Eds.), Advances in Experimental Social Psy- 17, 657658.
chology, 45, 279-328. doi: 10.1016/B978-0-12-394286-9.00006-8. Yong, E. (2012). Nobel laureate challenges psychologists to clean
Sinclair, S., Lowery, B. S., Hardin, C. D., & Colangelo, A. (2005). up their act: Social-priming research needs daisy chain of
Social tuning of automatic racial attitudes: the role of afliative replication. Nature, 485(7398), 298300.

Copyright 2013 John Wiley & Sons, Ltd. Eur. J. Pers. 27: 120144 (2013)
DOI: 10.1002/per

Vous aimerez peut-être aussi