C 63380-l 3-k NSM

The Nave Intuitive 1
Running head: THE NAVE INTUITIVE STATISTICIAN
The Nave Intuitive Statistician:
A Nave Sampling Model of Intuitive Confidence Intervals
Peter Juslin and Anders Winman
Uppsala University, Uppsala, Sweden
Patrik Hansson
Ume University, Ume, Sweden

Abstract
The perspective of the nave intuitive statistician is outlined and applied to explain overconfidence
when people produce intuitive confidence intervals and why this format leads to more
overconfidence than other formally equivalent formats. The nave sampling model implies that
people accurately describe the sample information they have but are nave in the sense that they
uncritically take sample properties as estimates of population properties. A review demonstrates that
the nave sampling model accounts for the robust and important findings in previous research, as well
as provides novel predictions that are confirmed, including a way to minimize the overconfidence
with interval production. The authors discuss the NSM as a representative of models inspired by the
nave intuitive statistician.

The Nave Intuitive Statistician:
A Nave Sampling Model of Intuitive Confidence Intervals
The study of judgment is often portrayed as a sequence of research programs, each guided by
a different andat least, on the face of itinconsistent metaphor (Fiedler & Juslin, 2006b). In the
sixties the mind was likened to an intuitive statistician producing judgments that by large are
responsive to the variables implied by statistics and probability theory (Peterson & Beach, 1967).
Less than a decade later the conclusion was that the mind operates according to principles other than
those prescribed by probability theory and statistics (Kahneman, Slovic, & Tversky, 1982). Because
of limited ability to process information people are condemned to the use of heuristics that, although
useful as rules of thumb, produce serious and persistent cognitive biases (Gilovich, Griffin, &
Kahneman, 2002).
In this article we will draw on yet another developing metaphor: that of the nave intuitive
statistician (Fiedler & Juslin, 2006b). Re-evoking the intuitive statistician this metaphor emphasizes
that, although the cognitive processes accurately describe the available information the intuitive
statistician is nave with respect to origins and estimator properties of the samples given. Two key
ingredients of this naivety are: a) that people take for granted that samples are representative of the
populations of interest, as when the frequency of violent death as channeled by the media affects the
judged risk of violent death (Lichtenstein, Slovic, Fischhoff, Layman, & Combs, 1978; Winman &
Juslin, 1993). b) People tend to assume that sample properties can be directly used to estimate the
corresponding population properties, as when the variance in a sample is correctly assessed but not
corrected by n/(n-1) to become an unbiased estimate of population variance (Kareev, Arnon, &
Horwitz-Zeliger, 2002).
Specifically, we will apply the metaphor of a nave intuitive statistician to explain

overconfidence in the production of intuitive confidence intervals for unknown quantities, and for
understanding why this method produces much more overconfidence than other formally equivalent
methods (Juslin & Persson, 2002; Juslin, Wennerholm, & Olsson, 1999; Klayman, Soll, Gonzalez-
Vallejo, & Barlas, 1999; Soll & Klayman, 2004; Winman, Hansson, & Juslin, 2004). If, for
example, people produce intuitive confidence intervals for their best guess for the interest rate next
year that should include the true value with probability .9 the proportion of intervals including the true
value tend to be far below .9, often closer to .4 or .5 (Block & Harper, 1991; Lichtenstein,
Fischhoff, & Phillips, 1982; Russo & Schoemaker, 1992). Because confidence intervals are often
made by experts (Clemen, 2001; Russo & Schoemaker, 1992), and the difference in expected
return given a probability .9 or .4 can be enormous, the phenomenon is not only theoretically puzzling
but of profound applied relevance.
Intriguingly, however, when participants are provided with the same intervals, and assess the
probability that the value falls within the interval, the overconfidence bias tends to diminish or even
disappear (Winman et al., 2004). This phenomenon is referred to as format dependence (Juslin et
al., 1999). The key idea explored in this article is the following: Even if the samples from the
environment are unbiased and the cognitive processes portray these samples accurately, if the sample
properties are uncritically taken as estimators of population properties, the implication is indeed
overconfidence with intuitive confidence intervals, yet relatively accurate judgment with probability
judgment. As demonstrated below, when people are confined to reliance on small samples these
effects produce non-trivial effects.
In this article we first briefly discuss the framework provided by the nave intuitive statistician,
with the intent of arguing that it complements the previous approaches to intuitive judgment in useful
ways. Thereafter we illustrate this framework by providing an in-depth application to the puzzling
phenomena of extreme overconfidence with interval production and the format-dependence effect.
After reviewing the relevant literature, the nave sampling model (NSM) is introduced and we
illustrate how it explains the results, including how it can be used to reduce the overconfidence with
intuitive confidence intervals. Finally, we discuss the NSM as one representative of models inspired
by the nave intuitive statistician.
The Nave Intuitive Statistician
Many of the enlightenment philosophers involved in the early development of probability
theory appeared to take for granted that judgments of probability are informed by the frequencies
encountered in ones experience (Gigerenzer et al., 1989; Hacking, 1975). Early research comparing
judgment to normative principles from probability theory and statistics indeed suggested that the mind
operates in the manner of an intuitive statistician (Peterson & Beach, 1967). People were apt at
learning probabilities (proportions) and other distributional properties from trial-by-trial experience.
Although already this early research documented some discrepancies from probability theory, most
famously that people are conservative updaters as compared to Bayes theorem (Edwards, 1982), in
most studies they appeared responsive to the factors implied by normative models (e.g., sample
size). Based on peoples remarkable ability to learn frequencies in controlled laboratory settings it
was also proposed that frequency is encoded and stored automatically (Hasher & Zacks, 1979;
Zacks & Hasher, 2002). The perspective of the intuitive statistician thus emphasizes our cognitive
ability to learn frequencies in well-defined and well-structured learning tasks and suggests that the
cognitive algorithms of the mind are in basic agreement with normative principles.
The heuristics and biases perspective emphasizes that, because of limited time, knowledge,
and computational ability, we rely on judgmental heuristics that provide useful guidance but also
produce characteristic biases (Gilovich et al., 2002; Kahneman et al., 1982). It is thus proposed that
the target attribute of probability is often substituted by heuristic variables, as when the probability
that an instance belongs to a category is assessed by its similarity to the category prototype, or the
probability of an event is assessed by the ease with which examples are brought to mind (Kahneman
& Frederick, 2002). For example, because violent deaths are more frequently reported in media it is
easier to retrieve examples of violent deaths than more mundane causes of death and people
therefore tend to overestimate the risk of dying a violent death (Lichtenstein et al., 1978; Winman &
Juslin, 1993).
As with the intuitive statistician, the heuristics and biases perspective highlights performance,
but here the limitations of intuitive judgment are brought to the forefront. The heuristics and biases
program has been extremely influential, inspired much new research, and organized large amounts of
empirical data (Gilovich et al., 2002). Yet, there remains an aching tension between this view and the
extensive body of data supporting the view of the intuitive statistician (Peterson & Beach, 1967;
Sedlmeier & Betsch, 2002).
The three assumptions that define the nave intuitive statistician (Fiedler & Juslin, 2006b)
integrate several aspects of these previous two research programs:
a) People have a remarkable ability to store frequencies in the form of natural and relative
frequencies and their judgments are accurate expressions of these frequencies. This assumption is
supported by an extensive body of data (Estes, 1976; Gigerenzer & Murray, 1987; Peterson &
Beach, 1967; Zacks & Hasher, 2002). The crucial working assumption is that the processes
operating on the information in general provide accurate description of the samples and, as such, are
not based on heuristic processing of the sample 1.
b) People are nave with respect to the effects of external sampling biases in the information
from the environment and more sophisticated sampling constraints (Einhorn & Hogarth, 1978;
Fiedler, 2000). In essence, people tend spontaneously to assume that the samples they encounter
are representative of the relevant populations. Peoples confidence in their answers to general
knowledge items may in part derive from accurate assessment of sampling probabilities in the
environment, but they fail to correct for the selection strategies used when general knowledge items
are created (Gigerenzer, Hoffrage, & Kleinblting, 1991; Juslin, 1994; Juslin, Winman, & Olsson,
2000). Confidence may accurately reflect experience but fail to correct for the effects of actions
taken on the basis of the confidence judgment that constrains the feedback received (Einhorn &
Hogarth, 1978; Elwin et al., in press). Likewise, many biases in social psychology may derive not
primarily from biased processing per se, but from inabilities to correct for the effects of sampling
strategies (Fiedler, 2000).
In fact, most of the traditional availability biases can be explained by accurate description of
biased samples, rather than by the reliance on heuristics that are inherently incompatible with
normative principles. Even if the available evidence is correctly assessed in terms of probability
(proportion), if the external media coverage (Lichtenstein et al., 1978) or the processes of search or
encoding in memory (Kahneman et al., 1982) yield biased samples the judgment becomes biased. In
these situations it is debatable whether the substitution of probability with a heuristic variable is where
the explanatory action lies2.
c) People are also nave with respect to the sophisticated statistical properties of statistical
estimators, such as whether an estimator as such is biased or unbiased (Kareev et al., 2002; Winman
et al., 2004). People tend spontaneously to assume that the properties of samples can be used to
describe the populations. People, for example, accurately assess the variance in a sample but fail to
understand that sample variance needs to be corrected by n/(n-1) to be an unbiased estimate of
population variance (Kareev et al., 2002). They may also underestimate the probability of rare
events in part because small samples seldom include the rare events (p. 537, Hertwig, Barron,
Weber, & Erev, 2004). Arguably, this naivety is an equally compelling account of the belief in the
law of small numbersthe tendency to expect small samples to be representative of their
populationsas the traditional explanation in terms of the representativeness heuristic (Tversky &
Kahneman, 1971). Although the biases discussed under b (biased input) and c (biased estimator) are
conceptually distinct the process is the same; the direct use of sample properties to estimate
population properties.
The nave intuitive statistician highlights both the cognitive mechanisms and the organism-
environment relations that support the judgments. Biases are traced not primarily or exclusively to the
cognitive mechanisms that describe the samples experienced, but to biases in the input and the
naivety with which samples are used to describe the populations (Fiedler & Juslin, 2006a). On this
view, the cognitive algorithms are not inherently irrational, but knowledge of more sophisticated
properties of samples are properly regarded as hard-earned and fairly recent cultural achievements
(Gigerenzer et al., 1989; Hacking, 1975).
The relationship between the research programs is summarized in Figure 1. The problem
involves three components, the environment, the sample of information available to the judge, and
the judgment that derives from this sample (Fiedler, 2000). These components, in turn, highlight two
interrelationships, the degree to which the sample is a veridical description of the environment and the
degree to which the judgment is a veridical description of the sample. When biased judgment occurs
in real environments or with task contents that involve general knowledge acquired outside of the
laboratory, both of these interrelations are unknowns in the equation. The heuristics and biases
program accounts for biases in terms of heuristic description of the samples (Figure 1B), but only
rarely is this attribution of the bias validated empirically, although there are exceptions (e.g., Schwarz
& Wnke, 2002).
Benefiting from the research inspired by the intuitive statistician (Figure 1A) in the sixties that,
in effect, ascertained one of the unknowns by demonstrating that judgments are fairly accurate
descriptions of experimentally controlled samples, the nave intuitive statistician emphasizes deviations
between sample and population properties as the cause of biased judgments (Figure 1C). This view
seems consistent both with the results supporting the intuitive statistician and the many
demonstrations of judgment biases. In the following we illustrate how the nave intuitive statistician
complements the other perspectives by addressing one of the more intriguing unresolved issues in
research on confidence.
Overconfidence with Interval Production
With the interval production or fractile format the participant produces an .xx confidence
interval around his or her best guess about a continuous quantity. As an example, the participants
may be asked to produce an interval within which they are .5 confident to find the population of
Thailand (Figure 2A). The fractiles in the subjective probability distribution define the upper and
lower boundaries for intervals, for example, the .25 and .75 fractile in the distribution define a .5
confidence interval within which the person is .5 confident to find the population of Thailand. To be
realistic xx % of the .xx intervals should include the correct values, but in general the intervals are
much too tight, a phenomenon commonly interpreted as overconfidence (Block & Harper, 1991;
Juslin & Persson, 2002; Juslin et al., 1999; Juslin, Winman, & Olsson, 2003; Klayman et al., 1999;
Lichtenstein et al., 1982; Russo & Schoemaker, 1992; Soll & Klayman, 2004; Winman et al.,
2004). The overconfidence bias is robust, but its exact magnitude differs depending on the target
quantity that is assessed. This is true, despite identical methods and procedures (Klayman et al.,
1999). Overconfidence also tends to be lower for more familiar quantities (Block & Harper, 1991;
Pitz, 1974).
Consider the two other ways illustrated in Figure 2A to elicit the same probability distribution.
With the half-range format the participant decides whether a statement is true or false and the
subjective probability that this choice is correct is assessed on a half range scale from .5 (Guessing)
to 1.0 (Certain). With the full-range format a proposition is presented and the probability that the
statement is true is assessed on a full-range scale from 0 (Certainly false) to 1.0 (Certainly true).
Overconfidence occurs if the participants are too confident in their ability to identify the true state of
affairs. To be perfectly consistent, if you respond; Yes with .9 confidence in the half-range task in
Figure 2A, in the full-range task you should assess a probability of .9 that the population of Thailand
exceeds 25 million. If you are asked to produce a .8 confidence interval within which the population
of Thailand falls you should provide a lower boundary of 25 million, because you are .9 confident
that the population of Thailand exceeds 25 million (i.e., the 90th and the 10th percentiles are the
upper and lower limits of a .8 interval). The items in Figure 2A are merely different ways to elicit the
same subjective probability distribution and they should produce the same conclusions.
Studies where these formats are applied to the same events emphasize format dependence,
that realism of confidence in a knowledge domain varies profoundly depending on the assessment
format (Juslin & Persson, 2002; Juslin et al., 1999; Juslin et al., 2003; Klayman et al., 1999) (see
Figure 2B). The interpretation of the overconfidence observed with probability judgment has been
the subject of controversy. The traditional interpretation has been that the overconfidence derives
from genuine cognitive processing biases (Kahneman & Tversky, 1996). Other studies suggest that
when you control for biased task selection (Gigerenzer et al., 1991; Juslin, 1994) and other artifacts
often little or no overconfidence is observed with the half-range format (Juslin, Winman, & Olsson,
2000; Klayman et al., 1999). Regardless of the interpretation of overconfidence with probability
judgment it is clear that the overconfidence with interval production is larger, and often of an
enormous magnitude. This holds even if we control for the way items are sampled and regression
effects from random error in the judgment process (Juslin et al., 1999; Soll & Klayman, 2004).
Also in contrast to probability judgment, where experts are better calibrated than novices
(Keren, 1987) and sometimes do assess impressively realistic probability judgments (Lichtenstein et
al., 1982), there are few signs that experts produce more realistic confidence intervals than novices.
Russo and Schoemaker (1992), for example, investigated more than 2000 experts from a large
number of professions using questions tailored to the experts field of expertise (e.g., advertising,
management, petroleum industry) and 99% of them were overconfident. In regard to the exceptions
that are known, where experts are not severely overconfident with interval production (Murphy &
Winkler, 1977; Tomassini, Solomon, Romney, & Krogstad, 1982), the extent to which these effects
are mediated by technical support and extensive experience of expressing similar distributions is
unclear. These factors may allow the judge to rely on pre-computed knowledge rather than
assessments computed on line. We return to this issue in a later section, where we also review the
data from laboratory experiments demonstrating that overconfidence with interval production
appears relatively immune to substantive expertise as such (Hansson, Juslin, & Winman, 2006).
In sum, there are several robust findings that should be addressed by a model of production of
intuitive confidence intervals: 1) there is a strong overconfidence bias with interval production that is
not accounted for by biased sampling of items or random error in judgment. 2) The magnitude of
bias differs depending on the target variable in ways that cannot be accounted for by methodological
or procedural differences. 3) When the same probability distribution is assessed by other formally
equivalent elicitation methods involving predefined events overconfidence is diminished or
disappears. 4) The overconfidence with interval production is affected little or not at all by expertise.
We conclude that this pattern of results defines one of the more intriguing unresolved issues in
research on confidence. In the next section we illustrate how the metaphor nave intuitive statistician
can enlighten our understanding of the phenomena and offer us ways to reduce overconfidence.
A Nave Sampling Model (NSM)
In this section we first introduce the assumptions behind the NSM. Thereafter, we explain why
the NSM explains the pattern of results documented in the previous section and we evaluate the
predictions by the NSM against empirical data in the literature. In the final part we review more
recent research that has been directly motivated by the NSM.
Assumptions of the Nave sampling Model
Three main assumptions define the NSM:
1. Each judgment elicits retrieval of a small sample of similar observations from
long term memory that become active in short term memory.
The sample of similar observations creates an impression of the likely value for the unknown quantity
and of the variability expected among such quantities. In the General Discussion we return to a
discussion of how literally we need to interpret this sampling metaphor.
2. The sample size is constrained by short term memory capacity.
Assumption 2 implies that the sample size is restricted by architectural limitations of the mind. Similar
capacity limitations (Baddeley, 1998) have proven important to understand many other processes of
controlled thought (e.g., Johnson-Laird, 1983; Just & Carpenter, 1992; Newell & Simon, 1972)
and if overconfidence is a function of sample size unaided experience (more observations), as such,
may not be sufficient to cure it.
3. People directly use sample properties to estimate population properties.

Assessment of probability in essence involves estimation of a proportion. Sample proportion is an
unbiased estimator of population proportion. For example, if you sample from an urn with a
proportion p of red balls, the long-run average sample proportion P of red balls equals p. If peoples
estimates of probabilities are based literally onor are computationally equivalent tosample
proportions they should show little or no overconfidence (Juslin et al., 1997; Soll, 1996)3. In more
concrete terms, to assess the probability that Thailand has more than 25 million inhabitants you may,
for example, retrieve a sample of Asian countries with a known number of inhabitants and assess the
proportion of those that satisfy the event.
By contrast, interval production involves estimation of a dispersion of plausible or possible
values. The idea is that people take the central coverage of the sample distribution as a direct proxy
for the central coverage of the population distribution. If, for example, 90% of the values in the
sample fall in between two limits this is taken as evidence that 90% of the population values fall
within these limits. As detailed below, not only is sample dispersion a biased estimate of the
population dispersion, but a sample is in general dislocated relative to the population distribution
further decreasing the proportion of population values covered by the central portion of the sample
distribution. For example, a 90% confidence interval for the population of Thailand may be formed
by retrieving a sample of Asian countries with a known number of inhabitants and report limits that
include 90% of the values in this sample.
To grasp the difference between estimating the proportion and the coverage of a distribution it
is instructive to consider that when estimating a proportion every observation is equally informative,
but when estimating the coverage of a distribution the observation of rare but extreme values of the
distribution becomes crucial. For example, in the extreme case of repeated sampling of only two
observations from a distribution on a continuous dimension, all samples will under-estimate the
100% coverage of the distribution, except one (i.e., the one actually containing the population
maximum and minimum values). The claim is that people use sample proportion and sample coverage
as if they are both unbiased estimators.
Computational Steps of the Nave Sampling Model
A simple implementation of these three assumptions is illustrated in Figure 3 (a Monte Carlo
simulation is presented in greater detail in the Appendix A).
1. Retrieval of cues. One or several cues (or facts) relevant to estimate the target quantity are
retrieved from long term memory. Keeping with the example of estimating the population of Thailand
you may retrieve that Thailand is situated in Asia. The cue(s), in turn, define a corresponding
objective environmental distribution (OED) of similar observations in the persons natural
environment (Brunswik, 1955). In regard to the cue located in Asia we have an OED defined by
the distribution of population figures of Asian countries.
2. Sampling the target values of similar objects. In long term memory a subset of the target
values in the OED is stored and these observations define the subjective environmental
distribution (SED). The target values of a sample of n observations from the SED are retrieved to
produce a sample distribution (SSD). In the spirit of the nave intuitive statistician, we assume that
the SSD is a random sample from the SED. The SED may sometimes be a random sample of the
OED, as, for example, if the populations of world countries that you know about are a random
sample of all world country populations. In many circumstances, however, it may deviate
systematically from the OED because of biases in the external input, for example from media
coverage, or the strategies used to collect the information (Fiedler, 2000). In our example the person
may retrieve the populations of n Asian countries which provide a sample of population figures for
countries similar to Thailand (i.e., in this example the similarity refers to the common property of
being an Asian country).
3. Nave estimation. The properties of the SSD are directly taken as estimates of the
corresponding properties of the population distributions (i.e., the OEDs):
a. Probability judgment: Given a target event, the subjective probability judgment for the
event is the proportion of observations in the SSD that satisfy the event. If, for example,
the estimate concerns the population of Thailand and the event is having a population
larger than 25 million, the person may retrieve a sample of n known population figures of
Asian countries. If m out of these n observations have a population larger than 25 million
the probability judgment is m/n.
b. Interval production. The coverage of the SSD is used directly to estimate the coverage
of the OED. In our example, to produce a .5 probability interval for the population of
Thailand a person may retrieve a sample of Asian countries and report the 25th and 75th
fractiles within this sample as the estimated interval.
With both assessment formats a sample of observations is retrieved and directly expressed as
required by the format, either as a probability (proportion) for interval evaluation or as fractiles of a
distribution for interval production. There is no bias in the description of the sample, only naivety in
that the sample property is treated as an unbiased estimator of the population property. Although the
assumptions behind the model in Figure 3 appear innocent in the following we demonstrate that they
predict several non-trivial phenomena.
Application to Previous Data
Why is there Overconfidence with Interval Production?
For infinite sample size the computations in Figure 3 for interval evaluation and interval
production produce identical results: at small sample sizes they do not! Naive use of the SSD to
assess the probability of an event (Step 3a) provides a relatively unbiased estimator of probability
(but see the qualification in Footnote 3); naive use of the SSD to produce intervals (Step 3b)
generates much too narrow intervals for small samples.
To see this, consider the schematic illustration of an OED in Figure 4A. In our example this
would correspond to the OED for populations of Asian countries, although this distribution is not
normal like the one in Figure 4A. The xxth fractile of the OED is the target value such that xx% of
the OED is equal to or lower than this value. The limits of the interval in Figure 4A are the 25th and
the 75th fractiles of the OED, thus defining an interval around the median within which 50% of the
OED falls. Again, in terms of our example, 50% of the Asian countries have a population between
3.5 and 48.8 million (i.e., based on the country populations listed in the United Nations database in
the year 2002).
For the moment, let us make the following simplifying assumptions about the knowledge state
of the person (these assumptions are later relaxed): the SED, the target values from the OED that are
known to the person, comprise a perfectly random and representative sub-sample of the OED. The
person forms his or her uncertain belief about the target quantity by retrieving a SSD of n similar
observations from the SED. Let us consider two ways of eliciting this uncertain belief. With full-
range interval evaluation the person is given interval limits and is asked to assess the probability
that the target quantity falls within the interval. With interval production the person is given a
probability and is asked to produce the limits of a central confidence interval around his or her best
guess for the quantity.
With interval evaluation there is one factor, sampling error, which contributes to the
probability that the quantity falls outside of the interval (to the error rate) and this contributor is
explicit in the sample. To the extent that the pre-defined interval does not include the entire OED,
pending sampling error some observations in the SSD will fall within, some outside, the interval when
observations are sampled from the SED. Knowing that the target quantity belongs to an OED
therefore suggests that the target quantity falls in the interval with a probability. For example,
knowing that a country is Asian suggests that its population is between 3.5 and 48.8 million with
probability .5. For random sampling with replacement sample proportion P is an unbiased estimate
of population proportion p (the expected value of P is p). As such, relying on the sample proportion
yields accurate judgment.
In regard to interval production with the NSM, there are three contributors to the error rate,
only one of which is explicitly manifested in the SSD: a) sampling error: The observations in the
SSD define a dispersion and, as with interval valuation, this variability is explicit in the SSD and for
all but the 1.0 interval there will be manifest observations within the SSD that fall outside of the
produced interval. With correctly estimated fractiles, the quantity falls outside the interval with
probability 1-.xx for an .xx interval. In the example in Figure 4A, the proportion of values falling
inside the interval is .5 and the error rate is .5.
b) Error from underestimation of the population dispersion: Sample dispersion is a biased
estimator, under-estimating the population dispersion. If you sample from a population with
dispersion d the average sample dispersion D is lower than d and this is why the variance in a sample
needs to be multiplied by n/(n-1) to become an unbiased estimator of population variance. The
reason why the sample dispersion underestimates the population dispersion is that to appreciate the
variability in the population you need to observe also the extreme but unlikely values and the
probability that these are represented in a small sample is low. For this reason alone, assuming a
normal distribution, random sampling, and sample size 4, the .5 sample central interval from a SSD
will include at most 39% of the OED (i.e., the error rate is .61 rather than .5; see Figure 4B).
Several experiments in Kareev et al. (2002) verify that people fail to correct their estimates of the
population variance for this effect.
c) Error from misjudged location of the population distribution. The above under-
estimation of the dispersion occurs even if the central tendency of the sample coincides with the
central tendency of the population. In addition, at small sample size, the sample is likely to be
randomly dislocated relative to the population distribution by sampling error (i.e., as measured by the
standard error for the sample mean). For example, at sample size 4 the .5 sample coverage interval
includes 39% of the population values only if the sample is not dislocated by sampling error relative
to the population distribution. The Monte Carlo simulations presented below suggest that if we take
this sampling error into account, the sample coverage interval only includes 34% of the population
values (error rate 66 %: Figure 4C). The NSM implies that because only the first variability is explicit
in the sample, only the first contributor to the error-rate is taken into account when the intervals are
produced. At small sample sizes this produces extreme overconfidence and format dependence.
A parametric statistical model. The NSM assumes the estimation of limits of a population
distribution, the OED, on the basis of n randomly sampled observations from this distribution. In
terms of statistical theory, these intervals correspond to coverage intervals, covering a specified
proportion of a population distribution (e.g., a 90% coverage interval covers 90% of the population
distribution). The probability that another observation from the OED falls within certain limits is
based on this coverage interval for the OED.
In a standard statistical model, assuming normally and independently distributed observations,
the width W of a coverage interval is (Poulsen, Holst, & Christensen, 1997),
W = 2 n (n 1) sd 2 (1 + 1 n ) t n, p , (1)
where sd is the standard deviation in the sample, t n, p is the t-statistic at n (df=n-1) covering the
central proportion p of the t-distribution (e.g., at n=4 the .95 interval is defined by a t 4, .95 =
3.18). At infinite sample size n, Eq. 1 converges on 2 sd z p , where zp is the z-score delimiting the
central proportion p of the normal distribution (e.g., for p=.95, z.95= 1.96; for large n, 95% of the
normal distribution falls within 1.96 sd ). In other words, the size of the interval that covers the
central .xx proportion of the population distribution (e.g., the OED) is estimated from the standard
deviation in the sample, which is corrected for being a biased estimator ( n (1 n ) in the first right-
hand component of Eq. 1) and for the random sampling error in the interval placement (the second
right-hand component of Eq. 1).
Framed in this way, the NSM claims that the subjective coverage intervals w produced by
human participants is better approximated by,
w = 2 sd z p , (2)
that is, as if the sample properties directly describe the population properties, regardless of sample
size and without the appropriate small-sample corrections. In Figure 5A that illustrates the predicted
relative interval sizes (w/W) we see that the NSM predicts constantly too tight intervals and
profoundly so for small sample sizes. Figure 5B illustrates how the predicted hit rates depend on the
sample size, given a SED that is representative of the OED. (In this article, hit-rate will refer to the
proportion of values that fall within the intervals). The hit-rates are too low and more so with smaller
sample size. Figure 5C, finally, shows how a bias in the SED affects the hit-rates (the mean of the
SED is assumed to be 0, .5 or 1 standard deviations higher than the OED, where both distributions
have unit standard deviation). In sum: the predicted overconfidence increases with smaller samples
and more SED-bias.
Although the parametric model in Figure 5 has the advantage of portraying the key-mechanism
of the NSM in terms of a statistical model that is familiar to many readers it has drawbacks as a
psychological model. First, this parametric model assumes an infinite population. In many inference
problems (and in the ones of concern here) the population is finite and observations are drawn
without replacement. Second: the distribution is normal with infinite upper and lower limits for the
quantity, in contrast to applications where the distributions may vary and there are bounds on the
values that the quantity can take. Third: the assumption that people compute standard deviations,
specifically, may not appear plausible. In the following we therefore explore a model that is non-
parametric in the sense that it does not make assumptions about distributions4, and which is applied
to a real-life distribution relevant to a judgment task in regard to which we have collected extensive
empirical data.
A non-parametric simulation model. A Monte Carlo simulation was applied to estimation of
world country population (Juslin et al., 1999; Juslin et al., 2003; Winman et al., 2004). The
populations of the 188 countries listed by the United Nations in the year 2002 defined the database
and continent served as the cue. Simulations were executed as detailed by Steps 1, 2 and 3 in Figure
3 with the following amendments (see Appendix A for details). Sample size was treated as a random
variable, n e , where e is a normally distributed error with zero expectation and a standard
deviation of 1 that is rounded to an integer.
It was assumed that, rather than computing standard deviations and relying on the assumption
of a normal distribution, people report fractiles within the SSD. These fractiles are either
observations within the SSD, whenever the fractiles implied by the interval are explicitly represented
in the sample, or values generated by a standard procedure for interpolating fractiles from small
samples, when the requested fractiles fall in between values explicitly represented in the sample (e.g.,
for the 90th fractile at n=4). The procedure relies on plotting the empirical cumulative distribution in
the sample, in a small sample increasing by discrete stepsone for each observation, and to
interpolate in between the steps.
The predictions in Figure 6A, based on the assumption of no SED bias, reproduce the
overconfidence bias reported for interval production. The main difference from the predictions by the
parametric model in Figure 5B is that the hit-rate for the 1.0 interval is lower. This decrease is
explained by the non-parametric version capturing the end effects that arise near the limits of the
distribution that produce small and asymmetric intervals. For illustration Figure 6A also reports
typical empirical data produced by experimental participants (from Winman et al., 2004). For
example, with sample size 3 only half of the target values are predicted to be included by the 100%
intervals. When the SEDs are biased relative to the OEDs as will often be the case, also larger
sample sizes than 3 produce overconfidence of the magnitude observed in data. In view of estimates
of short term memory capacity we expect the sample size to fall between app. 3 and 8 observations
(Broadbent, 1975; Cowan, 2001; Miller, 1956)5. The connection implied between overconfidence
and short term memory capacity is empirically validated by data from a study reviewed below
(Hansson et al., 2006).
It has been noted that in empirical data the hit-rate is sometimes quite similar for intervals of
varying subjective probability (Block & Harper, 1991; Yaniv & Foster, 1997). It is therefore
important to consider the alternative hypothesis that the participants pay no attention at all to the
probability and just produce an interval that appears informative, or that they are unable to express
their knowledge as a probability distribution. By contrast, the NSM assumes the use of a distribution
and naturally predicts that the size of the intervals and the hit-rates that fall within the intervals should
increase with subjective probability (Figure 6A).
We therefore reanalyzed the data in Winman et al.s (2004) Experiment 1 where participants
made .5, .75, and 1.0 confidence intervals for the population of world countries in a blocked within-
subjects design. As illustrated in Figure 6A, in the analysis of the entire data set the hit-rates increase
significantly with subjective probability (see also Juslin et al., 1999). This could be a side effect of the
within-subjects comparison that highlights the different probabilities, therefore triggering the analytic
insight (see Kahneman & Frederick, 2002) that high-confidence intervals should be wider than low-
confidence intervals. However, half of the participants started by making .5 intervals in the first
block, the other half by making 1.0 intervals, so this comparison is strictly between-subjects and a
difference cannot be explained by a within subjects design. In the first block the group with 1.0
confidence intervals had a substantially higher hit rate than those who made .5 intervals (.64 vs. .32,
t 18 = 2.52, p = .01, one-tail) with 66 percent larger sizes of the intervals (t 18 = 2.52; p = .01, one-
tail).
Why is there a Format Dependence Effect?
What if the NSM is given the intervals produced in the simulation for Figure 6A now asking
it to assess the probability that the quantity falls within the interval? For example, if the initial
simulation for interval production suggested a 75% interval between X and Y for a country, the
simulation for interval evaluation assessed the probability that the population of the same country falls
in between X and Y. Figure 6B illustrate that now the predicted probability converges on the hit-rate
of countries falling within the interval, yielding close to zero overconfidence. (The overconfidence
score in Figure 6B is the probability for the interval minus the hit rate collapsed across the .5, .75,
and 1.0 intervals in Figure 6A.) If the 75% confidence intervals on average yield a hit-rate of .4, re-
entering the same intervals to the NSM for interval evaluation produces an average probability
judgment close to .4
Overconfidence is virtually abolished for the same events because proportion is not a biased
estimator (as is sample coverage) and the error from too small and displaced intervals is explicit in a
new random sample (compare with cross-validation in statistical modeling). Note, however, that the
simulation model predicts slight overconfidence also with interval evaluation, because by necessity
the target quantity (e.g., Thailand) is never in the SSD (a sample of Asian countries other than
Thailand) so strictly speaking the SSD is not a random sample from the OED (all Asian countries)
(see also Footnote 4). Below we review data demonstrating that people exhibit format dependence
not only when assessing independently defined events or intervals by others, but when they assess
their own intervals.
Figure 6B also illustrates typical data on format dependence from Juslin et al. (2003), which
again fall numerically close to the predictions by the NSM. Figure 7 provides a more detailed
comparison with the data from Juslin et al., where the participants assessed the probability that
statements about the world populations are true (i.e., Thailand has more than 10 million
inhabitants.) or produced confidence intervals for the populations of the world countries. The full-
range probability judgment task was implemented in the NSM simulation by a random selection of a
country and a cut-off value (the 10th, 50th, or 90th percentile of the OED for all countries). The NSM
made probability assessments for these propositions according to Steps 1, 2, and 3a, assuming that
the SED is representative of the OED. The predictions by the NSM in Figure 7B illustrates that
already by setting n = 4 1 and assuming a SED that coincides with the OED the predictions by the
NSM qualitatively capture the data in Figure 7A nicely, with full-range proportions close to the line
for perfect realism, yet simultaneously extreme overconfidence for the confidence intervals. In sum:
while interval production is plagued both by overconfidence due to a biased estimator and biased
input, interval evaluation (or probability judgment, more generally) is only affected by bias in the
input, and to the extent that this bias small, overconfidence can approximate zero.
As a different illustration of the effect of replacing coverage as an estimator with proportion
consider the results in Soll and Klayman (2004). After producing confidence intervals, Soll and
Klayman asked the participants for post ratings of the number of intervals that included the true
values. These post ratings were less overconfident than the interval productions. One crucial aspect
that is affected by changing the task from interval production into post rating of number of intervals
including the true value is that the estimator changes from coverage to a proportion (for how
many of the 50 questions do you think that the correct answer will turn out to be within the interval
you gave?, p. 304). The participants can now consider the proportion of intervals that include the
true value and, according to the NSM, this change of estimator variable explains the reduced
overconfidence.
Why is Overconfidence Different with Different Target Variables?
The other source of bias in Figure 1C, biased input samples (SEDs that deviate systematically
from the OEDs), also has an effect on overconfidence. Such deviations, which affect both interval
production and interval evaluation, may arise from, for example, a correlation between the target
magnitude and memory storage probability, as when we know more about larger countries (e.g.,
Goldstein & Gigerenzer, 2002) or when the extremes of a distribution, small and large magnitudes,
are more salient and likely to be encoded in the SED. Most deviations between SEDs and OEDs
contribute to overconfidence and for sample sizes between 4 and 8 and moderate discrepancies
between the SEDs and the OEDs we expect overconfidence of the magnitude observed in data.
Because, on average, the SEDs should be better calibrated to the OEDs for familiar quantities there
should also be less overconfidence for familiar than for unfamiliar quantities (Block & Harper, 1991;
Pitz, 1974).
The overconfidence OU observed in an interval production task can therefore be partitioned
into two separate components:
OU i = o( ni ) + ou( SEDik ) . (3)
OUi is the observed over-/underconfidence bias for individual i and o(ni) is the overconfidence bias
added by the naive interpretation of sample dispersion, the magnitude of which is determined by the
sample size ni available to individual i. ou(SEDik) is the over- or underconfidence added by a bias
between the SED and the OED, the magnitude of which is idiosyncratic to the individual i and the
target variable k. (ou signifies the possibility of both over- and underconfidence bias, in contrast to o
that signifies overconfidence.).
The NSM also highlights the conditions where the overconfidence in probability assessment
should be minimized. When the assessment involves probability judgments for pair comparisons
(e.g., which city has the larger population: a) London, or b) Paris?), the effect of the bias between
the SED and the OED is, in effect, minimized because the judgment now involves a rank order
within distributions (compare with the elimination of bias in psychophysical discrimination by pair-
comparison, see Macmillan & Creelman, 1991). To the extent that the SED preserves the rank
order in the OED, probability assessment with pair-comparisons should be especially conducive of
zero overconfidence, as they appear to be (Juslin & Persson, 2002; Juslin et al., 1999; Juslin et al.,
2000). Moreover, because the factors that determine the overconfidence bias are different, target
variables that produce more overconfidence with interval production need not produce more
overconfidence with assessment for pair comparisons (Klayman et al., 1999). In sum: the deviations
between the SEDs and the OEDs are likely to be highly idiosyncratic, contributing to differences in
the overconfidence observed for different target variables. If the SED-bias is small or the task
involves pair comparisons probability judgments converge on zero overconfidence.
Why is Overconfidence with Interval Production not Cured by Expertise?
In regard to the NSM it is important to distinguish between procedural expertise referring to
experience with expressing judgments as probability distributions and domain expertise implying
experience with the content domain of judgment. Procedural expertise, especially if judgments are
repeatedly performed in the same domain and with the same format, is likely to promote a shift form
the sort of on-line processing captured by the NSM to retrieval of pre-computed sufficient statistics
that describe the OED (e.g., retrieval of its mean and standard deviation). There are examples of
domain experts that make well calibrated probability judgments (Keren, 1987; Murphy & Brown,
1985), but few examples of domain experts producing well calibrated confidence intervals (Russo &
Schoemaker, 1992).
For the exceptions, like the meteorologist in Murphy and Winkler (1977), domain expertise
appears to be confounded with unknown degrees of technical support and extensive procedural
expertise, suggesting the possibility to rely in part on pre-computed knowledge rather than the
intuitive on-line computation of intervals captured by the NSM. Importantly, however, this difference
between the formats is verified also by the laboratory experiments reviewed below where procedural
expertise is controlled for (Hansson et al., 2006).
The explanation by the NSM is straightforward. With probability judgment, over-
/underconfidence biases mainly derive from systematic deviations between SEDs and OEDs. Given
appropriate feedback from the environment, domain expertise should provide ample opportunity to
correct these deviations to attain realistic probability judgments. With interval production expertise
can likewise eliminate the overconfidence that derives from a biased SEDand thus diminish the
overconfidencebut, because the sample size is architecturally constrained by short term memory
capacity, the overconfidence that derives from a nave use of sample dispersion is not cured by more
experience6. These implications are consistent with the literature on probability judgment and interval
production by novices and experts (Russo & Schoemaker, 1992). Below we review data (Hansson
et al., 2006) that directly address the role of experience and short term memory for overconfidence
in judgment.
Summary
The NSM provides straightforward accounts of the overconfidence with interval production
and the format dependence effect. With interval production there are two contributors to the
overconfidence corresponding to the two origins of judgment bias in the general scheme for the nave
intuitive statistician in Figure 1C. The first origin is that people use small-sample coverage as a direct
proxy for a population property. Only the sampling variability that is explicit in the sample is
considered and the error rate added by the underestimated population dispersion and misplaced
interval is ignored. This contributes a bias towards overconfidence regardless of what target variable
is estimated. The second origin is that whenever the input itself is biased the SEDs deviate
systematically from the OEDs. The SED-bias is idiosyncratic to the target variable and affects
probability judgment too.
We have demonstrated that the NSM is consistent with much of the data previously collected
in regard to overconfidence with interval production and format dependence. The true test of a
theory, however, lies in its ability to control the phenomena. In the following, we review research
directly motivated by the NSM illustrating how it allows us to control the overconfidence and the
format dependence effects, including a method to minimize the overconfidence in interval production
tasks. We also provide the first evidence in regard to the role of short term memory capacity for the
overconfidence with interval production.
Data Collection Motivated by the Nave Sampling Model
Can we Control the Magnitude of Format Dependence?
Studies of format dependence have most often compared formats that involve the assessment
of one fractile of the subjective probability distribution (What is the probability that the population of
Thailand exceeds 25 million?) to production of intervals that involve two fractiles of the subjective
probability distribution (Juslin et al., 1999). Winman et al. (2004) therefore compared two formats
that both consist of an interval and two fractiles; viz. interval production and interval evaluation as
discussed above. To precisely equate the events, participants first produced intervals that defined
events, and later the probability of these events (intervals) was assessed. For example, a participant
may assess a .5 interval of 10 to 30 million inhabitants for the population of Thailand. On a later
occasion the same or another participant assesses the probability that the population of Thailand falls
between 10 and 30 million inhabitants. The NSM predicts that the overconfidence should be
substantially reduced when participants make probability assessments for the intervals.
In regard to format dependence an important aspect is whether the event in the probability
judgment is already correlated with the SSD used to assess it. The event is typically uncorrelated if it
is defined a priori, as when the assessment concerns unknown future events (Keren, 1987; Murphy
& Brown, 1985) or general knowledge questions that are randomly selected from natural
environments (Gigerenzer et al., 1991; Juslin, 1994; Juslin et al., 2000). When the event is
uncorrelated with the SSD used to assess the event and the SED approximates the OED, especially
when the task involves pair-comparisons, the NSM suggests a potential for fairly well calibrated
probability judgments (Figure 6B).
Another possibility is that the event is correlated with the SSD used to make the probability
assessment. One reason for such a correlation is SSD-overlap. If different people retrieve the same
or almost the same SSD because of a small and biased environmental input (an SED bias) a
correlation arises when the event is defined by another persons best guess. If you, for example,
believe that Thailand has a large population because all Asian countries you know have large
populations and I know the same Asian countries as you do, I will share your belief. Similar
correlations obtain when a person that selects general knowledge items for a confidence study can
second-guess the modal answer from data or their own intuitions, and over-represent surprising
items (Gigerenzer et al., 1991; Juslin, 1994; Juslin et al., 2000).
One extreme case is accordingly that of no correlation that predicts the format dependence in
Figure 6B with close to zero overconfidence for probability judgment and large overconfidence with
interval production. The other extreme is that of perfect correlation where exactly the same SSD is
used to both produce and probability assess an interval that yields no format dependence7 and large
overconfidence with both formats. In between these extreme cases, the predicted format
dependence is of an intermediate magnitude.
In the between-subjects design in Winman et al. (2004) participants evaluated the probability
that the target quantity falls within the intervals produced by another participant. There was
significantly and substantially more overconfidence in the interval production condition than in the
interval evaluation condition (.34 vs. .15), although the events were matched on an item-by-item
basis. As predicted by the NSM, however, interval evaluation of an event that represents the best-
guess by a peer yields more overconfidence than observed when the events are predefined and
independent of the judges own knowledge state (see, e.g., Figures 6B & 7). The explanation is that
the SSDs retrieved by two peers are likely to overlap due to their experience with a similar small and
biased input from the environment.
The most remarkable demonstration is if participants are susceptible to format-dependence
also for the intervals they produce themselves. In the within-subjects design, the participants first
produced confidence intervals for 40 world country populations. One week later they assessed the
probability that the populations fall within the intervals produced on the first occasion. Finally, they
again produced intervals for the same 40 country populations. The intervals produced on the first
occasion revealed extreme overconfidence (.34), which was significantly reduced to .26 when they
one week later made interval evaluations of the intervals. When they, again, produced intervals for
the same 40 country populations, they returned to their initial overconfidence (.34; see Winman et
al., 2004).
Format dependence is thus robust enough to obtain also when participants assess their own
intervals. In sum: the format dependence effect is profound when the event is uncorrelated with the
assessors knowledge state (Figure 6B). As also predicted from consideration of the SSD-overlap in
the production and evaluation of an interval, the format dependence is smaller when a person
evaluates another persons best guess, and especially when evaluating his or her own best guess,
where the SSD-overlap is likely to be large.
Can we Cure (or Reduce) Overconfidence with Interval Production?
In Winman et al. (2004), Experiment 2, the NSM was used to design a method to produce
intuitive confidence intervals, but in a way that minimizes the overconfidence with interval production.
The ADaptive INterval Adjustment (ADINA) procedure proposes a sequence of intervals, each of
which changes in size in response to the probability assessment for the previous interval, with the
effect that the intervals home in on a target probability. For example, assume that the target interval
probability is .9 and the target value is the population of Thailand. The ADINA proposes an interval
to the participant, for example, between 20 and 40 millions. If the participant assesses a probability
below .9, the next interval proposed by ADINA is wider. If the assessed probability is above .9,
ADINA proposes a smaller interval. This procedure continues until the assessed probability is .9.
As with usual interval production the end-product is an interval, but the procedure requires
estimates of proportion rather than coverage. Experiment 2 had three conditions; a) a control
condition with ordinary interval production; b) an ADINA(O)-condition where ADINA proposes
intervals centered on the participants Own point estimate. Because the interval (event) that is
assessed itself presumably is affected by the same sampling error as the SSD used for the probability
evaluation this is likely to involve high SSD-overlap. c) An ADINA(R)-condition where ADINA
locates intervals centered on a Random population value. Because the interval is randomly placed
this implies a decoupling of the SSD-overlap8.
In the ADINA(O)-condition, the NSM predicts less overconfidence than in the control
condition because proportion rather than coverage is the estimator, but not zero overconfidence.
Roughly speaking, ADINA(O) controls for the underestimated dispersion of the population
distribution in Figure 4B, but not for the misplaced interval in Figure 4C, because the event and the
SSD are correlated. In the ADINA(R)-condition we expected a larger decrease in overconfidence
because the SSD-overlap is now removed. In effect, the ADINA(R)-condition transforms the task
into a standard full-range probability judgment task and the NSM therefore predicts close to zero
overconfidence (see Figure 6B).
Figure 8 shows the mean subjective probability and hit rate with 95 percent (real) confidence
intervals in the three conditions. Figure 8A confirms that overconfidence for the initial interval follows
the predicted order with extreme overconfidence for the control condition and close to zero
overconfidence for the ADINA(R)-condition. The hit-rate is similar in all three conditions, but
whereas the subjective probability assigned to the interval is too high in the control condition it is
quite accurate with ADINA(R). However, these initial intervals are not yet intervals associated with
the pre-specified target probability.
Figure 8B that present mean subjective probability and hit rate for the final intervals confirms the
predicted order with extreme overconfidence for the control condition and close to zero
overconfidence for the ADINA(R)-condition. In this case the subjective probability is almost the
same9at the levels defined by the desired probabilitiesbut overconfidence is diminished by a

manipulation that, in effect, increases the too low hit rates in the control condition so that they
approximate the stated probabilities. In sum: we can reduce overconfidence substantially both by
decreasing the subjective probability attached to the intervals and by increasing the hit rates
associated with the intervals. Both of the ADINA procedures produce confidence intervals with
significantly less overconfidence.
All of the studies reviewed thus far involve naturalistic general knowledge content, in regard to
which the participants have acquired the knowledge expressed in the estimates from their pre-
laboratory experience. This experimental paradigm obviously limits our ability to control the learning
history and the processes used by the participants. In the final sections, we therefore turn to data
collected in more controlled laboratory training experiments.
Do the Intervals Derive from Experience with Distributions?
The NSM implies that people possess subjective counterparts (the SEDs) of the ecological
distributions (the OEDs), and that the intuitive confidence intervals derive from retrieval operations
(sampling) on these representations. The aim of an experiment presented in detail in Appendix B was
to manipulate the OEDs in order to establish that a) people have the cognitive capacities to generate
SEDs that represent the OEDs and b) when encountering an unknown object the uncertainty
expressed by an interval is determined by the SED.
The participants were exposed to an OED that was either U-shaped or inversely U-shaped
(range = 1-1000). The U-shaped OED condition had a high frequency of values close to 1 and to
1000 and the inversely U-shaped OED condition consisted predominantly of target values close to
500. The task involved estimation of the revenue of fictive business companies (see Appendix B). To
directly elicit the SEDs the participants also estimated the relative frequency of observations in ten
intervals evenly distributed over the OED range.

In the learning phase the participants guessed the revenue of 156 companies with outcome
feedback about the correct target value. In a test phase, they produced confidence intervals for 60
companies they had previously seen in training and for 60 new companies. As predicted if people are
able to represent the OEDs, the assessed SEDs were bimodal in the U-shaped condition and
unimodal in the inversely U-shaped condition. The standard deviation of the SED was 497 in the U-
shaped condition (OED = 574.3) vs. 177.6 (OED = 183.6) in the inversely U-shaped condition
(F(1, 28) = 410.9, p<.001: in both conditions the 95% real statistical confidence intervals for the
SEDs include the OED). As implied if the intuitive confidence intervals derive from the SEDs, the
interval size for new objects was larger (750) in the U-shaped condition than in the inversely U-
shaped condition (550) (F(1, 28) = 12.0, p = .002). There was also a strong positive correlation
between an individual participants SED standard deviation and his or her average interval size (r(30)
= .54, p<.001).
Thus, when encountering unknown objects the participants appeared to rely on previously
experienced distributions in expressing intuitive confidence intervals. The study demonstrated that
people are sensitive to the OEDs with SEDs that largely mirror the OEDs, interval sizes are affected
by the OED manipulation in the way predicted by NSM, and a direct measure of the SED verifies
that the SED is strongly linked to interval size.
Sample Size: Constrained by Experience or Short Term Memory?
In Hansson et al. (2006) we investigated the role of short term memory capacity (referred to
as n by STM), the total number of observations stored in long term memory (n by LTM), and the
bias between the SED and the OED in a laboratory learning task of the sort described in Appendix
B. Nominally the task involved estimates of the revenue of companies, but unbeknownst to the
participants the distributions were from a real environment (the world country populations), which
they had to learn from scratch in the laboratory.
A bold assumption of the NSM is that the sample size that can be applied to an estimation
task is constrained by the judges short term memory capacity. The NSM therefore predicts that
overconfidence in interval production tasks should correlate negatively and strongly with n by STM
and, although experience might diminish it, overconfidence should be extreme also after extensive
experience (i.e., experience can eliminate ou(SED) but not o(n) in Eq. 3). In a probability judgment
task overconfidence should be rapidly reduced by experience because experience can eliminate
ou(SED), the main source of overconfidence, and overconfidence should not be as strongly
correlated with n by STM.
An alternative hypothesis is that the sample size instead is constrained by long term memory (n
by LTM). This is plausible to the extent that the representations derive from crystallized knowledge
that is accumulated and updated gradually from experience. In that case, extensive experience should
reduce overconfidence with both assessment formats and overconfidence should be more strongly
correlated with n by LTM than n by STM.
Experiments 1 and 2 from Hansson et al. (2006) varied the extent of training in conditions with
immediate, complete and accurate outcome feedback from very modest (68 trials), over intermediate
(272 trials) to extensive (544 trials). After 68 training trials with the same training stimuli
overconfidence was .28 with interval production, but .07 with interval evaluation. The predictions by
the NSM assuming sample size 4 and no SED-bias are .28 and .02, respectively. The format
dependence effect was thus replicated and the overconfidence with probability judgment approached
0 already after 68 training trials. Figures 9A and B summarize the results with interval production as a
function of training, after 68, 272, and 544 trials. As illustrated in Figure 9A, the proportions of
correctly recalled target values (and thus n by LTM) increased significantly with training, from less
than 10% of all 136 target values in each training block after 68 training trials, to more than 30%
after 544 training trials.
The marginal and statistically non-significant decrease in overconfidence bias in Figure 9B is
consistent with the NSM. The overconfidence contributed by SED-bias early in training is corrected
by further training as the SED converges on the OED. More crucially, as predicted there is persistent
extreme overconfidence with interval production even after 544 trials with feedback, more than four
times larger bias than with interval evaluation after only 68 trials. This is remarkable because
conditions with immediate, unambiguous, and representative feedback have traditionally been
presumed to be optimal for attaining well calibrated judgments (Keren, 1991; Lichtenstein et al.,
1982). Consistently with the research on experts (Russo & Schoemaker, 1992) and as predicted by
the NSM, however, experience appears to have a minimal effect on the overconfidence with interval
production.
In addition to n by LTM, as estimated by the proportion of correctly recalled target values, in
Experiments 1, 2 and 3 in Hansson et al. (2006) participants were measured with a standard digit-
span test to estimate n by STM. Individual SED bias was estimated by the absolute deviation
between the mean OED and the mean correctly recalled target value. These data were entered into a
multiple linear regression model to predict overconfidence. As illustrated in Table 1, after 68 training
trials no independent variable was significantly related to overconfidence, but after 272 training trials
overconfidence was significantly related to n by STM ( =-.72) and SED-bias ( =.39). After 544
training trials there was a strong -weight for n by STM (-.67) but now the relation with SED-bias
had disappeared (-.26). In none of the experiments is n by LTM significantly related to
overconfidence and nominally often the -weight is of the wrong sign (larger n by LTM suggests
more overconfidence).
Our interpretation was that after 68 training trials the effects of short term memory capacity
and SED-bias were overshadowed by large idiosyncratic differences in the variability in the small
samples of target values encoded in long term memory. After 272 training trials the participants had
formed stable, although still somewhat idiosyncratic SEDs and now the correlations with short term
memory and SED-bias appear. After 544 trials the SEDs have converged on the OEDs rendering
SED-bias a less useful predictor and the model is now dominated by the correlation with short term
memory capacity. By contrast, the overconfidence with interval production was not related the total
sample size in the participants long term memory (n by LTM), neither when it was experimentally
manipulated (see Figure 9B), nor in the analysis of the individual differences in Table 1.
In Experiment 3, a larger participant sample was measured in fluid intelligence (RAPM; Raven,
1965) and episodic memory (a version of Wexlers paired-associates). As illustrated in Table 1, in
the interval production condition n by STM significantly predicts overconfidence, but not in the
interval evaluation condition. The more specific purposes of Experiment 3 were to inquire if the
correlation between short term memory capacity and overconfidence obtains also when we control
for individual differences in intelligence and episodic memory capacity. We also wanted to test the
interaction predicted by the NSM: there is more overconfidence with interval production, but this
difference should be especially pronounced for people with low short term memory capacity. To test
this interaction, in each condition the participants were divided into low or high short term memory
capacity based on a median split.
The interaction in Figure 9C confirms the prediction by the NSM. In the interval evaluation
condition (probability judgment) there is no significant difference between the participants with low
and high short term memory capacity, but in the interval production condition there is significantly
more overconfidence bias for participants with low short term memory capacity. Entering RAPM
and episodic memory as covariates in the analysis has no effect on the significant interaction in Figure
9C. In sum: as predicted by the NSM, n by STM but not n by LTM predicts the overconfidence
observed with interval production, and this correlation is higher with interval production than with
probability judgment.
Interval Size and Prediction Error
Yaniv and Foster (1995; 1997) proposed that confidence intervals may not be expressions of
subjective probability distributions but formed by a trade off between two countervailing objectives:
to provide intervals that are as accurate as possible (which pulls towards wide intervals) and as
informative as possible (which pulls towards tight intervals). A realistic high probability confidence
interval is often too wide to be of any practical use and therefore the judges may implicitly, or by
habit, trade a lower accuracy, contributing to overconfidence, for the attainment of a tighter and
more informative interval.
Yaniv and Foster (1997) compared three ways to produce intervals: a) self-selected grain size
of ones estimate of an unknown quantity (e.g., whether one prefers to indicate the birth year of
Mozart by an interval in terms of centuries, decades, or years); b) 95 percent confidence intervals; c)
expected plus-or-minus error. They found that with all three methods the mean absolute error
between the midpoint of the interval and the correct target value was a constant fraction of the
interval size. Yaniv and Foster therefore concluded that One might be better off interpreting
intuitive interval judgments as predictors of absolute error rather than ranges that include the truth
with some high probability (p. 31).
This conclusion conflicts with the NSM, which presumes that confidence intervals express
knowledge of an underlying distribution (albeit accessed by a small sample). Already that high-
confidence intervals are associated with wider intervals and higher hit-rates (Figure 6A)
demonstrates that the intervals express more than a plausible margin of error, or an accuracy-
informativeness trade off. Yet, it is worth asking how the NSM can address the constant normalized
error observed by Yaniv and Foster (1997). The three rightmost columns of Figure 10 that present
the data from Hansson et al (2006) after 68, 272 or 544 training trials, and where absolute error is
plotted against interval size with best fitting regression lines, illustrate this pattern. There is a positive
slope between absolute error and interval size, but there are also two separate clusters of absolute
errors at the low interval sizes.
Critically, as in the data reported by Yaniv and Foster (1997) the slope is virtually constant
regardless of the probability of the interval. By contrast, with real confidence intervals the slopes
naturally decrease with higher confidence levels and overconfidence is zero. The predictions by the
NSM (n=4, no SED bias) in the leftmost row of Figure 10, reproduces both the exact bivariate
distribution and the relationship between absolute error and interval size. The explanation is that
unlike parametric coverage intervals (Eq. 1) or classical confidence intervals for a parameter (e.g.,
the population mean), the non-parametric NSM does not compute means and standard deviations
within the SSD, nor does it rely on distributional assumptions. The interval size is determined solely
by the upper and lower limits corresponding to two sample fractiles and the midpoint of the interval is
not the sample mean, but the midpoint between these two limits. The variability of the boundary
values directly affects the precision of the midpoint of the interval, in terms of absolute error
regardless of whether the interval covers 50%, 75% or 100% of the SSD.
The similarity between the predicted and observed patterns in Figure 10 therefore suggests
that the non-parametric version of the NSM better captures the cognitive processes than the
parametric version. Finally, both the interval size and the overconfidence bias increases with the
probability of the interval. In sum: Figure 10 demonstrates that the constant fraction between
absolute error and interval size observed empirically is consistent with the NSM, even though the
intervals express knowledge of probability distributions.
General Discussion
Jean Piaget (1896 1980) is the psychologist most renowned for the notion of egocentrism,
conceived of as the young childs inability to transcend his or her own local perspective of the
situation (Piaget & Inhelder, 1956). The arguments in this and several other publications (e.g.,
Fiedler, 2000; Fiedler & Juslin, 2006) propose that one important cause of judgment error lies in the
adults continued struggle to transcend the locally available samples of information. In this article we
thus present two interrelated arguments: First, that the perspective of the nave intuitive statistician
(Fiedler & Juslin, 2006b) complements previous approaches to intuitive judgment in useful ways.
Second: to illustrate the approach we have provided its in-depth application to two perplexing
phenomena in research on confidence: the extreme overconfidence with interval production and the
format dependence effect.
The Nave Intuitive Statistician
As suggested by the the intuitive statistician (Peterson & Beach, 1967), the nave intuitive
statistician entails that by large people accurately describe the samples of information they encounter.
On this view, heuristics that substitute probability with intensional properties like similarity or
availability (Kahneman & Fredericks, 2002) are often not the appropriate locus of explanation for
cognitive biases. Rather they arise from sample properties that are inappropriate estimators, either
because of a biased input (Einhorn & Hogarth, 1978; Fiedler, 2000a), or because the property is a
biased estimator (Kareev et al., 2002; Winman et al., 2004). This is not to deny that judgments are
affected by similarity and fluency, but only to propose that more often than perhaps appreciated the
most important cause of the bias lies not in the processing of the sample. The explanatory burden is
thereby relocated from heuristic processing of the sample to the relationship between sample and
environment.
This program highlights at least three lines of research; first, to address judgment phenomena in
terms of accurate but nave description of experience (see Fiedler, 2000; Fiedler & Juslin, 2006a).
As noted initially, people, for example, have biased conceptions of the frequency of death causes.
The traditional account of this phenomenon is that they rely on the availability heuristic (Lichtenstein
et al., 1978). Detailed investigations (Hertwig, Pachur, & Kurzenhuser, 2005), however, suggest
that it is not the ease of retrieval, an intensional property substituting for proportion, that explains
the bias, but accurate assessment of biased samples (availability by recall, in the terms of Hertwig
et al., 2005). The naivety implied by the nave intuitive statistician is moreover an equally compelling
account of the belief in the law of small numbersthe tendency to expect small samples to be
representative of their populationsas the explanation in terms of the representativeness heuristic
(Tversky & Kahneman, 1971). In many situations it may be unclear what the notion of
representativeness adds over and above the assumption that people take small sample properties as
direct proxies for population properties. In many of these instantiations, but certainly not all, the locus
of explanation seems better captured by accurate but nave description of experience.
Although not cast in terms of the metaphor of the nave intuitive statistician several lines of
recent research illustrate its fertility by incorporating aspects of it. Denrell (2005), for example,
showed that biases in impression formation traditionally explained by cognitive and motivational
factors may arise from accurate description of samples that are biased as a side-effect of sampling
strategies. Because you are inclined to terminate the contact (the sampling) with people whom you
get an initial negative impression of, you are unlikely to correct an initial false negative impression. By
contrast, a false positive impression encourages additional sampling that serves to correct the false
impression. The net effect is a bias, even if the samples are correctly described. Hertwig et al. (2004)
demonstrated that if people accurately express the proportions in small samples they will under-
weight the low probability events when they make decisions from trial-by-trial experience. A recent
theory also emphasizing the sampling metaphor is decision by sampling theory (Stewart, Chater, &
Brown, 2006) showing how ordinal comparisons and frequency counts based on retrieval of small
samples explain the value function and probability weighting assumed by prospect theory
(Kahneman & Tversky, 1979), such as the concave utility function and losses looming larger than
gains. A crucial part of the explanation is the relationship between small proximal samples and the
distributions of values and probabilities in real environments.
These recent approaches illustrate the fertility of more carefully analyzing the relationships
between the available proximal samples and distal environmental properties (Fiedler & Juslin,
2006a). A perhaps more far-reaching question refers to the extent to which differences of opinion in
societal and political controversies may sometimes derive from selective and different samples of
experience, together with an inability to appreciate this fact, rather than from differences in value
orientation or biased information processing.
Second, surprisingly little is known about how people represent knowledge of distributions. Is
this knowledge shaped by a priori assumptions about distributional shape (akin to a parametric
statistician) or driven by bottom up experience? Analogously to the debate in categorization
learning as to whether people develop abstract representations, like prototypes and rules, or rely on
exemplars (Nosofsky & Johansen, 2000) one can ask do people spontaneously compute summary
representations of distributions, perhaps of central tendency and dispersion, or are observations
stored in their raw form? What is the metric by which dispersion and skew is represented?
Although research on frequency learning (Sedlmeier & Betsch, 2002) and judgment (Gilovich et al.,
2002) provide useful hints, the cognitive machinery of the intuitive statistician needs to be better
understood.
Third, we need to determine the scope of the naivety or meta-cognitive myopia (Fiedler,
2000). On the one hand, large amounts of research suggest considerable naivety in a variety of
situations (e.g., Denrell, 2005; Einhorn & Hogarth, 1978; Fiedler, 2000; Fiedler & Juslin, 2006a;
Hertwig et al., 2004; Kareev et al., 2002). On the other hand, it also seems evident that we are not
completely encapsulated in our personal samples. Working as a professor does not produce the
conviction that most people are PhDs, even though this might be true of the sample of people that is
encountered in a working day. The empirical research that exists on peoples ability to learn from
selective feedback suggests that often they are less affected than one might initially suspect (Elwin et
al., in press; Grosskopf, Erev, & Yechiam, 2001; Yechiam & Busemeyer, 2006, but see Fischer &
Budescu, 2005). What are the limits of our naivety, and by what cognitive processes do we
transcend our personal samples?
In sum: in order to counteract biased judgments we need to understand a) the relationship
between the proximal samples and the distal environmental properties; b) the way in which
knowledge of distributions is represented; and c) the limits of, and our ability to transcend, our
naivety. Although the nave intuitive statistician holds some promise to enlighten our understanding of
judgment phenomena it should, of course, also be emphasized that there are phenomena that are still
beyond the scope of the approach and which might be better addressed by the variable-substitution
approach of the heuristic and biases program, such as, for example, the conjunction fallacy (Tversky
& Kahneman, 1983).
The Nave Sampling Model of Intuitive Confidence Intervals

The metaphor of the nave intuitive statistician was applied to the extreme overconfidence with
interval production and the format dependence effect. The NSM thus assumes that people accurately
but naively describe small samples to make a judgment and the overconfidence is determined by the
two factors that are emphasized by the nave intuitive statistician (Figure 1C): the extent to which the
samples are representative of the environment and the extent to which the sample property described
is an unbiased estimator.
The NSM assumes that people take sample proportion as an estimator of probability.
Because proportion is an unbiased estimator there is potential for good calibration, provided that the
SEDs approximate the OEDs, or if the task involves pair comparisons minimizing the impact of
SED-biases. With interval production people take sample coverage as an estimate of the probable
range of the unknown target quantity. Because at small sample sizes, sample coverage is a biased
estimator there is extreme overconfidence. In line with the constructive and malleable nature of
judgment in other domains (e.g., Kahneman & Tversky, 1979; Slovic, Griffin, & Tversky, 2002) the
implication is that people have potential to assess both well calibrated and extremely overconfident
probability distributions for the same events.
We first verified that already the crude version of the NSM in Figure 3 is able to reproduce
most of the characteristic properties of previous data. In research motivated by the NSM (Winman
et al., 2004) it was demonstrated that the degree of sample overlap predicts the size of the format
dependence. We also devised a method from which the end result is confidence intervals for
unknown quantities, as with traditional interval production, but where the participants rely on
proportion rather than coverage as the estimator variable, thereby minimizing the overconfidence in
the produced intervals. This result is notable not only because the overconfidence is extreme and
therefore of considerable applied concern, but also because cognitive biases are notoriously difficult
to debias (Fischhoff, 1982).
In learning experiments we corroborated that confidence intervals derive from distributions of
observed similar target values and that the overconfidence with interval production is minimally
affected by the sample size in long term memory, but strongly correlated with short term memory
capacity (Hansson et al., 2006). As predicted, in the same circumstances the overconfidence with
probability judgment is rapidly reduced already by modest training and the correlation with short
term memory capacity is low.
The NSM outlined in Figure 3 is an obvious simplification, ignoring many of the potential
contributors to overconfidence, like for example the regression effects that may arise from other
stochastic components than sampling error (Erev et al., 1994), and the effects of attention and
memory captured by support theory (Tversky & Koehler, 1994) and of biased selection of task
items (Gigerenzer et al., 1991; Juslin, 1994). Although the simulation model of the NSM predicts
minor amounts of overconfidence (see Figure 6B & Footnote 4), some of the overconfidence
observed for probability judgment (e.g., with ADINA(R) in Figure 8) may derive from these and
other unknown origins of overconfidence. Despite its simplicity we conclude that it is plausible that
the NSM captures one important aspect of the processes that underlie overconfidence with interval
production and the format dependence effect.
Limitations of the NSM
One unsettling aspect is that there is no consensus on how people represent the variability in a
sample, except that it appears to be normalized against the sample mean and therefore is better
predicted by the coefficient of variation than the standard deviation (Weber, Shafir, & Blaise, 2004).
That sample coverage is biased, however, is firmly rooted in the relationship between sample and
population and surfaces robustly regardless of whether the NSM is framed as a parametric model
implying the computation of a sample standard deviation or as a simulation model that presumes the
direct use of sample fractiles.
To highlight its statistical logic the NSM was formulated in terms of frequency, something that
may suggest two limitations of the model: First, that it ignores that probability is often affected by
similarity (Tversky & Kahneman, 1983); second, that for many judgment tasks the applicability of a
sampling metaphor may be less obvious than for the country population tasks used here and, indeed,
often there may be no obvious reference classes from which to sample (Kahneman & Tversky,
1996). For example, it is not obvious what class of similar observations to consider when you
estimate the birth date of (the Swedish author and play writer) August Strindberg. We propose that
these concerns can be mitigated by implementing the NSM as an exemplar model where the
probability judgment is driven both by similarity and frequency (see, e.g., Dougherty, 2001;
Dougherty, Gettys, & Ogden, 1999; Juslin & Persson, 2002; Nilsson, Olsson, & Juslin, 2005; Sieck
& Yates, 2001).
The algorithm in Figure 3 is actually the special case of an exemplar model where similarity
takes only one of two discrete values, either an exemplar (e.g., a country) satisfies the cue and is
entered into the computations or it does not and is not processed. It would be straightforward to
relax these assumptions by allowing exemplars to enter the SSD as a continuous function of the
feature overlap (similarity) between probe and exemplars (see Juslin & Persson, 2002, for such a
relaxation in regard to probability judgment). An exemplar model does not require any pre-specified
reference class; only that similar exemplars are sampled into the SSD as a continuous function of
their similarity to the probe. If people naively report the properties of small proximal samples the
inherent difference between proportion and coverage as statistical estimators nonetheless remains
relevant. Indeed, the statistical logic captured by the NSM is not premised on how the variability that
underlies the uncertainty is generated and holds also when the alternative possibilities are generated
by, for example, imagination or mental simulation (Kahneman & Tversky, 1982).
Another question concerns the interpretation of the NSM, which could either be interpreted as
a computational-level theory or an algorithm-level theory (Marr, 1982). A computational-level
theory specifies what is computed, regardless of the algorithm that executes the computation. An
algorithm-level theory specifies the exact algorithm for the computation. The NSM has largely been
described as an (albeit crude) algorithm-level theory of the processes and representations involved in
probability judgment. An alternative interpretation is as a computational-level theory where the
crucial claim by the NSM is that what the mind computes, however it is computed, corresponds not
to the function implied by normative statistics but by a statistician that is nave in the specified sense.
We conclude that, by large, the data seems consistent with the computational level theory implied by
the NSM and note that data are accumulating (Dougherty, 2001; Dougherty et al., 1999; Juslin &
Persson, 2002; Nilsson et al., 2005; Sieck & Yates, 2001) suggesting that in an exemplar
implementation the NSM may also have validity as an algorithm-level theory.
What appears to be the most immediate empirical challenge to the NSM is the observation of
response order effects (as we shall see, it provides an even larger challenge to prominent previous
accounts of overconfidence with interval production). In seven experimental conditions, Block and
Harper (1991) documented pervasive effects of asking the participants to provide point estimates
before the intervals, with wider intervals and decreased overconfidence. The effect did not generalize
when participants were provided with point estimates by peers, suggesting that the process of
generating the point estimate and not the existence of an anchor is crucial for the effect. Soll and
Klayman (2004) further demonstrated that if participants make separate assessments of the upper
and lower limits of the interval, immediately following each other, rather than producing the interval as
a single judgment, the intervals became even wider with further reduction in the overconfidence bias.
That overconfidence is reduced by the point estimate suggests that, in effect, it temporarily
enlarges the sample applied to estimation, analogous to a priming effect (Juslin et al., 1999). If
consecutive estimates produce residual activation in short term memory, the effective sample size
may temporarily exceed the sample size retrieved for a single judgment enlarging interval size and hit
rate. For example, the NSM implies that extending a sample size of 4 observations by 2 additional
observations increases the hit rates for 80% intervals by more than 10 percentage units, roughly
approximating the effects observed in data.
The priming account predicts that order effect should only occur if the short term memory
content activated by the point estimate is still active when the interval is produced. In Juslin et al.
(1999) where the point estimates were assessed immediately before each interval production the
median normalized interval sizes for point estimates first was significantly higher than for point
estimate after interval production (2.37 vs.1.78; Wilcoxon, Z = 3.40, p < .001). When point
estimates were blocked before or after the block of interval productions (Winman et al., 2004),
implying at least ten minutes between the point estimates and the interval productions, there was no
priming effect (median normalized interval sizes of 1.22 vs. 1.23; Kruskal-Wallis, 2 = 0, p = 1).
Although this result is consistent with the priming account, these data were not collected with this
question in mind and we therefore have to await further research to test it and compare it with
alternative accounts.
Alternative Explanations and Models
The naivety of the NSM is similar to the law of small numbers (Tversky & Kahneman,
1971), peoples inclination to expect also small samples to accurately portray the population
properties. The traditional account of the law of small numbers is the representativeness heuristic,
which is insensitive to notions of sample size. Although the NSM implies that people should behave
as if they believe in the law of small numbers, clearly the explanation is different and emphasizes
accurate description of small samples rather than heuristic processing that replaces probability with
similarity.
The NSM also has resemblance to probabilistic mental models (PMM) theory (Gigerenzer et
al., 1991). As with the NSM, PMM-theory emphasizes that people represent knowledge of the
distributions of objects in natural environments; although more emphasis is placed on pre-computed
abstractions in the form of cue validites (see Juslin & Persson, 2000). The explanatory focus of the
PMM-theory is different, however, and complementary. In regard to overconfidence, PMM-theory
emphasizes the relationship between the items presented to participants in the laboratory and the
natural environments, capturing how biased item selection may produce overconfidence (see also
Juslin et al., 2000). The NSM, as discussed in this article, assumes representative task selection and
concentrates on biased sample properties and external biases in the information stored in memory.
The NSM further incorporates limited aspects of the error models (Erev et al., 1994; Juslin et
al., 1997; Soll, 1996). The error models show how unbiased stochastic noise in the judgment
process can produce overconfidence in the calibration analysis. The NSM addresses the sampling
error involved in retrieving the SSD from the SED, but not the other origins of noise, such as
response error in the use of the probability scale (see Juslin et al., 1997). The previous studies that
have applied error models to intuitive confidence intervals (Juslin et al., 1999; Soll & Klayman,
2004) clearly suggest that the regression effect implied by random error, at various stages of the
process, is insufficient to explain the observed bias.
The most well-known account of overconfidence with interval production is the anchoring-
and-adjustment heuristic (Tversky & Kahneman, 1974). According to this account, people
produce intervals by anchoring on their best guess and then adjusting insufficiently from this anchor.
Although this account has face validity and often appears to be accepted as the explanation of
overconfidence with interval production, it has limitations. First: there are concerns as to whether the
heuristic truly explains as opposed to describes the phenomenon, although there have been attempts
to explicate its cognitive bases (Chapman & Johnson, 2002). Second, and more seriously, it is
directly refuted by the finding that emphasizing the anchor by asking for a point estimate prior to the
interval increases the interval size and yields less rather than more overconfidence. In one of the few
other attempts to directly test the anchoring-and-adjustment explanation Juslin et al. (1999) showed
that the contribution was minor and insufficient to account for the magnitude of the overconfidence
bias.
Soll and Klayman (2004) proposed that the overconfidence bias is explained by selective and
confirmatory search of memory and the observation of wider intervals when the upper and lower
limits were assessed separately was interpreted in these terms. When people assess the upper limit
they selectively activate knowledge consistent with a high target value (e.g., reasons why Thailand
may have a large population); when assessing the lower limit they selectively activate knowledge
consistent with a low target value (reasons why Thailand may have a small population). As a result,
the intervals become wider. This account again appears inconsistent with intervals that become wider
when preceded by a point estimate (Block & Harper, 1991; Clemen, 2001; Juslin et al., 1999; Soll
& Klayman, 2004): first concentrating onor committing toa best guess should if anything
instigate more priming of information consistent with or confirming the point estimate and thus shrink
the intervals.
Nonetheless, the assumption of perfectly unbiased retrieval in the NSM (the SSD is assumed
to be a random sample of the SED) is a strong assumption, but, as emphasized in the introduction, it
is a working assumption that may have to be rejected if assumptions of biased memory retrieval
prove to yield significantly improved prediction or explanation of the data. At present, however, we
conclude that the NSM accounts for most of the data already without recourse to confirmatory
search of memory. It should also be noted that the mechanism emphasized by the NSM, the nave
interpretation of unbiased or biased sample properties, is not rendered irrelevant by the addition of
an assumption of biased retrieval.
More generally, our ability to compare the NSM to these competitors is at present hampered
by the fact that both the anchoring and the confirmatory search explanations remain too vaguely
specified. To apply these accounts to the range of phenomena discussed in this article, for example,
the format dependence effect, and the differential effect of experience and short term memory with
interval production and probability assessment, would require their quantitative specification and
numerous arbitrary decisions. By contrast, the NSM in its plain vanilla version requires a minimum
of such auxiliary assumptions to account for these phenomena and offers a practical method to
minimize the overconfidence.
Conclusions
Viewed from a distance the issue of whether cognitive biases arises from heuristic processing
of the evidence or from nave but accurate description of available samples may seem like splitting
hairs: the result is a judgment bias just the same. Data are, however, accumulating that in regard to
many biases the relations between samples and populations is a more important factor than heuristic
processing (e.g., Denrell, 2005; Fiedler, 2000; Fiedler & Juslin, 2006a; Hertwig et al., 2004). The
results in this article suggest that the NSM captures, at least, one important aspect of the
overconfidence and format dependence in previous studies and illustrates that these insights may be
crucial for reducing the biases.

References
Baddeley, A. (1998). Recent developments in working memory. Current Opinion in
Neurobiology, 8, 234-238.
Block, R. A., & Harper, D. R. (1991). Overconfidence in estimation: Testing the anchoring and
adjustment hypothesis. Organizational Behavior and Human Decision Processes, 49,
188-207.
Broadbent, D. E. (1975). The magic number seven after fifteen years. In A. Kennedy & A. Wilkes
(Eds.), Studies in Long Term Memory (pp. 3-18). London: Wiley.
Brunswik, E. (1955). Representative design and probabilistic theory in a functional psychology.
Psychological Review, 62, 193-217.
Chapman, G. B., & Johnson, E. J. (2000). Incorporating the irrelevant: Anchors in judgments of
belief and value. In T. Gilovich, D. W. Griffin, & D. Kahneman, (Eds.) Heuristics and
Biases: The Psychology of Intuitive Judgment (pp. 120-138). New York: Cambridge
University Press.
Clemen, R. T. (2001). Assessing 10-50-90s: A surprise. Decision Analysis Newsletter, 20, 2, 15.
Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of mental
storage capacity. Behavioral & Brain Sciences, 24, 87-114; discussion 114-185.
Denrell, J. (2005). Why most people disapprove of me: Experience sampling in impression
formation. Psychological Review, 112, 951-978.
Dougherty, M. R. (2001). Integration of the ecological and error models of overconfidence using a
multiple-trace memory model. Journal of Experimental Psychology: General, 130, 579-
599.
Dougherty, M. R., Gettys, C. F., & Ogden, E. E. (1999). MINERVA-DM: A memory process
model for judgments of likelihood. Psychological Review, 106, 180-209.
Edwards, W. (1982). Conservatism in human information processing. In D. Kahneman, P. Slovic, &
A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases (pp. 359-369).
New York: Cambridge University Press.
Einhorn, H. J., & Hogarth, R. M. (1978). Confidence in judgment: Persistence of the illusion of
validity. Psychological Review, 85, 395-416.
Elwin, E., Juslin, P., Olsson, H., & Enkvist, T. (in press). Constructivist coding of selective
feedback. Psychological Science.
Erev, I., Wallsten, T. S., & Budescu, D. V. (1994). Simultaneous over- and underconfidence: The
role of random error in judgment processes. Psychological Review, 101, 519-527.
Estes, W. K. (1976). The cognitive side of probability learning. Psychological Review, 83, 37-64.
Fischer, I., & Budescu, D. V. (2005). When do those who know more also know more about how
much they know? The development of confidence and performance in categorical decision
tasks. Organizational Behavior and Human Decision Processes, 98, 39-53.
Fischhoff, B. (1982). Debiasing. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under
uncertainty: Heuristics and biases (pp. 422-444). New York: Cambridge University
Press.
Fiedler, K. (2000). Beware of samples! A cognitive-ecological sampling approach to judgment
biases. Psychological Review, 107, 659-676.
Fiedler, K., & Juslin, P. (2006a). Information sampling and adaptive cognition. New York:
Cambridge University Press.
Fiedler, K., & Juslin, P. (2006b). Taking the interface between mind and environment seriously. In
K. Fiedler, & P.Juslin (Ed.), Information sampling and adaptive cognition. New York:
Gigerenzer, G., Hoffrage, U., & Kleinblting, H. (1991). Probabilistic mental models: A Brunswikian
theory of confidence. Psychological Review, 98, 506-528.
Gigerenzer, G., & Murray, D. J. (1987). Cognition as intuitive statistics. Hillsdale, NJ: Erlbaum.
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Kruger, L. (1989). The empire of
chance. Cambridge: Cambridge University Press.
Gilovich, T., Griffin, D., & Kahneman, D. (2002). Heuristics and biases: The psychology of
intuitive judgment. Cambridge: Cambridge University Press.
Goldstein, D. G., & Gigerenzer, G. (2002). Models of ecological rationality: The recognition
heuristic. Psychological Review, 109, 75-90.
Grosskopf, B., Erev, I., & Yechiam, E. (2001). Foregone with the Wind. Retreived April 7th 2006,
from http://www2.gsb.columbia.edu/faculty/ierev/foregone4.pdf
Hacking, I. (1975). The emergence of probability. London: Cambridge University Press.
Hansson, P., Juslin, P., & Winman, A. (2006). The role of short term memory and task
experience for overconfidence in judgment under uncertainty. Manuscript submitted for
publication, Department of Psychology, Ume University, Sweden.
Hasher, L., & Zacks, R. T. (1979). Automatic and effortful processes in memory. Journal of
Experimental Psychology: General, 108, 356-388.
Hertwig, R., Pachur, T. & Kurzenhuser, S. (2005). Judgments of risk frequencies: Tests of possible
cognitive mechanisms. Journal of Experimental Psychology: Learning, Memory and
Cognition, 35, 621-642.
Hertwig, R., Barron, G., Weber, E. U., & Erev, I. (2004). Decisions from experience and the effect
of rare events in risky choice. Psychological Science, 15, 534-539.

Johnson-Laird, P. N. (1983). Mental models. Cambridge, MA: Harvard University Press.
Juslin, P. (1994). The overconfidence phenomenon as a consequence of informal experimenter-
guided selection of almanac items. Organizational Behavior and Human Decision
Processes, 57, 226-246.
Juslin, P., Olsson, H., & Bjorkman, M. (1997). Brunswikian and Thurstonian origins of bias in
probability assessment: On the interpretation of stochastic components of judgment. Journal
of Behavioral Decision Making, 10, 189-209.
Juslin, P., & Persson, M. (2002). PROBabilities from EXemplars (PROBEX): A "lazy" algorithm for
probabilistic inference from generic knowledge. Cognitive Science, 26, 563-607.
Juslin, P., Wennerholm, P., & Olsson, H. (1999). Format dependence in subjective probability
calibration. Journal of Experimental Psychology: Learning, Memory, and Cognition,
28, 1038-1052.
Juslin, P., Winman, A., & Olsson, H. (2000). Naive empiricism and dogmatism in confidence
research: A critical examination of the hard-easy effect. Psychological Review, 107, 384-
396.
Juslin, P., Winman, A., & Olsson, H. (2003). Calibration, additivity, and source independence of
probability judgments in general knowledge and sensory discrimination tasks.
Organizational Behavior and Human Decision Processes, 92, 34-51.
Just, M. A., & Carpenter, P. A. (1992). A capacity theory of comprehension: Individual differences
in working memory. Psychological Review, 99, 122-149.
Kahneman, D., & Frederick, S. (2002). Representativeness revisited: Attribute substitution in
intuitive judgment. In T. Gilovich, D. W. Griffin, & D. Kahneman (Eds.), Heuristics and
biases: The psychology of intuitive judgment (pp. 49-81). New York: Cambridge
University Press.
Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk.
Econometrica, 47, 263-291.
Kahneman, D., & Tversky, A. (1982). The simulation heuristic. In D. Kahneman, P. Slovic, & A.
Tversky (Eds.), Judgment under uncertainty: Heuristics and biases (pp. 201-208). New
York: Cambridge University Press.
Kahneman, D., & Tversky, A. (1996). On the reality of cognitive illusions. Psychological Review,
103, 582-591.
Kahneman, D., Slovic, P., & Tversky, A. (1982). Judgment under uncertainty : Heuristics and
biases. Cambridge: Cambridge University Press.
Kareev, Y., Arnon, S., & Horwitz-Zeliger, R. (2002). On the misperception of variability. Journal
of Experimental Psychology: General, 131, 287-297.
Keren, G. (1987). Facing uncertainty in the game of bridge. Organizational Behavior and Human
Decision Processes, 39, 98-114.
Keren, G. (1991). Calibration and probability judgments. Conceptual and methodological issues.
Acta Psychologica, 77, 217-273.
Klayman, J., Soll, J. B., Gonzalez-Vallejo, C., & Barlas, S. (1999). Overconfidence: It depends on
how, what, and whom you ask. Organizational Behavior and Human Decision
Processes, 79, 216-247.
Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1982). Calibration of subjective probabilities: The
state of the art up to 1980. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment
under uncertainty: Heuristics and biases (pp. 306-334). New York: Cambridge
University Press.
Lichtenstein, S., Slovic, P., Fischhoff, B., Layman, M., & Combs, B. (1978). Judged frequency of
lethal events. Journal of Experimental Psychology: Human Learning and Memory, 4,
551-578.
Macmillan, N. A., & Creelman, C. D. (1991). Detection theory: A user's guide. New York:
Marr, D. (1982). Vision. San Franscisco: W. H. Freeman.
Miller, G. A. (1956). The magical number seven plus or minus two: Some limits on our capacity for
processing information. Psychological Review, 63, 81-97.
Murphy, A. H., & Brown, B. G. (1985). A comparative evaluation of objective and subjective
weather forecasts in the United States. In G. Wright (Ed.), Behavioral decision making
(pp. 329-359). New York: Plenum Press.
Murphy, A. H., & Winkler, R. L. (1977). Reliability of subjective probability forecasts of
precipitation and temperature. Applied Statistics, 26, 41-47.
Newell, A., & Simon, H. (1972). Human problem solving. Englewood Cliffs, NJ: Prentice-Hall.
Nilsson, H., Olsson, H., & Juslin, P. (2005). The cognitive substrate of subjective probability.
Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 600-620.
Nosofsky, R. M., & Johansen, M. K. (2000). Exemplar-based accounts of multiple-system
phenomena in perceptual categorization. Psychonomic Bulletin & Review, 7, 375-402.
Peterson, C. R., & Beach, L. R. (1967). Man as an intuitive statistician. Psychological Bulletin, 68,
29-46.
Piaget, J. & Inhelder, B. (1956). The child's conception of space. London: Routledge and Kegan
Paul.
Pitz, G. F. (1974). Subjective probability distributions for imperfectly known quantities. In L. W.

Gregg (Ed.), Knowledge and cognition. (pp. 29-41). Potomac, MD: Erlbaum.
Poulsen, O. M., Holst, E., & Christensen, J. M. (1997). Calculation and application of coverage
intervals for biological reference values. Pure and Applied Chemistry, 69, 1601-1611.
Raven, J. C. (1965). Advanced Progressive Matrices, Set I and II. London: H. K. Lewis.
Russo, J. E., & Schoemaker, P. J. (1992). Managing overconfidence. Sloan Management Review,
33, 7-17.
Schwarz, N., & Wnke, M. (2002). Experiental and contextual heuristics in frequency judgment:
ease of recall and response scales. In P. Sedlmeier & T. Betsch (Eds.), Etc. Frequency
processing and cognition (pp. 89-108). Oxford, UK: Oxford University Press.
Sedlmeier, P., & Betsch, T. (2002). Etc. Frequency processing and cognition. Oxford, UK:
Oxford University Press.
Sieck, W. R., & Yates, J. F. (2001). Overconfidence effects in category learning: A comparison of
connectionist and exemplar memory models. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 27, 1003-1021.
Slovic, P., & Griffin, D., & Tversky, A. (2002). Compatibility effects in judgment and choice. In T.
Gilovich, D. W. Griffin, & D. Kahneman, (Eds.) Heuristics and biases: The psychology of
intuitive judgment (pp. 217-229). New York: Cambridge University Press.
Soll, J. B. (1996). Determinants of overconfidence and miscalibration: The roles of random error and
ecological structure. Organizational Behavior and Human Decision Processes, 65, 117-
137.
Soll, J. B., & Klayman, J. (2004). Overconfidence in interval estimates. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 30, 299-314.
Stewart, N., Chater, N., & Brown, G. D. A. (2006). Decision by sampling. Cognitive Psychology,
53, 1-26.
Tomassini, L. A., Solomon, I., Romney, M. B., & Krogstad, J. L. (1982). Calibration of auditors
probabilistic judgments. Some empirical evidence. Organizational Behavior and Human
Performance, 30, 391-406.
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin,
2, 105-110.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science,
185, 1124-1131.
Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy
in probability judgment. Psychological Review, 90, 293-315.
Tversky, A., & Koehler, D. J. (1994). Support theory: A nonextensional representation of
subjective probability. Psychological Review, 101, 547-567.
Weber, E. U., Shafir, S., & Blaise, A-R. (2004). Predicting risk sensitivity in humans and lower
animals: Risk as variance or coefficient of variation. Psychological Review, 111, 430-445.
Winman, A., Hansson, P., & Juslin, P. (2004). Subjective probability intervals: How to cure
overconfidence by interval evaluation. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 30, 1167-1175.
Winman, A., & Juslin, P. (1993). Calibration of sensory and cognitive judgments - Two different
accounts. Scandinavian Journal of Psychology, 34, 135-148.
Yaniv, I., & Foster, D. P. (1995). Graininess of judgment under uncertainty: An accuracy
informativeness trade-off. Journal of Experimental Psychology: General, 124, 424-432.
Yaniv, I., & Foster, D. P. (1997). Precision and accuracy of judgmental estimation. Journal of
Behavioral Decision Making, 10, 21-32.

Yechiam, E., & Busemeyer, J. R. (2006). The effect of foregone payoffs on underweighting small
probability events. Journal of Behavioral Decision Making, 19, 1-16.
Zacks, R. T., & Hasher, L. (2002). Frequency processing: A twenty-five year perspective. In P.
Sedlmeier & T. Betsch (Eds.), Frequency processing and cognition (pp. 21-36). New
York: Oxford University Press.
Appendix A:
Details of the Monte Carlo Simulations
Interval production. In each iteration a target country T with population yT was randomly
sampled among the 188 world nations. Continent of the country was used as the cue C. A sample of
n exemplars Xi was sampled randomly without replacement from the reference class RC of exemplars
from the same continent as T (excluding T). For the 100%, 75% and 50% intervals the fractiles of
the sample distributions were used to generate the appropriate interval for the value of yT , as detailed
by Steps 1, 2 and 3b in Figure 3. For finite and small samples, all fractiles are not explicitly
represented in the sample as an observed value. In this case one has to rely on some method to
derive the value of the fractiles not explicitly represented in the sample by interpolating between the
observed values.
There exist several methods for interpolating fractiles from finite samples (see
http://www.maths.murdoch.edu.au/units/c503/unitnotes/boxhisto/quartilesmore.html for a discussion
of the two most common methods). The simulations rely on a standard procedure, commonly
referred to as the EXCEL method. This method is the only customary used procedure directly
applicable to the present simulations with small sample sizes. The fractiles are calculated as follows;
Let n be the number of cases, and p be the percentile value divided by 100. (n-1)p ((n-1) times p) is
expressed as (n-1)p=j+g where j is the integer part of (n-1)p, and g is the fractional part of (n-1)p.
The fractile value is calculated as xj+1 when g=0 and as xj+1 + g(xj+2 - xj+1) when g>0. The proportion
of intervals that included the true value was recorded. Realism or calibration requires that the
proportions are 1.0, .75, and .5, respectively, for 100%, 75%, and 50% confidence intervals. The
results are presented in Figure 6A.
Interval evaluation; independent samples. For evaluation of independently defined intervals
the intervals produced in the first simulation were taken to define the events (the event being that the
population of T falls within the interval). For each such intervalnow pre-stated and fed to the
algorithm rather than produced by itthe algorithm produces a probability judgment from a new and
independent random sample according to Steps 1, 2, and 3a of the NSM. Because the event
(interval) is defined independently of the specific sample retrieved to make the interval evaluation
and, in effect, uses a sample proportion to estimate a population proportion, it yields an almost
unbiased estimate (see Figure 6B).
Interval evaluation: Dependent samples. The same basic procedure was used to simulate
the case where the samples retrieved at the two occasions are dependent and overlap more than
expected by chance. When retrieving the observations at the second occasion (interval evaluation)
each observation was sampled from the observations in the sample from the first occasion (interval
production) with resampling probability r. With probability 1-r each observation was sampled
randomly among the observations within Rc that were not part of the first sample. All samplings were
made without replacement. The simulations verified that the difference between the overconfidence
with interval production and interval evaluation diminishes as a function of the resampling probability
(sample overlap).
Robustness. To further ascertain the robustness of the predictions in Figure 6A we applied the
algorithm to several idealized target variables for which the effects of distributional shape and
discrepancies between the SEDs and OEDs are more easily determined. Simulations based on
uniform, positively skewed, negatively skewed, U-shaped, and inverse U-shaped Beta-distributions
indicate that as long as the SED is representative of the OED, overconfidence is of similar magnitude
across a variety of distributional shapes. The simulations also reveal that the difference between the
overconfidence with interval production and full-range probability assessment (the format
dependence) is constant across different distribution shapes and deviations between the SEDs and
OEDs.
The simulations moreover show that systematic deviations between the SEDs and OEDs do
not always produce more overconfidence. The effect may actually be negative and pull towards
reduced overconfidence or even underconfidence. In simulations three different OEDs, a uniform, a
U-shaped, and an inversely U-shaped distribution, are crossed factorially with the corresponding
three distribution shapes of the SEDs. For example, a uniform OED and a U-shaped SED may
correspond to a target variable where all values are equally likely in the environment, but because
extreme values are more memorable the SED over-represents low and high values. Generally
speaking, the results showed that a U-shaped SED produces less overconfidence because most
observations are either small or large leading to large intervals. Inversely U-shaped SEDs contain
mostly intermediate target values and therefore tend to produce small intervals and strong
overconfidence.
The simulations further assume that people have exact knowledge of a number of target values,
for example, about the populations of some countries. To control for imperfect knowledge the
simulations in Figure 6A were repeated with the population figures in the database perturbed by a
random error representing imperfect knowledge with zero expectation and a standard deviation of
either10%, 20%, 30%, 40%, or 50% of the true population of the country. The predicted
overconfidence was robust and virtually unaffected up to a standard error of about 40% of the
population figure, where after it is slightly diminished. The empirically obtained point-estimates for
country populations observed in Winman et al. (2004) suggest an average absolute error in the
region of 8-9% of the population figures. Similar conclusions about the robustness of the predictions
hold also when we allow for the possibility that people confuse the cues (e.g., attribute a country to
the wrong continent in some proportion of trials). The simulations are not dependent on the
idealization of accurate knowledge but hold over much less accurate knowledge than that actually
observed.
Appendix B:
Details of the Empirical Study
The aim of the experiment was see whether participants are able to generate SEDs that
represent the OEDs and if intuitive confidence intervals are determined by the SED. We used two
distinct distributions: a U-shaped and an inversely U-shaped. The predictions following from the
NSM regarding interval production based on these two distributions are indeed straightforward: If
the intervals are generated from a U-shaped SED the average interval width should be considerably
wider compared to intervals that are generated from an inversely U-shaped SED. Another way to
verify that SEDs mirror OEDs is to look at distribution of point estimates and their standard
deviations. The standard deviations of an SED that is U-shaped should numerically be larger then an
SED that is inversely U-Shaped.
Method
Participants. Thirty undergraduate students (15 woman & 15 men), 15 in each condition,
with an average age of 25 years participated in the experiment. The participants received 150 SEK
(proximally 20 US dollars) for attending the experiment.
Materials and apparatus. The experiment was carried out on PC. The judgment task
involved estimation of the revenue of fictive business companies. Sixty different target values
corresponding to fictive revenue of business companies (range = 1-1000) were distributed either as
U-shaped or inversely U-shaped defined the OED. The U-shaped condition accordingly had a high
frequency of values close to 1 and to 1000, and the inversely U-shaped condition consisted
predominantly of target values close to 500. For each participant, the target values were randomly
attached to names in a database containing 156 real company names, with a new and independent
random assignment for each participant.

Design and procedure. The experiment was a between-subjects design and the participants
were randomly assigned to one of the two conditions (U-shape or inversely U-shape). In the learning
phase participants guessed the revenue of 60 companies, receiving outcome feedback about the
correct target value. A learning criterion of 40% (24) of the 60 revenue figures determined when the
learning phase was ended. In a test phase, they produced an initial point estimate followed by
intuitive confidence intervals for 60 companies they had previously seen in training and for 60 new
ones. After the test phase each participant assessed how many of the 60 companies they did
encounter during the training phase falling within ten pre-defined intervals (i. e., 0-100, 101-200,
201-300, 301-400, 401-500, 501-600, 601-700, 701-800, 801-900, 901-1000). The dependent
measures were the interval size, the standard deviation of the point estimates in the test phase, and
the reproduced SED.

Author Note
Peter Juslin, Department of Psychology, Uppsala University; Anders Winman, Department of
Psychology, Uppsala University; Patrik Hansson, Department of Psychology, Ume University.
This research was supported by the Swedish Research Council and the Bank of Sweden
Tercentenary Foundation. We are indebted to Nick Chater, Tommy Enquist, Josh Klayman, Ben
Newell, Hkan Nilsson, Henrik Olsson, and Jack Soll for useful discussions and comments on the
issues addressed in this article.
Correspondence concerning this paper should be addressed to: Peter Juslin, Department of
Psychology, Uppsala University, SE 751 42 Uppsala, Sweden, or to the e-mail address:
Peter.Juslin@psyk.uu.se
Footnotes
1
Working assumption is used to emphasize that we do not to deny that the cognitive
processes applied to the available information may at times be biased and violate normative
principles. Traditionally, when judgment bias (defined in one way or another) is encountered, the
most common explanatory scheme is to take the information input as given, while, in comparison,
often hastening to postulate cognitive algorithms that account for deviations from normative
responses. The strategy implied by the metaphor of the nave intuitive statistician is to start off the
analysis the other way around: to assume as a working assumption that the cognitive algorithms are
consistent with normative principles and to place the explanatory burden at the other end: the
information samples to which cognitive algorithms are applied. In the end, of course, in many
situations we may have to reject this working assumption.
2
One finds at least two different interpretations of availability in the literature (see Schwarz &
Wnke, 2002). The stronger version that implements the variable substitution approach advocated
in the heuristics and biases literature (see Kahneman & Fredericks, 2002) implies that assessments
of frequency are substituted by assessments of how easy it is to recall instances. Another
interpretation is that the judgments do derive from judgments of frequency but the samples are
themselves biased because of external biases in the information or the effects of biased recall. In the
latter case, of course, the processes are better described by the metaphor of the nave intuitive
statistician, because the proximal samples are described in terms of relative frequency and no
intensional heuristics are involved.
3
Note that this holds when plotting mean sample proportion P as a function of population
proportion p. If the expected population proportion instead is plotted as a function of the observed
sample proportion for a given sample size there is a regression effect that may sometimes contribute
to overconfidence bias (see Soll, 1996; Juslin et al., 1997).
4
The main differences between the parametric and the non-parametric model are the
following: If the SED is a finite set an additional source adds marginally to the error rate. By
necessity, in many estimation problems the target quantity (e.g., Thailand) cannot itself be an
observation in the retrieved SSD (a sample of Asian countries other than Thailand). Of course, if the
target quantity can be retrieved there is no need to estimate it. That the SSD always excludes the
target quantity lowers the hit rates somewhat. The assumption that the quantity can take on infinitely
low or high values increases the hit-rates for high confidence intervals as compared to a quantity with
defined lower and upper bounds. By contrast, our simulations suggest that the exact way in which
sample variability is represented (e.g., standard deviation, interquartile index) has minor effects. As
long as the measure is applied in the same way regardless of sample size the qualitative effect is
overconfidence.
5
The literature on working memory and short-term memory is complex with no consensus in
regard to the use of these terms (see Cowan, 2001, for a discussion). In this article we use the term
short-term memory for the capacity to passively maintain information retrieved from long-term
memory for the processing implied by the NSM.
6
Does this mean that experts produce confidence intervals that are no better than those
produced by novices? Clearly not: Although there are architectural constraints that determine
overconfidence, in general experts know better predictive cues of relevance to the estimation task.
Better cues allow the experts to access more homogenous OEDs, where the target values of the
observations are similar, allowing higher accuracy and tighter intervals.
7
This only holds to an approximation in the non-parametric implementation. This is because
even if all fractiles can be generated from a small sample by interpolation, when the sample is
retrieved again it still only contains a limited number of observations.
8
With ADINA(R) the independence strictly only holds for the first probability judgment.
Already at the second interval, its size is a function of the participants belief. The results presented in
Figure 8A are for the first interval where the assumption of independence is best approximated
and the experimental manipulation is most extreme.
9
The reason why the mean subjective probabilities are not identical in the three conditions in
Figure 8B is missing data in the ADINA-conditions. The ADINA procedure terminated the
sequence if the desired probability had not been produced within 24 trials.
Table 1
Linear Multiple Regression Models with Overconfidence as Dependent Variable, and Short
Term Memory Capacity (n by STM), Long Term Memory Content(n by LTM and SED Bias as
the Independent Variables for Experiment 1, 3, and 3 in Hansson et al. (2006) . Adapted from
Hansson, P., Juslin, P., & Winman, A. (2006). The Role of Short Term Memory and Task
Experience in for Overconfidence Judgment under Uncertainty. (Submitted for Publication).
Model n by STM n by LTM SED bias
Trials of training R N p p p p
Interval production
68 trials .21 20 .86 .04 .44a -.24 .41 -.11 .35 a
(Experiment 1)
Interval production
272 trials .82* 15 .006 -.72* .001a .29 .13 .39* .029a
(Experiment 2)
Interval production
544 trials .68 15 .072 -.67* .012a .38 .21 -.26 .34
(Experiment 2)
Interval production
232 trials 42*. 60 .02 -33* .00a .00 .96 .12 .17a
(Experiment 3)
Interval evaluation
232 trials .16 61 .67 -.15 .25 -.04 .78 .00 .94
(Experiment 3)
Note: R = Multiple correlation; N = Number of independent observations; p = p-value associated
with model or weight; = Beta weight for the predictor.
* Correlations statistically significant beyond alpha .05.

a
These-p-values correspond to one-tail tests of statistical significance.
Figure Captions
Figure 1. Schematic summary of three research programs on intuitive judgment. Research
guided by the perspective of the intuitive statistician (Panel A) has often found judgments that are in
approximate agreement with normative models, presumably because of the concentration on well
structured tasks in the laboratory. Research on heuristics and biases (Panel B) has emphasized
biases in intuitive judgment attributing them to heuristic processing of the evidence available to the
judge. The nave intuitive statistician (Panel C) emphasizes that the judgments are accurate
descriptions of the available samples and that performance depends on whether the samples
available to the judge are biased or unbiased, and on whether the estimator variable affords unbiased
estimation directly from the sample properties.
Figure 2. Format dependence. Panel A: Application of the half-range, the full-range, and the
interval production formats to elicit a participants belief about the population of Thailand. Panel B:
With the half-range format there is close to zero over/underconfidence, with the full-range format
there is marginal overconfidence, and with the interval production format there is extreme
overconfidence. Panel B is reproduced from Juslin, P., & Persson, M. (2002). PROBabilities from
EXemplars (PROBEX): A lazy algorithm for probabilistic inference from generic knowledge.
Cognitive Science, 26, 563-607.
Figure 3. The Nave Sampling Model (NSM) of format dependence.
Figure 4. Panel A: Probability density function for the distribution of target values in the
reference class defined by a cue (the OED). The values on the target dimension have been
standardized to have mean 0 and standard deviation 1. The interval between the .75th and the .25th
fractiles of the population distribution include 50% of the population values. Panel B: Schematic
illustration of the expectation for a sample of 4 exemplars with sample mean at the population mean.
The values on the target dimension are expressed in units standardized against the population
variance. The interval between the .75th and the .25th fractiles of the sample distribution include 39%
of the population values. Panel C: Schematic illustration of the expectation for a sample of 4
exemplars with the sample mean displaced relative to the population mean. The values on the target
dimension are expressed in units standardized against the population variance. The interval between
the .75th and the .25th fractiles of the sample distribution on average include 34% of the population
values.
Figure 5. Predictions by an implementation of the NSM as a parametric statistical model.
Panel A: The relative interval width, that is, the ratio between the interval width w predicted by the
NSM and the interval width W required for calibration, as a function of sample size. Panel B: hit-
rates as a function of sample size and the subjective probability of the interval, assuming that the SED
is perfectly representative of the OED. Panel C: Hit-rates as a function of the bias in the SED (the
mean SED is 0, .5, or 1 standard deviation larger than the mean OED) and the subjective probability
of the interval, assuming sample size 4.
Figure 6. Predictions by the NSM as a non-parametric simulation model applied to the data-
base for world-country populations. Panel A: Proportion of correct target values included in the
intervals for the three different probabilities and three different sample sizes (n). The identity line is
the proportions required for perfect calibration. The graph also illustrates typical empirical data from
interval production for world country populations adapted from Winman, A., Hansson, P., & Juslin,
P. (2004). Subjective probability intervals: How to reduce overconfidence by interval evaluation.
Journal of Experimental Psychology: Learning Memory and Cognition, 30, 1167-1175. Panel
B: The resulting predicted format dependence. On the y-axis is the overconfidence score, defined as
the mean probability of the interval minus the proportion of values falling within the interval. The
graph also illustrates typical empirical data in regard to judgments of world country populations
adapted from Juslin, P., Winman, A., & Olsson, H. (2003). Calibration, additivity, and source
independence of probability judgments in general knowledge and sensory discrimination tasks.
Organizational Behavior and Human Decision Processes, 92, 34-51. (This line for empirical data
is difficult to discern in the graph because it overlaps with the prediction for n=3.)
Figure 7. Panel A: Empirical data for full-range probability assessment and interval production
in regard to world country populations from Juslin, P., Winman, A., & Olsson, H. (2003).
Calibration, additivity, and source independence of probability judgments in general knowledge and
sensory discrimination tasks. Organizational Behavior and Human Decision Processes, 92, 34-
51. Panel B: The corresponding predictions by the NSM.
Figure 8. Panel A: Mean confidence and proportion of intervals that include the population
value as computed from the response to the first a priori interval for each of the three assessment
methods (with 95% CI, n=15). Panel B: Mean confidence and proportion of intervals that include
the population value as computed for the final, homed in, confidence intervals for each of the three
assessment methods (with 95% CI, n=15). Overconfidence is the difference between the mean
confidence and the proportion. Adapted from Winman, A., Hansson, P., & Juslin, P. (2004).
Subjective probability intervals: How to reduce overconfidence by interval evaluation. Journal of
Experimental Psychology: Learning Memory and Cognition, 30, 1167-1175.
Figure 9. Panel A: Proportion of correctly recalled target values as a function of the number
of training blocks in a laboratory learning task. The main effect of number of training blocks (1/2 a
block is 68 training trials, 2 blocks is 244 training trials, and 4 blocks is 544 training trials) is sizeable
and statistically significant. Panel B: Overconfidence in interval production as a function of the
number of training blocks in a laboratory learning task (1/2 a block is 68 training trials, 2 blocks is
244 training trials, and 4 blocks is 544 training trials). Overconfidence is the difference between the
probability of the interval and the hit-rate of target values included by the interval. The main effect of
number of training blocks is small and not statistically significant. The dotted line denoted
NSM(n=4) represents the prediction by NSM assuming sample size 4 and no SED bias. The
lower dotted line denoted probability judgment refers to the observed overconfidence in the
interval evaluation task after .5 training blocks. Panel C: Overconfidence and the statistically
significant interaction effect between short term memory capacity and assessment format. Adapted
from Hansson, P., Juslin, P., & Winman, A. (2006). The Role of Short Term Memory and Task
Experience for Overconfidence in Judgment under Uncertainty. (submitted for publication).
Department of Psychology, Ume University, Sweden.
Figure 10. First column (from the left): Predictions by the NSM (n=4) of the relationship
between interval sizes, mean absolute error and overconfidence, separately for .5 (Panel A), .80
(Panel B), and 1.0 intervals. Second column: Observed relationship between interval sizes, mean
absolute error and overconfidence in the interval production with half block of training (68 trials),
separately for .5 (Panel A), .80 (Panel B), and 1.0 intervals. Third column: Observed relationship
between interval sizes, mean absolute error and overconfidence with two blocks of training (244
trials), separately for .5 (Panel A), .80 (Panel B), and 1.0 (Panel C) intervals. Fourth column:
Observed relationship between interval sizes, mean absolute error and overconfidence with four
blocks of training (544 trials), separately for .5 (Panel B), .80 (Panel C), and 1.0 (Panel D) intervals.
Adapted from Hansson, P., Juslin, P., & Winman, A. (2006). The Role of Short Term Memory
and Task Experience for Overconfidence in Judgment under Uncertainty. (submitted for
publication). Department of Psychology, Ume University, Sweden.

Figure 1
A THE INTUITIVE STATISTICIAN
Environment Task sample Intuitive

Judgment
Environment
judgment
Veridicality
Veridicality
granted by
granted by training
training
in well
in well structured
structured
laboratory tasks
laboratory tasks
Closeto
Close tonormative
normativejudgment
judgment
B HEURISTICS AND BIASES
Environment Task sample Judgment
Biases attributed
to heuristic
processing of the
available sample
Biased judgment
C THE NAIVE INTUITIVE STATISTICIAN
Environment Task sample Judgment
Appropriateness of using
sample property to estimate
population property?
a) sampling biases?
b) biased estimators?
Normative or biased
judgment
Figure 2
Interval production format

Give the smallest interval which you are .xx % certain to
include the population of Thailand.
Between _____ and _____ million inhabitants.
Half-range format
Does the population of Thailand exceed 25 million? Yes/No
How confident are you that your answer is correct?
50% 60% 70% 80% 90% 100%
Guessing Certain
Full-range format
The population of Thailand exceeds 25 million.
What is the probability that this proposition is correct?
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Certainly false Certainly true
B
0.4
0.3
Over/Underconfidence
0.2
0.1
0.0
-0.1
Half-range Full-range Interval production
Probability Assessment
Response Format
Figure 3
Unknown quantity
1. Retrieve cue
that covaries with the
quantity. The cue defines an
objective environmental
distribution (OED)
2. Retrieve Sample
Retrieve n observations
from the corresponding
subjective environmental
distribution (SED)
3. Naive estimation
Estimate population
properties directly from
sample distribution (SSD),
and translate into required
response format
a. Interval b. Interval
evaluation production
Use the proportion in Use the coverage in
the SSD to estimate the SSD to estimate
the proportion in the the coverage in the
population population
Little or no Extreme
overconfidence overconfidence
Figure 4
OED (or Large Sample SD)

0.60
Probability density
0.45
50%
0.30
0.15
0.00
-3.50 -1.75 0.00 1.75 3.50
Standardized OED variance
B
SD, Small Sample and Centered Interval

0.60
Probability density
0.45
39%
0.30
0.15
0.00
-3.50 -1.75 0.00 1.75 3.50

C
SD, Small Sample and Dislocated Interval

0.60
Probability density
0.45
34%
0.30
0.15
0.00
-3.50 -1.75 0.00 1.75 3.50

Figure 5
A
Interval Size
Calibrated interval size
1.0
0.9
Relative Interval Size

0.8
0.7
0.6
0.5
0.4
0.3 .5 interval
0.2 .75 interval
0.1 .99 interval
0.0
0 2 4 6 8 10 12 14 16 18 20 22
Sample size n
Hit-rates (no SED bias)
1.0
0.9
0.8
0.7
Hit rate
0.6
0.5
0.4
0.3 Calibrated intervals
n =3
0.2
n=4
0.1 n=5
0.0
.50 .75 .95
Interval probability
C
Hit-rates (n=4, SED bias)
1.0
0.9
0.8
0.7
Hit rate
0.6
0.5
0.4
0.3 Calibrated intervals
0 sd SED bias
0.2
.5 sd SED bias
0.1
1 sd SED bias
0.0
.50 .75 .95
Interval probability
Figure 6
A
NSM(no SED bias)
1.0
Calibration
0.9 Empirical data
n=3
0.8
n=4
0.7 n=5
Proportion
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.50 0.75 1.00

Subjective probability
B
NSM(no SED bias)
0.5
0.4
Calibration
Empirical data
n=3
Overconfidence
0.3
n=4
n=5
0.2
0.1
0.0
Interval production Probability judgment

Assessment format
Figure 7
A
Empirical data
1.0
Probability judgment
0.9
Interval production
0.8 Calibration
0.7
Proportion
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
NSM(n=4, no SED bias)

1.0
0.9 Interval production
0.8 Calibration
0.7
Proportion
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Figure 8
A First Interval
1.0
0.9 Proportion correct
0.8
0.7
0.6
Probability
0.5
0.4
0.3
0.2
0.1
0.0
Control ADINA(O) ADINA(R)
Condition
B Final Interval
1.0
Proportion correct
0.9
0.8
0.7
0.6
Probability
0.5
0.4
0.3
0.2
0.1
0.0
Control ADINA(O) ADINA(R)
Condition
Figure 9
A
Recalled Target Values
0.5
0.4
Proportion Recalled
0.3
0.2
0.1
0.0
0 1 2 3 4
Training Blocks
Overconfidence for Estimates

0.5
0.4
Overconfidence
0.3 NSM(n=4)
0.2
0.1
0.0
0 1 2 3 4
Training Blocks
C
Format Dependence and STM Capacity
0.5
0.4
Overconfidence
Low STM capacity

0.3 High STM capacity
0.2
0.1
0.0
Probability Interval
Assessment format
Figure 10

C 63380-l 3-k NSM

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

C 63380-l 3-k NSM

Transféré par

Droits d'auteur :

Formats disponibles

The Nave Intuitive 1

Running head: THE NAVE INTUITIVE STATISTICIAN

The Nave Intuitive Statistician:

A Nave Sampling Model of Intuitive Confidence Intervals

Peter Juslin and Anders Winman

Uppsala University, Uppsala, Sweden

Ume University, Ume, Sweden

nave intuitive statistician.

The Nave Intuitive Statistician:

A Nave Sampling Model of Intuitive Confidence Intervals

Specifically, we will apply the metaphor of a nave intuitive statistician to explain

but of profound applied relevance.

effects produce non-trivial effects.

by the nave intuitive statistician.

The Nave Intuitive Statistician

Many of the enlightenment philosophers involved in the early development of probability

Sedlmeier & Betsch, 2002).

integrate several aspects of these previous two research programs:

not based on heuristic processing of the sample 1.

strategies (Fiedler, 2000).

the explanatory action lies2.

understand that sample variance needs to be corrected by n/(n-1) to be an unbiased estimate of

law of small numbersthe tendency to expect small samples to be representative of their

(Gigerenzer et al., 1989; Hacking, 1975).

& Wnke, 2002).

Overconfidence with Interval Production

equivalent elicitation methods involving predefined events overconfidence is diminished or

A Nave Sampling Model (NSM)

recent research that has been directly motivated by the NSM.

Assumptions of the Nave sampling Model

Three main assumptions define the NSM:

1. Each judgment elicits retrieval of a small sample of similar observations from

long term memory that become active in short term memory.

discussion of how literally we need to interpret this sampling metaphor.

2. The sample size is constrained by short term memory capacity.

may not be sufficient to cure it.

3. People directly use sample properties to estimate population properties.

Assessment of probability in essence involves estimation of a proportion. Sample proportion is an

proportion of those that satisfy the event.

By contrast, interval production involves estimation of a dispersion of plausible or possible

include 90% of the values in this sample.

as if they are both unbiased estimators.

Computational Steps of the Nave Sampling Model

A simple implementation of these three assumptions is illustrated in Figure 3 (a Monte Carlo

simulation is presented in greater detail in the Appendix A).

objective environmental distribution (OED) of similar observations in the persons natural

the distribution of population figures of Asian countries.

being an Asian country).

corresponding properties of the population distributions (i.e., the OEDs):

the probability judgment is m/n.

fractiles within this sample as the estimated interval.

predict several non-trivial phenomena.

Application to Previous Data

Why is there Overconfidence with Interval Production?

generates much too narrow intervals for small samples.

the year 2002).

guess for the quantity.

yields accurate judgment.

inside the interval is .5 and the error rate is .5.

b) Error from underestimation of the population dispersion: Sample dispersion is a biased

needs to be multiplied by n/(n-1) to become an unbiased estimator of population variance. The

population variance for this effect.

based on this coverage interval for the OED.