Académique Documents
Professionnel Documents
Culture Documents
-2
-1
0
1
1
2
Abstract
These are the write-up of a NIKHEF topical lecture series on Bayesian inference.
The topics covered are the definition of probability, elementary probability calculus and assignment, selection of least informative probabilities by the maximum
entropy principle, parameter estimation, systematic error propagation, model selection and the stopping problem in counting experiments.
Contents
1 Introduction
2 Bayesian Probability
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
21
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
28
. . . . . . . . . . . . . . . . . . . . . . . . . . 30
35
. . . . . . . . . . . . . . . . . . . . . . . . 43
50
54
Introduction
The Frequentist and Bayesian approaches to statistics differ in the definition of probability. For a Frequentist, probability is the relative frequency of the occurrence of an
event in a large set of repetitions of the experiment (or in a large ensemble of identical
systems) and is, as such, a property of a so-called random variable. In Bayesian statistics, on the other hand, probability is not defined as a frequency of occurrence but as
the plausibility that a proposition is true, given the available information. Probabilities
are thenin the Bayesian viewnot properties of random variables but a quantitative
encoding of our state of knowledge about these variables. This view has far-reaching
consequences when it comes to data analysis since Bayesians can assign probabilities to
propositions, or hypotheses, while Frequentists cannot.
In these lectures we present the basic principles and techniques underlying Bayesian
statistics or, rather, Bayesian inference. Such inference is the process of determining
the plausibility of a conclusion, or a set of conclusions, which we draw from the available
data and prior information.
Since we derive in this write-up (almost) everything from scratch, little reference is made
to the literature. So let us start by giving below our own small library on the subject.
Perhaps one of the best review articles (from an astronomers perspective) is T. Loredo
From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics [1]. Very
illuminating case studies are presented in An Introduction to Parameter Estimation
using Bayesian Probability Theory [2] and An Introduction to Model Selection using
Probability Theory as Logic [3] by G.L. Bretthorst. A nice summary of Bayesian statistics from a particle physics perspective can be found in the article Bayesian Inference
in Processing Experimental Data by G. DAgostini [4].
A good introduction to Bayesian methods is given in the book by Sivia Data Analysisa
Bayesian Tutorial [5]. More extensive, with many worked-out examples in Mathematica, is the book by P. Gregory (also an astronomer) Bayesian Logical Data Analysis for
the Physical Sciences [6].1 The ultimate reference, but certainly not for the fainthearted,
is the monumental work by Jaynes, Probability TheoryThe Logic of Science [7]. Unfortunately Jaynes died before the book was finished so that it is incomplete. It is
available in print (Cambridge University Press) but a free copy can still be found on
the website given in [7]. For those who want to refresh their memory on Frequentist
methods we recommend Statistical Data Analysis by G. Cowan [8].
A rich collection of very interesting articles (including those cited above) can be found
on the web site of T. Loredo http://www.astro.cornell.edu/staff/loredo/bayes.
Finally there are of course these lecture notes which can be found, together with the
lectures themselves, on
http://www.nikhef.nl/user/h24/bayes.
Exercise 1.1: The literature list above suggests that Bayesian methods are more popular
in astronomy than in our field of particle physics. Can you give reasons why astronomers
would be more inclined to turn Bayesian?
1
Bayesian Probability
2.1
Plausible inference
B
0
1
0
1
AB A B A B A + B A B
1
0
0
0
1
1
0
0
1
1
1
0
0
1
0
1
0
1
1
1
(2.1)
Note that the tautology is always true and the contradiction always false, independent
of the value of the input propositions.
Exercise 2.1: Show that A A is a contradiction and A + A a tautology.
Two important relations between the logical operations and and or are given by the
de Morgans laws
A B = A + B
and
A + B = A B.
(2.2)
We note here a remarkable duality possessed by logical equations in that they can be
transformed into other valid equations by interchanging the operations and +.
Exercise 2.2: Prove (2.2). Hint: this is easiest done by verifying that the truth tables of
the left and right-hand sides of the equations are the same. Once it is shown that the first
equation in (2.2) is valid, then duality guarantees that the second equation is also valid.
The reasoning process by which conclusions are drawn from a set of input propositions is
called inference. If there is enough input information we apply deductive inference
which allows us to draw firm conclusions, that is, the conclusion can be shown to be
either true or false. Mathematical proofs, for instance, are based on deductive inferences.
2
The two input propositions define four possible input states which each give rise to two possible
output states (true or false). These output states can thus be encoded as bits in a 4-bit word. Five out
of the 16 possible output words are listed in the truth table (2.1).
3
The conjunction A B will often be written as the juxtaposition AB since it looks neat in long
expressions or as (A, B) since we are accustomed to that in mathematical notation.
If there is not enough input information we apply inductive inference which does not
allow us to draw a firm conclusion. The difference between deductive and inductive
reasoning can be illustrated by the following simple example:
P1: Roses are red
P2: This flower is a rose
Induction thus leaves us in a state of uncertainty about our conclusion. However, the
statement that the flower is red increases the probability that we are dealing with a rose
as can easily be seen from the fact thatprovided all roses are redthe fraction of roses
in the population of red flowers must be larger than that in the population of all flowers.
The process of deductive and inductive reasoning can formally be summarized in terms
of the following two syllogisms4
Proposition 1:
Proposition 2:
Conclusion:
Deduction
If A is true then B is true
A is true
B is true
Induction
If A is true then B is true
B is true
A is more probable
The first proposition in both the syllogisms can be recognized as the implication A B
for which the truth table is given in (2.1). It is straight forward to check from this truth
table the validity of the conclusions derived above; in particular it is seen that A can be
true or false in the second (inductive) syllogism.
A;
(ii) What are the conExercise 2.3: (i) Show that it follows from A B that B
clusions in the two syllogisms above if we replace the second proposition by, respectively,
A is false and B is false ?
In inductive reasoning then, we are in a state of uncertainty about the validity (true or
false) of the conclusion we wish to draw. However, it makes sense to define a measure
P (A|I) of the plausibility (or degree of belief) that a proposition A is true, given the
information I. It may seem quite an arbitrary business to attempt to quantify something
like a degree of belief but this is not so.
Cox (1946) has, in a landmark paper [9], formulated the rules of plausible inference
and plausibility calculus by basing them on several desiderata. These desiderata are so
few that we can summarize them here: (i) boundednessthe degree of plausibility of a
proposition A is a bounded quantity; (ii) transitivityif A, B and C are propositions
and the plausibility of C is larger than that of B which, in turn, is larger than that
of A then the plausibility of C must be larger than that of A; (iii) consistencythe
plausibility of a proposition A depends only on the relevant information on A and not
on the path of reasoning followed to arrive at A.
4
A syllogism is a triplet of related propositions consisting of a mayor premise, a minor premise and
a conclusion.
It turns out that these desiderata are so restrictive that they completely determine the
algebra of plausibilities. To the surprise of many, this algebra appeared to be identical
to that of classical probability as defined by the axioms of Kolmogorov (see below for
these axioms). Plausibility is thusfor a Bayesianidentical to probability.5
2.2
Probability calculus
We will now derive several useful formula starting from the fundamental axioms of probability calculus taking the viewpoint of a Homo Bayesiensis, when appropriate. First, as
already mentioned above, P (A|I) is a bounded quantity; by convention P (A|I) = 1 (0)
when we are certain that the proposition A is true (false). Next, the two Kolmogorov
axioms are the sum rule
P (A + B|I) = P (A|I) + P (B|I) P (AB|I)
(2.3)
(2.4)
Let us, at this point, spell-out the difference between AB (A and B) and A|B (A given
B): In AB, B can be true or false while in A|B, B is assumed to be true and cannot
be false. The following terminology is often used for the probabilities occurring in the
product rule (2.4): P (AB|I) is called the joint probability, P (A|BI) the conditional
probability and P (B|I) the marginal probability.
Probabilities can be represented in a Venn diagram by the (normalized) areas of the
sub-sets A and B of a given set I. The sum rule is then trivially understood from the
following diagram
I
A
AB
B
while the product rule can be seen to give the relation between different normalizations of the area AB: P (AB|I) corresponds to the area AB normalized to I, P (A|BI)
corresponds to AB normalized to B and P (B|I) corresponds to B normalized to I.
Because A + A is a tautology (aways true) and AA a contradiction (always false) we
find from (2.3)
=1
P (A|I) + P (A|I)
(2.5)
which sometimes is taken as an axiom instead of (2.3)
Exercise 2.4: Derive the sum rule (2.3) from the axioms (2.4) and (2.5).
5
The intimate connection between probability and logic is reflected in the title of Jaynes book:
Probability TheoryThe Logic of Science.
If A and B are mutually exclusive propositions (they cannot both be true) then,
because AB is a contradiction, P (AB|I) = 0 and (2.3) becomes
P (A + B|I) = P (A|I) + P (B|I)
(A and B exclusive).
(2.6)
(A and B independent).
(2.7)
Because AB = BA we see from (2.4) that P (A|BI)P (B|I) = P (B|AI)P (A|I). From
this we obtain the rule for conditional probability inversion, also known as Bayes
theorem:
P (D|HI)P (H|I)
P (H|DI) =
.
(2.8)
P (D|I)
In (2.8) we replaced A and B by D and H to indicate that in the following these
propositions will refer to data and hypothesis, respectively. Ignoring for the moment the normalization constant P (D|I) in the denominator, Bayes theorem tells us
the following: The probability P (H|DI) of a hypothesis, given the data, is equal to
the probability P (H|I) of the hypothesis, given the background information alone (that
is, without considering the data) multiplied by the probability P (D|HI) that the hypothesis, when true, just yields that data. In Bayesian parlance P (H|DI) is called the
posterior probability, P (D|HI) the likelihood, P (H|I) the prior probability and
P (D|I) the evidence.
Bayes theorem describes a learning process in the sense that it specifies how to update
the knowledge on H when new information D becomes available.
We remark that Bayes theorem is valid in both the Bayesian and Frequentist worlds
because it follows directly from axiom (2.4) of probability calculus. What differs is
interpretation of probability: for a Bayesian, probability is a measure of plausibility so
that it makes perfect sense to convert P (D|HI) into P (H|DI) for data D and hypothesis
H. For a Frequentist however, probabilities are properties of random variables and,
although it makes sense to talk about P (D|HI), it does not makes sense to talk about
P (H|DI) because a hypothesis H is a proposition and not a random variable. See
Section 2.5 for more on Bayesian versus Frequentist.
Many people are not always fully aware of the consequences of probability inversion. To
see this, consider the case of Mr. White who goes to a doctor for an AIDS test. This
test is known to be 100% efficient (the test never fails to detect AIDS). A few days later
poor Mr. White learns that he is positive. Does this mean that he has AIDS?
Most people (including, perhaps, Mr. White himself and his doctor) would say yes
because they fail to realize that
P (AIDS|positive) 6= P (positive|AIDS).
6
We are talking here about a logical dependence which could be defined as follows: A and B are
logically dependent when learning about A implies that we also will learn something about B. Note
that logical dependence does not necessarily imply causal dependence. Causal dependence does, on the
other hand, always imply logical dependence.
That inverted probabilities are, in general, not equal should be even more clear from
the following example:
P (rain|clouds) 6= P (clouds|rain).
Right?
In the next section we will learn how to deal with Mr. Whites test (and with the opinion
of his doctor).
2.3
Let us now consider the important case that H can be decomposed into an exhaustive
set of mutually exclusive hypotheses {Hi }, that is, into a set of which one and only one
hypothesis is true.7 Notice that this implies, by definition, that H itself is a tautology.
Trivial properties of such a complete set of hypotheses are8
P (Hi , Hj |I) = P (Hi |I) ij
and
X
i
P (Hi|I) = P (
X
i
Hi |I) = 1
(normalization)
(2.9)
(2.10)
where we used the sum rule (2.6) in the first equality and the fact that the logical sum of
the Hi is a tautology in the second equality. Eq. (2.10) is the extension of the sum-rule
axiom (2.5).
Similarly it is straight forward to show that
X
X
P (D, Hi|I) = P (D,
Hi |I) = P (D|I).
i
(2.11)
This operation is called marginalization9 and plays a very important role in Bayesian
analysis since it allows to eliminate sets of hypotheses which are necessary in the formulation of a problem but are otherwise of no interest to us (nuisance parameters).
The inverse of marginalization is the decomposition of a probability: Using the product rule we can re-write (2.11) as
X
X
P (D|I) =
P (D, Hi|I) =
P (D|Hi, I)P (Hi|I)
(2.12)
i
which states that the probability of D can be written as the weighted sum of the probabilities of a complete set of hypotheses {Hi }. The weights are just given by the probability
that Hi , when true, gives D. In this way we have expanded P (D|I) on a basis of probabilities P (Hi |I).10 Decomposition is often used in probability assignment because it
allows to express a compound probability in terms of known elementary probabilities.
7
A trivial example is the complete set H1 : x < a and H2 : x a with x and a real numbers.
Here and in the following we write the conjunction AB as A, B.
9
A projection of a two-dimensional distribution f (x, y) on the x or y axis is called a marginal
distribution. Because (2.11) is projecting out P (D|I) it is called marginalization.
P
10
Note that (2.12) is similar to the closure relation in quantum mechanics hD|Ii = i hD|Hi ihHi |Ii.
8
Using (2.12), Bayes theorem (2.8) can, for a complete set of hypotheses, be written as
P (D|Hi, I)P (Hi|I)
,
P (Hi|D, I) = P
i P (D|Hi , I)P (Hi |I)
(2.13)
2.4
Continuous variables
The formalism presented above describes the probability calculus of propositions or,
equivalently, of discrete variables (which can be thought of as an index labeling a set
of propositions). To extend this discrete algebra to continuous variables, consider the
propositions
A : r < a,
B : r < b,
C:ar<b
for a real variable r and two fixed real numbers a and b with a < b. Because we have
the Boolean relation B = A + C and because A and C are mutually exclusive we find
from the sum rule (2.6)
P (a r < b|I) = P (r < b|I) P (r < a|I) G(b) G(a).
(2.14)
In (2.14) we have introduced the cumulant G(x) P (r < x|I) which obviously is a
monotonically increasing function of x. The probability density p is defined by
P (x r < x + |I)
dG(x)
=
0
dx
p(x|I) = lim
(2.15)
(2.16)
In terms of probability densities, the product rule (2.4) can now be written as
p(x, y|I) = p(x|y, I) p(y|I).
10
(2.17)
(2.18)
(2.19)
p(x|y, I) p(y|I)
.
p(x|y, I) p(y|I) dy
(2.20)
We make four remarks: (1)Probabilities are dimensionless numbers so that the dimension of a density is the reciprocal of the dimension of the variable. This implies
that p(x) transforms when we make a change of variable x f (x). The size of the
infinitesimal element dx corresponding to df is given by dx = |dx/df | df ; because the
probability content of this element must be invariant we have
dx
dx
and thus
p(f |I) = p(x|I) . (2.21)
p(f |I) df = p(x|I) dx = p(x|I) df
df
df
Exercise 2.8: A lighthouse at sea is positioned a distance d from the coast.
y
x
x0
This lighthouse emits collimated light pulses at random times in random directions, that
is, the distribution of pulses is uniform in . Derive an expression for the probability to
observe a light pulse as a function of the position x along the coast. (From Sivia [5].)
(3)The equations (2.19) and (2.20) are, together with their discrete equivalents, all
you need in Bayesian inference. Indeed, apart from the approximations and transformations described in Section 3 and the maximum entropy principle described in Section 5,
the remainder of these lectures will be nothing more than repeated applications of decomposition, probability inversion and marginalization.
(4)Plausible inference is, strictly speaking, always conducted in terms of probabilities
instead of probability densities. A density p(x|I) is turned into a probability by multiplying it with the infinitesimal element dx. For conditional probabilities dx should refer
to the random variable (in front of the vertical bar) and not to the condition (behind
the vertical bar); thus p(x|y, I)dx is a probability but p(x|y, I)dy is not.
2.5
In the previous sections some mention was made of the differences between the Bayesian
and Frequentist approaches. Although it is not the intention of these lectures to make
a detailed comparision of the two methods we think it is appropriate to make a few
remarks. For this we will follow the quasi-historical account of T. Loredo [1] which
helps to put matters in a proper perspective.
First, it is important to realize that the calculus presented in the previous sections
gives us the rules for manipulating probabilities but not how to assign them. Sampling
distributions (or likelihoods) are not much of a problem since they can, at least in
principle, be assigned by deductive reasoning within a suitable model (a Monte Carlo
simulation, for instance). However, the situation is less clear for the assignment of priors.
Bernoulli (1713) was the first to formulate a rule for a probability assignment which he
called the principle of insufficient reason:
If for a set of N exclusive and exhaustive propositions there is no evidence
to prefer any proposition over the others then each proposition must be
assigned an equal probability 1/N.
11
The rate of convergence can be much affected by the initial choice of prior, see Section 5.1 for a
nice example taken from the book by Sivia.
12
While this can be considered as a solid basis for dealing with discrete variables, problems
arise when we try to extend the principle to infinitely large sets, that is, to continuous
variables. This is because a uniform distribution (uninformative) can be turned into a
non-uniform distribution (informative) by a simple coordinate transformation. A solution is provided by not using uniformity but, instead, maximum entropy as a criterion
to select uninformative probability densities, see Section 5.
People were also concerned by the lack of rationale to use the axioms (2.3) and (2.4)
for the calculus of a degree of belief which was, indeed, the concept of probability
for Bernoulli (1713), Bayes (1763) and, above all, Laplace (1812) who was the first to
formulate Bayesian inference as we know it today. In spite of the considerable success
of Laplace in applying Bayesian methods to astronomy, medicine and other fields of
investigation, the concept of plausibility was considered to be too vague, and also too
subjective, for a proper mathematical formulation. This nail in the coffin of Bayesianism
has of course been removed by the work of Cox (1946), see Section 2.1.
An apparent masterstroke to eliminate all subjectiveness was to define probability as the
limiting frequency of the occurrence of an event in an infinite sample of repetitions or,
equivalently, in an infinite ensemble of identical copies of the process under study. Such
a definition of probability is consistent with the rules (2.3) and (2.4) but inconsistent
with the notion of a probability assignment to a hypothesis since in the repetitions of
an experiment such a hypothesis can only be true or false and nothing else.12 This then
invalidates the inversion of P (D|HI) to P (H|DI) with Bayes theorem which is quite
convenient since it removes, together with the theorem itself, the need to specify this
disturbing prior probability P (H|I).
However, the problem is now how to judge the validity of an hypothesis without allowing
access via Bayes theorem. The Frequentist answer is the construction of a so-called
statistic. Such a statistic is a function of the data and thus a random variable with
a distribution which can be derived directly from the sampling distribution. The idea
is now to construct a statistic which is sensitive to the hypothesis being tested and to
compare the data with the long-term behavior of this statistic. Well known examples
of a statistic are the sample mean, variance, 2 and so on. A disadvantage is that it
is far from obvious how to construct a statistic when we have to deal with complicated
problems. As a guidance many criteria have been invented like unbiasedness, efficiency,
consistency and sufficiency but this still does not lead to a unique definition.
Exercise 2.10: The number n of radioactive decays observed during a time interval t
is Poisson distributed (Section 4.4)
P (n|Rt) =
(Rt)n Rt
e
,
n!
where R is the decay rate. Motivate the use of n/t as a statistic for R (consult a
statistics textbook if necessary).
Another unattractive feature of the Frequentist approach is that conclusions are based
on hypothetical repetitions of the experiment which never took place. This is in stark
12
A model parameter thus has for a Frequentist a fixed, but unknown, value. This is also the view
of a Bayesian since the probability distribution that he assigns to the parameter does not describe how
it fluctuates but how uncertain we are of its value.
13
contrast to Bayesian methods where the evidence is provided by the available data and
prior information. Many repetitions (events) actually do take place in particle physics
experiments but this is not so in observational sciences like astronomy for instance. Also,
it turns out that the construction of a sampling distribution is not as unambiguous as
one might think. Sampling distributions may depend, in fact, not only on the relevant
information carried by the data but also on how the data were actually obtained. A
well known example of this is the so-called stopping problem which we will discuss in
Section 7.13
Guidance in the construction of priors, or of any probability, is provided by sampling
theory (the study of random processes like those occurring in games of chance), group
theory (the study of symmetries and invariants) and, above all, the principle of maximizing information entropy (see Section 5.3). However, one may take as a rule of thump
that when the prior is wide compared to the likelihood, it does not matter very much
what distribution we chose. On the other hand, when the likelihood is so wide that it
competes with any reasonable prior then this simply means that the experiment does
not carry much information on the problem at hand. In such a case it should not come
as a surprise that answers become dominated by prior knowledge or assumptions. Of
course there is nothing wrong with that, as long as these prior assumptions are clearly
stated. The prior also plays an important role when the likelihood peaks near a physical boundary or, as very well may happen, resides in an unphysical region (likelihoods
related to neutrino mass measurements are a famous example; for another example see
Section 6.3 in this write-up). Note that in case the prior is important, Bayesian and
Frequentist methods might yield different answers to your problem.
With these remarks we leave the Bayesian-Frequentist debate for what it is and refer to
the abundant literature on the subject, see e.g. [10] for recent discussions.
In the previous sections we have explicitly kept the probability densities conditional to
I in all expressions as a reminder that Bayesian probabilities are always defined in
relation to some background information and also that this information must be the
same for all probabilities in a given expression; if this is not the case, calculations may
lead to paradoxical results.14 This background information should not be regarded as
encompassing all that is known but simply as a list of what is needed to unambiguously
define all the probabilities in an expression. In the following we will be a bit liberal and
sometimes omit I when it clutters the notation.
Posterior Representation
The full result of Bayesian inference is the posterior distribution. However, instead of
publishing this distribution in the form of a parameterization, table, plot or computer
program it is often more convenient to summarize the posterior in terms of a few pa13
Many people consider the stopping problem the nail in the coffin of Frequentism.
This happens, for instance, when we calculate a posterior P1 (H|DI) P (D|HI)P (H|I) and
feed this posterior back into Bayes theorem as an improved prior to calculate P2 (H|DI)
P (D|HI)P1 (H|DI) using the same data D. It is obvious already from the inconsistent notation that
this kind bootstrapping is not allowed.
14
14
rameters. In the following subsection we present the mean, variance and covariance
as a measure of position and width. The remaining two subsections are devoted to
transformations of random variables and to the properties of the covariance matrix.
3.1
(3.1)
where the integration domain is understood to be the definition range of the distribution
p(x|I). The k-th moment of a distribution is the expectation value < xk >. From (2.18)
it immediately follows that the zeroth moment < x0 > = 1. The first moment is called
the mean of the distribution and is a location measure
Z
= x = < x > = x p(x|I) dx.
(3.2)
The variance 2 is the second moment about the mean
Z
2
2
2
= < x > = < (x ) > = (x )2 p(x|I) dx.
(3.3)
The square root of the variance is called the standard deviation and is a measure of
the width of the distribution.
Exercise 3.1: Show that the variance is related to the first and second moments by
< x2 > = < x2 > < x >2 .
A correlation between the variables is better judged from the matrix of correlation
coefficients which is defined by
15
ij = p
Vij
Vij
=
.
i j
Vii Vjj
(3.5)
We discuss here only continuous variables; the expressions for discrete variables are obtained by
replacing the integrals with sums.
15
The position of the maximum of a probability density function is called the mode16
which often is taken as a location parameter (provided the distribution has a single
maximum). For the general case of an n-dimensional distribution one finds the mode
(to
by minimizing the function L(x) = ln p(x|I). Expanding L around the point x
be specified below) we obtain
+
L(x) = L(x)
n
X
L(x)
i=1
xi
1 X X 2 L(x)
xi +
xi xj +
2 i=1 j=1 xi xj
(3.6)
L(x)
=0
xi
(3.7)
so that the second term in (3.6) vanishes. Up to second order, the expansion can now
be written in matrix notation as
+ ,
+ 12 (x x)H(x
x)
L(x) = L(x)
(3.8)
2 L(x)
.
xi xj
(3.9)
Exponentiating (3.8) we find for our approximation of the probability density in the
neighborhood of the mode:
(3.10)
where C is a normalization constant. Now if we identify the inverse of the Hessian with
a covariance matrix V then the approximation (3.10) is just a multivariate Gaussian
in x-space17
1
1 (x x)
,
exp 12 (x x)V
p(x|I) p
(3.11)
(2)n |V |
where |V | denotes the determinant of V .18
Sometimes the posterior is such that the mean and covariance matrix (or Hessian) can
be calculated analytically. In most cases, however, minimization programs like minuit
and V from L(x) (the function L is calculated in
are used to determine numerically x
the subroutine fcn provided by the user).
The approximation (3.11) is easily marginalized. It can be shown that integrating a
multivariate Gaussian over one variable xi is equivalent to deleting the corresponding
16
16
row and column i in the Hessian matrix H. This defines a new Hessian H and, by
inversion, a new covariance matrix V . Replacing V by V and n by (n 1) in (3.11)
then obtains the integrated Gaussian. It is now easy to see that integration over all but
one xi gives
"
2 #
1
1 xi xi
p(xi |I) = exp
(3.12)
2
i
i 2
where i2 is the diagonal element Vii of the covariance matrix V .19
Let us close this section by making the remark that communicating a posterior by
only giving the mean (or mode) and covariance is inappropriateand perhaps even
misleadingwhen the distribution is multi-modal, strongly a-symmetric or exhibits long
tails.20 Note also that there are distributions for which mean and variance are ill defined
as is, for instance, the case for a uniform distribution on [, +]. Mean and variance
may not even exist because the integrals (3.2) or (3.3) are divergent. An example of
this is the Cauchy distribution
p(x|I) =
1 1
.
1 + x2
Exercise 3.3: The Cauchy distribution is often called the Breit-Wigner distribution
which usually is parameterized as
p(x|x0 , ) =
/2
1
.
(/2)2 + (x x0 )2
Use (3.6)in one dimensionto characterize the position and width of the Breit-Wigner
distribution. Show also that is the FWHM (full width at half maximum).
3.2
Transformations
The error on a fitted parameter given by minuit is the diagonal element of the covariance matrix
and is thus the width of the marginal distribution of this parameter.
20
Without additional information people will assume that the posterior is unimodal and that a one
standard deviation interval contains roughly 68.3% probability as is the case for a Gauss distribution.
17
where we have made the trivial assignment p(z|x, I) = [z f (x)]. This assignment
guarantees that the integral only receives contributions from the hyperplane f (x) = z.
As an example consider two independent variables x and y distributed according to
p(x, y|I) = f (x)g(y). Using (3.13) we find that the distribution of the sum z = x + y is
given by the Fourier convolution of f and g
Z
Z
p(z|I) = f (x)g(z x) dx = f (z y)g(y) dy.
(3.14)
Likewise we find that the product z = xy is distributed according to the Mellin convolution of f and g
Z
Z
dx
dy
p(z|I) = f (x)g(z/x)
= f (z/y)g(y) ,
(3.15)
|x|
|y|
provided that the definition ranges do not include x = 0 and y = 0.
Exercise 3.4: Use (3.13) to derive Eqs. (3.14) and (3.15).
(3.16)
xi
.
zk
(3.17)
Exercise 3.5: Let x and y be two independent variables distributed according to p(x, y|I) =
f (x)g(y). Let u = x + y and v = x y. Use (3.16) to obtain an expression for q(u, v|I)
in terms of f and g and show, by integrating over v, that the marginal distribution of
u is given by (3.14). Likewise define u = xy and v = x/y and show that the marginal
distribution of u is given by (3.15).
The above, although it formally settles the issue of how to deal with functions of random
variables, often gives rise to tedious algebra as can be seen from the following exercise:21
Exercise 3.6: Two variables x1 and x2 are independently Gaussian distributed:
"
2 #
1 xi i
1
i = 1, 2.
p(xi |I) = exp
2
i
i 2
Show, by carrying out the integral in (3.14), that the variable z = x1 + x2 is Gaussian
distributed with mean = 1 + 2 and variance 2 = 12 + 22 .
21
Later on we will use Fourier transforms (characteristic functions) to make life much easier.
18
An easy way to deal with any function of any distribution is to generate p(x|I) by
Monte Carlo, calculate F (x) at each generation and histogram the result. However,
if we are content with summarizing the distributions by a location parameter and a
covariance matrix, then there is a very simple transformation rule, known as linear
error propagation.
Let F (x) be one of a set of m functions of x. Linear approximation gives
=
F F (x) F (x)
n
X
F (x)
i=1
xi
xi
(3.18)
n X
n
X
F F
i=1 j=1
xi xj
< xi xj > .
(3.19)
(3.20)
and the quadratic addition of relative errors for a product of independent variables
2
2 2 2
n
1
2
=
+
++
for z = x1 x2 xn .
(3.22)
z
x1
x2
xn
Exercise 3.7: Use (3.19) to derive the two propagation rules (3.21) and (3.22).
Exercise 3.8: A counter is traversed by N particles and fires n times. Since n N these
countsare not independent
but n and m = N n are. Assume Poisson errors (Section 4.4)
n = n and m = m and use (3.19) to show that the error on the efficiency = n/N
is given by
r
(1 )
=
N
This is known as the binomial error, see Section 4.2.
3.3
In this section we investigate in some more detail the properties of the covariance matrix
which, together with the mean, fully characterizes the multivariate Gaussian (3.11).
In Section 3.1 we have already remarked that V is symmetric but not every symmetric
matrix can serve as a covariance matrix. To see this, consider a function f (x) of a set
of Gaussian random variables x. For the variance of f we have according to (3.20)
2 = < f 2 > = d V d,
19
where d is the vector of derivatives f /xi . But since 2 is positive for any function f
it follows that the following inequality must hold:
d V d > 0 for any vector d.
(3.23)
x2
(b)
y2
1
x1
y1
Figure 1: The one standard deviation contour of a two dimensional Gaussian for (a) correlated
variables x1 and x2 and (b) uncorrelated variables y1 and y2 . The marginal distributions of x1 and x2
have a standard deviation of 1 and 2 , respectively.
Gaussian variables (x1 , x2 ) and two uncorrelated variables (y1 , y2 ). It is clear from these
plots that the two error ellipses are related by a simple rotation. The rotation matrix is
unique, provided that the variables x are linearly independent. A pure rotation is not
the only way to diagonalize the covariance matrix since the rotation can be combined
with a scale transformation along y1 or y2 .
The rotation U which diagonalizes V must, according to the transformation rule (3.19),
satisfy the relation
U V U T = L V U T = U TL
(3.24)
(3.25)
It is then easy to see that (3.24) corresponds to the set of eigenvalue equations
V ui = i u i .
(3.26)
Thus, the rotation matrix U and the vector of diagonal elements i is determined by
the complete set of eigenvectors and eigenvalues of the covariance matrix V .
Exercise 3.9: Show that (3.24) is equivalent to (3.26).
Exercise 3.10: (i) Show that for a symmetric matrix V and two arbitrary vectors x and
y the following relation holds y V x = x V y; (ii) Show that the eigenvectors ui and uj
of a symmetric matrix V are orthogonal, that is, ui uj = 0 for i 6= j; (iii) Show that the
eigenvalues of a positive definite symmetric matrix V are all positive.
20
where we have used the fact that |U | = 1 and that all the eigenvalues of V are positive.
4.1
Bernoullis urn
Consider an experiment where balls are drawn from an urn. Let the urn contain N
balls and let the balls be labeled i = 1, . . . , N. We can now define the exhaustive and
exclusive set of hypotheses
Hi = this ball has label i,
i = 1, . . . , N.
Since we have no information on which ball we will draw we use the principle of insufficient reason to assign the probability to get ball i at the first draw:
P (Hi|N, I) =
1
.
N
(4.1)
Next, we consider the case that R balls are colored red and W = N R are colored
white. We define the exhaustive and exclusive set of hypotheses
HR = this ball is red
HW = this ball is white.
21
We now want to assign the probability that the first ball we draw will be red. To solve
this problem we decompose this probability into the hypothesis space {Hi } which gives
P (HR |I) =
N
X
i=1
P (HR , Hi |I) =
N
X
i=1
N
1 X
R
=
P (HR |Hi, I) =
N i=1
N
(4.2)
where, in the last step, we have made the trivial probability assignment
1 if ball i is red
P (HR |Hi , I) =
0 otherwise.
Next, we assign the probability that the second ball will be red. This probability depends
on how we draw the balls:
1. We draw the first ball, put it back in the urn and shake the urn. The latter action
may be called randomization but from a Bayesian point of view the purpose
of shaking the urn is, in fact, to destroy all information we might have on the
whereabouts of this ball after it was put back in the urn (it would most likely
end-up in the top layer of balls). Since this drawing with replacement does
not change the contents of the urn and since the shaking destroys all previously
accumulated information, the probability of drawing a red ball a second time is
equal to that of drawing a red ball the first time:
P (R2 |I) =
R
,
N
Exercise 4.1: Draw a first ball and put it aside without recording its color. Show that
the probability for the second draw to be red is now P (R2 |I) = R/N .
In the above we have shown that the probability of the second draw may depend on the
outcome of the first draw. We will now show that the probability of the first draw may
depend on the outcome of the second draw! Consider the following situation: A first
ball is drawn and put aside without recording its color. A second ball is drawn and it
22
turns out to be red. What is the probability that the first ball was red? Bayes theorem
immediately shows that it is not R/N:
P (R1 |R2 , I) =
If this argument fails to convince you, take the extreme case of an urn containing one
red and one white ball. The probability of a red ball at the first draw is 1/2. Lay the
ball aside and take the second ball. If it is red, then the probability that the first ball
was red is zero and not 1/2. The fact that the second draw influences the probability of
the first draw has of course nothing to do with a causal relationship but, instead, with
a logical relationship.
Exercise 4.2: Draw a first ball and put it back in the urn without recording its color.
The color of a second draw is red. What is the probability that the first draw was red?
4.2
Binomial distribution
We now draw n balls from the urn, putting the ball back after each draw and shaking the
urn. In this way the probability that a draw is red is the same for all draws: p = R/N.
What is the probability that we find r red balls in our sample of n draws? Again, we
seek to decompose this probability into a combination of elementary ones which are easy
to assign. Let us start with the hypothesis
Sj = the n balls are drawn in the sequence labeled j
where j = 1, . . . , 2n is the index in a list of all possible sequences (of length n) of
white and red draws. The set of hypotheses {Sj } is obviously exclusive and exhaustive.
The draws are independent, that is, the probability of the k th draw does not depend
on the outcome of the other draws (remember that this is only true for drawing with
replacement). Thus we find from the product rule
P (Sj |I) = P (C1, . . . , Cn |I) =
n
Y
k=1
(4.3)
where Ck stands for red or white at the k th draw and where rj is the number of red
draws in the sequence j. Having assigned the probability of each element in the set
{Sj }, we now decompose our probability of r red balls into this set:
n
P (r|I) =
2
X
j=1
"
P (r, Sj |I) =
2n
X
j=1
2
X
j=1
(r rj ) pr (1 p)nr
(4.4)
The sum inside the square brackets in (4.4) counts the number of sequences in the set
{Sj } which have just r red draws. It is an exercise in combinatorics to show that this
number is given by the binomial coefficient. Thus we obtain
n!
n r
r
nr
P (r|p, n) =
p (1 p)nr .
(4.5)
p (1 p)
=
r!(n r)!
r
This is called the binomial distribution which applies to all processes where the
outcome is binary (red or white, head or tail, yes or no, absent or present etc.), provided
that the probability p of the outcome of a single draw is the same for all draws. In
Fig. 2 we show the distribution of red draws for n = (10, 20, 40) trials for an urn with
p = 0.25.
P H r 10, 0.25 L
P H r 20, 0.25 L
P H r 40, 0.25 L
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0.2
0.25
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
10
10
12
14
10
15
20
25
Figure 2: The binomial distribution to observe r red balls in n = (10, 20, 40) draws from an urn
containing a fraction p = 0.25 of red balls.
The binomial probabilities are just the terms of the binomial expansion
n
X
n r nr
n
a b
(a + b) =
r
r=0
(4.6)
P (r|p, n) = 1.
r=0
The condition of independence of the trials is important and may not be fulfilled: for
instance, suppose we scoop a handful of balls out of the urn and count the number r
of red balls in this sample. Does r follow the binomial distribution? The answer is no
since we did not perform draws with replacement, as required. This can also be seen
from the extreme situation where we take all balls out of the urn. Then r would not be
distributed at all: it would just be R.
The first and second moments and the variance of the binomial distribution are
<r> =
< r2 > =
2
n
X
r=0
n
X
rP (r|p, n) = np
r 2 P (r|p, n) = np(1 p) + n2 p2
r=0
2
(4.7)
If we now define the ratio = r/n then it follows immediately from (4.7) that
<> = p
< 2 > =
p(1 p)
n
(4.8)
The square root of this variance is called the binomial error which we have already
encountered in Exercise 3.8. It is seen that the variance vanishes in the limit of large
n and thus that converges to p in that limit. This fundamental relation between a
probability and a limiting relative frequency was first discovered by Bernoulli and is
called the law of large numbers. This law is, of course, the basis for the Frequentist
definition of probability.
4.3
Multinomial distribution
N!
pn1 pnk k
n1 ! nk ! 1
(4.9)
ni = N
and
i=1
k
X
pi = 1.
(4.10)
i=1
(4.11)
Marginalization is achieved by adding in (4.9) two or more variables ni and their corresponding probabilities pi .
Exercise 4.3: Use the addition rule above to show that the marginal distribution of each
ni in (4.9) is given by the binomial distribution P (ni |pi , N ) as defined in (4.5).
M!
nk1
q1n1 qk1
n1 ! nk1 !
25
where
k1
X
1
m = (n1 , . . . , nk1 ), q = (pi , . . . , pk1), s =
pi and M = N nk .
s
i=1
Exercise 4.4: Derive the expression for the conditional probability by dividing the joint
probability (4.9) by the marginal (binomial) probability P (nk |pk , N ).
4.4
Poisson distribution
Here we consider events or counts which occur randomly in time (or space). The
counting rate R is supposed to be given, that is, we know the average number of counts
= Rt in a given time interval t. There are several ways to derive an expression for
the probability P (n|) to observe n events in a time interval with contains, on average,
events. Our derivation is based on the fact that this probability distribution is a
limiting case of the binomial distribution.
Assume we divide the interval t in N sub-intervals t. The probability to observe an
event in such a sub-interval is then p = /N, see (4.2). Now we can always make N
so large and t so small that the number of events in each sub-interval is either one or
zero. The probability to find n events in N sub-intervals is then equal to the (binomial)
probability to find n successes in N trials:
n
N!
N n
P (n|N) =
1
.
n!(N n)! N
N
Taking the limit N then yields the desired result
N
n
n
N!
1
e .
=
P (n|) = lim
N (N n)!(N )n n!
N
n!
(4.12)
(4.13)
n=0
4.5
Gauss distribution
The sum of many small random fluctuations leads to a Gauss distribution, irrespective
of the distribution of each of the terms contributing to the sum. This fact, known as
26
the central limit theorem, is responsible for the dominant presence of the Gauss
distribution in statistical data analysis. To prove the central limit theorem we first
have to introduce the characteristic function which is nothing else than the Fourier
transform of a probability density.
The Fourier transform, and its inverse, of a distribution p(x) is defined by
Z
(k) =
ikx
p(x) dx
1
p(x) =
2
eikx (k) dk
(4.14)
This transformation plays an important role in proving many theorems related to sums
of random variables and moments of probability distributions. This is because a Fourier
transform turns a Fourier convolution in x-space, see (3.14), into a product in k-space.22
To see this, consider a joint distribution of n independent variables
p(x|I) = f1 (x1 ) fn (xn ).
Using (3.13) we write for the transform of the distribution of the sum z =
(k) =
dz
=
=
Z
Z
n
X
i=1
xi
xi
n
X
xi )
i=1
f1 (x1 ) fn (xn )
= 1 (k) n (k).
(4.15)
The transform of a sum of independent random variables is thus the product of the
transforms of each variable.
The moments of a distribution are related to the derivatives of the transform at k = 0:
Z
dn (k)
dn (0)
n ikx
=
= in < xn > .
(4.16)
(ix)
e
p(x)
dx
n
n
dk
dk
The characteristic functions (Fourier transforms) of many distributions can be found in,
for instance, the particle data book [11]. Of importance to us is the Gauss distribution
and its transform
"
2 #
1
1 2 2
1 x
p(x) = exp
(k) = exp ik k
(4.17)
2
2
2
To prove the P
central limit theorem we consider the sum of a large set of n random
variables s = xj . Each xj is distributed independently according to fj (xj ) with mean
22
A Mellin transform turns a Mellin convolution (3.15) in x-space into a product in k-space. We will
not discuss Mellin transforms in these notes.
27
j and a standard deviation which we take, for the moment, to be the same for all fj .
To simplify the algebra, we do not consider the sum itself but rather
n
X
n
X
xj j
s
z=
yj =
=
n
n
j=1
j=1
(4.18)
P
where we have set =
i . Now take the Fourier transform j (k) of the distribution
of yj and make a Taylor expansion around k = 0. Using (4.16) we find
X
k m dm j (0) X (ik)m < yjm >
j (k) =
=
m! dk m
m!
m=0
m=0
= 1+
X
(ik)m < (xj j )m >
k22
=
1
+ O(n3/2 )
m/2
m!
n
2n
m=2
(4.19)
Taking only the first two terms of this expansion we find from (4.15) for the characteristic
function of z
n
1 2 2
k22
exp k
for n .
(4.20)
(k) = 1
2n
2
But this is just the characteristic function of a Gaussian with mean zero and width .
Transforming back to the sum s we find
(s )2
1
(4.21)
exp
p(s) =
n 2
2n
It can be shown that the central limit theorem also applies when the widths ofPthe
individual distributions are different in which case the variance of the Gauss is 2 = i2
instead of n 2 as in (4.21). However, the theorem breaks down when one or more
individual widths are much larger than the others, allowing for one or more variables xi
to occasionally dominate the sum. It is also required that all i and i exist so that the
theorem does not apply to, for instance, a sum of Cauchy distributed variables.
Exercise 4.6: Apply (4.16) to the characteristic function (4.17) to show that the mean and
variance of a Gauss distribution are and 2 , respectively. Show also that all moments
beyond the second vanish (the Gauss distribution is the only one who has this property).
Exercise 4.7: In Exercise 3.6 we have derived the distribution of the sum of two Gaussian
distributed variables by explicitly calculating the convolution integral (3.14). Derive the
same result by using the characteristic function (4.17). Convince yourself that the central
limit theorem always applies to sums of Gaussian distributed variables even for a finite
number of terms or large differences in width.
5
5.1
The impact of the prior on the outcome of plausible inference is nicely illustrated by
a very instructive example, taken from Sivia [5], were the bias of a coin is determined
from the observation of the number of heads in N throws.
28
Let us first recapitulate what constitutes a well posed problem so that we can apply
Bayesian inference.
First, we need to define a complete set of hypotheses. For our coin flipping experiment this will be the value of the probability h to obtain a head in a single
throw. The definition range is 0 h 1.
Second, we need a model which relates the set of hypotheses to the data. In other
words, we need to construct the likelihood P (D|H, I) for all the hypotheses in the
set. In our case this is the binomial probability to observe n heads in N throws
of a coin with bias h
P (n|h, N, I) =
N!
hn (1 h)N n .
n!(N n)!
(5.1)
pHhL
00
1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8
pHhL
h
pHhL
310
h
pHhL
00
1
0.8
0.6
0.4
0.2
pHhL
310
2551000
1
0.8
0.6
0.4
0.2
1
21100
1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8
21100
1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8
1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8
2551000
1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8
1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8
pHhL
21100
1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8
pHhL
00
1
0.8
0.6
0.4
0.2
pHhL
pHhL
310
1
0.8
0.6
0.4
0.2
2551000
1
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8
Figure 3: The posterior density p(h|n, N ) for n heads in N flips of a coin with bias h = 0.25. In
the top row of plots the prior is uniform, in the middle row it is strongly peaked around h = 0.5 while
in the bottom row the region h < 0.5 has been excluded. The posterior densities are scaled to unit
maximum for ease of comparison.
flat prior (top row of plots), a strong prior preference for h = 0.5 (middle row) and a prior
which excludes the possibility that h < 0.5 (bottom row). It is seen that the flat prior
converges nicely to the correct answer h = 0.25 when the number of throws increases.
29
The second prior does this too, but more slowly. This is not surprising because we have
encoded, in this case, a quite strong prior belief that the coin is unbiased and it takes
a lot of evidence from the data to change that belief. In the last choice of prior we see
that the posterior cannot go below h = 0.5 since we have excluded this region by setting
the prior to zero. This is an illustration of the fact that no amount of data can change
certainties encoded by the prior as we have already remarked in Exercise 2.6. This can
of course be turned into an advantage since it allows to exclude physically forbidden
regions from the posterior, like a negative mass for instance.
From this example it is clear that unsupported information should not enter into the
prior because it may need a lot of data to converge to the correct result in case this
information turns out to be wrong. The maximum entropy principle provides a means to
construct priors which have the property that they are consistent with given boundary
conditions but are otherwise maximally un-informative. Before we proceed with the
maximum entropy principle let us first, in the next section, make a few remarks on
symmetry considerations in the assignment of probabilities.
5.2
Symmetry considerations
In Section 2.5 we have introduced Bernoullis principle of insufficient reason which states
that, in absence of relevant information, equal probability should be assigned to members of an enumerable set of hypotheses. While this assignment strikes us as being
very reasonable it is worthwhile to investigate if we can find some deeperor at least
alternativereason behind this principle.
Suppose we would plot the probabilities assigned to each hypothesis in a bar chart.
If we are in a state of complete ignorance about these hypotheses then it obviously
should not not matter how they would be ordered in such a chart. Stated differently, in
absence of additional information the set of hypotheses is invariant under permutations.
But our bar chart of probabilities can only be invariant under permutations if all the
probabilities are the same. Hence the statement that Bernoullis principle is due to
our complete ignorance is equivalent to the statement that it is due to permutation
invariance.
Similarly, translation invariance implies that the least informative probability distribution of a so-called location parameter is uniform. If, for instance, we are completely
ignorant of the actual whereabouts of the train from Utrecht to Amsterdam then any
location on the track should be, for us, equally probable. In other words, this probability
should obey the relation
p(x|I) dx = p(x + a|I) d(x + a) = p(x + a|I) dx,
(5.3)
(5.4)
But this is only possible when p(r|I) 1/r. This probability assignment, which applies
to positive definite scale parameters, is called a Jeffreys prior. Note that a Jeffreys
prior is uniform in ln(r), which means that it assigns equal probability per decade
instead of per unit interval as does a uniform prior.
Both the uniform and the Jeffreys prior cannot be normalized when the variable ranges
are x [, ] or r [0, ]. Such un-normalizable distributions are called improper.
The way to deal with improper distributions is to normalize them on a finite interval
and take the limits to infinity (or zero) at the end of the calculation. This is, by the way,
good practice in any mathematical calculation involving limits. The posterior should,
of course, always remain finite. If not, it is very likely that your problem is either ill
posed or that your data do not carry enough information.
Exercise 5.1: We make an inference on a counting rate R by observing the number of
counts n in a time interval t. Assume that the likelihood P (n|Rt) is Poisson distributed
as defined by (4.12). Assume further a Jeffreys prior for R, defined on the positive interval
R [a, b]. Show that (i) for t = 0 the posterior is equal to the prior and that we cannot
take the limits a 0 or b ; (ii) when t > 0 but still n = 0 we can take the limit
b but not a 0; (iii) that the latter limit can be taken once n > 0. (From Gull [12].)
Finally, let us remark that we have only touched here upon very simple cases where no
information, equal probability and invariance are more or less synonymous so that
the above may seem quite trivial. However, in Jaynes [7] you can find several examples
which are far from trivial.
5.3
In the assignments we have made up to now (mainly in Section 4), there was always
enough information to unambiguously determine the probability distribution. For instance if we apply the principle of insufficient reason to a fair dice then this leads to a
probability assignment of 1/6 for each face i = 1, . . . , 6. Note that this corresponds to
an expectation value of < i >= 3.5. But what probability should we assign to each of
the six faces when no information is given about the dice except that, say, < i >= 4.5?
There are obviously an infinite number of probability distributions which satisfy this
constraint so that we have to look elsewhere for a criterion to select one of these.
Jaynes (1957) has proposed to take the distribution which is the least informative by
maximizing the entropy, a concept first introduced by Shannon (1948) in his pioneering
paper on information theory [13]. The entropy carried by a probability distribution is
defined by
n
X
pi
S(p1 , . . . , pn ) =
pi ln
(discrete case)
m
i
i=1
Z
p(x)
dx (continuous case)
(5.5)
S(p) = p(x) ln
m(x)
where mi or m(x) is the so-called Lebesgue measure (see below) which satisfies
Z
n
X
mi = 1 or
m(x) dx = 1.
(5.6)
i=1
31
The definition (5.5) is such that larger entropies correspond to smaller information
content of the probability distribution. Note that the Lebesgue measure makes the
entropy invariant under coordinate transformations since both p and m transform in
the same way.
To get some feeling for the meaning of this Lebesgue measure imagine a set of swimming
pools on a rainy day. The amount of water collected by each pool will depend on
the distribution of the falling raindrops and on the surface size of each pool. These
surface sizes then play the role of the Lebesgue measure. Some formal insight can be
gained by maximizing (5.5) imposing only the normalization constraint and nothing
else. Restricting ourselves to the discrete case we find, using the method of Lagrange
multipliers, that the following equation has to be satisfied
!#
" n
n
X
X
pi
+
pi 1
= 0.
(5.7)
pi ln
mi
i=1
i=1
Differentiation of (5.7) to pi leads to the equation
pi
ln
+1+=0
pi = mi e(+1) .
mi
P
Imposing the normalization constraint
pi = 1 we find, using (5.6)
!
n
n
X
X
pi =
mi e(+1) = e(+1) = 1
i=1
i=1
(5.8)
The Lebesgue measure mi or m(x) is thus the least informative probability distribution
in complete absence of information. To specify this Ur -prior we have, again, to look
elsewhere and use symmetry arguments (see Section 5.2) or just common sense. In
practice a uniform or very simple Lebesgue measure is often adequate to describe the
structure of the sampling space.
Let us now suppose that the Lebesgue measure is known (we will simply take it to be
uniform in the following) and proceed by imposing further constraints in the form of socalled testable information. Such testable information is nothing else than constraints
on the probability distributions themselves like specifying moments, expectation values,
etc. Generically, we can write a set of m such constraints as
Z
n
X
fki pi = k
or
fk (x)p(x) dx = k ,
k = 1, . . . , m.
(5.9)
i=1
Using Lagrange multipliers we maximize the entropy by solving the equation, in the
discrete case,
!#
" n
!
m
n
n
X
X
X
X
pi
k
fki pi k
= 0.
(5.10)
pi ln
+ 0
pi 1 +
mi
i=1
i=1
i=1
k=1
32
pi
mi
+ 1 + 0 +
m
X
k fki = 0
k=1
pi = mi exp(1 0 ) exp
m
X
k=1
k fki .
(5.11)
k=1
n
X
i=1
mi exp
m
X
k=1
k fki .
(5.12)
Such partition functions play a very important role in statistical mechanics and thermodynamics since they contain all the information on the system under consideration,
somewhat like the Lagrangian in dynamics. Indeed, our constraints (5.9) are encoded
in Z through
ln Z
= k .
(5.13)
k
Exercise 5.2: Prove (5.13) by differentiating the logarithm of (5.12).
The formal solution (5.11) guarantees that the normalization condition is obeyed but is
still void of content since we have not yet determined the unknown Lagrange multipliers
1 , . . . , k from the equations (5.9) or, equivalently, from (5.13). These equations often
have to be solved numerically; a few simple cases are presented below.
Assuming a uniform Lebesgue measure, it immediately follows from (5.8) that in absence
of any information we have
pi = p(xi |I) = constant,
in accordance with Bernoullis principle of insufficient reason. In the continuum limit
this goes over into p(x|I) = constant.
Let us now consider a continuous distribution defined on [0, ] and impose a constraint
on the mean
Z
<x> =
xp(x|I) dx =
0
Taking the continuum limit of (5.11) we have, assuming a uniform Lebesgue measure,
Z
1
x
x
e
dx
p(x|I) = e
= ex .
0
0
33
(5.14)
p(x|I) =
exp (x )2 .
"
2 #
1
1 x
p(x|, , I) = exp
.
2
(5.15)
This is then the third time we encounter the Gaussian: in Section 3.1 as a convenient
approximation of the posterior in the neighborhood of the mode, in Section 4.5 as
the limiting distribution of a sum of random variables and in this section as the least
informative distribution consistent with a constraint on the variance.
As an example of a non-uniform Lebesgue measure we will now derive the Poisson
distribution from the maximum entropy principle. We want to know the distribution
P (n|I) of n counts in a time interval t when there is nothing given but the average
nP (n|I) = .
(5.16)
n=0
To find the Lebesgue measure of the time interval t we divide it into a very large
number (M) of intervals t. A particular distribution of counts over these infinitesimal
boxes is called a micro-state. If the micro-states are independent and equally probable
then it follows from the sum rule that the probability of observing n counts is proportional to the number of micro-states which have n boxes occupied. It is easy to see that
for M n this number is given by M n /n!. Upon normalization we then have for the
Lebesgue measure
M n M
e .
(5.17)
m(n) =
n!
34
so that
X Me
1
= eM
C
n!
n=0
n
n
= eM exp Me ,
n
Me
P (n|I) =
.
exp (Me ) n!
(5.18)
X
n Me
X Me
=
=
exp Me = Me exp Me
n!
n=0
n!
n=0
Combining this with (5.18) we find from the constraint (5.16)
Me = P (n|) =
n
e
n!
(5.19)
Parameter Estimation
There may also be parameters in the model which have known values. These are, if not
explicitly listed, included in the background information I.
Given a prior distribution for the parameters a and s, Bayes theorem gives for the joint
posterior distribution
p(d|a, s, I)p(a, s|I)dads
.
p(a, s|d, I)dads = R
p(d|a, s, I)p(a, s|I)dads
(6.1)
6.1
Gaussian sampling
of the temperature?
Since the measurements are independent, we can use the product rule (2.7) and write
for the likelihood
"
2 #
n
n
Y
1
1 X di
p(d|, ) =
p(di |, ) =
.
exp
(2 2 )n/2
2 i=1
i=1
36
(6.3)
where
X
1X
2.
d =
di and V =
(di d)
n i=1
i=1
n
n
, n) =
p(|d,
.
(6.4)
exp
2 2
2
But this is just a Gaussian with mean d and width / n. Thus we have the well known
result
(6.5)
= d .
n
Exercise 6.1: Derive (6.5) directly from (6.3) by expanding L = ln p using equations
(3.6), (3.7) and (3.9) in Section 3.1. Calculate the width as the inverse of the Hessian.
)2
C
V
+
n(
d
V, n) =
p(, |d,
exp
n
2 2
(6.6)
p(|d, V, n) =
p(, |d, V, n) d = C
0
1
V + n(d )2
(n1)/2
(6.7)
This integral converges only for n 2 which is not surprising since one measurement
cannot carry information on . When n = 2, the distribution (6.7) is improper (cannot
be normalized on [, ]). Calculating the normalization constant C for n 3 we
find the Student-t distribution for = n 2 degrees of freedom:
r
(n1)/2
[(n
1)/2]
V
n
V, n) =
p(|d,
.
(6.8)
[(n 2)/2] V V + n(d )2
This distribution can be written in a form which has the number of degrees of freedom
as the only parameter.
37
pHL
HaL
pHL
HbL
HcL
0.7
0.5
0.6
2.5
0.5
0.4
2
0.4
0.3
1.5
0.3
0.2
0.2
0.1
0.1
0.5
-2
-1
-4
-2
10
p(|d, V, n) d p(t|) dt =
dt.
(/2) 1 + t2 /
This is the expression for the Student-t distribution usually found in the literature.
Exercise 6.3: Show by expanding the negative logarithm of (6.7) that the maximum of
the posterior is located at u
= d and that the Hessian is given by H = n(n 1)/V . Note
that we do not need to know the normalization constant C for this calculation.
Characterizing the posterior by mode and width we find a result similar to (6.5)
V
S
.
= d with S 2 =
n1
n
(6.9)
p(|d, V, n) =
p(, |d, V, n) d = n1 exp 2 .
(6.10)
Integrating this equation over to calculate the normalization constant C we find the
2 distribution for = n 2 degrees of freedom
/2
1
V
V
p(|V, n) = 2
exp 2 .
2
(/2) +1
2
This distribution is shown for V = 4 and n = 4 in Fig. 4c.
38
(6.11)
Exercise 6.4: Transform 2 = V / 2 and show that (6.11) can be written in the more
familiar form with as the only parameter
1
2
exp( 21 2 ) 2
2
2
d
p(|V, n) d p( |) d =
2 ()
with = /2. Use the definition of the Gamma function
Z
(z) =
tz1 et dt
0
and the property (z + 1) = z(z) to show that the mean and variance of the 2 distribution are given by and 2, respectively.
.
=
n1 n1 2
(6.12)
More familiar measures of the 2 distribution for degrees of freedom are the average
and the variance (see Exercise 6.4)
< 2 > =
6.2
< 2 > = 2.
and
Least squares
In this section we consider the case that the data can be described by a function f (x; a)
of a variable x depending on a set of parameters a. For simplicity we will consider only
functions of one variable; the extension to more dimensions is trivial. Suppose that
we have made a series of measurements {di } at the sample points {xi } and that each
measurement is distributed according to some sampling distribution pi (di |i, i ). Here
i and i characterize the position and the width of the sampling distribution pi of data
point di . We parameterize the positions i by the function f :
i (a) = f (xi ; a).
If the measurements are independent we can write for the likelihood
p(d|a, I) =
n
Y
i=1
pi [di |i(a), i ].
Introducing the somewhat more compact notation pi (di |a, I) for the sampling distributions, we write for the posterior distribution
!
n
Y
p(a|d, I) = C
pi (di |a, I) p(a|I)
(6.13)
i=1
39
where C is a normalization constant and p(a|I) is the joint prior distribution of the
parameters a. The position and width of the posterior can be found by minimizing
L(a) = ln[p(a|d, I)] = ln(C) ln[p(a|I)]
n
X
i=1
(6.14)
as described in Section 3.1, equations (3.6)(3.9). In practice this is often done numerically by presenting L(a) to a minimization program like minuit. Note that this
procedure can be carried out for any sampling distribution pi be it Binomial, Poisson,
Cauchy, Gauss or whatever. In case the prior p(a|I) is chosen to be uniform, the second
term in (6.14) is constant and the procedure is called a maximum likelihood fit.
The most common case encountered in data analysis is when the sampling distributions
are Gaussian. For a uniform prior, (6.14) reduces to
2
n
X
di i (a)
1
1 2
.
(6.15)
L(a) = constant + 2 = constant + 2
i
i=1
We then speak of 2 minimization or least squares minimization. When the function f (x; a) is linear in the parameters, the minimization can be reduced to a single
matrix inversion, as we will now show.
A function which is linear in the parameters can generically be expressed by
f (x; a) =
m
X
a f (x)
(6.16)
=1
where the f are a set of functions of x and the a are the coefficients to be determined
from the data. We denote by wi 1/i2 the weight of each data point and write for the
log posterior
#2
"
n
n
m
X
X
X
L(a) = 12
a f (xi ) .
(6.17)
wi [di f (xi ; a)]2 = 21
wi di
i=1
i=1
=1
(6.18)
=1
b = Wa
so that
= W 1 b
a
(6.19)
n
X
wi f (xi )f (xi )
and
b =
i=1
n
X
wi di f (xi ).
(6.20)
i=1
2 L(
a)
= W .
a a
40
(6.21)
Higher derivatives vanish so that the quadratic expansion (3.8) in Section 3.1 is exact.
To summarize, when the function to be fitted is linear in the parameters we can build
a vector b and a matrix W as defined by (6.20). The posterior (assuming uniform
= W 1 b and covariance matrix
priors) is then a multivariate Gaussian with mean a
1
1
V H = W . In this way, a fit to the data is reduced to one matrix inversion
and does not need starting values for the parameters, nor iterations, nor convergence
criteria.
Exercise 6.6: Calculate the matrix W and the vector b of (6.20) for a polynomial parameterization of the data
f (x; a) = a1 + a2 x + a3 x2 + a4 x3 + .
Exercise 6.7: Show that a fit to a constant results in the weighted average of the data
P
wi di
1
.
pP
a
1 = Pi
w
i i
i wi
6.3
Example: no signal
In this section we describe a typical case where the likelihood prefers a negative value
for a positive definite quantity. Defining confidence limits in such a case is a notoriously
difficult problem in Frequentist statistics. But in our Bayesian approach the solution is
trivial as is illustrated by the following example where a negative counting rate is found
after background subtraction.
A search was made by the NA49 experiment at the CERN SPS for D0 production in a
large sample of 4 106 Pb-Pb collisions at a beam energy of 158 GeV per nucleon [18].
Since NA49 does not have secondary vertex reconstruction capabilities, all pairs of
positive and negative tracks in the event were accumulated in invariant mass spectra
0 K+ . In the
assuming that the tracks originate from the decays D0 K + or D
left-hand side plot of Fig. 5 we show the invariant mass spectrum of the D0 candidates.
The vertical lines indicate a 90 MeV window around the nominal D0 mass. The large
combinatorial background is due to a multiplicity of approximately 1400 charged tracks
per event giving, for each event, about 5105 entries in the histogram. In the right-hand
side plot we show the invariant mass spectrum after background subtraction. Clearly
no signal is observed.
A least squares fit to the data of a Cauchy line shape on top of a polynomial background
0+D
0) = 0.36 0.74 per event as shown by the full
yielded a negative value N(D
curve in the right-hand side plot of Fig. 5. As already mentioned above, this is a typical
example of a case where the likelihood favors an outcome which is unphysical.
To calculate an upper limit on the D0 yield, Bayesian inference is used as follows. First,
the likelihood of the data d is written as a multivariate Gaussian in the parameters a
which describe the D0 yield and the background shape
1
)V 1 (a a
)
p(d|a, I) = p
exp 21 (a a
(2)n |V |
41
(6.22)
10
dN/dm (1/GeV)
5000
400
200
-200
2
m(,K) (GeV)
1.8
1.9
m( ,K) (GeV)
Figure 5: Left: The invariant mass distribution of D0 candidates in 158A GeV Pb-Pb collisions at the
CERN SPS. The open (shaded) histograms are before (after) applying cuts to improve the significance.
0
The vertical lines indicate a 90 MeV window around the nominal D0 mass. Right: The D0 + D
invariant mass spectrum after background subtraction. The full curve is a fit to the data assuming a
fixed signal shape. The other curves correspond to model predictions of the D0 yield.
and V are the best values and covariance mawhere n is the number of parameters and a
trix as obtained from minuit, see also Section 3.1. Taking flat priors for the background
parameters leads to the posterior distribution
p(a|d, I) = p
C
(2)n |V |
)V 1 (a a
) p(N|I)
exp 12 (a a
(6.23)
where C is a normalization constant and p(N|I) is the prior for the D0 yield N. The
posterior for N is now obtained by integrating (6.23) over the background parameters.
As explained in Section 3.1 this yields a one-dimensional Gauss with a variance given
by the diagonal element 2 = VN N of the covariance matrix. Thus we have
!2
1 N N
C
) p(N|I)
p(N|I) = C g(N; N,
(6.24)
p(N|d, I) = exp
2
where we have introduced the short-hand notation g() for a one-dimensional Gaussian
= 0.36 and = 0.74 as obtained from the fit to the data.
density. In (6.24), N
As a last step we encode in the prior p(N|I) our knowledge that N is positive definite
0 for N < 0
(6.25)
P (N|I) (N)
with
(N) =
1 for N 0.
Inserting (6.25) in (6.24) and integrating over N to calculate the constant C we find
) (N)
p(N|d, I) = g(N; N,
Z
) dN
g(N; N,
1
(6.26)
The posterior distribution is thus a Gaussian with mean and variance as obtained from
the fit. This Gaussian is set to zero for N < 0 and re-normalized to unity for N 0.
42
The upper limit (Nmax ) corresponding to a given confidence level (CL) is then calculated
from
Z
1
Z Nmax
Z Nmax
) dN
) dN
g(N; N,
g(N; N,
p(N|d, I) dN =
CL =
. (6.27)
0
6.4
Data are often subject to sources of uncertainty which cause a simultaneous fluctuation
of more than one data point. We will call these correlated uncertainties systematic
errors in contrast to statistical errors which areby definitionpoint to point uncorrelated.
To propagate the systematic uncertainties to the parameters a of interest one often
offsets the data by each systematic error in turn, redo the analysis, and then add the
in quadrature. Such an intuitive ad hoc procedure
deviations from the optimal values a
(offset method) has no sound theoretical foundation and may even spoil your result
by assigning errors which are far too large, see [14] for an illustrative example and also
Exercise 6.10 below.
To take systematic errors into account we will include them in the data model. This can
of course be done in many ways, depending on the experiment being analyzed. Here we
restrict ourselves to a linear parameterization which has the advantage that it is easily
incorporated in any least squares minimization procedure. This model, as it stands,
does not handle asymmetric errors. However, in case we deal with several systematic
sources these asymmetries tend to vanish by virtue of the central limit theorem.
In Fig. 6 we show a systematic distortion of a set of data points
di di + si .
Here i is a list of systematic deviations and s is an interpolation parameter which
dHxL
2
dHxL
2
1.5
1.5
0.5
0.5
x
2
10
10
Figure 6: Systematic distortion (black symbols) of a set of data points (gray symbols) for two values
of the interpolation parameter s = +1 (left) and s = 1 (right).
controls the amount of systematic shift applied to the data. Usually there are several
43
di = ti (a) + ri +
s i ,
(6.28)
=1
where ti (a) = f (xi ; a) is the theory prediction containing the parameters a of interest
and i is the correlated error on point i stemming from source . In (6.28), the uncorrelated statistical fluctuations of the data are described by the independent Gaussian
random variables ri of zero mean and variance i2 . The s are independent Gaussian
random variables of zero mean and unit variance which account for the systematic fluctuations. The joint distribution of r and s is thus given by
m
2
n
Y
1
ri2 Y 1
s
exp 2
exp .
p(r, s|I) =
2i
2
2
2
i=1 i
=1
< s s > =
< ri s > = 0.
(6.29)
Because the data are a linear combination of the Gaussian random variables r and s it
follows that d is also Gaussian distributed
1
1 (d d)
.
p(d|I) = p
exp 12 (d d)V
(6.30)
(2)n |V |
The mean d is found by taking the average of (6.28)
di = < di > = ti (a) + < ri > +
m
X
(6.31)
=1
A transformation of (6.29) by linear error propagation (see Section 3.2) gives for the
covariance matrix
12 + S11
S12
S1n
m
X
S21
22 + S22
S2n
i j .
(6.32)
V =
with Sij =
..
..
..
..
.
.
.
.
=1
Sn1
Sn2
n2 + Snn
Exercise 6.8: Use the propagation formula (3.19) in Section 3.2 to derive (6.32) from
(6.28) and (6.29).
It is also easy to calculate this covariance matrix by directly averaging the product
di dj . Because all the cross terms vanish by virtue of (6.29) we immediately obtain
X
X
s j ) >
s i )(rj +
Vij = < di dj > = < (ri +
= < ri r j > +
XX
i2 ij
44
i j < s s > +
i j .
Inserting (6.31) in (6.30) and assuming a uniform prior, the log posterior of the parameters a can be written as
L(a) = ln[p(a|d)] = Constant +
1
2
n X
n
X
i=1 j=1
(6.33)
Minimizing L defined by (6.33) is impractical because we need the inverse of the covariance matrix (6.32) which can become very large. Furthermore, when the systematic
errors dominate, the matrix mightnumericallybe uncomfortably close to a matrix
with the simple structure Vij = i j , which is singular.
Fortunately (6.33) can be cast into an alternative form which avoids the inversion of
large matrices [15]. Our derivation of this result is based on the standard steps taken in
a Bayesian inference: (i) use the data model to write an expression for the likelihood;
(ii) define prior probabilities; (iii) calculate posterior probabilities with Bayes theorem
and (iv) integrate over the nuisance parameters.
The likelihood p(d|a, s) is calculated from the decomposition in the variables r
Z
Z
p(d|a, s) = dr p(d, r|a, s) = dr p(d|r, a, s) p(r|a, s)
(6.34)
The data model (6.28) is incorporated through the trivial assignment
p(d|r, a, s) =
n
Y
[ri + ti (a) +
i=1
m
X
=1
s i di ].
(6.35)
(6.36)
t
(a)
i
i
exp 12
.
p(d|a, s) =
2
i
i=1 i
(6.37)
m
Y
exp( 12 s2 )
=1
and a uniform prior for a, the joint posterior distribution can be written as
!2
n
m
m
X
X
X
p(a, s|d) = C exp 21
wi di ti (a)
s i 12
s2
i=1
=1
(6.38)
=1
where wi = 1/i2 . The log posterior L = ln p can now numerically be minimized (for
instance by minuit) with respect to the parameters a and s. Marginalization of the
45
nuisance parameters s, as described in Section 3.1, then yields the desired result. Clearly
we now got rid of our large covariance matrix (6.32) at the expense of extending the
parameter space from a to a s. In global data analysis where many experiments are
combined the number of systematic sources s can become quite large so that minimizing
L of (6.38) may still not be very attractive.
However, the fact that L is linear in s allows to analytically carry out the minimization
and marginalization with respect to s. For this, we expand L like in (3.6) but only to s
and not to a (it is easy to show that this expansion is exact i.e. that higher derivatives
in s vanish):
L(a, s) = L(a, s) +
X L(a, s)
s +
1 X X 2 L(a, s)
s s .
2 s s
(6.39)
with
s = S 1 b
S = +
n
X
wi i i
i=1
b (a) =
n
X
i=1
and
L(a, s) =
1
2
n
X
i=1
wi
wi [di ti (a)]i .
di ti (a)
m
X
s i
=1
!2
(6.40)
1
2
m
X
s2 .
(6.41)
=1
(6.42)
The log posterior (6.41) can now numerically be minimized with respect to the parameters a. Instead of the n n covariance matrix V of (6.32) only an m m matrix S
has to be inverted with m the number of systematic sources.
The solution (6.40) for s can be substituted back into (6.41). This leads after straight
forward algebra to the following very compact and elegant representation of the posterior [15]
" n
#)
(
X
2
1
1
wi (di ti ) b S b
.
(6.43)
p(a|d) = C exp 2
i=1
46
The first term in the exponent is the usual 2 in absence of correlated errors while
the second term takes into account the systematic correlations. Note that S does not
depend on a so that S 1 can be computed once and for all. The vector b, on the other
hand, does depend on a so that it has to be recalculated at each step in the minimization
loop.
Although the posteriors defined by (6.33) and (6.43) look very different, it can be shown
(tedious algebra) that they are mathematically identical [16]. In other words, minimizing
as minimizing the negative logarithm of (6.43).
(6.33) leads to the same result for a
It is clear that the uncertainty on parameters derived from a given data sample should
decrease when new data are added to the sample. This is because additional data will
always increase the available information even when these data are very uncertain. From
this it follows immediately that the error obtained from an analysis of the total sample
can never be larger than the error obtained from an analysis of any sub-sample of the
data. It turns out that error estimates based on equations (6.40)(6.43) do meet this
requirement but that the offset methodmentioned in the beginning of this section
does not. This issue is investigated further in the following exercise.
Exercise 6.10: We make n measurements di i of the temperature in a room. The
measurements have a common systematic offset error . Calculate the best estimate
of the temperature and the total error (statistical systematic) by: (i) using the offset
method mentioned at the beginning of this section and (ii) using (6.43). To simplify the
algebra assume that all data and errors have the same value: di = d, i = .
A second set of n measurements is added using another thermometer which has the same
resolution but no offset uncertainty. Calculate the best estimate
and the total error
from both data sets using either the offset method or (6.43). Again, assume that di = d
and i = to simplify the algebra.
Now let n . Which error estimate makes sense in this limit and which does not?
The systematic errors described in this section were treated as offsets. Another important class are scale or normalization errors. Scale uncertainties are non-linear
and their treatment is beyond the scope of this write-up although it is similar to that
described above for offset errors: include scale parameters s in the data model, calculate the joint posterior p(a, s|d) assuming a normal, or perhaps a lognormal23 prior
distribution centered around unity and, finally, integrate over s. Considerable care
should be taken in correctly formulating the log likelihood (or log posterior) since scale
uncertainties not only affect the position but also the widths of the sampling distributions. Therefore normalization termswhich usually are ignoredsuddenly become
dependent on the (scale) parameters so that they should be taken into account in the
minimization. For more on the treatment of normalization uncertainties and possible
large biases in the fit results we refer to [17].
6.5
Model selection
When x is normally (i.e. Gaussian) distributed then y = ex follows the lognormal distribution.
47
the hypothesis space to allow for several competing models and ask the question which
model should be preferred in light of the evidence provided by the data. The Bayesian
procedure which provides, at least in principle, an answer to this question is called
model selection. As we will see, this model selection is not only based on the quality
of the data description (goodness of fit) but also on a factor which penalizes models
which have a larger number parameters. Bayesian model selection thus automatically
applies Occams razor in preferring, to a certain degree, the more simple models.
As already mentioned several times before, Bayesian inference can only asses the plausibility of an hypothesis when this hypothesis is a member of an exclusive and exhaustive
but
set. One can of course always complement a given hypothesis H by its negation H
this does not bring us very far since H is usually too vague a condition for a meaningful
probability assignment.24 Thus, in general, we have to include our model H into a finite
set of mutually exclusive and exhaustive alternatives {Hk }. This obviously restricts the
outcome of our selection procedure to one of these alternatives but allows, on the other
hand, to use Bayes theorem (2.13) to assign posterior probabilities to each of the Hk
P (D|Hk , I) P (Hk |I)
P (Hk |D, I) = P
.
i P (D|Hi , I) P (Hi|I)
(6.44)
To avoid calculating the denominator, one often works with the so-called odds ratio
(i.e. a ratio of probabilities)
Okj =
P (Hk |D, I)
P (Hk |I) P (D|Hk , I)
=
.
P (Hj |D, I)
P (Hj |I) P (D|Hj , I)
|
{z
} | {z } |
{z
}
Posterior odds
Prior odds
(6.45)
Bayes factor
The first term on the right-hand side is called the prior odds and the second term the
Bayes factor.
The selection problem can thus be solved by calculating the odds with (6.45) and accept
hypothesis k if Okj is much larger than one, declare the data to be inconclusive if the
ratio is about unity and reject k in favor of one of the alternatives if Okj turns out to be
much smaller than one. The prior odds are usually set to unity unless there is strong
prior information in favor of one of the hypotheses. However, when the hypotheses in
(6.45) are composite, then not only the prior odds depend on prior information but also
the Bayes factor.
To see this, we follow Sivia [5] in working out an illustrative example where the choice is
between a parameter-free hypothesis H0 and an alternative H1 with one free parameter
. Let us denote a set of data points by d and decompose the probability density
p(d|H1) into
Z
Z
p(d|H1 ) = p(d, |H1 ) d = p(d|, H1) p(|H1) d.
(6.46)
To evaluate (6.46) we assume a uniform prior for in a finite range and write
p(|H1 ) =
24
1
.
(6.47)
We could, for instance, describe a detector response to pions by a probability P (d|). However,
it would be very hard to assign something like a not-pion probability P (d|
) without specifying the
detector response to members of an alternative set of particles like electrons, kaons, protons etc.
48
+ 1 [( )/]
+
L() = ln[p(d|, H1)] = L()
2
!2
H1 ) exp 1
(6.48)
p(d|, H1) p(d|,
2
Finally, inserting (6.47) and (6.48) in (6.46) gives, upon integration over
2
H1 )
p(d|H1 ) p(d|,
.
(6.49)
Exercise 6.11: Derive an expression for the posterior of by inserting Eqs. (6.47), (6.48)
and (6.49) in Bayes theorem
p(|d, H1 ) =
p(d|, H1 ) p(|H1 )
.
p(d|H1 )
P (H1 |d, I)
P (H1|I)
| {z }
| {z }
Posterior odds
Prior odds
p(d|H0)
p(d|H , )
| {z1 }
Likelihood ratio
2
| {z }
(6.50)
Occam factor
As already mentioned above, the prior odds can be set to unity unless there is information which gives us prior preference for one model over another. The likelihood ratio
will, in general, be smaller than unity and therefore favor the model H1 . This is because
the additional flexibility of an adjustable parameter usually yields a better description
of the data. This preference for models with more parameters leads to the well known
phenomenon that one can fit an elephant with enough free parameters.25 This clearly
illustrates the inadequacy of using the fit quality as the only criterion in model selection.
Indeed, such a criterion alone could never favor a simpler model.
Intuitively we would prefer a model that gives a good description of the data in a wide
range of parameter values over one with many fine-tuned parameters, unless the latter
would provide a significantly better fit. Such an application of Occams razor is encoded
by the so-called Occam factor in (6.50). This factor tends to favor H0 since it penalizes
H1 for reducing a wide parameter range to a smaller range allowed by the fit. Here
we immediately face the problem that H0 would always be favored in case is set to
infinity. As far as we know, there is no other way but setting prior parameter ranges as
honestly as possible when dealing with model selection problems.
In case Hi and Hj both have one free parameter ( and ) the odds ratio becomes
P (Hi|I) p(d|Hi,
)
P (Hi|d, I)
=
P (Hj |d, I)
P (Hj |I) p(d|Hj , )
25
(6.51)
Including as many parameters as data points will cause any model to perfectly describe the data.
49
For similar prior ranges and the likelihood ratio has to overcome the penalty
factor / . This factor favors the model for which the likelihood has the largest width.
It may seem a bit strange that the less discriminating model is favored but inspection
of (6.49) shows that the evidence P (D|H) carried by the data tends to be larger for
models with a larger ratio /, that is, for models which cause a smaller collapse of
the hypothesis space when confronted with the data. Note that the choice of prior range
is less critical in (6.51) than in (6.50) so that this poses not much of a problem when we
use Bayesian model selection to chose between, say, a Breit-Wigner or a Gaussian peak
shape in an invariant mass spectrum.
Finally, let us remark that the Occam factor varies like the power of the number of free
parameters so that models with many parameters may get very strongly disfavored by
this factor.
Exercise 6.12: Generalize (6.51) to the case where Hi has n free parameters and Hj
has m free parameters (with n 6= m).
This presentation of model selection had to remain very sketchy since practical applications depend strongly on the details of the selection problem at hand. For many
worked-out examples we refer to Sivia [5], Gregory [6], Bretthorst [3] and Loredo [1]
which also contains a discussion on goodness-of-fit tests used in Frequentist model selection.
Counting
7.1
Binomial counting
N!
hn (1 h)N n .
n!(N n)!
(7.1)
If we define as our statistic for h the ratio R = n/N, the average and variance of R are
given by (4.8)
<n>
h(1 h)
< R >=
=h
< R2 >=
.
(7.2)
N
N
From Bayes theorem we obtain for the posterior of h
p(h|n, N, I) dh = C P (n|N, h, I) p(h|I) dh
(7.3)
(N + 1)! n
h (1 h)N n dh.
n!(N n)!
(7.4)
In Fig. 7 we plot the evolution of the posterior for the first three throws (for more
pHhL
pHhL
01
pHhL
02
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.4
0.6
0.8
13
0.2
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
Figure 7: The posterior density p(h|n, N ) for n heads in the first three flips of a coin with bias h = 0.25.
A uniform prior distribution of h has been assumed. The densities are scaled to unit maximum for ease
of comparison.
throws see Fig. 3 in Section 5). From this plot, and also from (7.4), it is seen that the
distribution vanishes at the edges of the interval only when at least one head and one
tail has been observed. Thus for 1 n N 1 the distribution has its peak inside
the interval so that it makes sense to calculate the position and width from L = ln p.
This gives
s
= n h(1 h)
h
(1 n N 1).
(7.5)
N
N
Note, as a curiosity, that the expressions given above are similar to those for < R > and
< R2 > given in (7.2).
Exercise 7.1: Derive (7.5)
51
Exercise 7.2: A counter is traversed by N particles and fires N times. Calculating the
efficiency and error from (7.2) gives = 1 0 which is an unacceptable estimate for
the error. Derive from (7.4) an expression for the lower limit corresponding to the
confidence interval defined by the equation
Z 1
p(|N, N ) d = .
Show that for N = 4 and = 0.65 the result on the efficiency can be reported as
+0
0.19
=1
7.2
(65% CL)
Instead of throwing the coin N times we may decide to throw the coin as many times as
is necessary to observe n heads. Because the last throw must by definition be a head,
and because the probability of this throw does not depend on the previous throws, the
probability of N throws is given by
P (N|n, h) = P (n-1 heads in N-1 throws) P (one head in one throw)
(N 1)!
hn (1 h)N n
n 1, N n.
(7.6)
=
(n 1)!(N n)!
This distribution is known as the negative binomial. In Fig. 8 we show this distribution for a fair coin (h = 0.5) and n = (3, 9, 20) heads.
P H N 3, 0.5 L
0.2
P H N 20, 0.5 L
0.07
P H N 9, 0.5 L
0.1
0.15
0.1
0.06
0.08
0.05
0.06
0.04
0.03
0.04
0.02
0.05
0.02
10
12
14
0.01
10
15
20
25
30
35
N
30
40
50
60
70
Figure 8: The negative binomial P (N |n, h) distribution of the number of trials N needed to observe
n = (3, 9, 20) heads in flips of a fair coin (h = 0.5).
It can be shown that P (N|n, h) is properly normalized
X
P (N|n, h) = 1.
N =n
The first and second moments and the variance of this distribution are
X
n
<N > =
NP (N|n, h) =
h
N =n
<N2 > =
N 2 P (N|n, h) =
N =n
< N 2 > =
n(1 h)
h2
n(1 h) n2
+ 2
h2
h
(7.7)
52
If we define the ratio Q = N/n as our statistic for z 1/h it follows directly from (7.7)
that the average and variance of Q are given by
< Q2 > =
<Q> = z
z(z 1)
n
(7.8)
In the previous section we took R = n/N as a statistic for h because it had the property
that < R > = h, when averaged over the binomial distribution. But if we average R
over the negative binomial (7.6) then < R > 6= h. This can easily be seen, without any
explicit calculation, from the fact that N and not n is the random variable and that the
reciprocal of an average is not the average of the reciprocal:
<R> = <
1
n
n
> = n < > 6=
= h.
N
N
<N >
To calculate the posterior we again assume a flat prior for h so that we can write
p(h|N, n) dh = C
(N 1)!
hn (1 h)N n dh
(n 1)! (N n)!
N(N + 1)
.
n
(N + 1)! n
h (1 h)N n dh.
n!(N n)!
(7.9)
This posterior is the same as that given in (7.4) as it should be since the relevant
information on h is carried by the observation of how many heads there are in a certain
number of throws and not by how the experiment was halted.
It is worthwhile to have a closer look at the likelihoods (7.1) and (7.6) to understand
why the posteriors come out to be the same. It is seen that the dependence of both
likelihoods on h is given by the term
hn (1 h)(N n) .
The terms in front are different because of the different stopping rules but these terms
do not enter in the inference on h since they do not depend on h. They can thus be
absorbed in the normalization constant of the posterior which can in both cases be
written as
p(h|n, N) dh = C hn (1 h)(N n) dh
Here we have encountered a very important property of Bayesian inference namely its
ability to discard information which is irrelevant. This is in accordance with Cox
desideratum of consistency which states that conclusions should depend on relevant
information only. Frequentist analysis does not possess this property since the stopping
rule must be specified in order to construct the sampling distribution and a meaningful
statistic. Such inference therefore violates at least one of Cox desiderata.
53
Exercise 1.1
This is because astronomers often have to draw conclusions from the observation of rare events. Bayesian
inference is well suited for this since it is based solely on the evidence carried by the data (and prior
information) instead of being based on hypothetical repetitions of the experiment.
Exercise 2.3
and A B are true if and only if A is false. But
(i) From the truth table (2.1) it is seen that both B
A.
Proposition 1:
Proposition 2:
Conclusion:
Induction
If A is true then B is true
A is false
B is less probable
Deduction
If A is true then B is true
B is false
A is false
Exercise 2.4
From de Morgans law and repeated application of the product and sum rules (2.4) and (2.5) we find
P (A + B) =
=
=
=
=
1 P (A + B) = 1 P (AB)
A)P
(A)
= 1 P (A)[1
P (B|A)]
1 P (B|
P (A) + P (A|B)P
(B) = P (A) + P (B)[1 P (A|B)]
P (A) + P (B) P (A|B)P (B) = P (A) + P (B) P (AB).
Exercise 2.5
(i) The probability for Mr. White to have AIDS is
P (A|T ) =
P (T |A)P (A)
0.98 0.01
= 0.25.
=
1 0.01
= 0.25.
1 0.01 + 0.03 0.99
= 0 so that
(iii) For zero contamination P (T |A)
P (A|T ) =
0.98 0.01
= 1.
0.98 0.01 + 0 0.99
Exercise 2.6
(i) If nobody has AIDS then P (A) = 0 and thus
P (A|T ) =
P (T |A) 0
1 = 0.
P (T |A) 0 + P (T |A)
P (T |A) 1
0 = 1.
P (T |A) 1 + P (T |A)
In both these cases the posterior is thus equal to the prior independent of the likelihood P (T |A).
54
Exercise 2.7
(i) When is known we can write for the posterior distribution
P (|S, ) =
P (S|)P ()
=
.
P (S|)P () + P (S|
)P (
)
+ (1 )
Assuming a uniform prior p() = 1 gives for the probability that the signal S corresponds to a pion
Z 1
[ + ln(/)]
d
,
P (|S) =
=
+ (1 )
( )2
0
where we have used the Mathematica program to evaluate the integral.
Exercise 2.8
The quantities x, x0 , d and are related by x = x0 + d tan . With p(|I) = 1/ it follows that p(x|I)
is Cauchy distributed
d
d
1
1 cos2
.
=
p(x|I) = p(|I) =
dx
d
(x x0 )2 + d2
d2 L
4(x x0 )2
2
=
+
.
2
2
2
2
dx
[(x x0 ) + d ]
(x x0 )2 + d2
1
d2 L(
x)
2
=
= 2
2
dx2
d
d
= .
2
Exercise 2.9
(i) The posterior distribution of the first measurement is
P (|S1 ) =
P (S1 |)P ()
=
.
P (S1 |)P () + P (S1 |
)P (
)
+ (1 )
P (S2 |, S1 )P (|S1 )
2
= = 2
.
P (S2 |, S1 )P (|S1 ) + P (S2 |
, S1 )P (
|S1 )
+ 2 (1 )
Here we have assumed that the two measurements are independent, that is,
P (S2 |, S1 ) = P (S2 |) =
P (S2 |
, S1 ) = P (S2 |
) = .
2
P (S1 , S2 |)P ()
= 2
.
P (S1 , S2 |)P () + P (S1 , S2 |
)P (
)
+ 2 (1 )
Here we have again assumed that the two measurements are independent, that is,
P (S1 , S2 |) = P (S1 |)P (S2 |) = 2
P (S1 , S2 |
) = P (S1 |
)P (S2 |
) = 2.
Both results are thus the same when we assume that the two measurements are independent.
55
Exercise 3.1
Because averaging is a linear operation we have
< x2 >
= < x2 > 2 < x >2 + < x >2 = < x2 > < x >2 .
Exercise 3.2
The covariance matrix can be written as
Vij =< (xi x
i )(xj x
j ) > =< xi xj > < xi >< xj > .
For independent variables the joint probability factorizes p(xi , xj |I) = p(xi |I)p(xj |I) so that
Z
Z
< xi xj > = dxi xi p(xi |I) dxj xj p(xj |I) =< xi >< xj > .
This implies that the off-diagonal elements of Vij vanish.
Exercise 3.3
(i) It is easiest to make the transformation y = 2(x x0 )/ so that the Breit-Wigner transforms into
dx
1
1
.
p(x|x0 , , I) p(y|I) = p(x|I) =
dy
1 + y2
For this distribution L = ln + ln(1 + y 2 ). The first and second derivatives of L are given by
dL
dy
d2 L
dy 2
=
=
2y
1 + y2
4y 2
2
+
.
(1 + y 2 )2
1 + y2
From dL/dy = 0 we find y =0. It follows that the second derivative of L at y is 2 so that the width
of the distribution is y = 1/ 2. Transforming back to the variable x = x0 + y/2 gives
dx
x
= x0
and
x = y = .
dy
2 2
(ii) Substituting x = x0 in the Breit-Wigner formula gives for the maximum value 2/(). Substituting
x = x0 /2 gives a value of 1/() which is just half the maximum.
Exercise 3.4
For z = x + y and z = xy we have
ZZ
Z
p(z|I) =
(z x y) f (x)g(y) dxdy = f (z y)g(y) dy
and
Z
ZZ
ZZ
dy
dw
dy = f (z/y)g(y) .
p(z|I) =
(z xy) f (x)g(y) dxdy =
(z w) f (w/y)g(y)
|y|
|y|
Exercise 3.5
(i) The inverse transformations are
x=
u+v
2
y=
uv
2
56
1
= .
2
1
1
f (x)g(y) = f
2
2
u+v
2
uv
g
.
2
1
u
u(4uv)1/2
x = uv
|J| =
=
y=
(4uv)1/2 u(4uv 3 )1/2 2v
v
p(u, v) =
1
uv g
f
2v
r
u
.
v
Marginalization of v gives
r Z
Z
Z
u
dw
dv
u
=
f
f (w)g
p(u) = p(u, v) dv =
uv g
2v
v
w
w
Exercise 3.6
According to (3.14) the distribution of z is given by
"
(
2
2 #)
Z
1
x 1
z x 2
1
p(z|I) =
.
+
dx exp
21 2
2
1
2
The standard way to deal with such an integral is to complete the squares, that is, to write the
exponent in the form a(x b)2 + c. This allows to carry out the integral
r
Z
Z
1
2
2
2
1
1
exp 21 c .
dy exp 2 ay =
dx exp 2 a(x b) + c = exp 2 c
a
Our problem is now reduced to finding the coefficients a, b and c such that the following equation holds
p(x r)2 + q(x s)2 = a(x b)2 + c
where the left-hand side is a generic expression for the exponent of the convolution of our two Gaussians.
Since the coefficients of the powers of x must be equal at both sides of the equation we have
p+q =a
pr + qs = ab
a=p+q
b=
p = 1/12
q = 1/22
c=
pq
(s r)2 .
p+q
r = 1
s = z 2
Then substituting
yields the desired result
1
(z 1 2 )2
p
.
p(z|I) =
exp
2(12 + 22 )
2(12 + 22 )
57
Exercise 3.7
For independent random variables xi with variance i2 we have < xi xj > = i j ij .
P
(i) For the sum z = xi we have z/xi = 1 so that (3.19) gives
< z 2 > =
XX
i
i j ij =
i2 .
< z 2 > =
XX z z
X i 2
i j ij = z 2
.
xi xj
xi
i
j
i
Exercise 3.8
We have
=
N n
m
=
=
2
n
(n + m)
N2
n
n
=
N
n+m
n
n
= 2
=
2
m
(n + m)
N
and
< n2 > = n
< m2 > = m = N n
< nm > = 0.
< n > +
< m2 >
=
m
2
n 2
N n
(1 )
=
n
+
(N n) =
.
2
2
N
N
N
< >
2
Exercise 3.9
Writing out (3.24) in components gives, using the definition (3.25)
X
T
Vik Ukj
=
T
Uik
k kj
V uj = j uj .
Exercise 3.10
(i) For a symmetric matrix V and two arbitrary vectors x and y we have
yVx =
yi Vij xj =
xj VjiT yi =
xj Vji yi = x V y.
ij
ij
ij
Exercise 4.1
Decomposition of P (R2 |I) in the hypothesis space {R1 , W1 } gives
P (R2 |I) =
=
P (R2 |R1 , I)P (R1 |I) + P (R2 |W1 , I)P (W1 |I)
R1 R
R W
R
+
= .
N 1 N
N 1 N
N
58
Exercise 4.2
For draws with replacement we have P (R2 |R1 , I) = P (R2 |W1 , I) = P (R1 |I) = R/N and P (W1 |I) =
W/N . Inserting this in Bayes theorem gives
R
P (R2 |R1 , I)P (R1 |I)
= .
P (R2 |R1 , I)P (R1 |I) + P (R2 |W1 , I)P (W1 |I)
N
P (R1 |R2 , I) =
Exercise 4.3
Without loss of generality we can consider marginalization of the multinomial distribution over all but
the first probability. According to (4.10) we have
n2 =
k
X
i=2
ni = N n1
and
p2 =
k
X
i=2
pi = 1 p1 .
N!
pn1 (1 p1 )N n1 .
n!(N n)! 1
Exercise 4.4
From the product rule we have
P (n1 , . . . , nk |I) = P (n1 , . . . , nk1 |nk , I)P (nk |I)
From this we find for the conditional distribution
P (n1 , . . . , nk1 |nk , I) =
=
=
P (n1 , . . . , nk |I)
P (nk |I)
1
nk !(N nk )!
N!
nk1 nk
pk
pn1 pk1
n1 ! nk ! 1
N!
pnk k (1 pk )N nk
n1
nk1
(N nk )!
p1
pk1
.
n1 ! nk1 ! 1 pk
1 pk
Exercise 4.5
(i) The likelihood to observe n counts in a time window t is given by the Poisson distribution
n
exp()
n!
P (n|) =
with = Rt and R the average counting rate. Assuming a flat prior for [0, ] the posterior is
p(|n) = C
n
exp()
n!
with C a normalization constant which turns out to be unity: the Poisson distribution has the remarkable property that it is normalized with respect to both n and :
Z
Z n
X
X
n
P (n|) =
exp() = 1
and
p(|n) d =
exp() = 1.
n!
n!
0
0
n=0
n=0
The mean, second moment and the variance of the posterior are
< > = n + 1,
< 2 > = n + 1.
L = constant n ln() + ,
59
n
d2 L
= 2.
2
d
= exp(R ) Rd.
Exercise 4.6
The derivatives of the characteristic function (4.17) are
d(k)
1 2 2
d2 (k)
1 2 2
2 2
2
2
= (i k ) exp ik k .
= (i k ) exp ik k
dk
2
dk 2
2
From (4.16) we find for the first and second moment
<x> =
1 d(0)
=
i dk
1 d2 (0)
= 2 + 2
i2 dk 2
< x2 > =
from which it immediately follows that the variance is given by < x2 > < x >2 = 2 .
Exercise 4.7
From (4.15) and (4.17) we have
1
(k) = 1 (k) 2 (k) = exp i(1 + 2 )k (12 + 22 )k 2
2
which is just the characteristic function of a Gauss with mean 1 + 2 and variance 12 + 22 .
Exercise 5.2
From differentiating the logarithm of (5.12) we get
ln Z
n
m
X
1 Z
1 X
mi
=
exp
j fji
Z k
Z i=1
k
j=1
n
m
n
X
X
1 X
fki pi = k .
j fji =
fki mi exp
Z i=1
i=1
j=1
Exercise 6.1
The log posterior of (6.3) is given by
n
L = Constant +
1X
2 i=1
di
2
1X
di .
n i=1
n
d2 L X 1
=
= 2
2
2
d
i=1
60
d2 L(
)
2
d
1
2
= .
n
Exercise 6.3
The negative logarithm of the posterior (6.7) is given by
L = Constant +
n1
ln V + nz 2
2
dL
dL
(n 1)nz
= 0 z = 0
= d.
=
=
d
dz
V + nz 2
For the second derivative we find
d2 L
(n 1)n 2(n 1)n2 z 2
d2 L
=
=
d2
dz 2
V + nz 2
(V + nz 2 )2
H=
d2 L(
)
d2 L(0)
(n 1)n
=
=
.
2
2
d
dz
V
Exercise 6.4
By substituting t = 21 2 we find for the average
Z
Z
2( + 1)
2
= 2 = .
t et dt =
< 2 > =
2 p(2 |) d2 =
() 0
()
0
Likewise, the second moment is found to be
Z
Z
4
4( + 2)
< 4 > =
= 4( + 1) = ( + 2).
4 p(2 |) d2 =
t+1 et dt =
() 0
()
0
Therefore the variance is
< 4 > < 2 >2 = ( + 2) 2 = 2.
Exercise 6.5
The negative logarithm of (6.10) is given by
L = Constant + (n 1) ln() +
V
2 2
whence we obtain
dL
d
d2 L
d 2
=
=
V
n1
V
2 =
3 =0
n1
1 n 3V
d2 L(
)
2(n 1)2
+ 4 H=
=
.
2
2
dL
V
Exercise 6.6
In case of a polynomial parameterization
f (x; a) = a1 + a2 x + a3 x2 + a4 x3 +
the basis functions are given by f (x) =
equation (6.19) takes the form
P
P
Pi wi x2i
Pi wi
Pi wi x2i Pi wi x3i
i wi xi
i wi xi
Exercise 6.7
P
2
a
1
Pi wi di
Pi wi x3i
a
2 = Pi wi di xi .
Pi wi x4i
2
a
w
x
3
i wi di xi
i i i
In case f (x; a) = a1 (fit to a constant) the matrix W and the vector b are one-dimensional:
X
X
wi di .
wi ,
b=
W =
i
61
Exercise 6.8
V
= ,
Vi
= 0.
Dij =
Di =
di
= i .
s
T
Vij =
Dik Vkl Dlj
+
Di V
Dj
k
XX
k
ik k2 kl lj
XX
i2 ij +
i j
i j .
Exercise 6.9
From (6.38) we have for the log posterior
L(a, s) = Constant +
1
2
n
X
i=1
wi
di ti (a)
m
X
s i
=1
!2
1
2
m
X
s2 .
=1
=1
Exercise 6.10
n
X
wi i i
i=1
n
X
i=1
wi (di ti )i .
S s = b as given in (6.40).
= P
.
=d
and
<
2 > = P
=
n
wi
wi
1. Offsetting all data points by an amount gives for the best estimate
= d . Adding the
statistical and systematic deviations in quadrature we thus find from the offset method that
r
2
= d
+ 2
= d
for
n .
n
62
2. The matrix S and the vector b defined in (6.40) are in this case one-dimensional. We have
2
X
n(d )
n(d )2
,
b=
,
wi (di ti )2 =
.
S =1+n
2
2
Inserting this in (6.43) we find, after some algebra
L() =
1 n(d )2
,
2 2 + n2
dL()
n(d )
,
= 2
d
+ n2
d2 L()
n
= 2
.
2
d
+ n2
=d
+ 2
=d
for
n .
n
We now add a second set of n measurements which do not have a systematic error . The weighted
average gives for the best estimate
and the variance of the combined data
=d
<
2 > =
and
2
.
2n
1. Offsetting the first set of n data points by an amount but leaving the second set intact
gives for the best estimate
= d /2. Adding the statistical and systematic deviations in
quadrature we thus find from the offset method that
s
2
=d
+
for
n .
=d
2n
2
2
But this error is larger than if we would have considered only the second data set:
=d
= d0
for
n
n
In other words, the offset method violates the requirement that the error derived from all available
data must always de smaller than that derived from a subset of the data.
2. The matrix S and the vector b defined in (6.40) are now two-dimensional but with many zerovalued elements since the systematic error of the second data set is zero. We find
S 0
b
S=
b=
,
0 1
0
where S and b are defined above. The log posterior of (6.43) is found to be
"
#
2
2
d
1 n(d )2
1
1
2n
2
+
n
.
bS b =
L() =
2
2 2 + n2
2
Solving the equation dL()/d = 0 immediately yields
= d. The inverse of the second
derivative gives an estimate of the error. After some straight forward algebra we find
2
1
=d
2+n 2
+ 2
.
n
Exercise 6.11
In (3.11) we state that the posterior p(|d, H1 ) is Gaussian distributed in the neighborhood of the mode
Indeed, that is exactly what we find inserting Eqs. (6.47), (6.48) and (6.49) in Bayes theorem,
.
!2
1
1
.
p(|d, H1 ) = exp
2
63
The approximations made in Eqs. (6.47) and (6.48) are thus consistent with those made in (3.11).
Exercise 6.12
When Hi has n free parameters and Hj has m free parameters , the Occam factor in (6.51) becomes
the ratio of multivariate Gaussian normalization factors, see (3.11):
s
(2)m |V |
.
(2)n |V |
Exercise 7.1
For the negative log posterior of the binomial distribution we have
L = Constant n ln(h) (N n) ln(1 h).
The first and second derivatives of L to h are
dL
dh
d2 L
dh2
=
=
hN n
h(1 h)
h2 N + n 2hn
.
h2 (1 h)2
2
= n/N . The Hessian is thus d2 L(h)/dh
h)
from which (7.5)
From dL/dh = 0 we obtain h
= N/h(1
follows.
Exercise 7.2
The posterior of for N counts in N events is, from (7.4),
p(|N, N ) = (N + 1)N .
Integrating the posterior we find for the confidence level
Z a
Z 1
d p(|N, N ) = 1 a(N +1) a = (1 )1/(N +1)
d p(|N, N ) = 1
=
a
For N = 4 and = 0.65 we find a = 0.81 so that the result on the efficiency can be reported as
=1
+0
0.19
(65% CL)
References
[1] T. Loredo, From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics in Maximum Entropy and Bayesian Methods, ed. P.F. Foug`ere, Kluwer
Academic Publishers, Dordrecht (1990).
[2] G.L. Bretthorst, An Introduction to Parameter Estimation using Bayesian Probability Theory in Maximum Entropy and Bayesian Methods, ed. P.F. Foug`ere,
Kluwer Academic Publishers, Dordrecht (1990).
[3] G.L. Bretthorst, An Introduction to Model Selection using Probability Theory as
Logic in Maximum Entropy and Bayesian Methods, ed. G.L. Heidbreder, Kluwer
Academic Publishers, Dordrecht (1996).
[4] G. DAgostini, Bayesian Inference in Processing Experimental DataPrinciples
and Basic Applications, arXiv:physics/0304102 (2003).
64
[5] D.S. Sivia, Data Analysisa Bayesian Tutorial , Oxford University Press (1997).
[6] P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, Cambridge
University Press (2005).
[7] E.T. Jaynes, Probability TheoryThe Logic of Science, Cambridge University
Press (2003). See also http://omega.albany.edu:8008/JaynesBook.html (1998).
[8] G. Cowan, Statistical Data Analysis, Oxford University Press (1998).
[9] R.T. Cox, Probability, frequency and reasonable expectation, Am. J. Phys. 14, 1
(1946).
[10] F. James et al. eds., Proc. Workshop on Confidence Limits, CERN Yellow Report
2000-005. See also http://www.cern.ch/cern/divisions/ep/events/clw.
[11] PDG, S. Eidelman et al., Phys. Lett. B592, 1 (2004).
[12] S.F. Gull, Bayesian Inductive Inference and Maximum Entropy in MaximumEntropy and Bayesian Methods in Science and Engineering, G.J. Ericson and
C.R. Smith eds., Vol. I, 53, Kluwer Academic Publishers (1988).
[13] C.E. Shannon, The mathematical theory of communication, Bell Systems Tech. J.
27, 379 (1948).
[14] S.I. Alekhin, Statistical properties of estimators using the covariance matrix , hepex/0005042.
[15] D. Stump et al., Uncertainties of predictions from parton distribution functions,
Phys. Rev. D65 014012 (2002), hep-ph/0101051.
[16] D. Stump, private communication.
[17] G. DAgostini, Nucl. Inst. Meth. A346, 306 (1994);
T. Takeuchi, Prog. Theor. Phys. Suppl. 123, 247 (1996), hep-ph/9603415.
[18] NA49 Collab., C. Alt et al., Upper limit of D 0 production in central Pb-Pb collisions
at 158A GeV , Phys. Rev. C73, 034910 (2006), nucl-ex/0507031.
65
Index
assignment of probabilities
by decomposition, 9, 21
by maximum entropy, 3135
by principle of insufficient reason, 21
by symmetry considerations, 3031
trivial assignments, 18, 22, 23, 45
background information, 6, 14, 36
Bayes factor, 48
Bayes theorem
as a learning process, 8, 12
definition of, 8, 10, 11
in Frequentist statistics, 8, 13
Bayes, T., 13
Bayesian inference, 4, 1314
discards irrelevant information, 53
steps taken in, 29, 45
Bayesian probability, see probability
Bayesian-Frequentist debate, 1214
Bernoullis urn, drawing from, 2123
Bernoulli, J., 12, 25
binomial distribution
definition and properties of, 2325
posterior, for uniform prior, 5152
binomial error, 19, 25
Breit-Wigner distribution, 17
Cauchy distribution, 17, 28, 41
causal versus logical dependence, 8, 23
central limit theorem, 27, 43
proof and validity of, 2728
characteristic function, 27
2 distribution, 3839
2 minimization, see least squares
conditional probability, definition of, 7
confidence limits, 41, 43, 52
conjunction, see logical and
contradiction, 5, 7, 8
convolution, 18
correlation coefficient, 15
counting experiments, 50
covariance matrix
as the inverse of the Hessian, 16
definition of, 15
determinant of, 16, 21
group theory, 14
Hessian matrix, definition of, 16
hypothesis
complete set of, 9, 21, 23, 48
in Bayesian inference, 8
in Frequentist statistics, 8, 13
simple and composite, 35
implication, 5, 6
improper distribution, 31, 37
induction; inductive inference, 6
inference
Bayesian, see Bayesian inference
definition of, 5
information entropy, see entropy
invariance, 3031
Jacobian matrix, definition of, 18
Jaynes, E.T., 4, 31
Jeffreys prior, 31
joint probability, definition of, 7
Kolmogorov axioms, 7
Lagrange multipliers, 32, 33
Laplace, P.S., 13
law of large numbers, 25
least squares minimization, 4041
Lebesgue measure, 31, 32
non-uniform, 34
likelihood
definition of, 8
for a complete set of hypotheses, 10
in parameter estimation, 35
in unphysical region, 14, 41
width, compared to prior, 14
linear model, 36, 4041, 46
location parameter, 15, 16, 30
logical and, 5
logical or, 5
logical versus causal dependence, 8, 23
lognormal distribution, 47
marginal probability, definition of, 7
marginalization, 12
definition of, 9, 11
of multinomial distribution, 25
of multivariate Gauss, 16
probability
Bayesian definition of, 4, 7, 8, 13
Frequentist definition of, 4, 8, 13, 25
probability assignment, see assignment
probability calculus, 711
axioms of, 7
probability distribution
definition of, 10
properties of, 1516
probability inversion, 8, 12
product rule
definition of, 7, 10
for independent propositions, 8
propositions, 5
exclusive, 8
independent, 8
coordinate transformation, 18
of probability density, 11, 1718
unitary, 20
uniform prior, see prior probability
uninformative probability, 11, 13, 32, 34
variance, definition of, 15
weighted average, 41
well posed problem, specification of, 29