Introduction To Bayesian Networks

Introduction to Bayesian Networks
Practical and Technical Perspectives
Stefan Conrady, stefan.conrady@conradyscience.com
Dr. Lionel Jouffe, jouffe@bayesia.com
February 15, 2011
Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
Table of Contents
Introduction
Bayesian Networks from a Practitioner’s Perspective

Knowledge Unification 2
Knowledge Representation & Communication 3
Reasoning 4
Summary 4
Technical Introduction
Introduction 5
Probabilistic Semantics 7
Evidential Reasoning 8
Learning Bayesian Networks 9
Causal Networks 10
Causal Discovery 11
References 12
Contact Information 13
Conrady Applied Science, LLC 13
Bayesia SAS 13
www.conradyscience.com | www.bayesia.com i
Introduction
A simplistic analogy may help to jump-start our introduction to Bayesian networks: In the same way one can use a
phone book — without having to memorize all the names and numbers, one can deliberately (and correctly) reason with
the domain knowledge contained in a Bayesian network — without having to become a domain expert.
Over the last 25 years, Bayesian networks have emerged as a practically feasible form of knowledge representation, pri-
marily through the seminal works of UCLA Professor Judea Pearl. With the ever-increasing computing power, Bayesian
networks are now a powerful tool for deep understanding of very complex, high-dimensional problem domains. Their
computational efficiency and inherently visual structure make Bayesian networks attractive for exploring and explaining
complex problems.
However, Bayesian networks are somewhat of a disruptive technology, as they challenge a number common practices in
the world of business and science. So, beyond the world of academia, promoting Bayesian networks as a new tool for
practical knowledge management and reasoning still requires significant persuasion efforts. With this short paper, we
attempt to provide a concise justification, both from a practitioner’s and a technical perspective1 , why Bayesian net-
works are so important.
1 Author notes: portions of the technical chapter of this paper are adapted, with permission, from Pearl and Russell
(2000).
www.conradyscience.com | www.bayesia.com 1
Introduction to Bayesian Networks - Practitioner's Perspective
Bayesian Networks from a Practitioner’s Perspective
In our quest to “evangelize” about Bayesian networks (and the BayesiaLab software package2 ), we are often limited to
presenting our case in just a few PowerPoint slides and only using a few catchy bullet points. In this context, and this is
obviously not comprehensive, we selected the following headings to highlight the key benefits of Bayesian networks to
research practitioners and business executives:
1. Knowledge Unification
2. Knowledge Representation & Communication
3. Reasoning
Under these headlines, the following paragraphs are meant to provide a glimpse of the powerful properties and wide-
ranging practical advantages of Bayesian networks.
Knowledge Unification
Many fields are characterized by the proverbial conflict between “art” and “science.” This manifests itself in debates,
such as the one about evidence-based medicine versus the prevailing practice of physicians with years of experience.
Even more common is the discrepancy between scientifically derived market research insights and expertise-based mar-
keting decisions of business executives. Traditional frameworks typically don't facilitate leveraging the knowledge avail-
able on both sides.
Bayesian networks have the ability of capturing both qualitative knowledge (through their network structure), and
quantitative knowledge (through their parameters). While expert knowledge from practitioners is mostly qualitative, it
can be used directly for building the structure of a Bayesian network. In addition, data mining algorithms can encode
both qualitative and quantitative knowledge and encode both forms simultaneously in a Bayesian network. As a result,
Bayesian networks can bridge the gap between different types of knowledge and serve to unify all available knowledge
into a single form of representation.
2 Developed by Bayesia SAS, BayesiaLab is a comprehensive software package designed for learning, editing and analyz-
ing Bayesian networks. It is available in North America from Conrady Applied Science, LLC.
Domain
“Art” “Science”
Expert Mathematical
Knowledge Representation
Qualitative Quantitative
Bayesian Network
Unified Knowledge Representation
Figure 1: Knowledge unification with Bayesian networks
Knowledge Representation & Communication

Relaying knowledge typically includes an array of factual and causal statements. In natural language communication,
such statements will often contain generalizations, approximations, and implicit assumptions regarding their probability.
Such simplifications are widely accepted in casual conversation or in media headlines.
However, for more precise communication, which is required in science or business, spelling out exceptions, uncertainty
and conditions regarding statements about knowledge is necessary. With natural language expressions, however, this can
become very cumbersome, especially when it concerns a complex domain (hence the substantial girth of many text-
books).
Also, the need for precision in describing complex domains is often at odds with the modern business culture, which, as
already mentioned in the introduction, dictates communication via PowerPoint in few, concise bullet points. Needless to
say, the complex dynamics of a domain can thus often not be relayed correctly to policy makers and other stakeholders.
Bayesian networks are very well suited for capturing probabilistic and incomplete causal knowledge regarding a do-
main. They can easily accommodate exceptions to a rule, e.g. “all swans are white, except for a certain species,” as well
as partial causal information, for instance “alcohol caused the accident,” even though more factors may actually be in-
volved, such as poor road conditions.
Through its structure and its parameters, a Bayesian networks comprehensively describes what is known about a par-
ticular domain and especially the interactions of all the variables contained within that domain. As such, a Bayesian
network is a “Portable Knowledge Format,” that can succinctly and compactly communicate the state of the domain as
well as its dynamics.
Reasoning
By representing the interactions, a (correctly formulated) Bayesian network can yield a deep understanding of a domain.
Deep understanding means knowing, not merely how things behaved yesterday, but also how things will behave under
new hypothetical circumstances tomorrow. More specifically, a Bayesian network allows explicit reasoning, and deliber-
ate reasoning allows us to anticipate the consequences of actions we have not yet taken. Bayesian networks thus become
an instrument for formal reasoning that is entirely transparent to stakeholders, as opposed to a more opaque, internal-
ized process in the decision maker’s mind (or gut).
Domain
under Data Bayesian
Study Network
Hypothetical
Domain
Manipulation
Manipulation
Figure 2: Using Bayesian networks for formal reasoning about consequences of hypothetical actions
Summary
In summary, Bayesian networks are a highly universal knowledge framework and they provide a common reasoning
language between stakeholders from different backgrounds, such as business executives and market research scientists.
With all available knowledge unified, properly communicated and quite literally put into a “reasonable” format, Bayes-
ian network are a powerful tool for making decisions and shaping policies.
Introduction to Bayesian Networks - Technical Perspective
Technical Introduction
For the technical portion of this introduction, we defer to the words of Judea Pearl, who originally coined the term
“Bayesian network”. We are grateful to him for allowing us to use and adapt large sections from one of his technical
reports for our purposes (Pearl and Russell, 2000).
Introduction
Probabilistic models based on directed acyclic graphs have a long and rich tradition, beginning with the work of geneti-
cist Sewall Wright in the 1920s. Variants have appeared in many fields. Within statistics, such models are known as di-
rected graphical models; within cognitive science and artificial intelligence, such models are known as Bayesian net-
works. The name honors the Rev. Thomas Bayes (1702-1761), whose rule for updating probabilities in the light of new
evidence is the foundation of the approach.
Rev. Bayes addressed both the case of discrete probability distributions of data and the more complicated case of con-
tinuous probability distributions. In the discrete case, Bayes’ theorem relates the conditional and marginal probabilities
of events A and B, provided that the probability of B does not equal zero:
P(B∣A)P(A)
P(A∣B) =
P(B)
In Bayes’ theorem, each probability has a conventional name:
• P(A) is the prior probability (or “unconditional” or “marginal” probability) of A. It is “prior” in the sense that it does
not take into account any information about B; however, the event B need not occur after event A. In the nineteenth
century, the unconditional probability P(A) in Bayes’s rule was called the “antecedent” probability; in deductive logic,
the antecedent set of propositions and the inference rule imply consequences. The unconditional probability P(A) was
called “a priori” by Ronald A. Fisher.
• P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from
or depends upon the specified value of B.
• P(B|A) is the conditional probability of B given A. It is also called the likelihood.
• P(B) is the prior or marginal probability of B, and acts as a normalizing constant.
Bayes theorem in this form gives a mathematical representation of how the conditional probability of event A given B is
related to the converse conditional probability of B given A.
The initial development of Bayesian networks in the late 1970s was motivated by the need to model the top-down (se-
mantic) and bottom-up (perceptual) combination of evidence in reading. The capability for bidirectional inferences,
combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian networks as the method of
choice for uncertain reasoning in AI and expert systems replacing earlier, ad hoc rule-based schemes.
The nodes in a Bayesian network represent propositional variables of interest (e.g. the temperature of a device, the gen-
der of a patient, a feature of an object, the occurrence of an event) and the links represent statistical (informational)3 or
causal dependencies among the variables. The dependencies are quantified by conditional probabilities for each node
given its parents in the network. The network supports the computation of the posterior probabilities of any subset of
variables given evidence about any other subset.
Figure 1 shows a very simple Bayesian network consisting of only two nodes and one link, representing the joint prob-
ability distribution of the variables Eye Color and Hair Color in a given population. In this case, the conditional prob-
abilities of Hair Color given the values of its parent, Eye Color, are provided in a table. It is important to point out that
this Bayesian network does not contain any causal assumptions, i.e. we have no knowledge of the causal order between
the variables, so the interpretation here should be merely statistical (informational).
Figure 1: A Bayesian network representing the statistical relationship between to two variables.
Figure 2 illustrates another simple yet typical Bayesian network. In contrast to the statistical relationships in Figure 1,
the diagram in Figure 2 describes the causal relationships among the season of the year (X1), whether it’s raining (X2),
whether the sprinkler is on (X3), whether the pavement is wet (X4), and whether the pavement is slippery (X5). Here the
absence of a direct link between X1 and X5, for example, captures our understanding that there is no direct influence of
season on slipperiness — the influence is mediated by the wetness of the pavement (if freezing is a possibility then a di-
rect link could be added).
3 “informational” and “statistical” are treated here as equivalent concepts and can be used interchangeably.
Figure 2: A Bayesian network representing causal influences among five variables
Perhaps the most important aspect of a Bayesian networks is that they are direct representations of the world, not of
reasoning processes. The arrows in the diagram represent real causal connections and not the flow of information during
reasoning (as in rule-based systems and neural networks). Reasoning processes can operate on Bayesian networks by
propagating information in any direction. For example, if the sprinkler is on, then the pavement is probably wet (predic-
tion, simulation); if someone slips on the pavement, that also provides evidence that it is wet (abduction, reasoning to a
probable cause or diagnosis). On the other hand, if we see that the pavement is wet, that makes it more likely that the
sprinkler is on or that it is raining (abduction); but if we then observe that the sprinkler is on, that reduces the likelihood
that it is raining (explaining away). It is this last form of reasoning, explaining away, that is especially difficult to model
in rule-based systems and neural networks in any natural way, because it seems to require the propagation of informa-
tion in two directions.
Probabilistic Semantics
Any complete probabilistic model of a domain must, either explicitly or implicitly, represent the joint probability distri-
bution — the probability of every possible event as defined by the combination of the values of all the variables. There
are exponentially many such events, yet Bayesian networks achieve compactness by factoring the joint distribution into
local, conditional distributions for each variable given its parents. If xi denotes some value of the variable Xi and pai
denotes some set of values for the parents of Xi, then P(xi|pai) denotes this conditional distribution. For example,
P(x4|x2,x3) is the probability of wetness given the values of sprinkler and rain. The global semantics of Bayesian net-
works specifies that the full joint distribution is given by the product
P(xi ,..., xn ) = ∏ P(xi pai ) (1)

i
In our example network, we have
P(x1 , x2 , x3 , x4 , x5 ) = P(x1 )P(x2∣x1 )P(x3∣x1 )P(x4∣x2 , x3 )P(x5∣x4 ) . (2)
It becomes clear that the number of parameters grows linearly with the size of the network, i.e. the number of variables,
however, the conditional probability distribution grows exponentially with the number of parents. Further savings can
be achieved using compact parametric representations — such as noisy-OR models, decision trees, or neural networks
— for the conditional distributions.
There is also an entirely equivalent local semantics, which asserts that each variable is independent of its nondescen-
dants in the network given its parents. For example, the parents of X4 in Figure 2 are X2 and X3 and they render X4
independent of the remaining nondescendant, X1. That is,
P(x4∣x 1 , x2 , x3 ) = P(x4∣x2 , x3 ) .
Non-Descendants
Parents
Descendant
Figure 3: Variable X4 is independent of its nondescendants, in this case X1, given its parents, X3 and X2
The collection of independence assertions formed in this way suffices to derive the global assertion in Equation 1, and
vice versa. The local semantics is most useful in constructing Bayesian networks, because selecting as parents all the di-
rect causes (or direct relationships) of a given variable invariably satisfies the local conditional independence conditions.
The global semantics leads directly to a variety of algorithms for reasoning.
Evidential Reasoning
From the product specification in Equation 1 one can express the probability of any desired proposition in terms of the
conditional probabilities specified in the network. For example the probability that the sprinkler is on given that the
pavement is slippery is
P(X 3 = on, X5 = true)

P(X 3 = on∣X5 = true) =
P(X5 = true)
=
∑ x1 , x2 , x4 P(x1 , x2 , X 3 = on, x4 , X5 = true)
∑ x1 , x2 , x3 , x4 P(x1 , x2 , x3 , x4 , X5 = true)
==
∑ x1 , x2 , x4 P(x1 )P(x2∣x1 )P(X 3 = on∣x1 )P(x4∣x2 , X 3 = on)P(X5 = true∣x4 )
∑ x1 , x2 , x3 , x4 P(x1 )P(x2∣x1 )P(x3∣x1 )P(x4∣x2 , x3 )P(X5 = true∣x4 )
These expressions can often be simplified in ways that reflect the structure of the network itself. The first algorithms
proposed for probabilistic calculations in Bayesian networks used a local distributed message-passing architecture, typi-
cal of many cognitive activities. Initially this approach was limited to tree-structured networks, but was later extended
to general networks in Lauritzen and Spiegelhalter’s (1988) method of junction tree propagation. A number of other
exact methods have been developed and can be found in recent textbooks.
It is easy to show that reasoning in Bayesian networks subsumes the satisfiability problem in propositional logic and
hence is NP-hard Monte Carlo simulation methods can be used for approximate inference (Pearl, 1988) giving gradually
improving estimates as sampling proceeds. These methods use local message propagation on the original network struc-
ture unlike junction tree methods. Alternatively, variational methods provide bounds on the true probability.
Learning Bayesian Networks

The conditional probabilities P(xi|pai) of a given structure can be estimated from data by using the maximum likelihood
approach (observed frequencies). They can also be updated continuously from observational data using gradient-based
or EM methods that use just local information derived from inference — in much the same way as weights are adjusted
in neural networks.
It is also possible to machine-learn the structure of a Bayesian network and two families of methods are available for
that purpose. The first one, the constraint-based algorithms, is based on the probabilistic semantic of Bayesian networks.
Links are added or deleted according to the results of statistical tests, which identify marginal and conditional independ-
encies. The second approach, the score-based algorithms, is based on a metric measuring the quality of candidate net-
works with respect to the observed data. This metric trades off network complexity against degree of fit to the data,
typically expressed as the likelihood of the data given the network.
As a substrate for learning, Bayesian networks have the advantage that it is relatively easy to encode prior knowledge in
network form, either by fixing portions of the structure or by using prior distributions over the network parameters.
Such prior knowledge can allow a system to learn accurate models from much less data than are required for tabula rasa
approaches.
Uncertainty Over Time
Entities that live in a changing environment must keep track of variables whose values change over time. Dynamic
Bayesian networks capture this process by representing multiple copies of the state variables, one for each time step. A
set of variables Xt denotes the world state at time t and a set of sensor variables Et denotes the observations available at
time t. The sensor model P(Et|Xt) is encoded in the conditional probability distributions for the observable variables,
given the state variables. The transition model P(Xt+1|Xt) relates the state at time t to the state at time t+1. Keeping track
of the world means computing the current probability distribution over world states given all past observations, i.e.,
P(Xt|E1,…,Et). Dynamic Bayesian networks are strictly more expressive than other temporal probability models such as
hidden Markov models and Kalman filters.
Causal Networks
Most probabilistic models including, general Bayesian networks, describe a distribution over possible observed events —
as in Equation 1 — but say nothing about what will happen if a certain intervention occurs. For example, what if I turn
the sprinkler on? What effect does that have on the season, or on the connection between wetness and slipperiness? A
causal network, intuitively speaking, is a Bayesian network with the added property that the parents of each node are its
direct causes — as in Figure 2. In such a network, the result of an intervention is obvious: the sprinkler node is set to
X3 = on and the causal link between the season X1 and the sprinkler X3 is removed (see Figure 4). All other causal links
and conditional probabilities remain intact so the new model is
P(x1 , x2 , x4 , x5 ) = P(x1 )P(x2∣x1 )P(x4∣x2 , X 3 = on)P(x5∣x4 ).
Notice that this differs from observing that X3=on, which would result in a new model that included the term
P(X3=on|x1). This mirrors the difference between seeing and doing: after observing that the sprinkler is on, we wish to
infer that the season is dry, that it probably did not rain, and so on; an arbitrary decision to turn the sprinkler on should
not result in any such beliefs.
Figure 4: A causal network reflecting the intervention, X3=on
Causal networks are more properly defined, then, as Bayesian networks in which the correct probability model after
intervening to fix any node’s value is given simply by deleting links from the node’s parents. For example, Fire → Smoke
is a causal network whereas Smoke → Fire is not, even though both networks are equally capable of representing any
joint distribution on the two variables. Causal networks model the environment as a collection of stable component
mechanisms. These mechanisms may be reconfigured locally by interventions, with correspondingly local changes in the
model. This, in turn, allows causal networks to be used very naturally for prediction by an agent that is considering
various courses of action.
Causal Discovery
One of the most exciting prospects in recent years has been the possibility of using Bayesian networks to discover
causal structures in raw statistical data — a task previously considered impossible without controlled experiments. Con-
sider, for example, the following intransitive pattern of dependencies among three events: A and B are dependent. B and
C are dependent, yet A and C are independent. If you ask a person to supply an example of three such events, the exam-
ple would invariably portray A and C as two independent causes and B as their common effect, namely, A → B ← C.
(For instance A and C could be the outcomes of two fair coins and B represents a bell that rings whenever either coin
comes up heads.)
Figure 4: Causal model for variables A, C and B, representing two fair coins and a bell respectively.
Fitting this dependence pattern with a scenario in which B is the cause and A and C are the effects is mathematically
feasible but very unnatural (see Figure 5), because it must entail fine tuning of the probabilities involved; the desired
dependence pattern will be destroyed as soon as the probabilities undergo a slight change.
Such thought experiments tell us that certain patterns of dependency, which are totally void of temporal information,
are conceptually characteristic of certain causal directionalities and not others. When put together systematically, such
patterns can be used to infer causal structures from raw data and to guarantee that any alternative structure compatible
with the data must be less stable than the one(s) inferred; namely slight fluctuations in parameters will render that struc-
ture incompatible with the data.
References
Barber, David. “Bayesian Reasoning and Machine Learning.” http://www.cs.ucl.ac.uk/staff/d.barber/brml.
Barber, David. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2011.
Barnard, G. A, and T. Bayes. “Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards
Solving a Problem in the Doctrine of Chances.” Biometrika 45, no. 3 (1958): 293–315.
Darwiche, Adnan. “Bayesian networks.” Communications of the ACM 53, no. 12 (12, 2010): 80.
Hilbert, M., and P. Lopez. “The World's Technological Capacity to Store, Communicate, and Compute Information.”
Science (2, 2011). http://www.sciencemag.org/cgi/doi/10.1126/science.1200970.
Koller, Daphne, and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.
Neapolitan, Richard E., and Xia Jiang. Probabilistic Methods for Financial and Marketing Informatics. 1st ed. Morgan
Kaufmann, 2007.
Pearl, Judea, and Stuart Russell. Bayesian Networks. UCLA Cognitive Systems Laboratory, November 2000.
http://bayes.cs.ucla.edu/csl_papers.html.
Pearl, Judea. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.
Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press, 2009.
Spirtes, Peter, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search, Second Edition. 2nd ed. The
MIT Press, 2001.
Contact Information
Conrady Applied Science, LLC

312 Hamlet’s End Way
Franklin, TN 37067
USA
+1 888-386-8383
info@conradyscience.com
www.conradyscience.com
Bayesia SAS
6, rue Léonard de Vinci
BP 119
53001 Laval Cedex
France
+33(0)2 43 49 75 69
info@bayesia.com
www.bayesia.com

Introduction To Bayesian Networks

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Introduction To Bayesian Networks

Transféré par

Droits d'auteur :

Formats disponibles

Introduction to Bayesian Networks

Practical and Technical Perspectives

Stefan Conrady, stefan.conrady@conradyscience.com

Dr. Lionel Jouffe, jouffe@bayesia.com

February 15, 2011

Bayesian Networks from a Practitioner’s Perspective

Bayesian Networks from a Practitioner’s Perspective

2. Knowledge Representation & Communication

Figure 1: Knowledge unification with Bayesian networks

Knowledge Representation & Communication

In Bayes’ theorem, each probability has a conventional name:

• P(B|A) is the conditional probability of B given A. It is also called the likelihood.

• P(B) is the prior or marginal probability of B, and acts as a normalizing constant.

Figure 2: A Bayesian network representing causal influences among five variables

P(xi ,..., xn ) = ∏ P(xi pai ) (1)

In our example network, we have

P(x1 , x2 , x3 , x4 , x5 ) = P(x1 )P(x2∣x1 )P(x3∣x1 )P(x4∣x2 , x3 )P(x5∣x4 ) . (2)

P(X 3 = on, X5 = true)

Learning Bayesian Networks

Uncertainty Over Time

P(x1 , x2 , x4 , x5 ) = P(x1 )P(x2∣x1 )P(x4∣x2 , X 3 = on)P(x5∣x4 ).

Figure 4: A causal network reflecting the intervention, X3=on

Conrady Applied Science, LLC

Vous aimerez peut-être aussi