Vous êtes sur la page 1sur 12

The Performance of Bayesian Network Classifiers

Constructed using Different Techniques

Michael G. Madden

Department of Information Technology


National University of Ireland, Galway, Ireland
michael.madden@nuigalway.ie

Abstract. This paper presents empirical results for classification using


Bayesian networks constructed using the K2 Bayesian metric, and
compares these results with those of other researchers who have used
Bayesian networks constructed using the MDL score and using
conditional independence tests. There are significant disparities in these
results, which is somewhat paradoxical as it is has been shown that the
MDL score is asymptotically equivalent to the Bayesian metric, and that
structure search based on maximising a score is equivalent to structure
search based on conditional independence tests. To resolve this paradox,
we analyse the differences in methods used by different researchers to
identify the source of the disparities. We conclude that differences in
performance are attributable to differences in parameter estimation and
structure search heuristics, rather than to differences in the scores/tests
used.

1 Introduction

In their well-known paper on Bayesian classifiers, Friedman et al. [10] showed


that general Bayesian networks constructed using the minimal description
length (MDL) score tend to perform badly in benchmark classification tasks,
in several cases performing worse than Naive Bayes. On the other hand,
Cheng & Greiner [3, 4] have presented results demonstrating that Bayesian
networks constructed using conditional independence (CI) tests perform well
at classification tasks. While it is tempting to conclude that the CI approach
is more suitable for constructing BN classifiers, Cowell [8] has shown that for
any structure search procedure based on CI tests, an equivalent procedure
based on maximising a logarithmic score can be specified. Furthermore, as
reported in Section 3 below, we have found that Bayesian networks
constructed using the K2 algorithm [7] perform well in classification on
benchmark datasets, in contrast with the results of Friedman et al., even
though the MDL score is asymptotically equivalent to K2's Bayesian score
[10, 12].
As well as evaluating the classification performance of the general Bayesian
network (GBN), Friedman et al. [10] and Cheng & Greiner [3] evaluate the
performance of Naive Bayes (NB) and Tree-Augmented Naive Bayes (TAN)
classifiers. This paper also examines those classifiers.
The structure of this paper is as follows. Section 2 briefly reviews Bayesian
networks, Bayesian classifiers and the K2 algorithm [7]. Section 3 describes
our experiments of applying K2 and the restricted Bayesian classifiers to
standard benchmark datasets, and also presents the results reported by
Friedman et al. [10] and Cheng & Greiner [3] for the same datasets. Section 4
explores why significantly different results have been reported for the
performance of Bayesian networks on the same classification tasks. Section 5
summarises the main results and draws conclusions.

2 Bayesian Networks and Classification

Bayesian networks graphically represent the joint probability distribution of a


set of random variables. A Bayesian network is composed of a qualitative
portion (its structure) and a quantitative portion (its conditional
probabilities). The structure BS is a directed acyclic graph where the nodes
correspond to domain variables x1, …, xn and the arcs between nodes
represent direct dependencies between the variables. Likewise, the absence of
an arc between two nodes x1 and x2 represents that x2 is independent of x1
given its parents in BS. Using the notation of Cooper & Herskovits [7], the set
of parents of a node xi in BS is denoted πi. The structure is annotated with a
set of conditional probabilities, BP, containing a term P(Xi | Πi) for each
possible value Xi of xi and each possible instantiation Πi of πi.

2.1 Inductive Learning of Bayesian Networks

Several algorithms have been proposed in the last decade for inductive
learning of Bayesian networks. Recent developments include the Three-Phase
Dependency Analysis algorithm [5] and the Greedy Equivalence Search
algorithm [6]. Cheng et al. [5] provide a good summary and comparison of
earlier algorithms. This section briefly summarises three approaches that have
been used as the basis for Bayesian network classifiers, and for which accuracy
results are listed in Table 2. In each case, the network structure is found by
heuristic search and then its probabilities are estimated.

Bayesian Scoring Approach


Our experiments, reported below in Section 3, are based on the Bayesian
scoring approach first used in K2 [7]. Four assumptions are identified in
deriving the Bayesian score (though Heckerman [12] identifies others also):
1. Variables are discrete and all are observed
2. Cases occur independently, given a network
3. There are no cases with missing values
4. Network structures have uniform priors
Let Z be a set of n discrete variables, where a variable xi in Z has ri possible
value assignments, (vi1, …, viri). Let D be a database of m cases, each with a
value assignment for each variable in Z. Let BS denote a network structure
containing just the variables in Z. Each variable xi in BS has zero or more
parents, represented as a list of variables πi. Let wij denote the jth unique
instantiation of πi relative to D, and assume that there are qi such unique
instantiations of πi. Let Nijk be defined as the number of cases in D in which
variable xi has the value vik and πi is instantiated as wij. Let Nij be defined as:
ri
N ij = ∑ N ijk . (1)
k =1

Then, given the assumptions outlined above,


n qi ri
(r − 1)!
P (BS , D) = P(BS ) ∏ ∏ (N ij i+ ri − 1)! ∏ N ijk ! (2)
i =1 j =1 k =1

From the third assumption above, P(BS) is constant, so to maximize


P(BS,D) just requires finding the set of parents for each node that maximizes
the second inner product of Eq. (2). K2 proceeds by initially assuming that a
node has no parents, and then adding incrementally that parent whose
addition most increases the probability of the resulting network. Parents are
added greedily to a node until the addition of no one parent can increase the
structure probability.

MDL Scoring Approach


Friedman et al. [10] use a scoring function based on the minimal description
length (MDL) principle. The MDL score of a network B = 〈BS,BP〉 given a
database of training cases D is:
1
MDL(B | D) = log N B − LL(B | D) . (3)
2
where |B| is the number of parameters in the network and LL(B | D) is the
log-likelihood of B given D. To calculate LL(B | D), let PˆD (⋅) be the empirical
probability measure defined by frequencies of events in D. Then:
LL(B | D ) = N ∑ ∑ PˆD (X i , Π i ) log(PˆD (X i Π i )) . (4)
i X i ,Π i

The search heuristic used by Friedman et al. is to start with the empty
network and successively apply local operations that greedily reduce the MDL
score until a local minimum is found. The local operations applied are arc
insertion, arc deletion and arc reversal.

Conditional Independence Approach


Cheng & Greiner [3] construct a Bayesian network structure by identifying
the conditional independence relationships among the nodes in the network.
They use the CBL1 algorithm [2], a precursor to the TPDA algorithm [5]. The
basis for this algorithm is testing whether two nodes xi and xj are
conditionally independent given a set of nodes c. This is determined by testing
whether the conditional mutual information of the nodes is smaller than a
threshold value ε. The conditional mutual information is calculated as:
P(X i , X j C )
I (x i , x j c ) = ∑ P(X i , X j ,C ) log P(X i C )P(X j C ) . (5)
X i ,X j ,C

Their search procedure operates in three phases:


1. Drafting: A singly-connected graph is constructed between the nodes
2. Thickening: Arcs are added wherever two nodes cannot be d-separated
3. Thinning: Any arc is removed where its two nodes can be d-separated.

2.2 Restricted Bayesian Classifiers

Figure 1 schematically illustrates the structure of the Bayesian classifiers


considered in this paper. The simplest form of Bayesian classifier is Naive
Bayes. When represented as a Bayesian network, a Naive Bayes classifier has
a simple structure whereby there is an arc from the classification node to each
other node, and there are no arcs between other nodes [11], as illustrated in
Figure 1(a).
Several researchers have examined ways of achieving better performance
than Naive Bayes. Friedman et al. [10] analyse Tree Augmented Naive Bayes
(TAN), which allows arcs between the children of the classification node xc as
in Figure 1(b), thereby relaxing the assumption of conditional independence.
In their approach, each node has xc and at most one other node as a parent,
so that the nodes excluding xc form a tree structure. They use the MDL
metric discussed earlier, whereas Cheng & Greiner [3] use conditional
information tests and we use the K2 metric to construct TAN classifiers. Our
TAN implementation has another minor difference, which is that it does not
add arcs that would reduce the structure probability (Eq. 2).

xC xC x1
x3

xC
x3
x1 x2 x3 x4 x2 x4 x2
x1
x4
(a) Naive Bayes (b) TAN (c) General BN

Figure 1: Illustration of Naive Bayes, TAN and General BN Structures


3 Experiments

For this work, 18 datasets were selected from the UCI repository of machine
learning datasets [1], as listed in Table 1. Datasets 1-15 are included in the
analyses of Friedman et al. [10] and datasets 13-18 are included in the
analyses of Cheng & Greiner [3].
A training set with 2/3 or 4/5 of the data was randomly drawn without
replacement from each dataset, with the remainder used for testing. This was
repeated 10 times for training sets of size 2/3 and 20 times for those of size
4/5 (50 times on particularly small datasets, to reduce variance). The only
exception was the Waveform-21 dataset, where the training set size used was
6%, for consistency with Friedman et al. Training set sizes were selected to
allow comparability with the experiments of those other authors: where the
other authors use a training set of 2/3, we use the same size here; where they
use 5-fold cross-validation (CV5) we use a training set of size 4/5.
Table 1 provides details of the training set sizes used in this work
(captioned "Madden") and the work of Friedman et al. ("FGG") and/or
Cheng & Greiner ("CG"). In this work, all datasets were preprocessed in the
same way as those authors did: where datasets had continuous variables, they
were discretized using the discretization utility in MLC++ [14] with its
default entropy-based setting [9], and any cases with missing values were
removed from datasets, except for the Mushroom dataset where they were
assigned the value "unknown" (as done by Cheng & Greiner).
Table 1: Data sets, training set sizes and number of runs per training set
No Dataset Size Training Size/Runs
Madden FGG/CG
1 Breast Cancer 699 80% x10 80% CV5
2 Cleve 296 80% x20 80% CV5
3 Diabetes 768 80% x20 80% CV5
4 German 1000 80% x20 80% CV5
5 Glass 214 80% x20 80% CV5
6 Glass2 163 80% x50 80% CV5
7 Iris 150 80% x50 80% CV5
8 Lymphography 148 80% x50 80% CV5
9 Segment 2310 66.7% x10 66.7% x1
10 Soybean-Large 562 80% x20 80% CV5
11 Vehicle 846 80% x20 80% CV5
12 Waveform-21 5000 6% x10 6% x1
13 Chess 3196 66.7% x10 66.7% x1
14 Flare 1066 80% x20 80% CV5
15 Vote 435 80% x20 80% CV5
16 DNA 3190 66.7% x10 62.8% x1
17 Nursery 12960 66.7% x10 66.7% x1
18 Mushroom 8124 66.7% x10 66.7% x1

Each dataset was analysed using our implementation of the Naive Bayes,
TAN and General BN algorithms, as were described in Section 2. The results
are summarised in Table 2. For each dataset and each algorithm, this table
lists the mean classification accuracy and its standard deviation. Table 2 also
lists the classification accuracies published by Friedman et al. [10] and Cheng
& Greiner [3] for these datasets. The values listed for Friedman et al.'s Naive
Bayes and TAN results are where parameter smoothing has been used
(discussed later in Section 4.2). Where those authors used only one
training/testing set, they estimated the standard deviation of the accuracy
using a theoretical formula provided in MLC++ [14].

Table 2: Results of this author's experiments and corresponding results


reported by other authors
Madden Friedman et al Cheng & Greiner
s s
No. NB TAN GBN NB TAN GBN NB TAN GBN
1 97.81 ± 0.51 97.47 ± 0.68 97.17 ± 1.05 97.36 ± 0.55 96.92 ± 0.67 96.92 ± 0.63 N/A N/A N/A
2 84.75 ± 3.81 83.47 ± 4.95 82.97 ± 4.28 82.76 ± 1.27 81.76 ± 0.33 81.39 ± 1.82 N/A N/A N/A
3 72.21 ± 3.70 74.84 ± 2.93 74.97 ± 2.84 74.48 ± 0.89 75.52 ± 1.11 75.39 ± 0.29 N/A N/A N/A
4 70.38 ± 3.84 73.98 ± 3.50 74.28 ± 2.89 74.60 ± 1.34 73.10 ± 1.54 72.30 ± 1.57 N/A N/A N/A
5 64.60 ± 6.91 69.12 ± 7.41 69.12 ± 6.37 67.78 ± 2.17 67.78 ± 3.43 55.57 ± 5.39 N/A N/A N/A
6 79.24 ± 3.58 79.55 ± 5.55 78.03 ± 4.04 79.77 ± 1.50 77.92 ± 1.11 75.49 ± 2.47 N/A N/A N/A
7 93.93 ± 4.13 93.80 ± 4.15 93.87 ± 4.17 94.00 ± 1.25 94.00 ± 1.25 94.00 ± 1.25 N/A N/A N/A
8 83.60 ± 9.82 85.47 ± 9.49 81.47 ± 10.4 81.72 ± 2.62 85.03 ± 3.09 75.03 ± 1.58 N/A N/A N/A
9 90.21 ± 1.31 94.31 ± 0.92 94.40 ± 0.72 90.91 ± 1.04 95.58 ± 0.74 93.51 ± 0.89 N/A N/A N/A
10 90.40 ± 2.52 91.70 ± 2.30 91.34 ± 2.78 92.00 ± 1.32 92.17 ± 1.02 58.54 ± 4.84 N/A N/A N/A
11 60.92 ± 4.80 70.15 ± 3.16 67.60 ± 2.87 58.99 ± 1.57 69.63 ± 2.11 61.00 ± 2.02 N/A N/A N/A
12 79.09 ± 0.87 78.07 ± 0.69 76.72 ± 1.25 78.68 ± 0.60 78.38 ± 0.60 69.45 ± 0.67 N/A N/A N/A
13 87.63 ± 1.61 91.68 ± 1.09 94.03 ± 0.87 87.05 ± 1.03 92.31 ± 0.82 95.59 ± 0.63 87.34 ± 1.02 92.50 ± 0.81 94.65 ± 0.69
14 73.45 ± 2.12 83.10 ± 1.77 82.44 ± 2.49 79.65 ± 1.23 82.27 ± 1.86 82.74 ± 1.90 80.11 ± 3.14 83.49 ± 1.29 82.27 ± 1.45
15 90.11 ± 2.73 94.02 ± 2.46 94.83 ± 2.52 89.89 ± 1.60 93.56 ± 0.28 94.94 ± 0.46 89.89 ± 5.29 94.25 ± 3.63 95.17 ± 1.89
16 94.80 ± 0.44 94.75 ± 0.42 96.22 ± 0.64 N/A N/A N/A 94.27 ± 0.68 93.59 ± 0.71 79.09 ± 1.18
17 90.48 ± 0.41 94.16 ± 0.33 92.63 ± 0.67 N/A N/A N/A 90.32 ± 0.45 91.71 ± 0.42 89.72 ± 0.46
18 95.16 ± 0.49 99.93 ± 0.06 100.0 ± 0.00 N/A N/A N/A 95.79 ± 0.39 99.82 ± 0.08 99.30 ± 0.16

Notably, our results show that the performance of the GBN constructed
using K2 is good relative to NB. Figure 2(a) is a scatter-plot of our
experimental results for accuracies of both classifiers, plotted relative to each
other. In this and subsequent graphs, "A vs B" indicates that A is plotted on
the horizontal axis and B is plotted on the vertical axis. Visually, points
above the diagonal are those where classifier B has higher accuracy. For each
dataset, a two-tailed paired t-test on the accuracy values for each run was
performed. This indicated that there was no significant difference in
performance on four of the datasets (numbers 1, 6, 7, 8), NB was better on
two (2, 12) and GBN was better on the remaining twelve. This is clearly at
variance with the experimental results of Friedman et al., who compared GBN
and NB on 25 datasets and reported that GBN was significantly better on 6
and significantly worse on 6. (Five of those datasets are included in this
study.) Figure 2(b) compares the NB and GBN classifiers' performance as
reported by Friedman et al. Our results are more similar to those of Cheng &
Greiner: for the eight datasets they analysed, in one case NB performed
substantially better than GBN, in four both performed similarly well and in
three cases GBN performed better.
100 100

MM: NB vs GBN FGG: NB vs GBN

90 90

80 80

70 70

60 60

50 50
50 60 70 80 90 100 50 60 70 80 90 100

Figure 2: (a) Relative accuracies of NB and GBN classifiers from our experiments;
(b) Relative accuracies of NB and GBN classifiers as reported by FGG.
Note: The last paragraph of Sec. 3 explains the layout of these and subsequent plots.

4 Explaining the Discrepancies

Figure 3(a), which is based on some of the data in Table 2, plots the relative
classification accuracies achieved using general Bayesian network classifiers by
ourselves, Friedman et al. and Cheng & Greiner. A total of 24 pairwise
comparisons can be made between these results. In 9 of these, there are
substantial differences between results (Datasets 5, 8, 10, 11, 12, 13, 16, 17
and 18), based on a non-paired t-test. Six of these are disparities between our
results and those of Friedman et al. and the other three are disparities
between our results and those of Cheng & Greiner. While differences in
methodology (e.g. different sub-sampling of test sets) could cause disparities,
it would not explain the skewness evident in the results
Naive of Figure
Bayes Accuracy 3(a).
Comparisons
100 100

90 90

80 80

70 70

FGG vs MM FGG vs MM
60 60
CG vs MM CG vs MM
FGG vs CG FGG vs CG

50 50
50 60 70 80 90 100 50 60 70 80 90 100

Figure 3: (a) Relative classification accuracies of GBNs reported by different authors;


(b) Relative classification accuracies of different authors' NB classifiers
Four of the possible sources of these differences are:
1. Evaluation methodology (including dataset sub-sampling)
2. Parameter estimation
3. Structure scoring
4. Structure search (including node ordering)
Each of these is now considered.

4.1 Evaluation Methodology

The evaluation methodology used both by Friedman et al. and by Cheng &
Greiner is to divide larger datasets into 2/3 for training and 1/3 for testing,
and to use 5-fold cross-validation for smaller datasets. The justification for the
methodology used in this paper is:
- Smaller datasets are divided into 4/5 for training and 1/5 for testing, and
results averaged over 20 (or more) runs; this is consistent with the 80%
training set sizes used by the other authors
- Larger datasets are divided into 2/3 for training and 1/3 for testing, and
averaged over 10 runs; this is consistent with the approach of the other
authors, while minimizing the risk of unrepresentative training sets and
allowing us to calculate sample standard deviations.
Accordingly, we assume that differences in methodology should not be a
significant factor in discrepancies between results. This assumption is
supported by the observation that there is good correspondence between our
results and the other authors' for NB (Figure 3(b)) and TAN (Figure 4(a)).

4.2 Parameter estimation

Our method for estimating parameters is taken from Cooper & Herskovits [7].
Let θijk denote the conditional probability that a variable xi in BS has the
value vik, for some k from 1 to ri, given that the parents of xi, represented by
πi, are instantiated as wij. Then θijk = P(xi=k|πi=wij) is termed a network
conditional probability. Its value is estimated as:
Nijk + 1
θijk = N + r . (6)
ij i

This is the m-estimate of probability with uniform priors and with m equal
to the number of different values that may be assigned to xi [16, Ch. 6].
Cheng & Greiner report that they also use the estimation method of
Cooper & Herskovits. Friedman et al. consider two methods: simple frequency
counts and a more sophisticated technique they term parameter smoothing.
They present results for NB and TAN with and without parameter
smoothing, but GBN results for networks without parameter smoothing only.
Because NB has a fixed structure, comparison of NB results should indicate
differences arising from the combined effects of evaluation methodology and
parameter estimation. Figure 3(b) compares the different authors' NB results,
taken from Table 2 (i.e. using the results of Friedman et al. for Smoothed
NB). The only anomalous one is our result for the Dataset 14 (arrowed),
which differs from the results both of Friedman et al. and of Cheng &
Greiner. The relatively high level of agreement between these results provides
evidence that experimental methodology does not account for the
discrepancies between GBN results. It also provides some evidence that
parameter estimation is not a source of divergence. However, Friedman et al.
note that sophisticated parameter estimation is less important in NB than in
classifiers with more parameters, as there should be sufficient empirical data
available for frequency counts to be accurate. Accordingly, we must look at
relative performance of more sophisticated classifiers to evaluate this.
Figure 4(a) compares the different authors' TAN results; again, the results
for Friedman et al. are for Smoothed TAN. This figure shows extremely
strong agreement between results, providing stronger evidence (given the
comments in the preceding paragraph) that parameter estimation does not
account for the observed divergence in GBN results. Of the 24 pairwise
comparisons plotted, only one (Dataset 17) shows a significant difference and
its magnitude is relatively small.
Figure 4(b), on the other hand, does not show such strong agreement in
results. This figure plots results for Friedman et al.'s Unsmoothed TAN
against our TAN results and those of Cheng & Greiner. In the 18 pairs of
results plotted in Figure 4(b), there are 7 with substantial differences
(Datasets 2, 8, 9, 10, 12 and 15; the result of Friedman et al. for Dataset 15 is
different from our result and from that of Cheng & Greiner). This indicates
the importance of parameter smoothing, as explained by Friedman et al.
Considered in conjunction with Figure 4(a), it also indicates that the
probability estimate of Eq. (6) achieves the same effect as the parameter
smoothing used by Friedman et al., at least for TAN structures. (It is possible
that for structures with even more parameters, such as GBN structures,
parameter smoothing may provide additional benefit.)
100 100

90 90

80 80

70 70

FGG vs MM
60 60 FGG vs MM
CG vs MM
FGG vs CG FGG vs CG

50 50
50 60 70 80 90 100 50 60 70 80 90 100

Figure 4: (a) Relative classification accuracies of different authors' TAN classifiers,


where FGG results are for Smoothed TAN; (b) Comparison of the Unsmoothed TAN
results of FGG with the TAN results of the other authors
4.3 Structure Scoring

Because exhaustive enumeration, rather than heuristic search, is used for


construction of TAN classifiers, differences in the performance of different
authors' TAN classifiers will be due primarily to differences in structure
scoring, assuming (as has been concluded in Section 4.2) that parameter
estimation and methodology are equivalent. As mentioned in Section 2,
Friedman et al's TAN classifier is constructed using the MDL score
conditioned on the class variable. Cheng & Greiner's TAN algorithm is based
on that of Friedman et al., except it uses conditional independence tests.
Likewise, our TAN algorithm works similarly but using the Bayesian score.
Looking again at Figure 4(a), it is clear that TAN classifiers built using the
different scores produce remarkably similar results, provided that equivalent
forms of parameter estimation are used. This supports the theoretical analysis
of Cowell [8], showing that structure search based on CI tests and structure
search based on maximising scoring metrics are equivalent.

4.4 Structure Search

The methods for structure search used by the different authors, as described
already in Section 2, are superficially quite different. Unlike the K2 procedure
we use and the approach of Friedman et al., the procedure of Cheng &
Greiner comes with guarantees: given a dataset that is large enough and has a
DAG-isomorphic probability distribution, their CBL1 algorithm is guaranteed
to generate the perfect map [17] of the underlying dependency model.
Node ordering can of course affect heuristic search, particularly for K2 and
CBL1 as they restrict candidate parents for a node to appear before it in the
ordering. We use the ordering given for the features in all datasets, except
that we place the classification node first so that it can potentially be a parent
of any other node. Cheng & Greiner do the same. Details of the node
orderings used by Friedman et al. are not published.
Given that Friedman et al. do not report results for GBN with parameter
smoothing, and that their results for TAN without parameter smoothing do
not agree well with our TAN results or those of Cheng & Greiner, it might
initially appear that the disparity of results illustrated in Figure 2(a) could be
explained by differences in parameter estimation. However, three of the
anomalous results are in comparisons between our classifiers and those of
Cheng & Greiner (Datasets 16, 17 and 18), where the same parameter
estimation is used. The other six anomalous results are in comparisons
between K2 and the GBN of Friedman et al. (Datasets 5, 8, 10, 11, 12, 13).
Of those datasets, there was good agreement on Datasets 5, 11 and 13
between our TAN classifier and Friedman et al's Unsmoothed TAN classifier,
where structure search is not a factor. Accordingly, we conclude that
differences in structure search do indeed appear to account for some of the
disparity in results.
As an aside, we note that in the three cases where our GBN results are
different from the GBN results of Cheng & Greiner, our results are better. It
appears that K2 is a serendipitous algorithm for use as a classifier, if the
classification node is placed first in the node ordering provided to it (as is
typically done). In this case, the classification node is the first candidate
considered when selecting parents for all other nodes. The result is effectively
a selective BN augmented Naive Bayes structure.

5 Conclusions

This paper has presented experimental results applying Naive Bayes (NB),
Tree Augmented Naive Bayes (TAN) and general Bayesian network (GBN)
algorithms to classification of 18 well-known datasets. The algorithm used to
construct GBNs was K2 [7]. The TAN algorithm also used the Bayesian
scoring function from K2. There are significant disparities between these
experimental results and the results that have been previously presented by
other authors. Friedman et al. [10] evaluated the classification performance of
a GBN algorithm based on the MDL score using 25 standard datasets and
concluded that, on average, it did not out-perform NB. Conversely, Cheng &
Greiner [3] evaluated a GBN algorithm based on conditional independence
testing on eight standard datasets and presented results showing it
outperformed NB more often than not.
Given that the Bayesian score is known to be asymptotically equivalent to
the MDL score [12] and that using conditional independence tests as a basis
for structure search is equivalent to search based on locally maximising a
scoring metric [8], this paper has sought to identify the source of the
disparity. Our conclusion is that the principal causes of differences are the
formulae used for parameter estimation and the structure search procedures
followed. Additionally, our results provide experimental support for the
analysis of Cowell [8]; in particular, our comparison of TAN classifiers (Figure
4(a)) demonstrates that very similar classification performance can be
achieved by classifiers constructed using different approaches, when heuristic
search is not required.
Finally, it must be emphasised that it is not the objective of this paper to
promote the use of general Bayesian networks as classifiers; in their papers
from which the NB, TAN and GBN results of Table 2 are taken, Friedman et
al. and Cheng & Greiner propose and analyse alternative structures based on
Bayesian networks that outperform the algorithms discussed in this paper.
More recently, Keogh and Pazzani [13] propose an algorithm for constructing
TAN-type classifiers using classification accuracy rather than maximum-
likelihood scores. Nonetheless, the results presented here challenge the claim,
quite often repeated in the literature on Bayesian network classifiers, that
general Bayesian networks are not in general better for classification tasks
than Naive Bayes.
References

1. Blake, C.L. and Merz, C.J. (1998). UCI Repository of machine learning databases:
http://www.ics.uci. edu/~mlearn/MLRepository.html. University of California at
Irvine, Department of Information and Computer Science.
2. Cheng, J., Bell, D.A. and Liu, W. (1997). An Algorithm for Bayesian Belief
Network Construction from Data. Proc, 6th International Workshop on Artificial
Intelligence and Statistics.
3. Cheng, J. and Greiner, R. (1999). Comparing Bayesian Network Classifiers. Proc.
15th International Conference on Uncertainty in Artificial Intelligence.
4. Cheng, J. and Greiner, R. (2001). Learning Bayesian Belief Network Classifiers:
Algorithms and System. Proc. 14th Canadian Conference on Artificial Intelligence.
5. Cheng, J., Greiner, R., Kelly, J., Bell, D. and Liu, W. (2002). Learning Belief
Networks from Data: An Information Theory Based Approach. Artificial
Intelligence, Vol. 137, pp 43-90.
6. Chickering, D.M. (2002). Optimal Structure Identification with Greedy Search.
Journal of Machine Learning Research, Vol. 3, pp 507-554.
7. Cooper, G.F. and Herskovits, E. (1992). A Bayesian Method for the Induction of
Probabilistic Networks from Data. Machine Learning, Vol. 9, pp 309-347. Kluwer
Academic Publishers, Boston.
8. Cowell, R.G. (2001). Conditions Under Which Conditional Independence and
Scoring Methods Lead to Identical Selection of Bayesian Network Models. Proc.
17th International Conference on Uncertainty in Artificial Intelligence.
9. Dougherty, J., Kohavi, R. and Sahami, M. (1995). Supervised and Unsupervised
Discretization of Continuous Features. Proc. 12th International Conference on
Machine Learning.
10. Friedman, N., Geiger, D. and Goldszmidt, M. (1997). Bayesian Network Classifiers.
Machine Learning, Vol. 29, pp 131-163. Kluwer Academic Publishers, Boston.
11. Friedman, N. and Goldszmidt, M. (1996). Building Classifiers Using Bayesian
Networks. Proc. 13th National Conference on Artificial Intelligence. Vol. 2, pp
1277-1284.
12. Heckerman, D. (1996). A Tutorial on Learning with Bayesian Networks. Technical
Report MSR-TR-95-06, Microsoft Corporation, Redmond.
13. Keogh, E. and Pazzani, M.J. (2002). Learning the Structure of Augmented
Bayesian Classifiers. International Journal on Artificial Intelligence Tools, Vol. 11,
No. 4, pp 587-601.
14. Kohavi, R., Sommerfield, D. and Dougherty, J. (1997). Data Mining using
MLC++. International Journal on Artificial Intelligence Tools, Vol. 6, No. 4, pp
537-566.
15. Meek, C. (1997). Graphical Models: Selecting Causal and Statistical Models. Ph.D.
Thesis, Carnegie Mellon University.
16. Mitchell, T.M. (1997). Machine Learning. McGraw-Hill, New York.
17. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Francisco.

Vous aimerez peut-être aussi