Académique Documents
Professionnel Documents
Culture Documents
Michael G. Madden
1 Introduction
Several algorithms have been proposed in the last decade for inductive
learning of Bayesian networks. Recent developments include the Three-Phase
Dependency Analysis algorithm [5] and the Greedy Equivalence Search
algorithm [6]. Cheng et al. [5] provide a good summary and comparison of
earlier algorithms. This section briefly summarises three approaches that have
been used as the basis for Bayesian network classifiers, and for which accuracy
results are listed in Table 2. In each case, the network structure is found by
heuristic search and then its probabilities are estimated.
The search heuristic used by Friedman et al. is to start with the empty
network and successively apply local operations that greedily reduce the MDL
score until a local minimum is found. The local operations applied are arc
insertion, arc deletion and arc reversal.
xC xC x1
x3
xC
x3
x1 x2 x3 x4 x2 x4 x2
x1
x4
(a) Naive Bayes (b) TAN (c) General BN
For this work, 18 datasets were selected from the UCI repository of machine
learning datasets [1], as listed in Table 1. Datasets 1-15 are included in the
analyses of Friedman et al. [10] and datasets 13-18 are included in the
analyses of Cheng & Greiner [3].
A training set with 2/3 or 4/5 of the data was randomly drawn without
replacement from each dataset, with the remainder used for testing. This was
repeated 10 times for training sets of size 2/3 and 20 times for those of size
4/5 (50 times on particularly small datasets, to reduce variance). The only
exception was the Waveform-21 dataset, where the training set size used was
6%, for consistency with Friedman et al. Training set sizes were selected to
allow comparability with the experiments of those other authors: where the
other authors use a training set of 2/3, we use the same size here; where they
use 5-fold cross-validation (CV5) we use a training set of size 4/5.
Table 1 provides details of the training set sizes used in this work
(captioned "Madden") and the work of Friedman et al. ("FGG") and/or
Cheng & Greiner ("CG"). In this work, all datasets were preprocessed in the
same way as those authors did: where datasets had continuous variables, they
were discretized using the discretization utility in MLC++ [14] with its
default entropy-based setting [9], and any cases with missing values were
removed from datasets, except for the Mushroom dataset where they were
assigned the value "unknown" (as done by Cheng & Greiner).
Table 1: Data sets, training set sizes and number of runs per training set
No Dataset Size Training Size/Runs
Madden FGG/CG
1 Breast Cancer 699 80% x10 80% CV5
2 Cleve 296 80% x20 80% CV5
3 Diabetes 768 80% x20 80% CV5
4 German 1000 80% x20 80% CV5
5 Glass 214 80% x20 80% CV5
6 Glass2 163 80% x50 80% CV5
7 Iris 150 80% x50 80% CV5
8 Lymphography 148 80% x50 80% CV5
9 Segment 2310 66.7% x10 66.7% x1
10 Soybean-Large 562 80% x20 80% CV5
11 Vehicle 846 80% x20 80% CV5
12 Waveform-21 5000 6% x10 6% x1
13 Chess 3196 66.7% x10 66.7% x1
14 Flare 1066 80% x20 80% CV5
15 Vote 435 80% x20 80% CV5
16 DNA 3190 66.7% x10 62.8% x1
17 Nursery 12960 66.7% x10 66.7% x1
18 Mushroom 8124 66.7% x10 66.7% x1
Each dataset was analysed using our implementation of the Naive Bayes,
TAN and General BN algorithms, as were described in Section 2. The results
are summarised in Table 2. For each dataset and each algorithm, this table
lists the mean classification accuracy and its standard deviation. Table 2 also
lists the classification accuracies published by Friedman et al. [10] and Cheng
& Greiner [3] for these datasets. The values listed for Friedman et al.'s Naive
Bayes and TAN results are where parameter smoothing has been used
(discussed later in Section 4.2). Where those authors used only one
training/testing set, they estimated the standard deviation of the accuracy
using a theoretical formula provided in MLC++ [14].
Notably, our results show that the performance of the GBN constructed
using K2 is good relative to NB. Figure 2(a) is a scatter-plot of our
experimental results for accuracies of both classifiers, plotted relative to each
other. In this and subsequent graphs, "A vs B" indicates that A is plotted on
the horizontal axis and B is plotted on the vertical axis. Visually, points
above the diagonal are those where classifier B has higher accuracy. For each
dataset, a two-tailed paired t-test on the accuracy values for each run was
performed. This indicated that there was no significant difference in
performance on four of the datasets (numbers 1, 6, 7, 8), NB was better on
two (2, 12) and GBN was better on the remaining twelve. This is clearly at
variance with the experimental results of Friedman et al., who compared GBN
and NB on 25 datasets and reported that GBN was significantly better on 6
and significantly worse on 6. (Five of those datasets are included in this
study.) Figure 2(b) compares the NB and GBN classifiers' performance as
reported by Friedman et al. Our results are more similar to those of Cheng &
Greiner: for the eight datasets they analysed, in one case NB performed
substantially better than GBN, in four both performed similarly well and in
three cases GBN performed better.
100 100
90 90
80 80
70 70
60 60
50 50
50 60 70 80 90 100 50 60 70 80 90 100
Figure 2: (a) Relative accuracies of NB and GBN classifiers from our experiments;
(b) Relative accuracies of NB and GBN classifiers as reported by FGG.
Note: The last paragraph of Sec. 3 explains the layout of these and subsequent plots.
Figure 3(a), which is based on some of the data in Table 2, plots the relative
classification accuracies achieved using general Bayesian network classifiers by
ourselves, Friedman et al. and Cheng & Greiner. A total of 24 pairwise
comparisons can be made between these results. In 9 of these, there are
substantial differences between results (Datasets 5, 8, 10, 11, 12, 13, 16, 17
and 18), based on a non-paired t-test. Six of these are disparities between our
results and those of Friedman et al. and the other three are disparities
between our results and those of Cheng & Greiner. While differences in
methodology (e.g. different sub-sampling of test sets) could cause disparities,
it would not explain the skewness evident in the results
Naive of Figure
Bayes Accuracy 3(a).
Comparisons
100 100
90 90
80 80
70 70
FGG vs MM FGG vs MM
60 60
CG vs MM CG vs MM
FGG vs CG FGG vs CG
50 50
50 60 70 80 90 100 50 60 70 80 90 100
The evaluation methodology used both by Friedman et al. and by Cheng &
Greiner is to divide larger datasets into 2/3 for training and 1/3 for testing,
and to use 5-fold cross-validation for smaller datasets. The justification for the
methodology used in this paper is:
- Smaller datasets are divided into 4/5 for training and 1/5 for testing, and
results averaged over 20 (or more) runs; this is consistent with the 80%
training set sizes used by the other authors
- Larger datasets are divided into 2/3 for training and 1/3 for testing, and
averaged over 10 runs; this is consistent with the approach of the other
authors, while minimizing the risk of unrepresentative training sets and
allowing us to calculate sample standard deviations.
Accordingly, we assume that differences in methodology should not be a
significant factor in discrepancies between results. This assumption is
supported by the observation that there is good correspondence between our
results and the other authors' for NB (Figure 3(b)) and TAN (Figure 4(a)).
Our method for estimating parameters is taken from Cooper & Herskovits [7].
Let θijk denote the conditional probability that a variable xi in BS has the
value vik, for some k from 1 to ri, given that the parents of xi, represented by
πi, are instantiated as wij. Then θijk = P(xi=k|πi=wij) is termed a network
conditional probability. Its value is estimated as:
Nijk + 1
θijk = N + r . (6)
ij i
This is the m-estimate of probability with uniform priors and with m equal
to the number of different values that may be assigned to xi [16, Ch. 6].
Cheng & Greiner report that they also use the estimation method of
Cooper & Herskovits. Friedman et al. consider two methods: simple frequency
counts and a more sophisticated technique they term parameter smoothing.
They present results for NB and TAN with and without parameter
smoothing, but GBN results for networks without parameter smoothing only.
Because NB has a fixed structure, comparison of NB results should indicate
differences arising from the combined effects of evaluation methodology and
parameter estimation. Figure 3(b) compares the different authors' NB results,
taken from Table 2 (i.e. using the results of Friedman et al. for Smoothed
NB). The only anomalous one is our result for the Dataset 14 (arrowed),
which differs from the results both of Friedman et al. and of Cheng &
Greiner. The relatively high level of agreement between these results provides
evidence that experimental methodology does not account for the
discrepancies between GBN results. It also provides some evidence that
parameter estimation is not a source of divergence. However, Friedman et al.
note that sophisticated parameter estimation is less important in NB than in
classifiers with more parameters, as there should be sufficient empirical data
available for frequency counts to be accurate. Accordingly, we must look at
relative performance of more sophisticated classifiers to evaluate this.
Figure 4(a) compares the different authors' TAN results; again, the results
for Friedman et al. are for Smoothed TAN. This figure shows extremely
strong agreement between results, providing stronger evidence (given the
comments in the preceding paragraph) that parameter estimation does not
account for the observed divergence in GBN results. Of the 24 pairwise
comparisons plotted, only one (Dataset 17) shows a significant difference and
its magnitude is relatively small.
Figure 4(b), on the other hand, does not show such strong agreement in
results. This figure plots results for Friedman et al.'s Unsmoothed TAN
against our TAN results and those of Cheng & Greiner. In the 18 pairs of
results plotted in Figure 4(b), there are 7 with substantial differences
(Datasets 2, 8, 9, 10, 12 and 15; the result of Friedman et al. for Dataset 15 is
different from our result and from that of Cheng & Greiner). This indicates
the importance of parameter smoothing, as explained by Friedman et al.
Considered in conjunction with Figure 4(a), it also indicates that the
probability estimate of Eq. (6) achieves the same effect as the parameter
smoothing used by Friedman et al., at least for TAN structures. (It is possible
that for structures with even more parameters, such as GBN structures,
parameter smoothing may provide additional benefit.)
100 100
90 90
80 80
70 70
FGG vs MM
60 60 FGG vs MM
CG vs MM
FGG vs CG FGG vs CG
50 50
50 60 70 80 90 100 50 60 70 80 90 100
The methods for structure search used by the different authors, as described
already in Section 2, are superficially quite different. Unlike the K2 procedure
we use and the approach of Friedman et al., the procedure of Cheng &
Greiner comes with guarantees: given a dataset that is large enough and has a
DAG-isomorphic probability distribution, their CBL1 algorithm is guaranteed
to generate the perfect map [17] of the underlying dependency model.
Node ordering can of course affect heuristic search, particularly for K2 and
CBL1 as they restrict candidate parents for a node to appear before it in the
ordering. We use the ordering given for the features in all datasets, except
that we place the classification node first so that it can potentially be a parent
of any other node. Cheng & Greiner do the same. Details of the node
orderings used by Friedman et al. are not published.
Given that Friedman et al. do not report results for GBN with parameter
smoothing, and that their results for TAN without parameter smoothing do
not agree well with our TAN results or those of Cheng & Greiner, it might
initially appear that the disparity of results illustrated in Figure 2(a) could be
explained by differences in parameter estimation. However, three of the
anomalous results are in comparisons between our classifiers and those of
Cheng & Greiner (Datasets 16, 17 and 18), where the same parameter
estimation is used. The other six anomalous results are in comparisons
between K2 and the GBN of Friedman et al. (Datasets 5, 8, 10, 11, 12, 13).
Of those datasets, there was good agreement on Datasets 5, 11 and 13
between our TAN classifier and Friedman et al's Unsmoothed TAN classifier,
where structure search is not a factor. Accordingly, we conclude that
differences in structure search do indeed appear to account for some of the
disparity in results.
As an aside, we note that in the three cases where our GBN results are
different from the GBN results of Cheng & Greiner, our results are better. It
appears that K2 is a serendipitous algorithm for use as a classifier, if the
classification node is placed first in the node ordering provided to it (as is
typically done). In this case, the classification node is the first candidate
considered when selecting parents for all other nodes. The result is effectively
a selective BN augmented Naive Bayes structure.
5 Conclusions
This paper has presented experimental results applying Naive Bayes (NB),
Tree Augmented Naive Bayes (TAN) and general Bayesian network (GBN)
algorithms to classification of 18 well-known datasets. The algorithm used to
construct GBNs was K2 [7]. The TAN algorithm also used the Bayesian
scoring function from K2. There are significant disparities between these
experimental results and the results that have been previously presented by
other authors. Friedman et al. [10] evaluated the classification performance of
a GBN algorithm based on the MDL score using 25 standard datasets and
concluded that, on average, it did not out-perform NB. Conversely, Cheng &
Greiner [3] evaluated a GBN algorithm based on conditional independence
testing on eight standard datasets and presented results showing it
outperformed NB more often than not.
Given that the Bayesian score is known to be asymptotically equivalent to
the MDL score [12] and that using conditional independence tests as a basis
for structure search is equivalent to search based on locally maximising a
scoring metric [8], this paper has sought to identify the source of the
disparity. Our conclusion is that the principal causes of differences are the
formulae used for parameter estimation and the structure search procedures
followed. Additionally, our results provide experimental support for the
analysis of Cowell [8]; in particular, our comparison of TAN classifiers (Figure
4(a)) demonstrates that very similar classification performance can be
achieved by classifiers constructed using different approaches, when heuristic
search is not required.
Finally, it must be emphasised that it is not the objective of this paper to
promote the use of general Bayesian networks as classifiers; in their papers
from which the NB, TAN and GBN results of Table 2 are taken, Friedman et
al. and Cheng & Greiner propose and analyse alternative structures based on
Bayesian networks that outperform the algorithms discussed in this paper.
More recently, Keogh and Pazzani [13] propose an algorithm for constructing
TAN-type classifiers using classification accuracy rather than maximum-
likelihood scores. Nonetheless, the results presented here challenge the claim,
quite often repeated in the literature on Bayesian network classifiers, that
general Bayesian networks are not in general better for classification tasks
than Naive Bayes.
References
1. Blake, C.L. and Merz, C.J. (1998). UCI Repository of machine learning databases:
http://www.ics.uci. edu/~mlearn/MLRepository.html. University of California at
Irvine, Department of Information and Computer Science.
2. Cheng, J., Bell, D.A. and Liu, W. (1997). An Algorithm for Bayesian Belief
Network Construction from Data. Proc, 6th International Workshop on Artificial
Intelligence and Statistics.
3. Cheng, J. and Greiner, R. (1999). Comparing Bayesian Network Classifiers. Proc.
15th International Conference on Uncertainty in Artificial Intelligence.
4. Cheng, J. and Greiner, R. (2001). Learning Bayesian Belief Network Classifiers:
Algorithms and System. Proc. 14th Canadian Conference on Artificial Intelligence.
5. Cheng, J., Greiner, R., Kelly, J., Bell, D. and Liu, W. (2002). Learning Belief
Networks from Data: An Information Theory Based Approach. Artificial
Intelligence, Vol. 137, pp 43-90.
6. Chickering, D.M. (2002). Optimal Structure Identification with Greedy Search.
Journal of Machine Learning Research, Vol. 3, pp 507-554.
7. Cooper, G.F. and Herskovits, E. (1992). A Bayesian Method for the Induction of
Probabilistic Networks from Data. Machine Learning, Vol. 9, pp 309-347. Kluwer
Academic Publishers, Boston.
8. Cowell, R.G. (2001). Conditions Under Which Conditional Independence and
Scoring Methods Lead to Identical Selection of Bayesian Network Models. Proc.
17th International Conference on Uncertainty in Artificial Intelligence.
9. Dougherty, J., Kohavi, R. and Sahami, M. (1995). Supervised and Unsupervised
Discretization of Continuous Features. Proc. 12th International Conference on
Machine Learning.
10. Friedman, N., Geiger, D. and Goldszmidt, M. (1997). Bayesian Network Classifiers.
Machine Learning, Vol. 29, pp 131-163. Kluwer Academic Publishers, Boston.
11. Friedman, N. and Goldszmidt, M. (1996). Building Classifiers Using Bayesian
Networks. Proc. 13th National Conference on Artificial Intelligence. Vol. 2, pp
1277-1284.
12. Heckerman, D. (1996). A Tutorial on Learning with Bayesian Networks. Technical
Report MSR-TR-95-06, Microsoft Corporation, Redmond.
13. Keogh, E. and Pazzani, M.J. (2002). Learning the Structure of Augmented
Bayesian Classifiers. International Journal on Artificial Intelligence Tools, Vol. 11,
No. 4, pp 587-601.
14. Kohavi, R., Sommerfield, D. and Dougherty, J. (1997). Data Mining using
MLC++. International Journal on Artificial Intelligence Tools, Vol. 6, No. 4, pp
537-566.
15. Meek, C. (1997). Graphical Models: Selecting Causal and Statistical Models. Ph.D.
Thesis, Carnegie Mellon University.
16. Mitchell, T.M. (1997). Machine Learning. McGraw-Hill, New York.
17. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Francisco.