Académique Documents
Professionnel Documents
Culture Documents
1, FEBRUARY 2002
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on August 7, 2009 at 01:33 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 32, NO. 1, FEBRUARY 2002 49
level; only at the class level. There is a desired number of cells per
class, but the network may achieve that count in a variety of ways. The
testing results show that, although yielding lower classification rates,
the network trained to minimize SSCE performs better at counting than
a classification network with the same structure even though both are
trained a comparable number of iterations.
This result is consistent with other results we have achieved in our
experiments in the field of handwriting recognition. In particular, we
have shown repeatedly that, in the handwritten word recognition, high
character recognition rates are not good indicators of high word recog-
Fig. 2. Notation used to derive training rule.
nition rates in standard, lexicon-driven handwritten word recognition
systems [11], [12]. This led us to investigate the issue of word level
training in which the character level NNs are trained using word level achieve minimize classification error (albeit indirectly). We propose a
objective functions. In that case, as in the case discussed in this cor- new training scheme running in the batch mode to achieve the minimum
respondence, the results indicate that system level training is more ap- counting error.
propriate for training NNs to perform as system components. We need to establish notation. Let xn = [x1; n ; x2; n ; . . . ; xP; n ]
We point out that the objective functions and design methodologies denote an input feature vector where P is the number of features, and
in the blood cell problem and the handwriting problem are quite dif- let Q denote the number of classes. Let the training set be fxn g, n =
ferent although the guiding principles are the same. The principle is 1; 2; . . . ; N . We assume a standard feed-forward classification net-
that system level objective function lead to recognition systems that work with one output node for each class. If xn is an input to the net-
perform the ultimate task better than intermediate level objective func- work, we let oq; n denote the output value of the q th output node. Let
tions. These results can be interpreted as special cases of the principle the number of cells assigned to the q th class by an expert be cexp; q . We
of least commitment of Marr [13] which states that decisions should be define the sigma count of the q th class be
delayed as long as possible. In this case, the decision concerns setting
N
the desired outputs of NNs. By training them to count rather than clas-
sify, we allow more flexibility in producing outputs for each particular csigma; q = oq; n ; q = 1; 2; . . . ; Q: (1)
n=1
input sample as long as the sum of the outputs over the entire training
set is close to the desired count. The sigma count is a key notion in this approach. If the outputs are
In Section II, we describe the data. In Section III, we define the all 0 and 1, then the sigma count for a given class is just the number of
counting objective function and discuss training. We describe experi- cells that are assigned an output value of 1 for that class. If we set the
ments and evaluation measures in Section IV. In Section V, we provide outputs from a given input sample xn to 0 or 1 by the rule
results and in Section VI we conclude.
ccrisp; q; n =
1; if oq; n ok; n for k = 1; . . . ; Q
(2)
II. DATA DESCRIPTION 0; else
The bone marrow images used in the experiments were collected at then the quantity
the University of Missouri Ellis-Fischel Cancer Center. Multicell im-
ages were captured from slides by an Olympus BX50 microscope, a N
B/W CCD camera, and a digitizer (8 bits/pixel, PDI IMAXX) at 6002 ccrisp; q = ccrisp; q; n ; q = 1; 2; . . . ; Q (3)
n=1
magnification. Individual cells were detected and cropped manually
to form single-cell images. Each single-cell image was classified by is called the crisp count.
Dr. C. William Caldwell, Professor of Pathology and Director of the With the sigma count, we do not require that outputs are all 0 and 1.
Pathology Laboratory at the center. Therefore, each cell can contribute partially to each class. For example,
The data set contains 526 single-cell images of 33 Myeloblasts, a cell that is an “old” Promyelocyte and a “young” Myelocyte may
61 Promyelocytes, 77 Myelocytes, 93 Metamyelocytes, 128 Bands, have an output of 0.5 for both classes, thereby reflecting the continuous
and 134 PMNs. Cell segmentation is still an area of research. To aging of cells more accurately.
de-couple the effects of segmentation errors from the recognition The SSCE objective function is defined to be the sum of the squares
issues to be compared (class-coded versus minimum count error of the differences between the sigma counts and the expert counts.
objective functions), each single-cell image was segmented manually There are no cell level desired outputs
into three regions—nucleus, cytoplasm, and background—with the
gray scales of 0, 176, and 255, respectively. Q N
2
Ten features are extracted from each single-cell image. These are E= 1
2 oq; n 0 cexp; q
the best features found in [7]. There are six features extracted from q =1 n=1
and Fourier descriptors 3, 12, and 15. The remaining features are tex- =
1
2 [csigma; q 0 cexp; q ]
2
: (4)
ture features, namely, light number of patches in nucleus, energy of cy- q =1
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on August 7, 2009 at 01:33 from IEEE Xplore. Restrictions apply.
50 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 32, NO. 1, FEBRUARY 2002
@E
N cr classification rate;
@wji
= [cactual; q 0 cexp ; q] '0j (vj; n )yi : (10) cc crisp count rate;
n=1 sc sigma count rate;
Consider the j th neuron in the hidden layer Cl classification NN;
Co counting NN;
@oq; n @oq; n @yj; n @vj; n
= (11) Tr training set;
@wji @yj; n @vj; n @wji Te test set.
@oq; n Generally, when n-fold cross validation is applied to a classifica-
= wqj (12)
@yj; n tion problem, we sum the numbers of correct classifications divided by
the total number of data points over n folds as the overall classification
@yj; n
= 'j (vj; n )
0
(13) rate. However, this method should not be applied to the counting perfor-
@vj; n
mance measure. If the summation of count estimates in each class over
@vj; n n folds is used, then an underestimated count in one fold can compen-
= yi; n : (14)
@wji sate an overestimated count in another fold. The compensation yields
an artificially high overall counting rate, while, in fact, both underes-
Substituting (12)–(14) into (11) yields
timation and overestimation yield lower counting rates in both folds.
@oq; n For example, if in twofold cross validation, we overestimate by ten in
= wqj '0 (vj; n )yi; n : (15)
@wji one fold and underestimate by ten in the other fold, then the method of
Substituting (15) into (5) yields summing the errors would produce an error of zero which is obviously
incorrect. Therefore, the counting rate in each fold is calculated first,
@E
Q N
and then the average counting rate over n folds is used as the overall
= wqj [csigma; q 0 cexp; q ] '0j (vj; n )yi; n : (16) counting rate. This method is more pessimistic, but more accurate, and
@wji q =1 n=1
is the method that we use.
Thus, the update equation of weights at the (m + 1)th iteration is For clarity, we provide an example. Consider a fourfold cross vali-
dation experiment in which the actual counts for each class are all 100.
@E(m)
wqi (m + 1) = wqi (m) 0 (17) Assume that the count estimates are as shown in Table I. The correct
@wqi (m) counts for each class in this case should all be 25.
where @E=@wqi is calculated from (10) or (16) for the output layer or If we sum the estimates over the folds, then the estimated number of
the hidden layer, respectively, and is a learning rate. cells per class would be 100, 90, 100, 110, 95, and 105, respectively,
which looks quite accurate. Indeed, this yields a counting rate of (1 0
IV. EXPERIMENTAL FRAMEWORK 30/600) 2 100% = 95.0% which is misleading, since the counts are
inaccurate. However, by computing the error over each fold first and
A. Evaluation Measures averaging those, we achieve a counting rate of (80.0 + 80.0 + 66.7 +
The counting rate is used as an evaluation measure in the experi- 66.7)/4 = 73.4%, which is more accurate.
ments. The counting rate is defined by
6 B. Algorithm Descriptions
jcexp 0 c
;q alg; q j We used fourfold cross validation in the experiments because
Counting rate = 1 0 q =1
6
2 100%: (18) formal training and test sets are not available for this data set. More
jcexp j ;q specifically, we randomly divided data in each class into four groups
q =1
with approximately equal numbers of data points. For example, the
where cexp; q and calg; q are the expert’s and algorithm’s counts of the 33 Myeloblasts are divided into four groups of 8, 8, 8, and 9 data
number of cells in class q , respectively. The notation calg; q can refer to points. In each fold, data points in three groups (about 75% of the
either ccrisp; q or csigma; q . We also measure the average classification entire data) are used as a training set, the data points in the remaining
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on August 7, 2009 at 01:33 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 32, NO. 1, FEBRUARY 2002 51
group (about 25% of the entire data) is used as a test set. Hence,
we have four folds of data. The training and test sets in each fold
are independent. Moreover, the experiment using data in each fold
is done independently. Hence, the cross validation is used here for
separating the data set into several groups of training and test sets, not
for avoiding over fitting [15].
The counting and classification percentages of all four folds were
averaged at the end of the cross validation. To remove effects of net-
work initialization, we performed the cross validation ten times and
averaged the counting and classification percentages at the end. The
counting network should not be trained from initial random weights
because, without any prior information, it can easily achieve the goal
of achieving the correct counts by producing an output of 1=N q for
each class, which is what it does. Therefore, a classification network
is trained as the initialization for both the classification and counting
networks, as indicated in the following algorithm:
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on August 7, 2009 at 01:33 from IEEE Xplore. Restrictions apply.
52 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 32, NO. 1, FEBRUARY 2002
TABLE III
SAMPLE CONFUSION MATRIX FROM COUNTING NETWORK ON TRAINING SET
TABLE IV
SAMPLE CONFUSION MATRIX FROM CLASSIFICATION NETWORK ON TEST SET
TABLE V
SAMPLE CONFUSION MATRIX FROM COUNTING NETWORK ON TEST SET )
the actual class, and 0s at the other output nodes. In order to remove
effects of different initializations, each final rate is averaged over four
folds and over ten experiments. We believe that this approach would
strengthen the results conducted from the experiments to compare net-
works’ performances [19].
Figs. 3 and 4 depict the plots of overall classification performance on
the training and test sets, respectively. It is worthwhile noting that we
cannot show all confusion matrices here because there is one confusion
matrix for each cross validation which leads to totally 360 confuson
matrices for training and testing the classification and counting net-
works on ten independent experiments and nine different initial epochs.
Tables II–V are samples of confusion matrices from an experiment with
ten initial epochs in fourfold cross validation. Each element in a confu-
sion matrix is the sum of four elements from four folds at that location.
Therefore, if we sum the elements in a confusion matrix of the test set
along a row, we will have a total number of cells in that class. If we do
so in a confusion matrix of the training set, the result will be triple of
the total number of cells in that class. The plots shown in Figs. 3 and 4 Fig. 7. Average (over 40 networks) crisp counting rates on training set.
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on August 7, 2009 at 01:33 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 32, NO. 1, FEBRUARY 2002 53
REFERENCES
[1] L. W. Diggs, D. Sturm, and A. Bell, The Morphology of Human Blood
Cells. Abbott Park, IL: Abbott Laboratories, 1985.
[2] V. Minnich, Immature Cells in the Granulocytic, Monocytic, and Lym-
phocytic Series. Chicago, IL: Amer. Soc. Clin. Pathol., 1982.
[3] M. Beksaç, M. S. Beksaç, V. B. Tipi, H. A. Duru, M. U. Karakas, and
A. Nur Çakar, “An artificial intelligent diagnostic system on differential
recognition of hematopoietic cells from microscopic images,” Cytom-
etry, vol. 30, pp. 145–150, 1997.
[4] H. Harms, H. Aus, M. Haucke, and U. Gunzer, “Segmentation of stained
blood cell images measured at high scanning density with high mag-
nification and high numerical aperture optics,” Cytometry, vol. 7, pp.
522–531, 1986.
[5] J. Park and J. Keller, “Fuzzy patch label relaxation in bone marrow cell
segmentation,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., Orlando,
FL, 1997, pp. 1133–1138.
[6] S. S. S. Poon, R. K. Ward, and B. Palcic, “Automated image detection
and segmentation in blood smears,” Cytometry, vol. 13, pp. 766–774,
1992.
[7] S. Sohn, “Bone marrow white blood cell classification,” M.S. thesis,
Univ. Missouri, Columbia, 1999.
[8] N. Theera-Umpon, “Morphological granulometric estimation with
Fig. 8. Average (over 40 networks) sigma counting rates on test set. random primitives and applications to blood cell counting,” Ph.D.
dissertation, Univ. Missouri, Columbia, 2000.
[9] N. Theera-Umpon and P. D. Gader, “Counting white blood cells using
should provide enough information on the overall classification perfor- morphological granulometries,” J. Electron. Imag., vol. 9, no. 2, pp.
mances of both networks. The plots of overall crisp and sigma counting 170–177, 2000.
performances of both sets of networks are shown in Figs. 5–8. [10] N. Theera-Umpon, E. R. Dougherty, and P. D. Gader, “Non-homothetic
granulometric mixing theory with application to blood cell counting,”
Pattern Recognit., vol. 34, no. 12, pp. 2547–2560, 2001.
[11] J. Chiang and P. D. Gader, “Hybrid fuzzy-neural systems in handwritten
word recognition,” IEEE Trans. Fuzzy Syst., vol. 5, pp. 497–510, Nov.
VI. DISCUSSION AND CONCLUSION 1997.
[12] P. D. Gader, J. M. Keller, R. Krishnapuram, J. H. Chiang, and M. Mo-
As we expected, Fig. 3 shows that the overall classification perfor- hamed, “Neural and fuzzy methods in handwriting recognition,” IEEE
mance of the classification network is clearly better than that of the Comput., vol. 30, pp. 79–86, Feb. 1997.
counting network, especially on the training set. From Fig. 4, on the test [13] D. Marr, Vision. San Francisco, CA: Freeman, 1982.
[14] S. Haykin, Neural Networks: A Comprehensive Founda-
set, the classification network has a better overall classification perfor-
tion. Englewood Cliffs, NJ: Prentice-Hall, 1999.
mance when the number of epochs used in the network initialization is [15] N. Morgan and H. Bourlard, “Generalization and parameters estimation
small, they have comparative performances when the number of epochs in feedforward nets: Some experiments,” in Advances in Neural Infor-
is large. mation Processing Systems 2, D. S. Touretzky, Ed. San Mateo, CA:
As shown in Figs. 5–8, the counting network has better counting Morgan Kaufman, 1990, pp. 630–637.
performance for both the crisp and sigma counts, and on both training [16] H. Demuth and M. Beale, Neural Network Toolbox: For Use With
and test sets. The crisp count performed better than the sigma count. MATLAB. Natick, MA: Mathworks, 1998.
[17] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the
These results were achieved using extensive experimentation. bias/variance dilemma,” Neural Comput., vol. 4, pp. 1–58, 1992.
The reason that the crisp count yields better counting performance [18] L. Prechelt, “Early stopping—But when?,” in Neural Networks: Tricks
than that of the sigma count is that, for each input vector, a network of the Trade, G. B. Orr and K.-R. Muller, Eds. Berlin, Germany:
produces a count of one to a class with the maximum output. Therefore, Springer-Verlag, 1998, pp. 55–69.
the total number of cells from the crisp count and the total number of [19] F. Provost, T. Fawcett, and R. Kohavi, “The case against accuracy es-
true counts are identical. On the other hand, the total number of cells timation for comparing induction algorithms,” in Proc. 15th Int. Conf.
Machine Learning, Madison, WI, July 1998, pp. 445–453.
from the sigma count is generally different from that of true counts.
Hence, the sigma count is more likely to yield a worse counting rate
than the crisp count.
In this particular problem, the main goal is to achieve accurate
counts. Classification is only an indirect tool to achieve that goal.
In this correspondence, we have shown that forming the objective
function in terms of the main, system-level goal can increase the
overall performance of NNs as modules in overall systems. This is a
consistent and important theme in the design and implementation of
recognition system applications.
ACKNOWLEDGMENT
The authors would like to thank the reviewers for their useful
remarks.
Authorized licensed use limited to: Kongu Eningerring College. Downloaded on August 7, 2009 at 01:33 from IEEE Xplore. Restrictions apply.