Vous êtes sur la page 1sur 14

Using One-Class and Two-Class SVMs for

Multiclass Image Annotation


King-Shy Goh, Edward Y. Chang, Senior Member, IEEE, and Beitao Li
AbstractWe propose using one-class, two-class, and multiclass SVMs to annotate images for supporting keyword retrieval of
images. Providing automatic annotation requires an accurate mapping of images low-level perceptual features (e.g., color and texture)
to some high-level semantic labels (e.g., landscape, architecture, and animals). Much work has been performed in this area; however,
there is a lack of ability to assess the quality of annotation. In this paper, we propose a confidence-based dynamic ensemble (C11),
which employs a three-level classification scheme. At the base-level, C11 uses one-class Support Vector Machines (SVMs) to
characterize a confidence factor for ascertaining the correctness of an annotation (or a class prediction) made by a binary SVM
classifier. The confidence factor is then propagated to the multiclass classifiers at subsequent levels. C11 uses the confidence factor
to make dynamic adjustments to its member classifiers so as to improve class-prediction accuracy, to accommodate new semantics,
and to assist in the discovery of useful low-level features. Our empirical studies on a large real-world data set demonstrate C11 to be
very effective.
Index TermsPattern recognition, models, statistical, artificial intelligence, learning.

1 INTRODUCTION
A
typical image-annotation system consists of a training
data set containing annotated images, a set of low-level
features describing the perceptual attributes of the images
[31], and a set of keywords representing the semantic
content of the images. From the training data, an annotation
system learns a classifier and then uses it to predict
semantics for unlabeled images. Our goal is to improve
the effectiveness of image retrieval and organization by
mapping low-level perceptual features (such as color or
texture) to high-level semantics (keywords describing
image content).
Several recent studies have proposed using either a
generative statistical model (such as the Markov model [16])
or a discriminative approach (such as SVMs [13] and Bayes
Point Machines (BPMs) [6]) to learn a classifier for
annotating images. These traditional methods are static in
the sense that once the three components (semantics, low-
level features, and training data) of a classifier have been
determined, there is no effective way to improve them. In
this work, we propose a confidence-based dynamic en-
semble (C11) to improve the classifier by adaptively
improving the semantic set, perceptual features, and
training data. C11 can effectively help determine 1) when
retraining of the classifier is necessary, 2) what new
semantics or training data should be included, and
3) whether new low-level features should be incorporated.
One problem with traditional classifiers is the lack of
ability to assess the quality of their predictions. Typically, a
classifier predicts a semantic for an unlabeled image with an
estimate of the probability of its accuracy. However, the
estimated probability might not correlate well with predic-
tion quality due to factors such as noise or training-data
imbalance [37]. (Section 2 will discuss these problems in
detail.) The core of C11 is a set of robust indicators, which
we call confidence factors, used for asserting the class-
prediction confidence at all levels of its hierarchical classi-
fiers: the binary classifier (SVMs) level, the multiclass
ensemble level, and the bag
1
level. As we shall show, the
three levels work in concert to improve class-prediction
accuracy.
At the binary level, C11 uses two-class Support Vector
Machines (SVMs) to train a set of binary classifiers, each of
which predicts one semantic. The confidence factor of each
binary prediction is characterized using the algorithm of
one-class SVMs (OC-SVMs) [25], which estimate the
training data density distribution, also known as the support
of the data. Next, the multiclass level will consolidate the
binary predictions to predict one semantic for an image. The
confidence factor at this level is characterized by the margin,
which is the difference between between the two highest
confidence factors from the binary level. For example,
suppose the top two binary classifiers, one for the label
tigers and the other for cats, produce confidence factors
of 0.8 and 0.3, respectively. The confidence factor at the
multiclass level will be 0.8 0.3 0.5. At the top level,
C11 uses the bagging scheme [4] to aggregate predictions
among several sets of multiclass classifiers, each of which
has been trained on a different subset (bag) of the training
data. At this top level, C11 assigns more weight to the bags
with higher confidence factors.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 10, OCTOBER 2005 1333
. K.-S. Goh can be reached at 2121 Cornell St., Palo Alto, CA 94306.
E-mail: kingshy.goh@gmail.com.
. E.Y. Chang is with the Department of Electrical and Computer
Engineering, University of California, Santa Barbara, CA 93106.
E-mail: echang@ece.ucsb.edu.
. B. Li can be reached at 19 Dickerson Dr., Piscataway, NJ 08854.
E-mail: beitao_li@yahoo.com.
Manuscript received 11 Dec. 2003; revised 10 Nov. 2004; accepted 30 Mar.
2005; published online 18 Aug. 2005.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-0256-1203.
1. C11 uses the bagging scheme [4] at the top level to reduce class-
prediction variance. The bagging scheme divides training data into
overlapping subsets, each called a bag.
1041-4347/05/$20.00 2005 IEEE Published by the IEEE Computer Society
When the confidence in the final class-prediction is high,
C11 assigns a semantic to the query image. Otherwise,
C11 employs two methods to improve the annotation:
. Disambiguating conflicts. When ambiguity exists
between two or more classes (e.g., when the multi-
class confidence is low), C11 narrows down the
candidate classes to a subset of semantics and then
dynamically generates a new set of multiclass
classifiers to improve class-prediction accuracy.
The training data set for the new classifiers contains
only images labeled by that subset of semantics.
. Discovering new semantics and/or new perceptual
features. If the resulting class-prediction confidence
is still low, C11 diagnoses the causes and provides
remedies. There are three potential causes (and,
hence, remedies) of a misprediction: 1) new seman-
tics (remedy: adding new keywords to the set of
semantics), 2) underrepresentative low-level features
(remedy: adding new features to the feature set), and
3) underrepresentative training data (remedy: add-
ing the query image to the training data set).
Our empirical study shows that the confidence-based
approach of C11 can significantly improve class-predic-
tion accuracy. We will discuss the details in the rest of the
paper, which is organized as follows: Section 2 discusses
related work. Section 3 presents the working mechanism of
our C11 scheme. We depict the core of C11, a hierarchy
of confidence factors. In particular, we discuss how one-
class SVMs are used for generating confidence factors at the
binary level. In Section 4, we present our empirical results.
We offer our conclusions, along with some ideas for future
work, in Section 5.
2 RELATED WORK
In this section, we first discuss image annotation methods
that have been proposed. We then present related work in
multiclass classification.
2.1 Image Annotation
The methods for extracting semantic information from
images can be divided into two main categories:
1. Text-based methods. The text surrounding images is
analyzed and the system extracts those that appear to
be relevant. Shen et al. [27] explore the context of
Web pages as potential annotations for images in the
same pages. Srihari et al. [29] propose extracting
named entities from the surrounding text to index
images. Benitez and Chang [2] present a method to
extract semantic concepts by disambiguating words
senses with the help of the lexical database WordNet.
In addition, the relationships between keywords can
be extracted using relations established in WordNet.
2. Content-based methods. The word content commonly
refers to the low-level features that describe the visual
aspects of an image, such as color, texture, and shape.
Content-based methods extract semantic information
directly from the low-level features describing the
image. An approach proposed by Chang et al. [7] uses
the Semantic Visual Templates (SVTs), a collection of
regional objects within a video shot, to express the
semantic concept of a users query. The templates can
be further refined through a two-way interaction
between the user and the system. Wang et al. [34]
propose SIMPLIcity, a systemthat captures semantics
using the integrated region matching metric. The
semantics are used to classify images into two broad
categories, which are then used to support semantics-
sensitive image retrievals. More recently, Fan et al.
[11] used a two-level scheme to annotate images. At
the first level, salient objects extracted fromthe image
are classified using SVMs. At the next level, a finite-
mixture model is used to mapped the annotated
objects to some high-level semantic labels. Most of
these approaches rely heavily on local features, which
in turn rely on high quality segmentations or regions
with semantic meaning. However, segmentation can
hardly be done reliably, especially on compressed
images. As pointed out by Wang and Li [35], human
beings tend to view images as a whole. Thus, some
semantic concepts may not be learnable through a
single region. The relationship between regions has
also been considered for the semantic indexing of
images [17], [28], [35]. In the ALIP system [16], a
statistical technique is used to select the annotation
labels. The query image is first compared with the
trained models in a concept dictionary and a fixed
number of top-ranked concepts are identified as
candidate annotations. Among these candidates, the
rarer annotations are deemed more significant and,
hence, chosen as the querys annotations. Finally,
Zhang et al. suggest the use of a semantic feature
vector to model images and incorporate the semantic
classification into the relevance feedback for image
retrieval [15], [36], [38]. Almost all of these ap-
proaches assume that no changes will occur in the
feature set and keyword set.
The major constraint of text-based methods is that they
require the presence of high quality textual information
describing an image. In many situations, this requirement
may not be satisfied, so a content-based approach is
favored. For example, stock-photo companies images are
often digitized versions of printed photographs with little
or no textual information. In such situations, the content-
based approach is the only viable option.
2.2 Multiclass Classification
A number of methods have been proposed for decomposing
a multicategory classification problem into a collection of
binary classification problems and combining their predic-
tions in various ways [1], [10], [18], [20]. The most frequently
used scheme is the One Per Class (OPC) approach.
Suppose we have C classes. For each of the C classes, OPC
constructs a binary classifier to make a yes or no
prediction for that class alone. Given a query instance, each
binary classifier will produce a prediction and the class of
the instance is determined by the class with the highest
confidence in the yes prediction. The DAG-SVM scheme
proposed by [20] constructs CC 1,2 one-versus-one
binary classifiers and then uses the Decision Directed
Acyclic Graph (DDAG) to classify a query instance.
Different instances may follow different decision paths in
the acyclic graph to reach their most likely class. Another
approach for improving multicategory classification is to
perform multistage classification [21], [23]. This approach
1334 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 10, OCTOBER 2005
uses inexpensive classifiers to obtain an initial, coarse
prediction (for a query instance). It then selects the most
relevant localized classifiers to refine the prediction.
The C11 scheme differs from the traditional ensemble
schemes in its ability to assess class-prediction confidence
and take measures to correct and improve accuracy. The
concept of prediction confidence is not new in the field of
pattern recognition. In 1970, Chow [8] proposed using the
rejection option to exclude a low-confidence classification.
Whats new in C11 is that the multilevel confidence
assessment of C11 is more robust and that C11 can
dynamically train an ensemble to zero in on a small subset
of most probable classes to improve prediction accuracy.
Furthermore, the confidence factors can assist annotation-
quality improvement through adding new semantic labels,
enriching the feature set, and enhancing the training data set.
To measure confidence, various methods [3], [19] have
been proposed to map the output of a binary classifier into a
posterior probability value, which represents the confidence
in a yes prediction. Unfortunately, two problems hinder a
reliable estimate of the posterior probability in the annota-
tion setting.
1. Random guesses. An ensemble can sometimes force a
guess upon a query image when the guess has little
evidence (training data) to back it up. Fig. 1 shows an
example where an almost random guess is forced
upon a query instance. The shaded query instance sits
on two class boundaries: class 1 2 and class 1 3.
Both class 1 2 and class 1 3 classifiers weakly
predict that the point belongs to class 1. However, the
class 2 3 classifier in this example can confidently
say that the point is in class 2. In this example, we
actually do not have sufficient evidence to predict the
class membership of the query instance. This false
confidence results from a lack of support of training
data in the near neighborhood of the query instance.
Unfortunately, most classifiers will simply declare
class 2 as the winner with high confidence.
2. Imbalanced training data. The second problem is a
subtle but serious one. When we deal with a large
number of classes, the training data of the target class
are usually significantly outnumbered by the other
training instances. For example, consider that we
have 100 equal-size classes. The training data in any
one of the 100 classes is outnumbered by the rest of
the classes by a ratio of 99 : 1. When the training data
is imbalanced, no classifier can accurately compute
posterior probability. To understand the nature of the
problem, let us consider it in a binary classification
setting (positive versus negative). We know that the
Bayesian framework estimates the posterior prob-
ability using the class conditional and the prior [12].
When the training data are highly imbalanced, we
can infer that the natural state favors the majority
class. Hence, when ambiguity arises in classifying a
particular sample because of similar class-conditional
densities for two candidate classes, the Bayesian
framework will break the tie by favoring the majority
class. Consequently, a class with more training data
will, on the average, have a larger posterior prob-
ability than a class with fewer training data.
C11 remedies the above problems by employing the
algorithm of one-class SVMs (OC-SVMs) to estimate the
support of each individual class. If a query instance receives
no support from any one-class classifier, C11 will discount
its class-prediction confidence by multiplying the low
support with the estimated posterior probability. Since
OC-SVMs deal only with one class at a time, C11 can avoid
the problem of imbalanced training data.
3 CONFIDENCE-BASED DYNAMIC ENSEMBLE
SCHEME
Our annotation scheme is designed to produce a class
prediction (or semantic annotation) for unlabeled query
images. Each prediction is also accompanied by a con-
fidence factor to accomplish three goals:
1. Provide a quantitative measure to assess the class-
prediction correctness.
2. Disambiguate confusing class-predictions by con-
structing a dynamic ensemble for analyzing errors
and making corrections or improvements in
interpretation.
3. Assist in semiautomatic knowledge discovery to
enhance the quality of the existing low-level features
and to improve the descriptive precision of the high-
level semantics.
In order to achieve good prediction performance, our
annotation scheme consists of a three-level hierarchical
classifier:
1. Binary-level. We use Support Vector Machines (SVMs)
as our base-classifier in a binary classification setting.
Each base-classifier is responsible for performing the
class prediction of one semantic label. We map the
SVM output of the base-classifier to a posterior
probability to characterize the likelihood that a query
image belongs to the semantic category that the
classifier controls. More importantly, we employ the
algorithmof one-class SVMs (OC-SVMs) to formulate
a confidence factor (C1
/ii
) for each binary prediction.
The trust on the SVM output is conditioned on the
support. When the support is low, we discount the
posterior probability generated fromthe SVMoutput.
2. Multiclass level. The confidence factors from multiple
base-classifiers are aggregated to provide a single
class-prediction. A multiclass level confidence factor
(C1
in|
) is estimated for this aggregated prediction.
3. Bag-level. To reduce classification variance, we
combine the predictions from multiple sets (or bags)
of multiclass classifiers to make an overall prediction.
GOH ET AL.: USING ONE-CLASS AND TWO-CLASS SVMS FOR MULTICLASS IMAGE ANNOTATION 1335
Fig. 1. A noisy irrelevant classifier example.
Each set of classifiers is trained by a different subset
of the training data. An overall confidence factor
(C1
/oq
) is also produced at this level. If C1
/oq
is low, a
new ensemble of classifiers is dynamically con-
structed to improve the prediction accuracy. If the
new ensembles prediction confidence level remains
low, we flag the image as a potential candidate for
new knowledge discovery.
In the following sections, we first depict how we estimate
prediction confidence at each level (Sections 3.1 to 3.3).
Then, we present the dynamic ensemble scheme for
annotation enhancement (Section 3.4). Finally, we discuss
semantics discovery (Section 3.5).
3.1 Binary-Level Prediction and Confidence
We employ Support Vector Machines (SVMs) as our base-
classifier. We shall consider SVMs in the binary classifica-
tion setting. We are given training data fx
1
. . . x
i
g that are
vectors in some space X IR
d
. We are also given their
labels fy
1
. . . y
i
g, where y
i
2 f1. 1g. In their simplest form,
SVMs are hyperplanes that separate the training data by a
maximal margin. All vectors lying on one side of the
hyperplane are labeled as 1 and all vectors lying on the
other side are labeled as 1. The training images that lie
closest to the hyperplane are called support vectors. More
generally, SVMs learn a decision boundary between
two classes by mapping the training examples onto a
higher dimensional feature space F via a Mercer kernel
operator 1. In other words, we consider the set of classifiers
of the form: )x
P
i
i1
c
i
1x
i
. x, where x is the query
image we want to classify. When the SVM output )x ! 0,
we classify x as 1; otherwise, we classify x as 1.
When 1 satisfies Mercers condition [5], we can write
1u. v u v, where : X ! F and . denotes an
inner product. We can then rewrite ) as:
)x w x. where w
X
i
i1
c
i
x
i
. 1
Commonly used kernels include: polynomial kernel
1u. v uv 1
j
.
Gaussian kernel 1u. v c
uvuv
, and Laplacian
kernel 1u. v c

P
i
jni.i j
.
While the sign of the SVM output determines the class
prediction, the magnitude of the SVM output can indicate
the confidence level of that prediction. However, the SVM
output is an uncalibrated value and it might not translate
directly to a probability value useful for estimating
confidence. Of late, [14], [19] proposed methods to improve
the mapping from an SVM score to probability. However, as
we have pointed out in Section 2.2, these methods are
susceptible to the problems of noise and training-data
imbalance [37]. We propose using one-class Support Vector
Machines (OC-SVMs) [25] to estimate the support of
individual classes to avoid these problems. We present
our method next.
We make use of the idea of outliers detection [26], [30] to
estimate the confidence of a prediction. First, the probability
density distribution of the training data is estimated, then a
query image is tested to see how it differs from the
estimated distribution. More specifically, we learn a
function which characterizes the neighborhood in the input
space where the training data resides. This neighborhood is
commonly referred to as the support. If a query image does
not dwell in or is nowhere near the neighborhood of the
training data (and thus lacks support), the prediction is
given a low confidence factor.
3.1.1 Support of Data with One-Class SVMs
Density estimation has been extensively studied in the field
of statistics and has been commonly applied to the outliers
detection problem [25]. Recently, the range of applications
has been extended to query concept learning [39] as well as
general pattern classification [22].
Most proposed methods focus on estimating the dis-
tribution of the regular (or nonoutliers) portion of the data. It
is usually assumed that the data can be described by some
model (such as Gaussian) or a mixture of models. The
parameters of the models can be obtained using Bayesian
techniques, EM algorithms, clustering algorithms, etc. If the
distribution of the outliers is required, a different model set
can be fitted. The main assumption of these methods is that
we have a large amount of training data for learning model
parameters and that there are sufficient outliers in the
training data. However, the intrinsic nature of outliers is that
they are scarce, unpredictable, and distant fromothers, so no
training data set can possibly contain all forms of outliers.
For our purpose of formulating a confidence measure,
we do not require a precise estimate of the underlying
training data distribution. Instead, we make use of OC-
SVMs [25]. Given a set of labeled training data from just
one class, OC-SVMs will attempt to learn a function that fits
the data in a small region of the input space. No assumption
about the data distribution is made by OC-SVMs. Given a
query instance, the learned function returns a positive value
if the query instance belongs to the region that contains
most of the training data, or it returns a negative value. The
strategy used to learn the function is to separate the training
data from the origin by a maximal margin hyperplane w;
outliers fall on the side of the hyperplane that contains the
origin. When this linear separation is not possible in the
input space, we can employ the kernel trick [32] to project the
data to a high-dimensional feature space, where the
prospect of finding a linear separating hyperplane is higher.
Using an appropriate kernel (such as the Gaussian kernel)
ensures that all the dot products between two instances are
positive, thus guaranteeing that all instances are mapped to
the same orthant [26].
Fig. 2 presents a simple example of OC-SVMs to
illustrate how outliers can be separated from the training
1336 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 10, OCTOBER 2005
Fig. 2. Using margin as part of C1. (a) Input space showing catterplot of
misclassified and correctly classified images, and (b) projected feature
space showing classification accuracy versus multiclass margin with
sigmoid fit.
data. Fig. 2a shows the distribution of the training data (the
support is indicated by the circle that encloses the data)
and a small group of outliers in the input space. Outliers
are any data instances that lie outside the support of the
training data. After using a suitable kernel to project the
data onto the feature space, the data distribution is shown
in Fig. 2b. The hyperplane w separates the training data
from the origin by a maximal margin
,
kwk
. Data that are
mapped to the same side of the origin will be given a
negative OC-SVM value )
oc
< 0, whereas those mapped
to the side of the training data will have positive values.
With training data fx
1
. . . x
i
g 2 X, the optimal hyper-
plane w can be found by solving the following quadratic
programming problem [25]:
min
w2F.2IR
i
.,2IR
1
2
kwk
2

1
i|
X
i

i
,
subject to w x
i
! ,
i
.
i
! 0.
2
This optimization problem can be solved using Lagran-
gian multipliers. The dual problem can be stated as:
min
c
X
i,
c
i
c
,
/x
i
. x
,

subject to 0 c
i

1
i|
.
X
i
c
i
1.
3
The decision function )
oc
x sgnw x , can be
written as a kernel expansion:
)
oc
x sgn
X
i
c
i
/x
i
. x
,
,

. 4
By using those x
i
s whose c
i
is not at the upper or lower
bound of the constraints in (3), the offset , is given by
, w x
i

P
,
c
,
/x
i
. x
,
.
With a unique solution of w and ,, the decision function
)
oc
x of (4) should be positive for most of the training
instances, while the regularization termkxk will be small. In
other words, we would like most of the training instances to
fall on the side of the hyperplane that does not contain the
origin and we would like this hyperplane to be as far away
from the origin as possible. The tradeoff between these two
desirable yet conflicting properties is controlled by the
parameter i 2 0. 1. In [25], the authors prove that i is an
upper bound on the fraction of outliers allowed by the
solution, as well as a lower bound on the fraction of support
vectors. When using OC-SVMs as a confidence measure, we
would like i to be as small as possible since we are trying to
obtain the best estimate of the training data support.
However, we can also use OC-SVMs for multiclass
classifications, in which case, a smaller i may not always
be optimum as it may lead to overfitting. In Section 4, we
present our empirical studies on the effect of i on the
effectiveness of OC-SVMs as a confidence measure.
3.1.2 Binary-Level Confidence Factor C1
/oq

Let C denote a set of semantic labels. The training data


consists of images from jCj semantic categories. In order to
use SVMs, we decompose the jCj-nary classification
problem into jCj binary subproblems and use the one-per-
class (OPC) ensemble scheme to produce the multiclass
prediction. With OPC, the positive training data of a binary
classifier comes from one semantic category, while the
negative data encompasses the remaining training data
from jCj 1 categories. Therefore, we train jCj binary
SVM classifiers )
1
. )
2
. . . . )
jCj
for the jCj categories in our
semantic set. For each query image x, we first map each
binary SVM output )x to a posterior probability 1y
cjx using the following equation proposed by Platt [19]:
1y 1jx
1
1 crj )x 1
. 5
We then compute an OC-SVM output (4) for each of the jCj
binary predictions. We normalize )
oc
to 0. 1 using a simple
transformation
)
0
oc

)
oc
min )
oc
max )
oc
min )
oc
. 6
The binary-level confidence factor for image x is defined as
C1
/ii
y cjx )
0
oc
x 1y cjx. 7
Intuitively, we do not entirely trust the posterior prob-
ability estimated by SVMs (again, for the reasons that we
have discussed in Section 2.2). We use the support estimated
by OC-SVMs to determine how much we can trust the
output of a two-class SVMclassifier. In our empirical studies,
we will demonstrate that using C1
/ii
to produce a multiclass
prediction can lead to a higher prediction accuracy. In
Section 3.5, we will illustrate how C1
/ii
can be used to
discover new semantics in a query image.
3.2 Multiclass Level Prediction and Confidence
To label a query image x with one of the jCj possible
semantics, the OPC scheme examines the C1
/ii
cjx (7) of
each binary SVM classifier and chooses the prediction from
the most confident classifier. Thus, the multiclass prediction
label of x is
. oiqior
1cjCj
C1
/ii
cjx. 8
To estimate the confidence of this prediction, we first
introduce two useful parameters:
Definition 1: Top Binary-Level C1.
T
/ii
C1
/ii
.jx.
Definition 2: Multiclass Margin.
T
i
T
/ii
max
1cjCj.c6.
C1
/ii
cjx.
Although T
/ii
is the highest confidence factor from the
jCj binary classifiers and it determines the multiclass
prediction label ., T
/ii
alone may not be a sufficiently
accurate estimation of the confidence. To illustrate this, we
generate a scatterplot for a 4K-data set showing the
distributions of the correct and wrong predictions (Fig. 3a).
From Fig. 3a, it is apparent that the correct predictions
tend to have high T
/ii
as well as large multiclass margin T
i
,
whereas the wrong predictions may have high T
/ii
but
smaller corresponding T
i
. We observe that there is a better
separation of the correct predictions from the erroneous
ones if we use the multiclass margin T
i
[13], [24] as a
supplemental criterion. The larger the T
i
, the less likely an
image is wrongly predicted. It is unlikely that an image
with both high T
/ii
and large T
i
will be wrongly predicted.
GOH ET AL.: USING ONE-CLASS AND TWO-CLASS SVMS FOR MULTICLASS IMAGE ANNOTATION 1337
Fig. 3b displays the relationship between the class-predic-
tion accuracy with respect to the multiclass margin T
i
.
There are a number of ways to fuse the two parameters
T
/ii
and T
i
. Given the simple data pattern in Fig. 3a, we
treat the fusion task as a function-fitting problem. We
model the relationship between the prediction accuracy and
T
i
with the sigmoid function:
qT
i

1
1 crjC T
i

. 9
Parameters , 1, and C are determined through
empirical fitting as shown in Fig. 3b. With the margin T
i
,
we formulate the confidence factor of a prediction at the
multiclass level as follows:
C1
in|

T
/ii
qT
i

p
. 10
C1
in|
considers multiple factors in determining the
confidence of a prediction while retaining a linear relation-
ship with the expected prediction accuracy. A higher C1
in|
implies that the OPC classifier has a higher confidence in its
multiclass prediction.
3.3 Bag-Level Prediction and Confidence
To reduce class-prediction variance, we make use of the
bagging scheme proposed by Breiman in [4]. The overall
class prediction is the result of majority voting among
several bags of multiclass OPC-classifiers. Each bag of
classifiers is trained by a different subset of the training
data. Suppose we use 1 bags to determine the class of an
image. With the help of confidence factors, not only is the
prediction of the /th bag .
/
/ 1. . . . . 1 known, but also
the confidence level of that prediction (given by C1
in|
.
/
).
For the bags with higher confidence, their votes should be
given greater consideration during the final tally. Thus, we
weigh each bags vote by the confidence factor C1
in|
.
/
.
The final prediction is formulated as
oiqior
1cjCj
X
.
/
c
C1
in|
.
/
. 11
To evaluate the confidence level of the overall prediction,
we follow the same principles used for the multiclass level.
We identify two useful parameters:
Definition 3: Top Bagging Score.
\
/oq

X
.
/

C1
in|
.
/
.
Definition 4: Bagging Margin.
\
i
\
/oq
max
1cjCj.c6
X
.
/
c
C1
in|
.
/
.
under the situation of unanimous voting, \
i
\
/oq
.
Finally, the bag-level C1 for the overall prediction is
C1
/oq

\
/oq
q\
i

p
1
. 12
The denominator 1 normalizes the confidence factor to
within the range 0. 1. A prediction with a high C1
/oq
is
more likely to be accurate, hence, we will output that
predicted class label. For those with C1
/oq
less than the
threshold 0
/
, we enhance their prediction with the dynamic
ensemble algorithm, which we will describe in the next
section. The formal description of the multilevel annotation
algorithm is presented in Fig. 4.
3.4 Dynamic Ensemble Scheme
For the images with low annotation confidence, our system
makes an extra effort to diagnose and enhance their
annotations. The system dynamically builds an ensemble
of OPC classifiers for each image with low-confidence
annotation. Our aim is to reduce the number of classes in
the dynamic ensemble without losing any classes in which
the image semantically belongs.
The principle of the dynamic ensemble is best explained
by the theory of Structure Risk Minimization [33]. Let C be
the set of all classes and be the set of classes considered in
the dynamic ensemble. The difference between the two sets
C is the set of excluded classes. The elimination of
from the classification can result in a gain or pose a risk.
When an image x belongs to a class in the set , we risk
misclassifying x since its true class has been excluded from
the classification process. Conversely, when x belongs to a
class in the set , an improvement in the expected
classification accuracy is likely when we exclude from
the dynamic ensemble. The gain in the classification
1338 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 10, OCTOBER 2005
Fig. 3. Using margin as part of C1. (a) Scatterplot of misclassified and correctly classified images and (b) classification accuracy versus multiclass
margin with sigmoid fit.
accuracy results from two sources: 1) x will not be
misclassified into the classes in and 2) the decision
boundaries will be more accurate since we are only
considering a subset of the more relevant classes.
Formally, let 1jx be the probability that the image x
belongs to a class in and 1

x. C or 1

x. C be the
expected classification accuracy when all classes in C are
considered during classification. The sign represents
the situation when x belongs to a class in , while the
sign denotes the situation when x does not belong to any
classes in . When only the classes in are considered, the
expected classification accuracy is denoted by 1

x. . We
can then express the overall expected gain, denoted as , in
classification accuracy as
1

x. 1

x. C1 1jx 1

x. C1jx.
13
For each image x, the goal of the dynamic ensemble is to
maximize . From (13), we observe that this goal can be
accomplished by selecting an appropriate subset of classes
to keep 1jx low. The binary-level C1s derived in
Section 3.1.2 provide a critical support for the class selection
task. To keep 1jx low, the classes in should be selected
such that the probability of x
0
s true class belonging to will
be low. It is more logical to exclude those semantic classes
which are unlikely to include x. In other words, classes with
a lower C1
/ii
are considered less relevant to the annotation
of x. Formally, we select the set of candidate classes:
fcjC1
/ii
cjx ! 0
i
g, where c 1 . . . jCj. 0
i
is the
thresholding parameter for the binary C1s. The higher the
0
i
, the higher the value of the term 1

x. 1

x. C,
which leads to a higher . However, at the same time, 1jx
will also be higher, which leads to a lower . Through the
selection of an optimal 0
i
, we can maximize the expected
accuracy gain . In Section 4, we will examine the relation-
ship between 0
i
and annotation accuracy.
Once the candidate classes of have been identified, the
system dynamically composes an ensemble of SVM classi-
fiers to enhance the annotation. The dynamic ensemble
includes jj binary SVM classifiers, each of which compares
one candidate class against other candidate classes. By
excluding the influence of less relevant classes in , the
dynamic ensemble can significantly enhance the annotation
accuracy. To illustrate this with a simple example, suppose
we are very confident that an image should be labeled as
either architecture or landscape. It would be counter-produc-
tive to include images from irrelevant classes such as flowers
or fireworks when training the classifiers. Instead, by
focusing on the relevant classes, the dynamic ensemble
can reduce the noise from less relevant classes, thus
producing a more accurate annotation. The formal algo-
rithmof the dynamic ensemble scheme is presented in Fig. 5.
In our empirical studies, we confirmed that our dynamic
ensemble is capable of improving annotation accuracy (see
Section 4). By dynamically composing an ensemble of SVM
classifiers for low-confidence images, we will incur addi-
tional computational overhead. However, the overhead is
not too serious a concern for the following reasons:
1. For most annotation applications, the annotation time
is less of a concern than the annotation quality.
Frequently, the annotation process is carried out
offline.
2. A dynamic ensemble is applied only to low-
confidence annotations, and the percentage of such
annotations is usually small. Furthermore, we can
control that percentage by tuning the threshold 0
/
.
3. The dynamic ensemble considers only the relevant
classes. Thus, the training data set for the dynamic
GOH ET AL.: USING ONE-CLASS AND TWO-CLASS SVMS FOR MULTICLASS IMAGE ANNOTATION 1339
Fig. 4. Algorithm for multilevel predictions.
ensemble is just about
jj
jCj
as large as the entire
training data set. The ratio of
jj
jCj
is usually small
(especially for data sets with a large number of
classes). Besides, we can control this ratio through the
parameter 0
i
. Based on the study of Collobert and
Bengio [9], the training time for a binary SVMTorch
classifier is about O`
1.8
, where ` stands for the size
of training data. In addition, the dynamic ensemble
will need to train only jj binary SVM classifiers
rather than jCj (C denotes the set of semantics and jCj
represents the number of classes). To sumup, the total
overhead is in the order O`
|

jj
jCj

2.8
of the training
time for the entire data set, where `
|
is the number of
low-confidence annotations. For most annotation
applications, jCj usually is large; as a result, the
overhead costs of a dynamic ensemble usually come
within an acceptable range.
Our empirical studies show that when we pick an
appropriate 0
i
and 0
/
, the average dynamic ensemble
training time for a low-confidence annotation is affordable.
For a large data set with 25. 000 images from 116 classes, the
extra training time is 1.1 seconds.
If the confidence of an annotation remains low after the
dynamic ensemble scheme has been applied, there are
several possible reasons for that low confidence level. The
image could be semantically ambiguous and thus better
characterized by multiple labels, or the existing features
could be insufficient for describing the semantics in that
image. Another possible reason might be the presence of
completely new semantics in the image. A full exploration
of the above scenarios is beyond the scope of this paper.
However, we will address some of the issues in the
following section, especially those associated with the
discovery of new semantics.
3.5 Knowledge Discovery
The annotation process assigns an image with one or
multiple semantic labels. As new images are added to the
data set, however, some of them cannot be characterized by
the existing semantic labels. When this situation occurs, the
system should signal an alert so that proper human actions
can be taken to maintain the annotation quality (e.g.,
creating new semantic labels or researching new low-level
features). In this section, we briefly discuss how we utilize
our multilevel confidence factors to facilitate the detection
of new semantics and ambiguous semantics. For an
extended treatment of this subject, please consult [40].
3.5.1 New Semantics Discovery
We make use of the OC-SVM outputs to systematically
discover completely new semantics in a query image. There
are two possible manifestations of new semantics:
1. New semantics outside existing semantics. Suppose we
already have the following classes: architecture,
flower, and vehicle. If an image of a panda appears,
it presents a new semantics outside of the existing
classes. In such a situation, we want to enhance the
keyword set by adding new labels to it.
2. Under-represented semantics within existing ones. Sup-
pose we have an animals annotation system that was
trained only with images of these land-based
animals: tigers, elephants, bears, and monkeys. If we
are given a query image of camels to annotate, our
system will likely make a wrong prediction even
though the broad concept of animals is present in our
system. An extreme example would be a query image
of a whale swimming in the ocean. In these scenarios,
the system comprises the high-level semantic con-
cept, but the representation of the concept is
inadequate. The remedy is to enhance the training
data set with the addition of the query image.
To conduct knowledge discovery, a well-defined ontol-
ogy is necessary. The ontology can help to determine the
best course of action to follow when an image is singled out
as one with new semantics. It can dictate how specific a
keyword should be for describing a particular semantic. For
example, the ontology will decide whether the bear category
encompasses only polar and brown bears or it includes black
bears. Thus, the ontology will make it clear if the mis-
predicted image should be added to the training data set or a
new semantic label should be created in the set of semantics.
Fig. 6, whichcanbe foundonthe Computer Society Digital
Library at http://www.computer.org/tkde/archives.htm,
presents an example where an ontology is useful for
determining the best remedy for a misprediction. The query
1340 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 10, OCTOBER 2005
Fig. 5. Algorithm for DynamicEnsemble.
image (in Fig. 6a) consists of a lighthouse on a cliff; its true
label is landscape and the predicted label is wave. The other
five images (in Figs. 6b, 6c, 6d, 6e, and 6f) show the
five nearest neighbors of the query image; four belong to
the wave category and one belongs to landscape. In this
example, the vast expanse of the sky and the sea in the query
image causes it to bear more resemblance to the color features
of wave images, hence the misprediction. The only lighthouse
image in the training data set is the second nearest neighbor.
As we have discussed above, there are two possible
remedies: add the query image to the landscape training data
set and retrain the classifiers so as to avoid future mispredic-
tions for images with lighthouses or create a new semantic
category lighthouse. With an ontology as a guide, it would be
clear which remedy is preferable. If the keyword lighthouse is
present inthe ontology, thenit wouldbe preferable to create a
new label in the semantic set. If the keyword is absent, we
need to add the query image to the training data of the label
(wave) that best describes the image.
To discover new semantics, we employ a simple
approach: We examine the unnormalized OC-SVM output
()
oc
of (4)) to determine the amount of support that the
query image is receiving from the training data. Each of the
jCj OC-SVMs classifier is trained using data from just one
semantic class. We are interested not only in the magnitude
of the output, but also in the sign. If all the outputs are
negative, it indicates that the query image severely lacks
support from all semantic categories. Mathematically, when
max
1cjCj
)
oc
x 0. 14
it signifies that the image contains completely new
semantics that our classifiers have never encountered.
3.5.2 Ambiguity Discovery
There are two main causes of ambiguous predictions: 1) the
query image contains multiple semantics that makes
multiple labels applicable and/or 2) the existing feature
set is unable to distinguish between two or more classes. By
including the multiclass and voting margins in our
confidence factors, C1
/oq
will be low when there is a close
competition between classifiers of differing classes. If a
query image x satisfies the following three conditions, the
system deems its prediction to be ambiguous:
. There are no new semantics, that is:
max
1cjCj
)
oc
x 0.
. The confidence of the final prediction, C1
/oq
, is low.
. After the dynamic ensemble technique is applied,
the ambiguity is still unresolved.
When the final class-prediction confidence level is still
low after applying C11, we examine the candidates chosen
to form the dynamic ensemble and their multiclass
confidence factors (C1
in|
). If all the C1s are close to each
other, we use the query image as a possible candidate for
feature discovery.
In Fig. 7, which can be found on the Computer
Society Digital Library at http://www.computer.org/
tkde/archives.htm, we show an example where addi-
tional features can help correct a misprediction. The
query image shown in Fig. 7a is that of an elephant
against a mostly brownish background. Using C11, the
prediction for this query is the tiger class. Figs. 7b to 7d
show the three most similar training images from the
elephant class. These images contain elephants of different
sizes with a variety of backgrounds, including the sky
and greenish land. The three nearest training images
from the tiger class (Figs. 7e, 7f, and 7g) mostly show a
single tiger against a predominantly brownish back-
ground. From this example, we can see that our existing
global color and texture features are insufficient. If we
have local features, or some feature-weighting schemes
that assign less importance to background information,
we can potentially avoid this sort of misprediction. The
task of automatic feature discovery remains one of the
most challenging research problems. Nevertheless, C11
can assist in identifying useful features in a semiauto-
matic way. (More results are presented in Section 4.5.)
4 EMPIRICAL STUDY
Our testbed consists of 25K images compiled from both the
Corel CDs and the Internet. Each image was first manually
annotated with one of the 116 semantic-category labels
shown in Appendix A. We characterized each image by
two main perceptual feature sets: color and texture. The
color set includes color histograms, color means, color
variances, color spreadness, and color-blob elongations.
Texture features were extracted from three orientations
(vertical, horizontal, and diagonal) in three resolutions
(coarse, medium, and fine). A total of 144 features, 108 from
colors and 36 from textures, were extracted for representing
each image. For the detailed description of these perceptual
features, please consult [31].
The experimental setup is as follows:
1. Classifier Training. We first set aside 85 percent of the
image-feature vectors (about 21K vectors) from each
semantic category. 80 percent of the vectors were
used as the training set. The rest of the 20 percent
were used as the cross-validation set for finding the
best parameter settings of C11 (e.g., the parameters
of OC-SVMs and SVMs).
2. Annotation Testing. The remaining 15 percent of the
image-feature vectors were used as for testing.
We first applied C11 on the training data to train a
classifier ensemble. We then used the ensemble to predict
the best label for each testing image. Our evaluation metric
is the annotation error ratethe percentage of images in the
testing set whose predicted label disagrees with the
manually assigned one. The lower the error rate, the better
the annotation quality.
We report the results of our empirical study, which
consists of five parts:
1. Before and after summary. We compared the annota-
tion accuracy of C11 with its static version. We
studied and accounted for the contribution of each
individual component of C11 (confidence factor,
bagging, and dynamic ensemble) in the overall
accuracy improvement.
2. Binary-level confidence factor evaluation. We studied
the effect of parameters i and of OC-SVMs on
annotation accuracy.
GOH ET AL.: USING ONE-CLASS AND TWO-CLASS SVMS FOR MULTICLASS IMAGE ANNOTATION 1341
3. Multiclass-level confidence factor evaluation. We ana-
lyzed in detail how the confidence factor affects
annotation accuracy at the multiclass and bag level
of C11.
4. Dynamic ensemble scheme evaluation. We examined
whether our dynamic ensemble (DE) scheme could
improve the annotation of images with low C1
/oq
,
and thereby improve the overall annotation accuracy.
5. Knowledge Discovery. We investigated the use of OC-
SVMs to identify images that contain new semantics.
4.1 Before and After Evaluation
Fig. 8, which can be found on the Computer Society Digital
Library at http://www.computer.org/tkde/archives.htm,
shows 12 frames of qualitative examples of our annotation
results. The labels show the categories and confidence
factors (C1
/oq
) for each frame. Figs. 8a, 8b, 8c, 8d, 8e, and
8f each show an example with a high prediction confidence
where the label is an accurate description of the content.
Figs. 8g, 8h, and 8i show examples with low annotation C1s.
In Table 1, we summarize the error rates when various
annotation schemes were employed. The first column
reports the error rates using John Platts mapped prob-
ability; the second column reports the error rates using our
proposed confidence factors. C11 using confidence factors
outperforms the static version of it (without using
confidence) by about three percentile.
Next, the table reports the contribution of each compo-
nent of C11 in the overall improvement. Using five bags,
the error rate (reported in the second row of the table)
reduces by 2.6 percent compared to the one-bag version of
C11 (reported in the first row of the table). When dynamic
ensemble is employed, the error rate is further reduced by
another 2.8 percent (see third row of the table).
4.2 Evaluation of One-Class SVMs
For one-class SVMs (OC-SVMs), there are two tunable
parameters, and i. The width of the Gaussian has the
same implications for the one-class classifiers as it has for a
binary SVM classifier. When we increase the width, we
increase the region of influence of the support vectors,
which may improve classification results but only to a
certain extent. The other parameter is i, which controls the
fraction of outliers. Using the validation data, we studied
the effect of varying these parameters on two types of
prediction error. First, we used OC-SVMs as a classification
scheme where the classifier with the highest OC-SVM
output )
0
oc
determines the class prediction. Second, we used
the OC-SVM output as part of the confidence factor (7) and
the resulting C1
/ii
s are used to give a class prediction.
In Fig. 9a, we plot the prediction error rate for various
values of i when the algorithm of OC-SVMs was used as a
classification scheme. The x-axis shows the different i
values from 0 to 0.7, while the y-axis shows the prediction
error rate. We observe that as we increased i from 0.0001 to
0.4, the prediction error rate decreased dramatically from
i 0.0001 to 0.001 before posting more moderate reduc-
tions. As i was increased further, the error rate increased
again. Here, we do not plot the results for the case where
the algorithm of OC-SVMs is used in C1
/ii
because the
error rate hovers around 70.5 percent for various values of
i. Since changing i did not have an effect on its use in the
confidence factor, we set i to be as small as possible. In this
way, we can define more precisely the region where the
training data resides.
Fig. 9b shows the prediction error rate when is varied
(with i fixed at 0.001). We notice that as we increase , the
error rate decreases substantially up to 0.5 before
increasing again (solid line). This trend is similar to the
ones observed when using binary SVM classifiers in many
studies on SVMs. When the algorithm of OC-SVMs is used
in C1
/ii
, the reduction is moderate, but we still observe a
dip at 0.5 (dashed line).
Based on this evaluation, we set i 0.001 and 0.5 for
the rest of our experiments whenever the algorithm of OC-
SVMs was being utilized.
4.3 Evaluation Multiclass-Level Annotation Scheme
Our classification scheme makes use of confidence factors in
three areas:
1. While aggregating the binary predictions to produce
a multiclass prediction, we utilize the binary-level
1342 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 10, OCTOBER 2005
TABLE 1
Summary of Annotation Error Rates
Fig. 9. Effect of i and on OC-SVMs for classification and for C1
/ii
. (a) Vary i with set to 0.1. (b) Vary with i set to 0.001.
confidence factor C1
/ii
(7) instead of the raw
probability produced by SVMs.
2. During bagging, we aggregate the prediction from
each bag using the multiclass confidence factor
C1
in|
(10), instead of the top binary confidence
factor T
/ii
alone, to produce the final prediction of a
query image.
3. Once bagging has been completed, a bag-level
confidence factor C1
/oq
(12), is assigned to the final
prediction. C1
/oq
can be used to identify query
images with ambiguous predictions so that further
analysis may be carried out.
Before we perform bagging, we evaluate the multiclass
prediction performance of each bag of classifier. Fig. 10a
shows the prediction error rate for the five different bags.
The bars drawn with solid lines represent the case where
the raw probability is used during a multiclass prediction,
while the bars with dashed line indicate where C1
/ii
is
used. We observe that for all bags of classifiers, the
prediction error rate is lower by at least 2 percent when
we use C1
/ii
. The reduction is especially substantial (about
7 percent) for the fourth bag.
Next, we evaluated the effectiveness of the multiclass
confidence factor C1
in|
for aggregating the predictions
from the bags of classifiers. Fig. 10b shows a bar chart
comparing C1
in|
versus not using confidence factor. The
x-axis shows the number of bags used while the y-axis
denotes the bagging prediction error rate. We notice that
when C1
in|
is used, the error is lower by 2 percent for the
3-bag case and 4 percent for the 5-bag case. In addition, we
see that with raw probabilities, using more bags actually
leads to a lower error rate.
Both plots of Fig. 10 show that our confidence factors are
able to improve the prediction performance of our
classifiers. This is an indication that the C1s are effective
in assigning lower confidence value to annotation labels
that do not adequately describe the query images, thus
lowering their influence on the final prediction outcome.
Finally, we examined the usefulness of our bag-level
confidence factor C1
/oq
. Ideally, when a prediction is correct,
we expect to assign a high confidence level to it. In Fig. 11,
we plot a curve showing the annotation accuracy at each
C1
/oq
value. The figure shows that when the C1
/oq
is high,
the annotation accuracy is also high, and at lower C1
/oq
values, the accuracy tends to be low. There is a clear
correlation between C1
/oq
and the annotation accuracy of
the test data. This indicates that the C1
/oq
we formulated can
be generalized to track the annotation accuracy of test data.
4.4 Evaluation of the Dynamic Ensemble Scheme
This experiment examined the impact of the dynamic
ensemble (DE) scheme on the annotation quality. More
specifically, we tried to improve the annotation accuracy by
disambiguating the original low-confidence annotation. In
Fig. 12, we plot the error rates of the low-confidence
predictions against the threshold 0
/
for bag-level C1
/oq
.
GOH ET AL.: USING ONE-CLASS AND TWO-CLASS SVMS FOR MULTICLASS IMAGE ANNOTATION 1343
Fig. 10. Comparison of prediction error with and without using confidence factor to aggregate predictions. (a) For multiclass prediction. (b) For bagging.
Fig. 11. Effectiveness of bag-level confidence factor C1
/oq
. Fig. 12. Error rates for low-confident predictions after dynamic ensemble.
When a predictions C1
/oq
is below 0
/
, we deem the
prediction to be a low-confidence one. In our experiment,
the top binary confidence factor threshold 0
i
was set
empirically to 0.05. In Fig. 12, we observe that the error rates
are higher at lower 0
/
s. The main function of a confidence
factor is to assign a low C1
/oq
to potentially wrong
annotations, and C1
/oq
provides a good model for annota-
tion accuracy. When 0
/
is set lower, most of the low-
confidence predictions are likely to have wrong annota-
tions. Hence, the error rate of low-confidence predictions
tends to be high at low 0
/
.
In addition, the DE scheme achieves a greater error rate
reduction at lower thresholds; a 6.9 percent reduction at
0
/
0.05. This trend of error reduction shows that DE is
able to disambiguate conflicts for the low-confidence
predictions, while leaving the high-confidence predictions
intact. As shown in Table 1 at the beginning of this section,
the use of DE can further reduce the error rate of bagging by
2.7 percent.
Next, we report the influence of the probability threshold
0
i
on the overall annotation accuracy. As noted in
Section 3.4, the expected annotation accuracy changes with
the value of 0
i
; the empirical relationship between is shown
in Fig. 13. When the value of 0
i
increases from 0.01 to 0.1,
the overall prediction error rate continues to decrease. The
lowest error rate attained is 61.1 percent. Further incre-
ments of 0
i
will only result in higher annotation error rates.
This phenomenon conforms well to our theoretical analysis
in Section 3.4.
4.5 Knowledge Discovery
After applying C11 to disambiguate conflicts, we may
still end up with a low confidence level for the query
images prediction. As discussed in Section 3.5, the low
confidence level can be attributed to the presence of new
semantics or the lack of representative training data. In this
section, we evaluate the effectiveness of OC-SVMs for new
semantics discovery.
2
4.5.1 New Semantics Discovery
In order to evaluate the effectiveness of OC-SVMs for new
knowledge discovery, we first derived two subdata sets
from our 25K-data set:
1. Old-semantics data set. For our first data set, we
selected fourteen categories: architecture, bears, clouds,
elephants, fireworks, flowers, food, landscape, pattern,
people, textures, tigers, tools, and waves. We chose
between 100 to 200 images from each category to
form a 1,900-image data set. This data set was then
divided into two, a test set (20 percent of images),
and a training set (80 percent of images) to create the
one-class and the binary classifiers.
2. New-semantics data set. We constructed this data set
with images from seven categories: alligator, cactus,
cave, pyramid, sunset, tulip, and zebra. The total
number of images in this new data set was 383.
We added this new-semantics set to the old-semantics
test set to form a larger test set of 765 images.
We trained one OC-SVM classifier for each of the
fourteen categories of the old-semantics data set and,
therefore, each test image will have 14 unnormalized OC-
SVM outputs )
oc
s (see (4)) associated with it. The tuning of
the parameter i and is done in the same manner as in
Section 4.2; the final i is set to 0.001 and is set to 0.01.
When the maximal )
oc
value is low, it indicates that the
training data support received by the query image is
potentially low. By using different thresholds as the criteria
determine whether the maximal )
oc
is low, we can plot the
precision/recall(PR) curves for both scenarios in Fig. 14.
Recall refers to the percentage of images with new
semantics that we recover and precision refers to the
percentage of images from the new data set in the entire
pool of images that are considered new knowledge by our
system. When the threshold is low, the precision is high at
the expense of recall. Conversely, a high threshold results in
low precision but high recall.
The figure shows that the precision is 77 percent at
10 percent recall and 60 percent at 100 percent recall. One
major interpretation of the OC-SVM output is that, for
images that are similar to the training data, outputs are
likely to be positive. This interpretation provides theoretical
justification for us to set the )
oc
threshold to be 0. In
addition, we avoid the common pitfall of having to change
the threshold for different data sets, or for adding new data
to the training set. Any query image with max)
oc
0 is
considered to contain new semantics. In the figure, Y
marks the same threshold point and we get a precision of
64 percent with recall at 90 percent. The PR results shows
that at zero threshold, we can effectively identify most of
the images with new semantics.
1344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 10, OCTOBER 2005
2. This evaluation differs from that in [41] through the use of OC-SVMs,
which produce better results and provide a more intuitive interpretation.
Fig. 13. Relationship between posterior probability threshold 0
i
and
overall annotation accuracy for the 2K-image data set.
Fig. 14. Results for new semantics discovery.
Once images with potentially new semantics have been
isolated, we need to inspect these images and manually
annotate them if necessary. The manually assigned labels
can be added to the existing set of semantics, where they
can be used when new classifiers are retrained.
4.5.2 Discovery of Underrepresentative Training Data
Insufficient training data is another cause for a low
confidence prediction. To remedy this situation, we want
to add the image containing underrepresented semantic to
the training set and retrain the relevant classifiers for future
predictions.
In Fig. 15, which can be found on the Computer
Society Digital Library at http://www.computer.org/
tkde/archives.htm, we show an example in which the
misclassified query image, a white flower (Fig. 15a),
should belong to an existing class. Due to space
limitations, we show only the five support vectors nearest
to the query image (in Figs. 15b, 15c, 15d, 15e, and 15f). It
is evident that the absence of white flowers in the flowers
category of the training data has caused the misclassifica-
tion of the query image. The surprising prediction
provided by C11 was bear, a category that contains
many white polar bear images. Hence, by adding the
query image into the training set followed by retraining of
the flower classifier, we will be able to classify flowers
more accurately in the future.
4.5.3 Discovery of Insufficient Features
In Fig. 16, which can be found on the Computer Society
Digital Library at http://www.computer.org/tkde/archi-
ves.htm, we show a case where insufficient low-level
features causes a misprediction. The query image is
shown in Fig. 16a, its true label is architecture but it has
been assigned the bear label. Figs. 16b, 16c, and 16d show
the 3-NNs from the architecture category. Figs. 16e and 16f
show the 3-NNs from the bear category where dark-
colored bears resemble the darkened doorway of the
query image. We observe that the three architecture images
have colors that are visually similar to the query image,
but the brightness and semantic content of the images are
not. If we have features that are able to characterize the
shape of the objects in the images or features that contain
spatial information of the color blobs in the image, we can
potentially avoid this misprediction. Similar to the
previous example, C11 will identify this mispredicted
query image as a suitable candidate for feature discovery.
5 CONCLUSIONS
In this paper, we have proposed a confidence-based
dynamic ensemble (C11) scheme to overcome the short-
comings of traditional static classifiers. In contrast to
traditional models, C11 makes adjustments to accommo-
date new semantics to assist in the discovery of useful
low-level features and to improve class-prediction accu-
racy. The key components of C11 include a multilevel
prediction scheme that uses confidence factors to assert the
class-prediction confidence and a dynamic ensemble
scheme that uses the confidence factors to form new
classifiers adaptively to improve low-confidence predic-
tions. More specifically, C11 uses one-class Support
Vector Machines (SVMs) in the binary C1s to assert the
correctness of the base-level two-class SVMs predictions.
The binary confidence factors are also propagated to the
multiclass and the bag levels to aid the formulation of
confidence factors at those levels. Our empirical results
have shown that our confidence factors are able to
improve the bag-level prediction accuracy and effectively
identify potential mispredictions. We have also illustrated
the ability of our dynamic ensemble scheme to enhance
the annotations of low-confidence predictions. Finally, we
have demonstrated that using one-class SVMs, we are able
to identify images that might contain new semantics and
to pick out images with semantics that may be under-
represented in the existing training data. For future work,
we plan to delve deeper into exploring the use of C11 for
feature discovery.
APPENDIX A
See Table 2.
REFERENCES
[1] E.L. Allwein, R.E. Schapire, and Y. Singer, Reducing Multiclass
to Binary: A Unifying Approach for Margin Classifiers, J. Machine
Learning Research, vol. 1, 2000.
[2] A.B. Benitez and S.-F. Chang, Semantic Knowledge Construction
from Annotated Image Collection, Proc. IEEE Intl Conf. Multi-
media, Aug. 2002.
[3] D. Bouchaffra, V. Govindaraju, and S.N. Srihari, A Methodology
for Mapping Scores to Probabilities, IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 21, no. 9, pp. 923-927, 1999.
[4] L. Breiman, Bagging Predicators, Machine Learning, pp. 123140,
1996.
[5] C. Burges, A Tutorial on Support Vector Machines for Pattern
Recognition, Data Mining and Knowledge Discovery, vol. 2, pp. 121-
167, 1998.
[6] E. Chang, K. Goh, G. Sychay, and G. Wu, Content-Based Soft
Annotation for Multimodal Image Retrieval Using Bayes Point
Machines, IEEE Trans. Circuits and Systems for Video Technology,
special issue on conceptual and dynamical aspects of multimedia
content description, vol. 13, no. 1, pp. 26-38, 2003.
GOH ET AL.: USING ONE-CLASS AND TWO-CLASS SVMS FOR MULTICLASS IMAGE ANNOTATION 1345
TABLE 2
Category Names and Sizes for the 25k-Image Data Set
[7] S.-F. Chang, W. Chen, and H. Sundaram, Semantic Visual
Templates: Linking Visual Features to Semantics, Proc. IEEE Intl
Conf. Image Processing, 1998.
[8] C.K. Chow, On Optimum Recognition Error and Reject Trade-
off, IEEE Trans. Information Theory, vol. 16, no. 1, pp. 41-46, 1970.
[9] R. Collobert and S. Bengio, SVMtorch: Support Vector Machines
for Large-Scale Regression Problems, J. Machine Learning Re-
search, vol. 1, pp. 143-160, 2001.
[10] T. Dietterich and G. Bakiri, Solving Multiclass Learning Problems
via Error-Correcting Output Codes, J. Artifical Intelligence
Research, vol. 2, 1995.
[11] J. Fan, Y. Gao, and H. Luo, Multi-Level Annotation of Natural
Scenes Using Dominant Image Components and Semantic
Concepts, Proc. ACM Intl Conf. Multimedia, Oct. 2004.
[12] K. Fukunaga, Introduction to Statistical Pattern Recognition, second
ed. Boston, Mass.: Academic Press, 1990.
[13] K. Goh, E. Chang, and K.T. Cheng, SVM Binary Classifier
Ensembles for Image Classification, Proc. ACM Conf. Information
and Knowledge Management, pp. 395-402, Nov. 2001.
[14] T. Hastie and R. Tibshirani, Classification by Pairwise Coupling,
Advances in Neural Information Processing Systems, M.I. Jordan,
M.J. Kearns, and S.A. Solla, eds., vol. 10, The MIT Press, 1998.
[15] X. He, W.-Y. Ma, O. King, M. Li, and H. Zhang, Learning and
Inferring a Semantic Space from Users Relevance Feedback for
Image Retrieval, Proc. ACM Intl Conf. Multimedia, pp. 343-347,
Dec. 2002.
[16] J. Li and J.Z. Wang, Automatic Linguistic Indexing of Pictures by
a Statistical Modeling Approach, IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 25, no. 2, Feb. 2003.
[17] P. Lipson, Context and Configuration Based Scene Classifica-
tion, Phd Dissertation, MIT EECS Dept., Sept. 1996.
[18] M. Moreira and E. Mayoraz, Improving Pairwise Coupling
Classification with Error Correcting Classifiers, Proc. 10th
European Conf. Machine Learning, Apr. 1998.
[19] J. Platt, Probabilistic Outputs for SVMs and Comparisons to
Regularized Likelihood Methods, Advances in Large Margin
Classifiers. MIT Press, 1999.
[20] J. Platt, N. Cristianini, and J. Shawe-Taylor, Large Margin Dags
for Multiclass Classification, Advances in Neural Information
Processing Systems, vol. 12, pp. 547-553, MIT Press, 2000.
[21] P. Poddar and P. Rao, Hierarchical Ensemble of Neural
Networks, Proc. Intl Conf. Neural Networks, vol. 1, 1993.
[22] G. Ritter and M.T. Gallegos, Outliers in Statistical Pattern
Recognition and an Application to Automatic Chromosome
Classification, Pattern Recognition Letters, vol. 18, pp. 525-539, 1997.
[23] C. Rodriguez, J. Muguerza, M. Navarro, A. Zarate, J. Martin, and
J. Perez, A Two-Stage Classifier for Broken and Blurred Digits
in Forms, Proc. Intl Conf. Pattern Recognition, vol. 2, pp. 1101-
1105, 1998.
[24] R.F. Schapire and Y. Singer, Improved Boosting Algorithms
Using Confidence-Rated Predictions, Proc. 11th Ann. Conf.
Computational Learning Theory, pp. 80-91, July 1998.
[25] B. Scholkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, and R.C.
Williamson, Estimating the Support of a High-Dimensional
Distribution, Technical Report MSR-TR-99-87, Microsoft, Nov.
1999.
[26] B. Scholkopf, R.C. Williamson, A.J. Smola, J. Shawe-Taylor, and
J.C. Platt, Support Vector Method for Novelty Detection,
Advances in Neural Information Processing Systems, S.A. Solla,
T.K. Leen, and K.R. Miller, eds., vol. 12, The MIT Press, 2000.
[27] H.T. Shen, B.C. Ooi, and K.L. Tan, Giving Meanings to WWW
Images, Proc. ACM Multimedia, pp. 39-48, Nov. 2000.
[28] J.R. Smith and S.-F. Chang, Multi-Stage Classification of Images
from Features and Related Text, Proc. Fourth DELOS Workshop,
Aug. 1997.
[29] R. Srihari, Z. Zhang, and A. Rao, Intelligent Indexing and
Semantic Retrieval of Multimodal Documents, Information
Retrieval, vol. 2, pp. 245-275, 2000.
[30] D.M.J. Tax and R.P.W. Duin, Data Domain Description by
Support Vectors, Proc. European Symp. Artificial Neural Networks,
pp. 251-256, Apr. 1999.
[31] S. Tong and E. Chang, Support Vector Machine Active Learning
for Image Retrieval, Proc. ACM Intl Conf. Multimedia, Oct. 2001.
[32] V. Vapnik, The Nature of Statistical Learning Theory. New York:
Springer, 1995.
[33] V. Vapnik, Statistical Learning Theory. Wiley, 1998.
[34] J. Wang, J. Li, and G. Wiederhold, Simplicity: Semantics-Sensitive
Integrated Matching for Picture Libraries, IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 23, no. 9, pp. 947-963, 2001.
[35] J.Z. Wang and J. Li, Learning-Based Linguistic Indexing of
Pictures with 2-D MHMMs, Proc. ACM Multimedia, pp. 436-445,
Dec. 2002.
[36] L. Wenyin, S. Dumais, Y. Sun, H. Zhang, M. Czerwinski, and B.
Field, Semi-Automatic Image Annotation, Proc. Interact 2001:
Conf. Human-Computer Interaction, pp. 326-333, July 2001.
[37] G. Wu and E. Chang, Adaptive Feature-Space Conformal
Transformation for Learning Imbalanced Data, Proc. Intl Conf.
Machine Learning, Aug. 2003.
[38] H. Wu, M. Li, H. Zhang, and W.-Y. Ma, Improving Image
Retrieval with Semantic Classification Using Relevance Feed-
back, Proc. Sixth Conf. Visual Database Systems, pp. 327-339, 2002.
[39] X.Z.Y. Chen and T.S. Huang, One-Class SVM for Learning in
Image Retrieval, Proc. IEEE Intl Conf. Image Processing, 2001.
[40] K. Goh, B. Li, and E.Y. Chang, Semantics and feature Discovery
via Confidence-Based Dynamic Ensemble, ACM Trans. Multi-
media, vol. 1, no. 2, pp. 168-189, 2005.
[41] B. Li, K. Goh, and E.Y. Chang, Confidence-Based Dynamic
Ensemble for Image Annotation and Semantics Discovery, Proc.
Intl Conf. Multimedia, 2003.
King-Shy Goh received the PhD degree in
computer engineering from the University of
California, Santa Barbara, in 2004. During her
graduate studies, she was a member of Profes-
sor Edward Changs Multimedia Database La-
boratory. She worked as a summer intern at
MERL (2003) and VIMA Technology (2004). Her
research interests encompass statiscal learning,
multimedia retrieval, high-dimensional data in-
dexing, and computer vision algorithms that are
applicable to video surveillance. Since graduation, she has been part of
the engineering team at Proximex, which designs advanced and
scalable video surveillance solutions.
Edward Y. Chang received the MS degree in
computer science and the PhD degree in
electrical engineering from Stanford University
in 1994 and 1999, respectively. Since 2003, he
has been an associate professor of electrical
and computer engineering at the University of
California, Santa Barbara. His recent research
activities have been in the areas of machine
learning, data mining, high-dimensional data
indexing, and their applications to image data-
bases and video surveillance. Professor Chang has served on several
ACM, IEEE, and SIAM conference program committees. He cofounded
the annual ACM Video Sensor Network Workshop and has cochaired it
since 2003. He will cochair major conferences such as ACM Multimedia
and Multimedia Modeling in 2006. He serves as an associate editor for
IEEE Transactions on Knowledge and Data Engineering and ACM
Multimedia Systems Journal. Professor Chang is a recipient of the IBM
Faculty Partnership Award and the US National Science Foundation
NSF Career Award. He is a cofounder of VIMA Technologies, which
provides image searching and filtering solutions. He is a senior member
of the IEEE.
Beitao Li graduated from the Special Class for
Gifted Young at the University of Science and
Technology of China in 1997. He received the
masters degree in 2001 and the PhD degree in
2003, both in computer engineering, from the
University of California, Santa Barbara (UCSB).
From 1999 to 2003, he worked as a graduate
student researcher in the Multimedia Database
Laboratory of UCSB. His interested research
areas include statistical learning, data mining,
information retrieval, and multimedia data analysis. He has worked as a
research engineer for Ask Jeeves Inc. since August 2003.
1346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 10, OCTOBER 2005

Vous aimerez peut-être aussi