Vous êtes sur la page 1sur 15

Available online at www.sciencedirect.

com

ScienceDirect
Cognitive Systems Research 45 (2017) 109–123
www.elsevier.com/locate/cogsys

Active and semi-supervised learning for object detection with


imperfect data
Action editor: Simona Doboli
Phill Kyu Rhee ⇑, Enkhbayar Erdenee, Shin Dong Kyun, Minhaz Uddin Ahmed,
Songguo Jin
Computer and Information Engineering Department, Inha University, 100 Inha-ro, Nam-gu 22212, Incheon, Republic of Korea

Received 24 June 2016; received in revised form 21 March 2017; accepted 18 May 2017
Available online 26 May 2017

Abstract

In this paper, we address the combination of the active learning (AL) and semi-supervised (SSL) learnings, called ASSL, to leverage
the strong points of the both learning paradigms for improving the performance of object detection. Considering the pros and cons of the
AL and SSL learning methods, ASSL where SSL method provides the incremental improvement of semi-supervised detection perfor-
mance by combining the concept of diversity imported from AL methods. The proposed method demonstrates outstanding performance
compared with state-of-art methods on the challenging Caltech pedestrian detection dataset, reducing the miss rate to 12.2%, which is
significantly smaller than current state-of-art. In addition, extensive experiments have been carried out using ILSVRC detection dataset
and online evaluation for activity recognition.
Ó 2017 Elsevier B.V. All rights reserved.

Keywords: Active Learning (AL); Semi-Supervised Learning (SSL); Convolutional Neural Network (CNN); Deep learning; Object detection

1. Introduction data distribution. However, the assumption may not be


valid in real-world, and the underlying distributions
Object detection is a key problem for robotics, automo- between training and testing data are very often substan-
tive applications, and surveillance systems. Despite the tially imbalanced. Sample selection bias can cause a signif-
great advances in object detection technology, which cate- icant performance degradation the object detection based
gorizes and localizes visual objects in a scene, the imbal- on a supervised learning method.
anced training data poses a very challenging problem Recently, convolutional neural network (CNN) (Bengio,
since it very often causes biased results and performance Courville, & Vincent, 2013; Hinton, 2006; LeCun et al.,
degradations. Supervised detection methods (Girshick, 1989) becomes dominant in object detection area since
2015; Ren, He, Girshick, & Sun, 2015) have shown great Krizhevsky, Sutskever, and Hinton (2012) broke through
promise for the automatic object detection with the the performance barrier of large-scale object detection in
assumption that the training and testing sets have the same ILSVRC 2012 (Russakovsky et al., 2015). High dimen-
sional deep feature space such as fast RCNN (Girshick,
2015), SPP (He, Zhang, Ren, & Sun, 2015) causes perfor-
⇑ Corresponding author. mance degradations due to not only its relatively small
E-mail addresses: pkrhee@inha.ac.kr (P.K. Rhee), enkhbayar@inha. quantity but also insufficient quality of training samples
edu (E. Erdenee), tisgb_1@naver.com (S.D. Kyun), minhaz@inha.edu
since most previous object detection approaches rely on
(M.U. Ahmed), sgkim735@inha.edu (S. Jin).

http://dx.doi.org/10.1016/j.cogsys.2017.05.006
1389-0417/Ó 2017 Elsevier B.V. All rights reserved.
110 P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123

that training data samples being selected randomly from contribute or even harmful effect in building a high-
distributions. However, performing object detector. The proposed method takes
imbalanced sample selection is very likely to occur in the advantage of the concept of AL and SSL in terms of
object detector learning, where the available training sam- exploration and exploitation in the training time. Assum-
ples are relatively small comparing to the feature space size ing that an imperfect training dataset is given (e.g. ILSVRC
and huge computation powers are necessary. In real world Russakovsky et al., 2015), we first construct a simple object
applications, many researchers and developers also very detection model using a small number of perfect samples
often encounter the problem of sample selection bias instead of training using entire training samples. We parti-
(Heckman, 1979; Zadrozny, 2004), when training samples tion the imperfect samples into several batches. Then, we
are drawn manually, might not follow random sampling employ ASSL, which is a batch-mode learning of AL and
assumptions. The imbalance problem in training sample SSL combination for the effective incremental training
selection for an object detector is due to not only biased framework, which exploits a confident and reliable sam-
training samples (sample quality problem) but also a short- pling and explores a diversity and informative sampling.
age of samples (sample quantity problem). Such an imbal- The goal is to construct an object detector based on the
ance in training samples causes not only the incorrect batch-mode active semi-supervised learning starting from
measure of the object class distribution but also the degra- a small number of approximately perfect labeled samples
dation of the detection accuracies. in conjunction by adding the set of imperfect or unlabeled
Active learning (AL) and Semi-supervised learning samples.
(SSL) methods, which are originally invented for the clas- The novelty and contribution of ASSL are summarized
sification accuracy improvement using both labeled and in the following:
unlabeled data, can be adopted to overcome the imbal-
ances of sample distribution, imperfect labeling, and selec-  The combined framework for AL and SSL for efficient
tion biases in training an object detector. AL method adds object detection for the quality assurance to solve data
labeled data to the training set at each iteration from an imbalance problem.
unlabeled or an imperfect dataset, by a selective sampling  We combine AL and SSL to minimize the effect of
strategy followed by queries, and is expected to improve imbalanced classifications due to the bias of training
detection performance. While AL methods select the most samples and to model capability of the deep feature dis-
uncertain samples (the highest disagreement) based on the tribution for efficient object detection.
assumption of a perfect expert, SSL methods samples with  We extensively evaluate our method on challenging Cal-
the most confident (highest agreement among classifiers) tech pedestrian detection benchmark (Dollar, Wojek,
based on a cluster assumption that if samples are in the Schiele, & Perona, 2011), ILVRC object detection data-
same cluster, the probability that they might be in the same set (Russakovsky et al., 2015) and online evaluation for
class is high (Chapelle., 2006). In AL, the labeling is per- action detection. The proposed method demonstrates
formed by human experts. In SSL, the labeling is per- outstanding performance compared with state-of-art
formed on the basis of the current classification rules methods on the Caltech pedestrian detection dataset,
based on cluster assumption. The assumption of infallible reducing the miss rate to 12.23%, which is significantly
human experts is not valid in object detection since if the smaller than current state-of-art 17.1% Checkerboard
strict labeling rules are applied, the cost becomes too high (Zhang, Benenson, & Schiele, 2015). Experiment results
to be used in a real world application. If the assumption is confirm that the proposed ASSL method can be applied
not satisfied, SSL will not produce a better performance, a wide variety of application domains including cogni-
even degrading the classification accuracy. tive systems, automotive applications, and surveillance
In this paper, the labeled samples are divided into per- systems.
fect and imperfect samples. A perfect sample has almost
correct region of interest (RoI) information such as object
category, attribute, position, and size, but some informa- 2. Related works
tion of an imperfect sample may not be sufficiently correct
to be used in the training. For example, the attribute of an 2.1. Active learning
imperfect ROI represented by a CNN deep feature may be
ill-posed, or noisy to be modeled in the training time. The Active learning is a one kind of semi-supervised learning
dataset of imperfect samples tends to be imbalanced in dis- method that iteratively selects the most informative sam-
tribution or biased. Note that an imperfect RoI is different ples to be labeled by manually (Settles, 2010). In active
from an unlabeled sample where the only given informa- learning, uncertain and erroneous portion of the training
tion is the position and bounding box of an object in a data are required to be queried, annotated manually with
scene, i.e., region proposal, but no additional information minimum human cost, and added to the training dataset
is provided such as the category or existence of an object. to get the highest gain in classification accuracy (Roy &
The imperfect samples should be handled differently from McCallum, 2001). Thus, the query strategy is the main con-
the perfect samples since some imperfect samples do not sideration in an active learning technique. There are several
P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123 111

query functions, focus on the sample selection has been sion assurance (Sorokin & Forsyth, 2008; Welinder &
proposed. Perona, 2010). Assuming that a single prominent object
Uncertainty sampling strategy (Lewis & Gale, 1994), of interest exists in an image, some approaches have tried
tries to select the closest samples to the class decision to learn object models directly from noisy keyword search
boundary. In uncertainty sampling, the samples that are (Fergus, Fei-Fei, Perona, & Zisserman, 2005; Li & Fei-Fei,
least certain by the current classifier is considered as infor- 2010; Vijayanarasimhan & Grauman, 2009).
mative sample. Li and Sethi (2006) proposed a confidence-
based active learning, which identifies and labels only 2.2. Semi-supervised learning
uncertain samples by computing the uncertainty level of
all samples according to the output value of current classi- SSL approaches, which leverage a classifier performance
fier and selects samples whose outputs are within the uncer- using both labeled and unlabeled samples, are divided into
tainty range. There are several successful works which self-training, co-training, generative probabilistic models,
adopt uncertainty sampling such as non-probabilistic and graph-based SSL (Zhu, 2008). In the self-training
approaches; nearest-neighbor classifier (Lindenbaum, SSL, the classifier is constructed beginning with a few
Markovitch, & Rusakov, 2004), decision tree (Lewis & labeled training samples, a portion of the unlabeled train-
Catlett, 1994) and support vector machines (Tong & ing dataset is labeled using the current classifier, the most
Koller, 2001) and probabilistic approaches (Culotta & confident samples among the predicted labeled samples
Mccallum, 2005; Lewis & Gale, 1994). As popular as are added to the training dataset, repeatedly, until conver-
uncertainty sampling is, it has several problems; first, the gence. Uncertainty sampling method in AL is a comple-
most uncertain samples do not represent the whole data mentary approach, where least confident samples are
distribution well and might make classifier more focuses selected for querying. In co-training (Mitchell & Blum,
on noise or outliers. Second, since it employee single classi- 1998), an ensemble method is employed. Firstly, separate
fier, the performance of uncertainty sampling approaches is models are learned using independently labeled datasets.
very limited by the performance of the single classifier. The current models classify the unlabeled data and learn
Unlike to uncertainty sampling, instead of relying on the the next models using a few selected samples with most
single classifier, Query by committee (QBC) strategy confident predicted labels. QBE (Query-by-committee) is
(Dagan & Engelson, 1995; Seung, Opper, & a complimentary active learning version, where the com-
Sompolinsky, 1992) using a different hypothesis of a com- mittee queries the unlabeled instances with the most agree-
mittee of classifiers which selects the samples which have ment. Active and semi-supervised learning try to solve the
the highest disagreement between the classifiers. Generally, same problem from different directions (Settles, 2010).
Query by Committee approach generates accurate yet
diverse classifiers and makes better generalization than sin- 2.3. Deep feature for object detection
gle classifier by combining ensemble methods (Breiman,
1996; Lu, Wu, & Bongard, 2015; Xu, Li, & Chen, 2012). Motivated by recent advances of the deep convolutional
While most active learning methods select only a single neural network (CNN) (Bengio et al., 2013; Hinton, 2006;
sample at each iteration, in batch mode active learning LeCun et al., 1989) on visual object recognition tasks
(Demir, Persello, & Bruzzone, 2011), a batch of samples (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015;
is selected at each step and instead of update model at each Szegedy et al., 2015), significant improvements have been
iteration, updates the model once at each batch iteration. made in the object detection task (Girshick, 2015;
We adopt batch mode active learning since batch querying Girshick, Donahue, Darrell, & Malik, 2014; He et al.,
is more efficient and less expensive (M. Li & Sethi, 2006). 2015; Makantasis, Doulamis, Doulamis, & Psychas,
Considering the diversity, the selection strategy of batches 2016). Among them, most notably Girshick et al. (2014)
of samples at each iteration allows to speed up the learning proposed the region based convolutional network object
process with different consideration such as minimizing the detector, called R-CNN and achieved state-of-art perfor-
margin and maximizing the diversity (Brinker, 2003). The mance on object detection benchmarks with a large margin
diversity in the query function by clustering (Xu, Yu, over the previous best results, such as Overfeat (Sermanet
Tresp, Xu, & Wang, 2003), identifies uncertain samples et al., 2013) and deformable part model (DPM)
avoiding redundancy by clustering the samples. (Felzenszwalb, Girshick, Mcallester, & Ramanan, 2009).
Many researchers have investigated active learning The RCNN framework that employees AlexNet
method for large-scale image classification. Active learning (Krizhevsky et al., 2012) and VGGNet (Simonyan &
has been employed for object region labeling (Siddiquie & Zisserman, 2014) to extract a fixed-length feature vector
Gupta, 2010; Vijayanarasimhan & Grauman, 2009) and from object proposals generated by selective search
image recognition tasks (Kapoor, Grauman, Urtasun, & (Uijlings, Van De Sande, Gevers, & Smeulders, 2013) and
Darrell, 2007; Qi, Hua, Rui, Tang, & Zhang, 2008) with then use linear SVM to classify each region. In order to
less annotation effort and better annotation quality. Since reduce the computation of forward pass for each region
annotation quality varies from people to people, some proposal in RCNN (Girshick et al., 2014), Fast RCNN
researchers have investigated automatic annotation preci- (Girshick, 2015) and SPP (He et al., 2015) compute feature
112 P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123

maps once and then pool the features for object region only to include samples which are not confidently labeled, i.e.,
from the last convolutional layer. Most recently, Faster (see Fig. 2) not only correctly labeled samples but also
RCNN (Ren et al., 2015) introduced a region proposal net- includes samples some information of which is biased
work that replaces region proposals provided by selective labels under imbalanced underlying distribution. For
search and achieved better performance and speed. example, the information of object location is incorrect,
while that of the scale and pose is correct or the object cor-
3. System overview rect located, but ill-posed or noisy. The imperfect samples
should be handled differently from the confident samples
We very often encounter the problem of sample selec- since some sample do not contribute or even harmful effect
tion bias (Heckman, 1979; Zadrozny, 2004), when training in learning a high-performing object detector. One can
samples are drawn with manual classification rules, notice that an unlabeled sample can be treated as the mem-
whereby samples are error-prone and may not follow a ran- ber of an imperfect dataset with a missing label. In this con-
dom sampling assumption. Furthermore, the collected text, the concept of imperfect dataset can be thought as a
dataset tends to be imbalance in class distribution. Such generalization of labeled and unlabeled dataset the samples
imperfect training samples result in incorrect modeling of of which are described by confidence scores.
object classes and lead to the degradation of system perfor-
mance. Owing to the intrinsic complexity of fallible label- 4. Proposed method
ing and imbalanced data samples in object detection, we
need a novel approach to sample selection and learning We present ASSL framework which combines the deep
method. If imperfect or imbalanced samples exist in a feature and batch-mode AL algorithm in a similar manner
training dataset, heavy computation due to a huge number given in Persello and Bruzzone (2011), whereby the confi-
of iterations is required, and results in a slow or inefficient dence score function is used in the semi-labeled sample
learning in a supervised learning framework. We address selection to satisfy diversity criterion. A semi-label is
how to combine and SSL to solve the imbalanced object defined as a label assigned by a classification model, and
detector using biased or imperfect training samples. AL thus the label may be incorrect, i.e., either true positive
and SSL methods can be combined to define an efficient or false positive. We employ a batch mode active learning,
learning framework that exploits both labeled and semi- where the batches of samples are selected at each iteration,
labeled samples to ensure modeling capability of the deep maximizes the diversity to minimize the convergence time
feature distribution for efficient object detection by select- (Brinker, 2003). Instead of random selection, we adopt
ing well-balanced samples being labeled or relabeled by an efficient and flexible collaborative sampling strategy by
the human expert. integrating the uncertainty and diversity criteria from the
In this section, we discuss the overview of an efficient concept of AL and the confidence criterion from that of
batch-mode learning method for object detection in the SSL. ASSL initially trains deep convolutional neural net-
presence of imperfect training samples by combining AL work (CNN) using a small number of confidently labeled
and SSL based on the collaborative sample selection, called samples (confident dataset), and it repeatedly retrains the
ASSL. In general, AL tends to explore the unknown data next deep feature model (CNN) by adding the batch of
distribution, and on the other hand, SSL tries to exploit samples selected using the current classifier (stored in the
the unknown aspect. While AL can start with few labeled semi-confident dataset) until convergence. Overview archi-
samples without affecting much the generalization capabil- tecture of proposed ASSL framework for object detection
ity, SSL requires a more confident labeled samples, which is illustrated in Fig. 1. We assume that human classification
affect much to the convergence of the classifier perfor- (labeling) rules are based on the property of the validation
mance. The combination of AL and SSL is expected to dataset, the distribution of which is similar to that of test
produce better discrimination power. ASSL can improve dataset. We select the set of semi-labeled samples based
the accuracy of object detection based on a flexible sam- on the confidence score function, whereby the semi-
pling strategy which alleviates selection bias and class dis- labeled samples are fit with either a diversity criterion or
tribution imbalance that might occur frequently in object confidence criterion or both. This imperfect dataset
detection. assumption is reasonable in a sense the diversity of the
We formulate the object detector learning on imperfect training dataset comparing to the validation dataset causes
training data using a batch-mode active semi-supervised to inconsistency and imperfect of manual labeling process.
learning. Batch-mode learning, whereby a batch of samples The network architectures (Simonyan & Zisserman,
is selected and learned iteratively, is a more practical 2014; Zeiler & Fergus, 2014) used in this work have several
approach for object detection since it will not be realistic convolutional layers, rectified linear units (ReLU), max-
to deal with one data sample at a time. pooling layers, followed by a spatial pooling layer and sev-
In this paper, the labeled samples are divided into per- eral fully connected layers. The final layer of network has
fect and imperfect samples. The confident dataset consists two sibling layers, one is softmax classification layer which
of the samples labeled with correct object locations, poses, outputs a score between 0 to 1, indicating the class proba-
scales and object classes. The imperfect dataset is allowed bility over the K object classes plus ‘‘background” and sec-
P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123 113

Fig. 1. Overview of ASSL framework. First, a simple object detection model is constructed using a small number of perfect samples. Then, we employ
ASSL, which is a batch-mode learning of AL and SSL combination for the effective incremental training framework, which exploits a confident and
reliable sampling and explores a diversity and informative sampling.

ond is a bounding box regression (Felzenszwalb et al., the current classifier and added to the candidate set
2009) layer that outputs four coordinates, indicating box Dbatch . The selected region of interests (RoIs) are eliminated
location for each of the K object classes. from the whole training dataset Dimp . Instead of only the
selection of one sample at an iteration, ASSL minimizes
4.1. ASSL collaborative sampling the learning time by building a pool of 2g samples based
on the uncertainty criterion. Formally, considering training
Considering a batch mode AL framework combined dataset, the imperfect/unlabeled sample with the lowest
with SSL approach, the training phase of ASSL is as fol- confidence can be selected using the uncertainty measure
lows. First, the classifier (softmax classifier) F is learned xbi ¼ argminxi 2Dimp jCSðxi Þj, i.e., a sample that is closest to
for multiple object classes by using the confident training the decision boundary, since the most uncertain samples
samples Dconf . We first select randomly pool candidate have the lowest probability to be correctly classified by
samples Dbatch from the imperfect training dataset Dimp the current classification model (Demir et al., 2011). When
(Dconf  Dimp ). Then, pseudo labels are (re)assigned to using a probabilistic model, uncertainty sampling simply
the selected imperfect samples in terms of uncertainty crite- selects the samples whose the probability of being positive
rion of AL. We use a simple but effective sampling criterion is around 0.5 (Settles, 2010). We define a margin which is
for the uncertainty measure of the classifier learning to fil- upper and lower confidence score, similar manner
ter samples with sufficient uncertainty. The uncertain- described in Li and Sethi (2006) and construct the set of
labeled samples are filtered by a diversity criterion to pro- all the RoI samples Duncertain by selecting 2g the samples
duce active-labeled samples with minimum redundancy. inside the margin. That is, g samples with CSðxi Þ the score
Then the confidence criterion of SSL is applied to produce closer to the upper boundary are selected from the upper
semi-labeled data. Finally, we train the deep feature model half of the margin, and the remaining g samples are
on the semi-labeled data and repeat these steps until con- selected from the lower half of the margin with CSðxi Þ
verging or the imperfect dataset is exhausted. the score closer to the lower boundary. We have total 2g
Given an imperfect training set Dimp ¼ fxi gNi¼1 , a detec- samples in uncertain-labeled sample set Duncertain . However,
tion model (softmax classifier) F , we compute confidence the uncertainty criterion may result in the selection of
scores over K classes for each bounding box as follows. redundant noisy or samples. We investigate the
Current classifier is applied to the each sample in the Dimp uncertain-labeled samples to select the representative and
and the probability (Hasan & Roy-Chowdhury, 2015) that non-redundant uncertain samples as follows.
a sample xi belongs to class c is defined as: The diversity step employs the k-means clustering to
analyze the distribution of uncertain-labeled samples to
expðW Tc xi Þ
CSðxi Þ ¼ pðy i ¼ cjxi ; W Þ ¼ PK ð1Þ apply the diversity measure. Since the samples in the same
j¼1 expðW j xi Þ
T
cluster are similar, a representative sample should be
selected from each cluster in order to reduce redundant
where c 2 f1; . . . ; Kg is the set of class labels, W Tc is the cor-
samples. Our batch mode ASSL also provides the advan-
responding weight vector of class c and superscript T
tage of being incorporated with a diversity measure (Xu
denotes transposition.
et al., 2003). A batch of samples with diversity criterion
In uncertainty sampling stage, pseudo labels are (re)as-
Ddivers is determined by selecting # samples among the 2g
signed to the set of unlabeled or imperfect samples in
candidates ðg < #Þ with more diverse property, where more
Dimp by current classifier. For each object class, a set of
informative samples are expected to contained in Duncertain .
samples is randomly selected and scored (CSðxi Þ) using
The k-means clustering is applied in the deep feature space
114 P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123

in the uncertain-labeled samples Duncertain for defining h 5. Experiments


clusters and finally taking #=h samples from each cluster.
Finally, we construct Dtr to include SSL philosophy by The main objectives of the experiments are to confirm
initializing DD with a sample xtop ¼ argmaxxi 2Ddivers CSðxi Þ. the effectiveness of the framework in different domains
At each step, our sampling strategy chooses a sample from and analyze the performances of our proposed framework
Ddivers add to DD , which is the most confidence/similar sam- in learning detection models incrementally from imperfect
ple in DD in terms of a similarity measure, i.e., data. In other words, we want to see the performance is
xtop ¼ argminxi 2DdiversnD fmaxxj 2DD dðxi ; xj Þg. (Here we use increasing smoothly when new samples added to the train-
D
Euclidian distance between two deep features in calculating ing set. Depending on the data is presented to the model at
dðxi ; xj Þ). When the cardinality of DD is c, the sample selec- learning stage, each run of the framework on same dataset
tion process is stopped, and the final sample set is DD . We shows variances in accuracies. Therefore, we run the same
retrain the deep convolutional network (with softmax clas- experiments multiple times with the same parameter setting
sifier) after the selection of the pool of samples, and the and report the average of the results in this paper.
process is repeated until there is no more data left or a con- We carried out different kinds of experiments in order
vergence criterion is satisfied. The overall algorithm for do followings: (1) Compare the proposed ASSL method
ASSL is presented in Algorithm 1. with other methods used in sample selection, (2) evaluate
the incremental learning, and (3) confirm effectiveness of
Algorithm 1. ASSL Collaborative Sample Selection. our framework in different domains including object detec-
Input: g, #, and c ð2g > # > cÞ; Confidently labeled tion, pedestrian detection and action recognition.
dataset Dconf and imperfectly labeled dataset
Dimp ¼ fxi gNi¼1 with Dconf  Dimp 5.1. Dataset overview
Notations: Progressive batch dataset DD , and batch
candidate dataset Dbatch with DD  Dbatch . We conduct extensive experiments on ILSVRC detec-
Output: Optimally labeled dataset DD , target L tion dataset (Russakovsky et al., 2015), Caltech pedestrian
detection models F ¼ ff ð1Þ ; f ð2Þ ; . . . ; f ðLÞ g detection dataset (Dollar et al., 2011) and online evaluation
of activity recognition to verify the effectiveness of our
Method: framework. To make initial model, we use MS COCO
Step 1. Let Dtr ¼ Dconf ; Train an initial softmax detection dataset (Lin et al., 2014). Below we briefly
classifier F 0 using Dtr . describe the datasets used in our experiments.
Repeat MS COCO. MS COCO dataset has 80 categories, fewer
Step 2. Select a batch of candidate samples Dbatch from categories than ILSVRC (Russakovsky et al., 2015). But it
Dimp ; has more instances per category which lets to learn more
Compute CSðxi Þ; xi 2 Dbatch using current complex and general models. It contains around 82,700
classifier F t (Eq. (1)) training, 40,500 validation, and 40,700 testing images.
Step 3. Determine a batch dataset. Since MS COCO (Lin et al., 2014) consists of the samples
1. Select the set of uncertain samples Duncertain labeled with correct object localization, poses and object
by defining upper and lower boundary. categories, we employee it as a confident dataset and use
2. Select the set of diversity samples Ddivers by it to make an initial detection model.
taking total # samples from the clusters in ILSVRC. ILSVRC detection dataset has 200 categories.
uncertainty sample set Duncertain . There are three set of images and labels in ILSVRC2015
3. Initialized by the sample detection dataset (Russakovsky et al., 2015), training data
xtop ¼ argmaxxi 2Ddivers CSðxi Þ, DD ¼ fxtop g. (456,567), validation data (20,121) and test data (40,152),
4. xtop ¼ argminxi 2DdiversnD fmaxxj 2DD dðxi ; xj Þg; where in parentheses are the number of images in each
D
DD ¼ DD [ fxtop g set. The val and test set are split from the same image dis-
5. Repeat 4 until jDD j ¼ c. tribution, in contrast, training set is drawn from ILSVRC
Step 4. Human annotator corrects incorrectly labeled classification image distribution. All instances in val and
samples in DD . test set are fully annotated with bounding boxes. Unlike
Step 5. Retrain F t using DD ¼ Dtr [ DD ; validation and test set, training set is not fully annotated,
Dimp ¼ Dimp  DD . imbalanced and many examples are misclassified. ILSVRC
Until a convergence criterion is satisfied or Dimp ¼ ;. dataset (Russakovsky et al., 2015) used as imperfect data-
set, and semi-confident dataset is constructed from imper-
fect dataset by adopting proposed ASSL method. We
The time complexity of ASSL task would be simply O select 30,000 person images from ILSVRC training data
(NLG) time to choose next query and require O(C2NLG), for ASSL learning, which is divided into 6 batches (5000
where C is the number of class labels, N the size of unla- for each batch). Around 200 images randomly sampled
beled pool, L the size of the current training set, G the num- from ILSVRC dataset set, and used as test set. We make
ber of gradient computations (Xu & Akella R., 2008). several test sets and evaluate our framework on these test
P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123 115

sets and report mean of the results. We primarily evaluate from the crowd sourcing (Su, Deng, & Fei-Fei 2012)
detection Average Precision (AP) for ILSVRC, because method in deciding bounding boxes instead of simple
this is the actual metric for object detection. annotation method based on majority voting, i.e., consen-
Caltech pedestrian detection dataset. We also evaluated sus of multiple workers, (Deng et al. 2009; Sorokin &
effectiveness of our method on the Caltech dataset Forsyth 2008) for our object detection experiments. The
(Dollar et al., 2011), using subsets from set00 to set05 for challenge is how to obtain high annotation quality and
training and from set06 to set10 for testing. The Caltech proper coverage for both perfect and imperfect datasets
dataset is the dominant pedestrian detection datasets. It with minimum cost. The steps are box drawing, quality ver-
consists of approximately 10 h of 30 Hz video taken from ification, and coverage verification. In box drawing step,
a vehicle driving through street of Los Angeles, USA. the initial detection model is used to draw bounding boxes.
There are 350 k bounding boxes of around 2300 unique In the quality verification step, a worker decides whether a
pedestrians in 250 k frames. The Caltech evaluation devkit bounding box is correctly drawn (perfect) or not (imper-
provides several subsets of test set different aspect ratio, fect). In the third step, a second worker investigate whether
scale, occlusion and overlaps. Overall performance is eval- all objects in a scene have bounding boxes or not.
uated on the entire test set. In the reasonable evaluation
setting (Dollar et al., 2011), the performance is evaluated 5.3. Experiments on ILSVRC dataset
on unoccluded pedestrians over 50 pixels tall. We mainly
consider overall and reasonable subsets for evaluating In experiments on ILVRC dataset, we replaced 81 out-
our method. Readers are referred to (Dollar et al., 2011) put channels of VGG16 (Simonyan & Zisserman, 2015)
for detailed explanation of different types of evaluation set- pre-trained net on MS COCO dataset by 2 output channels
tings. We follow the evaluation rule of Caltech pedestrian for 2 classes of ILSVRC detection dataset (person and plus
detection benchmark (Dollar et al., 2011), that uses the one for background) and fine-tune it on ILSVRC person
log-average miss rate measured by averaging miss rate at data. In increment learning step, we continue previous
nine False Positive Per Image (FPPI) rates ranging from training on the semi-confident dataset which constructed
102 to 100. from the imperfect dataset using proposed ASSL method.
In increment learning phase, we used same learning rate
5.2. Experiment setup of 0.0001.
ASSL sampling parameters are selected by a grid search
Our experiments use publicly available ZF net (Zeiler & (Girshick et al., 2014) over g = {0.5, 0.6, . . . , 0.9}, # =
Fergus, 2014) that has 5 convolutional layers and 3 fully {0.4, 0.5, . . . , 0.8} c = {0.2, 0.3, . . . , 0.6} on validation set
connected layers and VGG16 (Simonyan & Zisserman, and set as uncertainty parameter g = 0.6, diversity param-
2015) that has 13 convolutional layers and 3 fully con- eter # = 0.5 and confidence parameter c = 0.3 which are
nected layers. This networks are pre-trained ImageNet the values give the best performance. As an optimization
for 1000 category classification task. For the detection sys- method for CNN training, we use mini-batch stochastic
tem ASSL employees the Faster-RCNN (Ren et al., 2015) gradient descent (SGD) with mini-batch size of 256 exam-
and ASSL frameworks is implemented on the popular deep ples, weight decay of 0.0005 and the momentum of 0.9.
learning tool Caffe (Jia et al., 2014). All implementations
are on a single server with cuDNN (Chetlur & Woolley, 5.3.1. Comparison with the proposed ASSL with other
2014) and a single NVIDIA GeForce GTX 970. sampling methods
For all datasets, we first fine-tune a pre-trained Ima- Fig. 3 shows the averaged detection accuracy versus the
geNet model on MS COCO dataset (Lin et al., 2014) to number of semi-labeled samples obtained in the 1st and 6th
make an initial model. Since categories on MS COCO batch experiments. We compare ASSL with random sam-
are is superset of ILSVRC (Russakovsky et al., 2015), we pling, uncertainty sampling and confidence sampling. At
can transfer previous knowledge COCO to ILSVRC and the first batches model is not stable, but on the following
Caltech (Dollar et al., 2011). We fine-tune only from the batches becomes more stable. As shown in Fig 3, our
conv3_1 convolutional layer and above. To train deep con- method performs better than compared methods; uncer-
volution neural network, we follow the guidelines described tainty, confidence and random sampling. In Fig 3, orange
in (Ren et al., 2015). The 1000 output channels classifica- color indicates detection accuracy when we use all training
tion layer of the ImageNet model is replaced with 81 out- samples without any sampling method. Our proposed
put channels classification layer for the 80 classes of MS ASSL sampling provides almost the same detection perfor-
COCO detection dataset plus one for background. We mance with when all training samples used even it used few
trained the pre-trained ImageNet model for 240 k itera- small number of training samples than one used all sam-
tions with a learning rate of 0.001 and then for 80 k itera- ples. On the all batches, ASSL method is able to signifi-
tions with 0.0001 on MS COCO dataset. Completing this cantly increase the detection accuracy.
step we get an initial detection model of our framework. As described in Section 3, since all of the samples in
We spilt the training dataset into the confidently labeled training set are not important for efficient training, ASSL
dataset and imperfectly dataset using a method modified selects only informative samples by combining AL and
116 P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123

Fig. 2. Some examples of imperfect samples in the training set. Left: An example of missing labels of an object. Left middle: Mislabeled sample. Right
middle: An example of missing labels and mislabeling of multiple objects. In this example, computer keyboard, computer mouse, table and lamp’s labels
are missing and a laptop is mislabeled as TV monitor. Right: An example of a noisy sample.

Fig. 3. Average detection accuracy of the 1st and 6th batches over 5 runs provided by random, uncertainty, confidence and the proposed ASSL for person
category of ILSVRC dataset on each batch. (Higher curved indicates better performance).

SSL based on the collaborative sample selection. In Fig. 4, 5.3.2. Evaluation of incremental learning
we show some qualitative examples that indicate proposed In this experiment, we analysis effectiveness of incremen-
ASSL selects more informative samples than random tal learning. At the beginning a detection may be misclassi-
sampling. fied, poor localization or it may be correctly classified with
Average detection performance on each batch is a low probability score by the detection model. However,
reported in Table 1 and illustrated in Fig. 5. As shown in in our framework, the models continue to improve at each
Table 1, our method incrementally improves detection per- batch and later it can correctly classify the same misclassi-
formance at each batch and outperforms compared fied activities with a higher probability score or provide
methods. good localization. In Table 2, we report top false positive
Random sampling
ASSL sampling

Fig. 4. An illustrative comparison between random sampling and our proposed ASSL sampling method. We show four selected examples by random
sampling and ASSL sampling method for person category of ILSVRC dataset. Random sampling selects ill-posed or noisy samples, whereas ASSL
sampling selects samples with clear poses.
P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123 117

Table 1
Comparison of average detection performance between different sampling methods; random, uncertainty, confidence and the proposed ASSL sampling for
person category of ILSVRC dataset. (Higher result indicates better performance).
Batch0 Batch1 Batch2 Batch3 Batch4 Batch5 Batch6
ASSL 64.1 69.4 70.8 71.2 71.6 72.2 72.7
Random 64.1 67.7 68.2 68.9 69.6 70.3 70.5
Uncertainty 64.1 68.2 68.7 70.2 70.1 71.6 71.7
Confidence 64.1 69.0 69.2 69.5 70.7 71.4 71.6

Fig. 5. Average detection accuracy over 6 batches with 5 runs provided by different sampling methods; random, uncertainty, confidence and the proposed
ASSL sampling for person category of ILSVRC dataset.

Table 2 Zisserman, 2015) model pre-trained on MS COCO dataset


Top false positive rates on each batch. Loc – poor localization, Sim – (Lin et al., 2014) by 2 output channels (pedestrian and
confusion with similar objects, other – confusion with other objects and background) and fine-tune it on Caltech train set. Same
BG – confusion with the background. (Lower result indicates better
as experiments on ILSVRC dataset, ASSL sampling
performance).
parameters are selected by a grid search and set as uncer-
Total Loc Sim Other BG
tainty parameter g = 0.8, diversity parameter # = 0.6 and
Batch1 1090 0.131 0.005 0.077 0.787 confidence parameter c = 0.5 which are the values give
Batch2 1072 0.127 0.002 0.076 0.796
the best detection performance. Other settings are the same
Batch3 1070 0.136 0.003 0.078 0.783
Batch4 1075 0.13 0.002 0.081 0.787 as experiment settings on ILSVRC dataset.
Batch5 1068 0.13 0.002 0.082 0.768
Batch6 1067 0.118 0.003 0.087 0.792 5.4.1. Comparison with the proposed ASSL with other
sampling methods
rates on each batch. In the beginning of the learning, there We compare the performance of our method with differ-
are total 1090 false positives. After incrementally updating ent sampling methods on the Caltech test set (Dollar et al.,
model on semi-confident dataset constructed by ASSL, it is 2011), including random, most uncertain and confidence
decreased into 1067 false positives. We can observe that the sampling. The comparison of log-average miss rate under
model becoming smarter later batches. In order to calculate different evaluation settings on Caltech pedestrian detec-
false positive impact and rate, we applied detection analysis tion dataset is summarized in Table 3. Our initial model
tool from Hoiem, Chodpathumwan, and Dai (2012). The which is the VGG16 (Simonyan & Zisserman, 2015) model
readers are referred to Hoiem et al. (2012) for more details fine-tuned on MS COCO (Lin et al., 2014) dataset achieves
about the analysis tool. miss rate of 31.4% performance in reasonable setting.
Figs. 6 and 7 report the comparison of false positive Incremental learning with random sampling achieves
types between the first batch and the last batch. As shown 25.3% miss rate. The detection results can be improved to
in Fig. 6, on batch 6 correct detection rate increases from miss rate of 12.2% by implementing proposed ASSL
batch 1, 58% to 63%. This confirms that effectiveness of method.
incremental learning. In overall evaluation setting, miss rates of all methods
are increased because overall setting uses all images in test
5.4. Experiment results on Caltech dataset set including very small pedestrian such as 30 pixels tall and
pedestrians occluded more than 75%. Performance mea-
In experiments on Caltech pedestrian dataset, we surements shown in Table 3 are described in Section 5.1.
replaced 81 output channels of VGG16 (Simonyan & Our initial model which is fine-tuned on MS COCO dataset
118 P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123

(a) Batch 1 (b) Batch 6


Fig. 6. Evaluation of top-ranked false positives in incremental learning. Pie charts shows correct (Cor) labeling percent improved after several batches. Pie
charts: fraction of top-ranked false positives that are due to poor localization (Loc), confusion with similar objects (Sim), confusion with other objects
(Oth) or confusion with background or unlabeled objects (BG).

(a) Batch 1 (b) Batch 6

Fig. 7. Comparison of top ranked false positive types between batch 1 and batch 6. Cor – correctly detected, Loc – poor localization (a detection with IoU
overlap with the correct label between 0.1 and 0.5), Sim – confusion with similar objects, BG – confusion with background and Oth – confusion with other
objects. Line plots show recall.

Table 3
Comparison of log-average miss rate on Caltech dataset between ASSL method and different sampling methods. (Lower result indicates better
performance).
Sampling Overall Reasonable Scale Aspect ratio Occlusion Overlap
Far Large Medium Near All A typical Typical Heavy None Partial 25 50 75
Initial 75.6 31.4 100 4.9 71.7 9.1 28.8 34.9 27.4 88 28.8 49.5 24.5 31.4 73.9
Random 73.9 25.3 99.8 2.8 69.9 4.4 22.8 34.3 20.4 71.5 22.8 44.2 22.6 25.3 69.4
Confidence 71.4 18.7 99.1 1.1 67.5 3.9 16.7 24 15.1 66.7 16.7 34.9 14.9 18.7 68.4
Uncertainty 69.2 17 98.9 1.1 65.1 3.8 15.2 23.6 13.3 59.6 15.2 32.8 13.5 17 66.3
ASSL 62.5 12.2 98.8 1 54.2 2.7 9.7 19.2 7.7 54.7 9.7 29.2 8.5 12.2 66

achieves miss rate of 75.6, while our final result provided by tech test set, including DeepCascade+ (Angelova et al.,
ASSL achieves significantly good performance of 62.5% 2015), Katamari (Benenson, Omran, Hosang, & Schiele,
miss rate. Miss rates versus false positive per-image curves 2015), TA-CNN (Tian, Luo, Wang, & Tang, 2015), Spar-
shown in reasonable and overall evaluation setting are tialPooling+ (Paisitkriangkrai, Shen, & Hengel, 2014),
shown in Fig. 8. SCF + AlexNet (Hosang, Omran, Benenson, & Schiele,
2015), SCCPriors (Yang, Wang, & Wu, 2015), Checker-
5.4.2. Comparison with state-of-art methods boards (Zhang et al., 2015), CCF + CF (Yang, Yan, Lei,
We compare the performance of the proposed ASSL & Li, 2015) and CCF (Yang et al., 2015). The comparison
method with existing state-of-the-art methods on the Cal- of log-average miss rate between proposed ASSL and the
P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123 119

(a) Reasonable (b) Overall

Fig. 8. The comparison of ASSL with different sampling methods and baseline result on Caltech pedestrian dataset (lower curves indicate better
performance). (a) Log-average miss rate in reasonable evaluation setting (unoccluded pedestrians over 50 pixels tall). (b) Overall performance on all
images in test set.

state-of-the-art methods is reported in Table 4. We evalu- variety of application domains including cognitive
ate the comparison under different evaluation settings on systems, automotive applications and surveillance systems.
Caltech pedestrian test set. Among the comparison under In this section, we present that the ASSL method can be
different settings, we illustrate the comparison with state- applied to solving imperfect data problem in human action
of-the-art methods under reasonable and overall evalua- recognition task. In online evaluation of ASSL for action
tion settings in Fig. 9. recognition, we define five types of human actions:
It can be observed that our method outperforms other jumping, phoning, reading, taking photo, using a com-
methods on most set subsets. ASSL achieves the lowest puter. We collected around 7000 frames using web camera
log-average miss rate of 12.2%, which is significantly lower that contain the actions we defined. In training phase,
than the current state-of-the-art method Checkerboards ASSL selects the most informative samples from the
(Zhang et al., 2015), by 4.9% in reasonable evaluation collected unlabeled data then incrementally updates the
setting. model using the samples selected by ASSL. After complet-
ing training stage, we evaluate our model in real-time
5.5. Online evaluation of ASSL for action recognition action recognition. Same as previous experiments, ASSL
sampling parameters are selected by a grid search and
Results on the ILSVRC object detection and Caltech set as uncertainty parameter g = 0.8, diversity parameter
pedestrian detection datasets show that the ASSL frame- # = 0.7 and confidence parameter c = 0.4 which are the
work can significantly outperform many other models on values give the best performance. We illustrate some
object detection and pedestrian detection tasks and example snapshots while evaluating real time online action
confirm that the proposed method can be applied a wide recognition in Fig. 10.

Table 4
Comparison with state-of-art methods. (Lower result indicates better performance.)
Methods Overall Reasonable Scale Aspect Ratio Occlusion Overlap
Far Large Medium Near All A typical Typical Heavy None Partial 25 50 75
DeepCascade+ 71.9 26.2 100 4 64.8 7.3 23.2 31.4 21.4 82.2 23.2 47.7 21.3 26.2 73.9
Katamari 71.3 22.5 100 6.8 63.1 9.8 20.1 30.2 18.1 84.4 20.1 41.7 18.5 22.5 62.6
TA-CNN 71.2 20.9 100 7 63.6 8 19 26.8 15.7 70.4 18.6 32.8 16.8 20.9 60
SpartialPooling+ 71.1 21.9 100 7.7 63.4 8 19.5 26.8 17.8 78.2 19.5 39.2 18.4 21.9 61.1
SCF + AlexNet 70.3 23.3 100 7 62.3 10.6 20 30.5 17.8 74.6 20 48.5 20.2 23.3 58.9
SCCPriors 70.3 21.9 100 8.8 61.3 8 19.3 26.7 17.6 80.9 19.3 41.3 17.4 21.9 61.6
Checkerboards+ 67.7 17.1 100 2.4 58 4.9 15.1 20.8 13.4 77.9 15.1 31.3 13.3 17.1 55.1
CCF + CF 68.6 17.3 100 4.2 59.6 5.5 14.6 22.6 12.8 72.7 14.6 37.7 13.1 17.3 83.4
CCF 66.7 18.7 100 2.9 56.3 4.7 15.9 25 14.2 72.4 14.6 40.6 14.8 18.7 84.7
ASSL 62.5 12.2 98.8 1 54.2 2.7 9.7 19.2 7.7 54.7 9.7 29.2 8.5 12.2 66
120 P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123

(a) Reasonable (b) Overall

Fig. 9. The comparison of ASSL with state-of-art methods on Caltech pedestrian dataset (lower curves indicate better performance). (a) Log-average miss
rate in reasonable evaluation setting (unoccluded pedestrians over 50 pixels tall). (b) Overall performance on all images in test set.

Fig. 10. Example snapshots of real-time online evaluation of action recognition.

Table 5
Action recognition accuracy in online evalluation.
Avg. acc Jumping Phoning Reading Taking a photo Using a computer
Baseline 56.5 ± 0.7% 67.1 ± 0.8% 51.1 ± 0.6% 58.3 ± 0.2% 50.4 ± 0.9% 55.6 ± 0.3%
Random 74.8 ± 0.4% 78.5 ± 0.8% 75.9 ± 0.1% 76.8 ± 0.3% 72.9 ± 0.6% 69.9 ± 0.4%
ASSL 82.2 ± 0.8% 86.4 ± 00.0.4% 75.2 ± 0.9% 84.9 ± 0.7% 79.7 ± 0.5% 84.7 ± 0.6%

shows variances in accuracies. Therefore, we run the same


experiments multiple times with the same parameter setting
and report the average of the results in this paper. We pre-
sent over 200 frames in each run and performed 10 run in
online evaluation. In order to get accuracy, we divide num-
ber of correct action by the total number of frames pre-
sented to the detector. Online evaluation result is shown
in Table 5. We compare our ASSL method with against
baseline and random selection. ASSL outperforms random
selection and gives significant improvement compared to
the baseline.
In Fig 11, we show the performance of our framework
Fig. 11. Action-wise performance analysis compared proposed ASSL with on each action categories. Each group of stacked bars
baseline result and random sampling in online evaluation.
shows performances of action categories. Each group con-
tains three bars corresponding to baseline accuracy, ran-
Depending on the data presented to the model in learn- dom sampling accuracy and proposed ASSL method
ing phase, each run of the framework on same dataset accuracy.
P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123 121

6. Discussion and conclusion Chetlur, S., & Woolley, C. (2014). cuDNN: Efficient primitives for deep
learning. arXiv Preprint arXiv: . . ., 1-9. Retrieved <http://arxiv.org/
abs/1410.0759>.
In this paper we presented the novel of the active learn- Culotta, A., & Mccallum, A. (2005). Reducing labeling effort for
ing (AL) and semi-supervised (SSL) learning, called ASSL, structured prediction tasks. Proceedings of the National Conference
to leverage the strong points of the both learning para- on Artificial Intelligence, 20(2), 746, Retrieved from<http://scholar.-
digms for improving the performance of object detection. google.com/scholar?hl=en&btnG=Search&q=intitle:Reducing+label-
Considering the pros and cons of the AL and SSL learning ing+effort+for+structured+prediction+tasks#0>.
Dagan, I., & Engelson, S. P. (1995). Committee-based sampling for
methods, ASSL where SSL method provides the incremen- training probabilistic classifiers. Proceedings of the Twelfth Interna-
tal improvement of semi-supervised detection performance tional Conference on Machine Learning, 150–157http://doi.org/10.1.1.
by combining the concept of diversity imported from AL 30.6148.
methods. We evaluated extensive experiments on ILSVRC Demir, B., Persello, C., & Bruzzone, L. (2011). Batch-mode active-
detection dataset and Caltech pedestrian detection dataset. learning methods for the interactive classification of remote sensing
images. IEEE Transactions on Geoscience and Remote Sensing, 49(3),
These experiments lead us to several observations that 1014–1031. http://dx.doi.org/10.1109/TGRS.2010.2072929.
will be useful in developing future a detection system based Deng, J, Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009).
on imperfect data. First, the results show that it is possible ImageNet: A large-scale hierarchical image database. In CVPR09.
to achieve detection performance that is similar to the per- Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2011). {P}edestrian
formance obtained with fully labeled data, even when a detection: {A}n evaluation of the state of the art. IEEE Transaction on
Pattern Analysis and Machine Intelligence, 1–2.
small fraction of the training data is used in the training Felzenszwalb, P.F., Girshick, R.B., Mcallester, D., & Ramanan, D.
set. Second, as a practical matter, the experiments show (2009). Object detection with discriminatively trained part based
that the active and semi-supervised learning can be applied models, doi:http://dx.doi.org/10.1109/TPAMI.2009.167.
to an existing detector that was originally designed for Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning
supervised training. Experiment results show that the object categories from Google’s image search. In Tenth IEEE
international conference on computer vision (ICCV’05) volume 1 (Vol.
ASSL framework can significantly outperform many other 2, pp. 1816–1823). doi:10.1109/ICCV.2005.142.
models on object detection and pedestrian detection tasks Girshick, R. (2015). Fast R-CNN, doi:http://dx.doi.org/10.1109/ICCV.
and confirm that the proposed method can be applied a 2015.169.
wide variety of application domains including cognitive Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature
systems, automotive applications, vision based intelligent hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE computer society conference on computer
systems and surveillance systems. vision and pattern recognition (pp. 580–587), doi:http://dx.doi.org/10.
1109/CVPR.2014.81.
Acknowledgement Hasan, M., & Roy-Chowdhury, A. K. (2015). A continuous learning
framework for activity recognition using deep hybrid feature models.
IEEE Transactions on Multimedia, 17(11), 1909–1922. http://dx.doi.
This work was supported by Inha University research org/10.1109/TMM.2015.2477242.
grant. GPUs used in this research was generously donated He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in
by NVIDIA Corporation. deep convolutional networks for visual recognition. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916. http://
dx.doi.org/10.1109/TPAMI.2015.2389824.
References Heckman, J. (1979). Sample selection bias as a specification error.
Econometrica, 47(1), 153–161. http://dx.doi.org/10.2307/1912352.
Angelova, A., Krizhevsky, A., View, M., View, M., Vanhoucke, V., Ogale, Hinton, G. E. (2006). Reducing the dimensionality of data with neural
A., & Ferguson, D. (2015). Real-time pedestrian detection with deep networks. Science, 313(5786), 504–507. http://dx.doi.org/
network cascades. BMVC, 2015, 1–12, Retrieved from<http://www. 10.1126/science.1127647.
vision.caltech.edu/anelia/publications/Angelova15RealTimePedes- Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in
trian.pdf>. object detectors. In Lecture notes in computer science (including
Benenson, R., Omran, M., Hosang, J., & Schiele, B. (2015). Ten years of subseries lecture notes in artificial intelligence and lecture notes in
pedestrian detection, what have we learned? Lecture notes in computer bioinformatics) (Vol. 7574 LNCS(PART 3), pp. 340–353), doi:http://
science (including subseries lecture notes in artificial intelligence and dx.doi.org/10.1007/978-3-642-33712-3_25.
lecture notes in bioinformatics), 8926, 613–627. http://dx.doi.org/ Hosang, J., Omran, M., Benenson, R., & Schiele, B. (2015). Taking a
10.1007/978-3-319-16181-5_47. deeper look at pedestrians. In Proceedings of the IEEE computer
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: society conference on computer vision and pattern recognition (Vol. 07-
A review and new perspectives. IEEE Transactions on Pattern Analysis 12-June, pp. 4073–4082), doi:http://dx.doi.org/10.1109/CVPR.2015.
and Machine Intelligence, 35(8), 1798–1828. http://dx.doi.org/10.1109/ 7299034.
TPAMI.2013.50. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.,
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(421), . . ., Darrell, T. (2014). Caffe: Convolutional architecture for fast
123–140. http://dx.doi.org/10.1007/BF00058655. feature embedding. In Proceedings of the ACM international conference
Brinker, K. (2003). Incorporating diversity in active learning with support on multimedia (pp. 675–678), doi:http://dx.doi.org/10.1145/2647868.
vector machines. Strategy, 20, 59, Retrieved from<http://www.aaai. 2654889.
org/Papers/ICML/2003/ICML03-011.pdf>. Kapoor, A., Grauman, K., Urtasun, R., & Darrell, T. (2007). Active
Chapelle. (2006). Semi-supervised learning. Interdisciplinary sciences com- learning with gaussian processes for object categorization. In 2007
putational life sciences (Vol. 1). doi:http://dx.doi.org/10.1007/s12539- IEEE 11th international conference on computer vision, doi:http://
009-0016-2. dx.doi.org/10.1109/ICCV.2007.4408844.
122 P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun,
classification with deep convolutional neural networks. Advances in Y. (2013). OverFeat: Integrated recognition, localization and detection
Neural Information Processing Systems, 1–9. http://dx.doi.org/ using convolutional networks. arXiv Preprint arXiv:1312.6229.
10.1016/j.protcy.2014.09.007. Retrieved from <http://arxiv.org/abs/1312.6229>.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Settles, B. (2010). Active learning literature survey. Machine Learning, 15
Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to (2), 201–221http://doi.org/10.1.1.167.4245.
handwritten zip code recognition. Neural Computation. http://dx.doi. Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Query by committee.
org/10.1162/neco.1989.1.4.541. Proceedings of the fifth annual workshop on computational learning
Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for theory - COLT’92, 287–294. http://dx.doi.org/10.1145/130385.130417.
supervised learning. In Proceedings of the 11th international conference Siddiquie, B., & Gupta, A. (2010). Beyond active noun tagging: Modeling
on machine learning (ICML’94) (pp. 148–156) <http://www.cs.bryn- contextual interactions for multi-class active learning. In Proceedings
mawr.edu/cs372/LeC94.pdf>. of the IEEE computer society conference on computer vision and pattern
Lewis, D. D., & Gale, W. A. (1994). A sequential algorithm for training recognition (pp. 2979–2986), doi:http://dx.doi.org/10.1109/CVPR.
text classifiers. In Proceedings of the 17th international conference on 2010.5540044.
research and development in information retrieval (SIGIR’94) (pp. 3– Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks
12), doi:http://dx.doi.org/10.1145/219587.219592. for large-scale image recognition. ImageNet Challenge, 1–10. http://dx.
Li, M., & Sethi, I. K. (2006). Confidence-based active learning, 28(8), doi.org/10.1016/j.infsof.2008.09.005.
1251–1261. Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks
Li, L. J., & Fei-Fei, L. (2010). OPTIMOL: Automatic online picture for large-scale image recognition. Iclr, 1–14. http://dx.doi.org/10.1016/
collection via incremental model learning. International Journal of j.infsof.2008.09.005.
Computer Vision, 88(2), 147–168. http://dx.doi.org/10.1007/s11263- Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon
009-0265-6. Mechanical Turk. In 2008 IEEE Computer society conference on
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . ., computer vision and pattern recognition workshops, CVPR workshops,
Zitnick, C. L. (2014). Microsoft COCO: common objects in context. In doi:http://dx.doi.org/10.1109/CVPRW.2008.4562953.
Lecture notes in computer science (including subseries lecture notes in Su, H., Deng, J., Fei-Fei, Li. (2012). Crowdsourcing annotations for visual
artificial intelligence and lecture notes in bioinformatics) (Vol. 8693 object detection. Human Computation AAAI Technical Report WS-
LNCS, pp. 740–755), doi:http://dx.doi.org/10.1007/978-3-319-10602- 12-08.
1_48. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., . . .,
Lindenbaum, M., Markovitch, S., & Rusakov, D. (2004). Selective Rabinovich, A. (2015). Going deeper with convolutions. In Proceed-
sampling for nearest neighbor classifiers. Machine Learning, 54 ings of the IEEE computer society conference on computer vision and
(2), 125–152. http://dx.doi.org/10.1023/B:MACH.0000011805. pattern recognition, 07-12-June (pp. 1–9), doi:http://dx.doi.org/10.
60520.fe. 1109/CVPR.2015.7298594.
Lu, Z., Wu, X., & Bongard, J. C. (2015). Active learning through adaptive Tian, Y., Luo, P., Wang, X., & Tang, X. (2015). Pedestrian detection
heterogeneous ensembling. IEEE Transactions on Knowledge and Data aided by deep learning semantic tasks. In Proceedings of the IEEE
Engineering, 27(2), 368–381. http://dx.doi.org/10.1109/ computer society conference on computer vision and pattern recognition
TKDE.2014.2304474. (Vol. 07-12-June, pp. 5079–5087). doi:http://dx.doi.org/10.1109/
Makantasis, K., Doulamis, A., Doulamis, N., & Psychas, K. (2016). Deep CVPR.2015.7299143.
learning based human behavior recognition in industrial workflows. In Tong, S., & Koller, D. (2001). Support vector machine active learning with
IEEE international conference on image processing (ICIP), 2016 (pp. applications to text classification. Journal of Machine Learning
1609–1613). IEEE. Research, 45–66. http://dx.doi.org/10.1162/153244302760185243.
Mitchell, T., & Blum, A. (1998). Combining labeled and unlabeled data Uijlings, J. R. R., Van De Sande, K. E. A., Gevers, T., & Smeulders, A.
with co-training. In Proceedings of the eleventh annual conference on W. M. (2013). Selective search for object recognition. International
computational learning theory (pp. 92–100). doi:http://dx.doi.org/10. Journal of Computer Vision, 104(2), 154–171. http://dx.doi.org/
1145/279943.279962. 10.1007/s11263-013-0620-5.
Paisitkriangkrai, S., Shen, C., & Hengel, A. Van Den. (2014). Pedestrian Vijayanarasimhan, S., & Grauman, K. (2009). Multi-level active predic-
detection with spatially pooled features and structured ensemble tion of useful image annotations for recognition. Advances in Neural
learning, 1–19, doi:http://dx.doi.org/10.1109/TPAMI.2015.2474388, Information Processing Systems, 21, 1705–1712.
arXiv Preprint arXiv:1409.5209. Welinder, P., & Perona, P. (2010). Online crowdsourcing: Rating
Persello, C., & Bruzzone, L. (2011). Active and semi-supervised learning annotators and obtaining cost-effective labels. In 2010 IEEE computer
for the classification of remote sensing images. IEEE Transactions on society conference on computer vision and pattern recognition -
Geoscience and Remote Sensing, 52(11), 6937–6956, doi.org/81800c workshops, CVPRW 2010 (pp. 25–32). doi:http://dx.doi.org/10.1109/
10.1117/12.898483. CVPRW.2010.5543189.
Qi, G.-J., Hua, X.-S., Rui, Y., Tang, J., & Zhang, H.-J. (2008). Two- Xu, Z. & Akella R. (2008). Active relevance feedback for difficult queries
Dimensional Active Learning for image classification. In IEEE In Proceedings of the 17th ACM conference on Information and
conference on computer vision and pattern recognition (pp. 1–8), knowledge management, Napa Valley, California, USA (pp. 459–468).
doi:http://dx.doi.org/10.1109/CVPR.2008.4587383. Xu, Z., Yu, K., Tresp, V., Xu, X., & Wang, J. (2003). Representative
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards sampling for text classification using support vector machines. In
real-time object detection with region proposal networks. Nips, 1–10. Proceedings of ECIR-03, 25th European conference on information
http://dx.doi.org/10.1016/j.nima.2015.05.028. retrieval (pp. 393-407). Retrieved from <http://link.springer.de/link/
Roy, N., & McCallum, A. (2001). Toward optimal active learning through service/series/0558/papers/2633/26330393.pdf>.
sampling estimation of error reduction. in Proceedings of the 18th Xu, L., Li, B., & Chen, E. (2012). Ensemble pruning via constrained eigen-
international conference on machine learning (pp. 441–448). Retrieved optimization. In Proceedings - IEEE international conference on data
from <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28. mining, ICDM (pp. 715–724). doi:http://dx.doi.org/10.1109/ICDM.
9963&rep=rep1&type=pdf>. 2012.97.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... Fei- Yang, B., Yan, J., Lei, Z., & Li, S. Z. (2015). Convolutional channel
Fei, L. (2015). ImageNet large scale visual recognition challenge. features. In ICCV, doi:http://dx.doi.org/10.1109/ICCV.2015.18.
International Journal of Computer Vision, 115(3), 211–252. http://dx. Yang, Y., Wang, Z., & Wu, F. (2015). Exploring prior knowledge for
doi.org/10.1007/s11263-015-0816-y. pedestrian detection. In BMVC2015 (pp. 1–12).
P.K. Rhee et al. / Cognitive Systems Research 45 (2017) 109–123 123

Zadrozny, B. (2004). Learning and evaluating classifiers under sample Zhang, S., Benenson, R., & Schiele, B. (2015). Filtered channel features
selection bias. In Twenty-first international conference on machine for pedestrian detection. In Proceedings of the IEEE computer society
learning - ICML ’04 (p. 114). doi:http://dx.doi.org/10.1145/1015330. conference on computer vision and pattern recognition (Vol. 07–12-June,
1015425. pp. 1751–1760). doi:http://dx.doi.org/10.1109/CVPR.2015.7298784.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding Zhu, X. (2008). Semi-supervised learning literature survey contents.
convolutional networks. In Computer vision–ECCV 2014 (Vol. 8689, Sciences New York, 10(1530), 10, http://doi.org/10.1.1.146.2352.
pp. 818–833), doi:http://dx.doi.org/10.1007/978-3-319-10590-1_53,
arXiv:1311.2901v3 [cs.CV] 28 Nov 2013.

Vous aimerez peut-être aussi