Vous êtes sur la page 1sur 5

2017 2nd International Conference on Image, Vision and Computing

A Novel Approach for Visual Saliency Detection and Segmentation Based on


Objectness and Top-down Attention

Yang Xu, Jun Li, Jianbin Chen, Guangtian Shen, Yangjian Gao
Automation School, Chongqing University, Chongqing, China
e-mail: {xuyang.lijun_777.chenjianbin.shenguangtian.gaoyangjian}@cqu.edu.cn

Abstract-Visual saliency detection is usually a prerequisite for principals when exposed to a dynamic scene, for example,
image processing tasks like object segmentation, object we prefer to give object right on the middle of our view more
recognition and information compression. In this work, we attention than those near the edge [4]. For better persuasion,
first present a novel method that combines cognitive-based segmentation process should bring these TD attention
objectness and image-based saliency to obtain a better saliency surveillance into play so that a more intuitively favorable
map with less negligible background information. Then, by
object can be selected.
introducing some top-down attention priors, we further
propose a computational selective attention model for object
segmentation task on the saliency map. Experiments have
shown that our method obviously refines the saliency map of
several state-of-the-art algorithms, and our selective attention
model meanwhile evidently improves salient object
segmentation performance in the challenging saliency
benchmark-Extend Complex Scene Saliency Dataset (ECSSD).

Keywords-saliency detection; object segmentation; objectness;


visual attention

I. INTRODUCTION (d)

Saliency detection is originated from visual attention of Figure 1. Illustration of our method: (a) input image (b) saliency map (c)
objectness map (d) overall saliency map (e) most salient region (f)
primates, which enables us to quickly distinguish attractive
ground-truth mask.
objects from those common ones or background. Attention
mechanism is of great importance for us to filter and refine In this paper, we attempt to refine saliency map by
information, and so is saliency detection for handling incorporating cognitive-based objectness and image-based
explosively growing image data. Specifically, finding salient saliency, for purpose of combing both TD object-level
objects or regions in an image will facilitates tasks such as information and BU pixel-level appearance to saliency map
object recognition, object segmentation, content-aware generation. Objectness quantifies how likely an image
image editing, and adaptive compression of images. window actually contains an object of any class [5,9]. Unlike
Saliency map is frequently used in previous work to category-dependent objectness in early work [7,8], we only
quantify visual attention distribution in an image. Two care about their category-independent representations. With
classes of knowledge dominate our attention assignment, the optimization of objectness, our saliency map shows
bottom-up and top-down knowledge. BU knowledge focus relatively great edge than results of conventional algorithms,
on cues that we naturally interested in, like colors, and its advantages extend to the following segmentation
orientations, density [1] and frequency [2,6], while TD operation. Objectness enables us to find more spatial cues in
knowledge is not innate but learned from tasks, for example, saliency map, it will conquer the drawback of scattering
our eye fixations basically favor objects with big volume [4] saliency in many algorithms [1,10,11], and this potentially
or with higher similarity to something we know [3,5]. Some improves segmentation performance. After obtaining the
previous work have also proved that purely BU fashion can saliency map, we further propose a TD attention-based
hardly generate object-level saliency maps in a given task, segmentation model, it is more intuition-driven and
and a reasonable map should take both image-based BU cues outperforms the state-of-the-art adaptive threshold
and cognitive-based TD cues into consideration for detecting segmentation method [6] in ECSSD benchmark. (See
the most salient objects in images. illustration in Figure 1).
Saliency segmentation is the last step for selecting a
concrete object, and segmentation performance largely II. RELATED WORK
depends on saliency map generated above, regions with
The term saliency was introduced by Olshausen and
higher saliency values are more likely to be chosen.
Anderson in their work [12]. Itti and Koch adopt
However, besides taking saliency as object's inherent
center-surround and inhabit of return principals from
attribute, we also obey some shared attention habits or

978-1-5090-6238-6/17/$31.00 ©2017 IEEE 361


neurology to implement their model [1]. Hou and Hare I • Multi-scale Saliency: Attaining frequency
based their model [2] on frequency-domain information, uniqueness for different parts of the image by using
Valenti and Sebe utilize gradient information to highlight spectral residual of the FFT.
regions [14], and Wei et al. compare background with • Color Contrast: Measuring local dissimilarity of the
objects to measure geodesic saliency [15]. In general, most color information of a bounding box to its
models employ low-level information by contrasting image surrounding area.
regional clues like color, orientation and density with their • Edge Density: Using Canny edge detection to
surrounding counterparts. calculate the density of the edges close to borders of
The unsupervised bottom-up methods have their shared the bounding box candidate.
drawbacks, such as scattering and inconsistent saliency. For • SuperpixeJ Straddling: Considering the rule that
better generalization, top-down knowledge is introduced to pixels in the same superpixel often belong to the same
solve those problems. Objectness as high-level information semantic group(foreground or background), it
describes how likely the region contains an object, Alexe et computes the agreement between bounding box and
al. [5] and Cheng et al. [9] come up with two mainstream super pixel gained from [20].
objectness frameworks. Based on that, Chang and Liu [16]
leverage objectness information as a prior, to construct an
energy model to fuse objectness and saliency. Jia and Han [3]
present a saliency detector combining objectness with
Gaussian MRF. In [17] Huang and Gan perform object
segmentation based on saliency information and objectness
frequency map. All these work demonstrate the truth that
object information can help saliency detection. Different
from methods mentioned before [17,3], we choose minimal
objectness bounding boxes that scores highest under the
framework of [5], and pixel-wisely fuse them with each pixel
saliency score obtained respectively from algorithms
[1,10,11,12] to better defme saliency map without increasing Figure 2. The top row shows the input images and its top 3 confident
bounding boxes,and the bottom row are their correspond objectness maps.
much computation budget.
Some previous works have explored approaches to
Here, we assume that the image size is m x n (width m,
segment salient objects in a saliency map. Ko and Nam [18]
use a Support Vector Machine for Itti's maps to choose the height n ), and Pi,j is the pixel on coordinate (i, j) . The chosen
most salient regions. Ma and Zhang [19] utilize fuzzy objectness bounding box candidates are
growing on saliency maps to extract salient regions within a {(B1,Q),(B2, 02) , ...,(Bk, Ok )}' whereBk is the bounding box
rectangular region. Achanta and Hemami [6] unite
position in an image and Ok is the corresponding confidence
hill-climbing based mean-shift segmentation with an
adaptive threshold to produce the most salient region, and score. We first merge all bounding boxes to a mx n image
according to them enjoy the best results in two different called objectness map OM:
datasets. However, these methods emphasize heavily on k
mathematical algorithms and weaken their interpretability
for us. Inspired by [4], we propose an attention model for
1
OMi,j =k � [OJ(Pi,j EB,)+(1-0JJ(Pi,j �B,))] (1)

salient region selection using several widely accepted


attention habits or principals, and couple with our objectness where OM is the objectness score for pixel (i, j) in map,
I,j
.

involved saliency map, our segmentation performance is a 0 or 1 indicator function denoting whether
J(pT,j. EB)
x
transcends method [6] in precision, recall and F-Measure for
considerable margins. Pi,} is inside the bounding box. We further normalize

III. OBJECTNESS FOR SALIENCY MAP objectness scores to [0,1] :

In this section, we will explicitly introduce our method for OM;,j-OMrrin


oM (2)
generating overall saliency map by a combination of ',J
OMrm, -OMmn
cognitive-based objectness and image-based saliency.
A. abjectness Map (OM) Generation where OMmin is the smallest objectness score in the map,
The objectness algorithm can produce a manually fixed and OM""" is the largest, see result in Figure 2,
number of bounding boxes (BD) that might contain potential
objects, and each bounding box has a corresponding B. Saliency Map(SM) Generation
confidence score to represent objectness probability. For Many algorithms have defined their standards for
every image, we focuses on the following four cues for generating an image-based saliency map, and this map is
objectness measurement. capable of showing the salient degree of each pixel in an
image.

362
In this paper, we choose four traditional saliency detector
algorithms proposed by Itti et al. to compute bottom-up
saliency map with low-level pixel cues. The choice of these
algorithms is motivated by the following reasons: Itti's
center-surround model [I] is the most classic model and is
ground-breaking for computational saliency map. Yan et al.
present a tree model [12] to decompose image information to
three layers and capture hierarchical saliency, it proves
remarkable performance in many saliency detection datasets
like MSRA-IOOO and 5000 Datasets [6]. Wei et al. [11] take
an opposite perspective, they focus more on the background
Figure 4. General framework to produce overall saliency map. Four overall
instead of the object, and exploit two common priors about saliency maps created from four different saliency maps and one identical
background in nature images: boundary and connectivity, the objectness map.
novel saliency measure called geodesic saliency. Federico
and Pritch [ lO] base their saliency estimation on uniqueness IV. SELECTIVE ATTENTION MODEL
and spatially distribution contrast, and formulate in a unified
way using high-dimensional Gaussian filters, it largely A. Introduction of Top-down Factors
simplify conception and leads to an efficient implementation Besides objectness, additional four top-down factors are
with linear complexity. We refer these four detectors to IT, raised based on principals or habits human visual attention
HS, GS and SF later. abides when detecting salient objects in a scene, together
For each image, we will have four preliminary saliency with overall saliency map, they are:
maps derived from aforementioned algorithms as • Pixel Saliency Value: Fusing object-level information
demonstrated by Figure 3. and bottom-up saliency.
• Area Size: Assigning more attention to bigger objects.
• Object Position: Assigning more attention to objects
near the optical center.
• Average saliency: Assigning more attention to overall
salient objects.
(a) (b) (c) • Regional variance: Eliminating effect of abnormal
points with high saliency value.
B. Feature Quantification and Merging
Before quantifying features, a binary segmentation is
operated on overall saliency map with an adaptive threshold.
(d) (e) (f)

Figure 3. Saliency map demonstration: (a) input image (b) IT SM (c) HS


SM (d) GS SM (e) SF SM (f) ground truth mask.
Bl,j =
. . { 1 ,
o ,
OSM' :;:: rxsalmean
'.J
.

others
(5)

C. Overall Saliency Map (OSM) Generation where Bi is the binary value after segmentation, sa lmean is
,j
After objectness map and preliminary saliency maps are mean value of overall saliency map, and r is the threshold
obtained, a linear approach is raised to combine them in
order to form the overall saliency map: coefficient.
After building the binary image, we also filter out tiny
regions that have few pixels. Now, five nominated factors
OSM.
t,)
=axOM'i. +(l-a)xSM. (3)
{,J
j will be quantified for our computational attention model.
Because we have one or more regions in a segmented binary
where OSM stands for the overall saliency map, and a is the map, when talking about region R , all pixel values in it will
weight for objectness map. Likewise, it needs to be be labeled as 1, and pixels outside as 0, for a specific pixel,
normalized: label 1 or label 0 .
= =

The first factor is abstracted as the maximal saliency


, OSM. ,Ll value in each region, its expression is as follows:
l,j -
OSM..
l,j
=

(4) MaxSalR = max OSM'. (6)


labeL�1 ',J

We normalize OSM to ensure that it obeys normal The second factor is measured by the pixels involved in
distribution, and set negatives to zeros, In this way pixel each region. We further denote proportion of every area size
scores will be smoother and have stronger expressive ability. to the whole map as AreaP , and a scale parameter 2 is added
Figure 4 illustrates the overall fusing process. for the area importance in saliency detection. With Num
denoting for pixel numbers in a region, we have this feature:

363
111
m 11
n saliency benchmark contammg
challenging ECSSD saliency contammg 10001000
Num R =
NumR = IL 2 )abeli,,jj
Ilabeli
i=1 i=1
(7)
(7) images, together with
complex images, with the salient object annotations
i =1 i=1 given byby ground-truth masks. We separately
separately compare our
refmed overall saliency maps with
refined with saliency maps from four
AreaPD _ 2NumR
A - 2NumR
rearRR = (8)
(8) selective attention
conventional saliency detectors and our selective attention
mxn with a highly recognized segmentation method.
model with method,
The third factor is intuitively represented by Euler In our experiments, we use use only three objectness
of a region to image
distance of image center. By averaging pixel bounding box candidates to form an objectness map map with
distance bias from image center, we have: less computation budget.
less budget. Also we let a be
let a be 0.5 to
to equally
equally
weigh OM and SM SM for OSM.
OSM, For binary
binary segmentation, the
L J
L ~(i - midX)2 +(j
(i -midX)2 - midY)2
+(j -midY)2 r 1.2,
threshold r is set to 1.2.
· _!abel.,=
w_e
b �}
(�
.� _]l (9)
E U D
EUD·lstR
ISt =
R= - =
'
-
__________________
- - - - - - - ---- (9)
A. Saliency Map Comparison
A.
Num R
criteria presented in
We mainly adopted the criteria in [6] to
where (midX,
(midX,M idY) is the center coordinate in binary
MidY) performance of
evaluate the performance of the saliency algorithms using
using
map. (PR) curves.
precision recall (PR) curves. The
The PR curve needs a fixed
threshold T recall values
T to obtain the precision and recall values for all
all
m images and calculate the average. Varying T
images T from 00 to 255
midX midY = !!..
midX = m ' midY !!.. (10)
(10)
22 22 gives us the average PR curve as Figure 5.5.
The fourth feature is is obtained by averaging saliency
saliency
scores in every region to measure its overall salient degree: 00 .9.91--
r-I -----+-----
- -t-- -+-----
--+- - --+----
t- --+----
- --+- -----1
------i
0.8 �:..:..::::::::j:
::~-- $;l
�=�=""=".J--------+
:: ----�
"OSM: .. 07 - - -- - - - --- -: - �
� ---- -
L..
L.- I,J
l,J -
---- --- ... - :-.,.-:
.-- - ...... I
_w_be�,( ��j =l
la be( _I _____ C 06 � ---- _ :- � - "� �� "
Avg (11)
(11)
! ::~
A vgRR =
=--'-".,'-------
.Sj -
-::..=-:: -- --�- I ��
Num R
NumR £ 0.5
03~B ~
~ 04 - � ... ~ ~ ... ... ....
�'."'\
0.4
IT

--- :�-OB

is regional saliency variance, it accounts


The last feature is 0.
3 :-:-: - - - ��

H5-0B
-OB �
for noise effect in the obtained overall saliency map: 00 .2
.2 --GS

~
---GS-OB
__ _ G5-0B

00 .11. -
--SF SF
, . - Avg )2 - - - SF-OB
---SF-OB

"(OSM
L... 0o
00 00 .2
.2 00 .4
. 00 ..6
6 00 .8
8
.
R
4
I,J
labe(j =! (12)
(12) Recall
R ecall
VarR = Num R
5. The PR curve of
Figure 5, ofsaliency (lT, HS,
saliency algorithms (IT, HS, GS,
GS, SF) and our
improved ones. Name with 08OB is from our method supplemented with
Finally, we design a normalization method for our objectness.
objectness,
model:
shown in figure 5,
As shown 5, the introduction ofof objectness
fi
fi - min(foaturee)
feature eR -min(foature.)
featuresR
R = ---"--....::....---"-'-------"-'--
eature
eature eeR = ---''--------'''------'''------''-'-
- - information significantly
information significantly improves performance of of IT
IT[[1]
1] and
max( foature
feature e) - min(featuree)
e)-min(foature e) (13) SF[lO]
SF slightly improves
[IO] and slightly improves GS
GS[ll],
[II], but degrades aa little
little in
in
(13)
HS[12].
HS [12]. Consider we use use minimal objectness bounding
bounding
max( foature - feature pR
feature pp)) -feature pR boxes, this result is is quite reasonable, because the former
feature R
feature R = -------"----------"---
..!:- --!:...
p
P
= ----------

max( foature p) -min(foaturep)


max(featurep) - min(feature p)
------ ­
information, and
three algorithms emphasize on pixel-level information,
(14)
(14) level information make
object level make these salient
salient pixels more
clustered. While HS map is composed by
clustered. by several
several feature
feature
{MaxSal, AreaP,
In which, && EE {MaxSal, AreaP, A vg} , P
Avg} P EE {EuDist,
{EuDist,Var} ,
layers that could largely
layers largely avoid scattering salient
salient points, and
max( foar
max( foarture
ture ee,pp)) is the biggest feature value for all regions, best performance on ECSSD so
HS also yields the best so far.
o
and min is the smallest. After feature
min(foarture
(foartureee.p ,p)) B. Segmentation Performance
Performance Comparison
quantification, we take a synthesis operation with five selection is
Threshold selection is vital
vital for image
image segmentation,
segmentation, aa
normalized feature values for each region, such that every with adaptive threshold segmentation
mean-shift with segmentation method
method is
is
region gets an attention value:
proposed in map, the threshold Ta
in [6] to segment saliency map, Ta
= L
AttentionValue R =
AttentionValue/l LfoaturekR
featurekR (15)
(15) is 2 times of
is of saliency mean:
kE{e,p}
kE{e ,p} 2
2 m
m /I
11

Ta = - Sal(x, y )
Ta=-IISal(x,y) LL (16)
(16)
Thereafter, attention values can be projected to region
Thereafter, mxn x = 1l y�
x� y =l1
saliency, and conversely, the most salient object can be
saliency,
selected by picking the region with highest attention score. selective attention model
Our selective model is very
very different, we
we design
design
selecting salient regions using
a computational approach for selecting using
V. EXPERIMENTS score, see
region attention score, see Figure 6 and Table
Table I for
for an
an
evaluated our method and model on the very
We evaluated very illustration.
illustration.

364
364
TABLE I. ATTENTION SCORE FOR TOP THREE REGIONS bounding box candidates to thoughly investigate the effect of
object-level information. In addition, we plan to introduce
more top-down factors to our selective attention model to
make it practical and task-oriented when facing scenes with
more salient objects. Lastly, our methods bring in reasonable
time budget, we will explore more time-saving approaches.

ACKNOWLEDGMENT

(a) (b) (c) (d) (e) (f)


This work is supported by the National Natural Science
Foundation of China (NSFC) under Grant No. 61473052.
Figure 6. Illustration of selective attention model based on IT saliency
detector. (a) input image (b) ground-truth mask (c) overall SM (d) highest
REFERENCES
score region (e) second-highest score region (f) third-highest score region.
[I] Laurent Itti, Christof Koch, and Ernst Niebur. A model of
In general, both methods obtain a binary map for all saliency-based visual attention for rapid scene analysis. TPAMI,
20(11):1254-1259,1998.
algorithms, and their average preCISIOn, recall and [2] X. Hou, J. Harel, and C. Koch. Image signature: Highlighting sparse
F-Measure (Eq. 17) are measured over the same ground-truth salient regions. TPAMI, 34(1):194-201,2012.
database of ECSSD. [3] Jia Y, Han M. Category-Independent Object-Level Saliency
(1 +/32)PxR (17)
Detection[C].IEEE International Conference on Computer Vision.
F - IEEE, 2013:1761-1768.
p /32 xP +R [4] Chen W, Shen F, Zhao 1. A computational model of selecting visual
attention based on bottom-up and top-down feature combination[J].
where P is precision, R is recall, and Fp is F-Measure, for 2013:1-5.
comparison, we use /32 = 0.3 as in [6,3], the result is shown in [5] Alexe B, Deselaers T, Ferrari V. V.: What is an object[C]. IEEE
Conference on Computer Vision and Pattern Recognition, CVPR
Table II. Our selective attention model obviously 2010,San Francisco,Ca,Usa,13-18 June. DBLP, 2010:73-80.
outperforms method [6] in precision, recall and F-Measure [6] Achanta R, Hemami S, Estrada F, et al. Frequency-tuned salient
for all saliency detectors. We can see that despite of the region detection[C]. Computer Vision and Pattern Recognition, 2009.
slight degradation using objectness information in HS CVPR 2009. IEEE Conference on. IEEE,2009:1597-1604.
[7] Li C, Cao Z, Xiao Y, et al. Fast object detection from unmanned
overall saliency map generation, it gets compensated in
surface vehicles via objectness and saliency[C]. Chinese Automation
segmentation task much richer, with l.78%, 3.38% and 2.4% Congress. 2015:500-505.
improvements for precIsIOn, recall and F-Measure [8] T. Judd, K. Ehinger, F. Durand,and A. Torralba. Learning to predict
respectively, we owe the improvement partly to objectness where humans look. In ICCV,2009.
bounding boxes, because they locate objects so that [9] Cheng M M, Zhang Z, Lin W Y, et al. BING: Binarized Normed
Gradients for Objectness Estimation at 300fps[J]. 2014:3286-3293.
segmentation can be easier and more accurate.
[10] Perazzi F, Krhenbiihl P, Pritch Y, et al. Saliency filters: Contrast
based filtering for salient region detection[C]. IEEE Conference on
TABLE II. ADAPTIVE THRESHOLD SEGMENTATION (ATS) AND OUR
SELECTIVE ATTENTION MODEL (SAM) RESULTS ON IT,HS,GS,SF Computer Vision and Pattern Recognition. IEEE Computer Society,
OVERALL SALIENCY MAPS (BOLD ARE THE BETTER ONES) 2012:733-740.
[II] Wei Y, Wen F, Zhu W, et al. Geodesic Saliency Using Background
Method Precision Recall F-Measure Priors[C]. European Conference on Computer Vision. 2012:29-42.
IT+ATS 0.4896 0.4423 0.4778 [12] Yan Q, Xu L, Shi J, et al. Hierarchical Saliency Detection[C].
IT+SAM 0.6038 0.4846 0.5759 Computer Vision and Pattern Recognition. IEEE, 2013:1155-1162.
HS+ATS 0.7408 0.5323 0.6794 [13] B. Olshausen, C. Anderson, and D. Van Essen. A neurobiological
HS+SAM 0.7586 0.5661 0.7034 model of visual attention and invariant pattern recognition based on
GS+ATS 0.6445 0.6411 0.6437 dynamic routing of information. Journal of Neuroscience, 13:4700-
4719,1993.
GS+SAM 0.6806 0.6054 0.6616
[14] Roberto Valenti, Nicu Sebe, and Theo Gevers. Image saliency by
SF+ATS 0.6120 0.3688 0.5312
isocentric curvedness and color. In ICCV,2009.
SF+SAM 0.6271 0.4263 0.5656
[15] Wei Y, Wen F, Zhu W, et al. Geodesic Saliency Using Background
Priors[C]. European Conference on Computer Vision. 2012:29-42.
VI. CONCLUSION AND FUTURE WORK [16] K.Y. Chang, T.L. Liu, H.T. Chen, and S.H. Lai. Fusing generic
objectness and visual saliency for salient object detection. In CVPR.
In this paper we propose a refinement scheme for
IEEE, 201I.
conventional saliency detection algorithms by introducing [17] Lehu Huang, Rui Gan and Gang Zeng. Object Cosegmentation by
object-level information to obtain the overall saliency map. Similarity Propagation with Saliency Information and Objectness
Based on overall saliency map, we further present an Frequency Map. In ICSAI, 2016.
improved selective attention model for salient object [18] B. C. Ko and 1.-Y. Nam. Object-of-interest image segmentation based
segmentation, it is more intuitively sound. Experiments have on human attention and semantic region clustering. Journal of Optical
Society of America A, 23(10):2462-2470,2006.
shown that our refined saliency map evidently improves [19] Y.-F. Ma and H.-J. Zhang. Contrast-based image attention analysis by
performance of some state-of-the-art algorithms, and our using fuzzy growing. In ACM International Conference on
segmentation model also outperforms a highly recognized Multimedia,2003.
method by a large margin. [20] P.F. Felzenszwalb and D.P. Huttenlocher. Efficient graph based
For future work, we plan to apply our refined method to image segmentation. TJCV, 59(2):167-181,2004.
more algorithms, and increase the number of objectness

365

Vous aimerez peut-être aussi