Vous êtes sur la page 1sur 20

The Segmented and Annotated IAPR-TC12 Benchmark

Hugo Jair Escalante, Carlos A. Hern andez, Jesus A. Gonzalez, A. L opez-L opez, Manuel Montes, Eduardo F. Morales, L. Enrique Sucar, and Luis Villase nor
National Institute of Astrophysics, Optics and Electronics Department of Computational Sciences Luis Enrique Erro # 1, Tonantzintla, Puebla, 72840, M exico

Michael Grubinger
Victoria University, Australia School of Computer Science and Mathematics PO Box 14428, Melbourne VIC 8001, Australia

Abstract This paper introduces the segmented and annotated IAPR-TC12 benchmark; an extended resource for evaluation of automatic image annotation methods and for studying their impact on multimedia information retrieval. We describe the methodology adopted for the manual segmentation and annotation of images and present statistics of the extended collection. Also, a soft measure for the evaluation of annotation performance is proposed. The extended collection is publicly available and can be used to evaluate a variety of tasks besides image annotation; in particular, this resource can serve to study the benets of using automatic annotations for multimedia image retrieval. Furthermore, we outline several applications and important questions for which this collection is useful; motivating research in the areas of segmentation, image annotation, multimedia information retrieval and machine learning. Key words: Data set creation, Ground truth collection, Evaluation metrics, Automatic Image Annotation, Image Retrieval

Manuscript submitted to the Special Issue on Image and Video Retrieval Evaluation. We would like to thank editors and anonymous reviewers for their useful comments that helped us to improve this paper. This project was partially supported by CONACyT under project grant 61335. Email addresses: hugojair@ccc.inaoep.mx (Hugo Jair Escalante), michael.grubinger@gmx.at

(Michael Grubinger).

Preprint submitted to Elsevier

26 February 2009

Introduction

The task of automatically assigning semantic labels to images is known as automatic image annotation (AIA). This problem has been identied as one of the hot-topics in the new age of image retrieval [1]. Despite being relatively new, there has been a signicant progress in this task since the last decade [2 9]. However, the lack of a benchmark collection specically designed for this task causes that most AIA methods are evaluated in small collections of unrealistic images [39]. Furthermore, the lack of a region-level AIA benchmark have caused that many region-level methods have been evaluated by their annotation performance at image-level; which results in unreliable estimations of localization performance [5,10]. The ultimate goal of AIA is to allow un-annotated image collections to be searched using keywords; this sort of image search is known as annotationbased image retrieval (ABIR) 1 [2]. Recently it has been proposed the combination of automatic and manual annotations to improve the retrieval performance and diversify results in annotated collections [11]. However, despite its goal and its popularity, the impact of AIA methods on image retrieval has not been studied under realistic settings. The latter challenges the advantages oered by annotation methods in the retrieval task. In order to provide reliable ground-truth data for benchmarking AIA and studying its advantages on multimedia image retrieval, we introduce the segmented and annotated IAPR-TC12 benchmark. The IAPR-TC12 collection, is an established image retrieval benchmark composed of about 20,000 images manually annotated with free-text descriptions in three languages [12]. We extended this benchmark by manually segmenting and annotating the entire collection according to a carefully dened vocabulary. This extension allows the evaluation of further multimedia tasks than those currently supported. Because the IAPR-TC12 is already an image retrieval benchmark, the extended collection can be used to assess the impact of AIA methods in the multimedia retrieval task; also, it can be used to objectively compare CBIR (content-based image retrieval), ABIR and TBIR techniques and to evaluate the usefulness of combining information from diverse sources. 1.1 Automatic Image Annotation Textual descriptions in images are very useful because, when they are complete (i.e., the visual and semantic content of images is available in the description), standard information retrieval techniques have reported very good results on image retrieval [13,14]. However, manually assigning textual information to images is both expensive and subjective, therefore, recently there has been an increasing interest on performing this task automatically.
1 Note that ABIR is dierent from text-based image retrieval (TBIR), since the latter approach uses text that has been manually assigned to images.

There are two ways of facing this problem: at image-level and at region-level (see Figure 1). In the rst case, labels are assigned to the image as a whole, not specifying which words are related to which objects within the image. In the second approach, the assignment of annotations is at region-level within each image, providing a one-to-one correspondence between words and regions. The latter approach provides more information (e.g. spatial relationships) that can be used to improve annotation and retrieval performance; note that any region-level annotation is an image-level annotation. This work considers the region-level AIA problem.

Fig. 1. Sample images from three related tasks. From left to right: image-level annotation and region-level
annotation (from the Corel subsets of Carbonetto et al. [8]), object detection (from the PASCAL VOC-2007 data set [15]) and object recognition (from the Caltech256 data set [16]).

The AIA task has been approached with semi-supervised and supervised machine learning techniques [311,17,18]. Supervised methods have reported better results than their semi-supervised counterparts in this task [9,17,18]. Supervised methods, however, require a training set of region-label pairs, while semi-supervised methods only need weakly annotated images. So both methods oer complimentary advantages. An important feature of the extended IAPR-TC12 benchmark is that it can support both kinds of methods. 1.2 AIA and Object Recognition Region-level AIA is often considered an object recognition task; however, this is true only to some extent, and, therefore, object recognition benchmarks are not well suited for AIA. In both, AIA and object recognition, the task is to assign the correct label to a region in an image. However, in object recognition collections the data consists of images where the object to recognize is centered and occupies more than 50% of the image (see Figure 1, rightmost image); and usually, no other object, from the set of objects to recognize, is present in the same image. On the other hand, in region-level AIA collections, the data consists of annotated regions from segmented images; where the target object may not be the main theme of the image and many other target objects can be present in the same image (see Figure 1). Another dierence lies on the type of objects to recognize. While in object recognition, the objects are very specic entities, like cars, gloves, or specic weapons, in region-level AIA the concepts are more general, for example: buildings, grass and trees. These dierences are mainly due to the applications they are designed for: while object recognition is mostly related with surveillance, identication, and tracking systems, AIA 3

methods are designed for image retrieval and related tasks. 1.3 Evaluation of Region-level AIA Methods Duygulu et al. adopted an evaluation methodology that has been widely used to assess the performance of both region and image level AIA techniques [4]. Under this formulation the AIA method is used to label regions of images in a test set; for each test image, the assigned region-level annotations are merged to obtain an image-level annotation; the latter annotation is compared to the respective image-level ground truth annotation. The performance 2 of AIA methods is measured by using standard information retrieval measures (e.g. precision and recall). While this formulation can provide information of image-level performance, localization performance cannot be eectively evaluated with it. For example, consider the annotations shown in Figure 2; under the above methodology both annotations have equal performance, however, the annotation at the right has a very poor localization performance. A better and simpler methodology would be to average the number of correctly labeled regions [8,10]; this measure would adequately evaluate the localization performance of both annotations. The image-level approach, however, has been adopted to evaluate AIA methods regardless of their type (i.e., supervised or semi-supervised) or their goal (i.e., region-level or image-level) [47]; this is because of the lack of a benchmark collection with region-level annotations. In this paper we describe a segmented and annotated benchmark collection that can be used to evaluate AIA methods.

Fig. 2.

Sample image from the Corel subsets due to Carbonetto et al. [8]. Left: correct region-level

annotation, right: incorrect region-level annotation. Despite the right image has zero (the worst) localization performance, both annotations are equally correct at image-level.

The rest of this paper is organized as follows. The next section describes related benchmark collections. Section 3 presents the methodology adopted for the segmentation and annotation of the IAPR-TC12 benchmark and describes statistics of the extended collection. Section 4 introduces an annotation performance measure for the extended collection. Section 5 outlines applications for the benchmark. Finally, Section 6 presents the conclusions derived from this work.
2 To evaluate localization performance, Duygulu et al. analyzed localization-results for 100 images [4] (500 were used in a posterior work [5]); however, this analysis can only give partial evidence of the true localization performance; and in most cases, when AIA methods are evaluated, the latter analysis is not carried out [47].

Related work

A widely used collection to evaluate AIA is the Corel data set [1,46,8,10,17]; it consists of around 800 CDs, each containing 100 images related to a common semantic concept. Each image is accompanied by a few keywords describing the semantic or visual content of the image. Although this collection is large enough for obtaining signicant results, there are several issues that make it an unsuitable benchmark: i) images are unrealistic because most of them were taken in dicult poses and under controlled situations; ii) it contains the same number of images related to each of the semantic concepts, which is rarely found in realistic collections; iii) Corel images are annotated at imagelevel and therefore can not be used for region-level AIA; iv ) it has been shown that subsets of this database can be tailored to show improvements [20]; v ) it is copyright protected, which makes the collection expensive, images cannot be distributed among researchers and it is no longer available. Computer games have been used to build resources for tasks related to computer vision [21,22]. ESP is an online game that has been used for imagelevel annotation of real images [21]. The annotation process ensures that only correct (correctness is measured by the agreement of annotators) labels are assigned to images. The volumes of data that this game is able to produce are considerably large. However, images are annotated at image-level and the data is not readily available. Peekaboom is another game that uses the annotated images generated with the ESP game [22]. In this game, the goal is to provide objects locations and geometric labels. The resulting collection could be used to train learning algorithms for a variety of tasks, including region-level AIA. However, since the number of annotators can be of the range of millions, there is not an objective criterion to obtain concise object localizations. An important eort is being carried out by Russell et al. [23] with the goal of creating benchmark collections for diverse computer vision applications. The LabelME project uses a web-based online tool for segmenting and annotating images [23]. Segmentations are specied by drawing polygons around each object, the annotation vocabulary is dened by the users; note that any Internet user is a potential annotator in LabelME. The advantages of this collection are its size and that is publicly available. The problem with this collection is that it has an open vocabulary, so that regions can be assigned any word depending on the annotator background and very dierent segmentations can be obtained for similar images as well. Yao et al. are carrying out a valuable eort to create another large-scale benchmark collection; currently, more than 630,000 images have been considered in this project [24]. Manual segmentations are very detailed; segmented objects are decomposed and organized into a hierarchy similar to a syntactic tree in linguistics; information about localization, 2D and 3D geometry is also available. The collection is divided in 13 subsets, according to the type of images and its applications. This collection will be a very useful resource for building

visual dictionaries and as training data for learning algorithms. It can also be used to evaluate AIA methods; however, since the collection lacks ground truth data to evaluate image retrieval (i.e., relevance judgments), it can not be used to assess the impact of AIA methods on multimedia information retrieval. There are several collections available for object recognition benchmarking [25]. Most notably, the Caltech-101 [19], Caltech-256 [16], and the PASCAL VOC2006 [26], and VOC-2007 [15] collections. The type of images in such collections, however, can not be used for evaluating AIA methods (see Section 1.2). Even their use for the evaluation of object recognition methods has been challenged [25]. There are a few collections that can be used for eectively evaluating regionlevel AIA methods, most of these, however, are restricted to specic domains, such as cars [27], nature-roadways [28], animals [29], landscape vs. structured classication [30], and natural scene classication [31]. The size of the data sets and the restricted domains turn them inadequate for the evaluation of general purpose AIA. Winn et al. segmented and annotated a small set of 240 images, considering 9 labels only [32]. In a more recent work, a larger collection with 591 images and 23 labels was created by Shotton et al. [33]. However, the size of these data sets and the number of concepts are not adequate for evaluating AIA. Carbonetto et al. provided three small data sets with a larger number of labels (from 22 to 56) [8]. To the best of our knowledge, these are the largest data sets publicly available that have been annotated at a region-level. However, the data sets are still small and come from the Corel collection. A highly relevant collection to AIA is that provided by Barnard et al. [10]. It consists of 1,041 images taken from a previous study for benchmarking segmentation algorithms [34]. Segmentations are very detailed and annotations are specied using WordNet and well dened criteria 3 . A straightforward methodology was proposed for the evaluation of localization performance in region-level AIA; an important aspect of this methodology is that it can be used to evaluate the performance of methods that do not use the same vocabulary as that used in [10]. Therefore, although the data set is small and images come from the Corel collection, their main contribution comes from the evaluation methodology that can be used to assess AIA techniques using other collections (just like the extended IAPR-TC12 benchmark). The IAPR-TC12 benchmark was created with the goal of providing a realistic collection of images suitable for a wide number of evaluation purposes [12]. The collection is composed of about 20,000 images taken from locations around the world and comprising a varying cross-section of still natural images. The image collection includes pictures of sports, actions, people, animals, cities, landscapes, and many other topics. Manual annotations in three languages
3 It would be very useful to adopt this methodology for the segmentation and annotation of the IAPRTC12 collection; however, the large size of the collection and the detailed segmentation/annotation processes introduced by Martin et al. [34] and Barnard et al. [10] make impractical its application; nevertheless, the segmentation and annotation guidelines described in Section 3 are based on those presented in [10,34].

are provided with each image. This collection has been used to evaluate cross-language TBIR methods, CBIR and multimedia image retrieval methods [13,14]. It has also been used for object retrieval [35] and visual concept detection [36]. For visual concept detection, about 1, 800 images were annotated with visual concepts [36]; this was an early eort to use the collection for tasks related to AIA. However, only 17 concepts were considered for this task; given the variety of images, this limited vocabulary can not be used for annotating the entire collection. Also, we must emphasize that annotations are available at image-level only. Previously, the IAPR-TC12 collection was used for the task of object retrieval, using the PASCAL VOC-2006 collection for training and the IAPR-TC12 as testing set [35]. However, the number of objects was 10 and the accuracy of most of the methods was poor [35]; the results on the latter task showed that specialized collections are required for benchmarking dierent tasks. Despite being an established benchmark, the IAPR-TC12 collection can not be used to evaluate region-level AIA methods in its original form; therefore, an extension is needed in order to increase the tasks supported by the collection, such an extension is the main contribution of this work. It is very important to emphasize that, although there are several collections available that can be used to evaluate the performance of AIA to some extent [8,10,21 24,29,32,33], there is no collection that can be used to study the impact of using AIA methods for image retrieval. The extended IAPR-TC12 collection can be used for this end to some extent. Furthermore, it can be used for studying approaches to combine automatic annotations, free-text descriptions and images, motivating further research in several elds. 3 The Annotation of the IAPR-TC12 Benchmark

This section describes the methodology adopted for extending the IAPR-TC12 collection. The extension consists of manually segmenting and annotating the entire collection; accordingly, the following aspects are covered in this section: vocabulary denition, its hierarchical organization, and the segmentation and annotation guidelines; furthermore, statistics of the extended collection are shown. Due to space limitations we have focused on describing the main aspects of the project; further technical details can be found in [37]. The products derived from this work (see Figure 3), namely: segmentation masks, annotations, visual features and spatial relationships are publicly available for research purposes from both: the ImageCLEF (http://imageclef.org/photodata ) and INAOE-TIA websites (http://ccc.inaoep.mx/tia ). 3.1 Vocabulary The vocabulary plays a key role in the annotation process because it must cover most of the concepts that one can nd in the collection of images. At 7

Fig. 3. Summary of the products derived from this work, namely: segmentation masks, spatial relationships
among regions, segmented images, annotation hierarchy, annotated regions and visual features.

the same time, the vocabulary must not be too large because AIA performance is closely related to the number of labels considered (see Section 4.1). Diverse annotation vocabularies have been considered in dierent works; some of them specic for the type of the images in the collection considered. In this work we considered a study carried out by Hanbury [38], where a list of 494 labels was obtained by analyzing several AIA benchmark collections. We took this word list (H-list) as a reference and adapted it to the IAPR-TC12 collection. We also considered the list of nouns from the manual annotations of the IAPRTC12 collection (A-list) and the list of nouns in the textual description of topics used for ImageCLEF 2006 and 2007 (T-list) [13,14]. Labels appearing in at least two lists were included into a candidate list (Clist); the C-list was then manually ltered by: analyzing a large number of images from the IAPR-TC12 collection and discarding labels not present or rarely found in the images (e.g. binoculars,printer ); and by considering the frequency of labels in the annotations of the IAPR-TC12 collection for ltering images: highly-frequent useful words were kept (e.g. sky, mountain ); while useless highly-frequent and non-frequent words were not considered (e.g. background, marble ). Finally, some words in the H-list that were initially dropped from the C-list (e.g. herd ) and words identied by the authors (e.g. sky-red ), that did not appear in any of the three lists, were incorporated into the nal C-list. The latter procedure was iterated several times until the authors fully agreed on the nal list; the resultant list of words is shown in Table 1. 3.2 Conceptual hierarchy When annotating the IAPR-TC12 benchmark the need of a hierarchical organization for the vocabulary arose. This is because a structured annotation was one of the main goals of this work. With this goal in mind a hierarchical arrangement of the vocabulary was proposed. The hierarchy was manually dened by the authors after carefully analyzing the images, the annotation vocabulary and the vocabulary of manual annotations. The annotation vo8

ar balloon antelope beach bird bridge cactus caribou child clock cougar curtain dolphin fabric sh oor other fowl girae ground veh head horse island lamp llama man made mountain ocean painting per rel obj primate reptile rooster scorpion shell sky night stairs surfboard toy trunk viola wooden fur wood

air vehicles apple bear boat building camel castle child boy cloth couple per deer door face ag oor wood fox glacier group per hedgehog house jewelry landscape lobster

airplane astronaut beaver boat rafting bull camera cat child girl cloud cow desert dragony feline amingo ower fruit glass guitar helicopter humans jaguar leaf log man made ot mandril mural carving non wo fur ocean anim octopus palm panda piano pigeon public sign pyramid rhinoceros river ruin arch sand screen seahorse ship shore sky ed smoke starsh statue swim pool table train trash turtle umbrella violin volcano waterfall wave zebra

ancient build arctic bed bobcat bus can caterpillar chimney column coyote desk eagle fence ock birds owerbed furniture goat hand herd hut kangaroo leopard lynx marsupial mushroom orange paper plant rabbit road sand desert seal sidewalk snake steam telephone tree vegetable wall whale

ape baby beetle book bush canine cello church construction crab dish edice eld oor food furniture ot grapes handcraft highway ice kitchen pot lighthouse mammal monkey musical inst ot entity parrot plant pot rafter rock sand beach semaphore sky snow strawberry tiger trees vegetation water window

animal ball bench bottle buttery cannon chair church int cons ot crocodile diver elephant re oor carpet forest furniture wo grass hat hill iguana koala lion mammal ot monument nest owl penguin polar bear railroad rodent saxophone shadow sky blue sp shuttle street tire trombone vehicle water re wolf

ant balloon bicycle branch cabin car cheetah city coral cup dog elk rework oor court fountain generic obj ground hawk horn insect lake lizard man motorcycle object pagoda person pot reection roof school shes sheep sky light squirrel sun tower trumpet veh tires water veh woman

Table 1
Annotation vocabulary obtained with the methodology described in Section 3; words included during the creation of the hierarchy described in Section 3.2 are shown in bold. Abbreviations are as follows: araerostatic, int-interior, per-person, ot-other, wo-wood, anim-animal, ins-instrument, sp-space, obj-object, veh-vehicle, fur-furniture, swim-swimming and re-reection.

cabulary was organized mostly using is-a relations between labels; although, relations like part-of and kind-of were also included. Note that the hierarchy was dened by thinking about its usefulness for annotation and the representation of images in the IAPR-TC12 collection rather than considering the semantics of labels. According to the proposed hierarchy, an object can be in one of six main branches: animal, landscape, man-made, human, food, or other. Figure 4 shows the landscape branch of the hierarchy; other branches and further details can be found in [37]. As we can see, the hierarchy is intuitively sound. The creation of the hierarchy required adding a few labels to the vocabulary (shown in bold in Table 1). 9

Fig. 4.

Detailed view of the branch landscape-nature in the hierarchy. Dierent colors are used for

dierent levels.

Some nodes in the same level contain more descendant labels than others (e.g. Vegetation and Arctic in Figure 4) due to the type of images in the IAPRTC12 collection. Note that some nodes are barely populated, for example fruit (see Figure 4); one could easily include all the names of fruits under this node, increasing the coverage on the variety of fruits; however, this would lead to a large vocabulary that would be scarcely used for annotation, this is because the images in the collection do not contain a considerable diversity of fruits. The hierarchy reects the principle adopted for dening the annotation vocabulary: keep it compact and ensure it covers most of the concepts present in images of the IAPR-TC12 collection. Barnard et al. considered WordNet for annotating images [10], however, we did not consider this resource because assigning a region to its nearest WordNet meaning would lead to ambiguities due to the subjective knowledge of the annotator 4 (i.e., we considered more generic words which are more likely to be agreed on, trading precision for agreement); furthermore, the use of WordNet would make annotation slower. A restricted hierarchy allows a more concise and objective annotation of the collection; of course, some concepts may not be covered by the proposed hierarchy, although for annotating the IAPRTC12 collection it proved to be very useful. It is interesting to note that the proposed hierarchy resembles hierarchies proposed in related works [10,24,36], giving some evidence of its validity.
4 E.g. a region of water may be labeled with river by an annotator, while the same region may be labeled with stream, watercourse or simply water by other annotators; even the same annotator may assign dierent WordNet concepts to similar regions in dierent images or at dierent times.

10

The main purpose of this hierarchy is to facilitate and improve the annotation process; this can be done by going through the hierarchy top-down each time a region needs to be labeled. This reduces ambiguities when annotating similar regions referring to dierent concepts (visual homonymy) and dierent regions about the same concept (visual synonymy). The hierarchical organization of concepts was also helpful for the soft evaluation of AIA methods (see Section 4). Furthermore, the hierarchy can be useful to organize and to categorize the IAPR-TC12 collection; which could be very helpful for the creation of topics and relevance assessments for diverse multimedia tasks. 3.3 Segmentation and Annotation Guidelines Manually segmenting and annotating images are tasks prone to subjectivity. This is because dierent persons can segment and annotate the same image in completely dierent ways; even the same individual can provide dierent segmentations of the same image at dierent times. In order to deal with the subjectivity issue, we considered a reduced number of annotators: four individuals were considered for this task. In order to standardize the process of segmenting and annotating images, a set of guidelines (adapted from those considered by Martin et al. [34] and Barnard et al. [10]) were dened. The goal was to make as consistent as possible the segmentation and annotation processes by reducing ambiguities and confusion among annotators. The main guidelines are summarized below (it is assumed the annotators know the full list of words in advance and they are familiarized with the segmentation and annotation tool). Avoid segmenting regions that are too small with respect to the size of the image. What is considered too small depends on the sort of object under consideration (e.g. bottle regions are usually smaller than mountain ones). Avoid segmenting regions where the object is incomplete: at least one third of the object must be visible. Each region must contain information from a single object. When the shape of an object is highly-relevant, the annotators are advised to provide a detailed segmentation; otherwise, they can relax the segmentation in a way that it can be performed faster. Also, when shape is considered to be irrelevant, the annotators can divide the object in more than one region if they consider that creating those smaller regions is easier than segmenting the original single one. In the presence of excessive illumination, shadows or other conditions that make it dicult to segment these regions, the annotators must avoid such areas and just segment what can be seen without diculty. For labeling a region, the annotators should go through the hierarchy top-down looking for the label that best describes the region. The annotators should use the closest label to the term they consider the 11

region belongs to. Whenever a suitable label is not found they must select the label in the upper level of the hierarchy; avoiding as much as possible to overuse general-purpose labels. In several cases, when a correct label is not found under the category of interest, there exists the option of nding an adequate one by navigating the other branch in the hierarchy. When segmenting groups of objects (e.g. persons, mammals or trees ) the annotators must segment the group as an unit in cases where the respective group label exists in the vocabulary and as far as possible, also segment its individual units. For the segmentation and annotation of the IAPR-TC12 collection, an interactive software tool in MatlabR was developed (ISATOOL, Interactive Segmentation and Annotation Tool ). ISATOOL allows the interactive segmentation of objects by drawing points around the desired object, splines are used to join the marked points. One should note that although several tools for interactive segmentation are available [29,34], they are designed to provide detailed segmentations of images. This would be desirable for the current work, however, the number of images and annotators involved will make impractical the use of such methods. Nevertheless, as we can see in Figure 5, accurate segmentations can be obtained with ISATOOL. Once a region is segmented, the user is asked to provide a label for the region by using the hierarchy described in Section 3.2. Additionally, a set of simple visual features extracted from the regions and spatial relationships are provided with the extended collection. The following features were extracted from each region: area, boundary/area, width and height of the region, average and standard deviation in x and y , convexity, average, standard deviation and skewness in both color spaces RGB and CIE-Lab. The following spatial relationships are calculated and provided with the collection: adjacent, disjoint, beside, X-aligned, above, below and Y-aligned [37]. Spatial relations are provided in order to promote the use of contextual information in the AIA task. We considered these features and spatial relationships because these have been successfully used in previous AIA works [8,18,39,17]; however, each user can extract their own set of features and spatial relationships since segmentation masks and images are publicly available. 3.4 Statistics of the Collection The 20,000 images in the collection have been segmented and annotated as described above; 96,234 regions compose the segmented collection, for which 256 labels have been used; this represents about 94% of the total of the vocabulary (only 19 labels of animals and musical-instruments were not used). In average, 4.82 regions have been segmented per image. The area of each region occupies 15.74% of the image in average. As it is usual in AIA collections, there are some labels that have been used for a considerable number 12

Fig. 5.

Sample images from the extended IAPR-TC12 collection.

of regions while others have been barely used. Figure 6 plots the number of regions annotated with the 50 more frequent labels. The ten more common labels are sky-blue (5,176), man, (3,634), group-persons (3,548), ground (3,284), cloud (2,770), rock (2,740), grass (2,609), vegetation (2,455), woman (2,339), and trees (2,291). We can clearly appreciate that the most common labels agree with the type of images present in the collection (i.e., pictures of people in vacation trips [12]). There are 22, 111 and 140 labels that have been used in more than 1,000, 100 and 50 regions, respectively.
6000

5000

4000

Frequency

3000

2000

1000

Fig. 6.

Histogram of regions annotated with each label for the top-50 more common labels.

A total of 218 leaves in the hierarchy have been used for annotation; Table 2 shows the distribution of annotations for the nodes in the rst level of the hierarchy. There are more than 39,000 regions annotated with labels below the Landscape node; it has 45 descendants of which 33 are leaves. More than 27,000 regions have been annotated with labels from the Man-made node; which is also a large number of regions, however, it is possible to notice that the number of descendants for Man-made is 109 nodes, from which 86 are 13

skyblue man groupofpersons ground cloud rock grass vegetation woman trees mountain wall sky ocean window skylight tree person building floor faceofperson coupleofpersons house street car plant sandbeach floorother hill city bed fabric lamp river door childboy chair sidewalk hat palm water bush lake childgirl woodenfurniture publicsign table bottle painting trunk

Labels

Label Frequency Norm. Freq. Descendants Leaves

Animal 1,796 25.29 70 56

Humans 14,910 877.06 14 12

Food 728 104.01 6 5

Man made 27,617 253.36 109 86

Landscape 39,436 857.31 45 33

Other 538 76.85 6 6

Table 2
Distribution of annotations for labels in and below the nodes in the rst level of the hierarchy described in Section 3.2. Frequency shows the number of regions annotated with labels in or below each node. Norm. Freq. shows the frequency amortized by the number of descendants of each node. ID Adjacent Disjoint Beside X-alig Above Below Y-alig No. Ex. 77,306 187,904 206,308 58,902 104,124 104,124 56,962 Norm. Freq. 9.72% 23.62% 25.93% 7.4% 13.09% 13.09% 7.16%

Table 3
Frequency of spatial relationships among regions in the extended IAPR-TC12 collection. No. Ex. shows the number of regions that present each relationship; Norm. Freq. shows its normalized frequency.

leaves. Humans is a node with many regions as well, however, its number of descendants is small when compared to other nodes at the same level. The normalized frequency (third row in Table 2) shows the average number of labels assigned to each descendant in the considered nodes. We can note that the branch Humans is the one with more annotations per descendant, Landscape and Man-made come next. This fact, again, reects the type of images in the collection: in most of the images appear Humans, since most of the images were taken by/for tourists; most of the pictures were taken in South-American natural places, therefore, there are many images that contain labels from the landscape-nature branch. Regarding spatial relationships, the normalized frequency of spatial relationships extracted is described in Table 3. We can notice that the most frequent relations are beside and disjoint, with 25.93% and 23.62%, respectively. Note that beside is a generalization of the left and right relations, and this is reected in its frequency. X-alignment and Y-alignment are low frequency relations, with 7.4% and 7.16%, respectively. Finally, the proportions obtained by above and below reect their symmetry property, both with 13.09%

An Evaluation Measure for the Benchmark

Clearly, with the extended IAPR-TC12 collection, region-level AIA methods can be evaluated by using common classication-performance measures (e.g. area under the ROC curve, misclassications rate, squared root error, etcetera). These measures can eectively assess the localization performance of annotation methods. However, they can be too rigorous for the current state-of-the-art in region-level AIA. Consider, for instance, the case in which the correct label for a given region is water and the model under study classies such a region as river. In this situation a classication-performance measure would consider the assignment as totally incorrect, despite this prediction is partially correct. In order to give partial credit to those annotations, 14

we introduce a new 5 evaluation measure (based in the annotation hierarchy described in Section 3.2) for region-level AIA in the context of the annotated IAPR-TC12 collection: ehierarchy (t, p) = [1inpath(t,p) |fdepth (t) fdepth (p)| ] max(fdepth (t), fdepth (p)) (1)

Where 1inpath(t,p) is an indicator function that takes the unit value when both, the predicted label p and the true one t, are in the same path of the annotation hierarchy; fdepth (x) is the depth of label x within the hierarchy. Intuitively, ehierarchy (t, p) assigns an error value to a label, predicted by a model, proportional to its normalized distance (within the hierarchy) with respect to the ground truth label. A predicted annotation will be evaluated as partially good if and only if it appears in the same branch as the correct label. Note that this measure only applies when the true and predicted labels are dierent, otherwise ehierarchy = 0 by denition. Also note that ehierarchy can be modied so that it considers partially correct only to more specic (general) predictions; this can be done by modifying 1inpath(t,p) so that it takes the value of one only when the predicted label p is more specic (general) than the true label t. Furthermore, a threshold can be set on ehierarchy so that only certain type of errors are considered as partially correct. This measure is well suited to evaluate AIA methods that have as ultimate goal supporting image retrieval. This is because words related to the correct annotation of an image can be considered as an expansion of the annotation, even when the annotation is incorrect. 4.1 Benchmarking AIA This section presents experiments and results on region-level AIA using subsets of the extended IAPR-TC12 collection. The goal is to illustrate how this collection can be used to evaluate region-level AIA and to roughly compare the soft evaluation measure to a hard one. We face the problem of AIA as one of multi-class classication with as many classes as labels in our vocabulary. State-of-the-art classiers were considered over subsets of the already annotated regions. Table 4 describes the classiers we considered with their respective (default) parameters; these classiers are included in the CLOP machine learning toolbox [40]. Dierent subsets of data were considered according to the frequency of annotated regions per class; Table 5 describes the considered data subsets and the distribution of classes and examples. For each data subset the available data were randomly split into, disjoint, training (70%) and testing (30%) sets. In each experiment, k classiers were trained under the one-versus-all (OVA)
5 This measure is not supposed to replace a classication-performance measure for evaluating regionlevel AIA. Instead, it is intended to provide an alternative and complementary evaluation of performance, specically for the IAPR-TC12 collection.

15

Classier Zarbi Naive Klogistic Neural SVC Kridge RF

Description A simple linear classier Naive Bayes classier Kernel logistic regression Feedforward Neural Network Support Vector Classier Kernel ridge regression Random Forest

Parameters Units=10, Shrinkage=0.1, Epochs=50 Kernel=Poly, Degree=1, Shrinkage=0.001 Kernel=Poly, Degree=1, Shrinkage=0.001 Depth=1, Shrinkage=0.3, Units=100

Table 4
Classiers from the CLOP toolbox considered in the experiments, see [40] for details. ID No. Classes No. Examples A 2 5609 B 3 7318 C 4 9013 D 5 10,661 E 10 17,886 F 25 29,448 G 50 37,475 H 100 44,500

Table 5
Distribution of classes and examples for the subsets considered in the experiments.

formulation (k being the number of classes). Under this schema, a binary classier is trained for each label; the k th classier is trained taken as positive those examples from class k and the rest as negative. Despite OVA being very simple, it has proved to be competitive to more sophisticated and complex approaches [41]; furthermore, OVA is the widely used approach for supervised AIA [9,17,18,30,39]. One of the main challenges in OVA classication is choosing a way to combine the outputs of binary classiers so that error on unseen data is minimized [41]. Since the goal of the experiments is only illustrative, we adopted the simplest strategy for merging outputs of the individual classiers: when more than one classier is triggered, we preferred the prediction of the classier with more condence about its respective class.
100 90 Zarbi Naive Klogistic Neural SVC Kridge RF Baseline1 Baseline2 100 90 Zarbi Naive Klogistic Neural SVC Kridge RF Baseline1 Baseline2

ehard Correct classifications (%)

80 70 60 50 40 30 20 10 0 A B C D E Data set ID F

ehierarchy Correct classifications (%)


H

80 70 60 50 40 30 20 10

0 A

D E Data set ID

Fig. 7. Percentage of correct annotations for the classiers and data sets described in Tables 4 and 5. Left,
ehard performance; right, ehierarchy performance. Two baselines are shown (see the legends); baseline-1 is the performance one would get by randomly selecting labels; baseline-2 is the performance one would get by assigning always the majority class.

For evaluation we considered the widely used measure for assessing the performance of classiers: the percentage of correct classications (ehard ). ehard is compared to ehierarchy described in Equation (1) (we say a predicted annotation is correct whenever ehierarchy < 1). Figure 7 shows the average ehard (left) and ehierarchy (right) for each data set and for each classier considered. We can clearly appreciate that both measures decrease similarly as the number of 16

Correct tree person food landscape

Predicted trees child dish river

ehierarchy 0.25 0.33 0.5 0.6

Correct child-boy sky vegetation leaf

Predicted child sky-light bush landscape

ehierarchy 0.25 0.33 0.5 0.75

Table 6
Analysis of some labels evaluated as non-erroneous with ehierarchy and as errors with the hard evaluation measure. Correct shows the correct label for the region and predicted stands for the label predicted by the model.

labels considered increases, this is because ehard is a special case of ehierarchy . For the data sets A-D (i.e., 5 classes at most) both measures obtain the same result, this is because there are no hierarchical relations among the top-5 more frequent classes. There are small dierences for the data sets F-H, for which information from the hierarchy can be used by ehierarchy . The average dierences between ehard and ehierarchy are of 4.27%, 5.91%, 5.44% and 4.37% for the data sets E, F, G and H, respectively. This reects how the soft measure is indeed less rigid for the evaluation of the classiers, although the dierence is not too large; which means that the considered classiers are not misclassifying instances with hierarchically related classes. Table 6 shows some labels assigned to test regions that were classied as incorrect by ehard and correct by ehierarchical ; for this experiment all the labels used for annotation were considered (i.e., 256 classes). We can clearly appreciate that ehierarchy eectively assesses the performance of the considered classier (naive, see Table 4); all the labels considered as partially good are indeed closely related to the correct label, which can be very useful for ABIR. Note that the value of ehierarchy reects how close the AIA method was of predicting the correct class. We would like to emphasize that ehierarchy is well suited for the extended IAPR-TC12 collection. However, any other measure can be used to evaluate AIA performance under dierent settings. For example, the evaluation methodology proposed by Barnard et al. can be adopted to evaluate AIA performance when using automatic segmentations [10].

Applications for the extended IAPR-TC12 Benchmark

In this section we outline possible applications and important questions that could be answered, to some extent, by using the fully segmented and annotated IAPR-TC12 benchmark. Benchmarking. The extended benchmark can be readily used to evaluate the tasks of region-level AIA, image-level AIA, visual concept detection, and object retrieval. Also, it can be considered to assess the performance of object detection and object recognition techniques (e.g. face and skin detection); segmentation performance can be evaluated, to some extent, with this collection as well. Furthermore, the collection motivates further research on the use of spatial context for AIA, ABIR and CBIR. Multimedia image retrieval. The extended collection can be used to 17

shed some light to clarify assumptions commonly made in multimedia image retrieval. For example, the assumption that the use of information from AIA methods can be useful to improve the retrieval performance of CBIR or TBIR methods; despite being intuitively sound, it has not been proved neither theoretically nor empirically; similarly, it can be studied to what extent automatic segmentation aects the retrieval and annotation performance among several other interesting aspects [37]. Furthermore, spatial relationships can be used to allow complex queries on the collection, where the interest could be on nding objects in specic positions with respect to others; motivating research on the use of spatial relations for multimedia image retrieval. Machine learning applications. The annotated collection could be used to bridge the gap between the machine learning and multimedia information retrieval communities because it allows studying: multi-class classication for a large number of labels (note that the hierarchy of labels can be used to study the performance of classiers with an increasing number of labels); hierarchical classication for AIA; spatial data mining; multi-modal clustering; classication for imbalanced data sets; and the application of structured prediction methods to the problems of AIA, ABIR, and CBIR. 6 Conclusions

The IAPR-TC12 image collection is an established image retrieval benchmark that has several attractive features. Namely, it is a large size collection, it is composed of diverse and realistic images, it oers textual annotations in three languages and it provides relevance assessments for evaluating image retrieval performance. A benchmark, however, is not supposed to be static, but to evolve according to the needs of the tasks it is designed for and the emergence of new related-issues. In this paper, we introduced the segmented and annotated IAPR-TC12 collection, an extension to the benchmark that will increase the number of tasks that can be covered with the collection; the extension also signicantly augments the number of applications for the IAPR-TC12 collection. We described the methodology adopted for extending the IAPR-TC12 collection; which included the denition of an ad-hoc annotation vocabulary, its hierarchical organization and well dened criteria for objectively segmenting and annotating images. Also, an alternative evaluation measure based on the hierarchy was proposed, with the goal of assessing localization performance of region-level AIA methods. Statistics of the extended collection give evidence that the adopted methodology is reliable and well suited for the IAPR-TC12 collection. Initial results with the proposed evaluation measure give evidence that it can eectively evaluate region-level AIA. An important contribution of this work is the identication of applications for the annotated IAPR-TC12 18

collection and important questions that can be answered with it; the latter will motivate further research that can signicantly help to advance the stateof-the-art on segmentation, AIA, CBIR, TBIR, ABIR and machine learning. References
[1] [2] [3] R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image Retrieval: Ideas, Inuences, and Trends of the New Age, ACM Computing Surveys, Vol. 40-2, 2008. M. Inoue, On the Need for Annotation-based Image Retrieval, IRiX 04: Procs. of the Workshop on Information Retrieval in Context, SIGIR, Sheeld, UK, 2004. Y. Mori, H. Takahashi, and R. Oka, Image-to-word transformation based on dividing and vector quantizing images with words, MISRM99 First Intl. Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999. P. Duygulu, K. Barnard, N. de Freitas, and D. A. Forsyth, Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary, ECCV 02: Procs. of the 7th European Conf. on Comp. Vis.-Part IV, 97112, LNCS 2353, Springer-Verlag, London, UK, 2002. K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. I. Jordan, Matching Words and Pictures, J. Mach. Learn. Res., Vol. 3, 11071135, 2003. J. Jeon, V. Lavrenko, and R. Manmatha, Automatic image annotation and retrieval using crossmedia relevance models, Procs. of the 26th Intl. ACM-SIGIR Conf. on Research and Development on Informaion Retrieval, 119126, Toronto, Canada, 2003. D. Blei, Probabilistic Models of Text and Images, Ph. D. thesis Universtity of California, Berkley, 2004. P. Carbonetto, N. de Freitas, and K. Barnard, A Statistical Model for General Context Object Recognition, ECCV 04: Procs. of the 8th Europ. Conf. on Comp. Vis., 350362, LNCS 3021, Springer-Verlag, Canada, 2004. G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos, Supervised Learning of Semantic Classes for Image Annotation and Retrieval, IEEE Trans. on PAMI, Vol. 29-3, 394410 2007.

[4]

[5] [6]

[7] [8]

[9]

[10] K. Barnard, Q. Fan, R. Swaminathan, A. Hoogs, R. Collins, P. Rondot, and J. Kaufhold, Evaluation of Localized Semantics: Data, Methodology, and Experiments, Int. J. Comput. Vis., Vol. 77:1-3 199217, 2008. [11] H. J. Escalante, J. A. Gonz alez, C. A. Hern andez, A. L opez, M. Montes, E. Morales, L. E. Sucar, and L. Villase nor, TIA-INAOEs Participation at ImageCLEF Photo 2008, Working Notes of the 2008 CLEF Workshop, Aarhus, Denmark, 2008. [12] M. Grubinger, Analysis and Evaluation of Visual Information Systems Performance, PhD Thesis. School of Computer Science and Mathematics, Faculty of Health, Engineering and Science, Victoria University, Melbourne, Australia, 2007. [13] P. Clough, M. Grubinger, T. Deselaers, A. Hanbury, and H. M uller, Overview of the ImageCLEF 2006 photographic retrieval and object annotation tasks, In Advances in Multilingual and Multimodal Information Retrieval, 7th Workshop of the CLEF Forum, CLEF 2006, revised selected papers, 579 594, LNCS Vol. 4730, Springer, Alicante, Spain, September 20-22, 2006. [14] P. Clough, M. Grubinger, T. Deselaers, A. Hanbury, and H. M uller, Overview of the ImageCLEF 2007 photographic retrieval task, In Advances in Multilingual and Multimodal Information Retrieval, 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, revised selected papers, LNCS Vol. 5152, Springer, Budapest, Hungary, September, 2007. [15] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn and A. Zisserman, The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html, 2006. [16] G. Grin, G. Holub, and P. Perona, The Caltech-256, CALTECH Technical Report, 2007. [17] H. J. Escalante, M. Montes, and L. E. Sucar, Word Co-occurrence and Markov Random Fields for Improving Automatic Image Annotation, Procs. of the 18th British Machine Vis. Conf., Vol. 2, 600-609, Warwick, UK, 2007. [18] H. J. Escalante, M. Montes, and L. E. Sucar, Multi-Class PSMS for Automatic Image Annotation, Submitted to Applications on Swarm Intelligence, 2008.

19

[19] L. Fei-Fei, R. Fergus and P. Perona, Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories, CVPR 04: Workshop on GenerativeModel Based Vis., 2004. [20] H. M uller, S. Maillet-Marchand, and T. Pun The Truth about Corel - Evaluation in Image Retrieval, CIVR 02: Procs. of the Int. Conf. on Image and Video Retrieval, Londond, UK, LNCS 2383, SpringerVerlag, 3849, 2002. [21] L. von Ahn and L. Dabbish, Labeling Images with a Comp. Game, CHI 04: Procs. of the ACM Conf. on Comp.-Human Interaction, 319326, Vienna, Austria, 2004. [22] L. von Ahn, R. Liu and M. Blum, Peekaboom: A Game for Locating Objects in Images, CHI 06: Procs. of the ACM Conf. on Comp.-Human Interaction, 5564, Montr eal, Qu ebec, Canada, 2006. [23] B. Russell, A. Torralba, K. P. Murphy, and W. Freeman, LabelMe: a Database and Web-Based Tool for Image Annotation, Int. J. Comput. Vis., Vol. 77:1-3 157173, 2008. [24] B. Yao, X. Yang, and S. Zhu, Introduction to a Large-Scale General Purpose Ground Truth Database: Methodology, Annotation Tool and Benchmarks, EMMCVPR 07: Procs. of Energy Minimization Methods in Comp. Vis. and Pattern Recognition, 169183, Hubei, China, 2007. [25] J. Ponce, T.L. Berg, M. Everingham, D.A. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B.C. Russell, A. Torralba, C.K.I. Williams, J. Zhang, and A. Zisserman, Dataset Issues in Object Recognition, Chapter 2 in Toward Category-Level Object Recognition, LNCS 4170, Springer-Verlag, 29-48, 2006. [26] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool, The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results, http://www.pascal-network.org/challenges/VOC/ voc2006/results.pdf, 2006. [27] S. Agarwal, Awan, and D. Roth, The UIUC Image Database for Car Detection, Available from http://l2r.cs.uiuc.edu/simcogcomp/Data/Car/, 2002. [28] C. K. I. Williams and F. Vivarelli, Using Bayesian neural networks to classify segmented images, Procs. of IEEE Intl. Conf. on Articial Neural Networks, 268273, 1997. [29] A. Hanbury, and A. Tavakoli-Targhi, A Dataset of Annotated Animals, Procs. of the Second MUSCLE / ImageCLEF Workshop on Image and Video Retrieval Evaluation, Czech Repulic 2006. [30] A. Vailaya, A. Jain, and H. Zhang, On Image Classication: City versus Landscape, Pattern Recognition, Vol. 31, 19211936, 1998. [31] J. Vogel and B. Schiele, Semantic Scene Modeling and Retrieval for Content-Based Image Retrieval, Int. J. Comput. Vis., Vol. 72:2, pp. 133-157, April 2007. [32] J. Winn, A. Criminisi and T. Minka, Object Categorization by Learned Universal Visual Dictionary, ICCV 05 Procs. IEEE Intl. Conf. on Comp. Vis., 18001807, Beijing, China, 2005 [33] J. Shotton, J. Winn, C. Rother, and A. Criminisi, TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation, ECCV 06: Procs. European Conf. on Comp. Vis., 115, Graz, Austria, 2006 [34] D. Martin, C. Fowlkes, D. Tal, and J. Malik, A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, ICCV 01 Procs. IEEE Intl Conf. on Comp. Vis., Vol. II, pp. 416-421, 2001. [35] T. Deselaers, A. Hanbury, V. Viitaniemi, et al., Overview of the ImageCLEF 2007 Object Retrieval Task, In Advances in Multilingual and Multimodal Information Retrieval, 8th Workshop of the CrossLanguage Evaluation Forum, CLEF 2007, revised selected papers, 445471, LNCS Vol. 5152, Springer, Budapest, Hungary, September, 2007. [36] T. Deselaers, and A. Hanbury, The Visual Concept Detection Task in ImageCLEF 2008, Working Notes of the 2008 CLEF Workshop, Aarhus, Denmark, 2008. [37] H. J. Escalante, C. Hernandez, J. Gonzalez, A. Lopez, M. Montes, E. Morales, E. Sucar, and L. Villase nor, Segmenting and Annotating the IAPR-TC12 Benchmark, INAOE Technical Report, CCC-08-05, 2008, avaialble from http://ccc.inaoep.mx/tia/. [38] A. Hanbury, Review of Image Annotation for the Evaluation of Comp. Vis. Algorithms, Technical Report PRIP, Vienna University of Technology, 102, 2006. [39] C. Hernandez and L. E. Sucar, Markov Random Fields and Spatial Information to Improve Automatic Image Annotaion. Procs. of the the 2007 Pacic-Rim Symposium on Image and Video Technology, 879892, Springer-Verlag 2007. [40] I. Guyon, A. Saari, H. J. Escalante, G. Bakir, and G. Cawley, CLOP: a Matlab Learning Object Package, Demonstration session, NIPS, Vancouver B.C., Canada, 2007. [41] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.

20

Vous aimerez peut-être aussi