Vous êtes sur la page 1sur 9

ARTICLE IN PRESS

Neurocomputing 73 (2010) 18311839

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

A fast recognition framework based on extreme learning machine using hybrid object information
Rashid Minhas , Abdul Adeel Mohammed, Q.M. Jonathan Wu
Computer Vision and Sensing Systems Laboratory, Department of Electrical and Computer Engineering, University of Windsor, Ontario, N9B 3P4, Canada

a r t i c l e in fo
Available online 19 March 2010 Keywords: Extreme learning machine Object recognition Ferns Two-dimensional PCA Object classication

abstract
This paper presents a new supervised learning scheme, which uses hybrid information i.e. global and local object information, for accurate identication and classication at considerably high speed both in training and testing phase. The rst contribution of this paper is a unique image representation using bidirectional two-dimensional PCA and Ferns style approach to represent global and local information, respectively, of an object. Secondly, the application of extreme learning machine supports reliable recognition with minimum error and learning speed approximately thousands of times faster than traditional neural networks. The proposed method is capable of classifying various datasets in a fraction of second compared to other modern algorithms that require at least 23 s per image [14]. & 2010 Elsevier B.V. All rights reserved.

1. Introduction Object recognition or categorization is a task of classifying an individual object to belong to a certain category. Automated vision systems, in general, do not perform better categorization than humans due to lack of intelligence, and knowledge. Despite some early success in automatic recognition; the problem is far from being solved due to preserved non-planar geometry and signicant 3D depth variations in images of natural scenes. The image databases are an essential part of recognition research. For comparison of emerging algorithms; a number of common publicly available databases have been established such as UIUC, Caltech, MIT, GRAZ and PASCAL. These databases have provided a common ground for evaluation and assessment of fresh algorithms. Detecting objects in measurements is a complicated task owing to their enormously large number of possible poses, appearances in varying image acquisition conditions, and occlusions (see Fig. 1 for sample images). Computer vision community has been following a line of investigation to develop algorithms which can efciently detect features, global or local, and regions for robust recognition of objects of interest. Each recognition method proposed in the past has its own merits and limitations; in general common approaches use image databases which contain object of interest at perceptible scale with minor deformations/pose variations. Feature extraction and representation of signicant objects in an incoming image using the

Corresponding author. Tel.: + 1 519 253 3000x2580; fax: +1 5199713695.

E-mail addresses: minhasr@uwindsor.ca (R. Minhas), mohammea@uwindsor.ca (A.A. Mohammed), jwu@uwindsor.ca (Q.M. Jonathan Wu). 0925-2312/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2009.11.049

generated features is an initial step towards visual recognition. Next, a classier is trained using the representation established at an earlier stage. The popular classiers include support vector machines (SVM), Bayes classier, Fisher linear discriminant and traditional neural networks, and hidden Markov models (HMM) to name a few. For above classiers; a degraded classication is observed due to non-convex feature space caused by images captured under different geometric and lighting environment. Unfortunately, categorization turns out to be a complicated chore due to noticeable changes in appearance and other deformations caused by variations in the scene depth. Classication schemes use a wide variety of features like color, texture, orientation, blob, center of gravity and mutual geometric relationship amongst feature points to learn a classier. Visual recognition frameworks range from constellation of local features [1,2], and complex geometric models [3] to the use of motion cues [4,7]. Object categorization schemes with smaller variations in pose [1,5,6] and manual pre-segmentation of objects to minimize the computational cost have also been proposed [8]. Part based schemes [1,5,9,13,22] represent object structure using patches covering distinctive parts of an object. Such patches are extracted from neighborhood of interest points detected using localized operators like Harris corner detector. In [14] Ali and Shah proposed a promising approach to use the global structure of an object by modeling nonlinear subspace of categories using kernel PCA (KPCA) and selecting a discriminative feature set employing AdaBoost algorithm. Opelt et al. [13] used multiple kinds of features to encode the extracted patches and later used AdaBoost framework to select the best features for categorization. The performance deterioration is observed in approaches which use only global information, especially in images with considerably large background clutter, geometric deformations,

ARTICLE IN PRESS
1832 R. Minhas et al. / Neurocomputing 73 (2010) 18311839

Fig. 1. Sample images from Caltech database.

and occlusions. The correlation amongst neighborhood pixels is also ignored due to vectorization of an image during dimensionality reduction operations such as PCA and kernel PCA. On other hand, part based recognition schemes are computationally expensive requiring signicantly large amount of training samples (extracted/synthesized for different view points) that eventually leads to momentous increase in computational cost. We propose a hybrid approach for recognition that combines global and local object information for robust and reliable recognition. The use of two-dimensional PCA (2D-PCA) [19] along mutually orthogonal directions is proposed to encode global information of an image which can better preserve association amongst neighboring pixels. Recently, [20] used multidimensional PCA for face recognition. The use of general tensor discriminant analysis (GTDA) for gait recognition, proposed by Tao et al. [26], is proven to generate improved accuracy with minimal undersample problem during classication. In [27,28] incremental and supervised approaches for tensor analysis have been proposed which generates better-quality recognition with structure preserving processing in higher dimension data similar to bidirectional 2D-PCA. However, our method differs from [26] for recognition in two waysusing hybrid object information for classication, secondly, synthesizing images for various afne deformations to train our classiers for potential objects views. For local object information, feature vectors are generated from patches around stable feature points detected using Harris corner detector. Multiple views of such patches are generated through afne deformations which result into considerably increased number of training samples. Later, extreme learning machine (ELM) [15] is applied for recognition using both kinds of feature vectors i.e. global and local information. The use of any other supervised learning framework such as neural network or AdaBoost may require longer training intervals due to their specic learning strategy while ELM can nish the similar training task at speed approximately thousands times faster than traditional neural networks and minimum training error. ELM has been successfully applied for multicategory classication such as microarray gene expression for cancer diagnosis [16] and classication of music genres [17]. Our proposed method allows to combine strengths of both types of features and exploits highly discriminative feature sets for classication using ELM. A wide variety of experiments using standard datasets are presented to ascertain the superior performance of our proposed scheme over other state-ofthe-art methods.

1 1 ... p p input neurons


L X

1 Decision Function Output ...

L L hidden layer neurons m output neurons

past theoretical research works, the input weights and hidden layer biases had to be adjusted using some parameter tuning approach such as gradient descent based methods. However, gradient descent based learning techniques are generally time-consuming due to inappropriate learning steps with signicantly large latency to converge to a local maxima. Huang et al. [15,18] showed that singlehidden layer feedforward neural network (SLFNN), also termed as ELM in their work, can exactly learn N distinct observations for almost any nonlinear activation function with at most N hidden nodes (see Fig. 2). Unlike the popular thinking that network parameters need to be tuned, one may not adjust the input weights and rst hidden layer biases but they are randomly assigned. Such an approach has been proven to perform learning at an extremely fast speed, and obtains good generalization performance for activation functions that are innitely differentiable in hidden layers. ELM transforms the learning problem into a simple linear system whose output weights can be analytically determined through a generalized inverse operation of the hidden layer weight matrices. Such a learning scheme can operate at considerable faster speed than learning strategies of traditional learning frameworks. Improved generalization performance of ELM with the smallest training error and the norm of weights demonstrate its superior classication capability for realtime applications at an exceptionally fast pace without any learning bottleneck. For N arbitrary distinct samples xi , gi where and gi gi1 , gi2 , . . . , gim 0 A Rm (The xi xi1 ,xi2 , . . . ,xip 0 A Rp 0 superscript represents the transpose), a standard ELM with L hidden nodes and an activation function g(x) is modeled by

bi gxl

L X i1

... i ...

Fig. 2. Simplied structure of ELM.

bi gwi xl bi ol , l A f1,2,3, . . . ,Ng,

2. Extreme learning machine Feedforward neural networks (FNN) have been widely used in different areas due to their approximation capabilities for nonlinear mappings. It is a well known fact that the slow learning speed of FNN has been a major bottleneck in different applications. In the

i1

where wi wi1 ,wi2 , . . . ,wip 0 and bi bi1 , bi2 , . . . , bim 0 represent the weight vectors connecting the input nodes to an i th hidden node and from the i th hidden node to the output nodes, respectively. bi shows a threshold for an i th hidden node, and wi. xl represents the inner product of wi and xl. The above modeled ELM can reliably

ARTICLE IN PRESS
R. Minhas et al. / Neurocomputing 73 (2010) 18311839 1833

approximate N samples with zero error as


N X l1 L X i1

Jol gl J 0,

bi gwi :xl bi gl , l A f1,2, . . . ,Ng:

The above N equations can be written as Ub G where b b01 , . . . , b0L 0Lm and G g01 , . . . , g0N 0Nm . In this formulation U is called the hidden layer output matrix of ELM where i th column of U is the output of i th hidden node with respect to inputs x1,x2,y,xN. If the activation function g is innitely differentiable, the number of hidden nodes are such that L5 N. Thus U is represented as

than 1D vectors, therefore the image matrix does not need to be vectorized prior to feature extraction. An image covariance matrix is constructed by directly using the original image matrices. Let X denotes an M-dimensional unitary column vector. To project a Q M image matrix A on X; a linear transformation Y AX is used which results in a Q-dimensional projected vector Y. The total scatter of the projected data is introduced to measure the discriminatory power of a projection vector X. The total scatter can be characterized by the trace of a covariance matrix of the projected feature vectors, i.e., J(X)tr(Sx) where tr(.) represents the trace of a matrix and Sx denotes the covariance matrix of projected feature vectors. The covariance matrix Sx can be computed as Sx EYEYYEY0 , EAEAXAEAX0 , trSx X 0 EAEA0 AEAX:
0

U w1 , . . . ,wL ,b1 , . . . ,bL ,x1 , . . . ,xN :

11 12 13

The training of ELM requires minimization of an error function e in terms of the dened parameters as !2 N L X X e bi gwi xl bi gl , 5
l1 i1

where it is sought to minimize the error, e JUbGJ. Traditionally unknown U is determined using gradient descent based scheme and the weight vector W, which is a combination of wi, bi , and bias parameters bi, is tuned iteratively by wk wk1 r @eW : @W 6

The image covariance matrix is dened as Gt AEA AEA. It is easy to verify that Gt is a M M non-negative denite matrix; suppose that there are P training image samples, the j th sample of size Q M is denoted by Aj where 1 r j r P. Gt is computed by Gt
P 1X A A0 Aj A, Pj1 j

14

JX X 0 Gt X,

15

In above relation, the learning rate r signicantly affects the accuracy and learning speed; a small value of r causes the learning algorithm to converge at a signicantly slower rate whereas a larger learning step leads to instability and divergence. Huang et al. proposed minimum norm least-square solution for ELM [15] to avoid aforementioned limitations encountered in conventional learning paradigm, which states that the input weights and the hidden layer biases can be randomly assigned if the activation function is innitely differentiable. It is an interesting solution; instead of tuning the entire network parameters such random allocation helps to analytically determine the hidden layer output matrix U. For the xed network parameters, the learning of ELM is simply equal to nding a least-square solution of ^ ^ ^ ^ JUw1 , . . . , wL , b^1 , . . . , bL b GJ min JUw1 , . . . ,wL ,b1 , . . . ,bL bGJ:
wi ,bi , b

7 8

For a number of hidden nodes L5 N, U is a non-square matrix, the norm least-square solution of above linear system becomes ^ b U G, where U is the MoorePenrose generalized inverse of a matrix U. It should be noted that above relationship holds for a non-square matrix U whereas the solution is straightforward for NL. The smallest training error is achieved using above model since it represents a least-square explanation of a linear system Ub G as ^ JUb GJ JUU GGJ, minJUbGJ:
b

9 10

3. Global feature vector computation KarlhunenLoeve expansion, also known as principal component analysis (PCA), is a data representation technique widely used in pattern recognition and compression schemes. In [19], Yang et al. proposed two-dimensional PCA for image representation. As opposed to PCA, 2D-PCA is based on 2D image matrices rather

where A represents the average image of all training samples. Above criterion is called the generalized total scatter criterion. The unitary vector X that maximizes the criterion is called the optimal projection axis. We usually are required to select a set of projection axes, X1, X2,y,Xd (where subscript d is a scalar value representing the number of dimensions), subject to orthonormal constraint and to maximize the criterion J(X). Yang et al. [19] showed that extraction of image features using 2D-PCA is computationally efcient and better recognition accuracy is achieved than traditional PCA. However, the main limitation of 2D-PCA based recognition is the processing of higher number of coefcients since it works in row directions only. Zhang and Zhou [21] proposed (2D)2 PCA based on assumption that training sample images are zero mean and image covariance matrix can be computed from the outer product of row/column vectors of images. We propose a modied bidirectional 2D-PCA to extract features by computing two image covariance matrices of the square training samples in their original and transposed forms, respectively, while training image mean need not be necessarily zero. The vectorization of mutual product of such covariance matrices results into a considerably smaller sized feature vectors which retain better structural and correlation information amongst neighboring pixels. Fig. 3 shows better ability of bidirectional 2D-PCA to represent the global structure of various object categories. Figs. 3(a and b) are plotted using Caltech (Airplanes and Leaves) and MIT (Cars and Pedestrians) datasets, respectively, whereas Caltech (Airplanes and Motorbikes) datasets have been used for Figs. 3(c and d). The rst two components of feature vectors obtained using bidirectional 2DPCA and Kernel PCA are plotted against each other. In Fig. 3(a) and (b), we observe close to convexity classes for Caltech and MIT datasets; this validates our claim that 2D PCA achieves superior categorization (see Table 2) for these datasets. For kernel PCA (Fig. 3(d)); it is quite clear that the rst two components, representing the largest eigenvalues, are almost identical and analogous overlap of these feature vectors may lead to poor classication. For the same datasets; use of bidirectional 2D-PCA generates classes which are partly converged as shown in

ARTICLE IN PRESS
1834 R. Minhas et al. / Neurocomputing 73 (2010) 18311839

Fig. 3. Non-linearity captured among various data sets using Bidirectional 2D-PCA for (a-c) (left to right and top to bottom) and KPCA (d) (right -bottom).

Table 1 Time spent for training and testing using ELM on global feature vectors. Time Planes TT Planes Background Cars Bikes Faces Leaves NA 0.094 0.156 0.125 0.094 0.094 CT NA 0.078 0.094 0.078 0.047 0.047 Background TT 0.078 NA 0.031 0.078 0.140 0.109 CT 0.062 NA 0.047 0.047 0.031 0.016 Cars TT 0.140 0.047 NA 0.094 0.109 0.125 CT 0.062 0.047 NA 0.047 0.031 0.016 Bikes TT 0.078 0.094 0.125 NA 0.094 0.140 CT 0.078 0.047 0.047 NA 0.047 0.047 Faces TT 0.078 0.109 0.140 0.109 NA 0.094 CT 0.062 0.031 0.031 0.047 NA 0.016 Leaves TT 0.094 0.156 0.125 0.078 0.109 NA CT 0.047 0.016 0.031 0.047 0.016 NA

TT: training time (s) and CT: classication time (s).

Table 2 Training and detection accuracy using ELM on global feature vectors. Accuracy Planes TA Planes Background Cars Bikes Faces Leaves NA 96 100 99.5 100 100 CA NA 79.8 100 91.6 97.1 97.2 Background TA 98 NA 100 99 99.5 97.5 CA 86 NA 100 80.1 87 85.8 Cars TA 100 100 NA 100 100 100 CA 100 100 NA 100 100 100 Bikes TA 99.5 97.5 100 NA 100 100 CA 91.2 73.7 100 NA 91.3 96.8 Faces TA 100 99.5 100 100 NA 100 CA 90.6 82.6 100 93.8 NA 95.4 Leaves TA 100 99 100 100 100 NA CA 97.6 92 100 95 92.7 NA

TA: training accuracy (%age) and CA: classication accuracy (%age).

ARTICLE IN PRESS
R. Minhas et al. / Neurocomputing 73 (2010) 18311839 1835

Fig. 3(c). Table 1 demonstrates the extremely fast classication capability of ELM using global features. However, the accuracy achieved using these global features is not stable and varies with the selection of datasets in different combinations to represent positive and negative classes during recognition (see Table 2 for mutual combinations of Caltech Airplanes, Caltech Background and Caltech Motorbikes) (due to space limitations we do not present similar results for MIT and GRAZ datasets). Therefore, we propose to combine complementary information, i.e. local feature vectors, along with the global contents of an image to attain reliable classication which is independent of view point changes and dataset combinations. It is worth mentioning that all Matlab implementations of our experiments are executed on a desktop computer equipped with Intel Core 2 Duo processor of 2.6 GHz speed and 2 GB RAM.

4. Local feature vector extraction Identifying textured patches that are distinctive and detectable under varying pose and lighting conditions in neighborhood of stable feature points is a widely researched area with numerous applications. Different strategies have been proposed to use local patches or contour based information for object detection with features being shared among different classes [4,2225]. This paper proposes a new object detection scheme where partial object information is presented using feature vectors computed from local patches surrounding a set of stable feature points. Such feature vectors are used as complementary information to enhance classication accuracy. A semi-naive based classier, recently proposed by Mustafa et al. [12], is used to determine the class of local patches surrounding stable key points. The solution for patch correspondence problem provided in [12] shows promising results, comparable to state-of-the-art, yet simple by exploiting statistical information of pixel intensities. To detect preliminary stable key points, randomly selected images of a specic class are deformed and Harris corner detector is applied. We select Harris corner detector for its simplied and efcient implementation to detect key points with minimal computational burden compared to other schemes such as SIFT, complex lters, PCA-SIFT, and

cross-correlation [10,11,29]. The parameters for afne deformations are randomly picked from a uniform distribution therefore two images of a similar object may have differently been warped at two different time instances. The corners identied for deformed set of images change based on chosen parameters and background clutter. Authors do realize that the rising number of afne warped images can lead to higher computational load, however, it is noticeable that the comprehensive training of our algorithm to mimic possible appearances of local patches of an object generates improved recognition. On other hand, such a pronounced computational complexity is deed only in training whereas the recognition of an incoming image during testing phase is undemanding and high-speed procedure. Fig. 4 shows different feature points detected at varying deformations; two sample images from Caltech airplane and motorbikes datasets are used and it is obvious that number of detected feature points are changing in different transformations. A list is maintained to keep track of points, which have been repeated the most, to extract local patches. We declare the identied key points as stable if the their rate of recognition is more than 75% of the total number of deformations. A high variation in feature points detection, i.e. corners, is observed; please refer to histogram presented in Fig. 5 where the number of identied key points is signicantly changing for varying afne parameters. For practicality, white noise is also added so that the patch is processed in conditions akin to a real life situation. The patches surrounding stable key points of size 16 16 are extracted whereas the deformed versions of images help to achieve synthesis to symbolize the possible appearances under varying poses. Fig. 5(left) shows extracted patches (blue squares), with a height and width of 16 pixels each, using an illustration image from Caltech (Airplanes) dataset; please note that the recognized stable key points are represented by the green color whereas outliers have been represented by red. After extraction of local patches; assigning each patch to a most probable object class is a subsequent task. Let ci, i1,2,y,H be a set of classes and fj, j1,2,y,Z be a set of binary features to be computed from extracted patches. We want to classify a patch based upon binary features as follows: ^ ci arg max PC ci jf1 ,f2 , . . . ,fz ,
ci

16

Fig. 4. Preliminary key points detected under varying poses.

ARTICLE IN PRESS
1836 R. Minhas et al. / Neurocomputing 73 (2010) 18311839

Feature Detection Analysis 900 800 700 600 500 400 300 200 100 0 1 2 Motorbikes Airplanes

Ferns Features

No. of Key Points

3 4 5 6 Affine Transformations

Fig. 5. Left: extracted local patches for Ferns computation and right: histogram of identied key points for varying afne deformation parameters.

Labels

Training: Testing: Global Information Local Information I ELM (1) ELM (1) Final Hypothesis Majority Voting Fusion for Final Classification Labeled Image

Training Image Set

ELM (2)

ELM (2)

Test Image Set

Local Information n-1 Feature Extraction

ELM (n)

ELM (n)

Majority Voting

Data Repository

Training / Classification
Fig. 6. Different steps involved in our algorithm.

PC ci jf1 ,f2 , . . . ,fz

Pf1 ,f2 , . . . ,fz jC ci PC ci : Pf1 ,f2 , . . . ,fz

17

Assuming uniform prior probability P(C) and denominator P(f1,f2,y, fz) as scaling factor; our problem is reduced to ^ ci arg max Pf1 ,f2 , . . . ,fz jC ci :
ci

18

The computation of each binary feature fj depends upon mutual relationship of two pixel intensities located at dj,1 and dj,2 in the patch. ( 1 if Idj,1 o Idj,2 , 19 fj 0 otherwise, where I(.) represents an image patch. Assuming a complete independence between features leads us to Pf1 ,f2 , . . . ,fz jC ci
z Y j1

where sk,S shows a random permutation function with range 1,y,Z. A reliable and fast patch correspondence using above relationship is reported in [12]. A performance and computational load trade-off is observed for varying values of M and S. In training phase, class condition probabilities for individual ferns are estimated which are combined to label corresponding extracted patches. We generate M+ 1 dimensional local feature vectors for individual patches which comprise of conditional probabilities of mutually independent ferns and their combined information to compute the conditional probability of a local patch. Finally, such feature vectors are used for training and testing purposes using ELM as classier working on local type of features.

5. Learning classiers with parallel ELMs The classication algorithm is provided with a set of training images where positive label indicates that an object of interest is present in an image while negative label represents its absence. All images are converted to gray level and resized to square dimension matrices. There is no further pre-processing applied to datasets and we assume no prior information about location, view point and/or image acquisition constraints. To avoid the curse of dimensionality, bidirectional 2D-PCA is employed which requires multiplication between two covariance matrices (one for each of row and column directions). The output of this dimensionality reduction step is a square matrix which is vectorized and termed as global feature vector.

Pfj jC ci :

20

However, the correlation amongst neighboring pixels of a patch is ignored hence an acceptable compromise can be modeled as Pf1 ,f2 , . . . ,fz jC ci
M Y k1

PFk jC ci ,

21

where M represents number of feature clusters of size S Z/M each, a fern Fk is represented by Fk fsk,1 ,fsk,2 , . . . ,fsk,S , 22

ARTICLE IN PRESS
R. Minhas et al. / Neurocomputing 73 (2010) 18311839 1837

The proposed recognition scheme declares an incoming image as positive class if the relevant object is present. For fast preprocessing direct intensity values are used to extract both kinds of features i.e. global and local. Generating a variety of features representing various contents of an image allows to knob varying geometric attributes of an object and achieve better categorization. Fig. 6 represents a generalized framework that supports integration of a wide variety of learnable local descriptors for enhanced classication. We used only single type of local feature vector generated using Ferns [12] style patches surrounding stable feature points identied using Harris corner detector. Due to signicantly shorter training time, and minimized computational burden; we use more than one ELMs in a parallel fashion to process all categories of image feature simultaneously for real-time classication. The training process for an ELM operating on global feature vectors is not the same as one using set of local features. Computed global training feature vectors are directly input to an ELM, whereas training for a ELM that deals with local features, starts with application of corner detector by deforming the training images and keeping a track of the number of times same feature point is identied. Such image deformations are suggested to train our classier for possible pose variations, and is proved to be feasible due to fundamentally soaring training speed (see Tables 1 and 3). The number of feature vectors representing the local patches of an image may vary depending upon stable key points detected from a synthesized set of images for different afne deformations. A majority voting scheme is adopted for reliable estimation of an image class since we observe false alarms for individual local patches due to low information content and an accidental matching among different regions of

two different objects. The ELM operating on global feature vectors does not require any such voting scheme since one-to-one correspondence holds between feature vectors and individual images. Finally, a fusion process is initiated to combine n estimates originating from all ELMs based on normalized weighted sum strategy that allows us to assign importance to each approximation based on condence (as follows): 8 n n X X > < 1 for wi :ei ZTh; wi 1, 23 fj i1 i1 > : 1 otherwise, where wi, ei and Th represent weight, estimate for individual ELM and threshold, respectively. In our proposed framework, user has better control over preference to be given to an individual feature type. Since the penchant strategy for different feature types is solely dependent upon application, photometric and geometric elements of individual objects. The value for threshold, i.e. Th, may vary between zero and one depending upon required condence. It is obvious that a high value of Th may result into increased reliability of classication with lower false positives and increased chance of false negative alarms. During experiments, we assigned a 0.5 weight values to each of the estimates originating from two ELMs operating on both kinds of feature vectors whereas a threshold of 0.75 is setup for nal classication. Different steps involved in our proposed algorithm are presented in Fig. 6.

6. Results and discussion We used standard datasets to test the viability of our proposed method. The datasets from Caltech include Airplanes, Cars Brad, Faces, Leaves, Background, and Motorbikes whereas GRAZ and MIT image sets comprised of Bikes, Cars, Persons, and Pedestrians, respectively.

Table 3 Computational time (s) for GRAZ dataset. Bikes Training time Bikes Cars Persons Cars Persons

NA 4.30 4.30

4.29 NA 4.37

4.31 4.34 NA

Table 4 Details of datasets used in analysis of changing threshold against accuracy. Positive class Negative class Faces Cars Persons Training images 200 300 200 Testing images 436 1140 476

Classication time Bikes Cars Persons

NA 3.25 2.91

3.21 NA 2.96

2.93 3.01 NA

Caltech MIT GRAZ

Leaves Pedestrians Bikes

Fig. 7. Classication accuracy for MIT dataset using varying principal components.

ARTICLE IN PRESS
1838 R. Minhas et al. / Neurocomputing 73 (2010) 18311839

Fig. 8. Performance analysis for changing threshold.

Table 3 presents the CPU time allocated for classication of GRAZ datasets using 53 712, 53 455, and 46 495 features for Bikes, Cars and Persons datasets, respectively. It is clear from the allocated CPU time that our proposed algorithm performs categorization at tremendously swift speed. In addition to speed, accuracy is another critical issue to judge the competence of a classier. Numbers of experiments using challenging categorization image sets have been performed to analyze the performance of our ELM based classier; the classication results of MIT and Caltech datasets are presented here for comparison of classication accuracy. Fig. 7 shows accuracy achieved for classication of MIT datasets (Cars and Pedestrians used as positive and negative classes, respectively). Above 95% categorization accuracy is achieved for MIT database using multiple principal components, it is also realized that increasing the number of principal components is not a solution to improve detection accuracy. Our experiments are based on binary classication problem, however, they can be extended to multiclass scenario with minor modications. It is an interesting aspect to investigate the impact of changing threshold Th on accurate classication. The optimal value for Th can help to minimize false alarms and precisely identify the object class present in an input image. Since we do not get classication with perfect condence because of noisy measurements and various distorting parameters during image acquisition therefore we try to estimate an acceptable compromise between accuracy and condence. The adjustment of Th to a value of 1 may lead to rejection of correct classication with lower condence and increased false alarm ringing is witnessed for lower values of threshold. Various experiments are conducted with changing values of Th to obtain an optimal solution, the number of principal components to represent global feature vectors are xed to 36 whereas number of training and test images are different for specic datasets based on their respective sizes. For these trials, we randomly picked Leaves and Faces, Pedestrians and Cars, Bikes and Persons to represent positive and negative classes from Caltech, MIT and GRAZ datasets, respectively. The number of images for both classes of objects also varies with datasets and readers may refer to Table 4 for further details. The achieved accuracy for various image sets is represented in Fig. 8 for varying values of threshold Th. The proposed method performs considerably well for MIT dataset on changing threshold, however, an inconsistent classication is observed for rest of the datasets included in our experiments. One may notice that the deviation in achieved correctness is higher for the combined results of all three input sets. However, a decrease in deviation amongst generated classications is observed for 0:7 rTh r 0:8. Such smaller

Table 5 Accuracy comparison for different approaches. Dataset (s) Bikes Planes Cars Leaves Faces Our method (%) 94.6 95.3 99.0 98.3 97.9 [14] (%) 93.4 90.0 96.0 94.2 98.0 [25] (%) 92.5 90.2 90.3 96.4 [1] (%) 73.9 92.7 97.0 97.8

deviation value amongst classications for all datasets represents the steady performance of our scheme for images with differing geometric and radiometric variations. Therefore, one may conclude, based on experimental evidence, that the value of Th set to a lower value or very close to 1 may lead to rising false alarms and rejection of correctly classied object with lower condence due to uncertain conditions. The Caltech dataset is used to test the performance of our proposed framework against state-of-the-art. The images are randomly selected from various Caltech datasets to represent negative class. For Caltech datasets; Table 5 presents an accuracy comparison of categorization for different modern algorithms; our proposed method achieves an average accuracy above 97% and outperformed other well-established schemes.

7. Conclusion We present a novel supervised learning algorithm for object detection and categorization that combines the strengths of both global and local features and demonstrate its considerably high speed in both training and testing phases. The proposed framework is capable of handling changes in pose, illumination, interclass and intra-class attributes. The proposed parallel architecture where each ELM module is simultaneously working, on distinct feature types, formulates a potential classier for problems requiring signicantly faster and reliable categorization. Features obtained through synthesized views of extracted local patches add further information to classication which is partially invariant to pose and lighting conditions.

Acknowledgments This research has been supported in part by the Canada Research Chair Program, AUTO 21 NCE and the NSERC discovery grant. Authors would also like to express their gratitude to

ARTICLE IN PRESS
R. Minhas et al. / Neurocomputing 73 (2010) 18311839 1839

anonymous reviewers and editor for valuable suggestions to improve this manuscript. References
[1] R. Fergus, P. Perona, A. Zisserman, Object class recognition by unsupervised scale-invariant learning, in: Proceedings of International Conference on CVPR, 2003, pp. 264272. [2] M. Weber, M. Welling, P. Perona, Unsupervised learning of models for recognition, in: Proceedings of Sixth ECCV, 2000, pp. 1832. [3] P. Felzenszwalb, D. Huttenlocher, Pictorial structures for object recognition, IJCV 61 (1) (2004) 5579. [4] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the International Conference on CVPR, 2001, pp. 511518. [5] S. Agarwal, D. Roth, Learning sparse representation for object detection, in: Proceedings of ECCV, 2002, pp. 113130. [6] B. Leibe, A. Leonardis, B. Schiele, Combined object categorization and segmentation with an implicit shape model, in: Proceedings ECCV Workshop SL in C. Vision, 2004, pp. 1732. [7] P. Viola, M. Jones, D. Snow, Detecting pedestrians using patterns and motion appearance, IJCV (2003) 734741. [8] G.Y. Dorko, C. Schmid, Selection of scale-invariant parts for object class recognition, in: Proceedings of ICCV, 2003, pp. 634640. [9] H. Schneiderman, T. Kanade, Object detection using the statistics of parts, IJCV 56 (3) (2004) 151177. [10] D.G. Lowe, Distinctive image features from scale-invariant keypoints, IJCV 60 (2) (2004) 91110. [11] K. Mikolajczyk, C. Shmid, An afne invariant interest point detector, in: Proceedings of ECCV, 2002, pp. 128142. [12] M. Ozuysal, V. Lepetit, P. Fua, Fast key point recognition using random ferns, IEEE Trans. PAMI, in press. [13] A. Opelt, M. Fussenegger, A. Pinz, P. Auer, Weak hypothesis and boosting for generic object detection and recognition, in: Proceedings of ECCV, 2004, pp. 7184. [14] S. Ali, M. Shah, A supervised learning framework for generic object detection in images, in: Proceedings of ICCV, 2005, pp. 13471354. [15] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing (2005) 489501. [16] R. Zhang, G.-B. Huang, N. Sundarajan, P. Saratchandran, Multicategory classication using an ELM for gene expression for cancer diagnosis, Comput. Biol. 1 (2007) 485495. [17] Q.-J. Benedict, S. Emmanuel, ELM for classication of music genres, in: Proceedings of the ICARCV, 2006, pp. 16. [18] G.-B. Huang, H.A. Babri, Upper bound on number of hidden neurons in F.N. with arbitrary bounded nonlinear activation functions, Neural Networks (1998) 224229. [19] J. Yang, D. Zhang, F. Frangi, J.-Y. Yang, Two-dimensional PCA: a new approach to appearance based face representation and recognition, IEEE Trans. PAMI 26 (1) (2004) 131137. [20] P. Sanguansat, W. Asdornwised, S. Marukatat, S. Jitapunkul, Two-dimensional random subspace analysis for face recognition, in: Proceedings of ISCIT, 2007, pp. 628631. [21] D. Zhang, Z.-H. Zhou, (2D)2PCA for efcient face representation and recognition, anonymous. [22] R. Minhas, A.A. Mohammed, J. Wu, A generic moments invariant based supervised learning framework for classication using partial object information, in: Proceedings of Conference on CRV, Canada, 2009, pp. 4552. [23] A. Opelt, A. Pinz, A. Zisserman, Incremental learning of object detectors using a visual shape alphabet, in: Proceedings of International Conference on CVPR, 2006, pp. 310.

[24] J. Shotton, A. Blake, R. Cippola, Contour-based learning for object detection, in: Proceedings of ICCV, 2005, pp. 503510. [25] A. Torralba, K.P. Murphy, W.T. Freeman, Sharing features: efcient boosting procedures for multiclass object detection, in: Proceedings of International Conference on CVPR, 2004, pp. 762769. [26] D. Tao, X. Li, X. Wu, S.J. Maybank, General tensor discriminant analysis and gabor features for gair recognition, IEEE Trans. PAMI 29 (10) (2007) 17001715. [27] J. Sun, D. Tao, S. Papadimitriou, P.S. Yu, C. Faloutsos, Incremental tensor analysis: theory and applications, ACM Trans. Knowl. Discovery Data 2 (3) (2008) 11:111:37. [28] D. Tao, X. Li, X. wu, W. Hu, S.J. Maybank, Supervised tensor learning, Knowl. Inf. Systems (2007) 13:142. [29] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, IEEE Trans. PAMI 27 (10) (2005) 16151630.

Rashid Minhas is a Ph.D. Candidate at Computer Vision and Sensing Systems Laboratory, Department of Electrical Engineering, University of Windsor Canada. He completed B.Sc. Computer Science from BZU Multan Pakistan and MS Mechatronics from GIST Korea. His research interests include object and action recognition, image registration and fusion using machine learning and statistical techniques.

Abdul Adeel Mohammed is a Ph.D. student at Department of Electrical Engineering, University of Windsor, ont., Canada. He completed his Bachelor of Engineering (2001) from Osmania university (India) and Masters of Applied Science from Ryerson University (Canada) in 2005. His main area of research is 3D pose estimation, robotics, computer vision, image compression and pattern recognition.

Jonathan Wu (M92, SM09) received his Ph.D. degree in Electrical Engineering from the University of Wales, Swansea, UK, in 1990. From 1995, he worked at the National Research Council of Canada (NRC) for 10 years where he became a Senior Research Ofcer and Group Leader. He is currently a Professor in the Department of Electrical and Computer Engineering at the University of Windsor, Canada. Dr. Wu holds the Tier 1 Canada Research Chair (CRC) in Automotive Sensors and Sensing Systems. He has published more than 150 peer-reviewed papers in areas of computer vision, image processing, intelligent systems, robotics, micro-sensors and actuators, and integrated microsystems. His current research interests include 3D computer vision, active video object tracking and extraction, interactive multimedia, sensor analysis and fusion, and visual sensor networks. Dr. Wu is an Associate Editor for IEEE Transaction on Systems, Man, and Cybernetics (Part A). Dr. Wu has served on the Technical Program Committees and International Advisory Committees for many prestigious conferences.

Vous aimerez peut-être aussi