Vous êtes sur la page 1sur 17

Pergamon

Pattern Recognition, Vol. 29, No. 6, pp. 919-935, 1996 Elsevier Science Ltd Copyright 1996 Pattern Recognition Society Pnnted in Great Britain. 0031-3203/96 $15.00 + .00

0031-3203(95) 00132-8

3D OBJECT CLASSIFICATION USING MULTI-OBJECT K O H O N E N NETWORKS


J. M. CORRIDONI, A. DEL BIMBO* and L. LANDI Dipartimento di Sistemi e Informatica, Facolfft di Ingegneria, Universit~tdegli Studi di Firenze via S. Marta, 3, 50139 Firenze, Italy

(Received 6 September 1995; received for publication 20 September 1995)


Abstract--The problem of three-dimensional planar-faced opaque object recognition from one single perspective view, is addressed in this paper. A view parameterization, based on Hough Transform, is described. Classification is accomplished by multiple multi-object Kohonen networks. A new object grouping criterion has been investigated to assign disjoint subsets of objects to each Kohonen network of the system, on the basis of input space topology. Recognition tests are presented on both synthetic and real-world 3D objects, even in the presence of partial occlusion. Copyright 1996 Pattern Recognition Society. Published by Elsevier Science Ltd.
Three-dimensional planar-faced objects Recognition Kohonen networks Multi-object system Single perspective view

I. INTRODUCTION Three-dimensional (3D) object recognition from digitized images is a central problem in order to provide computers with human-like capabilities.11~ Similarly to what is performed by human perception, automatic recognition systems must be able to identify 3D objects on the basis of one two-dimensional (2D) perspective view, independently from their pose. This implicitly requires that information about the object is previously gathered from several points of view so that a view-independent model of the object can be built inside the system. On the basis of this model, the system should be able to assign the observed view to the most plausible object, irrespectively of any rotation, translation or scaling being applied. Operating conditions, such as illumination sources, acquisition noise or incomplete knowledge, can increase the complexity of this task. In the literature about 3D object recognition from single perspective views, both orthographic and perspective projections have been used, the latter being of major interest as it resembles the way human vision operates/2~ Using perspective projection, a computational method to back-project image features, such as angles and curvatures into 3D space is described by Barnard. t3) 3D object recognition from single views using a back-projection method is proposed by Horaud. 14) The method is used to determine the possible orientations ofa 3D object, by analyzing the object junctions, and considering three spaces angles as known. Recognition using back-projections is performed through a model-to-image matching in the 3D *Author to whom correspondence should be addressed.

space, which requires a large 3D model database and a heavy computation effort. Some object recognition systems are based on the information recoverable in the object silhouette, t5'61 The set of all perspective views of an object is partitioned into a finite set of topological classes of equivalence (i.e., Characteristic Views). The method requires silhouette determination to guide the matching process. Among 3D object recognition methods, some are based purely on object silhouettes, Ill others rely on moment based silhouette shape description,t7'8) or on discriminant functions and de-correlation transformationsJ 9) Neural Networks systems can be used to capture the 3D geometrical structure of the objects on the basis of a limited set of 2D views taken from distinct vantage points, tl) Based on the training set, the system learns some discriminating functions in a multidimensional space, and performs a mapping from a viewpoint to an object class. Experiences of 3D object recognition with neural networks mostly address 3D objects, that essentially have a two-dimensional structure. For example, recognition of planar-faced real world objects (keys, mechanical parts and leaves) has been accomplished by combining complex-log conformal mappings and a distributed associative memory, t111 An improved version of the system has been studied for the classification of wooden logs. t12~ Neural network architectures (feed-forward back-propagation and Kohonen self-organizing map) have been compared for the classification of boundaries of 3D objects, t ~3) Only closed boundaries with no occlusions are addressed. Features, that are invariant to rotation and scaling transforms, are used as an input parametric description for

919

920

J.M. C O R R I D O N I et al.

object boundaries. The system can classify a limited number of object types and objects are recognized only on the basis of their dominant views (the views that mostly characterize the object). Basak et al. have presented a multi-object recognition system, which takes into account the occlusion problem, t141 The system is made up of a three-layer network, which is able to learn each individual object during a supervised training phase. The total number of output nodes denotes the maximum number of objects the network can recognize. The recognition of fully 3D objects from multiple 2D views has been addressed by few authors. Hopfield neural networks have been used to access a database of planar-faced object views through a coarse-to-fine strategy, t15) Poggio addresses only planar faced solids with a wire-frame structure. I1) The peculiarity of this study is based on the fact that every perspective view of an object can be mapped into a standard view through a vector-valued object-specific function, which can be approximated starting from a small number of views. Nevertheless, the representation used prevents the usage of these networks for recognition of objects in real world contexts. Networks obtained with this approach are object-specific. Each network can recognize only the object whose views it was trained 'on and rejects different objects by thresholding. Several networks can be combined to provide a multi-object neural classifierJ16) In this paper, we address the problem of classifying fully 3D planar-faced objects from a single perspective view. Neural classification of 3D objects is performed using Kohonen networks, ~17) which are able to discriminate among several distinct objects. Several multi-object Kohonen networks can be combined in a modular way to form a larger multi-object recognition system. In this paper, a synthetic description of a planar-faced object's view is proposed and the recognition system is described. The paper is organized as follows. In Section 2, the object view representation used is described and its properties are pointed out. Classification with multiobject Kohonen networks is described in Section 3. In Section 4, it is discussed the way the input object representation is reflected into the Kohonen network. Then, a criterion is introduced, according to which more Kohonen networks are used to classify independency disjoint subsets of objects. Finally, experimental results are reported in Section 5.
2. REPRESENTATION OF 3D PLANAR-FACED OIKIECT VIEWS

network, which is expected, after the learning process, to generalize from examples and correctly classify views not belonging to the training set. The input information of our recognition system is represented by the perspective views of the objects belonging to a sample set. In order to perform a task of automatic perception and reasoning, the distributed sensor information has to be processed, in order to extract those features which seem to be relevant for that task. In 3D object recognition the first problem to solve is finding a good parameterization, that could synthesize the distributed information contained in one object's image. Since only Planar Faced Solids (PFS) are addressed in our paper, the straightforward description of their perspective views is a rough parameterization of the visible edges. For its characteristics, the Hough Transform algorithm has shown to be the most suitable for our aim. A so-called Extended Hough Transform has been used to project the binary images, representing PFS perspective views, onto an n-dimensional parameter space. ~19.201 After the visible edges have been identified through a Sobel preprocessing, each of them is described by the Hough triplet (0, p, d) of line parameters, where 0 is the slope, p is the distance from the image origin,t211 and d is the length of each line (which is approximated by the number of votes collected during the Hough algorithm computation). In order to obtain a lower dependency on scaling, the length d of each segment is normalized relatively to the longer segment in the view. Parameter 0 takes values in the range [0, ~], p in the range [0, C] (where C is the maximum value allowed in the image area), and values for d are considered in the interval [1, m] (where m is the number of quantization intervals). Furthermore, it has to be pointed out that the Hough transform is particularly suitable for real-time applications, given its characteristics of parallelism and the availability of dedicated hardware. In order to represent not only image edges but the overall object view, a correlation law has been set among all triplets (0 v p~, di) associated with edges E i in the image. Let nmax be the maximum number of visible edges of an object. If one view has n visible edges, with n ~<nm,x, then this is described by a feature vector X E ~ 3 ..... having 3 x n non-zero components (Oi,pi,di) for i = 1, ...,n, and 3 x (tlma x -- n) zero values. Triplets are ordered first according to 0 then according to p and finally according to d, following the lexi-

cographic ordering.
Parameterization of pictures using the standard Hough transform can be strongly impaired by noise. Typical effects of noise are: (1) breaking lines into several segments, due to undetected pixels and (2) appearing of spurious pixels in regions were no edge is actually present. Broken lines have no significant effect on the parameterization. On the contrary, the presence of spurious pixels may result in the detection of false lines. A rough quantization of parameters may also result in false lines being detected. If a great number of pixels are aligned along a line with parameter values

Shape descriptors that are unaffected by the projections of the object onto the image plane are at the heart of model-based vision systems. {~s l Their use allows the design of recognition systems that are position independent. In this study, an invariant object representation is achieved by training: different perspective views of the objects to be classified are presented to the

Multi-object Kohonen networks (0*,p*), also lines whose parameters are sufficiently close to (0*,p*) will receive a large number of votes ("noisy voting"). In order to cope with these drawbacks and considering the polyhedral structure of the objects involved, the Weighted Polarized Hough Transform (WPHT), ~22) has been used. The WPHT algorithm has been further improved by a procedure aimed to reduce the negative effect due to the "noisy voting". After WPHT has been computed, we select the couple (0", p*) which turns out to the most voted among the couples falling inside a specified window in the parameter space to extract only the parameters (0*,p*) associated with the real edges extracted doing the preprocessing. Such window has fixed dimensions and is let shift through the whole parameter space.

921

way: regions from which inputs have frequently occurred are mapped onto larger domains of the output space and therefore with a better resolution with respect to regions from which only few inputs have emerged. The objects' classification task is performed by a two-layer hierarchical structure in which the first layer is made up of an array of multi-object Kohonen maps and the second layer is a MAXNET, which selects the Kohonen map with the highest output response, see Fig. 1. Given a set C of sample objects to be classified, each map is trained in order to classify a subset of them. Weights w~ are updated according to the equation:

wi(t + 1) = Ci(t)[wi(t ) + e(t)h~(t)]


where factors Ci(t ) and e(t) are used to keep the Ewi constant and to scale the size of the weight change, respectively. The function hi(t ) has a maximum at i = c where c = mini{d(x, wl) } and decays rapidly to zero for high distances from unit i. The local smoothing operations on the spatial variation of weights force variation of weights wi to be as continuous as possible.~34~ The unit i is selected on the basis of a norm measure in the matching process. We have experimentally verified, comparing the inner product to the Euclidean distance, the former measure to be more suitable to detect differences among vectors x representing object views. In order to obtain a finer classification, the supervised procedure of Learning Vector Quantization,~3t~ has been applied to the networks after the self-organizing phase has converged. As to the mathematical results so far available, each Kohonen self-organizing map has been proved to define near-optimal decision boundaries between the classes of objects, even in the sense of classical Bayesian decision theory. ~31~Conditions under which the self-organizing learning process converges and is stable are analysed in the literature but the only case where a thorough analysis could be achieved, for a very large class of input distributions, is the one-dimensional/ 17"35"36) while for higher dimensions the results are only partial. As far as the stationary case in any dimension is concerned, the final phase after the selforganization has been studied by Ritter and Schulten) 37'38) Recently, Fort and Pag6s give some results in higher dimensions but their results are very partial.~TM 3.1. The object view representation As pointed out in the former paragraph, the Kohonen network is highly sensitive to the input space distribution. This property highlights the importance of investigating the input space topology, with particular reference to the regions of discontinuity. The transformation (T) that projects an object's image into the

3. CLASSIFICATION WITH

MULTI-OBJECT

KOHONEN MAPS

Object views, described according to the Extended Hough Transform, are classified by the means of multiobject Kohonen maps. This type of map has been preferred to other neural architectures for both a physiological and a technical reason: The Kohonen learning algorithm seems to be plausible from a biological point of view. It is now generally accepted that cortical maps are a product of self-organization and are at some extent "learned" by visual experience. ~23,24~Specifically, object views represented by orientation of edges and contour elements of an image, are considered to be one of the main features detected by neurons of the primary visual cortex. (25'26~ The Kohonen self-organizing feature map algorithm,~17~ can be thought as an abstract computational model of neo-cortex processes involving the formation of two-dimensional topographic maps of peripheral sense. ~27~Such maps are believed to develop in a self-organizing process based on cooperation and competition between neurons/28'3) The learning phase is computationally much lighter than in traditional Back Propagation neural nets. Moreover, the weights of self-organizing maps stabilize into a narrower dynamic range and the accuracy requirements are then modest. ~31~Even if both self-organizing maps and Back Propayation adapt weights doing a gradient descent search on some potential function, weaker convergence properties are available for Back-Propagation. ~32"33) Several remarkable properties have been proved for the Kohonen self-organizing maps: (1) it represents most faithfully those dimensions of the input space along which the variance in the sequence of inputs is mostly pronounced; (2) it preserves the topology of input space, i.e., maps similar inputs to neighboring locations in the output space; (3) it reflects differences in the sampling density of the input space in a natural

922

J.M. CORR1DONI et aL

WinningNeuron

O00OO00000 QOQQOQOOOQ O000QQOQO QOOOOOOQQO 0000000000 O00000QQO 0000000000 0000000000 0000000000 0000000000

NetI

bOooooooooI
OBJECT VIEW

OO o Oo OO
O OOOO OOOOO O OOCxDOOOOO O OOOO CxDooo OOOOOOOOOO O OCxDOOOOOO O OOOO CXDO00

0 OQO0 0 0 0 0 0 1 Ne~
M A X N E T

System
p

Response

OOOOOCxDOOO Ne~ OOO(OO(OOO OOCxOOOOOOO OOOOOOOOOO CxDOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO C~O(OO(OOO OOO(OOOOOO
Fig. 1. General system architecture made up of a first layer with an array of self-orgamzingmaps and a MAXNET in the second layer.

parameter space of the Extended Hough Transform is the composition of three applications. T = M oHEoH = M(HE(H(v, o))) = M(H~(i)), (1)

where ~o.p.a c ~ a is the parameter space of slope 0, distance p and length d, The application M identifies the lexicographic ordering operation which associates to a set of points in ~0,p,d a vector belonging to ~f,

where II is the perspective operator, H e the Hough transform and M the lexicographic ordering operator. The H operator is an application that for every vantage point, associates an image to each object. It can be formally defined as H:(5 x v ~ H , where (5 c ~a is the planar faced object space, v ~ ~ is the vantage point space, and J c )t 2 is the image space. The extended Hough transform H e associates a triplet (0,p,d) to each segment in an image and can be formally expressed as follows:

M :'~o,o,a --' 9f,


where f is the feature-space, a subspace of ~3 ..... where nm~ ~ is the maximum number of lines in an image. As to the analysis of the discontinuity sources, in the following each function, composing the overall transformation T in (lh will be studied. The continuity of perspective operator with respect to th.e vantage point position has been widely prored/*) Therefore, the perspective operator II is a continuous application of the solid edges onto the image plane.

H E:H "--~~ o.o,d'

Multi-object Kohonen networks Given two points in the image plane Pa = (Xa, Y1 ) and Pz = (X2, Y2 ), the value of the parameter 0, in the H E transformation, can be determined according to: 4.1.

923

Maps organization

O= arccot Y 2
0

if

X2 ~ X a

if X 2 = X a

Assuming that (Y2 - Yt) >/0, the limits fQr X2~X ~ and X 2 ~ X~- (i.e. when lines are vertical) have distinct values: lim

x2-.x:
lim x2 x:

arccot(~zYa ) " --x-a arccot( Y2-Ya )


-Za

=0,

As a consequence, due to the discontinuity of 0, the application H~ is discontinuouswhen lines are vertical, (similar results can be obtained if (Y2 - Ya) ~<0). The last function to be analyzed is the one that, assuming that the number of lines in the image is n, associates a vector x in the feature-space& r ~ 3 ..... to the corresponding n points (0~,p~, d~) for i = 1,..., n. Let us consider two 3D vectors identified by two pairs of points (Oa,pa,da)and (02,pz,dz)./(ssuming 0a > 02, the 6-dimensional vector that is obtained according to the lexicographic ordering rule is xa = (01, Pl, da, 02, P2, d2). Suppose that 0 a ~ 0 2 and that Oa < 02, then the new vector will be X2 =(02,P2,d2,01,pa,dl). It is obvious that 0 a ~ 02 but it can happen that either Pa >>P2 or Pl <<P2, SO that a discontinuity is introduced whenever a change occurs. As a result the lexicographic ordering is a non-linear and discontinuous function operating on 3D points of the parameter space ~0.p.a. This means that there exist two object views that, although slightly different, are transformed into two unconnected regions of parameter space, by the lexicographic ordering. According to what discussed above, the overall transformation T that has been introduced in (1) is discontinuous. Moreover, note that other discontinuities which are not due to the features of the specific transformation, can arise whenever an object view changes to another with a different number of visible edges. This is the case of a meaningful change of the geometrical structure of the object view. 4. NETWORKSTRUCTURE AND ORGANIZATION In this section, we discuss the features of the multiobject self-organizing maps that are used in our system taking advantage of the properties of the transformation accomplished by the function T and by the learning algorithm.

Given an object with n visible edges in one particular view, the input vector ( x E ~ 3 . . . . ) representing such view is characterized by n non-zero triplets and it can be thought as belonging to the n-dimensional hyperplane in the space ~3 . . . . . For each training set, the number of distinct hyperplanes containing view representations is given by the number of views sets having a different number of edges. For each hyperplane, the distribution of the views representations can be approximately characterized with the variance with respect to the centroid. The variance of views within each hyperplane is dependent on the number of discontinuities that are generated by the applications T*. Since Kohonen maps reflect differences in the sampling density distribution of the input space, a rough description of the parameter space is fundamental to understand how the Kohonen algorithm creates clusters on the map. Assuming that the views have uniform distribution over each hyperplane, the greater is the variance the greater will be the number of neurons assigned by the self-organizing algorithm in the map. For each hyperplane, we may depict the object view representation with the ellipsoid, whose center is given by the average values of the three components (0, Pi, di), i.e., (4, fii, di), and whose three main axes are set equal to the average variance of the three components (62~,62, 62,) Figure 2 shows a 3D graphic display of the ellipsoids corresponding to the training patterns of a cube. The greater the ellipsoids are, the larger is the spreading of points around the plane Centroids. From Fig. 2, it can be noted that in the input distribution of the cube views there should be points distributed in only three hyperplanes, i.e., three ellipsoids, corresponding to four, seven and nine visible edges in the object views. The hyperplane containing views of the cube with eight edges derives from edge missing due to the finite precision of the WPHT. As the number of edges in the views presented to the network changes, the neurons tend to be located in specific areas of the map. Specifically, if neurons of an area of the map react to views with n edges, then the neurons of the adjacent areas react to object views with ( n + 1), so that hyperplanes are mapped on the Kohonen maps in neighbor regions. Please note that the same neurons may be activated by views belonging to different hyperplanes, so that adjacent areas of the map which are sensitive to views of different hyperplanes may have overlapping borders. These features can be recognized in the map organization shown in Fig. 3, where neurons reacting to views belonging to the same hyperplanes have been labeled with the same color. These results confirm the principle of Continuous Mappingfor self-organizing feature mapsJ 34) Such

*A rough estimation of the maximum number of discontinuities that can be generated inside each hyperplane with views representing n edges is given by n!

924

J.M. CORRIDON1 et al.


4 7 8 9 lines lines lines lines -----.......

200
lOO

0 I , I

%1

~
~

~
~

~ ~ L ~ , , ;
p, t" ~<_!

ool - i 0 0 ~-

-iso I-

:~.......

/C oo
~ ~ . _
. . . . .

qso

OOO

0.I

o.
'

theta

~s

~
0.2

"-500 -I000

rho

-2000

~s0~

vvv

Fig. 2. 3D representation of the set of cube training views.

OOOOOO@OQOQ@ OOOOOOQ@OQQO OOOOOOO0QO@ 0000000@@0 000000.00@0@


00000000000

0 ~ ~ ~ ~ @~
@s

00000000000@0@
00000000000@

~
0 0 0 ~o '~ 14

00000000000@
0000000@

0,,

00@0@00 OO0000000~~@Q @00@@@00@@@ @00@00

0,~ 0,~

Fig. 3. Organizationofa(14 x 14) map on the basis ofthe views belonging to different hyperplanes. Neurons with a darker gray level have been activated by views belonging to hyperplanes with a higher dimension, i.e., a greater number of edges.

principle states that the mapping from a higherdimensional feature space onto the two-dimensional topographic surface is achieved according to a maximizing continuity principle. As a result, neurons reacting to the same object classes will be clustered in different areas of the map.

4.2. Object 9roupin9 criterion The analysis of the input view representation, besides giving insights for the understanding of the process of self-organization of K o h o n e n maps, is useful to improve the performance of object recognition pro-

Multi-object Kohonen networks cess. The overall recognition system has been built according to two main principles: (a) hierarchical organization of the networks and (b) grouping of objects. Although single maps have the capabilities to define hierarchical, ultra-metric representations of structured data representation, it has been suggested that the real potential of these maps lies in a sort of "hierarchical" or otherwise "structured" system that may consist of several interconnected map modules. 1) The problem of "hierarchical" maps has turned out to be very difficult and, at this time, we are not aware of any studies which have been able to exploit it extensively. Driven by the specific problem, instead of a huge map made up of thousand of neurons, we have designed a system made up of several small Kohonen maps so that, inside each of them, the complexity of the decision boundaries is minimized. The maps are then arranged in a parallel fashion as to accomplish the final classification. Given that the system is made up of multiple networks, it has been defined a criterion to partition the sample objects' set into disjoint subsets, such that each of them is used to train a different Kohonen network. The grouping criterion has been derived from the observations that the input distribution is structured in hyperplanes and the distributions inside each hyperplane are made up of views of different objects, secondly, that the self-organizing algorithm maps object views representations of adjacent hyperplanes in adjacent areas of the network and, finally, that nets with a lower number of objects give rise to decision boundaries with lower interferences. As a result, objects are grouped and assigned to networks according to two principles: the number of views of distinct objects belonging to the same hyperplane is minimized and objects which have views in more hyperplanes are grouped in different sets. Partition of objects in distinct subsets is recursively iterated until an efficient organization of the net has been reached. "Efficiency" has been measured by evaluating the classification error percentage with respect to the size of map. Assuming that views have uniform distribution in each hyperplane, the number of neurons assigned to each hyperplane (Ni) has been estimated. If the variance of the object views belonging to the i-th hyperplane is denoted with a~ and Pi is the probability that a view is mapped into the i-th hyperlane, then the following can be stated:

925

-Ni and Ni is plotted for different hyperplanes obtained with the test set used for experiments reported in Section 5. Once an estimate of N~ is given, then for each hyperplane the number of neurons assigned to the k-th object can be computed from an evaluation of input object view distribution. Given the distribution of the views belonging to the i-th hyperplane, we can compute for the k-th object, the variance of the views around the centroid (alk)2) and the probability of having a view inside the i-th hyperplane (plk)). The organization achieved by the Kohonen map for this restricted set of views has the same features described above so that the number of neurons that are assigned to the k-th object in the i-th hyperplane is:

~Ik) < (al k)2 x pl k)) = A! k).


As a result, the following value is obtained for the number of neurons assigned to the k-th object within the i-th hyperplane,
A A I k)

~Ik, = N, k~ oTA?,
where N O is the overall number of objects. Therefore, the total number of neurons for the k-th object in the Kohonen net can be computed as:

~(~) = ~ ~I~).
i-I

NH

In Fig. 4, the relative difference between the value of n!k) and rilk) averaged over the hyperplanes is plotted. The relative error in estimating the value of n (k) measured as [n(*) - h(k)[/n(k) averaged over all the objects of the test set used for experiments reported in Section 5, has been found equal to 0.11.
5. E X P E R I M E N T A L RESULTS

fli ~ ( ~ p~),
where .N, is the estimated number of neurons assigned to. fhe i-th hyperplane. Referring to the quantity (a~ x Pi) as A i and given the overall network size N, then ffi can be given as: ~Qi = N A--~L---/ Z~nl Ai

Several experiments of 3D planar faced object recognition have been carried out with multi-object Kohonen networks according to the criterion expounded in previous sections. Image data used in the experiments are both synthetic and real. Synthetic data have been obtained from a 3D object CAD drawing and editing system. Real data have been used for the experiments with noise. In this case, sample 3D objects have been built and acquired with a 50 mm camera. Several objects have been used to test the robustness and the performance of the system. It has been assumed a maximum number of 24 edges per object view, so that all the networks have 72 = (24 x 3) input, i.e., X 6 ~ 72. Training sets have been obtained by changing the camera position continuously through rotation and translation along the three axes; no zoomed views have been used to train the network. For each object, the corresponding training set is made up of 312 views. The Kohonen layers used in the experiments are square shaped. Objects have been grouped in different

where N H is the overall number of hyperplanes. In Fig. 4, the relative difference between the estimate

926

J.M. CORRIDONI et al.


!

N_i-n_i . . . .

0.8

0.6

0.4
// /x x iI st II \ tt

ttr'xx
x~x x ~x x x,, i ~/ / j

t/ttt'xxxx
xx ~xx ~.t x J 1 jl

0.2

I 4

I 5

I 6

I 7

I I I I $ 9 10 I1 Numberof edgesper hyperplaa(i)

I 12

I 13

I 14

15

Fig. 4. ^Estimate of the number of neurons per hyperplane (i.e.,~/~) compared to the actual value N~, i.e., IN~ - N~I/N~, and estimate of the number of neurons per object per hyperplane (i.e., ~k)) compared to the actual value n~k~and averaged over the all set of objects, Le., ~ ',=No ~ ~ = ,."tk)--"~a(k)~"~/"tk)" Kohonen networks according to the criterion given in Section 4. The network dimension has been tuned according to the best trade-off between performance and number of neurons.(41) It has been observed that this number increases as the network has to classify a greater number of objects. In the following, only the experiments with networks with the best performance have been reported. The learning process is started by choosing arbitrary initial values wr(0) for the network weights. Recalling that learning is a stochastic process, the final statistical accuracy of the mapping depends on the number of steps which should be at least 500 times the number of network units.(31) Since the networks used have reduced dimensions, a larger number of steps, in the order of magnitude of 100,000 (1000 times the number of the network units) has been employed to train the network. For our experiments, an initial value of 0.8 with a linear decrease rule has been adopted for the function he(t). To achieve a global ordering of the map it turned out experimentally to be advantageous to take a large initial neighborhood of the winning neuron we, i.e., one half the diameter of the network, and then let it shrink linearly with time during the ordering phase. The network learning algorithm was implemented in C and run on a SUN Sparc workstation. Testing sets are made up of 100 views per object. These views were distinct from those used in the training set and include rotated, x- and y-translated views, as well as zoomed views. If networks respond to a test view with a neuron which had not been labeled during the LVQ phase, than this neuron is assigned the label of its closest neighbor, according to a minimal distance criterion. Since it is fairly logical to expect recognition of concave objects more critical, for problems due to edge occlusion, we have tested the performance of our system on both concave and convex objects. In the following, results are reported taken from some experimental set.

Experiments with convex objects


In the following, experiments are surveyed on seven sample convex objects, see Fig. 5. These objects have been selected in such a way that some of them have common views, which makes classification sometimes ambiguous. Solid convex objects have the form of a cube (Oc), a parallelepiped (Op,), a pyramid with triangular base (Opy~), a pyramid with square base (Opts), a pyramid with pentagonal base (Opyp) and two prisms with pentagonal (O~p) and hexagonal basis

(o~.e).
Given the initial set S

S ~ {Oc, O~y,, Opt p, Op,, Opa, Opae, Opap}

Multi-object Kohonen networks

927

belonging respectively to $1 and $2. Since misclassification rates of both nets have not shown satisfactory, to achieve better performance, the set St has been partitioned into $1 and Sb. The two disjoint subsets originated from $1 are:

s~.= {o.y,,o~oe},s~.= {oc,o~y~}.


Two K o h o n o n 10 x 10 networks have been trained on the views of objects belonging to the sets S L and
Sb.

Fig. 5. Planar-faced convex objects. it has been partitioned, according to the criterion discussed in Section 4, into two subsets $1 and $2:

S 1 = {Oc, Opy,,Opyp,Opae}, S2 = {Opa,Opys, Opap}.


Two K o h o n e n networks, 14 x 14 and 12 x 12, have been trained on the view representations of the objects

Thus, the overall recognition system used for the seven convex objects is made up of three networks: two 10 x 10 nets and one map made up of 12 x 12 neurons. The self-organization of the resulting maps is shown in Fig. 6. The overall number of neurons assigned to each object in each map, reflects the input patterns distribution. The percentage of error of this multi-object system is reported in Fig. 7. Both normal and noise corrupted images have been tested. The noise has been assumed to be normally distributed with variance set equal to

~
OBJECT VIEW

0000000000 0000000000 0000000000 0000000000 0000000000%~ 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 00041C) 041000 O00OOODO00 0000000000 0000000000 0000000000 0000000000 0000000000 0000000000 000000000000 000000000000 000000000000%, 000000000000 000000000000~ 0000000000% 000000000000 000000000000 000000000000 000000000000 000000000000 000000000000

o~,~ o~

A X N E T

Fig. 6. Overall system made up of(10 10) Kohonen network trained on objects Oc and Ovrv, i.e., training set $1~, (10 x 10) Kohonen network trained on objects Ome and Opt,, i.e., training set St. and (12 x 12) Kohonen network trained on objects Opts, Ow and Owp, i.e,, training set $2. In each map, neurons are lebeled with a color corresponding to one of the objects used for training.

928 .14 .12 .1 .08 .06 .04 .02


. . . . . . . . .

J.M. CORRIDONI et al.

Error Rate
. . . . . . . . . . . . . . . . . . . . .

[]

Synthetic Noisevar=0.2

Oc

Opyt

Opys

Opyp

Opa

Opap

Opae

Fig. 7. Errorrateofthesystemmadeupofthree Kohonennetworkswithdimensions(10 x 10),(10 x 10)and (12 12), tested on seven convex objects which have been partitioned as S = {$1.,$1,,$2}. 0.2. Note that some recognition errors are due to identical views of different objects. Two 14 14 networks have been trained on views of objects belonging to subsets $1 and $2 and one 16 x 16 net has been trained on views of objects in the subset $3. After learning, the self-organization of the map used to classify the concave object is shown in Fig. 9. The misclassification rate is reported in Fig. 10. Please note that the misclassification rate is slightly worse than that obtained with convex objects only. This is due mainly to the fact that in this case a single network maps more objects than in the previous case.

Experiments with concave and convex objects


Experiments with three sample concave objects, see Fig. 8, and the seven convex objects in Fig. 5, are surveyed below. The concave solids are made up of two orthogonal parallelepipeds with rectangular base forming an L-shaped solid, (Or), a T-shaped solid (O,) and three orthogonal parallelepipeds with square base (O,x). Evaluating the training view distribution, it can be noted that object views are distributed in a greater number of distinct hyperplanes. This allows the creation of subsets where the interference of object representations is lower than in the case of convex objects alone. Applying the grouping criterion to the set with the 10 objects (seven convex and three concave objects), the following three subsets have been selected.

Influence of noise
In order to check the robustness of Kohonen networks in the classification of polyhedral objects, some experiments have bden carried out on noisy images. Images have been corrupted with randomly generated Gaussian noise with increasing variance. In the following, results are reported for noise variances equal to 0.2, 0.3, and 0.4. The same set of experiments, for three concave and seven convex objects has been accomplished using the new images. Results of the testing for convex objects and concave objects are reported in Fig. 10. Note that the error percentages for images with a noise variance equal to 0.4 have not been reported since the WPHT worked poorly.

s, = {o.,., o~o~, o,}, s~ = {op.,


S 3 = {Oc, Opyt, Opap, Oax}

op, o,},

@
Fig. 8. Planar-faced concave objects.

Experiments with real-world objects


In order to check the reliability of the whole recognition system in presence of real-world conditions, such as critical illumination, l0 hand-crafted wooden objects (see Figs 11', 13) have been on purpose created, on the basis of the geometrical models taken into account in the previous paragraphs. The experiments carried out on these objects were also aimed to verify whether the image processing and recognition chain could be applied in real time applications, once the le~trning process had been performed off-line. On the basis of each wooden object, a 3D CAD model has been created and 312 training views have been extracted, by

Multi-object Kohonen networks

929

00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 OOOOOOOOO00OO0 OOOO0000000000 00000000000000 00000000000000 00000000000000 00000000000000 OOOOOOOOOOOOOO 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000 00000000000000

0 pyp

O~ Ot

O pys

OBJECT VIEWS

O~ OI

A I X I
N I

E T

I I

OQC, O @ O 0 0 0 0 0 0 0 0 0 0 O~i ~ , 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000000000000000 0000000000000000 ~;,i ~ , 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0@00@00000000000 OO00QOOOOOOO0000 O00000000000000Q ( :O(~',00@O000000000 :~;,()00~ ;,O00@O0,i :,0~ ;,OO O00OO00000@OOOOO ( ;,000000000000000 000000000000000,~ :, ( ;,C,O 0 0 0 0 O O 0 0 0 0 0 0 0 OOO000OOOOOOOOO0 OOOOOOOOOOOOO00(:,

Oc Oax Opa o O py~

Fig. 9. Overall system made up of (14 x 14) Kohonen network trained on objects O,yp, Opa e and O,, i.e, training set $1, (14 14) Kohonen network trained on objects Opa, Opys and Or, i.e., training set $2 and (16 x 16) Kohonen network trained on objects Oprt, Oc, Op,~pand'O~x, training set S 3. In each map, neurons are labeled with a color corresponding to one of the objects used for training.

.18 .16 .14 .12 .1 .08 .06 .04 .02 Oax O1 Ot Oc Opyt Opys Opyp Opa Opap Opae

[]

Synthetic Obj Noise var=0.2 Noise var=-0.3

[]

Real Objects

Fig. 10. Error rate of the system made up of three K ohonen networks with dimensions (14 14), (14 14) and (16 x 16) and tested on both convex and concave objects which have been partitioned as S = {S 1, Sz, $3 }. The different operating and noise conditions are reported in the diagram.

930

J.M. CORRIDONI et al. Given the availability of many dedicated processors for edge detection and Hough transformation, the classification system based on the Kohonen's self organizing maps has not only outlined good performance, but can be easily applicable for real time use.

moving the vantage point uniformly around the threedimensional model. The learning process is therefore based on 2D synthetic views, which have the advantage of being easily extracted and. processed, thus enabling a fast learning procedure. Once the final organization of the network weights has been attained, the real recognition task has been tested on real objects. A few grey-level images have been taken for each solid, using a 50 mm lens camera and different lighting conditions, some of which are rather critical. The images have been processed, using the Canny filter edge-extr~actort421 and a thinning algorithm. An example of the processed images is reported in Figs 12 and 14, where the edges are extracted of the image in Figs 11 and 13, respectively. The preprocessing step has shown to be the most critical, since our wooden objects have good reflectivity properties, thus resulting in edge missing or false edge detection. Once the edges have been detected, the modified W P H T algorithm has been applied, so as to compute the associated numerical vector, which has been input to the Kohonen network array for the classification. The results of this kind of experiments are summarized in Fig. 10 and show a good performance of the overall system in some case, even better than that obtained with CAD models. These statistics can be easily justified by noting that only few stable positions can be assumed by real objects, and that not many critical views are probable in the test set.

Experiments with real-world occluded objects


A set of experiments has been made to verify the robustness of the recognition system in presence of partially occluded objects. Multi-object sample scenes have been created in which objects are occluded by other objects. In order to segment the different solids in the scene, we have employed the system presented in reference (43). Since one main geometrical property which the edges should satisfy in order to be grouped is the symmetry, this method is well suited to work with PFS's. A set of experiments has been performed on all ten wooden models, whose views are subject to different degrees of occlusion. Being the edge the prominent feature used in our representation model of solid shapes, the degree of occlusion of an object view is a function of the ratio between the visible edges and the total number of edges in the view. The occlusion of an edge in an object view can be total--the edge completely d i s a p p e a r s - - o r only partial--the edge is not visible in its full length but part of it has disappeared. Since the Hough transform is used as the input operator, then partial occlusion of an edge can result in

Fig. 11. Image of the five wooden objects: four convex Or, Opa, Ovrp, Opap and one concave O,.

Multi-object Kohonen networks

931

Fig. 12. Edge image relative to Fig. 11, obtained with a Canny edge filtering.

Fig. 13. Image of the five wooden objects: three convex Opyt, Opys, O~e and two concave Ol, O~x. its total missing, whenever the length of its visible part does not overcome the H o u g h threshold as described in Section 2. To take into account this kind of effect we introduce a first index of occlusion in which the partially occluded edges (below or over the H o u g h threshold) are also considered as contributing to the global occlusion degree of the view. Let n be the number of edges in a view; n,o the number of edges in the same

932

J.M. CORRIDONI et al.

/-

Fig. 14. Edge image relative to Fig. 13, obtained with a Canny edge filtering.

view, which are totally occluded by the presence of another object; and npo the number of partially occluded edges, then, the occlusion degree O D ~ is defined as follows:
ODI

nto + rlpo n

views attributed to a wrong object and the views that have not been assigned to any object. The results show that a good performance is attained whenever the degree of occlusion does not overcome the value of O D 1 = 0.4, and O D 2 = 0.3 which is a remarkable level of occlusion. The performance degenerates when the objects are almost totally hidden. 6. CONCLUSIONS In this paper, the problem of fully 3D planar-faced object classification from single perspective views is addressed. It has been formulated a connectionist approach to 3D recognition, based on an array of multi-object Kohonen networks, linked together with a MAXNET stage, to create a modular hierarchical architecture, which has the property of being extensible to a set of objects of arbitrary number. The neural approach avoids the presence of large model database, since each network is able to "learn" the whole 3D structure of an arbitrary planar-faced object in an unsupervised manner. In this architectural context criteria have been defined for an efficient representation of a 3D planarfaced shape, which are based on an extension of the Hough transform, whose computation algorithm has been optimized and tuned to the specific environment of polyhedral objects.

In order to take into account the disparity between the behavior of the Hough transform and what is perceived by humans as the occlusion we introduce a second occlusion degree index O D 2 which only computes the number of fully occluded edges.
OD E : --. n

ntO

The tests have been made on 111 views of the ten wooden models, which have been occluded at increasing values of O D 1 and O D 2. Sample occluded views, taken from the test set are shown in Fig. 15, each with the corresponding non-occluded view and the two occlusion degree indices indicated. The neural networks that have been used are the same three nets, trained with the CAD models and tested on real objects with no occlusion. The results are reported in Tables 1 and 2, where the first column displays the occlusion degree and the other two columns display the absolute and relative number of misclassified views, having considered as misclassified both the

Multi-object Kohonen networks

933

(~duded Views

oo,. o.i
o q - o.t

oq- 0.J
o q . o.I

oq. o.4
o q . a3

oot. a2
oq-o.2

oq. ar
o q . o.,

oq- a6
o ~ -. o.:

Fig. 15. Edge images of sample occluded views taken from the testing set. In the upper part the corresponding non-occluded views are displayed. The two occluding degrees O D t and OD 2 are also reported for each view.

Table 1. Misclassification rate with different values of OD~


OD 1

Table 2.
OD 2

Misclassification rate with different values of O D 2 Misclassifled views 0 0 1 3 2 2 4 5 Misclassification r'ate (%) 0 0 7 14 15 13 33 56

Misclassified views 0 1 1 1 2 2 3 4 3

Misclassification rate (%) 0 8 8 8 10 10 37 40 37

0.1 0.15 0.2 0.25 0.3 0.4 0.5 0.6 0.7

0.0 0.1 0.15 0.2 0.25 0.3 0.4 0.5

As to the assignment of a n objects' subset to a specific network, a quasi-optimal p a r t i t i o n criterion of objects into disjoint parts has been inferred, c o m b i n i n g the analysis of the feature space t o p o l o g y a n d the K o h o n e n n e t w o r k clustering properties. Regarding the n e t w o r k design, the former assessments, in accord-

ance with experimental results a n d heuristic considerations, have been used to give a r o u g h estimate of networks dimension, o n the basis of the objecIs structure. P e r f o r m a n c e of the recognition system has been tested on a sample set of b o t h convex a n d concave

934

J.M. CORRIDONI et al.

objects. The results reported in this paper show that the system exhibits good classification rates as well as robustness with respect to noise in the input views and to errors due to the W P H T . Extensive tests on real scenes have been performed, showing good behavior, even in the presence of partial occlusion. Moreover, the intrinsic parallelism of the system suggests its implementation for real time recognition tasks.

REFERENCES

1. P.J. Besl and R. C. Jain, Three-dimensional object recognition, Comput. Surveys 17, 75-145 (March 1985). 2. Ulupinar F. and Neviata R, Constraints for interpretation of line drawings under perspective projection, Cornput. Vision Graphics Image Process. 53, 88-96 (January 1991). 3. S. T. Barnard, Interpreting perspective images, Artif. Intell. 21,435-462 (1983). 4. R. Horaud, New methods for matching 3D objects with single perspective views, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-9, 401-412 (May 1987). 5. H. Freeman and I. Chakravarty, The use of characteristic views in the recognition of three-dimensional objects, Pattern Recognition in Practice, E. Gelsema and L. Kanal, eds, pp. 277-288, North-Holland, Amsterdam (1980). 6. R. Wang and H. Freeman, The use of characteristic views classes for 3D object recognition, Machine Vision for Three-dimensional Scences, Academic Press, (1990). 7. Firooz A. Sadjadi and Ernest L. Hall, Three-dimensional moment invariants, IEEE Trans. Pattern Anal. Mach. lntell. PAMI-2, 127-136 (March 1980). 8. Cho-Huak Teh and Roland T. Chin, On image analysis by the methods of moments, IEEE Trans. Pattern Anal. Mach. Intell. 10, 496-513 (July 1988). 9. D. Casasent, Vijaya-Kumar B. and Sharma V, Synthetic discriminant functions for three-dimensional object recognition, Proc. Soc. Photo-Optical Instrumentation Eng. Conf. Robotics Industrial Inspection, 360, 136-142, San Diego, California (1982). 10. T. Poggio and F. Girosi, Networks for approximation and learning. Proc. IEEE 78, 1481-1497 (1990). 11. H. Wechsler and G. L. Zimmerman, 2D invariant object recognition using distributed associative memory, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-10, 811-821 (November 1988). 12. W. Polzleitner and H. Wechsler, Selective and focused invariant recognition using distributed associative memories (DAM), IEEE Trans. Pattern Anal. Mach. Intell. PAMI-12, 809-814 (August 1990). 13. G. N. Bebis and G. Papadourakis, Object recognition using invariant object boundary representations and neural networks models, Pattern Recognition 25, 25-44 (1992). 14. J. Basak, C. A. Murthy, S. Chaudhury and D. D. Majumber, A connectionist model for category perception: theory and implementation, IEEE Trans. Neural Networks 4, 257-269 (March 1993). 15. W.-C. Lin, F.-H. Liao, C.-K. Tsao and T. Lingutla, A hierarchical multiple-view approach to three-dimensional object recognition, IEEE Trans. Neural Networks 2, 84-92 (January 1991). 16. T. Poggio and S. Edelman, A network that learns to recognize three-dimensional objects, Nature 343, 263266 (January 1990). 17. Teuvo Kohonen, Self-Organization and Associative Memory, Springer-Verlag (1988).

18. D. Forsyth, J. L. Mundy, A. Zisserman, C. Coelho, A. Heller and C. Rothwell, lnvariant descriptors for 3D object recognition and pose, IEEE Trans. Pattern Anal. Mach. Intell. 13, 971-991 (October 1991). 19. J. B. Burns, A. R. Hanson and E. M. Riseman, Extracting straight lines, IEEE Trans. Pattern Anal. Mach. Intell. 8, 425-455 (1986). 20. L. S. Davis, A survey of the edge detection techniques, Comput. Graph. Image Process. 4, 248-270 (1975). 21. P. V. C. Hough, Method and Means for Recognition Complex Patterns, U. S. Patent 3 069 654 (1962). 22. Wen-Nung Lie and Yung-Chang Chen, Robust linedrawing extraction for polyhedra using weighted polarized Hough transform, Pattern Recognition 23, 261-274 (1990). 23. K. D. Miller, Neuroscience and Connectionist Theory, M. A. Gluck and D. E. Rumelhart, eds, pp. 267-353, Lawrence Erlbaum Associates, New Jersey (1990). 24. C. J. Shatz, Neuron 5 (1990). 25. D.H. Hubel and T. N. Wiesel, Receptive fields, binocular integration and functional architecture in the cat's visula cortex, J. Physiol. 160, 106-154 (1962). 26. D. H. Hubel and T. N. Wiesel, Sequence, regularity and geometry of orientation columns in the monkey striate cortex, J. Comp. Neurol. 158, 267-293 (1974). 27. T. Kohonen, Physiological interpretation of the selforganizing map algorithm, Neural Networks 6, 895-905 (1993). 28. D. H. Hubel and T. N. Wiesel, Receptive fields and functional architecture in two non-striate visual areas (18 and 19) of the cat, J. Neurophysiol. 128, 229-289 (1965). 29. C. vonder Malsburg, Self-organization of orientationsensitive cells in the striate cortex, Kybernetik 15, 85-100 (1973). 30. C. vonder Malsburg and W. Singer, Principles of cortical network organization, Neurobiology of Neocortex, P. Rakic and W. Singer, eds, pp. 69-99, Wiley, New York (1988). 31. T. Kohonen, The self-organizing map, Proc. IEEE 78, 1464-1480 (September 1990). 32. H. White, Some asymptotic results for learning in single hidden-layer feed forward network models, J. Am. Stat. Assoc. 84, 1003-1013 (December 1989). 33. P. Rti~.i(:ka,On the convergence of learning algorithm for topological maps, Neural Network World 4, 413-424 (1993). 34. K. Obermayer, G. G. Blasdel and K. J. Schulten, Statistical-mechanical analysis of self-organization and pattern formation during the development of visual maps, Physical Rev. A 45 7568-7589 (1992). 35. C. Bouton and G. Pages, Self-organization of the onedimensional kohonen algorithm with non-uniformly distributed stimuli, Stochastic Proc. Applic. 47, 249-274 (1993). 36. C. Bouton and G. Pagrs, Convergence in distribution of the one-dimensional kohonen algorithm when the stimuli are not uniform, Adv. Appl. Probability 26 (March 1994) 37. H. Ritter and K. Schulten, On the stationarity state of the kohonen's self-organizing sensor mapping, Biol. Cybern. 54, 99-106 (1986). 38. H. Ritter and K. Schulten, Convergence properties of kohonen topology conserving maps: fluctuations, stability and dimension selection, Biol. Cybern. 60, 59-71 (1988). 39. J. C. Fort and G. Pag6s, Sur la convergence p.s. de l'algorithme de Kohonen g6n6ralis6, volume t. 317, S6rie I, pp. 389-394 (1993). 40. Robert M. Haralick, Using perspective transformations in scene analysis, Comput. Graph. Image Process. 13, 191-221 (1980). 41. A. Del Bimbo, L. Landi and S. Santini, 3D planar faced object recognition using kohonen networks. Optical Engng. 32, 1222-1234 (June 1993).

Multi-object Kohonen networks

935

42. J. F. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. lntell. 76, 679-698 (November 1986).

43. R. Nevatia and R. Mohan, Perceptual organization for scene segmentation and description, IEEE Trans. Pattern Anal. Mach. lntell. 14, 616-635 (June 1992).

About the Author--ALBERTO DEL BIMBO was born in Firenze, Italy, in 1952. He received the doctoral degree in Electronic Engineering from the Universit/l di Firenze, Italy, in 1977. He was with IBM ITALIA from 1978 to 1988. He is Full Professor of Computer Systems at the University of Brescia and the University of Florence, Italy. Dr Del Bimbo is a member of IEEE and of the International Association for Pattern Recognition (IAPR). He is in the board of the IAPR Technical Committee n. 8 (Industrial Applications) and the Vice-President of the IAPR Italian Chapter. He presently serves as Associate Editor of Pattern Recognition Journal and of the Journal of Visual Languages and Computing. His research interests and activities are in the field of image analysis, image databases, visual languages and virtual reality.

About the Author--JACOPO MARIA CORRIDONI was born in Fano, Italy, in 1968. He received the Laurea in Electronic Engineering from the Universit~ di Firenze, Italy, in 1993. He is a Ph.D. student at Dipartimento Sistemi Informatica in University of Firenze. His research interests are in the field of image analysis, neural networks for computer yision and multimedia systems. Jacopo Maria Corridoni is a member of IAPR.

About the Author--LEONARDO LANDI was born in Florence, Italy, in 1965. He received the doctoral degree in Electronic Enginering from the Faculty of Engineering, University of Florence, Italy, in 1991. He was at the IBM Almaden Research Center, San Jose (CA), during the summer of 1994. He is currently a Ph.D. student at the University of Florence. His research interests include computer vision, pattern recognition, time series analysis and neural networks. Leonardo Landi is a member of IAPR.

Vous aimerez peut-être aussi