Académique Documents
Professionnel Documents
Culture Documents
112
but only to distinguish between null and non null vectors, In the ideal case, all the intra-cluster edges are preserved,
the low reliability of their estimation has to be taken into while the inter-cluster edges are removed, leaving a set of
proper account. connected trees corresponding to the desired clusters of in-
terest points. However, it commonly happens that some
2.2. Clustering of the interest points edges are misclassified producing two types of clustering
In order to compensate for changes in the number of errors.
points due to perspective and to partial occlusions, the al- The first type of error is due to inter-cluster edges classi-
gorithm needs to partition the detected points into clusters fied as intra-cluster: this misclassification does not allow to
corresponding to separated groups of persons, so as to be split some clusters, which will result aggregated. However,
able to compute for each group its distance from the camera this situation does not represent a problem when the joined
and its density. clusters refer to groups of people which are at the same
The faced clustering problem is characterized by the fact distance from the camera. This typically happens when
that we do not have any a priori knowledge about the num- the clusters are horizontally aggregated (see the example
ber and the shape of the clusters to be found. This depends in Figure 2.a). In fact, in this case the perspective distor-
on the fact that people can appear in different positions in tion does not change significantly among the joined clus-
the scene and can be aggregated in many different ways. ters and the error introduced can be considered negligible.
In this situation, the most commonly used clustering meth- Conversely, when the erroneously combined clusters refer
ods (such as k-means) could not have been applied because to groups of people at different distances from the camera
they require the user to provide either the number of de- (typically when clusters are joined vertically), this causes
sired clusters or a threshold on cluster diameter or on inter- errors in the estimation of the number of people which are
cluster distance. As observed in [12], the clustering algo- inside the groups whose distance from the camera is erro-
rithms based on graph theory are well suited to face clus- neously evaluated (see the box with ID = 2 in Figure. 3.b
tering problems where no assumptions can be made about for an example of this problem). It is worth noting that even
the clusters. In particular, we adopted the technique pre- if the estimation for clusters formed by people at different
sented in [6], since (differently from other algorithms in the distances may be inaccurate, it is still an improvement over
graph-based clustering family) it requires no parameters to the use of a global estimate based on all the detected points
be tuned or adapted to the particular application. in the scene. Furthermore, we have experimentally veri-
This algorithm represents the set of points as a graph in fied that the latter circumstance occurs rather infrequently,
which each point corresponds to a node and each edge is hence its impact on the overall performance is limited.
labeled with the distance between its endpoints. The min- The second type error is due to intra-cluster edges classi-
imum spanning tree (MST) of the graph is computed; this fied as inter-cluster: this phenomenon causes a cluster to be
tree will contain some edges that are between nodes in the split into several parts. Similar considerations can be done
same cluster (intra-cluster edges) and other edges between also for this type of error as regard the different incidence
nodes of different clusters (inter-cluster edges). Assum- and impact on the overall performance depending on the
ing that the clusters are well separated, it can be expected way the splits occur (horizontal or vertical). An example of
that the intra-cluster edges are shorter than the inter-cluster this type of error is shown in Figure 2.b.
edges. So the algorithm uses a thresholding to divide the Another important problem that is faced in this stage
edges in two sets (the ones below the threshold, say it 𝜆, is represented by the removal of outliers, i.e. those inter-
and the ones above the threshold 𝜆). The edges in the sec- est points which are output by the previous stage but are
ond set are deleted, and the remaining connected compo- not associated to people. It is quite easy to distinguish be-
nents are the clusters output by the algorithm. tween the correct moving points and the outliers, on the ba-
The use of a fixed value for the threshold 𝜆 would be sis of some considerations about the local point density. In
problematic, since the threshold would need to be adjusted fact, while the points associated to people are concentrated
depending on the resolution, the distance from the camera in small areas in the input image (those occupied by the
and so on. Instead, we have used a threshold proportional persons), erroneously detected moving points are randomly
to the average edge length, computed as: spread throughout the frame. Consequently, after cluster-
ing, the outliers tend to form singleton or very small isolated
𝑁
1 ∑ clusters, which can be simply cut off by adopting a proce-
𝜆=𝛾⋅ 𝑥𝑖 (2)
𝑁 𝑖=1 dure that deletes those clusters with a number of points be-
low a fixed threshold 𝜏 .
where 𝛾 is the proportionality factor, 𝑁 is the number of Overall, the computations carried out in this second stage
edges of the spanning tree, while 𝑥 𝑖 is the weight of the 𝑖-th of the system can be summarized in the following steps:
edge of the tree. As, it will be described in the experimental
section, a value for 𝛾 ≥ 2.0 works adequately. 1. define the graph 𝐺(𝑉, 𝐸, 𝑤) from 𝑉 and 𝐸, where 𝑉
113
2.3. Feature extraction and regression
In this stage of the algorithm, a feature vector is com-
puted from each cluster detected in the previous step, and is
fed into a regressor. The output of the regressor is the es-
timated number of persons in the group represented by the
cluster.
The basic idea of the method in [2] is that the average
number of interest points associated to each person is a
global property of scene. Thus, once the scene has been
defined, it is possible to assume a simple direct proportion-
ality relation between the number of points and the number
of persons.
(a) As noted by the same authors in [2], this model, albeit
extremely simple, performs well in scenes where people are
more or less at the same distance from the camera, and there
are only limited overlaps between persons. When these as-
sumptions are verified, deviations from the model are either
due to the fact that some interest points are missed (e.g. be-
cause a part of body is very similar to the underlying back-
ground) or to the limited reliability of motion vector esti-
mation, that may cause some static points to be considered
as moving. However, as can be deducted from the good
performance shown by Albiol’s method on the PETS2009
dataset, those deviations from the model often compensate
each other, and so the method gives a reasonable count, at
least on the average. Unfortunately, this model does not take
(b) into account the effects of the perspective, which causes
Figure 2. Clusters of points detected by the second stage of the that the farther the person is from the camera, the fewer are
system. Each cluster is enclosed by a bounding box. The images
the detected interest points (see Figure 3 for an example of
also contain examples of clustering errors. (a) In cluster 1 (green
interest points) two groups of people have been erroneously aggre-
this problem). Hence, the number of points associated to a
gated. (b) A group of people is erroneously split in two clusters, person and the distance of the person from the camera are
(yellow and cyan points, clusters 3 and 5). somehow related, and the relation is nonlinear.
Moreover, the assumption of a proportionality relation
between the number of persons and the number of points
is the set of the detected moving SURF points, while 𝐸 holds only when people are well separated from each other.
is the set of all weigthed edges calculated ∀(𝑣 𝑖 , 𝑣𝑗 ) ∈ On the other hand, when people are close to each other some
𝑉 × 𝑉 with 𝑖 ∕= 𝑗; with 𝑤(𝑒𝑖,𝑗 ) we denote the value parts of their bodies are occluded and, consequently, some
of the weight of the edge 𝑒 𝑖,𝑗 ∈ 𝐸 that is calculated as interest points are not detected. Therefore, there is a rela-
the euclidean distance between 𝑣 𝑖 , 𝑣𝑗 ∈ 𝑉 in the image tion (whose exact form is not easy to find analytically) also
plane; between the average number of points per person and the
people density. Unfortunately, we do not know the people
2. compute the minimum spanning tree 𝐺(𝑉, 𝑇, 𝑤) of density of each cluster. However we can reasonably assume
𝐺(𝑉, 𝐸, 𝑤); that when the density of people increase, the detected points
get closer to each other. So we can consider the density of
3. calculate 𝜆 according to Eq. 2;
the points as related to people density, and we can indirectly
4. define the forest 𝐺(𝑉, 𝐹, 𝑤), where 𝐹 = 𝑇 − {𝑒 ∈ take into account people density by establishing a relation
𝑇 ∣𝑤(𝑒) > 𝜆}; between the average number of points per person and the
∪ point density.
∪ the set 𝐶 = {𝐺(𝑉𝑖 , 𝐹𝑖 , 𝑤)∣𝑉 =
5. find 𝑉𝑖 , 𝐹 = In conclusion, the relation between the number of inter-
𝐹𝑖 , 𝐺(𝑉𝑖 , 𝐹𝑖 , 𝑤) is a connected component in est points and the number of people appears more complex
𝐺(𝑉, 𝐹, 𝑤)} than a direct proportionality, as we have to take into account
also the distance of the people from the camera and the point
6. determine the set 𝐶 ′ = 𝐶 − {𝐺(𝑉𝑖 , 𝐹𝑖 , 𝑤)∣ ∣𝐹𝑖 ∣ ≤ 𝜏 } density. We can formulate this relation as:
114
Region Top-left (𝑥, 𝑦) Bottom-right (𝑥, 𝑦)
R0 (10,10) (750,550)
R1 (290,160) (710,430)
R2 (30,130) (230,290)
Table 1. Coordinates of the top left and bottom right corners (in
pixels) of the regions of interest considered for person count.
115
ID of the video sequences of dataset S1 In order to have a deeper insight into the behavior of the
L1.13-57 L1.13-59 L2.14-06 L3.14-17 considered algorithms, Figure 4 shows the estimated and
1.14 1.59 5.12 2.20 the true number of people over time for the considered re-
R0
5.7% 13.4% 21.4% 10.1% gions (R0, R1 and R2). From these figures it is possible to
1.46 0.82 1.99 1.62 notice that in almost all cases the method is able to provide
R1
16.9% 15.1% 15.1% 24.7% a good estimate of the number of people in the scene. The
0.73 0.87 3.12 2.88 only exception is represented by the sequence S1.L2.14-06
R2
10.4% 15.6% 29.2% 19.4% where the number of persons is underestimated. This is due
Table 2. Performance of the proposed algorithm in terms of the to the fact that in that video there is a very large and dense
MAE and the MRE (as a percentage) indices. crowd that probably is not well represented in the training
set.
In the following we report the results of some tests aimed
required number of training frames has not to be very large
at evaluating the robustness of the proposed method with
to achieve a good performance level (in our tests we used
respect to the main setup parameters. In particular, we take
about 15 training frames from each of the video sequences),
into account the dependence of the estimation error with re-
by taking into account also the fact that a single frame usu-
spect to the value of the parameter 𝛾 used during the cluster-
ally contains several people clusters at different distances,
ing procedure, the minimum number of points in a cluster 𝜏
so it may cover several cases of the function to be learned.
and the size of the set adopted for the training of the 𝜖-SVR
On the whole the training set adopted for our experimenta-
regressor.
tions is composed by about 800 clusters. Furthermore, for
In Figure 5 the plots of the average estimation error
the test we set 𝛾 = 2.0 and 𝜏 = 5.
(MAE) with respect to the region R0 for different values
Testing has been carried out by comparing the actual
of 𝛾 are shown. From the graphs it is evident how the value
number of people in the three regions of video sequences
of this parameter can vary in wide range (𝛾 ≥ 2.0) without
and the number of people calculated by the algorithm. The
affecting significantly the overall performance of the sys-
indices used to report the performance are the Mean Ab-
tem. Differently if it is chosen a too low value for 𝛾 the
solute Error (MAE) and the Mean Relative Error (MRE)
performance degrades significantly. This behavior can be
defined as:
explained by considering that in this case a large number of
𝑁 edges of the MST are deleted producing many small clus-
1 ∑
𝑀 𝐴𝐸 = ⋅ ∣𝐺(𝑖) − 𝑇 (𝑖)∣ (4) ters that correspond only to parts of persons.
𝑁 𝑖=1 In Figure 6 the plots of the average estimation error
𝑁
(MAE) with respect to the region R0 for different values
1 ∑ ∣𝐺(𝑖) − 𝑇 (𝑖)∣ of 𝜏 are shown. It is interesting to note how also in this case
𝑀 𝑅𝐸 = ⋅ (5)
𝑁 𝑖=1 𝑇 (𝑖) the performance of the proposed method is only loosely de-
pendent on the value of the parameter 𝜏 , so that, similarly
where 𝑁 is the number of frames of the test sequence and to the setting of 𝛾, it is not required a fine tuning procedure
𝐺(𝑖) and 𝑇 (𝑖) are the guessed and the true number of per- in order to select the best value for it.
sons in the 𝑖-th frame, respectively. The results of the analysis regarding the dependence of
The MAE index allows to quantify the error in the esti- the performance of the system on the size of the training set
mation of the number of person which are in the region of are shown in Figure 7. As it can be seen, the number of
interest, but it does not relate this error to number of people; training samples required to learn the 𝑓 function is not very
in fact, the same absolute error can be considered negligible large, confirming the fact that the setup procedure for the
if the number of persons in the ROI is high while it becomes proposed algorithm is reasonably easy and robust.
significant if the number of person is of the same order of
magnitude. For this reason, we use also the MRE index
4. Conclusions
which takes into account the estimation error related to the
true people number. In this paper, we have presented a method for counting
The performance of the proposed method on the four moving people in a video surveillance scene. The proposed
considered sequences of the dataset S1 are reported in Ta- method has its roots in the algorithm by Albiol et al. that
ble 2. From the results reported in Table 2 it is evident that was among the best techniques that participated to the PETS
the proposed method yields very interesting performance on 2009 benchmarking session on people counting. The exper-
all the considered sequences, both in terms of absolute and imentation on the PETS 2009 database has confirmed that
relative estimation errors. This is more evident if the results the proposed method obtains good performance and demon-
are compared with that obtained by the algorithms that par- strated to be very robust with respect to the choice of the
ticipated to PETS 2009 benchmarking on people counting. parameters.
116
Figure 4. Plots of the estimated and the true number of people over time for the considered regions (R0, R1 and R2).
117
References
[1] http://www.cvg.rdg.ac.uk/PETS2009/. 5
[2] A. Albiol, M. J. Silla, A. Albiol, and J. M. Mossi. Video
analysis using corner motion statistics. In IEEE Interna-
tional Workshop on Performance Evaluation of Tracking and
Surveillance, pages 31–38, 2009. 2, 4
[3] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Surf: Speeded
up robust features. Computer Vision and Image Understand-
ing, 110(3):346–359, 2008. 2
[4] G. J. Brostow and R. Cipolla. Unsupervised bayesian de-
tection of independent motion in crowds. In IEEE Conf. on
Computer Vision and Pattern Recognition, pages 594–601,
2006. 1
[5] S.-Y. Cho, T. W. S. Chow, and C.-T. Leung. A neural-
Figure 5. Plots of the MAE values on the region R0 for the consid- based crowd estimation by hybrid global learning algorithm.
ered test video sequences as a function of 𝛾. IEEE Transactions on Systems, Man, and Cybernetics, Part
B, 29(4):535–541, 1999. 1
[6] P. Foggia, G. Percannella, C. Sansone, and M. Vento. A
graph-based algorithm for cluster detection. International
Journal of Pattern Recognition and Artificial Intelligence,
22(5):843–860, 2008. 3
[7] C. Harris and M. Stephens. A combined corner and edge
detector. In Proceedings of the 4th Alvey Vision Conference,
pages 147–151, 1988. 2
[8] D. Kong, D. Gray, and H. Tao. A viewpoint invariant ap-
proach for crowd counting. In International Conference on
Pattern Recognition, pages 1187–1190, 2006. 1
[9] A. N. Marana, L. da F. Costa, R. A. Lotufo, and S. A. Ve-
lastin. Estimating crowd density with mikowski fractal di-
mension. In Int. Conf. on Acoustics, Speech and Signal Pro-
cessing, volume 6, pages 3521–3524, 1999. 2
[10] H. Rahmalan, M. S. Nixon, and J. N. Carter. On crowd den-
Figure 6. Plots of the MAE values on the region R0 for the consid- sity estimation for surveillance. In The Institution of Engi-
ered test video sequences as a function of 𝜏 . neering and Technology Conference on Crime and Security,
2006. 2
[11] J. Rittscher, P. Tu, and N. Krahnstoever. Simultaneous esti-
mation of segmentation and shape. In IEEE Conf. on Com-
puter Vision and Pattern Recognition, pages 486–493, 2005.
1
[12] S. Theodoridis and K. Koutroumbas. Pattern Recognition,
Third Edition. Academic Press, February 2006. 3
[13] T. Zhao, R. Nevatia, and B. Wu. Segmentation and tracking
of multiple humans in crowded environments. IEEE Trans.
Pattern Anal. Mach. Intell., 30(7):1198–1211, 2008. 1
Figure 7. Plots of the MAE values on the region R0 for the consid-
ered test video sequences as a function of the size of the training
set (number of samples).
118