Vous êtes sur la page 1sur 8

2010 Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance

A Method Based on the Indirect Approach for Counting People in Crowded


Scenes ∗

D. Conte, P. Foggia, G. Percannella and M. Vento


Dipartimento di Ingegneria dell’Informazione ed Ingegneria Elettrica
Via Ponte Don Melillo, 1 - 84084 Fisciano (SA) - Italy
{dconte,pfoggia,pergen,mvento}@unisa.it

Abstract the basis of perspective mapping). The system first detects


foreground regions using background subtractions, and then
This paper presents a method for counting people in a tries to match people models with the observed edges of
scene by establishing a mapping between some scene fea- foreground regions, using a global optimization technique
tures and the number of people avoiding the complex fore- based on the Expectation-Maximization algorithm. While
ground detection problem. The method is based on the use the method is able to deal with partially occluded persons,
of SURF features and of an 𝜖-SVR regressor to provide an the assumption that the foreground region contour contains
estimate of this count. The algorithm takes specifically into enough edges that can be ascribed to each person limits the
account problems due to partial occlusions and to perspec- applicability of the method to cases where the density of
tive. the crowd is low. In [4], the tracking of feature points is
performed first, and then the points are grouped into ob-
1. Introduction jects according to their motion characteristics. Namely,
feature points are extracted using methods from the liter-
The estimation of the number of people present in an
ature, and are tracked using a combination of optical flow
area can be an extremely useful information both for secu-
and searching in a 2D window around the previous fea-
rity/safety reasons (for instance, an anomalous change in
ture position. Then, the feature points are clustered using
number of persons could be the cause or the effect of a dan-
a Bayesian framework, under the assumption that pairs of
gerous event) and for economic purposes (for instance, op-
points belonging to a same person have a small variance
timizing the schedule of public transportation system on the
in their mutual distance (quasi-rigid motion). While the
basis of the number of passengers). Hence, several works
method seems to perform well, even with high crowd densi-
in the fields of video analysis and intelligent video surveil-
ties, when the motion is mainly parallel to the image plane,
lance have addressed this task.
it has problems with motion directed towards the camera or
The problem of people counting has been faced using away from it. Also, the method can have problems in low
two different approaches. In the direct approach (also density conditions, where the motion of arms and legs is
called detection-based), people in the scene are first individ- clearly visible, because of its rigid motion assumption. In
ually detected, using some form of segmentation and object [13], a 3D model of the human body is used. Each per-
detection, and then counted. In the indirect approach (also son is represented as a set of ellipsoids corresponding to the
called map-based or measurement-based), instead, count- head, the body and the legs. The model is matched to the
ing is performed using the measurement of some feature detected foreground regions using a Markov Chain Monte
that does not require the separate detection of each person Carlo (MCMC) approach, that performs a global optimiza-
in the scene. The indirect approach is considered to be more tion of the a posteriori probability across multiple frames
robust, since the correct segmentation of people in the scene (in order to exploit temporal coherence). While the method
is by itself a complex problem that cannot be solved reli- provides good performance with low to medium crowd den-
ably, especially in crowded conditions. sities, it could have problems in very crowded scenes. Fur-
Recent examples of the direct approach are [11], [4] and thermore, it is very computationally intensive, resulting im-
[13]. In [11], the shape of a standing person is modeled practical for real-time applications.
as a rectangle of fixed width and height (normalized on
For the indirect approach, recent methods have proposed,
∗ This research has been partially supported by A.I.Tech s.r.l., a spin-off among the others, the use of measurements such as the
society of the University of Salerno. amount of moving pixels [5], blob size [8], fractal dimen-

978-0-7695-4264-5/10 $26.00 © 2010 IEEE 111


DOI 10.1109/AVSS.2010.86
2. Clustering of the interest points
video frames
Moving SURF points detection 3. Features extraction and regression
moving salient points In the following we provide some details about each of the

SURF points Clustering considered phases.

clusters of salient points 2.1. Detection of the interest points associated to


Features Extraction people
features vector per cluster In order to detect interest points associated to people we
 make two basic assumptions: persons within the scene are
𝜖 − 𝑆𝑉 𝑅 Regression
not static and there are no other moving elements in the
scene. Thus, if we compute the interest points of the image
estimated number of persons per cluster
and the associated motion information, the above assump-
Figure 1. System architecture.
tions guarantee that only the interest points with a non null
motion vector must be associated to people. Note that the
sion [9] or other texture features [10]. first assumption holds very often: in fact, although a person
A recent method following the indirect approach has might appear static, some motion, even very small, is usu-
been proposed by Albiol et al. in [2]. This method has been ally associated to her/him. The second assumption stands
submitted to the PETS 2009 contest on people counting, and in most real world applications where it is required to count
has obtained the best performance among the contest par- people in the scene; of course in the rare cases in which the
ticipants. In Albiol’s paper, the authors propose the use of second assumption is not verified (waving trees, moving ve-
corner points as features. Namely, corner points are found hicles, etc.), the proposed method cannot be adopted.
using a variant of the popular Harris corner detector [ 7]. As proposed in [2], the interest points associated to peo-
Then, background corner points are separated from fore- ple are extracted in two steps. First, we determine all the
ground corner points using an estimate of the motion vector interest points within the frame under analysis. Then, we
based on block matching between adjacent frames: points prune the points not associated to persons by taking into
whose motion speed is under a threshold are not considered account their motion information. Interest points are deter-
for further analysis. Finally, the number of people is esti- mined by using the SURF algorithm [3] and not the Har-
mated from the number of moving corner points assuming ris corner detector as in the paper by Albiol et al. [ 2]. As
a direct proportionality relation, with a constant factor de- widely described in the previous Section, the motivation be-
termined using a frame of the video sequence. Actually, hind this choice is that the interest points extracted using
the count so obtained is smoothed by averaging along a few SURF are scale-invariant, thus they are much more stable
adjacent frames to remove fluctuations due to noise. than the points found by the Harris corner detector.
Although the assumptions underlying Albiol’s paper In order to remove the static interest points (that are not
may appear rather simplistic, the method has proven to associated to people), for each point detected by the SURF
be quite more robust than more sophisticated competitors. algorithm we estimate the motion vector with respect to the
However, the accuracy it can attain is limited by the fact that previous frame by using a block-matching technique. Then
it does not take into account problems like the instability of we distinguish between static and moving interest points on
the Harris corner detector, the presence of occlusions or the the basis of the following rule:
need of a perspective correction.
In this paper we discuss the results of a method that, {
moving point if ∣⃗𝑣 (𝑥, 𝑦)∣ > 𝛽
while retaining the overall simplicity and the robustness of 𝑝(𝑥, 𝑦) = (1)
static point if ∣⃗𝑣 (𝑥, 𝑦)∣ ≤ 𝛽
Albiol’s approach, tries to provide a more accurate estima-
tion of the count by considering also these factors.
where 𝑝(𝑥, 𝑦) is the interest point at the 𝑥, 𝑦 coordinates,
∣⃗𝑣 (𝑥, 𝑦)∣ is the magnitude of the motion vector calculated
2. System Architecture of the proposed method in 𝑥, 𝑦 with respect to the previous frame; 𝛽 is a threshold
The overall system architecture of the proposed algo- value (in our experiments we set 𝛽 = 0.0).
rithm for people counting is shown in Figure 1. Unfortunately motion vectors are only relatively reliable.
The system operates according to the three phases re- Occlusions, sudden changes in illumination, artifacts intro-
ported below: duced by the compression of the video stream, may cause
errors in the estimations of motion vectors. Although we
1. Detection of the interest points associated to people are not interested to the exact value of the motion vectors,

112
but only to distinguish between null and non null vectors, In the ideal case, all the intra-cluster edges are preserved,
the low reliability of their estimation has to be taken into while the inter-cluster edges are removed, leaving a set of
proper account. connected trees corresponding to the desired clusters of in-
terest points. However, it commonly happens that some
2.2. Clustering of the interest points edges are misclassified producing two types of clustering
In order to compensate for changes in the number of errors.
points due to perspective and to partial occlusions, the al- The first type of error is due to inter-cluster edges classi-
gorithm needs to partition the detected points into clusters fied as intra-cluster: this misclassification does not allow to
corresponding to separated groups of persons, so as to be split some clusters, which will result aggregated. However,
able to compute for each group its distance from the camera this situation does not represent a problem when the joined
and its density. clusters refer to groups of people which are at the same
The faced clustering problem is characterized by the fact distance from the camera. This typically happens when
that we do not have any a priori knowledge about the num- the clusters are horizontally aggregated (see the example
ber and the shape of the clusters to be found. This depends in Figure 2.a). In fact, in this case the perspective distor-
on the fact that people can appear in different positions in tion does not change significantly among the joined clus-
the scene and can be aggregated in many different ways. ters and the error introduced can be considered negligible.
In this situation, the most commonly used clustering meth- Conversely, when the erroneously combined clusters refer
ods (such as k-means) could not have been applied because to groups of people at different distances from the camera
they require the user to provide either the number of de- (typically when clusters are joined vertically), this causes
sired clusters or a threshold on cluster diameter or on inter- errors in the estimation of the number of people which are
cluster distance. As observed in [12], the clustering algo- inside the groups whose distance from the camera is erro-
rithms based on graph theory are well suited to face clus- neously evaluated (see the box with ID = 2 in Figure. 3.b
tering problems where no assumptions can be made about for an example of this problem). It is worth noting that even
the clusters. In particular, we adopted the technique pre- if the estimation for clusters formed by people at different
sented in [6], since (differently from other algorithms in the distances may be inaccurate, it is still an improvement over
graph-based clustering family) it requires no parameters to the use of a global estimate based on all the detected points
be tuned or adapted to the particular application. in the scene. Furthermore, we have experimentally veri-
This algorithm represents the set of points as a graph in fied that the latter circumstance occurs rather infrequently,
which each point corresponds to a node and each edge is hence its impact on the overall performance is limited.
labeled with the distance between its endpoints. The min- The second type error is due to intra-cluster edges classi-
imum spanning tree (MST) of the graph is computed; this fied as inter-cluster: this phenomenon causes a cluster to be
tree will contain some edges that are between nodes in the split into several parts. Similar considerations can be done
same cluster (intra-cluster edges) and other edges between also for this type of error as regard the different incidence
nodes of different clusters (inter-cluster edges). Assum- and impact on the overall performance depending on the
ing that the clusters are well separated, it can be expected way the splits occur (horizontal or vertical). An example of
that the intra-cluster edges are shorter than the inter-cluster this type of error is shown in Figure 2.b.
edges. So the algorithm uses a thresholding to divide the Another important problem that is faced in this stage
edges in two sets (the ones below the threshold, say it 𝜆, is represented by the removal of outliers, i.e. those inter-
and the ones above the threshold 𝜆). The edges in the sec- est points which are output by the previous stage but are
ond set are deleted, and the remaining connected compo- not associated to people. It is quite easy to distinguish be-
nents are the clusters output by the algorithm. tween the correct moving points and the outliers, on the ba-
The use of a fixed value for the threshold 𝜆 would be sis of some considerations about the local point density. In
problematic, since the threshold would need to be adjusted fact, while the points associated to people are concentrated
depending on the resolution, the distance from the camera in small areas in the input image (those occupied by the
and so on. Instead, we have used a threshold proportional persons), erroneously detected moving points are randomly
to the average edge length, computed as: spread throughout the frame. Consequently, after cluster-
ing, the outliers tend to form singleton or very small isolated
𝑁
1 ∑ clusters, which can be simply cut off by adopting a proce-
𝜆=𝛾⋅ 𝑥𝑖 (2)
𝑁 𝑖=1 dure that deletes those clusters with a number of points be-
low a fixed threshold 𝜏 .
where 𝛾 is the proportionality factor, 𝑁 is the number of Overall, the computations carried out in this second stage
edges of the spanning tree, while 𝑥 𝑖 is the weight of the 𝑖-th of the system can be summarized in the following steps:
edge of the tree. As, it will be described in the experimental
section, a value for 𝛾 ≥ 2.0 works adequately. 1. define the graph 𝐺(𝑉, 𝐸, 𝑤) from 𝑉 and 𝐸, where 𝑉

113
2.3. Feature extraction and regression
In this stage of the algorithm, a feature vector is com-
puted from each cluster detected in the previous step, and is
fed into a regressor. The output of the regressor is the es-
timated number of persons in the group represented by the
cluster.
The basic idea of the method in [2] is that the average
number of interest points associated to each person is a
global property of scene. Thus, once the scene has been
defined, it is possible to assume a simple direct proportion-
ality relation between the number of points and the number
of persons.
(a) As noted by the same authors in [2], this model, albeit
extremely simple, performs well in scenes where people are
more or less at the same distance from the camera, and there
are only limited overlaps between persons. When these as-
sumptions are verified, deviations from the model are either
due to the fact that some interest points are missed (e.g. be-
cause a part of body is very similar to the underlying back-
ground) or to the limited reliability of motion vector esti-
mation, that may cause some static points to be considered
as moving. However, as can be deducted from the good
performance shown by Albiol’s method on the PETS2009
dataset, those deviations from the model often compensate
each other, and so the method gives a reasonable count, at
least on the average. Unfortunately, this model does not take
(b) into account the effects of the perspective, which causes
Figure 2. Clusters of points detected by the second stage of the that the farther the person is from the camera, the fewer are
system. Each cluster is enclosed by a bounding box. The images
the detected interest points (see Figure 3 for an example of
also contain examples of clustering errors. (a) In cluster 1 (green
interest points) two groups of people have been erroneously aggre-
this problem). Hence, the number of points associated to a
gated. (b) A group of people is erroneously split in two clusters, person and the distance of the person from the camera are
(yellow and cyan points, clusters 3 and 5). somehow related, and the relation is nonlinear.
Moreover, the assumption of a proportionality relation
between the number of persons and the number of points
is the set of the detected moving SURF points, while 𝐸 holds only when people are well separated from each other.
is the set of all weigthed edges calculated ∀(𝑣 𝑖 , 𝑣𝑗 ) ∈ On the other hand, when people are close to each other some
𝑉 × 𝑉 with 𝑖 ∕= 𝑗; with 𝑤(𝑒𝑖,𝑗 ) we denote the value parts of their bodies are occluded and, consequently, some
of the weight of the edge 𝑒 𝑖,𝑗 ∈ 𝐸 that is calculated as interest points are not detected. Therefore, there is a rela-
the euclidean distance between 𝑣 𝑖 , 𝑣𝑗 ∈ 𝑉 in the image tion (whose exact form is not easy to find analytically) also
plane; between the average number of points per person and the
people density. Unfortunately, we do not know the people
2. compute the minimum spanning tree 𝐺(𝑉, 𝑇, 𝑤) of density of each cluster. However we can reasonably assume
𝐺(𝑉, 𝐸, 𝑤); that when the density of people increase, the detected points
get closer to each other. So we can consider the density of
3. calculate 𝜆 according to Eq. 2;
the points as related to people density, and we can indirectly
4. define the forest 𝐺(𝑉, 𝐹, 𝑤), where 𝐹 = 𝑇 − {𝑒 ∈ take into account people density by establishing a relation
𝑇 ∣𝑤(𝑒) > 𝜆}; between the average number of points per person and the
∪ point density.
∪ the set 𝐶 = {𝐺(𝑉𝑖 , 𝐹𝑖 , 𝑤)∣𝑉 =
5. find 𝑉𝑖 , 𝐹 = In conclusion, the relation between the number of inter-
𝐹𝑖 , 𝐺(𝑉𝑖 , 𝐹𝑖 , 𝑤) is a connected component in est points and the number of people appears more complex
𝐺(𝑉, 𝐹, 𝑤)} than a direct proportionality, as we have to take into account
also the distance of the people from the camera and the point
6. determine the set 𝐶 ′ = 𝐶 − {𝐺(𝑉𝑖 , 𝐹𝑖 , 𝑤)∣ ∣𝐹𝑖 ∣ ≤ 𝜏 } density. We can formulate this relation as:

114
Region Top-left (𝑥, 𝑦) Bottom-right (𝑥, 𝑦)
R0 (10,10) (750,550)
R1 (290,160) (710,430)
R2 (30,130) (230,290)

Table 1. Coordinates of the top left and bottom right corners (in
pixels) of the regions of interest considered for person count.

∙ 𝑑 is the distance of the cluster from the camera: as-


suming that the bottom points of the bounding box lie
on the ground plane, the calculation is done by apply-
ing an Inverse Perspective Mapping and is referred to
the center of the bottom edge of the cluster’s bounding
box.
(a)
Since we do not know the analytical form of 𝑓 , we have
chosen to learn this function from a set of labeled examples
by using an 𝜖-SVR regressor. Once trained, the 𝜖-SVR acts
as a function estimator; for each detected cluster it receives
as its input the above features and outputs the estimated
number of people within the cluster. So the total number
of persons in the frame (or in a predetermined region of
interest) is obtained by summing the number of people cal-
culated for each cluster.
Finally, in order to smooth out the oscillations in the
number of the counted persons among consecutive frames,
we employ a low-pass filter. Specifically, the final count
of the persons within the scene is calculated as the average
value of the people count on the last k frames of the video.
(b)
Figure 3. Effect of perspective distortion on the number of detected
3. Experimental Results
interest points: note how the same woman in (a) is far from the
camera (cluster 0, red dots) and only 9 interest points are detected, The performance of the proposed method has been as-
while in (b) she is closer to the camera (cluster 1, green dots) and sessed using the section S1 of the PETS2009 dataset [1].
30 interest points are associated to her. For our experimentations, we used four videos of the view
1 of the Dataset S1, namely S1.L1.13-57, S1.L1.13-59,
S1.L2.14-06 and S1.L3.14-17. For all the sequences we cal-
culated the number of people in the three regions of interest
𝑛𝑝𝑒𝑜𝑝𝑙𝑒 = 𝑓 (𝑛𝑝𝑜𝑖𝑛𝑡𝑠 , 𝜌, 𝑑) (3) (ROI) whose coordinates are given in Table 1.
where: In order to use the proposed system for people counting,
we had first to train the 𝜖-SVR regressor. The minimum
∙ 𝑛𝑝𝑒𝑜𝑝𝑙𝑒 is the estimated number of people; size of the training set needed to achieve an acceptable per-
formance, as the statistical learning theory by Vapnik and
∙ 𝑛𝑝𝑜𝑖𝑛𝑡𝑠 is the number of interest points within the clus- Chervonenkis has demonstrated, depends on both the com-
ter; plexity of the problem and the complexity of the estimator
to be trained. The training set was built by manually col-
∙ 𝜌 is the average density of the points in the cluster: lecting some samples of people groups from a subset of the
the value is obtained as the ratio between the number test frames. For each selected box we calculated the fea-
of points into the cluster and the area of the bounding ture vector and the associated ground truth, i.e. the true
box. Note that the area of the bounding box is com- number of persons that are inside the box. Samples were
puted with respect to real world coordinates. This al- selected in order to guarantee that all the possible combi-
lows us to normalize the average density of the points nations in terms of number of persons in the group, points
to the value it would have if the cluster were moved to density and distance from the camera were adequately rep-
a predefined distance from the camera; resented in the training set. It is worth pointing out that the

115
ID of the video sequences of dataset S1 In order to have a deeper insight into the behavior of the
L1.13-57 L1.13-59 L2.14-06 L3.14-17 considered algorithms, Figure 4 shows the estimated and
1.14 1.59 5.12 2.20 the true number of people over time for the considered re-
R0
5.7% 13.4% 21.4% 10.1% gions (R0, R1 and R2). From these figures it is possible to
1.46 0.82 1.99 1.62 notice that in almost all cases the method is able to provide
R1
16.9% 15.1% 15.1% 24.7% a good estimate of the number of people in the scene. The
0.73 0.87 3.12 2.88 only exception is represented by the sequence S1.L2.14-06
R2
10.4% 15.6% 29.2% 19.4% where the number of persons is underestimated. This is due
Table 2. Performance of the proposed algorithm in terms of the to the fact that in that video there is a very large and dense
MAE and the MRE (as a percentage) indices. crowd that probably is not well represented in the training
set.
In the following we report the results of some tests aimed
required number of training frames has not to be very large
at evaluating the robustness of the proposed method with
to achieve a good performance level (in our tests we used
respect to the main setup parameters. In particular, we take
about 15 training frames from each of the video sequences),
into account the dependence of the estimation error with re-
by taking into account also the fact that a single frame usu-
spect to the value of the parameter 𝛾 used during the cluster-
ally contains several people clusters at different distances,
ing procedure, the minimum number of points in a cluster 𝜏
so it may cover several cases of the function to be learned.
and the size of the set adopted for the training of the 𝜖-SVR
On the whole the training set adopted for our experimenta-
regressor.
tions is composed by about 800 clusters. Furthermore, for
In Figure 5 the plots of the average estimation error
the test we set 𝛾 = 2.0 and 𝜏 = 5.
(MAE) with respect to the region R0 for different values
Testing has been carried out by comparing the actual
of 𝛾 are shown. From the graphs it is evident how the value
number of people in the three regions of video sequences
of this parameter can vary in wide range (𝛾 ≥ 2.0) without
and the number of people calculated by the algorithm. The
affecting significantly the overall performance of the sys-
indices used to report the performance are the Mean Ab-
tem. Differently if it is chosen a too low value for 𝛾 the
solute Error (MAE) and the Mean Relative Error (MRE)
performance degrades significantly. This behavior can be
defined as:
explained by considering that in this case a large number of
𝑁 edges of the MST are deleted producing many small clus-
1 ∑
𝑀 𝐴𝐸 = ⋅ ∣𝐺(𝑖) − 𝑇 (𝑖)∣ (4) ters that correspond only to parts of persons.
𝑁 𝑖=1 In Figure 6 the plots of the average estimation error
𝑁
(MAE) with respect to the region R0 for different values
1 ∑ ∣𝐺(𝑖) − 𝑇 (𝑖)∣ of 𝜏 are shown. It is interesting to note how also in this case
𝑀 𝑅𝐸 = ⋅ (5)
𝑁 𝑖=1 𝑇 (𝑖) the performance of the proposed method is only loosely de-
pendent on the value of the parameter 𝜏 , so that, similarly
where 𝑁 is the number of frames of the test sequence and to the setting of 𝛾, it is not required a fine tuning procedure
𝐺(𝑖) and 𝑇 (𝑖) are the guessed and the true number of per- in order to select the best value for it.
sons in the 𝑖-th frame, respectively. The results of the analysis regarding the dependence of
The MAE index allows to quantify the error in the esti- the performance of the system on the size of the training set
mation of the number of person which are in the region of are shown in Figure 7. As it can be seen, the number of
interest, but it does not relate this error to number of people; training samples required to learn the 𝑓 function is not very
in fact, the same absolute error can be considered negligible large, confirming the fact that the setup procedure for the
if the number of persons in the ROI is high while it becomes proposed algorithm is reasonably easy and robust.
significant if the number of person is of the same order of
magnitude. For this reason, we use also the MRE index
4. Conclusions
which takes into account the estimation error related to the
true people number. In this paper, we have presented a method for counting
The performance of the proposed method on the four moving people in a video surveillance scene. The proposed
considered sequences of the dataset S1 are reported in Ta- method has its roots in the algorithm by Albiol et al. that
ble 2. From the results reported in Table 2 it is evident that was among the best techniques that participated to the PETS
the proposed method yields very interesting performance on 2009 benchmarking session on people counting. The exper-
all the considered sequences, both in terms of absolute and imentation on the PETS 2009 database has confirmed that
relative estimation errors. This is more evident if the results the proposed method obtains good performance and demon-
are compared with that obtained by the algorithms that par- strated to be very robust with respect to the choice of the
ticipated to PETS 2009 benchmarking on people counting. parameters.

116
Figure 4. Plots of the estimated and the true number of people over time for the considered regions (R0, R1 and R2).

117
References
[1] http://www.cvg.rdg.ac.uk/PETS2009/. 5
[2] A. Albiol, M. J. Silla, A. Albiol, and J. M. Mossi. Video
analysis using corner motion statistics. In IEEE Interna-
tional Workshop on Performance Evaluation of Tracking and
Surveillance, pages 31–38, 2009. 2, 4
[3] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Surf: Speeded
up robust features. Computer Vision and Image Understand-
ing, 110(3):346–359, 2008. 2
[4] G. J. Brostow and R. Cipolla. Unsupervised bayesian de-
tection of independent motion in crowds. In IEEE Conf. on
Computer Vision and Pattern Recognition, pages 594–601,
2006. 1
[5] S.-Y. Cho, T. W. S. Chow, and C.-T. Leung. A neural-
Figure 5. Plots of the MAE values on the region R0 for the consid- based crowd estimation by hybrid global learning algorithm.
ered test video sequences as a function of 𝛾. IEEE Transactions on Systems, Man, and Cybernetics, Part
B, 29(4):535–541, 1999. 1
[6] P. Foggia, G. Percannella, C. Sansone, and M. Vento. A
graph-based algorithm for cluster detection. International
Journal of Pattern Recognition and Artificial Intelligence,
22(5):843–860, 2008. 3
[7] C. Harris and M. Stephens. A combined corner and edge
detector. In Proceedings of the 4th Alvey Vision Conference,
pages 147–151, 1988. 2
[8] D. Kong, D. Gray, and H. Tao. A viewpoint invariant ap-
proach for crowd counting. In International Conference on
Pattern Recognition, pages 1187–1190, 2006. 1
[9] A. N. Marana, L. da F. Costa, R. A. Lotufo, and S. A. Ve-
lastin. Estimating crowd density with mikowski fractal di-
mension. In Int. Conf. on Acoustics, Speech and Signal Pro-
cessing, volume 6, pages 3521–3524, 1999. 2
[10] H. Rahmalan, M. S. Nixon, and J. N. Carter. On crowd den-
Figure 6. Plots of the MAE values on the region R0 for the consid- sity estimation for surveillance. In The Institution of Engi-
ered test video sequences as a function of 𝜏 . neering and Technology Conference on Crime and Security,
2006. 2
[11] J. Rittscher, P. Tu, and N. Krahnstoever. Simultaneous esti-
mation of segmentation and shape. In IEEE Conf. on Com-
puter Vision and Pattern Recognition, pages 486–493, 2005.
1
[12] S. Theodoridis and K. Koutroumbas. Pattern Recognition,
Third Edition. Academic Press, February 2006. 3
[13] T. Zhao, R. Nevatia, and B. Wu. Segmentation and tracking
of multiple humans in crowded environments. IEEE Trans.
Pattern Anal. Mach. Intell., 30(7):1198–1211, 2008. 1

Figure 7. Plots of the MAE values on the region R0 for the consid-
ered test video sequences as a function of the size of the training
set (number of samples).

118

Vous aimerez peut-être aussi