Vous êtes sur la page 1sur 12

Pattern Analysis and Applications manuscript No.

(will be inserted by the editor)

Understanding People Motion in Video Sequences Using Voronoi


Diagrams
Detecting and Classifying Groups
Julio Cezar Silveira Jacques Jr., Adriana Braun, John Soldera, Soraia Raupp Musse, Cláudio Rosito
Jung
Graduate School of Applied Computing
Universidade do Vale do Rio dos Sinos
Av. Unisinos 950, 93022-000 São Leopoldo, RS, Brazil
phone: +55 51 3591-1122 ext. 1626 and fax: +55 51 3590-8162

Received: September 30, 2005

Abstract This work describes a model for understand- grouping can represent a theft or kidnap, and the detec-
ing people motion in video sequences using Voronoi Di- tion of large involuntary groups may indicate the pres-
agrams, focusing on group detection and classification. ence of strangle points in the environment (people form
We use the position of each individual as a site for the groups due to lack of physical space).
Voronoi Diagram at each frame, and determine the tem- In this work, we explore and quantify some concepts
poral evolution of some sociological and psychological related to personal space, proxemics and spatial relation-
parameters, such as distance to neighbors and personal ship among people, that are mainly described in other
spaces. These parameters are used to compute individ- research fields like psychology and sociology. These con-
ual characteristics (such as perceived personal space and cepts are then used to detect and classify groups, and
comfort levels), that are analyzed to detect the forma- are briefly discussed next.
tion of groups and their classification as voluntary or The term proxemics has been firstly proposed by Ed-
involuntary. Experimental results based on videos ob- ward Hall [1] in order to describe the social use of space
tained from real life as well as from a crowd simulator (in particular, personal space). Personal space is related
were analyzed and discussed. with the area with invisible boundaries surrounding an
individual’s body. This area states as a comfort zone dur-
ing interpersonal communication, and can disappear in
specific environments or situations (e.g. elevators, dense
crowds). Hall proposes four main distances (see Table 1)
1 Introduction observed in American interactions. Each distance has a
particular meaning, in terms of the kind of interaction
The understanding of people motion in video sequences allowed. Hall argues that those meanings depends on
can be very useful in several applications. For example, the culture, and also shows how distance constrains the
motion analysis is useful for video surveillance, pedes- types of interaction that are likely to occur.
trian monitoring in traffic applications, measurement of
Robert Sommer [2], an American sociologist, calls
athletic performance and virtual reality systems, among
such spatial overcoat as “portable space”, meaning that
other potential applications.
each organism carries his/her own space wherever he/she
Several approaches have been proposed for action/behavior
goes. Each individual person has an invisible “personal
understanding from video sequences with different levels
space” around himself/herself, and either consciously or
of analysis, ranging from global (crowd events without
unconsciously, he/she tries to maintain that space. The
individuality) to local (individual behaviors and pose de-
desired physical distance between individual persons may
tection). For middle-level analysis, people interactions
differ very much depending on culture. However, in spite
may provide useful cues. In particular, group detection
of such cultural differences, people implicitly recognize
and classification may be useful in a variety of appli-
and respect the individual personal spaces of others.
cations. For instance, an approach behavior followed by
This work describes a model in which the personal
Correspondence to: C. Jung space and the distance from each individual to his/her
e-mail: crjung@unisinos.br neighbors can be easily estimated through Voronoi Di-
2 Julio Cezar Silveira Jacques Jr. et al.

agrams (VDs). In the proposed approach, the position events, the system detects the proximity of agents as not
of each person is captured using computer vision algo- near, nearby, close, very close and touching. Hosie and
rithms at each frame, and it is used as a site to compute collaborators [7] proposed a method for group behav-
a VD. As people move, the positions of the sites are mod- ior detection based on surveillance data. Their approach
ified in time, generating a temporal evolution of Voronoi rely on pair primitives, which are pre-defined movements
polygons (called in this work Dynamical Voronoi Dia- that can occur between two targets in the scene over one
gram, or DVD). The geometry of Voronoi polygons is time sample. Such primitives are used to detect simple
explored to extract and quantify sociological and psy- events, such as convergence, divergence or stationarity.
chological individual characteristics, which are used to These two methods are more focused on pairwise prim-
detect the possible kinds of interactions proposed by itives, and are not suited for analyzing larger groups.
Hall [1]. Such interactions are then analyzed to detect Oliver et al. [8] used computer vision tracking algo-
group formations. We also propose a variation of the per- rithms and Coupled Hidden Markov Models (CHMMs)
sonal space: the Perceived Personal Space (PPS), which to model and recognize human tasks. Du and collabo-
takes into account the vision field of each individual to rators [9] proposed a similar approach using Dynamic
characterize group types. The main contribution of this Bayesian Networks (DBNs) instead of CHMMs. These
paper is related to the understanding of motion and approaches are able to detect some kinds of pre-defined
behavior of individuals and groups using Voronoi Dia- human interactions (such as follow, or approach + talk
grams. + continue together ), but are also limited mostly to in-
This paper is organized as follows. Next Section presents teractions between only two persons.
some works related to pattern recognition used in crowd Gong and Xiang [10,11] explored Dynamic Proba-
motion understanding. Section 3 presents the proposed bilistic Networks (DPNs) for modeling the temporal rela-
model, while Section 4 discusses some results of the pro- tionships among a set of different object temporal events
posed model applied to video sequences filmed in real in the scene for a coherent and robust scene-level be-
public environments as well as to video sequences re- havior interpretation. Although grouping is embedded
sulting from a crowd simulator. Finally, Section 5 draws in their approach, the main focus is activity recognition.
final remarks and some ideas for future work. Also, psychological aspects are not considered for group-
ing purposes.
Liu and Chua [12] presented a new method to model
2 Related Work
and classify multi-agent activities based on the obser-
vation decomposed hidden Markov models (ODHMMs),
Several authors have presented methods for understand-
also proposed by the authors. In their approach, pre-
ing the motion of people filmed by static cameras. There
defined activities with interacting people can be trained,
is a great variety of approaches and applications, rang-
using the relative distances between any two people as
ing from a global analysis (crowd behavior) to very lo-
feature vectors. Despite the good results achieved for de-
cal (tracking of individual trajectories or body parts).
tecting “Snatch Thefts”, the proposed method presents
Within this range of applications, there are methods
a high computational cost as the number of agents in-
that rely on people interactions for motion understand-
creases. Furthermore, tests were performed using man-
ing, and some of these techniques are briefly revised next
ual tracking of people, making its practical application
(a more comprehensive review on human motion under-
difficult to evaluate.
standing and surveillance can be found in the survey
papers [3,4]). Fuentes and Velastin [13,14] proposed event detec-
Stauffer and Grimson [5] worked with statistical meth- tion algorithms based on trajectories and foreground
ods to track people and other objects, which are clas- blobs designed for closed-circuit television (CCTV) surveil-
sified and recognized through prior models. The system lance systems. In their approach, some pre-defined events
learns with such recognitions and accumulated prototype involving two or more persons can be detected (such as
motion histogram from filmed recognized sequences, that fights, attacks and vandalism), but the concept of group-
allow the system to detect unusual events. An inspector ing was not explicitly used.
watching may help to classify the level of danger of that In this work, we propose a new method that uses the
scene. In such work, they are more concerned with people trajectory of each tracked person to compute a temporal
recognition and their interaction with objects for surveil- sequence of VDs. Then, Voronoi polygons are explored
lance applications, not much with interactions among to extract individual characteristics based on psycholog-
people. ical concepts, such as personal space (proxemics) [1] and
Buxton and Gong [6] described a method to deter- perceived personal space (a new concept introduced in
mine individual people behavior using Bayesian networks. this paper). Such individual characteristics are used to
In this approach, the objects dynamics are tracked and detect and characterize group formations, which can be
their behaviors are described as Bayesian networks, which further explored for higher-level event detection.
contain information such as time and events. To classify The proposed model is described in the next Section.
Understanding People Motion Using Voronoi Diagrams 3

3 The Model for Understanding People Motion but having small PPS (this usually indicate that they
form a group because they have no free space left to
The main goal of the proposed model is to extract infor- go);
mation from video sequences and to use such information
to recognize patterns of individual and groups’ behaviors More details on group formation and classification
using VDs. First of all, we will justify our decision con- will be provided in next Sections. However, we first de-
cerning the usage of VDs. scribe an algorithm for automatic people tracking in
According to Hall [1], Personal Space (PS) is related filmed video sequences suited for top-view images in monochro-
with the region with invisible boundaries surrounding an matic video sequences, which is used to obtain the tra-
individual’s body. Mathematically, this concept can be jectory of each person (and, hence, the DVD). Although
quantified as the set of points that are closer to this per- color cameras are becoming very popular nowadays, we
son than to any other individual. In fact, this property decided to use grayscale video sequences because there
arises naturally when computing the Voronoi Diagram are still scenarios (e.g. surveillance cameras) that employ
using the position of each person as a site: any point be- monochromatic cameras (or cameras with poor color
longing to the polygon related to a given site is closer to definition). Furthermore, the procedure described next
this site than to any other site used in the Voronoi tes- can be applied to the luminance component of color
sellation. Furthermore, according to Sommer [2], the PS cameras, whereas most tracking algorithms designed for
is in fact a “portable space”, meaning that each individ- color cameras can not be applied to grayscale video se-
ual carries his/her own space wherever he/she goes. As quences.
people move, their new positions are used as sites to re-
compute the VD, and the PSs are updated accordingly.
In other words, the time-varying positions tracked people 3.1 An Algorithm for Automatic People Tracking in
are used as sites to compute a series of time-varying VDs. Top-View Cameras
As people move, there is one VD for each frame of the
Several vision-based techniques for people tracking in
video sequence, and the temporal evolution of the VDs is
video sequences have been proposed in the past years (a
called a Dynamic Voronoi Diagram (DVD). Such DVD
recent survey can be found in [4]), most of them with
provides the PS for each tracked person across time.
surveillance purposes. In these applications, an oblique
Besides of providing the temporal evolution of per-
(or almost lateral) view of the scene is required, so that
sonal spaces, the Voronoi tessellation can be also used to
faces of the individuals can be recognized. However, such
quantify other psychological concepts. In particular, the
camera setups often result in occlusions, and mapping
following assumptions and definitions are taken in this
from image pixels to world coordinates may not be ac-
work:
curate (due to camera projection). Since our main goal
• Perceived Personal Space (PPS): this concept is in- is to extract trajectories for each individual, we chose a
troduced in this paper, and it is defined as the por- camera setup that provides a normal view with respect
tion of the PS that lies within the vision field of the to the ground (thus, reducing perspective problems and
person; occlusions). In fact, the general plane-to-plane projective
• People comfort (PC): we define different “comfort mapping is given by:
levels”, based on the PPS and the sociological inter-
personal distances proposed by Hall [1]. au + bv + c du + ev + f
x= , y= , (1)
gu + hv + i gu + hv + i
The temporal evolution of these parameters (given
by the DVD) is then used to detect group formations. where (u, v) are world coordinates, (x, y) are image co-
We assume that a set of people form a group if they ordinates, and a, b, c, d, e, f , g, h, i are constants. In
keep close enough for a certain period of time (the defi- top-view camera setups, it is easy to show that Equa-
nition of “close enough” is provided by Hall’s classifica- tion (1) reduces to:
tion, shown in Table 1, while “a certain period of time”
is quantified in terms of a grouping period Tg , described x = au, y = ev, (2)
in Section 3.3). When a group is detected, it is further and the mapping from image to world coordinates is
classified as: trivial. Furthermore, Equation (2) indicates that per-
• Voluntary group: is a group of individuals that keep sons have the same dimensions at all positions, indicat-
an intimate distance during a certain period of time, ing that an area thresholding technique may be used for
with sufficiently large PPSs (this means that, although people detection.
individuals have perceived enough space in their vi- It is important to notice that the expected longitudi-
sion fields, they chose to stay in short distance with nal projection of a person in oblique-lateral views is ex-
other individuals); plored by several tracking algorithms, such as [15–17,14].
• Involuntary group: is a group of individuals that keep However, such hypothesis clearly does not apply for top-
an intimate distance during a certain period of time, view cameras, requiring a different strategy for people
4 Julio Cezar Silveira Jacques Jr. et al.

tracking. In fact, a person’s head is a relatively invariant persons will only be detected when the cluster breaks
feature in top-view (or almost top-view) camera setups, apart.
indicating that tracking can be performed through tem- The next step is to identify template T in the follow-
plate matching. ing frame. Although there are several correlation metrics
To improve the accuracy of template matching, we designed for matching a small template within a larger
also employ a background subtraction technique to iso- image, Martin and Crowley [25] indicated that the Sum
late foreground pixels from the static background. Al- of Squared Differences (SSD) provides a more stable re-
though there are several algorithms for background sub- sult than other correlation metrics in generic applica-
traction reported in the literature [14–24], just a sub- tions, leading us to choose the SSD as the correlation
set deal with shadow suppression in grayscale video se- metric. It should be noticed that more generic object
quences, such as [18,21–24]. In this work, we adopted the tracking methods (such as the covariance tracking de-
algorithm described in [24], due to its simplicity, speed scribed in [26]) can be used instead of the SSD, but with
and adaptation to illumination changes (including shad- additional computational cost.
ows and highlights). Also, it does not make sense to compute the corre-
In the first frame that a new person enters the scene, lation over the entire image, since a person moves with
it is expected only a portion of his/her body to be visi- limited speed. We use a reduced correlation space, tak-
ble. As the person keeps moving, a larger portion of the ing into consideration the frame rate of the sequence,
body will be visible, until he/she enters completely in and the maximum speed allowed for each person. For
the viewing area of the camera. Hence, each new fore- example, if the acquisition rate is 15 frames per second
ground “blob”1 that appears is analyzed, and its area is (FPS) and the maximum allowed speed is 5 m/s, than
computed in time. If this blob is related to a real person, the center of the template cannot be displaced more that
then its area will increase progressively, until it reaches a 5 × 1/15 = 0.33 meters in the subsequent frame (for the
certain threshold that represents the minimum area al- camera setup shown in Fig. 1, 0.33 meters correspond to
lowed for a valid person (such threshold is determined a approximately 35 pixels). Since the template center must
priori, based on average person sizes in a specific camera be moved to a foreground pixel, the correlation space
setup). is further reduced by removing all background pixels.
The SSD between the template T and the correlation
Once a person is fully detected in the scene, it must space is then computed, and the center of the template
be tracked across time. For that purpose, we use a simple is moved to the point related to the global minimum of
and robust tracking procedure based on feature correla- the SSD. Such correlation procedure is repeated for all
tion [25]. To find the approximate position of the center subsequent frames, until the person disappears from the
of the head (which is the center of the correlation tem- camera view.
plate T ), the Distance Transform (DT) is applied to the Although the head is a good choice for the correlation
negative of each foreground blob (i.e. the blob exterior template, head tilts and illumination changes may vary
is considered the foreground object). The global maxi- the graylevels within the template. Also, the procedure
mum of the DT corresponds to the center of the largest for selecting the initial template may not detect exactly
circle that can be inscribed in the blob, and it provides the center of the head. To cope with such situations, T
an estimate of the person’s head center. is updated every N frames (we used N = 5 for sequences
If the blob area exceeds a certain threshold (in this acquired at 15 FPS).
work, we used three times the minimum area value), than As a result of our tracking procedure, we can deter-
such blob is probably related to two or more connected mine the trajectory and velocity of each person captured
persons, as in the rightmost blob in Fig. 1(b). In such by the camera. We use the position of each person to ob-
case, not only the global maximum of the DT is ana- tain the Voronoi sites for each frame, and the associated
lyzed, but also the local maxima with largest values. If velocity to compute the individuals orientation (which
the distance between the global maximum and a cer- are used to obtain PPS). The usage of VDs for obtaining
tain local maximum is larger than the diameter of the individual and group characteristics is explained next.
template (so that templates do not overlap), such lo-
cal maximum is also considered a head center, as shown
in the rightmost blob of Fig. 1(c). Fig. 1(a) illustrates 3.2 Extracting Individual Information
the corresponding frame, along with head centers and We argue that group events can only be precisely ex-
correlation templates. It is important to note that the tracted if individual information are consistently treated.
procedure for detecting individual templates belonging We defined that the personal space (PS) for an individ-
to the same blob may fail when people enter in the scene ual is the area of the corresponding Voronoi polygon. In
forming a compact cluster, which generates a blob with- fact, such choice matches the psychological principle of
out a recognizable geometry. In such cases, individual personal space, since all the points in the interior of a
Voronoi polygon are closer to the site that generated this
1
A blob is a set of connected foreground pixels polygon than to any other site.
Understanding People Motion Using Voronoi Diagrams 5

We can also use the Voronoi polygon of each per- as darker regions in Fig. 3. It should be noticed that α
son to compute his/her distance to his/her neighbors. is a parameter in our algorithm, and can be adjusted to
In fact, a point located at the edge of a Voronoi poly- increase or decrease the synthetic vision field, or even
gon is equidistant from two sites. Thus, the orthogonal be chosen randomly for each individual within a certain
distance from the site of a VD to its polygon edges rep- angular range.
resents half of the distance between this site and a neigh- We believe that a person feels comfortable if there
boring site. With this distance, we can classify the kind is enough space for him/her in the PPS, within a cer-
of interactions that each two neighbors have according tain distance range Rc . We use Hall’s values (Table 1) to
to Hall’s distances, shown in Table 1. An example of a determine different comfort levels depending on Rc . For
VD and the computed distances is illustrated in Fig. 2. instance, using Rc = 3.5 meters would provide a com-
In this Figure, values d1 , d2 and d3 are half-distances fort level with respect to the public distance (highest
from the person marked with a small square with re- comfort level), while using Rc = 0.5 meters would pro-
spect to his/her neighbors. It should be noticed that the vide a comfort level with respect to the intimate distance
VD for a set of N sites can be computed with complexity (lowest comfort level). Mathematically, we say that an
O (N log N ) using a divide and conquer algorithm [27], individual is comfortable at frame t if:
which is cheaper than performing an exhaustive search
απ 2
to compute pairwise distances. PPS(t) ≥ R , (3)
The DVD provides the temporal evolution of VDs 360◦ c
as people move, allowing the computation of the PS and απ 2
where 360 ◦ Rc is the area of a circular sector with angle
distance from neighbors for each individual as a function
α and radius Rc , and PPS(t) is the perceived personal
of time. The temporal analysis of such information can
space at frame t. In particular, we use the intimate com-
be used to determine both individual and group infor-
fort level (Rc = 0.5) to determine voluntary group for-
mation (as it will be discussed in Section 3.3).
mations, as it will be explained in Section 3.3.
An important individual characteristic that can be
According to the proposed metric for comfort, indi-
assessed using the temporal evolution of the PS is the vidual 7 is more comfortable than individual 2 in Fig. 3.
level of comfort of people, according to Sommer [2]. Al- In fact, there is no one close to person 7 in his/her vision
though it is intuitive to think that a person feels comfort-
field, while person 1 is in front of person 2.
able if the personal space around him/her is sufficiently
Next, we describe our approach for group formation
large, we believe that the PS is not the best metric to
and characterization based on inter-person distances and
estimate the level of comfort. In fact, the psychological
comfort levels.
concept of PS is static, meaning that it considers a region
around the individual, disregarding his/her direction of
movement. A person can have lots of personal space be-
hind him/her, but he/she may not feel comfortable if 3.3 Extracting Groups Information
there is an obstacle or another person right in front of
his/her desired path. The proposed Perceived Personal To detect group formation, we keep track of the dis-
Space (PPS) provides a much better metric to estimate tance from each person to his/her neighbors (such dis-
the level of comfort, because it takes into account the tances are provided directly by the VD) across time. If
vision field of the individual and his/her PS. two or more people keep short distances among them in
The vision field is modeled as a circular sector with a certain period of time (let us denote this time period
angle α, symmetric with respect to the velocity vector of Tg , measured in frames, and called grouping period ), we
the individual. According to Vaughan et al. [28], the hu- consider that they form a group. In practice, even a very
man vision field has an aperture angle of approximately strong group (e.g. a married couple) can be apart during
170◦ in the horizontal plane, but can be reduced due some frames, when avoiding obstacles and/or other peo-
to several factors, such as eye disfunctions (e.g. myopia) ple, but still keeping the group link. To cope with this
or movement at higher speeds (e.g. driving a car on a kind of situation, we consider that two individuals form
highway). Also, the region of attentional focus is consid- a group if they keep an intimate distance for at least a
erably smaller, approximately 40◦ [29]. Consequently, if fraction p of the grouping period Tg , where 0 ≤ p ≤ 1.
one is focused on a particular direction, then objects in Formally, let us consider two individuals Ii (t) and
the peripherical vision field will be out of focus, or blurry Ij (t) at frame t, and define a binary function:
(in fact, the perception of objects decreases as they get 
closer to the boundaries of the vision field). In this work, 1 if d(Ii (t), Ij (t)) ≤ Dintimate
g(i, j, t) = , (4)
we used α = 120◦ as the perception angle, making a com- 0 otherwise
promise between the attentional focus (40◦ ) and the full
viewing angle (170◦ ). The PPS is then defined as the where d(Ii (t), Ij (t)) represents the distance between agents
area of the region formed by the intersection of the vi- Ii and Ij at frame t, and Dintimate = 0.5 meters is the
sion field and the corresponding Voronoi polygon, shown distance for intimate relationship, as defined in Table 1.
6 Julio Cezar Silveira Jacques Jr. et al.

Then, Ii (t) and Ij (t) are considered a group at frame t elements of the group satisfy Equation (6). Otherwise,
if: the group is characterized as involuntary.
X t
g(i, j, k) ≥ pTg . (5)
k=t−(Tg −1)
4 Experimental Results
Our experimental results indicated that 5 seconds
is enough time for group formation (leading to Tg = In this Section we show some experimental results of our
150 frames in video sequences acquired at 30 frames per model, and indicate some higher-level events that could
second), and p = 0.8. However, such parameters can be be detected using group formation and classification. We
fine tuned for specific applications. analyzed filmed sequences in real life as well as videos
We also impose our grouping property to be transi- generated by a crowd simulator (in the latter, we do not
tive, meaning that if Ii and Ij are grouped and Ij and Ik apply the tracking algorithm, since the positions of all
are also grouped, than Ij and Ik must be also grouped. agents are known). The reason to include data gener-
In this case, Ii , Ij and Ik will be in the same group. Such ated by a crowd simulator is the flexibility to generate
transitive property is important, because we can detect controlled situations with a variety of known parame-
large groups using only pairwise comparisons. ters and events, such as the generation of groups, their
An example of group formation is shown in Fig. 5. size, people velocity, environment restrictions, locations
This Figure illustrates four frames of a video sequence, of strangle points, etc. Furthermore, tracking algorithms
and people belonging to two different groups according may present erroneous results in very crowded video se-
to Equation (5) are highlighted. The “trail” behind each quences.
tracked person shows his/her positions in the previous Fig. 4 illustrates four frames of a filmed video se-
frames. quence taken a few frames apart from each other. In
After detecting the formation of a group, we want this scenario, a group of two persons move from right to
to characterize it as voluntary or involuntary. We be- left, and a single person walks in the opposite direction
lieve that there are two main causes for group formation: (PPSs are marked as brighter polygons). In particular,
people can form a group whether because they want to the PPSs of the single person and the top person in the
(e.g. friends) or because they are forced to, due to lack of group decrease as they approach each other, as well as
space (e.g. exiting a crowded football stadium). It is rea- the distance between them until they cross each other.
sonable to think that such group characterization is re- Such relative distance is classified according to Hall [1]
lated to the perceived personal spaces of the individuals as social in Fig. 4(a), personal in Fig. 4(b) and intimate
in the group: friends walking together in a non-crowded in Fig. 4(c). After they crossed, their relative distance
environment may have plenty of PPS, but they choose to increases, as shown in Fig. 4(d). Since an intimate dis-
stay close; on the other hand, if several people approach tance was kept for only a few frames (just before and
a door at the same time, their PPS will be small, and after crossing), the single person was not included in the
they have no choice but to stay close to other people. group. The two persons walking close to each other from
To characterize a group as voluntary or involuntary, right to left are classified as a voluntary group after Tg
we evaluate the PPSs of all individuals of the group frames, since they enter the scene with intimate distance
(which is equivalent to evaluating the comfort of individ- and keep such distance during the whole sequence.
uals). It is expected that, in voluntary groups, most peo- An example of group formation is shown in Fig. 5.
ple should have large PPSs during the grouping time Tg . Fig. 5(a) illustrates two “clusters” of people that walk
However, some individuals may walk a little behind oth- together. After some frames, the cluster with four per-
ers, and consequently their PPSs would be small, even if sons is identified as a group (Fig. 5(b)), and a few frames
they belong to a voluntary group. Thus, to detect volun- later the second cluster is also characterized as a group
tary group formations, we check if at least half of persons (Fig. 5(c)), until they disappear from the viewing area of
in the group have sufficiently large PPSs. the camera. It should be noticed that both groups were
More specifically, let us consider a group with N per- detected as voluntary, since most individuals maintained
sons, formed by individuals I1 , I2 , · · · , IN . Let ci (t) be their PPS sufficiently large in at least a fraction p of the
a binary “comfort” function with respect to the public previous Tg frames.
distance, that returns 1 if Equation (3) is satisfied with An example of dynamic group formation is illustrated
Rc = 1.25, and 0 otherwise. Individual Ii (t) is said to be in Fig. 6. In this example, a person rapidly approaches
comfortable in the previous Tg frames if: and reaches another person. Then, he/she reduces his/her
t speed, so that the two persons walk side by side, at an in-
timate distance. After some time (grouping time), a vol-
X
ci (t) ≥ pTg , (6)
k=t−(Tg −1)
untary group is detected. A higher-level interpretation
of this grouping behavior could be simply two friends
where p is the same parameter used in Equation (5). The meeting, or maybe a kidnapping situation. A finite au-
group is qualified as voluntary at frame t if at least N/2 tomaton could be easily implemented in the proposed
Understanding People Motion Using Voronoi Diagrams 7

model to detect sequences of events that could be re- search Laboratory2 , we have developed specific studies
lated to suspect behavior, such as “approach” followed on crowd simulation, and developed algorithms that are
by “voluntary group”. able to populate virtual environments [30–32].
In the first experiment, a 90 × 20 meters virtual rect-
The evaluation of group formation and classification
angular environment was created. People arise in one
is a challenging task. In fact, the concept of grouping
end of the largest dimension, and their goal is to move
is a little subjective, and the classification of a group
to other end of the room along this dimension. The sim-
as voluntary or involuntary is difficult without a pri-
ulation was executed with a different number of agents
ori knowledge. To validate our results, a set of 20 short
(100, 150 and 200, ten times each), and 80% of the agents
movies was shown to 20 Computer Science undergradu-
forming groups (in a total of 24 groups in all experi-
ate and graduate students, and they had to answer for
ments). When a certain agent is detected as belonging
each movie whether there was grouping or not. In case of
to a group, a positive match is triggered; otherwise, a
grouping, they had to chose between voluntary or invol-
negative match is triggered. A quantitative evaluation
untary. We analyzed their opinions by contrasting with
of the proposed algorithm was performed in terms of
the results generated by our algorithm, and such com-
true positives (TP), true negatives (TN), false positives
parison is discussed below.
(FP) and false negatives (FN). The evolution in time of
According to our algorithm, most of short movies average TP, TN, FP and FN for simulations with differ-
(70%) contained from 1 to 5 voluntary groups, a few ent number of agents is illustrated in Figs. 7(a)-(c), and
(15%) contained involuntary groups (from 1 to 2), and their averages for all the three experiments are shown in
the remaining (15%) contained no group at all. From all Table 2. The number of groups as a function of time is
the groups (voluntary or involuntary) that were detected depicted in Fig. 7(d)3 , and the average number of groups
by our algorithm, 74% was also pointed out as groups for the experiments with 100, 150 and 200 agents is 23.5,
by human subjects; 25% of answers provided by human 22.9 and 24.8, respectively. It can be observed that some
evaluators indicated a larger number of groups, and only false positives are detected because some agents that
1% of answers indicated a smaller number of groups. were not supposed to form groups walk close to others
From the total of short movies where no group was de- for enough time (grouping period), and a grouping event
tected by our algorithm, 90% of them were in agreement is detected. If the grouping period Tg is increased, the
with human subjects, and 10% of the answers pointed number of false positives tends to decrease (but actual
at least one group. groups will be detected after a larger period of time).
In another simulated experiment, a virtual environ-
The evaluation of group classification as voluntary or
ment with length of 120 meters was created, containing a
involuntary was also performed. From the total of volun-
narrow corridor in the middle with length of 20 meters.
tary groups detected by the method, 86% were also clas-
Such environment was populated with 300 agents us-
sified by humans; 3% detected more voluntary groups,
ing our CSHuV Simulator [33], and their goal is to move
while 11% detected less voluntary groups. From the total
from left to right. Few of them are part of a family, mean-
of involuntary groups, 68% were also detected as invol-
ing that they know each other and try to keep together
untary by evaluators; 31% of all answers pointed out a
during the whole simulation. Fig. 8 illustrates a snapshot
larger number of groups, and the remaining 1% indi-
of the simulation, and it can be noticed that individuals
cated less involuntary groups. The worst case concern-
belonging to the same family were correctly identified as
ing the classification of groups occurred in a short movie
voluntary groups (black dots) in non-crowded regions.
where a group formed by two persons is climbing up a
In this simulation, it is also possible to recognize large
small set of stairs, and a third person suddenly follows
involuntary groups in the corridor (dots in dark gray),
the same path with higher velocity. There is no space
because the PPS of agents are drastically reduced due to
left for the third person to deviate from the group, so
environment constraints in that region. Individuals that
he/she slows down and stays behind the group (during
were not grouped are marked as light gray dots. The sud-
enough time for grouping in our method) until there is
den formation of large involuntary groups may indicate
free space to accelerate again. In such case, the algorithm
the presence of spatial constraints or traffic jams that
detected only one voluntary group, since more than half
compromise the flow of people, and could be included in
of the persons in the group keeps high PPS. On the other
systems for higher-level event detection.
hand, 70% of subjects defined this group as involuntary,
because they could perceive the temporary situation of For the same simulation, we have also analyzed the
grouping behavior, as people group and ungroup during comfort level of agents. More specifically, we highlighted
the filmed sequence. individuals that do not have a personal comfort level

A fast and easy way to generate video sequences with 2


CROMOS Lab: http://www.inf.unisinos.br/
knowledge of ground truth is to use crowd simulators, in ~cromoslab
which crowd and group events can be controlled and 3
In Fig. 7, frame 0 represents the first frame when a group
handled in different virtual environments. At our re- was detected.
8 Julio Cezar Silveira Jacques Jr. et al.

(i.e., agents for which Equation (3) is not satisfied for 2. R. Sommer. Personal Space: The Behavioural Basis of
Rc = 1.25 meters). Such individuals are marked as darker Design. Prentice Hall, Englewood CLiffs, New Jersey,
dots in Fig. 9. It can be observed that few agents present 1969.
low level of comfort in regions A and C, opposed to the 3. M. Valera and S.A. Velastin. Intelligent distributed
situation in region B. As expected, the level of comfort surveillance systems: a review. IEE Vision, Image and
Signal Processing, 152(2):192–204, April 2005.
was drastically reduced in the corridor, since agents get
4. T. B. Moeslund and A. Hilton and. V. Kruger. A survey
crowded. Some videos illustrating the proposed method
of advances in vision-based human motion capture and
applied to filmed sequences and crowd simulators can be analysis. Computer Vision and Image Understanding,
accessed at http://www.inf.unisinos.br/~cromoslab/ 104, 2006.
voronoi_analysis.html. 5. C. Stauffer, W. Eric, and L. Grimson. Learning patterns
of activity using real-time tracking. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(8):747
5 Conclusions – 757, August 2000.
6. H. Buxton and S. Gong. Advanced visual surveillance
Visual evaluation of results in filmed video sequences is using bayesian networks. In IEEE International Con-
in agreement with expected, meaning that persons that ference on Computer Vision, Cambridge, Massachusetts,
appear to know each other and walk together were iden- June 1995.
tified as voluntary groups. In our controlled experiments 7. R. Hosie, S. Venkatesh, and G.A.W. West. Classify-
with crowd simulators, most generated families were cor- ing and detecting group behaviour from visual surveil-
rectly labeled as groups. Furthermore, large involuntary lance data. In IEEE International Conference on Pattern
groups were detected in strangle points, where unknown Recognition, pages Vol I: 602–604, 1998.
people stay very close to each other for enough time. 8. N. M. Oliver, B. Rosario, and A. P. Pentland. A bayesian
computer vision system for modeling human interactions.
As future work we intend to further explore group
IEEE Transactions on Pattern Analysis and Machine In-
detection and characterization for automatic detection
telligence, 22(8):831–843, August 2000.
of higher level events, and also to obtain physical pa- 9. Y. Du, G. Chen, W. Xu, and Y. Li. Recognizing in-
rameters from the sensed environment (such as obstacles teraction activities using dynamic bayesian network. In
and strangle points). We also plan to study a variety of IEEE International Conference on Pattern Recognition,
filmed video sequences in order to adapt Hall’s distances volume 1, pages 618–621, August 2006.
for other cultures. 10. S. Gong and T. Xiang. Recognition of group activities
using dynamic probabilistic networks. In IEEE Interna-
tional Conference on Computer Vision, page 742, Wash-
6 Originality and Contribution ington, DC, USA, 2003. IEEE Computer Society.
11. T. Xiang and S. Gong. Beyond tracking: Modelling activ-
This paper presented a model to recognize individual ity and understanding behaviour. International Journal
and groups information in video sequences. We described of Computer Vision, 67(1):21–51, 2006.
a model using Dynamic Voronoi Diagrams (DVDs) in 12. X. H. Liu and C. S. Chua. Multi-agent activity recogni-
which interpersonal distances and personal spaces (PSs) tion using observation decomposed hidden markov mod-
of people are detected and analyzed across time. In ad- els. Image and Vision Computing, 24(2):166–175, Febru-
dition, we proposed a new parameter: the perceived per- ary 2006.
13. L. M. Fuentes and A. Velastin. Tracking-based event
sonal space (PPS), which we consider more adequate for
detection for CCTV systems. Pattern Analysis and Ap-
measuring comfort levels. A model for group formation
plications, 7(4):356–364, 2004.
was proposed using sociological interpersonal distances 14. L. M. Fuentes and S. A. Velastin. People tracking in
defined by Hall [1], and groups were classified as volun- surveillance applications. Image and Vision Computing,
tary or involuntary based on the PPSs of the individuals 24(11):1165–1171, November 2006.
in the group during a certain time interval. 15. I. Haritaoglu, D. Harwood, and L. S. Davis. W4: Real-
time surveillance of people and their activities. IEEE
Transactions on Pattern Analysis and Machine Intelli-
7 Acknowledgments gence, 22(8):809–830, August 2000.
16. A. M. Elgammal, R. Duraiswami, D. Harwood, and
This work was developed in collaboration with HP Brazil L. S. Davis. Background and foreground modeling us-
R&D. The authors would like to thank the anonymous ing nonparametric kernel density estimation for visual
reviewers for their fruitful contributions to improve this surveillance. Proceeedings of the IEEE, 90(7):1151–1163,
work. 2002.
17. F. H. Cheng and Y. L. Chen. Real time multiple ob-
jects tracking and identification based on discrete wavelet
References transform. Pattern Recognition, 39(6):1126–1139, June
2006.
1. E. T. Hall. The Silent Language. Doubleday Company, 18. S.-Y. Chien, S.-Y. Ma, and L.-G. Chen. Efficient moving
Garden City, NY, 1959. object segmentation algorithm using background regis-
Understanding People Motion Using Voronoi Diagrams 9

tration technique. IEEE Transactions on Circuits and


Systems for Video Technology, 12(7):577–586, 2002.
19. R. Cucchiara, C. Grana, M. Piccardi, and A. Prati. De-
tecting moving objects, ghosts, and shadows in video
streams. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 25(10):1337–1342, October 2003.
20. H. Ning, T. Tan, L. Wang, and W. Hu. People tracking
based on motion model and motion constraints with au-
tomatic initialization. Pattern Recognition, 37(7):1423–
1440, July 2004.
21. D. Xu, X. Li, Z. Liu, and Y. Yuan. Cast shadow detec-
tion in video segmentation. Pattern Recognition Letters,
26(1):5–26, 2005.
22. Y. Wang, T. Tan, K. F. Loe, and J. K. Wu. A prob-
abilistic approach for foreground and shadow segmenta-
tion in monocular image sequences. Pattern Recognition,
38(11):1937–1946, November 2005.
23. Y. L. Tian, M. Lu, and A. Hampapur. Robust and
efficient foreground analysis for real-time video surveil-
lance. In IEEE Computer Vision and Pattern Recogni-
tion, pages I: 1182–1187, 2005.
24. J. C. S. Jacques Jr., C. R. Jung, and S. R. Musse. A
background subtraction model adapted to illumination
changes. In IEEE International Conference on Image
Processing, pages 1817–1820. IEEE Press, 2006.
25. J. Martin and J. L. Crowley. Comparison of correlation
techniques. In Conference on Intelligent Autonomous
Systems, Karsluhe, Germany, March 1995.
26. F. Porikli, O. Tuzel, and P. Meer. Covariance tracking
using model update based on lie algebra. In IEEE Com-
puter Vision and Pattern Recognition, pages I: 728–735,
2006.
27. F. Aurenhammer. Voronoi diagrams: a survey of a funda-
mental geometric data structure. ACM Computing Sur-
veys, 23(3):345–405, 1991.
28. D. G. Vaughan, T. Asbury, and P. R. Riordan-Eva. Gen-
eral Ophtalmology. Lange Medical Publications, New
York, 1995.
29. E. Fuchs. Text-book of ophthalmology. D. Appleton, 1898.
30. A. Braun, B. E. J. Bodmann, L. P. L. Oliveira, and S. R.
Musse. Modelling individual behavior in crowd simula-
tion. In Proceedings of Computer Animation and Social
Agents 2003, pages 143–148, New Brunswick, USA, 2003.
IEEE Computer Society.
31. L. M. Barros, A. T. da Silva, and S. R. Musse. Petrosim:
An architecture to manage virtual crowds in panic situa-
tions. In Proceedings of Computer Animation and Social
Agents 2004, pages 111–120, Geneva, Switzerland, 2004.
32. N. Courty and S. R. Musse. Simulation of large crowds in
emergency situations including gaseous phenomena. In
Proceedings of Computer Graphics International 2005,
pages 206–212, Stony Brook, NY, 2005.
33. A. Braun, B. E. Bodmann, and S. R. Musse. Crowd
simulation in emergency situations. In Short Paper in
ACM Symposium on Computer Animation 2004, Greno-
ble, France, 2004.
10 Julio Cezar Silveira Jacques Jr. et al.

d1
d2
d3

Fig. 2 Voronoi Diagram and distances from one person to


his/her neighbors.

Fig. 3 Voronoi polygons and Perceived Personal Space


(PPSs, shown as translucent regions).
Understanding People Motion Using Voronoi Diagrams 11

Hall’s Classification Approximate distance Kind of interaction


Intimate distance up to 0.5 meters Comforting, threatening
Personal distance 0.5 to 1.25 meters Conversation between friends
Social distance 1.25 to 3.5 meters Impersonal business dealings
Public distance more than 3.5 meters Addressing a crowd
Table 1 Hall’s classification for Personal Space.

(a) (b) (c)


P N P N P N
T 77.3 7.5 T 119.3 13.5 T 158.2 12.9
F 13.5 1.7 F 16.5 0.7 F 27.1 0.8
Table 2 Average number of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) for the
simulated scenarios with (a) 100, (b) 150 and (c) 200 agents.

(a) (b) (c)

Fig. 1 (a) Frame of a video sequence, with detected center heads and correlation templates. (b) Result of background
subtraction. (c) Distance Transform and its maxima, that are used to detect head centers.

(a) (b) (c) (d)

Fig. 4 Illustration of PPS and evolution of interpersonal distances

(a) (b) (c) (d)

Fig. 5 Groups recognized at the filmed sequence.


12 Julio Cezar Silveira Jacques Jr. et al.

(a) (b) (c) (d)

Fig. 6 Example of dynamic formation of a voluntary group.

(a) (b) (c) (d)


120 160 35
80

70 140 30
100
60 120 25
TP TP

number of groups
TP
number of agents

number of agents
number of agents

TN 80 TN
50 100 TN
FP FP FP 20
FN FN FN
40 60 80
15
30 60
40
10
20 40
20 5 100 agents
10 20 150 agents
200 agents
0 0 0 0
0 500 1000 1500 0 500 1000 1500 0 500 1000 1500 0 500 1000 1500
frame frame frame frame

Fig. 7 Evolution in time of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) for group
detection in a simulated environment using (a) 100 agents (b) 150 agents and (c) 200 agents. (d) Shows the temporal evolution
of the number of groups for the three experiments.

Fig. 8 Voluntary and involuntary groups recognized in the Fig. 9 Individual comfort highlighted in the “narrow cor-
“narrow corridor” simulation sequence. ridor” simulation sequence.

Vous aimerez peut-être aussi