Vous êtes sur la page 1sur 14

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO.

11, NOVEMBER 2004

1459

Statistical Modeling of Complex Backgrounds for Foreground Object Detection


Liyuan Li, Member, IEEE, Weimin Huang, Member, IEEE, Irene Yu-Hua Gu, Senior Member, IEEE, and Qi Tian, Senior Member, IEEE

AbstractThis paper addresses the problem of background modeling for foreground object detection in complex environments. A Bayesian framework that incorporates spectral, spatial, and temporal features to characterize the background appearance is proposed. Under this framework, the background is represented by the most signicant and frequent features, i.e., the principal features, at each pixel. A Bayes decision rule is derived for background and foreground classication based on the statistics of principal features. Principal feature representation for both the static and dynamic background pixels is investigated. A novel learning method is proposed to adapt to both gradual and sudden once-off background changes. The convergence of the learning process is analyzed and a formula to select a proper learning rate is derived. Under the proposed framework, a novel algorithm for detecting foreground objects from complex environments is then established. It consists of change detection, change classication, foreground segmentation, and background maintenance. Experiments were conducted on image sequences containing targets of interest in a variety of environments, e.g., ofces, public buildings, subway stations, campuses, parking lots, airports, and sidewalks. Good results of foreground detection were obtained. Quantitative evaluation and comparison with the existing method show that the proposed method provides much improved results. Index TermsBackground maintenance, background modeling, background subtraction, Bayes decision theory, complex background, feature extraction, motion analysis, object detection, principal features, video surveillance.

I. INTRODUCTION

N COMPUTER vision applications, such as video surveillance, human motion analysis, human-machine interaction, and object based video encoding (e.g., MPEG4), objects of interest are often the moving foreground objects in an image sequence. One effective way of foreground object extraction is to suppress the background points in the image frames [1][6]. To achieve this, an accurate and adaptive background model is often desirable. Background usually contains nonliving objects that remain passive in the scene. The background objects can be stationary objects, such as walls, doors and room furniture, or nonstationary objects such as wavering bushes or moving escalators.

Manuscript received June 19, 2003; revised January 29, 2004. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Luca Lucchese. L. Li, W. Huang, and Q. Tian are with Institute for Infocomm Research (I R), Singapore, 119613 (e-mail: lyli@i2r.a-star.edu.sg; wmhuang@i2r.a-star.edu.sg; tian@i2r.a-star.edu.sg). I. Y.-H. Gu is with the Department of Signals and Systems, Chalmers University of Technology, SE-412 96 Gteborg, Sweden (e-mail: irenegu@ s2.chalmers.se). Digital Object Identier 10.1109/TIP.2004.836169

The appearance of background objects often undergoes various changes over time, e.g., the changes in brightness caused by changing weather conditions or the switching on/off of lights. The background image can be described as consisting of static and dynamic pixels. The static pixels belong to the stationary objects, and the dynamic pixels are associated with nonstationary objects. The static background part can be converted to a dynamic one as time advances, e.g., by turning on a computer screen. A dynamic background pixel can also turn to a static one, such as a pixel in the bush when the wind stops. To describe a general background scene, a background model must be able to 1) represent the appearance of a static background pixel; 2) represent the appearance of a dynamic background pixel; 3) self-evolve to gradual background changes; 4) self-evolve to sudden once-off background changes. For background modeling without specic domain knowledge, the background is usually represented by image features at each pixel. The features extracted from an image sequence can be classied into three types: spectral, spatial, and temporal features. Spectral features could be associated with gray-scale or color information, spatial features could be associated with gradient or local structure, and temporal features could be associated with interframe changes at the pixel. Many existing methods utilize spectral features (distributions of intensities or colors at each pixel) to model the background [4], [5], [7][9]. To be robust to illumination changes, some spatial features are also exploited [2], [10], [11]. The spectral and spatial features are suitable to describe the appearance of static background pixels. Recently, a few methods have introduced temporal features to describe the dynamic background pixels associated with nonstationary objects [6], [12], [13]. There is, however, a lack of systematic approaches to incorporate all three types of features into a representation of a complex background containing both stationary and nonstationary objects. The features that characterize stationary and dynamic background objects should be different. If a background model can describe a general background, it should be able to learn the signicant features of the background at each pixel and provide the information for foreground and background classication. Motivated by this, a Bayesian framework which incorporates multiple types of features for modeling complex backgrounds is proposed in this paper. The major novelties of the proposed method are as follows. 1) A Bayesian framework is proposed for incorporating spectral, spatial, and temporal features in the background modeling.

1057-7149/04$20.00 2004 IEEE

1460

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004

2) A new formula of Bayes decision rule is derived for background and foreground classication. 3) The background is represented using statistics of principal features associated with stationary and nonstationary background objects. 4) A novel method is proposed for learning and updating background features to both gradual and once-off background changes. 5) The convergence of the learning process is analyzed and a formula is derived to select a proper learning rate. 6) A new real-time algorithm is developed for foreground object detection from complex environments. Further, a wide range of tests is conducted on a variety of environments, including ofces, campuses, parks, commercial buildings, hotels, subway stations, airports, and sidewalks. The remaining part of the paper is organized as follows. After a brief literature review of existing work in Section I-A, Section II describes the statistical modeling of complex background based on principal features. First, a new formula of Bayes decision rule for background and foreground classication is derived. Based on this formula, an effective data structure to record the statistics of principal features is established. Principal feature representation for different background objects is addressed. In Section III, the method for learning and updating the statistics of principal features is described. Strategies to adapt to both gradual and sudden once-off background changes are proposed. Properties of the learning process are analyzed. In Section IV, an algorithm for foreground object detection based on the statistical background modeling is described. It contains four steps: change detection, change classication, foreground segmentation, and background maintenance. Section V presents the experimental results on various environments. Evaluations and comparisons with an existing method are also included. Finally, conclusions are given in Section VI. A. Related Work A simple and direct way to describe the background at each pixel is to use the spectral information, i.e., the gray-scale or color of the background pixel. Early studies describe background features using an average of gray-scale or color intensities at each pixel. Innite impulse response (IIR) or Kalman lters [7], [14], [15] are employed to update slow and gradual changes in the background. These methods are applicable to backgrounds consisting of stationary objects. To tolerate the background variations caused by imaging noise, illumination changes, and the motion of nonstationary objects, the statistical models are used to represent the spectral features at each background pixel. The frequently used models include gaussian [8], [16][22] and mixture of gaussians (MoG) [4], [23][25]. In these models, one or a few gaussians are used to represent the color distributions at each background pixel. A mixture of Gaussian distributions can represent various background appearances, e.g., road surfaces under the sun or in the shadows [23]. The parameters (mean, variance, and weight) for each gaussian are updated using an IIR lter to adapt to gradual background changes. Moreover, by replacing an old gaussian with a newly learned color distribution, MoG can adapt to

once-off background changes. In [9], a nonparametric model is proposed for background modeling, where a kernel-based function is employed to represent the color distribution of each background pixel. The kernel-based distribution is a generalization of MoG which does not require parameter estimation. The computation is high for this method. A variant model is [5], where the distribution of temporal variations used in in color at each pixel is used to model the spectral feature of the background. MoG performs better in a time-varying environment where the background is not completely stationary. But, the method can lead to misclassication of foreground if the background scenes are complex [19], [26]. For example, if the background contains a nonstationary object with significant motion, the colors of pixels in that region may change widely over time. Foreground objects with similar colors (the camouage foreground objects) could easily be misclassied as background. The spatial information has recently been exploited to improve the accuracy of background representation. The local statistics of the spectral features [27], [28], local texture features [2], [3], or global structure information [29] are found helpful for accurate foreground extraction. These methods are most suitable to stationary background. Paragios and Ramesh [10] use a mixture model (gaussians or laplacians) to represent the distributions of background differences for static background points. A Markov random eld (MRF) model is developed to incorporate the spatio-spectral coherence for robust foreground segmentation. In [11], gradient distributions are introduced to MoG to reduce the misclassication purely depending on color distributions. Spatial information helps to detect camouage foreground objects and suppress shadows. The spatial features are however not applicable to nonstationary background objects at pixel level since the corresponding spatial features vary over time. A few more attempts to segment foreground objects from nonstationary background have been made by using temporal features. One way is to estimate the consistency of optical ow over a short duration of time [13], [30]. The dynamic features of nonstationary background objects are represented by the signicant variation of accumulated local optical ows. In [12], Li et al. propose a method to employ the statistics of color co-occurrence between two consecutive frames to model the dynamic features associated with a nonstationary background object. Temporal features are suitable to model the appearance of nonstationary objects. In Wallower [6], Toyama et al. use a linear Wiener lter, a self-regression model, to represent intensity changes for each background pixel. The linear predictor could learn and estimate the intensity variations of a background pixel. It works well for periodical changes. The linear regression model is difcult to predict shadows and background changes with varying frequency in natural scene. A brief summary of the existing methods based on the types of used features is listed in Table I. Further, most existing methods perform the background and foreground classication with one or more heuristic thresholds. For backgrounds with different complexities, the thresholds should be adjusted empirically. In addition, these methods are often tested only on a few background environments (e.g., laboratories, campuses, etc.).

LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS

1461

TABLE I CLASSIFICATION OF PREVIOUS METHODS AND THE PROPOSED METHOD

Using (5), the pixel with observed feature vector at time can be classied as a background or a foreground point, provided , , and that the prior and conditional probabilities are known in advance. B. Principal Feature Representation of Background To apply (5) for classication of background and foreground, , , and should be the probability functions known in advance, or can be properly estimated. For complex backgrounds, the forms of these probability functions are unknown. One way to estimate these probability functions is to use the histogram of features. The problem that would be encountered is the high cost for storage and computation. Assuming is a -dimension vector and each of its element is quantized to values, the histogram would contain cells. For example, assuming the resolution of color has 256 levels, the histogram would contain 256 cells. The method would be unrealistic in terms of computational and memory requirements. It is reasonable to assume that if the selected features represent the background effectively, the intraclass spread of background features should be small, which implies that the distribution of background features will be highly concentrated in a small region in the histogram. Further, features from various foreground objects would spread widely in the feature space. Therefore, there would be less overlap between the distributions of background and foreground features. This implies that, with a proper selection and quantization of features, it would be possible to approximately describe the background by using only a small number of feature vectors. A concise data structure to implement such representation of background is created as follows. be the quantized feature vectors sorted in descending Let order with respect to for each pixel . Then, for a proper selection of features, there would be a small integer , a high percentage value , and a low percentage value (e.g., and ) such that the background could be well approximated by

II. STATISTICAL MODELING OF THE BACKGROUND A. Bayes Classication of Background and Foreground For arbitrary background and foreground objects or regions, the classication of the background and the foreground can be formulated under Bayes decision theory. be the position of an image pixel, be the Let input image at time , and be a -dimensional feature vector extracted from the position at time from the image sequence. Then, the posterior probability of the feature vector from the background at can be computed by using the Bayes rule (1) where indicates the background. is the probability of the feature vector being observed as a background at , is the prior probability of the pixel belonging to the is the prior probability of the feature background, and vector being observed at the position . Similarly, the posterior probability that the feature vector comes from a foreground object at is (2) where denotes the foreground. Using the Bayes decision rule, a pixel is classied as belonging to the background according to its feature vector observed at time if (3) Otherwise, it is classied as belonging to the foreground. Note that a feature vector observed at an image pixel comes from either background or foreground objects, it follows: (4)

and

(6)

and the existence of and depend The value of on the selection and quantization of the feature vectors. The feature vectors are dened as the principal features of the background at the pixel . To learn and update the prior and conditional probabilities for the principal feature vectors, a table of statistics for the possible principal features is established for each feature type at . The table is denoted as

(7) Substituting (1) and (4) into (3), it follows that the Bayes decision rule (3) becomes (5) is the learned based on the observation of where records the statistics of the most the features and

1462

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004

Fig. 1. One example of learned principal features for a static background pixel in a busy scene. The left image shows the position of the selected pixel. The two right images are the histograms of the statistics for the most signicant colors and gradients, where the height of a bar is the value of p , the light gray part is , and the top dark gray part is p p . The icons below the histograms are the corresponding color and gradient features. p

frequent feature vectors contains three components

at pixel . Each

(8) where is the dimension of the feature vector . The in the table are sorted in descending order with respect . The rst elements from the table , to the value , are used in (5) for background and foretogether with ground classication. C. Feature Selection The next essential issue for principal feature representation is feature selection. The signicant features of different background objects are different. To achieve effective and accurate representation of background pixels with principal features, the employment of proper types of features is important. Three types of features, the spectral, spatial, and temporal features, are used for background modeling. 1) Features for Static Background Pixels: For a pixel belonging to a stationary background object, the stable and most signicant features are its color and local structure (gradient). Hence, two tables are used to learn the principal features. They and with and repare resenting the color and gradient vectors, respectively. Since the gradient is less sensitive to illumination changes, the two types of feature vectors can be integrated under the Bayes framework as the following. and assume that the and are indepenLet dent, the Bayes decision rule (5) becomes (9) For the features from static background pixels, the quantization measure should be less sensitive to illumination changes. Here, a normalized distance measure based on the inner product of two vectors is employed for both color and gradient vectors. The distance measure is (10)

is less than where can be or , respectively. If a small value , and are matched to each other. The robustness of the distance measure (10) to illumination changes and imaging noise is shown in [2]. The color vector is directly obtained from the input images with 256 resolution levels for each component, while the gradient vector is obtained by applying Sobel operator to the corresponding gray-scale input im, is ages with 256 resolution levels. With found accurate enough to learn the principal features for static background pixels. An example of principal feature representation for static background pixel is shown in Fig. 1, where the histograms for the most signicant color and gradient features and are displayed. The histogram of the color in features shows that only the rst two are the principal colors for the background, and the histogram of the gradients shows that the rst six, excluding the fourth, are the principal gradients for the background. 2) Features for Dynamic Background Pixels: For dynamic background pixels associated with nonstationary objects, color co-occurrences are used as their dynamic features. This is because the color co-occurrence between consecutive frames has been found to be suitable to describe the dynamic features associated with nonstationary background objects, such as moving tree branches or a ickering screen [12]. Giving an interframe to change from the color at the time instant and the pixel , the feature vector of color co-occurrence is dened as . Similarly, a table of is maintained at each statistics for color co-occurrence pixel. Let be the input is generated by color image; the color co-occurrence vector quantizing color components to low resolution. For example, by quantizing the color resolution to 32 levels for each com, one may obtain a good ponent and selecting principal feature representation for dynamic background pixel. An example of the principal feature representation with color co-occurrence for a ickering screen is shown in Fig. 2. Compared with the quantized color co-occurrence feature space of cells, implies that with a very small number of feature vectors, the principal features are capable of modeling the dynamic background pixels.

LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS

1463

Fig. 2. One example of learned principal features for dynamic background pixels. The left image shows the position of the selected pixel. The right image is the histogram of the statistics for the most signicant color co-occurrences in T (s), where the height of a bar is the value of p , the light gray part is p , and the top dark gray part is p p . The icons below the histogram are the corresponding color co-occurrence features. In the screen, the color changes among white, dark blue, and light blue periodically.

III. LEARNING AND UPDATING THE STATISTICS FOR PRINCIPAL FEATURES Since the background might undergo both gradual and once-off changes, two strategies to learn and update the statistics for principal features are proposed. The convergence of the learning process is analyzed and a formula to select a proper learning rate is derived. A. For Gradual Background Changes At each time instant, if the pixel is identied as a static point, the features of color and gradient are used for foreground and background classication. Otherwise, the feature of is used. Let us assume that the feature color co-occurrence vector is used to classify the pixel at time based on the principal features learned previously. Then the statistics of the ( and , corresponding feature vectors in the table or ) is gradually updated at each time instant by

most frequent and signicant feature vectors observed the at pixel . B. For Once-Off Background Changes According to (4), the statistics of the principal features satisfy

(11) where the learning rate is a small positive number and . In (11), means that is classied as a background point at time in the nal segmentation, otherwise, . Similarly, means that the th vector of the matches the input feature vector , and otherwise table . The above updating operation states the following. If the pixel is labeled as a background point at time , is slightly due to . Further, the probabilities increased from . for the matched feature vector are also increased due to However, if , then the statistics for the un-matched feature vectors are slightly decreased. If there is no match be, the tween the feature vector and the vectors in the table th vector in the table is replaced by a new feature vector (12) If the pixel is labeled as a foreground point at time , and are slightly decreased with . However, the matched vector in the table is slightly increased. The updated elements in the table are resorted in a de, such that the table may keep scending order with respect to

(13) These probabilities are learned gradually with operations described by (11) and (12) at each pixel . When a once-off background change has happened, the new background appearance soon becomes dominant after the change. With the replacement operation (12), the gradual accumulation operation (11) and resorting at each time step, the learned new features will be . After some gradually moved to the rst few positions in time duration, the term on the left hand of (13) becomes large ( 1) and the rst term on the right hand of (13) becomes very small since the new background features are classied as foreground. From (6) and (13), new background appearance at can be found if

(14) In (14), denotes the previous background before the once-off change and denotes the new background appearance after the prevents errors caused by once-off change. The factor a small number of foreground features. Using the notation in (7) and (8), the condition (14) becomes (15) Once the above condition is satised, the statistics for the foreground should be tuned to be the new background appearance. According to (4), the once-off learning operation is performed as follows:

(16) for .

1464

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004

C. Convergence of the Learning Process If the time-evolving principal feature representation has successfully approximated the background, then should be satised. Hence, it is desirable that will converge to 1 with the evolution of the learning process. We shall show in the following that the learning operation (11) indeed meets such a condition. at time , and the th vector in the Suppose matches the input feature vector which has been table detected as background in the nal segmentation at time . Then, according to (11), we have

elements of and the features after fall into the next . Then, the statistics at time can be described as

(20) Since the new background appearance at pixel after time is classied as foreground before the once-off updating with (16), , and decrease exponentially, increases exponentially and will be whereas shifted to the rst positions in the updated table with sorting at each time step. Once the condition of (15) is met at time , the new background state is learned. To make the expression simpler, let us assume that there is no resorting operation. Then the condition (15) becomes (21) From (11) and (20), it follows that at time conditions hold: , the following (22)

(17) which implies the sum of the conditional probabilities of the principal features being background will remain equal or close to 1 during the evolution of the learning process. at time due to some reasons Let us suppose such as the disturbance from foreground objects or the operation from the rst vectors of once-off learning, and the matches the input feature vector , then we have in (18) If the pixel is detected as a background point at time , it leads to

(23) (19) (24) By substituting (22)(24) to (21) and rearranging terms, one can obtain (25) where is the number of frames required to learn the new background appearance. Equation (25) implies that if one wishes the system to learn the new background state in no later than frames, one should choose , such that (25) is satised. For example, if the system is to respond to an once-off background , change in 20 s with the frame rate being 20 fps and should be satised. IV. FOREGROUND OBJECT DETECTION: THE ALGORITHM With the Bayesain formulation of background and foreground classication, as well as the background representation with principal features, an algorithm for foreground object detection from complex environments is developed. It consists of four parts: change detection, change classication, foreground object segmentation, and background maintenance. The block diagram of the algorithm is shown in Fig. 3. The white blocks from left to right correspond to the rst three steps, and the blocks with gray shades correspond to background maintenance. In the rst step, unchanged background pixels in the current frame are ltered

If , then . In this case, the sum of the conditional probabilities of the principal features being background increases slightly. On the other hand, , there will be , If and the sum of the conditional probabilities of the principal features being background decreases slightly. From these two cases, it can be concluded that the sum of the conditional probabilities of the principal features being background converges to 1 as long as the background features are observed frequently. D. Selection of the Learning Rate In general, for an IIR ltering-based learning process, there is a tradeoff in the selection of the learning rate . To make the learning process adapt to the gradual background changes smoothly and not to be perturbed by noise and foreground objects, a small value should be selected for . On the other hand, if is too small, the system becomes too slow to respond to the once-off background changes. Previous methods select it empirically [4], [5], [8], [14]. Here, a formula is derived to select according to the required time for the system to respond to once-off background changes. can be An ideal once-off background change at time assumed to be a step function. Suppose the features before fall into the rst vectors in the table ,

LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS

1465

Fig. 3.

Block diagram of the proposed method.

out by using simple background and temporal differencing. The detected changes are separated into static and dynamic points according to interframe changes. In the second step, the detected static and dynamic change points are further classied as background or foreground using the Bayes rule and the statistics of principal features for background. Static points are classied based on the statistics of principal colors and gradients, whereas dynamic points are classied based on those of principal color co-occurrences. In the third step, foreground objects are segmented by combining the classication results from both static and dynamic points. In the fourth step, background models are updated. It includes updating the statistics of principal features for background as well as a reference background image. Brief descriptions of the steps are presented in the following. A. Change Detection In this step, simple adaptive image differencing is used to lter out nonchange background pixels. The minor variations of colors caused by imaging noise are ltered out to save the computation for further processing. be the input image and Let be the reference background image maintained at time with denoting a color component. The background difference is obtained as follows. First, image differencing and thresholding for each color component are performed, where the threshold is automatically generated using the least median of squares (LMedS) method [31]. The backis then obtained by fusing the results ground difference from the three color components. Similarly, the temporal (or inbetween two consecutive frames terframe) difference and is obtained. If both and , the pixel is classied as a nonchange background point. In general, more than 50% of the pixels would be ltered out in this step.

B. Change Classication is detected at a pixel , it is classied as If a dynamic point, otherwise, it is classied as a static point. A change that occurs at a static point could be caused by illumination changes, once-off background changes, or a temporarily motionless foreground object. A change detected at a dynamic point could be caused by a moving background or foreground object. They are further classied as background or foreground by using the Bayes decision rule and the statistics of the corresponding principal features. Let be the input feature vector at and time . The probabilities are estimated as

(26) is a feature vector set composed of those in where which match the input vector , i.e. and (27) matches , If no principal feature vector in the table and are set as 0. Then, the change point is both classied as background or foreground as the following. Classication of Static Point: For a static point, the probabilities for both color and gradient features are estimated by (26) and , respectively, where the vector distance with in (27) is calculated as (10). In this work, the measure statistics of the two type principal features ( and ) are learned separately. In general cases, there would be . The Bayes decision rule (9) can be applied for background and foreground classication. In some complex cases,

1466

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004

one type of the features from the background might be unstable. One example is the temporal static states of a wavering water surface. For these states, the gradient features are not constant. Another example is video captured with an auto-gain camera. The gain is often self-tuned due to the motion of objects and the gradient features are more stable than the color features for static background pixels. To work stably in various conditions, the foland lowing method is adopted. Let . If ( in our test), the color and gradient features are coincident and both features are used for classication using the Bayes rule (9). Otherwise, only one type of the features with a larger prior value is used for classication using the Bayes rule (5). Classication of Dynamic Point: For a dynamic point at time , the feature vector of color co-occurrence is generated. The probabilities for are calculated as (26), where the distance between two feature vectors in (27) is computed as (28) and is chosen. Finally, the Bayes rule (5) is applied for background and foreground classication. As observed from our experiments, for the dynamic background points, only a small percentage of them are wrongly classied as foreground changes [12]. Further, the remainders have become isolated points, which can easily be removed by a smoothing operation. C. Foreground Object Segmentation A post processing is applied to segment the remaining change points into foreground regions. This is done by rstly applying a morphological operation (a pair of open and close) to suppress the residual errors. Then the foreground regions are extracted, holes are lled and small regions are removed. Further, an AND operation is applied to the resulting segments in consecutive frames to remove the false foreground regions detected by temporal differencing [32]. D. Background Maintenance With the feedback from the above segmentation, background models are updated. First, the statistics of principal features are updated as described in Section IV. For the static points, the tables and are updated. For the dynamic points, is updated. Meanwhile, a reference background the table image is also maintained to make the background difference accurate. Let be a background point in the nal segmentation result at time . If it is identied as an unchanged background point in the change detection step, the background reference image at is smoothly updated by (29) where and is a small positive number. If is classied as background in change classication step, the background reference image at is replaced by the new background appearance (30)
Fig. 4. Summary of the complete algorithm.

With (30), the reference background image can follow the dynamic background changes, e.g., the changes of color between tree branch and sky, as well as once-off background changes. E. Memory Requirement and Computational Time The complete algorithm is summarized in Fig. 4. The major part of memory usage is to store the tables of the statistics , and ) for each pixel. In our implementa( tion, the memory requirement for each pixel is approximately 1.78 KB. For a video with image sized 160 120 pixels, the required memory is approximately 33.4 MB. While for image sized 320 240 pixels, 133.5-MB memory is required. For a standard PC, this is still feasible. With a 1.7-GHz Pentium CPU PC, real-time processing of image sequences is achievable at a rate of about 15 frames per second (fps) for images sized 160 120 pixels and at a rate of 3 fps for images sized 320 240 pixels.

LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS

1467

Fig. 5. Experimental results on a meeting room environment (MR) with wavering curtains in the winds. The two examples are the results of the frame 1816 and 2268.

Fig. 6. Experimental results on a lobby environment (LB) in an ofce building with switching on/off lights. Upper row: a frame before switching off some lights (364). Lower row: the frame 15 s after switching off some lights (648).

V. EXPERIMENTAL RESULTS The proposed method has been tested on a variety of indoor and outdoor environments, including ofces, campuses, parking lots, shopping malls, restaurants, airports, subway stations, sidewalks, and other private and public sites. It has also been tested on image sequences captured in various weather conditions, including sunny, cloudy, and rainy weather, as well as night and crowd scenes. In all the tests, the proposed method was automatically initialized (bootstrap) from blinking background (i.e., , , and for and ). The system gradually learned the most signicant features for both stationary and nonstationary background objects. Once the once-off updating is performed, the system is able to separate the foreground from the background well. MoG [4] is a widely-used adaptive background subtraction method. It performs quite well for both stationary and nonstationary backgrounds among the existing methods [6]. The proposed method has also been compared with MoG in the experiments. The same learning rate was used for both the proposed method and MoG in each test.1 Further, for a fair comparison, the post processing used in the proposed method was applied for the MoG method as well.
1The similar analysis of the learning process and dynamic performance for MoG can be made as in Section III-C and III-D.

The visual examples and quantitative evaluations of the experiments are described in the following two subsections, respectively. A. Examples on Various Environments Selected results on ve typical indoor and outdoor environments are displayed in this section. The typical environments are ofces, campuses, shopping malls, subway stations, and sidewalks. In the gures of this subsection, pictures are arranged in rows. In each row, the images from left to right are the input frame, the background reference image maintained by the pro, the manually genposed method at the moment erated ground truth, the results of the proposed method and MoG. 1) Ofce Environments: Ofce environments include ofces, laboratories, meeting rooms, corridors, lobbies, and entrances. An ofce environment is usually composed of stationary background objects. The difculties for foreground detection in these scenes can be caused by shadows, changes of illumination conditions, and camouage foreground objects (i.e., the color of the foreground object is similar to that of the covered background). In some cases, background may consist of dynamic objects, such as waving curtains, running fans, and ickering screens. Examples from two test sequences are shown in Figs. 5 and 6, respectively.

1468

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004

Fig. 7. Experimental results on a campus environment (CAM) containing wavering tree branches in strong winds. They are frame 1019, 1337, and 1393.

The rst sequence (MR) was captured by an auto-gain camera in a meeting room. The background curtain was moving in winds. The rst example in the upper row came from a scenario containing signicant motion of the curtain, as well as background changes caused by automatic gain adjustment. In the next example, the person wore clothes of bright colors, which are similar to the color of the curtain. In both cases, the proposed method separated the background and foreground satisfactorily. The second sequence (LB) was captured from a lobby in an ofce building. On this occasion, background changes were mainly caused by switching on/off lights. Two examples from this sequence are shown in Fig. 6. The rst example shows a scene before some lights are switched off. A signicant shadow of the person can be observed. The result of the proposed method is rather satisfactory apart from a small shadow included. The second example shows a scene at about 220 frames (about 15 s) after some lights have been switched off. In this example, even through the background reference image had not been recovered completely, the proposed method detected the person successfully. 2) Campus Environments: The second type of environments are campuses or parks. Changes in the background are often caused by motion of tree branches and their shadows on the ground surface, or the changes in the weather. Three examples displayed in Fig. 7 were from a sequence (CAM) captured in a campus containing moving tree branches. The great motion of tree branches was caused by strong winds which can be observed from the waving yellow ag in the left of the images. The moving tree branches also resulted in the changes of tree shadows. The three example frames contain vehicles of different colors. The results have shown that the proposed method has detected the vehicles quite well in such an environment. 3) Shopping Malls: The third type of typical environments are shopping centers, hotels, museums, airports, and restaurants. In these environments, the lighting are distributed from the ceil-

ings and there are some specular highlight ground surfaces. In such cases, if multiple persons move in the scene, the shadows on the ground surface vary signicantly in the image sequences. In these environments, the shadows can be classied into umbra and penumbra [33]. The umbra corresponds to the background area where the direct light is almost totally blocked by the foreground object, whereas in the penumbra area of the background, the lighting is partially blocked. Three examples from such environments are shown in Fig. 8. They were from a busy shopping center (SC), an airport (AP), and a buffet restaurant (BR) [6]. Signicant shadows of moving persons cast on the ground surfaces from different directions can be observed. As one can see, the proposed method has obtained the satisfactory results apart from where small parts of the shadows have been detected in these three environments. The recognized shadows could also be observed in the maintained background reference images. This can be explained as a) the feature distance measure (10) that is robust to the illumination changes has played a major role in suppressing the penumbra areas; b) the learned color co-occurrences of the changes from the normal background appearance to umbra and vice versa could identify many background pixels in the umbra areas. Hence, without special models for the shadows, the proposed method has suppressed much of the various shadows in these environments. 4) Subway Stations: Subway stations are other public sites that often require monitoring. In these situations, the motion of background objects (e.g. trains and escalators) would make the background modeling difcult. Further, the background model is hard to be established if there are frequent human crowds in the scenes. Fig. 9 shows two examples from a sequence of a subway station (SS) recorded on a tape by a CCTV surveillance system. The scene contains three moving escalators and frequent human ows in the right side of the images. In addition, there are signicant background changes caused by variation of lighting conditions due to many glass and stainless steel mate-

LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS

1469

Fig. 8. Experimental results on shopping mall environments which contain specular ground surfaces. The three examples came from a busy shopping center (SC), an airport (AP), and a buffet restaurant (BR), respectively.

Fig. 9.

Experimental results on a subway station environments (SS). The examples are the frame 1993 and 2634.

rials inside the building. Another difculty for this sequence is caused by the noise which is due to the old video recording device. The busy ow of human crowds can be observed from the rst example in the gure. Our test results have shown that the proposed method performed quite satisfactorily for such difcult scenarios. 5) Sidewalks: The pedestrians are often the targets of interest in many video surveillance systems. In such a case, a surveillance system may monitor the scene from day to night with a range of weather conditions. The tests were performed on such an environment around the clock. The image sequences (SW) were obtained from highly compressed MPEG4 videos through a local wireless network. There were large variations of background in the images. Five examples and test results are shown in Fig. 10. These correspond to sunny, cloudy and rainy weather conditions, as well as the night and crowded scenes. The interval between the rst two frames was less than 10 s. Comparing the results with the ground truths, one can nd that the proposed method performed very robustly in this complex environment.

From the comparisons with MoG in these examples shown in Figs. 510, one can nd that the proposed method has outperformed the MoG method in these selected difcult situations. The parameters used for these tests are listed in Tables II and III. The parameters in Table II were applied to all tests. The learning rates in the rst row of Table III were applied to all tests except for three shorter sequences where larger rates (in the second row of the table) were applied. This is because if the image sequences are short, a slightly faster learning rate should be used to speed up the initial learning. Since the decision (5) for the classication of background and foreground is not directly dependent on any threshold, the performance of the proposed method is not very sensitive to these parameters. B. Quantitative Evaluations To get a systematic evaluation of proposed method, the performance of the proposed method was also evaluated quantitatively on randomly selected samples from ten sequences.

1470

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004

Fig. 10. Experimental results of pedestrian detection from a sidewalk environment (SW) around the clock. From top to bottom are the frames from sunny, cloudy, rainy, night, and crowd scenes. TABLE II PARAMETERS USED FOR ALL TEST EXAMPLES

the corresponding ground truth, then the similarity measure between regions and is dened as (31) Using this measure, approaches to a maximum value 1.0 if and are the same. Otherwise, varies between 1 and 0 according to their similarity. It approaches to 0 with the least similarity. It integrates the false positive and negative errors in one measure. But one drawback of the measure (31) is that it is a nonlinear measure. To obtain a visual impression of the quantities of the similarity measures, some matching images and their similarity values are displayed in Fig. 11. For systematic evaluation and comparison, the similarity measure (31) has been applied to the experimental results with the proposed method and the MoG method. A total of ten image sequences were used, including those in Figs. 510, as well as two others [watersurface (WS) and fountain (FT)]. We randomly selected 20 frames from each sequence, leading to total of 200 sample frames for evaluation. The ground truths of these 200 frames were generated manually by four invited persons. All of the ten test sequences, the results, and the ground truths of the sample frames are available.2 The
2http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html.

TABLE III LEARNING RATES USED IN THE TEST EXAMPLES

In the previous work [6], the results were evaluated quantitatively from the comparison with the ground truths in terms of 1) false negative error: the number of foreground pixels that are missed; 2) false positive error: the number of background pixels that are misdetected as foreground. However, it is found that when averaging the measures over various environments, they are not accurate enough. In this paper, a new similarity measure is introduced to evaluate the results of foreground segmentation. Let be a detected region and be

LI et al.: STATISTICAL MODELING OF COMPLEX BACKGROUNDS

1471

Fig. 11. Some examples of matching images with different similarity measure values. In the images, the bright color indicates the intersection of the detected regions and the ground truths, the dark gray color indicates the false negatives, and the light gray color indicates the false positives. TABLE IV QUANTITATIVE EVALUATION AND COMPARISON RESULT: S (A; B ) VALUES FROM THE TEST SEQUENCES

averaging values of similarity measures for each individual sequence and for ten sequences are shown in Table IV. The corresponding values obtained from MoG method are also included. The ten test sequences are chosen among the difcult sequences, containing global background changes as well as persons staying motionless for quite a while besides the various background changes described in the previous subsection. Taking these situations into account, the obtained evaluation values for both methods are quite good. Comparing the results in Table IV and in Fig. 11, the performance of the proposed method is rather satisfactory. The comparison shows that the proposed method has provided improved results over those from the MoG method, especially for image sequences with complex background. C. Limitations of the Method Since the statistics are related to each individual pixel without considering its neighborhood, the method can wrongly absorb a foreground object into the background if the object remains motionless for a long time duration. For example, if a foreground moving person or car suddenly stopped moving in the scene and remains still for a long time duration. Further improvement should be made, e.g., by combining information from high-level object recognition and tracking in background updating [34], [35]. Another potential problem is that the method can wrongly learn the features of foreground objects as the background if crowded foreground objects (e.g., crowds) are constantly presented in the scenes. Adjusting the learning rate based on the feedback from the optical ow could provide a possible solution [36]. A method of controlling the learning processes from multilevel feedbacks is being investigated in order to further improve the results. VI. CONCLUSION For detecting foreground object from complex environments, this paper proposed a novel statistical method for background modeling. In the proposed method, the background appearance is characterized by the principal features and their statistics.

Foreground objects are detected through foreground and background classication under Bayesian framework. Our test results have shown that the principal features are effective in representing the spectral, spatial, and temporal characteristics of the background. A learning method to adapt to the time-varying background features has been proposed and analyzed. Experiments have been conducted on a variety of environments, including ofces, public buildings, subway stations, campuses, parking lots, airports, and sidewalks. The experimental results have shown the effectiveness of the proposed method. Quantitative evaluation and comparison with the existing method have shown that an improved performance for foreground object detection in complex background has been achieved. Some limitations of the method have been discussed with suggestions to possible improvement. ACKNOWLEDGMENT The authors would like to thank R. Luo, J. Shang, X. Huang, and W. Liu for their work to generate the ground truths for evaluation. REFERENCES
[1] D. Gavrila, The visual analysis of human movement: A survey, Comput. Vis. Image Understanding, vol. 73, no. 1, pp. 8298, 1999. [2] L. Li and M. Leung, Integrating intensity and texture differences for robust change detection, IEEE Trans. Image Processing, vol. 11, pp. 105112, Feb. 2002. [3] E. Durucan and T. Ebrahimi, Change detection and background extraction by linear algebra, Proc. IEEE, vol. 89, pp. 13681381, Oct. 2001. [4] C. Stauffer and W. Grimson, Learning patterns of activity using realtime tracking, IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 747757, Aug. 2000. [5] I. Haritaoglu, D. Harwood, and L. Davis, W : Real-time surveillance of people and their activities, IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 809830, Aug. 2000. [6] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, Wallower: Principles and practice of background maintenance, in Proc. IEEE Int. Conf. Computer Vision, Sept. 1999, pp. 255261. [7] K. Karmann and A. Von Brandt, Moving object recognition using an adaptive background memory, Time-Varing Image Process. Moving Object Recognit., 2, pp. 289296, 1990. [8] C. Wren, A. Azarbaygaui, T. Darrell, and A. Pentland, Pnder: realtime tracking of the human body, IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 780785, July 1997. [9] A. Elgammal, D. Harwood, and L. Davis, Non-parametric model for background subtraction, in Proc. Eur. Conf. Computer Vision, 2000.

1472

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 11, NOVEMBER 2004

[10] N. Paragios and V. Ramesh, A MRF-based approach for real-time subway monitoring, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, Dec. 2001, pp. I-1034I-1040. [11] O. Javed, K. Shaque, and M. Shah, A hierarchical approach to robust background subtraction using color and gradient information, in Proc. IEEE Workshop Motion Video Computing, Dec. 2002, pp. 2227. [12] L. Li, W. M. Huang, I. Y. H. Gu, and Q. Tian, Foreground object detection in changing background based on color co-occurrence statistics, in Proc. IEEE Workshop Applications of Computer Vision, Dec. 2002, pp. 269274. [13] L. Wixson, Detecting salient motion by accumulating directionary-consistent ow, IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 774780, Aug. 2000. [14] N. J. B. McFarlane and C. P. Schoeld, Segmentation and tracking of piglets in images, Mach. Vis. Applicat., vol. 8, pp. 187193, 1995. [15] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Russel, Toward robust automatic trafc scene analysis in real-time, in Proc. Int. Conf. Pattern Recognition, 1994, pp. 126131. [16] A. Bobick, J. Davis, S. Intille, F. Baird, L. Cambell, Y. Irinov, C. Pinhanez, and A. Wilson, Kidsroom: Action recognition in an interactive story environment, Mass. Inst. Technol., Cambridge, Perceptual Computing Tech. Rep. 398, 1996. [17] J. Rehg, M. Loughlin, and K. Waters, Vision for a smart kiosk, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 1997, pp. 690696. [18] T. Olson and F. Brill, Moving object detection and event recognition algorithm for smart cameras, in Proc. DARPA Image Understanding Workshop, 1997, pp. 159175. [19] T. Boult, Frame-rate multi-body tracking for surveillance, in Proc. DARPA Image Understanding Workshop, 1998. [20] T. Darell, G. Gordon, M. Harville, and J. Woodll, Integrated person tracking using stereo, color, and pattern detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 1998, pp. 601608. [21] A. Shafer, J. Krumm, B. Brumitt, B. Meyers, M. Czerwinski, and D. Robbins, The new EasyLiving project at microsoft, in Proc. DARPA/NIST Smart Space Workshop, 1998. [22] C. Eveland, K. Konolige, and R. C. Bolles, Background modeling for segmentation of video-rate stereo sequences, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 1998, pp. 266271. [23] N. Friedman and S. Russell, Image segmentation in video sequences: a probabilistic approach, in Proc. 13th Conf. Uncertainty Articial Intelligence, 1997. [24] A. J. Lipton, H. Fujiyoshi, and R. S. Patil, Moving target classication and tracking from real-time video, in Proc. IEEE Workshop Application of Computer Vision, Oct. 1998, pp. 814. [25] M. Harville, G. Gordon, and J. Woodll, Foreground segmentation using adaptive mixture model in color and depth, in Proc. IEEE Workshop Detection and Recognition of Events in Video, July 2001, pp. 311. [26] X. Gao, T. Boult, F. Coetzee, and V. Ramesh, Error analysis of background adaption, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2000, pp. 503510. [27] K. Skifstad and R. Jain, Illumination independent change detection from real world image sequence, Comput. Vis., Graph. Image Process., vol. 46, pp. 387399, 1989. [28] S. C. Liu, C. W. Fu, and S. Chang, Statistical change detection with moments under time-varying illumination, IEEE Trans. Image Processing, vol. 7, pp. 12581268, Aug. 1998. [29] N. Oliver, B. Rosario, and A. Pentland, A Bayesian computer vision system for modeling human interactions, IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 831843, Aug. 2000. [30] A. Iketani, A. Nagai, Y. Kuno, and Y. Shirai, Detecting persons on changing background, in Proc. Int. Conf. Pattern Recognition, vol. 1, 1998, pp. 7476. [31] P. Rosin, Thresholding for change detection, in Proc. IEEE Int. Conf. Computer Vision, Jan. 1998, pp. 274279. [32] Q. Cai, A. Mitiche, and J. K. Aggarwal, Tracking human motion in an indoor environment, in Proc. IEEE Int. Conf. Image Processing, Oct. 1995, pp. 215218. [33] C. Jiang and M. O. Ward, Shadow identication (in June), in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 1992, pp. 606612. [34] L. Li, I. Y. H. Gu, M. K. H. Leung, and Q. Tian, Knowledge-based fuzzy reasoning for maintenance of moderate-to-fast background changes in video surveillance, in Proc. 4th IASTED Int. Conf. Signal and Image Processing, 2002, pp. 436440.

[35] M. Harville, A framework for high-level feedback to adaptive, per-pixel, mixture-of-gaussian background models, in Proc. Eur. Conf. Computer Vision, 2002, pp. 543560. [36] D. Gutchess, M. Trajkovic, E. Cohen-Solal, D. Lyons, and A. K. Jain, A background model initialization algorithm for video surveillance, in Proc. IEEE Int. Conf. Computer Vision, vol. 1, July 2001, pp. 733740.

Liyuan Li (M96) received the B.E. and M.E. degrees from Southeast University, Nanjing, China, in 1985 and 1988, respectively, and the Ph.D. degree from Nanyang Technological University, Singapore, in 2001. From 1988 to 1999, he was on the faculty at Southeast University, where he was an Assistant Lecturer (1988 to 1990), Lecturer (1990 to 1994), and Associate Professor (1995 to 1999). Since 2001, he has been a Research Scientist at the Institute for Infocomm Research, Singapore. His current research interests include video surveillance, object tracking, event and behavior understanding, etc.

Weimin Huang (M97) received the B.Eng. degree in automation and the M.Eng. and Ph.D. degrees in computer engineering from Tsinghua University, Beijing, China, in 1989, 1991, and 1996, respectively. He is a Research Scientist at the Institute for Infocomm Research, Singapore. He has worked on the research of handwriting signature verication, biometrics authentication, and audio/video event detection. His current research interests include image processing, computer vision, pattern recognition, human computer interaction, and statistical learning.

Irene Yu-Hua Gu (M94SM03) received the Ph.D. degree in electrical engineering from the Eindhoven University of Technology, Eindhoven, The Netherlands, in 1992. She is an Associate Professor in the Department of Signals and Systems, Chalmers University of Technology, Gteborg, Sweden. She was a Research Fellow at Philips Research Institute IPO, The Netherlands, and Staffordshire University, Staffordshire, U.K., and a Lecturer at The University of Birmingham, Birmingham, U.K., from 1992 to 1996. Since 1996, she has been with the Department of Signals and Systems, Chalmers University of Technology. Her current research interests include image processing, video surveillance and object tracking, video communications, and signal processing applications to electric power systems. Dr. Gu has served as an Associate Editor for the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS since 2000, and she is currently the ChairElect of the IEEE Swedish Signal Processing Chapter.

Qi Tian (M83SM90) received the B.S. and M.S. degrees in electrical and computer engineering from the Tsinghua University, Beijing, China, in 1967 and 1981, respectively, and the Ph.D. degree in electrical and computer engineering from the University of South Carolina, Columbia, in 1984. He is a Principal Scientist at the Media Division, Institute for Infocomm Research, Singapore. His main research interests are image/video/audio analysis, indexing and retrieval, media content identication and security, computer vision, and pattern recognition. He joined the Institute of System Science, National University of Singapore, in 1992. Since then, he has been working on robust character ID recognition and video indexing. He was the Program Director for the Media Engineering Program, Kent Ridge Digital Labs, then Laboratories for Information Technology, from 2001 to 2002. Dr. Tian has served on editorial boards of professional journals and as chairs and members of technical committees of the IEEE Pacic-Rim Conference on Multimedia (PCM), the IEEE International Conference on Multimedia and Expo (ICME), etc.

Vous aimerez peut-être aussi