Académique Documents
Professionnel Documents
Culture Documents
CHAPTER I
INTRODUCTION
order to overcome boundary inaccuracy multiple features, multiple frames and spatialtemporal entropy methods are used. In addition, it gives robustness to noise and occluding
pixels.
The first stage is applied to the first two frames of a video shot to discover moving
objects while the second stage is applied to the rest of the frames to extract the detected
objects through the video shot. The algorithm is applied to first two frames of the image
sequence to detect the moving objects in the video sequence. First two frames of video
sequence are taken and motion vectors are computed using Adaptive Rood Pattern Search
(ARPS) algorithm. Simultaneously, components of optical flow are computed for each block
in the image. By using the motion vectors, motion compensated frame is generated.
Initial segmentation is performed on the first frame of traffic sequence. Applying
watershed transformation directly on the gradient of image results in over segmentation. To
avoid over segmentation morphological gradient is computed on the frame and then
watershed transformation is performed. After watershed transformation, some regions may
need to be merged because of possible over-segmentation.
Canny binary edge image is used to localize an object in subsequent frames of video
sequences and to detect the true weak edges. Intensity edge pixels are used as feature points
due to the key role that edges play in the human visual process and the fact that edges are
little affected by variation of luminance. Object models evolve from one frame to the next,
capturing the changes in the shape of objects as they move. The algorithm naturally
establishes the temporal correspondence of objects throughout the video sequence, and the
output of the algorithm is a sequence of binary models representing the motion and shape
changes of the objects.
Object model is obtained by subtracting background edge from edge image and
eliminating unlinked pixels. After a binary model for the object of interest has been derived
the motion vectors generated from ARPs algorithm are used to match the subsequent frames
in the sequence. Matching is performed on edge images because it is computationally
efficient and fairly insensitive to changes in illumination. The degree of change in the shape
of an object from one frame to the next is determined based on simplified Hausdorff distance
where simplified Hausdorff distance is defined as combination of distance transformation and
correlation. Distance Transform the image and then threshold it by different amounts to form
different dilated image sets. To search for the object in the image, it is required to obtain the
amount by which the image is dilated such that maximum points in the object model are
matched to image set.
In this automatic VO segmentation algorithm edge change detection starts with edge
detection which is the first and most important stage of human visual process. Edge
information plays a key role in extracting the physical change of the corresponding surface in
a real scene, exploiting simple difference of edges for extracting shape information of moving
objects in video sequence suffers from great deal of noise even in stationary background.
This is due to the fact that the random noise created in one frame is different from the one
created in the successive frame, and thus results in slight changes of the edge locations in the
successive frames. Thus difference edge of frames suppresses the noise in luminance
difference by means of canny edge detector.
Motion estimation is based on temporal changes in image intensities. The underlying
supposition behind motion estimation is that the patterns corresponding to objects and
background in a frame of video sequence move within the frame to form corresponding
objects on the subsequent frame. Motion estimation is accomplished using ARPs algorithm.
The ARPS algorithm makes use of the fact that the general motion in a frame is usually
coherent, i.e. if the macro blocks around the current macro block move in a particular
direction then there is a high probability that the current macro block will also have a similar
motion vector. This algorithm uses the motion vector of the macro block to its immediate left
to predict its own motion vector.
The ARPS algorithm tries to achieve the Peak Signal to Noise Ratio (PSNR) similar
to that of Exhaustive Search (ES) algorithm. ES algorithm, also known as Full Search, is the
most computationally expensive block matching algorithm. This algorithm calculates the cost
function at each possible location in the search window. As a result it finds the best possible
match and gives the highest PSNR among any block matching algorithm. The obvious
disadvantage of exhaustive search is it takes more computations to estimate motion vectors.
ARPS tries to achieve the same PSNR as that of ES with less number of computations.
CHAPTER II
LITERATURE SURVEY
2.1.1. Introduction
A human can easily determine the subject of interest in a video, even though that
subject is presented in an unknown or cluttered background or even has never been seen
before. With the complex cognitive capabilities exhibited by human brains, this process can
be interpreted as simultaneous extraction of both foreground and background information
from a video.
Many researchers have been working toward closing the gap between human and
computer vision. However, without any prior knowledge on the subject of interest or training
data, it is still very challenging for computer vision algorithms to automatically extract the
foreground object of interest in a video. As a result, if one needs to design an algorithm to
automatically extract the foreground objects from a video, several tasks need to be addressed.
1) Unknown object category and unknown number of the object instances in a video.
2) Complex or unexpected motion of foreground objects due to articulated parts or arbitrary
poses.
3) Ambiguous appearance between foreground and background regions due to similar color,
low contrast, insufficient lighting, etc. conditions.
matching algorithm and update the template using appearance features smoothed by kalman
filter. Tao presented a dynamic background layer model and model each moving object as a
foreground layer, together with the foreground ordering, the complete information necessary
for reliably tracking objects through occlusion is included. Alper tracked the complete object
and evolving the contour from frame to frame by minimizing some energy functions.
An obvious step towards video segmentation in Efficient Hierarchical Graph-Based
Video Segmentation Matthias is to apply image segmentation techniques to video frames
without considering temporal coherence. These methods are inherently scalable and may
generate segmentation results in real time. However, lack of temporal information from
neighboring frames may cause jitter across frames. Freedman and Kisilev applied a samplingbased fast mean shift approach to a cluster of 10 frames as a larger set of image features to
generate smoother results without taking into account temporal information.
Spatio-temporal video segmentation techniques can be distinguished by whether the
information from future frames is used in addition to past frames. Causal methods apply
Kalman filtering to aggregate data over time, which only consider past data. Paris et al.
derived the equivalent tool of mean-shift image segmentation for video streams based on the
ubiquitous use of the Gaussian kernel. They achieved real-time performance without
considering future frames in the video.
Another class of spatio-temporal techniques takes advantage of both past and future
data in a video. They treat the video as a 3D space-time volume, and typically use a variant of
the mean shift algorithm for segmentation. Dementhon applied mean shift on a 3D lattice and
used a hierarchical strategy to cluster the space-time video stack for computational efficiency.
Wang et al. used anisotropic kernel mean shift segmentation for video tooning. Wang and
Adelson used motion heuristics to iteratively segment video frames into motion consistent
layers. Tracking-based video segmentation methods generally define segments at frame-level
and use motion, color and spatial cues to force temporal coherence.
Following the same line of work, Brendel and Todorovic used contour cues to allow
splitting and merging of segments to boost the tracking performance. Interactive object
segmentation has recently shown significant progress. These systems produce high quality
segmentations driven by user input. We exhibit a similar interactive framework driven by our
segmentation. Our video segmentation method builds on Felzenszwalb and Huttenlochers
graph-based image segmentation technique. Their algorithm is efficient, being nearly linear
in the number of edges in the graph, which makes it suitable for extension to spatio-temporal
segmentation. We extend the technique to video making use of both past and future frames,
and improve the performance and efficiency using a hierarchical framework.
2.1.2. Methodology
In this paper, we aim at automatically extracting foreground objects in videos which
are captured by freely moving cameras. Instead of assuming that the background motion is
dominant and different from that of the foreground as did, we relax this assumption and allow
foreground objects to be presented in freely moving scenes.
We advance both visual and motion saliency information across video frames, and a
CRF model is utilized for integrating the associated features for VOE (i.e., visual saliency,
shape, foreground/background color models, and spatial/temporal energy terms). From our
quantitative and qualitative experiments, we verify that our VOE performance exhibits spatial
consistency and temporal continuity, and our method is shown to outperform state-of the-art
unsupervised VOE approaches. It is worth noting that, our proposed VOE framework is an
unsupervised approach, which does not require the prior knowledge (i.e., training data) of the
object of interest nor the user interaction for any annotation.
Input video
Visual saliency
Cooperating
multiple
cues
Detected object
10
11
each image pixel. Although the color features can be automatically determined from the input
video, these methods still need the user to train object detectors for extracting shape or
motion features.
Recently, researchers proposed to use some preliminary strokes to manually select the
foreground and background regions, and they utilized such information to train local
classifiers to detect the foreground objects. While these works produce promising results, it
might not be practical for users to manually annotate a large amount of video data.
2.1.3. Visual Saliency
The salience (also called saliency) of an item be it an object, a person, a pixel, etc.
is the state or quality by which it stands out relative to its neighbors. Saliency detection is
considered to be a key attention mechanism that facilitates learning and survival by enabling
organisms to focus their limited perceptual and cognitive resources on the most pertinent
subset of the available sensory data.
Saliency typically arises from contrasts between items and their neighborhood, such
as a red dot surrounded by white dots, a flickering message indicator of an answering
machine, or a loud noise in an otherwise quiet environment. Saliency detection is often
studied in the context of the visual system, but similar mechanisms operate in other sensory
systems. What is salient can be influenced by training: for example, for human subjects
particular letters can become salient by training.
When attention deployment is driven by salient stimuli, it is considered to be bottomup, memory-free, and reactive. Attention can also be guided by top-down, memorydependent, or anticipatory mechanisms, such as when looking ahead of moving objects or
sideways before crossing streets. Humans and other animals have difficulty paying attention
to more than one item simultaneously, so they are faced with the challenge of continuously
integrating and prioritizing different bottom-up and top-down influences.
In the domain of psychology, efforts have been made in modeling the mechanism of
human attention, including the learning of prioritizing the different bottom-up and top-down
influences.
12
In the domain of computer vision, efforts have been made in modeling the mechanism
of human attention, especially the bottom-up attention mechanism. Such a process is also
called visual saliency detection.
Generally speaking, there are two kinds of models to mimic the bottom-up saliency
mechanism. One way is based on the spatial contrast analysis. For example, in a centersurround mechanism is used to define saliency across scales, which is inspired by the putative
neural mechanism. The other way is based on the frequency domain analysis. While they
used the amplitude spectrum to assign saliency to rarely occurring magnitudes, Guo et al. use
the phase spectrum instead. introduced a system that uses both the amplitude and the phase
information.
A key limitation in many such approaches is their computational complexity which
produces less than real-time performance, even on modern computer hardware. Some recent
work attempts to overcome these issues but at the expense of saliency detection quality under
some conditions.
Our attention is attracted to visually salient stimuli. It is important for complex
biological systems to rapidly detect potential prey, predators, or mates in a cluttered visual
world. However, simultaneously identifying any and all interesting targets in one's visual
field has prohibitive computational complexity making it a daunting task even for the most
sophisticated biological brains let alone for any existing computer.
Fig. 2. Example of visual saliency calculations. (a) Original video frame. (b) and (C)
are Visual saliency of (a)
One solution, adopted by primates and many other animals, is to restrict complex
object recognition process to a small area or a few objects at any one time. The many objects
or areas in the visual scene can then be processed one after the other. This serialization of
13
14
Where
denotes the pixel pair detected by forward or backward optical flow propagation.
Although motion saliency allows us to capture motion salient regions within and
across video frames, those regions might only correspond to moving parts of the foreground
object within some time interval.
Besides the motion-induced shape information, we also extract both foreground and
back- ground color information for improved VOE performance. According to the
observation and the assumption that each moving part of the foreground object forms a
15
Fig. 2.4. Motion Saliency Calculated for Fig 2.3. (a) Calculation of the Optical Flow.
(b) Motion Saliency Derived from (a)
2.1.5.
to estimate the structural information (e.g. class label) of a set of variables with the associated
observations. For video foreground object segmentation, CRF has been applied to predict the
label of each observed pixel in an image I.
It is a class of statistical modeling method often applied in pattern recognition and
machine
learning,
where
they
are
used
for structured
prediction.
Whereas
an
ordinary classifier predicts a label for a single sample without regard to "neighboring"
samples, a CRF can take context into account; e.g., the linear chain CRF popular in natural
language processing predicts sequences of labels for sequences of input samples.
CRFs are a type of discriminative undirected probabilistic graphical model. It is used
to encode known relationships between observations and construct consistent interpretations.
It is often used for labeling or parsing of sequential data, such as natural language text
or biological sequences and in computer vision. Specifically, CRFs find applications
in shallow parsing, named entity recognition and gene finding, among other tasks, being an
alternative to the related hidden Markov models. In computer vision, CRFs are often used for
object recognition and image segmentation.
Conditional random fields (CRFs) are a probabilistic framework for labeling and
segmenting structured data, such as sequences, trees and lattices. The underlying idea is that
of defining a conditional probability distribution over label sequences given a particular
observation sequence, rather than a joint distribution over both label and observation
sequences.
16
The primary advantage of CRFs over hidden Markov models is their conditional
nature, resulting in the relaxation of the independence assumptions required by HMMs in
order to ensure tractable inference.
Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum
entropy Markov models (MEMMs) and other conditional Markov models based on directed
graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world
tasks in many fields, including bioinformatics, computational linguistics and speech
recognition.
Imagine you have a sequence of snapshots from a day in Justin Biebers life, and you
want to label each image with the activity it represents (eating, sleeping, driving, etc.). How
can you do this?
One way is to ignore the sequential nature of the snapshots, and build aperimage classifier. For example, given a months worth of labeled snapshots, you might learn
that dark images taken at 6am tend to be about sleeping, images with lots of bright colors
tend to be about dancing, and images of cars are about driving, and so on.
By ignoring this sequential aspect, however, you lose a lot of information. For
example, what happens if you see a close-up picture of a mouth is it about singing or
eating? If you know that the previous image is a picture of Justin Bieber eating or cooking,
then its more likely this picture is about eating; if, however, the previous image contains
Justin Bieber singing or dancing, then this one probably shows him singing as well.
Thus, to increase the accuracy of our labeler, we should incorporate the labels of
nearby photos, and this is precisely what a conditional random field does.
2.1.6. Disadvantages
Shape and colour cues extract foreground objects with missing parts.
17
or background, and an observation is a frame of the input frame. The density calculated in
the previous step can be utilized for estimating the priors of objects/backgrounds and the
feature likelihoods of the MRF. When calculating priors and likelihoods, the regions
extracted from the previous frames are also available.
Image segmentation
Consider a set of random variables A = {Ax}xI defined on a set I of co- ordinates.
Each random variable Ax takes a value ax from the set L = {0,1}, which corresponds to a
background (0) and an object (1), respectively.
A MAP-based MRF estimation can be formulated as an energy minimization
problem where the energy corresponding to the configuration a is the negative log likelihood
of the joint posterior density of the MRF, E(a|D)=logp(A = a|D), where D represents the
input image.
Result
Proposed a new method for achieving precise video segmentation without any
supervised interactions.
18
Disadvantage:
Segmented regions were randomly switched
Detection of visually salient image regions is useful for applications like object
segmentation, adaptive compression, and object recognition. Recently, full-resolution salient
maps that retain well defined boundaries have attracted attention. In these maps, boundaries
are preserved by retaining substantially more frequency content from the original image than
older techniques. However, if the salient regions comprise more than half the pixels of the
image, or if the background is complex, the background gets highlighted instead of the salient
object.
This paper,
advantages of such saliency maps while overcoming their shortcomings. Our method exploits
features of color and luminance, is simple to implement and is computationally efficient.
Methods used
and Jolly perform interactive segmentation using graph cuts. They require a user to provide
scribble based input to indicate foreground and background regions.
19
A graph cuts based algorithm then segments foreground from background. We use a
similar approach, however, instead of the user indicating the background and foreground
pixels using scribbles, we use the saliency map to assign these pixels automatically.
Result
This method improves upon six existing state-of-the-art algorithms in precision and
recall with respect to a ground truth database.
Disadvantage
The saliency maps generated by this method suffer from low resolution
combine the motion- induced shape, foreground and background color models into a CRF.
Without prior knowledge of the object of interest, this CRF model is designed to address
VOE problems in an unsupervised setting.
structural information (e.g. class label) of a set of variables. For object segmentation, CRF is
20
used to predict the label of each observed pixel in an image I. As shown in Figure 1, pixel i is
associated with observation zi, while the hidden node Fi indicates its corresponding label (i.e.
foreground or background).
In this CRF framework, the label Fi is calculated by the observation zi, while the
spatial coherence between this output and neighboring observations zj and labels Fj are
simultaneously taken into consideration.
CRF construction. To detect the moving parts and their corresponding pixels, we perform
dense optical flow forward and backward propagation at every frame.
Result
Proposed a method which utilizes multiple motion-induced features such as shape and
fore- ground/background color models to extract foreground objects in single-concept videos.
We advanced a unified CRF framework to integrate the above feature models.
Disadvantage
Some of the motion cues might be negligible due to low contrast
21
Methods used
three-dimensionally in space and scale) have many attractive properties, such as robust
invariant descriptor matching over scale, rotation and affine offset that could be useful in
combination with their use as a primitive saliency detector.
Feature matching has been used in estimating inter-frame homography mismatching
as an estimate of temporal saliency in video, but not as a measure of spatial saliency. Harding
and Robertson compare the co-occurrence of a set of computer vision feature points with
predictive maps of visual saliency .
Saliency has been referred to as visual attention, unpredictability, rarity, or surprise.
Saliency estimation methods can broadly be classified as biologically based, purely
computational, or those that combine the two ideas. In general, most methods employ a lowlevel approach of determining contrast of image regions relative to their surroundings using
one or more features of intensity, color, and orientation.
Result
Presented an image segmentation algorithm to segment image areas which are
visually salient to observers performing multiple tasks. In contrast to bottom-up saliency
alone, and combined with specific task search models our technique finds image areas
relevant to the performance of multiple objective tasks and without the need for prior
learning.
The general mode acts on eye-level imagery with parameters chosen from careful
experimentation and requires no machine learning stage. The technique is built upon feature
points and the descriptors of these feature points can be compared to database representations
of stored objects to narrow the focus of the attention prediction map for object class search,
all in one algorithmic iteration.
Disadvantage
The application to compression is just one possible use only for a segmentation
algorithm based on visually salient information.
22
Methods used:
sequences. The proposed technique is able to detect the attended regions as well as attended
actions in video sequences. Different from the previous methods, most of which are based on
the dense optical flow fields, our proposed temporal attention model utilizes the interest point
correspondences and the geometric transformations between images.
In this model, feature points are firstly detected in consecutive video images, and
correspondences are established between the interest-points using the Scale Invariant Feature
Transformation (SIFT ).
algorithm, color statistics of the images are used to reveal the color contrast information in
the scene. Given the pixel-level saliency map, attended points are detected by finding the
pixels with the local maxima saliency values. The region-level attention is constructed based
upon the attended points. Given an attended point, a unit region is created with its center to
be the point.
This region is then iteratively expanded by computing the expansion potentials on the
sides of the region. Rectangular attended regions are finally achieved. The temporal and
spatial attention models are finally combined in a dynamic fashion. Higher weights are
assigned to the temporal model if large motion contrast is present in the sequence.
23
Result
Presented a spatiotemporal attention detection framework for detecting both attention
regions and interesting actions in video sequences. The saliency maps are computed
separately for the temporal and spatial information of the videos.
Disadvantage
Fail to highlight the entire salient region or highlight smaller salient regions better
than larger ones.
Methods used
Background Modeling
In the context of a traffic surveillance system, Friedman and Russell proposed to
model each background pixel using a mixture of three Gaussians corresponding to road,
vehicle and shadows. This model is initialized using an EM algorithm. Then, the Gaussians
are manually labeled in a heuristic manner as follows: the darkest component is labeled as
shadow; in the remaining two components, the one with the largest variance is labeled as
vehicle and the other one as road.
24
This remains fixed for all the process giving lack of adaptation to changes over time.
First, each pixel is characterized by its intensity in the RGB color space. Then, the probability
of observing the current pixel value is considered given by the following formula in the
multidimensional case:
Foreground Detection
In this case, the ordering and labeling phases are conserved and only the matching test
is changed to be more exact statistically. Indeed, Stauffer and Grimson checked every new
pixel against the K existing distribution using the Equation to classify it in background pixel
or foreground one.
Result
Allows the reader to survey the strategies and it can effectively guide him to select the
best improvement for his specific application.
Disadvantage
Leading to misdetection of foreground objects and background .
25
Methods used
Graph-based Algorithm
Specifically, for image segmentation, a graph is defined with the pixels as nodes,
connected by edges based on 8- neighborhood. Edge weights are derived from the per-pixel
normalized color difference.
subset of the video volume. Performing multiple passes over windows of increasing size, it
still generates a segmentation identical to the in memory algorithm. Besides segmenting large
videos, this algorithm takes advantage of modern multi-core processors, and segments several
parts of the same video in parallel.
Result
This method apply the segmentation algorithm to a wide range of videos, from classic
examples to long dynamic movie shots, studying the contribution of each part of this
algorithm.
Disadvantage
However, as it defines a graph over the entire video volume, there is a restriction on
the size of the video that it can process, especially for the pixel-level over-segmentation stage
2.9.
26
The first move we consider is an -- swap: for a pair of labels ,, this move
exchanges the labels between an arbitrary set of pixels labeled and another arbitrary set
labeled .
First algorithm generates a labeling such that there is no swap move that decreases the
energy. The second move we consider is an _-expansion: for a label _, this move assigns an
arbitrary set of pixels the label _. Our second algorithm, which requires the smoothness term
to be a metric, generates a labeling such that there is no expansion move that decreases the
energy. Moreover, this solution is within a known factor of the global minimum.
Methods used
even when large moves are allowed. In this section, we discuss the moves we allow, which
are best described in terms of partitions. We sketch the algorithms and list their basic
properties. We then formally introduce the notion of a graph cut, which is the basis for our
methods.
Graph cuts
The minimum cut problem is to find the cut with smallest cost. There are numerous
algorithms for this problem with low-order polynomial complexity; in practice these methods
run in near-linear time.
Disadvantage
This method can produce only low energy
2.10. GEODESIC MATTING: A FRAMEWORK FOR FAST INTERACTIVE
IMAGEAND VIDEO SEGMENTATION AND MATTING
-Xue Bai, Guillermo Sapiro
The proposed technique is based on the optimal, linear time, computation of weighted
geodesic distances to user-provided scribbles, from which the whole data is automatically
segmented. The weights are based on spatial and/or temporal gradients, considering the
statistics of the pixels scribbled by the user, without explicit optical flow or any advanced and
often computationally expensive feature detectors. These could be naturally added to the
proposed framework as well if desired, in the form of weights in the geodesic distances.
27
Methods used
Segmentation
Algorithm starts from two types of user-provided scribbles, F for foreground and B for
background, roughly placed across the main regions of interest. Now the problem is how to
learn from them and propagate this prior information/labeling to the entire image, exploiting
both the marked pixel statistics and their positions.
collect labeled information from F/B regions. We use discriminative features to learn from
the samples on the scribbles (pixels marked by the user via the scribbles), and to classify all
the remaining pixels in the image.
Geodesic Distance:
We use the geodesic distance from these user-provided scribbles to classify the pixels
x in the image (outside of the scribbles), labeling them F or B. The geodesic distance d(x) is
simply the smallest integral of a weight function over all possible paths from the scribbles to
x (in contrast with the average distance as used in random walks or diffusion/Laplace based
frameworks).
Result
Presented a geodesics-based algorithm for (interactive) natural image, 3D, and video
segmentation and matting.We introduced the framework for still images and extended it to
video segmentation and matting, as well as to 3D medical data.
Disadvantage
Although the proposed framework is general, we mainly exploited weights in the
geodesic computation that depend on the pixel value distributions. Algorithm does not work
when these distributions significantly overlap.
28
for the user to correct these incorrect segmentation labels. Rather than requiring to re-label
individual pixels, in our case the interactive correction step can be more easily performed on
the super pixels directly.
super pixels. This is especially helpful for scenes with human actors, since the thermal signal
helps to separate the actors from their background, and could also help to separate actors
from each other.
29
Result:
described an interactive video segmentation approach based on the propagation of
known segmentations for the first and last frame, to the intermediate frames of a video
sequence. The straightforward matching of super pixels across the video sequence could
easily handle moving cameras and non-rigidly moving objects.
Disadvantage:
The method is not much efficient when the objects moving very fast
2.12. SUMMARY
No.
1
Title
Technique Used
Segmented regions
were randomly
switched
Saliency computation
methods
Feature points
extraction
Compression is just
one possible use for
segmentation
algorithm based on
visually salient
information.
Spatiotemporal
saliency map
Background modeling
Leading to
misdetection of
foreground objects
and background
Drawbacks
30
Hierarchical SpatioTemporal
Segmentation
there is a restriction
on the size of the
video
Energy minimization
via graph cuts
Feature Distribution
Estimation
9
A framework for fast interactive
image and video segmentation and
matting
10
Interactive
Segmentation
Correction
31
CHAPTER III
PROPOSED SYSTEM
32
3.1. INTRODUCTION
The project proposes efficient motion detection and people counting based on
background subtraction using dynamic threshold. Here three different methods are used
effectively for object detection and compare these performance based on accurate detection.
Here the techniques frame differences, dynamic threshold based detection and mixture of
Gaussian model will be used. After the object foreground detection, the parameters like
speed, velocity and angle of motion will be determined.
For this, most of previous methods depend on the assumption that the background is
static over short time periods. However, structured motion patterns of the background which
are distinctive from variations due to noise, are hardly tolerated in this assumption and thus
still lead to high-level false positive rates when using previous models. In dynamic threshold
based object detection, morphological process and filtering also used effectively for
unwanted pixel removal from the background.
Along with this dynamic threshold, we introduce a background subtraction algorithm
for temporally dynamic texture scenes using a mixture of Gaussian, which has an ability of
greatly attenuating color variations generated by background motions while still highlighting
moving objects. Finally the proposed method will be proved that effective for background
subtraction in dynamic texture scenes compared to several competitive methods and
parameter of moving object will be evaluated for all methods.
3.2. OBJECTIVE
The Background Subtraction for accurate moving object detection from dynamic
scene using dynamic threshold detection and mixture of Gaussian model and determination of
object parameters
3.3. BLOCK DIAGRAM
Input
Video
Frame
Separation
Frame Subtraction
method
Foreground
Detection
Dynamic
Threshold
Connected Component
Analysis
33
3.4. METHODOLOGIES
Frame Separation
Frame Subtraction
Morphological Filtering
Object Detection
Parameter analysis
34
Visible fixed-pattern noise is often caused by hot pixels pixel sensors with higher
than normal dark current. On long exposure, they can appear as bright pixels. Sensors on the
CCD that always appear as brighter pixels are called stuck pixels while sensors that only
brighten up after long exposure are called hot pixels.
The dark-frame-subtraction technique is also used in digital photogrammetry, to
improve the contrast of satellite and air photograms, and is considered part of "best practice.
Each CCD has its own dark signal signature that is present on every acquired image
(bias, dark frame, light frame). By cooling the CCD to a very low temperature (down to -30
C), the dark signal is attenuated but not completely cancelled. Image processing is then
required to remove dark signature from each raw image. The method usually used consists of
acquiring dark frames that are subsequently subtracted from each image. This permits
keeping only relevant information related to the observed target.
During an acquisition session, an ideal dark frame is obtained by acquiring and
averaging several dark frames. This ideal dark frame has a reduced noise signature and is, in
theory, acquired with the same conditions (temperature and moisture) as the observation
images. This averaged dark frame will be subtracted from every raw image.
This approach for reducing dark signal produces very nice astronomical images. It is
suitable for many applications which do not require very accurate photometry results.
However, the technique is not necessarily sufficiently accurate for detecting and analyzing
very low signal-to-noise ratio objects.
3.4.3. Dynamic Threshold Approach
Fixed decision boundaries (or fixed threshold) classification approaches are
successfully applied to segment human skin. These fixed thresholds mostly failed in two
situations as they only search for a certain skin color range:
1) any non-skin object may be classified as skin if non-skin objects color values
belong to fixed threshold range.
2) Any true skin may be mistakenly classified as non-skin if that skin color values do
not belong to fixed threshold range. Instead of predefined fixed thresholds, novel online
learned dynamic thresholds are used to overcome the above drawbacks. The experimental
results show that our method is robust in overcoming these drawbacks.
35
(1)
Where b(v1, ..., vn) is a Boolean function of n variables. The mapping f _ b(f) is
called a Boolean filter. By varying the Boolean function b, a large variety of Boolean filters
can be obtained. For example, choosing a Boolean AND for b would shrink the input image
object, whereas a Boolean OR would expand it.
Numerous other Boolean filters are possible, since there are 22n possible Boolean
functions of n variables. The main applications of such Boolean image operations have been
in biomedical image processing, character recognition, object detection, and general 2D
shape analysis.
Among the important concepts offered by mathematical morphology was to use sets
to represent binary images and set operations to represent binary image transformations.
Specifically, given a binary image, let the object be represented by the set X and its
background by the set complement Xc. The Boolean OR transformation of X by a (window)
set B is equivalent to the Minkowski set addition (+), also called dilation, of X by B:
(2)
Extending morphological operators from binary to gray level images can be done by
using set representations of signals and transforming these input sets via morphological set
operations. Thus, consider an image signal f(x) defined on the continuous or discrete plane ID
= R2 or Z2 and assuming values in R=RU {,}. Thresholding f at all amplitude levels v
produces an ensemble of binary images represented by the threshold sets
v(f) {x
(3)
36
structuring element, then the simple opening or closing of f by B will eliminate these 1D
objects. Another problem arises when f contains large-scale objects with sharp corners that
need to be preserved; in such cases opening or closing f by a disk B will round these corners.
These two problems could be avoided in some cases if we replace the conventional opening
with a radial opening.
37
Correlation
Sensitivity
MSE
PSNR
Correlation
Correlation and Convolution are basic operations that we will perform to extract
information from images. They are in some sense the simplest operations that we can perform
on an image, but they are extremely useful. Moreover, because they are simple, they can be
analyzed and understood very well, and they are also easy to implement and can be computed
very efficiently.
Our main goal is to understand exactly what correlation and convolution do, and why
they are useful. We will also touch on some of their interesting theoretical properties; though
developing a full understanding of them would take more time than we have.
Sensitivity
We estimate the sensitivity of the image processing operators with respect to
parameter changes by performing a sensitivity analysis. This is not new to the image
processing community but often left out for performance reasons.
MSE & PSNR
Image quality assessment is an important but difficult issue in image processing
applications such as compression coding and digital watermarking. For a long time, mean
square error (MSE) and peak signal-to-noise ratio (PSNR) are widely used to measure the
degree of image distortion because they can represent the overall gray-value error contained
in the entire image, and are mathematically tractable as well. In many applications, it is
usually straightforward to design systems that minimize MSE or PSNR.
38
neighbors. Moreover, all pixels in an image are assumed to be equally important. This, of
course, is far from being true. As a matter of fact, pixels at different positions in an image can
have very different effects on the human visual system (HVS).
3.4.7. Advantages
Accuracy is more
3.4.8. Applications
Video surveillance
Object detection
People counting
39
CHAPTER 4
SYSTEM SPECIFICATION
40
4.1.
HARDWARE SPECIFICATION
4.2.
Processor
PENTIUM IV
RAM
256 DDR
Hard Disc
80 GB
Floppy Disc
1.44 MB
CD drive
52 * CD ROM DRIVE
Monitor
Samsung 17
Keyboard
Mouse
Zip Drive
250 MB
Printer
HP DeskJet
SOFTWARE SPECIFICATION
Windows XP
Software
MATLAB
41
4.3.
SOFTWARE FEATURES:
The software used in this project is MATLAB- 7.10 or above
4.3.1.
Introduction To Matlab
MATLAB is a product of The Math Works; Inc. The name MATLAB stands for
MATLAB is an interactive system whose basic data element is an array that does not
require dimensioning. This allows you to solve many technical computing problems,
especially those with matrix and vector formulations, in a fraction of the time it would take to
write a program in a scalar non interactive language such as C or FORTRAN.
The name MATLAB stands for matrix laboratory. MATLAB was originally written to
provide easy access to matrix software developed by the LINPACK and EISPACK projects.
Today, MATLAB engines incorporate the LAPACK and BLAS libraries, embedding the
state of the art in software for matrix computation.
MATLAB has evolved over a period of years with input from many users. In
university environments, it is the standard instructional tool for introductory and advanced
courses in mathematics, engineering, and science. In industry, MATLAB is the tool of choice
for high-productivity research, development, and analysis.
42
Many of these tools are graphical user interfaces. It includes the MATLAB desktop and
Command Window, a command history, an editor and debugger, a code analyzer and other
reports, and browsers for viewing help, the workspace, files, and the search path.
MATLAB windows
1. Command window: This is the main window. It is characterized by Matlab
command prompt>>. All commands, including those for running user-written programs, are
typed in this window at the Matlab prompt.
2. Graphics Window: The output of all graphics commands typed are flushed into the
graphics orfigure window.
3. Edit Window: This is a place to create, write, edit and save programs in files called
Mfiles. Any text editor can be used to carry out these tasks.
functions, like sum, sine, cosine, and complex arithmetic, to more sophisticated functions like
matrix inverse, matrix eigen values, Bessel functions, and fast Fourier transforms.
43
Graphics handler
MATLAB has extensive facilities for displaying vectors and matrices as graphs, as
well as annotating and printing these graphs. It includes high-level functions for twodimensional and three-dimensional data visualization, image processing, animation, and
presentation graphics. It also includes low-level functions that allow you to fully customize
the appearance of graphics as well as to build complete graphical user interfaces on your
MATLAB applications.
4.3.3. The MATLAB application program interface
This is a library that allows you to write C and Fortran programs that interact with
MATLAB. It includes facilities for calling routines from MATLAB (dynamic linking),
calling MATLAB as a computational engine, and for reading and writing MAT-files
4.3.4. MATLAB documentation
MATLAB provides extensive documentation, in both printed and online format, to
help you learn about and use all of its features. If you are a new user, start with this Getting
Started book. It covers all the primary MATLAB features at a high level, including many
examples. The MATLAB online help provides task-oriented and reference information about
MALAB.
4.3.5.
program their own new functions. MATLAB function programming is flexible and
particularly easy to learn.
44
4.3.6.
M-files
M-files in MATLAB can be scripts that simply execute a series of MATLAB
statements, or they can be functions that can accept arguments and can produce one or more
outputs. M FILE functions extend the capabilities of both MATLAB and the Image
Processing Toolbox to address specific, user-defined applications. M-files are created using a
text editor and are stored with a name of the form filename.m, such as average.m and filter.m.
The H1 line
Help text
Comments
4.3.7.
Operators
Logical operators that perform the functions AND, OR, and NOT
4.3.8.
algorithms and graphical tools for image processing, analysis, visualization, and algorithm
development. You can perform image enhancement, image deblurring, feature detection,
noise reduction, image segmentation, geometric transformations, and image registration.
Many toolbox functions are multithreaded to take advantage of multicore and multiprocessor
computers.
Image Processing Toolbox supports a diverse set of image types, including high
dynamic range, gigapixel resolution, embedded ICC profile, tomography Spatial image
transformations, morphological operations, neighborhood and block operations, Linear
filtering and filter design, Transforms Image analysis and enhancement, Image registration,
De blurring Region of interest operations
45
You can extend the capabilities of Image Processing Toolbox by writing your own Mfiles, or by using the toolbox in combination with other toolboxes, such as Signal Processing
Toolbox and Wavelet Toolbox.
Graphical tools let you explore an image, examine a region of pixels, adjust the
contrast, create contours or histograms, and manipulate regions of interest (ROIs). With
toolbox algorithms you can restore degraded images, detect and measure features, analyze
shapes and textures, and adjust color balance
4.3.9.
many MATLAB libraries (for example XML or SQL support) are implemented as wrappers
around Java or ActiveX libraries. Calling MATLAB from Java is more complicated, but can
be done with MATLAB extension, which is sold separately by Math Works, or using an
undocumented mechanism called JMI (Java-to-Matlab Interface), which should not be
confused with the unrelated Java Metadata Interface that is also called JMI.
As alternatives to the MuPAD based Symbolic Math Toolbox available from Math
Works, MATLAB can be connected to Maple or Mathematical.
MATLAB has a direct node with mode FRONTIER, a multidisciplinary and multiobjective optimization and design environment, written to allow coupling to almost any
computer aided engineering (CAE) tool. Once obtained a certain result using Matlab, data
can be transferred and stored in a mode FRONTIER
4.4.
SYSTEM DESIGN
System design is the process of planning a new system to complement or altogether replace
the old system. The purpose of the design phase is the first step in moving from the problem
domain to the solution domain. The design of the system is the critical aspect that affects the
quality of the software. System design is also called top-level design. The design phase
translates the logical aspects of the system into physical aspects of the system.
46
CHAPTER V
RESULT AND IMPLEMENTATION
47
Visual Saliency
48
CHAPTER VI
WORKS TO BE DONE IN PHASE II
49
The Background Subtraction for accurate moving object detection from dynamic
scene using dynamic threshold detection and mixture of Gaussian model and determination of
object parameters with the help of morphological filtering will be done in next phase.
50
CHAPTER VII
REFERNCES
51
1. Saliency-based video segmentation with graph cuts and sequentially updated priors,
K. Fukuchi, K. Miyazato, A. Kimura, S. Takagi, and J. Yamato,
in Proc. IEEE Int. Conf. Multimedia Expo, Jun.Jul. 2009, pp. 638641.
52
IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 11, pp. 12221239, Nov. 2001.
9. A geodesic framework for fast interactive image and video segmentation and matting,
X. Bai and G. Sapiro,
in Proc. IEEE Int. Conf. Comput.Vis., Oct. 2007, pp. 18.