Real-Time Pedestrian Detection in FIR and Grayscale Images

Real-time pedestrian detection in FIR
and grayscale images
Dissertation
zur Erlangung des Grades eines
Doktor-Ingenieurs(Dr.-Ing.)
an der Fakultt
fr Elektrotechnik und Informationstechnik
der Ruhr-Universitt Bochum
vorgelegt von
Aalzen Jaap Wiegersma

Drachten
Bochum 2006
Contents
1 Introduction
10
1.1
Problem description . . . . . . . . . . . . . . . . .
10
1.2
Object detection in computer vision
. . . . . . . .
12
1.3
1.4
1.2.1
Segmentation
. . . . . . . . . . . . . . . .
12
1.2.2
Classication . . . . . . . . . . . . . . . . .
17
1.2.3
Tracking
. . . . . . . . . . . . . . . . . . .
18
Related work in pedestrian detection . . . . . . . .
19
1.3.1
Initial detection in grayscale images
. . . .
19
1.3.2
Initial detection in r images . . . . . . . .
21
1.3.3
Classication . . . . . . . . . . . . . . . . .
22
1.3.4
Tracking
. . . . . . . . . . . . . . . . . . .
24
Selected approach . . . . . . . . . . . . . . . . . .
25
1.4.1
Image preprocessing . . . . . . . . . . . . .
26
1.4.2
Initial detection
. . . . . . . . . . . . . . .
26
1.4.3
Classication . . . . . . . . . . . . . . . . .
27
1.4.4
Tracking
. . . . . . . . . . . . . . . . . . .
28
. . . . . . . . . . . . . . . . . . . .
29
1.5
Contributions
1.6
Outline of this thesis
. . . . . . . . . . . . . . . .
31
CONTENTS
2 Initial detection
2.1
2.2
32
Image pre-processing
. . . . . . . . . . . . . . . .
33
2.1.1
Image enhancement . . . . . . . . . . . . .
33
2.1.2
Image features . . . . . . . . . . . . . . . .
35
Gradient based segmentation . . . . . . . . . . . .
39
2.2.1
Grouping vertical gradients . . . . . . . . .
39
2.2.2
Scanning a vertical edge detector through

the whole image . . . . . . . . . . . . . . .
2.3
Temperature segmentation in r images
. . . . . .
3 Classication
3.1
3.2
Classiers
3.4
48
52
. . . . . . . . . . . . . . . . . . . . . .
52
3.1.1
Neural networks . . . . . . . . . . . . . . .
53
3.1.2
Support vector machines
. . . . . . . . . .
55
Image features for classication . . . . . . . . . . .
56
3.2.1
Rectangle features . . . . . . . . . . . . . .
57
3.2.2
Histograms of gradients and orientations features
3.3
45
. . . . . . . . . . . . . . . . . . . . .
Training and validation of the classier
57
. . . . . .
58
3.3.1
Creating training datasets . . . . . . . . . .
58
3.3.2
Training neural networks
60
3.3.3
Training support vector machines
. . . . . . . . . .
. . . . .
60
Optimization of the classication . . . . . . . . . .
61
3.4.1
Bootstrapping
61
3.4.2
Component classication
3.4.3
A simplied approximation to the support
. . . . . . . . . . . . . . . .
. . . . . . . . . .
vector machine decision rule
. . . . . . . .
62
62
CONTENTS
3.5
3.6
Feature selection . . . . . . . . . . . . . . . . . . .
63
3.5.1
Principal component analysis . . . . . . . .
63
3.5.2
Adaboost . . . . . . . . . . . . . . . . . . .
65
3.5.3
Multi-objective optimization
66
. . . . . . . .
Scanning a classier through the image at every

position and scale
. . . . . . . . . . . . . . . . . .
4 Tracking
69
72
4.1
Tracking using the Hausdor distance
. . . . . . .
73
4.2
Mean shift tracking
. . . . . . . . . . . . . . . . .
75
4.3
Tracking with Condensation
. . . . . . . . . . . .
77
4.4
Integrating initial detections through time . . . . .
80
4.5
Tracking the classication output . . . . . . . . . .
81
5 Experimental results
5.1
5.2
Initial detection
83
. . . . . . . . . . . . . . . . . . .
83
5.1.1
Results on r images
. . . . . . . . . . . .
86
5.1.2
Results on grayscale images . . . . . . . . .
89
Classication . . . . . . . . . . . . . . . . . . . . .
92
5.2.1
Results on r images
. . . . . . . . . . . .
94
5.2.2
104
5.2.3
Results of component classication . . . . .
109
5.2.4
Results of classier optimization on classication performance
5.2.5
. . . . . . . . . . . . .
109
Results of classier optimization on classication speed
. . . . . . . . . . . . . . . . .
112
CONTENTS
5.2.6
Scanning a classier through the image at

every position and scale
5.3
Tracking
. . . . . . . . . .
115
. . . . . . . . . . . . . . . . . . . . . . .
115
5.3.1
Results on r images
. . . . . . . . . . . .
116
5.3.2
118
6 Discussion
6.1
119
Achievements and limitations
. . . . . . . . . . .
119
6.1.1
Initial detection
. . . . . . . . . . . . . . .
120
6.1.2
Classication . . . . . . . . . . . . . . . . .
124
6.1.3
Tracking
130
6.1.4
Complete system performance
. . . . . . . . . . . . . . . . . . .
. . . . . . .
136
6.2
Further work . . . . . . . . . . . . . . . . . . . . .
137
6.3
Summary and conclusions . . . . . . . . . . . . . .
140
List of Figures
1.1
Examples of pedestrian detection.
. . . . . . . . .
10
2.1
Rectangle lters. . . . . . . . . . . . . . . . . . . .
36
2.2
Calculation of the rectangle sum. . . . . . . . . . .
37
2.3
Image gradients and energy.
. . . . . . . . . . . .
38
2.4
Grouping vertical gradients in a r image. . . . . .
41
2.5
Grouping vertical gradients in a grayscale image.
42
2.6
Use of the gradient sign for initial detection. . . . .
42
2.7
Combining regions of interest.
. . . . . . . . . . .
44
2.8
A scanning window. . . . . . . . . . . . . . . . . .
46
2.9
Scanning an edge detector through the image at

every position and scale.
. . . . . . . . . . . . . .
47
2.10 Scanning an edge detector through the image at

every position and scale.
. . . . . . . . . . . . . .
48
2.11 Temperature segmentation in r images. . . . . . .
49
2.12 Combining regions of interest.
50
. . . . . . . . . . .
3.1
Rectangle features for classication.
. . . . . . . .
57
3.2
Calculation of image features in subregions. . . . .
58
3.3
Subregions for component classication. . . . . . .
62
LIST OF FIGURES
3.4
Classifying the whole image at every scale and resolution. . . . . . . . . . . . . . . . . . . . . . . . .
71
4.1
Model contour of the Condensation tracker.
. . .
79
4.2
Integrating initial detections through time.
. . . .
81
5.1
Matching ground truth data. . . . . . . . . . . . .
84
5.2
Suitability of initial detections for classication.
85
5.3
Fir images at dierent temperature ranges.
. . . .
87
5.4
Initial detection results.
. . . . . . . . . . . . . .
88
5.5
Percentage positive detections.
. . . . . . . . . .
88
5.6
Calculation times of initial detection for r images.
89
5.7
Illumination changes in grayscale images.
90
5.8
Initial detection results in grayscale images.
5.9
Calculation times of initial detection grayscale.
5.10 An example ROC curve.
. . . . .
. . .
91
. .
92
. . . . . . . . . . . . . .
93
5.11 Results of support vector machine classication orientations features.
. . . . . . . . . . . . . . . . .
96
5.12 Results of support vector machine classication gradients and orientations features.
. . . . . . . . . .
97
5.13 Results of support vector machine classication rectangle features.
. . . . . . . . . . . . . . . . . . .
98
5.14 Results of neural network classication orientations

features.
. . . . . . . . . . . . . . . . . . . . . . .
100
5.15 Results of neural network classication gradients

and orientations features.
. . . . . . . . . . . . .
101
5.16 Results of neural network classication rectangle

features.
. . . . . . . . . . . . . . . . . . . . . . .
102
LIST OF FIGURES
5.17 Processing times feature vector calculation.
. . . .
103
5.18 Classication times of classier/image feature combination.
. . . . . . . . . . . . . . . . . . . . . .
104
5.19 Classication results grayscale images orientations

features.
. . . . . . . . . . . . . . . . . . . . . . .
106
5.20 Classication results grayscale images gradients orientations features.
. . . . . . . . . . . . . . . . .
107
5.21 Classication results grayscale images rectangle features.
. . . . . . . . . . . . . . . . . . . . . . . .
5.22 Results of component classication.

5.23 Results of PCA feature selection.
108
. . . . . . . .
109
. . . . . . . . .
110
5.24 Results of Adaboost feature selection.
. . . . . . .
111
5.25 Results of multi-objective optimization feature selection. . . . . . . . . . . . . . . . . . . . . . . . .
111
5.26 Classication times of optimization algorithms.
113
5.27 Classication times of optimization algorithms.
114
. . . . . . . . .
117
5.28 Pedestrian tracking in r images.
5.29 Processing times tracking algorithms.

6.1
118
Diculties for initial detection because of vertical

structures in the background.
6.2
. . . . . . .
. . . . . . . . . . .
122
Diculties for initial detection because of many

vertical structures in background.
. . . . . . . . .
122
6.3
Pedestrian pushing a stroller. . . . . . . . . . . . .
123
6.4
A group of pedestrians in a r image.
. . . . . . .
124
6.5
Persistence of false positives over time.
. . . . . .
127
6.6
Example of Hausdor tracker in r images.
. . . .
132
LIST OF FIGURES
6.7
Tracking a car in r images with the Hausdor

tracker. . . . . . . . . . . . . . . . . . . . . . . . .
6.8
133
Tracking a car in grayscale images with the Hausdor tracker.
. . . . . . . . . . . . . . . . . . . .
134
List of Algorithms
1
Connected components labeling.
. . . . . . . . . .
40
Grouping vertical gradients . . . . . . . . . . . . .
45
Scanning a vertical edge detector . . . . . . . . . .
47
. . . . . .
51
Adaboost. Algorithm from [39].
. . . . . . . . . .
65
Assignment of a rank to an individual. Algorithm

from [16]. . . . . . . . . . . . . . . . . . . . . . . .
68
Crowding-distance-assignment. Algorithm from [16].
69

scale and location. . . . . . . . . . . . . . . . . . .
70
The Hausdor tracker.
. . . . . . . . . . . . . . .
75
10
Mean shift tracking
. . . . . . . . . . . . . . . . .
77
11
The Condensation algorithm. . . . . . . . . . . . .
79
12
Initial detection tracker.
80
13
Tracking the classication output.
. . . . . . . . . . . . . .
. . . . . . . . .
82
Chapter 1
Introduction
1.1
Problem description
This thesis describes a complete system for the detection of pedestrians in monocular camera images and monocular r(far infrared)
images recorded from a moving car. The objective of this work
is to have a system that can warn the driver of the car when a
pedestrian crosses the street. Two examples of a street scene and
the output of the detection system are shown in gure 1.1. There
are a few important properties such as system should have:
(a) FIR pedestrian detection
(b) Pedestrian detection in a grayscale image
Figure 1.1: Examples of pedestrian detection.
10
CHAPTER 1.
INTRODUCTION
11
It should be very reliable with respect to false detection; it

should have a high detection rate and a low false alarm rate.
It should be very reliable with respect to environment conditions; it should be able to operate in a complex city environment as well as on a land road.
It should be very reliable with respect to weather conditions;

it should be able to operate under various weather conditions,
for example high temperatures, low temperatures, rain, and
snow.
It should able to operate reliably in a dynamic environment;

both the car and the pedestrians are usually moving.
It should be able to operate in real-time at moderate hardware. The aim of the system presented in this work is to run
at at least 20 frames per second on a 1 GHz. desktop processor. This processing speed is what roughly can be expected
to be available for driver assistance systems in cars in the
next years.
Developing a system which satises all of these properties is not

trivial. Therefore, the system described here is divided into four
components:
An image pre-processing component which calculates relevant features from the image.
An initial detection component which generates regions of

interest in the image. Several initial detection routines can
be used for dierent environment conditions. The initial detection routine can also have several parameter setups for
dierent weather conditions.
CHAPTER 1.
12
INTRODUCTION
A classication component which determines which of the

regions of interest generated by the initial detection routine
contain pedestrians. The quality of the classication component inuences the amount of fail detections of the system.
A tracking component which keeps track of pedestrians in

successive frames once they have been positively classied
by the classication component.
The tracking component
should be able to handle the motion of the car and the motion
of the pedestrian. The tracking component makes it possible
to stabilize detections over time and to predict the time of
contact in case of a possible collision.
This thesis describes a system that can be used for pedestrian

detection in grayscale camera images and for pedestrian detection
in far infrared images.
The r detection system described here
is used by BMW as a reference system for testing commercial

systems.
1.2
Object detection in computer vision
Computer vision research is applied in elds of computer science

like robotics and medical image analysis, in industrial applications
like driver assistance systems, medical image analysis, and assembly, and in law enforcement applications like face detection and
nger print comparison.
This section gives a short overview of
how the methods in this thesis relate to other elds of computer

vision.
1.2.1 Segmentation
The purpose of segmentation is to extract interesting regions from
the image. An example of segmentation in an image recorded from
CHAPTER 1.
13
INTRODUCTION
a medical scanner is locating the exact boundaries of an organ in

the image. An example of segmentation in pedestrian detection in
r images is locating bright image regions. In this thesis, the term
initial detection is used for segmenting regions of interest from
the image. The remainder of this section describes segmentation
methods which are applied in various elds of computer vision.
Color is a cue which can be used for segmentation, for example
by grouping pixels of similar color value into a region. An application of color segmentation is for example a robot vision system
which segments objects on a table for grabbing. An example of
a color based segmentation method is graph based segmentation.
In graph based segmentation, the image to be segmented is represented as a graph
to a pixel in the
G = (V, E) where each node vi V corresponds

image, and the edges (vi , vj ) E of V connect
pairs of neighboring pixels. Each edge in the graph has a weight
w((vi , vj ))
which is a measure of dissimilarity in color between
S is a
partition of V into components where each component C S is
0
0
0
a connected component in a graph G = (V, E ), where E E .
the two pixels connected by that edge.
A segmentation
The edges between two pixels inside a component should have

relatively low weights and the edges between pixels in dierent
components should have relatively high weights. An example of a
graph based segmentation method is [18].
Motion also provides information that can be used for segmentation. One motion based segmentation method for static cameras is
background subtraction. In background subtraction, an image of
the background is stored. In successive camera frames, the current
image is subtracted from the stored image. A dierence between
the current image and the stored image indicates that something
has moved at the pixels in the image where the dierence is unequal to zero. Usually, the stored image of the background needs
to be updated frequently because of changes in illumination con-
CHAPTER 1.
14
INTRODUCTION
ditions over time. Applications of background subtraction are for

example people detection in security camera images and trac
monitoring.
Optic ow [24] is the apparent motion of brightness in the image.
Like background subtraction, optic ow can be used for object
segmentation. If the image intensity at a point
time
in the image at
dI
is I(px , py , t), the derivative
dt with respect to
is
dI
I dpx
I dpy I dt
.
=
+
+
dt
px dt
py dt
t dt
The assumption made in the calculation of the optic ow is that
the intensity of a point
does not change over time so
dI
dt
= 0.
This gives the rst optic ow constraint:
T dpx
I dpy
I
=
+
.
t
px dt
py dt
(1.1)
The second assumption made in optic ow is that it is constant

in a small region
Sp
gradients in a region
in the image over time.
Sp
around
So measuring the
makes it possible to solve the
optic ow vector. The optic ow in each point in this region is
constrained by 1.1. To nd the optic ow
v = ( dpdtx ,
dpy
dt ) can be
calculated by minimizing
2
X I
I
I
vx +
vy +
.
Ep =
px
py
t
x,ySp
An application of optic ow in a static camera environment is

the segmentation of moving objects from the scene. At locations
where the magnitude of the optic ow vector is larger than zero,
there is a moving object. The direction of the vector indicates in
which direction the object is moving. An application of optic ow
CHAPTER 1.
15
INTRODUCTION
is the detection of moving people from a static camera, for example

from a security camera or from a standing car. An application of
optic ow in a moving camera environment is obstacle avoidance.
The whole scene recorded from a moving appears to be moving
because of the movement of the camera itself. Objects closer to
the camera generate a stronger ow vector eld magnitude than
objects further away from the camera because of motion parallax.
To avoid a collision with an objects in the scene, there should be
a movement into the direction where the ow vector eld has a
low magnitude.
Another computer vision technology that can be used for object
segmentation is stereo vision.
From two calibrated cameras, a
disparity map can be calculated which provides relative distance

information from pixels in the image to the camera. A disparity
map can be calculated using a correlation measure based on for
example intensity value. The correlation measure is used to nd
the corresponding position of a gray level value from the rst
image in the second image.
The disparity is the dierence in
position of a gray level value. A large disparity of an image region

means the corresponding object in that region is relatively closer
to the camera than an object in an image region with a smaller
disparity.
The disparity value of an image region can be used
for segmenting it from the image. Example applications of stereo

vision are pedestrian detection from a moving car [42], [45], and
[20] and obstacle detection in robotics.
The Active Contour Model [26] is a method which detects contours(edges) in an image. A snake is initialized around the object
to be segmented and moves along its interior normal until it stops
at the edge of the object. A contour(snake) is a spline represented
by a vector
v(s) = (x(s), y(s)) where s is the arc length.
Finding
a contour in an image means minimizing an energy function:
CHAPTER 1.
16
INTRODUCTION
Z
ESnake =
Eint (v(s)) + Eimage (v(s)) + Eext (v(s))ds

0
where
Eint represents
ity) of the contour,

image, and
Eext
the internal energy(bending and discontinu-
Eimage
represents the forces caused by the
represents the external forces caused by a higher
level process. The energy function
Esnake
is minimized using vari-
ational calculus. The Active Contour Model is much used in medical image processing, for example for the segmentation of organs
in images recorded by a medical scanner.
An alternative to the spline representation of Active Contour Models is the Level Set representation[10], which can detect contours
with discontinuities. The Active Contour Model from [10] is based
on the Mumford Shah function [29] for image segmentation. The
E(c1 , c2 , )
energy function
Z
E(c1 , c2 , ) =
for a piecewise smooth image
is the image domain,
erage values of
is:
Z
Z
(f c1 )2 H()+ (f c2 )2 (1H())+ |5H()|
where
is the image,
c1 and c2 are
the av-
inside and outside the object respectively,
is the Heaviside function, and

length term of the contour.
H()
is a constant for weighting the
The energy function
E(c1 , c2 , )
is
minimized using methods from variational calculus. An example

application of the Level Set representation of the Active Contour
Model is segmentation in images recorded from a medical scanner.
An improvement to the Active Contour Models is to add a shape
prior term
Eshape
to the energy functions.
In this way, a prior
shape of the object to be segmented can be used to improve the

segmentation results. Examples of the use of a shape prior term
in the energy function are [12, 9].
CHAPTER 1.
17
INTRODUCTION
1.2.2 Classication
The purpose of classication in the context of computer vision
is to learn to distinguish between images of a target class of objects and images of a non-target class of objects.
A classier is
usually trained on a set of target objects and a set of non-target

objects so that it can estimate the class of an unseen image example. The training is done on a database of target images and a
database of non-target images which are given a class label by a
human observer. From the labeled target and non-target images
the classier learns to distinguish between the two classes.
Commonly used classiers for computer vision applications are
nearest neighbour classiers, neural networks, and support vector machines.
In a two class classication problem, a k-nearest
neighbour classier selects the k example points closest to the example being classied and votes between those example points to
determine the class of the example. Neural networks and support
vector machines are explained in section 3.1.1 and section 3.1.2
respectively.
An important choice is which kind of image features are used
for classication.
Usually, the gray values of an image are not
directly used for classication because they are strongly dependent

on the illumination conditions. First features are calculated from
the image and these features are used as input to the classier.
Usually, the features used are some kind of gradient response.
For example gradients from a Sobel lter, the orientations of the
gradients, or Wavelets.
Example applications of neural networks and support vector machines are face detection [35], and car detection [22]. In section 3,
the use of classiers for pedestrian detection is described in detail.
CHAPTER 1.
18
INTRODUCTION
1.2.3 Tracking
The purpose of tracking in the context of computer vision is usually to determine some properties(usually position and scale) of
an object in a next time step or next image frame. Tracking usually involves an estimation and a conrmation of the estimation.
Tracking makes it possible to integrate information about an object through time, it makes it possible to estimate the position
and scale of an object in the near future. This can be useful for
estimating time to contact, for example. In addition, this makes
it possible to limit processing in the next frame to the area in the
image around where the tracker estimated the object to be. This
is important in real-time image processing, for example.
Two commonly used methods for estimating the state of a system
from measurements are the Kalman lter and Condensation[25].
They consist of two main steps:
a prediction step based on a
dynamical model of the object which estimates the state(often

position in computer vision) of the object in the next time step
and an update step which updates the prediction based on some
measurement. The main dierence between Condensation and a
Kalman lter is that Condensation is based on factored sampling,
which does not require a Gaussian measurement density.
This
property makes it possible to track objects in a cluttered scene.

An example application of tracking with a Kalman lter is vehicle tracking from a static camera [5]. An example application of
Condensation is tracking humans from a static camera [25].
Color can also be used as a feature for tracking. The Mean Shift
Tracker [11] is based on color histograms. The Mean Shift Tracker
is described in detail in section 4.2. An example application of the
Mean Shift Tracker is face tracking in color images.
CHAPTER 1.
1.3
19
INTRODUCTION
Related work in pedestrian detection
Recently, there has been a lot of interest in driver assistance applications like lane detection, car detection, trac sign recognition,
and pedestrian detection. The reasons for this are the desire to
make trac safer, to make driving more comfortable for drivers,
and the recent technological progress in camera systems and computer hardware which make this research possible.
At time of
writing, several driver assistance are commercially available. For

example a lane departure warning system from Citroen and a r
pedestrian detection system from Honda.
This section gives an
overview of relevant other pedestrian detection research.
1.3.1 Initial detection in grayscale images

In [42], [45], and [20] stereo vision is used to generate hypothesis
of pedestrians. A depth map is calculated from which foreground
objects can be segmented using range thresholding. The advantage of using stereo is that processing can be limited to a number
of objects close to the car. This speeds up further processing and
limits the number of fail detections. A disadvantage is that much
processing capacity or special hardware is required to calculate
the disparity map. Two cameras are needed for the detection process, making it more expensive than monocular processing. Also,
it is not clear if it is possible to keep the cameras calibrated over
a period of many years for use in everyday driving.
A dierent approach is shape-based detection. A shape matching
method for initial detection in monocular images is presented in
[21]. In order to detect a pedestrian in a grayscale image a hierarchy of edge templates of pedestrians are matched with a distance
transform image calculated from a feature image of the original
grayscale image. Templates of pedestrians in dierent poses are
grouped together in prototypes.
The grouping of templates is
CHAPTER 1.
20
INTRODUCTION
done at multiple levels, resulting in a hierarchy.
The leaves of
the hierarchy contain all templates and the nodes of the hierarchy
contain the prototypes. The hierarchy of templates is scanned at
a coarse-to-ne scale through the image. If a template at a certain node in the hierarchy matches(the distance measure between
template and image is below a certain threshold), the templates
under the leaf are processed.
In [32], saliency maps calculated
from intensity, color, and orientation features are used for initial
detection.
In [7] and [3], a combination of a shape-based method and stereo
is used for initial detection. An edge image is calculated from a
grayscale image.
edges.
Stereo vision is used to eliminate background
A symmetry map is calculated from the vertical fore-
ground edges.
A bounding box is created around symmetrical
vertical objects and a search for the head of the pedestrian is

performed through matching a head model.
Further processing
discards bounding boxes which no not have the correct width to

height ratio or which are too homogeneous to contain a pedestrian.
In [14], initial detection is performed by calculating image entropy.
In image areas with much structure a template is matched with
image contours. Depth information from stereo vision is used to
adjust the size of the template for matching image contours. A
restriction of using image features like edges for initial detection is
that good quality image features are required. Often, it is dicult
to obtain good image features from low contrast grayscale images.
One important cue humans use for object detection is motion. Humans are good in detection relative motion, for example detecting
a moving pedestrian from a moving car. Motion is generally not a
good cue for a real-time pedestrian detection system, the calculation of motion information using optical ow is expensive and does
not provide a good segmentation result because dierent parts of
the pedestrian are moving at dierent speeds. It does not provide
CHAPTER 1.
21
INTRODUCTION
a complete segmentation; for classication it is necessary to have

a complete bounding box around an object. Also, because usually the car is moving at a relatively high speed in comparison to
the pedestrian, the whole scene is moving and the motion of the
pedestrian is neglectable. There also many pedestrians that are
not moving so motion information is useless in this case.
In [23], leg motion is used for the detection of pedestrians in color
images.
The input image is clustered on color values.
Usually,
the legs of a pedestrian are combined into the same cluster. The
temporal change in the shape of such a cluster is used to distinguish it from clusters belonging to other objects in the image. The
limitation of using motion for detection is that it can only detect
moving pedestrians. Also, the clustering of a color image is time
consuming and does not always provide good segmentation results
if the input image has a complex background.
1.3.2 Initial detection in r images

The most straightforward approach to pedestrian detection in r
images is to use body heat. People appear brighter in a r image
than most other objects because of their body temperature.
In
[30], a probabilistic template is created from a training database

containing pedestrians in dierent poses. The template is created
by thresholding bright regions from the images in the training
database. The probability of each pixel in the template belonging
to a pedestrian is calculated based on how frequently it is thresholded in the training database. The template is scanned through
the image at various scales from a coarse to ne resolution.
In
[43], thresholding is also used to segment bright regions from the

image. After thresholding, bright regions that have an incorrect
width to height ratio for pedestrians are discarded and regions
which are unlikely to be a pedestrian based on position in the
image are discarded. In addition to detecting complete pedestri-
CHAPTER 1.
22
INTRODUCTION
ans, a detector that searches for pedestrian heads in also used to

generate initial detections.
In [4], the r image and its vertical
edges are searched for symmetrical regions. Symmetrical regions

in cold image areas are discarded. A vertical histogram of edges
is calculated in each of the bounding boxes surrounding the symmetrical regions. The shape of the histogram is used to discard
bounding boxes which do not match a pedestrian. Also, bounding
boxes with an incorrect aspect ratio and bounding boxes which
do not meet perspective constraints are discarded. The problem
with using image brightness for detection is that at higher outside temperatures, people do not necessarily appear brighter in
the image than the background.
1.3.3 Classication
It is possible to completely omit the initial detection step using a
pattern recognition approach. In [31], a support vector machine
classier is scanned through the image at every scale and every location. The classier is trained on haar wavelets calculated from
a database of pedestrian and non-pedestrian color images.
An
improved version of this system [28] is based on a component

classier.
There are multiple classiers each classifying a body
part. The nal classication result is calculated from the output

from the component classiers. In [15], a support vector machine
trained on histograms of oriented gradients is scanned at every
location and scale through the image.
The scanning window is
divided into cells. A histogram is calculated in each of the cells.

The combined histograms are the input to the support vector machine classier. In [40], a classier trained on rectangle features
is scanned through the image at every location and scale.
ditionally, motion information is used for classication.
Ad-
Motion
information is extracted by calculating dierences between images in time.
Adaboost [19] is used to train a classier.
The
CHAPTER 1.
INTRODUCTION
23
features for classication are selected from all possible rectangle

features and all possible motion features. The limitation of using
motion lters is that these can only be used with a static camera.
In [38], a convolutional neural network is scanned through
the image at every scale and location. In the convolutional neural

network architecture [27], feature selection is performed simultaneously with training the neural network. The feature selection
is implemented in a hidden layer with shared weights which are
optimized together with the weights of the classication so that a
complete classication/feature selection architecture is optimized.
A drawback of these approaches is that scanning a classier at every scale and location through the image is too slow for real-time
processing. In addition, because there is always a fail classication rate, it would generate a too high amount of fail detections
because many classications are performed per frame.
An alternative to scanning a classier through the image at every
scale and location is to classify the output of the initial detection
routine.
In [45], the output of a stereo initial detection routine
is classied with a neural network. The image gradients are the

features used for classication. The regions of interest from the
initial detection routine are rescaled to a xed size for classication. In [21], the regions of interest from the initial detection
routine are classied with a radial basis function network. To select negative examples for training, a bootstrapping procedure is
used. The classier is trained in an iterative way, at each iteration
it is validated on a new set of negative examples. The negative
examples are added to the training dataset and the classier is retrained. In [37], the data for training the classier is divided into
mutually exclusive training clusters. Each cluster contains data
from a particular pose, a particular articulation, and a particular
illumination condition.
The idea of clustering the training data
is that reducing the variability of the training data by dividing

it into clusters is more eective than training a classier on all
CHAPTER 1.
24
INTRODUCTION
data. The regions of interest for classication are divided into 9

subregions. A classier is trained multiple times for each subregion, once per training cluster. The features for classication are
histograms of gradient orientations. The histograms are weighted
by smoothed gradient magnitudes to achieve invariance to illumination changes.
The classier is trained using Adaboost which
selects features from all subregions.
In [42], regions of interest
generated by a stereo algorithm searching for the legs of a pedestrian are fed into a feed-forward time delay neural network. The
input to the network consists of pixel values from multiple frames.
Neurons in a higher layer are only connected to a subselection of
neurons in the lower layer, called receptive elds, which makes it
possible to detect specic leg poses and motion patterns.
1.3.4 Tracking
One approach to tracking is building a model of human shapes
and matching this model with image data.
In [2, 34], a linear
shape model of a pedestrian is build from pedestrian silhouettes.

A B-spline with a xed number of equally spaced control points is
tted to the contours of each of the silhouettes. After alignment
of the shapes, a mean shape and a set of modes of variation are
generated with Principal Component Analysis.
The Condensa-
tion algorithm [25] is used for tracking. Each sample represents a

possible position of the object to be tracked. A zero-order motion
model is used because no assumptions can be made about how
the pedestrian or the car are moving. The observation density for
weighting the samples is generated by measuring the distance to
the nearest image feature along the normals of the shape model.
Oriented edges discretized into eight bins are used as features for
tracking.
A drawback of this approach is that it cannot handle
scale well. The shape model should be matched to the image at

multiple resolutions for each sample. This makes it expensive and
CHAPTER 1.
25
INTRODUCTION
introduces aliasing problems. Also, a linear shape model cannot

capture the many possible poses and orientations of a pedestrian.
Work related to learning non-linear shape models can be found in
for example [12].
In [43], the heads of pedestrians in r images are tracked.
The
position of the pedestrian in the next frame is estimated with

a Kalman lter and the measurement update is performed by
calculating an exact position around the estimated position with
a mean-shift method. It is not clear how scaling is handled.
In [32], a classier is scanned through the image at each position and each resolution.
The output of the classier, which is
interpreted as a probability, is used for tracking.
Condensation
[25] is used for the propagation of classier outputs through time.

The position and scale of a person in the image are represented
by the state density in the Condensation algorithm. The classier
outputs are represented by the observation density in the Condensation algorithm. Instead of using the standard factored sampling,
the posterior state density of the previous frame is directly taken
as the prior for the current frame. The pedestrian detection system is run and the state density is reinforced or inhibited in the
locations where the support vector machine outputs are high. To
perform detection of people, the densities are thresholded. This
method can eectively handle scaling. The disadvantage of this
method is that unlike the two previously described tracking methods, it cannot be used without running the whole detection system
rst.
1.4
Selected approach
The detection system described in this document consists of tour

components: an image pre-processing component, an initial detection component, a classication component, and a tracking com-
CHAPTER 1.
INTRODUCTION
26
ponent.
1.4.1 Image preprocessing

The image pre-processing component performs the calculation of
image features for the initial detection component, classication
component, and tracking component.
It also contains methods
for image enhancement like smoothing, contrast stretching and

histogram equalization.
1.4.2 Initial detection

The initial detection component selects regions of interest in the
image which may contain pedestrians. So unlike systems that scan
a classier through the whole image at each scale and location([31],
[15], and [40]), processing is limited to certain regions in the image.
It is desirable to have an initial detection routine because otherwise too much processing time is spent on uninteresting parts of
the image. Also, there is always a certain false positive classication rate. The larger the number of objects that are classied per
frame, the larger the number of false positive classications per
frame. For a production system, the false positive rate should be
as low as possible.
Motion information is not used, because it cannot be assumed that
a pedestrian is moving in the scene. The assumption is made that
a single pedestrian can be segmented from the scene. The methods
described here are not specically designed for the detection of
pedestrians that are overlapping each other, although this may
work.
In this work, three dierent initial detection routines are used.
Two initial detection routines are gradient based, the other is region based. The gradient based initial detection routines calculate
CHAPTER 1.
INTRODUCTION
27
the vertical gradients from the intensity image. Image locations

with high vertical gradient response are regions of interest. The
rst routine calculates the vertical gradients of the image and
combines vertical structures into regions of interest. The second
routine scans a vertical edge detector through the image at every
scale and location.
The lter size of the edge detector is made
dependent on the size of the scan window. This makes the initial
detection routine suitable for detection objects at all scales. These
routines are mainly interesting for r images. In grayscale images,
there is usually so much gradient response from background structures that it is dicult to segment pedestrians based on gradient
information alone.
The region based initial detection routine operate directly on the
intensity values.
tensity values.
This routine combine regions with similar in-
For r images, human body temperature makes
pedestrians appear bright in the image, at least at low outside

temperatures. In r images, bright regions with a vertical shape
are regions of interest.
1.4.3 Classication
The classication of pedestrians is challenging because of the high
variability among objects in the pedestrians class. Also, there are
many pedestrian like objects in trac scenes, for example trees,
poles, parts of building, and parts of cars. Therefore, an advanced
classication component is required.
Because many objects per
frame may be classied, the classier should also be very ecient.

In this work, support vector machines and neural networks are
used for classication.
An important choice is the type of image features used for classication. In this work, several features are tested both on video
sequences and r sequences. The features that are used here are
CHAPTER 1.
28
INTRODUCTION
rectangle features [39], histograms of image gradients and orientations from the image gradients.
The reason for using rectan-
gle features is that they can be calculated very eciently, which

makes them very suitable for a real-time system. The histograms
of orientations from the image gradients are used because they are
invariant to small translations inside the regions from which the
histograms are calculated.
For a real system, the classication performance should be nearly
perfect. Especially, the false positive rate(false alarms) should be
practically zero. In practice, this is dicult to achieve. A classier
trained on a complete set of image features usually has a too high
fail classication rate to be useful in a real system.
Therefore,
in this work, the classication system is optimized with dierent

feature selection methods. The idea is that not all features are stable for classication. Selecting the subset of stable features from
all features and using only the stable features for classication
improves classication performance.
To improve classication performance even more, the training
data is divided into subdatasets.
The training data is divided
based on pedestrian orientation, pedestrian size, and environment

conditions like outside temperature.
Also classication speed is important for a real system. In order
to improve the classication speed of support vector machines,
the number of support vectors are reduced using the method from
[36], and with a multi-objective optimization.
1.4.4 Tracking
The reliable tracking pedestrians is also challenging: the contrast
between the pedestrian and the background is often low, the resolution of pedestrians in trac scenes is often low, there are many
background objects that resemble pedestrians, due to the move-
CHAPTER 1.
INTRODUCTION
29
ment of the car, pedestrians often strongly scale in a few frames,

multiple pedestrians may occlude each other, and pedestrians usually change shape because they move. In this work, ve tracking
methods are applied for tracking pedestrians. The rst tracker is
based on the Hausdor distance. Once a pedestrian is detected
in a certain frame, a model is created from the gradient image
calculated in the region of interest which contains the pedestrian.
In the next frame, the tracker searches in the area around the
previous location for the position with the smallest Hausdor distance to the model. The second tracker is a mean shift tracker. A
model is created for tracking by calculating a histogram of image
features of the detected pedestrian.
In the next frame, the po-
sition containing the pedestrian is determined by minimizing the

dierence between a target histogram and the model histogram.
The third tracker is a Condendation tracker. The tracker is based
on propagating a state density of tracked objects over time. The
fourth tracker is a tracker that integrates initial detections through
time. This tracker is based on the assumption that a pedestrian
which is detected in a certain frame will be detected at approximately the same location in the next frame. The fth tracker is a
tracker which integrates classications through time. This tracker
is based on the assumption that a pedestrian which is classied
in a certain frame will be classied at approximately the same
location in the next frame.
1.5
Contributions
This work has resulted in a pedestrian detection system for grayscale

and r images. The two main goals were to develop a real-time
detection system which operates with a high detection rate under various environment conditions. The r version of the system
runs easily at 20 frames per second on a 1 GHz. processor. Both
CHAPTER 1.
30
INTRODUCTION
the grayscale and r detection system are strongly optimized for

a high detection rate.
Three new real-time initial detection routines were developed, two
gradient based routines and one region based routine. A systematic overview of the performance of these detection routines is
presented on a variety of conditions by comparing the results of
the detection routines to ground truth data.
As mentioned in section 1.3.3, several types of classiers and image features are described in the pedestrian detection literature.
It is not clear however, which classier/image feature combination
works best. A systematic overview of the performance of support
vector machines and neural networks in combination with multiple
feature types is presented in this thesis. In addition, to improve
classication performance, dierent classiers are trained for different temperature ranges(in the case of r images) and dierent classiers are trained for dierent appearances of pedestrians,
based on orientation. Also, to improve classication performance,
feature selection is performed with a PCA, Adaboost, and multiobjective optimization. In order to achieve real-time classication
speed, the number of support vectors of a support vector machine
are reduced. An overview of all classication results for dierent
classiers trained on dierent image features is presented in this
work.
Five tracking methods are evaluated on data of pedestrians. The
rst is based on the Hausdor distance.
shift tracker.
The second is a mean
The third is a Condensation tracker.
The fourth
is based on integrating initial detection through time.
And the
fth is based on integrating classication outputs through time.

These trackers can be used to stabilize detections over time and for
example make it possible to estimate the time of contact with the
pedestrian.
An overview of tracking performance of all trackers
on r images and grayscale images is presented.
CHAPTER 1.
1.6
INTRODUCTION
31
Outline of this thesis
The rest of this thesis is structured in the following way: chapter two contains a description of the initial detection routines and
the image preprocessing methods it uses. Chapter three describes
the methods used for classication and optimization of the classication. Chapter four describes the methods used for tracking.
Chapter ve contains the experimental results. Chapter six contains the discussion and conclusion.
Chapter 2
Initial detection
The goal of the initial detection is to nd regions of interest in
the image.
It can be seen as a focus of attention mechanism
which limits processing to interesting parts of the image. These

regions of interest can contain pedestrians or other objects.
It
is desirable to have an initial detection routine in contrary to

scanning a classier through the whole image at all locations and
scales for the following reasons:
There is always an unavoidable number of fail classications.

By classifying only the output of the initial detection routine
instead of the whole image, the number of false detection is
reduced.
Limiting processing to interesting image regions saves calculation time.
It appears that it is necessary to have dierent initial detection

routines for dierent images types and dierent conditions. For
example, it may be necessary to have a dierent initial detection
routine for color/grayscale images than for r images. This section describes three initial detection routines. Two gradient based
32
CHAPTER 2.
33
INITIAL DETECTION
initial detection routines which use the image gradients for generating initial detections.
And a region based initial detection
routine which operates on the intensity values directly.

An important rate for a detection system is the rate of positive
examples detected as positive by the system. Another important
rate is the rate of negative examples detection as positive by the
system. In the literature, it seems that inconsistent terminology
exists for these rates.
In this work, these rates are called true
positive rate and false positive rate, respectively.
2.1
Image pre-processing
2.1.1 Image enhancement

Before the detection algorithms are applied to an image, it is usually preprocessed to enhance its quality. To enhance the contrast
in the image, contrast stretching can be applied. Usually, before
the image features are calculated, smoothing is performed to remove image artifacts.
Contrast stretching
Contrast stretching enhances an image by stretching the range of
intensity values to the maximum possible range. It does this by
applying a linear scaling to the image. The pixel with minimum
intensity in the image is set to the lowest possible value, the pixel
with maximum intensity in the image is set to the highest possible
value, and the other pixels are interpolated between the lowest and
highest possible values.
To perform contrast stretching, the minimum intensity value
max are calculated

image j can be calculated
min
and the maximum intensity value
from the
input image. The stretched
from the
original image
using
CHAPTER 2.
j(x, y) =
where
and
34
INITIAL DETECTION
lower

i(x,y)min
upper maxmin
upper
i(x, y) = min
min < i(x, y) < max
i(x, y) = max
are image coordinates, where
lower
is the lowest
upper is the highest possible intensity

reduce the eect of outliers, min and max can be
the pixel values at for example 3% and 97% of the
possible intensity value and

value.
To
selected at
image histogram, respectively.
Smoothing
A convolution kernel for smoothing is usually constructed with
the 2D Gaussian
1 x2 +y2 2
G(x, y) =
e 2
2 2
For performance reasons, in this work smoothing is performed
with two 1-dimensional approximations to a Gaussian
G(x) = {0.25, 0.5, 0.25}

and
0.25
0.5
G(y) =
0.25
for a kernel size of three and
G(x) = {0.0625, 0.25, 0.375, 0.25, 0.0625}
CHAPTER 2.
35
INITIAL DETECTION
and
0.0625
0.25
G(y) =
0.375
0.25
0.0625
for a kernel size of ve. The oating point values in the kernels
are selected in a way that a convolution with a kernel can be
performed eciently with integer math.
{0.25, 0.5, 0.25}
G(x) =
G(x) = {1, 2, 1}
For example,
can be eciently calculated as
shifted to the right by two bits.
2.1.2 Image features

Rectangle features
The rectangle features applied here are the same as those used in
[39] for real-time face detection. Figure 2.1 shows an example of
the lters to calculate these features.
The sum of the pixels in
the white region are subtracted from the sum of the pixels in the
dark region. The lter in gure 2.1(a) gives a high output at a
vertical edge, the lter in gure 2.1(b) gives a high output at a
horizontal edge, and the lter in gure 2.1(c) gives a high output
at a diagonal edge. The motivation for using these features are
that they are extremely fast to calculate. They can be calculated
in constant time regardless of their size using an integral image
representation related to summed area tables from texture mapping in computer graphics. The integral image at location
the sum of the pixels above and to the left of
ii(x, y) =
X
x0 x,y 0 y
i(x0 , y 0 )
x, y :
x, y
is
CHAPTER 2.
where
36
INITIAL DETECTION
ii(x, y)is the integral image and i(x, y) is the original image.
The integral image can be calculated using the following formulas:
s(x, y) = s(x, y 1) + i(x, y)

ii(x, y) = ii(x 1, y) + s(x, y)
s(x, y)is
ii(1, y) = 0.
where
the cumulative column sum,
s(x, 1) = 0,
and
Using the integral image, any rectangular sum
can be calculated in only four array references. An example calculation is shown in gure 2.2.
calculated with
The area of rectangle D can be
(ii(4) + ii(1)) (ii(2) + ii(3)),
where the values
1,2,3, and 4 are coordinates of the integral image.
Figure 2.1: Rectangle lters.
CHAPTER 2.
37
INITIAL DETECTION
Figure 2.2: Calculation of the rectangle sum.
Gradients
The image gradients are calculated by convolving the smoothed
image with two 1-dimensional kernels. For a lter size of 5, the
horizontal gradient image
gx is calculated with the following Sobel-
like kernel
gx =
0
.
1
The vertical gradient image
gy
is calculated with the following
kernel
gy = {1, 2, 0, 2, 1}.
An approximation to the energy image
is calculated from the
horizontal gradient image and the vertical gradient image
E = |gx | + |gy | .
CHAPTER 2.
38
INITIAL DETECTION
The gradient images and energy images can be used for edge detection. Also, the angle of orientation
from
of the gradient is calculated
gx and gy
= arctan(
gy
).
gx
An example of an image, its vertical gradients, its horizontal gradients, and its energy is shown in gure 2.3.
(a) A r image
(b) It's vertical gradient image
(c) It's horizontal gradient image
(d) It's energy image
Figure 2.3: Image gradients and energy.
CHAPTER 2.
2.2
39
INITIAL DETECTION
Gradient based segmentation
Two dierent gradient based initial detection routines are described in this section.
The rst is based on clustering vertical
gradients. The second is based on scanning a vertical edge detector through the image at every location and scale.
2.2.1 Grouping vertical gradients

The initial detection routine that groups vertical gradients works
as following:
the vertical gradient image is calculated with the
convolution kernel from section 2.1.2. A threshold is applied to

create a binary image where the regions with a high gradient magnitude are foreground pixels. In the case of r images, the threshold is calculated from the image histogram of the vertical gradient
image. The intensity value at 95% in the cumulative image histogram of the vertical gradient image is selected as the threshold.
The threshold is selected at 95% percent because this ensures that
only image regions with strong vertical gradients are considered
fur further processing. In the case of color/grayscale images, the
threshold is set to a xed value. This threshold is calculated in
the same way as for infrared images. An image histogram is calculated and the threshold is selected at the intensity value at 95% in
the cumulative image histogram. Using the connected component
algorithm described in algorithm 1, the foreground pixels in the
binary image are clustered into regions. Regions with a width to
heigth ratio from 0.5 to 1.5 (width divided by height) may contain a pedestrian and are regions of interest.
The values of 0.5
and 1.5 are selected based on measuring width to height rations

from dierent front view and side view pedestrian images.
CHAPTER 2.
40
INITIAL DETECTION
Algorithm 1 Connected components labeling.

Input: A binary image
Output: An image where each foreground region is assigned an unique label.
1. Repeat steps 2 and 3 for each pixel in the image
2. Starting at the rst pixel: scan the image row by row until a foreground
pixel is found
3. Consider all adjacent 8-connected pixels to the current pixel(west pixel,
south pixel, south west pixel and south east pixel) which were already
visited:
(a) If all of the visited pixels are background pixels, assign a new label
to the current pixel.
(b) If one or more of the visited pixels are foreground pixels of the
same group, assign the label of that pixel to the current pixel.
(c) If multiple visited pixels belong to a dierent group, assign one of
the labels of those pixels to the current pixel and mark that the
dierent labels belong to the same group.
4. In an additional pass over the whole image, assign an unique group
label to the pixels of groups which were marked as equal.
An example of this method for r images is shown in gure 2.4.

Figure 2.4(a) shows the input image, gure 2.4(b) shows the vertical gradient image, gure 2.4(c) shows the binary image, and
gure 2.4(d) shows the regions of interest.
An example of this
method for grayscale images is shown in gure 2.5. Figure 2.5(a)

shows the input image, gure 2.5(b) shows the vertical gradients,
gure 2.5(c) shows the binary image, and gure 2.5(d) shows the
regions of interest.
In the case of r images, there is usually a transition from dark(
background) to light(region of interest) to dark(background). So
additionally, the sign of the gradient is used to discard regions
of interest which cannot contain a pedestrian.
An example of
CHAPTER 2.
41
INITIAL DETECTION
(a) A r image
(b) It's vertical gradients
(c) It's binary image
(d) The initial detections
Figure 2.4: Grouping vertical gradients in a r image.
the use of the sign of the gradients is shown in gure 2.6.
The
gray regions contain negative gradients, the white regions contain

positive gradients. Only regions of interest consisting of a positive
gradient region followed by a negative gradient(from left to right)
region are accepted.
CHAPTER 2.
42
INITIAL DETECTION
(a) A grayscale image
(b) It's vertical gradient image
(c) It's binary image
(d) The initial detections
Figure 2.5: Grouping vertical gradients in a grayscale image.
Figure 2.6: Use of the gradient sign for initial detection.
CHAPTER 2.
INITIAL DETECTION
43
Often, in the case of r image, the feet and the head of pedestrians
are brighter than the upper body because of the isolation of the
coat. These upper body regions do not generate strong gradient
magnitudes.
Therefore, once a region of interest is generated,
a search along a vertical line is performed upwards for another

region of interest.
This makes it possible to combine the feet
with the head. When another region of interest is found, the two
regions are combined and it is tested if the region matches the
width to height ratio of a pedestrian. An example of this is shown
in gure 2.7. In gure 2.7(b), the complete pedestrian cannot be
segmented as one region. After the combination of the feet with
the head, the pedestrian is segmented as one object 2.7(c). The
complete initial detection method is described in algorithm 2.
CHAPTER 2.
44
INITIAL DETECTION
(a) A r image
(b) The binary image
(c) The initial detections

Figure 2.7: Combining regions of interest.
CHAPTER 2.
45
INITIAL DETECTION
Algorithm 2 Grouping vertical gradients

Input: An image
Output: A list of initial detections
1. Perform smoothing on the input image.
2. Calculate the vertical gradients from the smoothed image.
3. Calculate an image histogram from the vertical gradients image and
select the intensity value at 95% of the histogram as threshold.
4. Create a binary image from the vertical gradient image using the
threshold. Foreground pixels are those pixels whose absolute gradient
magnitude is larger than the threshold.
5. Cluster the foreground pixels into regions using algorithm 1.
6. Repeat steps 7 and 8 for each foreground region.
7. If the width to height ratio of the region matches the width to height
ratio of a pedestrian, add the coordinates of the region to the output
list. Additionally, in r images, if the sign of the gradients of the region
does not match the sign of a warm object against a cold background,
discard the region.
8. Search for another region above the current region.
If the width to
height ratio of the combined region matches the width to height ratio
of a pedestrian, add the coordinates of the combined region to the
output list.
2.2.2 Scanning a vertical edge detector through the whole

image
A dierent approach to nding vertical structures in the image is
to scan a search window through the image at every position and
scale. In practice, it is sucient to scan through a certain range
in the image because there are positions where there can not be
any pedestrians because of perspective constraints. For example,
at the top of the image there is usually sky so there is no need
CHAPTER 2.
46
INITIAL DETECTION
to search here. The search window consists of two or four edge

detectors which search for vertical structures.
search window is shown in gure 2.8.
An example of a
In order to make the de-
tection results invariant to the size of the objects in the image,

the size of the lters of the edge detectors are adapted to the
size of the search window.
The advantage of this method over
the initial detection routine which groups vertical features is that

when a pedestrian is located under another object with vertical
structure(trac sign, tree, building) it can still be detected correctly in the search window in which only the pedestrian ts. The
vertical feature grouping initial detection routine would only nd
the pedestrian clustered together with the other object. Some experimentation is necessary to nd a threshold value for the edge
detector. In general a high threshold is selected in grayscale images in which there is not much contrast between the pedestrian
and the background.
In r images, a higher threshold can be
selected because of the higher contrast between pedestrians and

background. The complete method is described in algorithm 3.
Figure 2.8: A scanning window.
CHAPTER 2.
47
INITIAL DETECTION
Algorithm 3 Scanning a vertical edge detector

Input: An image
Output: A list of initial detections
1. A threshold dependent on the image type(grayscale or r).
2. Repeat for every scale and image position
3. In the scanning windows, calculate the vertical edge response with four
vertical lters. Two in the left half of the scanning window and two in
the right half.
4. If all four responses are larger than the threshold, add this scan window
to the output list.
This initial detection routine can be implemented very eciently

using the integral image described in section 2.1.2. An example
of the output from this initial detection routine on r images is
shown in gure 2.9. An example of the output from this initial
detection routine on grayscale images is shown in gure 2.10. The
regions of interest are marked with orange rectangles.
(a) A r image
(b) The initial detections
Figure 2.9: Scanning an edge detector through the image at every position
and scale.
CHAPTER 2.
48
INITIAL DETECTION
(b) The initial detections
Figure 2.10: Scanning an edge detector through the image at every position
and scale.
2.3
Region based segmentation groups pixels with similar intensity

values into regions. In r images, humans appear brighter in the
image than most other objects because of their body temperature.
Temperature is a property that can be used for segmenting objects from the image. Temperature segmentation is a method that
is often used for the initial detection of pedestrians in r images
[30], [43], [4]. In this work, a related method is applied. A threshold is calculated from the outside temperature using the image
histogram of intensity values(as described in section 2.2.1). The
threshold is selected at 95% of the cumulative image histogram.
A high threshold is calculated for high outside temperatures and
a low threshold is calculated for low outside temperatures. With
the threshold, a binary image is calculated.
Pixels with an in-
tensity value higher than the threshold are selected as foreground

regions. The other pixels are selected as background regions. The
foreground pixels in the binary image are warm areas and the
background pixels are cold areas. The foreground areas are clustered with algorithm 1. Clusters with the correct width to height
CHAPTER 2.
49
INITIAL DETECTION
ratio(as described in section 2.2.1) for a pedestrian are selected

as regions of interest. An example of this method is shown in gure 2.11. Figure 2.11(a) shows the input r image, gure 2.11(b)
shows the binary image, and gure 2.11(c) shows the regions of
interest marked with orange rectangles.
(a) A r image
(b) The binary image

Figure 2.11: Temperature segmentation in r images.
A problem with using image brightness for initial detection of

pedestrians is that pedestrians are often not uniformly bright.
Usually, the head and the feet of the pedestrian are much brighter
than the body because of the isolation of the coat. An improvement of this method creates additional regions of interest by trying
to combine each region of interest with a higher located region of
CHAPTER 2.
50
INITIAL DETECTION
interest. In this way, the region containing the feet is combined

with the region containing the head. An example of this improved
method is shown in gure 2.12. Figure 2.12(a) shows the r image,
gure 2.12(b) shows the binary image, and gure 2.12(c) shows an
initial detection consisting of two regions. The complete algorithm
is described in algorithm 4.
(a) A r image
(b) It's binary image

Figure 2.12: Combining regions of interest.
CHAPTER 2.
INITIAL DETECTION
51
Algorithm 4 Temperature segmentation in r images

Input: A r image
Output: A list of regions of interest
1. Select an intensity threshold at 95% of the cumulative image histogram.
2. Create a binary image from the r image using the threshold. Pixels
with an intensity value higher than the threshold are foreground pixels.
3. Cluster the foreground pixels in the binary image into regions using
algorithm 1.
4. Repeat step 5 and 6 for each of the foreground regions.
5. If the width to height ratio of the region matches the width to height
ratio of a pedestrian(as described in section 2.2.1), add the coordinates
of the region in the output list.
6. Search for another region directly above the current region. If the width
to height ratio of the combined region matches the width to height ratio
of a pedestrian, add the coordinates of the combined region to the list.
Chapter 3
Classication
The purpose of classication is to determine which of the regions
of interest provided for example by the initial detection routine
contain pedestrians and which do not. The classication consists
of the following steps:
the generation of representative mutual exclusive training examples and validation examples of positive examples(pedestrians)
and negative examples(non-pedestrians)
the extraction of features from the training and validation

examples
the training of the classier with the training features
the evaluation of the generalization performance of the classier on the validation examples
3.1
Classiers
The classier is the learning algorithm which is taught how to

separate between two classes of input: features from positive examples and features from negative examples. The classier also
52
CHAPTER 3.
53
CLASSIFICATION
performs the forward propagation: the assignment of a class label(positive or negative) to an unseen example. In this work, the
classication is always a two-class problem. The classier learns
to separate between two classes of objects, pedestrians and nonpedestrians. So, the output of the classier is always interpreted
as a binary value, pedestrian or non-pedestrian.
Two types of
classiers are applied in this work: neural networks and support

vector machines.
3.1.1 Neural networks

A neural network is a set of processing units which communicate
through weighted connections. The processing units(neurons) are
divided up into layers. A neural network has
one input layer which receives input from outside the network,
zero, one or more hidden layers which receive input from the
input layer and from each other(in the case that there are
multiple hidden layers),
an output layer which receives input from the hidden layer(s).
The output value
yk
of a neuron
in the network at time step
is calculated in the following way
X
yk (t) = Fk (
wjk (t 1)yj (t 1) + k (t))
j
where
wjk
k (t)
is the bias of neuron
neuron
k.
is the weighted connection from neuron
k,
and
Fk
to neuron
k,
is the activation function of
In this work, a linear activation function is used for the
output neurons. A sigmoidal activation function is used for the

hidden neurons:
CHAPTER 3.
54
CLASSIFICATION
F (sk ) =
1
.
1 + esk
The neural network used in this work is a three layer feedforward

neural network, where the propagation through the network is
only forward. The network is not fully connected, neurons only
receive input from neurons in the previous layer. The number of
hidden neurons varies depending on the problem. The number of
neurons in the output layer is one. The number of input neurons
are equal to the number of features of the data.
During training, the weights and biases of the connections in the
network are updated so that the output of the network
yp
close as possible to the target value
for all examples
p.
dp
is as
This is
done by minimizing an error function with gradient descent. The

error function
is dened as
E=
1X p
(d y p )2 .
2 p
For a neural network with at least one hidden layer, the error is
minimized with error back-propagation. The change in a weight
is proportional to the negative of the derivative of the error:
p wj =
where
E p
wj
is the current training example, and
weight. This can be rewritten as
p wjk =
E p p
y
spk j
the index of the
CHAPTER 3.
where
spk
55
CLASSIFICATION
is the input to neuron
that
and for a hidden
k.
For an output neuron
o it holds
E p
p
p
0 p
p = (do yo )Fo (so )
so
unit h it holds that
No
X
E p
0 p
(dpo yop )Fo0 (spo )who .
p = F (sh )
sh
o=1
For more information about neural networks, see for example [6].
3.1.2 Support vector machines

The basis idea of support vector machines is to map the training data into a high dimensional feature space
and calculate a
separating hyperplane in that feature space. A separating func-
f : RN {1} is calculated from the training examples

(x1 , y1 ), ..., (xl , yl ) RN {1}. By using a mapping : R F
it is not necessary to work in the high dimensional space F be0
0
cause there exists a kernel k(x, x ) for which holds: k(x, x ) =
((x) (x0 )). Training a support vector machine consists of calculating a hyperplane w x b = 0 which maximally separates
2
the training data in F . This means minimizing kwk subject to
yi (w xi + b) 1. This gives the quadratic optimization problem
tion
1
minimize kwk2
2
with respect to
parameters
yi ((w (xi )) + b) 1.
Introducing the Lagrange
gives the Lagrangian
X
1
kwk2
i (yi ((w (xi )) + b) 1)
2
i=1
CHAPTER 3.
56
CLASSIFICATION
which should be minimized with respect to

maximized with respect to
where
and should be
i .
The resulting separation function is
b),
w, b
P
f (x) = sign( li=1 i yi k(x, xi )+
is the number of support vectors(the training data
points which dene the separation plane), and
i are the Lagrange
multipliers.
Two examples of kernels for classication are: a radial basis function kernel
k(x, y) = exp( kx yk2 )

where
is the width of the Gaussian and a polynomial kernel
k(x, y) = (x y)d
where
is the degree of the polynomial.
For more information
about support vector machines see for example [13].
3.2
Image features for classication
An important choice is the type of image features that are used

for classication. The type of image features inuences the classication performance and forward propagation speed. Before the
features are calculated, the regions of interest are rescaled to a
xed size so that the number of features is constant in each region of interest. For an image size of 320x240 pixels, the window
size used is 24x48 pixels. This window size is selected because it
is large enough to calculate image features from(as described in
section 2.1.2). On the other hand it is small enough to avoid a
very large input space to the classier.
CHAPTER 3.
57
CLASSIFICATION
3.2.1 Rectangle features

The rectangle features are the features as described in section
2.1.2. Two dierent scales of lters are used to capture dierent
levels of detail. A lter size of four by four pixels is used and a lter
size of eight by eight pixels is used. Both lters have an overlap
of 2 pixels in the horizontal and vertical direction. An example
of the application of the lters on pedestrian images is shown in
gure 3.1. For the classication of pedestrians, the lter outputs
are normalized from 0.1 to 0.9 for neural network classication
and from -1.0 to 1.0 for support vector machine classication.
Figure 3.1: Rectangle features for classication.
3.2.2 Histograms of gradients and orientations features

The gradients and orientations features are those from section
2.1.2. First the gradient image and orientations image calculated
with a lter size of 5 pixels. This lter size is selected because of
the low image resolution of 320x240 pixels. For a higher resolution
image a larger lter size would be selected. An example of the gradient calculation was shown in gure 2.3. The region of interest is
n elds. An example of this is shown in gure

3.2 for m is 4 and n is 8. In each of these regions, a histogram of 8
divided into
by
CHAPTER 3.
58
CLASSIFICATION
bins is calculated of both the gradient magnitudes and orientations

of the gradient. The index of the bin in the histogram is calculated by linearly scaling the gradient magnitude and orientation
between their minimum and maximum values. The histograms of
all regions are concatenated and are normalized from 0.1 to 0.9
for neural network classication and from -1.0 to 1.0 for support
vector machine classication.
Both the gradient magnitudes and gradient orientations are used
as features for classication. In this way, the information provided
by the strength of the gradient as well as the information provided
by its direction are used.
Figure 3.2: Calculation of image features in subregions.
3.3
Training and validation of the classier
3.3.1 Creating training datasets

The classier is trained with positive examples of pedestrian images and negative examples of non-pedestrian images. The initial
detection routines and a bootstrapping method(section 3.4.1) are
used to generate training data.
So for example, a classier is
trained on data from a particular initial detection routine. The

reason for using the initial detection routines for generating training data is that in this way, the classes of positive and negative
CHAPTER 3.
CLASSIFICATION
59
examples are well dened. Especially, the class of non-pedestrians

is in principle innitely large. By using the output from the initial detection routines, the negative examples are limited to those
that are generated by the initial detection routine. For each initial
detection routine, a separate classier is trained. This is done because dierent initial detection routines generate dierent positive
and negative examples.
The training data is created by manually marking the pedestrians
in image sequences. This is done by drawing a rectangle around
a pedestrian in each frame. This results in a set of ground truth
data. The initial detection routines are applied to the image sequences and the output of the initial detection routines is compared to the ground truth data. When an initial detection is at
the same coordinates as a labeled pedestrian in the ground truth
data, it is stored as a positive example in the training data. Otherwise it is stored as a negative example. Separate labels are used
for pedestrians in front/rear views and pedestrians in side views
to reduce the amount of inner class variability. So a separate classier is trained for back/front views and side views. The idea is
that the classication is improved when the classes of objects are
better dened.
The complete set of image data is divided into three approximately
equally large mutually exclusive subdatasets. The rst subset is
the training data, which is used for training the classier.
The
second subset is the validation dataset, which is used to prevent

overtting during training of the classier. The third dataset is
the test dataset, which is used to evaluate the performance of
an optimized classier.
There are always three times as many
negative examples as there are positive examples in the training

and validation datasets. This results in classiers with a low false
positive rate. For a real system, false detections are unacceptable
so the classier is trained in a way to have the lowest false positive
CHAPTER 3.
60
CLASSIFICATION
rate possible.
Once the available data has been divided into subdatasets, the
image features are calculated and stored as feature vectors. For
each of the initial detection routines, feature vectors are generated.
3.3.2 Training neural networks

For training neural networks, early stopping is used. The network is trained on the training dataset.
At each training iter-
ation, the classication error on the validation dataset is monitored. During a certain amount of iterations after the training has
started, both the training error and the validation error decrease.
From a certain iteration on, the training error keeps decreasing
but the validation error increases.
From this point on, the net-
work starts overtting the data: it memorizes the training data

and does not generalize well anymore on the validation data. The
network at the training iteration where the validation error starts
to increase, is selected as the nal network. The optimal network
from the early stopping procedure is then tested on a test dataset,
mutually exclusive from the training- and validation dataset. After some experimentation, the number of hidden neurons is xed
at a value of 10 such that a direct comparison between the performance of dierent networks is possible. The value of 10 neurons
was found to be the minimum required value for the optimal performance of the network.
So, decreasing the number of hidden
neurons below 10 results in a reduced classication performance.
3.3.3 Training support vector machines

For training support vector machines, a grid search is used to nd
the optimal value for
C , the trade o between the maximization of
the margin of the separating hyperplane and training error minimization, and in the case of a radial basis function kernel to nd
CHAPTER 3.
61
CLASSIFICATION
the optimal value for
the width of the Gaussian in the kernel.
The grid search is performed by training a support vector machine

on a training dataset, and evaluating the performance of the support vector machine on a validation dataset mutually exclusive
to the training dataset. The evaluation criteria are true positive
rate, false positive rate, and number of support vectors. The optimal network from the grid search is then tested on a test dataset,
mutually exclusive from the training- and validation dataset.
3.4
Optimization of the classication
The standard classication with neural networks and support vector machines is often not optimal.
Also, in the case of support
vector machines, the training generates so many support vectors

that the classication is too slow for real-time pedestrian detection. Therefore, some form of optimization is necessary, both in
classication performance and classication time.
3.4.1 Bootstrapping
The class of negative training examples is usually not well dened. To generate a dataset of representative negative examples,
a bootstrapping technique can be applied. This works as follows:
A training dataset is generated from all positive training examples
and a small number of negative examples. A classier is trained on
this data and is tested on a dataset not used for training. The false
positives from the test dataset are added to the training dataset
and a new classier is trained on the training dataset. Again, this
classier is tested on another test dataset and the false positives
from this dataset are added to the training examples. This procedure is repeated until a desired false positive rate is achieved.
CHAPTER 3.
62
CLASSIFICATION
3.4.2 Component classication

The idea of component classication is to train a classier for each
part of a pedestrian. A set of local classiers whose classication
outputs are combined may give better results than one global
classier. Figure 3.3 shows an example of the subregions. There
are seven subregions in the region of interest, a classier is trained
for each subregion.
The nal classication decision is made by
voting the output of the subregion classiers.
For example, for
each of the regions of gure 3.3(b), a neural network is trained.

When the trained networks are applied to an unseen example, this
results in seven network outputs. The network outputs are voted
to get the nal output. If at least four networks give a positive
output, the nal result is positive.
Otherwise the nal result is
negative.
Figure 3.3: Subregions for component classication.
3.4.3 A simplied approximation to the support vector

machine decision rule
The number of support vectors of a decision surface scales approximately with the number of training examples. This makes
the forward propagation through a support vector machine slow
compared to other classiers such as neural networks. To improve
the forward propagation speed, a reduced set of vectors can be
calculated from the original set of support vectors. To calculate a
CHAPTER 3.
63
CLASSIFICATION
reduced set of vectors which approximate the original decision surface, the algorithm from [8] is applied. From the original decision
surface
Ns
X
a ya (sa )
a=1
where
are the weights calculated during training,
Ns
are the class labels of the

set
zk
of size
Nz < Ns
support vectors
sa a
ya {1, 1}
reduced vector
is calculated with decision surface
Nz
X
k (zk )
k=1
k are the weights. This is done by minimizing the euclidean

0
distance: = k k. The reduced set of vectors zk , k =
1, ..., N z is calculated with an unconstrained conjugate gradient
where
method.
3.5
Feature selection
Feature selection is applied to select a stable set of features for

classication from the whole set of features and to speed up the
forward propagation speed of the classier. The smaller the input
dimension of the input data, the faster the forward propagation.
3.5.1 Principal component analysis

Principal component analysis transforms a set of possibly correlated variables into a smaller number of uncorrelated variables.
In feature selection this means nding a linear subspace of the
CHAPTER 3.
64
CLASSIFICATION
complete set of features. The method for generating a linear subspace of a lower dimension is the Karhunen-Love transform, a
standard technique from statistical pattern recognition. Principal
component analysis on a dataset consists of the following steps:
1. Calculation of the mean of each of the dimensions in the
dataset.
2. Subtraction of the mean of each dimension in the dataset
such that the mean of the dataset is zero.

3. Calculation of the covariance matrix of of the data.
4. Calculation of the eigenvectors and eigenvalues of the covari-
ance matrix.
5. Selection of the eigenvectors with the highest eigenvalue as
the principal components of the dataset.

6. Calculation of the reduced dataset by multiplying the vector
of eigenvectors with the original dataset.
For classication this is used as follows: principal component analysis is applied on the feature vectors of pedestrian images in the
training dataset. The matrix containing the eigenvectors for transforming the training dataset into the reduced features dataset is
stored. The transformation matrix is used to transform the feature
vectors of the training data to a lower dimension and a classier is
trained on the reduced set of feature vectors. Before each forward
propagation through the classier, the transformation is applied
to the feature vector using the stored matrix.
The purpose of
principal component analysis for a real-time system is to reduce

the input space to the classier. For testing the performance of
pca classication, the full set of 224 input features to a neural
network is reduced to 100 features and 50 features.
CHAPTER 3.
65
CLASSIFICATION
3.5.2 Adaboost
Adaboost [19] is a learning algorithm which constructs a classier
from a number of weak classiers. Adaboost is explained in algorithm 5. Adaboost can be used for feature selection by limiting
the number of iterations in the algorithm. The features which are
the most discriminative are found in increasing iteration order. In
this work, a single layer perceptron with a sigmoidal activation
function is used as weak learner.
The Adaboost algorithm is
iterated until a certain classication performance is achieved or

until the number of desired features is reached.
Like principal
component analysis, Adaboost can be used to reduce the input

dimension to the classier. In addition, it can boost the classication performance by nding the most discriminative features.
Algorithm 5 Adaboost.
Algorithm from [39].
Input: A set of examples {(x1 , y1 ),...,(xn , yn )} with labels

Output: A strong classier
yi {1, 1}
h(x)
1. Initialize weights of training examples

2. Repeat steps 3 until 5 for
3. Normalize the weights
is
wt,i =
w1,i =
1
, for all i.
n
1,..., T
w
Pn t,i
j=1wt,j
j, P
train a classier weak learner hj .
wt j = i wi |hj (xi ) yi |.
4. For each feature

classier is
The error of the
wt+1,i = wt,i t1ei , where ei = 0 if xi is classied correctly, ei = 0 if xi is classied incorrectly, et is the error of the classier
t
with the lowest error, and t =
1t
5. Update the weights
6. The nal strong classier is:

h(x) =
1
0
PT

PT
1
a
h
(x)
a
t
t
t
t=1
t=1
2
otherwise
CHAPTER 3.
66
CLASSIFICATION
3.5.3 Multi-objective optimization

Feature selection with a genetic algorithm works as follows: the
indices of the feature vector are coded on a binary chromosome.
A one at the feature index on the chromosome means the feature
is selected, a zero means the feature is not selected.
A genetic
algorithm is used to evolve a population of individuals, each with

one chromosome. The chromosomes of the individuals are all initialized to a random value(zero or one).
The genetic algorithm
performs selection, crossover, mutation, and tness evaluation of

each individual.
In [44], a genetic algorithm is used to select a
subset of features for medical classication problems.
In [17], a
multi-objective feature selection is used with classication performance and feature dimension as tness criteria.
In [33], the
genetic algorithm evolves a real valued chromosome. In this way,

the genetic search results in a relative weighting of features.
In this work, a binary tournament selection is used to select the
individuals for reproduction. Tournament selection with
viduals means that
indi-
n individuals are selected from the population
based with a probability proportional to their tness.

dividual with the highest tness value from the
The in-
individuals is
selected. Uniform crossover is used to generate an ospring population from the parent population. This means each gene on a
chromosome from an ospring is randomly selected from the corresponding genes on the parent chromosomes. The probability of
crossover is set to 0.7.
During mutation, the value of a bit can
be ipped. The probability for crossover is set to 1.0 / number of

features.
The tness of an individual is calculated by training and validating a classier as described in sections 3.3.2 and 3.3.3 on the
features that have the value one in the chromosome. The tness
value has two objectives: the true positive rate on the validation
dataset and the false positive rate on the validation dataset. An
CHAPTER 3.
67
CLASSIFICATION
important choice in multi-objective optimization is how individuals with multiple tness values are sorted before reproduction. In
this work, the method from [16] is used. This method is based on
n . Each individual i in the population

rank irank and a crowding distance idistance .
a comparison operator
has two attributes: a
The calculation of the rank is shown in algorithm 6, the calculation of the crowding distance is shown in algorithm 7. The
sort
function in algorithm 7 sorts on objective value. The parameters
max
min
fm
and fm
are the maximum and minimum value of object
m, respectively. The operator in algorithm 6 is the domination
operator. An individual p with n objectives {p1 , ..., pn } dominates
an individual q if
i {1, ..., n} pi qi i {1, ..., n} pi > qi .

If individuals do not dominate each other, they are assigned the
same rank. All solutions of the optimization that are not dominated by a dierent solution make up the Pareto optimal set.
The comparison operator
i n j
is dened as
if (irank < jrank ) or ((irank

(idistance > jdistance )).
= jrank )
and
CHAPTER 3.
68
CLASSIFICATION
Algorithm 6 Assignment of a rank to an individual.

P
Input: A population
with tness values assigned to each individual
Output: A ranking assigned to each individual

1. Repeat steps 2 until 6 for each
2.
Sp =
, .np is
q)
p P.
then
q P.
Sp is Sp {q}.
5. Else if (q
p)
then
np = 0
then
prank = 1, F =F1 {p}.
6. If
7.
i=
np is np + 1.
1.
8. Repeat steps 9 until 15 while

9.
Fi 6= .
Q=.
p Fi .
q Sp .
12.
p P.
0.
3. Repeat steps for each

4. If (p
nq = nq -
13. If
1.
nq = 0
14.
i = i + 1.
15.
Fi = Q.
then
qrank = i + 1, Q = Q {q}.
p P.
CHAPTER 3.
69
CLASSIFICATION
Algorithm 7 Crowding-distance-assignment.
Input:A population
in
of side
with tness values assigned to each individual
P.
Output: A crowding distance assigned to each individual.

2.
i = 1, ..., N
P (i)distance = 0.
3. Repeat steps 4 until 7 for each objective

4.
P (i) = sort(P (i), m)
5.
P (1)distance = P (N )distance =
6. Repeat step 7 for

7.
3.6
m P (i).
i = 2, ..., N 1
max
min
P (i)distance = P (i)distance + (I(i + 1).m I(i 1).m)/(fm
fm
)

position and scale
It is interesting to compare the initial detection based approaches

as described in this work with methods where a classier is scanned
through the image at every position and scale. Examples of this
system are [39], [32] for face detection, and [40], [32] for pedestrian
detection. The working of such a system is described in algorithm
8. For training a classier for this method, manually labeled regions generated as described in section 3.3.1 are the positive examples.
The negative examples are generated by applying the
bootstrapping method as described in section 3.4.1 to randomly

selected regions of interest of image sequences which contain no
pedestrians. As an alternative, the classiers trained on the initial
detection routines can be used.
CHAPTER 3.
Algorithm 8
CLASSIFICATION
70
Scanning a classier through the image at every scale and
location.
Input: An image
Output: A list of positive classications
1. Repeat steps 2 until 4 for every scale and image position
2. In the current scanning window, calculate the image features for classication.
3. Classify the feature vector from the current scanning window.
4. If the classication output is positive, add the coordinates of the scanning window to the output list.
To improve the method in algorithm 8, the following modications

are made:
a separate classier is trained for side views and front/rear

views of pedestrians and a separate classier is trained for
pedestrians at a low resolution and pedestrians at a high
resolution.
The optimization methods from section 3.4 and the feature

selection methods from section 3.5 are applied.
An example of classifying the whole image at every position and

scale in shown in gure 3.4.
CHAPTER 3.
CLASSIFICATION
Figure 3.4: Classifying the whole image at every scale and resolution.
71
Chapter 4
Tracking
The purpose of tracking is to keep track of an object through
successive image frames after a positive classication. In the case
of pedestrians, this is challenging. The reasons for this are that:
pedestrians keep changing shape if they are moving,
they usually increase in size because of the movement of the

car,
a reliable motion model is not available because of the motion

and nicking of the car,
the contrast between the pedestrian and the background is

often low so it is dicult to calculate reliable image features
for tracking, there are usually changes in illumination conditions,
the background in urban trac scenes is complex and contains often many pedestrian alike objects,
and the resolution of pedestrians is often small in the image

so not much information is available for tracking.
72
CHAPTER 4.
73
TRACKING
Tracking using the Hausdor distance
4.1
The Hausdor distance is a measure of inequality between two
P = {p1 , ..., pm }
sets of points.
Gives two sets of points
Q = {q1 , ..., qn },
their Hausdor distance is
and
H(P, Q) = max(max min kp qk , max min kq pk).

pP qQ
If
d1
and
in
qQ pP
is the maximum distance of set
d2
is he maximum distance of set
to the nearest point in
to the nearest point
then the Hausdor distance is the maximum of these two
distances.
The partial Hausdor distance can be used to measure the inequality between subsets of two sets of points.
P = {p1 , ..., pm }
and
Q = {q1 , ..., qn },
Given two sets
their partial Hausdor
distance is
th
Hk (P, Q) = KpP
min kp qk
qQ
where
1 k m.
Here, the
k th
largest element is used instead
of the maximum element. The partial Hausdor distance makes

it possible to locate objects which are partially occluded.
The tracker based on the Hausdor distance [41] works as follows.
The tracker is initialized with a set of image features
P (intensity
values or gradients), this is the model set of the tracker. In the

next frame, at several image positions and scales with feature
set
Q,
the partial Hausdor distance is calculated between the
feature set
set
and the feature set
Q.
The location of the feature
Q with the smallest partial Hausdor distance between P
and
is selected as the new location and scale of the object which
is tracked.
A Kalman lter with a linear motion model is used
to estimate the
position and
position of the search region
CHAPTER 4.
74
TRACKING
for the pedestrian in the next frame in which the feature set
is calculated.
The process noise and measurement noise for the
Kalman lter are drawn from a normal distribution.

A problem with using the Hausdor distance for tracking is when
a model for tracking is generated from the image features, the
model does not contain just image features from the object to
be tracked but also image features from the background.
This
invalidates the tracking model when the object is moving.
In
order to separate model features from background features, the

following model adaptation mechanism is used. The set of image
features in the next frame are compared to the image features of
the tracker model.
A list is made of image features which are
present in the model but do not correspond to an image feature in

the next frame. Also, a list of image features are made which are
present in the next frame but do not correspond to a feature in the
model. From the rst list, model features are removed randomly
with a certain probability. From the second list, image features are
randomly added to the model features with a certain probability.
The degree of adaptiveness of the tracker can be controlled by
varying the probability parameter.
The complete description of
the Hausdor tracker is described in algorithm 9. In this work, for

pedestrians, only the upper half of the body is tracked because the
movement of the legs would invalidate the model of the tracker
in a few frames. In order to handle the problem of background
features in the model, the tracker model is adapted every few
frames because of the often high relative speed dierence between
the pedestrian and the car.
CHAPTER 4.
75
TRACKING
Algorithm 9 The Hausdor tracker.

Input: A feature image
pt1 ,
I,
the position of the object in the previous frame
the feature set of the model
P.
Output: The position of the object in the current frame
pt .
1. Estimate with a Kalman lter the next position of the object

2. Search in
in the area around the estimated position
for the location

between
and
pt with feature
Q is minimal.
set
xt
xt .
of the object
such that the Hausdor distance
3. If the tracker model needs to be updated this frame make a list
L1
of
image features which are present in the model but do not correspond
to an image feature in frame
and make a list
are made which are present in the frame
L2
of image features
but do not correspond to a
feature in the model.

4. If the tracker model needs to be updated this frame, randomly remove
features in
L1
from the model and randomly add features from
L2
to
the model.
4.2
Mean shift tracking
Mean shift tracking [11] is based on the assumption that the position and scale of an object will not change much from one frame
to the next. A target model with image feature
function
qz ,
has a density
the target candidate located at position
has a fea-
pz (y). Mean shift tracking nds the position y

pz (y) matches the density qz the best. The metric
ture distribution
whose density
used for the density similarity is the Bhattacharyya coecient:
(y) [p(y), q] =
Z p
pz (y)qz dz.
(4.1)
For sampled data, the discrete density is calculated from the m-
Pm
u = 1 for the model
q = {
q1 , ..., qm } P
with
u=1 q
m
and p
(y) = {
p1 , ..., pm (y)} with u=1 pu = 1 for the candidate. A
bin histogram:
CHAPTER 4.
76
TRACKING
kernel is used to assign a weighting to pixels, pixels further away

from the center of the target are given a lower weight then pixels
at the center of the target.
This is done because pixels further
away from the center are more likely to be aected by occlusion

or background.
The kernel applied in [11] is the Epanechikov
kernel
(
k(x) =
1 1
2 cd (d
+ 2)(1 x)
if
x<1
otherwise
for normalized image coordinates
x.
To nd the new location
y1
of the target, the mean shift vector is applied

y0 xi 2
i=1 xi wi g( h )
y1 = P

y0 xi 2
nh
i=1 wi g( h )
Pnh
where
y0
is the location of the target in the previous frame,
the kernel for weighting,
wi
is the weight of pixel
number of pixels in the radius of the kernel

location
wi
is
is the
around the current
are calculated as follows
wi =
m
X
s
[b(xi ) u]
u=1
xi , nh
y0 .
The weights
where
(4.2)
qu
pu y0
is a function that assigns an index of a bin to a pixel.
The complete tracking algorithm is described in 10. In this work,

for tracking pedestrians, the features
used for calculating the
densities are intensity values or image gradients.

the kernel
The radius of
is dependent on the size of the region of interest to
be tracked. For a large region of interest a large radius is used,

for a small region of interest a small radius is used.
CHAPTER 4.
77
TRACKING
Algorithm 10 Mean shift tracking

Input: The distribution of the target object
frame
y0 ,
an image with pixels
q and its position in the previous
x.
Output: The position of the object in the current frame

1. Calculate the distribution
y1 .
{
pu (
y0 )}u=1,...,m in the current
[
p(
y0 , q] with 4.1.
frame at
y0
and the Bhattacharyya coecient
2. Calculate the new position of the target
y1 with
3. Calculate the Bhattacharyya coecient
[
p(
y1 ), q]
4.2.
at the new position
y1 .
4. Repeat step 5 while
5.
[
p(
y1 ), q] <[
p(
y0 ), q].
y1 = 12 (
y0 + y1 ).
6. If
4.3
k
y1 y0 k > y0 = y1 ,
go to step 1.
Tracking with Condensation
In the Condensation algorithm[25], tracking is performed by propagating a state density
p(xt |Zt )
over time with Bayes rule:
p(xt |Zt ) = kt p(zt |xt )p(xt |Zt1 )

where
xt
is the state at time
t, zt
is the observations from the
t, p(zt |xt ) is the observation density at timestep t,

the prior p(xt |Zt1 ) is a prediction step from the posterior density
p(xt1 |Zt1 ) from the previous timestep t 1, and kt is a constant. The prior density of timestep t is generated by factored
sampling. The posterior density p(xt1 |Zt1 ) from the previous
timestep t 1 is represented by a set of N weighted samples
(n)
(n)
{(st1 , t1 ), n = 1, ..., N }. N samples are sampled with replace(n)
ment from the sample set. Each sample st1 is selected with a
(n)
probability of t1 . Samples with high weights may be selected
image at time
CHAPTER 4.
78
TRACKING
multiple times.
Each selected sample undergoes a determinis-
tic step followed by a random step. This gives the prior density
p(xt |Zt1 )
t.
of timestep
Now, the observation density
is used to calculate the weights of the samples.
p(zt |xt )
This gives the
posterior density
p(xt |Zt )
The prior density
p(xt |Zt1 ) of timestep t is calculated by applying
of timestep
t.
only a random step. It is assumed that a reliable motion model

of a pedestrian does not exist because the camera may move or
stand still and the pedestrian may move or stand still. In addition,
the pedestrian may be moving perpendicular to the camera or
orthogonal to the camera. Therefore, only random motion is used.
For tracking pedestrians from a moving car,
x position, y position,
and scale are the parameters for tracking. Each sample represents
a position and scale. For calculating the observation density from
the image data, the contour based approach from [25] is followed.
A small set of model contours is calculated by performing a principal component analysis on a set of manually labeled contours(as
described in [1]). An example of a model contour is shown in gure 4.1. A sample represents a contour with a certain scale. The
distance from the contour along a xed set of
normals along
the contour is used to calculate the observation density
p(z|x)
of
a sample:
(
p(z|x) exp
where
r(sm )
r(sm )
along normal
M
X
1
min(z1 (sm ) r(sm ); ) ,
2rM
m=1
z1 (sm ) is the closest feature to
r and are constants. In the case
is a model contour,
m,
and
of multiple model contours, the contour with the smallest sum is

used for calculating the observation density. The sample with the
smallest sum of distances is selected as the current position/scale
of the pedestrian. For tracking pedestrians, the image energy is
CHAPTER 4.
79
TRACKING
used as features for tracking. The complete outline of the Condensation algorithm is shown in algorithm 11.
Figure 4.1: Model contour of the Condensation tracker.
Algorithm 11 The Condensation algorithm.

Input: An initial sample set
of timestep
{s11 , 11 , ..., sn1 , 1n } representing position and scale
1.
Output: A sample set
{s1T , T1 , ..., snT , Tn }
of timestep
t = 1, ..., T
T.
iterations.
2. Sample with replacement N samples from the set

i
i
Each sample st is selected with probability t .
{s1t , t1 , ..., snt , tn }.
3. Each of the selected samples undergoes a deterministic motion followed

by a random motion.
4. Each of the samples is weighted by calculating the observation density.
1
1
n
n
This gives the sample set {st+1 , t+1 , ..., st+1 , t+1 } of timestep t + 1.
5. The sample
sit+1 with the smallest sum of distances calculated for its ob-
servation density is selected as the current position/scale of the pedestrian
CHAPTER 4.
80
TRACKING
Integrating initial detections through time
4.4
A very simple model-free tracking method is based on the assumption that if a pedestrian is present in a certain frame, it will be
present at approximately the same location at approximately the
same scale in the next image. To nd the pedestrian in the next
image, the output from the initial detection routine is used. The
positions and scales of the regions of interest from the initial detection routine are evaluated. If a matching region of interest is
found, it is selected as the location of the pedestrian in the next
frame.
An example of this method is shown in gure 4.2.
The
advantages of using the initial detection for tracking are that it

is invariant to the scaling of the pedestrians through successive
frames, the initial detection is developed to detect pedestrians
at various sizes.
It is also developed for nding pedestrians in
complex, structured environments and in changing illumination

conditions. The initial detection tracker is described in algorithm
12.
Algorithm 12 Initial detection tracker.

(xcurrent , ycurrent ) and scale scurrent of the
(x1 , y1 , s1 ), ..., (xn , yn , sn ) of N initial detec-
Input: The current center position

tracked region of interest; a list
tions in the next frame.
Output: A new center position
(xnew , ynew ) and scale snew
in the next frame.
j = 1. dj =sqrt((xcurrent xj )2 + (ycurrent yj )2 )
ds =scurrent sj
1. Initialize
2. Repeat step 3 and 4 for

3.
di =sqrt((xcurrent xi )2 + (ycurrent yi )2 )
4. if
5.
i = 2, ..., N
di < dj and di dj dj = di and j = i.
(xnew , ynew ) = (xj , yj )
and
snew = sj .
and
di =scurrent si .
and
CHAPTER 4.
TRACKING
81
(a) A positive classication in a r image (b) The initial detections in the next frame
(c) The initial detection selected as the new

position of the pedestrian
Figure 4.2: Integrating initial detections through time.
4.5
Tracking the classication output
When detection is performed by scanning a classier through the

whole image at every scale and location, as described in section
3.6, the classication output can also be used for tracking. The
tracker described in this paragraph resembles the tracker from [32],
where Condensation is used for propagating a density of classication outputs over time. The assumption is made that the classication outputs decrease proportional to the distance from the
pedestrian in both location and scale. In addition, the assumption
is made that the location and scale of a pedestrian do not change
CHAPTER 4.
82
TRACKING
much from one frame to another. Tracking a pedestrian means locating the highest peak in classication outputs near the position
of the pedestrian in the previous frame. This method can possibly
reduce the false positive detection rate because it can be expected
that false positives do not generate a similar peak in classication output as a pedestrian does. An example of the classication
outputs at a few scales was shown in gure 3.4.
The complete
method is described in algorithm 13.
Algorithm 13 Tracking the classication output.

Input: An initial set of positions
{p1 , ..., pn } containing

T frames.
a pedestrian.
Output: A set of new positions after

t = 1, ..., T .
2. Calculate the classication outputs at each location and position in

frame
t.
3. Repeat steps 4 and 5 for each of the positions
p {p1 , ..., pn }.
4. Find the highest peak in classication outputs around

5. Set the position and scale of
peak.
p.
to the position and scale of the highest
Chapter 5
Experimental results
5.1
Initial detection
In this section, the performance of the initial detection routines

evaluated on r and grayscale images is described.
To evaluate
the performance of the initial detection routines, the output of

each of the routines is compared to manually labeled ground-truth
data. The comparison is performed as follows: for each frame of
a video sequence in the ground-truth database, the output of the
initial detection routine is calculated. Then the output of the initial detection routine is matched to the ground truth data. The
matching process is shown in gure 5.1. The ground truth data is
represented by red outlined rectangles and the output of the initial detection routine is represented by blue outlined rectangles.
There is a true positive detection if a region of interest output
by the initial detection routine corresponds to a region of interest of interest in the ground-truth database.
This is shown by
the overlap of the red outlined rectangle with the blue outlined
rectangle in gure 5.1. There is a false positive detection if the
initial detection routine outputs a region of interest and there is
no corresponding region of interest in the ground truth database.
This is shown by the blue outlined rectangle in gure 5.1 for which
there is no corresponding red outlined rectangle. There is a false
83
CHAPTER 5.
84
EXPERIMENTAL RESULTS
negative detection if the ground-truth database contains a region

of interest for which there is no corresponding region of interest
output by the initial detection routine. This is shown by the red
outlined rectangle in gure 5.1. Ideally, the true positive rate of
an initial detection routine is one and the false positive rate is
zero. In practice, a compromise has to be found between the true
positive detection rate and the false positive detection rate.
Figure 5.1: Matching ground truth data.
A certain amount of deviation is allowed for a correspondence.

This is shown in gure 5.1. The blue outlined rectangle from the
initial detection routine does not exactly match the red outlined
rectangle from the ground-truth database. A deviation of 10% in
horizontal and vertical direction from the width and height of the
ground-truth region of interest is allowed in the horizontal and
vertical direction.
The deviation is allowed because not all the
initial detection routines deliver accurate regions of interest.
In
addition, the ground-truth database is produced by manually labeling image sequences. Usually, there is always a certain amount
of inaccuracy in the ground truth data.
The deviation value of
10% is selected because at this level of deviation it is still possible

to classify the output of the initial detection routine. The classi-
CHAPTER 5.
85
er requires a certain level of accuracy in order to properly learn

the class of pedestrians. At a larger deviation, the spatial distribution of image features inside the region of interest makes the
region of interest a non-typical pedestrian example. An example
of this is shown in gure 5.2. Figure 5.2(a) contains an example
suitable for training a classier. The pedestrians t nicely in the
regions of interest. Figure 5.2(b) contains an example unsuitable
for training a classier.
The pedestrian does not t a region of
interest.
(a) Initial detections suitable for classica- (b) Initial detections unsuitable for classitions
cation
Figure 5.2: Suitability of initial detections for classication.
In order to properly measure the performance of the initial detection routines, the ground-truth database should contain a large set
of representative sequences. The r images ground-truth database
for initial detection consists of 4443 pedestrians for the temper-
ature range of 5 , 3510 pedestrians for the temperature range of
15 , and 2904 pedestrians for the temperature range of 25 . The

grayscale images ground-truth database for initial detection consists of 50 sequences containing at least one pedestrian each. One
and the same pedestrian which is present in multiple frames is also
counted for the number of frames it is present. A measurement
is made as soon as the pedestrian is large enough for detection.
CHAPTER 5.
86
In the 320x240 pixel images used, a minimum size of 10x20 pixels

is used.
The exact thresholds for the initial detection routines
are not mentioned because they depend strongly on the type of

camera used, especially in r images. The results displayed in this
section are consistently measured on data from a single r camera
and a single grayscale camera.
5.1.1 Results on r images

To measure the performance of the initial detection routines on
r sequences, the sequences are divided based on outside temperature.
This is done because the intensity distribution of r
images changes with temperature. Four temperature ranges are
selected: 5 , 15 , 25 , and 35 , each with a maximum deviation
of 3 . An example of r images at dierent temperature ranges is

shown in gure 5.3. The exact temperature is always supplied as
an argument to the initial detection routine.
CHAPTER 5.
(a) Fir image at 5 degrees
(b) Fir image at 15 degrees
(c) Fir image at 25 degrees
(d) Fir image at 35 degrees
87
Figure 5.3: Fir images at dierent temperature ranges.
Figure 5.4 shows the mean true positive rate of the initial detec-
tion routines on the dierent temperature ranges. The range 35
is omitted because image processing in r images is not possible

anymore at this temperature range with the particular camera
used.
At 35
there is too little contrast between objects in the
image and the background. This is clearly visible in gure 5.3. It

is not possible to report the false positive rate of the initial detection routines, because there is no ground truth data of negative
examples. It is possible to calculate the percentage of positive detections of the total number of detections. This is shown in gure
5.5.
CHAPTER 5.
Figure 5.4: Initial detection results.
Figure 5.5: Percentage positive detections.
88
CHAPTER 5.
89
Figure 5.6 shows the calculation times of the initial detection routines for r images. These values do not include the time required
for image pre-processing. All values are measured on a computer
with a 1470 MHz. AMD Athlon processor.
Figure 5.6: Calculation times of initial detection for r images.
5.1.2 Results on grayscale images

In the case of grayscale images, the distribution of intensity values is mainly dependent on the lighting conditions. Because the
illumination conditions may change while the car is moving, the
sequences are not divided on illumination conditions.
is a single database containing all grayscale sequences.
So there
An ex-
ample of a sudden change in illumination conditions is shown in

gure 5.7.
In gure 5.7(a), the image brightness is rather dark
while in gure 5.7(b) which is recorded 6 frames later, the sudden appearance of sun light causes the image to be much brighter
than the previous image.
A set of sequences containing images
CHAPTER 5.
90
Figure 5.7: Illumination changes in grayscale images.
under various illumination conditions makes it more dicult for

the initial detection to nd regions of interest. So a compromise
has to be found between generality and performance of the initial
detection routines. Of course, the region based brightness segmentation cannot be applied in the case of grayscale images. Figure
5.8 shows the true positive rates of the initial detection routines
on grayscale images and the percentage positive detections of the
total number of detections.
CHAPTER 5.
91
Figure 5.8: Initial detection results in grayscale images.
Figure 5.9 shows the calculation times of the initial detection routines for grayscale images. These values include the time required
for image pre-processing. All values are measured on a computer
with a 1470 MHz. AMD Athlon processor.
CHAPTER 5.
92
Figure 5.9: Calculation times of initial detection grayscale.
5.2
Classication
The performance of the classication is measured by calculating the true positive rate and the false positive rate of a classier/image feature combination. This is usually visualized with a
so called ROC(Receiver Operating Characteristic) curve. A ROC
curve displays the true positive rate of the classier at a certain
false positive rate. An example of a ROC curve is shown in gure
5.10. For neural networks, the ROC curve is created by varying
the threshold(normally 0.5) at the output neuron.
For support
vector machines, the ROC curve is created by varying the threshold
b).
in the separating function
P
f (x) = sign( li=1 i yi k(x, xi ) +
CHAPTER 5.
93
Figure 5.10: An example ROC curve.
A separate classier is trained for each classier/image feature

combination. In addition, separate classiers are trained for side
views and front/back views of pedestrians. In general there are
two dierent ways of generating data for training a classier: the
rst is manually labeling the output of the initial detection routines. The second way is by manually creating ground truth data
of pedestrians(positive examples) and using random image parts
not containing pedestrians as negative examples for training. The
bootstrapping method from section 3.4.1 is used to create a representative dataset of negative examples. In r images the object
size for classication is selected as 20x40 pixels. In grayscale images the object size for classication is selected as 24x48 pixels.
Regions of interest smaller than a size of 20x40 pixels in r images
and smaller than 24x48 pixels in grayscale images are not classied
because too little information is available for feature calculation.
To measure the performance of classication, a representative set
CHAPTER 5.
94
of training, validation, and test data has to be selected. All three

sub-datasets consist of mostly urban sequences and some land
street sequences.

As with the initial detection in r images, the training dataset is
divided into sub-datasets based on temperature range. As mentioned in section 5.1.1, the intensity distribution of r images
changes with temperature.
The same three temperature ranges
as used for the initial detection are applied: 5 , 15 , and 25 , each
with a maximum deviation of 3 .
the datasets of 5 , 15
For evaluating the classier,
are used because a large amount of data
is available of these temperature ranges. The training dataset of

5
consists of 729 pedestrian images generated from temperature
initial detection, 2581 pedestrian images generated from vertical

gradients initial detection, and 3165 pedestrian images generated
from scanning a vertical edge detector initial detection.
The training dataset of 15
consists of 287 front/back view pedes-
trian images generated from temperature initial detection, 1728

pedestrian images generated from vertical gradients initial detection, and 1448 pedestrian images generated from the initial detection based on scanning a vertical edge detector through the
image.
The training dataset is divided into 3 subdatasets, one for training,
one for the validation of the trained classier, and one test dataset.
The datasets are divided in a way such that no training image is
part of more than one subset.
There are always three times as
many negative training examples in the dataset than there are

positive training examples. The reason for having more negative
examples than positive examples in the dataset is for achieving a
low false positive rate.
CHAPTER 5.
95
Figures 5.11, 5.12, and 5.13 show the ROC curves for the support
vector machine classication on a test set where the training data
is generated by the initial detection routines from sections 2.3 and
2.3 for dierent image features.
CHAPTER 5.
96
(a) Orientations features 5 degrees
(b) Orientations features 15 degrees

Figure 5.11:
features.
Results of support vector machine classication orientations
CHAPTER 5.
97
(a) Gradients orientations features 5 degrees
(b) Gradients orientations features 15 degrees

Figure 5.12: Results of support vector machine classication gradients and
orientations features.
CHAPTER 5.
98
(a) Rectangle features 5 degrees
(b) Rectangle features 15 degrees

Figure 5.13: Results of support vector machine classication rectangle features.
CHAPTER 5.
99
Figures 5.14, 5.15, and 5.16 show the ROC curves for the neural
networks classication on a test set where the training data is
generated by the initial detection routines from section 2.3.
CHAPTER 5.
100
(a) Orientations features 5 degrees
(b) Orientations features 15 degrees

Figure 5.14: Results of neural network classication orientations features.
CHAPTER 5.
101
(a) Gradients orientations features 5 degrees
(b) Gradients orientations features 15 degrees

Figure 5.15: Results of neural network classication gradients and orientations features.
CHAPTER 5.
102
(a) Rectangle features 5 degrees
(b) Rectangle features 15 degrees

Figure 5.16: Results of neural network classication rectangle features.
CHAPTER 5.
103
Figure 5.17 shows the processing times of the feature vector calculation for forward propagation through the classier.
Figure 5.17: Processing times feature vector calculation.
Figure 5.18 shows the classication times of the dierent classier/feature combinations.
CHAPTER 5.
104
Figure 5.18: Classication times of classier/image feature combination.

As with the initial detection in grayscale images, one single dataset
is used for training a classier for grayscale images. The reason
for this is that because the distribution of intensity values in the
image depends on changing illumination conditions, it is impossible to select a classier for a particular illumination condition
beforehand. Figure 5.8 shows that the performance of the initial
detection routines on grayscale images is limited. Therefore only
ground truth data in combination with the bootstrapping method
from section 3.4.1 is used to generate training data.
The train-
ing dataset consists of 444 front/back view pedestrian images and

342 side view pedestrian images. There are three times as many
negative examples in the training database as there are positive
examples. The remaining positive examples are divided between
the validation and test database. Figure 5.19, 5.20, and 5.21 show
CHAPTER 5.
105
the support vector machine and neural networks classication results on a test set of grayscale images where the training data is
generated using the bootstrapping method from section 3.4.1.
CHAPTER 5.
106
(a) Support vector machine orientations features
(b) Neural network orientations features

Figure 5.19: Classication results grayscale images orientations features.
CHAPTER 5.
107
(a) Support vector machine gradients orientations features
(b) Neural network gradients orientations features

Figure 5.20:
features.
Classication results grayscale images gradients orientations
CHAPTER 5.
(a) Support vector machines rectangle features
(b) Neural networks rectangle features

Figure 5.21: Classication results grayscale images rectangle features.
108
CHAPTER 5.
109
Figure 5.18 shows the classication times of the dierent classier/feature combinations for r and grayscale images.
5.2.3 Results of component classication

Figure 5.22 shows the results of the component classication of
r images with a support vector machines and a neural network
trained on histograms of gradients and orientations features on
data from 5 .
Figure 5.22: Results of component classication.
5.2.4 Results of classier optimization on classication

performance
Figure 5.23 shows the results of the feature selection methods
from section 3.5 on the classication performance.
All data is
from r images recorded at 5 . Figure 5.23 shows the results of a

support vector machine trained on a feature set of histograms of
CHAPTER 5.
110
orientations reduced with a principal component analysis. Figure

5.24 shows the results of Adaboost compared to a neural network
trained on the full set of histograms of gradients and orientations
features. As sub-feature set for Adaboost, a single histogram of
8 bins is used. The maximum number of weak-learners is limited
to 20. Figure 5.25 shows the results of multi-objective optimization feature selection compared to the full set of histograms of
orientations features.
Figure 5.23: Results of PCA feature selection.
CHAPTER 5.
111
Figure 5.24: Results of Adaboost feature selection.
Figure 5.25: Results of multi-objective optimization feature selection.
CHAPTER 5.
112
5.2.5 Results of classier optimization on classication

speed
This sections shows the results of the support vector machine reduction algorithm described in section 3.4.3 and feature selection
algorithms from section 3.5.
Figures 5.26 and 5.27 display the
classication time in milliseconds for each of the optimization algorithms.
CHAPTER 5.
(a) Support vector reduction
(b) Adaboost
Figure 5.26: Classication times of optimization algorithms.
113
CHAPTER 5.
(a) Principal component analysis
(b) Feature selection with multi-objective optimization

Figure 5.27: Classication times of optimization algorithms.
114
CHAPTER 5.
115
5.2.6 Scanning a classier through the image at every

position and scale
For scanning a classier through the image at every position and
scale, the training dataset for grayscale images is generated from
manually labeled ground truth data and the bootstrapping method
from section 3.4.1. For r images, the training datasets generated
by the initial detection routines are used. The support vector machine classier trained on a complete set of image features is used
for r images as well as for grayscale images. Figure 3.4 showed
an example of scanning a classier through the image at every
position and scale. As can be seen from gure 3.4, there are often
multiple positive classications on one object at dierent scales.
There are a total of 4863 classications per frame for r images
and for grayscale images. Figures 5.11, gure 5.14, and gure 5.19
show the false positive rate for each of the classier/feature combinations. Because of the high number of classications per frame,
there is a high average number of false positives per frame. Of interest is also the computation time per frame. Figure 5.18 shows
the computation time in milliseconds per frame for the dierent
classier/feature combinations.
5.3
Tracking
To measure the performance of the tracking algorithms, a similar

method is applied as for measuring the performance of the initial detection routines. The output of the tracking algorithms is
matched to manually labeled ground-truth data(the same ground
truth-data as used for initial detection and classication).
The
same procedure as described in section 5.1 is used for comparing

the output of the tracking algorithms to the ground-truth data.
A certain small amount of deviation from the ground-truth data
CHAPTER 5.
is allowed.
116
As with the initial detection, a deviation of 10% in
horizontal and vertical direction from the width and height of the
ground-truth region of interest is allowed in the horizontal and
vertical direction.
classication.
Usually, the tracker is started by a positive
In order to make the results of the tracking al-
gorithms independent of the initial detection algorithms and the

classication when measuring the tracker performance, the tracker
is started manually.
As with measuring the performance of the initial detection routines and the classication, a representative set of data is selected
for testing the tracker. This set consists of a mixture of land street
and urban scenarios. To avoid that the tracking results are inuenced by pedestrians moving out of the image(for example caused
by the car moving along the pedestrian), only tracking runs are
counted where the tracker loses track of the pedestrian before it
moves out of the image.

Figure 5.28 shows the average number of frames a pedestrian is
tracked with the dierent tracking algorithms on r sequences.
The test dataset consists of 30 sequences containing at least one
pedestrian.
Only the tracker which integrates initial detections
through time, the tracker which integrates classications through

time and the tracker based on the Hausdor distance are shown.
The Mean shift tracker and the Condensation tracker do not produce any useful results at all and are not displayed.
CHAPTER 5.
117
Figure 5.28: Pedestrian tracking in r images.
Figure 5.29 shows the processing times for the dierent tracking
algorithms in r images.
CHAPTER 5.
118
Figure 5.29: Processing times tracking algorithms.

Only the tracker which integrates classications over time generates somewhat acceptable results on grayscale images. The average number of frames tracked is 17 with a standard deviation of
14 frames. The test dataset for evaluating this tracker consists of
30 sequences containing at least one pedestrian. The processing
time for this tracker is the same as for the tracker which integrates
classications over time in r image(see gure 5.29). The tracker
which integrates initial detections through time, the tracker based
on the Hausdor distance, the Mean shift tracker, and the Condensation tracker do not produce any useful results at all.
Chapter 6
Discussion
6.1
Achievements and limitations
The results of the initial detection algorithms are reasonable for r

images and disappointing for grayscale images as gure 5.4 and
gure 5.8 demonstrate.
It holds for both image types that the
true positive detection rate is quite low and the false positive rate
is high.
The true positive rate for r images is still acceptable,
the probability that a pedestrian is detected in a few frames time

is still high at a frame rate at 20 frames per second.
The true
positive rate for grayscale images is so low that is not useful to

use the initial detection routines for grayscale images.
As gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 for r images
and gures 5.19, 5.20, 5.21 for grayscale images demonstrate, the
results of classication are very good.
Figure 5.22 and gures
5.23, 5.24 and 5.25 show that the component classier and the
optimization methods for classication lead to better classication
performance.
Figure 5.28 and section 5.3.2 show that aside from the tracking method which integrates the output of the classier which
is scanned through the whole image at every position and scale
through the image, the tracking results are very disappointing.
119
CHAPTER 6.
120
DISCUSSION
This holds for r images as well as for grayscale images.
The
Hausdor tracker, the Mean Shift Tracker, and the Condensation

tracker perform so poorly when tracking pedestrians from a moving car, that they are not useful for tracking for this purpose.
Before going into more detail what the strengths and weaknesses
of the components of the detection system are, it should be mentioned that the disappointing results of the initial detection routines for grayscale images and the tracking methods can for some
degree be explained by the complexity of pedestrian detection. As
mentioned in section 1.2, many computer vision applications operate in a controlled and static environment where color or motion
segmentation can be applied or where a search for a xed pattern
can be applied to segment objects from a scene. Although these
computer vision problems are not trivial, they are not as complicated as detecting moving, shape changing objects from a moving
camera in a complex structured urban background in low resolution images. Also image processing applications for driver assistance systems like lane detection and car detection on a highway
scenario deal with detecting rigid objects in a largely homogeneous
background.
6.1.1 Initial detection

Figure 5.4 shows that for r images, temperature segmentation
performs the best at low outside temperatures. This is obvious,
at low outside temperatures, many objects have a low intensity
value because they are cool while pedestrians have a high intensity
value because of their body temperature. Figure 5.4 shows that
the true positive rate of the temperature segmentation decreases
with increasing temperature. The reason for this is that at higher
temperatures, the contrast between the pedestrian and the background is less than it is at lower temperatures. At temperatures
higher than
20 ,
the temperature segmentation does not perform
CHAPTER 6.
121
DISCUSSION
satisfactory anymore.
Figure 5.4 shows that for r images, the vertical gradient based
initial detection routine and the initial detection routine which
scans a vertical edge detector through the image at every position
and scale perform the best at low outside temperatures.
There
performance decreases with an increase in outside temperature.

The reason for this is that at higher temperatures, the contrast
between the pedestrian and the background is less than it is at
lower temperatures. The gradient magnitude calculated from the
intensity image have a weaker response so the initial detection
routines have more diculties segmenting objects from the background than at higher temperatures.
Figure 5.8 shows that for grayscale images, the vertical gradient
based initial detection routine and the initial detection routine
which scans a vertical edge detector through the image at every
position and scale do not produce acceptable results. The main
diculty for the initial detection routines is the complex structure of the background.
What all initial detection routines do
is basically search for vertical objects in the image.
Because of
the many vertical structures in the background in urban scenarios, the initial detection routines segment the pedestrian together
with a background structure or do not segment the pedestrian at
all because they are vertical structures with a stronger gradient
magnitude in the image. Examples are shown in gure 6.1 for r
images. An example of the many vertical structures in grayscale
images and the diculties this creates for the initial detection routines from section 2.2 are shown in gure 6.2. What may happen
is that the initial detection routines only nd a part of the pedestrian. The match with the ground truth data, which is used for
evaluating the initial detection routines, can for this reason also
be unsuccessful.
CHAPTER 6.
122
DISCUSSION
Figure 6.1: Diculties for initial detection because of vertical structures in

the background.
(c) It's initial detections

Figure 6.2: Diculties for initial detection because of many vertical structures in background.
CHAPTER 6.
123
DISCUSSION
In a scene with little background structure, the initial detection

performs quite well. Both in the case of r images and grayscale
images. This was for example shown in gure 2.4 for r images and
gure 2.5 for grayscale images. This is also the reason the results
displayed in gure 5.4 are at least somewhat acceptable for r
images. Especially at low temperatures, many background objects
have a low intensity value because they are cool while pedestrians
have a high intensity value because they are warm.
So there is
no strong gradient response from background structures.
This
increases the performance of the initial detection of pedestrians.

Other diculties for the initial detection is that pedestrians are
not always perfect vertical structures, pedestrians change shape
when they move, and pedestrians may be carrying or pushing objects. For example, a pedestrian carrying a backpack or pushing a
stroller. Figure 6.3 shows an example situation. The initial detection routines described in this work were not specically adapted
to these specic situations, although they are set up to be quite
exible with respect to the shape of the objects they detect.
Figure 6.3: Pedestrian pushing a stroller.
Even more complications arise when there are groups of pedestrians present in the image.
Groups may contain an arbitrary
number of pedestrians which may be occluding each other in an
CHAPTER 6.
124
DISCUSSION
arbitrary number of ways.
An example of this is shown in g-
ure 6.4. When pedestrians appear in one single bright blob in r

images as shown in gure 6.4, it is not possible to segment them
with the initial detection routines described in this work.
The
initial detection routines described in this work are designed for

detecting single pedestrians only.
Figure 6.4: A group of pedestrians in a r image.
A solution to the problems with the initial detection routines is

to omit the initial detection all together by scanning a classier
through the image at every position and scale like described in
section 3.6.
The results of this method are discussed in section
6.1.2.
The threshold parameters applied in algorithms 2.5, 3, and 4 adapt
the initial detection routines for r images to a specic temperature range.
For example, at a higher temperature, the thresh-
old value is lowered because the gradient magnitude response

is weaker at higher temperatures.
For grayscale images, these
threshold parameters are set to a xed value for images recorded

in daylight conditions.
6.1.2 Classication
Figures 5.11, 5.12, 5.13 and 5.14, 5.15, 5.16 for r images, and
CHAPTER 6.
125
DISCUSSION
gures 5.19, 5.20, 5.21 for grayscale images show that the classication results are very good in general.
The classier/image
features combinations achieve a high true positive rate at a low

false positive rate. However, the results should be considered in
the context of the complete detection system. In the case of r
images, the initial detection routines may generate up to 12 times
as much negative detections as they generate positive detections.
This is shown in gure 5.5. In the case of grayscale images, the
classier is scanned through the whole image at every position and
scale. At a very large number of negative examples, the absolute
number of false positive detections per second may still be high.
As gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 show, the
classication results of all classier/feature combinations perform
better on data from 5
than on data from 15 .
Better means
the ROC curve lies higher and more to the left. The reason for
this is that at 5 , the contrast between the pedestrian and the
background is larger than at 15 . This means that at 5 , there is
a better gradient response than at 15 . All classication features

use gradient information. This explains why the results are better
at 5 .
Figures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 show that for
r images, the classication results of the histograms of gradients
and gradient orientations features are generally the best for both
support vector machines and neural networks at both at 5
15 .
Second best are the rectangle features.
tograms of orientations features.
and
Last are the his-
This shows the magnitude of
the gradient in combination with the orientations of the gradient perform better than the orientation of the gradient alone. In
addition, a feature set as large as the rectangle feature set(1140
features) is not required for successful classication, the smaller
feature set of histograms of gradient magnitudes and gradient orientations(448 features) performs better.
From gures 5.19,5.20,
CHAPTER 6.
126
DISCUSSION
and 5.21 it becomes clear that in the case of grayscale images, the
classication results of the histograms of orientations features are
comparable to the results of the rectangle features.
By comparing gures 5.11, 5.12, 5.13 to gures 5.14, 5.15, 5.16
and by looking at gures 5.19, 5.20, 5.21 it becomes clear that
the results of the support vector machine classication are generally a bit better than the results of the neural network classication. However, the forward propagation speed of neural networks
is much higher than the forward propagation speed of support
vector machines. This can be seen in gure 5.18.
Comparing gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 to
gures 5.19, 5.20, 5.21 shows that the classication results on r
images are comparable to the the classication results on grayscale
images. These gures should not be compared directly because the
training data for the r data is generated from initial detection
routines while the training data for the grayscale data is generated
from ground truth data using a bootstrapping method.
The classication results on front/back views of pedestrians are
comparable to the results on side views of pedestrians. This becomes clear by looking at gures 5.11, 5.12, 5.13 and gures 5.14,
5.15, 5.16 for r images and gures 5.19, 5.20, 5.21 for grayscale
images. In some gures, the front view results are better while in
other gures, the side view results are better. This is remarkable
because the class of side views of pedestrians is less well dened
than the class of front/back views of pedestrians. The side view
class of pedestrians contains pedestrians oriented to the left and
to the right, while the front view class contains pedestrians of a
homogeneous orientation.
Apparently, in the case of side view
classication, the classier has enough representational capabilities to learn both orientations.
Section 5.2 mentions that a minimum region of interest size of
20x40 pixels is used for r images and a minimum size of 24x48
CHAPTER 6.
127
DISCUSSION
pixels is used for grayscale images.
Below these minimum im-
age sizes there is too little feature information for calculating the
histograms of gradient orientations and gradient magnitudes and
the classication performance strongly decreases. In general, it is
better to use higher resolution images for object detection.
For
pedestrian detection, a minimum image size of 640x480 pixels is

recommendable.
For the detection system described in section 3.6, which scans a
classier through the image at every scale and location, the classication results from gures 5.11, 5.12, 5.13 and gures 5.14, 5.15,
5.16 for r images and gures 5.19, 5.20, 5.21 for grayscale images are of importance. The false positive rates of these classiers
are very low. However, with 4863 classications per frame at 20
frames per second, this still means a high average number of false
positive detections per second. For a real production system, this
is unacceptable.
One possible solution to this problem is to in-
tegrate detections over time by tracking the positively classied

object. Only when an object is classied as a pedestrian in successive frames, it is considered a positive detection. Unfortunately,
in practice it appears that false detections may be persistent in
time. An example of this is shown in gure 6.5.
(a) A false positive classication
(b) The same false positive 10 frames later
Figure 6.5: Persistence of false positives over time.
CHAPTER 6.
128
DISCUSSION
Another issue concerning scanning a classier through the image

at every scale and location is computation time. The computation time consists of the feature calculation time(including preprocessing) and time for the classier to perform forward propagation. As can be seen from gure 5.17 and gure 5.18 the combination orientations features with a neural network classier has a
computation time of 0.5 milliseconds for feature calculation plus
0.2 milliseconds for forward propagation is 0.7 milliseconds. With
4863 classications per frame, this gives a computation time of
6808.2 millisecond per frame (2x3404.1 millisecond for front/back
view and side view classication).
This is clearly too slow for
a real-time system, although there is potential for optimization.

Still despite these two drawbacks, this classier based system in
combination with the tracker which integrates classication outputs through time is the best performing system.
By comparing gure 5.22 to gure 5.12 and gure 5.15 it becomes
clear that the component based classier performs better than the
full-body classier.
The combined answer of a set of classiers
each trained for a part of a pedestrian has better discriminative

capabilities than one full-body classier.
Figure 5.17 shows that the rectangle features are the fastest to calculate. All feature types take less than a millisecond to calculate
for one region of interest. Figure 5.18 shows the forward propagation speed of the classier/image feature combinations. Neural
networks take much less time than support vector machines. The
orientations features take the least time in forward propagation.
This is obvious, the orientation feature vector consists of less features(224) than the gradients and orientations feature vector(448
features) and the rectangle gradients(1140 features) feature vector.
Figures 5.23, 5.24, and 5.25 show the results of feature selection on
the classication performance. The classication results of feature
CHAPTER 6.
129
DISCUSSION
selection with a PCA are approximately equal to the classication

results on the full set of features. This can be seen by comparing
by comparing gure 5.23 to gure 5.11. This shows the full feature
set is not required for an optimal classication result.
As gure 5.24 shows, the results of Adaboost are somewhat disappointing, although it achieves its performance with a small subset
of features. The reason for the disappointing results are that because all types of features used are generated by gradient lters of
very small size compared to the whole region of interest containing the pedestrian.
Adaboost searches for small feature subsets
which are characteristic for the object to be classied. There are

no single key features which are characteristic for a pedestrian.
It is a larger set of features which makes it possible to achieve
good classication performance.
This is what happens in feature selection with multi-objective optimization. A larger subset of features selected from the complete
set of features is selected at once.
As can be seen from gure
5.25 this gives better results then Adaboost. A disadvantage of

using multi-objective optimization for feature selection is that it
almost prohibitively slow.
For each individual from each gener-
ation, a network has to be trained. Training one support vector

machine is slow because of the grid search required to select the
training parameters. Training a neural network is slow because of
the many iteration required for gradient descent.
Figures 5.26 and 5.27 show the results of the classier optimization on the forward propagation time through the network. From
gure 5.26(a) it becomes clear that the support vector reduction
results in a classier with a much smaller number of support
vectors.
A reduction of 95% can be achieved without a loss of
classication performance. Figure 5.26(b) compares the forward

propagation speed of Adaboost to a neural network trained on
all features.
Adaboost is much faster than a classier trained
CHAPTER 6.
130
DISCUSSION
on the full set of features. Figure 5.27(a) compares the forward

propagation speed of a support vector machine trained on all features to a support vector machine trained to a reduced vector
set created with principal component analysis. It is obvious that
as the number of features gets smaller, the forward propagation
speed increases. Figure 5.27(b) compares the forward propagation
speed of a support vector machine trained with feature selection
based on multi-objective optimization to a full set of features. Although the feature set selected with multi-objective optimization
is smaller(116 features) than the full set of features(224 features),
the forward propagation speed is slower.
that the original network parameters
The reason for this is
c and
were used for train-
ing the support vector machine. These are not the optimal values
for the reduced network and resulted in a large set of support
vectors.
From the feature selection methods, the PCA feature selection
method gives the best classication results. It is also by far the
easiest and fastest feature selection method to apply. The PCA
is therefore strongly preferable over Adaboost and multi-objective
optimization.
One general limitation of neural networks and support vector machines is that both types of classiers are black boxes.
It can
not be veried that a trained network produces a correct output

for an unseen example. This should be taken into consideration
when applying a system using a classier in a production system
for braking a car in the case of a possible collision. There is never
the guarantee that the system will make the correct decision.
6.1.3 Tracking
It becomes clear by looking at gure 5.28 and section 5.3.2 that
the tracking methods which use image features for tracking: the
CHAPTER 6.
DISCUSSION
131
Hausdor tracker, the Mean Shift Tracker, and the Condensation

tracker perform poorly on pedestrian images. The integration of
initial detections and classication outputs through time perform
better. These trackers does not operate on image features directly.
There are two conditions which must be satised for a tracker to
be able to track an object based on its image features:
the shape of the object should not change much from one
frame to the other so the distribution of image features does
not change much from frame to frame,
there should be a motion model available for estimating the

movement the object makes.
In the case of tracking pedestrians from a moving car, usually

both conditions are not satised. First, the shape of a pedestrian
changes either by its own movement or its size changes by the
movement of the car. Second, a reliable motion model of a pedestrian is not available: it is not always the case that a pedestrian
is moving. In addition, because of the possible movement of the
car, a pedestrian may be moving in the image, even if it is not
because of its own movement.
The Hausdor tracker is based on calculating at a target location
the Hausdor distance of a set of image features to the image features of the tracker model. The Hausdor tracker allows for some
degree of discrepancy between the model set of features and the
target set of features by randomly removing model features and
randomly adding target features. It also allows for some degree of
scaling of the object by calculating the target set of features at different scales. Because of the strong scaling of pedestrians caused
by the motion of the car and the possible motion of the pedestrian(arm movement, orientation change), the Hausdor tracker
cannot reliably keep track of the pedestrian. What usually happens is that after the tracker is started, it locks on a part of the
CHAPTER 6.
DISCUSSION
132
pedestrian after some time before it loses track of the pedestrian.

An example of this for r images is shown in gure 6.6. In addition, because there exists no reliable motion model of a pedestrian,
the Kalman lter used in the Hausdor tracker cannot accurately
predict the location of the pedestrian in the next frame. This is
what causes the tracker to lose track of its target.
(a) On a positive classication, the tracker is (b) Tracking locks on a part of the pedesstarted.
trian
Figure 6.6: Example of Hausdor tracker in r images.
To demonstrate that the tracker can principle be used for tracking

objects from a moving camera, it is applied for tracking cars. As
can be seen in gure 6.7 and gure 6.8, the Hausdor tracker can
successfully track passing cars in r images and grayscale images.
Cars are easier to track than pedestrians because of the following
reasons: in contrary to pedestrians, cars are not changing shape
when they move, the relative speed dierence between the camera
and a moving car is much smaller than the speed dierence between the camera and the pedestrian so there is much less scaling
of the object to be tracked, the resolution of cars in the image
is usually higher than the resolution of pedestrians in the image,
and the position and scale of a passing car can be estimated more
accurately than the position and scale of a pedestrian.
The Condensation tracker has related problems. It contains a set
CHAPTER 6.
133
DISCUSSION
(a) frame 1
(b) frame 100
(c) frame 170

Figure 6.7: Tracking a car in r images with the Hausdor tracker.
of xed shapes of pedestrians which it uses for comparing target

locations to. Because of the strong scaling of pedestrians caused
by the motion of the car and the low resolution of the images, the
comparison of the xed shapes with target locations suers from
aliasing problems.
In addition, because three dimensions are tracked: horizontal position, vertical position, and scale, the number of samples needed
in the Condensation algorithm gets prohibitively large. This effect is increased by the absence of a motion model.
If there is
no motion model which moves the samples somewhat in the right

direction, an even larger number of randomly moving samples is
required to achieve the same performance. If a large number of
CHAPTER 6.
134
DISCUSSION
(a) frame 1
(b) frame 100
(c) frame 185

Figure 6.8: Tracking a car in grayscale images with the Hausdor tracker.
samples is used for tracking in three dimensions, the tracker gets

too slow because of the time it takes to evaluate all the samples.
If the number of samples is reduced for improving eciency, the
tracker rapidly loses the object it is tracking.
The mean shift tracker is originally designed for tracking objects
in color images.
The reason for this is that 24-bit color images
contain more information for tracking than 8-bit grayscale images.

The distinction in histograms between the tracker model and the
target object is usually not as large in 8-bit intensity images as it is
in 24-bit color images. For this reason, it is dicult for the mean
shift tracker to track objects in the low contrast 8-bit intensity
CHAPTER 6.
DISCUSSION
135
images. Although the mean shift tracker can handle changes in

scale of the object it tracks by increasing or decreasing the number
of pixels it calculates a histogram over, it cannot handle large
changes in the object shape it is tracking. The mean shift tracker
locks on to a part of the object it is tracking and eventually loses
track.
The trackers which integrate initial detection and classication
outputs over time perform better than the Hausdor tracker, the
Condensation tracker and the Mean shift tracker, which are based
on image features. The advantage of the trackers which integrate
initial detection outputs and classication outputs over time are
that the initial detection routines and the classiers which they
are based on, are designed and trained respectively for detecting pedestrians in a variety of shapes.
The disadvantage of the
trackers which integrate initial detection and classication outputs over time are that they cannot operate autonomously, they
require the initial detection routines and the classication routines
respectively to run.
The results of the tracker which integrates initial detections through
time is based on the performance of the initial detection routines.
As gure 5.4 and gure 5.8 show, the performance of the initial
detection routines applied to r images is reasonable and the performance of the initial detection routines applied to grayscale images is unacceptable. This severely limits the performance of the
tracker which integrates initial detection over time. Much better
performing is the tracker which integrates classication outputs
over time.
By interpreting the normalized output of the classi-
er at every position and scale in the image as a probability, the

object position in the next frame can be estimated by locating
the peak in classication output around the previous location and
scale of the object to be tracked. Because the classication results
are good on both r images and on grayscale images, the tracker
CHAPTER 6.
DISCUSSION
136
which integrates classication outputs through time also performs

well. One limitation of this method is that classication outputs
are not smooth in time. This results in sudden changes in scale of
the region of interest around the object to be tracked. A partial
solution to this problem is to apply an alpha-beta lter on the
coordinates of the region of interest output by the tracker. This
is done for visualization purposes only, internally in the tracker,
the original region of interest is stored.
6.1.4 Complete system performance

Although it seems attractive from a design point of view to have
a system which performs initial detection to limit processing to
promising parts of the image, performs classication on the output
of the initial detection, and nally performs tracking on the positive classications to stabilize detections over time, this does not
work in practice. If the initial detection routines do not detect a
large part of the pedestrians, the classier and tracker do not even
get the chance to process these initially undetected pedestrians.
The system which classies the whole image at every position and
scale described in section 3.6 combined with the tracker described
in section 4.5 which uses classication outputs for tracking, performs much better.
When evaluating the performance of the pedestrian detection system the system is usually compared to how well humans can detect
pedestrians in these images. A simple pattern recognition system
as described in this work has some severe limitations compared to
the human vision system:
humans use knowledge about the world in object detection.

For example when a human sees a vertical object standing
along the road at a large distance, the hypothesis is made that
it is a pedestrian. This can be conrmed as soon as the object
CHAPTER 6.
DISCUSSION
137
is near enough. The pedestrian detection system maintains

no hypothesis about objects but detects pedestrians as soon
as they are large enough to be detected without considering
hypothesis made in the past.
humans can detect motion when they are moving themselves,

for example when driving a car. Motion detection methods
in computer vision like for example optic ow cannot solve
this problem well.
humans have no problem assigning object parts or object features to an object. In image processing, simple heuristics or
exhaustive search is required to assign object parts or object
features to an object.
The real-time constraint of the system is satised quite well. When

combining the initial detection routine based on vertical gradients
with a support vector machine trained on histograms of gradients and orientations features and the support vector reduction
algorithm from section 3.4.3 and assuming the initial detection
delivers 20 regions of interest per frame, the processing time per
frame is 7 milliseconds + 20 * (0.7 + 0.2) is 25 milliseconds per
frame on a 1470 MHz. AMD Athlon processor. Tracking is not
measured in this case because the computation times of the initial detection based tracker and the tracker based on classication
outputs is almost zero. This gives a frame rate of 40 frames per
second. On a 1 GHz. processor, this gives approximately a frame
rate of 27 frames per second.
6.2
Further work
To improve the initial detection, the use of distance information

would bring the most progress. Usually, pedestrians are closer to
CHAPTER 6.
138
DISCUSSION
the camera than background structures. This can be used in segmentation. Distance information is available from stereo cameras
or from distance measurement sensors. Much work has been done
on the use of stereo cameras in pedestrian detection [42], [45], and
[20]. In addition, the use of distance information makes classication easier because the class of negative examples is limited to
non-pedestrian objects at a certain distance to the camera.
When the car is not moving because it is waiting for a trac
light, the number of pedestrians crossing the road in front of the
car in the image may be large. In this case, motion information
from optic ow or background subtraction can be used for initial
detection.
For a human, pedestrians are easier to detect in color images than
in grayscale images.
Color information may also be helpful for
a pedestrian detection system.
For example, the gradients cal-
culated from color images may be stronger than the gradients

calculated from grayscale images.
Another possible improvement for initial detection is instead of
scanning a vertical edge detector through the image at every position and scale as described in section 2.2.2, the scanning window
tests if there is any vertical gradient response at the current location and scale. The advantage of such a method is that is generates
more regions of interest than the initial detection routine which
scans a vertical edge detector through the image at every position
and scale.
In addition, the number of required classication is
reduced compared to the method which scans a classier through

the image at every scale and position.
The most important improvement for the classication would be
to reduce the false positive rate to zero while keeping the true
positive rate constant.
It is however unlikely to nd an image
feature/classier combination which achieves a zero false positive

rate while maintaining a high true positive rate on the low resolu-
CHAPTER 6.
139
DISCUSSION
tion images used in this work. The largest performance improvement can be achieved by using high resolution, high quality(with
respect to contrast, reections, overblending) image data.
It is
recommended to use a resolution of at least 640x480 pixels. First

experiments indicate that this leads to better classication performance.
In addition, a very large set of several thousands of
pedestrian- and non-pedestrian images should be available for the

training and the optimization of the classication to get good generalization performance.
To improve classication performance,
another possibility would be to test other image features for classication, for example Gabor features, although they may be difcult to apply in real-time.
In order to reliably track moving pedestrians from a fast moving
camera in real-time, it is required to have a tracker which is exible with respect to shape change, large translation, and scaling
of the object it tracks. The most promising seems to be a shape
adaptive method like applied in [2]. This method has the disadvantages that it cannot be used for very small pedestrians because
of aliasing problems and that it is dicult to apply it in real-time.
The use of higher resolution images makes the use of a method
like this more feasible.
The problem of detecting occluded pedestrians or pedestrians in a
group occluding each other is an even more dicult problem than
detecting single pedestrians.
One approach that could segment
an occluded pedestrian is a shape based segmentation method

with active contour models [9], [12].
This approach cannot be
applied for general segmentation in images however. The initial

contour must be around the object to be segmented. Also, it is
not possible to apply this method in real-time at 20 frames per
second on moderate hardware.
Shape based segmentation with
active contour models is more suitable for segmentation in medical

images.
CHAPTER 6.
DISCUSSION
140
It appeared that the cycle initial detection, classication, and

tracking is not very successful for pedestrian detection. It would
be interesting to modify the algorithms for other problems like car
detection or face detection to compare the results to pedestrian
detection.
6.3
Summary and conclusions
This work describes a system for detecting pedestrians in r images and grayscale images. The pedestrian detection system consists of three main components:
an initial detection component which locates regions of interest in the image,
a classication component which classies the regions of interest from the initial detection routines as pedestrians or
non-pedestrians,
and a tracking component which keeps track of objects through

successive frames.
The initial detection component consists of the following three

methods:
a temperature based method for r images which uses image

brightness for segmentation,
a method based on clustering vertical gradients for r images

and grayscale images,
and a method based on scanning a vertical edge detection

through the image for r images and grayscale images.
CHAPTER 6.
DISCUSSION
141
The classication component applies neural networks and support

vector machines in combination with the following image features
for r images and grayscale images:
rectangle features,
image gradients calculated with a Sobel-like lter,
and orientations of the image gradients.
The following optimization methods are applied for increasing the

classication performance and the forward propagation time of
the classication:
reduction of the number of support vectors,
and feature selection with:
principal component analysis
Adaboost
multi-objective optimization
The tracking component consists of the following tracking methods:
a tracker based on the Hausdor distance,
a mean shift tracker,
a Condensation tracker,
a tracker based on integrating initial detections through time,
and a tracker based on integrating classication outputs through

time.
CHAPTER 6.
DISCUSSION
142
The most important contributions of this work are:
the development of three new initial detection routines for

pedestrian detection, one based on intensity information, the
other two based on gradient information,
the systematic evaluation of classier/image feature combinations for pedestrian classication in r and grayscale images,
an evaluation of the following classier optimization algorithms on classication performance and forward propagation speed of the classiers:
a support vector reduction algorithm,
principal component analysis for feature selection,
Adaboost for feature selection,
and multi-objective optimization for feature selection.
the evaluation of three existing tracking algorithms for tracking pedestrians in real-time in r and grayscale images:
a tracker based on the Hausdor distance,
a Mean shift tracker,
a tracker based on Condensation.
the development of two tracking algorithms which are based

on integrating classication outputs and initial detection outputs over time.
The tracker which integrates classication
outputs over time is a modication of the tracker based on

Condensation from [32], the dierence is that the estimation
and conrmation of the estimation of the tracker from the
image data is performed in a dierent way.
CHAPTER 6.
143
DISCUSSION
From this work, the most important conclusions are:
The initial detection methods proposed in this work give reasonable results on r images. The results on grayscale images
are not acceptable.
The classication methods deliver good results on both r

images as on grayscale images. However, although the false
positive rate is low, it is still too high for a real system.
The classication results on r images are comparable to the

classication results on grayscale images.
The support vector machine classication results are generally a bit better than the neural network classication results.
The optimization methods for classication generally improve

the classication performance and improve the classication
forward propagation speed. The component based classication gives the best results.
The system which scans a classier through the image at

every position and scale performs better than the initial detection/classication combination.
However, with approxi-
mately 5000 classications per image this method is not applicable in real-time and its false positive rate is too high for
use in a real system.
The tracking methods which track image features are not

successful in tracking pedestrians. They are able to track objects in simpler situation, like cars in grayscale images and r
images from a moving camera. The tracker which integrates
classication outputs over time performs much better.
Integrating detections over time for the stabilization of positive detections does not improve detection results because
false positives are for a large degree persistent in time.
CHAPTER 6.
DISCUSSION
144
The applied image resolution of 320x240 pixels is too low

to deliver optimal results, especially for classication. It is
recommendable to use a resolution of at least 640x480 pixels.
Bibliography
[1] A. Baumberg and D. Hogg.
image sequences. In
Learning exible models from
ECCV (1), pages 299308, 1994.
[2] A.M. Baumberg and D.C. Hogg.
An ecient method for
contour tracking using active shape models.
Technical report.
[3] M. Bertozzi, A. Broggi, A. Fascioli, A. Tibaldi, R. Chapuis,

and F. Chausse. Pedestrian localization and tracking system
with kalman ltering.
IEEE Intelligent Vehicles Symposium
2004, pages 584589, 2004.

[4] M. Bertozzi, A. Broggi, P. Grisleri, A. Tibaldi, T. Graf, and
M. Meinecke.
Pedestrian detection in infared images.
Pro-
ceedings of the IEEE Intelligent Vehicles Symposium 2003,

pages 662667, 2003.
[5] D. Beymer, P. McLauchlan, B. Coifman, and J. Malik. A real
time computer vision system for measuring trac parameters.
IEEE Conference on Computer Vision and Pattern Recognition, pages 495501, 1997.
[6] C.M. Bishop.
Neural Networks for Pattern Recognition. Ox-
ford University Press, 1995.

[7] A. Broggi, M. Bertozzi, A. Fascioli, and M. Sechi.
based pedestrian detection.
Shape-
Proceedings of the IEEE Intelli-
gent Vehicles Symposium 2000, pages 215220, 2000.
145
146
BIBLIOGRAPHY
[8] C.J.C. Burges. Simplied support vector decision rules.
In-
ternational Conference on Machine Learning, pages 7177,

1996.
[9] T. Chan and W. Zhu. Level set based shape prior segmentation.
Technical Report 03-66, Computational Applied Math-
ematics, UCLA, Los Angeles., 2003.

[10] T.F. Chan and L.M. Vese.
Active contours without edges.
IEEE Transactions on Image Processing, 10(2):266277,

2001.
[11] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking
of non-rigid objects using mean shift. volume 2, pages 142
151, 2000.
[12] D. Cremers, T. Kohlberger, and C. Schnrr. Nonlinear shape
statistics in MumfordShah based segmentation. In A. Heyden et al., editors,
European Conference on Computer Vision
(ECCV), volume 2351 of LNCS, pages 93108, Copenhagen,

May 2002. Springer.
[13] Nello Cristianini and John Shawe-taylor.
An Introduction to
Support Vector Machines. Cambridge University Press, 2000.

[14] C. Curio, J. Edelbrunner, T. Kalinke, C. Tzomakas, and
W. von Seelen. Walking pedestrian recognition.
IEEE Trans-
actions on intelligent transportation systems, 1(3), 2000.

[15] N. Dalal and B. triggs. Histograms of oriented gradients for
human detection.
IEEE Conference on Computer Vision and
Pattern Recognition, 1:886893, 2005.

[16] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and
elitist multiobjective genetic algorithm.
IEEE Transactions
on Evolutionary Computation, 6(2):182197, 2002.
147
BIBLIOGRAPHY
[17] C. Emmanouilidis, A. Hunter, and J. MacIntyre.
A mul-
tiobjective evolutionary setting for feature selection and a
Proceedings of the
commonality-based crossover operator.
2000 Congress on Evolutionary Computation, pages 309316,

2000.
[18] Pedro F. Felzenszwalb and Daniel P. Huttenlocher.
cient graph-based image segmentation.
E-
Internation Journal
of Computer Vision, 59(2):167181, 2004.

[19] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm.
Proceedings of the International Conference on
Machine Learning(ICML), pages 148156, 1996.

[20] D.M. Gavrila, J. Giebel, and S. Munder. Vision-based pedestrian detection:
The protector system.
Proceedings of the
IEEE Intelligent Vehicles Symposium(IV 2004), 2004.

[21] D.M. Gavrila and V. Philomin.
for "smart" vehicles.
Real-time object detection
Proceedings of the IEEE International
Conference on Computer Vision, 1:8793, 1999.

[22] C. Goerick, N. Detlev, and M. Werner. Articial neural networks in real-time car detection and tracking applications.
Pattern Recognition Letters, 17:335343, 1996.

[23] B. Heisele and C. Whler. Motion-based recognition of pedestrian.
Proceedings of the fourteenth international conference
on pattern recognition, 2:13251330, 1998.

[24] K.K.P. Horn and B.G. Schunck.
Determining optical ow.
MIT A.I Memo, (572), 1980.

[25] M. Isard and A. Blake. Condensation - conditional density
propagation for visual tracking.
puter Vision, 29:528, 1998.
Internation Journal of Com-
148
BIBLIOGRAPHY
[26] M. Kass, A. Witkin, and D. Terzopoulos.
Snakes:
Active
contour models. pages 259269, 1987.

[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haner. Gradientbased learning applied to document recognition.
Proceedings
of the IEEE, 86(11):22782324, November 1998.

[28] A. Mohan,
C. Papageorgiou,
and T. Poggio.
Example-
based object detection in images by components.
IEEE
Transactions on Pattern Analysis and Machine Intelligence,

23(4):349361, 2001.
[29] D. Mumford and J. Shah. Optimal approximation by piecewise smooth functions and associated variational problems.
Communications Of Pure Applied Mathematics, 42:577585,

1989.
[30] H. Nanda and L. Davis. Probabilistic template based pedestrian detextion in infrared videos.
IEEE Intelligent vehicles
symposium, 2002.
[31] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. Pedestrian detection using wavelet templates.
Proceed-
ings of conference on computer vision and pattern recognition, pages 193199, 1997.
[32] C. Papageorgiou. A trainable system for object detection in
images and video sequences.
PhD. Thesis.
[33] M. Pei, E.D. Goodman, and W.F. Punch. Feature extraction

using genetic algorithms.
Technical report, Michigan State
University, June 1997, 1997.

[34] V. Philomin, R. Duraiswami, and L. Davis. Pedestrian tracking from a moving vehicle.
Proceedings of the IEEE Intelligent
Vehicles Symposium 2000, pages 350355, 2000.
149
BIBLIOGRAPHY
[35] H.A. Rowley, S. Baluja, and T. Kanade.

ant neural-network based face detection.
Rotation invari-
IEEE Conference
on Computer Vision and Pattern Recognition, pages 3844,

1998.
[36] Bernard Schlkopf Sami Romdhani, Philip Torr and Andrew
Blake.
Computationally ecient face detection.
Proceed-
ings of the 8th International Conference on Computer Vision,

1:695700, 2001.
[37] A. Shashua, Y. Gdalyahu, and G. Hayun. Pedestrian detection for driving assistance systems: Single-frame classication
and system level performance.
Proceedings of the IEEE In-
telligent Vehicles Symposium(IV 2004), 2004.

[38] M. Szarvas, A. Yoshizawa, M. Yamamoto, and J. Ogata.
Pedestrian detection with convolutional neural networks.
Proceedings of the IEEE Intelligent Vehicles Symposium

2005, pages 224229, 2005.
[39] P. Viola and M. Jones. Rapid object detection using a boosted
cascade of simple features.
IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition (CVPR),

1:511518, 2001.
[40] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance.
Proceedings of the
Internation Conference on Computer Vision(ICCV), 2003.

[41] M. Werner.
Objektverfolgung und objekterkennung mittels
der partiellen hausdor distanz, 1998.

[42] C. Whler, J.K. Anlauf, T. Prtner, and U. Franke.
time delay neural network algorithm for real-time pedestrian

recognition.
IEEE International Conference on Intelligent
Vehicles, pages 247251, 1998.
150
BIBLIOGRAPHY
[43] F. Xu, X. Liu, and K. Fujimura.

tracking with night vision.
Pedestrian detection and
IEEE Transactions on Intelligent
Transportation Systems, 6(1):6371, 2002.

[44] J. Yang and V. Honavar.
genetic algorithm.
Feature subset selection using a
IEEE Intelligent Systems, 13:4449, 1998.
[45] L. Zhao and C.E. Thorpe.

based pedestrian detection.
Stereo and neural network-
IEEE Transactions on Intelligent
Transportation Systems, 1(3):298303, 2000.
Curriculum Vitae
20.02.1977
Geboren in Drachten
09.1989 - 06.1997
Atheneum mit Abitur in Drachten
09.1997 - 11.2002
Studium Knstliche Intelligenz an

der Universitt von Amsterdam
02.2003 - 04.2006
Wissenschaftlicher Mitarbeiter,
Institut fr Neuroinformatik,
Ruhr Universitt Bochum

Real-Time Pedestrian Detection in FIR and Grayscale Images

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Real-Time Pedestrian Detection in FIR and Grayscale Images

Transféré par

Droits d'auteur :

Formats disponibles

Real-time pedestrian detection in FIR

and grayscale images

Aalzen Jaap Wiegersma

Object detection in computer vision

Related work in pedestrian detection . . . . . . . .

Initial detection in grayscale images

Initial detection in r images . . . . . . . .

Outline of this thesis

Gradient based segmentation . . . . . . . . . . . .

Grouping vertical gradients . . . . . . . . .

Scanning a vertical edge detector through

Temperature segmentation in r images

Support vector machines

Image features for classication . . . . . . . . . . .

Histograms of gradients and orientations features

Training and validation of the classier

Creating training datasets . . . . . . . . . .

Training neural networks

Training support vector machines

Optimization of the classication . . . . . . . . . .

A simplied approximation to the support

vector machine decision rule

Principal component analysis . . . . . . . .

Scanning a classier through the image at every

Tracking using the Hausdor distance

Mean shift tracking

Tracking with Condensation

Integrating initial detections through time . . . . .

Tracking the classication output . . . . . . . . . .

Results on grayscale images . . . . . . . . .

Results on grayscale images . . . . . . . . .

Results of component classication . . . . .

Results of classier optimization on classication performance

Results of classier optimization on classication speed

Scanning a classier through the image at

Results on grayscale images . . . . . . . . .

Achievements and limitations

Complete system performance

Summary and conclusions . . . . . . . . . . . . . .

Examples of pedestrian detection.

Calculation of the rectangle sum. . . . . . . . . . .

Image gradients and energy.

Grouping vertical gradients in a r image. . . . . .

Grouping vertical gradients in a grayscale image.

Use of the gradient sign for initial detection. . . . .

Combining regions of interest.

Scanning an edge detector through the image at

2.10 Scanning an edge detector through the image at

2.11 Temperature segmentation in r images. . . . . . .

2.12 Combining regions of interest.

Rectangle features for classication.

Calculation of image features in subregions. . . . .

Subregions for component classication. . . . . . .

Classifying the whole image at every scale and resolution. . . . . . . . . . . . . . . . . . . . . . . . .

Model contour of the Condensation tracker.

Integrating initial detections through time.

Matching ground truth data. . . . . . . . . . . . .

Suitability of initial detections for classication.

Fir images at dierent temperature ranges.

Initial detection results.

Percentage positive detections.

Calculation times of initial detection for r images.

Illumination changes in grayscale images.

Initial detection in r images . . . . . . . .

Temperature segmentation in r images

Image features for classication . . . . . . . . . . .

Training and validation of the classier

Optimization of the classication . . . . . . . . . .

A simplied approximation to the support

Scanning a classier through the image at every

Tracking using the Hausdor distance

Tracking the classication output . . . . . . . . . .

Results of component classication . . . . .

Results of classier optimization on classication performance

Results of classier optimization on classication speed

Scanning a classier through the image at

Grouping vertical gradients in a r image. . . . . .

2.11 Temperature segmentation in r images. . . . . . .

Rectangle features for classication.

Subregions for component classication. . . . . . .

Suitability of initial detections for classication.

Fir images at dierent temperature ranges.

Calculation times of initial detection for r images.

5.11 Results of support vector machine classication orientations features.

5.13 Results of support vector machine classication rectangle features.

5.14 Results of neural network classication orientations

5.15 Results of neural network classication gradients

5.16 Results of neural network classication rectangle

5.18 Classication times of classier/image feature combination.

5.19 Classication results grayscale images orientations

5.20 Classication results grayscale images gradients orientations features.

5.21 Classication results grayscale images rectangle features.

5.22 Results of component classication.

5.26 Classication times of optimization algorithms.

5.27 Classication times of optimization algorithms.

5.28 Pedestrian tracking in r images.

Diculties for initial detection because of vertical

Diculties for initial detection because of many

A group of pedestrians in a r image.

Example of Hausdor tracker in r images.

Tracking a car in r images with the Hausdor

Tracking a car in grayscale images with the Hausdor tracker.

Temperature segmentation in r images

Scanning a classier through the image at every

The Hausdor tracker.

Tracking the classication output.

Developing a system which satises all of these properties is not

A classication component which determines which of the

The r detection system described here

Computer vision research is applied in elds of computer science

how the methods in this thesis relate to other elds of computer

This gives the rst optic ow constraint:

The second assumption made in optic ow is that it is constant