Vous êtes sur la page 1sur 154

Real-time pedestrian detection in FIR

and grayscale images

Dissertation
zur Erlangung des Grades eines
Doktor-Ingenieurs(Dr.-Ing.)
an der Fakultt
fr Elektrotechnik und Informationstechnik
der Ruhr-Universitt Bochum

vorgelegt von

Aalzen Jaap Wiegersma


Drachten

Bochum 2006

Contents
1 Introduction

10

1.1

Problem description . . . . . . . . . . . . . . . . .

10

1.2

Object detection in computer vision

. . . . . . . .

12

1.3

1.4

1.2.1

Segmentation

. . . . . . . . . . . . . . . .

12

1.2.2

Classication . . . . . . . . . . . . . . . . .

17

1.2.3

Tracking

. . . . . . . . . . . . . . . . . . .

18

Related work in pedestrian detection . . . . . . . .

19

1.3.1

Initial detection in grayscale images

. . . .

19

1.3.2

Initial detection in r images . . . . . . . .

21

1.3.3

Classication . . . . . . . . . . . . . . . . .

22

1.3.4

Tracking

. . . . . . . . . . . . . . . . . . .

24

Selected approach . . . . . . . . . . . . . . . . . .

25

1.4.1

Image preprocessing . . . . . . . . . . . . .

26

1.4.2

Initial detection

. . . . . . . . . . . . . . .

26

1.4.3

Classication . . . . . . . . . . . . . . . . .

27

1.4.4

Tracking

. . . . . . . . . . . . . . . . . . .

28

. . . . . . . . . . . . . . . . . . . .

29

1.5

Contributions

1.6

Outline of this thesis

. . . . . . . . . . . . . . . .

31

CONTENTS

2 Initial detection
2.1

2.2

32

Image pre-processing

. . . . . . . . . . . . . . . .

33

2.1.1

Image enhancement . . . . . . . . . . . . .

33

2.1.2

Image features . . . . . . . . . . . . . . . .

35

Gradient based segmentation . . . . . . . . . . . .

39

2.2.1

Grouping vertical gradients . . . . . . . . .

39

2.2.2

Scanning a vertical edge detector through


the whole image . . . . . . . . . . . . . . .

2.3

Temperature segmentation in r images

. . . . . .

3 Classication
3.1

3.2

Classiers

3.4

48

52
. . . . . . . . . . . . . . . . . . . . . .

52

3.1.1

Neural networks . . . . . . . . . . . . . . .

53

3.1.2

Support vector machines

. . . . . . . . . .

55

Image features for classication . . . . . . . . . . .

56

3.2.1

Rectangle features . . . . . . . . . . . . . .

57

3.2.2

Histograms of gradients and orientations features

3.3

45

. . . . . . . . . . . . . . . . . . . . .

Training and validation of the classier

57

. . . . . .

58

3.3.1

Creating training datasets . . . . . . . . . .

58

3.3.2

Training neural networks

60

3.3.3

Training support vector machines

. . . . . . . . . .
. . . . .

60

Optimization of the classication . . . . . . . . . .

61

3.4.1

Bootstrapping

61

3.4.2

Component classication

3.4.3

A simplied approximation to the support

. . . . . . . . . . . . . . . .
. . . . . . . . . .

vector machine decision rule

. . . . . . . .

62

62

CONTENTS

3.5

3.6

Feature selection . . . . . . . . . . . . . . . . . . .

63

3.5.1

Principal component analysis . . . . . . . .

63

3.5.2

Adaboost . . . . . . . . . . . . . . . . . . .

65

3.5.3

Multi-objective optimization

66

. . . . . . . .

Scanning a classier through the image at every


position and scale

. . . . . . . . . . . . . . . . . .

4 Tracking

69

72

4.1

Tracking using the Hausdor distance

. . . . . . .

73

4.2

Mean shift tracking

. . . . . . . . . . . . . . . . .

75

4.3

Tracking with Condensation

. . . . . . . . . . . .

77

4.4

Integrating initial detections through time . . . . .

80

4.5

Tracking the classication output . . . . . . . . . .

81

5 Experimental results
5.1

5.2

Initial detection

83

. . . . . . . . . . . . . . . . . . .

83

5.1.1

Results on r images

. . . . . . . . . . . .

86

5.1.2

Results on grayscale images . . . . . . . . .

89

Classication . . . . . . . . . . . . . . . . . . . . .

92

5.2.1

Results on r images

. . . . . . . . . . . .

94

5.2.2

Results on grayscale images . . . . . . . . .

104

5.2.3

Results of component classication . . . . .

109

5.2.4

Results of classier optimization on classication performance

5.2.5

. . . . . . . . . . . . .

109

Results of classier optimization on classication speed

. . . . . . . . . . . . . . . . .

112

CONTENTS

5.2.6

Scanning a classier through the image at


every position and scale

5.3

Tracking

. . . . . . . . . .

115

. . . . . . . . . . . . . . . . . . . . . . .

115

5.3.1

Results on r images

. . . . . . . . . . . .

116

5.3.2

Results on grayscale images . . . . . . . . .

118

6 Discussion
6.1

119

Achievements and limitations

. . . . . . . . . . .

119

6.1.1

Initial detection

. . . . . . . . . . . . . . .

120

6.1.2

Classication . . . . . . . . . . . . . . . . .

124

6.1.3

Tracking

130

6.1.4

Complete system performance

. . . . . . . . . . . . . . . . . . .
. . . . . . .

136

6.2

Further work . . . . . . . . . . . . . . . . . . . . .

137

6.3

Summary and conclusions . . . . . . . . . . . . . .

140

List of Figures
1.1

Examples of pedestrian detection.

. . . . . . . . .

10

2.1

Rectangle lters. . . . . . . . . . . . . . . . . . . .

36

2.2

Calculation of the rectangle sum. . . . . . . . . . .

37

2.3

Image gradients and energy.

. . . . . . . . . . . .

38

2.4

Grouping vertical gradients in a r image. . . . . .

41

2.5

Grouping vertical gradients in a grayscale image.

42

2.6

Use of the gradient sign for initial detection. . . . .

42

2.7

Combining regions of interest.

. . . . . . . . . . .

44

2.8

A scanning window. . . . . . . . . . . . . . . . . .

46

2.9

Scanning an edge detector through the image at


every position and scale.

. . . . . . . . . . . . . .

47

2.10 Scanning an edge detector through the image at


every position and scale.

. . . . . . . . . . . . . .

48

2.11 Temperature segmentation in r images. . . . . . .

49

2.12 Combining regions of interest.

50

. . . . . . . . . . .

3.1

Rectangle features for classication.

. . . . . . . .

57

3.2

Calculation of image features in subregions. . . . .

58

3.3

Subregions for component classication. . . . . . .

62

LIST OF FIGURES

3.4

Classifying the whole image at every scale and resolution. . . . . . . . . . . . . . . . . . . . . . . . .

71

4.1

Model contour of the Condensation tracker.

. . .

79

4.2

Integrating initial detections through time.

. . . .

81

5.1

Matching ground truth data. . . . . . . . . . . . .

84

5.2

Suitability of initial detections for classication.

85

5.3

Fir images at dierent temperature ranges.

. . . .

87

5.4

Initial detection results.

. . . . . . . . . . . . . .

88

5.5

Percentage positive detections.

. . . . . . . . . .

88

5.6

Calculation times of initial detection for r images.

89

5.7

Illumination changes in grayscale images.

90

5.8

Initial detection results in grayscale images.

5.9

Calculation times of initial detection grayscale.

5.10 An example ROC curve.

. . . . .
. . .

91

. .

92

. . . . . . . . . . . . . .

93

5.11 Results of support vector machine classication orientations features.

. . . . . . . . . . . . . . . . .

96

5.12 Results of support vector machine classication gradients and orientations features.

. . . . . . . . . .

97

5.13 Results of support vector machine classication rectangle features.

. . . . . . . . . . . . . . . . . . .

98

5.14 Results of neural network classication orientations


features.

. . . . . . . . . . . . . . . . . . . . . . .

100

5.15 Results of neural network classication gradients


and orientations features.

. . . . . . . . . . . . .

101

5.16 Results of neural network classication rectangle


features.

. . . . . . . . . . . . . . . . . . . . . . .

102

LIST OF FIGURES

5.17 Processing times feature vector calculation.

. . . .

103

5.18 Classication times of classier/image feature combination.

. . . . . . . . . . . . . . . . . . . . . .

104

5.19 Classication results grayscale images orientations


features.

. . . . . . . . . . . . . . . . . . . . . . .

106

5.20 Classication results grayscale images gradients orientations features.

. . . . . . . . . . . . . . . . .

107

5.21 Classication results grayscale images rectangle features.

. . . . . . . . . . . . . . . . . . . . . . . .

5.22 Results of component classication.


5.23 Results of PCA feature selection.

108

. . . . . . . .

109

. . . . . . . . .

110

5.24 Results of Adaboost feature selection.

. . . . . . .

111

5.25 Results of multi-objective optimization feature selection. . . . . . . . . . . . . . . . . . . . . . . . .

111

5.26 Classication times of optimization algorithms.

113

5.27 Classication times of optimization algorithms.

114

. . . . . . . . .

117

5.28 Pedestrian tracking in r images.

5.29 Processing times tracking algorithms.


6.1

118

Diculties for initial detection because of vertical


structures in the background.

6.2

. . . . . . .

. . . . . . . . . . .

122

Diculties for initial detection because of many


vertical structures in background.

. . . . . . . . .

122

6.3

Pedestrian pushing a stroller. . . . . . . . . . . . .

123

6.4

A group of pedestrians in a r image.

. . . . . . .

124

6.5

Persistence of false positives over time.

. . . . . .

127

6.6

Example of Hausdor tracker in r images.

. . . .

132

LIST OF FIGURES

6.7

Tracking a car in r images with the Hausdor


tracker. . . . . . . . . . . . . . . . . . . . . . . . .

6.8

133

Tracking a car in grayscale images with the Hausdor tracker.

. . . . . . . . . . . . . . . . . . . .

134

List of Algorithms
1

Connected components labeling.

. . . . . . . . . .

40

Grouping vertical gradients . . . . . . . . . . . . .

45

Scanning a vertical edge detector . . . . . . . . . .

47

Temperature segmentation in r images

. . . . . .

51

Adaboost. Algorithm from [39].

. . . . . . . . . .

65

Assignment of a rank to an individual. Algorithm


from [16]. . . . . . . . . . . . . . . . . . . . . . . .

68

Crowding-distance-assignment. Algorithm from [16].

69

Scanning a classier through the image at every


scale and location. . . . . . . . . . . . . . . . . . .

70

The Hausdor tracker.

. . . . . . . . . . . . . . .

75

10

Mean shift tracking

. . . . . . . . . . . . . . . . .

77

11

The Condensation algorithm. . . . . . . . . . . . .

79

12

Initial detection tracker.

80

13

Tracking the classication output.

. . . . . . . . . . . . . .
. . . . . . . . .

82

Chapter 1
Introduction
1.1

Problem description

This thesis describes a complete system for the detection of pedestrians in monocular camera images and monocular r(far infrared)
images recorded from a moving car. The objective of this work
is to have a system that can warn the driver of the car when a
pedestrian crosses the street. Two examples of a street scene and
the output of the detection system are shown in gure 1.1. There
are a few important properties such as system should have:

(a) FIR pedestrian detection

(b) Pedestrian detection in a grayscale image

Figure 1.1: Examples of pedestrian detection.

10

CHAPTER 1.

INTRODUCTION

11

It should be very reliable with respect to false detection; it


should have a high detection rate and a low false alarm rate.

It should be very reliable with respect to environment conditions; it should be able to operate in a complex city environment as well as on a land road.

It should be very reliable with respect to weather conditions;


it should be able to operate under various weather conditions,
for example high temperatures, low temperatures, rain, and
snow.

It should able to operate reliably in a dynamic environment;


both the car and the pedestrians are usually moving.

It should be able to operate in real-time at moderate hardware. The aim of the system presented in this work is to run
at at least 20 frames per second on a 1 GHz. desktop processor. This processing speed is what roughly can be expected
to be available for driver assistance systems in cars in the
next years.

Developing a system which satises all of these properties is not


trivial. Therefore, the system described here is divided into four
components:

An image pre-processing component which calculates relevant features from the image.

An initial detection component which generates regions of


interest in the image. Several initial detection routines can
be used for dierent environment conditions. The initial detection routine can also have several parameter setups for
dierent weather conditions.

CHAPTER 1.

12

INTRODUCTION

A classication component which determines which of the


regions of interest generated by the initial detection routine
contain pedestrians. The quality of the classication component inuences the amount of fail detections of the system.

A tracking component which keeps track of pedestrians in


successive frames once they have been positively classied
by the classication component.

The tracking component

should be able to handle the motion of the car and the motion
of the pedestrian. The tracking component makes it possible
to stabilize detections over time and to predict the time of
contact in case of a possible collision.

This thesis describes a system that can be used for pedestrian


detection in grayscale camera images and for pedestrian detection
in far infrared images.

The r detection system described here

is used by BMW as a reference system for testing commercial


systems.

1.2

Object detection in computer vision

Computer vision research is applied in elds of computer science


like robotics and medical image analysis, in industrial applications
like driver assistance systems, medical image analysis, and assembly, and in law enforcement applications like face detection and
nger print comparison.

This section gives a short overview of

how the methods in this thesis relate to other elds of computer


vision.

1.2.1 Segmentation
The purpose of segmentation is to extract interesting regions from
the image. An example of segmentation in an image recorded from

CHAPTER 1.

13

INTRODUCTION

a medical scanner is locating the exact boundaries of an organ in


the image. An example of segmentation in pedestrian detection in
r images is locating bright image regions. In this thesis, the term
initial detection is used for segmenting regions of interest from
the image. The remainder of this section describes segmentation
methods which are applied in various elds of computer vision.
Color is a cue which can be used for segmentation, for example
by grouping pixels of similar color value into a region. An application of color segmentation is for example a robot vision system
which segments objects on a table for grabbing. An example of
a color based segmentation method is graph based segmentation.
In graph based segmentation, the image to be segmented is represented as a graph
to a pixel in the

G = (V, E) where each node vi V corresponds


image, and the edges (vi , vj ) E of V connect

pairs of neighboring pixels. Each edge in the graph has a weight

w((vi , vj ))

which is a measure of dissimilarity in color between

S is a
partition of V into components where each component C S is
0
0
0
a connected component in a graph G = (V, E ), where E E .

the two pixels connected by that edge.

A segmentation

The edges between two pixels inside a component should have


relatively low weights and the edges between pixels in dierent
components should have relatively high weights. An example of a
graph based segmentation method is [18].
Motion also provides information that can be used for segmentation. One motion based segmentation method for static cameras is
background subtraction. In background subtraction, an image of
the background is stored. In successive camera frames, the current
image is subtracted from the stored image. A dierence between
the current image and the stored image indicates that something
has moved at the pixels in the image where the dierence is unequal to zero. Usually, the stored image of the background needs
to be updated frequently because of changes in illumination con-

CHAPTER 1.

14

INTRODUCTION

ditions over time. Applications of background subtraction are for


example people detection in security camera images and trac
monitoring.
Optic ow [24] is the apparent motion of brightness in the image.
Like background subtraction, optic ow can be used for object
segmentation. If the image intensity at a point
time

in the image at

dI
is I(px , py , t), the derivative
dt with respect to

is

dI
I dpx
I dpy I dt
.
=
+
+
dt
px dt
py dt
t dt
The assumption made in the calculation of the optic ow is that
the intensity of a point

does not change over time so

dI
dt

= 0.

This gives the rst optic ow constraint:

T dpx
I dpy
I
=
+
.
t
px dt
py dt

(1.1)

The second assumption made in optic ow is that it is constant


in a small region

Sp

gradients in a region

in the image over time.

Sp

around

So measuring the

makes it possible to solve the

optic ow vector. The optic ow in each point in this region is
constrained by 1.1. To nd the optic ow

v = ( dpdtx ,

dpy
dt ) can be

calculated by minimizing

2
X  I
I
I
vx +
vy +
.
Ep =
px
py
t
x,ySp

An application of optic ow in a static camera environment is


the segmentation of moving objects from the scene. At locations
where the magnitude of the optic ow vector is larger than zero,
there is a moving object. The direction of the vector indicates in
which direction the object is moving. An application of optic ow

CHAPTER 1.

15

INTRODUCTION

is the detection of moving people from a static camera, for example


from a security camera or from a standing car. An application of
optic ow in a moving camera environment is obstacle avoidance.
The whole scene recorded from a moving appears to be moving
because of the movement of the camera itself. Objects closer to
the camera generate a stronger ow vector eld magnitude than
objects further away from the camera because of motion parallax.
To avoid a collision with an objects in the scene, there should be
a movement into the direction where the ow vector eld has a
low magnitude.
Another computer vision technology that can be used for object
segmentation is stereo vision.

From two calibrated cameras, a

disparity map can be calculated which provides relative distance


information from pixels in the image to the camera. A disparity
map can be calculated using a correlation measure based on for
example intensity value. The correlation measure is used to nd
the corresponding position of a gray level value from the rst
image in the second image.

The disparity is the dierence in

position of a gray level value. A large disparity of an image region


means the corresponding object in that region is relatively closer
to the camera than an object in an image region with a smaller
disparity.

The disparity value of an image region can be used

for segmenting it from the image. Example applications of stereo


vision are pedestrian detection from a moving car [42], [45], and
[20] and obstacle detection in robotics.
The Active Contour Model [26] is a method which detects contours(edges) in an image. A snake is initialized around the object
to be segmented and moves along its interior normal until it stops
at the edge of the object. A contour(snake) is a spline represented
by a vector

v(s) = (x(s), y(s)) where s is the arc length.

Finding

a contour in an image means minimizing an energy function:

CHAPTER 1.

16

INTRODUCTION

Z
ESnake =

Eint (v(s)) + Eimage (v(s)) + Eext (v(s))ds


0

where

Eint represents

ity) of the contour,


image, and

Eext

the internal energy(bending and discontinu-

Eimage

represents the forces caused by the

represents the external forces caused by a higher

level process. The energy function

Esnake

is minimized using vari-

ational calculus. The Active Contour Model is much used in medical image processing, for example for the segmentation of organs
in images recorded by a medical scanner.
An alternative to the spline representation of Active Contour Models is the Level Set representation[10], which can detect contours
with discontinuities. The Active Contour Model from [10] is based
on the Mumford Shah function [29] for image segmentation. The

E(c1 , c2 , )

energy function

Z
E(c1 , c2 , ) =

for a piecewise smooth image

is the image domain,

erage values of

is:

Z
Z
(f c1 )2 H()+ (f c2 )2 (1H())+ |5H()|

where

is the image,

c1 and c2 are

the av-

inside and outside the object respectively,

is the Heaviside function, and


length term of the contour.

H()

is a constant for weighting the

The energy function

E(c1 , c2 , )

is

minimized using methods from variational calculus. An example


application of the Level Set representation of the Active Contour
Model is segmentation in images recorded from a medical scanner.
An improvement to the Active Contour Models is to add a shape
prior term

Eshape

to the energy functions.

In this way, a prior

shape of the object to be segmented can be used to improve the


segmentation results. Examples of the use of a shape prior term
in the energy function are [12, 9].

CHAPTER 1.

17

INTRODUCTION

1.2.2 Classication
The purpose of classication in the context of computer vision
is to learn to distinguish between images of a target class of objects and images of a non-target class of objects.

A classier is

usually trained on a set of target objects and a set of non-target


objects so that it can estimate the class of an unseen image example. The training is done on a database of target images and a
database of non-target images which are given a class label by a
human observer. From the labeled target and non-target images
the classier learns to distinguish between the two classes.
Commonly used classiers for computer vision applications are
nearest neighbour classiers, neural networks, and support vector machines.

In a two class classication problem, a k-nearest

neighbour classier selects the k example points closest to the example being classied and votes between those example points to
determine the class of the example. Neural networks and support
vector machines are explained in section 3.1.1 and section 3.1.2
respectively.
An important choice is which kind of image features are used
for classication.

Usually, the gray values of an image are not

directly used for classication because they are strongly dependent


on the illumination conditions. First features are calculated from
the image and these features are used as input to the classier.
Usually, the features used are some kind of gradient response.
For example gradients from a Sobel lter, the orientations of the
gradients, or Wavelets.
Example applications of neural networks and support vector machines are face detection [35], and car detection [22]. In section 3,
the use of classiers for pedestrian detection is described in detail.

CHAPTER 1.

18

INTRODUCTION

1.2.3 Tracking
The purpose of tracking in the context of computer vision is usually to determine some properties(usually position and scale) of
an object in a next time step or next image frame. Tracking usually involves an estimation and a conrmation of the estimation.
Tracking makes it possible to integrate information about an object through time, it makes it possible to estimate the position
and scale of an object in the near future. This can be useful for
estimating time to contact, for example. In addition, this makes
it possible to limit processing in the next frame to the area in the
image around where the tracker estimated the object to be. This
is important in real-time image processing, for example.
Two commonly used methods for estimating the state of a system
from measurements are the Kalman lter and Condensation[25].
They consist of two main steps:

a prediction step based on a

dynamical model of the object which estimates the state(often


position in computer vision) of the object in the next time step
and an update step which updates the prediction based on some
measurement. The main dierence between Condensation and a
Kalman lter is that Condensation is based on factored sampling,
which does not require a Gaussian measurement density.

This

property makes it possible to track objects in a cluttered scene.


An example application of tracking with a Kalman lter is vehicle tracking from a static camera [5]. An example application of
Condensation is tracking humans from a static camera [25].
Color can also be used as a feature for tracking. The Mean Shift
Tracker [11] is based on color histograms. The Mean Shift Tracker
is described in detail in section 4.2. An example application of the
Mean Shift Tracker is face tracking in color images.

CHAPTER 1.

1.3

19

INTRODUCTION

Related work in pedestrian detection

Recently, there has been a lot of interest in driver assistance applications like lane detection, car detection, trac sign recognition,
and pedestrian detection. The reasons for this are the desire to
make trac safer, to make driving more comfortable for drivers,
and the recent technological progress in camera systems and computer hardware which make this research possible.

At time of

writing, several driver assistance are commercially available. For


example a lane departure warning system from Citroen and a r
pedestrian detection system from Honda.

This section gives an

overview of relevant other pedestrian detection research.

1.3.1 Initial detection in grayscale images


In [42], [45], and [20] stereo vision is used to generate hypothesis
of pedestrians. A depth map is calculated from which foreground
objects can be segmented using range thresholding. The advantage of using stereo is that processing can be limited to a number
of objects close to the car. This speeds up further processing and
limits the number of fail detections. A disadvantage is that much
processing capacity or special hardware is required to calculate
the disparity map. Two cameras are needed for the detection process, making it more expensive than monocular processing. Also,
it is not clear if it is possible to keep the cameras calibrated over
a period of many years for use in everyday driving.
A dierent approach is shape-based detection. A shape matching
method for initial detection in monocular images is presented in
[21]. In order to detect a pedestrian in a grayscale image a hierarchy of edge templates of pedestrians are matched with a distance
transform image calculated from a feature image of the original
grayscale image. Templates of pedestrians in dierent poses are
grouped together in prototypes.

The grouping of templates is

CHAPTER 1.

20

INTRODUCTION

done at multiple levels, resulting in a hierarchy.

The leaves of

the hierarchy contain all templates and the nodes of the hierarchy
contain the prototypes. The hierarchy of templates is scanned at
a coarse-to-ne scale through the image. If a template at a certain node in the hierarchy matches(the distance measure between
template and image is below a certain threshold), the templates
under the leaf are processed.

In [32], saliency maps calculated

from intensity, color, and orientation features are used for initial
detection.
In [7] and [3], a combination of a shape-based method and stereo
is used for initial detection. An edge image is calculated from a
grayscale image.
edges.

Stereo vision is used to eliminate background

A symmetry map is calculated from the vertical fore-

ground edges.

A bounding box is created around symmetrical

vertical objects and a search for the head of the pedestrian is


performed through matching a head model.

Further processing

discards bounding boxes which no not have the correct width to


height ratio or which are too homogeneous to contain a pedestrian.
In [14], initial detection is performed by calculating image entropy.
In image areas with much structure a template is matched with
image contours. Depth information from stereo vision is used to
adjust the size of the template for matching image contours. A
restriction of using image features like edges for initial detection is
that good quality image features are required. Often, it is dicult
to obtain good image features from low contrast grayscale images.
One important cue humans use for object detection is motion. Humans are good in detection relative motion, for example detecting
a moving pedestrian from a moving car. Motion is generally not a
good cue for a real-time pedestrian detection system, the calculation of motion information using optical ow is expensive and does
not provide a good segmentation result because dierent parts of
the pedestrian are moving at dierent speeds. It does not provide

CHAPTER 1.

21

INTRODUCTION

a complete segmentation; for classication it is necessary to have


a complete bounding box around an object. Also, because usually the car is moving at a relatively high speed in comparison to
the pedestrian, the whole scene is moving and the motion of the
pedestrian is neglectable. There also many pedestrians that are
not moving so motion information is useless in this case.
In [23], leg motion is used for the detection of pedestrians in color
images.

The input image is clustered on color values.

Usually,

the legs of a pedestrian are combined into the same cluster. The
temporal change in the shape of such a cluster is used to distinguish it from clusters belonging to other objects in the image. The
limitation of using motion for detection is that it can only detect
moving pedestrians. Also, the clustering of a color image is time
consuming and does not always provide good segmentation results
if the input image has a complex background.

1.3.2 Initial detection in r images


The most straightforward approach to pedestrian detection in r
images is to use body heat. People appear brighter in a r image
than most other objects because of their body temperature.

In

[30], a probabilistic template is created from a training database


containing pedestrians in dierent poses. The template is created
by thresholding bright regions from the images in the training
database. The probability of each pixel in the template belonging
to a pedestrian is calculated based on how frequently it is thresholded in the training database. The template is scanned through
the image at various scales from a coarse to ne resolution.

In

[43], thresholding is also used to segment bright regions from the


image. After thresholding, bright regions that have an incorrect
width to height ratio for pedestrians are discarded and regions
which are unlikely to be a pedestrian based on position in the
image are discarded. In addition to detecting complete pedestri-

CHAPTER 1.

22

INTRODUCTION

ans, a detector that searches for pedestrian heads in also used to


generate initial detections.

In [4], the r image and its vertical

edges are searched for symmetrical regions. Symmetrical regions


in cold image areas are discarded. A vertical histogram of edges
is calculated in each of the bounding boxes surrounding the symmetrical regions. The shape of the histogram is used to discard
bounding boxes which do not match a pedestrian. Also, bounding
boxes with an incorrect aspect ratio and bounding boxes which
do not meet perspective constraints are discarded. The problem
with using image brightness for detection is that at higher outside temperatures, people do not necessarily appear brighter in
the image than the background.

1.3.3 Classication
It is possible to completely omit the initial detection step using a
pattern recognition approach. In [31], a support vector machine
classier is scanned through the image at every scale and every location. The classier is trained on haar wavelets calculated from
a database of pedestrian and non-pedestrian color images.

An

improved version of this system [28] is based on a component


classier.

There are multiple classiers each classifying a body

part. The nal classication result is calculated from the output


from the component classiers. In [15], a support vector machine
trained on histograms of oriented gradients is scanned at every
location and scale through the image.

The scanning window is

divided into cells. A histogram is calculated in each of the cells.


The combined histograms are the input to the support vector machine classier. In [40], a classier trained on rectangle features
is scanned through the image at every location and scale.
ditionally, motion information is used for classication.

Ad-

Motion

information is extracted by calculating dierences between images in time.

Adaboost [19] is used to train a classier.

The

CHAPTER 1.

INTRODUCTION

23

features for classication are selected from all possible rectangle


features and all possible motion features. The limitation of using
motion lters is that these can only be used with a static camera.

In [38], a convolutional neural network is scanned through

the image at every scale and location. In the convolutional neural


network architecture [27], feature selection is performed simultaneously with training the neural network. The feature selection
is implemented in a hidden layer with shared weights which are
optimized together with the weights of the classication so that a
complete classication/feature selection architecture is optimized.
A drawback of these approaches is that scanning a classier at every scale and location through the image is too slow for real-time
processing. In addition, because there is always a fail classication rate, it would generate a too high amount of fail detections
because many classications are performed per frame.
An alternative to scanning a classier through the image at every
scale and location is to classify the output of the initial detection
routine.

In [45], the output of a stereo initial detection routine

is classied with a neural network. The image gradients are the


features used for classication. The regions of interest from the
initial detection routine are rescaled to a xed size for classication. In [21], the regions of interest from the initial detection
routine are classied with a radial basis function network. To select negative examples for training, a bootstrapping procedure is
used. The classier is trained in an iterative way, at each iteration
it is validated on a new set of negative examples. The negative
examples are added to the training dataset and the classier is retrained. In [37], the data for training the classier is divided into
mutually exclusive training clusters. Each cluster contains data
from a particular pose, a particular articulation, and a particular
illumination condition.

The idea of clustering the training data

is that reducing the variability of the training data by dividing


it into clusters is more eective than training a classier on all

CHAPTER 1.

24

INTRODUCTION

data. The regions of interest for classication are divided into 9


subregions. A classier is trained multiple times for each subregion, once per training cluster. The features for classication are
histograms of gradient orientations. The histograms are weighted
by smoothed gradient magnitudes to achieve invariance to illumination changes.

The classier is trained using Adaboost which

selects features from all subregions.

In [42], regions of interest

generated by a stereo algorithm searching for the legs of a pedestrian are fed into a feed-forward time delay neural network. The
input to the network consists of pixel values from multiple frames.
Neurons in a higher layer are only connected to a subselection of
neurons in the lower layer, called receptive elds, which makes it
possible to detect specic leg poses and motion patterns.

1.3.4 Tracking
One approach to tracking is building a model of human shapes
and matching this model with image data.

In [2, 34], a linear

shape model of a pedestrian is build from pedestrian silhouettes.


A B-spline with a xed number of equally spaced control points is
tted to the contours of each of the silhouettes. After alignment
of the shapes, a mean shape and a set of modes of variation are
generated with Principal Component Analysis.

The Condensa-

tion algorithm [25] is used for tracking. Each sample represents a


possible position of the object to be tracked. A zero-order motion
model is used because no assumptions can be made about how
the pedestrian or the car are moving. The observation density for
weighting the samples is generated by measuring the distance to
the nearest image feature along the normals of the shape model.
Oriented edges discretized into eight bins are used as features for
tracking.

A drawback of this approach is that it cannot handle

scale well. The shape model should be matched to the image at


multiple resolutions for each sample. This makes it expensive and

CHAPTER 1.

25

INTRODUCTION

introduces aliasing problems. Also, a linear shape model cannot


capture the many possible poses and orientations of a pedestrian.
Work related to learning non-linear shape models can be found in
for example [12].
In [43], the heads of pedestrians in r images are tracked.

The

position of the pedestrian in the next frame is estimated with


a Kalman lter and the measurement update is performed by
calculating an exact position around the estimated position with
a mean-shift method. It is not clear how scaling is handled.
In [32], a classier is scanned through the image at each position and each resolution.

The output of the classier, which is

interpreted as a probability, is used for tracking.

Condensation

[25] is used for the propagation of classier outputs through time.


The position and scale of a person in the image are represented
by the state density in the Condensation algorithm. The classier
outputs are represented by the observation density in the Condensation algorithm. Instead of using the standard factored sampling,
the posterior state density of the previous frame is directly taken
as the prior for the current frame. The pedestrian detection system is run and the state density is reinforced or inhibited in the
locations where the support vector machine outputs are high. To
perform detection of people, the densities are thresholded. This
method can eectively handle scaling. The disadvantage of this
method is that unlike the two previously described tracking methods, it cannot be used without running the whole detection system
rst.

1.4

Selected approach

The detection system described in this document consists of tour


components: an image pre-processing component, an initial detection component, a classication component, and a tracking com-

CHAPTER 1.

INTRODUCTION

26

ponent.

1.4.1 Image preprocessing


The image pre-processing component performs the calculation of
image features for the initial detection component, classication
component, and tracking component.

It also contains methods

for image enhancement like smoothing, contrast stretching and


histogram equalization.

1.4.2 Initial detection


The initial detection component selects regions of interest in the
image which may contain pedestrians. So unlike systems that scan
a classier through the whole image at each scale and location([31],
[15], and [40]), processing is limited to certain regions in the image.
It is desirable to have an initial detection routine because otherwise too much processing time is spent on uninteresting parts of
the image. Also, there is always a certain false positive classication rate. The larger the number of objects that are classied per
frame, the larger the number of false positive classications per
frame. For a production system, the false positive rate should be
as low as possible.
Motion information is not used, because it cannot be assumed that
a pedestrian is moving in the scene. The assumption is made that
a single pedestrian can be segmented from the scene. The methods
described here are not specically designed for the detection of
pedestrians that are overlapping each other, although this may
work.
In this work, three dierent initial detection routines are used.
Two initial detection routines are gradient based, the other is region based. The gradient based initial detection routines calculate

CHAPTER 1.

INTRODUCTION

27

the vertical gradients from the intensity image. Image locations


with high vertical gradient response are regions of interest. The
rst routine calculates the vertical gradients of the image and
combines vertical structures into regions of interest. The second
routine scans a vertical edge detector through the image at every
scale and location.

The lter size of the edge detector is made

dependent on the size of the scan window. This makes the initial
detection routine suitable for detection objects at all scales. These
routines are mainly interesting for r images. In grayscale images,
there is usually so much gradient response from background structures that it is dicult to segment pedestrians based on gradient
information alone.
The region based initial detection routine operate directly on the
intensity values.
tensity values.

This routine combine regions with similar in-

For r images, human body temperature makes

pedestrians appear bright in the image, at least at low outside


temperatures. In r images, bright regions with a vertical shape
are regions of interest.

1.4.3 Classication
The classication of pedestrians is challenging because of the high
variability among objects in the pedestrians class. Also, there are
many pedestrian like objects in trac scenes, for example trees,
poles, parts of building, and parts of cars. Therefore, an advanced
classication component is required.

Because many objects per

frame may be classied, the classier should also be very ecient.


In this work, support vector machines and neural networks are
used for classication.
An important choice is the type of image features used for classication. In this work, several features are tested both on video
sequences and r sequences. The features that are used here are

CHAPTER 1.

28

INTRODUCTION

rectangle features [39], histograms of image gradients and orientations from the image gradients.

The reason for using rectan-

gle features is that they can be calculated very eciently, which


makes them very suitable for a real-time system. The histograms
of orientations from the image gradients are used because they are
invariant to small translations inside the regions from which the
histograms are calculated.
For a real system, the classication performance should be nearly
perfect. Especially, the false positive rate(false alarms) should be
practically zero. In practice, this is dicult to achieve. A classier
trained on a complete set of image features usually has a too high
fail classication rate to be useful in a real system.

Therefore,

in this work, the classication system is optimized with dierent


feature selection methods. The idea is that not all features are stable for classication. Selecting the subset of stable features from
all features and using only the stable features for classication
improves classication performance.
To improve classication performance even more, the training
data is divided into subdatasets.

The training data is divided

based on pedestrian orientation, pedestrian size, and environment


conditions like outside temperature.
Also classication speed is important for a real system. In order
to improve the classication speed of support vector machines,
the number of support vectors are reduced using the method from
[36], and with a multi-objective optimization.

1.4.4 Tracking
The reliable tracking pedestrians is also challenging: the contrast
between the pedestrian and the background is often low, the resolution of pedestrians in trac scenes is often low, there are many
background objects that resemble pedestrians, due to the move-

CHAPTER 1.

INTRODUCTION

29

ment of the car, pedestrians often strongly scale in a few frames,


multiple pedestrians may occlude each other, and pedestrians usually change shape because they move. In this work, ve tracking
methods are applied for tracking pedestrians. The rst tracker is
based on the Hausdor distance. Once a pedestrian is detected
in a certain frame, a model is created from the gradient image
calculated in the region of interest which contains the pedestrian.
In the next frame, the tracker searches in the area around the
previous location for the position with the smallest Hausdor distance to the model. The second tracker is a mean shift tracker. A
model is created for tracking by calculating a histogram of image
features of the detected pedestrian.

In the next frame, the po-

sition containing the pedestrian is determined by minimizing the


dierence between a target histogram and the model histogram.
The third tracker is a Condendation tracker. The tracker is based
on propagating a state density of tracked objects over time. The
fourth tracker is a tracker that integrates initial detections through
time. This tracker is based on the assumption that a pedestrian
which is detected in a certain frame will be detected at approximately the same location in the next frame. The fth tracker is a
tracker which integrates classications through time. This tracker
is based on the assumption that a pedestrian which is classied
in a certain frame will be classied at approximately the same
location in the next frame.

1.5

Contributions

This work has resulted in a pedestrian detection system for grayscale


and r images. The two main goals were to develop a real-time
detection system which operates with a high detection rate under various environment conditions. The r version of the system
runs easily at 20 frames per second on a 1 GHz. processor. Both

CHAPTER 1.

30

INTRODUCTION

the grayscale and r detection system are strongly optimized for


a high detection rate.
Three new real-time initial detection routines were developed, two
gradient based routines and one region based routine. A systematic overview of the performance of these detection routines is
presented on a variety of conditions by comparing the results of
the detection routines to ground truth data.
As mentioned in section 1.3.3, several types of classiers and image features are described in the pedestrian detection literature.
It is not clear however, which classier/image feature combination
works best. A systematic overview of the performance of support
vector machines and neural networks in combination with multiple
feature types is presented in this thesis. In addition, to improve
classication performance, dierent classiers are trained for different temperature ranges(in the case of r images) and dierent classiers are trained for dierent appearances of pedestrians,
based on orientation. Also, to improve classication performance,
feature selection is performed with a PCA, Adaboost, and multiobjective optimization. In order to achieve real-time classication
speed, the number of support vectors of a support vector machine
are reduced. An overview of all classication results for dierent
classiers trained on dierent image features is presented in this
work.
Five tracking methods are evaluated on data of pedestrians. The
rst is based on the Hausdor distance.
shift tracker.

The second is a mean

The third is a Condensation tracker.

The fourth

is based on integrating initial detection through time.

And the

fth is based on integrating classication outputs through time.


These trackers can be used to stabilize detections over time and for
example make it possible to estimate the time of contact with the
pedestrian.

An overview of tracking performance of all trackers

on r images and grayscale images is presented.

CHAPTER 1.

1.6

INTRODUCTION

31

Outline of this thesis

The rest of this thesis is structured in the following way: chapter two contains a description of the initial detection routines and
the image preprocessing methods it uses. Chapter three describes
the methods used for classication and optimization of the classication. Chapter four describes the methods used for tracking.
Chapter ve contains the experimental results. Chapter six contains the discussion and conclusion.

Chapter 2
Initial detection
The goal of the initial detection is to nd regions of interest in
the image.

It can be seen as a focus of attention mechanism

which limits processing to interesting parts of the image. These


regions of interest can contain pedestrians or other objects.

It

is desirable to have an initial detection routine in contrary to


scanning a classier through the whole image at all locations and
scales for the following reasons:

There is always an unavoidable number of fail classications.


By classifying only the output of the initial detection routine
instead of the whole image, the number of false detection is
reduced.

Limiting processing to interesting image regions saves calculation time.

It appears that it is necessary to have dierent initial detection


routines for dierent images types and dierent conditions. For
example, it may be necessary to have a dierent initial detection
routine for color/grayscale images than for r images. This section describes three initial detection routines. Two gradient based

32

CHAPTER 2.

33

INITIAL DETECTION

initial detection routines which use the image gradients for generating initial detections.

And a region based initial detection

routine which operates on the intensity values directly.


An important rate for a detection system is the rate of positive
examples detected as positive by the system. Another important
rate is the rate of negative examples detection as positive by the
system. In the literature, it seems that inconsistent terminology
exists for these rates.

In this work, these rates are called true

positive rate and false positive rate, respectively.

2.1

Image pre-processing

2.1.1 Image enhancement


Before the detection algorithms are applied to an image, it is usually preprocessed to enhance its quality. To enhance the contrast
in the image, contrast stretching can be applied. Usually, before
the image features are calculated, smoothing is performed to remove image artifacts.

Contrast stretching
Contrast stretching enhances an image by stretching the range of
intensity values to the maximum possible range. It does this by
applying a linear scaling to the image. The pixel with minimum
intensity in the image is set to the lowest possible value, the pixel
with maximum intensity in the image is set to the highest possible
value, and the other pixels are interpolated between the lowest and
highest possible values.
To perform contrast stretching, the minimum intensity value

max are calculated


image j can be calculated

min

and the maximum intensity value

from the

input image. The stretched

from the

original image

using

CHAPTER 2.

j(x, y) =

where

and

34

INITIAL DETECTION

lower


i(x,y)min
upper maxmin

upper

i(x, y) = min
min < i(x, y) < max
i(x, y) = max

are image coordinates, where

lower

is the lowest

upper is the highest possible intensity


reduce the eect of outliers, min and max can be
the pixel values at for example 3% and 97% of the

possible intensity value and


value.

To

selected at

image histogram, respectively.

Smoothing
A convolution kernel for smoothing is usually constructed with
the 2D Gaussian

1 x2 +y2 2
G(x, y) =
e 2
2 2
For performance reasons, in this work smoothing is performed
with two 1-dimensional approximations to a Gaussian

G(x) = {0.25, 0.5, 0.25}


and

0.25
0.5
G(y) =

0.25
for a kernel size of three and

G(x) = {0.0625, 0.25, 0.375, 0.25, 0.0625}

CHAPTER 2.

35

INITIAL DETECTION

and

0.0625

0.25

G(y) =
0.375

0.25

0.0625
for a kernel size of ve. The oating point values in the kernels
are selected in a way that a convolution with a kernel can be
performed eciently with integer math.

{0.25, 0.5, 0.25}

G(x) =
G(x) = {1, 2, 1}

For example,

can be eciently calculated as

shifted to the right by two bits.

2.1.2 Image features


Rectangle features
The rectangle features applied here are the same as those used in
[39] for real-time face detection. Figure 2.1 shows an example of
the lters to calculate these features.

The sum of the pixels in

the white region are subtracted from the sum of the pixels in the
dark region. The lter in gure 2.1(a) gives a high output at a
vertical edge, the lter in gure 2.1(b) gives a high output at a
horizontal edge, and the lter in gure 2.1(c) gives a high output
at a diagonal edge. The motivation for using these features are
that they are extremely fast to calculate. They can be calculated
in constant time regardless of their size using an integral image
representation related to summed area tables from texture mapping in computer graphics. The integral image at location
the sum of the pixels above and to the left of

ii(x, y) =

X
x0 x,y 0 y

i(x0 , y 0 )

x, y :

x, y

is

CHAPTER 2.

where

36

INITIAL DETECTION

ii(x, y)is the integral image and i(x, y) is the original image.

The integral image can be calculated using the following formulas:

s(x, y) = s(x, y 1) + i(x, y)


ii(x, y) = ii(x 1, y) + s(x, y)
s(x, y)is
ii(1, y) = 0.
where

the cumulative column sum,

s(x, 1) = 0,

and

Using the integral image, any rectangular sum

can be calculated in only four array references. An example calculation is shown in gure 2.2.
calculated with

The area of rectangle D can be

(ii(4) + ii(1)) (ii(2) + ii(3)),

where the values

1,2,3, and 4 are coordinates of the integral image.

Figure 2.1: Rectangle lters.

CHAPTER 2.

37

INITIAL DETECTION

Figure 2.2: Calculation of the rectangle sum.

Gradients
The image gradients are calculated by convolving the smoothed
image with two 1-dimensional kernels. For a lter size of 5, the
horizontal gradient image

gx is calculated with the following Sobel-

like kernel

gx =
0
.

1
The vertical gradient image

gy

is calculated with the following

kernel

gy = {1, 2, 0, 2, 1}.
An approximation to the energy image

is calculated from the

horizontal gradient image and the vertical gradient image

E = |gx | + |gy | .

CHAPTER 2.

38

INITIAL DETECTION

The gradient images and energy images can be used for edge detection. Also, the angle of orientation
from

of the gradient is calculated

gx and gy
= arctan(

gy
).
gx

An example of an image, its vertical gradients, its horizontal gradients, and its energy is shown in gure 2.3.

(a) A r image

(b) It's vertical gradient image

(c) It's horizontal gradient image

(d) It's energy image

Figure 2.3: Image gradients and energy.

CHAPTER 2.

2.2

39

INITIAL DETECTION

Gradient based segmentation

Two dierent gradient based initial detection routines are described in this section.

The rst is based on clustering vertical

gradients. The second is based on scanning a vertical edge detector through the image at every location and scale.

2.2.1 Grouping vertical gradients


The initial detection routine that groups vertical gradients works
as following:

the vertical gradient image is calculated with the

convolution kernel from section 2.1.2. A threshold is applied to


create a binary image where the regions with a high gradient magnitude are foreground pixels. In the case of r images, the threshold is calculated from the image histogram of the vertical gradient
image. The intensity value at 95% in the cumulative image histogram of the vertical gradient image is selected as the threshold.
The threshold is selected at 95% percent because this ensures that
only image regions with strong vertical gradients are considered
fur further processing. In the case of color/grayscale images, the
threshold is set to a xed value. This threshold is calculated in
the same way as for infrared images. An image histogram is calculated and the threshold is selected at the intensity value at 95% in
the cumulative image histogram. Using the connected component
algorithm described in algorithm 1, the foreground pixels in the
binary image are clustered into regions. Regions with a width to
heigth ratio from 0.5 to 1.5 (width divided by height) may contain a pedestrian and are regions of interest.

The values of 0.5

and 1.5 are selected based on measuring width to height rations


from dierent front view and side view pedestrian images.

CHAPTER 2.

40

INITIAL DETECTION

Algorithm 1 Connected components labeling.


Input: A binary image
Output: An image where each foreground region is assigned an unique label.
1. Repeat steps 2 and 3 for each pixel in the image
2. Starting at the rst pixel: scan the image row by row until a foreground
pixel is found
3. Consider all adjacent 8-connected pixels to the current pixel(west pixel,
south pixel, south west pixel and south east pixel) which were already
visited:
(a) If all of the visited pixels are background pixels, assign a new label
to the current pixel.
(b) If one or more of the visited pixels are foreground pixels of the
same group, assign the label of that pixel to the current pixel.
(c) If multiple visited pixels belong to a dierent group, assign one of
the labels of those pixels to the current pixel and mark that the
dierent labels belong to the same group.
4. In an additional pass over the whole image, assign an unique group
label to the pixels of groups which were marked as equal.

An example of this method for r images is shown in gure 2.4.


Figure 2.4(a) shows the input image, gure 2.4(b) shows the vertical gradient image, gure 2.4(c) shows the binary image, and
gure 2.4(d) shows the regions of interest.

An example of this

method for grayscale images is shown in gure 2.5. Figure 2.5(a)


shows the input image, gure 2.5(b) shows the vertical gradients,
gure 2.5(c) shows the binary image, and gure 2.5(d) shows the
regions of interest.
In the case of r images, there is usually a transition from dark(
background) to light(region of interest) to dark(background). So
additionally, the sign of the gradient is used to discard regions
of interest which cannot contain a pedestrian.

An example of

CHAPTER 2.

41

INITIAL DETECTION

(a) A r image

(b) It's vertical gradients

(c) It's binary image

(d) The initial detections

Figure 2.4: Grouping vertical gradients in a r image.

the use of the sign of the gradients is shown in gure 2.6.

The

gray regions contain negative gradients, the white regions contain


positive gradients. Only regions of interest consisting of a positive
gradient region followed by a negative gradient(from left to right)
region are accepted.

CHAPTER 2.

42

INITIAL DETECTION

(a) A grayscale image

(b) It's vertical gradient image

(c) It's binary image

(d) The initial detections

Figure 2.5: Grouping vertical gradients in a grayscale image.

Figure 2.6: Use of the gradient sign for initial detection.

CHAPTER 2.

INITIAL DETECTION

43

Often, in the case of r image, the feet and the head of pedestrians
are brighter than the upper body because of the isolation of the
coat. These upper body regions do not generate strong gradient
magnitudes.

Therefore, once a region of interest is generated,

a search along a vertical line is performed upwards for another


region of interest.

This makes it possible to combine the feet

with the head. When another region of interest is found, the two
regions are combined and it is tested if the region matches the
width to height ratio of a pedestrian. An example of this is shown
in gure 2.7. In gure 2.7(b), the complete pedestrian cannot be
segmented as one region. After the combination of the feet with
the head, the pedestrian is segmented as one object 2.7(c). The
complete initial detection method is described in algorithm 2.

CHAPTER 2.

44

INITIAL DETECTION

(a) A r image

(b) The binary image

(c) The initial detections


Figure 2.7: Combining regions of interest.

CHAPTER 2.

45

INITIAL DETECTION

Algorithm 2 Grouping vertical gradients


Input: An image
Output: A list of initial detections
1. Perform smoothing on the input image.
2. Calculate the vertical gradients from the smoothed image.
3. Calculate an image histogram from the vertical gradients image and
select the intensity value at 95% of the histogram as threshold.
4. Create a binary image from the vertical gradient image using the
threshold. Foreground pixels are those pixels whose absolute gradient
magnitude is larger than the threshold.
5. Cluster the foreground pixels into regions using algorithm 1.
6. Repeat steps 7 and 8 for each foreground region.
7. If the width to height ratio of the region matches the width to height
ratio of a pedestrian, add the coordinates of the region to the output
list. Additionally, in r images, if the sign of the gradients of the region
does not match the sign of a warm object against a cold background,
discard the region.
8. Search for another region above the current region.

If the width to

height ratio of the combined region matches the width to height ratio
of a pedestrian, add the coordinates of the combined region to the
output list.

2.2.2 Scanning a vertical edge detector through the whole


image
A dierent approach to nding vertical structures in the image is
to scan a search window through the image at every position and
scale. In practice, it is sucient to scan through a certain range
in the image because there are positions where there can not be
any pedestrians because of perspective constraints. For example,
at the top of the image there is usually sky so there is no need

CHAPTER 2.

46

INITIAL DETECTION

to search here. The search window consists of two or four edge


detectors which search for vertical structures.
search window is shown in gure 2.8.

An example of a

In order to make the de-

tection results invariant to the size of the objects in the image,


the size of the lters of the edge detectors are adapted to the
size of the search window.

The advantage of this method over

the initial detection routine which groups vertical features is that


when a pedestrian is located under another object with vertical
structure(trac sign, tree, building) it can still be detected correctly in the search window in which only the pedestrian ts. The
vertical feature grouping initial detection routine would only nd
the pedestrian clustered together with the other object. Some experimentation is necessary to nd a threshold value for the edge
detector. In general a high threshold is selected in grayscale images in which there is not much contrast between the pedestrian
and the background.

In r images, a higher threshold can be

selected because of the higher contrast between pedestrians and


background. The complete method is described in algorithm 3.

Figure 2.8: A scanning window.

CHAPTER 2.

47

INITIAL DETECTION

Algorithm 3 Scanning a vertical edge detector


Input: An image
Output: A list of initial detections
1. A threshold dependent on the image type(grayscale or r).
2. Repeat for every scale and image position
3. In the scanning windows, calculate the vertical edge response with four
vertical lters. Two in the left half of the scanning window and two in
the right half.
4. If all four responses are larger than the threshold, add this scan window
to the output list.

This initial detection routine can be implemented very eciently


using the integral image described in section 2.1.2. An example
of the output from this initial detection routine on r images is
shown in gure 2.9. An example of the output from this initial
detection routine on grayscale images is shown in gure 2.10. The
regions of interest are marked with orange rectangles.

(a) A r image

(b) The initial detections

Figure 2.9: Scanning an edge detector through the image at every position
and scale.

CHAPTER 2.

48

INITIAL DETECTION

(a) A grayscale image

(b) The initial detections

Figure 2.10: Scanning an edge detector through the image at every position
and scale.

2.3

Temperature segmentation in r images

Region based segmentation groups pixels with similar intensity


values into regions. In r images, humans appear brighter in the
image than most other objects because of their body temperature.
Temperature is a property that can be used for segmenting objects from the image. Temperature segmentation is a method that
is often used for the initial detection of pedestrians in r images
[30], [43], [4]. In this work, a related method is applied. A threshold is calculated from the outside temperature using the image
histogram of intensity values(as described in section 2.2.1). The
threshold is selected at 95% of the cumulative image histogram.
A high threshold is calculated for high outside temperatures and
a low threshold is calculated for low outside temperatures. With
the threshold, a binary image is calculated.

Pixels with an in-

tensity value higher than the threshold are selected as foreground


regions. The other pixels are selected as background regions. The
foreground pixels in the binary image are warm areas and the
background pixels are cold areas. The foreground areas are clustered with algorithm 1. Clusters with the correct width to height

CHAPTER 2.

49

INITIAL DETECTION

ratio(as described in section 2.2.1) for a pedestrian are selected


as regions of interest. An example of this method is shown in gure 2.11. Figure 2.11(a) shows the input r image, gure 2.11(b)
shows the binary image, and gure 2.11(c) shows the regions of
interest marked with orange rectangles.

(a) A r image

(b) The binary image

(c) The initial detections


Figure 2.11: Temperature segmentation in r images.

A problem with using image brightness for initial detection of


pedestrians is that pedestrians are often not uniformly bright.
Usually, the head and the feet of the pedestrian are much brighter
than the body because of the isolation of the coat. An improvement of this method creates additional regions of interest by trying
to combine each region of interest with a higher located region of

CHAPTER 2.

50

INITIAL DETECTION

interest. In this way, the region containing the feet is combined


with the region containing the head. An example of this improved
method is shown in gure 2.12. Figure 2.12(a) shows the r image,
gure 2.12(b) shows the binary image, and gure 2.12(c) shows an
initial detection consisting of two regions. The complete algorithm
is described in algorithm 4.

(a) A r image

(b) It's binary image

(c) The initial detections


Figure 2.12: Combining regions of interest.

CHAPTER 2.

INITIAL DETECTION

51

Algorithm 4 Temperature segmentation in r images


Input: A r image
Output: A list of regions of interest
1. Select an intensity threshold at 95% of the cumulative image histogram.
2. Create a binary image from the r image using the threshold. Pixels
with an intensity value higher than the threshold are foreground pixels.
3. Cluster the foreground pixels in the binary image into regions using
algorithm 1.
4. Repeat step 5 and 6 for each of the foreground regions.
5. If the width to height ratio of the region matches the width to height
ratio of a pedestrian(as described in section 2.2.1), add the coordinates
of the region in the output list.
6. Search for another region directly above the current region. If the width
to height ratio of the combined region matches the width to height ratio
of a pedestrian, add the coordinates of the combined region to the list.

Chapter 3
Classication
The purpose of classication is to determine which of the regions
of interest provided for example by the initial detection routine
contain pedestrians and which do not. The classication consists
of the following steps:

the generation of representative mutual exclusive training examples and validation examples of positive examples(pedestrians)
and negative examples(non-pedestrians)

the extraction of features from the training and validation


examples

the training of the classier with the training features

the evaluation of the generalization performance of the classier on the validation examples

3.1

Classiers

The classier is the learning algorithm which is taught how to


separate between two classes of input: features from positive examples and features from negative examples. The classier also

52

CHAPTER 3.

53

CLASSIFICATION

performs the forward propagation: the assignment of a class label(positive or negative) to an unseen example. In this work, the
classication is always a two-class problem. The classier learns
to separate between two classes of objects, pedestrians and nonpedestrians. So, the output of the classier is always interpreted
as a binary value, pedestrian or non-pedestrian.

Two types of

classiers are applied in this work: neural networks and support


vector machines.

3.1.1 Neural networks


A neural network is a set of processing units which communicate
through weighted connections. The processing units(neurons) are
divided up into layers. A neural network has

one input layer which receives input from outside the network,

zero, one or more hidden layers which receive input from the
input layer and from each other(in the case that there are
multiple hidden layers),

an output layer which receives input from the hidden layer(s).

The output value

yk

of a neuron

in the network at time step

is calculated in the following way

X
yk (t) = Fk (
wjk (t 1)yj (t 1) + k (t))
j
where

wjk

k (t)

is the bias of neuron

neuron

k.

is the weighted connection from neuron

k,

and

Fk

to neuron

k,

is the activation function of

In this work, a linear activation function is used for the

output neurons. A sigmoidal activation function is used for the


hidden neurons:

CHAPTER 3.

54

CLASSIFICATION

F (sk ) =

1
.
1 + esk

The neural network used in this work is a three layer feedforward


neural network, where the propagation through the network is
only forward. The network is not fully connected, neurons only
receive input from neurons in the previous layer. The number of
hidden neurons varies depending on the problem. The number of
neurons in the output layer is one. The number of input neurons
are equal to the number of features of the data.
During training, the weights and biases of the connections in the
network are updated so that the output of the network

yp

close as possible to the target value

for all examples

p.

dp

is as

This is

done by minimizing an error function with gradient descent. The


error function

is dened as

E=

1X p
(d y p )2 .
2 p

For a neural network with at least one hidden layer, the error is
minimized with error back-propagation. The change in a weight
is proportional to the negative of the derivative of the error:

p wj =
where

E p
wj

is the current training example, and

weight. This can be rewritten as

p wjk =

E p p
y
spk j

the index of the

CHAPTER 3.

where

spk

55

CLASSIFICATION

is the input to neuron

that

and for a hidden

k.

For an output neuron

o it holds

E p
p
p
0 p
p = (do yo )Fo (so )
so
unit h it holds that

No
X
E p
0 p
(dpo yop )Fo0 (spo )who .
p = F (sh )
sh
o=1
For more information about neural networks, see for example [6].

3.1.2 Support vector machines


The basis idea of support vector machines is to map the training data into a high dimensional feature space

and calculate a

separating hyperplane in that feature space. A separating func-

f : RN {1} is calculated from the training examples


(x1 , y1 ), ..., (xl , yl ) RN {1}. By using a mapping : R F
it is not necessary to work in the high dimensional space F be0
0
cause there exists a kernel k(x, x ) for which holds: k(x, x ) =
((x) (x0 )). Training a support vector machine consists of calculating a hyperplane w x b = 0 which maximally separates
2
the training data in F . This means minimizing kwk subject to
yi (w xi + b) 1. This gives the quadratic optimization problem
tion

1
minimize kwk2
2
with respect to
parameters

yi ((w (xi )) + b) 1.

Introducing the Lagrange

gives the Lagrangian

X
1
kwk2
i (yi ((w (xi )) + b) 1)
2
i=1

CHAPTER 3.

56

CLASSIFICATION

which should be minimized with respect to


maximized with respect to

where

and should be

i .

The resulting separation function is

b),

w, b

P
f (x) = sign( li=1 i yi k(x, xi )+

is the number of support vectors(the training data

points which dene the separation plane), and

i are the Lagrange

multipliers.
Two examples of kernels for classication are: a radial basis function kernel

k(x, y) = exp( kx yk2 )


where

is the width of the Gaussian and a polynomial kernel

k(x, y) = (x y)d
where

is the degree of the polynomial.

For more information

about support vector machines see for example [13].

3.2

Image features for classication

An important choice is the type of image features that are used


for classication. The type of image features inuences the classication performance and forward propagation speed. Before the
features are calculated, the regions of interest are rescaled to a
xed size so that the number of features is constant in each region of interest. For an image size of 320x240 pixels, the window
size used is 24x48 pixels. This window size is selected because it
is large enough to calculate image features from(as described in
section 2.1.2). On the other hand it is small enough to avoid a
very large input space to the classier.

CHAPTER 3.

57

CLASSIFICATION

3.2.1 Rectangle features


The rectangle features are the features as described in section
2.1.2. Two dierent scales of lters are used to capture dierent
levels of detail. A lter size of four by four pixels is used and a lter
size of eight by eight pixels is used. Both lters have an overlap
of 2 pixels in the horizontal and vertical direction. An example
of the application of the lters on pedestrian images is shown in
gure 3.1. For the classication of pedestrians, the lter outputs
are normalized from 0.1 to 0.9 for neural network classication
and from -1.0 to 1.0 for support vector machine classication.

(a) A grayscale image

(b) It's vertical gradients

Figure 3.1: Rectangle features for classication.

3.2.2 Histograms of gradients and orientations features


The gradients and orientations features are those from section
2.1.2. First the gradient image and orientations image calculated
with a lter size of 5 pixels. This lter size is selected because of
the low image resolution of 320x240 pixels. For a higher resolution
image a larger lter size would be selected. An example of the gradient calculation was shown in gure 2.3. The region of interest is

n elds. An example of this is shown in gure


3.2 for m is 4 and n is 8. In each of these regions, a histogram of 8
divided into

by

CHAPTER 3.

58

CLASSIFICATION

bins is calculated of both the gradient magnitudes and orientations


of the gradient. The index of the bin in the histogram is calculated by linearly scaling the gradient magnitude and orientation
between their minimum and maximum values. The histograms of
all regions are concatenated and are normalized from 0.1 to 0.9
for neural network classication and from -1.0 to 1.0 for support
vector machine classication.
Both the gradient magnitudes and gradient orientations are used
as features for classication. In this way, the information provided
by the strength of the gradient as well as the information provided
by its direction are used.

Figure 3.2: Calculation of image features in subregions.

3.3

Training and validation of the classier

3.3.1 Creating training datasets


The classier is trained with positive examples of pedestrian images and negative examples of non-pedestrian images. The initial
detection routines and a bootstrapping method(section 3.4.1) are
used to generate training data.

So for example, a classier is

trained on data from a particular initial detection routine. The


reason for using the initial detection routines for generating training data is that in this way, the classes of positive and negative

CHAPTER 3.

CLASSIFICATION

59

examples are well dened. Especially, the class of non-pedestrians


is in principle innitely large. By using the output from the initial detection routines, the negative examples are limited to those
that are generated by the initial detection routine. For each initial
detection routine, a separate classier is trained. This is done because dierent initial detection routines generate dierent positive
and negative examples.
The training data is created by manually marking the pedestrians
in image sequences. This is done by drawing a rectangle around
a pedestrian in each frame. This results in a set of ground truth
data. The initial detection routines are applied to the image sequences and the output of the initial detection routines is compared to the ground truth data. When an initial detection is at
the same coordinates as a labeled pedestrian in the ground truth
data, it is stored as a positive example in the training data. Otherwise it is stored as a negative example. Separate labels are used
for pedestrians in front/rear views and pedestrians in side views
to reduce the amount of inner class variability. So a separate classier is trained for back/front views and side views. The idea is
that the classication is improved when the classes of objects are
better dened.
The complete set of image data is divided into three approximately
equally large mutually exclusive subdatasets. The rst subset is
the training data, which is used for training the classier.

The

second subset is the validation dataset, which is used to prevent


overtting during training of the classier. The third dataset is
the test dataset, which is used to evaluate the performance of
an optimized classier.

There are always three times as many

negative examples as there are positive examples in the training


and validation datasets. This results in classiers with a low false
positive rate. For a real system, false detections are unacceptable
so the classier is trained in a way to have the lowest false positive

CHAPTER 3.

60

CLASSIFICATION

rate possible.
Once the available data has been divided into subdatasets, the
image features are calculated and stored as feature vectors. For
each of the initial detection routines, feature vectors are generated.

3.3.2 Training neural networks


For training neural networks, early stopping is used. The network is trained on the training dataset.

At each training iter-

ation, the classication error on the validation dataset is monitored. During a certain amount of iterations after the training has
started, both the training error and the validation error decrease.
From a certain iteration on, the training error keeps decreasing
but the validation error increases.

From this point on, the net-

work starts overtting the data: it memorizes the training data


and does not generalize well anymore on the validation data. The
network at the training iteration where the validation error starts
to increase, is selected as the nal network. The optimal network
from the early stopping procedure is then tested on a test dataset,
mutually exclusive from the training- and validation dataset. After some experimentation, the number of hidden neurons is xed
at a value of 10 such that a direct comparison between the performance of dierent networks is possible. The value of 10 neurons
was found to be the minimum required value for the optimal performance of the network.

So, decreasing the number of hidden

neurons below 10 results in a reduced classication performance.

3.3.3 Training support vector machines


For training support vector machines, a grid search is used to nd
the optimal value for

C , the trade o between the maximization of

the margin of the separating hyperplane and training error minimization, and in the case of a radial basis function kernel to nd

CHAPTER 3.

61

CLASSIFICATION

the optimal value for

the width of the Gaussian in the kernel.

The grid search is performed by training a support vector machine


on a training dataset, and evaluating the performance of the support vector machine on a validation dataset mutually exclusive
to the training dataset. The evaluation criteria are true positive
rate, false positive rate, and number of support vectors. The optimal network from the grid search is then tested on a test dataset,
mutually exclusive from the training- and validation dataset.

3.4

Optimization of the classication

The standard classication with neural networks and support vector machines is often not optimal.

Also, in the case of support

vector machines, the training generates so many support vectors


that the classication is too slow for real-time pedestrian detection. Therefore, some form of optimization is necessary, both in
classication performance and classication time.

3.4.1 Bootstrapping
The class of negative training examples is usually not well dened. To generate a dataset of representative negative examples,
a bootstrapping technique can be applied. This works as follows:
A training dataset is generated from all positive training examples
and a small number of negative examples. A classier is trained on
this data and is tested on a dataset not used for training. The false
positives from the test dataset are added to the training dataset
and a new classier is trained on the training dataset. Again, this
classier is tested on another test dataset and the false positives
from this dataset are added to the training examples. This procedure is repeated until a desired false positive rate is achieved.

CHAPTER 3.

62

CLASSIFICATION

3.4.2 Component classication


The idea of component classication is to train a classier for each
part of a pedestrian. A set of local classiers whose classication
outputs are combined may give better results than one global
classier. Figure 3.3 shows an example of the subregions. There
are seven subregions in the region of interest, a classier is trained
for each subregion.

The nal classication decision is made by

voting the output of the subregion classiers.

For example, for

each of the regions of gure 3.3(b), a neural network is trained.


When the trained networks are applied to an unseen example, this
results in seven network outputs. The network outputs are voted
to get the nal output. If at least four networks give a positive
output, the nal result is positive.

Otherwise the nal result is

negative.

Figure 3.3: Subregions for component classication.

3.4.3 A simplied approximation to the support vector


machine decision rule
The number of support vectors of a decision surface scales approximately with the number of training examples. This makes
the forward propagation through a support vector machine slow
compared to other classiers such as neural networks. To improve
the forward propagation speed, a reduced set of vectors can be
calculated from the original set of support vectors. To calculate a

CHAPTER 3.

63

CLASSIFICATION

reduced set of vectors which approximate the original decision surface, the algorithm from [8] is applied. From the original decision
surface

Ns
X

a ya (sa )

a=1
where

are the weights calculated during training,

Ns

are the class labels of the


set

zk

of size

Nz < Ns

support vectors

sa a

ya {1, 1}

reduced vector

is calculated with decision surface

Nz
X

k (zk )

k=1

k are the weights. This is done by minimizing the euclidean


0
distance: = k k. The reduced set of vectors zk , k =
1, ..., N z is calculated with an unconstrained conjugate gradient

where

method.

3.5

Feature selection

Feature selection is applied to select a stable set of features for


classication from the whole set of features and to speed up the
forward propagation speed of the classier. The smaller the input
dimension of the input data, the faster the forward propagation.

3.5.1 Principal component analysis


Principal component analysis transforms a set of possibly correlated variables into a smaller number of uncorrelated variables.
In feature selection this means nding a linear subspace of the

CHAPTER 3.

64

CLASSIFICATION

complete set of features. The method for generating a linear subspace of a lower dimension is the Karhunen-Love transform, a
standard technique from statistical pattern recognition. Principal
component analysis on a dataset consists of the following steps:

1. Calculation of the mean of each of the dimensions in the

dataset.
2. Subtraction of the mean of each dimension in the dataset

such that the mean of the dataset is zero.


3. Calculation of the covariance matrix of of the data.
4. Calculation of the eigenvectors and eigenvalues of the covari-

ance matrix.
5. Selection of the eigenvectors with the highest eigenvalue as

the principal components of the dataset.


6. Calculation of the reduced dataset by multiplying the vector

of eigenvectors with the original dataset.

For classication this is used as follows: principal component analysis is applied on the feature vectors of pedestrian images in the
training dataset. The matrix containing the eigenvectors for transforming the training dataset into the reduced features dataset is
stored. The transformation matrix is used to transform the feature
vectors of the training data to a lower dimension and a classier is
trained on the reduced set of feature vectors. Before each forward
propagation through the classier, the transformation is applied
to the feature vector using the stored matrix.

The purpose of

principal component analysis for a real-time system is to reduce


the input space to the classier. For testing the performance of
pca classication, the full set of 224 input features to a neural
network is reduced to 100 features and 50 features.

CHAPTER 3.

65

CLASSIFICATION

3.5.2 Adaboost
Adaboost [19] is a learning algorithm which constructs a classier
from a number of weak classiers. Adaboost is explained in algorithm 5. Adaboost can be used for feature selection by limiting
the number of iterations in the algorithm. The features which are
the most discriminative are found in increasing iteration order. In
this work, a single layer perceptron with a sigmoidal activation
function is used as weak learner.

The Adaboost algorithm is

iterated until a certain classication performance is achieved or


until the number of desired features is reached.

Like principal

component analysis, Adaboost can be used to reduce the input


dimension to the classier. In addition, it can boost the classication performance by nding the most discriminative features.

Algorithm 5 Adaboost.

Algorithm from [39].

Input: A set of examples {(x1 , y1 ),...,(xn , yn )} with labels


Output: A strong classier

yi {1, 1}

h(x)

1. Initialize weights of training examples


2. Repeat steps 3 until 5 for
3. Normalize the weights

is

wt,i =

w1,i =

1
, for all i.
n

1,..., T
w
Pn t,i

j=1wt,j

j, P
train a classier weak learner hj .
wt j = i wi |hj (xi ) yi |.

4. For each feature


classier is

The error of the

wt+1,i = wt,i t1ei , where ei = 0 if xi is classied correctly, ei = 0 if xi is classied incorrectly, et is the error of the classier
t
with the lowest error, and t =
1t

5. Update the weights

6. The nal strong classier is:


h(x) =

1
0

PT


PT
1
a
h
(x)

a
t
t
t
t=1
t=1
2
otherwise

CHAPTER 3.

66

CLASSIFICATION

3.5.3 Multi-objective optimization


Feature selection with a genetic algorithm works as follows: the
indices of the feature vector are coded on a binary chromosome.
A one at the feature index on the chromosome means the feature
is selected, a zero means the feature is not selected.

A genetic

algorithm is used to evolve a population of individuals, each with


one chromosome. The chromosomes of the individuals are all initialized to a random value(zero or one).

The genetic algorithm

performs selection, crossover, mutation, and tness evaluation of


each individual.

In [44], a genetic algorithm is used to select a

subset of features for medical classication problems.

In [17], a

multi-objective feature selection is used with classication performance and feature dimension as tness criteria.

In [33], the

genetic algorithm evolves a real valued chromosome. In this way,


the genetic search results in a relative weighting of features.
In this work, a binary tournament selection is used to select the
individuals for reproduction. Tournament selection with
viduals means that

indi-

n individuals are selected from the population

based with a probability proportional to their tness.


dividual with the highest tness value from the

The in-

individuals is

selected. Uniform crossover is used to generate an ospring population from the parent population. This means each gene on a
chromosome from an ospring is randomly selected from the corresponding genes on the parent chromosomes. The probability of
crossover is set to 0.7.

During mutation, the value of a bit can

be ipped. The probability for crossover is set to 1.0 / number of


features.
The tness of an individual is calculated by training and validating a classier as described in sections 3.3.2 and 3.3.3 on the
features that have the value one in the chromosome. The tness
value has two objectives: the true positive rate on the validation
dataset and the false positive rate on the validation dataset. An

CHAPTER 3.

67

CLASSIFICATION

important choice in multi-objective optimization is how individuals with multiple tness values are sorted before reproduction. In
this work, the method from [16] is used. This method is based on

n . Each individual i in the population


rank irank and a crowding distance idistance .

a comparison operator
has two attributes: a

The calculation of the rank is shown in algorithm 6, the calculation of the crowding distance is shown in algorithm 7. The

sort

function in algorithm 7 sorts on objective value. The parameters

max
min
fm
and fm
are the maximum and minimum value of object
m, respectively. The operator in algorithm 6 is the domination
operator. An individual p with n objectives {p1 , ..., pn } dominates
an individual q if

i {1, ..., n} pi qi i {1, ..., n} pi > qi .


If individuals do not dominate each other, they are assigned the
same rank. All solutions of the optimization that are not dominated by a dierent solution make up the Pareto optimal set.
The comparison operator

i n j

is dened as

if (irank < jrank ) or ((irank


(idistance > jdistance )).

= jrank )

and

CHAPTER 3.

68

CLASSIFICATION

Algorithm 6 Assignment of a rank to an individual.


P

Input: A population

with tness values assigned to each individual

Output: A ranking assigned to each individual


1. Repeat steps 2 until 6 for each
2.

Sp =

, .np is

q)

p P.

then

q P.

Sp is Sp {q}.

5. Else if (q

p)

then

np = 0

then

prank = 1, F =F1 {p}.

6. If
7.

i=

np is np + 1.

1.

8. Repeat steps 9 until 15 while


9.

Fi 6= .

Q=.

10. Repeat steps 11 until 13 for each

p Fi .

11. Repeat steps 12 until 13 for each

q Sp .

12.

p P.

0.

3. Repeat steps for each


4. If (p

nq = nq -

13. If

1.

nq = 0

14.

i = i + 1.

15.

Fi = Q.

then

Algorithm from [16].

qrank = i + 1, Q = Q {q}.

p P.

CHAPTER 3.

69

CLASSIFICATION

Algorithm 7 Crowding-distance-assignment.
Input:A population
in

of side

Algorithm from [16].

with tness values assigned to each individual

P.

Output: A crowding distance assigned to each individual.


1. Repeat steps 2 until 7 for
2.

i = 1, ..., N

P (i)distance = 0.

3. Repeat steps 4 until 7 for each objective


4.

P (i) = sort(P (i), m)

5.

P (1)distance = P (N )distance =

6. Repeat step 7 for


7.

3.6

m P (i).

i = 2, ..., N 1

max
min
P (i)distance = P (i)distance + (I(i + 1).m I(i 1).m)/(fm
fm
)

Scanning a classier through the image at every


position and scale

It is interesting to compare the initial detection based approaches


as described in this work with methods where a classier is scanned
through the image at every position and scale. Examples of this
system are [39], [32] for face detection, and [40], [32] for pedestrian
detection. The working of such a system is described in algorithm
8. For training a classier for this method, manually labeled regions generated as described in section 3.3.1 are the positive examples.

The negative examples are generated by applying the

bootstrapping method as described in section 3.4.1 to randomly


selected regions of interest of image sequences which contain no
pedestrians. As an alternative, the classiers trained on the initial
detection routines can be used.

CHAPTER 3.

Algorithm 8

CLASSIFICATION

70

Scanning a classier through the image at every scale and

location.
Input: An image
Output: A list of positive classications
1. Repeat steps 2 until 4 for every scale and image position
2. In the current scanning window, calculate the image features for classication.
3. Classify the feature vector from the current scanning window.
4. If the classication output is positive, add the coordinates of the scanning window to the output list.

To improve the method in algorithm 8, the following modications


are made:

a separate classier is trained for side views and front/rear


views of pedestrians and a separate classier is trained for
pedestrians at a low resolution and pedestrians at a high
resolution.

The optimization methods from section 3.4 and the feature


selection methods from section 3.5 are applied.

An example of classifying the whole image at every position and


scale in shown in gure 3.4.

CHAPTER 3.

CLASSIFICATION

Figure 3.4: Classifying the whole image at every scale and resolution.

71

Chapter 4
Tracking
The purpose of tracking is to keep track of an object through
successive image frames after a positive classication. In the case
of pedestrians, this is challenging. The reasons for this are that:

pedestrians keep changing shape if they are moving,

they usually increase in size because of the movement of the


car,

a reliable motion model is not available because of the motion


and nicking of the car,

the contrast between the pedestrian and the background is


often low so it is dicult to calculate reliable image features
for tracking, there are usually changes in illumination conditions,

the background in urban trac scenes is complex and contains often many pedestrian alike objects,

and the resolution of pedestrians is often small in the image


so not much information is available for tracking.

72

CHAPTER 4.

73

TRACKING

Tracking using the Hausdor distance

4.1

The Hausdor distance is a measure of inequality between two

P = {p1 , ..., pm }

sets of points.

Gives two sets of points

Q = {q1 , ..., qn },

their Hausdor distance is

and

H(P, Q) = max(max min kp qk , max min kq pk).


pP qQ

If

d1

and

in

qQ pP

is the maximum distance of set

d2

is he maximum distance of set

to the nearest point in

to the nearest point

then the Hausdor distance is the maximum of these two

distances.
The partial Hausdor distance can be used to measure the inequality between subsets of two sets of points.

P = {p1 , ..., pm }

and

Q = {q1 , ..., qn },

Given two sets

their partial Hausdor

distance is

th
Hk (P, Q) = KpP
min kp qk
qQ

where

1 k m.

Here, the

k th

largest element is used instead

of the maximum element. The partial Hausdor distance makes


it possible to locate objects which are partially occluded.
The tracker based on the Hausdor distance [41] works as follows.
The tracker is initialized with a set of image features

P (intensity

values or gradients), this is the model set of the tracker. In the


next frame, at several image positions and scales with feature
set

Q,

the partial Hausdor distance is calculated between the

feature set
set

and the feature set

Q.

The location of the feature

Q with the smallest partial Hausdor distance between P

and

is selected as the new location and scale of the object which

is tracked.

A Kalman lter with a linear motion model is used

to estimate the

position and

position of the search region

CHAPTER 4.

74

TRACKING

for the pedestrian in the next frame in which the feature set
is calculated.

The process noise and measurement noise for the

Kalman lter are drawn from a normal distribution.


A problem with using the Hausdor distance for tracking is when
a model for tracking is generated from the image features, the
model does not contain just image features from the object to
be tracked but also image features from the background.

This

invalidates the tracking model when the object is moving.

In

order to separate model features from background features, the


following model adaptation mechanism is used. The set of image
features in the next frame are compared to the image features of
the tracker model.

A list is made of image features which are

present in the model but do not correspond to an image feature in


the next frame. Also, a list of image features are made which are
present in the next frame but do not correspond to a feature in the
model. From the rst list, model features are removed randomly
with a certain probability. From the second list, image features are
randomly added to the model features with a certain probability.
The degree of adaptiveness of the tracker can be controlled by
varying the probability parameter.

The complete description of

the Hausdor tracker is described in algorithm 9. In this work, for


pedestrians, only the upper half of the body is tracked because the
movement of the legs would invalidate the model of the tracker
in a few frames. In order to handle the problem of background
features in the model, the tracker model is adapted every few
frames because of the often high relative speed dierence between
the pedestrian and the car.

CHAPTER 4.

75

TRACKING

Algorithm 9 The Hausdor tracker.


Input: A feature image

pt1 ,

I,

the position of the object in the previous frame

the feature set of the model

P.

Output: The position of the object in the current frame

pt .

1. Estimate with a Kalman lter the next position of the object


2. Search in

in the area around the estimated position

for the location


between

and

pt with feature
Q is minimal.

set

xt

xt .

of the object

such that the Hausdor distance

3. If the tracker model needs to be updated this frame make a list

L1

of

image features which are present in the model but do not correspond
to an image feature in frame

and make a list

are made which are present in the frame

L2

of image features

but do not correspond to a

feature in the model.


4. If the tracker model needs to be updated this frame, randomly remove
features in

L1

from the model and randomly add features from

L2

to

the model.

4.2

Mean shift tracking

Mean shift tracking [11] is based on the assumption that the position and scale of an object will not change much from one frame
to the next. A target model with image feature
function

qz ,

has a density

the target candidate located at position

has a fea-

pz (y). Mean shift tracking nds the position y


pz (y) matches the density qz the best. The metric

ture distribution
whose density

used for the density similarity is the Bhattacharyya coecient:

(y) [p(y), q] =

Z p

pz (y)qz dz.

(4.1)

For sampled data, the discrete density is calculated from the m-

Pm
u = 1 for the model
q = {
q1 , ..., qm } P
with
u=1 q
m
and p
(y) = {
p1 , ..., pm (y)} with u=1 pu = 1 for the candidate. A
bin histogram:

CHAPTER 4.

76

TRACKING

kernel is used to assign a weighting to pixels, pixels further away


from the center of the target are given a lower weight then pixels
at the center of the target.

This is done because pixels further

away from the center are more likely to be aected by occlusion


or background.

The kernel applied in [11] is the Epanechikov

kernel

(
k(x) =

1 1
2 cd (d

+ 2)(1 x)

if

x<1

otherwise

for normalized image coordinates

x.

To nd the new location

y1

of the target, the mean shift vector is applied



y0 xi 2
i=1 xi wi g( h )
y1 = P


y0 xi 2
nh
i=1 wi g( h )
Pnh

where

y0

is the location of the target in the previous frame,

the kernel for weighting,

wi

is the weight of pixel

number of pixels in the radius of the kernel


location

wi

is

is the

around the current

are calculated as follows

wi =

m
X

s
[b(xi ) u]

u=1

xi , nh

y0 .

The weights

where

(4.2)

qu
pu y0

is a function that assigns an index of a bin to a pixel.

The complete tracking algorithm is described in 10. In this work,


for tracking pedestrians, the features

used for calculating the

densities are intensity values or image gradients.


the kernel

The radius of

is dependent on the size of the region of interest to

be tracked. For a large region of interest a large radius is used,


for a small region of interest a small radius is used.

CHAPTER 4.

77

TRACKING

Algorithm 10 Mean shift tracking


Input: The distribution of the target object
frame

y0 ,

an image with pixels

q and its position in the previous

x.

Output: The position of the object in the current frame


1. Calculate the distribution

y1 .

{
pu (
y0 )}u=1,...,m in the current
[
p(
y0 , q] with 4.1.

frame at

y0

and the Bhattacharyya coecient

2. Calculate the new position of the target

y1 with

3. Calculate the Bhattacharyya coecient

[
p(
y1 ), q]

4.2.
at the new position

y1 .
4. Repeat step 5 while
5.

[
p(
y1 ), q] <[
p(
y0 ), q].

y1 = 12 (
y0 + y1 ).

6. If

4.3

k
y1 y0 k >  y0 = y1 ,

go to step 1.

Tracking with Condensation

In the Condensation algorithm[25], tracking is performed by propagating a state density

p(xt |Zt )

over time with Bayes rule:

p(xt |Zt ) = kt p(zt |xt )p(xt |Zt1 )


where

xt

is the state at time

t, zt

is the observations from the

t, p(zt |xt ) is the observation density at timestep t,


the prior p(xt |Zt1 ) is a prediction step from the posterior density
p(xt1 |Zt1 ) from the previous timestep t 1, and kt is a constant. The prior density of timestep t is generated by factored
sampling. The posterior density p(xt1 |Zt1 ) from the previous
timestep t 1 is represented by a set of N weighted samples
(n)
(n)
{(st1 , t1 ), n = 1, ..., N }. N samples are sampled with replace(n)
ment from the sample set. Each sample st1 is selected with a
(n)
probability of t1 . Samples with high weights may be selected
image at time

CHAPTER 4.

78

TRACKING

multiple times.

Each selected sample undergoes a determinis-

tic step followed by a random step. This gives the prior density

p(xt |Zt1 )

t.

of timestep

Now, the observation density

is used to calculate the weights of the samples.

p(zt |xt )

This gives the

posterior density

p(xt |Zt )

The prior density

p(xt |Zt1 ) of timestep t is calculated by applying

of timestep

t.

only a random step. It is assumed that a reliable motion model


of a pedestrian does not exist because the camera may move or
stand still and the pedestrian may move or stand still. In addition,
the pedestrian may be moving perpendicular to the camera or
orthogonal to the camera. Therefore, only random motion is used.
For tracking pedestrians from a moving car,

x position, y position,

and scale are the parameters for tracking. Each sample represents
a position and scale. For calculating the observation density from
the image data, the contour based approach from [25] is followed.
A small set of model contours is calculated by performing a principal component analysis on a set of manually labeled contours(as
described in [1]). An example of a model contour is shown in gure 4.1. A sample represents a contour with a certain scale. The
distance from the contour along a xed set of

normals along

the contour is used to calculate the observation density

p(z|x)

of

a sample:

(
p(z|x) exp
where

r(sm )

r(sm )

along normal

M
X

1
min(z1 (sm ) r(sm ); ) ,
2rM
m=1
z1 (sm ) is the closest feature to
r and are constants. In the case

is a model contour,

m,

and

of multiple model contours, the contour with the smallest sum is


used for calculating the observation density. The sample with the
smallest sum of distances is selected as the current position/scale
of the pedestrian. For tracking pedestrians, the image energy is

CHAPTER 4.

79

TRACKING

used as features for tracking. The complete outline of the Condensation algorithm is shown in algorithm 11.

Figure 4.1: Model contour of the Condensation tracker.

Algorithm 11 The Condensation algorithm.


Input: An initial sample set
of timestep

{s11 , 11 , ..., sn1 , 1n } representing position and scale

1.

Output: A sample set

{s1T , T1 , ..., snT , Tn }

1. Repeat steps 2 until 4 for

of timestep

t = 1, ..., T

T.

iterations.

2. Sample with replacement N samples from the set


i
i
Each sample st is selected with probability t .

{s1t , t1 , ..., snt , tn }.

3. Each of the selected samples undergoes a deterministic motion followed


by a random motion.
4. Each of the samples is weighted by calculating the observation density.
1
1
n
n
This gives the sample set {st+1 , t+1 , ..., st+1 , t+1 } of timestep t + 1.
5. The sample

sit+1 with the smallest sum of distances calculated for its ob-

servation density is selected as the current position/scale of the pedestrian

CHAPTER 4.

80

TRACKING

Integrating initial detections through time

4.4

A very simple model-free tracking method is based on the assumption that if a pedestrian is present in a certain frame, it will be
present at approximately the same location at approximately the
same scale in the next image. To nd the pedestrian in the next
image, the output from the initial detection routine is used. The
positions and scales of the regions of interest from the initial detection routine are evaluated. If a matching region of interest is
found, it is selected as the location of the pedestrian in the next
frame.

An example of this method is shown in gure 4.2.

The

advantages of using the initial detection for tracking are that it


is invariant to the scaling of the pedestrians through successive
frames, the initial detection is developed to detect pedestrians
at various sizes.

It is also developed for nding pedestrians in

complex, structured environments and in changing illumination


conditions. The initial detection tracker is described in algorithm
12.

Algorithm 12 Initial detection tracker.


(xcurrent , ycurrent ) and scale scurrent of the
(x1 , y1 , s1 ), ..., (xn , yn , sn ) of N initial detec-

Input: The current center position


tracked region of interest; a list
tions in the next frame.
Output: A new center position

(xnew , ynew ) and scale snew

in the next frame.

j = 1. dj =sqrt((xcurrent xj )2 + (ycurrent yj )2 )
ds =scurrent sj

1. Initialize

2. Repeat step 3 and 4 for


3.

di =sqrt((xcurrent xi )2 + (ycurrent yi )2 )

4. if
5.

i = 2, ..., N

di < dj and di dj dj = di and j = i.

(xnew , ynew ) = (xj , yj )

and

snew = sj .

and

di =scurrent si .

and

CHAPTER 4.

TRACKING

81

(a) A positive classication in a r image (b) The initial detections in the next frame

(c) The initial detection selected as the new


position of the pedestrian
Figure 4.2: Integrating initial detections through time.

4.5

Tracking the classication output

When detection is performed by scanning a classier through the


whole image at every scale and location, as described in section
3.6, the classication output can also be used for tracking. The
tracker described in this paragraph resembles the tracker from [32],
where Condensation is used for propagating a density of classication outputs over time. The assumption is made that the classication outputs decrease proportional to the distance from the
pedestrian in both location and scale. In addition, the assumption
is made that the location and scale of a pedestrian do not change

CHAPTER 4.

82

TRACKING

much from one frame to another. Tracking a pedestrian means locating the highest peak in classication outputs near the position
of the pedestrian in the previous frame. This method can possibly
reduce the false positive detection rate because it can be expected
that false positives do not generate a similar peak in classication output as a pedestrian does. An example of the classication
outputs at a few scales was shown in gure 3.4.

The complete

method is described in algorithm 13.

Algorithm 13 Tracking the classication output.


Input: An initial set of positions

{p1 , ..., pn } containing


T frames.

a pedestrian.

Output: A set of new positions after


1. Repeat steps 2 until 5 for

t = 1, ..., T .

2. Calculate the classication outputs at each location and position in


frame

t.

3. Repeat steps 4 and 5 for each of the positions

p {p1 , ..., pn }.

4. Find the highest peak in classication outputs around


5. Set the position and scale of
peak.

p.

to the position and scale of the highest

Chapter 5
Experimental results
5.1

Initial detection

In this section, the performance of the initial detection routines


evaluated on r and grayscale images is described.

To evaluate

the performance of the initial detection routines, the output of


each of the routines is compared to manually labeled ground-truth
data. The comparison is performed as follows: for each frame of
a video sequence in the ground-truth database, the output of the
initial detection routine is calculated. Then the output of the initial detection routine is matched to the ground truth data. The
matching process is shown in gure 5.1. The ground truth data is
represented by red outlined rectangles and the output of the initial detection routine is represented by blue outlined rectangles.
There is a true positive detection if a region of interest output
by the initial detection routine corresponds to a region of interest of interest in the ground-truth database.

This is shown by

the overlap of the red outlined rectangle with the blue outlined
rectangle in gure 5.1. There is a false positive detection if the
initial detection routine outputs a region of interest and there is
no corresponding region of interest in the ground truth database.
This is shown by the blue outlined rectangle in gure 5.1 for which
there is no corresponding red outlined rectangle. There is a false
83

CHAPTER 5.

84

EXPERIMENTAL RESULTS

negative detection if the ground-truth database contains a region


of interest for which there is no corresponding region of interest
output by the initial detection routine. This is shown by the red
outlined rectangle in gure 5.1. Ideally, the true positive rate of
an initial detection routine is one and the false positive rate is
zero. In practice, a compromise has to be found between the true
positive detection rate and the false positive detection rate.

Figure 5.1: Matching ground truth data.

A certain amount of deviation is allowed for a correspondence.


This is shown in gure 5.1. The blue outlined rectangle from the
initial detection routine does not exactly match the red outlined
rectangle from the ground-truth database. A deviation of 10% in
horizontal and vertical direction from the width and height of the
ground-truth region of interest is allowed in the horizontal and
vertical direction.

The deviation is allowed because not all the

initial detection routines deliver accurate regions of interest.

In

addition, the ground-truth database is produced by manually labeling image sequences. Usually, there is always a certain amount
of inaccuracy in the ground truth data.

The deviation value of

10% is selected because at this level of deviation it is still possible


to classify the output of the initial detection routine. The classi-

CHAPTER 5.

85

EXPERIMENTAL RESULTS

er requires a certain level of accuracy in order to properly learn


the class of pedestrians. At a larger deviation, the spatial distribution of image features inside the region of interest makes the
region of interest a non-typical pedestrian example. An example
of this is shown in gure 5.2. Figure 5.2(a) contains an example
suitable for training a classier. The pedestrians t nicely in the
regions of interest. Figure 5.2(b) contains an example unsuitable
for training a classier.

The pedestrian does not t a region of

interest.

(a) Initial detections suitable for classica- (b) Initial detections unsuitable for classitions
cation
Figure 5.2: Suitability of initial detections for classication.

In order to properly measure the performance of the initial detection routines, the ground-truth database should contain a large set
of representative sequences. The r images ground-truth database
for initial detection consists of 4443 pedestrians for the temper-

ature range of 5 , 3510 pedestrians for the temperature range of

15 , and 2904 pedestrians for the temperature range of 25 . The


grayscale images ground-truth database for initial detection consists of 50 sequences containing at least one pedestrian each. One
and the same pedestrian which is present in multiple frames is also
counted for the number of frames it is present. A measurement
is made as soon as the pedestrian is large enough for detection.

CHAPTER 5.

EXPERIMENTAL RESULTS

86

In the 320x240 pixel images used, a minimum size of 10x20 pixels


is used.

The exact thresholds for the initial detection routines

are not mentioned because they depend strongly on the type of


camera used, especially in r images. The results displayed in this
section are consistently measured on data from a single r camera
and a single grayscale camera.

5.1.1 Results on r images


To measure the performance of the initial detection routines on
r sequences, the sequences are divided based on outside temperature.

This is done because the intensity distribution of r

images changes with temperature. Four temperature ranges are

selected: 5 , 15 , 25 , and 35 , each with a maximum deviation

of 3 . An example of r images at dierent temperature ranges is


shown in gure 5.3. The exact temperature is always supplied as
an argument to the initial detection routine.

CHAPTER 5.

EXPERIMENTAL RESULTS

(a) Fir image at 5 degrees

(b) Fir image at 15 degrees

(c) Fir image at 25 degrees

(d) Fir image at 35 degrees

87

Figure 5.3: Fir images at dierent temperature ranges.

Figure 5.4 shows the mean true positive rate of the initial detec-

tion routines on the dierent temperature ranges. The range 35

is omitted because image processing in r images is not possible


anymore at this temperature range with the particular camera
used.

At 35

there is too little contrast between objects in the

image and the background. This is clearly visible in gure 5.3. It


is not possible to report the false positive rate of the initial detection routines, because there is no ground truth data of negative
examples. It is possible to calculate the percentage of positive detections of the total number of detections. This is shown in gure
5.5.

CHAPTER 5.

EXPERIMENTAL RESULTS

Figure 5.4: Initial detection results.

Figure 5.5: Percentage positive detections.

88

CHAPTER 5.

89

EXPERIMENTAL RESULTS

Figure 5.6 shows the calculation times of the initial detection routines for r images. These values do not include the time required
for image pre-processing. All values are measured on a computer
with a 1470 MHz. AMD Athlon processor.

Figure 5.6: Calculation times of initial detection for r images.

5.1.2 Results on grayscale images


In the case of grayscale images, the distribution of intensity values is mainly dependent on the lighting conditions. Because the
illumination conditions may change while the car is moving, the
sequences are not divided on illumination conditions.
is a single database containing all grayscale sequences.

So there
An ex-

ample of a sudden change in illumination conditions is shown in


gure 5.7.

In gure 5.7(a), the image brightness is rather dark

while in gure 5.7(b) which is recorded 6 frames later, the sudden appearance of sun light causes the image to be much brighter
than the previous image.

A set of sequences containing images

CHAPTER 5.

EXPERIMENTAL RESULTS

90

Figure 5.7: Illumination changes in grayscale images.

under various illumination conditions makes it more dicult for


the initial detection to nd regions of interest. So a compromise
has to be found between generality and performance of the initial
detection routines. Of course, the region based brightness segmentation cannot be applied in the case of grayscale images. Figure
5.8 shows the true positive rates of the initial detection routines
on grayscale images and the percentage positive detections of the
total number of detections.

CHAPTER 5.

EXPERIMENTAL RESULTS

91

Figure 5.8: Initial detection results in grayscale images.

Figure 5.9 shows the calculation times of the initial detection routines for grayscale images. These values include the time required
for image pre-processing. All values are measured on a computer
with a 1470 MHz. AMD Athlon processor.

CHAPTER 5.

92

EXPERIMENTAL RESULTS

Figure 5.9: Calculation times of initial detection grayscale.

5.2

Classication

The performance of the classication is measured by calculating the true positive rate and the false positive rate of a classier/image feature combination. This is usually visualized with a
so called ROC(Receiver Operating Characteristic) curve. A ROC
curve displays the true positive rate of the classier at a certain
false positive rate. An example of a ROC curve is shown in gure
5.10. For neural networks, the ROC curve is created by varying
the threshold(normally 0.5) at the output neuron.

For support

vector machines, the ROC curve is created by varying the threshold

b).

in the separating function

P
f (x) = sign( li=1 i yi k(x, xi ) +

CHAPTER 5.

EXPERIMENTAL RESULTS

93

Figure 5.10: An example ROC curve.

A separate classier is trained for each classier/image feature


combination. In addition, separate classiers are trained for side
views and front/back views of pedestrians. In general there are
two dierent ways of generating data for training a classier: the
rst is manually labeling the output of the initial detection routines. The second way is by manually creating ground truth data
of pedestrians(positive examples) and using random image parts
not containing pedestrians as negative examples for training. The
bootstrapping method from section 3.4.1 is used to create a representative dataset of negative examples. In r images the object
size for classication is selected as 20x40 pixels. In grayscale images the object size for classication is selected as 24x48 pixels.
Regions of interest smaller than a size of 20x40 pixels in r images
and smaller than 24x48 pixels in grayscale images are not classied
because too little information is available for feature calculation.
To measure the performance of classication, a representative set

CHAPTER 5.

94

EXPERIMENTAL RESULTS

of training, validation, and test data has to be selected. All three


sub-datasets consist of mostly urban sequences and some land
street sequences.

5.2.1 Results on r images


As with the initial detection in r images, the training dataset is
divided into sub-datasets based on temperature range. As mentioned in section 5.1.1, the intensity distribution of r images
changes with temperature.

The same three temperature ranges

as used for the initial detection are applied: 5 , 15 , and 25 , each

with a maximum deviation of 3 .

the datasets of 5 , 15

For evaluating the classier,

are used because a large amount of data

is available of these temperature ranges. The training dataset of


5

consists of 729 pedestrian images generated from temperature

initial detection, 2581 pedestrian images generated from vertical


gradients initial detection, and 3165 pedestrian images generated
from scanning a vertical edge detector initial detection.

The training dataset of 15

consists of 287 front/back view pedes-

trian images generated from temperature initial detection, 1728


pedestrian images generated from vertical gradients initial detection, and 1448 pedestrian images generated from the initial detection based on scanning a vertical edge detector through the
image.
The training dataset is divided into 3 subdatasets, one for training,
one for the validation of the trained classier, and one test dataset.
The datasets are divided in a way such that no training image is
part of more than one subset.

There are always three times as

many negative training examples in the dataset than there are


positive training examples. The reason for having more negative
examples than positive examples in the dataset is for achieving a
low false positive rate.

CHAPTER 5.

EXPERIMENTAL RESULTS

95

Figures 5.11, 5.12, and 5.13 show the ROC curves for the support
vector machine classication on a test set where the training data
is generated by the initial detection routines from sections 2.3 and
2.3 for dierent image features.

CHAPTER 5.

EXPERIMENTAL RESULTS

96

(a) Orientations features 5 degrees

(b) Orientations features 15 degrees


Figure 5.11:
features.

Results of support vector machine classication orientations

CHAPTER 5.

EXPERIMENTAL RESULTS

97

(a) Gradients orientations features 5 degrees

(b) Gradients orientations features 15 degrees


Figure 5.12: Results of support vector machine classication gradients and
orientations features.

CHAPTER 5.

EXPERIMENTAL RESULTS

98

(a) Rectangle features 5 degrees

(b) Rectangle features 15 degrees


Figure 5.13: Results of support vector machine classication rectangle features.

CHAPTER 5.

EXPERIMENTAL RESULTS

99

Figures 5.14, 5.15, and 5.16 show the ROC curves for the neural
networks classication on a test set where the training data is
generated by the initial detection routines from section 2.3.

CHAPTER 5.

EXPERIMENTAL RESULTS

100

(a) Orientations features 5 degrees

(b) Orientations features 15 degrees


Figure 5.14: Results of neural network classication orientations features.

CHAPTER 5.

EXPERIMENTAL RESULTS

101

(a) Gradients orientations features 5 degrees

(b) Gradients orientations features 15 degrees


Figure 5.15: Results of neural network classication gradients and orientations features.

CHAPTER 5.

EXPERIMENTAL RESULTS

102

(a) Rectangle features 5 degrees

(b) Rectangle features 15 degrees


Figure 5.16: Results of neural network classication rectangle features.

CHAPTER 5.

EXPERIMENTAL RESULTS

103

Figure 5.17 shows the processing times of the feature vector calculation for forward propagation through the classier.

Figure 5.17: Processing times feature vector calculation.

Figure 5.18 shows the classication times of the dierent classier/feature combinations.

CHAPTER 5.

EXPERIMENTAL RESULTS

104

Figure 5.18: Classication times of classier/image feature combination.

5.2.2 Results on grayscale images


As with the initial detection in grayscale images, one single dataset
is used for training a classier for grayscale images. The reason
for this is that because the distribution of intensity values in the
image depends on changing illumination conditions, it is impossible to select a classier for a particular illumination condition
beforehand. Figure 5.8 shows that the performance of the initial
detection routines on grayscale images is limited. Therefore only
ground truth data in combination with the bootstrapping method
from section 3.4.1 is used to generate training data.

The train-

ing dataset consists of 444 front/back view pedestrian images and


342 side view pedestrian images. There are three times as many
negative examples in the training database as there are positive
examples. The remaining positive examples are divided between
the validation and test database. Figure 5.19, 5.20, and 5.21 show

CHAPTER 5.

EXPERIMENTAL RESULTS

105

the support vector machine and neural networks classication results on a test set of grayscale images where the training data is
generated using the bootstrapping method from section 3.4.1.

CHAPTER 5.

EXPERIMENTAL RESULTS

106

(a) Support vector machine orientations features

(b) Neural network orientations features


Figure 5.19: Classication results grayscale images orientations features.

CHAPTER 5.

EXPERIMENTAL RESULTS

107

(a) Support vector machine gradients orientations features

(b) Neural network gradients orientations features


Figure 5.20:
features.

Classication results grayscale images gradients orientations

CHAPTER 5.

EXPERIMENTAL RESULTS

(a) Support vector machines rectangle features

(b) Neural networks rectangle features


Figure 5.21: Classication results grayscale images rectangle features.

108

CHAPTER 5.

109

EXPERIMENTAL RESULTS

Figure 5.18 shows the classication times of the dierent classier/feature combinations for r and grayscale images.

5.2.3 Results of component classication


Figure 5.22 shows the results of the component classication of
r images with a support vector machines and a neural network
trained on histograms of gradients and orientations features on

data from 5 .

Figure 5.22: Results of component classication.

5.2.4 Results of classier optimization on classication


performance
Figure 5.23 shows the results of the feature selection methods
from section 3.5 on the classication performance.

All data is

from r images recorded at 5 . Figure 5.23 shows the results of a


support vector machine trained on a feature set of histograms of

CHAPTER 5.

EXPERIMENTAL RESULTS

110

orientations reduced with a principal component analysis. Figure


5.24 shows the results of Adaboost compared to a neural network
trained on the full set of histograms of gradients and orientations
features. As sub-feature set for Adaboost, a single histogram of
8 bins is used. The maximum number of weak-learners is limited
to 20. Figure 5.25 shows the results of multi-objective optimization feature selection compared to the full set of histograms of
orientations features.

Figure 5.23: Results of PCA feature selection.

CHAPTER 5.

EXPERIMENTAL RESULTS

111

Figure 5.24: Results of Adaboost feature selection.

Figure 5.25: Results of multi-objective optimization feature selection.

CHAPTER 5.

EXPERIMENTAL RESULTS

112

5.2.5 Results of classier optimization on classication


speed
This sections shows the results of the support vector machine reduction algorithm described in section 3.4.3 and feature selection
algorithms from section 3.5.

Figures 5.26 and 5.27 display the

classication time in milliseconds for each of the optimization algorithms.

CHAPTER 5.

EXPERIMENTAL RESULTS

(a) Support vector reduction

(b) Adaboost
Figure 5.26: Classication times of optimization algorithms.

113

CHAPTER 5.

EXPERIMENTAL RESULTS

(a) Principal component analysis

(b) Feature selection with multi-objective optimization


Figure 5.27: Classication times of optimization algorithms.

114

CHAPTER 5.

EXPERIMENTAL RESULTS

115

5.2.6 Scanning a classier through the image at every


position and scale
For scanning a classier through the image at every position and
scale, the training dataset for grayscale images is generated from
manually labeled ground truth data and the bootstrapping method
from section 3.4.1. For r images, the training datasets generated
by the initial detection routines are used. The support vector machine classier trained on a complete set of image features is used
for r images as well as for grayscale images. Figure 3.4 showed
an example of scanning a classier through the image at every
position and scale. As can be seen from gure 3.4, there are often
multiple positive classications on one object at dierent scales.
There are a total of 4863 classications per frame for r images
and for grayscale images. Figures 5.11, gure 5.14, and gure 5.19
show the false positive rate for each of the classier/feature combinations. Because of the high number of classications per frame,
there is a high average number of false positives per frame. Of interest is also the computation time per frame. Figure 5.18 shows
the computation time in milliseconds per frame for the dierent
classier/feature combinations.

5.3

Tracking

To measure the performance of the tracking algorithms, a similar


method is applied as for measuring the performance of the initial detection routines. The output of the tracking algorithms is
matched to manually labeled ground-truth data(the same ground
truth-data as used for initial detection and classication).

The

same procedure as described in section 5.1 is used for comparing


the output of the tracking algorithms to the ground-truth data.
A certain small amount of deviation from the ground-truth data

CHAPTER 5.

is allowed.

EXPERIMENTAL RESULTS

116

As with the initial detection, a deviation of 10% in

horizontal and vertical direction from the width and height of the
ground-truth region of interest is allowed in the horizontal and
vertical direction.
classication.

Usually, the tracker is started by a positive

In order to make the results of the tracking al-

gorithms independent of the initial detection algorithms and the


classication when measuring the tracker performance, the tracker
is started manually.
As with measuring the performance of the initial detection routines and the classication, a representative set of data is selected
for testing the tracker. This set consists of a mixture of land street
and urban scenarios. To avoid that the tracking results are inuenced by pedestrians moving out of the image(for example caused
by the car moving along the pedestrian), only tracking runs are
counted where the tracker loses track of the pedestrian before it
moves out of the image.

5.3.1 Results on r images


Figure 5.28 shows the average number of frames a pedestrian is
tracked with the dierent tracking algorithms on r sequences.
The test dataset consists of 30 sequences containing at least one
pedestrian.

Only the tracker which integrates initial detections

through time, the tracker which integrates classications through


time and the tracker based on the Hausdor distance are shown.
The Mean shift tracker and the Condensation tracker do not produce any useful results at all and are not displayed.

CHAPTER 5.

EXPERIMENTAL RESULTS

117

Figure 5.28: Pedestrian tracking in r images.

Figure 5.29 shows the processing times for the dierent tracking
algorithms in r images.

CHAPTER 5.

EXPERIMENTAL RESULTS

118

Figure 5.29: Processing times tracking algorithms.

5.3.2 Results on grayscale images


Only the tracker which integrates classications over time generates somewhat acceptable results on grayscale images. The average number of frames tracked is 17 with a standard deviation of
14 frames. The test dataset for evaluating this tracker consists of
30 sequences containing at least one pedestrian. The processing
time for this tracker is the same as for the tracker which integrates
classications over time in r image(see gure 5.29). The tracker
which integrates initial detections through time, the tracker based
on the Hausdor distance, the Mean shift tracker, and the Condensation tracker do not produce any useful results at all.

Chapter 6
Discussion
6.1

Achievements and limitations

The results of the initial detection algorithms are reasonable for r


images and disappointing for grayscale images as gure 5.4 and
gure 5.8 demonstrate.

It holds for both image types that the

true positive detection rate is quite low and the false positive rate
is high.

The true positive rate for r images is still acceptable,

the probability that a pedestrian is detected in a few frames time


is still high at a frame rate at 20 frames per second.

The true

positive rate for grayscale images is so low that is not useful to


use the initial detection routines for grayscale images.
As gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 for r images
and gures 5.19, 5.20, 5.21 for grayscale images demonstrate, the
results of classication are very good.

Figure 5.22 and gures

5.23, 5.24 and 5.25 show that the component classier and the
optimization methods for classication lead to better classication
performance.
Figure 5.28 and section 5.3.2 show that aside from the tracking method which integrates the output of the classier which
is scanned through the whole image at every position and scale
through the image, the tracking results are very disappointing.
119

CHAPTER 6.

120

DISCUSSION

This holds for r images as well as for grayscale images.

The

Hausdor tracker, the Mean Shift Tracker, and the Condensation


tracker perform so poorly when tracking pedestrians from a moving car, that they are not useful for tracking for this purpose.
Before going into more detail what the strengths and weaknesses
of the components of the detection system are, it should be mentioned that the disappointing results of the initial detection routines for grayscale images and the tracking methods can for some
degree be explained by the complexity of pedestrian detection. As
mentioned in section 1.2, many computer vision applications operate in a controlled and static environment where color or motion
segmentation can be applied or where a search for a xed pattern
can be applied to segment objects from a scene. Although these
computer vision problems are not trivial, they are not as complicated as detecting moving, shape changing objects from a moving
camera in a complex structured urban background in low resolution images. Also image processing applications for driver assistance systems like lane detection and car detection on a highway
scenario deal with detecting rigid objects in a largely homogeneous
background.

6.1.1 Initial detection


Figure 5.4 shows that for r images, temperature segmentation
performs the best at low outside temperatures. This is obvious,
at low outside temperatures, many objects have a low intensity
value because they are cool while pedestrians have a high intensity
value because of their body temperature. Figure 5.4 shows that
the true positive rate of the temperature segmentation decreases
with increasing temperature. The reason for this is that at higher
temperatures, the contrast between the pedestrian and the background is less than it is at lower temperatures. At temperatures
higher than

20 ,

the temperature segmentation does not perform

CHAPTER 6.

121

DISCUSSION

satisfactory anymore.
Figure 5.4 shows that for r images, the vertical gradient based
initial detection routine and the initial detection routine which
scans a vertical edge detector through the image at every position
and scale perform the best at low outside temperatures.

There

performance decreases with an increase in outside temperature.


The reason for this is that at higher temperatures, the contrast
between the pedestrian and the background is less than it is at
lower temperatures. The gradient magnitude calculated from the
intensity image have a weaker response so the initial detection
routines have more diculties segmenting objects from the background than at higher temperatures.
Figure 5.8 shows that for grayscale images, the vertical gradient
based initial detection routine and the initial detection routine
which scans a vertical edge detector through the image at every
position and scale do not produce acceptable results. The main
diculty for the initial detection routines is the complex structure of the background.

What all initial detection routines do

is basically search for vertical objects in the image.

Because of

the many vertical structures in the background in urban scenarios, the initial detection routines segment the pedestrian together
with a background structure or do not segment the pedestrian at
all because they are vertical structures with a stronger gradient
magnitude in the image. Examples are shown in gure 6.1 for r
images. An example of the many vertical structures in grayscale
images and the diculties this creates for the initial detection routines from section 2.2 are shown in gure 6.2. What may happen
is that the initial detection routines only nd a part of the pedestrian. The match with the ground truth data, which is used for
evaluating the initial detection routines, can for this reason also
be unsuccessful.

CHAPTER 6.

122

DISCUSSION

Figure 6.1: Diculties for initial detection because of vertical structures in


the background.

(a) A grayscale image

(b) It's vertical gradients

(c) It's initial detections


Figure 6.2: Diculties for initial detection because of many vertical structures in background.

CHAPTER 6.

123

DISCUSSION

In a scene with little background structure, the initial detection


performs quite well. Both in the case of r images and grayscale
images. This was for example shown in gure 2.4 for r images and
gure 2.5 for grayscale images. This is also the reason the results
displayed in gure 5.4 are at least somewhat acceptable for r
images. Especially at low temperatures, many background objects
have a low intensity value because they are cool while pedestrians
have a high intensity value because they are warm.

So there is

no strong gradient response from background structures.

This

increases the performance of the initial detection of pedestrians.


Other diculties for the initial detection is that pedestrians are
not always perfect vertical structures, pedestrians change shape
when they move, and pedestrians may be carrying or pushing objects. For example, a pedestrian carrying a backpack or pushing a
stroller. Figure 6.3 shows an example situation. The initial detection routines described in this work were not specically adapted
to these specic situations, although they are set up to be quite
exible with respect to the shape of the objects they detect.

Figure 6.3: Pedestrian pushing a stroller.

Even more complications arise when there are groups of pedestrians present in the image.

Groups may contain an arbitrary

number of pedestrians which may be occluding each other in an

CHAPTER 6.

124

DISCUSSION

arbitrary number of ways.

An example of this is shown in g-

ure 6.4. When pedestrians appear in one single bright blob in r


images as shown in gure 6.4, it is not possible to segment them
with the initial detection routines described in this work.

The

initial detection routines described in this work are designed for


detecting single pedestrians only.

Figure 6.4: A group of pedestrians in a r image.

A solution to the problems with the initial detection routines is


to omit the initial detection all together by scanning a classier
through the image at every position and scale like described in
section 3.6.

The results of this method are discussed in section

6.1.2.
The threshold parameters applied in algorithms 2.5, 3, and 4 adapt
the initial detection routines for r images to a specic temperature range.

For example, at a higher temperature, the thresh-

old value is lowered because the gradient magnitude response


is weaker at higher temperatures.

For grayscale images, these

threshold parameters are set to a xed value for images recorded


in daylight conditions.

6.1.2 Classication
Figures 5.11, 5.12, 5.13 and 5.14, 5.15, 5.16 for r images, and

CHAPTER 6.

125

DISCUSSION

gures 5.19, 5.20, 5.21 for grayscale images show that the classication results are very good in general.

The classier/image

features combinations achieve a high true positive rate at a low


false positive rate. However, the results should be considered in
the context of the complete detection system. In the case of r
images, the initial detection routines may generate up to 12 times
as much negative detections as they generate positive detections.
This is shown in gure 5.5. In the case of grayscale images, the
classier is scanned through the whole image at every position and
scale. At a very large number of negative examples, the absolute
number of false positive detections per second may still be high.
As gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 show, the
classication results of all classier/feature combinations perform

better on data from 5

than on data from 15 .

Better means

the ROC curve lies higher and more to the left. The reason for

this is that at 5 , the contrast between the pedestrian and the

background is larger than at 15 . This means that at 5 , there is

a better gradient response than at 15 . All classication features


use gradient information. This explains why the results are better

at 5 .
Figures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 show that for
r images, the classication results of the histograms of gradients
and gradient orientations features are generally the best for both

support vector machines and neural networks at both at 5

15 .

Second best are the rectangle features.

tograms of orientations features.

and

Last are the his-

This shows the magnitude of

the gradient in combination with the orientations of the gradient perform better than the orientation of the gradient alone. In
addition, a feature set as large as the rectangle feature set(1140
features) is not required for successful classication, the smaller
feature set of histograms of gradient magnitudes and gradient orientations(448 features) performs better.

From gures 5.19,5.20,

CHAPTER 6.

126

DISCUSSION

and 5.21 it becomes clear that in the case of grayscale images, the
classication results of the histograms of orientations features are
comparable to the results of the rectangle features.
By comparing gures 5.11, 5.12, 5.13 to gures 5.14, 5.15, 5.16
and by looking at gures 5.19, 5.20, 5.21 it becomes clear that
the results of the support vector machine classication are generally a bit better than the results of the neural network classication. However, the forward propagation speed of neural networks
is much higher than the forward propagation speed of support
vector machines. This can be seen in gure 5.18.
Comparing gures 5.11, 5.12, 5.13 and gures 5.14, 5.15, 5.16 to
gures 5.19, 5.20, 5.21 shows that the classication results on r
images are comparable to the the classication results on grayscale
images. These gures should not be compared directly because the
training data for the r data is generated from initial detection
routines while the training data for the grayscale data is generated
from ground truth data using a bootstrapping method.
The classication results on front/back views of pedestrians are
comparable to the results on side views of pedestrians. This becomes clear by looking at gures 5.11, 5.12, 5.13 and gures 5.14,
5.15, 5.16 for r images and gures 5.19, 5.20, 5.21 for grayscale
images. In some gures, the front view results are better while in
other gures, the side view results are better. This is remarkable
because the class of side views of pedestrians is less well dened
than the class of front/back views of pedestrians. The side view
class of pedestrians contains pedestrians oriented to the left and
to the right, while the front view class contains pedestrians of a
homogeneous orientation.

Apparently, in the case of side view

classication, the classier has enough representational capabilities to learn both orientations.
Section 5.2 mentions that a minimum region of interest size of
20x40 pixels is used for r images and a minimum size of 24x48

CHAPTER 6.

127

DISCUSSION

pixels is used for grayscale images.

Below these minimum im-

age sizes there is too little feature information for calculating the
histograms of gradient orientations and gradient magnitudes and
the classication performance strongly decreases. In general, it is
better to use higher resolution images for object detection.

For

pedestrian detection, a minimum image size of 640x480 pixels is


recommendable.
For the detection system described in section 3.6, which scans a
classier through the image at every scale and location, the classication results from gures 5.11, 5.12, 5.13 and gures 5.14, 5.15,
5.16 for r images and gures 5.19, 5.20, 5.21 for grayscale images are of importance. The false positive rates of these classiers
are very low. However, with 4863 classications per frame at 20
frames per second, this still means a high average number of false
positive detections per second. For a real production system, this
is unacceptable.

One possible solution to this problem is to in-

tegrate detections over time by tracking the positively classied


object. Only when an object is classied as a pedestrian in successive frames, it is considered a positive detection. Unfortunately,
in practice it appears that false detections may be persistent in
time. An example of this is shown in gure 6.5.

(a) A false positive classication

(b) The same false positive 10 frames later

Figure 6.5: Persistence of false positives over time.

CHAPTER 6.

128

DISCUSSION

Another issue concerning scanning a classier through the image


at every scale and location is computation time. The computation time consists of the feature calculation time(including preprocessing) and time for the classier to perform forward propagation. As can be seen from gure 5.17 and gure 5.18 the combination orientations features with a neural network classier has a
computation time of 0.5 milliseconds for feature calculation plus
0.2 milliseconds for forward propagation is 0.7 milliseconds. With
4863 classications per frame, this gives a computation time of
6808.2 millisecond per frame (2x3404.1 millisecond for front/back
view and side view classication).

This is clearly too slow for

a real-time system, although there is potential for optimization.


Still despite these two drawbacks, this classier based system in
combination with the tracker which integrates classication outputs through time is the best performing system.
By comparing gure 5.22 to gure 5.12 and gure 5.15 it becomes
clear that the component based classier performs better than the
full-body classier.

The combined answer of a set of classiers

each trained for a part of a pedestrian has better discriminative


capabilities than one full-body classier.
Figure 5.17 shows that the rectangle features are the fastest to calculate. All feature types take less than a millisecond to calculate
for one region of interest. Figure 5.18 shows the forward propagation speed of the classier/image feature combinations. Neural
networks take much less time than support vector machines. The
orientations features take the least time in forward propagation.
This is obvious, the orientation feature vector consists of less features(224) than the gradients and orientations feature vector(448
features) and the rectangle gradients(1140 features) feature vector.
Figures 5.23, 5.24, and 5.25 show the results of feature selection on
the classication performance. The classication results of feature

CHAPTER 6.

129

DISCUSSION

selection with a PCA are approximately equal to the classication


results on the full set of features. This can be seen by comparing
by comparing gure 5.23 to gure 5.11. This shows the full feature
set is not required for an optimal classication result.
As gure 5.24 shows, the results of Adaboost are somewhat disappointing, although it achieves its performance with a small subset
of features. The reason for the disappointing results are that because all types of features used are generated by gradient lters of
very small size compared to the whole region of interest containing the pedestrian.

Adaboost searches for small feature subsets

which are characteristic for the object to be classied. There are


no single key features which are characteristic for a pedestrian.
It is a larger set of features which makes it possible to achieve
good classication performance.
This is what happens in feature selection with multi-objective optimization. A larger subset of features selected from the complete
set of features is selected at once.

As can be seen from gure

5.25 this gives better results then Adaboost. A disadvantage of


using multi-objective optimization for feature selection is that it
almost prohibitively slow.

For each individual from each gener-

ation, a network has to be trained. Training one support vector


machine is slow because of the grid search required to select the
training parameters. Training a neural network is slow because of
the many iteration required for gradient descent.
Figures 5.26 and 5.27 show the results of the classier optimization on the forward propagation time through the network. From
gure 5.26(a) it becomes clear that the support vector reduction
results in a classier with a much smaller number of support
vectors.

A reduction of 95% can be achieved without a loss of

classication performance. Figure 5.26(b) compares the forward


propagation speed of Adaboost to a neural network trained on
all features.

Adaboost is much faster than a classier trained

CHAPTER 6.

130

DISCUSSION

on the full set of features. Figure 5.27(a) compares the forward


propagation speed of a support vector machine trained on all features to a support vector machine trained to a reduced vector
set created with principal component analysis. It is obvious that
as the number of features gets smaller, the forward propagation
speed increases. Figure 5.27(b) compares the forward propagation
speed of a support vector machine trained with feature selection
based on multi-objective optimization to a full set of features. Although the feature set selected with multi-objective optimization
is smaller(116 features) than the full set of features(224 features),
the forward propagation speed is slower.
that the original network parameters

The reason for this is

c and

were used for train-

ing the support vector machine. These are not the optimal values
for the reduced network and resulted in a large set of support
vectors.
From the feature selection methods, the PCA feature selection
method gives the best classication results. It is also by far the
easiest and fastest feature selection method to apply. The PCA
is therefore strongly preferable over Adaboost and multi-objective
optimization.
One general limitation of neural networks and support vector machines is that both types of classiers are black boxes.

It can

not be veried that a trained network produces a correct output


for an unseen example. This should be taken into consideration
when applying a system using a classier in a production system
for braking a car in the case of a possible collision. There is never
the guarantee that the system will make the correct decision.

6.1.3 Tracking
It becomes clear by looking at gure 5.28 and section 5.3.2 that
the tracking methods which use image features for tracking: the

CHAPTER 6.

DISCUSSION

131

Hausdor tracker, the Mean Shift Tracker, and the Condensation


tracker perform poorly on pedestrian images. The integration of
initial detections and classication outputs through time perform
better. These trackers does not operate on image features directly.
There are two conditions which must be satised for a tracker to
be able to track an object based on its image features:

the shape of the object should not change much from one
frame to the other so the distribution of image features does
not change much from frame to frame,

there should be a motion model available for estimating the


movement the object makes.

In the case of tracking pedestrians from a moving car, usually


both conditions are not satised. First, the shape of a pedestrian
changes either by its own movement or its size changes by the
movement of the car. Second, a reliable motion model of a pedestrian is not available: it is not always the case that a pedestrian
is moving. In addition, because of the possible movement of the
car, a pedestrian may be moving in the image, even if it is not
because of its own movement.
The Hausdor tracker is based on calculating at a target location
the Hausdor distance of a set of image features to the image features of the tracker model. The Hausdor tracker allows for some
degree of discrepancy between the model set of features and the
target set of features by randomly removing model features and
randomly adding target features. It also allows for some degree of
scaling of the object by calculating the target set of features at different scales. Because of the strong scaling of pedestrians caused
by the motion of the car and the possible motion of the pedestrian(arm movement, orientation change), the Hausdor tracker
cannot reliably keep track of the pedestrian. What usually happens is that after the tracker is started, it locks on a part of the

CHAPTER 6.

DISCUSSION

132

pedestrian after some time before it loses track of the pedestrian.


An example of this for r images is shown in gure 6.6. In addition, because there exists no reliable motion model of a pedestrian,
the Kalman lter used in the Hausdor tracker cannot accurately
predict the location of the pedestrian in the next frame. This is
what causes the tracker to lose track of its target.

(a) On a positive classication, the tracker is (b) Tracking locks on a part of the pedesstarted.
trian
Figure 6.6: Example of Hausdor tracker in r images.

To demonstrate that the tracker can principle be used for tracking


objects from a moving camera, it is applied for tracking cars. As
can be seen in gure 6.7 and gure 6.8, the Hausdor tracker can
successfully track passing cars in r images and grayscale images.
Cars are easier to track than pedestrians because of the following
reasons: in contrary to pedestrians, cars are not changing shape
when they move, the relative speed dierence between the camera
and a moving car is much smaller than the speed dierence between the camera and the pedestrian so there is much less scaling
of the object to be tracked, the resolution of cars in the image
is usually higher than the resolution of pedestrians in the image,
and the position and scale of a passing car can be estimated more
accurately than the position and scale of a pedestrian.
The Condensation tracker has related problems. It contains a set

CHAPTER 6.

133

DISCUSSION

(a) frame 1

(b) frame 100

(c) frame 170


Figure 6.7: Tracking a car in r images with the Hausdor tracker.

of xed shapes of pedestrians which it uses for comparing target


locations to. Because of the strong scaling of pedestrians caused
by the motion of the car and the low resolution of the images, the
comparison of the xed shapes with target locations suers from
aliasing problems.
In addition, because three dimensions are tracked: horizontal position, vertical position, and scale, the number of samples needed
in the Condensation algorithm gets prohibitively large. This effect is increased by the absence of a motion model.

If there is

no motion model which moves the samples somewhat in the right


direction, an even larger number of randomly moving samples is
required to achieve the same performance. If a large number of

CHAPTER 6.

134

DISCUSSION

(a) frame 1

(b) frame 100

(c) frame 185


Figure 6.8: Tracking a car in grayscale images with the Hausdor tracker.

samples is used for tracking in three dimensions, the tracker gets


too slow because of the time it takes to evaluate all the samples.
If the number of samples is reduced for improving eciency, the
tracker rapidly loses the object it is tracking.
The mean shift tracker is originally designed for tracking objects
in color images.

The reason for this is that 24-bit color images

contain more information for tracking than 8-bit grayscale images.


The distinction in histograms between the tracker model and the
target object is usually not as large in 8-bit intensity images as it is
in 24-bit color images. For this reason, it is dicult for the mean
shift tracker to track objects in the low contrast 8-bit intensity

CHAPTER 6.

DISCUSSION

135

images. Although the mean shift tracker can handle changes in


scale of the object it tracks by increasing or decreasing the number
of pixels it calculates a histogram over, it cannot handle large
changes in the object shape it is tracking. The mean shift tracker
locks on to a part of the object it is tracking and eventually loses
track.
The trackers which integrate initial detection and classication
outputs over time perform better than the Hausdor tracker, the
Condensation tracker and the Mean shift tracker, which are based
on image features. The advantage of the trackers which integrate
initial detection outputs and classication outputs over time are
that the initial detection routines and the classiers which they
are based on, are designed and trained respectively for detecting pedestrians in a variety of shapes.

The disadvantage of the

trackers which integrate initial detection and classication outputs over time are that they cannot operate autonomously, they
require the initial detection routines and the classication routines
respectively to run.
The results of the tracker which integrates initial detections through
time is based on the performance of the initial detection routines.
As gure 5.4 and gure 5.8 show, the performance of the initial
detection routines applied to r images is reasonable and the performance of the initial detection routines applied to grayscale images is unacceptable. This severely limits the performance of the
tracker which integrates initial detection over time. Much better
performing is the tracker which integrates classication outputs
over time.

By interpreting the normalized output of the classi-

er at every position and scale in the image as a probability, the


object position in the next frame can be estimated by locating
the peak in classication output around the previous location and
scale of the object to be tracked. Because the classication results
are good on both r images and on grayscale images, the tracker

CHAPTER 6.

DISCUSSION

136

which integrates classication outputs through time also performs


well. One limitation of this method is that classication outputs
are not smooth in time. This results in sudden changes in scale of
the region of interest around the object to be tracked. A partial
solution to this problem is to apply an alpha-beta lter on the
coordinates of the region of interest output by the tracker. This
is done for visualization purposes only, internally in the tracker,
the original region of interest is stored.

6.1.4 Complete system performance


Although it seems attractive from a design point of view to have
a system which performs initial detection to limit processing to
promising parts of the image, performs classication on the output
of the initial detection, and nally performs tracking on the positive classications to stabilize detections over time, this does not
work in practice. If the initial detection routines do not detect a
large part of the pedestrians, the classier and tracker do not even
get the chance to process these initially undetected pedestrians.
The system which classies the whole image at every position and
scale described in section 3.6 combined with the tracker described
in section 4.5 which uses classication outputs for tracking, performs much better.
When evaluating the performance of the pedestrian detection system the system is usually compared to how well humans can detect
pedestrians in these images. A simple pattern recognition system
as described in this work has some severe limitations compared to
the human vision system:

humans use knowledge about the world in object detection.


For example when a human sees a vertical object standing
along the road at a large distance, the hypothesis is made that
it is a pedestrian. This can be conrmed as soon as the object

CHAPTER 6.

DISCUSSION

137

is near enough. The pedestrian detection system maintains


no hypothesis about objects but detects pedestrians as soon
as they are large enough to be detected without considering
hypothesis made in the past.

humans can detect motion when they are moving themselves,


for example when driving a car. Motion detection methods
in computer vision like for example optic ow cannot solve
this problem well.

humans have no problem assigning object parts or object features to an object. In image processing, simple heuristics or
exhaustive search is required to assign object parts or object
features to an object.

The real-time constraint of the system is satised quite well. When


combining the initial detection routine based on vertical gradients
with a support vector machine trained on histograms of gradients and orientations features and the support vector reduction
algorithm from section 3.4.3 and assuming the initial detection
delivers 20 regions of interest per frame, the processing time per
frame is 7 milliseconds + 20 * (0.7 + 0.2) is 25 milliseconds per
frame on a 1470 MHz. AMD Athlon processor. Tracking is not
measured in this case because the computation times of the initial detection based tracker and the tracker based on classication
outputs is almost zero. This gives a frame rate of 40 frames per
second. On a 1 GHz. processor, this gives approximately a frame
rate of 27 frames per second.

6.2

Further work

To improve the initial detection, the use of distance information


would bring the most progress. Usually, pedestrians are closer to

CHAPTER 6.

138

DISCUSSION

the camera than background structures. This can be used in segmentation. Distance information is available from stereo cameras
or from distance measurement sensors. Much work has been done
on the use of stereo cameras in pedestrian detection [42], [45], and
[20]. In addition, the use of distance information makes classication easier because the class of negative examples is limited to
non-pedestrian objects at a certain distance to the camera.
When the car is not moving because it is waiting for a trac
light, the number of pedestrians crossing the road in front of the
car in the image may be large. In this case, motion information
from optic ow or background subtraction can be used for initial
detection.
For a human, pedestrians are easier to detect in color images than
in grayscale images.

Color information may also be helpful for

a pedestrian detection system.

For example, the gradients cal-

culated from color images may be stronger than the gradients


calculated from grayscale images.
Another possible improvement for initial detection is instead of
scanning a vertical edge detector through the image at every position and scale as described in section 2.2.2, the scanning window
tests if there is any vertical gradient response at the current location and scale. The advantage of such a method is that is generates
more regions of interest than the initial detection routine which
scans a vertical edge detector through the image at every position
and scale.

In addition, the number of required classication is

reduced compared to the method which scans a classier through


the image at every scale and position.
The most important improvement for the classication would be
to reduce the false positive rate to zero while keeping the true
positive rate constant.

It is however unlikely to nd an image

feature/classier combination which achieves a zero false positive


rate while maintaining a high true positive rate on the low resolu-

CHAPTER 6.

139

DISCUSSION

tion images used in this work. The largest performance improvement can be achieved by using high resolution, high quality(with
respect to contrast, reections, overblending) image data.

It is

recommended to use a resolution of at least 640x480 pixels. First


experiments indicate that this leads to better classication performance.

In addition, a very large set of several thousands of

pedestrian- and non-pedestrian images should be available for the


training and the optimization of the classication to get good generalization performance.

To improve classication performance,

another possibility would be to test other image features for classication, for example Gabor features, although they may be difcult to apply in real-time.
In order to reliably track moving pedestrians from a fast moving
camera in real-time, it is required to have a tracker which is exible with respect to shape change, large translation, and scaling
of the object it tracks. The most promising seems to be a shape
adaptive method like applied in [2]. This method has the disadvantages that it cannot be used for very small pedestrians because
of aliasing problems and that it is dicult to apply it in real-time.
The use of higher resolution images makes the use of a method
like this more feasible.
The problem of detecting occluded pedestrians or pedestrians in a
group occluding each other is an even more dicult problem than
detecting single pedestrians.

One approach that could segment

an occluded pedestrian is a shape based segmentation method


with active contour models [9], [12].

This approach cannot be

applied for general segmentation in images however. The initial


contour must be around the object to be segmented. Also, it is
not possible to apply this method in real-time at 20 frames per
second on moderate hardware.

Shape based segmentation with

active contour models is more suitable for segmentation in medical


images.

CHAPTER 6.

DISCUSSION

140

It appeared that the cycle initial detection, classication, and


tracking is not very successful for pedestrian detection. It would
be interesting to modify the algorithms for other problems like car
detection or face detection to compare the results to pedestrian
detection.

6.3

Summary and conclusions

This work describes a system for detecting pedestrians in r images and grayscale images. The pedestrian detection system consists of three main components:

an initial detection component which locates regions of interest in the image,

a classication component which classies the regions of interest from the initial detection routines as pedestrians or
non-pedestrians,

and a tracking component which keeps track of objects through


successive frames.

The initial detection component consists of the following three


methods:

a temperature based method for r images which uses image


brightness for segmentation,

a method based on clustering vertical gradients for r images


and grayscale images,

and a method based on scanning a vertical edge detection


through the image for r images and grayscale images.

CHAPTER 6.

DISCUSSION

141

The classication component applies neural networks and support


vector machines in combination with the following image features
for r images and grayscale images:

rectangle features,

image gradients calculated with a Sobel-like lter,

and orientations of the image gradients.

The following optimization methods are applied for increasing the


classication performance and the forward propagation time of
the classication:

reduction of the number of support vectors,

and feature selection with:

principal component analysis

Adaboost

multi-objective optimization

The tracking component consists of the following tracking methods:

a tracker based on the Hausdor distance,

a mean shift tracker,

a Condensation tracker,

a tracker based on integrating initial detections through time,

and a tracker based on integrating classication outputs through


time.

CHAPTER 6.

DISCUSSION

142

The most important contributions of this work are:

the development of three new initial detection routines for


pedestrian detection, one based on intensity information, the
other two based on gradient information,

the systematic evaluation of classier/image feature combinations for pedestrian classication in r and grayscale images,

an evaluation of the following classier optimization algorithms on classication performance and forward propagation speed of the classiers:

a support vector reduction algorithm,

principal component analysis for feature selection,

Adaboost for feature selection,

and multi-objective optimization for feature selection.

the evaluation of three existing tracking algorithms for tracking pedestrians in real-time in r and grayscale images:

a tracker based on the Hausdor distance,

a Mean shift tracker,

a tracker based on Condensation.

the development of two tracking algorithms which are based


on integrating classication outputs and initial detection outputs over time.

The tracker which integrates classication

outputs over time is a modication of the tracker based on


Condensation from [32], the dierence is that the estimation
and conrmation of the estimation of the tracker from the
image data is performed in a dierent way.

CHAPTER 6.

143

DISCUSSION

From this work, the most important conclusions are:

The initial detection methods proposed in this work give reasonable results on r images. The results on grayscale images
are not acceptable.

The classication methods deliver good results on both r


images as on grayscale images. However, although the false
positive rate is low, it is still too high for a real system.

The classication results on r images are comparable to the


classication results on grayscale images.

The support vector machine classication results are generally a bit better than the neural network classication results.

The optimization methods for classication generally improve


the classication performance and improve the classication
forward propagation speed. The component based classication gives the best results.

The system which scans a classier through the image at


every position and scale performs better than the initial detection/classication combination.

However, with approxi-

mately 5000 classications per image this method is not applicable in real-time and its false positive rate is too high for
use in a real system.

The tracking methods which track image features are not


successful in tracking pedestrians. They are able to track objects in simpler situation, like cars in grayscale images and r
images from a moving camera. The tracker which integrates
classication outputs over time performs much better.

Integrating detections over time for the stabilization of positive detections does not improve detection results because
false positives are for a large degree persistent in time.

CHAPTER 6.

DISCUSSION

144

The applied image resolution of 320x240 pixels is too low


to deliver optimal results, especially for classication. It is
recommendable to use a resolution of at least 640x480 pixels.

Bibliography
[1] A. Baumberg and D. Hogg.
image sequences. In

Learning exible models from

ECCV (1), pages 299308, 1994.

[2] A.M. Baumberg and D.C. Hogg.

An ecient method for

contour tracking using active shape models.

Technical report.

[3] M. Bertozzi, A. Broggi, A. Fascioli, A. Tibaldi, R. Chapuis,


and F. Chausse. Pedestrian localization and tracking system
with kalman ltering.

IEEE Intelligent Vehicles Symposium

2004, pages 584589, 2004.


[4] M. Bertozzi, A. Broggi, P. Grisleri, A. Tibaldi, T. Graf, and
M. Meinecke.

Pedestrian detection in infared images.

Pro-

ceedings of the IEEE Intelligent Vehicles Symposium 2003,


pages 662667, 2003.
[5] D. Beymer, P. McLauchlan, B. Coifman, and J. Malik. A real
time computer vision system for measuring trac parameters.

IEEE Conference on Computer Vision and Pattern Recognition, pages 495501, 1997.
[6] C.M. Bishop.

Neural Networks for Pattern Recognition. Ox-

ford University Press, 1995.


[7] A. Broggi, M. Bertozzi, A. Fascioli, and M. Sechi.
based pedestrian detection.

Shape-

Proceedings of the IEEE Intelli-

gent Vehicles Symposium 2000, pages 215220, 2000.

145

146

BIBLIOGRAPHY

[8] C.J.C. Burges. Simplied support vector decision rules.

In-

ternational Conference on Machine Learning, pages 7177,


1996.
[9] T. Chan and W. Zhu. Level set based shape prior segmentation.

Technical Report 03-66, Computational Applied Math-

ematics, UCLA, Los Angeles., 2003.


[10] T.F. Chan and L.M. Vese.

Active contours without edges.

IEEE Transactions on Image Processing, 10(2):266277,


2001.
[11] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking
of non-rigid objects using mean shift. volume 2, pages 142
151, 2000.
[12] D. Cremers, T. Kohlberger, and C. Schnrr. Nonlinear shape
statistics in MumfordShah based segmentation. In A. Heyden et al., editors,

European Conference on Computer Vision

(ECCV), volume 2351 of LNCS, pages 93108, Copenhagen,


May 2002. Springer.
[13] Nello Cristianini and John Shawe-taylor.

An Introduction to

Support Vector Machines. Cambridge University Press, 2000.


[14] C. Curio, J. Edelbrunner, T. Kalinke, C. Tzomakas, and
W. von Seelen. Walking pedestrian recognition.

IEEE Trans-

actions on intelligent transportation systems, 1(3), 2000.


[15] N. Dalal and B. triggs. Histograms of oriented gradients for
human detection.

IEEE Conference on Computer Vision and

Pattern Recognition, 1:886893, 2005.


[16] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and
elitist multiobjective genetic algorithm.

IEEE Transactions

on Evolutionary Computation, 6(2):182197, 2002.

147

BIBLIOGRAPHY

[17] C. Emmanouilidis, A. Hunter, and J. MacIntyre.

A mul-

tiobjective evolutionary setting for feature selection and a

Proceedings of the

commonality-based crossover operator.

2000 Congress on Evolutionary Computation, pages 309316,


2000.
[18] Pedro F. Felzenszwalb and Daniel P. Huttenlocher.
cient graph-based image segmentation.

E-

Internation Journal

of Computer Vision, 59(2):167181, 2004.


[19] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm.

Proceedings of the International Conference on

Machine Learning(ICML), pages 148156, 1996.


[20] D.M. Gavrila, J. Giebel, and S. Munder. Vision-based pedestrian detection:

The protector system.

Proceedings of the

IEEE Intelligent Vehicles Symposium(IV 2004), 2004.


[21] D.M. Gavrila and V. Philomin.
for "smart" vehicles.

Real-time object detection

Proceedings of the IEEE International

Conference on Computer Vision, 1:8793, 1999.


[22] C. Goerick, N. Detlev, and M. Werner. Articial neural networks in real-time car detection and tracking applications.

Pattern Recognition Letters, 17:335343, 1996.


[23] B. Heisele and C. Whler. Motion-based recognition of pedestrian.

Proceedings of the fourteenth international conference

on pattern recognition, 2:13251330, 1998.


[24] K.K.P. Horn and B.G. Schunck.

Determining optical ow.

MIT A.I Memo, (572), 1980.


[25] M. Isard and A. Blake. Condensation - conditional density
propagation for visual tracking.

puter Vision, 29:528, 1998.

Internation Journal of Com-

148

BIBLIOGRAPHY

[26] M. Kass, A. Witkin, and D. Terzopoulos.

Snakes:

Active

contour models. pages 259269, 1987.


[27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haner. Gradientbased learning applied to document recognition.

Proceedings

of the IEEE, 86(11):22782324, November 1998.


[28] A. Mohan,

C. Papageorgiou,

and T. Poggio.

Example-

based object detection in images by components.

IEEE

Transactions on Pattern Analysis and Machine Intelligence,


23(4):349361, 2001.
[29] D. Mumford and J. Shah. Optimal approximation by piecewise smooth functions and associated variational problems.

Communications Of Pure Applied Mathematics, 42:577585,


1989.
[30] H. Nanda and L. Davis. Probabilistic template based pedestrian detextion in infrared videos.

IEEE Intelligent vehicles

symposium, 2002.
[31] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. Pedestrian detection using wavelet templates.

Proceed-

ings of conference on computer vision and pattern recognition, pages 193199, 1997.
[32] C. Papageorgiou. A trainable system for object detection in
images and video sequences.

PhD. Thesis.

[33] M. Pei, E.D. Goodman, and W.F. Punch. Feature extraction


using genetic algorithms.

Technical report, Michigan State

University, June 1997, 1997.


[34] V. Philomin, R. Duraiswami, and L. Davis. Pedestrian tracking from a moving vehicle.

Proceedings of the IEEE Intelligent

Vehicles Symposium 2000, pages 350355, 2000.

149

BIBLIOGRAPHY

[35] H.A. Rowley, S. Baluja, and T. Kanade.


ant neural-network based face detection.

Rotation invari-

IEEE Conference

on Computer Vision and Pattern Recognition, pages 3844,


1998.
[36] Bernard Schlkopf Sami Romdhani, Philip Torr and Andrew
Blake.

Computationally ecient face detection.

Proceed-

ings of the 8th International Conference on Computer Vision,


1:695700, 2001.
[37] A. Shashua, Y. Gdalyahu, and G. Hayun. Pedestrian detection for driving assistance systems: Single-frame classication
and system level performance.

Proceedings of the IEEE In-

telligent Vehicles Symposium(IV 2004), 2004.


[38] M. Szarvas, A. Yoshizawa, M. Yamamoto, and J. Ogata.
Pedestrian detection with convolutional neural networks.

Proceedings of the IEEE Intelligent Vehicles Symposium


2005, pages 224229, 2005.
[39] P. Viola and M. Jones. Rapid object detection using a boosted
cascade of simple features.

IEEE Computer Society Confer-

ence on Computer Vision and Pattern Recognition (CVPR),


1:511518, 2001.
[40] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance.

Proceedings of the

Internation Conference on Computer Vision(ICCV), 2003.


[41] M. Werner.

Objektverfolgung und objekterkennung mittels

der partiellen hausdor distanz, 1998.


[42] C. Whler, J.K. Anlauf, T. Prtner, and U. Franke.

time delay neural network algorithm for real-time pedestrian


recognition.

IEEE International Conference on Intelligent

Vehicles, pages 247251, 1998.

150

BIBLIOGRAPHY

[43] F. Xu, X. Liu, and K. Fujimura.


tracking with night vision.

Pedestrian detection and

IEEE Transactions on Intelligent

Transportation Systems, 6(1):6371, 2002.


[44] J. Yang and V. Honavar.
genetic algorithm.

Feature subset selection using a

IEEE Intelligent Systems, 13:4449, 1998.

[45] L. Zhao and C.E. Thorpe.


based pedestrian detection.

Stereo and neural network-

IEEE Transactions on Intelligent

Transportation Systems, 1(3):298303, 2000.

Curriculum Vitae

20.02.1977

Geboren in Drachten

09.1989 - 06.1997

Atheneum mit Abitur in Drachten

09.1997 - 11.2002

Studium Knstliche Intelligenz an


der Universitt von Amsterdam

02.2003 - 04.2006

Wissenschaftlicher Mitarbeiter,
Institut fr Neuroinformatik,
Ruhr Universitt Bochum

Vous aimerez peut-être aussi