Thrun Etal Jfr06

LEARNING TO DRIVE:
PERCEPTION FOR AUTONOMOUS CARS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
David Michael Stavens

May 2011
© 2011 by David Michael Stavens. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-

Noncommercial-No Derivative Works 3.0 United States License.
http://creativecommons.org/licenses/by-nc-nd/3.0/us/
This dissertation is online at: http://purl.stanford.edu/pb661px9942
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Sebastian Thrun, Primary Adviser
Fei-Fei Li
Andrew Ng
Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
Abstract
Every year, 1.2 million people die in automobile accidents and up to 50 million are in-
jured [1]. Many of these deaths are due to driver error and other preventable causes. Au-
tonomous or highly aware cars have the potential to positively impact tens of millions of
people. Building an autonomous car is not easy. Although the absolute number of traf-
fic fatalities is tragically large, the failure rate of human driving is actually very small. A
human driver makes a fatal mistake once in about 88 million miles [2].
As a co-founding member of the Stanford Racing Team, we have built several relevant
prototypes of autonomous cars. These include Stanley, the winner of the 2005 DARPA
Grand Challenge and Junior, the car that took second place in the 2007 Urban Challenge.
These prototypes demonstrate that autonomous vehicles can be successful in challenging
environments. Nevertheless, reliable, cost-effective perception under uncertainty is a major
challenge to the deployment of robotic cars in practice.
This dissertation presents selected perception technologies for autonomous driving in
the context of Stanford’s autonomous cars. We consider speed selection in response to
terrain conditions, smooth road finding, improved visual feature optimization, and cost
effective car detection. Our work does not rely on manual engineering or even super-
vised machine learning. Rather, the car learns on its own, training itself without human
teaching or labeling. We show this “self-supervised” learning often meets or exceeds tra-
ditional methods. Furthermore, we feel self-supervised learning is the only approach with
the potential to provide the very low failure rates necessary to improve on human driving
performance.
iv
Acknowledgments
Sebastian Thrun, Cliff Nass, Vaughan Pratt, Mike Montemerlo, Sven Strohband, Hendrik
Dahlkamp, Gabe Hoffmann, Jesse Levinson, Alex Teichman, Gary Bradski, and Adrian
Kaehler were principal collaborators on the Stanford Racing Team, shaped aspects of this
work, and contributed other systems to the vehicle (eg: path planning) that were invaluable
to the completion of this dissertation.
We received generous financial support from DARPA, Volkswagen, Red Bull, Mohr
Davidow Ventures, Google, Intel, Qualcomm, and Boeing. I held the David Cheriton Stan-
ford Graduate Fellowship from 2005 to 2008 and the Qualcomm Innovation Fellowship in
2010.
The surgical data set in Chapter 4 is courtesy of Professor Greg Hager of Johns Hopkins
University.
Special thanks to my faculty committee: Sebastian Thrun (Adviser/Examiner/Reader),
Daphne Koller (Examiner), Andrew Ng (Examiner/Reader), Chris Gerdes (Examination
Chair), and Fei-Fei Li (Examiner/Reader).
v
Contents
Abstract iv
Acknowledgments v
1 Introduction 1
2 Speed Selection 6
2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Speed Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Z Acceleration Mapping and Filtering . . . . . . . . . . . . . . . . 9
2.4.2 Generating Speed Recommendations . . . . . . . . . . . . . . . . 12
2.4.3 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . 13
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.3 Results from the 2005 Grand Challenge . . . . . . . . . . . . . . . 20
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Smooth Road Detection 23

3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
vi
3.4 Acquiring A 3-D Point Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Surface Classification Function . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.1 Raw Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.2 Calculating the Shock Index . . . . . . . . . . . . . . . . . . . . . 32
3.7 Learning the Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.8.1 Sensitivity Versus Specificity . . . . . . . . . . . . . . . . . . . . . 35
3.8.2 Effect on the Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Feature Optimization 39
4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Generating Sets of Patches . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Optimizing Higher-Level Features . . . . . . . . . . . . . . . . . . . . . . 44
4.6 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.2 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6.4 SIFT Rotational Invariance . . . . . . . . . . . . . . . . . . . . . . 49
4.6.5 HOG Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Cost Effective Sensing 60

5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Laser Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Vision Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5.1 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 67
vii
5.5.2 Feature Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5.3 Geometric Constraints . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.4 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6.1 General Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6.2 Data Collection and Preparation . . . . . . . . . . . . . . . . . . . 71
5.6.3 Ground Truth and Metrics . . . . . . . . . . . . . . . . . . . . . . 71
5.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Bibliography 80
viii
List of Tables
4.1 Learned HOG parameters for the cars and surgical data sets. . . . . . . . . 50
5.1 Results per class for a training size of 20,000 examples. . . . . . . . . . . . 73
ix
List of Figures
1.1 Stanley won the 2005 DARPA Grand Challenge. . . . . . . . . . . . . . . 2

1.2 Junior took second place in the 2007 DARPA Urban Challenge. . . . . . . 3
2.1 We find empirically that the relationship between perceived vertical accel-
eration amplitude and vehicle speed over a static obstacle can be approxi-
mated by a linear function in the operational region of interest. . . . . . . . 10
2.2 Filtered vs. unfiltered IMU data. The 40-tap FIR band pass filter extracts
the excitation of the suspension system, in the 0.3 to 12 Hz band. This
removes the offset of gravity, as terrain slope varies, and higher frequency
noise, such as engine system vibration, that is unrelated to oscillations in
the vehicle’s suspension. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 The results of coordinate descent machine learning are shown. The learning
generated the race parameters of .25 Gs and 1 mph/s, a good match to
human driving. Note that the human often drives under the speed limit.
This suggests the importance of our algorithm for desert driving. . . . . . . 14
2.4 Each point indicates a specific shock event with both the velocity controller
(vertical axis) versus speed limits alone (horizontal axis). There are very
few high shock events on this easy terrain. The velocity controller does not
reduce shock or speed much. . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Each point indicates a specific shock event with both the velocity controller
(vertical axis) versus speed limits alone (horizontal axis). There are many
high shock events on this difficult terrain. The velocity controller reduces
shock and speed a great deal. . . . . . . . . . . . . . . . . . . . . . . . . . 18
x
2.6 Here we show where the algorithm reduced speed during the 2005 Grand
Challenge event. The velocity controller modified speed during 17.6% of
the race. Height of the ellipses shown at the top is proportional to the max-
imum speed reduction for each 1/10th mile segment. Numbers represent
mile markers. Simulated terrain at the bottom approximates route elevation. 19
2.7 Here we plot the trade-off of completion time and shock on four desert
routes. α is constant (.25) and β is varied along each curve. With the single
set of parameters learned in Section 2.4.3, the algorithm reduces shock by
50% to 80% while increasing completion time by 2.5% to 5%. . . . . . . . 20
3.1 The left graphic shows a single laser scan. The right shows integration of
these scans over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Pose estimation errors can generate false surfaces that are problematic for
navigation. Shown here is a central laser ray traced over time, sensing a
relatively flat surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 The measured z-difference from two scans of flat terrain indicates that error
increases linearly with the time between scans. This empirical analysis
suggests the second term in Equation (3.4). . . . . . . . . . . . . . . . . . 28
3.4 We find empirically that the relationship between perceived vertical accel-
eration and vehicle speed over a static obstacle can be approximated tightly
by a linear function in the operational region of interest. . . . . . . . . . . . 32
3.5 The true positive / false positive trade-off of our approach and two inter-
pretations of prior work. The self-supervised method is significantly better
for nearly any fixed acceptable false-positive rate. . . . . . . . . . . . . . . 35
3.6 The trade off of completion time and shock experienced for our new self-
supervised approach and the reactive method we used in the Grand Challenge. 37
xi
4.1 An example of corner feature tracking with optical flow. The video se-
quence moves frame-to-frame from left to right. The center of each yellow
box in the left-most (first) frame is a corner detected by [75, 16]. The green
arrows show the motion of these corners into the next frame as determined
with optical flow [57, 16]. This is not a prediction. The flow algorithm uses
the next frame in the sequence for this calculation. The yellow boxes con-
tain the patches extracted and saved for later optimization. The yellow path
tails in the subsequent frames depict patch movement as estimated by the
optical flow. In total, four patch sets are produced, each with four patches.
They are shown in Figure 4.2. Our algorithm produces additional patches
for these frames. We do not show them for a simpler illustration. . . . . . . 52
4.2 The extracted patch sets from Figure 4.1. The correspondence is not per-
fect, but satisfactory for good optimization and application-specific feature
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Four non-consecutive images from the surgical data set. The images are
non-consecutive because the camera moves slowly, making the actual mo-
tion between frames small. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 HOG feature optimization for the cars task. The features optimized for
the cars task (blue line) outperform those optimized for the surgical task
(red line). This verifies that our algorithm captures application-dependent
feature invariances. The approximate theoretical best (light blue line) is
only slightly better than our method. The approximate theoretical worst is
reasonable for a 2-class problem. . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 HOG feature optimization for the surgical task. The features optimized
for the surgical task (red line) outperform those optimized for the cars task
(blue line). This verifies that our algorithm captures application-dependent
feature invariances. The approximate theoretical best (light blue) is only
slightly better than our method. The approximate theoretical worst is rea-
sonable for a 20-class problem. Because this is a correspondence task on
patches, reasonable performance for small numbers of features or images
is not surprising. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
xii
4.6 Examples from the cars task that were misclassified with the surgical-SIFT
but not the car-SIFT features. . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 SIFT feature optimization for the cars task. The features optimized for the
cars task (blue line) outperform those optimized for the surgical task (red
line) and the parameters provided by Lowe (green line). This verifies our
claim that our algorithm captures application-dependent feature invariances. 57
4.8 SIFT feature optimization for the surgical task. The features optimized for
the surgical task (red line) outperform those optimized for the cars task
(blue line). This verifies our claim that our algorithm captures application-
dependent feature invariances. Intuitively, we expect correspondence to
work well even with a single known point. This justifies the good per-
formance presented even with one training example. The performance of
our surgical-optimized features is nearly identical to Lowe’s (green-line).
Lowe’s extensive optimization was geared toward matching (correspon-
dence) problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.9 The effect of rotation on our surgical-optimized SIFT features. Our opti-
mization removed rotational invariance. This is correct given the invari-
ances that are specific to the application. . . . . . . . . . . . . . . . . . . . 59
5.1 Example of an image labeled by the laser-based recognition system [84].

We use this method to generate vast amounts of training data. This large
corpus of data allows us to match the performance of the laser with the
camera to arbitrary precision. The laser labeling is suitable for training
camera-based classifiers because false positives are low and false negatives
are largely due to segmentation errors rather than some property inherent
to the object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
xiii
5.2 A 3D point cloud example created by a single rotation of a 3D laser scanner.
The method in [84] tracks objects through these point clouds, segments
them, and classifies them using a learning algorithm. We project these
classifications back into a camera frame and use them as training data for
our vision-only algorithm. Nearly unlimited amounts of training data can
be automatically produced. . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Several examples of camera frames after labeling by the laser algorithm.
These labels are the training data for our vision-only classifier. This train-
ing data is very inexpensive to acquire. Millions of examples can be ob-
tained by simply driving the car around. . . . . . . . . . . . . . . . . . . . 75
5.4 The 5,000 cluster centers or bases used for image classification represented
by their corresponding closest image patch. . . . . . . . . . . . . . . . . . 76
5.5 Our algorithm updates a variable number of image vector elements per
feature. The frequency of the number of updates is shown here. . . . . . . . 76
5.6 Classification results from the vision algorithm. Car detection is excellent.
The number of false positives are slightly higher with the camera than the
laser despite the similar non-car class accuracy (Table 5.1). This is because
the camera must consider more segmentations due to the absence of 3D. . . 77
5.7 Doubling the amount of camera training data cuts the error, relative to the
laser’s performance, by a fixed percentage. By adding additional labeled
camera training data – which is automatic – we can achieve performance
that is arbitrarily close to the laser’s performance for object classification. . 78
5.8 Histogram of accuracy as a function of segment size. Poor performance
occurs on small segments which imply occlusion or significant distance
from the camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.9 Accuracy as a function of random labeling errors. Significant amounts of
random labeling error do not substantially degrade classification accuracy. . 79
xiv
Chapter 1
Introduction
Traffic accidents are a major source of disability and mortality worldwide. Every year,
1.2 million people die and up to 50 million people are injured [1]. Autonomous or highly
aware cars have the potential to reduce these numbers dramatically. However, building a
self-driving car that exceeds human driving performance is not easy. While traffic injuries
and deaths are large in number, they are relatively infrequent. In the United States, a
human driver makes a fatal mistake only once in approximately 88 million miles [2]. As a
co-founding member of Stanford’s Racing Team, we have built several relevant prototypes
of autonomous cars. These include Stanley [88], the winner of the 2005 DARPA Grand
Challenge and Junior [63], the car that took second place in the 2007 Urban Challenge.
Stanley and Junior are shown in Figures 1.1 and 1.2. These prototypes demonstrate that
autonomous vehicles can be successful in challenging environments.
Work on self-driving cars spans several decades [29, 66, 37, 11]. The DARPA Grand
and Urban Challenge competitions [39, 40] offered a modern, uniform testing opportunity
to examine the state-of-the-art in autonomous cars. The Grand Challenge competitions,
held in 2004 and 2005, focused on endurance driving across the desert, stationary obsta-
cles, and road following. Performance in 2004 was disappointing due, in part, to widely
varying expectations about the nature of the competition. In 2005, the competition was
repeated on a different (but similarly challenging) route, and several teams successfully
finished. In 2007, DARPA held a new challenge focused on urban environments. The
problem statement expanded significantly to include much of what a typical human driver
1
CHAPTER 1. INTRODUCTION 2
Figure 1.1: Stanley won the 2005 DARPA Grand Challenge.
must do. For example, car detection, lane keeping, parking, and intersections were in-
cluded. Important exclusions were high-speed (highway) driving, traffic lights, bicycles,
pedestrians, and the ability to drive without a highly accurate GPS course skeleton. Several
teams were successful although reliability was not perfect despite the abridged definition of
the problem. Nevertheless, it is remarkable how quickly the technology progressed. Today,
the Google driving project is expanding the code base from these efforts [85].
Any mobile robot must be able to localize itself, perceive its environment, make de-
cisions in response to those perceptions, and control actuators to move about [20, 78]. In
many ways, autonomous cars are no different. Thus, many ideas from mobile robotics gen-
erally are directly applicable to robotic driving. Examples include GPS/IMU fusion with
Kalman filters [86], map-based localization [28, 94, 42, 52, 67], and path planning based
on trajectory scoring [46, 47, 66, 29, 37, 11, 20]. Actuator control for high-speed driv-
ing is different than for typical mobile robots and is very challenging. However, excellent
solutions exist [83, 3].
However, general perception is unsolved for mobile robots and is the focus of major
efforts within the research community. This dissertation focuses on perception for robotic
cars. Perception is much more tractable within the context of autonomous driving. This is
due to a number of factors. For example, the number of object classes is smaller, the classes
Figure 1.2: Junior took second place in the 2007 DARPA Urban Challenge.
are more distinct, rules offer a strong prior on what objects may be where at any point in
time, and expensive, high-quality laser sensing is appropriate. Nevertheless, perception is
still very challenging due to the extremely low acceptable error rate. We focus on four
driving-related perception topics in this dissertation: speed selection in response to terrain
conditions [80], smooth road detection [81], visual feature optimization [82], and car de-
tection. Each of these four topics is presented as a distinct chapter and is self-contained
with its own literature review.
A unifying theme throughout our work is the use of self-supervised machine learning.
In self-supervised learning, the robot teaches itself automatically, without using human-
generated labels. Self-supervised learning is different from other non-supervised learning
methods. Unlike semi-supervised and self-taught learning, labeled data in the traditional
sense is not needed. However, self-supervised learning is also not totally unsupervised as in
clustering or deep learning [51, 50]. Self-supervised learning uses multi-view coherence.
Specifically, two different algorithms or sensors view the same data in a different way. Each
algorithm or sensor brings a different strength to its view of the data. By using knowledge
of correspondence that is automatically or algorithmically obtained, self-supervised learn-

ing combines these strengths automatically to create a better method. This dissertation is
among the first to broadly apply this type of learning [24, 53].
Chapter 2 presents supervised speed selection in response to terrain conditions [80].
Shock felt by the vehicle’s inertial unit as well as road width and slope are used to guide
speed decisions. Supervised learning is used to create a speed recommendation system that
closely matches human performance. The method was used extensively during the 2005
DARPA Grand Challenge. While this method is supervised, it forms a baseline used for
comparison in Chapter 3. Also, Chapter 2 is a direct contrast to [35], an alternative method
using in the 2005 Grand Challenge that relied on massive manual labeling. The method
presented in Chapter 2 is much more efficient than [35] and thus is consistent with our
theme.
Chapter 3 presents a self-supervised terrain roughness estimator that can be used for
smooth road detection (obstacle avoidance) as well as speed selection (roughness avoid-
ance) [81]. Shock felt by the vehicle’s inertial measurement unit is compared to scans of
the terrain in advance. Using that coherence, a laser perception system for smooth road is
built automatically that offers improved performance over the supervised one used in the
2005 Grand Challenge. Furthermore, because accuracy of the laser system is improved, it is
sensitive enough to make better terrain speed decisions than the supervised shock-reaction
system presented in Chapter 2.
Chapter 4 presents a self-supervised method for improving visual features used for ob-
ject recognition tasks [82]. Visual features should be invariant, producing the same vector
when applied to the same environment location even if that location’s appearance varies
slightly due to lighting or other factors. Chapter 4 uses temporal coherence to extract mul-
tiple views of an environment location over time. Using this coherence, improved visual
features are optimized automatically. We show improvements compared to SIFT and HOG,
two major feature types produced using manual or supervised methods.
Chapter 5 presents a self-supervised, camera-only car classification system. An expen-
sive laser-based system [84] is used to automatically produce vast amounts of camera-only
labels. These labels are propagated using coherence. Specifically, the laser and camera are
calibrated together and are thus known to be observing the same thing. Using these labels,
a camera-only system is automatically learned. Importantly, the vast amounts of labeled

data produced show that doubling the amount of training data cuts the error rate by a fixed
percentage. Thus, the camera system can obtain performance arbitrarily close to the laser’s
performance, despite the camera being much less expensive and, intuitively, a less accurate
sensing system.
Chapter 2
Speed Selection
2.1 Abstract
The mobile robotics community has traditionally addressed motion planning and naviga-
tion in terms of steering decisions. However, selecting the best speed is also important –
beyond its relationship to stopping distance and lateral maneuverability. Consider a high-
speed (35 mph) autonomous vehicle driving off-road through challenging desert terrain.
The vehicle should drive slowly on terrain that poses substantial risk. However, it should
not dawdle on safe terrain. In this chapter we address one aspect of risk – shock to the
vehicle. We present an algorithm for trading-off shock and speed in real-time and without
human intervention. The trade-off is optimized using supervised learning to match human
driving. The learning process is essential due to the discontinuous and spatially correlated
nature of the control problem – classical techniques do not directly apply. We evaluate
performance over hundreds of miles of autonomous driving, including performance during
the 2005 DARPA Grand Challenge. This approach was the deciding factor in our vehicle’s
speed for nearly 20% of the DARPA competition – more than any other constraint except
the DARPA-imposed speed limits – and resulted in the fastest finishing time.
6
CHAPTER 2. SPEED SELECTION 7
2.2 Introduction
In mobile robotics, motion planning and navigation have traditionally focused on steering
decisions. This chapter presents speed decisions as another crucial part of planning – be-
yond the relationship of speed to obstacle avoidance concerns, such as stopping distance
and lateral maneuverability. Consider a high-speed (35 mph) autonomous vehicle driving
off-road through challenging desert terrain. We want the vehicle to drive slower on more
dangerous terrain. However, we also want to minimize completion time. Thus, the robot
must trade-off speed and risk in real-time. This is a natural process for human drivers, but
it is not at all trivial to endow a robot with this ability.
We address this trade-off for one component of risk: the shock the vehicle experi-
ences. Minimizing shock is important for several reasons. First, shock increases the risk
of damage to the vehicle, its mechanical actuators, and its electronic components. Second,
a key perceptive technology, laser range scanning, relies on accurate estimation of orien-
tation. Shock causes the vehicle to shake violently, making accurate estimates difficult.
Third, shocks substantially reduce traction during oscillations. Finally, we demonstrate
that shock is strongly correlated with speed and, independently, with subjectively difficult
terrain. That is, minimizing shock implies slowing on challenging roads when necessary –
a crucial behavior to mitigate risk to the vehicle.
Our algorithm uses the linear relationship between shock and speed which we derive
analytically. The algorithm has three states. First, the vehicle drives at the maximum
allowed speed until a shock threshold is exceeded. Second, the vehicle slows immediately
to bring itself within the shock threshold using the relationship between speed and shock.
Finally, the vehicle gradually accelerates. It returns to the first state, or the second if the
threshold is exceeded during acceleration.
To maximize safety, the vehicle must react to shock by slowing down immediately, re-
sulting in a discontinuous speed command. Further, it should accelerate cautiously, because
rough terrain tends to be clustered, a property we demonstrate experimentally. For these
reasons, it is not possible to determine the optimal parameters using classical control tech-
niques. The selection of parameters cannot be tuned by any physical model of the system,
as the performance is best measured statistically by driving experience.
Therefore, we generate the actual parameters for this algorithm – the shock threshold
and acceleration rate – using supervised learning. The algorithm generates a reasonable
match to the human teacher. Of course, humans are proactive, also using perception to
make speed decisions. We use inertial-only cues because it is very difficult to assess ter-
rain roughness with perceptive sensors. The accuracy required exceeds that for obstacle
avoidance by a substantial margin because rough patches are often just a few centimeters
in height. Still, experimental results indicate our reactive approach is very effective in
practice.
Our algorithm is part of the software suite developed for Stanley, an autonomous vehicle
entered by Stanford University in the 2005 DARPA Grand Challenge [27]. The Grand
Challenge was a robot race organized by the U.S. Government. In 2005, vehicles had to
navigate 132 miles of unrehearsed desert terrain. The algorithm described here played an
important role in that event. Specifically, it was the deciding factor in our robot’s speed
for nearly 20% of the competition – more than any other constraint except the DARPA
imposed speed limit. It resulted in the fastest finishing time of any robot.
2.3 Related Work

There has been extensive work on terramechanics [8, 9, 99], the guidance of autonomous
vehicles through rough terrain. That work generally falls into two categories. One focuses
on high-speed navigation. However, it requires that terrain maps and topology be known
beforehand [76, 79]. The other analyzes terrain ruggedness in an online fashion by driving
over it [72, 18, 41]. However, vehicle speed is very slow, less than 1.8 mph. Further,
classification focused on terrain type, not roughness. Our work combines the strengths of
both categories: online terrain roughness assessment and high-speed navigation.
Successful obstacle detection systems have been built for high-speed autonomous driv-
ing [88, 87, 90, 91, 47]. However, as we mentioned in Section 2.2, the precision needed
for determining terrain ruggedness exceeds that required for obstacle avoidance by a sub-
stantial margin. Thus, while these systems are excellent at detecting and avoiding static
obstacles, they do not protect the vehicle from the effects of road roughness.
Some work has explicitly considered speed as important for path planning. For exam-
ple, [33] presents a method for trading off “progress” and “velocity” with regard to the
number and type of obstacles present. That is, slower speeds are desirable in more clut-
tered environments. Faster speeds are better for open spaces. This concept, applied to
desert driving, is analogous to slowing for turns in the road and for discrete obstacles. Our
vehicle exhibits those behaviors as well. However, they are separate from the algorithm we
describe in this chapter.
Other entrants in the 2005 DARPA Grand Challenge realized the importance of adapt-
ing speed to rough terrain. For example, two teams from Carnegie Mellon University
(CMU) used “preplanning” for speed selection [35, 90, 91, 65]. For months, members of
the CMU teams collected extensive environment data in the general area where the compe-
tition was thought to be held. Once the precise route was revealed by DARPA, two hours
before the race, the team preprogrammed speeds according to the previously collected data.
The end-result of the CMU approach is similar to ours. However, our fully-online method
requires neither human intervention nor prior knowledge of the terrain. The latter distinc-
tion is particularly important since desert terrain readily changes over time due to many
factors, including weather.
2.4 Speed Selection Algorithm

Our algorithm is derived in three steps. First, we derive the linear relationship between
speed and shock. Second, we leverage this relationship to build an algorithm for determin-
ing an appropriate speed. Finally, we tune the parameters of the algorithm using supervised
learning.
2.4.1 Z Acceleration Mapping and Filtering

The mapping between vehicle speed and shock experienced is fundamental to our tech-
nique. Here, shock is measured by the amplitude of acceleration in the vertical, z-direction,
of the vehicle’s main body. The vehicle’s speed is v and position along the path is p. The
ground varies in height, zg (p), pushing stiffly on the car’s tires. The ground is modeled
Figure 2.1: We find empirically that the relationship between perceived vertical acceler-
ation amplitude and vehicle speed over a static obstacle can be approximated by a linear
function in the operational region of interest.
to be the sum of sinusoids of a large number of unknown spatial frequencies and ampli-
tudes. Consider one component of this summation, with a spatial frequency of ωs , and an
amplitude Az,g,ωs . Due to motion of the vehicle, the frequency in time of zg,ωs is ω = vωs .
By taking two derivatives, the acceleration, z̈g,ωs , has an amplitude of v 2 ωs2 Az,g,ωs and a
frequency of vωs . This is the acceleration input to the vehicle’s tire.
To model the response of the suspension system, we use the quarter car model [34].
The tire is modeled as a connection between the axle and the road with a spring with
stiffness kt , and the suspension system is modeled as a connection between the axle and
the vehicle’s main body, using a spring with stiffness ks and a damper with coefficient cs .
Using this model, the amplitude of acceleration of the main body of the vehicle can be
easily approximated, by assuming infinitely stiff tires, to be:
s
(cs vωs )2 + ks2
Az̈,ωs = v 2 ωs2 Az,g,ωs (2.1)
(cs vωs )2 + (ks − mv 2 ωs2 )2
Figure 2.2: Filtered vs. unfiltered IMU data. The 40-tap FIR band pass filter extracts the
excitation of the suspension system, in the 0.3 to 12 Hz band. This removes the offset of
gravity, as terrain slope varies, and higher frequency noise, such as engine system vibration,
that is unrelated to oscillations in the vehicle’s suspension.
where m is the mass of one quarter of the vehicle, and the magnitude of the acceleration
input to the tire is taken to be v 2 ωs2 Az,g,ωs , as described above. The resonant frequency of a
standard automobile is around 1 to 1.5 Hz. Below this frequency, the amplitude increases
from zero at 0 Hz, slowly, but exponentially, with frequency. Above this frequency, the
cs vωs
amplitude increases linearly, approaching m
Az,g,ωs . Modern active damping systems
further quench the effect of the resonant frequency, leading to a curve that is nearly linear
in frequency throughout, to within the noise of the sensors, which also measure the effect
of random forcing disturbances.
The frequency, ω, is directly proportional to speed v, so the amplitude of acceleration
response of the vehicle is also approximately a linear function of the velocity. Now, by
superposition, considering the sum of the responses for all components of the terrain profile
at all spatial frequencies, the amplitude of the acceleration response of the vehicle, Az̈ , is
the sum of many approximately linear functions in v, and hence is approximately linear in
v itself. This theoretical analysis can also be verified experimentally as in Figure 2.1.
This approximation assumed infinitely stiff tires. In reality, the finite stiffness of tires
leads to a higher frequency resonance in the suspension system around 10 to 20 Hz, fol-
lowed by an exponential drop off in Az̈ , to zero.
Although the acceleration is approximately a linear function of v, a barrier to the use of
pure z-accelerometer inertial measurement unit (IMU) data is that acceleration can reflect
other inputs than road roughness. First, the projection of the gravity vector onto the z-axis,
which changes with the pitch of the road, causes a large, low frequency offset. Second, the
driveline and non-uniformities in the wheels cause higher frequency vibrations.
To remove these effects, we pass the IMU data through a 40-tap, FIR band pass filter
to extract the excitation of the suspension system, in the 0.3 to 12 Hz band. This band was
found, experimentally, to both reject the offset of gravity and eliminate vibrations unrelated
to the suspension system, without eliminating the vibration response of the suspension.
A comparison of the raw and filtered data is shown in Figure 2.2. The mean of the raw
data, acquired on terrain with no slope, is about 1G, whereas the mean of the filtered data
is about 0G. Therefore, solely for presentation in this figure, 1G was subtracted from the
raw data.
2.4.2 Generating Speed Recommendations

Our speed selection algorithm (the velocity controller) has three states. Each road is as-
sumed to have a speed limit (γ, in mph) that represents the upper bound on vehicle speed
along that road. First, the vehicle travels at the speed limit until a road event generates a
shock that exceeds an acceptable threshold (α, in Gs). Second, it reduces speed such that
the event would have been less than α. Finally, it accelerates (at rate β, in mph/s) back to
the speed limit.
More formally, recall that, for a given rough segment of road, the shock experienced
by the vehicle is linear with respect to speed. Every dt seconds, the vehicle takes a reading
at its current location p along the route. Filtered shock z̈p and speed vp are observed. The
optimal speed at p should have been
vp
vp∗ = α . (2.2)
|z̈p |
That is, vp∗ is the velocity that would have delivered the maximum acceptable shock (α) to
the vehicle.
Our instant recommendation for vehicle speed is vp∗ . Notice that the vehicle is slowing
reactively. Reactive behavior has no effect if shock is instant, random, and not clustered.
That is, our approach relies on one shock event “announcing” the arrival of another, mo-
mentarily. Experimental results on hundreds of miles of desert indicate this is an effective
strategy.
We consider vp∗ an observation. The algorithm incorporates these observations over
time, generating the actual recommended speed, vpr :
vpr = min γ, vp∗ , vp−1

r

+ βdt . (2.3)
For our purposes, we also clamp vpr such that it is never less than 5 mph. We call the vpr
series a velocity plan.
2.4.3 Supervised Machine Learning

We use supervised machine learning to select α and β. Some might argue we should
select these parameters using explicit hardware lifetime calculations rather than matching a
human teacher. However, an autonomous vehicle has hundreds of components. Analyzing
the effect of shock on each would require a great deal of time. It would also be costly –
each component would have to break before we could determine its limit. We believe that
matching a human’s speed provides a useful and more efficient bound on acceptable shock.
The velocity plan generated by a specific parametrization is scored against a human
driver’s velocity decisions at all p. The parameters are then iteratively modified according
to a scoring function and coordinate descent.
Let vph be the speed of the human driver at position p. The objective/scoring function
Figure 2.3: The results of coordinate descent machine learning are shown. The learning
generated the race parameters of .25 Gs and 1 mph/s, a good match to human driving. Note
that the human often drives under the speed limit. This suggests the importance of our
algorithm for desert driving.
we wish to minimize is:
Z
ψ vph − vpr 1 + αβ −1

(2.4)
p
where ψ = 1 if vpr ≤ vph and ψ = 3 otherwise. The ψ term penalizes speeding rela-
tive to the human driver. The value 3 was selected arbitrarily. The αβ −1 term penalizes
parametrizations that tolerate a great deal of shock or accelerate too slowly.
For this experiment, a human driver drove 1.91 miles of the previous year’s (2004)
Grand Challenge route. The vehicle’s state vector was estimated by an Unscented Kalman
Filter (UKF) [44]. The UKF combined measurements from a Novatel GPS with Omnistar
differential corrections, a GPS compass, the IMU, and wheel speed encoders. The GPS
devices produced data at 10Hz. All other devices and the UKF itself produced data at
100Hz. All data is logged with millisecond accurate timestamps.
We do not model driver reaction time, and we correct for the delay due to IMU filtering.
Thus, in this data, driver velocity at position p is the vehicle’s recorded velocity at position
p, according to the UKF. The algorithm’s velocity at position p is the decision of the veloc-
ity controller considering all data up to and including the IMU and UKF data points logged
at position p.
The parametrization learned was (α, β) = (.27 Gs, .909 mph/s). This is illustrated
in Figure 2.3. The top and bottom halves of the figure show the training and test sets,
respectively. The horizontal axis is position along the 2004 Grand Challenge route in miles.
The vertical axis is speed in mph. For practical reasons we rounded these parameters to
.25 and 1, respectively, which are essentially identical in terms of vehicle performance.
The robot used these parameters during the Grand Challenge event and they are evaluated
extensively in future sections.
We notice a reasonable match to human driving. Without the penalty for speeding, a
closer match is possible. However, that parametrization causes the vehicle to drive too ag-
gressively. This is, no doubt, because a human driver also uses perceptual cues to determine
speed.
Notice that the speed limits are often much greater than the safe speed the human (and
the algorithm) selects. This demonstrates the importance of our technique over blindly fol-
lowing the provided speed limits. Notice also, however, both the human and the algorithm
at times exceed the speed limit. This is not allowed on the actual robot or in the experiments
in Section 2.5. However, we permit it here to simplify the scoring function, the learning
process, and the idea of matching a human driver’s speed.
2.5 Experimental Results
2.5.1 Qualitative Analysis

In our first experiment, we evaluate the algorithm on sections of off-road driving that we
identify (qualitatively) as particularly easy and, alternatively, as challenging. The purpose
of this experiment is to spot-check performance. We seek to verify that our algorithm,

as tuned by the learning, does not drive too slowly on easy terrain or too fast on difficult
terrain.
We select two seven mile stretches from the 2005 Grand Challenge route. The first
stretch includes miles 0-7. This is very easy terrain consisting of wide dirt roads and dry
lakes. The second includes miles 122-129. This terrain is very challenging: a narrow
mountain road with a 200 ft drop on one side. Both sections were graded (smoothed with
a large plow) not long before the Grand Challenge event. Based on video collected by
the vehicle and by DARPA, the mountain road is clearly much more challenging to drive.
However, neither has obviously greater potential for vertical shock.
Figure 2.4 presents results from the first seven miles of the course – the easy example.
In the figure, points represent a single shock event along the route. The vertical axis plots
shock the vehicle experiences while driving the route with the velocity controller. The
horizontal axis plots shock experienced driving the speed limits alone. We notice two
things. First, shock above our .25G threshold is very rare. This reflects on the particularly
straightforward driving conditions. Second, we notice that the shock experienced with
the speed limits alone is, for all events, roughly the same as that experienced with the
velocity controller. Thus, because shock is linear with speed, the velocity controller did not
significantly reduce speed on this easy terrain. This supports the idea that the algorithm,
with the learned parameters, does not cause the vehicle to drive too slowly on easy terrain.
Figure 2.5 is the same type of plot, with the same axis ranges, as Figure 2.4. However,
here, we plot data from the more difficult segment, miles 122-129. In this terrain, poor
perceptive sensing or loss of traction – both due to excessive shock – could result in total
vehicle loss. The velocity controller is very active, lowering vehicle speed. The lower
speed results in reduced shock, better traction, and generally more favorable conditions for
successful navigation of this challenging terrain.
Figure 2.4: Each point indicates a specific shock event with both the velocity controller
(vertical axis) versus speed limits alone (horizontal axis). There are very few high shock
events on this easy terrain. The velocity controller does not reduce shock or speed much.
2.5.2 Quantitative Analysis

Experimental Setup
Having verified the basic function of the algorithm, we now proceed to analyze its perfor-
mance over long stretches of driving. For long stretches, it becomes intractable to compare
algorithm performance to human decisions (as in Section 2.4.3) or relative to qualitatively
assessed terrain conditions (as in Section 2.5.1). Therefore, we must consider other metrics.
Over long stretches of driving, we use the aggregate reduction of large shocks to measure
algorithm effectiveness.
However, to evaluate the reduction of large shocks, we cannot simply sum the amount
of shock. Extreme shock is rare. Less than .3% of all readings (2005 Grand Challenge,
Speed Limits Alone) are above our maximum acceptable shock (α) threshold. Thus, the
aggregate effect of frequent, small shock could mask the algorithm’s performance on rare,
Figure 2.5: Each point indicates a specific shock event with both the velocity controller
(vertical axis) versus speed limits alone (horizontal axis). There are many high shock
events on this difficult terrain. The velocity controller reduces shock and speed a great
deal.
large shock. To avoid this problem, we take the 4th power (“L4 Metric”) of a shock be-
fore adding it to the sum. This accentuates large shocks and diminishes small ones in the
aggregate score.
Whenever the vehicle travels along a road – regardless of whether under human or
autonomous control – it can map the road for velocity control experiments. Each raw IMU
reading is logged along with the vehicle’s velocity and position according to the UKF. Then,
offline, each IMU reading can be filtered and divided by the vehicle’s speed. This generates
a sequence of position tagged road roughness values in the units Gs-felt-per-mile-per-hour.
We use these sequences in this section to evaluate the velocity controller.
We calculate completion time by summing the time spent traversing each route segment
(ie: from pi to pi+1 ) according to the average speed driven during that segment. 1/dt is
100Hz. Because the velocity controller may change speed instantly at each pi , we track the
velocities it outputs – for these simulations only – with a simple controller. It restricts the
Figure 2.6: Here we show where the algorithm reduced speed during the 2005 Grand Chal-
lenge event. The velocity controller modified speed during 17.6% of the race. Height
of the ellipses shown at the top is proportional to the maximum speed reduction for each
1/10th mile segment. Numbers represent mile markers. Simulated terrain at the bottom
approximates route elevation.
maximum amount of change in velocity between successive pi ’s, approximating vehicle

capabilities. The limits are .02 mph for increases and .09 mph for decreases. We use the
controller output for completion time and shock calculations.
Time vs. Shock for Four Desert Routes
We consider the trade-off of shock versus completion time on four desert routes. See Fig-
ure 2.7. The horizontal axis plots completion time. The vertical axis plots shock felt. Both
are for the algorithm normalized against speed limits alone. Each of the four curves repre-
sents a different desert route: the 2005 Grand Challenge route, the 2004 Grand Challenge
route, and two routes outside of Phoenix, Arizona, USA that we created for testing pur-
poses. Individual points on the curves depict the completion time versus shock trade-off
for a particular set of (α, β) values. For these curves, α is fixed and β is varied. The asterisk
on each curve indicates the trade-off generated by the parameters selected in Section 2.4.3.
The fifth asterisk, at the top-left, indicates what happens with speed limits alone. We notice
that, for all routes, completion time is increased between 2.5% and 5% over the baseline.
However, shock is reduced by 50% to 80% compared to the baseline, a substantial savings.
It appears that the algorithm is most effective along the 2004 Grand Challenge route.
There are two reasons for this. First, we prepared for the 2005 event along the 2004 course.
Therefore, our algorithm is especially optimized for that route. However, the effect is
mostly due to a different phenomenon.
Figure 2.7: Here we plot the trade-off of completion time and shock on four desert routes. α
is constant (.25) and β is varied along each curve. With the single set of parameters learned
in Section 2.4.3, the algorithm reduces shock by 50% to 80% while increasing completion
time by 2.5% to 5%.
The 2004 course had sections that were in extremely bad condition when this data
was collected (June 2005). Such extreme off-road driving was not seen on the other three
routes. Because the core functionality of our algorithm involves being cautious about clus-
tered, dangerous terrain, it was particularly effective for the 2004 course. This implies our
approach may be more beneficial as terrain difficulty increases.
Finally, we notice occasional abrupt, unsmooth drops in shock in the figure. This is
the result of a parametrization that is just aggressive enough to slow the vehicle before a
major rough section of road. Using the parametrization immediately prior did not slow the
vehicle in time.
2.5.3 Results from the 2005 Grand Challenge

In Figure 2.6, we indicate where the algorithm reduced speed during the 2005 Grand Chal-
lenge. Numbers and vertical lines are mile markers. We depict terrain at the bottom that
approximates elevation data from the route. At the top of the figure, ellipse height indicates
the maximum speed reduction generated by the algorithm for that portion of the route. For
the purposes of this figure, the route was divided into 1/10th mile segments. Log files
taken by the vehicle during the Grand Challenge event indicate the velocity controller was
controlling vehicle speed for 17.6% of the entire race – more than any other factor except
speed limits.
We notice that the algorithm was particularly active in the mountains and especially
inactive in plains. We also notice that miles 122-129 – which we identified in Section 2.5.1
as the most challenging part of the route – do not represent the greatest speed reductions the
velocity controller made. This is because speed limits (given by DARPA) were already low
in those regions, and the amount of speed reduction is, of course, related to the accuracy of
the original speed limit as a good speed guide.
2.6 Discussion
Speed decisions are a crucial part of planning – even beyond their relationship to obstacle
avoidance and lateral maneuverability. We have presented an online approach to speed
adaptation for high-speed off-road autonomous driving. Our approach addresses shock,
an important component of overall risk when driving off-road. Our method requires no
rehearsal or remote imaging of the route.
The algorithm has three states. First, the vehicle drives at the speed limit until an accept-
able shock threshold is exceeded. Second, the vehicle reduces speed to bring itself within
the threshold. Finally, the vehicle gradually increases speed back to the speed limit. The
algorithm uses the linear relationship between speed and shock to determine the amount of
speed reduction needed to stay within the limit. This limit and the acceleration parameter
from the third state are tuned using supervised learning to match human driving.
Experimental results allow us to draw numerous conclusions. First, our algorithm
seems to be more cautious on more difficult terrain, although this is difficult to prove in
the general case (Figures 2.4 and 2.5). Second, by slowing, we can substantially reduce
high shock events with minimal effect on route completion time (Figure 2.7). Finally, the
algorithm had significant influence on our vehicle’s behavior during the 2005 Grand Chal-
lenge (Figures 2.4, 2.5, 2.6, and 2.7). It reduced shock by 52% and slowed the vehicle in
difficult terrain, all while adding less than 10 minutes to the completion time. The algorithm
also had more influence than any other factor – except speed limits – on our robot’s speed.
Finally, it generated the fastest completion time despite being fully automated, unlike some
other competitors in the Grand Challenge [35, 90, 91].
Chapter 3
Smooth Road Detection
3.1 Abstract
Accurate perception is a principal challenge of autonomous off-road driving. Perceptive
technologies generally focus on obstacle avoidance. However, at high speed, terrain rough-
ness is also important to control shock the vehicle experiences. The accuracy required to
detect rough terrain is significantly greater than that necessary for obstacle avoidance.
We present a self-supervised machine learning approach for estimating terrain rough-
ness from laser range data. Our approach compares sets of nearby surface points acquired
with a laser. This comparison is challenging due to uncertainty. For example, at range, laser
readings may be so sparse that significant information about the surface is missing. Also, a
high degree of precision is required in projecting laser readings. This precision may be un-
available due to latency or error in the pose estimation. We model these sources of error as
a multivariate polynomial. The coefficients of this polynomial are obtained through a self-
supervised learning process. The “labels” of terrain roughness are automatically generated
from actual shock, measured when driving over the target terrain. In this way, the approach
provides its own training labels. It “transfers” the ability to measure terrain roughness from
the vehicle’s inertial sensors to its range sensors. Thus, the vehicle can slow before hitting
rough terrain.
Our experiments use data from the 2005 DARPA Grand Challenge. We find our ap-
proach is substantially more effective at identifying rough surfaces and assuring vehicle
23
CHAPTER 3. SMOOTH ROAD DETECTION 24
safety than previous methods – often by as much as 50%.
3.2 Introduction
In robotic autonomous off-road driving, the primary perceptual problem is terrain assess-
ment in front of the robot. For example, in the 2005 DARPA Grand Challenge [27] (a
robot competition organized by the U.S. Government) robots had to identify drivable sur-
face while avoiding a myriad of obstacles – cliffs, berms, rocks, fence posts. To perform
terrain assessment, it is common practice to endow vehicles with forward-pointed range
sensors. The terrain is then analyzed for potential obstacles. The result is used to adjust the
direction of vehicle motion [46, 47, 48, 91].
Driving at high speed – one key challenge in the DARPA Grand Challenge – terrain
roughness must also dictate vehicle behavior. Rough terrain induces shock. At high speed,
the effect of shock can be detrimental [18]. Thus, to be safe, a vehicle must sense ter-
rain roughness and adjust speed, avoiding high shock. The degree of accuracy necessary
for assessing terrain roughness exceeds that required for obstacle finding by a significant
margin – rough patches are often just a few centimeters in height. This makes design of a
competent terrain assessment function difficult.
In this chapter, we present a method that enables a vehicle to acquire a competent
roughness estimator for high speed navigation. Our method uses self-supervised machine
learning. This allows the vehicle to learn to detect rough terrain while in motion and with-
out human training input. Training data is obtained by a filtered analysis of inertial data
acquired at the vehicle core. This data is used to train (in a supervised fashion) a classifier
that predicts terrain roughness from laser data. In this way, the learning approach “trans-
fers” the capability to sense rough terrain from inertial sensors to environment sensors. The
resulting module detects rough terrain in advance, allowing the vehicle to slow. Thus, the
vehicle avoids high shock that would otherwise cause damage.
We evaluate our method using data acquired in the 2005 DARPA Grand Challenge.
Our experiments measure ability to predict shock and effect of such predictions on vehicle
safety.
We find our method is more effective at identifying rough surfaces than previous tech-
niques derived from obstacle avoidance algorithms. This is because we use training data
that emphasizes the distinguishing characteristics between very small discontinuities. The
self-supervised approach is crucial to making this possible.
Furthermore, we find our method reduces vehicle shock significantly with only a small
reduction in average vehicle speed. The ability to slow before hitting rough terrain is in
stark contrast to the speed controller used in the race [80] presented in Chapter 2. That
controller measured vehicle shock exclusively with inertial sensing. Hence, it sometimes
slowed the vehicle after hitting rough terrain.
We present the chapter in four parts. First, we define the functional form of the laser-
based terrain estimator. Then we describe the exact method for generating training data
from inertial measurements. Third, we train the estimator with the learning. Finally, we
examine the results experimentally.
3.3 Related Work

There has been extensive work on Terramechanics [8, 9, 99], the guidance of autonomous
vehicles through rough terrain. Early work in the field relies on accurate a priori models of
terrain. More recent work [41, 18, 76] addresses the problem of online terrain roughness
estimation. However, sensors used in that work are proprioceptive and require the robot to
drive over terrain for classification. In this chapter we predict roughness from laser data so
the vehicle can slow in advance of hitting rough terrain.
Numerous techniques for laser perception exist in the literature, including its applica-
tion to off-road driving [77, 91]. Those that address error in 3D point cloud acquisition
are especially relevant for our work. The iterative closest point (ICP) algorithm [12] is a
well-known approach for dealing with this type of error. ICP relies on the assumption that
terrain is scanned multiple times. Any overlap in multiple scans represents an error and
that mismatch is used for alignment.
Although ICP can be implemented online [70], it is not well-suited to autonomous
driving. At significant range, laser data is quite sparse. Thus, although the terrain may be
scanned by multiple different lasers, scans rarely cover precisely the same terrain. This
effectively breaks the correspondence of ICP. Therefore, we believe that accurate recovery
of a full 3-D world model from this noisy, incomplete data is impossible. Instead we use a
machine learning approach to define tests that indicate when terrain is likely to be rough.
Other work uses machine learning in lieu of recovering a full 3-D model. For example,
[73] and [61] use learning to reason about depth in single monocular image. Although full
recovery of a world model is impossible from a single monocular image, useful data can
be extracted using appropriate learned tests. Our work has two key differences from these
prior papers: the use of lasers rather than vision and the emphasis upon self-supervised
rather than reinforcement learning.
In addition, other work has used machine learning in a self-supervised manner. The
method is an important component of the DARPA LAGR Program. It was also used for
visual road detection in the DARPA Grand Challenge [25]. The “trick” of using self-
supervision to generate training data automatically, without the need for hand-labeling, has
been adapted from this work. However, none of these approaches addresses the problem of
roughness estimation and intelligent speed control of fast moving vehicles. As a result, the
source of the data and the underlying mathematical models are quite different from those
proposed here.
3.4 Acquiring A 3-D Point Cloud

Our vehicle uses a scanning laser mounted on the roof as illustrated in Figure 3.1. The
specific laser generates range data for 181 angular positions at 75Hz with .5 degree angular
resolution. The scanning takes place in a single plane, tilted slightly downward, as indicated
in the figure. In the absence of obstacles, a scanner produces a line orthogonal to the
vehicle’s direction of travel. Detecting obstacles, however, requires the third dimension.
This is achieved through vehicle motion. As the vehicle moves, 3-D measurements are
integrated over time into a 3-D point cloud.
Integration of data over time is not without problems. It requires an estimate of the
coordinates at which a measurement was taken and the orientation of the sensor – in short,
the robot pose. In Stanley, the pose is estimated from a Novatel GPS receiver, a Novatel
GPS compass, an ISIS inertial measurement unit, and wheel encoder data from the vehicle.
Figure 3.1: The left graphic shows a single laser scan. The right shows integration of these
scans over time.
Figure 3.2: Pose estimation errors can generate false surfaces that are problematic for nav-
igation. Shown here is a central laser ray traced over time, sensing a relatively flat surface.
The integration is achieved through an unscented Kalman filter [44] at an update rate of
100Hz. The pose estimation is subject to error due to measurement noise and communica-
tion latencies.
Thus, the resulting z-values might be misleading. We illustrate this in Fig 3.2. There we
show a laser ray scanning a relatively flat surface over time. The underlying pose estimation
errors may be on the order of .5 degrees. When extended to the endpoint of a laser scan,
the pose error can lead to z-errors exceeding 50 centimeters. This makes it obvious that the
3-D point cloud cannot be taken at face value. Separating the effects of measurement noise
from actual surface roughness is one key challenge addressed in this work.
Figure 3.3: The measured z-difference from two scans of flat terrain indicates that error
increases linearly with the time between scans. This empirical analysis suggests the second
term in Equation (3.4).
The error in pose estimation is not easy to model. One of the experiments in [88] shows
it tends to grow over time. That is, the more time that elapses between the acquisition of
two scans, the larger the relative error. Empirical data backing up this hypothesis is shown
in Figure 3.3. This figure depicts the measured z-difference from two scans of flat terrain
graphed against the time between the scans. The time-dependent characteristic suggests
this be a parameter in the model.
3.5 Surface Classification Function

This section describes the function for evaluating 3-D point clouds. The point clouds are
scored by comparing adjacent measurement points. The function for this comparison con-
siders a number of features (including the elapsed time) which it weights using a number
of parameters. The optimization of these parameters is discussed in Section 3.7.

After projection as described in Section 3.4, an individual 3-D laser point has six fea-
tures relevant to our approach: its 3DOF (x, y, z) position, the time τ it was observed, and
first derivatives of the vehicle’s estimated roll γ and pitch ψ at the time of measurement.
This results in the following vector of features for each laser point Li :
Li = [x, y, z, τ, γ, ψ]. (3.1)
Our algorithm compares features pairwise, resulting in a square scoring matrix of size N 2
for N laser points. Let us denote this matrix by S. In practice, N points near the predicted
future location of each rear wheel separately are used. Data from the rear wheels will be
combined later in Equation (3.5).
The (r, c) entry in the matrix S is the score generated by comparison of feature r with
feature c where r and c are both in {0, . . . , N − 1}:
Sr,c = ∆(Lr , Lc ). (3.2)
The difference function ∆ is symmetric and, when applied to two identical points, yields
a difference of zero. Hence S is symmetric with a diagonal of zeros, and its computa-
N 2 −N
tion requires just 2
comparisons. The element-by-element product of S and the lower
triangular matrix whose non-zero elements are 1 is taken. (The product zeros out the sym-
metric, redundant elements in the matrix.) Finally, the largest ω elements are extracted into
the vector W and accumulated in ascending order to generate the total score, R:
ω
X
R = Wi υ i . (3.3)
i=0
Here υ is the increasing weight given to successive scores. Both υ and ω are parameters
that are learned according to Section 3.7. As we will see in Equation (3.4), points that are
far apart in (x, y) are penalized heavily in our technique. Thus, they are unlikely to be
included in the vector W .
Intuitively, each value in W is evidence regarding the magnitude of the second deriva-
tive. The increasing weight indicates that large, compelling evidence should win out over
the cumulative effect of small evidence. This is because laser points on very rough patches
are likely to be a small fraction of the overall surface. R is the quantitative verdict regarding
the roughness of the surface.
The comparison function ∆ is a polynomial that combines additively a number of cri-
teria. In particular, we use the following function:
∆(Lr , Lc ) = α1 |Lrz − Lcz |α2 − α3 |Lrτ − Lcτ |α4

− α5 |euclidean(Lr , Lc )|α6
− α7 |Lrγ |α8 − α7 |Lcγ |α8
− α9 |Lrψ |α10 − α9 |Lcψ |α10 . (3.4)
Here euclidean(Li , Lj ) denotes Euclidean distance in (x, y)-space between laser points i
and j. The various parameters α[k] are generated by the machine learning.
The function ∆ implicitly models the uncertainties that exist in the 3-D point cloud.
Specifically, large change in z raises our confidence that this pair of points is a witness
for a rough surface. However, if significant time elapsed in between scans, our functional
form of ∆ allows for the confidence to decrease. Similarly, large (x, y) distance between
points may decrease confidence as it dilutes the effect of the large z discontinuity on surface
roughness. Finally, if the first derivatives of roll and pitch are large at the time the reading
was taken, the confidence may further be diminished. These various components make it
possible to accommodate the various noise sources in the robot system.
Finally, we model the nonlinear transfer of shock from the left and right wheels to the
IMU. (The IMU is rigidly mounted and centered just above the rear axle.) Our model
assumes a transfer function of the form
Rcombined = Rleft ζ + Rright ζ (3.5)
where Rleft and Rright are calculated according to Equation (3.3). Here ζ is an unknown
parameter that is learned simultaneously with the roughness classifier. Clearly, this model
is extremely crude. However, as we will show, it is sufficient for effective shock prediction.
Ultimately, Rcombined is the output of the classifier. In some applications, using this
continuous value has advantages. However, we use a threshold µ for binary classifica-
tion. (A binary assessment simplifies integration with our speed selection process.) Terrain
whose Rcombined value exceeds the threshold is classified as rugged, and terrain below the
threshold is assumed smooth. µ is also generated by machine learning.
3.6 Data Labeling

The data labeling step assigns target values for Rcombined to terrain patches. This makes
it possible to define a supervised learning algorithm for optimizing the parameters in our
model (14 in total).
3.6.1 Raw Data Filtering

We record the perceived shock when driving over a terrain patch. The shock is observed
with the vehicle’s IMU. Specifically, the z-axis accelerometer is used.
However, the IMU is measuring two factors in addition to surface roughness. First,
single shock events pluck resonant oscillations in the vehicle’s suspension. This is prob-
lematic because we want to associate single shock events to specific, local surface data.
Oscillations create an additive effect that can cause separate events to overlap making iso-
lated interpretation difficult. Second, the gravity vector is present in the z-accelerometer
data. From the perspective of the z-axis, the vector changes constantly with the pitch of the
vehicle and road. This vector must be removed to generate the true shock value.
To address these issues, we filter the IMU data with a 40-tap FIR filter. The filter
removes 0-10Hz from the input. Removing 0Hz addresses the gravity vector issue. Dis-
carding the remaining frequencies addresses the vehicle’s suspension.
Figure 3.4: We find empirically that the relationship between perceived vertical accelera-
tion and vehicle speed over a static obstacle can be approximated tightly by a linear function
in the operational region of interest.
3.6.2 Calculating the Shock Index

Next, our approach calculates a series of velocity independent ruggedness coefficients
from the measured shock values. These coefficients are used as target values for the self-
supervised learning process.
This step is necessary because the raw shock filtering is strongly affected by vehicle
velocity. Specifically, the faster we drive over rugged terrain, the stronger the perceived
z-acceleration. Without compensating for speed, the labeling step and thus the learning
breaks down. (This is especially true since data was collected with a speed controller that
automatically slows down after the onset of rough terrain.)
The general dependence of perceived vertical acceleration and vehicle speed is non-
linear. However, we find empirically that it can be approximated tightly by a linear function
in the operational region of interest. Figure 3.4 shows empirical data of Stanley traversing
a static obstacle at varying speeds. The relation between shock and speed appears to be
surprisingly linear. The ruggedness coefficient value, thus, is simply the quotient of the
measured vertical acceleration and the vehicle speed.
3.7 Learning the Model Parameters

We now have an appropriate method for calculating target values for terrain ruggedness.
Thus, we proceed to the learning method for finding the model parameters.
The data for training is acquired as the vehicle drives. For each sensed terrain patch,
we calculate a value Rcombined . When the vehicle traverses the patch, the corresponding
ruggedness coefficient is calculated as above.
The objective function for learning (which we seek to maximize) is of the form
Tp − λFp (3.6)
where Tp and Fp are the true and false positive classification rate, respectively. A true pos-
itive is when, given a patch of road whose ruggedness coefficient exceeds a user-selected
threshold, that patch’s Rcombined value exceeds µ. The trade-off parameter λ is selected
arbitrarily by the user. Throughout this chapter, we use λ = 5. That is, we are especially
interested in learning parameters that minimize false positives in the ruggedness detection.
To maximize this objective function, the learning algorithm adapts the various param-
eters in our model. Specifically, the parameters are the α values as well as υ, ω, ζ, and µ.
The exact learning algorithm is an instantiation of coordinate descent. The algorithm be-
gins with an initial guess of the parameters. This guess takes the form of a 13-dimensional
vector B containing the 10 α values as well as υ, ω, and ζ. For carrying out the search
in parameter space, there are two more 13-dimensional vectors I and S. These contain an
initial increment for each parameter and the current signs of the increments, respectively.
A working set of parameters T is generated according to the equation:
T = B + SI. (3.7)
The SI product is element-by-element. Each element of S rotates through the set {-1,1}
while all other elements are held at zero. For each valuation of S, all values of Rcombined
are considered as possible classification thresholds, µ. If the parameters generate an im-
provement in the classification, we set B = T and save µ.
After a full rotation by all the elements of S, the elements of I are halved. This contin-
ues for a fixed number of iterations. The resulting vector B and the saved µ comprise the
set of learned parameters.
3.8 Experimental Results

Our algorithm was developed after the 2005 DARPA Grand Challenge, intended to replace
an algorithm that adjusts speed reactively based on measured shock [80]. However, because
we calculate velocity-independent ruggedness coefficients, the data recorded during the
race can easily be used for training and evaluation.
Our experiments use Miles 60 to 70 of the 2005 Grand Challenge route as a training
set and Miles 70 to 80 as the test set. The 10 mile-long training set contains 887,878 laser
points and 11,470 shock ”events” whose intensity ranges from completely negligible (.02
Gs) to very severe (nearly .5 Gs). The 10 mile test set contains 1,176,375 laser points and
13,325 shock ”events”. Intensity again ranges from negligible (.02 Gs) to severe (over .6
Gs). All laser points fall within 30cm of terrain the vehicle’s rear tires traversed during the
race. (And it is straightforward to predict the vehicle’s future position.)
There is no correlation between the number of laser points in the event and its speed
normalized shock. Because the number of laser points is entirely dependent on event length
(the laser acquisition rate is at constant Hz), there is no correlation between event length
and speed normalized shock either. Data from all lasers is used. Readings from lasers
aimed farther are more difficult to interpret because orientation errors are magnified at
range. Such lasers also produce fewer local readings because the angular resolution of the
emitter is constant.
As motivated in Section 3.6.2 and Section 3.7, we select (speed-normalized) shocks of
.02 Gs per mph or more as those we wish to detect. There are 42 such events in the training
set (.36%) and 58 events in the test set (.44%). Because these events are extremely rare and
Figure 3.5: The true positive / false positive trade-off of our approach and two interpreta-
tions of prior work. The self-supervised method is significantly better for nearly any fixed
acceptable false-positive rate.
significant uncertainty exists, they are difficult to detect reliably.

The value of .02 Gs / mph was selected because these events are large enough to war-
rant a change in vehicle behavior. However, they are not so rare that reliable learning is
impossible.
3.8.1 Sensitivity Versus Specificity

We analyze the sensitivity versus the specificity of our technique using Receiver Operat-
ing Characteristic (ROC) curves. For comparison, we also include the ROC curves for the
actual terrain classification used during the race [87]. These methods are not fully compara-
ble. The method in [87] was used for steering decisions, not for velocity control. Further, it
was trained through human driving, not by a self-supervised learning approach. Neverthe-
less, both methods attach numbers to points in a 3-D point cloud which roughly correspond
to the ruggedness of the terrain.
We use the method in [87] to generate a vector W . Separately trained parameters υ, ω,

ζ, and µ are then used to extract a prediction. Further, we acquire data for the [87] approach
in two different ways. The first considers data from all lasers equally. The second uses only
data from the closest, most accurate laser.
Figure 3.5 shows the resulting ROC curves. On the vertical axis, this figure plots the
true-positive rate. On the horizontal axis, it plots the false-positive rate. For all of the useful
false-positive ranges and more (0-80%), our approach offers significant improvement over
the method in [87]. Performance in the most important region (0-20% false positive rate)
is particularly good for our new approach.
At the extreme right of the plot, where the true-positive to false-positive-ratio ap-
proaches one, all the approaches deteriorate to poor performance. However, within this
region of poor performance, our approach appears to lag behind the one in [87]. This
breakdown occurs for high true positive rates because the remaining few undetected posi-
tives are areas of the surface with very high noise.
As an aside, we notice within the [87] approach that the most accurate laser is not
strictly better than all lasers together. This is due to the trade-off between resolution and
noise. The closest laser is the least noisy. However, certain small obstacles are harder to
detect with the lower perceptual resolution of a single laser.
3.8.2 Effect on the Vehicle

As noted above, during the race we used a different approach for velocity control [80]
presented in Chapter 2. The vehicle slowed after an onset of rough terrain was detected by
the accelerometers. The parameters of the deceleration and the speed recovery rate were
set by machine learning to match human driving. Because roughness is spatially correlated,
we found this method to be quite effective. However, it did not permit the vehicle to detect
rough terrain in advance. Therefore, the vehicle experienced more shock than necessary.
To understand how much improvement our new technique offers, we compare it to the
reactive method used in the race. The only software change is the mechanism that triggers
speed reduction. By anticipating rough terrain, our new approach should generate faster
finishing times with lower shock than was previously possible.
Figure 3.6: The trade off of completion time and shock experienced for our new self-
supervised approach and the reactive method we used in the Grand Challenge.
This is indeed the case. The results are shown in Figure 3.6. The graph plots on the
vertical axis the shock experienced. The horizontal axis indicates the completion time of
the 10-mile section used for testing. Both axes are normalized by the values obtained with
no speed modification. The curve labeled “reactive only” corresponds to the actual race
software. In contrast, the curve labeled “self-supervised” indicates our new method.
The results clearly demonstrate that reducing speed proactively with our new terrain
assessment technique constitutes a substantial improvement. The amount of shock experi-
enced is dramatically lower for essentially all possible completion times. The reduction is
as much as 50%. Thus, while the old method was good enough to win the race, our new
approach would have permitted Stanley to drive significantly faster while still avoiding
excessive shock due to rough terrain.
3.9 Discussion
We have presented a novel, self-supervised learning approach for estimating the roughness
of outdoor terrain. Our main application is the detection of small discontinuities that are
likely to create significant shock for a high-speed robotic vehicle. By slowing, the vehicle
can reduce the shock it experiences. Estimating roughness demands the detection of very
small surface discontinuities – often a few centimeters. Thus the problem is significantly
more challenging than finding obstacles.
Our approach monitors the actual vertical accelerations caused by the unevenness of
the ground. From that, it extracts a ruggedness coefficient. A self-supervised learning
procedure is then used to predict this coefficient from laser data, using a forward-pointed
laser. In this way, the vehicle can safely slow down before hitting rough surfaces. The
approach in this chapter was formulated (and implemented) as an offline learning method.
But it can equally be used as an online learning method where the vehicle learns as it drives.
Experimental results indicate that our approach offers significant improvement in shock
detection and – when used for speed control – reduction. In comparison to previous meth-
ods, our algorithm has a significantly higher true-positive rate for (nearly) any fixed value
of acceptable false-positive rate. Results also indicate our new proactive method allows for
additional shock reduction without increase in completion time compared to our previous
reactive work. Put differently, Stanley could have completed the race even faster without
any additional shock.
There exist many open questions that warrant further research. One is to train a camera
system so rough terrain can be extracted at farther ranges. Also, the mathematical model for
surface analysis is quite crude. Further work could improve upon it. Additional parameters,
such as the orientation of the vehicle and the orientation of the surface roughness, might
add further information. Finally, we believe using this method online, during driving, could
add further insights to its strengths and weaknesses.
Chapter 4
Feature Optimization
4.1 Abstract
We present an algorithm that learns invariant features from real data in an entirely unsuper-
vised fashion. The principal benefit of our method is that it can be applied without human
intervention to a particular application or data set, learning the specific invariances neces-
sary for excellent feature performance on that data. Our algorithm relies on the ability to
track image patches over time using optical flow. With the wide availability of high frame
rate video (eg: on the web, from a robot), good tracking is straightforward to achieve. The
algorithm then optimizes feature parameters such that patches corresponding to the same
physical location have feature descriptors that are as similar as possible while simultane-
ously maximizing the distinctness of descriptors for different locations. Thus, our method
captures data or application specific invariances yet does not require any manual super-
vision. We apply our algorithm to learn domain-optimized versions of SIFT and HOG.
SIFT and HOG features are excellent and widely used. However, they are general and
by definition not tailored to a specific domain. Our domain-optimized versions offer a
substantial performance increase for classification and correspondence tasks we consider.
Furthermore, we show that the features our method learns are near the optimal that would
be achieved by directly optimizing the test set performance of a classifier. Finally, we
demonstrate that the learning often allows fewer features to be used for some tasks, which
has the potential to dramatically improve computational concerns for very large data sets.
39
CHAPTER 4. FEATURE OPTIMIZATION 40
4.2 Introduction
Many feature functions have been proposed in the literature. However, they are usually
not tailored to a specific domain. While some of these features achieve excellent results
across a broad range of applications, their performance for particular applications can be
improved because they do not capture domain-specific invariances. Optimizing a feature
function for a particular domain is challenging because most features have numerous pa-
rameters that effect their invariances, such as the standard deviation of Gaussians used for
smoothing and the number of bins in histograms used for orientation calculation. These
parameters can be set entirely by hand, using hand-labeled data and learning, or by mathe-
matical processes such as warping an image patch while optimizing the feature value to be
invariant. However, we believe those approaches have limitations. A completely manual
approach or one involving hand-labeling can be costly or time consuming and may not pro-
duce optimal results. Patch warping and related methods tend to be synthetic, not capturing
all the invariances in the data.
We present an algorithm that uses unsupervised machine learning to optimize feature
invariances using video. Because our algorithm is entirely unsupervised and computation-
ally efficient, it can handle huge amounts of data. Furthermore, because it is data driven,
the invariances it learns are accurate for the specific application domain or data set. Our
method has two key steps.
First, our method performs correspondence between successive video frames. For this
step we use basic Harris corner features and the Lucas-Kanade pyramidal optical flow
algorithm. This approach is well-known as a method for correspondence on video and
works well despite its simplicity because high frame rate limits motion between frames. A
small patch of pixels around each Harris corner feature is extracted and saved, resulting in
many sets of patches. Each set contains multiple views of approximately the same location
in the environment. The number of sets is bounded only by video length and availability of
corner features to track.
The second step optimizes the higher-level features. Here, “higher-level” implies a
feature of greater sophistication and complexity than Harris corners. Because SIFT and
HOG are widely used in the literature, we focus on them in this chapter. However, any
type of feature may be used. It is well-established that a feature function should compute
the same value when applied to the same real-world object or type of object but be distinct
for different objects. To reach this goal, we optimize feature parameters using an efficient,
effective search procedure and scoring function. Our method requires no hand-labeling and
can be completely automated for any task involving video.
To show that our optimized features capture the appropriate domain-specific invari-
ances, we demonstrate that parameters optimized for one video stream outperform, on that
stream, those optimized for another and vice-versa. In particular, we look at the perfor-
mance of optimized versions of SIFT and HOG on two tasks: car detection in an urban
environment and correspondence for 3D reconstruction of surgery-related imagery. We
show that the version of SIFT optimized for car detection, on a car detection task, out-
performs the version of SIFT optimized for the surgical task. We also show the converse
is true; the surgical SIFT features outperform the car features on the surgical task. When
we compare our parameters to the original highly-tuned parameters presented by Lowe,
we find our performance is better for the car detection task and essentially the same for
the correspondence task. We present similar results for HOG features. Furthermore, we
analyze the effect of optimization on the rotational invariance property of SIFT.
In addition, the parameters our method learns are near the optimal achieved by directly
optimizing the test set performance of a classifier. That is, iteratively, we train the classifier
on a training set, evaluate its performance on the test set, and change feature parameters
to improve test set performance. We find that our method, despite being entirely unsuper-
vised, produces parameters near this optimal. Finally, we show that optimized parameters
can allow fewer features to be used for comparable application performance. This has
the potential to improve computational issues associated with very large data sets or high
resolution images.
Our method does have limitations. For example, in highly repetitive environments or
video with loops, the same object may be encountered multiple times and thus recorded in
distinct patch sets. In such cases, optimizing for distinctness between patch sets could be
sub-optimal. However, our experimental results indicate that for typical, non-pathological
cases performance is excellent.
4.3 Related Work

Our work shares some themes with [96, 97]. [96, 97] extract patches from different views
of an accurately reconstructed 3D scene. These patches are used for optimizing parameters
for a class of features. However, there are key differences. First, this previous work learns
parameters for particular feature functions but does not distinguish between parameters for
different applications. We show that ideal parameters are highly application dependent
even for a specific feature function. Our method of finding parameters handles this case
natively. Second, [96, 97] rely on accurate 3D reconstruction whereas our method requires
only video of standard frame rate. We feel that video is more easily acquired in practice
across a wide variety of applications and tasks. In contrast, an accurate 3D reconstruction
can be challenging to achieve. Third, in [96, 97], the patches selected for optimization
tend to correspond with interest points selected by the SIFT algorithm. This is due to
the nature of their algorithm. Our method works well with any Harris corner feature –
any location in the image with sufficient texture to avoid the aperture problem. Thus, our
method can consider a much broader class of patches. Finally, we explicitly compare our
learned parameters to a well-defined optimal, establishing a global sense of performance.
See [97] for a recent literature review of the intersection of features and learning. Other
work on features and learning includes [5, 59, 45, 51, 21, 95, 98].
In addition, several other aspects of this chapter are rooted in the literature. Self-taught
learning [68] has appeared previously as applied to continuous streams of data [81, 24].
The Harris features we use are due to [36] and the enhancements in [75]. We use Lucas-
Kanade optical flow from [57] with the pyramidal modifications from [16]. SIFT is due
to [56, 55] and HOG features are presented in [26]. We use the SIFT implementation from
Lowe [54] and the HOG implementation from [16]. We focus on HOG and SIFT due to
their wide use in the literature including [102, 32, 89, 101, 71] and [74, 19, 6, 58, 13],
respectively.
4.4 Generating Sets of Patches

First, our algorithm generates sets of patches. Each patch in a set corresponds to a different
view of approximately the same physical location. We generate these patch sets by tracking
simple features frame-to-frame in standard frame rate (30 fps) video. Due to the relatively
small amount of motion between frames, it is not difficult to achieve good correspondence.
We use Harris corner features [36] as modified by Shi and Tomasi [75]. In the interest
of making our results fully reproducible, we use the standard implementation found in
the OpenCV Library [16]. Our algorithm tracks points using the well-known Lucas and
Kanade optical flow method [57] modified with image pyramids, once again due to the
implementation publicly available in [16].
An example of the corner-detection and tracking process is shown in Figure 4.1. The
images are real data we use in the experiments (Section 4.6). The left-most image is the first
in four consecutive frames that progress to the right. In the first frame, the center of each
yellow box is a corner detected by [75, 16]. These corners are tracked to the four subsequent
frames using the Lucas-Kanade pyramidal optical flow method [57, 16]. The green arrows
show the motion of each corner feature into the next frame. This is not a prediction. The
algorithm uses the subsequent image to perform the calculation. The pixels in each yellow
box are extracted and saved for the optimization step. The yellow path tails show patch
movement from previous frames as estimated by the flow algorithm. In this example, four
patch sets of four patches each are produced. They appear in Figure 4.2.
Actually, our algorithm finds many more corners on these frames and is also tracking
corners found in frames prior to the left-most frame. We omit these in the figure for a
simpler illustration. In practice, on each frame, our algorithm continues tracking corners
from previous frames and also calculates a new set of corners to begin tracking. Tracking
of a particular corner terminates when the flow algorithm can no longer confidently solve
for its new position. We also perform outlier rejection. Tracks are discarded if they are
dramatically different from the average for the current frame. While this could potentially
discard interesting data, we find it eliminates the majority of optical flow tracking errors,
resulting in cleaner data for optimization.
Our algorithm uses corner features and optical flow. However any features and corre-
spondence algorithm could be used. Corner features work with virtually any textured image
patch. We argue that texture is a necessary condition for any algorithm seeking multiple
views of the same area. Thus, our method is especially complete in the types of patches
it uses. We feel this makes the optimization step particularly general, really capturing all
possible image patch types and variations. Furthermore, corners and optical flow are very
fast to compute. Therefore our method could be used online to adapt features in real-time.
Corner features are not enough for many vision applications such as object recognition
and correspondence for 3D reconstruction in the absence of dense data. This motivates the
next step of our method where higher-level features are calculated and optimized for the
tracked patches.
4.5 Optimizing Higher-Level Features

We define a “higher-level” feature to be anything more sophisticated than the basic corner
features we use in Section 4.4. In our experiments, we use SIFT and HOG. However, the
optimization process is identical for any higher-level feature. Let p be any single patch
generated in Section 4.4. Then p ∈ P ∈ P~ where P is the set of all patches for a single
environment location and P~ is the set of all patch sets. Since we do not optimize on pixels,
we use p as the feature vector associated with patch p. Let θ~ be the set of parameters used
by the feature function. Our optimization problem is then:
|p|
X X 1 X
argθ~ (γ min |pi − qi | (4.1)
∀p,q∈P
|p| i=0
~
∀P ∈P
|p|
X 1 X
+ max |pi − qi |).
|p| i=0
~ )6=P )
∀p∈P,∀q∈(∀(Q∈P
~ The min part of the optimization is minimiz-

So, we are optimizing the parameters of θ.
ing the distance between the feature vectors generated for all patches within one set, all
patches corresponding to the same physical location. The max part of the optimization is
maximizing the distance between the feature vectors in one set and those in all other sets.
Finally, the outer summation adds over all patch sets, turning our optimization for the one
set into an optimization for all. That is, we are making all the patches that are actually the
same as close as possible in feature value and all the patches that are different as different
as possible in feature value. γ is a parameter trading off the extent to which we would
rather make the same patches match as opposed to making differing patches not match.
Simplifying without loss of generality we have:
|p|
X X 1 X
arg min (γ |pi − qi | (4.2)
θ~
∀p,q∈P
|p| i=0
~
∀P ∈P
|p|
X 1 X
− |pi − qi |).
|p| i=0
~ )6=P )
∀p∈P,∀q∈(∀(Q∈P
Since |P~ | is potentially enormous, in practice we replace ∀(Q ∈ P~ ) 6= P with a small num-
ber of samples. In fact, we find experimentally that one is sufficient for good performance.
Even with this approximation, the optimization is still challenging computationally. Thus
we also assume the elements of θ~ are conditionally independent. This allows us to perform
coordinate descent in parameter space. However, we find experimentally that the space is
fraught with local minima. Thus a more reliable method is to simply search the space. A
search is tractable in this case because a feature can be generated rapidly (in about .0025
seconds) and the well-known “good” range of many parameters is not enormous, even at
small discretization [54].
4.6 Experimental Analysis

To prove our algorithm learns application/data-specific invariances, we demonstrate two
properties experimentally. First, we show our algorithm learns feature parameters that are
distinct for different sets of data, and the parameters learned for one data set outperform, on
that data set, parameters learned for a different data set and vice-versa. This establishes the
idea that optimal feature invariances are application/data dependent and that our algorithm
successfully adapts parameters to take advantage of this property. Second, we show that
the parameters our algorithm learns are near the best for the application/data, despite the
totally unsupervised nature of our algorithm.
4.6.1 Data Sets

We use two data sets for experimental analysis, Google Map’s Street View and a surgical
data set from inside a pig sinus that emulates many human medical tasks. Examples of data
from Street View are seen in Figure 4.1 and non-consecutive examples of the surgical data
can be seen in Figure 4.3. The surgical frames in the figure are non-consecutive because
the camera moves slowly, limiting motion between frames. However, we use all frames for
the optical flow tracking. Thus, for the surgical case, sets can have many patches. That
is, because the camera does not move quickly, locations stay within view for many frames,
and a large number of views/patches are saved. The data sets are very different in color,
lighting, motion, texture, and many other factors.
4.6.2 Experimental Protocol

For each data set, patches were extracted as described in Section 4.4. Then for each data
set and each feature type, optimization was performed as presented in Section 4.5. We used
both HOG and SIFT features. For SIFT features, for full reproducibility, we used Lowe’s
own implementation. For HOG features, we used the publicly available implementation
from [16]. All parameters from these implementations (8 for HOG, 11 for SIFT) were used
for the optimization, except parameters varying descriptor length. An in-depth motivation
of these parameters requires describing the entire HOG and SIFT algorithms and thus is
beyond the scope of this chapter. Note all steps up to this point are totally automated
and unsupervised. We then used these optimized features in experimental applications to
analyze their efficacy. We took care that the patches used for the optimization process were
not used in either the training or testing sets. We used color images for HOG and grayscale
for SIFT.
All of our experiments are framed as classification tasks. Of course, we used a distinct
training set and test set. Reported accuracy is an average over 10 trials where the composi-
tion of training and testing sets were varied randomly. Error bars are standard deviations.
For each image, our classifier samples features at evenly spaced locations on a grid as
in [93]. For SIFT features, 3 scales were used. Each of these features was concatenated
together in one (large!) vector describing the image. A standard SVM with a linear kernel
was trained and tested in the standard fashion. This is a simple classification scheme, not
the most recent presented in the literature. Our goal, however, is to emphasize the qual-
ity of our features, not the learning algorithm. For full reproducibility, we used the SVM
implementation publicly available in [16]. The vertical axis shows classification accuracy,
the fraction of all images in the test set classified correctly. We use two different horizontal
axes: the resolution of the grid (for SIFT, multiplied by scales) and the number of training
examples per class. The number of features is important as it allows us to illustrate the ef-
ficacy of our features. For plots specifying the number of training examples, the resolution
of the grid was 6 x 6.
For the Street View data set, we consider car detection. To establish ground truth, we
manually labeled cars with bounding boxes. To generate negative examples, we randomly
cut bounding boxes from the image. We spot-checked the negative examples to verify cars
were not present. For all plots varying the number of features, training and testing each
occurred on 100 examples per class.
For the surgical data set, we consider patch correspondence for 3D reconstruction. We
do not perform 3D reconstruction itself. To evaluate correspondence, we randomly selected
20 patch sets (20 P ’s from P~ , see Section 4.5). For plots varying the number of features,
we used 20 patches from each of these sets. The 20 were split evenly for training and test-
ing. Thus our classifier performs correspondence as a 20-way classification problem. The
classification of each patch in the test set was deemed its correspondence. Intuitively, we
expect correspondence to perform reasonably at finding a match even if only one point has
been seen before. This explains why our plots for the surgical data show good performance
even with one training example.
For SIFT features, we also plot the original/default parameters used in Lowe’s imple-
mentation. These provide a basis of comparison. Additionally, for HOG features, we
evaluated the optimality of our method. In particular, we searched all parametrizations,
iteratively training and testing the classifier for each. We kept the parametrization with the
best test set performance for a static number of features and training examples. Then, with
that parametrization, we evaluated performance for all numbers of features and training
examples as described above. As a check, we also retained the lowest test set performance.
We present these bounds only for HOG features because the search is computationally in-
tensive, and HOG features compute significantly faster in our implementation. To make
computation tractable, we used the conditional independence assumption described in Sec-
tion 4.5 for this search as well. Thus it is an approximate theoretical best. When searching,
we averaged the performance of a parameter set over many trials. To save computation, we
used fewer trials for the lower bound making it less accurate.
4.6.3 Experimental Results

Results for HOG features appear in Figures 4.4 and 4.5. Car detection results are in Fig-
ure 4.4. Note the difference in performance between the car-optimized features (blue line)
and the surgical-optimized features (red-line). The car-optimized features offer signifi-
cantly better performance, particularly with fewer features. These results establish that
optimal feature parameters and invariances are application dependent and that our algo-
rithm can learn these invariances in an unsupervised fashion. The performance of our
method nears the theoretical best (light blue). At one data point, the theoretical best dips
below the performance of our features. This may be due to the approximate manner in
which we calculate the theoretical best parameters or that we only calculate them for one
vertical “slice” of the plot, then use those parameters for the remainder of the horizontal
axis trials. The theoretical worst performance, barely above chance for small numbers of
features, seems reasonable for a 2-class problem. Figure 4.5 provides analogous results for
the surgical task. Here, the features optimized for the surgical task (red line) outperform
the car-optimized features (blue line) with analogous conclusions. Since this is a 20-class
problem, a theoretical worst case of .05 seems reasonable. Because the surgical task is a
correspondence task on small image patches, it is not surprising we achieve good perfor-
mance with a small number of features and examples.
Results for the SIFT features appear in Figures 4.7 and 4.8 and are analogous. As
with HOG, the features optimized for the car task outperform, on the car task, the fea-
tures optimized for the surgical task. Similarly, the features optimized for the surgical task
outperform, on that task, the features optimized for cars. Note that, in all plots that vary
the number of features, fewer of the task-specific features are needed for a particular per-
formance level. This may be useful for algorithms on large data sets. Interestingly, our
surgical-specific parameters for the surgical correspondence task produce results nearly
identical to Lowe’s. However, for the cars task, our parameters are significantly better.
Lowe’s original intent was to produce features for matching/correspondence. Evidently his
extensive tuning well-optimized them for that task. Finally, Figure 4.6 shows selected im-
ages from the cars task that were classified correctly with the car-optimized SIFT but not
the surgical-optimized SIFT features.
4.6.4 SIFT Rotational Invariance

For our applications and data sets, features should not be rotationally invariant. However
SIFT features are naturally rotationally invariant. SIFT features achieve rotational invari-
ance by histogramming gradients, selecting the most frequent bin as the orientation, and
then extracting the local descriptor relative to that orientation. Our algorithm can turn
off this functionality. Specifically, the number of bins in the histogram is a parameter we
optimize. Rotational invariance is disabled if the number of bins becomes one. We also
include an additional parameter that acts as a binary switch for rotational invariance. If the
parameter is off, the orientation is set to a constant value, eliminating the invariance. (We
need the binary parameter because we set a lower limit on the number of bins to make the
interpolation step well-defined.) For both the cars and the surgical tasks, our optimization
decreases the bins to their minimum, then flips the binary parameter off.
To demonstrate that rotational invariance is in fact disabled, consider SIFT feature
matching using the surgical data set. We selected every 50th image in the set, omitting
numbers 0 and 50 because they are too near the start, leaving 18 images in total. We ro-
tated each image by 10 degree intervals. Rotation occurred about the center of the circle,
not the image (see Figure 4.3). We computed a single SIFT feature in the center of each.
For each image, we scored the difference between the descriptor at 0 degrees of rotation
and after X degrees of rotation using the function:
Data Set Window Size Block Size Block Stride Cell Size
Cars 32 32 10 16
Surgical 12 12 4 6
Data Set Aperture Sigma L2 Threshold Gamma Correction
Cars 1 none .61 false
Surgical 1 5 .51 false
Table 4.1: Learned HOG parameters for the cars and surgical data sets.
v
u
u 1 X |f |
t (fangle i − f0 i )2 . (4.3)
|f | i=0
That is, a single score is the square-root of the normalized sum of squared differences of
each element of two SIFT feature vectors. The vectors are from a rotation of angle and
zero degrees, respectively. Figure 4.9 shows averages and standard deviations over 18 trials
as a function of angle. Clearly, our optimization has removed rotational invariance from
the surgical-optimized SIFT features. This is the correct behavior given the invariances in
our data. An analogous plot could be produced for car-optimized SIFT.
Rotational invariance is not the only invariance optimized for SIFT. There are many
other parameters being optimized as discussed in Section 4.6.2. For example, the stan-
dard deviation of the Gaussian used for smoothing is important. Our optimization disables
rotational invariance for both of the application-specific SIFT features. Thus, it does not
account for the performance difference between them. It may when compared to Lowe’s
parameters which have rotational invariance.
4.6.5 HOG Parameters

The parameters learned for HOG appear in Table 4.1 for both the cars task and the surgical
task. Description of the HOG parameters can be found in [26].
4.7 Discussion
We present an algorithm that learns optimized features in an unsupervised fashion. Our
features capture application or data-specific invariances, improving performance. We fo-
cus on optimizing SIFT and HOG features due to their widespread use in the literature.
However our method will work for any feature function with parameters. For HOG fea-
tures, we show our optimized versions are near the best possible for both the surgical and
car tasks. For SIFT, our parameters for the surgical task produce performance very near
the extensively researched parameters from Lowe. For the cars task, our SIFT parameters
offer substantially better performance. Due to the widespread use of SIFT and HOG, we
feel our method has the potential for immediate impact.
There are numerous opportunities for future work. Because our method is entirely un-
supervised, we plan to explore how it scales to enormous data sets. For example, we could
use every video on the web (or a very large subset) to learn optimized but general-purpose
features. Rather than optimizing, we could learn a whole new class of features directly
from pixels. We could also pursue additional applications. For example, numerous robotic
systems would benefit from real-time feature optimization. It is also possible our core idea
of unsupervised patch-extraction and optimization could be applied to other tasks in vi-
sion, such as camera calibration. Finally, we have alluded to the intersection of optimized
features and feature compression and could explore it in more depth.
Figure 4.1: An example of corner feature tracking with optical flow. The video sequence
moves frame-to-frame from left to right. The center of each yellow box in the left-most
(first) frame is a corner detected by [75, 16]. The green arrows show the motion of these
corners into the next frame as determined with optical flow [57, 16]. This is not a prediction.
The flow algorithm uses the next frame in the sequence for this calculation. The yellow
boxes contain the patches extracted and saved for later optimization. The yellow path tails
in the subsequent frames depict patch movement as estimated by the optical flow. In total,
four patch sets are produced, each with four patches. They are shown in Figure 4.2. Our
algorithm produces additional patches for these frames. We do not show them for a simpler
illustration.
Figure 4.2: The extracted patch sets from Figure 4.1. The correspondence is not perfect,
but satisfactory for good optimization and application-specific feature performance.
Figure 4.3: Four non-consecutive images from the surgical data set. The images are non-
consecutive because the camera moves slowly, making the actual motion between frames
small.
Figure 4.4: HOG feature optimization for the cars task. The features optimized for the cars
task (blue line) outperform those optimized for the surgical task (red line). This verifies
that our algorithm captures application-dependent feature invariances. The approximate
theoretical best (light blue line) is only slightly better than our method. The approximate
theoretical worst is reasonable for a 2-class problem.
Figure 4.5: HOG feature optimization for the surgical task. The features optimized for
the surgical task (red line) outperform those optimized for the cars task (blue line). This
verifies that our algorithm captures application-dependent feature invariances. The approx-
imate theoretical best (light blue) is only slightly better than our method. The approximate
theoretical worst is reasonable for a 20-class problem. Because this is a correspondence
task on patches, reasonable performance for small numbers of features or images is not
surprising.
Figure 4.6: Examples from the cars task that were misclassified with the surgical-SIFT but
not the car-SIFT features.
Figure 4.7: SIFT feature optimization for the cars task. The features optimized for the cars
task (blue line) outperform those optimized for the surgical task (red line) and the param-
eters provided by Lowe (green line). This verifies our claim that our algorithm captures
application-dependent feature invariances.
Figure 4.8: SIFT feature optimization for the surgical task. The features optimized for the
surgical task (red line) outperform those optimized for the cars task (blue line). This verifies
our claim that our algorithm captures application-dependent feature invariances. Intuitively,
we expect correspondence to work well even with a single known point. This justifies the
good performance presented even with one training example. The performance of our
surgical-optimized features is nearly identical to Lowe’s (green-line). Lowe’s extensive
optimization was geared toward matching (correspondence) problems.
Figure 4.9: The effect of rotation on our surgical-optimized SIFT features. Our optimiza-
tion removed rotational invariance. This is correct given the invariances that are specific to
the application.
Chapter 5
Cost Effective Sensing
5.1 Abstract
Recent work on autonomous driving has used laser scanners almost exclusively. The choice
is not surprising: lasers provide 3D range information at a high refresh rate, work in all
light conditions, and even provide basic color information through reflectivity. However,
camera-based sensing has fundamental advantages. Laser data is often ambiguous, even to
a human viewer. Lasers are orders of magnitude more expensive. They also consume more
energy. Nevertheless, lasers are ubiquitous in modern autonomous driving due to the belief
that excellent performance cannot be achieved with cameras given present science. This
chapter aims to dispel that belief. We show that lasers can be used to automatically label
essentially unlimited amounts of camera data. Using this data as a training set, a camera-
only classifier is learned whose performance is arbitrarily close to the laser’s. Specifically,
we show that doubling the amount of training data cuts the error rate by a fixed percentage.
Since labeled training data is automatically produced, the error rate of the camera-based
method can be cut indefinitely, ultimately converging to the laser’s performance. We eval-
uate performance using real data from an autonomous car.
60
CHAPTER 5. COST EFFECTIVE SENSING 61
5.2 Introduction
Laser scanners enjoy wide use in autonomous cars. For example, in the DARPA Challenge
competitions, lasers were used almost exclusively by the top performing vehicles [39, 40].
The Google autonomous cars rely heavily on laser perception as well [85]. The reasons are
clear. When compared to cameras, laser data is much easier to segment, works in all light
conditions, natively provides scale, and even provides monochrome intensity information
from reflectivity. Laser-based methods have been used to drive over 100,000 autonomous
miles with more than 1,000 miles between failures.
Nevertheless, camera-based methods still have an important role for autonomous driv-
ing. The key argument for vision is that dense, detailed perception information carries
value. Laser point cloud data is often difficult to interpret even for humans. This is not
the case for visual data. Furthermore, cameras are much less expensive. The common
Velodyne scanner used in most robotic cars costs nearly USD$85,000. Lasers usually re-
quire integration of data over time with highly accurate roll and pitch information from a
ring laser gyro which can also increase costs by USD$20,000 or more. In contrast, a high
quality camera is often less than USD$100. This is a dramatic and important difference.
Cameras also consume less power. Unlike lasers, cameras on distinct autonomous cars can
face one another without interference. Finally, on a basic science level, the interpretation
of camera images is an important and interesting problem.
The principal reason lasers dominate robotic car perception is the belief that producing
highly accurate camera-based methods is not possible with present science. This chapter
aims to dispel that belief. We present a vision-only car detection system whose performance
at object classification is arbitrarily close to the state-of-the-art in laser-based driving [84].
The key insight allowing this performance is the size of the training set. We show experi-
mentally that doubling the amount of training data cuts the error rate relative to the laser’s
performance by a fixed percentage. We also show how to automatically acquire extremely
large sets of laser-labeled camera images. Putting these properties together, it follows that
the performance of a camera-only system converges to that of a laser-based one.
Since the amount of labeled data required is very large, it is essential to avoid hand
labeling. Labels can be automatically acquired using the laser. Specifically, we run the
Figure 5.1: Example of an image labeled by the laser-based recognition system [84]. We
use this method to generate vast amounts of training data. This large corpus of data allows
us to match the performance of the laser with the camera to arbitrary precision. The laser
labeling is suitable for training camera-based classifiers because false positives are low and
false negatives are largely due to segmentation errors rather than some property inherent to
the object.
laser algorithm from [84] and project the laser labels back into a camera image. Figure 5.1
shows an example. Thus, by simply letting an autonomous car drive around, we can acquire
unlimited training examples automatically. Once labels are acquired from the laser, the
laser is no longer needed. Of course, the labels will have the error rate of the laser. But
this is acceptable, since we are merely trying to match the laser’s performance. This is an
example of self-supervised learning.
The basic pipeline of many modern vision algorithms is similar. Low-level informa-
tion is extracted from the image in the form of pixels or features. This information is
encoded using an unsupervised coding scheme. These codes are pooled, often with geo-
metric information, into an image descriptor. Finally, the image descriptors are classified
using a learning algorithm. Overall, we use the learning algorithm of [49] except we con-
sider multi-hypothesis data association using probabilistic and also triangle [22] coding.
We also explore several different learning algorithms in addition to the histogram intersec-
tion kernel SVM of [49] such as Random Forests [17], Nearest Neighbors [23], and linear
SVMs [43]. As suggested in [100], we encode features not pixel patches and at keypoints
instead of a grid.
We rigorously test our algorithm on a real autonomous car over a test set of more than
20,000 manually labeled examples. We show that, for object classification, our camera-
only method can converge arbitrarily close to the performance of the state-of-the-art laser-
based autonomous driving algorithm [84].
5.3 Related Work

Lane marker detection is one problem within autonomous driving (and driver assistance
systems) where excellent camera-based solutions exist and have even been deployed com-
mercially [10]. The lane marker detection problem has certain characteristics that distin-
guish it from other object recognition tasks. Specifically, lane markers appear at well-
defined intervals in predictable locations, respond well to edge detectors, do not vary much
in appearance (compared to other object classes), and occupy a limited color space.
Despite the success of lane marker detection, the general object recognition problem
for autonomous driving remains unsolved. The fundamental challenge is to detect cars,
pedestrians, and bicycles in a highly reliable manner. [84] is the state-of-the-art work for
laser-based driving that offers excellent – although not perfect – performance by doing
temporal analysis over tracks of objects. The trade-off is latency for accuracy. [84] also
relies on a substantial manual labeling effort (13,000 tracks or 1.3 million frames). [30] fo-
cuses on pedestrian detection. They manually label approximately 400,000 images which
required about 400 hours. In [30, 84], the data is highly realistic and representative of what
a typical autonomous car would encounter. Also both use the massive amount of manu-
ally labeled data to set benchmark performance for autonomous driving perception tasks.
The best performing algorithm in [30] yields approximately 75% miss rate with more than
10 false positives per frame. The performance of [84] is independently analyzed in this
chapter. However, we consider only cars whereas [84] considers cars, bikes, and pedes-
trians. A large number of other systems exist for car detection, but a complete review is
beyond the scope of this chapter. We believe that extremely reliable car detection in diverse
environments, as required for an autonomous car, remains an open research challenge.
Self-supervised learning has been applied to autonomous driving on several previous
occasions. In the 2005 DARPA Grand Challenge, a self-supervised laser-to-camera sys-

tem [24] was used to extend the distance of the short-range laser system. Specifically, [24]
uses the laser to label smooth road ahead. Then a camera identifies patches beyond laser
range that have similar color and texture properties according to a mixture of Gaussians.
Our work differs fundamentally in that [24] requires the laser to run continuously. Our
method uses the laser as a label-generation source for off-line training. The laser is then
removed for testing. In addition, we focus on object recognition as opposed to smooth ter-
rain detection, a fundamentally different problem specific to urban as opposed to off-road
driving. [81] uses haptic feedback from the IMU to train a self-supervised laser percep-
tion system. By using the more precise labels and regression, the self-supervised method
exceeds the performance of the supervised system that won the 2005 competition [88, 64].
As in [24], [81] focuses on smooth terrain detection, not object recognition.
A complete review of vision-based object recognition is beyond the scope of this chap-
ter. It can be argued that spatial pyramids with the histogram intersection kernel [49] are
the state-of-the-art baseline for modern computer vision. Much recent work has focused
on improved coding mechanisms, especially the sparse coding family [50, 51, 69, 68, 38,
45, 62, 4, 60, 21, 95, 98]. [22] offers a complete discussion and comparison of many recent
sparse coding algorithms and concludes that triangle coding offers exceptional performance
and efficiency. For example, unlike other methods, triangle coding does not use a large set
of hyper-parameters. Thus, expensive cross-validation is not needed. The triangle encoder
presented in [22] is similar to [92] in that multiple bases are updated for every feature.
However, the assignment function is different. [15] provides an independent implemen-
tation of [92] and shows a slight performance decrease. In contrast, the triangle encoder
offers state-of-the-art performance for several standard data sets in vision [22]. [100] pro-
poses sparse coding directly on features as opposed to pixels, and we adopt that idea for
this work.
5.4 Laser Classification

Our method uses self-supervised learning. That is, the output of an existing laser classifi-
cation algorithm is used as the training data for our vision-only method. We summarize the
Figure 5.2: A 3D point cloud example created by a single rotation of a 3D laser scanner.
The method in [84] tracks objects through these point clouds, segments them, and classifies
them using a learning algorithm. We project these classifications back into a camera frame
and use them as training data for our vision-only algorithm. Nearly unlimited amounts of
training data can be automatically produced.
laser classification method. [84] offers a complete description.

Figure 5.2 shows a 3D point cloud example created by a single rotation of a 3D laser
scanner. Rotations occur at approximately 15Hz. The laser scanner is shown in the lower
left of the image, on top of the rendered autonomous car. Two cars can be seen in the center
of the image which occlude or shadow laser readings at farther distances. Trees, people,
and other objects are clearly visible.
Using algorithms described in [84], the point cloud is segmented into objects, and these
objects are tracked over time. These tracks are labeled manually as a car, person, bicycle,
or background. Our work considers only cars and non-cars. Features are extracted from
the entire track including speed, acceleration, the size of the object, as well as visual fea-
tures such as HOG descriptors [26]. Finally, using these features, the manual labels, and
boosting, a laser-only classifier is trained. Although not considered in [84], the output of
the laser-only classifier can be projected back into the corresponding frame of a camera.
Examples of this projection are found in Figures 5.1 and 5.3. Figure 5.1 shows convex
hulls for illustration. While those hulls were automatically extracted, we use the simpler
bounding boxes of Figure 5.3 without loss of generality. We notice from Figure 5.3 that the
data is very realistic and represents an interesting and challenging vision problem. We also
notice that the laser system makes occasional errors. See the upper left image in the figure
where several cars are missed. The accuracy in these examples appears to be approximately
95%.
We do not have access to the original, manually labeled laser data. Instead, we have
only the resulting classifier. By simply mounting the laser on a car, driving around, and
running the classifier, we automatically label a vast database of images. This is the self-
supervised portion of our method. The labels are not perfect. Nevertheless, they are very
useful for training. Our goal is to match the laser’s performance, not to achieve 100%
accuracy. We use these labels and the vision algorithm presented in Section 5.5 to train
a vision-only classifier whose performance – in the limit of an enormous training set – is
arbitrarily close to that of the laser.
5.5 Vision Algorithm

Beginning with the data automatically labeled in Section 5.4, we train a vision-only algo-
rithm as follows. Image segments labeled as “car” are extracted and saved. Next, a sliding
window is run over the image, and all windows that do not intersect with the car segments
are extracted as background. Importantly, notice that the vision-only algorithm does not get
an unfair advantage in either training or testing due to the comparable ease with which laser
data is segmented. That is, our algorithm uses the sliding window method for segmentation
during both the training and testing phases. Thus, it is vision-only. We now have image
segments in two classes.
5.5.1 Feature Generation

From each image segment, we compute features in the standard fashion. Recent work
indicates computing descriptors on a grid may be the preferred method [22, 31]. However,
our database of images is very large, over 1,200,000 images, and grid-based features are
costly to compute. Instead, we compute features at keypoints. Any features may be used,
but we find experimentally that SURF [7] offer excellent speed and performance. It is
noteworthy that our comparison to [84] is still valid despite their use of HOG [26]. Both
contributions select the best features for their applications, and point cloud data is very
different from images.
5.5.2 Feature Coding

Codebook Generation
Codebook construction is similar to [49]. Features are computed over the entire training
set as described in Section 5.5.1. k-Means clustering [14] is used to cluster features into
centroids or bases.
We begin with N feature vectors f~1 to f~N , K clusters centered at µ1−K
~ , and N K
assignment variables a1−N,1−K . At any given time, for each feature vector f~i , one of its
K assignment variables ai,1−K equals 1 and the rest are zero. The clusters are initially
centered at randomly chosen features. The minimization is then:
N X
X K
min ai,k ||f~i − µ~k ||2 . (5.1)
∀a,∀~
µ
i=1 k=1
The minimization can be solved iteratively as an Expectation-Maximization (EM) problem.

In the E-step, features are assigned to clusters. For each feature i, we evaluate:
k = arg minj ||f~i − µ~j ||2 , (5.2)
and ai,k is set to 1 indicating feature i is assigned to cluster k. ai,∀j6=k are set to zero. In the
M-step, the cluster centers are re-calculated based on the new assignments:
ai,k f~i
PN
µ~k = Pi=1
N
. (5.3)
i=1 ai,k
The optimization ends when the change in the cluster centers between two iterations drops
below a threshold.
The cluster centers or bases for K = 5000 appear in Figure 5.4. The figure contains
5000 image patches arranged in a grid. Each patch corresponds to the SURF feature –
over all keypoints in more than 1,200,000 images – that most closely resembles that cluster
center. While SURF keypoints vary in size, we extract patches of fixed size for presentation
purposes. The figure is best viewed under zoom on a high-resolution display. We notice
that the cluster centers correspond to many car parts, road markings, and environment
fragments. Also, illuminated brake lights appear with some frequency.
Codebook Assignment
Codebook assignment is based on multi-hypothesis data association using probabilistic and

triangle coding [92, 22]. After feature and codebook generation, we must create training
vectors for each image. Let ~z be the training vector for an image, and let all its elements
be initialized to zero. For the moment, let the length of ~z be K. Now, features are encoded
from each image according to the clusters generated. Specifically, for each feature f~i in an
image, we generate a set of K codes C1−K :
∀j, Cj = ||f~i − µ~j ||. (5.4)
Notice that each code C1−K is a scalar. We also compute the mean of the code set:
K
1 X
µ= Cj . (5.5)
K j=1
Finally, the training vector is updated. For triangle coding the update is:
∀j, ~zj = ~zj + max(0, µ − Cj ). (5.6)
And for probabilistic coding the update is:
∀j, ~zj = exp(−Cj ). (5.7)
Equations (5.6) and (5.7) mean each element of ~z is updated. Finally, ~z is normalized so its
elements sum to one. Naturally, a separate ~z is computed for every image.
Equation (5.6) implies the number of non-zero updates for each feature is approxi-
mately half the number of clusters. Figure 5.5 shows a histogram of the frequency of each
number of updates for K = 5000. The x-axis indicates the number of vector elements
updated. The y-axis shows the number of features whose update modified that many vector
elements. The integral under the curve is over 32,000,000 features. Notice the contrast
to [49] which updates only one vector element per feature – specifically, the maximum
likelihood (minimum score) from an equation similar to Equation (5.4).
5.5.3 Geometric Constraints

We use spatial pyramids [49] to provide basic geometric constraints. Essentially, a spatial
pyramid divides the image into a grid of varying resolution. The 0-th level of the pyramid
is the vector ~z as described in Section 5.5.2. That is, every feature in the image is included.
However, the 1-st level of the pyramid divides the image into a 2x2 grid of 4 rectangles.
Each rectangle has its own ~z computed using only those features falling inside the rectangle.
This process continues indefinitely: at the i-th pyramid level there are 22i rectangles and
~ is the
22i ~z in a 2i x2i grid. In the final spatial pyramid vector for the entire image, Z
concatenation of the ~z’s at all pyramid levels. We focus on one pyramid level in this paper
– the bag of features and four pyramid grids. Thus our complete image vector is:
~ = z~0,0 , z~1,0 , z~1,1 , z~1,2 , z~1,3 ,

Z (5.8)
where zl,n refers to the n-th grid at the l-th level.
5.5.4 Learning Algorithm

~ vectors, we use non-linear
Several different learning algorithms are used. To split the Z
support vector machines (SVMs) [14] and the histogram intersection kernel [49], linear
SVMs [43], Random Forests [17], and Nearest Neighbors [23].
5.6 Experiments
We evaluate the performance of our algorithm experimentally. We implemented our algo-
rithm on the codebase from our autonomous car. As described in Section 5.4, we drive the
car which automatically produces laser point clouds, segments the clouds, tracks the ob-
jects, classifies the tracks, and projects the labels back into a camera image. (Equivalently,
the car can drive autonomously, but this is not a relevant distinction for this paper.) These
noisily labeled, segmented images are used as training data to produce image vectors as de-
scribed in Section 5.5. Notice again, after acquiring the training data, the laser is no longer
needed or used. Thus we produce a vision-only algorithm. Finally, we test our vision-
only algorithm on a manually labeled test set, showing performance for several different
learning algorithms, comparing it to [84, 49], and, importantly, showing that doubling the
amount of training data cuts the error by a fixed percentage.
5.6.1 General Setup

All methods are implemented using the OpenCV computer vision and machine learning
library [16]. OpenCV natively includes code for SURF [7], Random Forests [17], and
Nearest Neighbors [23]. However, we opted to implement our own version of Nearest
Neighbors for efficiency. OpenCV includes a generic SVM solver but does not support the
histogram intersection kernel natively. We modified the solver to do histogram intersection.

OpenCV’s SVM solver does support linear SVM output but the solver does not run in linear
time. Thus we used the svmlight package [43]. Experiments were conducted on a 64-bit
Fedora 12 workstation with two 2.93GHz six-core Intel Xeon processors. 100,000 training
examples encoded as one-level pyramids with K = 5000 occupy approximately 12GB of
RAM. This number is for the Linux process and thus includes overhead.
5.6.2 Data Collection and Preparation

We collected data in the form of several dozen log files. Log files were taken at distinct
areas and times. Data streams were taken from two cameras at once to increase differences
in perspective. However, we do not use stereo or multi-camera analysis. Importantly, the
standard method of randomly selecting training and testing examples is not valid in our
case. There are at least two problematic situations. A track of a single object could be split
between the training and testing sets. Similarly, frames from the same time but different
cameras could be split between the sets. Both cases are unacceptable as test data would be
too similar to training data. Thus, for the test set, we hold out two log files in their entirety,
corresponding to the pair of cameras at a particular area and time. All other logs are used
for training. Importantly, care was taken to ensure that the areas and times of the two test
set logs were sufficiently distinct from other logs to avoid seeing unique objects in both.
Finally, we eliminated data examples whose corresponding image segments did not contain
any SURF features – even with very low keypoint thresholds. Triangle coding was used for
all methods except for the linear SVMs, where probabilistic coding was used. [84, 49] use
their own distinct coding schemes.
5.6.3 Ground Truth and Metrics

To establish ground truth, we manually labeled frames in the test set log files, producing
930 cars with bounding boxes. As noted in the technical section, the vision-only and laser-
based algorithms perform segmentation differently. For the vision-only algorithm, a sliding
window is used. Thus, for vision, the test set consisted of 21,664 image windows includ-
ing all 930 cars and 20,734 non-car windows that were randomly selected. We generated
ground truth for the non-car windows in a programmatic manner by labeling windows as
“non-car” if they did not intersect a car label. Notice this means we did not consider sliding
windows that covered both car and background. We feel these are too subjective to score.
We ran the vision algorithm on the test set and recorded its accuracy in the standard fashion.
For the laser-based algorithm, the laser performed its own segmentation on the frames in
the test set and then classified the segments. Our manually labeled car bounding boxes were
compared to the boxes from the laser’s segmentation and classification. We performed the
comparison automatically in software. We visualized the output to verify our programmatic
comparison matched intuition. The software worked as follows. Each manually labeled car
bounding box was matched into the set of laser bounding boxes. Matching was based on
the location and dimensions of the box. We allowed for 100 pixels of error. Of course,
matching boxes had to be within the same frame. A manual car box that matched a laser
car box was scored as a correct car. Otherwise, an incorrect car was scored. The laser’s
own segmentations were used to score the non-car class. Laser car boxes were matched to
manual car labels (the opposite direction). Laser car boxes that did not match were scored
as incorrect non-cars. Next, we considered the non-car laser boxes. If a non-car laser box
did not intersect any manually labeled car box, it was scored as a correct non-car. Recall,
as with the vision, we do not consider overlapping car and background. Thus, there were
no intersections between non-car laser boxes and manual car labels. If a non-car laser box
entirely covered a manual car, that car would still be marked incorrect as it would not match
a car laser box.
Importantly, the laser and the camera are doing car detection on the same set of frames
using identical car classes. However, the non-car classes may contain different non-car
segmentations. This is because of the difference between the segmentation methods. The
laser’s native segmentation discards many non-objects through 3D segmentation. The vi-
sion, however, must classify all the sliding windows to reject non-objects. Nevertheless,
the end-result is the same: finding cars and rejecting non-cars on the same test set.
Sensor Method Car Class Non-Car Class Mean

Laser [84] 92.19% 97.22% 94.71%
Camera Ours, Cubic SVM, K = 5000 89.05% 95.88% 92.47%
Camera Ours, Cubic SVM, K = 1000 88.68% 95.28% 91.98%
Camera [49], Cubic SVM, K = 5000 82.71% 89.89% 86.30%
Camera Ours, Forests [17], K = 1000 78.98% 87.28% 83.13%
Camera Ours, kNN [23], K = 1000 73.26% 86.69% 79.98%
Camera Ours, Linear SVM [43], K = 100 80.85% 87.23% 84.04%
Table 5.1: Results per class for a training size of 20,000 examples.
5.6.4 Results
The principal experimental results can be found in Figure 5.7. The x-axis shows the amount
of training data which was always equally split per class. The y-axis shows the average
error per class relative to the laser’s performance. Notice both axes are log scale. Here,
“Cubic SVM, Sparse” refers to [49]. Clearly, doubling the amount of camera training data
cuts the error relative to the laser’s performance by a fixed percentage. Because unlimited
amounts of training data can be automatically acquired, the performance of the camera can
be arbitrarily close to the laser. This trend holds everywhere except for the linear SVM with
K = 100. This is because the feature space was too small to continue learning. Clearly,
expanding the feature space to K = 1000, as shown in the plot, solves the problem and
continues the trend. Figure 5.6 shows selected vision-only classification results. Green
boxes depict correctly classified cars, and red boxes show false positives. Car detection
is excellent although the number of false positives is slightly higher using the camera.
Although the non-car class accuracies are similar, the camera must try more segmentations.
This increases the total number of misclassifications. In future work, we could resolve
this by adding an image-based segmenter to our method’s pipeline. Specific performance
numbers for N = 20,000 appear in Table 5.1.
Notice the difficulty of the classification task in Figure 5.6. For example, in the upper-
left image, there is a small white sedan under the tree (next to the green box) that is missed.
This counts as an error. Most errors are like that one – the object is far from the sensor,
appears small, and has few features. Importantly, autonomous cars do not need to react
to distant objects. Thus, they achieve much lower driving failure rates than perception
accuracy might imply. To quantify this property, a histogram of accuracy as a function of
segment size appears in Figure 5.8 for the Cubic SVM method with triangle coding.
Figure 5.9 shows classification accuracy as a function of random labeling errors. To
produce this plot, we manually labeled a training set and then programmatically introduced
random labeling errors. Results indicate a significant amount of labeling error is acceptable
before classification performance degrades significantly. This implies there may be a hard
limit on classification performance that is a function of features or information content.
For example, some examples may be too low resolution, too small, too far away, or just too
ambiguous to classify correctly. However, the errors in the laser’s labeling are not random
– they correspond to examples near its decision boundary. So conclusions from this plot do
not necessarily transfer back to the laser labeling process.
Non-linear SVMs are O(N 3 ) in complexity. Nevertheless, training time was reasonable
on our machine. The 20,000 example SVM with K = 5000 (the largest case) took 22,528.5
seconds to train. Recall that, due to the pyramid, image vectors are 5K in length. For
Cubic SVMs, we did not attempt to parallelize SVM training or optimize code beyond
standard OpenCV. Also, no attempt was made to dedicate all system resources to training
by, for example, stopping background Linux services. Thus, this training time represents a
rough upper bound. We report it to verify that Cubic SVM training is tractable in a typical
environment.
5.7 Discussion
Recent autonomous or robotic cars are often characterized by expensive laser scanners.
This is due largely to the belief that building effective camera-only systems is beyond the
state-of-the-art. We present a vision-only algorithm for car classification. Our method
is trained in a self-supervised fashion from a laser system without manual labeling. By
utilizing vast amounts of training data, we show that a camera-only method can get arbi-
trarily close to the performance of the laser-based labeling source [84] on a substantial and
realistic test set of more than 20,000 examples.
Figure 5.3: Several examples of camera frames after labeling by the laser algorithm. These
labels are the training data for our vision-only classifier. This training data is very inexpen-
sive to acquire. Millions of examples can be obtained by simply driving the car around.
Figure 5.4: The 5,000 cluster centers or bases used for image classification represented by
their corresponding closest image patch.
Figure 5.5: Our algorithm updates a variable number of image vector elements per feature.
The frequency of the number of updates is shown here.
Figure 5.6: Classification results from the vision algorithm. Car detection is excellent.
The number of false positives are slightly higher with the camera than the laser despite the
similar non-car class accuracy (Table 5.1). This is because the camera must consider more
segmentations due to the absence of 3D.
Figure 5.7: Doubling the amount of camera training data cuts the error, relative to the
laser’s performance, by a fixed percentage. By adding additional labeled camera training
data – which is automatic – we can achieve performance that is arbitrarily close to the
laser’s performance for object classification.
Figure 5.8: Histogram of accuracy as a function of segment size. Poor performance occurs
on small segments which imply occlusion or significant distance from the camera.
Figure 5.9: Accuracy as a function of random labeling errors. Significant amounts of

random labeling error do not substantially degrade classification accuracy.
Bibliography
[1] http://www.who.int/gho/road safety/mortality/en/index.html.
[2] http://www-fars.nhtsa.dot.gov/Main/index.aspx.
[3] P. Abbeel, A. Coates, and A. Y. Ng. Autonomous helicopter aerobatics through

apprenticeship learning. International Journal of Robotics Research (IJRR), pages
1–31, 2010.
[4] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing over-
complete dictionaries for sparse representation. IEEE Trans. on Signal Processing,
54(11):4311–4322, 2006.
[5] M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD algorithm. In Proc. of

SPARSE, 2005.
[6] T. Barfoot. Online visual motion estimation using fastslam with sift features. In
Proc. of the 2005 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS),
2005.
[7] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features.
Computer Vision and Image Understanding (CVIU), 110(3):346–359, 2008.
[8] G. Bekker. Theory of Land Locomotion. University of Michigan, 1956.
[9] G. Bekker. Introduction to Terrain-Vehicle Systems. University of Michigan, 1969.
80
BIBLIOGRAPHY 81
[10] M. Bertozzi and A. Broggi. Gold: A parallel real-time stereo vision system for
generic obstacle and lane detection. IEEE Trans. on Image Processing, 7(1):62–81,
1998.
[11] M. Bertozzi, A. Broggi, C. G. Lo Bianco, A. Fascioli, and A. Piazzi. The argo

autonomous vehicle’s vision and control systems. Int. Journal of Intelligent Control
and Systems, 3(4):409–441, 1999.
[12] P. Besl and N. McKay. A method for registration of 3d shapes. Transactions on

Pattern Analysis and Machine Intelligence, 14(2):239–256, 1992.
[13] M. Bicego, A. Lagorio, E. Grosso, and M. Tistarelli. On the use of sift features
for face authentication. In Proc. of the IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR) Workshop, 2007.
[14] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[15] Y-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for
recognition. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR), 2010.
[16] G. Bradski and A. Kaehler. Learning OpenCV. O’Reilly, 2008.
[17] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[18] C.A. Brooks and K. Iagnemma. Vibration-based terrain classification for planetary
exploration rovers. IEEE Transactions on Robotics, 21(6):1185–1191, 2005.
[19] M. Brown and D. G. Lowe. Recognising panoramas. In Proc. of the Ninth IEEE Int.
Conf. on Computer Vision (ICCV), 2003.
[20] W. Burgard, A. B. Cremers, D. Fox, D. Haehnel, G. Lakemeyer, D. Schulz,

W. Steiner, and S. Thrun. Experiences with an interactive museum tour-guide robot.
Artificial Intelligence, 114:3–55, 1999.
BIBLIOGRAPHY 82
[21] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively,

with application to face verification. In Proc. of the IEEE Conf. on Computer Vision
and Pattern Recognition (CVPR), 2005.
[22] A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsu-

pervised feature learning. In Proc. of the NIPS Workshop on Deep Learning and
Unsupervised Feature Learning, 2010.
[23] T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Trans. on
Information Theory, 13(1):21–27, 1967.
[24] H. Dahlkamp, A. Kaehler, D. Stavens, S. Thrun, and G. Bradski. Self-supervised

monocular road detection in desert terrain. In Proc. of Robotics: Science and Systems
(RSS), 2006.
[25] H. Dahlkamp, A. Kaehler, D. Stavens, S. Thrun, and G. Bradski. Self-supervised

monocular road detection in desert terrain. In G. Sukhatme, S. Schaal, W. Burgard,
and D. Fox, editors, Proceedings of the Robotics Science and Systems Conference,
Philadelphia, PA, 2006.
[26] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In
Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2005.
[27] DARPA. DARPA Grand Challenge Rulebook, 2004.

http://www.darpa.mil/grandchallenge05/.
[28] F. Dellaert, D. Fox, W. Burgard, and S. Thrun. Monte carlo localization for mo-
bile robots. In Proc. of the International Conference on Robotics and Automation
(ICRA), 1999.
[29] E. Dickmanns, R. Behringer, D. Dickmanns, R. Hildebrandt, M. Maurer,

J. Schiehlen, and F. Thomanek. The seeing passenger car vamors-p. In Proc. of
the Int. Symposium on Intelligent Vehicles, 1994.
BIBLIOGRAPHY 83
[30] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark.

In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
2009.
[31] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene
categories. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR), 2005.
[32] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, mul-

tiscale, deformable part model. In Proc. of the IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), 2008.
[33] D. Fox, W. Burgard, and S. Thrun. Controlling synchro-drive robots with the dy-
namic window approach to collision avoidance. In Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems, 1996.
[34] T. Gillespie. Fundamentals of Vehicle Dynamics. Society of Automotive Engineers,

1992.
[35] A. Gutierrez, T. Galatali, J. P. Gonzalez, C. Urmson, and W. L. Whittaker. Preplan-

ning for high performance autonomous traverse of desert terrain exploiting a priori
knowledge to optimize speeds and to detail paths. Technical Report CMU-RI-TR-
05-54, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, 2005.
[36] C. Harris and M. Stephens. A combined corner and edge detector. In Proc. of the
4th Alvey Vision Conf., pages 147–151, 1988.
[37] M. Hebert, C. Thorpe, and A. Stentz. Intelligent Unmanned Ground Vehicles: Au-
tonomous Navigation Research at Carnegie Mellon University. Dordrecht: Kluwer
Academic, 1997.
[38] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief
nets. Neural Computation, 18(7):1527–1554, 2006.
[39] K. Iagnemma and M. Buehler, editors. Journal of Field Robotics: Special Issue on
the DARPA Grand Challenge, Parts 1 and 2. Wiley, 2006.
BIBLIOGRAPHY 84
[40] K. Iagnemma, M. Buehler, and S. Singh, editors. Journal of Field Robotics: Special
Issue on the DARPA Urban Challenge, Parts I, II, and III. Wiley, 2008.
[41] K. Iagnemma, S. Kang, H. Shibly, and S. Dubowsky. Online terrain parameter esti-
mation for wheeled mobile robots with application to planetary rovers. IEEE Trans-
actions on Robotics and Automation, 2004.
[42] P. Jensfelt, D. Austin, O. Wijk, and M. Anderson. Feature based condensation for
mobile robot localization. In Proc. of the International Conference on Robotics and
Automation (ICRA), pages 2531–2537, 2000.
[43] T. Joachims. Training linear svms in linear time. In Proc. of the ACM Conf. on
Knowledge Discovery and Data Mining (KDD), 2006.
[44] S. Julier and J. Uhlmann. A new extension of the Kalman filter to nonlinear systems.
In International Symposium on Aerospace/Defense Sensing, Simulate and Controls,
Orlando, FL, 1997.
[45] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features

through topographic filter maps. In Proc. of the IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), 2009.
[46] A. Kelly and A. Stentz. Rough terrain autonomous mobility, part 1: A theoretical
analysis of requirements. Autonomous Robots, 5:129–161, 1998.
[47] A. Kelly and A. Stentz. Rough terrain autonomous mobility, part 2: An active vision,
predictive control approach. Autonomous Robots, 5:163–198, 1998.
[48] D. Langer, J. Rosenblatt, and M. Hebert. A behavior-based system for off-road

navigation. IEEE Transactions of Robotic and Automation, 10(6):776–783, 1994.
[49] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), 2006.
BIBLIOGRAPHY 85
[50] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations. In Proc. of the
Int. Conf. on Machine Learning (ICML), 2009.
[51] H. Lee, Y. Largman, P. Pham, and A. Y. Ng. Unsupervised feature learning for
audio classification using convolutional deep belief networks. In Proc. of Neural
Information Processing Systems (NIPS), 2009.
[52] J. Levinson, M. Montemerlo, and S. Thrun. Map-based precision vehicle localization

in urban environments. In Proc. of Robotics: Science and Systems (RSS), 2007.
[53] A. Lookingbill, J. Rogers, D. Lieb, J. Curry, and S. Thrun. Reverse optical flow
for self-supervised adaptive autonomous robot navigation. International Journal of
Computer Vision, 74(3), 2007.
[54] D. Lowe. http://people.cs.ubc.ca/ lowe/keypoints/.
[55] D. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal of
Computer Vision (IJCV), 60(2):91–110, 2004.
[56] D. G. Lowe. Object recognition from local scale-invariant features. In Proc. of the
Seventh Int. Conf. on Computer Vision (ICCV), pages 1150–1157, 1999.
[57] B. D. Lucas and T. Kanade. An iterative image registration technique with an ap-
plication to stereo vision. In Proc. of the 1981 DARPA Imaging Understanding
Workshop, pages 121–130, 1981.
[58] J. Luo, Y. Ma, E. Takikawa, S. Lao, M. Kawade, and B.-L. Lu. Person-specific sift
features for face recognition. In Proc. of the IEEE Int. Conf. on Acoustics, Speech,
and Signal Processing (ICASSP), 2007.
[59] J. Mairal, M. Elad, and G. Sapiro. Sparse learned representations for image restora-
tion. In Proc. of the 4th World Conf. of the Int. Assoc. for Statistical Computing
(IASC), 2008.
BIBLIOGRAPHY 86
[60] J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration.
IEEE Trans. on Image Processing, 17(1):53–69, 2008.
[61] J. Michels, A. Saxena, and A. Ng. High-speed obstacle avoidance using monocular
vision and reinforcement learning. In Proceedings of the Seventeenth International
Conference on Machine Learning (ICML), 2005.
[62] H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in
video. In Proc. of the Int. Conf. on Machine Learning (ICML), 2009.
[63] M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov, S. Ettinger,

D. Haehnel, T. Hilden, G. Hoffmann, B. Huhnke, D. Johnston, S. Klumpp,
D. Langer, A. Levandowski, J. Levinson, J. Marcil, D. Orenstein, J. Paefgen,
I. Penny, A. Petrovskaya, M. Pflueger, G. Stanek, D. Stavens, A. Vogt, and S. Thrun.
Junior: The stanford entry in the urban challenge. Journal of Field Robotics, 2008.
[64] M. Montemerlo, S. Thrun, H. Dahlkamp, D. Stavens, and S. Strohband. Winning

the DARPA Grand Challenge with an AI robot. In Proc. of the Association for the
Advancement of Artificial Intelligence (AAAI), 2006.
[65] NOVA. The Great Robot Race (Documentary of the DARPA Grand Challenge),
2006. http://www.pbs.org/wgbh/nova/darpa/.
[66] D. Pomerleau and T. Jochem. Rapidly adapting machine vision for automated vehi-
cle steering. IEEE Expert, 11(2):19–27, 1996.
[67] M. Quigley, B. Gerkey, K. Conley, J. Faust, T. Foote, J. Leibs, E. Berger, R. Wheeler,

and A. Y. Ng. Ros: an open-source robot operating system. In Proc. of the Open-
Source Software Workshop of the IEEE International Conference on Robotics and
Automation (ICRA), 2009.
[68] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: Transfer
learning from unlabeled data. In Proc. of the 24th Int. Conf. on Machine Learning
(ICML), 2007.
BIBLIOGRAPHY 87
[69] M. Ranzato, F.-J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of
invariant feature hierarchies with applications to object recognition. In Proc. of the
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2007.
[70] S. Rusinkiewicz and M. Levoy. Efficient variants of the ICP algorithm. In

Proc. Third International Conference on 3D Digital Imaging and Modeling (3DIM),
Quebec City, Canada, 2001. IEEEComputer Society.
[71] P. Sabzmeydani and G. Mori. Detecting pedestrians by learning shapelet features.

2007.
[72] D. Sadhukhan, C. Moore, and E. Collins. Terrain estimation using internal sen-
sors. In Proceedings of the 10th IASTED International Conference on Robotics and
Applications (RA), Honolulu, Hawaii, USA, 2004.
[73] A. Saxena, S. Chung, and A. Ng. Learning depth from single monocular images.
In Proceedings of Conference on Neural Information Processing Systems (NIPS),
Cambridge, MA, 2005. MIT Press.
[74] S. Se, D. Lowe, and J. Little. Vision-based mobile robot localization and mapping
using scale-invariant features. In Proc. of the Int. Conf. on Robotics and Automation
(ICRA), 2001.
[75] J. Shi and C. Tomasi. Good features to track. In Proc. of the 9th IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), 1994.
[76] S. Shimoda, Y. Kuroda, and K. Iagnemma. Potential field navigation of high speed
unmanned ground vehicles in uneven terrain. In Proceedings of the IEEE Interna-
tional Conference on Robotics and Automation (ICRA), Barcelona, Spain, 2005.
[77] C. Shoemaker, J. Bornstein, S. Myers, and B. Brendle. Demo III: Department of de-
fense testbed for unmanned ground mobility. In Proceedings of SPIE, Vol. 3693,
AeroSense Session on Unmanned Ground Vehicle Technology, Orlando, Florida,
1999.
BIBLIOGRAPHY 88
[78] R. G. Simmons, J. Fernandez, R. Goodwin, S. Koenig, and J. O’Sullivan. Lessons

learned from xavier. IEEE Robotics and Automation Magazine, 7:33–39, 2000.
[79] M. Spenko, Y. Kuroda, S. Dubowsky, and K. Iagnemma. Hazard avoidance for

high-speed mobile robots in rough terrain. Journal of Field Robotics, 23(5):311–
331, 2006.
[80] D. Stavens, G. Hoffmann, and S. Thrun. Online speed adaptation using supervised
learning for high-speed, off-road autonomous driving. In Proc. of the Int. Joint Conf.
on Artificial Intelligence (IJCAI), 2007.
[81] D. Stavens and S. Thrun. A self-supervised terrain roughness estimator for off-
road autonomous driving. In Proc. of the Int. Conf. on Uncertainty in Artificial
Intelligence (UAI), 2006.
[82] D. Stavens and S. Thrun. Unsupervised learning of invariant features using video.
2010.
[83] K. L. R. Talvala, K. Kritayakirana, and J. Christian Gerdes. Pushing the limits: From
lanekeeping to autonomous racing. Annual Reviews in Control, 2011.
[84] A. Teichman, J. Levinson, and S. Thrun. Towards 3d object recognition via classifi-
cation of arbitrary object tracks. In Proc. of the Int. Conf. on Robotics and Automa-
tion (ICRA), 2011.
[85] S. Thrun. The official google blog, 2010.

http://googleblog.blogspot.com/2010/10/what-were-driving-at.html.
[86] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, 2005.
[87] S. Thrun, M. Montemerlo, and Andrei Aron. Probabilistic terrain analysis for high-
speed desert driving. In G. Sukhatme, S. Schaal, W. Burgard, and D. Fox, edi-
tors, Proceedings of the Robotics Science and Systems Conference, Philadelphia,
PA, 2006.
BIBLIOGRAPHY 89
[88] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong,

J. Gale, M. Halpenny, G. Hoffmann, K. Lau, C. Oakley, M. Palatucci, V. Pratt,
P. Stang, S. Strohband, C. Dupont, L.-E. Jendrossek, C. Koelen, C. Markey, C. Rum-
mel, J. van Niekerk, E. Jensen, P. Alessandrini, G. Bradski, B. Davies, S. Ettinger,
A. Kaehler, A. Nefian, and P. Mahoney. Stanley: The robot that won the DARPA
Grand Challenge. Journal of Field Robotics, 2006.
[89] D. Tran and D. A. Forsyth. Configuration estimates improve pedestrian finding. In

Proc. of Neural Information Processing Systems (NIPS), 2007.
[90] C. Urmson, J. Anhalt, D. Bartz, M. Clark, T. Galatali, A. Gutierrez, S. Harbaugh,

J. Johnston, H. Kato, P. Koon, W. Messner, N. Miller, A. Mosher, K. Peterson, C. Ra-
gusa, D. Ray, B. Smith, J. Snider, S. Spiker, J. Struble, J. Ziglar, and W. L. Whittaker.
A robust approach to high-speed navigation for unrehearsed desert terrain. Journal
of Field Robotics, 23(8):467508, 2006.
[91] C. Urmson, J. Anhalt, M. Clark, T. Galatali, J.P. Gonzalez, J. Gowdy, A. Gutierrez,

S. Harbaugh, M. Johnson-Roberson, H. Kato, P.L. Koon, K. Peterson, B.K. Smith,
S. Spiker, E. Tryzelaar, and W.L. Whittaker. High speed navigation of unrehearsed
terrain: Red team technology for grand challenge 2004. Technical Report CMU-RI-
TR-04-37, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, 2004.
[92] J. C. van Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. M. Smeulders. Kernel

codebooks for scene categorization. In Proc. of the European Conf. on Computer
Vision (ECCV), 2008.
[93] M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In

Proc. of the Int. Conf. on Computer Vision (ICCV), 2007.
[94] N. Vlassis, B. Terwijn, and B. Kröse. Auxillary particle filter robot localization from
high-dimensional sensor observations. In Proc. of the International Conference on
Robotics and Automation (ICRA), 2002.
[95] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding.
In Proc. of the Int. Conf. on Machine Learning (ICML), 2008.
BIBLIOGRAPHY 90
[96] S. Winder and M. Brown. Learning local image descriptors. In Proc. of the IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR), 2007.
[97] S. Winder, G. Hua, and M. Brown. Picking the best daisy. In Proc. of the IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR), 2009.
[98] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of in-
variances. Neural Computation, 14(4):715–770, 2002.
[99] J. Wong. Terramechanics and Off-Road Vehicles. Elsevier, 1989.
[100] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using
sparse coding for image classification. In Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), 2009.
[101] W. Zhang, G. Zelinsky, and D. Samaras. Real-time accurate object detection using
multiple resolutions. In Proc. of the Int. Conf. on Computer Vision (ICCV), 2007.
[102] Q. Zhu, S. Avidan, M.-C. Yeh, and K.-T. Cheng. Fast human detection using a
cascade of histograms of oriented gradients. In Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), 2006.

Thrun Etal Jfr06

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Thrun Etal Jfr06

Transféré par

Droits d'auteur :

Formats disponibles

LEARNING TO DRIVE:

PERCEPTION FOR AUTONOMOUS CARS

David Michael Stavens

This work is licensed under a Creative Commons Attribution-

This dissertation is online at: http://purl.stanford.edu/pb661px9942

Sebastian Thrun, Primary Adviser

Approved for the Stanford University Committee on Graduate Studies.

3 Smooth Road Detection 23

5 Cost Effective Sensing 60

5.1 Results per class for a training size of 20,000 examples. . . . . . . . . . . . 73

1.1 Stanley won the 2005 DARPA Grand Challenge. . . . . . . . . . . . . . . 2

5.1 Example of an image labeled by the laser-based recognition system [84].

Figure 1.1: Stanley won the 2005 DARPA Grand Challenge.

of correspondence that is automatically or algorithmically obtained, self-supervised learn-

a camera-only system is automatically learned. Importantly, the vast amounts of labeled

2.3 Related Work

2.4 Speed Selection Algorithm

2.4.1 Z Acceleration Mapping and Filtering

2.4.2 Generating Speed Recommendations

vpr = min γ, vp∗ , vp−1

2.4.3 Supervised Machine Learning

we wish to minimize is:

2.5 Experimental Results

2.5.1 Qualitative Analysis

of this experiment is to spot-check performance. We seek to verify that our algorithm,

2.5.2 Quantitative Analysis

maximum amount of change in velocity between successive pi ’s, approximating vehicle

Time vs. Shock for Four Desert Routes

2.5.3 Results from the 2005 Grand Challenge

Smooth Road Detection

safety than previous methods – often by as much as 50%.

3.3 Related Work

3.4 Acquiring A 3-D Point Cloud

3.5 Surface Classification Function

of parameters. The optimization of these parameters is discussed in Section 3.7.

Li = [x, y, z, τ, γ, ψ]. (3.1)

Sr,c = ∆(Lr , Lc ). (3.2)

∆(Lr , Lc ) = α1 |Lrz − Lcz |α2 − α3 |Lrτ − Lcτ |α4

Rcombined = Rleft ζ + Rright ζ (3.5)

3.6 Data Labeling

3.6.1 Raw Data Filtering

3.6.2 Calculating the Shock Index

3.7 Learning the Model Parameters

3.8 Experimental Results

significant uncertainty exists, they are difficult to detect reliably.

3.8.1 Sensitivity Versus Specificity

We use the method in [87] to generate a vector W . Separately trained parameters υ, ω,

3.8.2 Effect on the Vehicle

4.3 Related Work

4.4 Generating Sets of Patches

4.5 Optimizing Higher-Level Features

~ The min part of the optimization is minimiz-

4.6 Experimental Analysis

totally unsupervised nature of our algorithm.

4.6.1 Data Sets

4.6.2 Experimental Protocol

4.6.3 Experimental Results

4.6.4 SIFT Rotational Invariance

4.6.5 HOG Parameters

Cost Effective Sensing

5.3 Related Work

occasions. In the 2005 DARPA Grand Challenge, a self-supervised laser-to-camera sys-

5.4 Laser Classification

laser classification method. [84] offers a complete description.