Académique Documents
Professionnel Documents
Culture Documents
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Fei-Fei Li
I certify that I have read this dissertation and that, in my opinion, it is fully adequate
in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Andrew Ng
This signature page was generated electronically upon submission of this dissertation in
electronic format. An original signed hard copy of the signature page is on file in
University Archives.
iii
Abstract
Every year, 1.2 million people die in automobile accidents and up to 50 million are in-
jured [1]. Many of these deaths are due to driver error and other preventable causes. Au-
tonomous or highly aware cars have the potential to positively impact tens of millions of
people. Building an autonomous car is not easy. Although the absolute number of traf-
fic fatalities is tragically large, the failure rate of human driving is actually very small. A
human driver makes a fatal mistake once in about 88 million miles [2].
As a co-founding member of the Stanford Racing Team, we have built several relevant
prototypes of autonomous cars. These include Stanley, the winner of the 2005 DARPA
Grand Challenge and Junior, the car that took second place in the 2007 Urban Challenge.
These prototypes demonstrate that autonomous vehicles can be successful in challenging
environments. Nevertheless, reliable, cost-effective perception under uncertainty is a major
challenge to the deployment of robotic cars in practice.
This dissertation presents selected perception technologies for autonomous driving in
the context of Stanford’s autonomous cars. We consider speed selection in response to
terrain conditions, smooth road finding, improved visual feature optimization, and cost
effective car detection. Our work does not rely on manual engineering or even super-
vised machine learning. Rather, the car learns on its own, training itself without human
teaching or labeling. We show this “self-supervised” learning often meets or exceeds tra-
ditional methods. Furthermore, we feel self-supervised learning is the only approach with
the potential to provide the very low failure rates necessary to improve on human driving
performance.
iv
Acknowledgments
Sebastian Thrun, Cliff Nass, Vaughan Pratt, Mike Montemerlo, Sven Strohband, Hendrik
Dahlkamp, Gabe Hoffmann, Jesse Levinson, Alex Teichman, Gary Bradski, and Adrian
Kaehler were principal collaborators on the Stanford Racing Team, shaped aspects of this
work, and contributed other systems to the vehicle (eg: path planning) that were invaluable
to the completion of this dissertation.
We received generous financial support from DARPA, Volkswagen, Red Bull, Mohr
Davidow Ventures, Google, Intel, Qualcomm, and Boeing. I held the David Cheriton Stan-
ford Graduate Fellowship from 2005 to 2008 and the Qualcomm Innovation Fellowship in
2010.
The surgical data set in Chapter 4 is courtesy of Professor Greg Hager of Johns Hopkins
University.
Special thanks to my faculty committee: Sebastian Thrun (Adviser/Examiner/Reader),
Daphne Koller (Examiner), Andrew Ng (Examiner/Reader), Chris Gerdes (Examination
Chair), and Fei-Fei Li (Examiner/Reader).
v
Contents
Abstract iv
Acknowledgments v
1 Introduction 1
2 Speed Selection 6
2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Speed Selection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Z Acceleration Mapping and Filtering . . . . . . . . . . . . . . . . 9
2.4.2 Generating Speed Recommendations . . . . . . . . . . . . . . . . 12
2.4.3 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . 13
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.3 Results from the 2005 Grand Challenge . . . . . . . . . . . . . . . 20
2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vi
3.4 Acquiring A 3-D Point Cloud . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Surface Classification Function . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.1 Raw Data Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.2 Calculating the Shock Index . . . . . . . . . . . . . . . . . . . . . 32
3.7 Learning the Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.8.1 Sensitivity Versus Specificity . . . . . . . . . . . . . . . . . . . . . 35
3.8.2 Effect on the Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Feature Optimization 39
4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Generating Sets of Patches . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Optimizing Higher-Level Features . . . . . . . . . . . . . . . . . . . . . . 44
4.6 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.2 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6.4 SIFT Rotational Invariance . . . . . . . . . . . . . . . . . . . . . . 49
4.6.5 HOG Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
vii
5.5.2 Feature Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5.3 Geometric Constraints . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.4 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6.1 General Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6.2 Data Collection and Preparation . . . . . . . . . . . . . . . . . . . 71
5.6.3 Ground Truth and Metrics . . . . . . . . . . . . . . . . . . . . . . 71
5.6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Bibliography 80
viii
List of Tables
4.1 Learned HOG parameters for the cars and surgical data sets. . . . . . . . . 50
ix
List of Figures
2.1 We find empirically that the relationship between perceived vertical accel-
eration amplitude and vehicle speed over a static obstacle can be approxi-
mated by a linear function in the operational region of interest. . . . . . . . 10
2.2 Filtered vs. unfiltered IMU data. The 40-tap FIR band pass filter extracts
the excitation of the suspension system, in the 0.3 to 12 Hz band. This
removes the offset of gravity, as terrain slope varies, and higher frequency
noise, such as engine system vibration, that is unrelated to oscillations in
the vehicle’s suspension. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 The results of coordinate descent machine learning are shown. The learning
generated the race parameters of .25 Gs and 1 mph/s, a good match to
human driving. Note that the human often drives under the speed limit.
This suggests the importance of our algorithm for desert driving. . . . . . . 14
2.4 Each point indicates a specific shock event with both the velocity controller
(vertical axis) versus speed limits alone (horizontal axis). There are very
few high shock events on this easy terrain. The velocity controller does not
reduce shock or speed much. . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Each point indicates a specific shock event with both the velocity controller
(vertical axis) versus speed limits alone (horizontal axis). There are many
high shock events on this difficult terrain. The velocity controller reduces
shock and speed a great deal. . . . . . . . . . . . . . . . . . . . . . . . . . 18
x
2.6 Here we show where the algorithm reduced speed during the 2005 Grand
Challenge event. The velocity controller modified speed during 17.6% of
the race. Height of the ellipses shown at the top is proportional to the max-
imum speed reduction for each 1/10th mile segment. Numbers represent
mile markers. Simulated terrain at the bottom approximates route elevation. 19
2.7 Here we plot the trade-off of completion time and shock on four desert
routes. α is constant (.25) and β is varied along each curve. With the single
set of parameters learned in Section 2.4.3, the algorithm reduces shock by
50% to 80% while increasing completion time by 2.5% to 5%. . . . . . . . 20
3.1 The left graphic shows a single laser scan. The right shows integration of
these scans over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Pose estimation errors can generate false surfaces that are problematic for
navigation. Shown here is a central laser ray traced over time, sensing a
relatively flat surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 The measured z-difference from two scans of flat terrain indicates that error
increases linearly with the time between scans. This empirical analysis
suggests the second term in Equation (3.4). . . . . . . . . . . . . . . . . . 28
3.4 We find empirically that the relationship between perceived vertical accel-
eration and vehicle speed over a static obstacle can be approximated tightly
by a linear function in the operational region of interest. . . . . . . . . . . . 32
3.5 The true positive / false positive trade-off of our approach and two inter-
pretations of prior work. The self-supervised method is significantly better
for nearly any fixed acceptable false-positive rate. . . . . . . . . . . . . . . 35
3.6 The trade off of completion time and shock experienced for our new self-
supervised approach and the reactive method we used in the Grand Challenge. 37
xi
4.1 An example of corner feature tracking with optical flow. The video se-
quence moves frame-to-frame from left to right. The center of each yellow
box in the left-most (first) frame is a corner detected by [75, 16]. The green
arrows show the motion of these corners into the next frame as determined
with optical flow [57, 16]. This is not a prediction. The flow algorithm uses
the next frame in the sequence for this calculation. The yellow boxes con-
tain the patches extracted and saved for later optimization. The yellow path
tails in the subsequent frames depict patch movement as estimated by the
optical flow. In total, four patch sets are produced, each with four patches.
They are shown in Figure 4.2. Our algorithm produces additional patches
for these frames. We do not show them for a simpler illustration. . . . . . . 52
4.2 The extracted patch sets from Figure 4.1. The correspondence is not per-
fect, but satisfactory for good optimization and application-specific feature
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Four non-consecutive images from the surgical data set. The images are
non-consecutive because the camera moves slowly, making the actual mo-
tion between frames small. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 HOG feature optimization for the cars task. The features optimized for
the cars task (blue line) outperform those optimized for the surgical task
(red line). This verifies that our algorithm captures application-dependent
feature invariances. The approximate theoretical best (light blue line) is
only slightly better than our method. The approximate theoretical worst is
reasonable for a 2-class problem. . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 HOG feature optimization for the surgical task. The features optimized
for the surgical task (red line) outperform those optimized for the cars task
(blue line). This verifies that our algorithm captures application-dependent
feature invariances. The approximate theoretical best (light blue) is only
slightly better than our method. The approximate theoretical worst is rea-
sonable for a 20-class problem. Because this is a correspondence task on
patches, reasonable performance for small numbers of features or images
is not surprising. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
xii
4.6 Examples from the cars task that were misclassified with the surgical-SIFT
but not the car-SIFT features. . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7 SIFT feature optimization for the cars task. The features optimized for the
cars task (blue line) outperform those optimized for the surgical task (red
line) and the parameters provided by Lowe (green line). This verifies our
claim that our algorithm captures application-dependent feature invariances. 57
4.8 SIFT feature optimization for the surgical task. The features optimized for
the surgical task (red line) outperform those optimized for the cars task
(blue line). This verifies our claim that our algorithm captures application-
dependent feature invariances. Intuitively, we expect correspondence to
work well even with a single known point. This justifies the good per-
formance presented even with one training example. The performance of
our surgical-optimized features is nearly identical to Lowe’s (green-line).
Lowe’s extensive optimization was geared toward matching (correspon-
dence) problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.9 The effect of rotation on our surgical-optimized SIFT features. Our opti-
mization removed rotational invariance. This is correct given the invari-
ances that are specific to the application. . . . . . . . . . . . . . . . . . . . 59
xiii
5.2 A 3D point cloud example created by a single rotation of a 3D laser scanner.
The method in [84] tracks objects through these point clouds, segments
them, and classifies them using a learning algorithm. We project these
classifications back into a camera frame and use them as training data for
our vision-only algorithm. Nearly unlimited amounts of training data can
be automatically produced. . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Several examples of camera frames after labeling by the laser algorithm.
These labels are the training data for our vision-only classifier. This train-
ing data is very inexpensive to acquire. Millions of examples can be ob-
tained by simply driving the car around. . . . . . . . . . . . . . . . . . . . 75
5.4 The 5,000 cluster centers or bases used for image classification represented
by their corresponding closest image patch. . . . . . . . . . . . . . . . . . 76
5.5 Our algorithm updates a variable number of image vector elements per
feature. The frequency of the number of updates is shown here. . . . . . . . 76
5.6 Classification results from the vision algorithm. Car detection is excellent.
The number of false positives are slightly higher with the camera than the
laser despite the similar non-car class accuracy (Table 5.1). This is because
the camera must consider more segmentations due to the absence of 3D. . . 77
5.7 Doubling the amount of camera training data cuts the error, relative to the
laser’s performance, by a fixed percentage. By adding additional labeled
camera training data – which is automatic – we can achieve performance
that is arbitrarily close to the laser’s performance for object classification. . 78
5.8 Histogram of accuracy as a function of segment size. Poor performance
occurs on small segments which imply occlusion or significant distance
from the camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.9 Accuracy as a function of random labeling errors. Significant amounts of
random labeling error do not substantially degrade classification accuracy. . 79
xiv
Chapter 1
Introduction
Traffic accidents are a major source of disability and mortality worldwide. Every year,
1.2 million people die and up to 50 million people are injured [1]. Autonomous or highly
aware cars have the potential to reduce these numbers dramatically. However, building a
self-driving car that exceeds human driving performance is not easy. While traffic injuries
and deaths are large in number, they are relatively infrequent. In the United States, a
human driver makes a fatal mistake only once in approximately 88 million miles [2]. As a
co-founding member of Stanford’s Racing Team, we have built several relevant prototypes
of autonomous cars. These include Stanley [88], the winner of the 2005 DARPA Grand
Challenge and Junior [63], the car that took second place in the 2007 Urban Challenge.
Stanley and Junior are shown in Figures 1.1 and 1.2. These prototypes demonstrate that
autonomous vehicles can be successful in challenging environments.
Work on self-driving cars spans several decades [29, 66, 37, 11]. The DARPA Grand
and Urban Challenge competitions [39, 40] offered a modern, uniform testing opportunity
to examine the state-of-the-art in autonomous cars. The Grand Challenge competitions,
held in 2004 and 2005, focused on endurance driving across the desert, stationary obsta-
cles, and road following. Performance in 2004 was disappointing due, in part, to widely
varying expectations about the nature of the competition. In 2005, the competition was
repeated on a different (but similarly challenging) route, and several teams successfully
finished. In 2007, DARPA held a new challenge focused on urban environments. The
problem statement expanded significantly to include much of what a typical human driver
1
CHAPTER 1. INTRODUCTION 2
must do. For example, car detection, lane keeping, parking, and intersections were in-
cluded. Important exclusions were high-speed (highway) driving, traffic lights, bicycles,
pedestrians, and the ability to drive without a highly accurate GPS course skeleton. Several
teams were successful although reliability was not perfect despite the abridged definition of
the problem. Nevertheless, it is remarkable how quickly the technology progressed. Today,
the Google driving project is expanding the code base from these efforts [85].
Any mobile robot must be able to localize itself, perceive its environment, make de-
cisions in response to those perceptions, and control actuators to move about [20, 78]. In
many ways, autonomous cars are no different. Thus, many ideas from mobile robotics gen-
erally are directly applicable to robotic driving. Examples include GPS/IMU fusion with
Kalman filters [86], map-based localization [28, 94, 42, 52, 67], and path planning based
on trajectory scoring [46, 47, 66, 29, 37, 11, 20]. Actuator control for high-speed driv-
ing is different than for typical mobile robots and is very challenging. However, excellent
solutions exist [83, 3].
However, general perception is unsolved for mobile robots and is the focus of major
efforts within the research community. This dissertation focuses on perception for robotic
cars. Perception is much more tractable within the context of autonomous driving. This is
due to a number of factors. For example, the number of object classes is smaller, the classes
CHAPTER 1. INTRODUCTION 3
Figure 1.2: Junior took second place in the 2007 DARPA Urban Challenge.
are more distinct, rules offer a strong prior on what objects may be where at any point in
time, and expensive, high-quality laser sensing is appropriate. Nevertheless, perception is
still very challenging due to the extremely low acceptable error rate. We focus on four
driving-related perception topics in this dissertation: speed selection in response to terrain
conditions [80], smooth road detection [81], visual feature optimization [82], and car de-
tection. Each of these four topics is presented as a distinct chapter and is self-contained
with its own literature review.
A unifying theme throughout our work is the use of self-supervised machine learning.
In self-supervised learning, the robot teaches itself automatically, without using human-
generated labels. Self-supervised learning is different from other non-supervised learning
methods. Unlike semi-supervised and self-taught learning, labeled data in the traditional
sense is not needed. However, self-supervised learning is also not totally unsupervised as in
clustering or deep learning [51, 50]. Self-supervised learning uses multi-view coherence.
Specifically, two different algorithms or sensors view the same data in a different way. Each
algorithm or sensor brings a different strength to its view of the data. By using knowledge
CHAPTER 1. INTRODUCTION 4
Speed Selection
2.1 Abstract
The mobile robotics community has traditionally addressed motion planning and naviga-
tion in terms of steering decisions. However, selecting the best speed is also important –
beyond its relationship to stopping distance and lateral maneuverability. Consider a high-
speed (35 mph) autonomous vehicle driving off-road through challenging desert terrain.
The vehicle should drive slowly on terrain that poses substantial risk. However, it should
not dawdle on safe terrain. In this chapter we address one aspect of risk – shock to the
vehicle. We present an algorithm for trading-off shock and speed in real-time and without
human intervention. The trade-off is optimized using supervised learning to match human
driving. The learning process is essential due to the discontinuous and spatially correlated
nature of the control problem – classical techniques do not directly apply. We evaluate
performance over hundreds of miles of autonomous driving, including performance during
the 2005 DARPA Grand Challenge. This approach was the deciding factor in our vehicle’s
speed for nearly 20% of the DARPA competition – more than any other constraint except
the DARPA-imposed speed limits – and resulted in the fastest finishing time.
6
CHAPTER 2. SPEED SELECTION 7
2.2 Introduction
In mobile robotics, motion planning and navigation have traditionally focused on steering
decisions. This chapter presents speed decisions as another crucial part of planning – be-
yond the relationship of speed to obstacle avoidance concerns, such as stopping distance
and lateral maneuverability. Consider a high-speed (35 mph) autonomous vehicle driving
off-road through challenging desert terrain. We want the vehicle to drive slower on more
dangerous terrain. However, we also want to minimize completion time. Thus, the robot
must trade-off speed and risk in real-time. This is a natural process for human drivers, but
it is not at all trivial to endow a robot with this ability.
We address this trade-off for one component of risk: the shock the vehicle experi-
ences. Minimizing shock is important for several reasons. First, shock increases the risk
of damage to the vehicle, its mechanical actuators, and its electronic components. Second,
a key perceptive technology, laser range scanning, relies on accurate estimation of orien-
tation. Shock causes the vehicle to shake violently, making accurate estimates difficult.
Third, shocks substantially reduce traction during oscillations. Finally, we demonstrate
that shock is strongly correlated with speed and, independently, with subjectively difficult
terrain. That is, minimizing shock implies slowing on challenging roads when necessary –
a crucial behavior to mitigate risk to the vehicle.
Our algorithm uses the linear relationship between shock and speed which we derive
analytically. The algorithm has three states. First, the vehicle drives at the maximum
allowed speed until a shock threshold is exceeded. Second, the vehicle slows immediately
to bring itself within the shock threshold using the relationship between speed and shock.
Finally, the vehicle gradually accelerates. It returns to the first state, or the second if the
threshold is exceeded during acceleration.
To maximize safety, the vehicle must react to shock by slowing down immediately, re-
sulting in a discontinuous speed command. Further, it should accelerate cautiously, because
rough terrain tends to be clustered, a property we demonstrate experimentally. For these
reasons, it is not possible to determine the optimal parameters using classical control tech-
niques. The selection of parameters cannot be tuned by any physical model of the system,
as the performance is best measured statistically by driving experience.
CHAPTER 2. SPEED SELECTION 8
Therefore, we generate the actual parameters for this algorithm – the shock threshold
and acceleration rate – using supervised learning. The algorithm generates a reasonable
match to the human teacher. Of course, humans are proactive, also using perception to
make speed decisions. We use inertial-only cues because it is very difficult to assess ter-
rain roughness with perceptive sensors. The accuracy required exceeds that for obstacle
avoidance by a substantial margin because rough patches are often just a few centimeters
in height. Still, experimental results indicate our reactive approach is very effective in
practice.
Our algorithm is part of the software suite developed for Stanley, an autonomous vehicle
entered by Stanford University in the 2005 DARPA Grand Challenge [27]. The Grand
Challenge was a robot race organized by the U.S. Government. In 2005, vehicles had to
navigate 132 miles of unrehearsed desert terrain. The algorithm described here played an
important role in that event. Specifically, it was the deciding factor in our robot’s speed
for nearly 20% of the competition – more than any other constraint except the DARPA
imposed speed limit. It resulted in the fastest finishing time of any robot.
Some work has explicitly considered speed as important for path planning. For exam-
ple, [33] presents a method for trading off “progress” and “velocity” with regard to the
number and type of obstacles present. That is, slower speeds are desirable in more clut-
tered environments. Faster speeds are better for open spaces. This concept, applied to
desert driving, is analogous to slowing for turns in the road and for discrete obstacles. Our
vehicle exhibits those behaviors as well. However, they are separate from the algorithm we
describe in this chapter.
Other entrants in the 2005 DARPA Grand Challenge realized the importance of adapt-
ing speed to rough terrain. For example, two teams from Carnegie Mellon University
(CMU) used “preplanning” for speed selection [35, 90, 91, 65]. For months, members of
the CMU teams collected extensive environment data in the general area where the compe-
tition was thought to be held. Once the precise route was revealed by DARPA, two hours
before the race, the team preprogrammed speeds according to the previously collected data.
The end-result of the CMU approach is similar to ours. However, our fully-online method
requires neither human intervention nor prior knowledge of the terrain. The latter distinc-
tion is particularly important since desert terrain readily changes over time due to many
factors, including weather.
Figure 2.1: We find empirically that the relationship between perceived vertical acceler-
ation amplitude and vehicle speed over a static obstacle can be approximated by a linear
function in the operational region of interest.
to be the sum of sinusoids of a large number of unknown spatial frequencies and ampli-
tudes. Consider one component of this summation, with a spatial frequency of ωs , and an
amplitude Az,g,ωs . Due to motion of the vehicle, the frequency in time of zg,ωs is ω = vωs .
By taking two derivatives, the acceleration, z̈g,ωs , has an amplitude of v 2 ωs2 Az,g,ωs and a
frequency of vωs . This is the acceleration input to the vehicle’s tire.
To model the response of the suspension system, we use the quarter car model [34].
The tire is modeled as a connection between the axle and the road with a spring with
stiffness kt , and the suspension system is modeled as a connection between the axle and
the vehicle’s main body, using a spring with stiffness ks and a damper with coefficient cs .
Using this model, the amplitude of acceleration of the main body of the vehicle can be
easily approximated, by assuming infinitely stiff tires, to be:
s
(cs vωs )2 + ks2
Az̈,ωs = v 2 ωs2 Az,g,ωs (2.1)
(cs vωs )2 + (ks − mv 2 ωs2 )2
CHAPTER 2. SPEED SELECTION 11
Figure 2.2: Filtered vs. unfiltered IMU data. The 40-tap FIR band pass filter extracts the
excitation of the suspension system, in the 0.3 to 12 Hz band. This removes the offset of
gravity, as terrain slope varies, and higher frequency noise, such as engine system vibration,
that is unrelated to oscillations in the vehicle’s suspension.
where m is the mass of one quarter of the vehicle, and the magnitude of the acceleration
input to the tire is taken to be v 2 ωs2 Az,g,ωs , as described above. The resonant frequency of a
standard automobile is around 1 to 1.5 Hz. Below this frequency, the amplitude increases
from zero at 0 Hz, slowly, but exponentially, with frequency. Above this frequency, the
cs vωs
amplitude increases linearly, approaching m
Az,g,ωs . Modern active damping systems
further quench the effect of the resonant frequency, leading to a curve that is nearly linear
in frequency throughout, to within the noise of the sensors, which also measure the effect
of random forcing disturbances.
The frequency, ω, is directly proportional to speed v, so the amplitude of acceleration
response of the vehicle is also approximately a linear function of the velocity. Now, by
superposition, considering the sum of the responses for all components of the terrain profile
at all spatial frequencies, the amplitude of the acceleration response of the vehicle, Az̈ , is
the sum of many approximately linear functions in v, and hence is approximately linear in
CHAPTER 2. SPEED SELECTION 12
v itself. This theoretical analysis can also be verified experimentally as in Figure 2.1.
This approximation assumed infinitely stiff tires. In reality, the finite stiffness of tires
leads to a higher frequency resonance in the suspension system around 10 to 20 Hz, fol-
lowed by an exponential drop off in Az̈ , to zero.
Although the acceleration is approximately a linear function of v, a barrier to the use of
pure z-accelerometer inertial measurement unit (IMU) data is that acceleration can reflect
other inputs than road roughness. First, the projection of the gravity vector onto the z-axis,
which changes with the pitch of the road, causes a large, low frequency offset. Second, the
driveline and non-uniformities in the wheels cause higher frequency vibrations.
To remove these effects, we pass the IMU data through a 40-tap, FIR band pass filter
to extract the excitation of the suspension system, in the 0.3 to 12 Hz band. This band was
found, experimentally, to both reject the offset of gravity and eliminate vibrations unrelated
to the suspension system, without eliminating the vibration response of the suspension.
A comparison of the raw and filtered data is shown in Figure 2.2. The mean of the raw
data, acquired on terrain with no slope, is about 1G, whereas the mean of the filtered data
is about 0G. Therefore, solely for presentation in this figure, 1G was subtracted from the
raw data.
vp
vp∗ = α . (2.2)
|z̈p |
That is, vp∗ is the velocity that would have delivered the maximum acceptable shock (α) to
the vehicle.
Our instant recommendation for vehicle speed is vp∗ . Notice that the vehicle is slowing
reactively. Reactive behavior has no effect if shock is instant, random, and not clustered.
That is, our approach relies on one shock event “announcing” the arrival of another, mo-
mentarily. Experimental results on hundreds of miles of desert indicate this is an effective
strategy.
We consider vp∗ an observation. The algorithm incorporates these observations over
time, generating the actual recommended speed, vpr :
For our purposes, we also clamp vpr such that it is never less than 5 mph. We call the vpr
series a velocity plan.
Figure 2.3: The results of coordinate descent machine learning are shown. The learning
generated the race parameters of .25 Gs and 1 mph/s, a good match to human driving. Note
that the human often drives under the speed limit. This suggests the importance of our
algorithm for desert driving.
Z
ψ vph − vpr 1 + αβ −1
(2.4)
p
where ψ = 1 if vpr ≤ vph and ψ = 3 otherwise. The ψ term penalizes speeding rela-
tive to the human driver. The value 3 was selected arbitrarily. The αβ −1 term penalizes
parametrizations that tolerate a great deal of shock or accelerate too slowly.
For this experiment, a human driver drove 1.91 miles of the previous year’s (2004)
Grand Challenge route. The vehicle’s state vector was estimated by an Unscented Kalman
Filter (UKF) [44]. The UKF combined measurements from a Novatel GPS with Omnistar
differential corrections, a GPS compass, the IMU, and wheel speed encoders. The GPS
CHAPTER 2. SPEED SELECTION 15
devices produced data at 10Hz. All other devices and the UKF itself produced data at
100Hz. All data is logged with millisecond accurate timestamps.
We do not model driver reaction time, and we correct for the delay due to IMU filtering.
Thus, in this data, driver velocity at position p is the vehicle’s recorded velocity at position
p, according to the UKF. The algorithm’s velocity at position p is the decision of the veloc-
ity controller considering all data up to and including the IMU and UKF data points logged
at position p.
The parametrization learned was (α, β) = (.27 Gs, .909 mph/s). This is illustrated
in Figure 2.3. The top and bottom halves of the figure show the training and test sets,
respectively. The horizontal axis is position along the 2004 Grand Challenge route in miles.
The vertical axis is speed in mph. For practical reasons we rounded these parameters to
.25 and 1, respectively, which are essentially identical in terms of vehicle performance.
The robot used these parameters during the Grand Challenge event and they are evaluated
extensively in future sections.
We notice a reasonable match to human driving. Without the penalty for speeding, a
closer match is possible. However, that parametrization causes the vehicle to drive too ag-
gressively. This is, no doubt, because a human driver also uses perceptual cues to determine
speed.
Notice that the speed limits are often much greater than the safe speed the human (and
the algorithm) selects. This demonstrates the importance of our technique over blindly fol-
lowing the provided speed limits. Notice also, however, both the human and the algorithm
at times exceed the speed limit. This is not allowed on the actual robot or in the experiments
in Section 2.5. However, we permit it here to simplify the scoring function, the learning
process, and the idea of matching a human driver’s speed.
Figure 2.4: Each point indicates a specific shock event with both the velocity controller
(vertical axis) versus speed limits alone (horizontal axis). There are very few high shock
events on this easy terrain. The velocity controller does not reduce shock or speed much.
Having verified the basic function of the algorithm, we now proceed to analyze its perfor-
mance over long stretches of driving. For long stretches, it becomes intractable to compare
algorithm performance to human decisions (as in Section 2.4.3) or relative to qualitatively
assessed terrain conditions (as in Section 2.5.1). Therefore, we must consider other metrics.
Over long stretches of driving, we use the aggregate reduction of large shocks to measure
algorithm effectiveness.
However, to evaluate the reduction of large shocks, we cannot simply sum the amount
of shock. Extreme shock is rare. Less than .3% of all readings (2005 Grand Challenge,
Speed Limits Alone) are above our maximum acceptable shock (α) threshold. Thus, the
aggregate effect of frequent, small shock could mask the algorithm’s performance on rare,
CHAPTER 2. SPEED SELECTION 18
Figure 2.5: Each point indicates a specific shock event with both the velocity controller
(vertical axis) versus speed limits alone (horizontal axis). There are many high shock
events on this difficult terrain. The velocity controller reduces shock and speed a great
deal.
large shock. To avoid this problem, we take the 4th power (“L4 Metric”) of a shock be-
fore adding it to the sum. This accentuates large shocks and diminishes small ones in the
aggregate score.
Whenever the vehicle travels along a road – regardless of whether under human or
autonomous control – it can map the road for velocity control experiments. Each raw IMU
reading is logged along with the vehicle’s velocity and position according to the UKF. Then,
offline, each IMU reading can be filtered and divided by the vehicle’s speed. This generates
a sequence of position tagged road roughness values in the units Gs-felt-per-mile-per-hour.
We use these sequences in this section to evaluate the velocity controller.
We calculate completion time by summing the time spent traversing each route segment
(ie: from pi to pi+1 ) according to the average speed driven during that segment. 1/dt is
100Hz. Because the velocity controller may change speed instantly at each pi , we track the
velocities it outputs – for these simulations only – with a simple controller. It restricts the
CHAPTER 2. SPEED SELECTION 19
Figure 2.6: Here we show where the algorithm reduced speed during the 2005 Grand Chal-
lenge event. The velocity controller modified speed during 17.6% of the race. Height
of the ellipses shown at the top is proportional to the maximum speed reduction for each
1/10th mile segment. Numbers represent mile markers. Simulated terrain at the bottom
approximates route elevation.
We consider the trade-off of shock versus completion time on four desert routes. See Fig-
ure 2.7. The horizontal axis plots completion time. The vertical axis plots shock felt. Both
are for the algorithm normalized against speed limits alone. Each of the four curves repre-
sents a different desert route: the 2005 Grand Challenge route, the 2004 Grand Challenge
route, and two routes outside of Phoenix, Arizona, USA that we created for testing pur-
poses. Individual points on the curves depict the completion time versus shock trade-off
for a particular set of (α, β) values. For these curves, α is fixed and β is varied. The asterisk
on each curve indicates the trade-off generated by the parameters selected in Section 2.4.3.
The fifth asterisk, at the top-left, indicates what happens with speed limits alone. We notice
that, for all routes, completion time is increased between 2.5% and 5% over the baseline.
However, shock is reduced by 50% to 80% compared to the baseline, a substantial savings.
It appears that the algorithm is most effective along the 2004 Grand Challenge route.
There are two reasons for this. First, we prepared for the 2005 event along the 2004 course.
Therefore, our algorithm is especially optimized for that route. However, the effect is
mostly due to a different phenomenon.
CHAPTER 2. SPEED SELECTION 20
Figure 2.7: Here we plot the trade-off of completion time and shock on four desert routes. α
is constant (.25) and β is varied along each curve. With the single set of parameters learned
in Section 2.4.3, the algorithm reduces shock by 50% to 80% while increasing completion
time by 2.5% to 5%.
The 2004 course had sections that were in extremely bad condition when this data
was collected (June 2005). Such extreme off-road driving was not seen on the other three
routes. Because the core functionality of our algorithm involves being cautious about clus-
tered, dangerous terrain, it was particularly effective for the 2004 course. This implies our
approach may be more beneficial as terrain difficulty increases.
Finally, we notice occasional abrupt, unsmooth drops in shock in the figure. This is
the result of a parametrization that is just aggressive enough to slow the vehicle before a
major rough section of road. Using the parametrization immediately prior did not slow the
vehicle in time.
approximates elevation data from the route. At the top of the figure, ellipse height indicates
the maximum speed reduction generated by the algorithm for that portion of the route. For
the purposes of this figure, the route was divided into 1/10th mile segments. Log files
taken by the vehicle during the Grand Challenge event indicate the velocity controller was
controlling vehicle speed for 17.6% of the entire race – more than any other factor except
speed limits.
We notice that the algorithm was particularly active in the mountains and especially
inactive in plains. We also notice that miles 122-129 – which we identified in Section 2.5.1
as the most challenging part of the route – do not represent the greatest speed reductions the
velocity controller made. This is because speed limits (given by DARPA) were already low
in those regions, and the amount of speed reduction is, of course, related to the accuracy of
the original speed limit as a good speed guide.
2.6 Discussion
Speed decisions are a crucial part of planning – even beyond their relationship to obstacle
avoidance and lateral maneuverability. We have presented an online approach to speed
adaptation for high-speed off-road autonomous driving. Our approach addresses shock,
an important component of overall risk when driving off-road. Our method requires no
rehearsal or remote imaging of the route.
The algorithm has three states. First, the vehicle drives at the speed limit until an accept-
able shock threshold is exceeded. Second, the vehicle reduces speed to bring itself within
the threshold. Finally, the vehicle gradually increases speed back to the speed limit. The
algorithm uses the linear relationship between speed and shock to determine the amount of
speed reduction needed to stay within the limit. This limit and the acceleration parameter
from the third state are tuned using supervised learning to match human driving.
Experimental results allow us to draw numerous conclusions. First, our algorithm
seems to be more cautious on more difficult terrain, although this is difficult to prove in
the general case (Figures 2.4 and 2.5). Second, by slowing, we can substantially reduce
high shock events with minimal effect on route completion time (Figure 2.7). Finally, the
CHAPTER 2. SPEED SELECTION 22
algorithm had significant influence on our vehicle’s behavior during the 2005 Grand Chal-
lenge (Figures 2.4, 2.5, 2.6, and 2.7). It reduced shock by 52% and slowed the vehicle in
difficult terrain, all while adding less than 10 minutes to the completion time. The algorithm
also had more influence than any other factor – except speed limits – on our robot’s speed.
Finally, it generated the fastest completion time despite being fully automated, unlike some
other competitors in the Grand Challenge [35, 90, 91].
Chapter 3
3.1 Abstract
Accurate perception is a principal challenge of autonomous off-road driving. Perceptive
technologies generally focus on obstacle avoidance. However, at high speed, terrain rough-
ness is also important to control shock the vehicle experiences. The accuracy required to
detect rough terrain is significantly greater than that necessary for obstacle avoidance.
We present a self-supervised machine learning approach for estimating terrain rough-
ness from laser range data. Our approach compares sets of nearby surface points acquired
with a laser. This comparison is challenging due to uncertainty. For example, at range, laser
readings may be so sparse that significant information about the surface is missing. Also, a
high degree of precision is required in projecting laser readings. This precision may be un-
available due to latency or error in the pose estimation. We model these sources of error as
a multivariate polynomial. The coefficients of this polynomial are obtained through a self-
supervised learning process. The “labels” of terrain roughness are automatically generated
from actual shock, measured when driving over the target terrain. In this way, the approach
provides its own training labels. It “transfers” the ability to measure terrain roughness from
the vehicle’s inertial sensors to its range sensors. Thus, the vehicle can slow before hitting
rough terrain.
Our experiments use data from the 2005 DARPA Grand Challenge. We find our ap-
proach is substantially more effective at identifying rough surfaces and assuring vehicle
23
CHAPTER 3. SMOOTH ROAD DETECTION 24
3.2 Introduction
In robotic autonomous off-road driving, the primary perceptual problem is terrain assess-
ment in front of the robot. For example, in the 2005 DARPA Grand Challenge [27] (a
robot competition organized by the U.S. Government) robots had to identify drivable sur-
face while avoiding a myriad of obstacles – cliffs, berms, rocks, fence posts. To perform
terrain assessment, it is common practice to endow vehicles with forward-pointed range
sensors. The terrain is then analyzed for potential obstacles. The result is used to adjust the
direction of vehicle motion [46, 47, 48, 91].
Driving at high speed – one key challenge in the DARPA Grand Challenge – terrain
roughness must also dictate vehicle behavior. Rough terrain induces shock. At high speed,
the effect of shock can be detrimental [18]. Thus, to be safe, a vehicle must sense ter-
rain roughness and adjust speed, avoiding high shock. The degree of accuracy necessary
for assessing terrain roughness exceeds that required for obstacle finding by a significant
margin – rough patches are often just a few centimeters in height. This makes design of a
competent terrain assessment function difficult.
In this chapter, we present a method that enables a vehicle to acquire a competent
roughness estimator for high speed navigation. Our method uses self-supervised machine
learning. This allows the vehicle to learn to detect rough terrain while in motion and with-
out human training input. Training data is obtained by a filtered analysis of inertial data
acquired at the vehicle core. This data is used to train (in a supervised fashion) a classifier
that predicts terrain roughness from laser data. In this way, the learning approach “trans-
fers” the capability to sense rough terrain from inertial sensors to environment sensors. The
resulting module detects rough terrain in advance, allowing the vehicle to slow. Thus, the
vehicle avoids high shock that would otherwise cause damage.
We evaluate our method using data acquired in the 2005 DARPA Grand Challenge.
Our experiments measure ability to predict shock and effect of such predictions on vehicle
safety.
CHAPTER 3. SMOOTH ROAD DETECTION 25
We find our method is more effective at identifying rough surfaces than previous tech-
niques derived from obstacle avoidance algorithms. This is because we use training data
that emphasizes the distinguishing characteristics between very small discontinuities. The
self-supervised approach is crucial to making this possible.
Furthermore, we find our method reduces vehicle shock significantly with only a small
reduction in average vehicle speed. The ability to slow before hitting rough terrain is in
stark contrast to the speed controller used in the race [80] presented in Chapter 2. That
controller measured vehicle shock exclusively with inertial sensing. Hence, it sometimes
slowed the vehicle after hitting rough terrain.
We present the chapter in four parts. First, we define the functional form of the laser-
based terrain estimator. Then we describe the exact method for generating training data
from inertial measurements. Third, we train the estimator with the learning. Finally, we
examine the results experimentally.
effectively breaks the correspondence of ICP. Therefore, we believe that accurate recovery
of a full 3-D world model from this noisy, incomplete data is impossible. Instead we use a
machine learning approach to define tests that indicate when terrain is likely to be rough.
Other work uses machine learning in lieu of recovering a full 3-D model. For example,
[73] and [61] use learning to reason about depth in single monocular image. Although full
recovery of a world model is impossible from a single monocular image, useful data can
be extracted using appropriate learned tests. Our work has two key differences from these
prior papers: the use of lasers rather than vision and the emphasis upon self-supervised
rather than reinforcement learning.
In addition, other work has used machine learning in a self-supervised manner. The
method is an important component of the DARPA LAGR Program. It was also used for
visual road detection in the DARPA Grand Challenge [25]. The “trick” of using self-
supervision to generate training data automatically, without the need for hand-labeling, has
been adapted from this work. However, none of these approaches addresses the problem of
roughness estimation and intelligent speed control of fast moving vehicles. As a result, the
source of the data and the underlying mathematical models are quite different from those
proposed here.
Figure 3.1: The left graphic shows a single laser scan. The right shows integration of these
scans over time.
Figure 3.2: Pose estimation errors can generate false surfaces that are problematic for nav-
igation. Shown here is a central laser ray traced over time, sensing a relatively flat surface.
The integration is achieved through an unscented Kalman filter [44] at an update rate of
100Hz. The pose estimation is subject to error due to measurement noise and communica-
tion latencies.
Thus, the resulting z-values might be misleading. We illustrate this in Fig 3.2. There we
show a laser ray scanning a relatively flat surface over time. The underlying pose estimation
errors may be on the order of .5 degrees. When extended to the endpoint of a laser scan,
the pose error can lead to z-errors exceeding 50 centimeters. This makes it obvious that the
3-D point cloud cannot be taken at face value. Separating the effects of measurement noise
from actual surface roughness is one key challenge addressed in this work.
CHAPTER 3. SMOOTH ROAD DETECTION 28
Figure 3.3: The measured z-difference from two scans of flat terrain indicates that error
increases linearly with the time between scans. This empirical analysis suggests the second
term in Equation (3.4).
The error in pose estimation is not easy to model. One of the experiments in [88] shows
it tends to grow over time. That is, the more time that elapses between the acquisition of
two scans, the larger the relative error. Empirical data backing up this hypothesis is shown
in Figure 3.3. This figure depicts the measured z-difference from two scans of flat terrain
graphed against the time between the scans. The time-dependent characteristic suggests
this be a parameter in the model.
Our algorithm compares features pairwise, resulting in a square scoring matrix of size N 2
for N laser points. Let us denote this matrix by S. In practice, N points near the predicted
future location of each rear wheel separately are used. Data from the rear wheels will be
combined later in Equation (3.5).
The (r, c) entry in the matrix S is the score generated by comparison of feature r with
feature c where r and c are both in {0, . . . , N − 1}:
The difference function ∆ is symmetric and, when applied to two identical points, yields
a difference of zero. Hence S is symmetric with a diagonal of zeros, and its computa-
N 2 −N
tion requires just 2
comparisons. The element-by-element product of S and the lower
triangular matrix whose non-zero elements are 1 is taken. (The product zeros out the sym-
metric, redundant elements in the matrix.) Finally, the largest ω elements are extracted into
the vector W and accumulated in ascending order to generate the total score, R:
ω
X
R = Wi υ i . (3.3)
i=0
Here υ is the increasing weight given to successive scores. Both υ and ω are parameters
that are learned according to Section 3.7. As we will see in Equation (3.4), points that are
far apart in (x, y) are penalized heavily in our technique. Thus, they are unlikely to be
included in the vector W .
CHAPTER 3. SMOOTH ROAD DETECTION 30
Intuitively, each value in W is evidence regarding the magnitude of the second deriva-
tive. The increasing weight indicates that large, compelling evidence should win out over
the cumulative effect of small evidence. This is because laser points on very rough patches
are likely to be a small fraction of the overall surface. R is the quantitative verdict regarding
the roughness of the surface.
The comparison function ∆ is a polynomial that combines additively a number of cri-
teria. In particular, we use the following function:
Here euclidean(Li , Lj ) denotes Euclidean distance in (x, y)-space between laser points i
and j. The various parameters α[k] are generated by the machine learning.
The function ∆ implicitly models the uncertainties that exist in the 3-D point cloud.
Specifically, large change in z raises our confidence that this pair of points is a witness
for a rough surface. However, if significant time elapsed in between scans, our functional
form of ∆ allows for the confidence to decrease. Similarly, large (x, y) distance between
points may decrease confidence as it dilutes the effect of the large z discontinuity on surface
roughness. Finally, if the first derivatives of roll and pitch are large at the time the reading
was taken, the confidence may further be diminished. These various components make it
possible to accommodate the various noise sources in the robot system.
Finally, we model the nonlinear transfer of shock from the left and right wheels to the
IMU. (The IMU is rigidly mounted and centered just above the rear axle.) Our model
assumes a transfer function of the form
where Rleft and Rright are calculated according to Equation (3.3). Here ζ is an unknown
parameter that is learned simultaneously with the roughness classifier. Clearly, this model
CHAPTER 3. SMOOTH ROAD DETECTION 31
is extremely crude. However, as we will show, it is sufficient for effective shock prediction.
Ultimately, Rcombined is the output of the classifier. In some applications, using this
continuous value has advantages. However, we use a threshold µ for binary classifica-
tion. (A binary assessment simplifies integration with our speed selection process.) Terrain
whose Rcombined value exceeds the threshold is classified as rugged, and terrain below the
threshold is assumed smooth. µ is also generated by machine learning.
Figure 3.4: We find empirically that the relationship between perceived vertical accelera-
tion and vehicle speed over a static obstacle can be approximated tightly by a linear function
in the operational region of interest.
surprisingly linear. The ruggedness coefficient value, thus, is simply the quotient of the
measured vertical acceleration and the vehicle speed.
Tp − λFp (3.6)
where Tp and Fp are the true and false positive classification rate, respectively. A true pos-
itive is when, given a patch of road whose ruggedness coefficient exceeds a user-selected
threshold, that patch’s Rcombined value exceeds µ. The trade-off parameter λ is selected
arbitrarily by the user. Throughout this chapter, we use λ = 5. That is, we are especially
interested in learning parameters that minimize false positives in the ruggedness detection.
To maximize this objective function, the learning algorithm adapts the various param-
eters in our model. Specifically, the parameters are the α values as well as υ, ω, ζ, and µ.
The exact learning algorithm is an instantiation of coordinate descent. The algorithm be-
gins with an initial guess of the parameters. This guess takes the form of a 13-dimensional
vector B containing the 10 α values as well as υ, ω, and ζ. For carrying out the search
in parameter space, there are two more 13-dimensional vectors I and S. These contain an
initial increment for each parameter and the current signs of the increments, respectively.
A working set of parameters T is generated according to the equation:
T = B + SI. (3.7)
CHAPTER 3. SMOOTH ROAD DETECTION 34
The SI product is element-by-element. Each element of S rotates through the set {-1,1}
while all other elements are held at zero. For each valuation of S, all values of Rcombined
are considered as possible classification thresholds, µ. If the parameters generate an im-
provement in the classification, we set B = T and save µ.
After a full rotation by all the elements of S, the elements of I are halved. This contin-
ues for a fixed number of iterations. The resulting vector B and the saved µ comprise the
set of learned parameters.
Figure 3.5: The true positive / false positive trade-off of our approach and two interpreta-
tions of prior work. The self-supervised method is significantly better for nearly any fixed
acceptable false-positive rate.
Figure 3.6: The trade off of completion time and shock experienced for our new self-
supervised approach and the reactive method we used in the Grand Challenge.
This is indeed the case. The results are shown in Figure 3.6. The graph plots on the
vertical axis the shock experienced. The horizontal axis indicates the completion time of
the 10-mile section used for testing. Both axes are normalized by the values obtained with
no speed modification. The curve labeled “reactive only” corresponds to the actual race
software. In contrast, the curve labeled “self-supervised” indicates our new method.
The results clearly demonstrate that reducing speed proactively with our new terrain
assessment technique constitutes a substantial improvement. The amount of shock experi-
enced is dramatically lower for essentially all possible completion times. The reduction is
as much as 50%. Thus, while the old method was good enough to win the race, our new
approach would have permitted Stanley to drive significantly faster while still avoiding
excessive shock due to rough terrain.
CHAPTER 3. SMOOTH ROAD DETECTION 38
3.9 Discussion
We have presented a novel, self-supervised learning approach for estimating the roughness
of outdoor terrain. Our main application is the detection of small discontinuities that are
likely to create significant shock for a high-speed robotic vehicle. By slowing, the vehicle
can reduce the shock it experiences. Estimating roughness demands the detection of very
small surface discontinuities – often a few centimeters. Thus the problem is significantly
more challenging than finding obstacles.
Our approach monitors the actual vertical accelerations caused by the unevenness of
the ground. From that, it extracts a ruggedness coefficient. A self-supervised learning
procedure is then used to predict this coefficient from laser data, using a forward-pointed
laser. In this way, the vehicle can safely slow down before hitting rough surfaces. The
approach in this chapter was formulated (and implemented) as an offline learning method.
But it can equally be used as an online learning method where the vehicle learns as it drives.
Experimental results indicate that our approach offers significant improvement in shock
detection and – when used for speed control – reduction. In comparison to previous meth-
ods, our algorithm has a significantly higher true-positive rate for (nearly) any fixed value
of acceptable false-positive rate. Results also indicate our new proactive method allows for
additional shock reduction without increase in completion time compared to our previous
reactive work. Put differently, Stanley could have completed the race even faster without
any additional shock.
There exist many open questions that warrant further research. One is to train a camera
system so rough terrain can be extracted at farther ranges. Also, the mathematical model for
surface analysis is quite crude. Further work could improve upon it. Additional parameters,
such as the orientation of the vehicle and the orientation of the surface roughness, might
add further information. Finally, we believe using this method online, during driving, could
add further insights to its strengths and weaknesses.
Chapter 4
Feature Optimization
4.1 Abstract
We present an algorithm that learns invariant features from real data in an entirely unsuper-
vised fashion. The principal benefit of our method is that it can be applied without human
intervention to a particular application or data set, learning the specific invariances neces-
sary for excellent feature performance on that data. Our algorithm relies on the ability to
track image patches over time using optical flow. With the wide availability of high frame
rate video (eg: on the web, from a robot), good tracking is straightforward to achieve. The
algorithm then optimizes feature parameters such that patches corresponding to the same
physical location have feature descriptors that are as similar as possible while simultane-
ously maximizing the distinctness of descriptors for different locations. Thus, our method
captures data or application specific invariances yet does not require any manual super-
vision. We apply our algorithm to learn domain-optimized versions of SIFT and HOG.
SIFT and HOG features are excellent and widely used. However, they are general and
by definition not tailored to a specific domain. Our domain-optimized versions offer a
substantial performance increase for classification and correspondence tasks we consider.
Furthermore, we show that the features our method learns are near the optimal that would
be achieved by directly optimizing the test set performance of a classifier. Finally, we
demonstrate that the learning often allows fewer features to be used for some tasks, which
has the potential to dramatically improve computational concerns for very large data sets.
39
CHAPTER 4. FEATURE OPTIMIZATION 40
4.2 Introduction
Many feature functions have been proposed in the literature. However, they are usually
not tailored to a specific domain. While some of these features achieve excellent results
across a broad range of applications, their performance for particular applications can be
improved because they do not capture domain-specific invariances. Optimizing a feature
function for a particular domain is challenging because most features have numerous pa-
rameters that effect their invariances, such as the standard deviation of Gaussians used for
smoothing and the number of bins in histograms used for orientation calculation. These
parameters can be set entirely by hand, using hand-labeled data and learning, or by mathe-
matical processes such as warping an image patch while optimizing the feature value to be
invariant. However, we believe those approaches have limitations. A completely manual
approach or one involving hand-labeling can be costly or time consuming and may not pro-
duce optimal results. Patch warping and related methods tend to be synthetic, not capturing
all the invariances in the data.
We present an algorithm that uses unsupervised machine learning to optimize feature
invariances using video. Because our algorithm is entirely unsupervised and computation-
ally efficient, it can handle huge amounts of data. Furthermore, because it is data driven,
the invariances it learns are accurate for the specific application domain or data set. Our
method has two key steps.
First, our method performs correspondence between successive video frames. For this
step we use basic Harris corner features and the Lucas-Kanade pyramidal optical flow
algorithm. This approach is well-known as a method for correspondence on video and
works well despite its simplicity because high frame rate limits motion between frames. A
small patch of pixels around each Harris corner feature is extracted and saved, resulting in
many sets of patches. Each set contains multiple views of approximately the same location
in the environment. The number of sets is bounded only by video length and availability of
corner features to track.
The second step optimizes the higher-level features. Here, “higher-level” implies a
feature of greater sophistication and complexity than Harris corners. Because SIFT and
HOG are widely used in the literature, we focus on them in this chapter. However, any
CHAPTER 4. FEATURE OPTIMIZATION 41
type of feature may be used. It is well-established that a feature function should compute
the same value when applied to the same real-world object or type of object but be distinct
for different objects. To reach this goal, we optimize feature parameters using an efficient,
effective search procedure and scoring function. Our method requires no hand-labeling and
can be completely automated for any task involving video.
To show that our optimized features capture the appropriate domain-specific invari-
ances, we demonstrate that parameters optimized for one video stream outperform, on that
stream, those optimized for another and vice-versa. In particular, we look at the perfor-
mance of optimized versions of SIFT and HOG on two tasks: car detection in an urban
environment and correspondence for 3D reconstruction of surgery-related imagery. We
show that the version of SIFT optimized for car detection, on a car detection task, out-
performs the version of SIFT optimized for the surgical task. We also show the converse
is true; the surgical SIFT features outperform the car features on the surgical task. When
we compare our parameters to the original highly-tuned parameters presented by Lowe,
we find our performance is better for the car detection task and essentially the same for
the correspondence task. We present similar results for HOG features. Furthermore, we
analyze the effect of optimization on the rotational invariance property of SIFT.
In addition, the parameters our method learns are near the optimal achieved by directly
optimizing the test set performance of a classifier. That is, iteratively, we train the classifier
on a training set, evaluate its performance on the test set, and change feature parameters
to improve test set performance. We find that our method, despite being entirely unsuper-
vised, produces parameters near this optimal. Finally, we show that optimized parameters
can allow fewer features to be used for comparable application performance. This has
the potential to improve computational issues associated with very large data sets or high
resolution images.
Our method does have limitations. For example, in highly repetitive environments or
video with loops, the same object may be encountered multiple times and thus recorded in
distinct patch sets. In such cases, optimizing for distinctness between patch sets could be
sub-optimal. However, our experimental results indicate that for typical, non-pathological
cases performance is excellent.
CHAPTER 4. FEATURE OPTIMIZATION 42
Our algorithm uses corner features and optical flow. However any features and corre-
spondence algorithm could be used. Corner features work with virtually any textured image
patch. We argue that texture is a necessary condition for any algorithm seeking multiple
views of the same area. Thus, our method is especially complete in the types of patches
it uses. We feel this makes the optimization step particularly general, really capturing all
possible image patch types and variations. Furthermore, corners and optical flow are very
fast to compute. Therefore our method could be used online to adapt features in real-time.
Corner features are not enough for many vision applications such as object recognition
and correspondence for 3D reconstruction in the absence of dense data. This motivates the
next step of our method where higher-level features are calculated and optimized for the
tracked patches.
|p|
X X 1 X
argθ~ (γ min |pi − qi | (4.1)
∀p,q∈P
|p| i=0
~
∀P ∈P
|p|
X 1 X
+ max |pi − qi |).
|p| i=0
~ )6=P )
∀p∈P,∀q∈(∀(Q∈P
Finally, the outer summation adds over all patch sets, turning our optimization for the one
set into an optimization for all. That is, we are making all the patches that are actually the
same as close as possible in feature value and all the patches that are different as different
as possible in feature value. γ is a parameter trading off the extent to which we would
rather make the same patches match as opposed to making differing patches not match.
Simplifying without loss of generality we have:
|p|
X X 1 X
arg min (γ |pi − qi | (4.2)
θ~
∀p,q∈P
|p| i=0
~
∀P ∈P
|p|
X 1 X
− |pi − qi |).
|p| i=0
~ )6=P )
∀p∈P,∀q∈(∀(Q∈P
Since |P~ | is potentially enormous, in practice we replace ∀(Q ∈ P~ ) 6= P with a small num-
ber of samples. In fact, we find experimentally that one is sufficient for good performance.
Even with this approximation, the optimization is still challenging computationally. Thus
we also assume the elements of θ~ are conditionally independent. This allows us to perform
coordinate descent in parameter space. However, we find experimentally that the space is
fraught with local minima. Thus a more reliable method is to simply search the space. A
search is tractable in this case because a feature can be generated rapidly (in about .0025
seconds) and the well-known “good” range of many parameters is not enormous, even at
small discretization [54].
in [93]. For SIFT features, 3 scales were used. Each of these features was concatenated
together in one (large!) vector describing the image. A standard SVM with a linear kernel
was trained and tested in the standard fashion. This is a simple classification scheme, not
the most recent presented in the literature. Our goal, however, is to emphasize the qual-
ity of our features, not the learning algorithm. For full reproducibility, we used the SVM
implementation publicly available in [16]. The vertical axis shows classification accuracy,
the fraction of all images in the test set classified correctly. We use two different horizontal
axes: the resolution of the grid (for SIFT, multiplied by scales) and the number of training
examples per class. The number of features is important as it allows us to illustrate the ef-
ficacy of our features. For plots specifying the number of training examples, the resolution
of the grid was 6 x 6.
For the Street View data set, we consider car detection. To establish ground truth, we
manually labeled cars with bounding boxes. To generate negative examples, we randomly
cut bounding boxes from the image. We spot-checked the negative examples to verify cars
were not present. For all plots varying the number of features, training and testing each
occurred on 100 examples per class.
For the surgical data set, we consider patch correspondence for 3D reconstruction. We
do not perform 3D reconstruction itself. To evaluate correspondence, we randomly selected
20 patch sets (20 P ’s from P~ , see Section 4.5). For plots varying the number of features,
we used 20 patches from each of these sets. The 20 were split evenly for training and test-
ing. Thus our classifier performs correspondence as a 20-way classification problem. The
classification of each patch in the test set was deemed its correspondence. Intuitively, we
expect correspondence to perform reasonably at finding a match even if only one point has
been seen before. This explains why our plots for the surgical data show good performance
even with one training example.
For SIFT features, we also plot the original/default parameters used in Lowe’s imple-
mentation. These provide a basis of comparison. Additionally, for HOG features, we
evaluated the optimality of our method. In particular, we searched all parametrizations,
iteratively training and testing the classifier for each. We kept the parametrization with the
best test set performance for a static number of features and training examples. Then, with
that parametrization, we evaluated performance for all numbers of features and training
CHAPTER 4. FEATURE OPTIMIZATION 48
examples as described above. As a check, we also retained the lowest test set performance.
We present these bounds only for HOG features because the search is computationally in-
tensive, and HOG features compute significantly faster in our implementation. To make
computation tractable, we used the conditional independence assumption described in Sec-
tion 4.5 for this search as well. Thus it is an approximate theoretical best. When searching,
we averaged the performance of a parameter set over many trials. To save computation, we
used fewer trials for the lower bound making it less accurate.
the number of features, fewer of the task-specific features are needed for a particular per-
formance level. This may be useful for algorithms on large data sets. Interestingly, our
surgical-specific parameters for the surgical correspondence task produce results nearly
identical to Lowe’s. However, for the cars task, our parameters are significantly better.
Lowe’s original intent was to produce features for matching/correspondence. Evidently his
extensive tuning well-optimized them for that task. Finally, Figure 4.6 shows selected im-
ages from the cars task that were classified correctly with the car-optimized SIFT but not
the surgical-optimized SIFT features.
Data Set Window Size Block Size Block Stride Cell Size
Cars 32 32 10 16
Surgical 12 12 4 6
Data Set Aperture Sigma L2 Threshold Gamma Correction
Cars 1 none .61 false
Surgical 1 5 .51 false
Table 4.1: Learned HOG parameters for the cars and surgical data sets.
v
u
u 1 X |f |
t (fangle i − f0 i )2 . (4.3)
|f | i=0
That is, a single score is the square-root of the normalized sum of squared differences of
each element of two SIFT feature vectors. The vectors are from a rotation of angle and
zero degrees, respectively. Figure 4.9 shows averages and standard deviations over 18 trials
as a function of angle. Clearly, our optimization has removed rotational invariance from
the surgical-optimized SIFT features. This is the correct behavior given the invariances in
our data. An analogous plot could be produced for car-optimized SIFT.
Rotational invariance is not the only invariance optimized for SIFT. There are many
other parameters being optimized as discussed in Section 4.6.2. For example, the stan-
dard deviation of the Gaussian used for smoothing is important. Our optimization disables
rotational invariance for both of the application-specific SIFT features. Thus, it does not
account for the performance difference between them. It may when compared to Lowe’s
parameters which have rotational invariance.
4.7 Discussion
We present an algorithm that learns optimized features in an unsupervised fashion. Our
features capture application or data-specific invariances, improving performance. We fo-
cus on optimizing SIFT and HOG features due to their widespread use in the literature.
However our method will work for any feature function with parameters. For HOG fea-
tures, we show our optimized versions are near the best possible for both the surgical and
car tasks. For SIFT, our parameters for the surgical task produce performance very near
the extensively researched parameters from Lowe. For the cars task, our SIFT parameters
offer substantially better performance. Due to the widespread use of SIFT and HOG, we
feel our method has the potential for immediate impact.
There are numerous opportunities for future work. Because our method is entirely un-
supervised, we plan to explore how it scales to enormous data sets. For example, we could
use every video on the web (or a very large subset) to learn optimized but general-purpose
features. Rather than optimizing, we could learn a whole new class of features directly
from pixels. We could also pursue additional applications. For example, numerous robotic
systems would benefit from real-time feature optimization. It is also possible our core idea
of unsupervised patch-extraction and optimization could be applied to other tasks in vi-
sion, such as camera calibration. Finally, we have alluded to the intersection of optimized
features and feature compression and could explore it in more depth.
CHAPTER 4. FEATURE OPTIMIZATION 52
Figure 4.1: An example of corner feature tracking with optical flow. The video sequence
moves frame-to-frame from left to right. The center of each yellow box in the left-most
(first) frame is a corner detected by [75, 16]. The green arrows show the motion of these
corners into the next frame as determined with optical flow [57, 16]. This is not a prediction.
The flow algorithm uses the next frame in the sequence for this calculation. The yellow
boxes contain the patches extracted and saved for later optimization. The yellow path tails
in the subsequent frames depict patch movement as estimated by the optical flow. In total,
four patch sets are produced, each with four patches. They are shown in Figure 4.2. Our
algorithm produces additional patches for these frames. We do not show them for a simpler
illustration.
CHAPTER 4. FEATURE OPTIMIZATION 53
Figure 4.2: The extracted patch sets from Figure 4.1. The correspondence is not perfect,
but satisfactory for good optimization and application-specific feature performance.
Figure 4.3: Four non-consecutive images from the surgical data set. The images are non-
consecutive because the camera moves slowly, making the actual motion between frames
small.
CHAPTER 4. FEATURE OPTIMIZATION 54
Figure 4.4: HOG feature optimization for the cars task. The features optimized for the cars
task (blue line) outperform those optimized for the surgical task (red line). This verifies
that our algorithm captures application-dependent feature invariances. The approximate
theoretical best (light blue line) is only slightly better than our method. The approximate
theoretical worst is reasonable for a 2-class problem.
CHAPTER 4. FEATURE OPTIMIZATION 55
Figure 4.5: HOG feature optimization for the surgical task. The features optimized for
the surgical task (red line) outperform those optimized for the cars task (blue line). This
verifies that our algorithm captures application-dependent feature invariances. The approx-
imate theoretical best (light blue) is only slightly better than our method. The approximate
theoretical worst is reasonable for a 20-class problem. Because this is a correspondence
task on patches, reasonable performance for small numbers of features or images is not
surprising.
CHAPTER 4. FEATURE OPTIMIZATION 56
Figure 4.6: Examples from the cars task that were misclassified with the surgical-SIFT but
not the car-SIFT features.
CHAPTER 4. FEATURE OPTIMIZATION 57
Figure 4.7: SIFT feature optimization for the cars task. The features optimized for the cars
task (blue line) outperform those optimized for the surgical task (red line) and the param-
eters provided by Lowe (green line). This verifies our claim that our algorithm captures
application-dependent feature invariances.
CHAPTER 4. FEATURE OPTIMIZATION 58
Figure 4.8: SIFT feature optimization for the surgical task. The features optimized for the
surgical task (red line) outperform those optimized for the cars task (blue line). This verifies
our claim that our algorithm captures application-dependent feature invariances. Intuitively,
we expect correspondence to work well even with a single known point. This justifies the
good performance presented even with one training example. The performance of our
surgical-optimized features is nearly identical to Lowe’s (green-line). Lowe’s extensive
optimization was geared toward matching (correspondence) problems.
CHAPTER 4. FEATURE OPTIMIZATION 59
Figure 4.9: The effect of rotation on our surgical-optimized SIFT features. Our optimiza-
tion removed rotational invariance. This is correct given the invariances that are specific to
the application.
Chapter 5
5.1 Abstract
Recent work on autonomous driving has used laser scanners almost exclusively. The choice
is not surprising: lasers provide 3D range information at a high refresh rate, work in all
light conditions, and even provide basic color information through reflectivity. However,
camera-based sensing has fundamental advantages. Laser data is often ambiguous, even to
a human viewer. Lasers are orders of magnitude more expensive. They also consume more
energy. Nevertheless, lasers are ubiquitous in modern autonomous driving due to the belief
that excellent performance cannot be achieved with cameras given present science. This
chapter aims to dispel that belief. We show that lasers can be used to automatically label
essentially unlimited amounts of camera data. Using this data as a training set, a camera-
only classifier is learned whose performance is arbitrarily close to the laser’s. Specifically,
we show that doubling the amount of training data cuts the error rate by a fixed percentage.
Since labeled training data is automatically produced, the error rate of the camera-based
method can be cut indefinitely, ultimately converging to the laser’s performance. We eval-
uate performance using real data from an autonomous car.
60
CHAPTER 5. COST EFFECTIVE SENSING 61
5.2 Introduction
Laser scanners enjoy wide use in autonomous cars. For example, in the DARPA Challenge
competitions, lasers were used almost exclusively by the top performing vehicles [39, 40].
The Google autonomous cars rely heavily on laser perception as well [85]. The reasons are
clear. When compared to cameras, laser data is much easier to segment, works in all light
conditions, natively provides scale, and even provides monochrome intensity information
from reflectivity. Laser-based methods have been used to drive over 100,000 autonomous
miles with more than 1,000 miles between failures.
Nevertheless, camera-based methods still have an important role for autonomous driv-
ing. The key argument for vision is that dense, detailed perception information carries
value. Laser point cloud data is often difficult to interpret even for humans. This is not
the case for visual data. Furthermore, cameras are much less expensive. The common
Velodyne scanner used in most robotic cars costs nearly USD$85,000. Lasers usually re-
quire integration of data over time with highly accurate roll and pitch information from a
ring laser gyro which can also increase costs by USD$20,000 or more. In contrast, a high
quality camera is often less than USD$100. This is a dramatic and important difference.
Cameras also consume less power. Unlike lasers, cameras on distinct autonomous cars can
face one another without interference. Finally, on a basic science level, the interpretation
of camera images is an important and interesting problem.
The principal reason lasers dominate robotic car perception is the belief that producing
highly accurate camera-based methods is not possible with present science. This chapter
aims to dispel that belief. We present a vision-only car detection system whose performance
at object classification is arbitrarily close to the state-of-the-art in laser-based driving [84].
The key insight allowing this performance is the size of the training set. We show experi-
mentally that doubling the amount of training data cuts the error rate relative to the laser’s
performance by a fixed percentage. We also show how to automatically acquire extremely
large sets of laser-labeled camera images. Putting these properties together, it follows that
the performance of a camera-only system converges to that of a laser-based one.
Since the amount of labeled data required is very large, it is essential to avoid hand
labeling. Labels can be automatically acquired using the laser. Specifically, we run the
CHAPTER 5. COST EFFECTIVE SENSING 62
Figure 5.1: Example of an image labeled by the laser-based recognition system [84]. We
use this method to generate vast amounts of training data. This large corpus of data allows
us to match the performance of the laser with the camera to arbitrary precision. The laser
labeling is suitable for training camera-based classifiers because false positives are low and
false negatives are largely due to segmentation errors rather than some property inherent to
the object.
laser algorithm from [84] and project the laser labels back into a camera image. Figure 5.1
shows an example. Thus, by simply letting an autonomous car drive around, we can acquire
unlimited training examples automatically. Once labels are acquired from the laser, the
laser is no longer needed. Of course, the labels will have the error rate of the laser. But
this is acceptable, since we are merely trying to match the laser’s performance. This is an
example of self-supervised learning.
The basic pipeline of many modern vision algorithms is similar. Low-level informa-
tion is extracted from the image in the form of pixels or features. This information is
encoded using an unsupervised coding scheme. These codes are pooled, often with geo-
metric information, into an image descriptor. Finally, the image descriptors are classified
using a learning algorithm. Overall, we use the learning algorithm of [49] except we con-
sider multi-hypothesis data association using probabilistic and also triangle [22] coding.
We also explore several different learning algorithms in addition to the histogram intersec-
tion kernel SVM of [49] such as Random Forests [17], Nearest Neighbors [23], and linear
SVMs [43]. As suggested in [100], we encode features not pixel patches and at keypoints
instead of a grid.
CHAPTER 5. COST EFFECTIVE SENSING 63
We rigorously test our algorithm on a real autonomous car over a test set of more than
20,000 manually labeled examples. We show that, for object classification, our camera-
only method can converge arbitrarily close to the performance of the state-of-the-art laser-
based autonomous driving algorithm [84].
Figure 5.2: A 3D point cloud example created by a single rotation of a 3D laser scanner.
The method in [84] tracks objects through these point clouds, segments them, and classifies
them using a learning algorithm. We project these classifications back into a camera frame
and use them as training data for our vision-only algorithm. Nearly unlimited amounts of
training data can be automatically produced.
the laser-only classifier can be projected back into the corresponding frame of a camera.
Examples of this projection are found in Figures 5.1 and 5.3. Figure 5.1 shows convex
hulls for illustration. While those hulls were automatically extracted, we use the simpler
bounding boxes of Figure 5.3 without loss of generality. We notice from Figure 5.3 that the
data is very realistic and represents an interesting and challenging vision problem. We also
notice that the laser system makes occasional errors. See the upper left image in the figure
where several cars are missed. The accuracy in these examples appears to be approximately
95%.
We do not have access to the original, manually labeled laser data. Instead, we have
only the resulting classifier. By simply mounting the laser on a car, driving around, and
running the classifier, we automatically label a vast database of images. This is the self-
supervised portion of our method. The labels are not perfect. Nevertheless, they are very
useful for training. Our goal is to match the laser’s performance, not to achieve 100%
accuracy. We use these labels and the vision algorithm presented in Section 5.5 to train
a vision-only classifier whose performance – in the limit of an enormous training set – is
arbitrarily close to that of the laser.
Codebook construction is similar to [49]. Features are computed over the entire training
set as described in Section 5.5.1. k-Means clustering [14] is used to cluster features into
centroids or bases.
We begin with N feature vectors f~1 to f~N , K clusters centered at µ1−K
~ , and N K
assignment variables a1−N,1−K . At any given time, for each feature vector f~i , one of its
K assignment variables ai,1−K equals 1 and the rest are zero. The clusters are initially
centered at randomly chosen features. The minimization is then:
N X
X K
min ai,k ||f~i − µ~k ||2 . (5.1)
∀a,∀~
µ
i=1 k=1
and ai,k is set to 1 indicating feature i is assigned to cluster k. ai,∀j6=k are set to zero. In the
CHAPTER 5. COST EFFECTIVE SENSING 68
M-step, the cluster centers are re-calculated based on the new assignments:
ai,k f~i
PN
µ~k = Pi=1
N
. (5.3)
i=1 ai,k
The optimization ends when the change in the cluster centers between two iterations drops
below a threshold.
The cluster centers or bases for K = 5000 appear in Figure 5.4. The figure contains
5000 image patches arranged in a grid. Each patch corresponds to the SURF feature –
over all keypoints in more than 1,200,000 images – that most closely resembles that cluster
center. While SURF keypoints vary in size, we extract patches of fixed size for presentation
purposes. The figure is best viewed under zoom on a high-resolution display. We notice
that the cluster centers correspond to many car parts, road markings, and environment
fragments. Also, illuminated brake lights appear with some frequency.
Codebook Assignment
Notice that each code C1−K is a scalar. We also compute the mean of the code set:
K
1 X
µ= Cj . (5.5)
K j=1
CHAPTER 5. COST EFFECTIVE SENSING 69
Finally, the training vector is updated. For triangle coding the update is:
Equations (5.6) and (5.7) mean each element of ~z is updated. Finally, ~z is normalized so its
elements sum to one. Naturally, a separate ~z is computed for every image.
Equation (5.6) implies the number of non-zero updates for each feature is approxi-
mately half the number of clusters. Figure 5.5 shows a histogram of the frequency of each
number of updates for K = 5000. The x-axis indicates the number of vector elements
updated. The y-axis shows the number of features whose update modified that many vector
elements. The integral under the curve is over 32,000,000 features. Notice the contrast
to [49] which updates only one vector element per feature – specifically, the maximum
likelihood (minimum score) from an equation similar to Equation (5.4).
5.6 Experiments
We evaluate the performance of our algorithm experimentally. We implemented our algo-
rithm on the codebase from our autonomous car. As described in Section 5.4, we drive the
car which automatically produces laser point clouds, segments the clouds, tracks the ob-
jects, classifies the tracks, and projects the labels back into a camera image. (Equivalently,
the car can drive autonomously, but this is not a relevant distinction for this paper.) These
noisily labeled, segmented images are used as training data to produce image vectors as de-
scribed in Section 5.5. Notice again, after acquiring the training data, the laser is no longer
needed or used. Thus we produce a vision-only algorithm. Finally, we test our vision-
only algorithm on a manually labeled test set, showing performance for several different
learning algorithms, comparing it to [84, 49], and, importantly, showing that doubling the
amount of training data cuts the error by a fixed percentage.
ground truth for the non-car windows in a programmatic manner by labeling windows as
“non-car” if they did not intersect a car label. Notice this means we did not consider sliding
windows that covered both car and background. We feel these are too subjective to score.
We ran the vision algorithm on the test set and recorded its accuracy in the standard fashion.
For the laser-based algorithm, the laser performed its own segmentation on the frames in
the test set and then classified the segments. Our manually labeled car bounding boxes were
compared to the boxes from the laser’s segmentation and classification. We performed the
comparison automatically in software. We visualized the output to verify our programmatic
comparison matched intuition. The software worked as follows. Each manually labeled car
bounding box was matched into the set of laser bounding boxes. Matching was based on
the location and dimensions of the box. We allowed for 100 pixels of error. Of course,
matching boxes had to be within the same frame. A manual car box that matched a laser
car box was scored as a correct car. Otherwise, an incorrect car was scored. The laser’s
own segmentations were used to score the non-car class. Laser car boxes were matched to
manual car labels (the opposite direction). Laser car boxes that did not match were scored
as incorrect non-cars. Next, we considered the non-car laser boxes. If a non-car laser box
did not intersect any manually labeled car box, it was scored as a correct non-car. Recall,
as with the vision, we do not consider overlapping car and background. Thus, there were
no intersections between non-car laser boxes and manual car labels. If a non-car laser box
entirely covered a manual car, that car would still be marked incorrect as it would not match
a car laser box.
Importantly, the laser and the camera are doing car detection on the same set of frames
using identical car classes. However, the non-car classes may contain different non-car
segmentations. This is because of the difference between the segmentation methods. The
laser’s native segmentation discards many non-objects through 3D segmentation. The vi-
sion, however, must classify all the sliding windows to reject non-objects. Nevertheless,
the end-result is the same: finding cars and rejecting non-cars on the same test set.
CHAPTER 5. COST EFFECTIVE SENSING 73
Table 5.1: Results per class for a training size of 20,000 examples.
5.6.4 Results
The principal experimental results can be found in Figure 5.7. The x-axis shows the amount
of training data which was always equally split per class. The y-axis shows the average
error per class relative to the laser’s performance. Notice both axes are log scale. Here,
“Cubic SVM, Sparse” refers to [49]. Clearly, doubling the amount of camera training data
cuts the error relative to the laser’s performance by a fixed percentage. Because unlimited
amounts of training data can be automatically acquired, the performance of the camera can
be arbitrarily close to the laser. This trend holds everywhere except for the linear SVM with
K = 100. This is because the feature space was too small to continue learning. Clearly,
expanding the feature space to K = 1000, as shown in the plot, solves the problem and
continues the trend. Figure 5.6 shows selected vision-only classification results. Green
boxes depict correctly classified cars, and red boxes show false positives. Car detection
is excellent although the number of false positives is slightly higher using the camera.
Although the non-car class accuracies are similar, the camera must try more segmentations.
This increases the total number of misclassifications. In future work, we could resolve
this by adding an image-based segmenter to our method’s pipeline. Specific performance
numbers for N = 20,000 appear in Table 5.1.
Notice the difficulty of the classification task in Figure 5.6. For example, in the upper-
left image, there is a small white sedan under the tree (next to the green box) that is missed.
This counts as an error. Most errors are like that one – the object is far from the sensor,
appears small, and has few features. Importantly, autonomous cars do not need to react
CHAPTER 5. COST EFFECTIVE SENSING 74
to distant objects. Thus, they achieve much lower driving failure rates than perception
accuracy might imply. To quantify this property, a histogram of accuracy as a function of
segment size appears in Figure 5.8 for the Cubic SVM method with triangle coding.
Figure 5.9 shows classification accuracy as a function of random labeling errors. To
produce this plot, we manually labeled a training set and then programmatically introduced
random labeling errors. Results indicate a significant amount of labeling error is acceptable
before classification performance degrades significantly. This implies there may be a hard
limit on classification performance that is a function of features or information content.
For example, some examples may be too low resolution, too small, too far away, or just too
ambiguous to classify correctly. However, the errors in the laser’s labeling are not random
– they correspond to examples near its decision boundary. So conclusions from this plot do
not necessarily transfer back to the laser labeling process.
Non-linear SVMs are O(N 3 ) in complexity. Nevertheless, training time was reasonable
on our machine. The 20,000 example SVM with K = 5000 (the largest case) took 22,528.5
seconds to train. Recall that, due to the pyramid, image vectors are 5K in length. For
Cubic SVMs, we did not attempt to parallelize SVM training or optimize code beyond
standard OpenCV. Also, no attempt was made to dedicate all system resources to training
by, for example, stopping background Linux services. Thus, this training time represents a
rough upper bound. We report it to verify that Cubic SVM training is tractable in a typical
environment.
5.7 Discussion
Recent autonomous or robotic cars are often characterized by expensive laser scanners.
This is due largely to the belief that building effective camera-only systems is beyond the
state-of-the-art. We present a vision-only algorithm for car classification. Our method
is trained in a self-supervised fashion from a laser system without manual labeling. By
utilizing vast amounts of training data, we show that a camera-only method can get arbi-
trarily close to the performance of the laser-based labeling source [84] on a substantial and
realistic test set of more than 20,000 examples.
CHAPTER 5. COST EFFECTIVE SENSING 75
Figure 5.3: Several examples of camera frames after labeling by the laser algorithm. These
labels are the training data for our vision-only classifier. This training data is very inexpen-
sive to acquire. Millions of examples can be obtained by simply driving the car around.
CHAPTER 5. COST EFFECTIVE SENSING 76
Figure 5.4: The 5,000 cluster centers or bases used for image classification represented by
their corresponding closest image patch.
Figure 5.5: Our algorithm updates a variable number of image vector elements per feature.
The frequency of the number of updates is shown here.
CHAPTER 5. COST EFFECTIVE SENSING 77
Figure 5.6: Classification results from the vision algorithm. Car detection is excellent.
The number of false positives are slightly higher with the camera than the laser despite the
similar non-car class accuracy (Table 5.1). This is because the camera must consider more
segmentations due to the absence of 3D.
CHAPTER 5. COST EFFECTIVE SENSING 78
Figure 5.7: Doubling the amount of camera training data cuts the error, relative to the
laser’s performance, by a fixed percentage. By adding additional labeled camera training
data – which is automatic – we can achieve performance that is arbitrarily close to the
laser’s performance for object classification.
CHAPTER 5. COST EFFECTIVE SENSING 79
Figure 5.8: Histogram of accuracy as a function of segment size. Poor performance occurs
on small segments which imply occlusion or significant distance from the camera.
[2] http://www-fars.nhtsa.dot.gov/Main/index.aspx.
[4] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing over-
complete dictionaries for sparse representation. IEEE Trans. on Signal Processing,
54(11):4311–4322, 2006.
[6] T. Barfoot. Online visual motion estimation using fastslam with sift features. In
Proc. of the 2005 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS),
2005.
[7] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features.
Computer Vision and Image Understanding (CVIU), 110(3):346–359, 2008.
80
BIBLIOGRAPHY 81
[10] M. Bertozzi and A. Broggi. Gold: A parallel real-time stereo vision system for
generic obstacle and lane detection. IEEE Trans. on Image Processing, 7(1):62–81,
1998.
[13] M. Bicego, A. Lagorio, E. Grosso, and M. Tistarelli. On the use of sift features
for face authentication. In Proc. of the IEEE Conf. on Computer Vision and Pattern
Recognition (CVPR) Workshop, 2007.
[15] Y-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for
recognition. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR), 2010.
[18] C.A. Brooks and K. Iagnemma. Vibration-based terrain classification for planetary
exploration rovers. IEEE Transactions on Robotics, 21(6):1185–1191, 2005.
[19] M. Brown and D. G. Lowe. Recognising panoramas. In Proc. of the Ninth IEEE Int.
Conf. on Computer Vision (ICCV), 2003.
[23] T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Trans. on
Information Theory, 13(1):21–27, 1967.
[26] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In
Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2005.
[28] F. Dellaert, D. Fox, W. Burgard, and S. Thrun. Monte carlo localization for mo-
bile robots. In Proc. of the International Conference on Robotics and Automation
(ICRA), 1999.
[31] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene
categories. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR), 2005.
[33] D. Fox, W. Burgard, and S. Thrun. Controlling synchro-drive robots with the dy-
namic window approach to collision avoidance. In Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems, 1996.
[36] C. Harris and M. Stephens. A combined corner and edge detector. In Proc. of the
4th Alvey Vision Conf., pages 147–151, 1988.
[37] M. Hebert, C. Thorpe, and A. Stentz. Intelligent Unmanned Ground Vehicles: Au-
tonomous Navigation Research at Carnegie Mellon University. Dordrecht: Kluwer
Academic, 1997.
[38] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief
nets. Neural Computation, 18(7):1527–1554, 2006.
[39] K. Iagnemma and M. Buehler, editors. Journal of Field Robotics: Special Issue on
the DARPA Grand Challenge, Parts 1 and 2. Wiley, 2006.
BIBLIOGRAPHY 84
[40] K. Iagnemma, M. Buehler, and S. Singh, editors. Journal of Field Robotics: Special
Issue on the DARPA Urban Challenge, Parts I, II, and III. Wiley, 2008.
[41] K. Iagnemma, S. Kang, H. Shibly, and S. Dubowsky. Online terrain parameter esti-
mation for wheeled mobile robots with application to planetary rovers. IEEE Trans-
actions on Robotics and Automation, 2004.
[42] P. Jensfelt, D. Austin, O. Wijk, and M. Anderson. Feature based condensation for
mobile robot localization. In Proc. of the International Conference on Robotics and
Automation (ICRA), pages 2531–2537, 2000.
[43] T. Joachims. Training linear svms in linear time. In Proc. of the ACM Conf. on
Knowledge Discovery and Data Mining (KDD), 2006.
[44] S. Julier and J. Uhlmann. A new extension of the Kalman filter to nonlinear systems.
In International Symposium on Aerospace/Defense Sensing, Simulate and Controls,
Orlando, FL, 1997.
[46] A. Kelly and A. Stentz. Rough terrain autonomous mobility, part 1: A theoretical
analysis of requirements. Autonomous Robots, 5:129–161, 1998.
[47] A. Kelly and A. Stentz. Rough terrain autonomous mobility, part 2: An active vision,
predictive control approach. Autonomous Robots, 5:163–198, 1998.
[49] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In Proc. of the IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), 2006.
BIBLIOGRAPHY 85
[50] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations. In Proc. of the
Int. Conf. on Machine Learning (ICML), 2009.
[51] H. Lee, Y. Largman, P. Pham, and A. Y. Ng. Unsupervised feature learning for
audio classification using convolutional deep belief networks. In Proc. of Neural
Information Processing Systems (NIPS), 2009.
[53] A. Lookingbill, J. Rogers, D. Lieb, J. Curry, and S. Thrun. Reverse optical flow
for self-supervised adaptive autonomous robot navigation. International Journal of
Computer Vision, 74(3), 2007.
[55] D. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal of
Computer Vision (IJCV), 60(2):91–110, 2004.
[56] D. G. Lowe. Object recognition from local scale-invariant features. In Proc. of the
Seventh Int. Conf. on Computer Vision (ICCV), pages 1150–1157, 1999.
[57] B. D. Lucas and T. Kanade. An iterative image registration technique with an ap-
plication to stereo vision. In Proc. of the 1981 DARPA Imaging Understanding
Workshop, pages 121–130, 1981.
[58] J. Luo, Y. Ma, E. Takikawa, S. Lao, M. Kawade, and B.-L. Lu. Person-specific sift
features for face recognition. In Proc. of the IEEE Int. Conf. on Acoustics, Speech,
and Signal Processing (ICASSP), 2007.
[59] J. Mairal, M. Elad, and G. Sapiro. Sparse learned representations for image restora-
tion. In Proc. of the 4th World Conf. of the Int. Assoc. for Statistical Computing
(IASC), 2008.
BIBLIOGRAPHY 86
[60] J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration.
IEEE Trans. on Image Processing, 17(1):53–69, 2008.
[61] J. Michels, A. Saxena, and A. Ng. High-speed obstacle avoidance using monocular
vision and reinforcement learning. In Proceedings of the Seventeenth International
Conference on Machine Learning (ICML), 2005.
[62] H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in
video. In Proc. of the Int. Conf. on Machine Learning (ICML), 2009.
[65] NOVA. The Great Robot Race (Documentary of the DARPA Grand Challenge),
2006. http://www.pbs.org/wgbh/nova/darpa/.
[66] D. Pomerleau and T. Jochem. Rapidly adapting machine vision for automated vehi-
cle steering. IEEE Expert, 11(2):19–27, 1996.
[68] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: Transfer
learning from unlabeled data. In Proc. of the 24th Int. Conf. on Machine Learning
(ICML), 2007.
BIBLIOGRAPHY 87
[69] M. Ranzato, F.-J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of
invariant feature hierarchies with applications to object recognition. In Proc. of the
IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2007.
[72] D. Sadhukhan, C. Moore, and E. Collins. Terrain estimation using internal sen-
sors. In Proceedings of the 10th IASTED International Conference on Robotics and
Applications (RA), Honolulu, Hawaii, USA, 2004.
[73] A. Saxena, S. Chung, and A. Ng. Learning depth from single monocular images.
In Proceedings of Conference on Neural Information Processing Systems (NIPS),
Cambridge, MA, 2005. MIT Press.
[74] S. Se, D. Lowe, and J. Little. Vision-based mobile robot localization and mapping
using scale-invariant features. In Proc. of the Int. Conf. on Robotics and Automation
(ICRA), 2001.
[75] J. Shi and C. Tomasi. Good features to track. In Proc. of the 9th IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), 1994.
[76] S. Shimoda, Y. Kuroda, and K. Iagnemma. Potential field navigation of high speed
unmanned ground vehicles in uneven terrain. In Proceedings of the IEEE Interna-
tional Conference on Robotics and Automation (ICRA), Barcelona, Spain, 2005.
[77] C. Shoemaker, J. Bornstein, S. Myers, and B. Brendle. Demo III: Department of de-
fense testbed for unmanned ground mobility. In Proceedings of SPIE, Vol. 3693,
AeroSense Session on Unmanned Ground Vehicle Technology, Orlando, Florida,
1999.
BIBLIOGRAPHY 88
[80] D. Stavens, G. Hoffmann, and S. Thrun. Online speed adaptation using supervised
learning for high-speed, off-road autonomous driving. In Proc. of the Int. Joint Conf.
on Artificial Intelligence (IJCAI), 2007.
[81] D. Stavens and S. Thrun. A self-supervised terrain roughness estimator for off-
road autonomous driving. In Proc. of the Int. Conf. on Uncertainty in Artificial
Intelligence (UAI), 2006.
[82] D. Stavens and S. Thrun. Unsupervised learning of invariant features using video.
In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),
2010.
[83] K. L. R. Talvala, K. Kritayakirana, and J. Christian Gerdes. Pushing the limits: From
lanekeeping to autonomous racing. Annual Reviews in Control, 2011.
[84] A. Teichman, J. Levinson, and S. Thrun. Towards 3d object recognition via classifi-
cation of arbitrary object tracks. In Proc. of the Int. Conf. on Robotics and Automa-
tion (ICRA), 2011.
[86] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. MIT Press, 2005.
[87] S. Thrun, M. Montemerlo, and Andrei Aron. Probabilistic terrain analysis for high-
speed desert driving. In G. Sukhatme, S. Schaal, W. Burgard, and D. Fox, edi-
tors, Proceedings of the Robotics Science and Systems Conference, Philadelphia,
PA, 2006.
BIBLIOGRAPHY 89
[94] N. Vlassis, B. Terwijn, and B. Kröse. Auxillary particle filter robot localization from
high-dimensional sensor observations. In Proc. of the International Conference on
Robotics and Automation (ICRA), 2002.
[95] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding.
In Proc. of the Int. Conf. on Machine Learning (ICML), 2008.
BIBLIOGRAPHY 90
[96] S. Winder and M. Brown. Learning local image descriptors. In Proc. of the IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR), 2007.
[97] S. Winder, G. Hua, and M. Brown. Picking the best daisy. In Proc. of the IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR), 2009.
[98] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of in-
variances. Neural Computation, 14(4):715–770, 2002.
[100] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using
sparse coding for image classification. In Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), 2009.
[101] W. Zhang, G. Zelinsky, and D. Samaras. Real-time accurate object detection using
multiple resolutions. In Proc. of the Int. Conf. on Computer Vision (ICCV), 2007.
[102] Q. Zhu, S. Avidan, M.-C. Yeh, and K.-T. Cheng. Fast human detection using a
cascade of histograms of oriented gradients. In Proc. of the IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR), 2006.