Vous êtes sur la page 1sur 14



Adel Fakih, Samantha Ng, and John Zelek
University of Waterloo

Inertial sensors and cameras are two sources of odometry that can provide a means of continuous localization when the GPS signal is blocked because of jamming or interference. Furthermore, they augment the GPS
localization with orientation (heading) information. Although there have been many approaches to fuse GPS
with vision and inertial measurements, most of these approaches are still subject to many conditions,
assumptions, and specific applications that do not guarantee the most accurate localization from inferred
measurements of the available sensors at all times. We propose a comprehensive framework to fuse GPS intelligently with vision and inertial data by automatically, and adaptively, selecting the best combination of sensors
and the best set of parameters. Our framework, based on multiple hypotheses, is a roadmap for real time systems that provide high accuracy, outlier robustness, and scalability when mapping a large number of 3D points.
Les capteurs inertiels et les camras sont deux sources dodomtrie qui fournissent une mthode de
localisation en continu lorsque le signal GPS est bloqu cause dun brouillage intentionnel ou d'interfrence.
En outre, ils amliorent la localisation du GPS avec de linformation sur lorientation (le cap). Mme sil y a
eu plusieurs approches pour fusionner le GPS avec des mesures visuelles et dinertie, la plupart de ces
approches sont encore assujetties de nombreuses conditions, suppositions et applications prcises qui ne
garantissent pas la localisation la plus exacte des mesures infres des capteurs disponibles en tout temps.
Nous proposons un cadre approfondi pour fusionner de faon intelligente le GPS avec les donnes visuelles
et les donnes inertielles en choisissant, de faon automatique et adaptative, la meilleure combinaison de
capteurs et le meilleur ensemble de paramtres. Bas sur des hypothses multiples, notre cadre est une
feuille de route pour les systmes en temps rel qui fournissent une grande exactitude, une robustesse centre
les valeurs aberrantes et une variabilit dimensionnelle pour cartographier un grand nombre de points

Adel Fakih

1. Introduction
Mobile mapping vehicles, as well as pedestrians, equipped with GPS are often subject to satellite
shading and signal dropout when operating along
tree-lined streets, in wooded areas, and environments where tall buildings can block reception.
Vehicle mounted GPS positioning is further hindered by structures which cause satellites to go in
and out of view, limiting a GPS receivers ability to
provide continuous and accurate position information. In addition, multi-path problems encountered
as a result of signals bouncing off reflective surfaces
often cause GPS systems to be unable to provide
reliable positioning information in this type of area.
One way to provide uninterrupted measurements of
the position, roll, pitch, and true heading of moving
vehicles is to integrate precision GPS with
advanced inertial technology, similar to what
planes and rockets use; however, such technology is
unfeasible for the automotive or pedestrian sector
due to the expense.
As humans, we are very successful in navigating,
localizing, and finding our way with our eyes and

inertial system (vestibular system). In this investigation, we mimic sensor fusion algorithms that combine
the results of computer vision algorithms with the outputs of GPS and inexpensive inertial sensor measurements. Our system is fast, scalable, and flexible; it
selectively fuses measurements from the given sensors
in a robust, probabilistic, multi-hypothesis approach
that always gives the best possible localization and
mapping. We also include a magnetometer (i.e. compass) in the system apparatus. The magnetometer,
providing information about absolute orientation,
enhances the ability of the system to avoid large drift
after losing the GPS signal. Our system is not tied to
any specific image feature detector and tracker.
Therefore, any feature detector can be used to detect
features in the images and any tracker (or matcher) can
be used to track those features between images. Our
system is resilient to the problems and failures of the
feature tracker; firstly, it uses multiple hypotheses and,
secondly, it uses additional information from the other
sensors. We provide preliminary results to prove the
validity and feasibility of our proposed approach.

Samantha Ng

John Zelek

GEOMATICA Vol. 63, No. 4, 2009, pp. 383 to 396

This paper is organized into six sections.

Section 1 is this introduction. Section 2 provides an
overview of previous work on fusing GPS with
either vision or inertial sensors; it constitutes a preamble for the presentation of the proposed framework. Section 3 introduces the notations and equations to be used in the framework. Section 4 is a
detailed explanation of the proposed approach.
Section 5 presents some of our results and Section
6 is our set of conclusions.

2. Background
The Global Positioning System (GPS) has long
been used in combination with inertial sensors for
localization and mapping [Grewal et al. 2007].
Together, GPS and inertial measurements units
(IMUs) naturally complement each other. GPS
compensates for drift in inertial measurements while
inertial measurements are available at higher sampling rates and in a wider range of environments.
There have been many approaches introduced in
recent years for combining GPS and inertial measurements [Grewal et al. 2007; Li et al. 2006;
Mohamed and Schwarz 1999; Sasiadek et al. 2000].
These approaches range from loose coupling to
ultra-tight coupling and from using different estimation techniques such as the conventional Kalman
Filter, the adaptive Kalman Filter, and the fuzzy
adaptive Kalman filter. Although inertial sensors can
operate in more environments (e.g. indoors) when
compared to GPS, they can only compensate for loss
in GPS signal for a short period of time and will drift
quickly from the true trajectory.
More recently, localization methods use vision
with filter-based approaches [Chiuso et al. 2002;
Davison et al. 2007; Eade and Drummond 2006] or
with odometry-based approaches [Yang and

Figure 1: Rotational velocity error on an indoor sequence.


Pollefeys 2003; Nister et al. 2006; Konolige et al.

2007]. However, there are two main problems with
vision systems. First, there is positional drift; drift
that occurs due to the disappearance of old features
and the appearance of new features in filter-based
approaches, or drift that occurs due to the accumulation of errors, especially during rotations in
odometry-based approaches. The drift problem is
sometimes addressed using loop-closure, as in
SLAM (Simultaneous Localization and Mapping)
[Davison et al. 2007; Eade and Drummond 2006],
although it is not generalizable to all scenarios. The
second problem with vision is its high dependence
on the quality of feature detection and tracking
components, which are sensitive to lighting conditions and occlusions. When tracking fails, the
vision-based localization fails catastrophically. In
addition, the effect of feature outliers is a serious
problem for vision-based solutions. Vision has been
used previously with GPS applications mainly to
help in road detection and following [Bai and Wang
2008; Goldbeck et al. 2000; Xia et al. 2006; Hu and
Uchimura 2006], or in the context of augmented
reality applications [Carceroni et al. 2006;
Ramachandran et al. 2007; Forlani et al. 2005],
where GPS aids vision in localization and mapping.
To address the problems of positional drift and
feature detection quality, many authors propose to
combine vision and inertial measurements together
[Jones et al. 2007; Chikuma et al. 2004; Qian et al.
2001; Roumeliotis et al. 2002; Konolige et al. 2007].
Generally, the combination of vision and inertial
measurements gives more accurate results than
either one on its own provided that uncertainties in
both measurements are well-modeled. We showed
these results in the context of trying to determine the
camera velocity from an optical flow field and from
the measurements of a gyroscope using an optimal
non-linear minimization [Fakih and Zelek 2008]. As
shown in Figures 1 and 2, we demonstrated that, with
a good understanding of measurement uncertainties
in both sensors, we can usually get a combined estimate that is better than the estimates obtained from
each of the measurements by itself.
However, if we have an inaccurate assumption
of measurement uncertainty, which may be the likely case for visual measurements, the combination
could lead to results that are worse than the estimates
obtained from the individual measurements. This is
shown in Figure 3, where incorrect uncertainties
were assumed in the combination and led to results
with higher error than the gyro errors.
From these results, we conclude that although
the fusion of vision and inertial is a good strategy
in many cases, in some cases it might not be the
right thing to do. Accordingly, a good system

should be able to fuse data when the fusion is efficient and it should be able to rely on individual sensors when measurements become very noisy or
unknown uncertainties are suspected. One
approach to resolve a similar problem is based on
using multiple Kalman filters [Drolet et al. 2000];
a bank of Kalman filters is used, one of which is
running at all times. Whenever new sensor measurements become available, the suitable filter is
selected from the bank of available filters. The
Adaptive Kalman Filter [Gustafsson and Firm
2000] is another way to resolve this problem. The
adaptive filter tries to estimate the noise covariance
in the measurements; however, this is not suitable
for vision, since features are short-lived and, hence,
not practical for estimating their noise covariance.

(a) Rotational velocity error.

3. Methodology Preliminaries
We wish to determine the motion of a moving
platform from the readings of a set of sensors mounted on the platform. Figure 4 shows our platform with
three types of sensors.
We use a Trimble ProXH GPS receiver, a Point
Grey Flea2 firewire camera, and a Memsense
nIMU inertial unit, which provides measurements
of the rotational velocity, the translational acceleration, and the components of the Earths magnetic
field along the three axes of the unit. The pose of
the platform is represented by a rotation matrix
RSO(3) (Lie group of rotations in R3) and a transla-

tion vector T = T N ,TE,TD, Texpressed with respect

to the inertial frame which we chose to be a local
North East Down (NED) frame. The velocity of the
moving platform is represented by a rotational

(b) Rotational velocity error.

Figure 2: Results on two outdoor sequences.

velocity vector = x, y,z, and a translational


velocity V = V x,V y,Vz, expressed in the body frame

which we chose to be coincident with the camera
frame (i.e. oriented so that the axis coincides with
the optical axis of the camera). The motion evolves
according to:

Tt+1 =e
Rt+1 =e

t x

t +V t



where is the skew symmetric matrix defined as


0 z y
z 0 x .
y x 0


For filtering purposes, we represent R by the

vector of its exponential coordinates = x,y,z

Figure 3: Example of fusing gyro and vision when the

assumed uncertainties of the two sensors are inaccurate.
The rotational error in the fused estimate is worse than
the gyro estimate.

Assuming a perspective projection model, X c

projects to the point u = u,v, f

on the image plane:

f c .


= proj X c def


f is the focal length. A calibrated Euclidean
framework is assumed, and without loss of generality, f is taken as 1. In this paper, visual features are
tracked using a pyramidal implementation of the
Lucas-Kanade feature tracker [Bouguet 2000].
Figure 5 shows a test sequence with the feature
tracked between two of the frames.

Figure 4: Setup.

with R = e t

and = Logso3 R .

The measurements that the platform receives

from the sensors are the absolute position of the
GPS receiver unit in the Earth Centered Earth Fixed

ECEF frame T g , the accelerometer's measurement

of the translational acceleration a ins in the IMU frame,

which is a sum of the true acceleration of the IMU

and the gravity of the earth g, the gyroscope's meas

urement of the rotational velocity ins in the IMU

frame, and the magnetometers measurement of the

orientation of the IMU Rmag of the unit with respect
to the NED frame. To determine the origin and orient our system so that the camera frame coincides
with the NED frame, we leave the platform at rest
for 5 seconds and use the average of the GPS, magnetometer, and accelerometer measurements over
these first few seconds. The camera provides the
positions of salient features that are tracked in the
images. The 3D positions of the world points corresponding to the tracked features are added to the
parameters to be estimated for two reasons: (1) the
simultaneous estimation of these 3D points with the
motion adds to the accuracy of the vision system,
and (2) these points constitute a map that can be
used for mapping applications. The 3D position of
a world feature is represented in the inertial frame

by the vector X = X,Y,Z

and in the camera frame

Figure 5: A frame from Sequence 2 and the directions

of the optical flow used to track the features. Mean
feature displacement is four pixels per frame.

4. Methodology
4.1 Meta Level Overview
We wish to determine the position and velocity of the platform accurately, at every time step,
regardless of the environment (e.g. indoor, outdoor), the natural conditions (e.g. bright, dark), or
the type of motions (e.g. fast, slow, jerky). Using a
probabilistic formulation, we want to determine, at
every time step t, the probability distribution of the

state vector S = ,T,,V,a,X 1,.,X N given all the

measurements T GPS , ins,a ins,mag i.e. inertial

by X c = X c,Y c,Z c .
At time t, the position of the feature point X

The notation {,} is used to represent the concatenation of vectors on top of each other. We denote the

with respect to the camera frame is:

Xc t = R t X t + T t .

and U = u 1,.,u N i.e. visual features from 0 to t.


desired distribution as p S t U 0 : t ,T g 0 : t ,

ins 0:t ,V ins 0:t ,m 0:t .

A traditional way to do the estimation of such
systems is by using the Extended Kalman Filter
(EKF). However, the EKF has two main problems in
this context; firstly, its computationally intractable
if the number of 3D features tracked by the vision
system is high and, secondly, it assumes that all the
uncertainties are Gaussian and perfectly known. To
address these problems, Rao-Blackwellized
Particle Filters (RBPF) have been introduced. The
first generation of these filters [Montemerlo 2003]
reduced the computational complexity by capitalizing on the important characteristic that features are
independent given the camera motion. Then, a particle filter can be used in which each particle is
composed of a motion sample and N low dimensional filter EKFs, one for each of the N features.
However, there was still a need for further reduction of the number of particles and improving the
convergence properties of the filter. A drastic
improvement [Eade and Drummond 2006;
Montemerlo et al. 2003; Sim et al. 2007] over the
simple Rao-Blackwellized particle filter was found
in using the optimal importance function, which is
the distribution of the estimates given not only the
state at the previous time-step but also the most
recent observation. This ensures that samples are
propagated in an efficient way, which reduces the
number of wasted samples. However, in the mentioned approaches, the way samples are drawn from
this importance function is not optimal in the sense
of providing particles that lie in the regions of highest probability of the posterior. This is not optimal
because the function does not use the uncertainty of
a motion sample during the update process and,
therefore, it updates the feature locations based on
the predicted motion, which might not be very accurate if the motion is changing rapidly. To deal with
that, some researchers [Karlsson et al. 2005; Sim et
al. 2007; Elinas et al. 2006] use mixed proposal
distributions that combine hypotheses from the
motion and observation models.
The approach we present in this paper borrows
some principles from the Rao-Blackwellized particle filter; we divide the state vector into two parts,
where the second one is estimated, analytically
conditioned, on the first part. However, our
approach is intrinsically different from the RBPF.
First, we use hypotheses that are much more general than the particles in the Rao-Blackwellized particles. Our hypotheses represent not only a motion
estimate, but also a set of sensors to be used.
Hence, we have multiple types of hypotheses. In
the current paper, we limit the types of hypotheses
to three: vision only hypotheses, vision-inertial

hypotheses, and inertial only hypotheses. The

approach is more general to allow for the easy
incorporation of any additional sensor. Furthermore,
our hypotheses do not represent an exact probability distribution over the 3D parameters; however,
they can be seen as a mixture of Gaussian representations of this distribution. Another major difference
between our approach and the Rao-Blackwellized
approach is that in our approach, instead of dividing
the state vector into a motion part and a feature
locations part, we divide it into a first part that consists of the camera motion and a small number K of
random 3D points, and another part containing the
remaining 3D points conditioned on the first part.
The K features that are associated with the first part
are chosen randomly for every hypothesis by drawing putative sets from the available tracked features
when the hypothesis is created. This new division, as
we will show, has very important implications on the
efficiency and the mechanism of filter operations.
As we mentioned, the state vector in our

approach is divided into a part S 1 containing the

A drastic
over the
simple RaoBlackwellized
particle filter
was found in
using the

motion plus a few number K of features, and another

part S 2 consisting of the remaining features. K typically should be a small number (>5). A relatively
large number leads to better results but increases
the computational expense. In this paper we settle
on K = 8,

S = S 1,S 2
S 1 = t ,T t , t ,V t ,X L l t , ,
L = randperm N ,l 1,k
S 2 = X i t ,., ,i = 1,.,N, i L l ,


where randperm(n) is a random permutation of the

numbers from 1 to N. The elements (3D feature

locations in the second part S 2 of S are independent

given S 1. The desired probability distribution can be
written, therefore, as follows (n.b. dropping the
time indices):

p S U,Tg,ins,mag =

p S 1,S 2 U 1,U 2,Tg,ins,Vins,mag


where U 1 are the image projections of the features in

S 1 and U 2 are the image projections of the features in S 2.
The rationale of dividing the state vector in the
presented way is that we can now approximate the
desired distribution as follows:

S 1,S 2 U 1,U 2,Tg,ins,Vins,mag =

p S 1 U 1,Tg,ins,Vins,mag p S 2 S 1U 2


since S 2 is independent of U 1 given S 1. The approximation comes from the fact that S 1 is now independ-

As mentioned before, although the combination

of inertial measurements with vision generally provides estimates that are better than those provided by
either one of the sensors, in some situations, if one of
the sensors measurements is really poor or if the
uncertainties in the measurements are not well
known, the combination is not fruitful. For this reason, we keep three kinds of hypotheses:

ent only on the first part of the visual measurements

U 1 which is a set of few K features instead of being

Hypotheses using vision without inertial sensors corresponding to the distribution

dependent on all the features U 1 + U 2 . However,

p S 1 U 1,Tgps p S 2 S 1,U 2 . The part S 1 of

we can deal with that by forming multiple hypotheses in which the features in U 1 are different. This is

this hypothesis is estimated only from the

somewhat similar to the RANSAC [Fischler and

Bolles. 1981] paradigm and it provides robustness
to outliers in the tracking of the features.
Now every one of those hypotheses can be

U 1 data associated with it, and then S 2 is estimated from the U 2 data using the motion in S 1.

distribution p S 1 U 1,T g, ins,V ins,mag

efficiently estimated, since estimating S 1 can be

p S 2 S 1,U 2 . The part S 1 of this hypothesis

done in a low dimensional Extended Kalman Filter

is estimated from the U 1 data associated with it

see Section 4.2 , and then every 3D feature in S 2

and from the inertial measurements than S 2 is

can be estimated in a 3-dimensional EKF condition

estimated from the U 2 data using the motion

on S 1 see Secion 4.3 . This way of estimation is a

in S 1.

drastic improvement over the RBPF, since S 2 relies

on its update process of a motion in S 1 that is
already updated and not only predicted from the
previous time instant.

Hypotheses using both vision with inertial sensors (plus magnetometer) corresponding to the

And one hypothesis corresponding to inertial

measurements (plus magnetometer) without

vision: p S 1 Tg,ins,Vins,mag p S 2 S 1,U .

In this case, S 1 contains only motion estimates
while all the features are in S 2. S 1, in this hypothe-

Figure 6: Particles used in the filter.


sis, is estimated using an Adaptive Kalman

Filter (AKF) [Gustafsson and Firm 2000],
which estimates also the noise covariance ins
in the inertial measurements. This is done to
deal with cases in situations where ins is not
known or changes with time. The estimate of
ins from this hypothesis is fed to the other
Figure 6 shows the three types of particles in our
system. Figures 7, 8, and 9 depict the estimation
process of each type of hypothesis.
At every time step, the estimation of all the
hypotheses can be done in a GPS mode or in a nonGPS mode depending on the availability of the GPS
signal. Also, a procedure is designed to determine the
best hypotheses (see Section 4.4). Figure 10 shows
the mean visual reprojection errors of 20 hypotheses
over 160 frames and the chosen hypothesis.

4.2 Estimation of Part 1 for

Every Hypothesis
Three types of dynamical systems can be
definedone for each of our hypotheses. The equations in these systems are a subset of the following
system of equations:

1 X L l t + 1 = X L l t l = 1.K

2 T t + 1 = e w t xT t + V t
3 t + 1 = Log so3 e w t xe t

Figure 7: Estimation of a vision hypothesis: Part 1 is estimated using an EKF

whose measurements are the projections of the features in Part 1 and the GPS
fix (when available). Then every feature in Part 2 is estimated using an individual EKF based on the motion in Part 1.

4V t+1 =V t +at
5 t+1 = t +at
6at+1 =at
7 u i t = proj R t X i t + T t
+ n vis,i t i = 1.N
8 T GPS t = CT t + B + n GPS
9 a ins t = RTci a t + RT t g + R t T ci + a bias + n ins,acc
10 ins t = RTci t + bias + n ins,gyro
11 mag t = t + n mag

with X 0 =

X 0,T

0 = T 0. = Log so3 R is the

vector of the exponential coordinates of R, i.e. R

= exp

. 0 = 0V 0 = V 0 0 = 0,

n vis,i ~ N 0, vis,i t

where vis,i t is the covari-

Figure 8: Estimation of a vision-inertial hypothesis: Part 1 is estimated using an

EKF whose measurements are the projections of the K features in Part 1, the
inertial measurements, and the GPS fix (when available). Then every feature in
Part 2 is estimated using an individual EKF based on the motion in Part 1.

ance matrix representing the uncertainty on the

position of the feature; it is obtained from the feature tracker. C is a matrix representing the transformation from the NED reference frame to the ECEF
reference frame, and B is a vector from the earth cen-

ter to the origin of the NED frame. Rci is the rotation between the body frame and the nIMU frame.
Tci is the vector from the nIMU frame to the body
frame. n GPS 0, GPS is the noise from uncertainty in

the GPS measurement; it can be obtained from the

Dilution of Precision (DOP) values provided by the

GPS unit. n ins,acc 0, ins,acc and n ins,gyro 0, ins,gyro

are the noise from uncertainties in the accelerometers
and gyroscopes of the inertial sensor, which are estimated adaptively using an Adaptive Kalman Filter

Figure 9: Estimation of an inertial hypothesis: Part 1 of the hypothesis is estimated using an adaptive KF [Gustafsson and Firm 2000] whose measurements
are the inertial measurements and the GPS fix (when available). Then every
feature in Part 2 is estimated using an individual EKF based on the motion in
Part 1 and whose measurement is the projection of the considered feature.

the form (n.b. dropping the index j):

S1 t = f S1 t 1 + ns t
n s t ~ N 0, s
m t = h S1 + nm t
n m t ~ N 0, m , l = 1,.,k


where f is a subset of the system of equations in (8),

dependent on the hypothesis type, and h is the system of equations 8.6, 8.7 and 8.10.

Therefore, to update S 1 from t 1 to t, we fol-

low the EKF approach described in [Chiuso et al.

2002] and use the standard notations of dynamical
systems filtering where x tt 1 refers to an esti-

Figure 10: Mean reprojection errors of Sequence 2 of the State Vector at

frame 160 on all the previous frames.

mate of x predicted from the previous estimate and

in the inertial hypothesis. a bias and bias are the

biases for the accelerometers and gyroscopes given
in the hardware specfications. n mag 0 mag is the

vious estimate and the most recent measurement.

The filtering equations are:

noise the from uncertainty in the magnetometer

measurement and is also estimated adaptively.
The vision hypotheses will have equations 8.9
and 8.10 missing. Inertial hypotheses will have
equations 8.1 and 8.7 missing. In the non-GPS
mode, all hypotheses will have equation 8.8 missing. Note that the K features in equation 8.1 are
chosen randomly from the N total features and are
different for every hypothesis. There are two reasons for which GPS measurements are not available at all times. The first reason is that the GPS
provide measurements at 1Hz, whereas vision and
inertial measurements are much more frequent (e.g.
30Hz). The second reason is that the GPS receiver
may not be in line of site of enough satellites to
determine its position (e.g. in tunnels or buildings).
Unfortunately, the scale and reference frame are not
observable with vision (i.e. visual features do not
provide measurements of absolute scale). Thus,
when the GPS signal disappears, we have to fix the
scale and reference frame for vision-type hypotheses. To do this, whenever the GPS measurement is
not available, we replace equation 8.1 with

4.2.1 EKF Formulation


For each hypothesis S 1, we have a system of



S 1 tt 1 = f S 1 t 1 t 1
Pm tt 1 = F t 1 Pm t 1 t 1 .

F t 1 + m



S 1 tt = S 1 tt 1 + A t u t h S 1 tt 1
Pm tt = t Pm tt 1 t + A t n A t

t = H t P t H T t + n

t = P tt 1 H t
t =IAt H t

t .


S 1 t 1t 1
S 1
H t = h S 1 tt 1
S 1



Ft =

Z L l t + 1 = Z L l t , l = 48
x L l t + 1 = x L t t , l = 48
where X L l = Z L l x L l ,Z L l y L l ,Z L l

x t t refers to an estimate of x that uses the pre-

For calculation details of the above matrices

refer to (6).

4.3 Estimation of Part 2



up i t,k = proj R k X i t + T k


p i t,k = G m,i k Q k G m,i k + G x,i t,k Px t G x,i t,k ,

For every feature, X i of S 2, we have the following

dynamical system:



t =


G x,i t,k =

t 1 + n x t ,0.3cmn x t ~ N 0, x

u i t = proj R tt

t + T tt +n vis,j t ,

Px is the covariance of

k=0 i=1

Note that in the above equation, the measurement

model is formulated in terms of R tt , T tt


and not in terms of R tt 1 , T tt 1 . This

is one of the major advantages of our approach. Let

Q be the covariance matrix of t , T t . Let

Gm be the linearization of the measurement equation

with respect to R t and T t at R j t ,T t :

t +T t

R t ,T t


e vis j,t = e j,t 1

R j t ,T t

p i t,k + vis,j k

up i t,k u i ,

where is a robust function which we simply

define as (e) = e if 0 e < th and (e) = th if e
th, where th is a threshold. The errors are expressed
in terms of mean pixel error per feature. However,
this would require us to keep track of all the image
projections and to recompute the reprojection
errors of the state vector on all the frames, at every
time, which can not be done in real time. Therefore,
we use an approximation of this error, which can be
computed recursively at every frame as follows:


up i t,k u i



X ij.

e vis j,t =

proj R t

X i t

n t ~ N 0, vis,j

G m,j
t =

proj R j k X i t + T k

t +

up i t,t u i

p i t,t + vis t

up i t,t u i



In this case, instead of computing the reprojection error of the state vector S t on all the previous


Then X i can be updated using similar equations

as in Section 4.2.1. However, instead of using vis,i
we replace it by vis,i + G m,i
m,i to take into con

frames k, k = 0,.,t, we are computing for every frame

k the reprojection error of S k on this frame, which

would be equivalent to the following:

sideration the effect of the uncertainty in the camera


e vis j,t = 1

up ij k,k u i

4.4 Evaluating the Hypotheses

A weight for the hypothesis, based on the error

vis (j,t), then is defined as follows:

At every time instant, we rank each of the

hypotheses based on their fitness to the available
4.4.1 Weights from Vision Measurements
Ideally, to test a given hypothesis j at time t
using the visual measurements, we have to determine its re-projection error evis(j,t) on all the previous frames. The reprojection of the ith 3D point in
the jth at time t hypothesis on the kth frame and its
covariance matrix are given by:

k=0 i=1

P ji k,k + vis,j k

wvis (j,t) = exp(vis (j,t)).

up ij k,k u i ,


The weights from all the hypotheses are, then,

normalized to sum to one.
4.4.2 Inertial Weights
Inertial weights are measured by the deviations
of V and from the inertial measurements. This

error consists of two parts: one for the accelerometer

and one for the gyros. We define the mean and covariance matrices of the translational and rotational velocities of the jth hypothesis in the IMU reference frame:

PTj t is the covariance of T in the hypothesis j.

ap ins t =RTci a j t + RT,j t g + R j t T + a

The weights are computed as:

p jins,acc t = RTciPa t R ci

wGPS(j,t) = exp(GPS(j,t)),

p jins t = RTci j t + wbias

p jins,gyro t = RTciPj t R ci



and then normalized to 1 over all the considered


Pa t and P t are the covariance matrices of a and

in the hypothesis j.

4.4.5 Combining Weights

To compute an aggregate weight for each hypothesis, we simply treat the weights from each type of
measurement as probabilities and multiply them:

e acc j,t = e acc j,t 1 +


ap ins

p ins,acc

t a ins t

t + ins,acc t

ap ins t a ins t

e gyro j,t = e gyro j,t 1 +

p ins t ins t

ins,gyro p ins t ins t




The weights corresponding to acc(j,t) and

gyro(j,t) are calculated as follows:
wacc(j,t) = exp(acc(j,t))
wgyro(j,t) = exp(gyro(j,t))


The gyro and accelerometer weights of the

hypotheses are then normalized to 1.
4.4.3 Magnetometer Weights
The magnetometer weights are measured by
the deviation of from the magnetometer measure


e mag j,t = e mag j,t 1 +


t mag t

mag + pj t

t mag t .



Pj t is the covariance of in the hypothesis j.

The weights are computed as:
wmag(j,t) = exp(mag(j,t)),


and then normalized to 1 over all the considered

4.4.4 GPS Weights
The GPS weights are measured by the deviation of T from the GPS measurement:

e GPS j,t = e GPS j,t 1 +


CT t B T GPS t


CT t + B T GPS t .

w(j,t) = wvis(j,t)wacc(j,t)wgyro(j,t)wmag(j,t)wGPS(j,t).
Note that when the visual measurements are not
good, all the reprojection errors are very high. The
robust function then leads, almost, to equal visual weights for the hypotheses, and then w(j,t) will
not be affected by the vision weights.

4.5 System Initialization

and Operation
Upon starting our system, a number of features
are detected in the first image and then tracked
through subsequent frames. We randomly sample
Ns putative sets of K = 8 features each. Every set is
used to generate a hypothesis (vision or vision inertial). A hypothesis is generated by initializing an
EKF in which the initial translation is set to 0, the

rotation with respect to the NED frame is determined from the accelerometer and magnetometer at
rest, and the translational and rotational velocities
are set to zero. After a number of frames, when the
EKFs are assumed to have converged for every
hypothesis, we triangulate the 3D positions of the
features in its Part 2 using the motion in its Part 1.
The covariance of every 3D feature is determined
using the unscented transform [Julier and Uhlmann
1997], the covariance of the motion in the Part 1
and the covariance of its image feature. The triangulated 3D features and their covariances are used
as an initialization of the individual EKFs in Part 2.
A similar procedure is used to insert new features.
When a new feature is detected (n.b. in our system
we keep the number of visible features constant), it
is tracked for a certain number of frames (e.g. from
t1 to t2). Then, for each of the available hypotheses,
the 3D location of this feature is determined by triangulation and used as an initial value for the EKFs
of this feature in every hypothesis.

When one of the K features in the part S 1 of a

given hypothesis is not tracked (e.g. disappears or
gets occluded), it is replaced by another feature
from the part S 2 of the hypothesis and this latter
feature is chosen as the one with the lowest variance. The replacement is done by eliminating the
occluded, or disappearing, feature from the state
vector and substituting its mean by the mean of the
new feature. In addition, the lines and rows in the
covariance matrix corresponding to the old feature
are set to zero and only the part corresponding to
the autocovariance of the old feature is replaced by
the covariance of the new feature.
When the weight of a given hypothesis
becomes smaller than some threshold th1 and if
none of its individual weights is higher than another thereshold th2, the hypothesis is discarded and
replaced by a new hypothesis generated from one
then interchanging the feature having the largest

of the good hypotheses. The new hypothesis S 1


is generated by copying the old hypothesis S and


variance in S 1
variance in


with the feature having the lowest


5 Experimental Results with

the Vision System
As a proof of concept for our hypothesis-based
approach, we tested the system using vision only
hypotheses. We calibrated a Flea2 camera, from
Point Grey, using a checkerboard pattern and used
the camera to capture several sequences of images.
In some of the sequences, the camera is fixed on a
pan-tilt unit; and in other sequences, the camera is
moved by hand. We report, hereafter, the results on
three sequences. We compare the reprojection error
of the state vector estimated by our system, in all the
considered frames, against the error of the full
covariance EKF [Chiuso et al. 2002] and of the optimal nonlinear offline optimization (i.e. bundle
adjustment) [Triggs et al. 2000]. The bundle adjustment is a Gauss-Newton batch estimation of the
camera poses at all the frames and the positions of
all the considered points by minimizing the reprojection errors of the 3D parameters in all the images.
The reprojection errors in a given frame are determined by projecting the estimates of the 3D points
on this frame using the corresponding camera pose
estimate, then taking the difference between the

obtained projections and the detected features positions. We use these reprojection errors to evaluate
different estimators. If the errors in the image projections are Gaussians, the estimator corresponding
to the smallest reprojection error would be the
Maximum Likelihood estimator; otherwise, in the
non-Gaussian case it would be the BLUE estimator
(Best Linear Unbiased Estimator). Figures 11, 12,
and 13 show the reprojection errors of the state vectors at time 160, of all the (20) considered hypotheses on all the frames, from time 10 till 160, for the
three sequences respectively. The blue solid lines
correspond to the different hypotheses; the green
dash-dotted line corresponds to the optimal bundle
adjustment solution; the dashed red line corresponds
to the chosen hypothesis; and the dotted black line
corresponds to the full EKF solution. The chosen
hypothesis in our system always corresponds to one
of the best hypotheses; it has a reprojection error
much lower than the EKF error and it is also very
close to the error of the optimal bundle adjustment.
Figure 14 shows the estimated positions of the scene
landmarks, where the diameter of the circles is proportional to the depth of the feature. The larger the
diameter of circle, the farther it is from the camera
center. A qualitative examination of these features
shows that their obtained depth is in accord with the
distance to the cameras and that the 3D positions are
obtained up to a good accuracy. The results presented, thus far, show the effectiveness and efficiency of
our multi-hypothesis approach. Since the kernel of
every hypothesis is the EKF filter of its Part 1, and
based on previous results of vision-inertial EKFs
[Jones et al. 2007], we expect the same performance
of our approach, with added accuracy, when we
include inertial measurements in the tests.

Figure 11: Mean reprojection errors of Sequence 1 of the State Vectors at

frame 160 on all the previous frames.

6. Conclusion

Figure 12: Mean reprojection errors of Sequence 2 of the State Vectors at

frame 160 on all the previous frames.

We proposed an efficient system to fuse GPS

with vision and inertial measurements for localization and mapping. The system maintains multiple
hypotheses that are more general than the particles in
a Particle Filter and allow our system much more
flexibility in selecting the set of measurements with
the set of parameters that give the most accurate
results. The system is independent of the image features tracking algorithm while the multiple hypotheses scheme provides robustness to the potential failures of the tracker. We showed preliminary results
that outlined the superiority of our framework versus
the traditional EKF in terms of giving the best possible localization and mapping. Our framework can
easily be generalized so that it is scalable to different
networks of sensors. Future work will focus on an
extensive experimental evaluation of our approach
with very long sequences in addition to various
indoor and outdoor conditions. We will rigorously
test our systems robustness to extended loss of GPS
data, as well as confirm the superior computational
efficiency and scalability implied by the derivation
of our probabilistic framework when compared to
other sensor fusion algorithms.


Figure 13: Mean reprojection errors of Sequence 3 of the State Vectors at

frame 160 on all the previous frames.

Figure 14: Recovered 3D points. The diameter of the circles is proportional to the distance from the camera.

Bai, L. and Y. Wang. 2008. Fusing image, GPS and GIS

for road tracking using multiple condensation particle filters. In 2008 IEEE Intelligent Vehicles
Symposium, pp. 162167.
Bouguet, J. 2000. Pyramidal implementation of the
Lucas Kanade feature tracker description of the
algorithm. Intel Corporation, Microprocessor
Research Labs.
Carceroni, R., A. Kumar, and K. Daniilidis. 2006.
Structure from motion with known camera positions. In Proc. of IEEE Intl. Conf. on Computer
Vision and Pattern Recognition, pp. 477484.
Chikuma, M., Z. Hu, and K. Uchimura. 2004. Fusion of
vision, GPS and 3D-gyro in solving camera global
registration problem for vision-based road navigation. IEIC Technical Report (Institute of
Electronics, Information and Communication
Engineers), 103(643):7176.
Chiuso, A., P. Favaro, H. Jin, and S. Soatto. 2000. 3-D
motion and structure from 2-D motion causally integrated over time: Implementation. In ECCV 00:
Proceedings of the 6th European Conference on
Computer Vision-Part II, pp. 734750, London,
U.K., Springer-Verlag.
Chiuso, A., P. Favaro, H. Jin, and S. Soatto. 2002.
Structure from motion causally integrated over
time. IEEE Transactions on Pattern Analysis and
Machine Intelligence, pp. 523535.

Davison, A., I. Reid, N. Molton, and O. Stasse. 2007.

MonoSLAM: Real-time single camera SLAM.
IEEE Transactions on Pattern Analysis and
Machine Intelligence, pp. 10521067.
Drolet, L., F. Michaud, and J. Cote. 2000. Adaptable sensor fusion using multiple Kalman filters. In 2000
IEEE/RSJ International Conference on Intelligent
Robots and Systems, 2000. (IROS 2000).
Proceedings, volume 2.
Eade, E. and T. Drummond. 2006. Scalable monocular
SLAM. Computer Vision and Pattern Recognition,
2006 IEEE Computer Society Conference on, 1.
Elinas, P., R. Sim, and J. Little. 2006. SLAM: Stereo
vision SLAM using the Rao-Blackwellised particle
filter and a novel mixture proposal distribution. pp.
Fakih, A. and J. Zelek. 2008. On the benefits of using
gyroscope measurements with structure from
motion. In ECCV workshop on Multi-camera and
Multi-modal Sensor Fusion (M2SFA208), Oct.
Fischler, M.A. and R.C. Bolles. 1981. Random sample
consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24:381395.
Forlani, G., R. Roncella, and F. Remondino. 2005.
Structure and motion reconstruction of short mobile
mapping image sequences. In VII Conference on
Optical 3D Measurement Techniques, pp. 265274.
Goldbeck, J., B. Huertgen, S. Ernst, and L. Kelch. 2000.
Lane following combining vision and DGPS. Image
and Vision Computing, 18(5):425433.
Grewal, M., L. Weill, and A. Andrews. 2007. Global
positioning systems, inertial navigation, and integration. Wiley-Interscience.
Gustafsson, F. and K. Firm. 2000. Adaptive filtering and
change detection. Wiley New York.
Hu, Z. and K. Uchimura. 2006. Fusion of vision, GPS
and 3D gyro data in solving camera registration
problem for direct visual navigation. International
Journal of ITS Research, 4(1):312.
Jones, E., A. Vedaldi, and S. Soatto. 2007. Inertial structure from motion with autocalibration. In
Proceedings of the ICCV Workshop on Dynamical
Julier, S. and J. Uhlmann. 1997. A new extension of the
Kalman filter to nonlinear systems. 3.
Karlsson, N., E. di Bernardo, J. Ostrowski, L. Goncalves,
P. Pirjanian, and M. Munich. 2005. The vSLAM
algorithm for robust localization and mapping. pp.
Konolige, K., M. Agrawal, and J. Sola. 2007. Large scale
visual odometry for rough terrain. In Proc.
International Symposium on Robotics Research.
Li, Y., J. Wang, C. Rizos, P. Mumford, and W. Ding.
2006. Low-cost tightly coupled GPS/INS integration based on a nonlinear Kalman filtering design.
In Proceedings of ION National Technical Meeting,
pp. 1820.
Mohamed, A. and K. Schwarz. 1999. Adaptive Kalman
filtering for INS/GPS. Journal of Geodesy,

Montemerlo, M. 2003. FastSLAM: A factored solution to

the simultaneous localization and mapping problem
with unknown data association. Ph.D. thesis,
Robotics Institute, Carnegie Mellon University,
Pittsburgh, PA, July.
Montemerlo M., S. Thrun, D. Koller, and B. Wegbreit.
2003. FastSLAM 2.0: An improved particle filtering
algorithm for simultaneous localization and mapping that provably converges. 18:11511156.
Nister. D., O. Naroditsky, and J. Bergen. 2006. Visual
odometry for ground vehicle applications. Journal
of Field Robotics, 23(1):320.
Point Grey Research, Inc. http://www.ptgrey.com.
Qian, G., R. Chellappa, and Q. Zheng. 2001. Robust
structure from motion estimation using inertial data.
Journal of the Optical Society of America A,
Ramachandran, M., A. Veeraraghavan, and R. Chellappa.
2007. Fast bilinear SfM with side information. In
IEEE 11th International Conference on Computer
Vision, 2007. ICCV 2007, pp. 18.
Roumeliotis, S., A. Johnson, and J. Montgomery. 2002.
Augmenting inertial navigation with image-based
motion estimation. In IEEE International
Conference on Robotics and Automation, 2002.
Proceedings. ICRA02, volume 4.
Sasiadek, J., Q. Wang, and M. Zeremba. 2000. Fuzzy
adaptive Kalman filtering for INS/GPS data fusion.
In Intelligent Control, 2000. Proceedings of the
2000 IEEE International Symposium on, pp.
Sim, R., P. Elinas, and J. J. Little. 2007. A study of the
Rao-Blackwellised particle filter for efficient and
accurate vision-based SLAM. Int. J. Comput.
Vision, 74(3):303318.
Triggs, B., P. McLauchlan, R. Hartley, and A. Fitzgibbon.
2000. Bundle adjustmenta modern synthesis.
Vision Algorithms: Theory and Practice,
Xia, T., M. Yang, and R. Yang. 2006. Vision based global localization for intelligent vehicles. In 2006 IEEE
Intelligent Vehicles Symposium, pp. 571576.
Yang, R. and M. Pollefeys. 2003. Multi-resolution realtime stereo on commodity graphics hardware. In
2003 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 2003.
Proceedings, volume 1.

Adel Fakih is a Ph.D. student in Systems
Design Engineering at the University of Waterloo.
His research interests are in visual motion analysis,
structure from motion, visual navigation and tracking human limbs in video sequences. He holds a
Masters degree in Computer and Communications
Engineering from the American University of

Beirut, in Lebanon. His masters research was about

the detection and extraction of vertebrae from spinal
X-ray images. He did his undergraduate studies at
the Lebanese University where he graduated with a
Diploma in Electrical and Electronics Engineering.
Samantha Ng is currently pursuing a M.Sc.
degree at the University of Waterloo. Her main
research area is computer vision and human pose
estimation. Samantha received her B.Sc. from the
University of Waterloo, specializing in mechatronics, statistics, and management science.
John Zelek is Associate Professor in Systems
Design Engineering at the University of Waterloo,
with expertise in the area of intelligent mechatronic
control systems that interface with humans; specifically, (1) wearable sensory substitution and assistive
devices; (2) probabilistic visual and tactile percep-


tion; (3) wearable haptic devices including their

design, synthesis and analysis; and (4) human-robot
interaction. Zelek has co-founded two start-up companies. He was awarded the best paper award at the
international IEEE/IAPRS Computer and Robot
Vision conference. He was also awarded with best
paper at the 1995 IEEE SMC conference, and his
student won best student paper at the 2002 Vision
Interface conference. He was awarded a 2006 &
2008 Distinguished Performance Award from the
Faculty of Engineering at the University of
Waterloo. He was also awarded the 2004 Young
Investigator Award by the Canadian Image
Processing & Pattern Recognition Society for his
work in robotic vision. His research has been reported by the media in the Globe & Mail, Discovery
Channel, CBC, CTV, and many other venues such as
the Science & Vie Junior Hors Srie (renowned
young science magazine in France). o