0 évaluation0% ont trouvé ce document utile (0 vote)

12 vues14 pagespaper

Jan 28, 2015

© © All Rights Reserved

PDF, TXT ou lisez en ligne sur Scribd

paper

© All Rights Reserved

0 évaluation0% ont trouvé ce document utile (0 vote)

12 vues14 pagespaper

© All Rights Reserved

Vous êtes sur la page 1sur 14

VISION AND INERTIAL SENSING

Adel Fakih, Samantha Ng, and John Zelek

University of Waterloo

Inertial sensors and cameras are two sources of odometry that can provide a means of continuous localization when the GPS signal is blocked because of jamming or interference. Furthermore, they augment the GPS

localization with orientation (heading) information. Although there have been many approaches to fuse GPS

with vision and inertial measurements, most of these approaches are still subject to many conditions,

assumptions, and specific applications that do not guarantee the most accurate localization from inferred

measurements of the available sensors at all times. We propose a comprehensive framework to fuse GPS intelligently with vision and inertial data by automatically, and adaptively, selecting the best combination of sensors

and the best set of parameters. Our framework, based on multiple hypotheses, is a roadmap for real time systems that provide high accuracy, outlier robustness, and scalability when mapping a large number of 3D points.

Les capteurs inertiels et les camras sont deux sources dodomtrie qui fournissent une mthode de

localisation en continu lorsque le signal GPS est bloqu cause dun brouillage intentionnel ou d'interfrence.

En outre, ils amliorent la localisation du GPS avec de linformation sur lorientation (le cap). Mme sil y a

eu plusieurs approches pour fusionner le GPS avec des mesures visuelles et dinertie, la plupart de ces

approches sont encore assujetties de nombreuses conditions, suppositions et applications prcises qui ne

garantissent pas la localisation la plus exacte des mesures infres des capteurs disponibles en tout temps.

Nous proposons un cadre approfondi pour fusionner de faon intelligente le GPS avec les donnes visuelles

et les donnes inertielles en choisissant, de faon automatique et adaptative, la meilleure combinaison de

capteurs et le meilleur ensemble de paramtres. Bas sur des hypothses multiples, notre cadre est une

feuille de route pour les systmes en temps rel qui fournissent une grande exactitude, une robustesse centre

les valeurs aberrantes et une variabilit dimensionnelle pour cartographier un grand nombre de points

tridimensionnels.

Adel Fakih

afakih@uwaterloo.ca

1. Introduction

Mobile mapping vehicles, as well as pedestrians, equipped with GPS are often subject to satellite

shading and signal dropout when operating along

tree-lined streets, in wooded areas, and environments where tall buildings can block reception.

Vehicle mounted GPS positioning is further hindered by structures which cause satellites to go in

and out of view, limiting a GPS receivers ability to

provide continuous and accurate position information. In addition, multi-path problems encountered

as a result of signals bouncing off reflective surfaces

often cause GPS systems to be unable to provide

reliable positioning information in this type of area.

One way to provide uninterrupted measurements of

the position, roll, pitch, and true heading of moving

vehicles is to integrate precision GPS with

advanced inertial technology, similar to what

planes and rockets use; however, such technology is

unfeasible for the automotive or pedestrian sector

due to the expense.

As humans, we are very successful in navigating,

localizing, and finding our way with our eyes and

inertial system (vestibular system). In this investigation, we mimic sensor fusion algorithms that combine

the results of computer vision algorithms with the outputs of GPS and inexpensive inertial sensor measurements. Our system is fast, scalable, and flexible; it

selectively fuses measurements from the given sensors

in a robust, probabilistic, multi-hypothesis approach

that always gives the best possible localization and

mapping. We also include a magnetometer (i.e. compass) in the system apparatus. The magnetometer,

providing information about absolute orientation,

enhances the ability of the system to avoid large drift

after losing the GPS signal. Our system is not tied to

any specific image feature detector and tracker.

Therefore, any feature detector can be used to detect

features in the images and any tracker (or matcher) can

be used to track those features between images. Our

system is resilient to the problems and failures of the

feature tracker; firstly, it uses multiple hypotheses and,

secondly, it uses additional information from the other

sensors. We provide preliminary results to prove the

validity and feasibility of our proposed approach.

Samantha Ng

sjng@uwaterloo.ca

John Zelek

jzelek@uwaterloo.ca

Section 1 is this introduction. Section 2 provides an

overview of previous work on fusing GPS with

either vision or inertial sensors; it constitutes a preamble for the presentation of the proposed framework. Section 3 introduces the notations and equations to be used in the framework. Section 4 is a

detailed explanation of the proposed approach.

Section 5 presents some of our results and Section

6 is our set of conclusions.

2. Background

The Global Positioning System (GPS) has long

been used in combination with inertial sensors for

localization and mapping [Grewal et al. 2007].

Together, GPS and inertial measurements units

(IMUs) naturally complement each other. GPS

compensates for drift in inertial measurements while

inertial measurements are available at higher sampling rates and in a wider range of environments.

There have been many approaches introduced in

recent years for combining GPS and inertial measurements [Grewal et al. 2007; Li et al. 2006;

Mohamed and Schwarz 1999; Sasiadek et al. 2000].

These approaches range from loose coupling to

ultra-tight coupling and from using different estimation techniques such as the conventional Kalman

Filter, the adaptive Kalman Filter, and the fuzzy

adaptive Kalman filter. Although inertial sensors can

operate in more environments (e.g. indoors) when

compared to GPS, they can only compensate for loss

in GPS signal for a short period of time and will drift

quickly from the true trajectory.

More recently, localization methods use vision

with filter-based approaches [Chiuso et al. 2002;

Davison et al. 2007; Eade and Drummond 2006] or

with odometry-based approaches [Yang and

384

2007]. However, there are two main problems with

vision systems. First, there is positional drift; drift

that occurs due to the disappearance of old features

and the appearance of new features in filter-based

approaches, or drift that occurs due to the accumulation of errors, especially during rotations in

odometry-based approaches. The drift problem is

sometimes addressed using loop-closure, as in

SLAM (Simultaneous Localization and Mapping)

[Davison et al. 2007; Eade and Drummond 2006],

although it is not generalizable to all scenarios. The

second problem with vision is its high dependence

on the quality of feature detection and tracking

components, which are sensitive to lighting conditions and occlusions. When tracking fails, the

vision-based localization fails catastrophically. In

addition, the effect of feature outliers is a serious

problem for vision-based solutions. Vision has been

used previously with GPS applications mainly to

help in road detection and following [Bai and Wang

2008; Goldbeck et al. 2000; Xia et al. 2006; Hu and

Uchimura 2006], or in the context of augmented

reality applications [Carceroni et al. 2006;

Ramachandran et al. 2007; Forlani et al. 2005],

where GPS aids vision in localization and mapping.

To address the problems of positional drift and

feature detection quality, many authors propose to

combine vision and inertial measurements together

[Jones et al. 2007; Chikuma et al. 2004; Qian et al.

2001; Roumeliotis et al. 2002; Konolige et al. 2007].

Generally, the combination of vision and inertial

measurements gives more accurate results than

either one on its own provided that uncertainties in

both measurements are well-modeled. We showed

these results in the context of trying to determine the

camera velocity from an optical flow field and from

the measurements of a gyroscope using an optimal

non-linear minimization [Fakih and Zelek 2008]. As

shown in Figures 1 and 2, we demonstrated that, with

a good understanding of measurement uncertainties

in both sensors, we can usually get a combined estimate that is better than the estimates obtained from

each of the measurements by itself.

However, if we have an inaccurate assumption

of measurement uncertainty, which may be the likely case for visual measurements, the combination

could lead to results that are worse than the estimates

obtained from the individual measurements. This is

shown in Figure 3, where incorrect uncertainties

were assumed in the combination and led to results

with higher error than the gyro errors.

From these results, we conclude that although

the fusion of vision and inertial is a good strategy

in many cases, in some cases it might not be the

right thing to do. Accordingly, a good system

should be able to fuse data when the fusion is efficient and it should be able to rely on individual sensors when measurements become very noisy or

unknown uncertainties are suspected. One

approach to resolve a similar problem is based on

using multiple Kalman filters [Drolet et al. 2000];

a bank of Kalman filters is used, one of which is

running at all times. Whenever new sensor measurements become available, the suitable filter is

selected from the bank of available filters. The

Adaptive Kalman Filter [Gustafsson and Firm

2000] is another way to resolve this problem. The

adaptive filter tries to estimate the noise covariance

in the measurements; however, this is not suitable

for vision, since features are short-lived and, hence,

not practical for estimating their noise covariance.

3. Methodology Preliminaries

We wish to determine the motion of a moving

platform from the readings of a set of sensors mounted on the platform. Figure 4 shows our platform with

three types of sensors.

We use a Trimble ProXH GPS receiver, a Point

Grey Flea2 firewire camera, and a Memsense

nIMU inertial unit, which provides measurements

of the rotational velocity, the translational acceleration, and the components of the Earths magnetic

field along the three axes of the unit. The pose of

the platform is represented by a rotation matrix

RSO(3) (Lie group of rotations in R3) and a transla-

to the inertial frame which we chose to be a local

North East Down (NED) frame. The velocity of the

moving platform is represented by a rotational

Figure 2: Results on two outdoor sequences.

T

which we chose to be coincident with the camera

frame (i.e. oriented so that the axis coincides with

the optical axis of the camera). The motion evolves

according to:

Tt+1 =e

Rt+1 =e

t x

T

t

t +V t

(1)

Rt,

def

0 z y

z 0 x .

y x 0

(2)

assumed uncertainties of the two sensors are inaccurate.

The rotational error in the fused estimate is worse than

the gyro estimate.

385

projects to the point u = u,v, f

Xc

Zc

Y

f c .

Zc

ut

vt

f

= proj X c def

=

(4)

f

f is the focal length. A calibrated Euclidean

framework is assumed, and without loss of generality, f is taken as 1. In this paper, visual features are

tracked using a pyramidal implementation of the

Lucas-Kanade feature tracker [Bouguet 2000].

Figure 5 shows a test sequence with the feature

tracked between two of the frames.

Figure 4: Setup.

with R = e t

and = Logso3 R .

from the sensors are the absolute position of the

GPS receiver unit in the Earth Centered Earth Fixed

of the translational acceleration a ins in the IMU frame,

and the gravity of the earth g, the gyroscope's meas

orientation of the IMU Rmag of the unit with respect

to the NED frame. To determine the origin and orient our system so that the camera frame coincides

with the NED frame, we leave the platform at rest

for 5 seconds and use the average of the GPS, magnetometer, and accelerometer measurements over

these first few seconds. The camera provides the

positions of salient features that are tracked in the

images. The 3D positions of the world points corresponding to the tracked features are added to the

parameters to be estimated for two reasons: (1) the

simultaneous estimation of these 3D points with the

motion adds to the accuracy of the vision system,

and (2) these points constitute a map that can be

used for mapping applications. The 3D position of

a world feature is represented in the inertial frame

of the optical flow used to track the features. Mean

feature displacement is four pixels per frame.

4. Methodology

4.1 Meta Level Overview

We wish to determine the position and velocity of the platform accurately, at every time step,

regardless of the environment (e.g. indoor, outdoor), the natural conditions (e.g. bright, dark), or

the type of motions (e.g. fast, slow, jerky). Using a

probabilistic formulation, we want to determine, at

every time step t, the probability distribution of the

measurements T GPS , ins,a ins,mag i.e. inertial

by X c = X c,Y c,Z c .

At time t, the position of the feature point X

The notation {,} is used to represent the concatenation of vectors on top of each other. We denote the

Xc t = R t X t + T t .

386

(3)

desired distribution as p S t U 0 : t ,T g 0 : t ,

A traditional way to do the estimation of such

systems is by using the Extended Kalman Filter

(EKF). However, the EKF has two main problems in

this context; firstly, its computationally intractable

if the number of 3D features tracked by the vision

system is high and, secondly, it assumes that all the

uncertainties are Gaussian and perfectly known. To

address these problems, Rao-Blackwellized

Particle Filters (RBPF) have been introduced. The

first generation of these filters [Montemerlo 2003]

reduced the computational complexity by capitalizing on the important characteristic that features are

independent given the camera motion. Then, a particle filter can be used in which each particle is

composed of a motion sample and N low dimensional filter EKFs, one for each of the N features.

However, there was still a need for further reduction of the number of particles and improving the

convergence properties of the filter. A drastic

improvement [Eade and Drummond 2006;

Montemerlo et al. 2003; Sim et al. 2007] over the

simple Rao-Blackwellized particle filter was found

in using the optimal importance function, which is

the distribution of the estimates given not only the

state at the previous time-step but also the most

recent observation. This ensures that samples are

propagated in an efficient way, which reduces the

number of wasted samples. However, in the mentioned approaches, the way samples are drawn from

this importance function is not optimal in the sense

of providing particles that lie in the regions of highest probability of the posterior. This is not optimal

because the function does not use the uncertainty of

a motion sample during the update process and,

therefore, it updates the feature locations based on

the predicted motion, which might not be very accurate if the motion is changing rapidly. To deal with

that, some researchers [Karlsson et al. 2005; Sim et

al. 2007; Elinas et al. 2006] use mixed proposal

distributions that combine hypotheses from the

motion and observation models.

The approach we present in this paper borrows

some principles from the Rao-Blackwellized particle filter; we divide the state vector into two parts,

where the second one is estimated, analytically

conditioned, on the first part. However, our

approach is intrinsically different from the RBPF.

First, we use hypotheses that are much more general than the particles in the Rao-Blackwellized particles. Our hypotheses represent not only a motion

estimate, but also a set of sensors to be used.

Hence, we have multiple types of hypotheses. In

the current paper, we limit the types of hypotheses

to three: vision only hypotheses, vision-inertial

approach is more general to allow for the easy

incorporation of any additional sensor. Furthermore,

our hypotheses do not represent an exact probability distribution over the 3D parameters; however,

they can be seen as a mixture of Gaussian representations of this distribution. Another major difference

between our approach and the Rao-Blackwellized

approach is that in our approach, instead of dividing

the state vector into a motion part and a feature

locations part, we divide it into a first part that consists of the camera motion and a small number K of

random 3D points, and another part containing the

remaining 3D points conditioned on the first part.

The K features that are associated with the first part

are chosen randomly for every hypothesis by drawing putative sets from the available tracked features

when the hypothesis is created. This new division, as

we will show, has very important implications on the

efficiency and the mechanism of filter operations.

As we mentioned, the state vector in our

A drastic

improvement

over the

simple RaoBlackwellized

particle filter

was found in

using the

optimal

importance

function

part S 2 consisting of the remaining features. K typically should be a small number (>5). A relatively

large number leads to better results but increases

the computational expense. In this paper we settle

on K = 8,

S = S 1,S 2

S 1 = t ,T t , t ,V t ,X L l t , ,

L = randperm N ,l 1,k

S 2 = X i t ,., ,i = 1,.,N, i L l ,

(5)

numbers from 1 to N. The elements (3D feature

given S 1. The desired probability distribution can be

written, therefore, as follows (n.b. dropping the

time indices):

p S U,Tg,ins,mag =

(6)

S 1 and U 2 are the image projections of the features in S 2.

The rationale of dividing the state vector in the

presented way is that we can now approximate the

desired distribution as follows:

387

p S 1 U 1,Tg,ins,Vins,mag p S 2 S 1U 2

(7)

since S 2 is independent of U 1 given S 1. The approximation comes from the fact that S 1 is now independ-

of inertial measurements with vision generally provides estimates that are better than those provided by

either one of the sensors, in some situations, if one of

the sensors measurements is really poor or if the

uncertainties in the measurements are not well

known, the combination is not fruitful. For this reason, we keep three kinds of hypotheses:

we can deal with that by forming multiple hypotheses in which the features in U 1 are different. This is

Bolles. 1981] paradigm and it provides robustness

to outliers in the tracking of the features.

Now every one of those hypotheses can be

U 1 data associated with it, and then S 2 is estimated from the U 2 data using the motion in S 1.

in S 1.

on its update process of a motion in S 1 that is

already updated and not only predicted from the

previous time instant.

Hypotheses using both vision with inertial sensors (plus magnetometer) corresponding to the

measurements (plus magnetometer) without

In this case, S 1 contains only motion estimates

while all the features are in S 2. S 1, in this hypothe-

388

Filter (AKF) [Gustafsson and Firm 2000],

which estimates also the noise covariance ins

in the inertial measurements. This is done to

deal with cases in situations where ins is not

known or changes with time. The estimate of

ins from this hypothesis is fed to the other

hypotheses.

Figure 6 shows the three types of particles in our

system. Figures 7, 8, and 9 depict the estimation

process of each type of hypothesis.

At every time step, the estimation of all the

hypotheses can be done in a GPS mode or in a nonGPS mode depending on the availability of the GPS

signal. Also, a procedure is designed to determine the

best hypotheses (see Section 4.4). Figure 10 shows

the mean visual reprojection errors of 20 hypotheses

over 160 frames and the chosen hypothesis.

Every Hypothesis

Three types of dynamical systems can be

definedone for each of our hypotheses. The equations in these systems are a subset of the following

system of equations:

1 X L l t + 1 = X L l t l = 1.K

2 T t + 1 = e w t xT t + V t

3 t + 1 = Log so3 e w t xe t

x

whose measurements are the projections of the features in Part 1 and the GPS

fix (when available). Then every feature in Part 2 is estimated using an individual EKF based on the motion in Part 1.

4V t+1 =V t +at

5 t+1 = t +at

6at+1 =at

7 u i t = proj R t X i t + T t

+ n vis,i t i = 1.N

8 T GPS t = CT t + B + n GPS

9 a ins t = RTci a t + RT t g + R t T ci + a bias + n ins,acc

10 ins t = RTci t + bias + n ins,gyro

11 mag t = t + n mag

(8)

i

with X 0 =

i

X 0,T

= exp

. 0 = 0V 0 = V 0 0 = 0,

n vis,i ~ N 0, vis,i t

EKF whose measurements are the projections of the K features in Part 1, the

inertial measurements, and the GPS fix (when available). Then every feature in

Part 2 is estimated using an individual EKF based on the motion in Part 1.

position of the feature; it is obtained from the feature tracker. C is a matrix representing the transformation from the NED reference frame to the ECEF

reference frame, and B is a vector from the earth cen-

ter to the origin of the NED frame. Rci is the rotation between the body frame and the nIMU frame.

Tci is the vector from the nIMU frame to the body

frame. n GPS 0, GPS is the noise from uncertainty in

Dilution of Precision (DOP) values provided by the

are the noise from uncertainties in the accelerometers

and gyroscopes of the inertial sensor, which are estimated adaptively using an Adaptive Kalman Filter

Figure 9: Estimation of an inertial hypothesis: Part 1 of the hypothesis is estimated using an adaptive KF [Gustafsson and Firm 2000] whose measurements

are the inertial measurements and the GPS fix (when available). Then every

feature in Part 2 is estimated using an individual EKF based on the motion in

Part 1 and whose measurement is the projection of the considered feature.

389

S1 t = f S1 t 1 + ns t

n s t ~ N 0, s

m t = h S1 + nm t

n m t ~ N 0, m , l = 1,.,k

(9)

dependent on the hypothesis type, and h is the system of equations 8.6, 8.7 and 8.10.

j

2002] and use the standard notations of dynamical

systems filtering where x tt 1 refers to an esti-

frame 160 on all the previous frames.

biases for the accelerometers and gyroscopes given

in the hardware specfications. n mag 0 mag is the

The filtering equations are:

measurement and is also estimated adaptively.

The vision hypotheses will have equations 8.9

and 8.10 missing. Inertial hypotheses will have

equations 8.1 and 8.7 missing. In the non-GPS

mode, all hypotheses will have equation 8.8 missing. Note that the K features in equation 8.1 are

chosen randomly from the N total features and are

different for every hypothesis. There are two reasons for which GPS measurements are not available at all times. The first reason is that the GPS

provide measurements at 1Hz, whereas vision and

inertial measurements are much more frequent (e.g.

30Hz). The second reason is that the GPS receiver

may not be in line of site of enough satellites to

determine its position (e.g. in tunnels or buildings).

Unfortunately, the scale and reference frame are not

observable with vision (i.e. visual features do not

provide measurements of absolute scale). Thus,

when the GPS signal disappears, we have to fix the

scale and reference frame for vision-type hypotheses. To do this, whenever the GPS measurement is

not available, we replace equation 8.1 with

j

390

Prediction

S 1 tt 1 = f S 1 t 1 t 1

Pm tt 1 = F t 1 Pm t 1 t 1 .

T

F t 1 + m

(10)

Update

S 1 tt = S 1 tt 1 + A t u t h S 1 tt 1

T

T

Pm tt = t Pm tt 1 t + A t n A t

(11)

Gain

t = H t P t H T t + n

T

t = P tt 1 H t

t =IAt H t

t .

(12)

f

S 1 t 1t 1

S 1

.

H t = h S 1 tt 1

S 1

(13)

Linearization

Ft =

Z L l t + 1 = Z L l t , l = 48

,

x L l t + 1 = x L t t , l = 48

where X L l = Z L l x L l ,Z L l y L l ,Z L l

refer to (6).

j

A

j

j

up i t,k = proj R k X i t + T k

(16)

j

j

j\:T

j

j\:T

p i t,k = G m,i k Q k G m,i k + G x,i t,k Px t G x,i t,k ,

dynamical system:

where

j

j

Xi

t =

j

Xi

G x,i t,k =

t 1 + n x t ,0.3cmn x t ~ N 0, x

j

Xi

u i t = proj R tt

t + T tt +n vis,j t ,

Px is the covariance of

N

k=0 i=1

(14)

Note that in the above equation, the measurement

j

j

is one of the major advantages of our approach. Let

j

Gm be the linearization of the measurement equation

j

t +T t

R t ,T t

t.

i=1

j

R j t ,T t

p i t,k + vis,j k

up i t,k u i ,

(18)

define as (e) = e if 0 e < th and (e) = th if e

th, where th is a threshold. The errors are expressed

in terms of mean pixel error per feature. However,

this would require us to keep track of all the image

projections and to recompute the reprojection

errors of the state vector on all the frames, at every

time, which can not be done in real time. Therefore,

we use an approximation of this error, which can be

computed recursively at every frame as follows:

j

Xi

1

t+1

j

up i t,k u i

j

Xi

(17)

X ij.

e vis j,t =

t

proj R t

X i t

n t ~ N 0, vis,j

j

G m,j

t =

proj R j k X i t + T k

t +

t+1

up i t,t u i

p i t,t + vis t

up i t,t u i

(19)

t+1

In this case, instead of computing the reprojection error of the state vector S t on all the previous

(15)

j

as in Section 4.2.1. However, instead of using vis,i

j

we replace it by vis,i + G m,i

QG jT

m,i to take into con

k the reprojection error of S k on this frame, which

motion.

e vis j,t = 1

t+1

N

t

up ij k,k u i

vis (j,t), then is defined as follows:

hypotheses based on their fitness to the available

measurements.

4.4.1 Weights from Vision Measurements

Ideally, to test a given hypothesis j at time t

using the visual measurements, we have to determine its re-projection error evis(j,t) on all the previous frames. The reprojection of the ith 3D point in

the jth at time t hypothesis on the kth frame and its

covariance matrix are given by:

k=0 i=1

P ji k,k + vis,j k

up ij k,k u i ,

20)

normalized to sum to one.

4.4.2 Inertial Weights

Inertial weights are measured by the deviations

of V and from the inertial measurements. This

and one for the gyros. We define the mean and covariance matrices of the translational and rotational velocities of the jth hypothesis in the IMU reference frame:

391

j

ap ins t =RTci a j t + RT,j t g + R j t T + a

ci

bias

p jins,acc t = RTciPa t R ci

wGPS(j,t) = exp(GPS(j,t)),

p jins,gyro t = RTciPj t R ci

(21)

(27)

hypotheses.

in the hypothesis j.

To compute an aggregate weight for each hypothesis, we simply treat the weights from each type of

measurement as probabilities and multiply them:

1

T

j

ap ins

j

p ins,acc

t a ins t

t + ins,acc t

ap ins t a ins t

.

t+1

e gyro j,t = e gyro j,t 1 +

T

j

p ins t ins t

(22)

t+1

k=0

gyro(j,t) are calculated as follows:

wacc(j,t) = exp(acc(j,t))

wgyro(j,t) = exp(gyro(j,t))

(23)

hypotheses are then normalized to 1.

4.4.3 Magnetometer Weights

The magnetometer weights are measured by

the deviation of from the magnetometer measure

ment.

T

t mag t

mag + pj t

t mag t .

(24)

t+1

The weights are computed as:

wmag(j,t) = exp(mag(j,t)),

(25)

hypotheses.

4.4.4 GPS Weights

The GPS weights are measured by the deviation of T from the GPS measurement:

j

CT t B T GPS t

392

CPT t C + GPS

t+1

CT t + B T GPS t .

(26)

w(j,t) = wvis(j,t)wacc(j,t)wgyro(j,t)wmag(j,t)wGPS(j,t).

(28)

Note that when the visual measurements are not

good, all the reprojection errors are very high. The

robust function then leads, almost, to equal visual weights for the hypotheses, and then w(j,t) will

not be affected by the vision weights.

and Operation

Upon starting our system, a number of features

are detected in the first image and then tracked

through subsequent frames. We randomly sample

Ns putative sets of K = 8 features each. Every set is

used to generate a hypothesis (vision or vision inertial). A hypothesis is generated by initializing an

EKF in which the initial translation is set to 0, the

rotation with respect to the NED frame is determined from the accelerometer and magnetometer at

rest, and the translational and rotational velocities

are set to zero. After a number of frames, when the

EKFs are assumed to have converged for every

hypothesis, we triangulate the 3D positions of the

features in its Part 2 using the motion in its Part 1.

The covariance of every 3D feature is determined

using the unscented transform [Julier and Uhlmann

1997], the covariance of the motion in the Part 1

and the covariance of its image feature. The triangulated 3D features and their covariances are used

as an initialization of the individual EKFs in Part 2.

A similar procedure is used to insert new features.

When a new feature is detected (n.b. in our system

we keep the number of visible features constant), it

is tracked for a certain number of frames (e.g. from

t1 to t2). Then, for each of the available hypotheses,

the 3D location of this feature is determined by triangulation and used as an initial value for the EKFs

of this feature in every hypothesis.

given hypothesis is not tracked (e.g. disappears or

gets occluded), it is replaced by another feature

from the part S 2 of the hypothesis and this latter

feature is chosen as the one with the lowest variance. The replacement is done by eliminating the

occluded, or disappearing, feature from the state

vector and substituting its mean by the mean of the

new feature. In addition, the lines and rows in the

covariance matrix corresponding to the old feature

are set to zero and only the part corresponding to

the autocovariance of the old feature is replaced by

the covariance of the new feature.

When the weight of a given hypothesis

becomes smaller than some threshold th1 and if

none of its individual weights is higher than another thereshold th2, the hypothesis is discarded and

replaced by a new hypothesis generated from one

then interchanging the feature having the largest

new

j

new

variance in S 1

variance in

new

S2

.

the Vision System

As a proof of concept for our hypothesis-based

approach, we tested the system using vision only

hypotheses. We calibrated a Flea2 camera, from

Point Grey, using a checkerboard pattern and used

the camera to capture several sequences of images.

In some of the sequences, the camera is fixed on a

pan-tilt unit; and in other sequences, the camera is

moved by hand. We report, hereafter, the results on

three sequences. We compare the reprojection error

of the state vector estimated by our system, in all the

considered frames, against the error of the full

covariance EKF [Chiuso et al. 2002] and of the optimal nonlinear offline optimization (i.e. bundle

adjustment) [Triggs et al. 2000]. The bundle adjustment is a Gauss-Newton batch estimation of the

camera poses at all the frames and the positions of

all the considered points by minimizing the reprojection errors of the 3D parameters in all the images.

The reprojection errors in a given frame are determined by projecting the estimates of the 3D points

on this frame using the corresponding camera pose

estimate, then taking the difference between the

obtained projections and the detected features positions. We use these reprojection errors to evaluate

different estimators. If the errors in the image projections are Gaussians, the estimator corresponding

to the smallest reprojection error would be the

Maximum Likelihood estimator; otherwise, in the

non-Gaussian case it would be the BLUE estimator

(Best Linear Unbiased Estimator). Figures 11, 12,

and 13 show the reprojection errors of the state vectors at time 160, of all the (20) considered hypotheses on all the frames, from time 10 till 160, for the

three sequences respectively. The blue solid lines

correspond to the different hypotheses; the green

dash-dotted line corresponds to the optimal bundle

adjustment solution; the dashed red line corresponds

to the chosen hypothesis; and the dotted black line

corresponds to the full EKF solution. The chosen

hypothesis in our system always corresponds to one

of the best hypotheses; it has a reprojection error

much lower than the EKF error and it is also very

close to the error of the optimal bundle adjustment.

Figure 14 shows the estimated positions of the scene

landmarks, where the diameter of the circles is proportional to the depth of the feature. The larger the

diameter of circle, the farther it is from the camera

center. A qualitative examination of these features

shows that their obtained depth is in accord with the

distance to the cameras and that the 3D positions are

obtained up to a good accuracy. The results presented, thus far, show the effectiveness and efficiency of

our multi-hypothesis approach. Since the kernel of

every hypothesis is the EKF filter of its Part 1, and

based on previous results of vision-inertial EKFs

[Jones et al. 2007], we expect the same performance

of our approach, with added accuracy, when we

include inertial measurements in the tests.

frame 160 on all the previous frames.

393

6. Conclusion

frame 160 on all the previous frames.

with vision and inertial measurements for localization and mapping. The system maintains multiple

hypotheses that are more general than the particles in

a Particle Filter and allow our system much more

flexibility in selecting the set of measurements with

the set of parameters that give the most accurate

results. The system is independent of the image features tracking algorithm while the multiple hypotheses scheme provides robustness to the potential failures of the tracker. We showed preliminary results

that outlined the superiority of our framework versus

the traditional EKF in terms of giving the best possible localization and mapping. Our framework can

easily be generalized so that it is scalable to different

networks of sensors. Future work will focus on an

extensive experimental evaluation of our approach

with very long sequences in addition to various

indoor and outdoor conditions. We will rigorously

test our systems robustness to extended loss of GPS

data, as well as confirm the superior computational

efficiency and scalability implied by the derivation

of our probabilistic framework when compared to

other sensor fusion algorithms.

References

frame 160 on all the previous frames.

Figure 14: Recovered 3D points. The diameter of the circles is proportional to the distance from the camera.

394

for road tracking using multiple condensation particle filters. In 2008 IEEE Intelligent Vehicles

Symposium, pp. 162167.

Bouguet, J. 2000. Pyramidal implementation of the

Lucas Kanade feature tracker description of the

algorithm. Intel Corporation, Microprocessor

Research Labs.

Carceroni, R., A. Kumar, and K. Daniilidis. 2006.

Structure from motion with known camera positions. In Proc. of IEEE Intl. Conf. on Computer

Vision and Pattern Recognition, pp. 477484.

Chikuma, M., Z. Hu, and K. Uchimura. 2004. Fusion of

vision, GPS and 3D-gyro in solving camera global

registration problem for vision-based road navigation. IEIC Technical Report (Institute of

Electronics, Information and Communication

Engineers), 103(643):7176.

Chiuso, A., P. Favaro, H. Jin, and S. Soatto. 2000. 3-D

motion and structure from 2-D motion causally integrated over time: Implementation. In ECCV 00:

Proceedings of the 6th European Conference on

Computer Vision-Part II, pp. 734750, London,

U.K., Springer-Verlag.

Chiuso, A., P. Favaro, H. Jin, and S. Soatto. 2002.

Structure from motion causally integrated over

time. IEEE Transactions on Pattern Analysis and

Machine Intelligence, pp. 523535.

MonoSLAM: Real-time single camera SLAM.

IEEE Transactions on Pattern Analysis and

Machine Intelligence, pp. 10521067.

Drolet, L., F. Michaud, and J. Cote. 2000. Adaptable sensor fusion using multiple Kalman filters. In 2000

IEEE/RSJ International Conference on Intelligent

Robots and Systems, 2000. (IROS 2000).

Proceedings, volume 2.

Eade, E. and T. Drummond. 2006. Scalable monocular

SLAM. Computer Vision and Pattern Recognition,

2006 IEEE Computer Society Conference on, 1.

Elinas, P., R. Sim, and J. Little. 2006. SLAM: Stereo

vision SLAM using the Rao-Blackwellised particle

filter and a novel mixture proposal distribution. pp.

15641570.

Fakih, A. and J. Zelek. 2008. On the benefits of using

gyroscope measurements with structure from

motion. In ECCV workshop on Multi-camera and

Multi-modal Sensor Fusion (M2SFA208), Oct.

Fischler, M.A. and R.C. Bolles. 1981. Random sample

consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24:381395.

Forlani, G., R. Roncella, and F. Remondino. 2005.

Structure and motion reconstruction of short mobile

mapping image sequences. In VII Conference on

Optical 3D Measurement Techniques, pp. 265274.

Goldbeck, J., B. Huertgen, S. Ernst, and L. Kelch. 2000.

Lane following combining vision and DGPS. Image

and Vision Computing, 18(5):425433.

Grewal, M., L. Weill, and A. Andrews. 2007. Global

positioning systems, inertial navigation, and integration. Wiley-Interscience.

Gustafsson, F. and K. Firm. 2000. Adaptive filtering and

change detection. Wiley New York.

Hu, Z. and K. Uchimura. 2006. Fusion of vision, GPS

and 3D gyro data in solving camera registration

problem for direct visual navigation. International

Journal of ITS Research, 4(1):312.

Jones, E., A. Vedaldi, and S. Soatto. 2007. Inertial structure from motion with autocalibration. In

Proceedings of the ICCV Workshop on Dynamical

Vision.

Julier, S. and J. Uhlmann. 1997. A new extension of the

Kalman filter to nonlinear systems. 3.

Karlsson, N., E. di Bernardo, J. Ostrowski, L. Goncalves,

P. Pirjanian, and M. Munich. 2005. The vSLAM

algorithm for robust localization and mapping. pp.

2429.

Konolige, K., M. Agrawal, and J. Sola. 2007. Large scale

visual odometry for rough terrain. In Proc.

International Symposium on Robotics Research.

Li, Y., J. Wang, C. Rizos, P. Mumford, and W. Ding.

2006. Low-cost tightly coupled GPS/INS integration based on a nonlinear Kalman filtering design.

In Proceedings of ION National Technical Meeting,

pp. 1820.

Mohamed, A. and K. Schwarz. 1999. Adaptive Kalman

filtering for INS/GPS. Journal of Geodesy,

73(4):193203.

the simultaneous localization and mapping problem

with unknown data association. Ph.D. thesis,

Robotics Institute, Carnegie Mellon University,

Pittsburgh, PA, July.

Montemerlo M., S. Thrun, D. Koller, and B. Wegbreit.

2003. FastSLAM 2.0: An improved particle filtering

algorithm for simultaneous localization and mapping that provably converges. 18:11511156.

Nister. D., O. Naroditsky, and J. Bergen. 2006. Visual

odometry for ground vehicle applications. Journal

of Field Robotics, 23(1):320.

Point Grey Research, Inc. http://www.ptgrey.com.

Qian, G., R. Chellappa, and Q. Zheng. 2001. Robust

structure from motion estimation using inertial data.

Journal of the Optical Society of America A,

18:29822997.

Ramachandran, M., A. Veeraraghavan, and R. Chellappa.

2007. Fast bilinear SfM with side information. In

IEEE 11th International Conference on Computer

Vision, 2007. ICCV 2007, pp. 18.

Roumeliotis, S., A. Johnson, and J. Montgomery. 2002.

Augmenting inertial navigation with image-based

motion estimation. In IEEE International

Conference on Robotics and Automation, 2002.

Proceedings. ICRA02, volume 4.

Sasiadek, J., Q. Wang, and M. Zeremba. 2000. Fuzzy

adaptive Kalman filtering for INS/GPS data fusion.

In Intelligent Control, 2000. Proceedings of the

2000 IEEE International Symposium on, pp.

181186.

Sim, R., P. Elinas, and J. J. Little. 2007. A study of the

Rao-Blackwellised particle filter for efficient and

accurate vision-based SLAM. Int. J. Comput.

Vision, 74(3):303318.

Triggs, B., P. McLauchlan, R. Hartley, and A. Fitzgibbon.

2000. Bundle adjustmenta modern synthesis.

Vision Algorithms: Theory and Practice,

1883:298372.

Xia, T., M. Yang, and R. Yang. 2006. Vision based global localization for intelligent vehicles. In 2006 IEEE

Intelligent Vehicles Symposium, pp. 571576.

Yang, R. and M. Pollefeys. 2003. Multi-resolution realtime stereo on commodity graphics hardware. In

2003 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition, 2003.

Proceedings, volume 1.

Authors:

Adel Fakih is a Ph.D. student in Systems

Design Engineering at the University of Waterloo.

His research interests are in visual motion analysis,

structure from motion, visual navigation and tracking human limbs in video sequences. He holds a

Masters degree in Computer and Communications

Engineering from the American University of

395

the detection and extraction of vertebrae from spinal

X-ray images. He did his undergraduate studies at

the Lebanese University where he graduated with a

Diploma in Electrical and Electronics Engineering.

Samantha Ng is currently pursuing a M.Sc.

degree at the University of Waterloo. Her main

research area is computer vision and human pose

estimation. Samantha received her B.Sc. from the

University of Waterloo, specializing in mechatronics, statistics, and management science.

John Zelek is Associate Professor in Systems

Design Engineering at the University of Waterloo,

with expertise in the area of intelligent mechatronic

control systems that interface with humans; specifically, (1) wearable sensory substitution and assistive

devices; (2) probabilistic visual and tactile percep-

396

design, synthesis and analysis; and (4) human-robot

interaction. Zelek has co-founded two start-up companies. He was awarded the best paper award at the

international IEEE/IAPRS Computer and Robot

Vision conference. He was also awarded with best

paper at the 1995 IEEE SMC conference, and his

student won best student paper at the 2002 Vision

Interface conference. He was awarded a 2006 &

2008 Distinguished Performance Award from the

Faculty of Engineering at the University of

Waterloo. He was also awarded the 2004 Young

Investigator Award by the Canadian Image

Processing & Pattern Recognition Society for his

work in robotic vision. His research has been reported by the media in the Globe & Mail, Discovery

Channel, CBC, CTV, and many other venues such as

the Science & Vie Junior Hors Srie (renowned

young science magazine in France). o