Académique Documents
Professionnel Documents
Culture Documents
Inertial sensors and cameras are two sources of odometry that can provide a means of continuous localization when the GPS signal is blocked because of jamming or interference. Furthermore, they augment the GPS
localization with orientation (heading) information. Although there have been many approaches to fuse GPS
with vision and inertial measurements, most of these approaches are still subject to many conditions,
assumptions, and specific applications that do not guarantee the most accurate localization from inferred
measurements of the available sensors at all times. We propose a comprehensive framework to fuse GPS intelligently with vision and inertial data by automatically, and adaptively, selecting the best combination of sensors
and the best set of parameters. Our framework, based on multiple hypotheses, is a roadmap for real time systems that provide high accuracy, outlier robustness, and scalability when mapping a large number of 3D points.
Les capteurs inertiels et les camras sont deux sources dodomtrie qui fournissent une mthode de
localisation en continu lorsque le signal GPS est bloqu cause dun brouillage intentionnel ou d'interfrence.
En outre, ils amliorent la localisation du GPS avec de linformation sur lorientation (le cap). Mme sil y a
eu plusieurs approches pour fusionner le GPS avec des mesures visuelles et dinertie, la plupart de ces
approches sont encore assujetties de nombreuses conditions, suppositions et applications prcises qui ne
garantissent pas la localisation la plus exacte des mesures infres des capteurs disponibles en tout temps.
Nous proposons un cadre approfondi pour fusionner de faon intelligente le GPS avec les donnes visuelles
et les donnes inertielles en choisissant, de faon automatique et adaptative, la meilleure combinaison de
capteurs et le meilleur ensemble de paramtres. Bas sur des hypothses multiples, notre cadre est une
feuille de route pour les systmes en temps rel qui fournissent une grande exactitude, une robustesse centre
les valeurs aberrantes et une variabilit dimensionnelle pour cartographier un grand nombre de points
tridimensionnels.
Adel Fakih
afakih@uwaterloo.ca
1. Introduction
Mobile mapping vehicles, as well as pedestrians, equipped with GPS are often subject to satellite
shading and signal dropout when operating along
tree-lined streets, in wooded areas, and environments where tall buildings can block reception.
Vehicle mounted GPS positioning is further hindered by structures which cause satellites to go in
and out of view, limiting a GPS receivers ability to
provide continuous and accurate position information. In addition, multi-path problems encountered
as a result of signals bouncing off reflective surfaces
often cause GPS systems to be unable to provide
reliable positioning information in this type of area.
One way to provide uninterrupted measurements of
the position, roll, pitch, and true heading of moving
vehicles is to integrate precision GPS with
advanced inertial technology, similar to what
planes and rockets use; however, such technology is
unfeasible for the automotive or pedestrian sector
due to the expense.
As humans, we are very successful in navigating,
localizing, and finding our way with our eyes and
inertial system (vestibular system). In this investigation, we mimic sensor fusion algorithms that combine
the results of computer vision algorithms with the outputs of GPS and inexpensive inertial sensor measurements. Our system is fast, scalable, and flexible; it
selectively fuses measurements from the given sensors
in a robust, probabilistic, multi-hypothesis approach
that always gives the best possible localization and
mapping. We also include a magnetometer (i.e. compass) in the system apparatus. The magnetometer,
providing information about absolute orientation,
enhances the ability of the system to avoid large drift
after losing the GPS signal. Our system is not tied to
any specific image feature detector and tracker.
Therefore, any feature detector can be used to detect
features in the images and any tracker (or matcher) can
be used to track those features between images. Our
system is resilient to the problems and failures of the
feature tracker; firstly, it uses multiple hypotheses and,
secondly, it uses additional information from the other
sensors. We provide preliminary results to prove the
validity and feasibility of our proposed approach.
Samantha Ng
sjng@uwaterloo.ca
John Zelek
jzelek@uwaterloo.ca
2. Background
The Global Positioning System (GPS) has long
been used in combination with inertial sensors for
localization and mapping [Grewal et al. 2007].
Together, GPS and inertial measurements units
(IMUs) naturally complement each other. GPS
compensates for drift in inertial measurements while
inertial measurements are available at higher sampling rates and in a wider range of environments.
There have been many approaches introduced in
recent years for combining GPS and inertial measurements [Grewal et al. 2007; Li et al. 2006;
Mohamed and Schwarz 1999; Sasiadek et al. 2000].
These approaches range from loose coupling to
ultra-tight coupling and from using different estimation techniques such as the conventional Kalman
Filter, the adaptive Kalman Filter, and the fuzzy
adaptive Kalman filter. Although inertial sensors can
operate in more environments (e.g. indoors) when
compared to GPS, they can only compensate for loss
in GPS signal for a short period of time and will drift
quickly from the true trajectory.
More recently, localization methods use vision
with filter-based approaches [Chiuso et al. 2002;
Davison et al. 2007; Eade and Drummond 2006] or
with odometry-based approaches [Yang and
should be able to fuse data when the fusion is efficient and it should be able to rely on individual sensors when measurements become very noisy or
unknown uncertainties are suspected. One
approach to resolve a similar problem is based on
using multiple Kalman filters [Drolet et al. 2000];
a bank of Kalman filters is used, one of which is
running at all times. Whenever new sensor measurements become available, the suitable filter is
selected from the bank of available filters. The
Adaptive Kalman Filter [Gustafsson and Firm
2000] is another way to resolve this problem. The
adaptive filter tries to estimate the noise covariance
in the measurements; however, this is not suitable
for vision, since features are short-lived and, hence,
not practical for estimating their noise covariance.
3. Methodology Preliminaries
We wish to determine the motion of a moving
platform from the readings of a set of sensors mounted on the platform. Figure 4 shows our platform with
three types of sensors.
We use a Trimble ProXH GPS receiver, a Point
Grey Flea2 firewire camera, and a Memsense
nIMU inertial unit, which provides measurements
of the rotational velocity, the translational acceleration, and the components of the Earths magnetic
field along the three axes of the unit. The pose of
the platform is represented by a rotation matrix
RSO(3) (Lie group of rotations in R3) and a transla-
Tt+1 =e
Rt+1 =e
t x
T
t
t +V t
(1)
Rt,
def
0 z y
z 0 x .
y x 0
(2)
Xc
Zc
Y
f c .
Zc
ut
vt
f
= proj X c def
=
(4)
f
f is the focal length. A calibrated Euclidean
framework is assumed, and without loss of generality, f is taken as 1. In this paper, visual features are
tracked using a pyramidal implementation of the
Lucas-Kanade feature tracker [Bouguet 2000].
Figure 5 shows a test sequence with the feature
tracked between two of the frames.
Figure 4: Setup.
with R = e t
and = Logso3 R .
4. Methodology
4.1 Meta Level Overview
We wish to determine the position and velocity of the platform accurately, at every time step,
regardless of the environment (e.g. indoor, outdoor), the natural conditions (e.g. bright, dark), or
the type of motions (e.g. fast, slow, jerky). Using a
probabilistic formulation, we want to determine, at
every time step t, the probability distribution of the
by X c = X c,Y c,Z c .
At time t, the position of the feature point X
The notation {,} is used to represent the concatenation of vectors on top of each other. We denote the
Xc t = R t X t + T t .
386
(3)
desired distribution as p S t U 0 : t ,T g 0 : t ,
A drastic
improvement
over the
simple RaoBlackwellized
particle filter
was found in
using the
optimal
importance
function
part S 2 consisting of the remaining features. K typically should be a small number (>5). A relatively
large number leads to better results but increases
the computational expense. In this paper we settle
on K = 8,
S = S 1,S 2
S 1 = t ,T t , t ,V t ,X L l t , ,
L = randperm N ,l 1,k
S 2 = X i t ,., ,i = 1,.,N, i L l ,
(5)
p S U,Tg,ins,mag =
(6)
p S 1 U 1,Tg,ins,Vins,mag p S 2 S 1U 2
(7)
since S 2 is independent of U 1 given S 1. The approximation comes from the fact that S 1 is now independ-
we can deal with that by forming multiple hypotheses in which the features in U 1 are different. This is
U 1 data associated with it, and then S 2 is estimated from the U 2 data using the motion in S 1.
in S 1.
Hypotheses using both vision with inertial sensors (plus magnetometer) corresponding to the
1 X L l t + 1 = X L l t l = 1.K
2 T t + 1 = e w t xT t + V t
3 t + 1 = Log so3 e w t xe t
x
4V t+1 =V t +at
5 t+1 = t +at
6at+1 =at
7 u i t = proj R t X i t + T t
+ n vis,i t i = 1.N
8 T GPS t = CT t + B + n GPS
9 a ins t = RTci a t + RT t g + R t T ci + a bias + n ins,acc
10 ins t = RTci t + bias + n ins,gyro
11 mag t = t + n mag
(8)
i
with X 0 =
i
X 0,T
= exp
. 0 = 0V 0 = V 0 0 = 0,
n vis,i ~ N 0, vis,i t
ter to the origin of the NED frame. Rci is the rotation between the body frame and the nIMU frame.
Tci is the vector from the nIMU frame to the body
frame. n GPS 0, GPS is the noise from uncertainty in
Figure 9: Estimation of an inertial hypothesis: Part 1 of the hypothesis is estimated using an adaptive KF [Gustafsson and Firm 2000] whose measurements
are the inertial measurements and the GPS fix (when available). Then every
feature in Part 2 is estimated using an individual EKF based on the motion in
Part 1 and whose measurement is the projection of the considered feature.
389
S1 t = f S1 t 1 + ns t
n s t ~ N 0, s
m t = h S1 + nm t
n m t ~ N 0, m , l = 1,.,k
(9)
Prediction
S 1 tt 1 = f S 1 t 1 t 1
Pm tt 1 = F t 1 Pm t 1 t 1 .
T
F t 1 + m
(10)
Update
S 1 tt = S 1 tt 1 + A t u t h S 1 tt 1
T
T
Pm tt = t Pm tt 1 t + A t n A t
(11)
Gain
t = H t P t H T t + n
T
t = P tt 1 H t
t =IAt H t
t .
(12)
f
S 1 t 1t 1
S 1
.
H t = h S 1 tt 1
S 1
(13)
Linearization
Ft =
Z L l t + 1 = Z L l t , l = 48
,
x L l t + 1 = x L t t , l = 48
where X L l = Z L l x L l ,Z L l y L l ,Z L l
A
j
j
up i t,k = proj R k X i t + T k
(16)
j
j
j\:T
j
j\:T
p i t,k = G m,i k Q k G m,i k + G x,i t,k Px t G x,i t,k ,
where
j
j
Xi
t =
j
Xi
G x,i t,k =
t 1 + n x t ,0.3cmn x t ~ N 0, x
j
Xi
u i t = proj R tt
t + T tt +n vis,j t ,
Px is the covariance of
N
k=0 i=1
(14)
Note that in the above equation, the measurement
j
t +T t
R t ,T t
t.
i=1
j
R j t ,T t
p i t,k + vis,j k
up i t,k u i ,
(18)
j
Xi
1
t+1
j
up i t,k u i
j
Xi
(17)
X ij.
e vis j,t =
t
proj R t
X i t
n t ~ N 0, vis,j
j
G m,j
t =
proj R j k X i t + T k
t +
t+1
up i t,t u i
p i t,t + vis t
up i t,t u i
(19)
t+1
In this case, instead of computing the reprojection error of the state vector S t on all the previous
(15)
j
e vis j,t = 1
t+1
N
t
up ij k,k u i
k=0 i=1
P ji k,k + vis,j k
up ij k,k u i ,
20)
j
ap ins t =RTci a j t + RT,j t g + R j t T + a
ci
bias
p jins,acc t = RTciPa t R ci
wGPS(j,t) = exp(GPS(j,t)),
(21)
(27)
T
j
ap ins
j
p ins,acc
t a ins t
t + ins,acc t
ap ins t a ins t
.
t+1
e gyro j,t = e gyro j,t 1 +
T
j
p ins t ins t
(22)
t+1
k=0
(23)
ment.
t mag t
mag + pj t
t mag t .
(24)
t+1
(25)
CT t B T GPS t
392
CPT t C + GPS
t+1
CT t + B T GPS t .
(26)
w(j,t) = wvis(j,t)wacc(j,t)wgyro(j,t)wmag(j,t)wGPS(j,t).
(28)
Note that when the visual measurements are not
good, all the reprojection errors are very high. The
robust function then leads, almost, to equal visual weights for the hypotheses, and then w(j,t) will
not be affected by the vision weights.
rotation with respect to the NED frame is determined from the accelerometer and magnetometer at
rest, and the translational and rotational velocities
are set to zero. After a number of frames, when the
EKFs are assumed to have converged for every
hypothesis, we triangulate the 3D positions of the
features in its Part 2 using the motion in its Part 1.
The covariance of every 3D feature is determined
using the unscented transform [Julier and Uhlmann
1997], the covariance of the motion in the Part 1
and the covariance of its image feature. The triangulated 3D features and their covariances are used
as an initialization of the individual EKFs in Part 2.
A similar procedure is used to insert new features.
When a new feature is detected (n.b. in our system
we keep the number of visible features constant), it
is tracked for a certain number of frames (e.g. from
t1 to t2). Then, for each of the available hypotheses,
the 3D location of this feature is determined by triangulation and used as an initial value for the EKFs
of this feature in every hypothesis.
variance in S 1
variance in
new
S2
obtained projections and the detected features positions. We use these reprojection errors to evaluate
different estimators. If the errors in the image projections are Gaussians, the estimator corresponding
to the smallest reprojection error would be the
Maximum Likelihood estimator; otherwise, in the
non-Gaussian case it would be the BLUE estimator
(Best Linear Unbiased Estimator). Figures 11, 12,
and 13 show the reprojection errors of the state vectors at time 160, of all the (20) considered hypotheses on all the frames, from time 10 till 160, for the
three sequences respectively. The blue solid lines
correspond to the different hypotheses; the green
dash-dotted line corresponds to the optimal bundle
adjustment solution; the dashed red line corresponds
to the chosen hypothesis; and the dotted black line
corresponds to the full EKF solution. The chosen
hypothesis in our system always corresponds to one
of the best hypotheses; it has a reprojection error
much lower than the EKF error and it is also very
close to the error of the optimal bundle adjustment.
Figure 14 shows the estimated positions of the scene
landmarks, where the diameter of the circles is proportional to the depth of the feature. The larger the
diameter of circle, the farther it is from the camera
center. A qualitative examination of these features
shows that their obtained depth is in accord with the
distance to the cameras and that the 3D positions are
obtained up to a good accuracy. The results presented, thus far, show the effectiveness and efficiency of
our multi-hypothesis approach. Since the kernel of
every hypothesis is the EKF filter of its Part 1, and
based on previous results of vision-inertial EKFs
[Jones et al. 2007], we expect the same performance
of our approach, with added accuracy, when we
include inertial measurements in the tests.
6. Conclusion
References
Figure 14: Recovered 3D points. The diameter of the circles is proportional to the distance from the camera.
394
Authors:
Adel Fakih is a Ph.D. student in Systems
Design Engineering at the University of Waterloo.
His research interests are in visual motion analysis,
structure from motion, visual navigation and tracking human limbs in video sequences. He holds a
Masters degree in Computer and Communications
Engineering from the American University of
395
396