Académique Documents
Professionnel Documents
Culture Documents
IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 21, NO. 6, NOVEMBER 2013
I. I NTRODUCTION
(1)
(2)
(3)
(4)
2225
j
with (k) =
j w(k) . Finally, the a posteriori pdf gives
the means to estimate x(k) as x (k), which was our objective.
For example, we can choose x (k) to be the particle with the
highest weight.
In addition, after the weight determination, the particle
population is typically resampled and the new weights are set
to be identical. This resampling step is a crucial component of
particle filter algorithms. Resampling is necessary since it can
provide the chance for good particles to amplify themselves
and produce better and more accurate results. Moreover, it
overcomes the degeneracy phenomenon, where, after a few
iterations, all but one particle will have negligible weights.
However, it also introduces other practical issues that need
careful attention. First, it limits the opportunity to parallelize
since all the particles must be combined. Second, the particles
that have high weights are statistically selected many times.
This leads to a loss of diversity among the particles as the
resultant samples contain many repeated points. Therefore
the choice of the number of particles and of the resampling
procedure fundamentally determines the properties of the
particle filter.
We can summarize a prototypical SIR particle filter algorithm as follows [4]:
1) draw m samples x(k) j from the a priori distribution
p(x(k)|x(k 1)), using (3);
2) compute the importance weight w(k) j for each j ,
using (4);
3) compute the state estimate according to the approximated a posteriori pdf (6);
4) resample the set of particles according to their weight;
5) set w(k) j = 1/m for all particles j .
Typically, when the state dimension is larger than 5, a
high number of particles is required to capture the nonlinear
underlying dynamics and measurement equations (1) and (2)
and to obtain accurate estimates. Often, this endangers the realtime capabilities of particle filters in usual single-core CPU
systems, and it is a fundamental drawback when accurate yet
fast algorithms are needed, for example, for high-frequency
feedback control purposes. With the rise of reasonably priced
GPU hardware, these disadvantages could be overcome by
distributing the computation among the different computing
2226
IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 21, NO. 6, NOVEMBER 2013
zi
zi
zi
z
zi
zi
PF
(a)
(xj , w j )
(
xi, Pi)
(
x, P )
PFi
PFi
PFi
PFi
PFi
(b)
PFi
PFi
PFi
(xji , wij )
PFi
PFi
(c)
Fig. 1. Distributed particle filter classes along with the standard centralized approach. In the distributed sensing setting, the exchange of information and
the distributed estimation process is done at the level of the sensors that share typically the local means and covariances of the state estimates ( xi , Pi ). Via
In the distributed computation setting, the distribution and communication is done
some consensus mechanism [5], they agree on a common couple ( x,
P).
j
j
at the level of computing units with particle-weight couples. The final outcome is a number of different a posteriori distributions represented via (xi , wi ).
The fact that these distributions are different means that the different local particle filters do not necessarily have to agree on a common one. (a) Standard
centralized approach. (b) Distributed sensing approach. (c) Distributed computation approach.
2227
2228
IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 21, NO. 6, NOVEMBER 2013
TABLE I
AVAILABLE M ETHODS TO I MPLEMENT D ISTRIBUTED C OMPUTATION PARTICLE F ILTERS
Refs.
Sampling + Weight
Resampling
Estimation
Particles
State Dimension
[12]
Local
Local
Central
32 000
GDPF
Local
Central
Central
5000
[13]
LDPF
Local
Local
Central
5000
CDPF
Local
Central
Central
5000
Central
50 000
RPA
Local
Central
[14]
RNA
Local
Local
Central
50 000
[17]
Local
Central
Central
1 000 000
[18]
Local
Local
Unknown
4000
[19]
Local
Central
Central
130 000
only part of the particles are sent to a few specific coordinating computing units.
these are theoretical limits based on the considered hardware rather than a measured performance.
3
5
5
5
3
3
2
3
4
Runtime [ms]
1100
100
10
10
1
0.11
1000
200
10
2229
host
kernel
device
thread block
Topology mapping
Fig. 2. Basic concept of GPU architectures as available in CUDA. The global device memory is mapped into a specific topology representing data exchange
between thread blocks.
2230
IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 21, NO. 6, NOVEMBER 2013
Algorithm 1 Distributed Computation Particle Filter: HighLevel Description on Each Computing Unit i
Input: {xi (k 1) j } j =1,...,m , z(k)
1. Local Filter:
for j = 1: m
1.1: sampling: xi (k) j p(xi (k)|xi (k 1) j )
1.2: weight_calculation: wi (k) j = p(z(k)|xi (k) j )
end
2.
Sorting: sort {xi (k) j } j =1,...,m
according to
{wi (k) j } j =1,...,m
3. Estimation: local_estimation: pick xi (k) j with maximal
wi (k) j
4. Particle Exchange:
foreach neighbor do
send and receive t particle-weight couples to/from
neighbors
end
5. Resampling: resample the m + Ni t particles into m
particles
6. Reset: set wi (k) j = 1/m for all j
Output: {xi (k) j } j =1,...,m
p i (x (k) |z (k)) =
j =1
particles, as
(t )
p i (x (k) |z (k)) =
t
(t )
i (k) j =1
wi (k) j x (k) xi (k) j
(8)
t
(t )
j
with i (k) =
j =1 wi (k) ; then the claim that the t
particles with highest weights are the most representative
for (7) is formally expressed as follows.
Proposition (From [24]): The KL divergence between (7)
and its approximation (8), which employes only t < m
particles, i.e., D( pi , p i(t )), can be written as
(t )
t
j
i (k)
wi (k)
(t )
= log
D( pi , p i ) = log
(9)
i (k)
i (k)
j =1
and it is minimal when we use for (8) the t particles with the
highest weights.
We can distinguish two main aspects that affect the performance of Algorithm 1. First, we have to analyze how well
each of the local a posteriori p i (x(k)|z(k)) represents the
global p(x(k)|z(k)).
m i
N
N
j
wi (k)
D( p,
p i ) =
log
(k)
i=1
i=1
j =1
m+N
N
i t wi (k) j
i (k)
(10)
=
log
+
(k)
(k)
i=1
Nm
j =m+1
(11)
2231
(S1) The higher the Ni t, the closer the local and global
a posteriori pdfs are. This follows from (10) with
W (or in the approximated sense, with Nit ).
Therefore, increasing the communication leads to an
increase of estimation quality for a given time step k
(note that the effect on the time step k + 1 depends
also on the resampling, which is analyzed next). In
particular, if we choose all-to-all communication and
t = m, then each local population is comprised at least
of the m particles with highest weight, which are the
most representative to describe p(x(k)|z(k)).
j =1
2232
IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 21, NO. 6, NOVEMBER 2013
L2
L3
2
3
L1
(a)
(b)
Fig. 3. Robotic arm used for the experimental test. (a) Real testbed. (b) Schematic representation showing the camera at the end tip, marked with a grey
circle, and the monitor on which a moving object are displayed.
(13a)
(13b)
(13c)
v y (k) = v y (k 1) + wv y (k 1)
v z (k) = v z (k 1) + wv z (k 1)
(13d)
(13e)
pw pc pp ps
world
camera
plane
sensor
where describes the lens distortions. The full sensor model
is described by
ps = (R, pw , q) + s
(14)
(15)
where h s is the nonlinear function that represents the composition of the three maps in the sensor model (14). In order
to complete our measurement model, we add the independent
measurements of the angles
i (k) = i (k) + i (k),
i = 0, . . . , J
(16)
(17)
2233
TABLE II
E XPERIMENT AND S IMULATION PARAMETERS
Experiments
High-Noise Simulations
0.015 rad
0.001 m
0.05 m/s
0.075 rad
0.005 m
0.25 m/s
10 px
0.01 rad
10 px
0.01 rad
0.01s
200s
0.03 m/s
0.01s
200s
0.03 m/s
12%
Process Noise
w i
w y , wz
wv y , wv z
Measurement Noise
s
i
Other Parameters
t
Tf
Velocity of the Target
Camera View Area
Total Area
with the sampling rate of 100 Hz. This update rate is close to
the hardware limit.
The performance metric we consider is the estimation error
of the position of the target.2 In particular we define the
average error e as
e=
Tf
1
||p w (k) pw (k)||
Tf
(18)
k=1
IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 21, NO. 6, NOVEMBER 2013
0.06
0.06
0.04
0.04
0.02
0.02
z [m]
z [m]
2234
0.02
0.02
0.04
0.04
0.06
0.06
0.1
0.05
0.05
0.1
0.1
0.05
y [m]
y [m]
(a)
(b)
0.05
0.1
Fig. 4. Estimated position (y, z) (in grey dots) superimposed on the actual trajectory of the object. (a) Setting is N = 32 and m = 64, which does not allow
the robotic arm to follow the object. (b) Setting is N = 2048 and m = 512, which allows the robotic arm to follow the object very well. Both experiments
have been run at 100 Hz. The empty rectangle on the left figure represents the camera view area.
7
6
m = 64
m = 128
m = 256
m = 512
5
4
3
2
128
256
512
1024
2048
Ring Topology, t = 1
All-to-all Topology, t = 1
1K
m=8
m = 16
m = 32
m = 64
m = 128
m = 256
m = 512
100
1K
2235
m=8
m = 16
m = 32
m = 64
m = 128
m = 256
m = 512
100
10
10
32
256
2048
16384
32
256
2048
16384
Fig. 6. Average error (e averaged over the 100 simulation runs) for different (N, m) settings varying the communication topology G. (a) Setting: ring
topology, t = 1. (b) Setting: all-to-all topology, t = 1.
Ring Topology, t = 0
Ring Topology, t = 4
1K
m=8
m = 16
m = 32
m = 64
m = 128
m = 256
m = 512
100
10
1K
m=8
m = 16
m = 32
m = 64
m = 128
m = 256
m = 512
100
10
32
256
2048
16384
32
256
2048
16384
Fig. 7. Average error (e averaged over the 100 simulation runs) for different (N, m) settings varying the number of exchanged particles t. (a) Setting: ring
topology, t = 0. (b) Setting: ring topology, t = 4.
able with a low number of particles (Nm < 1K), since the
single core is more powerful computation-wise than the local
cores of the GPU architecture and there is no communication
involved.
Fig. 9 can also be used for a comparison with the methods
presented in the literature. With reference to Table I, we can
report in Table III the performance of the proposed method.
As we see from Table III, the presented method outperforms in terms of the number of particles, state dimension,
and/or runtime state-of-the-art methods employing GPU architectures. Furthermore, Algorithm 1 increases the achievable
performances often by orders of magnitude.
In Fig. 10 we report the estimation error in both the
distributed implementation and the centralized one. As we
notice, the estimation error for cases in which m 128 is
comparable with the centralized setting. Furthermore, in the
case of m = 512 and N 1024, the distributed algorithm
delivers better estimates than the centralized one. This has to
2236
IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 21, NO. 6, NOVEMBER 2013
12
Total
100
Sampling
Runtime [ms]
10
Resampling
Exchange
6
4
m=8
m = 16
m = 32
m = 64
m = 128
m = 256
m = 512
Centralized
10
2
256
12 16 20 24 28 32 36 40 44 48 52 56
State dimension
1K
4K
16K
64K
256K
Number of particles N m
1M
4M
Fig. 10.
Comparison of the estimation error between distributed and
centralized sequential algorithms.
VII. C ONCLUSION
A. Summary
TABLE III
P ERFORMANCE OF THE P ROPOSED A PPROACH
State
Runtime
Sampling +
Resampling Estimation Particles
Weight
Dimension
[ms]
Ref.
Algorithm 1
1024
local
distributed
local
64 000
0.3
1M
2.3
4M
4.6
Distributed Implementation
Runtime [ms]
0.25
0.015
256
1K
4K
16K
64K
256K
Number of particles N m
1M
4M
In this paper, we showed that fast yet accurate nonlinear estimation is realizable and can be used in relatively high sample
rate real-time feedback controllers. In particular, we designed,
analyzed, and implemented a distributed computation particle
filter that can handle over a million particles at 100 Hz with
remarkable estimation accuracy. This was one of our goals
as formulated in question (Q) in Section II, which has been
therefore answered positively.
The results obtained were made possible by the use of a
novel distributed resampling technique that is based on the
exchange of particles via specified topologies that link local
filters. This has several advantages, among which are (P1),
(P2), and (P3), as discussed in Section V-B. In particular, it
is crucial in limiting the number of exchanged particles and
still increasing the estimation accuracy (without the need for
centralized data collection).
Furthermore, we were able to infer some theoretical properties of the proposed algorithm, which also demonstrated why
our method performs better than available implementations.
These properties, explicitly (S1), (S2), (S3), and (S4) of
Section V-E, indicate that the choice of exchange topology
(e.g., the number of neighbors) and number of exchanged
particles are critical for the accuracy of the filter and depend on
the problem at hand. In general, a significant gain in accuracy
has to be expected passing from no-exchange to exchange of
particles among the local filters (even if the filters exchange
just one particle), while for high process noise settings it is
advisable not to choose topologies with a high number of
neighbors, which will cause high levels of distortion.
Through simulations and experiments we showed that our
proposed method outperforms other implementations that can
be found in the literature. In particular, our implementation increased the number of particles, the state dimension,
and/or the sampling frequency often by orders of magnitude with respect to state-of-the-art GPU solutions (typically
based on parallel algorithms instead of our distributed ones).
Moreover, we showed that the proposed scheme has compara-
2237
2238
IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, VOL. 21, NO. 6, NOVEMBER 2013
Tams Keviczky received the M.S. degree in electrical engineering from the Budapest University of
Technology and Economics, Budapest, Hungary, in
2001, and the Ph.D. degree from the Control Science and Dynamical Systems Center, University of
Minnesota, Minneapolis, in 2005.
He is currently an Assistant Professor with the
Delft Center for Systems and Control, Delft University of Technology, Delft, The Netherlands. He
has been a Research Intern with Honeywell Laboratories, Minneapolis, and a Post-Doctoral Scholar
with the Control and Dynamical Systems, California Institute of Technology,
Pasadena. His current research interests include distributed optimization
and optimal control, model predictive control, and distributed control and
estimation of large-scale systems with applications in aerospace, automotive,
mobile robotics, industrial processes, and infrastructure systems.
Dr. Keviczky was a co-recipient of the AACC O. Hugo Schuck Best Paper
Award for Practice in 2005.