Vous êtes sur la page 1sur 17

Parallel Computing 32 (2006) 205–221

www.elsevier.com/locate/parco

A parallel algorithm, architecture and FPGA realization


for landmark determination and map construction
in a planar unknown environment
P. Rajesh Kumar, K. Sridharan *, S. Srinivasan
Department of Electrical Engineering, Indian Institute of Technology Madras, Chennai 600 036, India

Received 15 March 2005; received in revised form 24 June 2005; accepted 26 September 2005
Available online 29 November 2005

Abstract

Good representations of the environment are crucial for navigation of autonomous systems. This paper presents a time and
space-efficient hardware-based solution for construction of an environment map based on range sensor data gathered by a robot
at P locations in an unknown 2D environment. In particular, the paper proposes a new parallel algorithm based on training a set
of K place units for generating ‘‘perceptual landmarks’’ and connecting the landmarks based on sensor information. The com-
putational complexity of this parallel algorithm is O(MPlog K) where M is the number of training iterations. The algorithm has
been mapped to hardware. The architecture comprises of components such as Content Addressable Memory and Wallace tree
adders. The architecture has been implemented in Xilinx Field Programmable Gate Array (FPGA) and results show that the
proposed design achieves high speed taking approximately 100 ms to identify 49 landmarks and build a map of an environment
of size 3.75 m · 3.75 m. Further, the entire design for an environment with 196 sample points and 49 processing elements fits in
one XCV3200E (or XC2V6000) device. The details of the experimental setup for gathering data are also presented.
 2005 Elsevier B.V. All rights reserved.

Keywords: Landmark determination; Parallel algorithm; Architecture; Field Programmable Gate Array (FPGA) implementation;
Unknown environments

1. Introduction

Navigation of autonomous systems in unknown environments has been of tremendous interest during the
last decade. Our interest in this paper is on a parallel solution for an aspect of navigation of a robot (denoted
by R henceforth) that has multiple range sensors.
One approach for navigation is based on identifying a set of landmarks. A simple method to construct such
a set is based on using the dimensions of the room (environment) in which the navigation takes place and the
step size of R to determine G grid points at which range sensor readings should be obtained.

*
Corresponding author.
E-mail addresses: rkpanakala@yahoo.com (P. Rajesh Kumar), sridhara@ee.iitm.ac.in (K. Sridharan), srini@ee.iitm.ac.in (S.
Srinivasan).

0167-8191/$ - see front matter  2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.parco.2005.09.004
206 P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221

A strategy that attempts to use the sensor data at all of these G grid points would encounter the following
difficulty: some of the grid points may not be accessible to R because of obstacles (such as chairs, tables)
in those locations. We consider here the subset P (of the G grid points) at which the robot can obtain
sensor readings. Among P, there may be many points in close proximity and it is desirable to simply have
representatives of sets of nearby points. Good representatives are appropriate for navigation purposes
since they enable rapid path finding. We call these representatives as perceptual landmarks or merely as
landmarks.
After the robot has explored the environment (by some form of graph search) and collected data, it is
important to compute these perceptual landmarks as fast as possible and use them to build a map of the
environment for navigation. The necessity for high speed calculations is motivated by semi-dynamic/
dynamic objects that move between different locations in the environment (examples include furniture
getting rearranged or humans moving between places during short periods). If one could process the
environment data quickly, it would facilitate early transfer of R from one point to another. Further, the
ability to process the environment data on one small device is valuable and crucial to the success of various
tasks.
Landmark determination has been pursued by many researchers in the past. Different definitions have
been used to characterise a landmark. These include location of the sensing agent [1], a real-world object
at a known location [2], a place unit [3] and one based on clustering [4]. The landmarks computed in this
paper use ideas on perceptual landmark determination [4] based on self organizing mechanisms [5]. The
focus of prior work has been primarily on the development of sequential algorithms and implementations
on single general-purpose processors. Prior work on dedicated hardware is in the area of architectures for
environment representation—these include the definition of a virtual rectangle for path calculations [6] and
a linear approximation-based hybrid approach for binary logarithmic conversion [7]. A path planning archi-
tecture based on assumption of availability of vision data is presented in [8]. To the best of our knowledge,
there are no parallel algorithms or architectures for landmark determination and map construction in
unknown environments.
Our objective is twofold: (i) to present a fast parallel algorithm (time-efficiency) for landmark determination
and map building and (ii) to perform the processing for a reasonably-sized environment within a device (space-
efficiency). Further, we also present power consumption data for processing using an Field Programmable
Gate Array (FPGA) device.
An important feature of the proposed parallel solution for landmark determination and map construction
is that it works for an environment that may have obstacles of arbitrary shapes. The algorithm assumes the
existence of a network of K place units [3] that is trained iteratively on the sample data set of size P. The out-
put of the algorithm is a set of landmarks and connectivity information. The complexity of the parallel scheme
is O(MP log K) where M is the number of training iterations. Our hardware-directed approach treats each
place unit as a separate Processing Element (PE). The PEs operate in parallel during various phases of land-
mark determination and map construction.
An architecture based on the new algorithm is developed that incorporates parallelism while keeping space
requirements low. The architecture has a hybrid nature with partly distributed processing and partly centra-
lised processing: some portion of the processing is done by all the processing elements in parallel. However, in
addition to the processing elements (PEs) corresponding to the place units, there are other computational units
such as a winner determination unit, a Content Addressable Memory (CAM) controller etc.. The winner
determination task is, for example, centrally performed.
An implementation of the proposed architecture on FPGAs is also presented. FPGAs are attractive for this
application due to a variety of reasons: First, a parallel algorithm can be directly implemented on an FPGA
device. Second, they provide a compact and lightweight alternative (to PCs) for processing sensor data—they
can be placed atop the moving robot/autonomous system. FPGAs have been identified for use in various
applications [9,10] in view of the ability to get high performance solutions on a small device that may other-
wise take up an entire board of parts. Codesign systems and coprocessors using FPGAs have been developed
and employed in many applications in the last decade [11–13].
The paper concludes with a brief description of the experimental setup used for gathering data to perform
landmark determination and map construction.
P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221 207

2. Assumptions and terminology

The robot R is assumed to begin collection of data from a corner (representing the origin). The environment
may have obstacles of arbitrary shape. The motion of the system is restricted to x, x, y and y directions.
R is equipped with four range sensors. The sensors point along +x, +y, x, and y directions. Ultrasonic
transducers with a narrow beam are part of the range sensor setup. However, obstacles up to a distance of
approximately 5 m can be handled. The nature of permitted obstacles is discussed below. The details of the
range sensor setup are given in Section 6.
Obstacles are assumed to be not anechoic and further, obstacles do not have any openings that may prevent
reception of signal by the transducer setup. We also assume that obstacles are isolated enough—otherwise,
energy may return from multiple surface reflections resulting in erroneous distance values. Further, any loca-
tionÕs coordinates are computed with respect to an origin from where the exploration begins based on the
dimensions of the room, the step size of R and the steps taken. Data gathered at a given robotÕs location s
are stored in a sample point structure (non-homogenous entity analogous to struct in C) and are given by
[W0[s], . . . , W5[s]]. P sample points (each of type sample point) are assumed to be available.
The notion of a place unit which is a structure (a non-homogenous entity with six components) is used in
this paper. The first two components of the place unit structure contain coordinates (defining the location of a
place unit) while the remaining four correspond to sensor values along x, x, y and y directions. A set of K
place units (each of type place unit and denoted by N[i], i = 1, . . . , K), will represent perceptual landmarks (or
simply landmarks) after training and appropriate adjustment. The representation for place unit N[i] will be
denoted by [N0[i], . . . , N5[i]]. The place unit components are randomly initialised. Figs. 3 and 4 illustrate the
notions of sample point, place unit and landmark.
Landmarks are obtained from the place units since clustering takes place and the values of the components
get altered. The activity at each place unit that can be performed in sequence on a single general-purpose pro-
cessor (one by one for all the place units) is parallelizable. Hence, each place unit can be mapped and imple-
mented as a separate Processing Element (PE) in hardware. It turns out that a small number of roughly 30–40
landmarks (therefore the same number of PEs) is usually adequate for rooms of dimension 5 m · 5 m (approx-
imately). The maximum number of PEs used in the hardware implementation presented in this paper has been
determined based on estimation of the amount of hardware required (and therefore space consumed) in each
PE. Our design accommodates 49 PEs in a moderately-sized FPGA device (such as XCV3200E or
XC2V6000).
Training of a set of place units is a process wherein the randomly assigned initial values to each component
of a place unit get altered depending on the robotÕs data at each sample point. Training takes place with
respect to each sample point and for several iterations. Identification of place units whose components will
have changes is done by determining a winner place unit for each sample point and at each iteration as follows.
The distance of each place unit from a given sample point is found using city-block metric (l1 norm) since it is
digital hardware-friendly. The distance in l1 norm between a place unit N[i] and a sample point s is given by
jN0[i]  W0[s]j + jN1[i]  W1[s]j. (It should be noted that training in our scheme uses only the first two com-
ponents to get a good distribution of place units; more details are presented in Section 3). The nearest place
unit (using the l1 norm) is declared to be the winner place unit. The components of the winner place unit are
adjusted using a parameter named learning rate and denoted by a. The new values for the components are
obtained using Eq. (1).

N new old old


0 ½i ¼ N 0 ½i þ a½W 0 ½s  N 0 ½i
ð1Þ
N new old old
1 ½i ¼ N 1 ½i þ a½W 1 ½s  N 1 ½i

a is assigned a positive value between 0 and 1 initially. When the learning rate is high (>0.5), the network tends
to learn only the latter samples while for very small values of a (<0.01), the ability to learn slows down.
In general, besides the winner place unit, a few other place units within some distance of the winning place
unit may also have their components altered. These place units are identified by a parameter named neighbor-
hood radius and denoted by r. The adjustment of components of these neighbors also follows Eq. (1). An initial
value for the neighborhood radius is determined from the dimensions of the grid of place units. While the
208 P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221

network is trained via a number of iterations, the learning rate does not remain fixed. In order to generate good
landmarks, the learning rate has to be steadily decreased and this process begins after a certain number of iter-
ations. A parameter of interest here is the threshold value for iteration number (denoted by I1 henceforth)
from which a is steadily reduced.

3. The proposed algorithm

3.1. Key ideas

The algorithm presented in this paper assumes the existence of a network of 49 PEs. The PEs are arranged
as a 7 · 7 array. Each PE has six components and two of them correspond to x and y coordinates—the com-
ponents are initially assigned random values indicating a random ‘‘location’’ of the place unit in the space in
which R operates (along with random sensor values at those locations). The place units ‘‘settle’’ after a train-
ing phase followed by an adjustment phase.
The explored environment data is stored in global memory. The number of sample points is quite large
(here, the total number is four times the number of place units). For our experiments, the total number of
iterations is 1000. The learning rate a has been initialised to 0.5 for our work based on some experiments.
I1 defined in Section 2 has been set to 500 since we set the total number of iterations to 1000. In general, a
is decreased after approximately half the number of iterations.
The proposed algorithm has three key steps: The first step involves training of place units taking only the
coordinates of the sample points stored in the memory (the sensor values are not used for training as such
since our initial experience using all the components has resulted in a less desirable distribution of place units;
assignment and use of sensor values to determine information on adjacency between place units is described
below). The training process identifies a winner place unit for each sample point at each iteration. The com-
ponents of the winner place unit are adjusted as per Eq. (1).
The second step consists of translating place units to a sample point in the ‘‘vicinity’’ in the 2D space (this
takes into account the situations where a place unit after training goes into the interior of some object—how-
ever, since there is no prior knowledge of the object geometries, translation of all place units is essential). Fig. 1
provides an illustration for the situation where a place unit could be after training suggesting the need for
adjustment of position. The sample point in the vicinity to which a place unit is translated is determined as
follows. Sample points may not be uniformly spaced since the location of an obstacle may prevent R from get-
ting data at a specific grid vertex. Place unit locations correspond to one of the sample points after translation.
For example, in an environment of size 400 cm · 400 cm (the environment size here uses information on step
size (40 cm) of the robot used), the sample points may have coordinates given by (40, 40), (40, 120) and so on.
If a place unit has coordinates of (37, 35) after training, it will get adjusted to (40, 40). It should be noted that a
grid point such as (40, 80) may not be a valid one owing to the presence of an obstacle.
The sensor values are assigned to each place unit after the adjustment of location and they precisely cor-
respond to those of the sample point (to which a place unit has translated).

PLACE UNIT AFTER TRAINING

SAMPLE POINT

Fig. 1. Scenario indicating adjustment of place unit locations.


P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221 209

Position of place unit j


obstacle m

obstacle n

Position of place unit i

Fig. 2. Connecting a place unit to another.

The third step consists of generation of the map taking into account the sensor values stored for each place
unit after movement. In particular, we build an adjacency structure containing information on the list of place
units adjacent to each place unit. Fig. 2 illustrates how a place unit may be connected to another place unit on
the basis of sensor values.
An important feature of the algorithm is the amenability for hardware implementation. The algorithm can
be readily parallelized. When the components (of place units and sample points) have integer values, the oper-
ations on them are such that they do not introduce floating point numbers. In the next section, the proposed
parallel algorithm is presented.

3.2. The new parallel algorithm

Algorithm (CONSTRUCT_MAP).

Input: Sample points of the explored environment.


Output: Adjacency structure of size K · K. A place unit N[i] is connected to another place unit N[j] typically by
two segments (one horizontal segment taking sensor data along x(x) for N[i] and N[j] and one vertical
segment taking sensor data for N[i] and N[j] along y(y)).

Step 1: Initialize learning rate (a), neighborhood radius (r) and I1.
Step 2: Initialize first two components of all place units in parallel with random values.
Step 3: for j = 0 to M
if (j > = I1)
a = a*0.5;
for s = 0 to P
for each of the K place units do in parallel
compute the distance in city-block metric
(as per Eq. (1))
end
Determine the winner place unit by binary
tree-structured comparison.
Update the components of the winner place unit and
the four neighbors (along x, x, y
and y) as per Eq. (1) in Section 2
end /* for sample points */
end /* for iterations */
Step 4: for all place units do in parallel
find a sample point in the vicinity in
constant time and adjust x and y coordinates
end
210 P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221

Step 5: for all place units do in parallel


load the remaining components of
all the place units with the sensor values
of the vicinity sample point (found in Step 5).
end
Step 6: for each place unit N[i], 1 6 i 6 49
determine connectivity to place units N[j], j = 1, . . . , K, i5j
based on sensor values stored; repeat for all place units.
end

A summary of key steps is as follows. In step 3, training begins. In step 4, the locations of the place units are
adjusted. In step 5, sensor values are loaded in the place units using those of the sample points (to which the
place units are transferred). Step 6 establishes connectivity between the place units. Parallelism in various steps
contributes to the desired speedup. Sequential operation in Step 6 ensures that the design fits in a moderately-
sized FPGA device. Further, K < P and K < M in practice. Hence, the overall time complexity is unaltered as
established in Section 3.3.

Remark 1. M in the algorithm has been set to 999 for our experiments. In general, when the number of iter-
ations increases, the training results are better. P corresponds to 196. This number has been chosen taking
dimensions of a room and the step size for our robot. I1 has been set to 500 (in accordance with the chosen
value for M).

Remark 2. In Step 4, we find a sample point in the vicinity of a place unit in constant time by checking
between which pair of x values of sample points, the x value for the place unit falls (in constant time). Once
the x adjustment is done, the same procedure is repeated for the y value of the place unit.

3.3. Time complexity analysis

The complexity of the parallel algorithm can be analysed as follows. Step 1 and 2 take O(1) time. Step 3
takes O(MP + MP log K) time since all place units find distance to a point s in parallel and then binary-tree
structured comparison takes O(log K) time to identify the nearest place unit. Steps 4 and 5 also take O(1) time.
Step 6 takes O(K2) time. The overall complexity is therefore O(MP + MP log K + K2) or O(MP log K + K2).
Typically, K < M and further, K < P. Hence, the overall complexity can be expressed as O(MP log K).
Remark 3. A sequential version of the algorithm presented above would have a higher computational
complexity. In particular, Step 2 would take O(K) time. Step 3 would take O(MPK) time. Steps 4 and 5 require
O(KP) time while Step 6 takes O(K2) time. The overall complexity is therefore O(MPK). It should, however, be
noted that the KP factor in step 4 and the lack of parallelism in Step 2 (in addition to the lack of parallelism in
Step 3) lead to a much larger execution time on a PC. Further, a PC-based solution demands more (physical)
space than an FPGA-based solution.

3.4. Simulation results

The sequential version of this algorithm has been simulated on a PC with Pentium IV processor
(1700 MHz) and 256 MB RAM to verify correctness. Different types of objects (polygonal, curved) were
included in the simulation.
The first simulation is for a room of size 4 m · 4 m. The place units prior to training along with the sample
points are shown in Fig. 3. The landmarks are shown in Fig. 4. The step size was taken using that of R. The
total number of sample points is 60. The number of landmarks is 25. It can be observed that the landmarks
cover the space well. Since they represent a small subset of the sample points, computations for path finding
P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221 211

Initial random distribution of place units


400
sample point

350
B
A

300
Room y − extent(cm)

250 place unit

C
200

150
E

100

50
D

0
0 50 100 150 200 250 300 350 400
Room x − extent(cm)

Fig. 3. Simulation for a room of size 4 m · 4 m: place units prior to training shown.

Map using landmarks


400
sample point
landmark

350
B
A

300
Room y − extent(cm)

250

200 C

150
E

100

50
D

0
0 50 100 150 200 250 300 350 400
Room x − extent(cm)

Fig. 4. Simulation for a room of size 4 m · 4 m: landmarks shown.

would require less time (than that using the entire set of sample points). The map obtained is therefore
valuable for path finding.
212 P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221

The second simulation is for a room of size 6 m · 6 m. Various types of objects have been included. Step
size is the same as for the first simulation. A total of 162 sample points are present. The number of landmarks
is 36. The place units before training are shown in Fig. 5. Landmarks after training are shown in Fig. 6.

Initial random distribution of place units


600
sample point

500 place unit

400 A
Room y − extent(cm)

300
D

200 F

G
E
100

0
0 100 200 300 400 500 600
Room x − extent(cm)

Fig. 5. Simulation for a room of size 6 m · 6 m: place units before training are shown.

Map using landmarks


600
landmark sample point

500

400 A
Room y − extent(cm)

300 D

F
200

G E
100

0
0 100 200 300 400 500 600
Room x − extent(cm)

Fig. 6. Simulation for a room of size 6 m · 6 m: landmarks shown.


P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221 213

4. The proposed architecture

The overall architecture is shown in Fig. 7. Internal details of various elements are given in the subsections
to follow.

4.1. Random number generation

Step 2 of the proposed algorithm assigns random values to the coordinates of the place units. This is accom-
plished in hardware by means of a random number generator in Fig. 7. The random number generator pro-
duces a 4-bit address that is used for initialisation of the first two components of all the place units. It is built
using a Linear Feedback Shift Register (LFSR) [14]. The LFSR consists of 4-one bit registers. A 4-bit value is
applied as a seed to initialise the random number generator. The LFSR then cycles through 15 states before
repeating the sequence, so as to produce a pattern of 4-bit random numbers.

4.2. Processing element (PE) structure

The circuitry for each PE is shown in Fig. 8. The circuitry for the block labelled ‘‘Training Unit’’ in Fig. 8 is
shown in Figs. 9 and 10.
In Step 3 of the proposed algorithm, training of place units takes place. A feature of the design is that the inter-
nal state machine of Fig. 8 is operated at twice the frequency of the main controller to speed up the training (In the
implementation, the clock frequency of the main controller is obtained and the frequency of the internal state
machine is set suitably; the critical path in the actual main controller circuit is approximately three times that
of the circuit for the internal state machine). The internal state machine sequences the tasks such as initialisation

Clkb Special memory bank O0 M O M C D0


C match addr 0 M Clkc De
Clkb O1 u
A O1
u A match addr O0 m
x x M u u
N M N 13 O1 x Weight x
Sp 0 Sp 5 TO 0 TO adjust 1
N controller to
On-1 1 1 to
CAM On-1 CAM 1 N D n-1
cntrl cntrl On-1
control control
bus Clka bus Clka
pe output bus Clkc
O0 Adjacency memory bank
Index bus Demux
PE 0
Clkb
Clkb
cntrl Adjacency
C decider Ad 0 Ad n-1
O1 M
Demux P
PE 1 1
Bi-directional databus cntrl
C
Bi-directional control bus O2 M Neighbor
Demux P update
PE 2
N-1 Winner
cntrl C update

O3 M
PE 3 Demux P
2

cntrl Clka Global memory bank


Clka

Clka Gm 0 Gm 5
Rst

On-2
Main control Demux
unit PE n-2
Clka Clka
cntrl C
M
On-1
Bi-directional control bus P Gm 78 Gm 83
Demux N/2
PE n-1
Random number generator
Cntrl
Clkb
Seed
Clkc

Fig. 7. Overall architecture of the map finder.


214 P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221

Internal_data_bus

16
16 W_data0

16 16 W_data1
Registers
for 16
16 E_data0
place unit
coordinates 16 E_data1 Bi-directional
16 data buses
&
sensor values 16 F_data0
16
16 F_data1
16
16
Internal_data_bus
16 4 PE_rinx
16 Control unit Index carrying
buses
Internal_cntrl_bus 4 PE_cinx
Internal_data_bus
Training 4 16 PE_out PE output
unit
16 1 Adj_data_bus Adjacency
16 memory
6 Adj_addr data & data buses
16 Internal_data_bus
14 CAM0 CAM13 CAM select
signals
16
4 Shft Shift amount as per
Place unit learning rate
translation 18
module 10 External_cntrl_bus

Internal_cntrl_bus Clkb

Fig. 8. Internal block diagram of a PE.

place unit
< x-coordinate
COMPARATOR
sample point
x-coordinate

M
U
X
SUBTRACTOR

PE output M
U
PE_out X
ADDER
place unit
< y-coordinate
COMPARATOR

sample point
y-coordinate

M
U
X

SUBTRACTOR

M
U
X

Fig. 9. Circuitry for winner determination.

of components, computation of difference between components of place units and sample points, loading of sen-
sor values into the corresponding components of the place unit, and loading of elements of a row of the adjacency
structure into adjacency memory upon receiving control signals from the main control unit. A set of six 16-bit
registers is used to store components of the PE (corresponding to a place unit). Bidirectional data buses are used
during initialisation of components, training of the PE and loading of sensor values. Index carrying buses supply
the identity of the PE to the external winner determination circuit during training.
Another feature of the PE architecture is the use of barrel shifters to speed up updating of the components
of place units depending on the value of the learning rate parameter. Two barrel shifters are in particular used
to compute two place unit components. The ALU performs computations to measure the distance between
P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221 215

Updated(next) x-coordinate of place unit

ADDER Current
x-coordinate
BARREL SHIFTER SUBTRACTOR of place unit

x-coordinate
sample point
Shft_amt
Updated(next) y-coordinate of place unit
Current
y-coordinate
ADDER of place unit

BARREL SHIFTER SUBTRACTOR


y-coordinate
of sample point

Shft amt

Fig. 10. Circuitry for updation of components of winner PE.

first two components of the place unit and coordinates of sample points. Each PE interacts with global mem-
ory (during Steps 3 and 5) whose organisation is presented in the next subsection.

4.3. Global memory organisation

Steps 3 and 5 of the proposed algorithm require access to global memory (implemented so as to use
components of the FPGA device). The organisation of this memory is shown in Fig. 11. Global memory is

Clka 4 Adr_bus 0 Clka 4 Adr_bus


0
En a 4 Adr_bus1 En a Global 4 Adr_bus1
Global
En b memory-0 16 Data_bus0 En b memory-5
(depth=16, 16 Data_bus
(depth=16, 0
Wr 16 Data_bus1 Wr width=16) 16 Data_bus
width=16) 1

Clka Adr_bus0 Clka Adr_bus 0


En a Global
Global Adr_bus1 En a Adr_bus1
memory-13 memory-83
En b
(depth=16, Data_bus 0 En b (depth=16,
width=16) Data_bus0
Wr width=16) Data_bus 1 Wr
Data_bus1

Fig. 11. Global memory organisation.


216 P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221

organised in 2-D as a set of u · v memory blocks, where u denotes the maximum number of steps that can be
made along x-axis and v denotes the number of components in a place unit. Each memory block holds q 16-bit
elements, where q denotes maximum number of steps that can be made along y-axis. When v = 6, the place
unit holds coordinates of sample point position and distance data from four sensors. The interconnection
of PEs and global memory is such that parallel loading of data items of different memory blocks into all
PEs is possible during initialisation of place unit components.
We now describe one of the key elements in the architecture, namely Content Addressable Memory (CAM).
CAM contributes to high speed as well as effective device utilization (when mapped suitably onto target FPGA
devices).

4.4. Content Addressable Memory (CAM)

Step 4 in the proposed algorithm consists of a ‘‘match-finder’’ that is implemented in hardware using
CAMs. The time required to find an item in memory can be reduced considerably if stored data can be
accessed by the content itself rather than by an address [15–18]. Conventional RAM implementations use
addresses. An address-based search for a data element is typically performed sequentially and is slow. The
alternative that yields high speed is based on a circuit that uses a few registers and a select logic which per-
forms the function of a CAM. In our application, we have multiple CAMs (one for each x-coordinate). A sam-
ple point in the ‘‘vicinity’’ of a PE is found as described in Section 3 without operations such as squaring and
square root calculation. The circuit structure is shown in Fig. 12. Details are presented next.
For each (sampling) step along x-axis, one CAM has been assigned. The total number of CAMs
corresponds to the number of steps along the x-axis. Our implementation uses 14 since we have a total of

CAM_DATAOUT
WALLACE TREE
ADDER

PRIORITY
ONES_CNT
ENCODER

PRTY_VAL 0 TO N_i[0]
ACK 0 D
E
CLKA M
ACK13 X0 U
DATABUS X
ACK 1
49 0
PRTY_VAL 0 TO TO N_i[ N-1]
CAM N
EN TO
ADDRESS CAM DEMUX
PRTY_VAL13 LINKER
TO CONTROLLER
GLOBAL MEM.
TO N_i[ 0]
X13 D
E
MATCH_OK M
PRTY_VAL 0 CLKB U
MATCH_ADDR X
1
TO TO N_i[N-1]
N

N_i[0] EN
DATA IN
M 16 4
CAM
U C 14
Address ENCODER
X A
4 M
N Control Bus
N_i[N] to 5
1
CLKA

Fig. 12. CAM related circuits.


P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221 217

196 sample points (at some of these points, there may be an obstacle preventing data collection, hence storage
must be appropriately handled and is discussed below). Each CAM contains as many entries as the number of
steps along the y-axis. A location in a given CAM may store the (step) value along y, if that could be explored
(and no obstacle was present). Otherwise, we store zero. Each PE selects a CAM depending on the first two
components of the place unit structure. If the CAM generates a MATCH_OK signal, then the PE is loaded
with sensor values from the global memory. Otherwise, it is loaded with sensor values from special memory
(described in Section 4.5).
Multiple CAM controllers, one for each CAM, are present. Each CAM controller contains a register
(whose size matches the number of PEs). The appropriate bit is set by those PEs which require the address
of the matched data. A Wallace tree adder is used to count number of ones in the register (in our implemen-
tation, the register is 49-bit since we have 49 PEs). A priority encoder determines for which PE, the match
finding has to be carried through the controller. FPGA-based aspects of CAMs are presented in Section 5.

4.5. Special memory

While the CAM portion of the architecture presented in Section 4.4 handles storage of sample points (cor-
responding to sampling of the environment), the special memory unit (shown in Fig. 7) deals with storage of
components for the place units (PEs). Step 5 of the proposed algorithm requires use of special memory. In the

CLKC
DF_EN
EN
MUX_SEL D
I0 I
R
O0 E
I1
S0 C
T
S1 S0 I
Content of MUX’S
O
( 6 , 49 TO 1;
PEs i.e. 294 to 6)
S2 S1 N
(corresp. F
to all S3 I
landmarks) ADJACENCY
N
S4 DECIDER D
E
O 293 S5 R
CF_EN NEW SN S N S O
I0
PE_IDENT I0 E EW W
I1 5 C
LD_OVER O
Content of a PE I2 49 N
I5
(corresponding to a N
specific landmark) I3 E
CNECT_FIND_OVER C
T
CNECT_FLAG I
I4 O
N
I5 49 F
I
S0 N
D
CNECT_FIND_OVER E
R

CNECT_FLAG S5

Fig. 13. Hardware for adjacency determination.


218 P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221

absence of obstacles, identification of a sample point in the vicinity for any place unit would be straightfor-
ward. However, when one or more obstacles exist, some of the ‘‘planned’’ sampling locations are unreach-
able—One coordinate (or both) for such a point stored in CAM would be zero. A place unit cannot be
moved to such a location. Whenever a match for a sample point in the vicinity (for a place unit as per Step
5 of the algorithm) is not found, the PE (corresponding to the place unit) is loaded with the values from the
special memory block from a randomly chosen address. Special memory consists of four blocks corresponding
to sensors pointing in the direction of +x, x, +y and y.

4.6. Adjacency determination unit

Step 6 of the proposed algorithm that connects the appropriate place units (mapping to PEs) is imple-
mented as a separate unit in hardware. Fig. 13 presents details of adjacency determination for each place unit
I. I0, I1, . . . , I5 represent its components. A place unit S whose connectivity to I is examined has components
denoted by S0, S1, . . . , S5. Direction information is also indicated: A place unit may be to the north or south of
another. In general, O, N, E, W, S, NE, NW, SE and SW in the figure denote respectively on (place unit on
another), north, east, west, south, north-east, north-west, south-east, and south-west. It must be noted that the
memory requirements for Step 6 are handled by the separate set of memory blocks shown in Fig. 7.

5. FPGA implementation

The design of the environment map finder has been coded in VERILOG-2001. The design has been simu-
lated using ModelSim for functional verification. It has then been synthesized using Synplify 7.6 for implemen-
tation on Xilinx FPGAs. In particular, the design has been mapped onto two devices of Xilinx belonging to
the Virtex series to see the performance on each one of them. The specifications of the chosen devices are given
in Table 1. General information is provided without reference to the packaging. After mapping, placement
and routing of FPGA components have been carried out.
The percentage of different components of the device used by the design after placement and routing as well
as the execution time are tabulated in Table 2. The data on execution time has been obtained from number of
clock cycles required (obtained from ModelSim) and the frequency of operation. The results are for M = 1000,
P = 196 and K = 49.
The results in Table 2 can be interpreted as follows. XCV3200E has a Configurable Logic Block (CLB)
array of 104 · 156. It is to be noted that every CLB has two slices. Every slice has two Look-up Tables (LUTs)
for a total of 64896 LUTs so the usage is 94%. Also, XCV3200E has 851,968 block RAM bits (208 4096-bit
block RAMs) and the usage is 57%. With regard to XC2V6000, the total number of LUTs is 67584 while the
total number of 18 K bit block RAMs is 144 leading to usage of 89% and 82%, respectively. The data on two
different Xilinx devices provide an indication of speed versus space consumption for two devices from different
Xilinx FPGA families: Virtex-E and Virtex-II. While speedup with XC2V6000 is more, it is in general easier to

Table 1
Specifications of target FPGA devices
Device System gates CLBs Block RAMs
XCV3200E 4,074,387 104 · 156 208 (each 4 K bits)
XC2V6000 6M 96 · 88 144 (each 18 K bits)

Table 2
Results of environment mapper with 7 · 7 PEs
Environment Device Execution time (ms) LUTs Block RAMs
196 sample points XCV3200E 121 61,050 (94%) 118 (57%)
196 sample points XC2V6000 85 60,562 (89%) 118 (82%)
P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221 219

Fig. 14. Floorplan for implementation on XC2V6000.

fit a design into an XCV3200E than into an XC2V6000. The floorplan for implementation on XC2V6000 is
presented in Fig. 14.
One feature that has enabled accommodation of the entire design with 49 place units (processing elements)
in one FPGA device is the following. The adjustment of the first weight vector component is done by the cor-
responding PE while the adjustment of the second component of all PEs is done with the help of an adjustment
control unit, which operates at the twice the clock frequency of the PE.
Another feature of the implementation is the use of block RAMs for handling CAMs. CAMs are built
using XILINX provided primitives. The use of block RAMs for this purpose makes available the LUTs
for other tasks in map construction. Global memory and special memory have also been implemented using
block RAMs on Xilinx Virtex devices. This has also facilitated effective device utilization.
Information on the extent of speedup (obtained) by a custom hardware implementation has been gathered
by timing a C implementation of the sequential algorithm for constructing the map on a general-purpose PC
(1700 MHz Intel processor, on-board memory of 256 MB and runs Red Hat Linux). The actual running times
of the sequential algorithm have been obtained via clock ticks using the times() function in C. The CPU time
taken for finding the environment map is approximately 50 s when M = 1000, P = 196 and K = 49. For envi-
ronments that are larger, the PC-based implementation typically takes close to few minutes. This amounts to a
speed-up for the FPGA implementation by a factor of approximately 500. There do not appear to be any
other hardware implementations for this problem in the literature and hence our comparisons are with a ded-
icated software solution.
Data on power consumed by the design on the two Xilinx devices is presented in Table 3. Power data is
valuable in the context of the life of the battery used on-board (the robot). The power calculation was done
using XPower. The frequency of operation (25 MHz), the Value Change Dump (vcd) file from ModelSim for
the proposed design coded in Verilog and the voltage specification served as the basis for the power
calculation.

Table 3
Power consumption on Xilinx Virtex devices
Environment (number of points) Frequency (MHz) Power consumed (milliwatts)
196 points on XCV3200E 25 1400
196 points on XC2V6000 25 2400
220 P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221

Fig. 15. Mobile robot setup used.

6. Experimental setup

The robotic setup used is shown with all the accessories in Fig. 15. The setup has been designed and made in
our laboratory. The unit has three wheels: two big wheels controlled by means of stepper motors and a small
free wheel.
Each range sensor circuit has been fabricated on a PCB using ultrasonic transducers and other components
based on the schematic available at www.hobby-elec.org/e_srm1_1.htm. Our transducers give out a narrow
beam but have a range of approximately 5 m (obstacles/wall far away from the robot are detected). The min-
imum distance handled by the unit is 40 cm.
One I/O pin on the FPGA device is used to get data from each range sensor. A common ground connection
is also provided. The range sensor circuit is interfaced to the FPGA board through a buffer circuit constructed
using 74LS14 (Schmitt trigger).

7. Conclusions

The paper has presented an efficient hardware-directed solution for landmark determination and map con-
struction in planar environments. One of the features of the approach presented is that it does not make any
assumptions about the shape/geometry of the objects in the environment.
A new parallel algorithm with complexity O(MP log K) has been developed. Space-efficiency has also been
obtained by appropriate reuse of hardware. The architecture developed is based on Content Addressable
Memory (CAM) and Wallace-tree adders. These contribute to high performance. Further, the implementa-
tions of CAMs, global memory and special memory take advantage of block RAMs present on Xilinx Virtex
devices thereby making available the LUTs on the devices for other purposes. Effective device utilization has
therefore been achieved.
Results of implementation on Xilinx Virtex FPGAs indicate that the hardware-based solution is approxi-
mately 500 times faster than a PC-based solution. Further, the design for an environment with roughly 200
sample points fits in one XCV3200E device as well as in one XC2V6000 device.

References

[1] T. Dean, K. Basye, L. Kaelbling, Uncertainty in graph-based map learning. In: J. Connel, S. Mahadevan, (Eds.), Robot Learning,
1994, pp.171–192.
[2] R. Greiner, R. Isukapalli, Learning to select useful landmarks, IEEE Transactions on Systems, Man, and Cybernetics 26 (1996) 437–
449.
P. Rajesh Kumar et al. / Parallel Computing 32 (2006) 205–221 221

[3] B. Yamauchi, R. Beer, Spatial learning for navigation in dynamic environments, IEEE Transactions on Systems, Man, and
Cybernetics 26 (1996) 496–505.
[4] C. Owen, U. Nehmzow, Landmark-based navigation for a mobile robot, in: Proceedings of Simulation of Adaptive Behavior, MIT
Press, 1998.
[5] T. Kohonen, The Self Organizing Map, Proceedings of the IEEE 78 (1990) 1464–1479.
[6] S.K. Lam, T. Srikanthan, High speed environment representation scheme for dynamic path planning, Journal of Intelligent and
Robotic Systems 32 (2001) 307–319.
[7] S.K. Lam, T. Srikanthan, A linear approximation-based hybrid approach for binary logarithmic conversion, Microprocessors and
Microsystems 26 (2002) 353–361.
[8] N. Ranganathan, B. Parthasarathy, K. Hughes, A parallel algorithm and architecture for robot path planning, in: Proceedings of
IEEE Parallel Processing Symposium, 1994, pp. 275–279.
[9] A. DeHon, The density advantage of reconfigurable computing, IEEE Computer 33 (2000) 41–49.
[10] D.S. Katz, R.R. Some, NASA advances robotic space exploration, IEEE Computer 36 (2003) 52–61.
[11] G.J. Gent, S.R. Smith, R.L. Haviland, An FPGA-based custom coprocessor for automatic image segmentation applications, in:
Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, 1994, pp. 172–179.
[12] M. Alderighi, F. Casini, S. DÕAngelo, D. Salvi, G.R. Sechi, A fault-tolerant FPGA-based multi-stage interconnection network for
space applications, in: Proceedings of IEEE International Workshop on Electronic Design, Test and Applications, 2002, pp. 302–306.
[13] S.J. Visser, A.S. Dawood, J.A. Williams, FPGA based real-time adaptive filtering for space applications, in: Proceedings of IEEE
International Conference on Field-Programmable Technology, 2002, pp. 322–326.
[14] Jan M. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits, Prentice-Hall, 2003.
[15] T. Ogura, S. Yamada, T. Nikaido, A 4-kbit associative memory LSI, IEEE Journal of Solid State Circuits SC-20 (6) (1985) 1277–
1282.
[16] T. Ogura, S. Yamada, J. Yamada, A 20-kbit CMOS associative memory LSI for artificial intelligence machines, in: Proceedings of
IEEE International Conference on Computer Design: VLSI in Computers, 1986, pp. 574–577.
[17] J.P. Hayes, Computer Architecture and Organization, McGraw-Hill, 1998.
[18] L. Chisvin, R.J. Duckworth, Content-addressable and associative memory: alternatives to the ubiquitous RAM, IEEE Computer
(1989) 51–64.

Vous aimerez peut-être aussi