Vous êtes sur la page 1sur 7

Fully-Pipelined VLSI Architectures for the Kinematics of Robot Arm Manipulators

Jeong-A Lee Department of Electrical Engineering University of Houston Houston, T X 77204-4793


ABSTRACT This paper presents a set of VLSI architectures for robot direct kinematic computation. The homogeneous link transformation matrix is decomposed into products of translation/rotation matrices, each of which is implemented via an augmented C O R D I C as a processing element. W e propose a specific scheme f o r a 6-link Robot Kinematics Processor, utilizing fulbpipelining at the macro-level, parallel redundant arithmetic and fullpipelining at the micro-level. Performance of the scheme is analyzed with respect t o the time t o compute one location of the end-effector of a 6-link manipulator and the number of transistors required. This scheme is assessed t o produce a single chip VLSI utilizing the current state-of-the-art MOS technology. Also, the comparison table reveals the CORDIC-based robotics processor as a prospective solution i n VLSI t o be used f o r a wide range of kinematic calculation requirements.

Kiseon Kim Superconducting Super Collider Lab. 2550 Beckleymeade Avenue, Dallas, T X 75237
CORDIC-based engine was proposed as an additional device to be attached to a conventional computing system [3]. Conceptually, CORDIC is a regular and simple computation algorithm that can be used to implement several elementary functions with the proper selection of input, output, and mode [4]; it is, therefore, a basic building block for kinematic computation. The idea has been exploited at the level of function-block substitution (based on the conventional CORDIC and bit-serial implementation thereof) for the inverse [7, 61 and direct kinematic computations [5]. Two issues arise regarding structural optimization for direct kinematic computation: the first is the optimization of the CORDIC-based processing element (PE), and the second is the appropriate array design. Several works have been investigated to optimize the architectural structure for the CORDIC element and CORDICbased system. Initially, there were several points of improvement in the conventional CORDIC element, as is illustrated by the development of redundant [9] and constant-factor-redundant (CFR) CORDIC [14]. These try to improve performance of a single CORDIC cell by reducing the computation time along the critical path. The architectural structure can be further optimized by an augmented CORDIC as shown in this paper. With the recurrence structure unfolded, the regularity of CORDIC is fully utilized, and a simple local communication cost is achieved. The observation that CORDIC is represented in a recursive way makes the concept of systolic architecture applicable to each PE, producing micro-level pipelining. (Note that we denote these two levels in prefixes as macro- and micro-.) The cost effectiveness for each variation should be carefully analyzed based on its implementation complexity and performance. In this paper, we will adopt an augmented CORDIC as a basic algorithm along with the previous approach. The essential distinctions between this paper and the others are (i) improvement of the basic PE based on CORDIC

Introduction

Robot manipulation is a research field based on the description of the relationship between objects and manipulators. One of the interesting problems in this area is direct kinematics computation, which calculates the location of the end-effector of a manipulator based on the measurements of all joint angles. The well known Denavit-Hartenberg matrix representation describes the direct kinematic problem of an n-link manipulator in terms of the matrix product, where each matrix is simply a homogeneous transformation describing the relative translation and orientation between two adjacent links [I, 21. The typical serial 6-link manipulator is the robot of our interest. In general, the resulting kinematic computation involves many elementary operations, including transcendental functions to represent the origin, approach and normal vectors of the end-actuator. Consequently, the

1.4.3.1
0080
CH3129-419210000-0080 $3.00

0 1992 IEEE

IPCCC '92

and (ii) exploitation of various architectures to suggest a cost-effective direct kinematic computation module by trade offs in area and step-time. We will introduce the direct kinematic problem shortly, followed by a review of conventional CORDIC, redundant CORDIC, and the improved version, CFR-CORDIC, in Section 2. In Section 3, a fully-pipelined architectural scheme will be analyzed under a general design strategy. The summary and conclusion are presented in Section 4.

1.1

Direct Kinematic Computation

Consider a manipulator that includes a series of n-links connected together by joints. We denote the three unit vectors as a,o, and n in right-handed 3-dimensional Cartesian coordinates (x, y, and w). These vectors describe hand orientation : approach, orientation, and normal vectors, respectively. The transform specifying the end-effector is written as T .

As VLSI technology evolves, the complexity of an algorithm does not always imply the number of multiplication elements. Regularity, modularity, and communication complexity are equally important factors to be considered when one seeks to develop a better algorithm [8].In this sense, CORDIC has been reviewed as a prospective processing element for a wide range of applications. Here, we are considering a rotating mode of CORDIC for direct kinematic computation, however minor modification is also possible for inverse kinematic computation [7, 61. The apparent difference between the two problem is that the latter should exploit the various CORDIC modes to afford a set of trigonometric values for a given vector, while the former necessitates the rotation of an input vector by a given angle. The basic ideas behind the direct computation can be extended to the inverse problem, by adopting CORDIC as a computational PE.

CORDIC Techniques

Let the jth joint orientation vector be denoted by p j , where p j = Ajpj-1. Consider an intermediate vector p t , Aj = Trans(wj-l,dj)Rot(wj-1,0j)Trans(zjcj,aj)Rot(zj, between p j and p j - 1: $j) (2) Each term describes the relationship between successive p , = Trans(wj-1, d j ) R o t ( w j - ~ , Oj)pf(stage - 1) (4) frames of the (j-1)th and jth, by the following rotations p t = Trans(zj,aj)Rot(zj, $j)pj-l(stage - 2). (5) and translations, in a sequential order: (i) translate along wj-1, by a distance dj; One set of transformations along each axis, i.e. (ii) rotate about wj-1, by an angle 0,; Trans(w,d)Rot(w,e), is a block-diagonal matrix and can (iii) translate along zj-1 = z j , by a distance a,; be orthogonally implemented by two 2x2 matrix trans(iv) and rotate about x j , by an angle $ j . formations. Note that the matrix can be implemented Note that ~j is equal to zero for a prismatic joint. The through an augmented PE rather than through two difposition and orientation of the nth link, with respect to ferent PES, observing that Trans(w,d ) is a trivial operthe base, are given by the matrix product of Aj . ation. Recall the 1x2 vector transformations:
n

Recalling an A matrix defined in [l]that is configurationdependent, we adopt the decomposition of the jth homogeneous transformation matrix in a four-term product [3]:

The implementation of the matrix A j , which describes the jth link, was proposed via 4 CORDICs: a parallel pair for the w-axis operation and another parallel pair for the x-axis. The 4 CORDICs can be done via a 2-stage cascade since the rotation and translation are disjoint to each other [3].

T=
j=1

Aj,

(3)
Then, Rot(wj,ei) :
Pj

where the initial A1 describes the position and orientation of the 1st link. Each of the 12 nontrivial elements in T can be expressed as a function of sin/cosine values for the measurement angles, multiplications, and additions. The transform can be implemented by either (i) a set of elementary functions generated with an appropriate lookup table or (ii) an unified algorithm such as CORDIC.

0
A Pj
1

: Trans(wj ,dj)

1.4.3.2
0081

pj (also, similarly for pj) is decomposed into two blocks, e.g., the first two elements of pj become one vector X, :
p 3

- [ X1,,. W,, lIt = [Rot(wj,Oj);wj + dj, 11, 3

(8)

where wj corresponds to the w-axis component of the vector pj, and Xj corresponds to x- and y-axis components rotated by O j . Similarly, for pj we can choose a rotated vector of y- and w-axis disjointly through axis shuffling. Finally, consecutive n-pairs of rotation and translation can be implemented via a 2n-stage cascade. We will name each stage as a macro-PE (or, an augmented PE), which can be 2n-pipelined to compose an n-link computation processor. To avoid differentiating the two different sets of transformations, w-axis and xaxis respectively, we employ index i in unified notations for a macro-PE:

Each macro-PE including one Trans(wi,di) and one Rot(wi,Oi) can be implemented as in Figure 1.a. The one-joint processor being constructed by cascading two macro-PES, Figure 1.b shows a fully pipelined structure, for a 6-joint system. From this point, we will concentrate on implementation of a macro-PE. Observing that Rot and Trans functions are disjoint to each other, let us isolate the rotation function first. This vector rotation for Xi = (zi,yi) by the angle Oi can be realized by an iteration algorithm called CORDIC [4] instead of by computing trigonometric functions and applying matrix multiplication. CORDIC realizes a vector rotation by a partial sum of micro-angle rotations with a pre-fixed sequence of angles. The rotation macro-angle is represented as a sum of decomposed micro-angles, i.e Oi = Oi,k, x5
y5 w5

d6 6

where 61, = cosOi,k is a micro-scale composing a final scale factor, explained later. Such a specific form of the pre-fixed micro-angle sequence as tan-2-, is attractive for VLSI implementation since it is composed only of additions, shiftings, and an arctangent lookup 3

Figure 1: CORDIC-basel Pipelined Architecture for Direct Kinematics Computation: (a). a macro-PE, onestage from an orientation to an intermediate; (b). a complete pipelined computation module for a 6-link system.

1.4.3.3
0082

table For the simplicity of notation, subscript i indexing a certain stage will be omitted, and X , Y and Z will stand for abridged notations for those having subscript i. The micro-iterations of the conventional (called the non-redundant) CORDIC are 3 linear recursive equations: X recurrence (X-rec.), Y-recurrence (Y-rec.) and Z-recurrence (Z-rec.) [4]. With an initial value of Z[O] = e;, CORDIC rotates the initial values of X[O] and Y[O] to the final values X[n] and Y[n]. CORDIC forces Z[n] to be zero by keeping Z[i]close to zero during each iteration. With n number of iterations, n-bit accuracy of X and Y in the output can be achieved. The CORDIC rotation does not preserve the input norm. To get a rotated vector having the same length as the input vector (X[O], Y[O]), X[n](Y[n]) needs to be compensated by a scaling factor K , K = nygt

mined quickly by looking at a few most significant bits. This new scheme is called Constant-Factor-RedundantCORDIC(CFR-CORDIC). The modified recurrences and selection functions for the scheme are described below.
X [ i 1 = X [ i ] bi2-'Y[i] 1 Y[i 1 = Y[i] - bi2-'x[i] 1 ~ [ 1 = 2 ( ~ [ i - (ii2'tan-I 2-9 i1 ]

(10)

where U[i] is for the implementation simplicity, which is equal to 2'Z[i], and the selection function is given as

d w .

Redundant : Non-redundant CORDIC is inherently slow with a delay of O(n2). The delay is caused by its recursiveness and serial dependency since a microrotation with delay O ( n ) should be finished before processing the next micro-rotation. Delay performance of a macro-rotation ( n micro-rotations) can be improved from O(n2)to O(n) by using redundant arithmetic (carry-free addition such as carry save or signed-digit addition) to determine the direction of the rotation, b;, based on an estimate instead of an exact value [9]. The redundant arithmetic gives a delay of 0 ( 1 ) instead of O ( n ) , and the estimation of direction is necessary to avoid eroding the advantage of O(1). This requires the modification of the recurrences and selection function. This redundant CORDIC scheme produces its output approximately 4 times faster than the non-redundant scheme. It introduces additional cost, however, since the scale factor K is variable depending on a macro-angle by allowing b; to be in {-1, 0, 1).
Constant-Factor-Redundant: To reduce the implementation cost of redundant CORDIC, it would be good to have a constant scale factor by forcing b; in {-1, 1). However, since bi is determined from an estimate, there arises a convergence assurance question. A scheme was proposed that appended correcting iteration stages at proper positions [lo]. Along with this idea, the number of extra correcting iterations is further reduced by dividing the micro-iterations (for i = 0 to i = n - 1) into two groups: one group where the direction of the rotation is in {-1, 1) for i = 0 to i = n/2, and the other in (-1, 0, 1) for i = ( n + 1)/2 to i = n - 1 [14]. This simplifies the number of correcting iterations by 50 % since correcting iteration is not needed for the second half of the micro-iterations, and we still obtain a constant scale factor I( since the value of K in n-bit precision does not depend on the b value for ( n 1)/2 5 i 5 ( n - 1). Z-recurrence also can be modified so that b; is deter-

When t fractional bits are used in the estimate value, i.e., O[i]is computed using t fractional bits of the redundant representation of U[i], the following correcting iteration needs to be included(where the interval between indexes of correcting iterations should be less than or equal to (t - 1) up to the last iteration index equal to n/2). The correction stage at the j t h step of micro-iteration is written as

uc[j + 11 = U h + 11- 2&?2jtan-j2-j

(12)

The direction of the rotation bf is determined from the same selection function of en.( l l ) , except being decided 1 based on U b 1 instead of U[i].

So far, we discussed recursive structures of several CORDIC schemes as methods of implementing the rotation part in the basic PE, as depicted in Figure 1. The PE, augmented by a translator, necessitates scaling operation at each stage because shuffling of the output at each stage makes continuous accumulation of the scaling factor too complex to be processed at the final stage. The scaling operation has been solved either by an explicit or an implicit method. The explicit way divides the rotated vector by a scale factor, which is constant for the non-redundant [4] and needs to be calculated for the redundant scheme [9]. The division can be processed by another CORDIC (in a linear mode) or a divider. The implicit approach reconfigures the sequence of micro-iterations of the CORDIC and includes additional iterations, called scaling micro-iterations such that the corresponding scale factor becomes either 2 or 1. In that case, the scaling operation can be implemented for free(by wiring when the scale factor is 2). Each micro-iteration can be composed of (i) reduction axis-scaling [Ill, (ii) repetition of vector-scaling [12], (iii) expansion axis-scaling, or combinations thereof [13]. For
4

1.4.3.4
0083

efficient implementation, the number of additional microiterations needs to be minimized. Relevant issue regarding search for the minimum number of micro-iterations has been studied and developed a decomposed search [14], which reduces the searching time from T to fl. There has also been a different approach [15] with a modified recurrence equations of CORDIC for the implicit scaling, where the effect of micro-scaling can be included in each micro-iteration. In summary, the explicit scaling almost doubles the system complexity, while the implicit increases 25 % for the non-redundant and about 30 % for the constant-factor-redundant.

Application to Direct Kinemat ics

For the performance comparison, we will define the following parameters. bi the number of bits in each input x, y and w bf the number of bits in each output nj the number of links (=6) fc the available data shift rate A the step time per micro CORDIC iteration fi the input bit rate TA = step-time(A) * number of steps. For a discrete element implementation, A corresponds to one single external clock time, l/fc. Note that A varies depending on a particular implementation of a macroPE. We define the unit of A to be 1 for one-bit full addition time. Redundant Parallel : The step time of the basic PE can be shortened by adopting CFR-CORDIC, where a carry-free adder (signed-digit adder) is replaced for a carry-propagate adder. Figure 2.a shows a macro-PE in components, and Figure 2.b shows the detail of each block (X-recurrence or Y-recurrence) employing parallel/redundant arithmetic. Figure 2.c shows Z-recurrence. In this case, the sign of Z [ i ] at the ith micro-iteration can not be detected by looking at the most significant bit since Z [ i ] is in redundant number representation. To determine the sign of Z [ i ] quickly by looking at a few significant bits, CFR-CORDIC uses an estimate of shiftedZ [ i ] ( V [ i ] )using t fractional bits. As discussed earlier, the number of fractional bits used for the estimate also determines the frequency rate of a correcting iteration: the more fractional bits used, the less number of correcting iterations required. If we let the number of correcting iterations be denoted by 71, the corresponding TA, becomes

IX( -1, 1) 04 -

8i

1
I

SD,adder

X[i+ll (or Y[i+l] ) (2.b)

Qi

shift left by 1/0

_correction control

CJ

"i+l +

U[i+ 11 Register

Figure 2: A parallel/redundant PE: (a). a macro-PE with X- and Y-recurrence, (b). detail of either block, (c). Z-recurrence.

1.4.3.5
0084

where A2 equals to the time for carry-free addition plus the time for the maximum of a selection function and a variable shifter, approximately (1 logzbi). Note that a practical number of correcting iterations is much smaller than b i , (e.g., 1 for the 16 bit resolution.); hence, we can approximate T A to be that for the redundant without a ~ correcting iteration.

Pipelied: We have micro-pipelined the CORDIC system when the internal recurrences of CORDIC are unfolded. We have interest here only with the redundant scheme. Figure 3.a shows a micro-PE in components with Z-recurrence. The basic recurrence blocks,of X, Y and Z are same as those in Figure 2. Figure 3.b shows the implementation of a macro-PE via micro-pipelining of micro-PES. In this case, since the shifting ia each iteration can be realized by wiring, the critical path for the stage becomes the path of 2. Again, the number of fractional bits used for the estimate, t, must be chosen so that the time for the selection function will not slow ~ down the system. The corresponding T A is

.] Xti+ll

Y[i+ll

To make A3 = 1, we choose t such that the time for the selection function becomes negligible when compared with the carry-free adder. There are two factors associated with the speed up obtained in the micro-pipelined system: one is shifting by wiring and the other is unfolding the recurrences.

Examples : For bi = 12 and b j = 16, the estimated TA is summarized in Table 1. To get first order estimates of available speed and area, we use a figure that one full adder (also one bit shifter) requires approximately 50 'Ikansistors (TFb). The 5th column of Table 1 shows the first order of available input processing rate that utilize the capability of current MOS technology (up to 500K TRs/chip on a 100 mm2 area and 20 nsec clock time). The total number of TFb required for each type of implementation is tabulated in the 6th column. The supportable input for direct kinematic computation ranges from a fraction to tens of MHz. The most prospective architecture is type 4 with macro-pipelining for a 6-link kinematic computation. Based on the previous figures (i.e., 100mm2 for 500K TFb) the first order estimate of the chip size becomes 80 mm2 per 6 stages for direct computation. 1.2 j~ CMOS implementations have been successfully demonstrated to verify the first order estimate for a basic P E [16].

-1

Conversion unit

i+lY i+l

I+ 1

Figure 3: An n-pipelined redundant PE: a. A micro-PE, b. n-pipelining.

1.4.3.6
0085

l i I

Description Parallel/Redundant Pipelined

I Ai/A I
5 1 500A 20A

11 Processing I
rate 10M 50M

2 3

TRs estimate 40K 400K

Table 1: Time and complexity comparison

Conclusion

We have examined alternatives of CORDIC schemes for the implementation of the macro-PE module of the direct kinematic processor and shown that the macro-PE modules micro-level regularity is suitable for VLSI implementation. We depicted specific schematics which include the redundant and the constant-Factor-Redundant schemes. The cost-effectiveness of selected architectures has been analyzed using parallel or pipelined structure. The analysis has been performed with respect to the time and the number of modules required to compute one location of the end-effector for a 6-link manipulator, given a set of angle measurements A special scheme has been proposed to implement the 6-link robot kinematic computation. This scheme utilizes a fully parallel macro-PE and parallel redundant arithmetic and is fully pipelined at the micro-level. Also, the comparison table reveals the CORDIC-based robotics processor as a prospective solution in VLSI to be used for a wide range of kinematic calculations with flexibility in both its size and speed.

[6] M. Kameyama, T. Matsumoto and H. Hideki, Implementation of a High Performance LSI for Inverse Kinematics Computation, IEEE Int. Conf. Robotics and Automation, pp.757-762, 1989. [7] R. Harber et. al.,Bit-serial CORDIC Circuits for Use in a VLSI Silicon Compiler, Int. Conf. Circuit and System, pp.154-157, 1989. [8] H. Kung,Lets Design Algorithms for VLSI systems , Caltech Conf. VLSI, pp.65-90, 1979. [9] M. Ercegovac and T. Lang, Redundant and OnLine CORDIC: Application to Matrix triangulariza tion and SVD, IEEE Trans. on Computers, Vol. C39, No 6, pp.725-740, June 1990.
[lo] N . Takagi, T. Asada and S. Yajima, Redundant CORDIC methods with a constant scale factor for sine and cosine computation, Submitted to IEEE Trans. on Computers, 1989. [ll] G. Haviland and A. Tuszynski,A CORDIC Arithmetic Processor Chip, IEEE Trans. on Computers, Vol C-29, No 2, pp.68-79, Feb. 1980.

References
[l] J . Denavit and R. Hartenberg,A Kinematic Notation for Lower-Pair Mechanisms Based on Matrices, Journal of Applied Mechanics, pp.215-221, 1955. [2] P. Nanua, K. Waldron and V. Murthy,Direct Kinematic Solution of a Stewart Platform, IEEE Trans on Robotics and Automation, Vol6, No 4, pp.438-444, Aug. 1990. [3] C. Chen and C. Lee,Computational Structures for Robot Kinematics and Dynamics Computations, Proc. A S M E Int. Computers in Engineering, pp.349354,1989. [4] J. Walther, A Unified Algorithm for Elementary Functions, AFIPS Sprang Joint Computer Conference, pp.379-385, 1971. [5] C. Lee,CORDIC-based Architectures for Robot Direct Kinematics and Jacobian Computation, 3rd Int. Symp. Intelligent Control, pp.609-614, 1988. 7

I121 H. M. Ahmed, Signal Processing Algorithms and Architectures, Ph.D. Dissertation, Department of Electrical Engineering, Stanford University, 1982.
[13] J . Delosme,VLSI Implementation of Rotations in Pseudo-Euclidean Spaces, proc. of ICASSP, pp.927930, 1983. 141 J. Lee and T. Lang, Matrix triangularization by fixed-point redundant CORDIC with a constant scale factor, Proc. SPIE Conference on Advanced Signal Processing Algorithms, Architectures, and Implementations, July 1990. 151 A.A.J. de Lange, A.J. van der Hoeven, E.F. Deprettere and J.Bu, An optimal floating-point pipeline CMOS CORDIC processor, ISCAS, pp 2043-2047, 1988. [16] J . Harding, T. Lang and 3. Lee,A Comparison of Redundant CORDIC Rotation Engines, Int. ConJ Computer Design 91, Oct. 1991.

1.4.3.7
0086

Vous aimerez peut-être aussi