Vous êtes sur la page 1sur 5

2009 22nd International Conference on VLSI Design

Efcient Implementation of Floating-Point Reciprocator on FPGA


Manish Kumar Jaiswal
M.S.(by Research) Department of Electrical Engineering, IIT-Madras,Chennai-36, India. e-mail: ee06s024@smail.iitm.ac.in

Nitin Chandrachoodan
Assistant Professor, Department of Electrical Engineering, IIT-Madras,Chennai-36, India. e-mail: nitin@ee.iitm.ac.in

AbstractIn this paper we have presented an efcient FPGA implementation of a reciprocator for both IEEE single-precision and double-precision oating point numbers. The method is based on the use of look-up tables and partial block multipliers. Compared with previously reported work, the modules occupy less area with a higher performance and less latency. The designs trade off either 1 unit in last-place (ulp) or 2 ulp of accuracy (for double or single precision respectively), without rounding, to obtain a better implementation. Rounding can also be added to the design to restore some accuracy at a slight cost in area. Index TermsFloating-point arithmetic, reciprocator, FPGA, double-precision, partial block-multipliers, binomial expansion

less area, less delay, and correct up to required level (accuracy trade off). We have restricted ourselves only to normalized numbers. All the exceptional cases are detected, and indicated as invalid input/output. Comparisons of our implementation with previous works mentioned in the literature show that we are able to obtain small look-up tables and overall very efcient hardware. We have used Xilinx ISE-8.2 synthesis tool, ModelSim SE 6.1b simulation tool, and X2VP30-7ff896 as our FPGA platform. II. A PPROACH The format of a oating-point number is as follows: For Single Precision
1bit 8bits 23bits

I. I NTRODUCTION Floating point arithmetic is widely used in many scientic and signal processing applications. The greater dynamic range and lack of need to scale the numbers makes development of algorithms much easier. However, implementing arithmetic operations for oating point numbers in hardware is very challenging. Among the operations (add, subtract, multiply, divide), division is generally the most difcult to implement in hardware. Division is a fairly common operation in many scientic and signal processing applications, so there is a need for efcient hardware implementations for division. The IEEE standard for oating point (IEEE-754) denes the format of the numbers, and also species various rounding modes that determine the accuracy of the result. For many signal processing, and graphics applications, it is acceptable to trade off some accuracy [1] (in the least signicant bit positions) for faster and better implementations. A lot of work has been done on obtaining efcient implementations for this operation. Generally, this operation can be done into two parts, rst take the inverse of divisor and then multiply with dividend. Because of this, many hardware dividers focus on efciently obtaining the reciprocal of oating-point number. Different proposed architectures in the literature are based on Newton-Raphson method [3], [5], [7], [10], [11], digit-recurrence method [3], [8], [11], [15], seedarchitecture [12], etc. Previous works had used huge look-up tables, along-with wider multipliers, which affects the area and performance. Our approach also focuses on nding the reciprocal. It is based on the well known binomial-expansion, contains small look-up table, and uses partial block-multipliers, resulting in
1063-9667/09 $25.00 2009 IEEE DOI 10.1109/VLSI.Design.2009.12

Sign bit exponent For Double Precision


1bit 11bits

mantissa

52bits

Sign bit exponent

mantissa

In this paper, we do not discuss the exponent manipulation as it is a standard process. The benets of our implementation are in the computation of the inverse of the mantissa. Let y be the inverse of the mantissa a. Then, y= 1 , where in 1.a, 1 is hidden bit of mantissa. 1.a

We have divided the mantissa in two parts, a1 and a2 . a1 is used to fetch some pre-calculated data from a look-up table. Now,since y = 1 a1 + a2 = (a1 + a2 )1 = a1 a2 .a2 + a3 .a2 a4 .a3 + 2 2 1 1 1 1

(1)

The content of each term of equation(1) will be as follows:

267

Authorized licensed use limited to: Asian Institute of Technology. Downloaded on February 8, 2010 at 04:40 from IEEE Xplore. Restrictions apply.

a2(15-bit)
f ull signif icant bits

a1(8-bit)

a1 1 a2 .a2 1 a3 .a2 2 1 a4 .a3 2 1 .

= 0.

xxxxxxxx
mzero bits signif icant bits

BRAM

Stage-1

= 0. 00 00 = 0. 00 00 = 0. 00 00

xx xx xx xx
a22 17-bit Multiplier a1-3a22 17-bit Multiplier a1-3

2mzero bits signif icant bits

a1-2 17-bit Multiplier a1-1 a1-2a2 Stage-3 Stage-2

3mzero bits signif icant bits

xx xx

and so on where m is the number of bits of a1 .

30-bit Substractor (a1-1-a1-2a2) 30-bit Adder

We can see that as we move towards higher terms their contribution to main result are decreasing. Thus, depending upon our precision choice we can take suitable number of terms from equation(1) for calculating inverse, based on value of m. For our implementation, based on experiments over a large number of random test cases, we have chosen the number of terms as described below. In case of single-precision we have taken the rst three terms, while for the case of doubleprecision 7 terms have been taken. The value of m we have chosen is 8 for both cases. These values were selected based on available FPGA resources, as will be shown soon. We have simplied the desired terms in such a way so that we can use less hardware with low latency and good accuracy. For single-precision we have taken all the three terms as available, like y = a1 a2 .a2 + a3 .a2 2 1 1 1 (2)

Stage-4

23-bit Mantissa

Fig. 1.

Architecture for single-precision oating-point reciprocator

A. Single-precision Floating-point The architecture of single-precision oating-point reciprocator is shown in Fig. 1. It includes a Block-Memory (BRAM) which contains pre-calculated values of a1 (24 1 bits), a2 (17 bits), and a3 (17 bits) in a single data1 1 word(58-bits), with 8-bit (content of a1 ) as address bits. The contents of the BRAM have been calculated using a separate program written in C, with oat data type for the numbers. The content of a1 has been extended to 30-bits 1 (by appending 6-bits 111111 at least signicant bit (LSBs)) for addition/subtraction purpose. Here we can also do above operation with only value of a1 , but it will increase the total 1 operation latency and size of multipliers. In both cases we will use only a single BRAM on FPGA, so we prefer the rst approach. The architecture has latency of four, though we can include the BRAM access in the rst stage with a slight loss in maximum operating frequency. By using pipelined multiplier we can approximately double the overall frequency. We have shown the result with the latency four. Our aim here is to only show the use of less necessary hardware. We can do pipelining in the given architecture very easily. B. Double-precision Floating-point The architecture of double-precision oating-point reciprocator is shown in Fig. 2. It also includes a single BRAM which contains pre-calculated values of only a1 (54 bits) with 81 bit (content of a1 ) as address bits. The content of BRAM has been calculated using a C-program, with double as datatype of oating-point numbers. The content of a1 has been 1 extended to 60-bits (by appending 6-bits 111111 at LSBs) for addition/subtraction purpose. Here we have a huge saving on block-memory compared to other methods discussed later. There are three type of multiplier (based on Xilinx MULT18x18 block) that have been used. Second, third and

For double precision, simplied form will be as, y = a1 a1 [(a1 .a2 a2 .a2 ) 2 1 1 1 1 (1 + a2 .a2 + a4 .a4 )] 2 2 1 1

(3)

Though we can simplify above equations a little more, it will affect the area, latency and accuracy. The accuracy is affected due to the fact that oating-point operations are not completely associative, i.e. u(v + w) may not be exactly equal to (uv + uw). This is due to the nite number of bits used to represent the numbers. III. I MPLEMENTATION We have shown the implementations for single precision and double precision separately as different issues arise in each case. Some of the design decisions are based on the fact that multipliers of size 18 18 are readily available as hard IP cores on many common FPGA families. We have based our computations on the Xilinx Virtex II platform. However, the basic ideas of saving some of the block multiplications hold even if a different sized multiplier core is used, although the exact numbers would change.

268

Authorized licensed use limited to: Asian Institute of Technology. Downloaded on February 8, 2010 at 04:40 from IEEE Xplore. Restrictions apply.

a1(8-bit)

a2(44-bit)
A B

51-bits A3, 17-bits A2, 17bits A1, 17-bits

B3, 17-bits B2, 17-bits B1, 17-bits A1 . B1 34-bits 17-bits__

BRAM a1
-1

Stage-1
This part Ignored A1 . B2 A2 . B1 A1 . B3 A2 . B2 34-bits 34-bits 34-bits 17-bits__ This part Ignored 17-bits__ 34-bits 34-bits

51-bit partial block multiplier a 1 a2 51-bit partial block multiplier a1 a2


-2 2 -1

Stage-2

Stage-3
A2 . B3 A3 . B2

A3 . B1 34-bits 34-bits 17-bits__

34-bit Full block multiplier a1 a2


-4 4

60-bit Substractor

Stage-4

A3 . B3

34-bits

We took sum of this part only, using only 6-MULT18x18.

60-bit Adder (1+a1-2a22+a1-4a24) (a1-1a2 - a1-2a22)

Stage-5

Fig. 3.

Partial 51-bit multiplier for stages 2,3 and 7

51-bits A

51-bit Reduced partial block multiplier a-1 51-bit partial block multiplier a-1 60-bit Substractor 52-bit Mantissa

Stage-6
B A1 . B3

0x10000 17-bits

A2, 17bits

A1, 17-bits

B3, 17-bits B2, 17-bits B1, 17-bits 34-bits 34-bits This part Ignored

Stage-7

A2 . B2

{B1,0x0000} 34-bits A2 . B3 34-bits 17-bits__

Stage-8

{B2,0x0000} 34-bits {B3,0x0000} 34-bits 17-bits__

We took sum of this part, using only 3-MULT18x18.

Fig. 2.

Architecture for double-precision oating-point reciprocator

Fig. 4.

Reduced Partial 51-bit multiplier for stage 6

seventh stage has 51-bit partial multiplier, which is shown in Fig. 3. It uses only six-MULT18x18 block instead of nine, to produce more than 52-bit (MSB) of correct result, which is all that we need. Stage six is also a 51-bit partial multiplier, but due to its specic input nature (17-bits of rst input is 0x10000 in hex), it contains only three-MULT18x18 block Fig. 4. The fourth stage multiplier is a 34-bit full multiplier, but instead of using IP-core for it we have designed it using four MULT18x18 block (shown in Fig. 5) which is taking less (about 2/3) glue logic and is faster than the IP-core available from Xilinx. Overall latency of module is eight, which we can increase further using pipelining as discussed in the case of single precision, for better performance. IV. R ESULTS Hardware utilization and performance of both the singleprecision and double-precision is shown in Table-I. Since our implementation neglects some of the lower order bits in the computation, it is important to estimate the impact of this on the overall accuracy of results. For the error performance 5-

millions randomly generated test cases were used to check the errors. The error performance is shown in Table-II for both versions of oating-point numbers. The error was obtained by comparing results from the proposed module with the results produced by a C compiler on a workstation. In all cases, it was found that the maximum error in the case of single precision was 2 ulp (unit last place), while in the case of double precision numbers, it was 1 ulp. The error we got is without rounding.
TABLE I H ARDWARE U TILIZATION AND P ERFORMANCE TABLE Parameters MULT18x18 BRAM Slices Freq(MHz) Latency Single-precision 3 1 108 192.365 4 Double-precision 25 1 672 88.594 8

269

Authorized licensed use limited to: Asian Institute of Technology. Downloaded on February 8, 2010 at 04:40 from IEEE Xplore. Restrictions apply.

34-bits A A2, 17bits A1, 17-bits

B B2, 17-bits B1, 17-bits A1 . B1 A2 . B1 A1 . B2 A2 . B2 34-bits 34-bits 34-bits 17-bits__ 34-bits 17-bits__

Sum of all the above give complete result

Fig. 5.

34-bit block multiplier for stage 4 TABLE II E RROR P ERFORMANCE

Error Max. ULP Mean Mean(absolute) Variance Variance(absolute)

Single-precision 2 4.7867e-08 5.0060e-08 (224.2518 ) 2.5959e-15 2.3812e-15

Double-precision 1 7.7636e-17 7.8125e-17 (253.5070 ) 7.5177e-33 7.4415e-33

V. C OMPARISON The basis of our implementation is a well known technique of using look-up tables and multipliers. The main benets are in the optimizations of the resource usage. In this section, we compare our implementation against many previous approaches mentioned in the literature. Our comparisons are based around the Xilinx hardware resources. Even on this platform, many different multiplier implementations are available with differing speed-area-latency tradeoffs. By using different instances, we can obtain suitable tradeoffs. Similarly, on a different platform with different basic resources, the main ideas developed in this paper will still hold. Only the details of the hardware usage will differ. For many of the comparisons, direct FPGA implementations of the methods are not available. In such cases, we have estimated the resource usage based on the components in the design. The number of block RAM cores required to implement a given look-up table and number of MULT18x18 block required to implement a given multiplication has been estimated from the Xilinx core generator software. One of the most popular methods used for computing reciprocals is the Newton Raphson iterative procedure [3], [5], [7], [10], [11]. The Newton-Raphson iteration for reciprocal of A is given by, xi+1 = xi (2 xi .A). For each iteration it requires two multiplication and one subtraction. The value of x0 is usually taken from a look-up table. Thus for two iterations (results are based on [7]), in the case of singleprecision it requires one look-up table in 8-bit address space, two 816 multiplication and two 1632 multiplication (equivalently 1 BRAM and 6 MULT18x18). For double-precision it requires one look-up table in 15-bit address space, two 1530

multiplication and two 30 60 multiplication (equivalently 28 BRAM and 20 MULT18x18). The error performance of 2NR method is discussed in [3], which presents the division of double-precision oating-point number by combining NewtonRaphson method and digit-by-digit recurrence method with a link module. Thus as a whole it will take more area than 2-NR method. They have shown the error produced by 2-NR iteration for computing reciprocal is minimum of 1.999999993e-55 and maximum of 1.284729483e-49, which is more than our method. Ito et al. [5] implement the reciprocal computation using multiply-accumulate unit (similar to the NR method). They mention a linear initial approximation and proposed an accelerated convergence method. It results in a speedup with respect to conventional NR, but requires an additional look-up table. Hung et al.[6] proposed the computation of division based h Y on the expression X = X(YY 2 l ) where X and Y are 2mY h bits mantissa. Yh is (m+1)-bits MSB of Y , used as address 2 for look-up table for Yh of (2m+2)-bits. Thus computation of division is done by rst 2m bit Z = X (Yh Yl ) 2 multiplication and then (2m+2)-bits Z Yh multiplication. For our comparison we are not including the nal multiplication. Even then, for Single precision it needs 213 26 bits (12 BRAM) look-up table and 4 MULT18x18. For doubleprecision 227 56bits look-up table (impractical on available FPGA platforms) and 16-MULT18x18. Ercegovac et al.[7] propose a method in which reciprocal of Y (m-bit) has been computed in three steps, namely reduction, evaluation,and post-processing. They are taking 7-MULT18x18 and 1-BRAM for single precision, and 10MULT18x18 and 30-BRAM for double-precision. In [12], the authors propose a method for computing the initial seed approximation. However, they require look-up tables addressed by the complete word, making it difcult to use for the 23-bit and 52-bit mantissas in oating point. The methods described in [4] and [2] both require relatively large look-up tables and overall more resources than our implementation. In [9], Jeong et al. propose an idea that is based on [6]. Though the area is less than in [6], it is still larger than our proposed method. In [10] division operation is based on method in [5]. It rst take a initial approximation and then using NR-iteration compute the reciprocal and then division. Approximately it will also take same hardware resources as in [5]. In [8], the authors have reported the oating-point division and square-root using SRT 1 division method on FPGA. In terms of performance, the pipelined approach is closest to our proposed implementation, but requires signicantly more area (3245 slices and 14 BRAM for clock period of 6ns and latency of 47 cycles). [15] presents another SRT based implementation that has similar area to ours but considerably less throughput and speed. Wang et al. [13] have presented a library for oating-point operations. For division this library has used the method of [6].
1 Sweeney,

Robertson and Tocher - inventors of the algorithm

270

Authorized licensed use limited to: Asian Institute of Technology. Downloaded on February 8, 2010 at 04:40 from IEEE Xplore. Restrictions apply.

Thus for Single precision it needs 213 26 bits (12 BRAM) look-up table and 4 MULT18x18. For double-precision 227 56 bits (impractical in available FPGA platforms) look-up table and 16-MULT18x18. Also for single-precision in spite of a relatively large latency (14 cycles, while our method only has 4) the maximum frequency is 129 MHz (we have 192 MHz). In [16] division of double precision oating point number has been performed using Goldschmidts algorithm, implemented on a ALTERA STRATIX-II FPGA platform. The area reported is large relative to our design (about 3500 ALMs, equivalent to about 4600 slices on a Virtex II [17]), and has less performance and throughput. In terms of performance, the oating point library from Sandia Labs [14] and the cores from Xilinx [18] are among the best. The Sandia implementations reported here obtain high frequency of operation at the cost of increased latency (33cycle for single-precision and 62-cycle for double-precision), while the reported areas (BRAM sizes are not mentioned in the paper) are similar to the area for our implementation. These designs are hand-optimized and are specic to the Xilinx platform, whereas we have used an HDL implementation that is easy to re-target. Table III presents a direct resource comparison across some of the reported implementations. Since many of the implementations do not give accurate numbers for RAM usage, it is difcult to form a proper comparison. However, from this table and the above explanation it is clear that our implementation is very efcient in terms of resources. As can be noted from the operating frequencies mentioned earlier, it is clear that the proposed implementation also maintains high performance with less latency.
TABLE III R ESOURCE C OMPARISON Single-precision MULT18x18 BRAM 2-NR 6 1 [2] 14 2 [5][10] 12 1 [6][13] 4 12 [7] 7 1 [9] 8 1 [12] impractical Proposed 3 1 Method Method Double-precision MULT18x18 BRAM 20 28 36 2 48 29 16 impractical 10 30 32 50 impractical 25 1

improve by pipelining them very easily. The error performance is also within acceptable range (1-ulp for double-precision). The implementation can thus form a useful core for use in hardware dividers, especially for applications like signal processing that could be more tolerant of inaccuracies in the least signicant bits. R EFERENCES
[1] J. Hopf, A parameterizable HandelC divider generator for FPGAs with embedded hardware multipliers, IEEE International Conference on Field-Programmable Technology, Pages 355-358, Dec-2004. [2] W. F. Wong, Member, IEEE, and E. Goto, Fast Hardware-Based Algorithms for Elementary Function Computations Using Rectangular Multipliers, IEEE Transactions on Computers, Issue 3, VOL. 43, March-1994. [3] P. Montuschi, L. Ciminiera, A. Giustina, Division unit with NewtonRaphson approximation and digit-by-digit renement of the quotient, IEE Proceedings - Computers and Digital Techniques, Issue 6, Vol. 141, Pages 317 - 324, Nov-1994. [4] W. F. Wong, Member, IEEE, and E. Goto, Fast Evaluation of the Elementary Functions in Single Precision, IEEE Transactions on Computers, Issue 3, Vol. 44, Pages 453-457, March-1995. [5] M. Ito, N. Takagi, and S. Yajima, Efcient Initial Approximation and Fast Converging Methods for Division and Square Root, Proceedings of the 12th Symposium on Computer Arithmetic, Pages 2-9, July-1995. [6] P. Hung, H. Fahmy, O. Mencer, M. J. Flynn, Fast division algorithm with a small look-up table, 33th Asilomar Conference on Signals, Systems and Computers, Pacic Grove, CA, USA., Vol-2, Pages 1465-1468, Oct1999. [7] Milos D. Ercegovac, Tomas Lang, Jean-Michel Muller, Arnaud Tisserand, Reciprocation, Square Root, Inverse Square Root, and Some Elementary Functions Using Small Multipliers, IEEE Transactions on Computers, Issue 7, VOL. 49, July-2000. [8] Xiaojun Wang, B. E. Nelson, Tradeoffs of designing oating-point division and square root on Virtex FPGAs, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2003), Pages 195- 203, Apr-2003. [9] Jong-Chul Jeong, Woo-Chan Park, Woong Jeong, Tack-Don Han, MoonKey Lee, A cost-effective pipelined divider with a small look-up table, IEEE Transactions on Computers, Issue-4, Vol-53, Pages 489- 495, April2004. [10] U. Kucukkabak, A. Akkas, A Combined Interval and Floating-Point Reciprocal Unit, Thirty-Ninth Asilomar Conference on Signals, Systems and Computers, pages 1366- 1371, Nov-2005. [11] E. Antelo, T. Lang, P. Montuschi, A. Nannarelli, Low latency digitrecurrence reciprocal and square-root reciprocal algorithm and architecture, 17th IEEE Symposium on Computer Arithmetic, Pages 147- 154, June-2005. [12] M. Ercegovac, J. M. Muller and A. Tisserand. Simple seed architectures for reciprocal and square root reciprocal, 39th Asilomar Conference on Signals, Systems and Computers, Pacic Grove, California, USA., Pages 1167- 1171, Oct-2005. [13] Xiaojun Wang, Sherman Braganza, Miriam Leeser, Advanced Components in the Variable Precision Floating-Point Library, 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines(FCCM06), Pages 249-258, April-2006. [14] K. Scott Hemmert, Keith D. Underwood, Open Source High Performance Floating-Point Modules, 14th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM 06), pages 349-350, April-2006. [15] H. Bessalah, M. Anane, M. Issad, N. Anane, K. Messaoudi, Digit recurrence divider: Optimization and verication, International Conference on Design & Technology of Integrated Systems in Nanoscale Era, Pages 70-75, Sept-2007. [16] R. Goldberg, G. Even, P. M. Seidel, An FPGA implementation of pipelined multiplicative division with IEEE Rounding, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007), Pages 185-196, Apr-2007. [17] Stratix II vs. Virtex-4 Density Comparison. [Online]. Available: http://www.altera.com/literature/wp/wpstxiixlnx.pdf [18] Xilinx Floating-point unit v2.0. [Online]. Available: www.xilinx.com

VI. C ONCLUSION We have implemented an efcient reciprocal unit on FPGA for both single and double precision oating-point numbers. The method uses the idea of neglecting higher order terms in the partial block multiplication to reduce the number of multipliers. At the same time, the look-up table requirements are kept to a minimum, and are the least reported in the literature for double precision implementation. Initial latency for our module is also less (4 for single and 8 for doubleprecision), that too with promising frequency, which we can

271

Authorized licensed use limited to: Asian Institute of Technology. Downloaded on February 8, 2010 at 04:40 from IEEE Xplore. Restrictions apply.

Vous aimerez peut-être aussi