Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas

2019 6th International Conference on Signal Processing and Integrated Networks (SPIN)
Performance Evaluation of Fixed-Point Array

Multipliers on Xilinx FPGAs
Burhan Khurshid Saniya Syed Syed Mehtaab
Department of ECE, IUST Department of ECE, IUST Department of ECE, IUST
Awantipora (J&K), India Awantipora (J&K), India Awantipora (J&K), India
burhan32.iust@gmail.com saniyawolf95@gmail.com syedmehtaab123@gmail.com
Junaid Peerzada Adnan Yaqoob

Department of ECE, IUST Department of ECE, IUST
Awantipora (J&K), India Awantipora (J&K), India
peerzada954@gmail.com adnanyaqoob@gmail.com
Abstract – Multiplication is one of the frequently used in the conservation of resources and thereby limits the
arithmetic operation in digital signal processing. It forms the critical path delays and power consumption. In floating
core element in many complex operations like Filtering, Fast point notation, mantissa needs to be properly aligned before
Fourier Transforms, Convolution etc. Since these circuits a particular operation can be performed and this consumes
perform key operations in signal processing, their speed and lot of hardware, leading to longer combinational paths and
power optimization are crucial quality factors. Evidently, more power dissipation.
fixed-point representation has gained a lot of importance when
hardware realization of multipliers is considered. Field programmable gate arrays (FPGAs) are often used
Traditionally, three implementation styles have been used viz. as co-processors to perform high speed tasks that cannot be
bit-parallel, bit-serial and digit-serial. However, bit-parallel achieved using conventional processors. Advantages of
systems are preferred in performance-critical applications these platforms include ability to be re-programmed in the
because of their ability to process multiple bits in parallel. This field, post-production design verification, relatively lower
work considers the implementation of two widely used fixed- non-recurring engineering (NRE) costs, reconfigurable
point bit-parallel multipliers on Xilinx Spartan-6 FPGAs. design approach, high integration levels etc. [8-9]. FPGAs
These include parallel Ripple Carry Array multipliers and are truly revolutionary in the sense that they can combine
parallel Carry Save Array multipliers. A distinguishing feature the benefits of both hardware and software. The hardware
of this work is that the realizations are carried out using LUT aspects of FPGAs provide designers with huge power,
instantiations. This is in contrast to previous realizations,
resource, and speed benefits while the software aspects
where inferential codes are used to represent the functionality
of the design and the synthesizer decides the mapping of the
enable an FPGA to be re-programmed in the field to realize
coded functionality, thereby, resulting in sub-optimal a wide range of logical functions. The hardware aspects of
realizations. We have compared our realizations against some FPGAs enable implementations that are distributed
conventional multiplier designs. Experimental analysis using spatially. Therefore, the resultant designs have the potential
Xilinx Spartan-6 FPGA reveals that our approach results in a to be hundreds of times faster than microprocessor-based
substantial improvement in performance with a little runtime designs. However, unlike in Application Specific Integrated
overhead. Circuits (ASICs), these computations are programmed into
the chip, not permanently by the manufacturing process.
Keywords – DSP, FPGA, Array Multipliers, Fixed-point This means that an FPGA-based system can be programmed
Arithmetic, LUT and re-programmed many times. A single FPGA can replace
thousands of discrete components by incorporating millions
I. INTRODUCTION of logic gates in a single integrated circuit (IC) chip. Owing
Multiplication is one of the frequent and critical to these advantages, FPGAs are fast moving from prototype
arithmetic operations encountered in signal processing applications to low and medium volume productions [10-
applications. Fixed-point multipliers are the elements of 12].
choice when design and implementation of high
The rest of the paper is organized as follows. Section II
performance digital signal processing (DSP) hardware is
briefly discusses bit-parallel multipliers based on fixed-point
concerned [1-3]. Performance versus accuracy trade-offs
arithmetic. Section III and section IV discusses the
often compel designers to use these multiplier circuits in
architecture of Ripple Carry and Carry Save based array
their realizations. Since these circuits form key elements in
multipliers, respectively. Synthesis, implementation and
DSP, their performance is crucial and will affect the overall
analysis is carried out in section V. Conclusions are drawn
performance of the DSP system they are part of [4-7].
in section VI and references are listed at the end.
Hence, there is an immense need for low power and high
speed realization of these multipliers on different hardware II. PARALLEL MULTIPLIERS
platforms.
Two widely used bit-parallel multipliers have been
A degenerate case of the floating-point is the fixed-point considered in this work. These include multipliers based on
representation. Such a representation has a fixed exponent ripple carry logic (RCA multipliers) and carry save logic
that cannot vary with time. Further, the use of barrel shifter (CSA multipliers). In each case, the multiplicand (P) and
unit in the circuit realization is eliminated as no variable multiplier (Q) vectors are represented in fixed-point 2’s
alignment is required in addition and subtraction. Such a complement form. A typical representation of N-bit
representation is suitable for FPGA applications as it results operands would, therefore, be: [7]:
978-1-7281-1380-7/19/$31.00 ©2019 IEEE 171

ܲ ൌ ‫݌‬ேିଵ Ǥ ‫݌‬ேିଶ Ǥ ‫݌‬ேିଷǥǥ ‫݌‬ଶǤ ‫݌‬ଵ Ǥ ‫݌‬଴ ሺሻ (1) 0 X3 Y0 0 X2 Y0 0 X1 Y0 0 X0 Y0
ܳ ൌ ‫ݍ‬ேିଵ Ǥ ‫ݍ‬ேିଶ Ǥ ‫ݍ‬ேିଷǥǥ ‫ݍ‬ଶǤ ‫ݍ‬ଵ Ǥ ‫ݍ‬଴ ሺሻ (2) FA FA FA FA 0
X3 Y1 X2 Y1 X1 Y1 X0 Y1
Magnitude wise these numbers occupy the range [í1, 1) and

is given by:
FA FA FA FA 0
ܲ ൌ െ‫݌‬ேିଵ ൅ σேିଵ
௜ୀଵ ‫݌‬ேିଵି௜ ʹ
ି௜
(3) X3 Y2 X2 Y2 X1 Y2 X0 Y2
ܳ ൌ െ‫ݍ‬ேିଵ ൅ σேିଵ

௜ୀଵ ‫ݍ‬ேିଵି௜ ʹ
ି௜
(4)
FA FA FA FA 0
The product Y = P×Q is given by: X3 Y3 X2 Y3 X1 Y3 X0 Y3
ܻ ൌ െ‫ݕ‬ଶேିଶ ൅ σଶேିଶ

௜ୀଵ ‫ݕ‬ଶேିଶି௜ ʹ
ି௜
(5)
FA FA FA FA Y3
And may be represented as:

ܻ ൌ ‫ݕ‬ଶேିଶ Ǥ ‫ݕ‬ଶேିଷ Ǥ ‫ݕ‬ଶேିସǥǥ ‫ݕ‬ଵ Ǥ ‫ݕ‬଴ (6) Z3 Z2 Z1 Z0
Fig. 1 Block schematic of multiplier based on ripple carry logic
Since constant word-length multiplication is considered, the IV. CARRY SAVE ARRAY MULTIPLIER
N-1 lower order bits in the product Y are truncated, and the
product is given by: The major drawback of RCA multiplier is that the carry-
chain has a large delay associated with it. This delay has a
ܹൌ െ‫ݓ‬ேିଵ ൅ σேିଵ
௜ୀଵ ‫ݓ‬ேିଵି௜ ʹ
ି௜
(7) linear dependency on the operand word-length. An alternate
approach is the carry save logic, where the rippling of the
The constant word length product W may be represented as: carry within a row is avoided. Instead the carry generated by
each FA cell is saved and propagated to the next row. This
ܹ ൌ ‫ݓ‬ேିଵ Ǥ ‫ݓ‬ேିଶ Ǥ ‫ݓ‬ேିଷǥǥ ‫ݓ‬ଶǤ ‫ݓ‬ଵ Ǥ ‫ݓ‬଴ (8)
carry propagation from one row to the next makes the bit-
wise addition operation within a row independent, enabling
The product W is used when a constant word-length an increase in the speed of the overall multiplication
multiplier is considered and Y is used when a full precision process. A block level schematic of a 4-bit CSA multiplier
product is required. Out of the (2N-1)-bit product, the N is shown in fig. 2. An additional overhead in CSA multiplier
MSB bits of an N×N bit multiplication are retained. The is the terminating vector merging adder (VMA). VMA
number will lie in the range of (-1, 1-2-N+1). 2’s merges the partial vectors corresponding to sum and carry
complement arithmetic has an advantage that a correct final and generates the final output. It is essentially a ripple carry
result is guaranteed irrespective of the overflow in the adder and incurs an additional hardware cost in a CSA
intermediate stages. In other words, the final result is correct multiplier. An N×N CSA multiplier will require N2 adder
if it is known to lie in the range [- (1-2-N+1), (1-2-N+1)]. cells, N2 AND gates and additional overhead of VMA adder.
III. RIPPLE CARRY ARRAY MULTIPLIER The critical path will have and additional VMA component.
The critical path for a pure combinational realization will be
RCA multiplier is based on the traditional ripple carry given by:
logic. In each row, the generated carry from the previous
full adder (FA) block ripples to the next FA block in the ‫ܲܥ‬ோ஼஺ ൌ ܶ௔ ൅ ܰǤ ܶ஺ ൅ ܶ௏ெ஺ (10)
chain. The critical path of this multiplier, thus depends on Where TVMA is the delay associated with VMA.
the length of the carry chain. This in turn depends on the
0 X3 Y0 0 X2 Y0 0 X1 Y0 0 X0 Y0
operand word-length N. A block level schematic of a 4-bit
RCA multiplier is shown in fig. 1. Based on this schematic
an N×N RCA multiplier will require N2 adder cells and N2 FA FA FA FA
AND gates. The critical path for such a realization will X3 Y1 X2 Y1 X1 Y1 X0 Y1
include the delay associated with the first carry chain, plus
intermediate rows, plus final carry chain. The critical path FA FA FA FA
for a pure combinational realization will be given by: X3 Y2 X2 Y2 X1 Y2 X0 Y2
‫ܲܥ‬ோ஼஺ ൌ ܶ௔ ൅ ܰǤ ܶ஺ ൅ ሺܰ െ ʹሻǤ ܶ஺ ൅ ܰǤ ܶ஺
FA FA FA FA
X3 Y3 X2 Y3 X1 Y3 X0 Y3
ൌ ܶ௔ ൅ ሺ͵ܰ െ ʹሻǤ ܶ஺ (9)
Where Ta and TA is the delay associated with AND gate and FA FA FA FA Y3
an FA cell respectively.
VECTOR MERGING ADDER
Z3 Z2 Z1 Z0
Fig. 2 Block schematic of multiplier based on carry save logic
172
V. SYNTHESIS, IMPLENTATION AND ANALYSIS VHDL script that directly calls upon the target element, in
The basic logic element in an FPGA is the look-up table this case a 6-input LUT, and assigns some part of the
(LUT). The input bandwidth of an LUT varies from family available logic to it. Instantiation increases the complexity
to family. This work considers FPGAs that have 6-input of the code, makes it time consuming and often refers the
LUTs as their basic logic element. Specifically, we have code incomprehensible, but generally results in an improved
considered Spartan-6 FPGAs from Xilinx. Each logic slice performance.
in Spartan-6 has a general fabric of LUTs, carry-chains, Xilinx ISE 14.1 has been used to carry out the synthesis,
function generators etc. The LUTs have an input bandwidth simulation and implementation of different multiplier
of six and can be used to implement a 6-input Boolean structures. For power estimation, Xpower analyzer has been
function, or two 5-input Boolean functions. The former is a used. The entire analysis has been done on a comparative
single mode operation and the latter is a dual mode basis and some frequently used traditional multiplier designs
operation. To enable a dual mode operation the logical have been considered. Such an analysis gives a good
function should have shared variables. measure of the achievable performance speed-up in fixed-
Detailed implementation has been done by realizing point array multipliers based on our approach. Multiplier
multiplier structures of varying input word-lengths. The realizations reported in [13-14] have been considered.
parameters used to specify the performance include However, the multipliers in [13-14] have been implemented
resources utilized, timing and power dissipation. Resources using Virtex-5 FPGAs. Since our work focusses on Spartan-
include the number of LUTs and logic slices used. Timing 6 FPGAs, the multiplier designs presented in [13-14] are re-
gives the notion of speed for a particular realization. In this implemented using Spartan-6 FPGAs.
work, we have considered both combinational and pipelined A. Resource Analysis
realizations of RCA and CSA multipliers. For Table 1 provides a comparison of the different FPGA
combinational realizations, timing analysis is mainly resources utilized by different multiplier realizations
concerned with the delay incurred along the critical path, proposed in this work and those reported in [13-14]. The
which is defined as the longest combinational path in a analysis considers an input operand word-length of 16 bits.
circuit. For pipelined realizations, timing analysis is It is observed that fixed-point array-multipliers show an
concerned with the frequency at which the multipliers can improved resource utilization when compared to traditional
be clocked. Timing analysis may be done after synthesis or realization. This is due to the fixed nature of the multipliers
after placement and route (PAR). However, post-synthesis that truncates the extra bits so that the input operands and
timing analysis is often vulnerable to changes, as the logic the final product has the same word length. Among different
has not yet been mapped on the LUTs. Therefore, post PAR multipliers, CSA based array multipliers show higher
timing analysis has been done which is more accurate. Post resource utilization due to the additional terminating VMA
PAR timing analysis also enables the designer to capture a part. Among different implementation styles, pipelined
realistic picture of the switching activity that is occurring multipliers have higher resource utilization due to the
within a routed design. The same is used to assess the power involvement of pipelined registers. Further analysis is
dissipation (dynamic) of an implemented design. Generally, carried out by plotting the different utilized resources
a value change dump (VCD) or a simulation activity against the input operand word-length. The results are
interchange format (SAIF) file captures the switching shown in fig. 3.
activity of a design, which in combination with the design
netlist and power constraint file is used to generate a TABLE 1. RESOURCE UTILIZATION FOR DIFFERENT MULTIPLIER
detailed power report. Power dissipation involves both static REALIZATIONS
and dynamic components. While static power is device
specific, dynamic power depends on the complexity of the Multiplier Design No. of No. of
LUTs Slices
routed design. The amount of mapped logic, clock
frequency, density of interconnects and the toggle rate of Carry Save Mult. (CSM) [13-14] 942 458
signals along nodes are some of the factors that will affect Carry Ripple Mult. (CRM) [13-14] 1062 527
the reported power metrics from the synthesizer database. Type-I Signed Booth Mult. (BSM-I) [13-14] 1142 567
Language based design entry is used by coding the Type-II Signed Booth Mult. (BSM-II) [13-14] 884 379
multiplier functionality in VHDL. However, unlike Type-III Signed Booth Mult. (BSM-III) [13-14] 1073 532
traditional approaches, an instantiation based coding Comb. Ripple Carry Array (CRCAr) [this work] 764 325
strategy is followed. Conventional codes are inferential. Pip. Ripple Carry Array (PRCAr) [this work] 835 336
Inferential coding styles do not give the designer any control
over the placement and distribution of the logic to the Comb. Carry Save Array (CCSAr) [this work] 857 432
underlying FPGA fabric. The synthesizer infers the VHDL Pip. Carry Save Array (PCSAr) [this work] 902 445
code and distributes the logic to the underlying FPGA fabric
as per its internal strategies which are not known to the
designer. Although such codes are easy to write and have a
high degree of readability the resulting design suffers from
poor performance.
In this work, VHDL codes have been written following
an instantiation strategy. Such a strategy involves writing a
173
2250 RESOURCE UTILIZATION 70

CRITICAL PATH
2000 CRCAr 60 CRCAr
CCSAr
1750 PRCAr CCSAr
DELAY (nS)
PCSAr 50 PRCAr
1500 PCSAr
LUTs
1250
40
1000
750 30
500
20
250
0 10
4 8 12 16 20 24 28 32
WORD LENGTH 0
0 4 8 12 16 20 24 28 32
WORD LENGTH
700 RESOURCE UTILIZATION
Fig. 4 Critical Path versus word length
OCCUPIED SLICES
600 CRCAr
CCSAr
PRCAr
500 PCSAr C. Power Analysis
400 Finally dynamic power dissipation of different
300 multiplier realizations is reported in table 3. The analysis is
200
done for maximum operating frequency and a fixed input
operand word-length of 16 bits. Fixed-point array
100
multipliers show a considerable reduction in dynamic power
0 dissipation when compared to traditional multipliers. This is
0 4 8 12 16 20 24 28 32
WORD LENGTH due to the reduction in logic that is achieved by truncating
extra bits in fixed-point multipliers. Among different
Fig. 3 Resource utilization versus word length. (a) LUTs (b) Slices multipliers, CSA based array multipliers shows higher
power dissipation because of the extra logic overhead in
B. Timing Analysis terms of VMA. Among different implementation styles,
For timing the critical path and maximum achievable pipelined multipliers have higher power dissipation due to
clock frequency of different multiplier realizations have the involvement of pipelined registers which consume extra
been considered and compared against those reported in [13- power. Further analysis is done by plotting dynamic power
14]. The results are shown in table 2. Again an input dissipation as a function of operand word-length. The results
operand word-length of 16 bits has been considered. It is are shown in fig. 5.
observed that fixed-point array-multipliers shows an
improved timing response when compared to traditional TABLE 3. DYNAMIC POWER ANALYSIS FOR DIFFERENT MULTIPLIER
realizations. Among different multipliers, CSA based array REALIZATIONS
multipliers shows reduced critical paths because it does not Multiplier Design Dynamic Power
involve the rippling of carry from one unit to another within (mW)
a row. Among different implementation styles, pipelined Carry Save Mult. (CSM) [13-14] 35.69
multipliers have reduced critical path due to the involvement
of pipelined registers which break the combinational path. Carry Ripple Mult. (CRM) [13-14] 67.13
Further analysis is done by plotting the measured critical Type-I Signed Booth Mult. (BSM-I) [13-14] 69.74
path against the input operand word-length. The results are Type-II Signed Booth Mult. (BSM-II) [13-14] 68.89
shown in fig. 4. Type-III Signed Booth Mult. (BSM-III) [13-14] 56.52
TABLE 2. CRITICAL PATH AND THROUGHPUT FOR DIFFERENT Comb. Ripple Carry Array (CRCAr) [this work] 23.91
MULTIPLIERS Pip. Ripple Carry Array (PRCAr) [this work] 29.43
Multiplier Design Crit. Op. Freq. Comb. Carry Save Array (CCSAr) [this work] 25.04
Path (nS) (MHz) Pip. Carry Save Array (PCSAr) [this work] 31.05
Carry Save Mult. (CSM) [13-14] 29.6078 356.21
Carry Ripple Mult. (CRM) [13-14] 46.30314 308.71
70
POWER DISSIPATION
DYNAMIC POWER (mW)
Type-I Signed Booth Mult. (BSM-I) [13- 50.75136 298.71

60
14] CRCAr
CCSAr
Type-II Signed Booth Mult. (BSM-II) [13- 30.81981 323.01 50 PRCAr
PCSAr
14]
40
Type-III Signed Booth Mult. (BSM-III) 30.5715 325
[13-14] 30
Comb. Ripple Carry Array (CRCAr) [this 28.853 578 20

work]
10
Pip. Ripple Carry Array (PRCAr) [this 7.667 578
work] 0
Comb. Carry Save Array (CCSAr) [this 19.941 578 0 5 10 15 20 25 30 35

work] WORD LENGTH
Pip. Carry Save Array (PCSAr) [this work] 3.641 578 Fig. 5 Dynamic Power versus word length
174
VI. CONCLUSION [4] IEEE Standards Board. IEEE Standard for Binary Floating-Point
Arithmetic.
This work carried out the analysis of fixed-point bit
[5] Technical Report ANSI/IEEE Std. 754-1985, the Institute of
parallel array multipliers. Realizations based on ripple carry Electrical and Electronics Engineers, 1985.
logic and carry save logic were considered. Performance
comparisons, based on the data obtained from the [6] C. Inacio, D. Ombres. The DSP decision: Fixed point or floating?
IEEE Spectrum 33(9), September 1996.
synthesizer database were found to be consistent with the
theoretical arguments. A distinguishing feature of this work [7] R. Tessier and W. Burleson, “Reconfigurable Computing for DSP: A
was the use of instantiation based coding style. A substantial Survey,” Journal of VLSI Signal Processing, Vol. 28, pp. 7-27, 2001,
Kluwer Academic Publisher.
improvement in performance was observed using the
instantiation based coding. One important feature of array [8] K. K. Parhi, Chapter 13, Bit-Level Arithmetic Architectures, VLSI
Digital Signal Processing Systems: Design and Implementation,
multipliers is that they can be easily pipelined, thereby Wiley, 1999.
resulting in extremely fast structures. The analysis done in
this work also highlights this feature of array multipliers. [9] T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W.
Luk and P. Y. K. Cheung, “Reconfigurable Computing: Architecture
One limitation with fixed-point representation, however is and Design Methods,” IEEE Proceedings. Computer Digital
that it may affect the accuracy of the final product. In this Technology, Vol. 152, No. 2, March 2005.
work, it was assumed that the operands are normalized so
[10] R. Naseer, M. Balakrishnan, and A. Kumar, “Direct Mapping of RTL
that their magnitude lies between 0 and 1, resulting in fairly Structures onto LUT-Based FPGAs,” IEEE Transactions on
accurate results. Computer-Aided Design of Integrated Circuits and Systems, Vol. 17,
No. 7, July 1998.
ACKNOWLEDGEMENTS
[11] O. Kwon, K. Nowka, and Jr. Swartzlander, “A 16-bit by 16-bit MAC
The authors would like to acknowledge the TEQIP-III design using fast 5:3 compressor cells,” Journal of VLSI Signal
project team for their assistance and financial support during Procsesing, Vol. 31, No. 2, pp. 77-89, June, 2002.
the entire course of study. [12] Scott Hauck and Andre Dehon, “Reconfigurable Computing: The
Theory and Practice of FPGA based Computation,” Morgan
REFERENCES Kaufmann Publisher, November 2007.
[1] G. L. Narayan and B. Venkataramani, “Optimization Techniques for [13] S. Bhattacharjee, S. Sil, B. Basak and A. Chakarbarti, “Evaluation of
FPGA based Wave Pipelined DSP Blocks,” IEEE Transc. Very Large Power Efficient Adder and Multiplier Circuits for FPGA based DSP
Scale Integr. (VLSI) syst., vol. 13, No. 7, pp. 783-792, July 2005. Applications,” in Proceedings of the International Conference on
Communication and Industrial Applications (ICCIA), December 2011
[2] M. A. Ashour and H. I. Saleh, “An FPGA Implementation guide for
some different types of Serial-Parallel Multiplier Structures,” [14] B. Khurshid and R. Naaz, “Technology Optimized Fixed-Point Bit-
Microelectronics Journal, vol. 31, pp. 161-168, 2000. Parallel Multiplier for LUT based FPGAs,” International Journal of
High Performance Systems Architecture, Vol. 6, No. 1, pp. 28-35,
[3] K. Compton, S. Hauck, “Reconfigurable Computing: A survey of
2016.
Systems and Software,” ACM Computing Surveys, vol. 34, No. 2, pp.
171-210, June 2002.
175

Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Performance Evaluation of Fixed-Point Array Multipliers On Xilinx Fpgas

Transféré par

Droits d'auteur :

Formats disponibles

2019 6th International Conference on Signal Processing and Integrated Networks (SPIN)