A High-Speed, Hierarchical 16×16 Array of Array

IMPACT-2009
A High-Speed, Hierarchical 16×16 Array of Array

Multiplier Design
Abhijit Asati 1 and Chandrashekhar 2
1
EEE Group, BITS, Pilani, India, abhijitmicro@gmail.com
2
CEERI, Pilani, India, chandra@ceeri.ernet.in
Abstract—Array multipliers are preferred for smaller operand the partial product rows by factor of n. Booth radix-4
sizes due to their simpler VLSI implementation, in-spite of their (m=4=22) encoding can reduce the number of partial product
linear time complexity. The tree multipliers have time rows by a factor of two [3]. Since the numbers of partial
complexity of O (log n) but are less suitable for VLSI product rows is reduced to half, the hardware required to
implementation since, being less regular, they require larger generate partial products is reduced to n2/2 cells [2]. In
total routing length, which may degrade their performance. Wallace tree multipliers, since ripple effect is reduced they
Some hybrid architectures called ‘array of array’ multipliers produce products in far less time. The time complexity is
have intermediate performance. These multipliers have a time reduced to O (log n) but larger routing area is required as
complexity better than array multipliers, and therefore becomes
compared to regular array multipliers making them less
an obvious choice for higher performance multiplier designs of
suitable for VLSI implementation [2]. The advantage of
moderate operand sizes. In this paper a 16×16 unsigned ‘array
reduction in hardware using Booth encoding scheme can be
of array’ multiplier circuit is designed with hierarchical
structure and implemented using conventional CMOS logic in combined with accelerated Wallace tree accumulation of
0.6μm, N-well CMOS process (SCN_SUBM, lambda=0.3) of partial product to obtain the reduced time complexity of O
MOSIS. The proposed multiplier implementation shows large (log n), which are very much suitable for large operand size
reduction in propagation delay and the average power multipliers [2], [3]. In sub-micron/deep sub-micron era for the
consumption (at 20MHz) as compared to 16-bit Booth encoded multipliers of moderate operand sizes, where tree based
Wallace tree multiplier by F Jalil [3]. The total transistor count, architectures may degrade their performance due to larger
maximum instantaneous power, leakage power, core area, total routing lengths some hybrid architectures shows better
routing length and number of vias are also presented. performance, since gate level analysis of these architectures
shows moderate area and delay performance. These multiplier
I. INTRODUCTION architectures have moderate area requirements and time
The multiplier is a fundamental building block in Standard complexity of O ( N ) [4]. In this paper we present a
Digital Signal Processors and ASIC Digital Signal Processors hierarchical implementation of 16×16, multiplier design using
used for Digital Signal Processing. Multiplication process is array of array technique. The VLSI implementation of
used in many Neural computing and DSP applications like multiplier circuit is done using 0.6μm, N-well CMOS process
instrumentation and measurement, communications, audio and (SCN_SUBM, lambda=0.3) of MOSIS, using conventional
video processing, Graphics, image enhancement, 3-D CMOS logic. Simulation results are compared with Booth
rendering, Navigation, radar, GPS, and control applications encoded Wallace tree multiplier of [3]. Section II explains the
like robotics, machine vision, guidance. It is mainly used to design of a 2×2 multiplier, Section III describes hierarchical
implement algorithms like frequency domain filtering (FIR design of a 4×4 multiplier; Section IV describes hierarchical
and IIR), frequency-time transformations (FFT), Correlation design of 8×8 multiplier and 16×16 multiplier. Physical
etc. Most DSP tasks require real-time processing; it must implementation and results are described in section V. Section
perform these tasks speedily while minimizing Cost and VI concludes the paper.
Power. The multiplication algorithms differ in the means of
‘partial product generation’ and ‘partial product addition [1]. II. DESIGN OF A 2×2 MULTIPLIER
The array multiplier has linear time complexity i.e O (n)
therefore delay degrades for multipliers having larger operand In this architecture the 2×2 unsigned multiplier is used as a
sizes. Also it has poor space complexity O (n2), as it requires basic building block in a hierarchical design of a larger bit size
approximately n2 cells to produce multiplication. Therefore as multiplier. The truth table for a 2×2 combinational multiplier
the operand size grows, the circuit takes larger area and power is shown in table I. The truth table can be solved using K-
[2], [5], [6]. A radix-m booth encoding, where m=2n reduces
978-1-4244-3604-0/09/$25.00 ©2009 IEEE 161
Authorized licensed use limited to: K.S. Institute of Technology. Downloaded on November 3, 2009 at 01:45 from IEEE Xplore. Restrictions apply.
IMPACT-2009
map, which generates the equation (1). A 2×2, combinational IV. DESIGN OF A 8×8 MULTIPLIER AND 16×16
circuit can be realized using these equations. MULTIPLIER
P 0 = A0 • B0 In the design of 8×8 multiplier the first step will be finding

the different combinations of input bit pairs that are derived in
P1 = A0 B1( B0 + A1) + A1B0( B1 + A0) terms of 4×4 multiplier. Each input bit-pair is handled by a
P 2 = A1B1( A0 + B0) separate 4×4 combinational multiplier which has been already
designed using 2×2 multiplier as explained in section III.
P3 = A1A0 B1B0 (1) These separate 4×4 combinational multipliers produce 4
partial product rows. These partial products rows are then
TABLE I. TRUTH TABLE OF 2×2 MULTIPLIER added optimally to generate final product bits of 8×8
A1 A0 B1 B0 P3 P2 P1 P0 multiplier as shown in Fig. 2 Similarly, in the design of 16×16
0 0 0 0 0 0 0 0 multiplier the first step is to find the different combinations of
0 0 0 1 0 0 0 0 input bit pairs that are derived in terms of 8×8 multiplier. The
0 0 1 0 0 0 0 0 each input bit-pair is handled by a separate 8×8 combinational
0 0 1 1 0 0 0 0 multiplier to produce 4 partial product rows (the 8×8
0 1 0 0 0 0 0 0
0 1 0 1 0 0 0 1
combinational multiplier design has already been discussed).
0 1 1 0 0 0 1 0 These partial products rows are then added optimally to
0 1 1 1 0 0 1 1 generate final product bits of 16×16 multiplier as shown in
1 0 0 0 0 0 0 0 Fig. 3.
1 0 0 1 0 0 1 0
1 0 1 0 0 1 0 0
1 0 1 1 0 1 1 0
1 1 0 0 0 0 0 0
1 1 0 1 0 0 1 1
1 1 1 0 0 1 1 0
1 1 1 1 1 0 0 1
III. DESIGN OF A 4×4 MULTIPLIER

The first step in the design of 4 bit multiplier will be
finding the different combinations of input bit pairs that are
derived in terms of 2×2 multiplier. Each input bit-pair is
handled by a separate 2×2 combinational multiplier to produce
4 partial product rows. These partial products rows are then
added optimally to generate final product bits. The design
procedure for 4×4 combinational multiplier is shown in table
II, while Fig. 1 shows the schematic of a 4×4 combinational
multiplier designed using 2×2 combinational multiplier. These
partial products rows are then added optimally using 5-bit full
adder cells.
TABLE II. DESIGN OF A 4-BIT MULTIPLIER USING 2×2 COMBINATIONAL

MULTIPLIER
Pair A3 A2 A1 A0
Group II Group I
B3 B2 B1 B0
Figure 1. A 4×4 combinational multiplier
Group IV Group III
I × III PP3 PP2 PP1 PP0
II ×III PP7 PP6 PP5 PP4
I × IV PP11 PP10 PP9 PP8 At each level of hierarchy in design four partial product
II × PP15 PP14 PP13 PP12
rows are to be handled. Therefore accumulation of partial
IV
Sum P7 P6 P5 P4 P3 P2 P1 P0
product rows at each level of hierarchy is much simplified as
compared to other multiplier architectures.
162
IMPACT-2009
Figure 1.
163
IMPACT-2009
V. PHYSICAL IMPLEMENTATION AND RESULTS TABLE III. COMPARISON TABLE
Layout for a 16×16 unsigned multiplier circuit shown in Algorithm VDD Propagation Average Transistor
(technology) (V) delay (τ) ns power count
Fig. 3 is implemented in 0.6μm, N-well CMOS process (mW)
(SCN_SUBM, lambda=0.3) of MOSIS, using conventional Proposed 3.3 10.94 31.58 16032
CMOS logic. A schematic library consisting of 4 functional (0.6μm)
cells is defined for static CMOS design styles comprising of 1-
bit Full Adder, 2-input NAND, 2-input NOR, and inverter. BEWM 5 60 100 7858
The selected 1-bit full adder and other logic cells show better (1.25 μm)
power delay products as compared to other logic design styles
[7]. Corresponding to the schematic library, physical libraries TABLE IV. OTHER IMPLEMENTATION DETAILS
were designed using conventional CMOS logic design styles Algorithm Maximum Leakage Core Total Number
using the design principles of [8], [9], [10], [11]. Three (technology) Power Power area routing of Via
different versions of each physical library were developed by (mW) (nW) (mm2) length
respectively sizing the W/L ratios of the NMOS transistor to (mm)
Proposed 543.34 64.02 35.07 2101.75 5704
values of 3,5 and 7 (W/L values smaller than 3 were also (0.6μm)
experimented with but not considered further as they resulted
in parasitic dominated slower speeds due to weak drives of
transistors and were not considered good candidates for high REFERENCES
performance). The layout assemblies for the 16-bit multiplier [1] A. Hesham, “Technology scaling effects on multipliers,” IEEE
were carried out using these cell libraries and automatic place Transactions on Computers, Vol.47, No.11, pp. 1201-1215, November
and route tool LEDIT (SPR) from M/s Tanner Research Inc. It 1998.
was noticed that the physical library utilizing W/L ratio of 3 [2] Z. Kiamal, “Multiplexer-based array multipliers,” IEEE Transactions
for NMOS transistor gave the smallest average switching on Computers, Vol.48, No.1, pp. 15-23, January 1999.
energy-delay product. The generated layouts were simulated [3] F Jalil, “M*N Booth encoded multiplier generator using optimized
after parasitic extraction using circuit simulator, ELDO spice. wallace trees,” IEEE Transactions on very large Scale Integration
(VLSI) Systems, Vol. 1, No.2, pp. 120-125, June 1993.
Supply voltage VDD was kept at 3.3V. Different design
[4] V. Chanramouli, “Self-Timed design in GaAs-case study on a high-
parameters like propagation delay, transistor count, core area speed, parallel multiplier,” IEEE Transactions on very large Scale
and power dissipation at 20MHz data rate are compared. It Integration (VLSI) Systems, Vol. 4, No.1, pp. 146-149, March 1996.
was observed that layout with NMOS transistor sizing of 3 [5] P. Kornerup, “A systolic, linear-array multiplier for a class of right-
gave best results, which are then compared with Booth shift algorithms,” IEEE Transactions on Computers, Vol.43, No.8, pp.
encoded Wallace tree multiplier (BEWM) of reference [3] as 892-898, August 1994.
shown in table III. [6] L. Ciminiera, “Carry-save multiplication schemes without final
addition,” IEEE Transactions on Computers, Vol.45, No.9, pp. 1050-
Comparing these two multiplier architectures shows that 1055, September 1996.
proposed array of array multiplier architecture shows [7] Reto Zimmermann and Wolfgang Fichtner, “Low-Power Logic Styles:
reduction in delay by a factor of 0.182 and reduction in CMOS Versus Pass –Transisistor Logic” IEEE Journal of solid state
average power consumption almost by a factor of 0.315. The circuits, Vol. 32, No. 7, pp. 1079-1090, July 1997.
improvement in delay beyond that predicted by scaling theory [8] Mohab Anis, Mohamed Allam and Mohamed Elmasry, “Impact of
is observed (in-spite of increased transistor count) due to Technology Scaling on CMOS Logic Styles,” IEEE Transaction on
circuits and systems-II, Analog and Digital Signal Processing, VOL.
reduced interconnect length. The maximum instantaneous 49, NO. 8, pp. 577-587, August 2002.
power, leakage power, core area, total routing length and [9] S.M. kang, Yusuf Leblebici, “CMOS Digital integrated Circuits,
number of vias are also shown in table IV for highlighting the Analysis and Design,” Third edition McGrawhill, 2003.
VLSI implementation characteristics. [10] N. Weste and K. Eshraghian, “Principles of CMOS VLSI Design,”
Addison-Wesley, 1994
[11] Jan M. Rabaey, Anantha Chandrakasan, Borivose Nikolic, “Digital
Integrated Circuits,” Second Edition Prentice–Hall of India Private
VI. CONCLUSION Limited, 2004.
This paper describes a array of array unsigned multiplier
design. The accumulation of partial product rows at each level
of hierarchy is much simplified as compared to other
multiplier architectures. Multiplier circuit is implemented in
0.6μm, N-well CMOS process using conventional fully static
CMOS logic design style using appropriate transistor sizes.
The simulation results are compared with Booth encoded
Wallace tree multiplier architecture and show reduction in
propagation delay by a factor of six and average switching
power by approximately a factor three. The maximum
instantaneous power, leakage power, core area, total routing
length and number of vias are also presented.
164

A High-Speed, Hierarchical 16×16 Array of Array

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

A High-Speed, Hierarchical 16×16 Array of Array

Transféré par

Droits d'auteur :

Formats disponibles

IMPACT-2009

A High-Speed, Hierarchical 16×16 Array of Array

978-1-4244-3604-0/09/$25.00 ©2009 IEEE 161

P 0 = A0 • B0 In the design of 8×8 multiplier the first step will be finding

III. DESIGN OF A 4×4 MULTIPLIER

TABLE II. DESIGN OF A 4-BIT MULTIPLIER USING 2×2 COMBINATIONAL

Figure 2. A 8×8 combinational multiplier

Figure 3. A 16×16 combinational multiplier

V. PHYSICAL IMPLEMENTATION AND RESULTS TABLE III. COMPARISON TABLE

Vous aimerez peut-être aussi