Académique Documents
Professionnel Documents
Culture Documents
AbstractIn nanometer regime, optimization of System-on- cise/approximate designs have attracted significant research
Chip (SoC) designs w.r.t. speed, power and area is a major interest in recent years. Conventional wisdom investigated
concern for VLSI designers today. Imprecise/approximate design several mechanisms such as truncation [2] [3], over-clocking,
obviates the constraints on accuracy, stemming a novel Speed-
Power-Accuracy-Area (SPAA) metrics which can pilot to tremen- and voltage over-scaling (VOS) [4], which could not config-
dous improvements in speed and/or power with a feeble accord ure Speed-Power-Accuracy-Area (SPAA) metrics effectively.
in accuracy. This astonishingly expediency captivated researchers Apart from these, other design techniques rely on functional
to delve into imprecise/approximate VLSI design evolution. In approximations and mostly focus on imprecise/approximate
this paper, we present a new accuracy-configurable multiplier adders via the concept of shortening the carry-chain to elevate
architecture (ACMA) for error-resilient systems. The ACMA uses
a technique called Carry-in Prediction for approximate multipli- design performance. Lu [5] proposed a -bit carry look-ahead
cation based on efficient precomputation logic that increases its adder in which only previous bits are considered to estimate
throughput. The proposed multiplication reduces the latency of current carry signal. Lus adder is found unattractive due to
an accurate multiplier by almost half by reducing its critical low probability of getting a correct sum and increased area
path. The simulation results suggest that SPAA metrics can be overhead. Shin et al. [6] reduce data-path delay and re-design
administered by exploiting the design for apposite number of
iterations. The results for 16-bit multiplication show the mean the data-path modules and cuts the critical-path in carry-
accuracy of 99.85% to 99.9% in case there is no lower bound chain to exploit a given error rate to improve parametric
on the size of operands and if size of operands are 10-bit or more yield. Further, Shin et al. [7] explored a logic synthesis
(numbers > 1000), it results into a mean accuracy of 99.965%. approach to exploit a given error rate to reduce the area of
Index TermsApproximate Arithmetic; Error-Resilient De- imprecise/approximate circuits.
signs, multiplier architecture; Accuracy-Configurable multiplier
Zhu et al.[8] [9] reveal four error-tolerant adders: ETA-I,
ETA-II, ETA-IIM, and ETA-IV. ETA-I [8] segments inputs
into: 1) Accurate part, and 2) Inaccurate part, in which no
I. I NTRODUCTION
carry signal is considered at any bit position. ETA-II [9]
With the recent spectacular progress in sub-nanometer tech- concurrently completes carry propagation by dividing the path
nology, shrinking of transistor sizes has led to integration of into a number of short paths. In ETA-IIM, more MSBs
many cores on a single chip, thereby increasing system per- are considered in evaluating carry signal at the expense of
formance. This continuous technology scaling is putting forth degradation in speed performance and ETA-IV uses Carry
new design challenges because of the contradictory design Select Adder (CSL). Gupta et al. [10] target low-power to
specifications such as low-power and high-speed. Further, it leverage error-resiliency and propose five different versions
would be ridiculous to entertain these design specifications of mirror adder by reducing the number of transistors and
by dilating the design with exorbitant costs of manufac- internal node capacitances. Verma et al. [11] proposed a
turing, verification, and test. The International Technology Variable Latency Speculative Adder (VLSA) which provides
Roadmap for Semiconductors (ITRS) [1] has anticipated im- approximate/accurate results but land up in delay and large
precise/approximate designs that became a state-of-the art area overhead. Kahng et. al [12] demonstrated an accuracy-
demand in view of the emerging class of killer applications configurable adder (ACA) with reduced critical-path delay and
that manifest inherent error-resilience such as multimedia, error rate. The ACA provides for a better trade-off between
graphics, and wireless communications. In complex multi- accuracy, speed and power but with the large area overhead.
media applications, DSP blocks are implemented as cores In comparison to the above work on adders, very few
that process signals relevant to human senses, e.g., sight and researchers reported their work on approximate multipliers.
hearing. The verity of limited perception of human senses Sullivan et al. [13] investigated an iterative approximate mul-
alleviate the constraints on accuracy, resulting into error- tiplier based on (TEC) in which
resilient System-on-Chip (SoC) designs. a small amount of error correcting circuitry is added for each
In the SoC design for aforementioned applications, adders iteration. This circuitry inexpensively replicates the effects of
and multipliers are used as basic building blocks and multiple pipeline iterations for the most problematic inputs.
in accordance with error-resilient system design, impre- Kulkarni et al. [14] presented a 2 2 underdesigned multiplier
that is of the remaining lower order bits. The multiplication (a) (b)
process begins at the point where the bits split and move
simultaneously towards in the two opposite directions till all Fig. 1. (a) Recursive Multiplication (b) Recursive Tree (Each box represents
bits are taken care of. The ETM is able to achieve a reduction a partial product of equal size.)
in the delay, power saving and hardware cost as compared to
2b1 2b1
the traditional 12-bit multipliers. AH XH
Multiplier
Most of the above design approaches of approximate mul- (bxb)
and fine tuning to dynamically update the probability of input AH XL Final Product
(2bx2b)
combinations exhibiting accurate results which bound SPAA AL XH
AH XH AL XL
AH XH AL XL
AL XH
AH XL
Accurate to a
Large Extent b bits
Final Product
4b bits
Critical Column
Accurate to a Accurate
Large Extent Completely
However, in order to minimize the error, we derive a logic Inaccurate
that gives us accurate results for certain most significant bits b bits b bits
(MSBs). In our logic, we render most significant 2 bits (out ACCURATE PART INACCURATE PART
of a total of 4 bits) as accurate to a large extent. Note that Fig. 4. Carry-in Prediction for = 8 i.e. 16 16 multiplication
here the objective is to make upper 2 bits accurate to a high
extent and not completely accurate. This will minimize the
degree of error involved and at the same time will give a In order to make the approximate partial product multipliers
suitable tradeoff between accuracy and power consumed in the fast, it is necessary to break the carry chain which reduces
multiplication. Further, out of the least significant 2 bits that the horizontal critical path. Therefore, we propose a novel
remain, lower /2 bits will be accurate and upper 3/2 bits will Carry-in Prediction Logic. This logic reduces the horizontal
be inaccurate. We make the lower /2 bits as accurate because critical path by half and makes the upper bits of approximate
it doesnt take a lot of hardware and at the same time increases partial product multipliers accurate to some extent. We divide
accuracy. In other words, out of 4 bits in the final product, the products , and of 2 bits each into
least significant /2 bits will be accurate and most significant two parts: Accurate to a large extent part and inaccurate
2 bits will be accurate to a large extent and remaining 3/2 part. Both these parts are of bits each as shown in the
bits will be inaccurate. This kind of arrangement will not only Figure 4 for = 8. The circles in figure represent the
provide accuracy but also give promising results for speed, tree of partial products ( ) obtained from multiplication of
power and area. As we shall see, this is achieved by employing multiplicand (say = 1 2 ...1 0 ) and multiplier (say
approximate multipliers for , and = 1 2 ...1 0 ). Please note, here can take values of
computation in Figure 2 and an accurate multiplier is or and can take values of or as required
used for evaluation. for , and evaluation. As evident from
In order to obtain most significant bits accurate (out of 4 the figure, inaccurate part further consists of a completely
bits in final product), we make ( multiplication) inaccurate part and a completely accurate part of /2 bits
completely accurate. Further, we reduce critical path to a large each. In rest of the paper, we denote the accurate to a large
extent by dividing the approximate partial product computation extent part as Accurate part and the remaining lower bits
into two parts: accurate part and inaccurate part. We now of approximate partial product as Inaccurate part.
introduce a new concept named as carry-in prediction logic Figure 4 further shows a critical column which is the column
which is used in approximate partial product computation only containing the maximum number of partial products. Carry-
( , and ) and not for accurate . in Prediction logic exploits the fact that if there are two or
more 1s in the critical column, then a carry of atleast 1 is
C. Carry-in Prediction Logic
definitely propagated to the next column. When is large
In the proposed architecture, our aim is to make the most (greater than 5 i.e. for operands of size 10 bits or so) we
significant 2 (out of a total of 4 bits) as accurate to a make the carry-in propagated to the accurate part as 1 if one
large extent and we achieve this as illustrated in Figure 3. or more of the circles in the critical column are 1, since there
In this illustration, the most significant bits of the is a very low probability of 0 carry being propagated. In fact,
approximate partial products ( , , and ) for such a magnitude of , there is a good probability that
must be accurate to a certain extent so that the sum of upper the carry propagated to accurate part from most significant bit
bits of , upper bits of and the 2 bits of of inaccurate part will be more than 1. In order to minimize
achieves higher degree of accuracy. this error we make the upper /2 bits of inaccurate part as a
2b bits
two additions is just and + 1 respectively, they almost
AH XL
take same time to complete their addition process as the latter
b bits addition of 2 bits. Further from Figure 5, it is observed
that carry out emerging from latter 2 bit addition can be
AHH X HL
propagated using half adders and a 1 bit full adder can be
AHH X HH AL XL employed at the position where a carry-out emerges from the
AHL X HL
former +1 addition to add the 3 bits together at that position.
b A
HL
X
HH
The carry-out thus produced will be further propagated using
2 b bits b bits half adders till the MSB. Hence, we significantly reduce the
b AL XH latency of stage 1 and get almost same latency for stage 2 using
2
the methodology discussed above, thus reducing the overall
critical path of the said multiplier.
2b bits
Now that we have 7 multipliers in stage 1, we can vary
Final Product
4b bits
the accuracy level of the proposed multiplier by varying the
number of multipliers that are accurate. In any case, we keep
the as always accurate, so that the accuracy level
Fig. 5. Reducing the critical path of first stage of pipelined approximate
multiplier does not fall below a certain level. Therefore, we obtain
an Accuracy Configurable Multiplier whose accuracy can be
adjusted according to error tolerance of the application. The
series of 1s. The error is minimized because the difference number of inaccurate multipliers used will directly determine
in actual and approximate partial products will be analogous the amount of power saved by the multiplier. Also, because
to that between 128 and 127 i.e. in binary, 128 is represented the algorithm has been designed in general for 2 2
as 10000000 and 127 is given by 01111111 (128 just passes multiplications, it is also configurable according to bit-width
an extra carry). Now that we have a carry-in for accurate of operands, i.e. size of inaccurate part in the approximate
part beforehand, we can start the multiplication procedure partial products will always be equal to bits. For instance, if
simultaneously from both sides, thus reducing the horizontal bit-width of operands is scaled to 12 bits, value of becomes
critical path by half. This completes the so-called Carry-in 6, and there will be three approximate partial products of 66
Prediction Logic and its impact on reducing the critical path and four accurate partial products of 3 3 each. Hence, the
in accuracy configurable multiplier design as described next. proposed algorithm is reconfigurable with operand bit width
as well as accuracy configurable.
III. ACCURACY-C ONFIGURABLE M ULTIPLIER Next, we present experimental results by considering a
In this section, we present the design of an accuracy simple and suitable tradeoff between accuracy and power -
configurable 16 16 approximate multiplier. The first stage 3 partial products as inaccurate (8 8) and 1 partial product
of the pipelined multiplier uses a total of 4 multipliers: 1 as accurate (further recursively divided into 4 accurate partial
accurate (8 8) and 3 approximate (8 8). Because the products that are 4 4) to reduce latency. Theoretical analysis
inaccurate multiplier is inherently faster than a corresponding of power, area and latency is also described in the later part
accurate multiplier, using an accurate 8 8 multiplier in the of the next section.
same pipeline stage as other approximate ones would give no
IV. E XPERIMENTAL R ESULTS AND A NALYSIS
improvement in critical path. Therefore, to reduce the critical
path of stage 1, we further recursively divide the 88 accurate We simulate the proposed ACMA by writing a C - Program
product into four 4 4 accurate partial products as shown in and generating 5000 random numbers to compute accurate and
Figure 5. We then add them up together in the second stage. approximate products for all possible combinations without
In other words, the stage 1 of the pipelined approximate repetition. We use the following design metrics [8], [12] for
multiplier effectively consists of 7 multipliers, namely - the analysis of the proposed multiplier:
, , and which are 1) Overall Error (OE): It is the absolute error between
4 4 and accurate and , and which are approximate and accurate products. It is given by =
8 8 and inaccurate. Although, this decreases the latency of , where is the correct result and is the
stage 1 significantly when compared to an accurate 16 16 result obtained from approximate arithmetic circuit. All
pipelined multiplier, it appears that this methodology increases the numbers here are in decimal numbers.
the latency of stage 2 as we need to compute more number 2) Relative Error: Relative Error is simply (/ )
of additions. It can be observed that the latency of stage 2 100%. It gives the percentage of error involved in the
remains unaffected in case we perform addition of all the result of an algorithm.
partial products in parallel i.e. addition of and 3) Accuracy ( ): It is given by
( bits) and that of resulting sum with , (1 ) 100%. It measures the degree
( + 1 bits) in parallel with the addition of of correctness of the output of certain approximate
and (2 bits). Since the size of operands of the former algorithm.
TABLE I b bit Ripple Carry Adder
S IMULATION R ESULTS U SING C-P ROGRAM
b bit Ripple Carry Adder
Run Operand Mean Mean Acceptance
b times
No. Range Error Accuracy Probability b bit Ripple Carry Adder
1 > 1000 0.034% 99.966% 99.72%
2 >1 0.10% 99.90% 98.44%
3 >1 0.11% 99.89% 98.37%
4 >1 0.13% 99.87% 97.93% b bit Ripple Carry Adder
400 size of operands are 10-bit or more (numbers > 1000), the
mean accuracy is found to be 99.965%. These results demon-
200
strate the effectiveness of the proposed accuracy-configurable
0
approximate multiplier design w.r.t. the SPAA design metrics.
0 5 10 15 20 25 30 35
Half Bitwidth(b) As far as future scope of this research is concerned, the
proposed ACMA design can be employed in Error-Resilient
Fig. 8. Normalized Clock Period vs. Half Operand Bit-width System Architectures for well-known Recognition, Mining and
Synthesis (RMS) applications and also in the error-tolerant
Percentage Reduction in Clock Period vs. Half Bitwidth of Operands
60 processor designs where speed, power, and area are major
design concerns and not the accuracy.
50
Percentage Reduction in Clock Period
R EFERENCES
40 [1] International technology roadmap for semiconductors,
http://www.itrs.net.
30 [2] M. Sheng, H. Libo, L. Mingce, and W. Zhiying, A comparative study
of subword parallel adders for multimedia applications, in ASIC, 2009.
20
ASICON 09. IEEE 8th International Conference on, oct. 2009, pp. 179
182.
[3] E. J. Swartzlander, Truncated multiplication with approximate round-
10
ing, in Signals, Systems, and Computers, 1999. Conference Record of
the Thirty-Third Asilomar Conference on, vol. 2, oct. 1999, pp. 1480
0
5 10 15 20 25 30 35 1483 vol.2.
Half Bitwidth(b)
[4] L. N. Chakrapani, K. K. Muntimadugu, L. Avinash, J. George, and
K. V.Palem, Highly energy and performance efficient embedded com-
Fig. 9. Percentage Clock Period Reduction vs. Half Operand Bit-width puting through approximately correct arithmetic: a mathematical foun-
dation and preliminary experimental validation, CASES, pp. 187196,
2008.
TABLE II
[5] S. L. Lu, Speeding up processing with approximation circuits, Com-
R ESULTS OF RTL C OMPILER
puter, vol. 37, no. 3, pp. 6773, mar 2004.
[6] D. Shin and S. Gupta, A re-design technique for datapath modules in
Power Area error tolerant applications, in Asian Test Symposium, 2008. ATS 08.
( ) (2 ) 17th, nov. 2008, pp. 431437.
[7] , Approximate logic synthesis for error tolerant applications, in
Approximate 0.295 2298.24
Design, Automation Test in Europe Conference Exhibition (DATE), 2010,
Accurate 0.438 3004.20 march 2010, pp. 957960.
% Reduction 32.73% 23.5% [8] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, Design of
low-power high-speed truncation-error-tolerant adder and its application
in digital signal processing, Very Large Scale Integration (VLSI) Sys-
tems, IEEE Transactions on, vol. 18, no. 8, pp. 12251229, aug. 2010.
on the reduction of number of full adders. For instance, in [9] N. Zhu, W. L. Goh, G. Wang, and K. S. Yeo, Enhanced low-power high-
speed adder for error-tolerant application, in SoC Design Conference
the case of a 16 16 multiplier, for a single approximate (ISOCC), 2010 International, nov. 2010, pp. 323327.
partial product (8 8), reduction in the number of adders [10] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, Low-power
is about 40%. But a 16 16 pipelined multiplier requires digital signal processing using approximate adders, Computer-Aided
Design of Integrated Circuits and Systems, IEEE Transactions on,
4 such multiplications, so the net reduction in number of vol. 32, no. 1, pp. 124137, jan. 2013.
adders is about 30% in stage 1. Therefore, this provides atleast [11] A. Verma, P. Brisk, and P. Ienne, Variable latency speculative addition:
30% reduction in power. We verified this hypothesis using A new paradigm for arithmetic circuit design, in Design, Automation
and Test in Europe, 2008. DATE 08, march 2008, pp. 12501255.
Cadence RTL Compiler and the results obtained are tabulated [12] A. Kahng and S. Kang, Accuracy-configurable adder for approximate
in Table II. We used a 45 standard cell library called arithmetic designs, in Design Automation Conference (DAC), 2012 49th
Nangate Opencell Library for RTL Synthesis. Further, it can ACM/EDAC/IEEE, june 2012, pp. 820825.
[13] M. B. Sullivan and E. E. Swartzlander, Truncated error correction for
be observed from the Table II that power and area results are flexible approximate multiplication, in Signals, Systems and Computers
in agreement with the theoretical results. (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar
Conference on, 2012, pp. 355359.
V. C ONCLUSION AND F UTURE S COPE OF THE W ORK [14] P. Kulkarni, P. Gupta, and M. Ercegovac, Trading accuracy for power
with an underdesigned multiplier architecture, in VLSI Design (VLSI
In this paper, we proposed an Accuracy-Configurable Multi- Design), 2011 24th International Conference on, 2011, pp. 346351.
plier Architecture (ACMA) for error-resilient System-on-Chip [15] K. Y. Kyaw, W.-L. Goh, and K.-S. Yeo, Low-power high-speed multi-
plier for error-tolerant application, in Electron Devices and Solid-State
designs. The ACMA design is based on a new algorithm for Circuits (EDSSC), 2010 IEEE International Conference of, 2010, pp.
approximate multiplication where an efficient precomputation 14.