Vlsi Systems: The Center For Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, Louisiana, USA

III
VLSI SYSTEMS
Magdy Bayoumi
The Centerfor Advanced Computer
Studies, Universityof Louisiana
at Lafayette, Lafayette, Louisiana,
USA
This Section covers the broad spectrum of VLSI arithmetic, design engineer to analyze, design, and predict the behavior of
custom memory organization and data transfer, the role of large-scale systems. While design formulas and tables are
hardware description languages, clock scheduling, low-power listed, emphasis is placed on the key concepts and the theories
design, micro electro mechanical systems, and noise analysis underlying the processes. In order to do so, the material is
and design. It has been written and developed for practicing reinforced with frequent examples and illustrations.
electrical engineers in industry, government, and academia. The compilation of this section would not have been pos-
The goal is to provide the most up-to-date information in sible without the dedication and efforts of the section editor
the field. and contributing authors. I wish to thank them all.
Over the years, the fundamentals of the field have evolved to
include a wide range of topics and a broad range of practice. To Wai-Kai Chen
encompass such a wide range of knowledge, the section focuses Editor
on the key concepts, models, and equations that enable the
1
Logarithmic and Residue Number
Systems for VLSI Arithmetic
Thanos Stouraitis 1.1 Introduction ....................................................................................... 179

Department of Electrical and 1.2 LNS Basics .......................................................................................... 179
Computer Engineering, 1.2.1 LNS a n d Linear Representations • 1.2.2 LNS O p e r a t i o n s . 1.2.3 LNS a n d Power
University of Patras, Dissipation
Greece
1.3 The Residue Number System ................................................................. 185
1.3.1 RNS Basics • 1.3.2 RNS Architectures • 1.3.3 Error Tolerance in RNS Systems •
1.3.4 RNS a n d Power Dissipation
References .......................................................................................... 190
1.1 Introduction This chapter describes two arithmetic systems that employ
nonstandard encoding of numbers. The logarithmic number
Very large-scale integrated circuit (VLSI) arithmetic units are system (LNS) and the residue number system (RNS) are
essential for the operations of the data paths and/or the ad- singled out because they have been shown to offer important
dressing units of microprocessors, digital signal processors advantages in the efficiency of their operation and may be at
(DSPs), as well as data-processing application-specific inte- the same time more power- or energy-efficient, faster, and/or
grated circuits (ASICs) and programmable integrated circuits. smaller than other systems.
Their optimized realization, in terms of power or energy con- Although a detailed comparison of performance of these
sumption, area, and/or speed, is important for meeting systems to their counterparts is not offered here, one must
demanding operational specifications of such devices. keep in mind that such comparisons are only meaningful
In modern VLSI design flows, the design of standard arith- when the systems under question cover the same dynamic
metic units is available from design libraries. These units range and present the same precision of operations. This
employ binary encoding of numbers, such as one's or two's necessity usually translates in certain data word lengths,
complement, or sign magnitude encoding to perform ad- which, in their turn, affect the operating characteristics of the
ditions and multiplications. If nonstandard operations are systems.
required, or if high performance components are needed, then
the design of special arithmetic units is necessary. In this case,
the choice of arithmetic system is of utmost importance. 1.2 LNS Basics
The impact of arithmetic in a digital system is not only
limited to the definition of the architecture of arithmetic Traditionally, LNS has been considered as an alternative
circuits. Arithmetic affects several levels of the design abstrac- to floating-point representation (Koren, 1993; Stour-
tion because it may reduce the number of operations, the aitis, 1986). The organization of an LNS word is shown in
signal activity, and the strength of the operators. The choice Figure 1.1.
of arithmetic may lead to substantial power savings, reduced The LNS maps a linear real number X to a triplet as
area, and enhanced speed. follows:
Copyright© 2005 by AcademicPress. 179

All rights of reproduction in any form reserved.
180 Thanos Stouraitis
n n-1 ... 0 Ix - 21
gre1(X) -- - - , (1.4)
X
Sx ...
in which X is the actual value and X is the corresponding value

* x=lOgb[Xl * representable in the system. Notice that X ¢ X due to the
finite length of the words. Assuming that the logarithm of X
F I G U R E 1.1 The Organization of an (n + 1)-bit LNS Digital Word is represented as a two's complement number, the relative
representational error ~rel, LNS for a (k, l, b)-LNS is independ-
ent of X and, therefore, is equal to the average representational
X LN_~S(Zx, Sx, x = log b [X[), (1.1) error. It is given by [refer to Koren (1993) for the case b = 2].
where Sx is the sign of X, b is the base of the logarithmic gave, LNS = grel, LNS = b2-I - 1. (1.5)
representation, and Zx is a single-bit flag, which, when asserted,
denotes that X is zero. A zero flag is required because log b X is Due to formula 1.3, the average representational error for the
not a finite number for X = O. Similarly, since the logarithm of n-bit linear fixed-point case is given by:
a negative number is not a real number, the sign information
of X is stored in flag Sx. Logarithm x = log b Ixl is encoded as a
binary number, and it may comprise a number of k integer and (1.6)
gave, FXP - - 2n - 1 i=1 ~
I fractional bits.
The inverse mapping of a logarithmic triple (Zx, Sx, x) to a
which, by computing the sum on the right-hand side, can be
linear number X is defined by:
written as:
(Zx, 5x, X) LN~S-' X : X = (1 - Zx)( - 1)Sxb x. (1.2) t~(2n) + ~/

gave, FXP - - 2n - 1 ' (1.7)
1.2.1 LNS and Linear Representations
where ~ is the Euler gamma constant and function ~ is defined
Two important issues in a finite word length number system through:
are the range of numbers that can be represented and the
precision of the representation (Koren, 1993).
Let (k, 1, b)-LNS denote an LNS of integer and fractional t~(x) = d in F(x), (1.8)
word lengths k and l, respectively, and of base b. These three
parameters determine the properties of the LNS and can be where F(x) is the Euler gamma function.
computed so that the LNS meets certain specifications. For In the following, the maximum number representable in
example, for a (k, l, b)-LNS to be considered as equivalent to each number system is computed and used to compare
an n-bit linear fixed-point system, the following two restric- the ranges of the representations. Notice that different
tions may be posed: figures could also have been used for range comparison,
1. The two representations should exhibit equal average such as the ratio Xmax/Xmi n (Stouraitis, 1986). The maxi-
representational error. mum number representable by an n-bit linear integer is
2. The two representations should cover equivalent data 2" - 1; therefore the upper bound of the fixed-point range is
ranges. given by:
The average representational error, ~ave, is defined as: FXP 2n

Xma~= --1. (1.9)
Xmax
The maximum number representable by a (k, l, b)-LNS encod-
E grel ( X )
X=Xmin
ing 1.1 is as follows:
8ave = (1.3)
Xma x - Xmi n q- 1 '
LNS
Xmax = b2k+l-2 -I" (1.10)
where Xmin and Xm~ define the range of representable
numbers in each system and where grel(X) is the relative Therefore, according to the equivalence restrictions posed
representational error of a number X encoded in a number above, to make an LNS equivalent to an n-bit linear fixed-
system. This error is, in general, a function of the value of X point representation, the following inequalities should be si-
and it is defined as: multaneously satisfied:
1 Logarithmic and Residue Number Systems for VLSI Arithmetic 181
LNS
Xma FXP
x _> Xm~~. (1.11) Of course, the average (relative) error is not the only way to
compare the accuracy of computing systems. Especially true
E . . . . LNS ~ gave, FXP' (1.12) for signal processing systems, one may use the signal-to-noise
ratio (SNR), assuming that quantization errors represent
Hence, from equations 1.5 and 1.7 through 1.10 the following noise, to compare the precision of two systems. In that case,
equations are obtained: by equating the SNRs of the LNS and the fixed-point system
that covers the required dynamic range, the integer and frac-
tional word lengths of the LNS may be computed.
l = I - l o g 2log b ( l + *(2~)-~-~
2 ~ - 1 )7
"[" (1.13)
k = [log2(log b (2 ~ - 1) + 2 -1 - 1)7. (1.14) 1.2.2 LNS Operations

Mapping of equation 1.1 is of practical interest because it can
Values of k and l that correspond to various values of n for simplify certain arithmetic operations (i.e., it can reduce the
various values of b can be seen in Table 1.1, where for each implementation complexity, also called strength, of several
base, there is a third column (explained next). operators). For example, due to the properties of the logarithm
Mthough the word lengths k and I computed via equations function, the multiplication of two linear numbers, X : bx
1.13 and 1.14 meet the posed equivalence specifications of and Y = by, is reduced to the addition of their logarithmic
equations 1.11 and 1.12, LNS is capable of covering a signifi- images, x and y.
cantly larger range than the equivalent fixed-point representa- The basic arithmetic operations and their LNS counterparts
tion. Let neq denote the word length of a fixed-point system are summarized in Table 1.2, where, for simplicity and without
that can cover the range offered by an LNS defined through loss of generality, the zero flag zx is omitted and it is assumed
equations 1.13 and 1.14. Equivalently, let neq be the smallest that X > Y. Table 2 reveals that, while the complexity of most
integer, which satisfies: operations is reduced, the complexity of LNS addition and
LNS subtraction is significant. In particular, for d = Ix -Yl,
2 r~q -- 1 > b 2k+1-2 '. (1.15) LNS addition requires the computation of the nonlinear func-
tion:
From equation 1.15, it follows that:
s~(d) = logb (1 + b-d), (1.17)
neq = [ ( 2 k + 1 -- 2 - t ) l o g 2 b 1 . (1.16)
and subtraction requires the computation of the nonlinear
It should be stressed that when rteq ~ n, the precision of the function:
particular fixed-point system is better than that of the LNS
derived by equations 1.13 and 1.14. Equation 1.16 reveals that s,(d) = logb (1 - b-d). (1.18)
the particular LNS, while meeting the precision of an n-bit
linear representation, in fact covers the range provided by an Equations 1.7 and 1.8 substantially limit the data word lengths
neq-bit linear system. for which LNS can offer efficient VLSI implementations. The
TABLE 1.1 C o r r e s p o n d e n c e of n, k, l, a n d neq for Various Bases b
n b = 1.5 b=2 b=2.5
k 1 rteq k 1 rteq k 1 tleq
5 3 2 6 3 3 9 2 3 7
6 4 3 10 3 4 9 2 4 7
7 4 4 10 3 5 9 3 5 12
8 4 5 10 3 5 9 3 6 12
9 4 5 10 4 6 17 3 7 12
10 5 6 20 4 7 17 3 7 12
11 5 7 20 4 8 17 3 8 12
12 5 8 20 4 9 17 4 9 23
13 5 9 20 4 10 17 4 10 23
14 5 10 20 4 11 17 4 II 23
15 5 11 20 4 12 17 4 12 23
TABLE 1.2 BasicLinear Arithmetic Operations and Their LNS Counterparts
Linear operation Logarithmic operation

Multiply W = X Y = b~by = b~+r w = x + y, Sw = Sx X O R sy
Divide w = x - y, sw = sxXORsy
Root w = ~, rn, integer, sw = sx
Power w = x m = (b~)m w = rex, m, integer, Sw= sx
Add W = X + Y = b ~ + b y = b X ( l + b y ~) w = x+logb(1 + by-x), sw = Sx
Subtract w = x _ y = ~ _ bY = b~(l _ by x) w = x + log~ ( 1 - bY-X), sw = sx
, Sum/difference
Multiply/
divide//
IOgb(1 +_b-Ix-yl)
Add/
subtract ' Product/quotient
FIGURE 1.2 The Organization of a Basic LNS Processor: the processor comprises an adder, two multiplexers, a sign-inversion unit, a look-up
table, and a final adder. It may perform the four operations of addition, subtraction, multiplication, or division.
organization of an LNS processor that can perform the four To use the benefits of LNS, a conversion overhead is re-
basic operations of addition, subtraction, multiplication, or quired in most cases to perform the forward LNS mapping
division is shown in Figure 1.2. Note that to implement LNS defined by equation 1.1. It is noted that conversions of equa-
subtraction (i.e., the addition of two quantities of opposite tions 1.1 and 1.2 are required if an LNS processor receives
sign) a different m e m o r y look-up table (LUT) is required. input or transmits output linear data in digital format. Since
The main complexity of an LNS processor is the implemen- all arithmetic operations can be performed in the logarithmic
tation of the LUTs for storing the values of the functions s a ( d ) domain, only an initial conversion is imposed; therefore, as the
and s s ( d ) . A straightforward implementation is only feasible amount of processing implemented in LNS grows, the contri-
for small word lengths. A different technique can be used for bution of the conversion overhead to power dissipation and to
larger word lengths based on the partitioning of an LUT into area-time complexity becomes negligible because it remains
an assortment of smaller LUTs. The particular partitioning constant.
becomes possible due to the nonlinear behavior of the addition In stand-alone DSP systems, the adoption of a different
and subtraction functions, log b (1 + b -a) and log b (1 - b d), solution to the conversion problem is possible. In particular,
respectively, that are depicted in Figure 1.3 for b = 2. By the LNS forward and inverse mapping overhead can be miti-
exploiting the different minimal word length required by gated by converting the analog data directly into digital loga-
groups of function samples, the overall size of the LUT is rithms.
compressed, leading to a LUT organization of Figure 1.4. In
addition to the above techniques, reduction of the size of
m e m o r y can be achieved by proper selection of the base of LNS Arithmetic Example
the logarithms. It turns out that the same bases that yield L e t X = 2.75, Y = 5.65, and b = 2. P e r f o r m t h e o p e r a t i o n s X . Y,
m i n i m u m power consumption for the LNS arithmetic unit X + Y, v/X and y2 u s i n g t h e L N S .
by reducing the bit activity, as mentioned in the next section, Initially, the data are transferred to the logarithmic domain
also result in m i n i m u m LUT sizes. as implied by equation 1.h
1 i , , , ,
6 8
0.8
0.6
0.4
0.2 -3
2 4 6
(a) : S a (O) (b) : s s (d)
FIGURE 1.3 The Functions sa(d) and s,(d): Approximations required for LNS addition and subtraction.
To retrieve the actual result W from e q u a t i o n 1.21, inverse

conversion of 1.2 is used as follows:
W = (1 - zw)( - 1)sw2 W = 23.9577 = 15.5377. (1.22)
By directly m u l t i p l y i n g X b y Y, it is f o u n d that W = 15.5375.

The difference of 0.0002 is due to r o u n d - o f f error d u r i n g the
conversion f r o m linear to the LNS d o m a i n .
The calculation of the logarithmic image w o f W = v ~ is
p e r f o r m e d as follows:
1 1
w = - x = - 1.4594 = 0.7297. (1.23)
2 2
The actual result is retrieved as follows:
W = 20.7297 = 1 . 6 5 8 3 . (1.24)
FIGURE 1.4 The Partitioning of the LUT: The partitioning stores

The calculation of the l o g a r i t h m i c image w of W = X 2 can be
the addition and subtraction functions into a set of smaller LUTs,
d o n e as:
which leads to memory compression.
W = 2- 1.4594 = 2.9188. (1.25)

X LNS (Zx, Sx, X = log 2 Ixl)
(1.19)
=(0, O , x = l o g 22.75) = (0, O, 1.4594). Again, the actual result is o b t a i n e d as:
y LNS (Zy, Sy, y = l o g 2 ]YI) W = 22"9188 = 7 . 5 6 2 2 .

(1.20) (1.26)
= (0, 0, y = log 2 5 . 6 5 ) = (0, 0, 2.4983).
The o p e r a t i o n of logarithmic a d d i t i o n is rather awkward,
Using the LNS images f r o m equations 1.19 and 1.20, the and its realization is usually based on a m e m o r y LUT oper-
required arithmetic o p e r a t i o n s are p e r f o r m e d as follows: The ation. The logarithmic image w o f the sum W = X + Y is as
logarithmic image w of the p r o d u c t W = X • Y is given by: follows:
W = X + y = 1.4594 + 2.4983 = 3.9577. (1.21) w = m a x (x, y) + log 2 (1 + 2 min(x'y)-max(x'y)) (1.27)
As b o t h o p e r a n d s are o f the same sign (i.e., Sx = sy = 0), the = 2.4983 + log 2 (1 + 2 -1"0389) (1.28)
sign o f the p r o d u c t is s~ = 0. In addition, because
Zx ¢ 1 and zy ¢ 1, the result is n o n - z e r o (i.e., z~ = 0). =3.0704. (1.29)
The actual value of the sum W = X + Y is obtained as: 1Ogb(1 ± by-X), although different approaches have been
proposed in the literature (Orginos et aL, 1995; Paliouras and
W = 23.0704 = 8.4001. (1.30) Stouraitis, 1996). An LUT operation requires a ROM of n × 2 ~
bits, a size that can inhibit use of LNS for large values of n.
1.2.3 LNS a n d P o w e r D i s s i p a t i o n In an attempt to solve this problem, efficient table reduction
techniques have been proposed (Taylor et al., 1988). As a result
Power dissipation minimization is sought at all levels of design
of the above analysis, applications with a computational load
abstraction, ranging from software and hardware partitioning
dominated by operations of simple LNS implementation can
down to technology-related issues. The average power dissipa-
be expected to gain power dissipation reduction due to the
tion in a circuit is computed via the relationship:
LNS impact on architecture complexity.
Since multiplication-additions are important in DSP appli-
Pave = ~lkCLgffl d, (1.31)
cations, the power requirements of an LNS and a linear fixed-
point adder-multiplier have been compared. It has been
where fdk is the clock frequency, CL is the total switching reported that approximately a two times reduction in power
capacitance, Vda is the supply voltage, and a is the average dissipation is possible for operations with word sizes of 8 to 14
activity in a clock period. bits (Paliouras and Stouraitis, 2001). Given a sufficient number
LNS is applicable for low-power design because it reduces of consecutive multiplication-additions, the LNS implementa-
the complexity of certain arithmetic operators and the bit tion becomes more efficient from the low-power dissipation
activity. viewpoint, even when a constant conversion overhead is taken
into consideration.
Power Dissipation and LNS Architecture
LNS exploits properties of the logarithm function to reduce Power Dissipation and LNS Encoding
the strength of several arithmetic operations; thus, it leads to The encoding of data through logarithms of various bases
complexity savings. By reducing the area complexity of oper- implies variations in the bit activity (i.e., the a factor of
ations, the switching capacitance CL of equation 1.31 can be equation 31 and, therefore, the power dissipation) (Paliouras
reduced. Furthermore, reduction in latency allows for further and Stouraitis, 1996, 2001).
reduction in supply voltage, which also reduces power dissipa- Assuming a uniform distribution of linear n-bit input
tion (Chandrakasan and Brodersen, 1995). A study of the numbers, the distribution of bit assertions of the correspond-
impact of the choice of the number system on the QRD-RLS ing LNS words reveals that LNS can be exploited to reduce the
algorithm revealed that LNS offers accuracy comparable to average activity. Let P0~l(i) be the bit assertion probabilities
that of floating-point operations but only at a fraction of the (i.e., the probability of the /th bit transition from 0 to 1).
switched capacitance per iteration of the algorithm (Sacha and Assuming that data are temporarily independent, it holds that:
Irwin, 1998). The reduction of average switched capacitance of
LNS systems stems from the simplification of basic arithmetic p o ~ ( i ) = po(i)p~(i) = (1 - pl(i) )P~(i), (1.33)
operations, shown in Table 1.2. It can be seen that n-bit
multiplication and division are reduced to (k +/)-bit addition where P0 (i) and Pl (i) is the probability of the/th bit being 0 or
and subtraction, respectively, while the computation of roots 1, respectively. Due to the assumption of uniform data distri-
and powers is reduced to division and multiplication by a bution, it holds that:
constant, respectively. For the common cases of square root
or square, the operation is reduced to left or right shift respect- 1
ively. For example, assume that a n-bit carry-save array multi- po(i) = pl(i) 2 (1.34)
plier, which has a complexity of n 2 - n 1-bit full adders (FAs),
is replaced by an n-bit adder, assuming k + l = n has a com- which, due to equation 1.33, gives:
plexity of n FAs for a ripple-carry implementation (Koren,
1993). Therefore, multiplication complexity is reduced by a 1
factor rcL, given as: po-.l(i) = - . (1.35)
4
/./2 _ n Therefore, all bits in the linear fixed-point representation ex-

rcL -- - - - n - 1. (1.32) hibit an equal P0~l(i), i = 0, 1. . . . . n - 1.
n
Activities of the bits in an LNS-encoded word are quantified
Equation 1.32 reveals that the reduction factor rcL grows with under similar assumptions. Since there is an one-to-one cor-
the word length n. respondence of linear fixed-point values to their LNS images
Addition and subtraction, however, are complicated in LNS defined by equation 1.1, the LNS values follow a probability
because they require an LUT operation for the evaluation of function identical to the fixed-point case. In fact, the LNS
mapping can be considered as a continuous transformation of K-'~k+1-1 ~,LNS /

the discrete~'random variable X, which is a word in the linear Saw = 1 /-.,i=o t'°~l(i) 100%, (1.36)
representation, to the discrete random variable x, an LNS
word. Hence, the two discrete random variables follow the
same probability function (Peebles, 1987). where pFXP(i)= 1/4 for i = O , 1. . . . . n--1; the word
The pLNS probabilities of bit assertions in LNS words, how- lengths k and l are computed via equations 1.13 and 1.14,
ever, are not constant as P0-*l (i) of equation 1.35; they depend and n denotes the length of the fixed-point system. The savings
on the significance of the ith bit. To evaluate the probabilities percentage Save is demonstrated in Figure 1.6(A) for various
pLNS(i), the following experiment is performed. For all pos- values of n and b, and the percentage is found to be more than
sible values of X in a n-bit system, the corresponding Llogb x j 15% in certain cases.
values in a (k, l, b)-LNS format are derived, and probabilities As implied by the definition of neq in equation 1.16, how-
Pl (i) for each bit are computed. Then, poL~Sl(i) is computed as ever, the linear system that provides an equivalent range to that
in equation 1.33. The actual assertion probabilities for the bits ofa (k, l, b)-LNS, requires neq bits. If the reduced precision of a
in an LNS word, 1)0-~
~LNS (k, l, b)-LNS, compared to an nCq-bit fixed-point system, is
1(i), are depicted in Figure 1.5. It can be
seen that P0--,l(i) for the more significant bits is substantially acceptable for a particular application, S'ave is used to describe
lower than P0-~l(i) for the less significant bits. Moreover, it can the relative efficiency of LNS, instead of equation 1.36, where:
be seen that P0-~l (i) depends on b. This behavior, which is due
to the inherent data compression property of the logarithm x'-~k+l-1 -LNS/ .~\
2_,i=0 Po--.]tt)~
function, leads to a reduction of the average activity in the Savo= 1- -5G _100% (1.37)
/
entire word. The average activity savings percentage, S.... is
computed as:
Savings percentage S~ve is demonstrated in Figure 1.6(B) for
various values of n and b. Savings are found to exceed 50% in
some cases. Notice that Figure 1.6 reveals that, for a particular
P01
word length n, the proper selection of logarithm base b can
0.25 significantly affect the average activity. Therefore, the choice of
b is important in designing a low-power LNS-based system.
0.2 Finally, it should be noted that overhead is imposed for
linear-to-logarithmic and logarithmic-to-linear conversion.
0.15 b=2.5
Conversion overhead contributes additional area and time
0.1 b=2
complexity as well as power dissipation. As the number of
operations grows, however, the conversion overhead remains
0.05 b= 1.5 ,-- -- ' 4 constant; therefore, the overhead's contribution to the overall
budget becomes negligible.
1 2 3 4 5 6 7 8
(A) n = 8
1.3 The Residue Number System
POl
0.25 A different concept than the nonlinear logarithmic transfor-

mation is followed by mapping of data to appropriately selec-
0.2 ted finite fields. This may be achieved through the use of one of
the many available versions of the residue number system
0.15 b=2.5 . . . . (RNS) (Szabo and Tanaka, 1967). RNS arithmetic faces diffi-
culties with sign detection, division, and magnitude compari-
0.1 b=2 . . . .
son. These difficulties may outweigh the benefits it presents
for addition, subtraction, and multiplication as far as general
0.05 b=1.5 ......
computing is concerned. Its use in specialized computations,
like those for signal processing, offers many advantages. RNS
1 2 3 4 5 7 8 9 10 11 12
has been used to offer superior fault tolerance capabilities
(B) n = 12
as well as high-speed, small-area, and/or significant power-
FIGURE 1.5 Activities Against Bit Significance i (in an LNS Word dissipation savings in the design of signal processing architec-
for n = 8 and n = 12) and Various Values of the Base b. The horizon- tures for FIR filters (Freking and Parhi, 1997) and other cir-
tal dashed line is the activity of the corresponding n-bit fixed-point cuits (Chren, 1998). RNS may even reduce the computational
system load in complex-number processing (Taylor et al., 1985), thus
Save
/x /\\ ~'~.
15 / \ / -. -. ,k<,.
12.5 " t'~'L~/ : ' " ~ " " b : ,2..75

10 \{i / "!1~'->. .~ '"~-.~ ~Z- b = 1.5
},. / b=11
7.5 ; ".,,/ i '/ ~ - " b=2
2.5 .l:
i!
6 8 10 12 14 16
(A) n - bit FXP
Save
6O
" /
/~ \ /',. \ \ " r b=3
50 /"\ /\.\/ x",,k\ \ ' \ \ , ~'' b=1.7
, /, ix / ~ ", \ / /
40
, ,,/ //- \ ",..
30 1% i\ /I ",-- ~ ' \ . \ ",~" \1 b=l.a
20 \ \/ ' " "~ ~ b=2
6 8 10 12 14 16
neq
(B) neq- bit FXP
FIGURE 1.6 Percentage of Average Activity Reduction from Use of LNS. The percentage is compared to n-bit and to neq-bit linear fixed-point
system for various bases b of the logarithm. The diagram reveals that the optimal selection of b depends on n, and it can lead to significant power
dissipation reduction.
providing speed and power savings at the algorithmic level of ation (X)m returns the integer remainder of the integer di-
the design abstraction. vision x div m (i.e., an integer k such that x = m . l + k) where
l is an integer.
1.3.1 RNS Basics RNS is of interest because basic arithmetic operations can be
performed in a digit-parallel carry-free manner, such as in:
The RNS maps a natural number X in the range [0, M - 1],
with M = uN=I mi, to an N-tuple of residues xi: zi = (xi o Yi)m~, (1.39)
xRNS {X1, X2 . . . . . XN}, (1.38) where i = 1, 2 . . . . . N and where the symbol o stands for
addition, subtraction, or multiplication. Every integer in the
where xi = ( X ) m i, (')m~ denotes the m o d rni operation and range 0 _< X < I-IN_I mi has a unique RNS representation.
where mi is a member of the set of the co-prime integers Inverse conversion may be accomplished by means of the
B = {ml, m2 . . . . . mN} called moduli. Co-prime integers' Chinese remainder theorem (CRT) or the mixed-radix conver-
greatest c o m m o n divisor is gcd(mi, mj) = 1, i ¢ j. The set sion (Soderstrand et al., 1986). The CRT retrieves an integer
of RNS moduli is called the base of RNS. The modulo oper- from its RNS representation as:
1 Logarithmic and Residue N u m b e r Systems for VLSI Arithmetic 187
(1.40)
X~{X1, X2, X 3 } = {(10)3, (10)5, (10)71 (1.44)
={1,0,3}.
where roT, = G , M = M HN_lmi, and m7 1 is the multiplicative Y~{yl, Y2, Y3} = {(5)3, (5)5, ( 5 ) 7 } = {2, O, 51. (1.45)
inverse of ~ modulo m i (i.e., an integer such that
(-mi" ~ 1)mi = 1). The RNS image of the sum Z = X + Y is obtained as:
Using an associated mixed radix system, inverse conversion
may also be performed by translating the residue representa- zRNS{zI, Z2, Z3} = {(1 4- 2)3, (0 4- 055, (3 4- 5)7 ) (1.46)
tions to a mixed radix representation. By choosing the RNS
= {0, 0, 1}.
moduli to be the weights in the mixed radix representation, the
inverse mapping is facilitated by associating the mixed radix
To retrieve the integer that corresponds to the RNS representa-
system with the RNS. Specifically, an integer 0 < X < M can
tion {0, 0, 1} by applying the CRT of equation 1.40, the
be represented by N mixed radix digits (x~l. . . . . x~) as:
following quantities are precomputed: M = 3 - 5 - 7 = 105,
--ml = - 5-=1°5 35, ~ = -5-
=105 21, ~ _ 1 0 5 ~ _ = 15, /7"/11-1 = 2 ,
X = Xm(mN-lmN 2 . ml) 4 - . . . 4- x3(m2ml)
' m~ 1 = 1, and m33 1 = 1. The value of the sum in integer
(1.41)
4-4ml 4-Xll , form is obtained by applying equation 1.40
!
where 0 _< ~ < mi, i = 1 . . . N, and the x i can be generated Z = X 4- Y = (35(2.0)3 4- 21(1.0)5 4- 15(1- 1)7)105
sequentially from the xi using only residue arithmetic, such as = (15)105 = 15. (1.47)
in:
To verify the result of equation 1.46, notice that
X '1 ~ (x) ml ~ Xl X 4 - Y = 10 4- 5 = 1 5 and that:
x~ = (m~-l(x -- Xtl))m 2 (1.42)
RNS
15---+{(15)3, (15)5, (15)7 } = {0, 0, 1} = {Zl, z2, z3}, (1.48)
=
(m21(mll(X --
Xtl)) -- 4))m3,
which is the result obtained in equation 1.46. The same integer

and so on, or as in the following: may be retrieved by using an associated mixed radix system
defined by equation 1.41 as:
!
X1 = X1
4 = ((x2 - - X l ), m l -1 m2)m2 Z= 4-15+ Z'2 " 34- Z'a,
= (((X 3 t
-- xl)m -1 !
1 m3 -- x2)m 2-1 m3)m3 with 0 < Ztl < 3, 0 _< z~ < 5, 0 _< z~ < 7 and the following:
(1.43)
Ztl z Z 1 ~ 0
XN= ( ( "'" ( ( x u - x l ' ) m l lmm -- ~ ) m 2 1 Z~ = (3 I ( Z 2 -- Ztl))5 = (2" z2)5 = 0
t --1
m, -... - xN 1)m,_lmN)mN.
and
The digits x I can be generated sequentially through residue
subtraction and multiplication by the fixed m~-1. The sequen- z3 (5 -1 [3-1(z3 - z,) - 4 ] ) 7 = - Z'l) - 3 . 257
tial nature of calculation increases the latency of the residues = ((1-0)-3.0)7 = 1
conversion to binary numbers. (1.49)
The set of RNS moduli is often chosen so that the imple-
mentation of the various RNS operations (e.g., addition, so t h a t Z = 1.15+0.3+0= 15.
multiplication, and scaling) becomes efficient. A c o m m o n
choice is the set of moduli {2 n - 1, 2", 2" + 1}, which may
also form a subset of the base of RNS. 1.3.2 R N S A r c h i t e c t u r e s
The basic architecture of an RNS processor in comparison to a
RNS Arithmetic Example binary counterpart is depicted in Figure 1.7. This figure shows
Consider the base B = {3, 5, 7} and two integers X = I0 and that the word length n of the binary counterpart is parti-
Y = 5. The R N S images of X and Yare as written here: tioned into N subwords, the residues, that can be processed
n
8 \
-O
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . n o
n/M[ ~ ~ _ i n/M
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(A) Structure of a Binary Architecture (B) Corresponding RNS Processor
FIGURE 1.7 Basic Architectures
independently and are of word length significantly smaller with the rest of the system operating in a "soft failure" mode,
than n. The architecture in Figure 1.7 assumes, without loss being allowed to gracefully degrade into accurate operations of
of generality, that the moduli are of equal word length. The ith reduced dynamic range. Provided that the remaining dynamic
residue channel performs arithmetic modulo mi. range contains the results, there is no problem with this deg-
Most implementations of arithmetic units for RNS consist of radation.
an accumulator and a multiplier and are based on ROMs or The more redundant an RNS is, the easier it is to identify
PLAs. Bayoumi et al. (1983) have analyzed the efficiency of and correct errors. A redundant RNS (RRNS) uses a number r
various VLSI implementations of RNS adders. Moreover, im- of moduli in addition to the N standard moduli that
plementations of arithmetic units that operate in a finite integer are necessary for covering the desired dynamic range. All
ring R(m) and that are called AUras are offered in the literature N ÷ r moduli must be relatively prime. In an RRNS, a number
(Stouraitis, 1993). They are less costly and require less area and X is presented by a total of N nonredundant residue digits
lower hardware complexity and power consumption. They are {X2. . . . . XN} plus r redundant residue digits { X N + 1 . . . . XN+r}.
based on continuously decomposing the residue bits that cor- Of the total number of states, MR = IIN+rm i=~ , is represented by
respond to powers of 2 that are larger than or equal to 2 n, until the RRNS. The M = IIiN1mi first states constitute its "legitim-
they are reduced to a set of bits that correspond to a sum of ate range" while any number that lies in the range (M, MR), is
powers of 2 that is less than 2 n, where n = [log2 m1. This called "illegitimate."
decomposition is implemented by using full adder (FA) arrays. Any single error moves a legitimate number X into an
For all moduli, the FA-based AUras are shown to execute much illegitimate number X t. Once it is verified that the number
faster as well as have much smaller hardware complexity and being tested is illegitimate, its digits are discarded one by one,
time-complexity products than ROM-hased general multi- until a legitimate representation is found. The discarded digit
pliers. Since the AUras use full adders as their basic units, they whose omission results in the legitimate representation is
lead to modular and regular designs, which are inexpensive and the erroneous one. A correct digit can then be produced
easy to implement in VLSI. by extending the base of the reduced RNS that produced
the legitimate representation. The above error-locating-and-
correcting procedure can be implemented in a variety of
1.3.3 Error Tolerance in RNS Systems ways. Assuming that the mixed radix representations of all
Because there is no interaction among digits (channels) in the reduced RNS representations can be efficiently generated,
residue arithmetic, any errors generated at one digit cannot the legitimate one can be easily identified by checking the
propagate and contaminate other channels during subsequent highest order mixed radix digit against zero. If it is zero,
operations, given that no conversion has occurred from the the representation is legitimate.
RNS to a weighted representation. Mixed radix representations associated with the RNS
In addition, because there is no weight associated with the numbers can be used to detect overflows as well as to detect
RNS residues (digits), if any digit becomes corrupted, the and correct errors in redundant RNS systems. For example,
associated channel may be easily identified and dealt with. to detect overflows, a redundant modulus mN+l is added to
Based on the amount of redundancy that is built in an RNS the base and the corresponding highest order mixed radix
processor, the faulty channels may be replaced or just isolated, digit aN.l is found and compared to zero. Assuming that the
number being tested for overflow is not large enough to Monte Carlo runs. It is observed that RNS performs better than
overflow the augmented range of the redundant system, over- two's complement representation for anticorrelated data and
flow occurs whenever aN+a is not zero. slightly worse than sign-magnitude and two's complement
representations for uncorrelated and correlated sequences.
1.3.4 RNS and Power Dissipation

RNS may reduce power dissipation because it reduces the TC
2500
hardware cost, the switching activity, and the supply voltage
(Freking and Parhi, 1997). By employing binary-like RNS filter
2000
structures (Ibrahim, 1994), it has been reported that RNS
reduces the bit activity up to 38% in (4 × 4)-bit multipliers. 1500
As the critical path in an RNS architecture increases logarith-
mically with the equivalent binary word length, RNS can
1000
tolerate a larger reduction in the supply voltage than the
corresponding binary architecture while achieving a particu- 500
lar delay specification. To demonstrate the overall impact of
the RNS on the power budget of an FIR filter, Freking and
Parhi (1997) report that a filter unit with 16-bit coefficients 20 40 60 80 100
and 32-bit dynamic range, operating at 50 MHz, dissipates (A) Strongly Anticorrelated Gaussian Data
26.2 mW on average for a two's complement implementation,
while the RNS equivalent architecture dissipates 3.8mW. 2500
Hence, power dissipation reduction becomes more significant RNS
as the number of filter taps increases, and a 3-fold reduction is 2000 "" " " * " ' " " " " ' - ' ' " " ' " , " " ' " ' " ' " ' " " - "" TO
possible for filters with more than 100 taps. ~ SM
Low-power may also be achieved via a different RNS imple- 1500
mentation. It has been suggested to one-hot encode the resi-
dues in an RNS-based architecture, thus defining one-hot RNS 1000
(OHR) (Chren, 1998). Instead of encoding a residue value xi in
a conventional positional notation, an ( m - 1)-bit word is 500
employed. In this word, the assertion of the ith bit denotes
the residue value xi. The one-hot approach allows for a further
reduction in bit activity and power-delay products using resi- 20 40 60 . 80 100
due arithmetic. OHR is found to require simple circuits for (B) Uncorrelated Gaussian Data
processing. The power reduction is rendered possible since all
basic operations (i.e., addition, subtraction, and multiplica- 2500
tion) as well as the RNS-specific operations of scaling
(i.e., division by constant), modulus conversion, and index
2000
computation are performed using transposition of hit lines
RNS
and barrel shifters. The performance of the obtained residue
1500
architectures is demonstrated through the design of a direct
digital frequency synthesizer that exhibits a power-delay prod- 1000 SM
uct reduction of 85% over the conventional approach (Chren,
1998). 500
RNS Signal Activity for Gaussian Input

The bit activity in an RNS architecture with positionally en- 20 40 60 80 100
coded residues has been experimentally studied for the encod- (C) Strongly Correlated Gaussian Data
ing of 8-bit data using the base {2, 151}, which provides
FIGURE 1.8 Number of Low-to-High Transitions. (A) This figure
a linear fixed-point dynamic range of approximately 8.24 bits. shows strongly anticorrelated (p = -0.99) Gaussian data for two's
Assuming data sampled from a Gaussian process, the complement, RNS, and sign-magnitude number systems for 100
bit assertion activities of the particular RNS, an 8-bit sign- Monte Carlo runs. (B) Shown here are uncorrelated (p = 0) Gaussian
magnitude, and an 8-bit two's-complement system are meas- data; (C) This figure illustrates strongly correlated (p = 0.99) Gauss-
ured and compared. The results are depicted in Figure 1.8 for 100 ian data
References Paliouras, V., and Stouraitis, T. (1996). A novel algorithm for accurate
Bayoumi, M.A., Jullien, G.A., and Miller, W.C. (1983). Models of VLSI logarithmic number system subtraction. Proceedings of Inter-
implementation of residue number system arithmetic modules. national Symposium on Circuits and Systems. 4, 268-271.
Proceedings of 6th Symposium on Computer Arithmetic, 412-413. Peebles, EZ. Jr. (1987). Probability, random variables, and random
Chandrakasan, A.P., and Brodersen, R.W. (1995). Low power digital signal principles. New York: McGraw-Hill.
CMOS design. Boston: Kluwer Academic Publishers. Soderstrand, M.A., Jenkins, W.K., Jullien, G.A., and Taylor, EJ. (1986).
Chren, W.A., Jr., (1998). One-hot residue coding for low delay-power Residue number arithmetic: Modern applications in digital signal
product CMOS design. IEEE Transactions on Circuits and Systems-- processing. New York: IEEE Press.
Part II 45, 303-313. Stouraitis, T., Kim, S.W., and Skavantzos, A. (1993). Full adder-based
Freking, W.L., and Parhi, K.K. (1997). Low-power FIR digital filters units for finite integer rings. IEEE Transactions on Circuits and
using residue arithmetic. Proceedings of Thirty-first Asilomar Con- Systems--Part H 40, 740-745.
ference on Signals, Systems, and Computers 739-743. Taylor, E, Gill, R., Joseph, J., and Radke, J. (1988). A 20-bit logarith-
Ibrahim, M.K. (1994). Novel digital filter implementations using mic number system processor. IEEE Transactions on Computers, 37,
hybrid RNS-binary arithmetic. Signal Processing 40, 287-294. 190-199.
Koren, I. (1993). Computer arithmetic algorithms. Englewood Cliffs, Taylor, EJ., Papadourakis, G., Skavantzos, A., and Stouraitis, T. A
NJ: Prentice Hall. radix-4 FFT using complex RNS arithmetic. IEEE Transactions on
Orginos, I., Paliouras, V., and Stouraitis, T. (1995). A novel algorithm Computers C-34, 573-576.
for multioperand logarithmic number system addition and sub- Sacha, J.R., and Irwin, M.J. (1998). The logarithmic number system
traction using polynomial approximation. Proceedings of Inter- for strength reduction in adaptive filtering. Proceedings of
national Symposium on Circuits and Systems, III.1992-III.1995. International Symposium on Low-Power Electronics and Design,
Paliouras, V., and Stouraitis, T. (2001). Signal activity and power 256-261.
consumption reduction using the logarithmic number system. Pro- Stouraitis, T. (1986). Logarithmic number system: Theory, analysis and
ceedings of IEEE International Symposium on Circuits and Systems, design. Ph.D. diss., University of Florida.
II.653-II.656. Szab6, N., and Tanaka, R. (1967). Residue arithmetic and its applica-
Paliouras, V., and Stouraitis, T. (2001). Low-power properties of the tions to computer technology. New York: McGraw-Hill.
logarithmic number system. Proceedings of the 15th Symposium on
Computer Arithmetic (ARITH15), 229-236.

Vlsi Systems: The Center For Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, Louisiana, USA

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Vlsi Systems: The Center For Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, Louisiana, USA

Transféré par

Droits d'auteur :

Formats disponibles

III

Thanos Stouraitis 1.1 Introduction ....................................................................................... 179

Copyright© 2005 by AcademicPress. 179

in which X is the actual value and X is the corresponding value

(Zx, 5x, X) LN~S-' X : X = (1 - Zx)( - 1)Sxb x. (1.2) t~(2n) + ~/

The average representational error, ~ave, is defined as: FXP 2n

k = [log2(log b (2 ~ - 1) + 2 -1 - 1)7. (1.14) 1.2.2 LNS Operations

TABLE 1.1 C o r r e s p o n d e n c e of n, k, l, a n d neq for Various Bases b

n b = 1.5 b=2 b=2.5

k 1 rteq k 1 rteq k 1 tleq

TABLE 1.2 BasicLinear Arithmetic Operations and Their LNS Counterparts

Linear operation Logarithmic operation

To retrieve the actual result W from e q u a t i o n 1.21, inverse

W = (1 - zw)( - 1)sw2 W = 23.9577 = 15.5377. (1.22)

By directly m u l t i p l y i n g X b y Y, it is f o u n d that W = 15.5375.

The actual result is retrieved as follows:

FIGURE 1.4 The Partitioning of the LUT: The partitioning stores

W = 2- 1.4594 = 2.9188. (1.25)

y LNS (Zy, Sy, y = l o g 2 ]YI) W = 22"9188 = 7 . 5 6 2 2 .

W = X + y = 1.4594 + 2.4983 = 3.9577. (1.21) w = m a x (x, y) + log 2 (1 + 2 min(x'y)-max(x'y)) (1.27)

/./2 _ n Therefore, all bits in the linear fixed-point representation ex-

mapping can be considered as a continuous transformation of K-'~k+1-1 ~,LNS /

0.25 A different concept than the nonlinear logarithmic transfor-

12.5 " t'~'L~/ : ' " ~ " " b : ,2..75

20 \ \/ ' " "~ ~ b=2

(B) neq- bit FXP

which is the result obtained in equation 1.46. The same integer

4 = ((x2 - - X l ), m l -1 m2)m2 Z= 4-15+ Z'2 " 34- Z'a,

(A) Structure of a Binary Architecture (B) Corresponding RNS Processor

FIGURE 1.7 Basic Architectures

1.3.4 RNS and Power Dissipation

RNS Signal Activity for Gaussian Input

Vous aimerez peut-être aussi