Vous êtes sur la page 1sur 14

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO.

2, FEBRUARY 2010 275

New Reconfigurable Architectures for Implementing


FIR Filters with Low Complexity
R. Mahesh, Member, IEEE, and A. P. Vinod, Senior Member, IEEE

Abstract—Reconfigurability and low complexity are the two intensive part of an SDR receiver is the channelizer since it
key requirements of finite impulse response (FIR) filters employed operates at the highest sampling rate [2]. It extracts multiple
in multistandard wireless communication systems. In this paper,
narrowband channels from a wideband signal using a bank
two new reconfigurable architectures of low complexity FIR
filters are proposed, namely constant shifts method and pro- of FIR filters, called channel filters. Using polyphase filter
grammable shifts method. The proposed FIR filter architecture structure, decimation can be done prior to channel filtering
is capable of operating for different wordlength filter coefficients so that the channel filters need to operate only at relatively
without any overhead in the hardware circuitry. We show that low sampling rates. This can relax the speed of operation of
dynamically reconfigurable filters can be efficiently implemented
the filters to a good extent [22]. However due to the strin-
by using common subexpression elimination algorithms. The
proposed architectures have been implemented and tested on gent adjacent channel attenuation specifications of wireless
Virtex 2v3000ff1152-4 field-programmable gate array and syn- communication standards, higher order filters are required for
thesized on 0.18 µm complementary metal–oxide–semiconductor channelization and consequently the complexity and power
technology with a precision of 16 bits. Design examples show consumption of the receiver will be high. As the ultimate aim
that the proposed architectures offer good area and power
of the future multi-standard wireless communication receiver
reductions and speed improvement compared to the best existing
reconfigurable FIR filter implementations in the literature. is to realize its functionalities in mobile handsets, where its full
utilization is possible, low power and low area implementation
Index Terms—Channelizer, common subexpression elimina-
of FIR channel filters is inevitable. In [37], the filter multipli-
tion, FIR filter, high level synthesis, reconfigurability.
cations are done via state machines in an iterative shift and
add component and as a result of this there is huge savings
I. Introduction in area. For lower order filters, the approach in [37] offers
good trade-off between speed and area. But in general, the

F IR DIGITAL filters find extensive applications in mobile


communication systems for applications such as chan-
nelization, channel equalization, matched filtering, and pulse
channel filters in wireless communication receivers need to be
of high order to achieve sharp transition band and low adjacent
channel attenuation requirements. For such applications, the
shaping, due to their absolute stability and linear phase proper- approach in [37] results in low speed of operation.
ties. The filters employed in mobile systems must be realized The complexity of FIR filters is dominated by the com-
to consume less power and operate at high speed. Recently, plexity of coefficient multipliers. It is well known that the
with the advent of software defined radio (SDR) technology, common subexpression elimination (CSE) methods based on
finite impulse response (FIR) filter research has been focused canonical signed digit (CSD) coefficients produce low com-
on reconfigurable realizations. The fundamental idea of an plexity FIR filter coefficient multipliers [3]. The goal of CSE
SDR is to replace most of the analog signal processing in the is to identify multiple occurrences of identical bit patterns
transceivers with digital signal processing in order to provide that are present in the CSD representation of coefficients,
the advantage of flexibility through reconfiguration. This will and eliminate these redundant multiplications. A modification
enable different air-interfaces to be implemented on a single of the 2-bit CSE technique in [3] for identifying the proper
generic hardware platform to support multistandard wireless patterns for elimination of redundant computations and to
communications [1]. Wideband receivers in SDR must be maximize the optimization impact was proposed in [4]. In [5],
realized to meet the stringent specifications of low power the technique in [3] was modified to minimize the logic depth
consumption and high speed. Reconfigurability of the receiver (LD) (LD is defined as the number of adder-steps in a maximal
to work with different wireless communication standards is path of decomposed multiplications [27]) and thus to improve
another key requirement in an SDR. The most computationally the speed of operation. In [6], we have proposed the binary
Manuscript received October 17, 2008; revised April 20, 2009 and Septem- common subexpression elimination (BCSE) method which
ber 28, 2009. Current version published January 22, 2010. This paper was provided improved adder reductions and thus low complexity
recommended by Associate Editor J. Lach.
The authors are with the School of Computer Engineering, Nanyang
FIR filters compared to [3]–[5]. In [7], a method based on
Technological University, 639798 Singapore (e-mail: rpmahesh@ntu.edu.sg; the pseudo floating point method was used to encode the filter
asvinod@ntu.edu.sg). coefficients and thus to reduce the complexity of the filter.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
But the method in [7] is limited to filter lengths less than 40.
Digital Object Identifier 10.1109/TCAD.2009.2035548 In general, the methods in [3]–[7] are only suitable for
0278-0070/$26.00 
c 2010 IEEE
276 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2010

application specific filters where the coefficients are fixed and increases with the filter length as in [14] and filters with
hence not suitable for reconfigurable filters. filter-length above 40 are infeasible. In [20], the common
Several implementation approaches for reconfigurable FIR digital signal processing (DSP) operations such as filtering and
filters have been proposed in literature [8]–[15]. These de- matrix multiplication were identified and expressed as vector
signs include either a fully programmable multiply-accumulate scaling operations. In order to apply vector scaling, simple
(MAC) based filter processor or dedicated architectures where number decomposition strategies were identified. The idea was
the filter coefficients can be stored in registers. The architec- to precompute the values such as x, 3x, 5x, 7x, 9x, 11x, 13x,
ture of a filter processor consists of a datapath with a single and 15x, where x is the input signal and then reuse these
MAC unit, data and program memories, and a control unit precomputations efficiently using multiplexers. The presence
[8], [9]. The datapath includes a 16-bit adder/subtractor, a of multiplexers gave the option of adaptive computing for the
multiplier, and a 32-bit accumulator. The performance of the method in [20]. In [21], the method in [20] was modified
processor is mainly restricted by the delay of this datapath, and efficient circuit-level techniques, namely a new carry-
more specifically that of the multiplier. The main disadvantage select adder and conditional capture flip-flop, were used to
of the filter processors is that the area and power requirements further improve power and performance. The architectures in
are significantly large. In [10], a comparison was done for [11]–[15] and [20], [21] are appropriate only for relatively
the performance of speech based algorithms on dedicated lower order filters and hence not suitable for channel filters in
architectures and general-purpose processors. It was shown communication receivers.
that the power consumption for a general-purpose processor Although a few works addressed the problem of reducing
can be a factor of four times more than dedicated architectures the complexity of coefficient multipliers in reconfigurable
for a complex algorithm [10]. FIR filters, hardly any work demonstrated reconfigurabil-
The works in [11]–[15] and [20], [21] present reconfigurable ity in higher order filters. Moreover, we note that there is
FIR filter architectures. In [11], a CSD based digit reconfig- sufficient scope for more work on complexity reduction in
urable FIR filter architecture was proposed. This architecture reconfigurable filters especially for wireless communication
was independent of the number of taps because the number of applications where higher order filters are often required to
taps and non-zero digits in each tap were arbitrarily assigned. meet the stringent adjacent channel attenuation specifications.
The intention of the authors was to reduce the precision of In this paper, we propose two architectures that integrate
coefficients and thus the filter complexity without affecting the reconfigurability and low complexity to realize FIR filters.
filter performance. But the architecture in [11] demanded huge The FIR filter architectures proposed are called constant shifts
hardware resources and this makes the method infeasible for method (CSM) and programmable shifts method (PSM). We
power constrained SDR receiver applications. In [12], a high- have presented the preliminary design of these architectures in
speed programmable CSD based FIR filter was proposed. The a recent conference paper [34]. In this journal, we elaborate the
filter architecture consisted of a programmable CSD based CSM and PSM architectures introduced in [34] by providing
Booth encoding scheme and partial product Wallace adder the detailed design. The design analysis of the architectures
tree. The final adder was a carry look-ahead adder. Though and their extension to high-level synthesis are presented. The
this method offered a high speed solution, the resulting filters proposed architectures have been synthesized on 0.18 µm
consume more power. Another high-speed programmable FIR complementary metal—oxide—semiconductor (CMOS) tech-
filter based on polyphase decomposition was proposed in [13]. nology and compared with the recent methods such as
However, this method used the built-in block multipliers of [11], [12], [14], and [15]. Also we have implemented two CSD
Virtex II field-programmable gate array (FPGA) and there based methods based on our CSM and PSM to compare the
was no consideration for the complexity reduction of the complexities of the CSD and binary based CSE techniques.
FIR filter. In [14], the concept of reconfigurable multiplier The implementation and verification of the proposed archi-
block (ReMB) was introduced. The ReMB will generate all tectures on Virtex 2v3000ff1152-4 FPGA is also presented.
the coefficient products and a multiplexer will select the The proposed architectures consider coefficients as constants
required ones depending on the input. It was shown that by (as they are stored in LUTs) and input signal as variable.
pushing the multiplexer deep into the multiplier block design, The coefficient multiplication in such a case is known as
the redundancy can be reduced. The resulting specialized multiple constant multiplications (MCM), i.e., multiplication
multiplier design can be more efficient in terms of area and of one variable (input signal) with multiple constants (filter
computational complexity compared to the general-purpose coefficients) [35]. The MCM is then optimized for eliminating
multiplier plus the coefficient store [14]. But the ReMB pro- redundancy using our recently proposed BCSE algorithm [6]
posed in [14] has its area, power, and speed dependent on the to minimize the filter complexity. The proposed CSM focuses
filter-length making them inappropriate for higher order FIR on the implementing FIR filters by partitioning the filter
filters. In [15], a multiplexed multiple constant multiplications coefficients into fixed groups. The PSM has a pre-analysis
(MMCM) approach was proposed. This method considers the part which eliminates the redundancy in filter coefficients
coefficient set as a constant and uses the graph dependence using the BCSE algorithm. The advantage of CSM is that
(GD) algorithms for reducing redundancy. But this method it produces high-speed filters at the cost of a slight increase
follows a directed acyclic graph structure which will result in area and power consumption. On the contrary, the PSM
in long LD and thus lower speed of operation as reported produces filters with low area and power consumption at the
in [3], [24], [25]. Also the area of the architecture linearly cost of a slight increase in delay. Another advantage of PSM is
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 277

Fig. 1. Transposed direct form of an FIR filter.

that the wordlength of the filter coefficients can be dynamically


changed without any modification in the hardware. Fig. 2. Architecture of the proposed method.
The paper is organized as follows. In Section II, the BCSE
method is reviewed. Section III presents our reconfigurable [6] was formulated as a low complexity solution to realize
architectures. In Section IV, the proposed architectures are application specific filters where the coefficients are fixed.
extended to high-level synthesis. The synthesis and design In the case of channel filters for SDR receivers, the coeffi-
results of proposed architectures are presented in Section V. cients need to be changed as the filter specification changes
In Section VI, implementation results of the proposed archi- with the communication standard. Therefore, reconfigurability
tectures are presented. Section VII has our conclusion. is a necessary requirement for SDR channel filters. In the
next section, we propose two architectures that incorporate
reconfigurability into the BCSE-based low complexity filter
II. Review of BCSE Method architecture. Although we use BCSE to illustrate proposed
This section reviews the BCSE algorithm [6], which deals reconfigurable filter architectures in this paper, it must be
with the elimination of redundant binary common subex- noted that the proposed architectures can be used for any CSE
pressions (BCSs) that occur within the coefficients. The method with appropriate modifications.
BCSE technique focuses on eliminating redundant com-
putations in coefficient multipliers by reusing the most
III. Proposed Filter Architectures
common binary bit patterns (BCSs) present in coeffi-
cients. An n-bit binary number can form 2n − (n + 1) In this section, the architecture of the proposed FIR filter is
BCSs among themselves. For example, a 3-bit binary rep- presented. Our architecture is based on the transposed direct
resentation can form four BCSs, which are [0 1 1], form FIR filter structure as shown in Fig. 1. The dotted portion
[1 0 1], [1 1 0], and [1 1 1]. These BCSs can be expressed as in Fig. 1 represents the MB. In Fig. 1, PE-i represents the
[0 1 1] = x2 = 2−1 x + 2−2 x, [1 0 1] = x3 = x + 2−2 x, [1 1 0]= processing element corresponding to the ith coefficient. PE
x4 = x + 2−1 x, and [1 1 1]= x5 = x + 2−1 x + 2−2 x, where x is performs the coefficient multiplication operation with the help
the input signal. Note that other BCSs such as [0 0 1], [0 1 0], of a shift and add unit which will be explained in the latter
and [1 0 0] do not require any adder for implementation as part of this section. The architecture of PE is different for
they have only one nonzero bit. A straightforward realization proposed CSM and PSM. In the CSM, the filter coefficients
of above BCSs would require five adders. However x2 can be are partitioned into fixed groups and hence the PE architecture
obtained from x4 by a right shift operation (without using involves constant shifters. But in the PSM, the PE consists of
any extra adders): x2 = 2−1 x + 2−2 x = 2−1 (x + 2−1 x) = programmable shifters (PS). The FIR filter architecture can
2−1 x4 . Also, x5 can be obtained from x4 using an adder: be realized in a serial way in which the same PE is used for
x5 = x + 2−1 x + 2−2 x = x4 + 2−2 x. Thus, only three adders are generation of all partial products by convolving the coefficients
needed to realize the BCSs x2 to x5 . The number of adders with the input signal (h ∗ x[n]) or in a parallel way, where
required for all the possible n-bit binary subexpressions is parallel PE architectures are employed. The first option is used
2n−1 − 1 [6]. The number of adders needed to implement the when power consumption and area are of prime concern. The
coefficient multipliers using the binary representation-based basic architecture of the PE (dotted portion) is shown in Fig. 2.
BCSE is considerably less than the CSD-based CSE methods The functions of different blocks of the PE are explained
[6]. The proposed FIR filter architecture is based on transposed below.
direct form as shown in Fig. 1. In the transposed direct form, 1) Shift and Add Unit: It is well known that one of the
the coefficient multipliers (shown as dotted outline in Fig. 1) efficient ways to reduce the complexity of multiplication
share the same input and hence commonly known as multiplier operation is to realize it using shift and add operations.
block (MB). The MB reduces the complexity of the FIR In contrast to conventional shift and add units used in
filter implementations, by exploiting the redundancy in MCM. previously proposed reconfigurable filter architectures,
Thus, redundant computations (partial product additions in the we use the BCSs-based shift and add unit in our pro-
multiplier) are eliminated using BCSE. The BCSE method in posed CSM and PSM architectures. The architecture of
278 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2010

After obtaining the intermediate sums (x + 2−2 x) and


(x + 2−1 x) from the shift and add units with the help of
multiplexer unit, the final shifter unit will perform the
shift operations 2−4 and 2−15 in (2). The PSM and CSM
architectures also differ in the nature of final shifters. In
the CSM, the final shifts are constants and hence no PS
are required. In the PSM, we have used PS.
4) Final Adder Unit: This unit will compute the sum of all
the intermediate additions 2−4 (x + 2−2 x) and 2−15 (x +
2−1 x) as in (2). As the filter specifications of different
communication standards are different, the coefficients
change with the standards. In conventional reconfig-
Fig. 3. Architecture of shift and add unit. urable filters, the new coefficient set corresponding to the
filter specification of the new communication standard
shift and add unit is shown in Fig. 3. The shift and add is loaded in the LUT. Subsequently, the shift and add
unit is used to realize all the 3-bit BCSs of the input unit performs a bitwise addition after appropriate shifts.
signal ranging from [0 0 0] to [1 1 1]. In Fig. 3, “x>>k” On the contrary, the proposed CSM and PSM architec-
represents the input x shifted right by k units. All the tures perform a binary common subexpression (BCS)-
3-bit BCSs [0 1 1], [1 0 1], [1 1 0], and [1 1 1] of a 3-bit wise addition (instead bitwise addition). Thus, the same
number are generated using only three adders, whereas hardware architecture can be used for different filter
a conventional shift and add unit would require five specifications to achieve the necessary reconfigurability.
adders. Since the shifts to obtain the BCSs are known Moreover, the proposed BCS-based shift and add unit
beforehand, PS are not required. All these eight BCSs reduces addition operations and hence offers hardware
(including [000]) are then fed to the multiplexer unit. complexity reduction. In the next section, the CSM is
In both the architectures (CSM and PSM) proposed in explained in a detailed manner.
this paper, we use the same shift and add unit. Thus,
the use of 3-bit BCSs reduces the number of adders A. Architecture of CSM
needed to implement the shift and add unit compared to
In the CSM architecture, the coefficients are stored directly
conventional shift and add units.
in the LUT. These coefficients are partitioned into groups of
2) Multiplexer Unit: The multiplexer units are used to
3-bits and are used as the select signal for the multiplexers.
select the appropriate output from the shift and add
The number of multiplexer units required is n/3, where n is
unit. All the multiplexers will share the outputs of
the wordlength of the filter coefficients. The CSM can be ex-
the shift and add unit. The inputs to the multiplexers
plained with the help of an 8-bit coefficient h = “0.11111111.”
are the 8/4 inputs from the shift and add unit and
This coefficient h is the worst-case 8-bit coefficient since all
hence 8:1/4:1 multiplexer units are employed in the
the bits are nonzero and hence needs a maximum number of
architecture. The select signals of the multiplexers are
additions and shifts. In this case, n = 8, and therefore the
the filter coefficients which are previously stored in a
number of multiplexers required is 3. The output y = h ∗ x is
look up table (LUT). The CSM and PSM architectures
expressed as
basically differ in the way filter coefficients are stored in
the LUT. In the CSM, the coefficients are directly stored
y = 2−1 x + 2−2 x + 2−3 x + 2−4 x + 2−5 x + 2−6 x + 2−7 x + 2−8 x. (3)
in LUTs without any modification whereas in PSM, the
coefficients are stored in a coded format. The number of By partitioning into groups of three bits from most signifi-
multiplexers will also be different for PSM and CSM. In cant bit (MSB) (3), we obtain
CSM, the number of multiplexers will be dependent on
the number of groups after the partitioning of the filter h = 2−1 (x+2−1 x+2−2 x+2−3 x+2−4 x+2−5 x+2−6 x+2−7 x) (4)
coefficient into fixed groups. The number of multiplexers
in the PSM is dependent on the number of non-zero
operands in the coefficient for the worst case after the h = 2−1 (x + 2−1 x + 2−2 x + 2−3 (x + 2−1 x + 2−2 x) + 2−6 (x + 2−1 x)).
application of BCSE algorithm. (5)
3) Final Shifter Unit: The final shifter unit will perform Note that the terms x + 2−1 x + 2−2 x and x + 2−1 x can
the shifting operation after all the intermediate additions be obtained from the shift and add unit. Then by using
(i.e., intra-coefficient additions) are done. This can be the three multiplexers (mux), two 8:1 mux for the first two
illustrated using the output expression 3-bit groups and one 4:1 mux for the last two bits of the filter
coefficients, the intermediate sums shown inside the brackets
y = 2−4 x + 2−6 x + 2−15 x + 2−16 x. (1) of (5) can be obtained. The final shifter unit will perform
By coefficient-partitioning [16], we obtain the shift operations 2−1 , 2−3 , and 2−6 . Since these shifts are
always constant irrespective of the coefficients, programmable
y = 2−4 (x + 2−2 x) + 2−15 (x + 2−1 x). (2) shifters are not required and these shifts can be hardwired. The
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 279

The shifts are obtained by partitioning the 16-bit coefficient


into groups of 3-bits.
By partitioning (6)

y = 2−1 [(r1 + 2−3 r2 ) + 2−6 [(r3 + 2−3 r4 ) + 2−6 (r5 + 2−3 r6 )]]. (7)
Substituting (r1 + 2−3 r2 ), (r3 + 2−3 r4 ), and (r5 + 2−3 r6 ) by r7 ,
r8 , and r9 , respectively, we get
y = 2−1 [r7 + 2−6 (r8 + 2−6 r9 )]. (8)
By substituting (r8 + 2−6 r9 ) by r10
y = 2−1 (r7 + 2−6 r10 ). (9)
By substituting (r7 + 2−6 r10 ) by r11
y = 2−1 (r11 ). (10)
The expressions from (6)–(10) are represented in Fig. 4. The
main advantage of the CSM architecture is that all the shifts
Fig. 4. Architecture of PE for CSM. are constants irrespective of the coefficients and hence can be
hardwired resulting in high speed operation of the filter.
We have employed the shift and add unit which can generate
all the 3-bit BCSs using only three adders. The impact of
final adder unit will compute the sum of all the intermediate using higher order BCSs (4-bit, 5-bit BCSs, etc.) has also
sums to obtain h ∗ x[n]. been investigated. The choice of the best shift and add unit
The architecture of PE for CSM is shown in Fig. 4. The will depend on the complexities of: 1) shift and add unit;
coefficient wordlength is considered as 16 bits. The filter 2) multiplexer unit; and 3) final adder unit. The number of
coefficients are stored in the LUT in sign-magnitude form with adders needed to implement n-bit CSs is 2n−1 − 1 [6]. Thus,
the MSB reserved for the sign bit. The first bit after the sign shift and add units capable of generating 4-bit, 5-bit, and 6-bit
bit is used to represent the integer part of the coefficient and BCSs would require 7, 15, and 31 adders, respectively. The
the remaining 16 bits are used to represent the fractional part LD is two adder-steps for both the 3-bit and the 4-bit BCSs-
of the coefficient. Thus, each 16-bit coefficient is stored as based shift and add units, and hence they have the same speed.
an 18-bit value in LUTs. Each row in LUT corresponds to The LD of 5-bit and 6-bit BCSs-based shift and add units are
one coefficient. Note that only half the number of coefficients same, i.e., three adder-steps, which is one adder-step more
need to be stored as FIR filter coefficients are symmetric. The than that of the 3-bit and 4-bit BCSs. Thus, the 3-bit BCSs-
coefficient values corresponding to 20 to 2−14 are partitioned based shift and add unit results in fewer number of adders
into groups of three bits and are used as select signals to than the 4-bit BCSs-based shift and adder unit (reduction of
multiplexers Mux1 to Mux5. i.e., the set (20 , 2−1 , 2−2 ) forms four adders) with the same LD. The requirement of additional
the select signal to Mux1 and so on. Since there are 3-bits, four adders would increase the complexity of the 4-bit BCSs-
eight combinations are possible and hence Mux1 to Mux5 are based shift and add unit. Note that the cost of shift and
8:1 multiplexers. The value corresponding to 2−15 forms the add unit is independent of the number of coefficients (filter
select to a 2:1 multiplexer, Mux6. The output from the ith length) as the same shift and add unit is shared by all the
multiplexer is denoted as ri . Note that even though we are coefficients. In the proposed CSM architecture, W/3 number
taking coefficient with values up to a precision of 16 bits, of 8:1 multiplexers (W/3 8:1 multiplexers and remaining
the shifting of 2−1 is done finally as shown in (4) and (5) 2:1 or 4:1 multiplexers in some cases) of bit-width (x + 2)
and hence the maximum shift will be 2−15 . Mux7 determines are required, where W is the coefficient wordlength and x is
whether the output needs to be complemented based on the the input data wordlength. For example, if W = 16 (16-bit
sign bit of the filter coefficient and hence it is a 2:1 multiplexer. coefficient), the proposed 3-bit BCSs-based approach requires
In FIR filters, coefficient values are always less than one. [In five 8:1 multiplexers and one 2:1 multiplexer. On the other
our design examples, we used the Parks–McClellan algorithm hand, if 4-bit BCSs were used instead of 3-bit BCSs, four
to design filters (using “firpm” command in MATLAB)]. 16:1 multiplexers are required. Assuming an 8:1 multiplexer
Hence, we have not employed the integer bit. However if is equivalent to four 2:1 multiplexers and a 16:1 multiplexer
an integer digit is required, the proposed architectures do not is equivalent to eight 2:1 multiplexers, then the 3-bit BCSs-
impose any restrictions to accommodate it. based PE requires 21 2:1 multiplexers and 4-bit BCSs-based
In Fig. 4, the shifts are obtained as follows. Let r1 to r6 PE requires thirty two 2:1 multiplexers, respectively. Thus,
denotes the outputs of Mux1 to Mux6, respectively. Then the multiplexer complexity would increase when 4-bit BCSs
are used. To be more precise, for each PE with 16-bit filter
y = 2−1 r1 + 2−4 r2 + 2−7 r3 + 2−10 r4 + 2−13 r5 + 2−16 r6 . (6) coefficients, the multiplexer complexity of 4-bit BCSs-based
280 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2010

PE is increased by eleven 2:1 multiplexers when compared to multiplication is not avoided. Also in case of the outputs of any
3-bit BCSs based shift and add unit. But it can be noted that of the multiplexers becoming zero, the adder corresponding
the total number of adders required for 3-bit BCS-based filter to that mux will be used, which is not required if the output
with n coefficients is 3 + 5n (three adders for shift and add is zero. But it can be seen that the adders at the output of
unit and five adders for each PE) and that for 4-bit BCS-based the multiplexers can be combined in many ways and hence
PE is 7 + 3n (seven adders for shift and unit and three adders the best power solution saving can be utilized. Also carry
for each PE). Hence, two adders are saved for 4-bit BCSs- save adders can be employed if much faster operation is
based filter for each PE. From the above discussion, it can be required. The drawbacks in CSM are resolved by employing
concluded that if 4-bit BCSs were used instead of 3-bit BCSs, the BCSE algorithm proposed by us in [6]. This forms the
the complexity of shift and add unit and multiplexer unit of PSM architecture which is explained in the next section.
PE would have increased, whereas complexity of final adder
unit would decrease. B. Architecture of PSM
To provide a quantitative comparison, let us consider a The PSM is based on the BCSE algorithm presented in
16-bit (W = 16) coefficient with an 8-bit quantized (x = 8) our previous work [6]. The PSM architecture presented in this
input signal. The proposed 3-bit BCSs-based CSM architecture section incorporates reconfigurability into BCSE. The PSM
requires twenty one 2:1 multiplexers of wordlength x + 2 = has a pre-analysis part in which the filter coefficients are
10 bits and 4-bit BCSs-based CSM architecture requires thirty analyzed using the BCSE algorithm in [6]. Thus, the redundant
two 2:1 multiplexers of wordlength x + 3 = 11 bits. For a computations (additions) are eliminated using the BCSs and
1-bit 2:1 multiplexer, we need eight 2-input NAND gates [36]. the resulting coefficients in a coded format are stored in
This means the 3-bit BCSs-based CSM architecture requires the LUT. The coding format is explained in the latter part
21 × 8 × 10 = 1680 NAND gates whereas the 4-bit BCSs- of this section. The shift and add unit is identical for both
based CSM architecture requires 32 × 8 × 11 = 2816 NAND PSM and CSM. The number of multiplexer units required can
gates. The 3-bit BCSs-based shift and add unit requires three be obtained from the filter coefficients after the application
adders with adder-length [number of full adders (FAs)] of 10- of BCSE [6]. The number of multiplexers is selected after
bits each. Thus, roughly 3-bit BCSs-based shift and add unit considering the number of non-zero operands (BCSs and
requires 30 FAs (assuming ripple carry addition). For each unpaired bits) in each of the coefficients after the application
FA, we require fifteen 2-input NAND gates [36]. Thus, the of the BCSE algorithm. The number of multiplexers will be
3-bit BCSs-based shift and add unit requires 30 × 15 = 450 corresponding to the number of non-zero operands for the
NAND gates. Similarly the 4-bit BCSs-based shift and add unit worst-case coefficient (worst-case coefficient being defined
requires approximately seven adders of adder-length 7 × 11 = as coefficient that has the maximum number of non-zero
77 FAs. Thus, the 4-bit BCSs-based shift and add unit requires operands).
77 × 15 = 1155 NAND gates. For the final adder unit, the The architecture of PE for PSM is shown in Fig. 5. The
proposed 3-bit BCSs-based PE requires five adders (as shown coefficient wordlength is fixed as 16 bits. We have done the
in Fig. 4.) Adders A1 to A3 require 13 FAs {10 (output word- statistical analysis for various filters with coefficient precision
length of shift and add unit) + 3 (shift-length)}, A4 requires of 16 bits and different filter lengths (20, 50, 80, 120, 200,
13 + 6 = 19 FAs and A5 requires 19 + 6 = 25 FAs. Thus, 400, and 800 taps) and it was found that the maximum
a total of 83 FAs are required. This means the 3-bit BCSs- number of non-zero operands is 5 for any coefficient. The
based PE requires 83 × 15 = 1245 NAND gates. Similarly analysis was done for filters with different passband (ωp ) and
the 4-bit BCSs-based PE requires three adders. These three stopband (ωs ) frequency specifications given by 1) ωp = 0.1π,
adders will require 15 + 15 + 15 + 8 = 53 FAs. Thus, ωs = 0.12π; 2) ωp = 0.15π, ωs = 0.25π; 3) ωp = 0.2π,
the 4-bit BCSs-based PE requires 53 × 15 = 795 NAND ωs = 0.22π; and 4) ωp = 0.2π, ωs = 0.3π, respectively.
gates. Now considering the total complexity of PE (Note Based on our statistical analysis, we have fixed the num-
that complexity of PE is directly proportional to the number ber of multiplexers as 5 (same as the number of non-zero
of filter coefficients and total complexity = complexity of operands). The LUT consists of two rows of 18 bits for
multiplexers + complexity of final adder unit), the 3-bit BCSs- each coefficient of the form SDDDDXXDDDDXXMMMML
based PE requires 2925 NAND gates. The 4-bit BCSs-based and DDDDXXDDDDXXDDDDXX, where “S” represents the
PE requires 3611 NAND gates. Thus, for a filter with n taps, sign bit, “DDDD” represents the shift values from 20 to 2−15
the 3-bit BCSs-based CSM architecture requires (450+2925n) and “XX” represents the input “x” or the BCSs obtained
NAND gates whereas the 4-bit BCSs-based CSM architecture from the shift and add unit. In the coded format, XX =
requires (1155 + 3611n) NAND gates. Thus, it is very evident “01” represents “x,” “10” represents x + 2−1 x, “11” represents
that the 3-bit BCSs-based shift and add unit results in low x + 2−2 x, and “00” represents x + 2−1 x + 2−2 x, respectively.
complexity implementation when compared to the 4-bit BCSs- Thus, the two rows can store up to five operands which is
based implementation. Nevertheless, it must be noted that the the worst case number of operands for a 16-bit coefficient. In
CSM architecture can be easily modified to incorporate 4-bit most of the practical coefficients, the number of operands is
or 5-bit shift and add unit based CSM architectures, if there less than the worst case number of operands, 5. In that case
is such a requirement. “MMMML” can be used to avoid unnecessary additions. The
In the CSM approach, the coefficients are directly stored values “MMMM” will be given as select signal to the Mux6
in the LUT and hence complete redundancy in coefficient and “L” to Mux8. “MMMML” indicates the presence of five
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 281

using adder, shr 2 . Mux8 will do this and hence the adder
shr2 is not loaded and consumes zero current and power. The
select signals of Mux6 and Mux8 have five bits and hence
25 different control signals are possible which adds lots of
flexibility to the architecture which can be employed in future
if required. Mux7 is used to complement the output in case
of a negative coefficient and its select signal is the sign bit
“S” of the coefficient.
The PSM architecture has two advantages; first, it guaran-
tees a reduced number of additions compared to CSM, and
second it offers the flexibility of changing the wordlength
of coefficients. The same PSM architecture designed for
16-bit coefficients is capable of operating for any coefficient
wordlength less than 16 bits. This means, if the wordlength is
reduced, the format of the LUT can be changed if required.
The main advantage of reducing the precision is that some of
the adders in the PSM architecture will be unloaded resulting
in zero dynamic power. To the best of our knowledge, the
PSM architecture is the first approach toward programmable
coefficient wordlength FIR filter architecture. This means that
the coefficient wordlength of the proposed PSM architecture
can be changed dynamically without any change in hardware.

C. Comparison Between CSM and PSM


The idea of CSM is to split the filter coefficients into
Fig. 5. Architecture of PE for PSM.
groups of three bits and use these groups as selectors to
multiplexer unit and obtain the product h ∗ x[n]. This doesn’t
operands. A “1” in each position indicates the presence of guarantee the minimum number of additions to be performed.
each operand. Thus, for all operands to be present will be In PSM, since the BCSE algorithm is employed, the number
indicated by “MMMML” = “11111.” This means the Mux6 of additions to be performed will always be reduced compared
will select the output from the output of adder, A4 and Mux8 to CSM. This can be illustrated as follows. Consider the
will select the output of adder, A2 . If only first operand is coefficient h = [010100001010]. If CSM is employed, always
present, “MMMML” = “10 000.” This means the Mux8 will four multiplexers are needed and this means the shift and
select the output of PS, shr 4 and Mux6 will select the output add unit in Fig. 3 needs to be used four times. Thus, always
of PS, shr1 . As a result of this none of the adders shr1 to shr4 three additions are required for CSM. But if PSM is used,
will be loaded saving significant amount of dynamic power. first we apply BCSE, then h1 = [020000002000]. The output
The coding can be explained as given below. Consider the computation requires only two additions, one for h1 and one
positive coefficient h for obtaining 2 = [101]. This reduction is significant for
higher order filters. The filters used in SDR channelizers must
h = [1010011001010011]. (11) have a large number of taps to meet the stringent adjacent
channel attenuation specifications. Therefore, the proposed
By using the BCSE [6], substituting 2 = [1 1], 3 = [1 0 1],
PSM architecture is best suited for the channel filters in SDRs.
(11) becomes
In the case of PSM, the final shifting is done based on the
h = [3000020003000020]. (12) values from LUT using programmable shifters whereas in the
case of CSM, the shifts are constants as we are always splitting
Then (12) will be stored in the LUT as or partitioning the filter coefficients into groups of 3-bits. Thus,
000001101011011110 and 100111111010000000. It must the CSM architecture results in faster coefficient multiplication
be noted that as (12) has only four operands, the fifth operation at the cost of few extra adders compared to PSM
operand values “DDDDXX” are substituted as 000000 and architecture whereas the PSM architecture results in fewer
“MMMML” as “11110.” The XX values are given as select number of additions and thus less area and power consumption
signals for Mux1 to Mux5. The values of DDDD are fed compared to the CSM architecture.
to corresponding PS. The multiplexer Mux6 and Mux8 will Another advantage of PSM is that it is independent of the
select the appropriate output in case the number of operands wordlength of the filter coefficients. For PSM architecture,
after BCSE is less than 5. The use of Mux6 and Mux8 the number of multiplexers is fixed based on the number of
reduces the number of adders utilized by selecting the output BCSs present in a given coefficient set (worst case-coefficient
from the appropriate adder as all the adders in the PE are not of the set). Thus, even if the wordlength changes, it hardly
always needed. For example, in (12), as only four operands affects the architecture of PSM. In [11], it was pointed out
occur, output can be taken from the output of PS, shr4 without that for many filter taps, the highest coefficient precision is
282 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2010

not required. Valuable hardware resources will be wasted if all in complexity achieved by the approach in [14] is directly
taps are implemented with the highest precision. The proposed proportional to the number of multiplexers. But it is shown in
PSM can be implemented for dynamically varying coefficient [29] that the delay imposed by multiplexers in reconfigurable
precision as it is wordlength independent. One of the limita- designs can heavily degrade the performance of the system,
tions of the PSM architecture is that it requires pre-analysis which will have adverse effects on the architecture in [14]. In
of filter coefficients and hence on-the-fly reconfigurability this paper, we have used our BCSE algorithm [6] to reduce
is not always feasible. But this restriction does not impose the redundancies in multiplications in the reconfigurable filter
constraints on popular reconfigurable filter applications like architecture. To the best of our knowledge, this is the first
wireless communications. This is because in such applications, approach that employs the CSE technique to achieve high-
we have a distinct filter for each communication standard and level synthesis goals for reconfigurable systems. The proposed
the coefficients of the filter are fixed for a specific standard. CSM and PSM methods make use of architectures with fixed
In other words, when the communication system is operating number of multiplexers and the reduction in complexity is
on a particular wireless standard, the filter coefficients do achieved by applying the BCSE algorithm proposed in [6].
not change, i.e., the filter is not required to be an adaptive Also, the shift and add unit, which significantly reduces the
filter. When the system changes its mode of operation to a number of adders compared to direct implementation, has no
different wireless communication standard (as in the case of multiplexers in contrary to the approach in [14].
a multi-standard transceiver), the coefficient set corresponding The high level synthesis literature has an extensive coverage
to the specification of the new standard is loaded (replacing of employing partitioning techniques to integrate low power
the current filter coefficients). Note that the coefficients of the realization within the scheduling process [29]–[32]. These
new standard are known beforehand (pre-stored) and therefore methods generally use some scheduling techniques or path
the pre-analysis can be done offline and the problem with analysis to identify regions that can be combined to partitions.
reconfigurability can be solved. Each partition will have an activation/deactivation mechanism,
In this paper, we have employed tree-structured adder for which can be controlled. The basic idea is that the partition can
the final adder unit in both CSM and PSM architectures. be switched off when it is not used and consequently power
But it is possible to further optimize the CSM architecture can be saved. The methods in [29]–[32] have not exploited
by employing techniques such as compression techniques hardware level redundancy in operations which can result in
[36]. 3:2 or 4:2 compressors can be employed for carry free better performances of the system. In [33], an algorithm based
addition making the entire final adder unit more power efficient on graphs was devised which reuses the hardware resulting in
with improved speed of operation. But the use of addition less power consumption. But reusing of hardware results in
structures other than tree-structured adder would impair the increased number of multiplexer logic being created which
flexibility of PSM architecture. For example, the multiplexer degrades the system performance as discussed in [29]. The
Mux6 in our PSM architecture (Fig. 6), which is employed to partitioning of coefficients into 3-bit groups in our proposed
load/unload the adders, plays a significant role in achieving CSM is a high level synthesis transformation targeted to reduce
low power consumption and dynamic wordlength capabilities. power consumption. In CSM architecture, the partitioned bit
If compression techniques were to be employed, the final adder groups of coefficients are given as select signals to multiplex-
unit should be made free of multiplexers including Mux 6, ers. These multiplexers will load and unload different parts
which will in turn impair above merits of PSM approach. of the circuit and thus save significant amount of power as
discussed in Section III-B. In CMOS technology, there are
three sources of power dissipation arising from switching
IV. Extension of CSM and PSM (dynamic) currents, short circuit currents, and leakage currents.
to High Level Synthesis Among these parameters, the switching component, which is a
In this section, we present an extension of proposed function of the effective capacitance, plays the most significant
reconfigurable architectures to high level synthesis. CSE role [28]. It is possible to reduce the power by employing
techniques have been used in the literature as a powerful transformations such as reductions in LD, number of opera-
transformation for eliminating hardware redundancies to tions, and average transition activity. In [28], it was shown
reduce power consumption and area [6], [27], [35]. However that a binary tree-structured adder always ensures lowest LD
there is hardly any work that addressed the problem of and consequently the least number of transitions. Our CSM
designing reconfigurable architectures using CSE techniques. and PSM architectures also employ the binary tree-structured
In [14], the concept of ReMB was introduced, which utilized approach so as to achieve low LD. It must be noted that as the
GD algorithms for eliminating coefficient redundancies and coefficients are synthesized sequentially in GD algorithms, the
thus reducing the number of additions for the ReMB. However resulting filter structures do not have a binary tree structure
the approach in [14] can be considered suitable only for static and hence will always result in an increased LD. Hence, the
reconfigurable systems, where either the filter coefficients reconfigurable approaches in [14] and [15] which employ
are known beforehand or the filters need to switch between GD algorithms will result in increased power consumption
coefficient sets which are already available. Also the approach when applied to high-level synthesis. In addition to reducing
in [14] reduces the redundancies in multiplications by pushing the LD in our CSM and PSM architectures, the number
multiplexers deep into the ReMB design, thus increasing of operations is also reduced by employing the BCSE [6].
the number of multiplexers. In other words, the reduction Furthermore, the proposed PSM architecture can make use
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 283

Fig. 6. Implementation of the proposed CSM and PSM architecture on Virtex 2v3000ff1152-4 FPGA.

TABLE I TABLE II
Synthesis Results for an FIR Filter with 20 Taps and Synthesis Results for MB (PSM) with Different
Coefficient Wordlength of 16 Bits Coefficient Wordlengths

Proposed PSM Proposed CSM Wordlength 8-bit 12-bit 16-bit


Gate count 22 581 22 956 Gate count 2878 3532 3771
Sampling frequency (MHz) 24 26 Sampling frequency (MHz) 35 30 24
Data arrival time (ns) 33.64 26.824 Data arrival time (ns) 7.96 8.84 9.92

of dynamic change of coefficient wordlength which will given by: 1) ωp = 0.1π, ωs = 0.12π; 2) ωp = 0.15π, ωs = 0.2π;
save significant amount of dynamic power as explained in 3) ωp = 0.2π, ωs = 0.22π; and 4) ωp = 0.2π, ωs = 0.3π,
Section III-B. Thus, the proposed CSM and PSM approaches respectively. Even though the proposed architectures are re-
improve the efficiency of reconfigurable systems in high-level configurable, the usage of adders and shifters is dependent on
synthesis and offers a power efficient solution by reducing the the filter coefficient values. Some of the adders may not be
LD as well as the number of operations (additions). used by the multiplexers. As a result of this, they are unloaded
and do not consume any dynamic power. Hence, the power
and speed values of the synthesis results are dependent on the
V. Experimental Results
filter coefficients and hence we have considered an average of
In this section, the synthesis and design results of the the synthesis results in all the tables in this paper. From the
proposed CSM and PSM architectures are presented and comparison it is very evident that the CSM requires 475 gates
compared with the recently proposed reconfigurable FIR filter more than that of PSM, whereas PSM requires 6.82 ns more
architectures in the literature [11], [12], [14], [15], [20], [21]. for the data to arrive at the output compared to CSM. Thus,
the CSM results in higher speed whereas the PSM results in
A. Synthesis Results lower area. The reason for lower speed of PSM is due to
We have used Xilinx 8.1i ISE for synthesizing purposes. The the presence of programmable shifters and that of less area
synthesis has been done on Xilinx’s Virtex-II 2v3000ff1152-4 is due to elimination of redundant additions by using BCSE
FPGA. Table I shows the synthesis results of the CSM and algorithm. We have also analyzed the effect of the MB for
PSM 20-tap FIR filter that has a coefficient wordlength of 16 different filter coefficient wordlengths of 8, 12, and 16 bits for
bits. We have done the implementation of filters with different the PSM architecture. The results are shown in Table II. It can
passband edge (ωp ) and stopband edge (ωs ) specifications be noted that as the precision of the coefficient is made high,
284 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2010

TABLE III
Synopsys Synthesis Results for 20-Tap FIR Filter Implementation of Section V-B

Binary programmable shifts method Binary constant shifts method CSD-CSM CSD-PSM FIR Filter [15]
(BPSM) (BCSM)
Area (mm2 ) 0.2594 0.275 0.304 0.2796 0.5467
Delay (ns) 8.2 7.67 8.5 9.34 15.6
Dynamic power (mW) 5.98 7.8 10 13.97 16

TABLE IV
Synopsys Synthesis Results for 32-Tap FIR Filter Implementation of Section V-B

BPSM BCSM FIR Filter [11] FIR Filter [14]


Area (mm2 ) 0.245 0.27 1.47 1.394
Delay (ns) 4.2 3.67 5.97 7.17
Power (mW) 4.3 5.2 8.5 6.5

the area consumption is increased and the speed of operation storage space required for CSD will increase the area and
is reduced. Thus, by choosing the appropriate filter coefficient the additional half-adders in the adder/subtractor unit reduces
wordlength, it is possible to obtain reduced area and power as the speed of operation of the CSD based reconfigurable FIR
well as increased speed for the PSM architecture. filters compared to binary based FIR filter implementations.
This becomes highly significant, as the order of the channel
B. CSD Based Reconfigurable FIR Filter Architecture filters in wireless communication transceivers is very high.
CSD based CSE algorithms are considered to be one of We have done the synthesis using Synopsys tool for all the
the best algorithms that can result in low complexity fixed- FIR filter specifications as mentioned in Section IV-A on
coefficient FIR filter implementations. However to the best 0.18 µm CMOS technology. The synthesis results for a 20-tap
of our knowledge, the implementation of the CSD-CSE based FIR filter with 16-bit coefficient wordlength are summarized
reconfigurable filter architectures has not been addressed in the in Table III. The proposed CSM and PSM architectures
literature. We have implemented a CSD based FIR filter using which employ binary representation of filter coefficients are
the CSM architecture (CSD-CSM) and a CSD-CSE based denoted as BCSM and BPSM, respectively. The CSD based
FIR filter using the PSM architecture (CSD-PSM). For low implementations of CSM and PSM are denoted as CSD-CSM
complexity, we have employed the CSE algorithm in [3] on and CSD-PSM, respectively. Table III shows that the CSD-
the coefficients before they are stored in LUT. We have imple- CSM and CSD-PSM architectures consume more area, power,
mented a CSD based shift and add unit to generate common and has less speed compared to our binary representation based
subexpression (CSs) such as [1 0 1], [1 0 −1], [1 0 0 1] BPSM and BCSM architectures. The BCSM architecture has
and [1 0 0 −1] and their negated versions. In the previous area reduction of 10% and 1% over CSD-CSM and CSD-PSM
works based on CSE algorithm [3]–[5], it was considered architectures, respectively, and the area reduction for BPSM
that common subexpressions (CSs) such as [−1 0 − 1] and architecture over CSD-CSM and CSD-PSM architectures are
[−1 0 1] can be generated from their respective negated 15% and 7%, respectively. The improvement in the speed of
versions [1 0 1] and [1 0 − 1] without using any extra adder operation for the BCSM architecture over the CSD-CSM and
by configuring the existing adder as a subtractor. But this is CSD-PSM architectures is 10% and 22%, respectively. The
applicable only for fixed coefficient filters. An n-bit adder BPSM architecture offers an improvement in the speed of
circuit would require n additional XOR gates to reconfigure the operation of 4% and 12% over the CSD-CSM and CSD-PSM
adder to subtractor mode. These additional XOR gates would architectures, respectively. The dynamic power reductions for
increase the critical path of the adder circuit (equivalent to the BCSM architecture are 22% and 44% over the CSD-CSM
the delay imposed by n half-adders) and impose overheads and CSD-PSM architectures, respectively. The BPSM architec-
for CSD implementation of the FIR filter. Another drawback ture offers the dynamic power reductions of 40% and 57% over
of CSD implementation is with the storage of coefficients in the CSD-CSM and CSD-PSM architectures, respectively. The
LUT. The CSD value like [1 0 − 1 0 − 1 0 1 0 − 1] can be BPSM architecture offers area and power reductions of 6%
stored in an LUT like [01 00 11 00 11 00 01 00 11] with and 23% over the BCSM architecture, respectively. The BCSM
“00” corresponding to 0, “01” corresponding to 1, and “11” architecture offers in improvement in the speed of operation
corresponding to −1. Therefore, for the worst-case scenario, by 7% compared to the BPSM architecture. In Table III, the
an 8-bit CSD coefficient requires 16 bits for its representation. proposed architectures are also compared with the MMCM
This can be optimized as no adjacent bits in CSD are ones. architecture based FIR filter in [15]. The BCSM architecture
But still CSD requires more number of bits than binary. Since offers an area reduction of 49.7%, power reduction of 51.3%,
all the bits in binary representation are positive, this problem and a speed improvement of 50.8% over the MMCM [15]. The
will not come. Thus, the additional half-adders required for area and power reductions offered by the BPSM architecture
implementing the adder/subtractor circuit and the additional over MMCM [15] are 52.7% and 62.5%, respectively, with
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 285

TABLE V TABLE VI
Synopsys Synthesis Results for 18-Tap FIR Filter Synopsys Synthesis Results for D-AMPS Channel Filter
Implementation of Section V-B Implementations

BPSM BCSM FIR Filter [12] BPSM BCSM CSD CSD


Area (mm2 ) 0.137 0.14 13.872 CSM PSM
Delay (ns) 5.7 5.47 7 Area (mm2 ) 4.531 4.82 5.82 5.17
Delay (ns) 9.87 9.08 13.5 15.34
Power (mW) 48.3 84 105 114.9
an improvement in speed of 47.7%. It must be noted that
the MMCM [15] architecture is limited to maximum filter
length of 40 whereas no such restrictions exist for the proposed in our implementation is 16 bits. It must be noted that the
architectures. Table IV shows the comparison of a 32-tap proposed architectures do not alter the coefficient values and
FIR filter with coefficient wordlength of eight bits using therefore the frequency response of the filter implemented
the BCSM and BPSM architectures and the FIR filter given using our architectures is unaffected.
in [11] and [14]. It can be seen from Table IV that the We have done the synthesis of the above channel filter
BCSM architecture offers an area reduction of 81% over the on 0.18 µm CMOS technology. The synthesis results for
architecture in [11] with a reduction in power consumption of a 350-tap FIR filter with 16-bit coefficient wordlength are
38.8% and increase in speed of 38.5%. Table IV also shows shown in Table VI. From Table VI, it can be seen that
that the BPSM results in area reduction of 83.7%, reduction in the CSM and PSM architectures give much better result
power consumption of 45.9% and increase in speed of 29.7% compared to the CSD-CSM and CSD-PSM architectures. The
over the architecture in [11], respectively. Compared to the BCSM architecture has area reduction of 17% and 6.6% over
architecture in [14], the BCSM offers area reduction of 81%, CSD-CSM and CSD-PSM architectures respectively and the
power reduction of 20% and speed improvement of 48.8% area reduction for BPSM architecture over CSD-CSM and
whereas the BPSM offers area and power reductions of 82.8% CSD-PSM architectures is 22.8% and 12.2%, respectively.
and 29.2% respectively, and a speed improvement of 41.4%. The improvement in the speed of operation for the BCSM
Table V shows the comparison of an 18-tap FIR filter with a architecture over the CSD-CSM and CSD-PSM architectures
coefficient wordlength of ten bits using the BCSM and BPSM is 32.8% and 40.8%, respectively. The BPSM architecture
architectures and the FIR filter given in [12]. It can be seen offers an improvement in the speed of operation of 26.9%
from Table V that the area of the filter architecture in [12] is and 35.66% over the CSD-CSM and CSD-PSM architectures,
almost 100% more than the BPSM and BCSM architectures. respectively. The dynamic power reductions for the BCSM
The reason for the increase in area and delay of [12] is due to architecture are 20% and 24% over the CSD-CSM and CSD-
the employment of Booth encoding schemes, Wallace adder PSM architectures, respectively. The BPSM architecture offers
trees, etc. the dynamic power reductions of 53.3% and 57.4% over
Thus from the comparisons, it can be concluded that the the CSD-CSM and CSD-PSM architectures, respectively. The
proposed BCSM and BPSM architectures are equally suitable BPSM architecture offers area and power reductions of 6%
for higher and lower order filters. In the next section, a higher and 42% over the BCSM architecture, respectively. The BCSM
order channel filter in an SDR receiver using our architectures architecture offers in improvement in the speed of operation
is analyzed. by 8% compared to the BPSM architecture.

C. Design Results D. Comparison with FIR Filter [11]


In this example, the FIR filters employed in the filter In this section, we present a comparison of proposed CSM
bank channelizer of Digital Advanced Mobile Phone Systems and PSM architectures with the reconfigurable architecture
(D-AMPS) in [17] are considered. The sampling rate chosen proposed in a recent work [11]. In [11], a low power CSD
is 34.02 MHz as in [17]. The channel filters extract 30 kHz based digit reconfigurable FIR filter was proposed. The ar-
D-AMPS channels from the input signal after downsampling chitecture in [11] performs multiplication using direct shift
by a factor of 350. The passband and stopband edges are 30 and add method and hence will result in low speed operation.
and 30.5 kHz, respectively. The peak passband ripple is chosen If we consider a coefficient of n-bit wordlength, then the
as 0.1 dB. The filter stop-band specifications are chosen as in LD (critical path length) for method in [11] is n as it is
the D-AMPS standard [18]. The length of the FIR filter N is digit
 based. But for our CSM architecture, the LD is only
determined using [19]  − 1)) + 2 . Also for the PSM architecture, the
(log2 (n/3
LD is (log2 b) + 2 , where “b” is the number of non-zero
−10 log10 ∂1 ∂2 − 13
N= +1 (13) operands in the worst case coefficient after the application
14.6f of BCSE as explained in Section III-B. In both CSM and
where ∂1 and ∂2 are the peak passband and stopband ripples PSM, “2” denotes the LD of the 3-bit BCS-based shift and
respectively, and f is the normalized width of the transition add unit. Note that “b” is considerably less than “n” in the
band. We have chosen a stopband ripple of −24 dB and the PSM as common subexpressions are used. Consequently, the
transition to be 0.01 so that the optimum filter length is found LD of the PSM is also less. It can be seen that the digit
to be 350 according to (13). The coefficient wordlength used processing unit (DPU) is used in series to form the processing
286 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2010

TABLE VII
Comparison of the Proposed CSM Approach with Approach in [20], [21]

Proposed CSM Approach [20, 21]


Shift and Add Multiplexer Final Adder Over-all Shift and Add Multiplexer Final Adder Over-all
Unit Unit Unit Unit Unit Unit
4 input LUTs 2058 3026 2976 8976 2987 5434 2342 10 875
Slices 1420 1768 1678 5023 1976 2345 1245 5800
Slice flip-flops 1125 1450 1345 4198 1597 2467 1056 5170

element. Thus for a higher order filter, the number of DPUs three adders for final adder unit, i.e., 9 + 3n adders for
will be significantly large, which would delay the filtering n-tap filter) and eight programmable shifters. From the above
operation substantially. The proposed architectures employ example, it is evident that the proposed CSM is less complex
CSE techniques and hence a much faster filtering operation for than the methods in [20], [21].
any filter-length is feasible. Also the method in [11] is CSD For a quantitative comparison, consider a 16-bit coefficient
based which has many inherent difficulties, as explained in a with an 8-bit quantized input signal. The proposed CSM ap-
detailed manner in Section V-B. A more detailed comparison proach and the approach in [20], [21] have been implemented
between method in [11] and proposed architectures is also on Virtex 4XC4VSX35-10ff668 FPGA and the complexities
given in Section V-B. of the shift and add unit, multiplexer unit, final adder unit and
over-all complexities have been compared. The comparison re-
E. Comparison with FIR Filters [20], [21] sults (synthesis results on FPGA) are shown in Table VII. The
The main difference of the proposed CSM architecture from over-all complexity (shown in the right-most column under
the architectures in [20], [21] is the use of the BCSs-based respective methods in Table VII) is the sum of the complexities
shift and add unit and hardwiring of shifts. In the architectures associated with shift and add unit, multiplexer unit, final adder
proposed in [20], [21], there are pre-computers which are unit, and that of the LUTs used for storing the coefficients,
used to generate x, 3x, 5x, 7x, 9x, 11x, 13x, and 15x using nine and the complexity of the complementer. It should be noted
adders employing a special carry select adder, where x is the that the final adder unit in [20], [21] is optimized by using
input signal. This is in comparison with only seven adders carry select addition. However, the architecture implemented
required by our 4-bit BCSs-based shift and add unit. Thus, the for this comparison uses ripple carry adders for a fair and
CSM architecture offers adder reduction over the architectures straightforward comparison. We used ripple carry adders on
in [20], [21] and is different from the latter ones because account of their low power consumption and simple layout.
it employs BCSE-based shift and add unit for complexity From Table VII, it is clear that the complexity of shift and
reduction. Another major difference is that [20], [21] employ add unit and the multiplexer unit is more for the approach
two programmable shifters named SHIFTER and ISHIFTER in [20], [21] and the complexity of final adder unit is more
with coefficient values as select values. The shifters were used for the proposed CSM approach. It is evident from Table VII
to identify the most significant non-zero bit (digit) in each that over-all complexity of the proposed CSM approach is less
filter coefficient. These shifters should always be preceded than that of the approach in [20], [21].
by 8:1 multiplexers as shown in Fig. 6 of [20] and hence
the multiplexer complexity is also not reduced. These pro-
VI. Implementation Results
grammable shifters will reduce the overall speed of operation
of the resulting filters especially for higher order channel We have implemented the proposed CSM and PSM architec-
filter applications in wireless communication receivers. In the tures for a 20-tap FIR filter with 16-bit coefficient precision
proposed CSM architecture, all the shifts are constants and on Xilinx’s Virtex-II 2v3000ff1152-4 FPGA associated with
hence can be hardwired using a constant propagation tool and the dual DSP-FPGA Signalmaster kit provided by Lyrtech
hence results in better speed of operation compared to methods [23]. A model based design using MATLAB Simulink and
in [20], [21]. This can be clarified using an example. For a Xilinx System generator was employed for the implementation
16-bit coefficient, the proposed CSM (3-bit BCSs-based shift purpose as shown in Fig. 6 (copied directly from the Simulink
and add unit) architecture requires five 8:1 multiplexers and environment). Fig. 6 consists of eight components/blocks
one 2:1 multiplexer (equivalent to twenty one 2:1 multiplex- whose details are given below.
ers), eight adders (three adders for shift and add unit and five 1) Multi-Tone Input Signal: A multi-tone input signal was
adders for the final adder unit, i.e., 3 + 5n adders for n-tap generated by summing up sine waves of frequencies
filter). Note that programmable shifters are not requited in 300 Hz, 1000 Hz, 2500 Hz, 3500 Hz, and 4200 Hz, each
CSM since all shifts are constants which can be hardwired. sampled at 10 MHz. Note that the signal frequencies
On the other hand, the approach in [20], [21] requires four and the sampling frequency in this example are only
8:1 multiplexers (main multiplexers) + four 4:1 multiplexers for illustration purposes. By dynamically changing the
(for programmable shifters if not implemented using power input frequencies using the function in Simulink, we
consuming barrel shifters) (equivalent to twenty four 2:1 verified that the CSM and PSM architectures work well
multiplexers), 12 adders (nine adders for precomputers and for frequencies of several tens of MHz.
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 287

2) Lyrtech Signal Master Controller: Lyrtech signal master TABLE VIII


controller consists of three components: 1) the board Implementation Results for Proposed Architectures with
configuration for configuring the FPGA/DSP and for 20 Taps and Coefficient Wordlength of 16 Bits
downloading the bitstream to FPGA; 2) Xilinx system
generator for generating the bit stream to be downloaded Proposed CSM Proposed PSM
to the FPGA; and 3) log viewer which gives implemen- Gate count 27 979 25 679
Data arrival time (ns) 38.672 40.824
tation information about the area and delay of the used
Sampling frequency (MHz) 26 24
architecture. Xilinx XPower can be employed for the Power dissipation (mW) 408.29 375.45
calculation of power dissipation.
3) Coefficient Controller: The CSM and PSM architectures
8) Fast Fourier transform (FFT) of Outputs: The FFT
have been implemented with the provision of dynamic
block was employed for observing the outputs of the
changing of filter coefficients. This was made possible
simulation and FPGA architectures. The FFT block plots
with the help of a multiport switch. Based on the
the energy of the output in dB against the frequencies.
select signal values, the switch will select one of the
The implementation results are shown in Table VIII.
inputs, h2 to h6 , which are the coefficients stored in
The area and delay results are obtained using log viewer
LUTs, generated using MATLAB program according
block in Fig. 6. The power dissipation is obtained using
to the specifications of the architecture as explained
Xilinx XPower. The results show that the proposed CSM
in Sections III-A and III-B. As shown in Fig. 6, h2
results in low delay compared to proposed PSM whereas
corresponds to a lowpass filter with a cutoff frequency of
the latter results in low area and power implementations.
500 Hz, h3 corresponds to a lowpass filter with a cutoff
From Table VIII, it can be concluded that the CSM
frequency of 1000 Hz, h4 corresponds to a lowpass filter
architecture results in an improvement in speed of 5.3%
with a cutoff frequency of 2000 Hz, h5 corresponds to
over the PSM architecture, whereas the PSM architec-
a lowpass filter with a cutoff frequency of 3000 Hz
ture results in area and power reductions of 8.2% and
and h6 corresponds to a lowpass filter with a cutoff
8% over CSM architecture, respectively.
frequency of 4000 Hz. For example, if the constant value
is 4 (as shown in Fig. 6), the specification h5 will be
VII. Conclusion
chosen which corresponds to a lowpass filter that has a
cutoff frequency of 3000 Hz. Simulink provides option We have proposed two new approaches namely, CSM and
to change the constant value dynamically. Hence, if we PSM, for implementing reconfigurable higher order filters
change the constant value to 1, the coefficient h2 corre- with low complexity. The CSM architecture results in high
sponding to a lowpass filter of 500 Hz will be chosen speed filters and PSM architecture results in low area and thus
and the output of the filter changes dynamically. We low power filter implementations. The PSM also provides
used this scheme to achieve dynamic reconfigurability. the flexibility of changing the filter coefficient wordlengths
We have employed only five filter specifications from h2 dynamically. We have implemented the architectures on
to h6 , but it is possible to include filters for additional Virtex-II 2v3000ff1152-4 FPGA and synthesized using 0.18
specifications. µm CMOS technology with a high coefficient precision of
4) Coefficient Extractor: The coefficient extractor is used 16 bits and compared to numerous reconfigurable FIR filter
to extract the coefficients individually and to provide the architectures. The proposed reconfigurable architectures can
extracted coefficient to each processing element of the be easily modified to employ any CSE (MCM) method.
proposed CSM and PSM architectures. Thus, our method is a general approach for low complexity
5) Gateways: Gateways are employed as an interface be- reconfigurable channel filters.
tween Xilinx’s blocks which are used for developing the
proposed CSM and PSM architectures. Thus, gateways References
provide the connection of the bitstream file with input [1] T. Hentschel and G. Fettweis, “Software radio receivers,” in CDMA
Techniques for Third Generation Mobile Systems. Dordrecht, The
sources and output sinks of the Simulink environment. Netherlands: Kluwer Academic, 1999, pp. 257–283.
6) Simulation Architecture: This forms the CSM and PSM [2] J. Mitola, “Object-oriented approaches to wireless systems engineering,”
architectures developed as shown in Figs. 4 and 5, in Software Radio Architecture. New York: Wiley, 2000.
[3] R. I. Hartley, “Subexpression sharing in filters using canonic signed digit
respectively, using Xilinx block set in Simulink library. multipliers,” IEEE Trans. Circuits Syst. II, vol. 43, no. 10, pp. 677–688,
The simulation architecture is used as a reference for Oct. 1996.
comparing with the hardware implementation on FPGA [4] R. Pasko, P. Schaumont, V. Derudder, S. Vernalde, and D. Durackova,
“A new algorithm for elimination of common subexpressions,” IEEE
(bit stream running on FPGA). Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 18, no. 1,
7) Bitstream Running on FPGA: We have generated the pp. 58–68, Jan. 1999.
bitstream of the simulation architecture using the Xilinx [5] M. M. Peiro, E. I. Boemo, and L. Wanhammar, “Design of high-speed
multiplierless filters using a nonrecursive signed common subexpression
system generator. The generated bitstream can be down- algorithm,” IEEE Trans. Circuits Syst. II, vol. 49, no. 3, pp. 196–203,
loaded to FPGA and it appears as a “Lyrtech Cosim Mar. 2002.
Engine” block. The performances of the bitstream and [6] R. Mahesh and A. P. Vinod, “A new common subexpression elimination
algorithm for realizing low complexity higher order digital filters,” IEEE
the simulation architecture were checked to ensure that Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no. 2, pp.
they are identical. 217–219, Feb. 2008.
288 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2010

[7] A. P. Vinod and E. Lai, “Low power and high-speed implementation of Comput.-Aided Design Integr. Circuits Syst., vol. 14, no. 1, pp. 12–31,
FIR filters for software defined radio receivers,” IEEE Trans. Wireless Jan. 1995.
Commun., vol. 5, no. 7, pp. 1669–1675, Jul. 2006. [29] M. Meribout and M. Motomura, “A combined approach to high-
[8] T. Solla and O. Vainio, “Comparison of programmable FIR filter level synthesis for dynamically reconfigurable systems,” IEEE Trans.
architectures for low power,” in Proc. 28th Eur. Solid-State Circuits Comput., vol. 53, no. 12, pp. 1508–1522, Dec. 2004.
Conf., Firenze, Italy, Sep. 2002, pp. 759–762. [30] X.-J. Zhang, K.-W. Ng, and W. Luk, “A combined approach to high-
[9] T. Solla, R. Mäkelä, M. Liljeroos, and O. Vainio, “Application-specific level synthesis for dynamically reconfigurable systems,” in Proc. 10th
filter processor for flexible receivers,” in Proc. 19th NORCHIP Conf., Int. Workshop Field Programmable Logic Applicat., 2000, pp. 361–370.
Kista, Sweden, Nov. 2001, pp. 53–58. [31] R. Kress and A. Pyttel, “High-level synthesis for dynamically re-
[10] D. Hwang, C. Mittelsteadt, and I. Verbauwhede, “Low power showdown: configurable hardware/software systems,” in Proc. 8th Int. Workshop
Comparison of five DSP platforms implementing an LPC speech codec,” Field-Programmable Logic Applicat. Field-Programmable Gate Arrays
in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Salt Lake City, Comput. Paradigm, 1998, pp. 288–297.
UT, May 2001, pp. 1125–1128. [32] A. Rettberg and F.-J. Rammig, “Integration of energy reduction into
[11] K. H. Chen and T. D. Chiueh, “A low-power digit-based reconfigurable high-level synthesis by partitioning,” in Proc. Working Conf. Distributed
FIR filter,” IEEE Trans. Circuits Syst. II, vol. 53, no. 8, pp. 617–621, Parallel Embedded Syst. (DIPES), Braga, Portugal, Oct. 2006, pp. 225–
Aug. 2006. 234.
[12] T. Zhangwen, J. Zhang, and H. Min, “A high-speed, programmable, CSD [33] N. Moreano, E. Borin, C. de Souza, and G. Araujo, “Efficient data-
coefficient FIR filter,” IEEE Trans. Consumer Electron., vol. 48, no. 4, path merging for partially reconfigurable architectures,” IEEE Trans.
pp. 834–837, Nov. 2002. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7, pp. 969–
[13] X. Chenghuan, C. He, Z. Shunan, and W. Hua, “Design and imple- 980, Jul. 2005.
mentation of a high-speed programmable polyphase FIR filter,” in Proc. [34] R. Mahesh and A. P. Vinod, “Reconfigurable low complexity FIR filters
5th Int. Conf. Applicat.-Specific Integr. Circuit, vol. 2. Oct. 2003, pp. for software radio receivers,” in Proc. 17th IEEE Int. Symp. Personal
783–787. Indoor Mobile Radio Commun. (PIMRC), Helsinki, Finland, Sep. 2006,
[14] S. S. Demirsoy, I. Kale, and A. G. Dempster, “Efficient implementation pp. 1–5.
of digital filters using novel reconfigurable multiplier blocks,” in Proc. [35] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, “Multiple
38th Asilomar Conf. Signals Syst. Comput., vol. 1. Nov. 2004, pp. 461– constant multiplications: Efficient and versatile framework and algo-
464. rithms for exploring common subexpression elimination,” IEEE Trans.
[15] P. Tummeltshammer, J. C. Hoe, and M. Puschel, “Multiplexed multiple Comput.-Aided Design, vol. 15, no. 2, pp. 151–165, Feb. 1996.
constant multiplication,” IEEE Trans. Comput.-Aided Design Integr. [36] B. Parhami, “Implementation details,” in Computer Arithmetic. New
Circuits, vol. 26, no. 9, pp. 1551–1563, Sep. 2007. York: Oxford Press, 2000, p. 131.
[16] A. P. Vinod and E. M.-K. Lai, “An efficient coefficient-partitioning [37] Y. Linn, “Efficient loop filter design in FPGAs for phase lock loops in
algorithm for realizing low complexity digital filters,” IEEE Trans. high-data rate wireless receivers: Theory and case study,” in Proc. 6th
Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 12, pp. 1936– Annu. Wireless Telecommun. Symp., Pomona, CA, Apr. 2007, pp. 1–8.
1946, Dec. 2005.
[17] K. C. Zangi and R. D. Koilpillai, “Software radio issues in cellular base
R. Mahesh (M’06) received the B.Tech. degree in
stations,” IEEE J. Select. Areas Commun., vol. 17, no. 4, pp. 561–573,
electrical and electronics engineering from Mahatma
Apr. 1999.
Gandhi University, Kottayam, Kerala, India, in 2003,
[18] N. Spencer, “An overview of digital telephony standards,” in Proc. IEE
and the Ph.D. degree in computer engineering from
Colloq. Design Dig. Cellular Handsets, Mar. 1998, pp. 1/1–1/7.
Nanyang Technological University, Singapore, in
[19] J. G. Proakis and D. G. Manolakis, “Design of digital filters,” in
2009.
Digital Signal Processing Principles, Algorithms, and Applications.
He was a Lecturer at the College of Engineer-
Upper Saddle River, NJ: Prentice-Hall, 1998, pp. 614–738.
ing, Mahatma Gandhi University, from August 2003
[20] K. Muhammad and K. Roy, “Reduced computational redundancy imple-
to July 2005. Currently, he is a Research Fellow
mentation of DSP algorithms using computation sharing vector scaling,”
with the School of Computer Engineering, Nanyang
IEEE Trans. Very Large Scale Integr. Syst., vol. 10, no. 3, pp. 292–300,
Technological University. His main research inter-
Jun. 2002.
ests include low complexity and high speed digital signal processing circuits
[21] J. Park, W. Jeong, H. Mahmoodi-Meimand, Y. Wang, H. Choo, and K.
and computer arithmetic.
Roy, “Computation sharing programmable FIR filter for low-power and
high-performance applications,” IEEE J. Solid State Circuits, vol. 39,
no. 2, pp. 348–357, Feb. 2004. A. P. Vinod (SM’01) received the B.Tech. degree
[22] Y. Wang, H. Mahmoodi, L.-Y. Chiou, H. Choo, J. Park, W. Jeong, in instrumentation and control engineering from the
and K. Roy, “Hardware architecture and VLSI implementation of a University of Calicut, Malappuram, Kerala, India, in
low-power high-performance polyphase channelizer with applications 1994, and the M.E. and Ph.D. degrees in computer
to subband adaptive filtering,” in Proc. IEEE Int. Conf. Acoust. Speech engineering from Nanyang Technological University
Signal Process., vol. 5. May 2004, pp. 97–100. (NTU), Singapore, in 2000 and 2004, respectively.
[23] http://www.lyrtech.com/index.php?act=view&pv=SignalMaster space From 1993 to 1998, he was an Automation
Quad Engineer with Kirloskar, Bangalore, India, Tata
[24] C. Y. Yao, H. H. Chen, C. J. Chien, and C. T. Hsu, “A novel Honeywell, Pune, India, and Shell Singapore, Sin-
common-subexpression-elimination method for synthesizing fixed-point gapore. From September 2000 to September 2002,
FIR filters,” IEEE Trans. Circuits Syst. I, vol. 51, no. 11, pp. 2215–2221, he was a Lecturer at the School of Electrical and
Nov. 2004. Electronic Engineering, Singapore Polytechnic, Singapore. He was a Lecturer
[25] I. C. Park and H. J. Kang, “FIR filter synthesis algorithms for minimizing at the School of Computer Engineering, NTU from September 2002 to
the delay and the number of adders,” IEEE Trans. Circuits Syst. II, November 2004. Since December 2004, he has been an Assistant Professor
vol. 48, no. 8, pp. 770–777, Aug. 2001. with the School of Computer Engineering, NTU. He has published more than
[26] R. A. Walker and D. E. Thomas, “Introduction,” in A Survey of High- 90 papers in refereed journals and international conferences. His research
Level Synthesis Systems. Boston, MA: Kluwer, 1991, pp. 20–37. interests include digital signal processing, low power and reconfigurable
[27] A. P. Vinod and E. M.-K. Lai, “On the implementation of efficient digital signal processing circuits, software radio, cognitive radio, and brain–
channel filters for wideband receivers by optimizing common subexpres- computer interface.
sion elimination methods,” IEEE Trans. Comput.-Aided Design Integr. Dr. Vinod is an Editor of the International Journal of Advancements in
Circuits Syst., vol. 24, no. 2, pp. 295–304, Feb. 2005. Computing Technology.
[28] A. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W.
Brodersen, “Optimizing power using transformations,” IEEE Trans.

Vous aimerez peut-être aussi