Académique Documents
Professionnel Documents
Culture Documents
Abstract—Reconfigurability and low complexity are the two intensive part of an SDR receiver is the channelizer since it
key requirements of finite impulse response (FIR) filters employed operates at the highest sampling rate [2]. It extracts multiple
in multistandard wireless communication systems. In this paper,
narrowband channels from a wideband signal using a bank
two new reconfigurable architectures of low complexity FIR
filters are proposed, namely constant shifts method and pro- of FIR filters, called channel filters. Using polyphase filter
grammable shifts method. The proposed FIR filter architecture structure, decimation can be done prior to channel filtering
is capable of operating for different wordlength filter coefficients so that the channel filters need to operate only at relatively
without any overhead in the hardware circuitry. We show that low sampling rates. This can relax the speed of operation of
dynamically reconfigurable filters can be efficiently implemented
the filters to a good extent [22]. However due to the strin-
by using common subexpression elimination algorithms. The
proposed architectures have been implemented and tested on gent adjacent channel attenuation specifications of wireless
Virtex 2v3000ff1152-4 field-programmable gate array and syn- communication standards, higher order filters are required for
thesized on 0.18 µm complementary metal–oxide–semiconductor channelization and consequently the complexity and power
technology with a precision of 16 bits. Design examples show consumption of the receiver will be high. As the ultimate aim
that the proposed architectures offer good area and power
of the future multi-standard wireless communication receiver
reductions and speed improvement compared to the best existing
reconfigurable FIR filter implementations in the literature. is to realize its functionalities in mobile handsets, where its full
utilization is possible, low power and low area implementation
Index Terms—Channelizer, common subexpression elimina-
of FIR channel filters is inevitable. In [37], the filter multipli-
tion, FIR filter, high level synthesis, reconfigurability.
cations are done via state machines in an iterative shift and
add component and as a result of this there is huge savings
I. Introduction in area. For lower order filters, the approach in [37] offers
good trade-off between speed and area. But in general, the
application specific filters where the coefficients are fixed and increases with the filter length as in [14] and filters with
hence not suitable for reconfigurable filters. filter-length above 40 are infeasible. In [20], the common
Several implementation approaches for reconfigurable FIR digital signal processing (DSP) operations such as filtering and
filters have been proposed in literature [8]–[15]. These de- matrix multiplication were identified and expressed as vector
signs include either a fully programmable multiply-accumulate scaling operations. In order to apply vector scaling, simple
(MAC) based filter processor or dedicated architectures where number decomposition strategies were identified. The idea was
the filter coefficients can be stored in registers. The architec- to precompute the values such as x, 3x, 5x, 7x, 9x, 11x, 13x,
ture of a filter processor consists of a datapath with a single and 15x, where x is the input signal and then reuse these
MAC unit, data and program memories, and a control unit precomputations efficiently using multiplexers. The presence
[8], [9]. The datapath includes a 16-bit adder/subtractor, a of multiplexers gave the option of adaptive computing for the
multiplier, and a 32-bit accumulator. The performance of the method in [20]. In [21], the method in [20] was modified
processor is mainly restricted by the delay of this datapath, and efficient circuit-level techniques, namely a new carry-
more specifically that of the multiplier. The main disadvantage select adder and conditional capture flip-flop, were used to
of the filter processors is that the area and power requirements further improve power and performance. The architectures in
are significantly large. In [10], a comparison was done for [11]–[15] and [20], [21] are appropriate only for relatively
the performance of speech based algorithms on dedicated lower order filters and hence not suitable for channel filters in
architectures and general-purpose processors. It was shown communication receivers.
that the power consumption for a general-purpose processor Although a few works addressed the problem of reducing
can be a factor of four times more than dedicated architectures the complexity of coefficient multipliers in reconfigurable
for a complex algorithm [10]. FIR filters, hardly any work demonstrated reconfigurabil-
The works in [11]–[15] and [20], [21] present reconfigurable ity in higher order filters. Moreover, we note that there is
FIR filter architectures. In [11], a CSD based digit reconfig- sufficient scope for more work on complexity reduction in
urable FIR filter architecture was proposed. This architecture reconfigurable filters especially for wireless communication
was independent of the number of taps because the number of applications where higher order filters are often required to
taps and non-zero digits in each tap were arbitrarily assigned. meet the stringent adjacent channel attenuation specifications.
The intention of the authors was to reduce the precision of In this paper, we propose two architectures that integrate
coefficients and thus the filter complexity without affecting the reconfigurability and low complexity to realize FIR filters.
filter performance. But the architecture in [11] demanded huge The FIR filter architectures proposed are called constant shifts
hardware resources and this makes the method infeasible for method (CSM) and programmable shifts method (PSM). We
power constrained SDR receiver applications. In [12], a high- have presented the preliminary design of these architectures in
speed programmable CSD based FIR filter was proposed. The a recent conference paper [34]. In this journal, we elaborate the
filter architecture consisted of a programmable CSD based CSM and PSM architectures introduced in [34] by providing
Booth encoding scheme and partial product Wallace adder the detailed design. The design analysis of the architectures
tree. The final adder was a carry look-ahead adder. Though and their extension to high-level synthesis are presented. The
this method offered a high speed solution, the resulting filters proposed architectures have been synthesized on 0.18 µm
consume more power. Another high-speed programmable FIR complementary metal—oxide—semiconductor (CMOS) tech-
filter based on polyphase decomposition was proposed in [13]. nology and compared with the recent methods such as
However, this method used the built-in block multipliers of [11], [12], [14], and [15]. Also we have implemented two CSD
Virtex II field-programmable gate array (FPGA) and there based methods based on our CSM and PSM to compare the
was no consideration for the complexity reduction of the complexities of the CSD and binary based CSE techniques.
FIR filter. In [14], the concept of reconfigurable multiplier The implementation and verification of the proposed archi-
block (ReMB) was introduced. The ReMB will generate all tectures on Virtex 2v3000ff1152-4 FPGA is also presented.
the coefficient products and a multiplexer will select the The proposed architectures consider coefficients as constants
required ones depending on the input. It was shown that by (as they are stored in LUTs) and input signal as variable.
pushing the multiplexer deep into the multiplier block design, The coefficient multiplication in such a case is known as
the redundancy can be reduced. The resulting specialized multiple constant multiplications (MCM), i.e., multiplication
multiplier design can be more efficient in terms of area and of one variable (input signal) with multiple constants (filter
computational complexity compared to the general-purpose coefficients) [35]. The MCM is then optimized for eliminating
multiplier plus the coefficient store [14]. But the ReMB pro- redundancy using our recently proposed BCSE algorithm [6]
posed in [14] has its area, power, and speed dependent on the to minimize the filter complexity. The proposed CSM focuses
filter-length making them inappropriate for higher order FIR on the implementing FIR filters by partitioning the filter
filters. In [15], a multiplexed multiple constant multiplications coefficients into fixed groups. The PSM has a pre-analysis
(MMCM) approach was proposed. This method considers the part which eliminates the redundancy in filter coefficients
coefficient set as a constant and uses the graph dependence using the BCSE algorithm. The advantage of CSM is that
(GD) algorithms for reducing redundancy. But this method it produces high-speed filters at the cost of a slight increase
follows a directed acyclic graph structure which will result in area and power consumption. On the contrary, the PSM
in long LD and thus lower speed of operation as reported produces filters with low area and power consumption at the
in [3], [24], [25]. Also the area of the architecture linearly cost of a slight increase in delay. Another advantage of PSM is
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 277
y = 2−1 [(r1 + 2−3 r2 ) + 2−6 [(r3 + 2−3 r4 ) + 2−6 (r5 + 2−3 r6 )]]. (7)
Substituting (r1 + 2−3 r2 ), (r3 + 2−3 r4 ), and (r5 + 2−3 r6 ) by r7 ,
r8 , and r9 , respectively, we get
y = 2−1 [r7 + 2−6 (r8 + 2−6 r9 )]. (8)
By substituting (r8 + 2−6 r9 ) by r10
y = 2−1 (r7 + 2−6 r10 ). (9)
By substituting (r7 + 2−6 r10 ) by r11
y = 2−1 (r11 ). (10)
The expressions from (6)–(10) are represented in Fig. 4. The
main advantage of the CSM architecture is that all the shifts
Fig. 4. Architecture of PE for CSM. are constants irrespective of the coefficients and hence can be
hardwired resulting in high speed operation of the filter.
We have employed the shift and add unit which can generate
all the 3-bit BCSs using only three adders. The impact of
final adder unit will compute the sum of all the intermediate using higher order BCSs (4-bit, 5-bit BCSs, etc.) has also
sums to obtain h ∗ x[n]. been investigated. The choice of the best shift and add unit
The architecture of PE for CSM is shown in Fig. 4. The will depend on the complexities of: 1) shift and add unit;
coefficient wordlength is considered as 16 bits. The filter 2) multiplexer unit; and 3) final adder unit. The number of
coefficients are stored in the LUT in sign-magnitude form with adders needed to implement n-bit CSs is 2n−1 − 1 [6]. Thus,
the MSB reserved for the sign bit. The first bit after the sign shift and add units capable of generating 4-bit, 5-bit, and 6-bit
bit is used to represent the integer part of the coefficient and BCSs would require 7, 15, and 31 adders, respectively. The
the remaining 16 bits are used to represent the fractional part LD is two adder-steps for both the 3-bit and the 4-bit BCSs-
of the coefficient. Thus, each 16-bit coefficient is stored as based shift and add units, and hence they have the same speed.
an 18-bit value in LUTs. Each row in LUT corresponds to The LD of 5-bit and 6-bit BCSs-based shift and add units are
one coefficient. Note that only half the number of coefficients same, i.e., three adder-steps, which is one adder-step more
need to be stored as FIR filter coefficients are symmetric. The than that of the 3-bit and 4-bit BCSs. Thus, the 3-bit BCSs-
coefficient values corresponding to 20 to 2−14 are partitioned based shift and add unit results in fewer number of adders
into groups of three bits and are used as select signals to than the 4-bit BCSs-based shift and adder unit (reduction of
multiplexers Mux1 to Mux5. i.e., the set (20 , 2−1 , 2−2 ) forms four adders) with the same LD. The requirement of additional
the select signal to Mux1 and so on. Since there are 3-bits, four adders would increase the complexity of the 4-bit BCSs-
eight combinations are possible and hence Mux1 to Mux5 are based shift and add unit. Note that the cost of shift and
8:1 multiplexers. The value corresponding to 2−15 forms the add unit is independent of the number of coefficients (filter
select to a 2:1 multiplexer, Mux6. The output from the ith length) as the same shift and add unit is shared by all the
multiplexer is denoted as ri . Note that even though we are coefficients. In the proposed CSM architecture, W/3 number
taking coefficient with values up to a precision of 16 bits, of 8:1 multiplexers (W/3 8:1 multiplexers and remaining
the shifting of 2−1 is done finally as shown in (4) and (5) 2:1 or 4:1 multiplexers in some cases) of bit-width (x + 2)
and hence the maximum shift will be 2−15 . Mux7 determines are required, where W is the coefficient wordlength and x is
whether the output needs to be complemented based on the the input data wordlength. For example, if W = 16 (16-bit
sign bit of the filter coefficient and hence it is a 2:1 multiplexer. coefficient), the proposed 3-bit BCSs-based approach requires
In FIR filters, coefficient values are always less than one. [In five 8:1 multiplexers and one 2:1 multiplexer. On the other
our design examples, we used the Parks–McClellan algorithm hand, if 4-bit BCSs were used instead of 3-bit BCSs, four
to design filters (using “firpm” command in MATLAB)]. 16:1 multiplexers are required. Assuming an 8:1 multiplexer
Hence, we have not employed the integer bit. However if is equivalent to four 2:1 multiplexers and a 16:1 multiplexer
an integer digit is required, the proposed architectures do not is equivalent to eight 2:1 multiplexers, then the 3-bit BCSs-
impose any restrictions to accommodate it. based PE requires 21 2:1 multiplexers and 4-bit BCSs-based
In Fig. 4, the shifts are obtained as follows. Let r1 to r6 PE requires thirty two 2:1 multiplexers, respectively. Thus,
denotes the outputs of Mux1 to Mux6, respectively. Then the multiplexer complexity would increase when 4-bit BCSs
are used. To be more precise, for each PE with 16-bit filter
y = 2−1 r1 + 2−4 r2 + 2−7 r3 + 2−10 r4 + 2−13 r5 + 2−16 r6 . (6) coefficients, the multiplexer complexity of 4-bit BCSs-based
280 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2010
PE is increased by eleven 2:1 multiplexers when compared to multiplication is not avoided. Also in case of the outputs of any
3-bit BCSs based shift and add unit. But it can be noted that of the multiplexers becoming zero, the adder corresponding
the total number of adders required for 3-bit BCS-based filter to that mux will be used, which is not required if the output
with n coefficients is 3 + 5n (three adders for shift and add is zero. But it can be seen that the adders at the output of
unit and five adders for each PE) and that for 4-bit BCS-based the multiplexers can be combined in many ways and hence
PE is 7 + 3n (seven adders for shift and unit and three adders the best power solution saving can be utilized. Also carry
for each PE). Hence, two adders are saved for 4-bit BCSs- save adders can be employed if much faster operation is
based filter for each PE. From the above discussion, it can be required. The drawbacks in CSM are resolved by employing
concluded that if 4-bit BCSs were used instead of 3-bit BCSs, the BCSE algorithm proposed by us in [6]. This forms the
the complexity of shift and add unit and multiplexer unit of PSM architecture which is explained in the next section.
PE would have increased, whereas complexity of final adder
unit would decrease. B. Architecture of PSM
To provide a quantitative comparison, let us consider a The PSM is based on the BCSE algorithm presented in
16-bit (W = 16) coefficient with an 8-bit quantized (x = 8) our previous work [6]. The PSM architecture presented in this
input signal. The proposed 3-bit BCSs-based CSM architecture section incorporates reconfigurability into BCSE. The PSM
requires twenty one 2:1 multiplexers of wordlength x + 2 = has a pre-analysis part in which the filter coefficients are
10 bits and 4-bit BCSs-based CSM architecture requires thirty analyzed using the BCSE algorithm in [6]. Thus, the redundant
two 2:1 multiplexers of wordlength x + 3 = 11 bits. For a computations (additions) are eliminated using the BCSs and
1-bit 2:1 multiplexer, we need eight 2-input NAND gates [36]. the resulting coefficients in a coded format are stored in
This means the 3-bit BCSs-based CSM architecture requires the LUT. The coding format is explained in the latter part
21 × 8 × 10 = 1680 NAND gates whereas the 4-bit BCSs- of this section. The shift and add unit is identical for both
based CSM architecture requires 32 × 8 × 11 = 2816 NAND PSM and CSM. The number of multiplexer units required can
gates. The 3-bit BCSs-based shift and add unit requires three be obtained from the filter coefficients after the application
adders with adder-length [number of full adders (FAs)] of 10- of BCSE [6]. The number of multiplexers is selected after
bits each. Thus, roughly 3-bit BCSs-based shift and add unit considering the number of non-zero operands (BCSs and
requires 30 FAs (assuming ripple carry addition). For each unpaired bits) in each of the coefficients after the application
FA, we require fifteen 2-input NAND gates [36]. Thus, the of the BCSE algorithm. The number of multiplexers will be
3-bit BCSs-based shift and add unit requires 30 × 15 = 450 corresponding to the number of non-zero operands for the
NAND gates. Similarly the 4-bit BCSs-based shift and add unit worst-case coefficient (worst-case coefficient being defined
requires approximately seven adders of adder-length 7 × 11 = as coefficient that has the maximum number of non-zero
77 FAs. Thus, the 4-bit BCSs-based shift and add unit requires operands).
77 × 15 = 1155 NAND gates. For the final adder unit, the The architecture of PE for PSM is shown in Fig. 5. The
proposed 3-bit BCSs-based PE requires five adders (as shown coefficient wordlength is fixed as 16 bits. We have done the
in Fig. 4.) Adders A1 to A3 require 13 FAs {10 (output word- statistical analysis for various filters with coefficient precision
length of shift and add unit) + 3 (shift-length)}, A4 requires of 16 bits and different filter lengths (20, 50, 80, 120, 200,
13 + 6 = 19 FAs and A5 requires 19 + 6 = 25 FAs. Thus, 400, and 800 taps) and it was found that the maximum
a total of 83 FAs are required. This means the 3-bit BCSs- number of non-zero operands is 5 for any coefficient. The
based PE requires 83 × 15 = 1245 NAND gates. Similarly analysis was done for filters with different passband (ωp ) and
the 4-bit BCSs-based PE requires three adders. These three stopband (ωs ) frequency specifications given by 1) ωp = 0.1π,
adders will require 15 + 15 + 15 + 8 = 53 FAs. Thus, ωs = 0.12π; 2) ωp = 0.15π, ωs = 0.25π; 3) ωp = 0.2π,
the 4-bit BCSs-based PE requires 53 × 15 = 795 NAND ωs = 0.22π; and 4) ωp = 0.2π, ωs = 0.3π, respectively.
gates. Now considering the total complexity of PE (Note Based on our statistical analysis, we have fixed the num-
that complexity of PE is directly proportional to the number ber of multiplexers as 5 (same as the number of non-zero
of filter coefficients and total complexity = complexity of operands). The LUT consists of two rows of 18 bits for
multiplexers + complexity of final adder unit), the 3-bit BCSs- each coefficient of the form SDDDDXXDDDDXXMMMML
based PE requires 2925 NAND gates. The 4-bit BCSs-based and DDDDXXDDDDXXDDDDXX, where “S” represents the
PE requires 3611 NAND gates. Thus, for a filter with n taps, sign bit, “DDDD” represents the shift values from 20 to 2−15
the 3-bit BCSs-based CSM architecture requires (450+2925n) and “XX” represents the input “x” or the BCSs obtained
NAND gates whereas the 4-bit BCSs-based CSM architecture from the shift and add unit. In the coded format, XX =
requires (1155 + 3611n) NAND gates. Thus, it is very evident “01” represents “x,” “10” represents x + 2−1 x, “11” represents
that the 3-bit BCSs-based shift and add unit results in low x + 2−2 x, and “00” represents x + 2−1 x + 2−2 x, respectively.
complexity implementation when compared to the 4-bit BCSs- Thus, the two rows can store up to five operands which is
based implementation. Nevertheless, it must be noted that the the worst case number of operands for a 16-bit coefficient. In
CSM architecture can be easily modified to incorporate 4-bit most of the practical coefficients, the number of operands is
or 5-bit shift and add unit based CSM architectures, if there less than the worst case number of operands, 5. In that case
is such a requirement. “MMMML” can be used to avoid unnecessary additions. The
In the CSM approach, the coefficients are directly stored values “MMMM” will be given as select signal to the Mux6
in the LUT and hence complete redundancy in coefficient and “L” to Mux8. “MMMML” indicates the presence of five
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 281
using adder, shr 2 . Mux8 will do this and hence the adder
shr2 is not loaded and consumes zero current and power. The
select signals of Mux6 and Mux8 have five bits and hence
25 different control signals are possible which adds lots of
flexibility to the architecture which can be employed in future
if required. Mux7 is used to complement the output in case
of a negative coefficient and its select signal is the sign bit
“S” of the coefficient.
The PSM architecture has two advantages; first, it guaran-
tees a reduced number of additions compared to CSM, and
second it offers the flexibility of changing the wordlength
of coefficients. The same PSM architecture designed for
16-bit coefficients is capable of operating for any coefficient
wordlength less than 16 bits. This means, if the wordlength is
reduced, the format of the LUT can be changed if required.
The main advantage of reducing the precision is that some of
the adders in the PSM architecture will be unloaded resulting
in zero dynamic power. To the best of our knowledge, the
PSM architecture is the first approach toward programmable
coefficient wordlength FIR filter architecture. This means that
the coefficient wordlength of the proposed PSM architecture
can be changed dynamically without any change in hardware.
not required. Valuable hardware resources will be wasted if all in complexity achieved by the approach in [14] is directly
taps are implemented with the highest precision. The proposed proportional to the number of multiplexers. But it is shown in
PSM can be implemented for dynamically varying coefficient [29] that the delay imposed by multiplexers in reconfigurable
precision as it is wordlength independent. One of the limita- designs can heavily degrade the performance of the system,
tions of the PSM architecture is that it requires pre-analysis which will have adverse effects on the architecture in [14]. In
of filter coefficients and hence on-the-fly reconfigurability this paper, we have used our BCSE algorithm [6] to reduce
is not always feasible. But this restriction does not impose the redundancies in multiplications in the reconfigurable filter
constraints on popular reconfigurable filter applications like architecture. To the best of our knowledge, this is the first
wireless communications. This is because in such applications, approach that employs the CSE technique to achieve high-
we have a distinct filter for each communication standard and level synthesis goals for reconfigurable systems. The proposed
the coefficients of the filter are fixed for a specific standard. CSM and PSM methods make use of architectures with fixed
In other words, when the communication system is operating number of multiplexers and the reduction in complexity is
on a particular wireless standard, the filter coefficients do achieved by applying the BCSE algorithm proposed in [6].
not change, i.e., the filter is not required to be an adaptive Also, the shift and add unit, which significantly reduces the
filter. When the system changes its mode of operation to a number of adders compared to direct implementation, has no
different wireless communication standard (as in the case of multiplexers in contrary to the approach in [14].
a multi-standard transceiver), the coefficient set corresponding The high level synthesis literature has an extensive coverage
to the specification of the new standard is loaded (replacing of employing partitioning techniques to integrate low power
the current filter coefficients). Note that the coefficients of the realization within the scheduling process [29]–[32]. These
new standard are known beforehand (pre-stored) and therefore methods generally use some scheduling techniques or path
the pre-analysis can be done offline and the problem with analysis to identify regions that can be combined to partitions.
reconfigurability can be solved. Each partition will have an activation/deactivation mechanism,
In this paper, we have employed tree-structured adder for which can be controlled. The basic idea is that the partition can
the final adder unit in both CSM and PSM architectures. be switched off when it is not used and consequently power
But it is possible to further optimize the CSM architecture can be saved. The methods in [29]–[32] have not exploited
by employing techniques such as compression techniques hardware level redundancy in operations which can result in
[36]. 3:2 or 4:2 compressors can be employed for carry free better performances of the system. In [33], an algorithm based
addition making the entire final adder unit more power efficient on graphs was devised which reuses the hardware resulting in
with improved speed of operation. But the use of addition less power consumption. But reusing of hardware results in
structures other than tree-structured adder would impair the increased number of multiplexer logic being created which
flexibility of PSM architecture. For example, the multiplexer degrades the system performance as discussed in [29]. The
Mux6 in our PSM architecture (Fig. 6), which is employed to partitioning of coefficients into 3-bit groups in our proposed
load/unload the adders, plays a significant role in achieving CSM is a high level synthesis transformation targeted to reduce
low power consumption and dynamic wordlength capabilities. power consumption. In CSM architecture, the partitioned bit
If compression techniques were to be employed, the final adder groups of coefficients are given as select signals to multiplex-
unit should be made free of multiplexers including Mux 6, ers. These multiplexers will load and unload different parts
which will in turn impair above merits of PSM approach. of the circuit and thus save significant amount of power as
discussed in Section III-B. In CMOS technology, there are
three sources of power dissipation arising from switching
IV. Extension of CSM and PSM (dynamic) currents, short circuit currents, and leakage currents.
to High Level Synthesis Among these parameters, the switching component, which is a
In this section, we present an extension of proposed function of the effective capacitance, plays the most significant
reconfigurable architectures to high level synthesis. CSE role [28]. It is possible to reduce the power by employing
techniques have been used in the literature as a powerful transformations such as reductions in LD, number of opera-
transformation for eliminating hardware redundancies to tions, and average transition activity. In [28], it was shown
reduce power consumption and area [6], [27], [35]. However that a binary tree-structured adder always ensures lowest LD
there is hardly any work that addressed the problem of and consequently the least number of transitions. Our CSM
designing reconfigurable architectures using CSE techniques. and PSM architectures also employ the binary tree-structured
In [14], the concept of ReMB was introduced, which utilized approach so as to achieve low LD. It must be noted that as the
GD algorithms for eliminating coefficient redundancies and coefficients are synthesized sequentially in GD algorithms, the
thus reducing the number of additions for the ReMB. However resulting filter structures do not have a binary tree structure
the approach in [14] can be considered suitable only for static and hence will always result in an increased LD. Hence, the
reconfigurable systems, where either the filter coefficients reconfigurable approaches in [14] and [15] which employ
are known beforehand or the filters need to switch between GD algorithms will result in increased power consumption
coefficient sets which are already available. Also the approach when applied to high-level synthesis. In addition to reducing
in [14] reduces the redundancies in multiplications by pushing the LD in our CSM and PSM architectures, the number
multiplexers deep into the ReMB design, thus increasing of operations is also reduced by employing the BCSE [6].
the number of multiplexers. In other words, the reduction Furthermore, the proposed PSM architecture can make use
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 283
Fig. 6. Implementation of the proposed CSM and PSM architecture on Virtex 2v3000ff1152-4 FPGA.
TABLE I TABLE II
Synthesis Results for an FIR Filter with 20 Taps and Synthesis Results for MB (PSM) with Different
Coefficient Wordlength of 16 Bits Coefficient Wordlengths
of dynamic change of coefficient wordlength which will given by: 1) ωp = 0.1π, ωs = 0.12π; 2) ωp = 0.15π, ωs = 0.2π;
save significant amount of dynamic power as explained in 3) ωp = 0.2π, ωs = 0.22π; and 4) ωp = 0.2π, ωs = 0.3π,
Section III-B. Thus, the proposed CSM and PSM approaches respectively. Even though the proposed architectures are re-
improve the efficiency of reconfigurable systems in high-level configurable, the usage of adders and shifters is dependent on
synthesis and offers a power efficient solution by reducing the the filter coefficient values. Some of the adders may not be
LD as well as the number of operations (additions). used by the multiplexers. As a result of this, they are unloaded
and do not consume any dynamic power. Hence, the power
and speed values of the synthesis results are dependent on the
V. Experimental Results
filter coefficients and hence we have considered an average of
In this section, the synthesis and design results of the the synthesis results in all the tables in this paper. From the
proposed CSM and PSM architectures are presented and comparison it is very evident that the CSM requires 475 gates
compared with the recently proposed reconfigurable FIR filter more than that of PSM, whereas PSM requires 6.82 ns more
architectures in the literature [11], [12], [14], [15], [20], [21]. for the data to arrive at the output compared to CSM. Thus,
the CSM results in higher speed whereas the PSM results in
A. Synthesis Results lower area. The reason for lower speed of PSM is due to
We have used Xilinx 8.1i ISE for synthesizing purposes. The the presence of programmable shifters and that of less area
synthesis has been done on Xilinx’s Virtex-II 2v3000ff1152-4 is due to elimination of redundant additions by using BCSE
FPGA. Table I shows the synthesis results of the CSM and algorithm. We have also analyzed the effect of the MB for
PSM 20-tap FIR filter that has a coefficient wordlength of 16 different filter coefficient wordlengths of 8, 12, and 16 bits for
bits. We have done the implementation of filters with different the PSM architecture. The results are shown in Table II. It can
passband edge (ωp ) and stopband edge (ωs ) specifications be noted that as the precision of the coefficient is made high,
284 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 29, NO. 2, FEBRUARY 2010
TABLE III
Synopsys Synthesis Results for 20-Tap FIR Filter Implementation of Section V-B
Binary programmable shifts method Binary constant shifts method CSD-CSM CSD-PSM FIR Filter [15]
(BPSM) (BCSM)
Area (mm2 ) 0.2594 0.275 0.304 0.2796 0.5467
Delay (ns) 8.2 7.67 8.5 9.34 15.6
Dynamic power (mW) 5.98 7.8 10 13.97 16
TABLE IV
Synopsys Synthesis Results for 32-Tap FIR Filter Implementation of Section V-B
the area consumption is increased and the speed of operation storage space required for CSD will increase the area and
is reduced. Thus, by choosing the appropriate filter coefficient the additional half-adders in the adder/subtractor unit reduces
wordlength, it is possible to obtain reduced area and power as the speed of operation of the CSD based reconfigurable FIR
well as increased speed for the PSM architecture. filters compared to binary based FIR filter implementations.
This becomes highly significant, as the order of the channel
B. CSD Based Reconfigurable FIR Filter Architecture filters in wireless communication transceivers is very high.
CSD based CSE algorithms are considered to be one of We have done the synthesis using Synopsys tool for all the
the best algorithms that can result in low complexity fixed- FIR filter specifications as mentioned in Section IV-A on
coefficient FIR filter implementations. However to the best 0.18 µm CMOS technology. The synthesis results for a 20-tap
of our knowledge, the implementation of the CSD-CSE based FIR filter with 16-bit coefficient wordlength are summarized
reconfigurable filter architectures has not been addressed in the in Table III. The proposed CSM and PSM architectures
literature. We have implemented a CSD based FIR filter using which employ binary representation of filter coefficients are
the CSM architecture (CSD-CSM) and a CSD-CSE based denoted as BCSM and BPSM, respectively. The CSD based
FIR filter using the PSM architecture (CSD-PSM). For low implementations of CSM and PSM are denoted as CSD-CSM
complexity, we have employed the CSE algorithm in [3] on and CSD-PSM, respectively. Table III shows that the CSD-
the coefficients before they are stored in LUT. We have imple- CSM and CSD-PSM architectures consume more area, power,
mented a CSD based shift and add unit to generate common and has less speed compared to our binary representation based
subexpression (CSs) such as [1 0 1], [1 0 −1], [1 0 0 1] BPSM and BCSM architectures. The BCSM architecture has
and [1 0 0 −1] and their negated versions. In the previous area reduction of 10% and 1% over CSD-CSM and CSD-PSM
works based on CSE algorithm [3]–[5], it was considered architectures, respectively, and the area reduction for BPSM
that common subexpressions (CSs) such as [−1 0 − 1] and architecture over CSD-CSM and CSD-PSM architectures are
[−1 0 1] can be generated from their respective negated 15% and 7%, respectively. The improvement in the speed of
versions [1 0 1] and [1 0 − 1] without using any extra adder operation for the BCSM architecture over the CSD-CSM and
by configuring the existing adder as a subtractor. But this is CSD-PSM architectures is 10% and 22%, respectively. The
applicable only for fixed coefficient filters. An n-bit adder BPSM architecture offers an improvement in the speed of
circuit would require n additional XOR gates to reconfigure the operation of 4% and 12% over the CSD-CSM and CSD-PSM
adder to subtractor mode. These additional XOR gates would architectures, respectively. The dynamic power reductions for
increase the critical path of the adder circuit (equivalent to the BCSM architecture are 22% and 44% over the CSD-CSM
the delay imposed by n half-adders) and impose overheads and CSD-PSM architectures, respectively. The BPSM architec-
for CSD implementation of the FIR filter. Another drawback ture offers the dynamic power reductions of 40% and 57% over
of CSD implementation is with the storage of coefficients in the CSD-CSM and CSD-PSM architectures, respectively. The
LUT. The CSD value like [1 0 − 1 0 − 1 0 1 0 − 1] can be BPSM architecture offers area and power reductions of 6%
stored in an LUT like [01 00 11 00 11 00 01 00 11] with and 23% over the BCSM architecture, respectively. The BCSM
“00” corresponding to 0, “01” corresponding to 1, and “11” architecture offers in improvement in the speed of operation
corresponding to −1. Therefore, for the worst-case scenario, by 7% compared to the BPSM architecture. In Table III, the
an 8-bit CSD coefficient requires 16 bits for its representation. proposed architectures are also compared with the MMCM
This can be optimized as no adjacent bits in CSD are ones. architecture based FIR filter in [15]. The BCSM architecture
But still CSD requires more number of bits than binary. Since offers an area reduction of 49.7%, power reduction of 51.3%,
all the bits in binary representation are positive, this problem and a speed improvement of 50.8% over the MMCM [15]. The
will not come. Thus, the additional half-adders required for area and power reductions offered by the BPSM architecture
implementing the adder/subtractor circuit and the additional over MMCM [15] are 52.7% and 62.5%, respectively, with
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 285
TABLE V TABLE VI
Synopsys Synthesis Results for 18-Tap FIR Filter Synopsys Synthesis Results for D-AMPS Channel Filter
Implementation of Section V-B Implementations
TABLE VII
Comparison of the Proposed CSM Approach with Approach in [20], [21]
element. Thus for a higher order filter, the number of DPUs three adders for final adder unit, i.e., 9 + 3n adders for
will be significantly large, which would delay the filtering n-tap filter) and eight programmable shifters. From the above
operation substantially. The proposed architectures employ example, it is evident that the proposed CSM is less complex
CSE techniques and hence a much faster filtering operation for than the methods in [20], [21].
any filter-length is feasible. Also the method in [11] is CSD For a quantitative comparison, consider a 16-bit coefficient
based which has many inherent difficulties, as explained in a with an 8-bit quantized input signal. The proposed CSM ap-
detailed manner in Section V-B. A more detailed comparison proach and the approach in [20], [21] have been implemented
between method in [11] and proposed architectures is also on Virtex 4XC4VSX35-10ff668 FPGA and the complexities
given in Section V-B. of the shift and add unit, multiplexer unit, final adder unit and
over-all complexities have been compared. The comparison re-
E. Comparison with FIR Filters [20], [21] sults (synthesis results on FPGA) are shown in Table VII. The
The main difference of the proposed CSM architecture from over-all complexity (shown in the right-most column under
the architectures in [20], [21] is the use of the BCSs-based respective methods in Table VII) is the sum of the complexities
shift and add unit and hardwiring of shifts. In the architectures associated with shift and add unit, multiplexer unit, final adder
proposed in [20], [21], there are pre-computers which are unit, and that of the LUTs used for storing the coefficients,
used to generate x, 3x, 5x, 7x, 9x, 11x, 13x, and 15x using nine and the complexity of the complementer. It should be noted
adders employing a special carry select adder, where x is the that the final adder unit in [20], [21] is optimized by using
input signal. This is in comparison with only seven adders carry select addition. However, the architecture implemented
required by our 4-bit BCSs-based shift and add unit. Thus, the for this comparison uses ripple carry adders for a fair and
CSM architecture offers adder reduction over the architectures straightforward comparison. We used ripple carry adders on
in [20], [21] and is different from the latter ones because account of their low power consumption and simple layout.
it employs BCSE-based shift and add unit for complexity From Table VII, it is clear that the complexity of shift and
reduction. Another major difference is that [20], [21] employ add unit and the multiplexer unit is more for the approach
two programmable shifters named SHIFTER and ISHIFTER in [20], [21] and the complexity of final adder unit is more
with coefficient values as select values. The shifters were used for the proposed CSM approach. It is evident from Table VII
to identify the most significant non-zero bit (digit) in each that over-all complexity of the proposed CSM approach is less
filter coefficient. These shifters should always be preceded than that of the approach in [20], [21].
by 8:1 multiplexers as shown in Fig. 6 of [20] and hence
the multiplexer complexity is also not reduced. These pro-
VI. Implementation Results
grammable shifters will reduce the overall speed of operation
of the resulting filters especially for higher order channel We have implemented the proposed CSM and PSM architec-
filter applications in wireless communication receivers. In the tures for a 20-tap FIR filter with 16-bit coefficient precision
proposed CSM architecture, all the shifts are constants and on Xilinx’s Virtex-II 2v3000ff1152-4 FPGA associated with
hence can be hardwired using a constant propagation tool and the dual DSP-FPGA Signalmaster kit provided by Lyrtech
hence results in better speed of operation compared to methods [23]. A model based design using MATLAB Simulink and
in [20], [21]. This can be clarified using an example. For a Xilinx System generator was employed for the implementation
16-bit coefficient, the proposed CSM (3-bit BCSs-based shift purpose as shown in Fig. 6 (copied directly from the Simulink
and add unit) architecture requires five 8:1 multiplexers and environment). Fig. 6 consists of eight components/blocks
one 2:1 multiplexer (equivalent to twenty one 2:1 multiplex- whose details are given below.
ers), eight adders (three adders for shift and add unit and five 1) Multi-Tone Input Signal: A multi-tone input signal was
adders for the final adder unit, i.e., 3 + 5n adders for n-tap generated by summing up sine waves of frequencies
filter). Note that programmable shifters are not requited in 300 Hz, 1000 Hz, 2500 Hz, 3500 Hz, and 4200 Hz, each
CSM since all shifts are constants which can be hardwired. sampled at 10 MHz. Note that the signal frequencies
On the other hand, the approach in [20], [21] requires four and the sampling frequency in this example are only
8:1 multiplexers (main multiplexers) + four 4:1 multiplexers for illustration purposes. By dynamically changing the
(for programmable shifters if not implemented using power input frequencies using the function in Simulink, we
consuming barrel shifters) (equivalent to twenty four 2:1 verified that the CSM and PSM architectures work well
multiplexers), 12 adders (nine adders for precomputers and for frequencies of several tens of MHz.
MAHESH AND VINOD: NEW RECONFIGURABLE ARCHITECTURES FOR IMPLEMENTING FIR FILTERS WITH LOW COMPLEXITY 287
[7] A. P. Vinod and E. Lai, “Low power and high-speed implementation of Comput.-Aided Design Integr. Circuits Syst., vol. 14, no. 1, pp. 12–31,
FIR filters for software defined radio receivers,” IEEE Trans. Wireless Jan. 1995.
Commun., vol. 5, no. 7, pp. 1669–1675, Jul. 2006. [29] M. Meribout and M. Motomura, “A combined approach to high-
[8] T. Solla and O. Vainio, “Comparison of programmable FIR filter level synthesis for dynamically reconfigurable systems,” IEEE Trans.
architectures for low power,” in Proc. 28th Eur. Solid-State Circuits Comput., vol. 53, no. 12, pp. 1508–1522, Dec. 2004.
Conf., Firenze, Italy, Sep. 2002, pp. 759–762. [30] X.-J. Zhang, K.-W. Ng, and W. Luk, “A combined approach to high-
[9] T. Solla, R. Mäkelä, M. Liljeroos, and O. Vainio, “Application-specific level synthesis for dynamically reconfigurable systems,” in Proc. 10th
filter processor for flexible receivers,” in Proc. 19th NORCHIP Conf., Int. Workshop Field Programmable Logic Applicat., 2000, pp. 361–370.
Kista, Sweden, Nov. 2001, pp. 53–58. [31] R. Kress and A. Pyttel, “High-level synthesis for dynamically re-
[10] D. Hwang, C. Mittelsteadt, and I. Verbauwhede, “Low power showdown: configurable hardware/software systems,” in Proc. 8th Int. Workshop
Comparison of five DSP platforms implementing an LPC speech codec,” Field-Programmable Logic Applicat. Field-Programmable Gate Arrays
in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Salt Lake City, Comput. Paradigm, 1998, pp. 288–297.
UT, May 2001, pp. 1125–1128. [32] A. Rettberg and F.-J. Rammig, “Integration of energy reduction into
[11] K. H. Chen and T. D. Chiueh, “A low-power digit-based reconfigurable high-level synthesis by partitioning,” in Proc. Working Conf. Distributed
FIR filter,” IEEE Trans. Circuits Syst. II, vol. 53, no. 8, pp. 617–621, Parallel Embedded Syst. (DIPES), Braga, Portugal, Oct. 2006, pp. 225–
Aug. 2006. 234.
[12] T. Zhangwen, J. Zhang, and H. Min, “A high-speed, programmable, CSD [33] N. Moreano, E. Borin, C. de Souza, and G. Araujo, “Efficient data-
coefficient FIR filter,” IEEE Trans. Consumer Electron., vol. 48, no. 4, path merging for partially reconfigurable architectures,” IEEE Trans.
pp. 834–837, Nov. 2002. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7, pp. 969–
[13] X. Chenghuan, C. He, Z. Shunan, and W. Hua, “Design and imple- 980, Jul. 2005.
mentation of a high-speed programmable polyphase FIR filter,” in Proc. [34] R. Mahesh and A. P. Vinod, “Reconfigurable low complexity FIR filters
5th Int. Conf. Applicat.-Specific Integr. Circuit, vol. 2. Oct. 2003, pp. for software radio receivers,” in Proc. 17th IEEE Int. Symp. Personal
783–787. Indoor Mobile Radio Commun. (PIMRC), Helsinki, Finland, Sep. 2006,
[14] S. S. Demirsoy, I. Kale, and A. G. Dempster, “Efficient implementation pp. 1–5.
of digital filters using novel reconfigurable multiplier blocks,” in Proc. [35] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, “Multiple
38th Asilomar Conf. Signals Syst. Comput., vol. 1. Nov. 2004, pp. 461– constant multiplications: Efficient and versatile framework and algo-
464. rithms for exploring common subexpression elimination,” IEEE Trans.
[15] P. Tummeltshammer, J. C. Hoe, and M. Puschel, “Multiplexed multiple Comput.-Aided Design, vol. 15, no. 2, pp. 151–165, Feb. 1996.
constant multiplication,” IEEE Trans. Comput.-Aided Design Integr. [36] B. Parhami, “Implementation details,” in Computer Arithmetic. New
Circuits, vol. 26, no. 9, pp. 1551–1563, Sep. 2007. York: Oxford Press, 2000, p. 131.
[16] A. P. Vinod and E. M.-K. Lai, “An efficient coefficient-partitioning [37] Y. Linn, “Efficient loop filter design in FPGAs for phase lock loops in
algorithm for realizing low complexity digital filters,” IEEE Trans. high-data rate wireless receivers: Theory and case study,” in Proc. 6th
Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 12, pp. 1936– Annu. Wireless Telecommun. Symp., Pomona, CA, Apr. 2007, pp. 1–8.
1946, Dec. 2005.
[17] K. C. Zangi and R. D. Koilpillai, “Software radio issues in cellular base
R. Mahesh (M’06) received the B.Tech. degree in
stations,” IEEE J. Select. Areas Commun., vol. 17, no. 4, pp. 561–573,
electrical and electronics engineering from Mahatma
Apr. 1999.
Gandhi University, Kottayam, Kerala, India, in 2003,
[18] N. Spencer, “An overview of digital telephony standards,” in Proc. IEE
and the Ph.D. degree in computer engineering from
Colloq. Design Dig. Cellular Handsets, Mar. 1998, pp. 1/1–1/7.
Nanyang Technological University, Singapore, in
[19] J. G. Proakis and D. G. Manolakis, “Design of digital filters,” in
2009.
Digital Signal Processing Principles, Algorithms, and Applications.
He was a Lecturer at the College of Engineer-
Upper Saddle River, NJ: Prentice-Hall, 1998, pp. 614–738.
ing, Mahatma Gandhi University, from August 2003
[20] K. Muhammad and K. Roy, “Reduced computational redundancy imple-
to July 2005. Currently, he is a Research Fellow
mentation of DSP algorithms using computation sharing vector scaling,”
with the School of Computer Engineering, Nanyang
IEEE Trans. Very Large Scale Integr. Syst., vol. 10, no. 3, pp. 292–300,
Technological University. His main research inter-
Jun. 2002.
ests include low complexity and high speed digital signal processing circuits
[21] J. Park, W. Jeong, H. Mahmoodi-Meimand, Y. Wang, H. Choo, and K.
and computer arithmetic.
Roy, “Computation sharing programmable FIR filter for low-power and
high-performance applications,” IEEE J. Solid State Circuits, vol. 39,
no. 2, pp. 348–357, Feb. 2004. A. P. Vinod (SM’01) received the B.Tech. degree
[22] Y. Wang, H. Mahmoodi, L.-Y. Chiou, H. Choo, J. Park, W. Jeong, in instrumentation and control engineering from the
and K. Roy, “Hardware architecture and VLSI implementation of a University of Calicut, Malappuram, Kerala, India, in
low-power high-performance polyphase channelizer with applications 1994, and the M.E. and Ph.D. degrees in computer
to subband adaptive filtering,” in Proc. IEEE Int. Conf. Acoust. Speech engineering from Nanyang Technological University
Signal Process., vol. 5. May 2004, pp. 97–100. (NTU), Singapore, in 2000 and 2004, respectively.
[23] http://www.lyrtech.com/index.php?act=view&pv=SignalMaster space From 1993 to 1998, he was an Automation
Quad Engineer with Kirloskar, Bangalore, India, Tata
[24] C. Y. Yao, H. H. Chen, C. J. Chien, and C. T. Hsu, “A novel Honeywell, Pune, India, and Shell Singapore, Sin-
common-subexpression-elimination method for synthesizing fixed-point gapore. From September 2000 to September 2002,
FIR filters,” IEEE Trans. Circuits Syst. I, vol. 51, no. 11, pp. 2215–2221, he was a Lecturer at the School of Electrical and
Nov. 2004. Electronic Engineering, Singapore Polytechnic, Singapore. He was a Lecturer
[25] I. C. Park and H. J. Kang, “FIR filter synthesis algorithms for minimizing at the School of Computer Engineering, NTU from September 2002 to
the delay and the number of adders,” IEEE Trans. Circuits Syst. II, November 2004. Since December 2004, he has been an Assistant Professor
vol. 48, no. 8, pp. 770–777, Aug. 2001. with the School of Computer Engineering, NTU. He has published more than
[26] R. A. Walker and D. E. Thomas, “Introduction,” in A Survey of High- 90 papers in refereed journals and international conferences. His research
Level Synthesis Systems. Boston, MA: Kluwer, 1991, pp. 20–37. interests include digital signal processing, low power and reconfigurable
[27] A. P. Vinod and E. M.-K. Lai, “On the implementation of efficient digital signal processing circuits, software radio, cognitive radio, and brain–
channel filters for wideband receivers by optimizing common subexpres- computer interface.
sion elimination methods,” IEEE Trans. Comput.-Aided Design Integr. Dr. Vinod is an Editor of the International Journal of Advancements in
Circuits Syst., vol. 24, no. 2, pp. 295–304, Feb. 2005. Computing Technology.
[28] A. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W.
Brodersen, “Optimizing power using transformations,” IEEE Trans.