The Implementation Methods of High Speed FIR Filter On FPGA: 978-1-4244-2186-2/08/$25.00 ©2008 IEEE

The Implementation methods of High Speed
FIR Filter on FPGA

Ying Li 1*, Chungan Peng l , Dunshan yu l , Xing Zhang l
1Key Laboratory of Microelectronic Devices and Circuits, Institute of Microelectronics, Peking University, 100871
*Email: liying@ime.pku.edu.cn
Abstract
This paper implements a sixteen-order high-speed
Finite Impose Response (FIR) filter with four different
popular methods: Conventional multiplications and
additions; Full custom Distributed Arithmetic (DA)
scheme; Add-and-Shift method with advanced
calculation schedule. Each scheme is analyzed in detail
including implementing process and advantages and/or
drawbacks in order to present a practical reference. All
of these implementations are aimed to implement on
Xilinx Spartan 3 devices and we also compare our
results with an industry result produced by Xilinx
CoregenTM also using Distributed Arithmetic. The
premium add-and-shift method observes up to 80%
reduction in total occupied slices and 63.3 % versus the
largest
conventional
parallel
multiplication
implementation.
1. Introduction
Digital Finite Impulse Response (FIR) filters are
frequently used in most Digital Signal Processing (DSP)
system by virtue of stability and easy implementation,
especially in the realm of communication and/or
multimedia applications. The filters are major
determinants of the performance and power consumption
of the whole system. Since field programmable gate
arrays (FPGAs) contain a very high number of
Configurable Logic Blocks (CLBs), they become more
feasible for implementing specific DSP functions such as
FIR filters. The problem of designing FIR filters are
suffering from a large number of multiplications, which
leads to excessive area and power consumption. Many
works have focused on designing alternative algorithm
such as Distributed Arithmetic (DA) [1,2,3,4]. Other
works have development on optimizing multiplications
by decomposing them into simple operations such as
addition, subtraction and shift or sharing common
sub-expressions [5,6,7,8,9].
This paper investigates and compares four available
implementing schemes target on FPGA: (1) conventional
multiplication and addition scheme; (2) using customed
DA relative multiplierless method instead of multiplier
block; (3) decomposing multiplications into addition,
subtraction and shift operations and eliminating these
operation with calculator schedules (add-and-shift); (4)
generating a black-box synthesized filter by some
place-and-routing software;
Application of multiplication and addition chains can
be considered as a straight and simple translation from
filter function behavior to circuit design, however, its

bottleneck is the consumption of resource, and its
performance strongly depends on the quality of
multiplier unit.
DA scheme, as a popular way in industry standard,
can significantly reduce the amount of occupied slices
and benefit excellent performance due to the high
accessing speed of Looking up Tables (LUTs). In FPGA
applications, however, the mode to apply DA method in
filters (as full-parallel or serial LUTs) directly
determined the occupied number of block RAMs, which
are limited resources on devices.
Add-and-shift is an alternative method which uses
adders and register-chains instead of multipliers to build
the structure of filter. Moreover, with advanced
calculator schedule and registering output, this scheme
could eliminate the critical path by 63.3% and the
consumption of resource by 80 % vs the conventional
scheme at the meantime.
In order to make a reliable comparison of results, we
use a standard Xilinx CoregenTM generated filter [10] as
a reference implementation. All of the four schemes have
been investigated by implementing a 16-order FIR on a
Xilinx Spatan3 device.
The aim of this paper is making an overview of
implementation methods of FIR filter on FPGAs,
providing relevant comparison and analysis in order to
present a useful reference to other researchers.
The other parts of the paper is arranged as follows:
Section 2 introduces the requirements and constraints of
a 16-order high speed FIR filter target to Xilinx Spartan
3 FPGA and section 3 analyzes all of the mentioned
methods to implement the filter. Performance results
including consumed resources, critical path lengths are
compared respectively in section 4. In the last section,
conclusion and some pieces of advice for different
choices in different applications are proposed.
2. Requirements of the target filter

1. Basic Function: The filter should meets
qualification of conventional filter expression equation
(1), while N=16, B=7. N means the tap of this filter, and
B is the width of input signal in binary. That means input
signal X expresses as 7-bit complemental code
N-I
B-1
n=O
b=O
Y = Lc[n]x LXb [n]x2 b
(1)
2. Coefficients: The constant symmetrical quantified

coefficients are:
CO=255,C1=227,C2=176,C3=113,
978-1-4244-2186-2/08/$25.00 2008 IEEE
C4= 48, C5=-6, C6=- 41, C7=-56,

C8= C7, C9= C6, CI0= C5, Cll=C4,
CI2=C3, C13= C2, C14= Cl, CI5=CO
They calculate as 9-bits complemented code.
3. Delay constraint: The final filter result should be
accessible on output no latter than 6 clock periods after
inputting X under 50M constraint, which means, the
calculation time for filtering is less than bits number of
input signal.
3 Schemes of implementation
3.1 conventional multiplier and addition scheme
Xi
--r---~---r------.-----,
7-bit data with serial to parallel transfer operation.

Secondly, look up to to t6 in the table and gets MO to M6.
Finally, the filter output y ( 19bit) will be accumulated by
the following expression (4).
y=Mox2+MIX21+ ... +M6X26
If the number of calculation circles for filtering is

more than bits number of input signal, serial mode
scheme could be used. However, considering the delay
requirement of this filter, we have to use parallel-LUTs
scheme instead of serial one to ensure the performance,
which means a 64K*13bit table should be instanced for
7 times. As well known, each look up table occupies a
block RAM on FPGA; therefore, the total area occupied
by ram would over spend the limited resource of target
device for about 30%.
f}
xi
Figure 1 the conventional structure of a FIR filter

As shown in Figure 1, a basic structure of a FIR filter
translates the conventional tapped delay line realization
of the expression equation to L multiplications and L-l
additions per sample to compute the result. Since all
coefficients are constants, it is not requited to generate a
general purpose multiplier with full flexibility. Therefore,
we used a full-custom-built 9*7 multiplier in this scheme
to reduce the area. In order to well trade off between the
required performances and the limited resource on
FPGA, we apply pipeline along with parallel L/2
multiplications
architecture
to
implement this
conventional filter. The detailed result of consumption
after place and routing can be found in next section.
3.2 DA scheme
DA algorithm is a well-known alternative
multiplierless method which can save large amount of
resources [1]. This algorithm can implement the filter
either in bit serial or fully parallel mode to trade
bandwidth for area utilization [2, 3,4].
According to signed DA method, the input variable
x[n] is equal to:
B-2
x[n] = _2 b x xB_t[n] + l:xb[n] x 2b
(2)
b=O
where xb[n] is the bth bit of x [n] and B is the input

width. Therefore, the conventional equation (1) can be
represented to (3) while f(c,x) is a partial product which
is obtained by multiplying relative coefficients with one
bit of input data x[n] i.e., an AND operation. This
function is always implemented into LUTs on FPGA.
The content of LUTs for this 16-order filter is shown in
Table 1.
B-2
N-I
b=O
n=O
b
Y = _2 x !(c[n],xB_1[n]) + L2b x L!(c[n],xb[nD
(4)
(3)
Considering this full-customed DA procedure (as

shown in Figure 2), the first step is registered loading 16
Change data from

serial to parallel
output
Look up th~ tabl~s to

get products
Accumulate the result
Figure 2 Main procedure of DA method,

, X2 [i], X1[i], Xo[i]}, i=O, 1,
ti={X I5 [i], X I4 [i],
Due to the limited source on devices, we modify the

original DA method by two processes:
1. Divide the original large looking up table into two
28 x 13bit = O.25k x 13bit tables. In this case, XO--X7
look up the low table and others look up the high one.
This optimization significantly reduces the area of block
rams though the number of instance is 14(7 for low and
7 for high). The implementation results can be found in
section 4.
2. In order to further decrease the number of block
rams, we introduce a higher clock in double frequency.
During the first clock cycle, both low and high LUTs
instance 4 times. The second clock cycle instance 3
times to cover the whole 7-bit data. Because of this time
division control scheme, the latter table instances can
reuse the former ones. As a result, this improvement
saves 8 block rams on FPGA at the cost of introducing
some registers. The detailed consumption of resource is
shown in section 4.
3.3 Add-and-Shift scheme with advanced calculation
schedule
An useful development of implementing constant
multiplications is by decomposing them into simple
operations such as addition, subtraction and shift or
sharing common sub-expressions [5,6,7,8,11]. [8,11 ]
used Canonical Signed Digit (CSD) encoding to
eliminate the number of adders in FIR filters, while [5,7]
claimed that sharing common sub-expression in proper
way could run with higher sample rates than [8].
Taking into account the structure of the FPGA slices,
the high speed implementations can be achieved by

registering each adder, due to which the critical path
becomes equal to the delay of the adder [7]. Registering
an adder output introduces no extra cost on an FPGA
because of the presence of a D flip flop at the output of
each LUT, as shown in Figure 3. Therefore, we can
register each addition result after operation to short the
length of critical path without any extra area cost.
~s
~s
We implement this filter mainly refer to the algorithm

presented in [5,7] which could enhance the maximum
sample rate with the least occupied FPGA slices. The
structure of filter in Figure 1 could be replaced by an
optimized set of additions and shift operations, as shown
in Figure 4. In the multiplier block, the current input
variable x[n] is multiplied by all the coefficients of the
filter to produce the Fn (n=O-- 15) outputs. These outputs
are then delayed and add to produce the final output Y of
filter.
Xi
....------1~sl
-S{Y7--------------~--'j.--
~+
xl
yl
X~
Multiplier block
FO
xO
yO
~---Il----+sO xO
+ adder
yO
(a)
(b)
Figure 3 (a) slice of adder without register out, (b)slice

of adder with register out
register
Figure 4 Structure ofAdd-and-shift scheme
3.4 Implement a soft core by CoregenTM
Xilinx includes a special tool called CoregenTM [10]

to instants a DA digital filter on its FPGA devices. Only
a few parameters need to be configured in order to get
such a synthesizable soft-core. The generated filter
would be considered as a black box during simulation,
synthesis. and ~lace & route. More reference please
contacts Its servIce department.
F2=(Al[ 7)+(Al[ 6)+(AlU 5)+(AlD 4)+(AlJ 3)+(AlD l)-(AlD 8)
Since the constant coefficients are expressed by 9-bits

complemental code, the products can be calculated as
follow add-and-shift expressions (registers are omited).
FO=(AlD 7)+(AlJ 6)+(AlD 3)-(AlJ 8)
A=(AlU 7)+(AlIJ 6)+(AlIJ 4)+(AlU 2)+(AlJ l)+Al-(AlJ 8)
F3=(Al'D 5)+(AlD 4)
F4=(Al[ 6)+(Al'J 5)+(AlD 4)+Al
F5=(AlD 7)+(AlD 5)+(Al[ 4)

F6=(AlD 7)+(Al] 6)+(AlD 5)+(AleJ l)+Al
F7 =(AlD 7)+(AlD 6)+(AlD 5)+(AiD 4)+(AiJ 3)+(Al'D 2)+(AiD
Due to the symmetry of these coefficients, we can just
consider the calculation of FO--F7. According to the
calculation schedule and organization in [7], if two
products have some sub-expressions, the addition
calculators can be reused to save the resource.
We further investigate those 8 products and divide
them into 4 groups which could share the most common
sub-expressions (dm, m=O, 1, 2, 3). Therefore we obtain
(5). The final filter result is then accumulated by the
whole 16 products.
4 Performance and conclusion

All the schemes introduced in section 3 have been
implemented on Xilinx Spatan3 FPGA. According to
implementation result, the occupied number of RAMs
l)lniirelevant resources of each method are shown in table
2. We also provide the comparison of critical paths
length of each scheme in Figure 5.
Scheme 1: conventional multiplier and addition scheme
Scheme 2: customed DA scheme with time division
Scheme 3: add-and-shift scheme with advanced
calculation schedule
Scheme 4: implement a soft core by CoregenTM
4
FO=dO
}
{ F2 = dO+ {5,4, I} dO = {-8, 7,6,3}
FI = dl + {-8}
}
{ F7 = dl+ {5,3 }dl = {7,6,4,2,1,0} (5)
F3=d2
}
{ F5=d2+{7} d2= {5, 4}
F4=d3+{0,4}
{}}
d3 = 6,5
{ F6=d3+{0,7,1}
10
15
20
length of critical path (ns)
Figure 5 critical path lengths for provided schemes

From table 2, we can see that scheme 3 has the least
consumption of slice flip flops, 1, 2 and 3, are generally
at the same magnitude. Considering the consumed 4

input LUTs and total occupied slices, scheme 1 spends
the most resource, scheme 2 spend less, scheme 3 and 4
spends the least. Moreover, DA method instant block
RAMs on FPGA, while others do not have to do so.
In conclusion, scheme 3 is the most area-saving and
fastest method in implementing this high speed 16-order
FIR filter. It observes up to 83% reduction of the total
number of occupied slices for fully parallel
multiplication implementation, about 60% reduction for
DA method, and about 30% for core-generate result. Its
expected speed could be about twice as the other
schemes as shown in Figure5.
To sum up, the actual requirements of applications

determine which scheme is suitable. i.e., specific FPGA
systems in which the performance of filter is not on
critical path or the design does not target on further full
custom integrated circuit can use CoregenTM to
conveniently generate a FIR filter. Furthermore, if the
speed of filter is not strictly required, serial mode of
modified DA method can also be an available scheme. In
most FIR filter applications based on FPGA demanding
both high speed and low cost, add-and-shift method with
proper calculator schedule seems to be a premium
scheme.
xb [15]
xb[14]
xb [13]
......
xb[2]
xb[l]
xb[O]
f(c, x) (13bit)
......
M o =OXC15 + ... +Oxco
......
M 1 =OXC15 + ... +lxco
___0
M.=lxcI5 ++1xco
Table. 1 Looking Up table for 16-order FIR filter. The scale of the entire table is 216 x 13bit ~ 64k x 13bit
2
1
3
Slice Flip flops
572(3%)
667(4%)
483(3%)
Total number of
2094(13%)
646(4%)
380(2%)
4 input LUTs
8(33%)
Number of Block RAMs
0
0
1
2
Number of clks
1
Total occupied Slices
1203(15%)
564(7%)
269(3%)
Table.2 implementation result based on Xilinx spatan3 device
Reference
[I] C. Sidney Burrus, "Digital Filter Structures
Described
by
Distributed
Arithmetic",
IEEE
Transactions on Circuits and Systems, Vol. cas-24, No.12
(1977)
[2] Ching-Long Su, Yin-Tsung Hwang, Chein-Wei Jen,
"A Novel Recursive Digital Filter Based on Signed Digit
Distributed Arithmetic", IEEE International Symposium
on Circuits and Systems, p.2104--2107(1997)
[3] Heejone Yoo, David V. Anderson, "Hardwre-efficient
distributed arithmetic architecture for high-order digital
filters", ICASSP, Vol 5, p.125--128 (2005)
[4] M. Rawski, P. Tomaszewicz, H. Selvaraj, "Efficient
Implementation of Digital Filters with Use of Advanced
Synthesis Methods Targeted FPGA Architectures",
Proceedings of the 8th Euromicro conference on Digital
System Design, p.2455 (2005)
[5]
Huy
T.
Nguyen,
Abhijit
Chatterjee,
"Number-splitting with shift-and-add Decomposition for
Power and Hardware Optimization in Linear DSP
Synthesis", IEEE Transactions on Very Large Scale
4
527(4%)
383(2%)
0
1
290(4%)
Integration (VLSI) system, Vol.8 No.4, (2000)

[6] Hyeont-Ju Kang, Hansoo Kim, In-Cheol Park, "FIR
filter synthesis algorithms for minimizing the delay and
the number of adders", p.51--54(2000)
[7] S. Mirzaei, A. Hosangadi, R. Kastner, "FPGA
Implementation of High Speed FIR Filters Using Add
and Shift Method", International Conference on
Computer Design, p.308--313 (2006)
[8] Richard I. Hartley, "Sub expression Sharing in Filters
Using Canonic Signed Digit Multipliers", IEEE Trans.
On circuits and systems-II: Analog and Digital Signal
Processing, Vol. 43, No.IO, p.611--623 (1996)
[9] Hyeong-Ju Kang, Hansoo Kim, In-Cheol Park, "FIR
filter synthesis algorithms for minimizing the delay and
the number of adders", IEEE/ACM International
Conference on Computer Aided Design, p.51--54 (2000)
[10] "Distributed Arithmetic FIR Filter v9.0," Xilinx
Product Specification (2004)
[11] Mitsuru Yamada, Akinori Nishihara, "High-Speed
FIR Digital Filter with CSD Coefficients Implemented
on FPGA", Asia and South Pacific Design Automation
Conference, p.7--8 (2001)

The Implementation Methods of High Speed FIR Filter On FPGA: 978-1-4244-2186-2/08/$25.00 ©2008 IEEE

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

The Implementation Methods of High Speed FIR Filter On FPGA: 978-1-4244-2186-2/08/$25.00 ©2008 IEEE

Transféré par

Droits d'auteur :

Formats disponibles

The Implementation methods of High Speed

FIR Filter on FPGA

filter function behavior to circuit design, however, its

2. Requirements of the target filter

Y = Lc[n]x LXb [n]x2 b

2. Coefficients: The constant symmetrical quantified

978-1-4244-2186-2/08/$25.00 2008 IEEE

C4= 48, C5=-6, C6=- 41, C7=-56,

7-bit data with serial to parallel transfer operation.

y=Mox2+MIX21+ ... +M6X26

If the number of calculation circles for filtering is

Figure 1 the conventional structure of a FIR filter

x[n] = _2 b x xB_t[n] + l:xb[n] x 2b

where xb[n] is the bth bit of x [n] and B is the input

Considering this full-customed DA procedure (as

Change data from

Look up th~ tabl~s to

Accumulate the result

Figure 2 Main procedure of DA method,

Due to the limited source on devices, we modify the

28 x 13bit = O.25k x 13bit tables. In this case, XO--X7

the high speed implementations can be achieved by

We implement this filter mainly refer to the algorithm

Figure 3 (a) slice of adder without register out, (b)slice

Figure 4 Structure ofAdd-and-shift scheme

3.4 Implement a soft core by CoregenTM

Xilinx includes a special tool called CoregenTM [10]

Since the constant coefficients are expressed by 9-bits

F5=(AlD 7)+(AlD 5)+(Al[ 4)

4 Performance and conclusion

length of critical path (ns)

Figure 5 critical path lengths for provided schemes

at the same magnitude. Considering the consumed 4

To sum up, the actual requirements of applications

M o =OXC15 + ... +Oxco

M 1 =OXC15 + ... +lxco

Integration (VLSI) system, Vol.8 No.4, (2000)

Vous aimerez peut-être aussi