Académique Documents
Professionnel Documents
Culture Documents
Abstract
This paper implements a sixteen-order high-speed
Finite Impose Response (FIR) filter with four different
popular methods: Conventional multiplications and
additions; Full custom Distributed Arithmetic (DA)
scheme; Add-and-Shift method with advanced
calculation schedule. Each scheme is analyzed in detail
including implementing process and advantages and/or
drawbacks in order to present a practical reference. All
of these implementations are aimed to implement on
Xilinx Spartan 3 devices and we also compare our
results with an industry result produced by Xilinx
CoregenTM also using Distributed Arithmetic. The
premium add-and-shift method observes up to 80%
reduction in total occupied slices and 63.3 % versus the
largest
conventional
parallel
multiplication
implementation.
1. Introduction
Digital Finite Impulse Response (FIR) filters are
frequently used in most Digital Signal Processing (DSP)
system by virtue of stability and easy implementation,
especially in the realm of communication and/or
multimedia applications. The filters are major
determinants of the performance and power consumption
of the whole system. Since field programmable gate
arrays (FPGAs) contain a very high number of
Configurable Logic Blocks (CLBs), they become more
feasible for implementing specific DSP functions such as
FIR filters. The problem of designing FIR filters are
suffering from a large number of multiplications, which
leads to excessive area and power consumption. Many
works have focused on designing alternative algorithm
such as Distributed Arithmetic (DA) [1,2,3,4]. Other
works have development on optimizing multiplications
by decomposing them into simple operations such as
addition, subtraction and shift or sharing common
sub-expressions [5,6,7,8,9].
This paper investigates and compares four available
implementing schemes target on FPGA: (1) conventional
multiplication and addition scheme; (2) using customed
DA relative multiplierless method instead of multiplier
block; (3) decomposing multiplications into addition,
subtraction and shift operations and eliminating these
operation with calculator schedules (add-and-shift); (4)
generating a black-box synthesized filter by some
place-and-routing software;
Application of multiplication and addition chains can
be considered as a straight and simple translation from
B-1
n=O
b=O
(1)
--r---~---r------.-----,
f}
xi
(2)
b=O
N-I
b=O
n=O
b
Y = _2 x !(c[n],xB_1[n]) + L2b x L!(c[n],xb[nD
(4)
(3)
output
look up the low table and others look up the high one.
This optimization significantly reduces the area of block
rams though the number of instance is 14(7 for low and
7 for high). The implementation results can be found in
section 4.
2. In order to further decrease the number of block
rams, we introduce a higher clock in double frequency.
During the first clock cycle, both low and high LUTs
instance 4 times. The second clock cycle instance 3
times to cover the whole 7-bit data. Because of this time
division control scheme, the latter table instances can
reuse the former ones. As a result, this improvement
saves 8 block rams on FPGA at the cost of introducing
some registers. The detailed consumption of resource is
shown in section 4.
3.3 Add-and-Shift scheme with advanced calculation
schedule
An useful development of implementing constant
multiplications is by decomposing them into simple
operations such as addition, subtraction and shift or
sharing common sub-expressions [5,6,7,8,11]. [8,11 ]
used Canonical Signed Digit (CSD) encoding to
eliminate the number of adders in FIR filters, while [5,7]
claimed that sharing common sub-expression in proper
way could run with higher sample rates than [8].
Taking into account the structure of the FPGA slices,
~s
~s
....------1~sl
-S{Y7--------------~--'j.--
~+
xl
yl
X~
Multiplier block
FO
xO
yO
~---Il----+sO xO
+ adder
yO
(a)
(b)
register
FO=dO
}
{ F2 = dO+ {5,4, I} dO = {-8, 7,6,3}
FI = dl + {-8}
}
{ F7 = dl+ {5,3 }dl = {7,6,4,2,1,0} (5)
F3=d2
}
{ F5=d2+{7} d2= {5, 4}
F4=d3+{0,4}
{}}
d3 = 6,5
{ F6=d3+{0,7,1}
10
15
20
xb [15]
xb[14]
xb [13]
......
xb[2]
xb[l]
xb[O]
f(c, x) (13bit)
......
......
___0
M.=lxcI5 ++1xco
Table. 1 Looking Up table for 16-order FIR filter. The scale of the entire table is 216 x 13bit ~ 64k x 13bit
2
1
3
Slice Flip flops
572(3%)
667(4%)
483(3%)
Total number of
2094(13%)
646(4%)
380(2%)
4 input LUTs
8(33%)
Number of Block RAMs
0
0
1
2
Number of clks
1
Total occupied Slices
1203(15%)
564(7%)
269(3%)
Table.2 implementation result based on Xilinx spatan3 device
Reference
[I] C. Sidney Burrus, "Digital Filter Structures
Described
by
Distributed
Arithmetic",
IEEE
Transactions on Circuits and Systems, Vol. cas-24, No.12
(1977)
[2] Ching-Long Su, Yin-Tsung Hwang, Chein-Wei Jen,
"A Novel Recursive Digital Filter Based on Signed Digit
Distributed Arithmetic", IEEE International Symposium
on Circuits and Systems, p.2104--2107(1997)
[3] Heejone Yoo, David V. Anderson, "Hardwre-efficient
distributed arithmetic architecture for high-order digital
filters", ICASSP, Vol 5, p.125--128 (2005)
[4] M. Rawski, P. Tomaszewicz, H. Selvaraj, "Efficient
Implementation of Digital Filters with Use of Advanced
Synthesis Methods Targeted FPGA Architectures",
Proceedings of the 8th Euromicro conference on Digital
System Design, p.2455 (2005)
[5]
Huy
T.
Nguyen,
Abhijit
Chatterjee,
"Number-splitting with shift-and-add Decomposition for
Power and Hardware Optimization in Linear DSP
Synthesis", IEEE Transactions on Very Large Scale
4
527(4%)
383(2%)
0
1
290(4%)