Académique Documents
Professionnel Documents
Culture Documents
ABSTRACT functionality.
Digital filtering algorithms are most commonly implemented
using general purpose digital signal processing chips for audio 2. BACKGROUND
applications, or special purpose digital filtering chips and appli- Research on digital filter implementation has concentrated on cus-
cation-specific integrated circuits (ASICs) for higher rates. This tom implementation using various VLSI technologies. The archi-
paper describes an approach to the implementation of digital fil- tecture of these filters has been largely determined by the target ap-
ter algorithms based on field programmable gate arrays (FPGAs). plications of the particular implementations. Several widely used
The advantages of the FPGA approach to digital filter imple- digital signal processors such as the Texas Instruments TMS320,
mentation include higher sampling rates than are available from Motorola 56000, and Analog Devices ADSP-2100 families have
traditional DSP chips, lower costs than an ASIC for moderate been designed to efficiently implement filtering operations at au-
volume applications, and more flexibility than the alternate ap- dio rates. These devices are extremely flexible, but are limited in
proaches. Since many current FPGA architectures are in-system performance. High performance designs for filtering at sampling
programmable, the configuration of the device may be changed rates above 100 MHz have also been demonstrated using CMOS
to implement different functionality if required. Our examples [3, 4, 6, 8, 9, 14, 17, 19, 20, 21] and BiCMOS [8, 20, 22] tech-
illustrate that the FPGA approach is both flexible and provides nologies, using approaches ranging from full custom to traditional
performance comparable or superior to traditional approaches. factory-configured gate arrays. These efforts have produced high
performance designs for specific application domains.
1. INTRODUCTION There are several potential shortcomings of the custom VLSI
approach, although it does promise the best performance and ef-
The most common approaches to the implementation of digital ficiency for the specific application for which a particular design
filtering algorithms are general purpose digital signal process- is intended. The most obvious problem is the lack of flexibility
ing chips for audio applications, or special purpose digital fil- in the custom approach. Custom devices are often suited only
tering chips and application-specific integrated circuits (ASICs) for use in a particular application, and can not be easily reconfig-
for higher rates [9, 14]. This paper describes an approach to the ured for other operations even within that same domain. Another
implementation of digital filter algorithms on field programmable problem which the custom VLSI approach often imposes is a lack
gate arrays (FPGAs). of adaptability once a device is in use within a system. Typi-
Recent advances in FPGA technology have enabled these de- cal custom approaches do not allow the function of a device to
vices to be applied to a variety of applications traditionally re- be modified within the system, for purposes such as correcting
served for ASICs. FPGAs are well suited to datapath designs, faults, for example. Although these problems can be overcome
such as those encountered in digital filtering applications. The with sufficient forethought, the costs in performance, implementa-
density of the new programmable devices is such that a non- tion complexity, and additional design time often preclude flexible
trivial number of arithmetic operations such as those encountered solutions.
in digital filtering may be implemented on a single device. The Lack of flexibility can forestall the cost-effective evaluation
advantages of the FPGA approach to digital filter implementa- of exotic algorithms in a high performance real-time environ-
tion include higher sampling rates than are available from tradi- ment. Only high volume applications or extremely critical low
tional DSP chips, lower costs than an ASIC for moderate volume volume applications can justify the expense of developing a full
applications, and more flexibility than the alternate approaches. custom solution. There are a variety of algorithms which are not
In particular, multiple multiply-accumulate (MAC) units may be within the performance envelope of general purpose processors,
implemented on a single FPGA, which provides comparable per- and which are not sufficiently commonplace or well-understood to
formance to general-purpose architectures which have a single justify implementation in a full custom design. These algorithms
MAC unit. Further, since many current FPGA architectures are cannot be evaluated with the traditional approaches, thus limiting
in-system programmable, the configuration of the device may be innovation.
changed to implement alternate filtering operations, such as lat-
Field programmable gate arrays (FPGAs) can be used to allevi-
tice filters and gradient-based adaptive filters, or entirely different
ate some of the problems with the custom approach. FPGAs are
This research is partially supported by the Kansas Technology Enter- programmable logic devices which bear a significant resemblance
prise Corporation through the Center for Excellence in Computer-Aided to traditional custom gate arrays. While there are a variety of
Systems Engineering and by the University of Kansas General Research approaches to FPGA implementation, some of the more popular
allocation 3626-20-0038. series consist of an array of arbitrarily programmable function
2
Type Logic Symbol Operation
Type A Cell
X A S (-X) * Y = (-S)
Y
Sum_in
Z
X
C Y
Type 0 Full Adder
Carry_out
X 0 S +) Z
ai Full
CS
Adder
Sum_out
Y
Z
X
C Y
Type 1 Full Adder
X 1 S +) - Z
C (-S)
xi Carry_in
Y
Z
- X
Figure 2. Cell Structure of Combinatorial Array Multiplier
Type 2 Full Adder C - Y
X 2 S +) Z
(-C) S
Y
Z
- X
S X Y Z + XY Z + XY Z + XY Z;
a0 a1 a2 a3 a4 a5 a6 a7
p 15 =
p 14
C XY + Y Z + XZ;
A 1 1 1 1 1 1 1
= (1)
x7
A 1 1 1 1 1 1 3 p and those for the Type 1 and 2 adders are
13
x
6
A 0 1 1 1 1 1 3 p S = X Y Z + XY Z + XY Z + XY Z;
x
5
12
C = XY + XZ + Y Z: (2)
A 0 0 1 1 1 1 3 p
11 Type 0 and Type 3 full adders are characterized by the same pair
x
4 of logic equations, identical to that of the conventional 1-bit full
A 0 0 0 1 1 1 3 p adder (Type 0). This is becausea Type 3 full adder can be obtained
10
x3 from a Type 0 full adder by negating all of the input and output
A 0 0 0 0 1 1 3 p9 values and vice versa. A similar relationship can be established
x2 between Type 1 and Type 2 full adders.
A 0 0 0 0 0 1 3 p For Type 0, 1, 2, and 3 full adders, the two independent 4-bit
8 functions were used to generate the sum and carry outputs. We
x1
can easily include the AND gate in the CLB just by replacing, for
A 0 0 0 0 0 0 2 p
7 example, X with (xi AND ai ) when configuring the CLB. The
x0 horizontal inputs (xi , ai ) can use the horizontal longlines which
are associated with each row for distribution of the signal with
p p p p p p5 p6 a very short routing delay. Other interconnections can be made
0 1 2 3 4
using the single-length or double-length lines via Programmable
Figure 3. Combinatorial Array Multiplier Block Diagram Interconnection Points (PIP) or switching matrices.
3.2. Adder Implementation
In the XC4000 series, each CLB includes high-speed carry logic
that can be activated by configuration. The two 4-input function
generators may be configured as a 2-bit adder with built-in hidden
carry that can be expanded to any length. The 16-bit adder in
our MAC unit, which uses the dedicated carry logic, requires nine
3
CLBs. The middle 14 bits use 7 CLBs, one CLB is used for the
MSB, and one is used for the LSB of the adder. For each CLB in
the middle section, the F function is used for lower-order bit and
the G function is used for higher-order bit. Obviously, we need
to use the G function for the LSB bit and F function for the MSB
bit.
In the case of the LSB CLB, two values must be input on
the G1 and G4 pins. The carry signal enters on the F1 pin,
propagates through the G carry logic, and exits on the COUT
pin. The F function of this CLB is not used and can be used for
other purposes. For the middle CLBs, the logic is configured to
perform a 2-bit addition of A+B in both the F and G functions,
with the lower-order A and B inputs on the F1 and F2 pins, and
the higher-order A and B inputs on the G1 and G4 pins. The carry
signal enters on the CIN pin, propagates through the F and G carry
logic, and exits on the COUT pin. For the MSB CLB, the two
values must be input on F1 and F2 pins. The carry signal enters
on the CIN pin, propagates through the F carry logic, and exits
on the COUT pin. The G function generator of this CLB is used
to access the carry out signal or calculate a two’s complement
overflow.
The limitation of using this built-in carry logic is that the carry
out (COUT) pin of a CLB can only be connected to the carry in Figure 5. Layout of Single MAC Unit on XC4010 FPGA
(CIN) pin of the CLBs above or below. Thus the adder using fast
carry logic can only be configured vertically in the array.
The dedicated carry circuitry greatly increases the efficiency
and performance of adders. Conventional methods for improving
performance such as carry generate/propagate are not useful even
at 16-bit level, and are of marginal benefit at longer wordlengths. Xk
D D D
In our case, the 16-bit adder has a combinatorial delay of only
20.5 ns.
a0 a1 a2 a N-1
3.3. MAC Implementation
Yk
We use the most significant 8 output bits of the multiplier as the
input to the low order bits of the adder. The 8-bit input of the
adder is sign-extended and added with previous outputs using
(a) Canonic Form
two’s complement addition.
The basic structure of the MAC unit can use pipeline registers
Xk
between the multiplier and accumulator to increase the throughput. 2D 2D 2D
The flip-flops in the CLBs are used as pipeline registers and hence
no additional CLBs are needed.
The layout of a single MAC unit on an XC4000-series part is a0 a1 a2 a
N-1
shown in Figure 5. Yk
The performance of the MAC unit with an 8-bit by 8-bit mul- D D D
tiply and 16 bit accumulator is determined by the speed of the
multiplier. The worst case multiplier delay reported is approach-
ing 100 ns. The MAC unit can thus support a clock speed better (b) Pipelined Form
than 10 MHz. With the use of the horizontal longlines to dis-
tribute the critical path signals, the speed can be further improved, Xk
although this may restrict the use of the MAC unit in various
system configurations. The implementation of a MAC unit on an
XC4000-series part requires 73 CLBs.
a a a a
N-1 N-2 N-3 0
4. FIR FILTERS Yk
D D D
4.1. Filter Structures
The transfer function of an N tap FIR filter is given by
(c) Inverted Form
4
. . . .
4.2. High Performance Filters on FPGAs . . . .
The inverted form shown in Figure 6(c) is well-suited for achiev- xk-4 a4 xk-5 a 5 x k-6 a 6 x k-7 a 7
ing a high sampling rate even for higher order filters. This is xk a0 xk-1 a1 x k-2 a2 x k-3 a 3
possible because the throughput does not depend strongly on the
number of taps due to extensive pipelining. The fact that the
multipliers occupy a large area, however, might render the imple-
mentation of higher order filters impractical.
It has been shown in [2] that a high performance FIR filter D D D D
5
ai ai Xk Yk
D
8 8 b0
M M D
A
8 x 8 Multiplier D 8 x 8 Multiplier
xi xi
8 (64 CLBs) D A (64 CLBs)
E C 8 a1 b1
L I I L
R A
C D D
D
U
Xilinx XC4010 M
D
(20 x 20 CLBs) U Output
E
L
R a2 b2
A
A T
M I I M
D O
8 x 8 Multiplier D R 8 x 8 Multiplier
a E ai
i 8 (64 CLBs) (64 CLBs) D D
R
8
L L
8 8
xi xi a N-1 b N-1
the dominant delay is not necessarily in the logic path with two the first second order section needs 12 columns and the second
adders as it may seem, but could very well be in other paths with one needs 11 columns. This was implemented in a single XC4013
one multiplier and adder, unless placement is carefully considered. chip. The placement of the key modules is given in Figure 11.
The floorplan is illustrated in Figure 10. The placement was ,
Addition of N inputs is performed by N 1 stages of dedicated
done carefully to reduce the routing delays. Three 8 by 8 multiplier carry logic (DC) adder/subtractors. The dotted lines represent
units are arranged at the top of the array of 24 by 24 CLBs and horizontal longlines and the shaded triangles, the shifts. Blocks
two multipliers below them. The placement of the multipliers which have registers are shaded. The shift for the scaling block
corresponding to a2 and b2 is not critical because there are no and the shifts needed for the multiplication tend to cancel each
adders in their logic paths. The adders are arranged vertically other, thus shifting in the scaling block is absorbed into the shift of
to make use of the dedicated carry logic. The scaling block is the multiplier units. Negative coefficients were handled by using
implemented with a shifter whose shift is controlled by a 3 bit the dedicated carry subtractors; these columns are represented in
input (S). This shifter is implemented in two stages, the first stage the layout by the small circles at the inputs to the adders. The
6
a1 b1 b0
8 8 8
M M M
L I L I L I
M M A A
D D
S
A H D D
8 x 8 Multiplier D 8 x 8 Multiplier E E
I y
(64 CLBs) D F (64 CLBs) R R 16 k
E T
E D D
R R C C
L I L I
first section
a1 a2 b2 b1 b0
Xilinx XC4013
(24 x 24 CLBs)
(ratios to scale)
8 8 3 8
a2 xk s b2
7
a a a a x x x x
3 2 1 0 0 1 2 3
D D D D D D D
D D D D D D
D D
D D D D D
D D D
D D D D
D D D D
D D D D D
D D D D D D
D D D D D D D
D D D D D D D D
D D D D D D D D D
8
for many applications. Our examples of FIR and IIR filter im- [19] P. Ruetz. The architectures and design of a 20-MHz real-time
plementations illustrate that the FPGA approach is both flexible DSP chip set. IEEE J. Solid State Circuits, 24(2):338–348,
and provides performance comparable or superior to traditional Apr 1989.
approaches. Because of the programmability of this technology, [20] P. Yang, T. Yoshino, R. Jain, and W. Gass. A functional
the examples in this paper can be extended to provide a variety of silicon compiler for high speed FIR digital filters. In IEEE
other high performance FIR and IIR filter realizations. Int. Conf. Acoust., Speech, Signal Processing, pages 1329–
1332, Apr 1990.
REFERENCES
[21] F. Yassa, , J. Jasica, et al. A silicon compiler for digital
[1] P. R. Cappello, editor. VLSI Signal Processing. IEEE Press, signal processing: Methodology, implementation, and ap-
1984. plications. Proc. IEEE, 75(9):1272–1282, Sep 1987.
[2] J. B. Evans. An efficient FIR filter architecture. In IEEE Int. [22] T. Yoshino, , R. Jain, et al. A 100-MHz 64-tap FIR digital
Symp. Circuits and Syst., pages 627–630, May 1993. filter in 0.8 m BiCMOS gate array. IEEE J. Solid State
[3] J. B. Evans, Y. C. Lim, and B. Liu. A high speed pro- Circuits, 25(6):1494–1501, Dec 1990.
grammable digital FIR filter. In IEEE Int. Conf. Acoust.,
Speech, Signal Processing, Apr 1990.
[4] R. Hartley, P. Corbett, P. Jacob, and S. Karr. A high speed
FIR filter designed by compiler. In IEEE Cust. IC Conf.,
pages 20.2.1–20.2.4, May 1989.
[5] M. Hatamian and G. Cash. A 70-mhz 8-bit x 8-bit parallel
pipelined multiplier in 2.5-m CMOS. IEEE J. Solid State
Circuits, SC-21:505–513, Aug 1986.
[6] M. Hatamian and S. Rao. A 100 MHz 40-tap programmable
FIR filter chip. In IEEE Int. Symp. Circuits and Syst., pages
3053–3056, May 1990.
[7] K. Hwang. Computer Arithmetic: Principles, Architecture,
and Design. John Wiley & Sons, Inc., 1979.
[8] R. Jain, P. Yang, and T. Yoshino. Firgen: A computer-aided
design system for high performance FIR filter integrated
circuits. IEEE Trans. Signal Processing, 39(7):1655–1668,
Jul 1991.
[9] K.-Y. Khoo, A. Kwentus, and A. N. Willson, Jr. An efficient
175 MHz programmable FIR digital filter. In IEEE Int. Symp.
Circuits and Syst., pages 72–75, May 1993.
[10] I. Koren. Computer Arithmetic Algorithms. Prentice-Hall,
1993.
[11] S. Y. Kung. VLSI Array Processors. Prentice-Hall, 1988.
[12] S. Y. Kung, R. E. Owen, and J. G. Nash, editors. VLSI Signal
Processing II. IEEE Press, 1986.
[13] S. Y. Kung, H. J. Whitehouse, and T. Kailath, editors. VLSI
and Modern Signal Processing. Prentice-Hall, Inc., 1985.
[14] J. Laskowski and H. Samueli. A 150-Mhz 43-tap half-band
FIR digital filter in 1.2-m CMOS generated by compiler.
In IEEE Cust. IC Conf., pages 11.4.1–11.4.4, May 1992.
[15] D. Li, J. Song, and Y. C. Lim. A polynomial-time algorithm
for designing digital filters with power-of-two coefficients.
In IEEE Int. Symp. Circuits and Syst., pages 84–87, May
1993.
[16] Y. C. Lim and B. Liu. Design of cascade form FIR filters with
discrete valued coefficients. IEEE Trans. Acoust., Speech,
Signal Processing, ASSP-36:1735–1739, Nov 1988.
[17] H. Moscovitz, W. Bullman, J. Carelli, et al. Automatic full-
custom design of high performance DSP chips. In Proc.
IEEE Globecom, 1988.
[18] A. Oppenheim and R. Schafer. Digital Signal Processing.
Prentice-Hall, Inc., 1975.