Vous êtes sur la page 1sur 9

IP FFT Processors for OFDM in FPGA

Christopher Kennedy
Broadband Wizard Inc.
Email: kennedy@bbwizard.com

May 28, 2009

Abstract

This brief examines the important issues surrounding the fast Fourier transform (FFT) processor for an orthogonal frequency-
division multiplexing (OFDM) system when implemented on a field-programmable gate array (FPGA). The mathematical
preliminaries of the FFT computation are reviewed, and the formulations of common radix values as well as their butterfly
architectures are derived. Afterward, the FFT processor performance requirements for OFDM are stated. Finally, the intellectual
property (IP) FFT cores offered by Altera and Xilinx are compared for deployment in the BBWizard OFDM system.

Index Terms

orthogonal frequency-division multiplexing (OFDM), fast Fourier transform (FFT), field-programmable gate array (FPGA)

I. I NTRODUCTION
RTHOGONAL frequency division multiplexing (OFDM) was first proposed by Chang in [1]. OFDM is spectrum efficient
O multi-carrier modulation technique [2], that is used in many modern wireless transmission systems. The performance of
the fast Fourier transform (FFT) hardware block is most critical to the successful implementation of OFDM [3]. In [4], it is
noted that the FFT requirement affords some interesting possibilities given the speed, power, and size constraints of a platform.
The FFT is a method for the fast computation of the discrete Fourier transform (DFT). Modern FFT approaches are
based on the decimation-in-frequency (DIF) algorithm proposed by Cooley and Tukey [5]. Assuming the sequence length is
fixed, the radix is the most important characteristic of an approach and it determines the number of stages as well as the
number of elements processed during an operation. Other considerations include the input and output sequence ordering of the
computation, the precision of the operands, and chosen architecture. It is clear that, designers are faced with many decisions
when implementing the FFT in hardware.
Intellectual property (IP) cores that are designed for the FFT computation are typically configurable, offering support for
most of the common parameters, i.e., different sequence lengths, radix values, bit orderings, precision, and architectures. The
performance of these IP cores are typically less than custom implementations since they are designed for the general case and
to be scalable. However, using IP cores can have some performance advantages if the core exploits some dedicated circuitry
such as external multipliers or high-performance configurable logic blocks (CLBs).
In this brief, we review the major concepts relating to the realization of the FFT on a field-programmable gate array (FPGA)
for use in an OFDM system. The formulations and butterfly architectures for the three most common radix values, i.e., Radix-2
[5], Radix-4, and Radix-2/4 (Split-Radix) [6] are derived. Then, the IP cores provided from Altera [7] and Xilinx [8] that
implement the FFT computation in FPGA are examined. The features offered by these cores are summarized for a comparison
to be made.
The remainder of this brief is organized as follows. In Section II, the FFT computation is reviewed. In Section III, the
OFDM FFT requirements for IEEE 802.11a are given. In Section IV, the two IP cores are discussed. Finally, this brief is
concluded in Section V.

II. FFT C OMPUTATION


This section outlines the mathematics and basic hardware operations of the FFT and inverse FFT (IFFT) computations. We
begin by explaining the motivation for developing the FFT transform. Afterward, the FFT formulations for the popular radix
values are derived and compared. Finally, the architecture to implement the IFFT computation is shown.
2

A. DFT Computation

The DFT denoted by X [k], of a finite input sequence (x [n]) is defined as


N −1
X −j2πkn
X [k] = x [n] · e N

n=0
N
X −1
= x [n] · WNnk , (1)
n=0

for 0 ≤ k ≤ N − 1, where N is the sequence length, j = −1, and

WNnk = e−j2π·nk/N
   
2π · nk 2π · nk
= cos − j · sin
N N
    k
2π · n 2π · n
= cos − j · sin (2)
N N
Observing (1), one notices the computation of each sample X [k], requires N complex multiplications and N − 1 complex
additions. Hence, the computation of an N -point DFT sequence requires N 2 complex multiplications and N · (N − 1) complex
additions, or 4N 2 real multiplications and (4N − 2)·N real additions. Consequently, the total number of operations to compute
an N -point DFT increases rapidly as N increases [9].

B. FFT Computation

The basic concept behind the FFT computation is to successively decompose the N -point DFT computation into smaller
sub-computations. Then, capitalize on the periodicity and symmetry properties of the twiddle factors (WNnk ) to reduce the
number of operations [9] required for the DFT computation. The FFT algorithm presented in [5] is designed for situations
when N is a power of 2, i.e., N = 2m , and offers far superior performance over the previous algorithms which do not place
restrictions on N . Consequently, most of the modern FFT implementations (including the IP cores [7] and [8]) are derived
from [5] and only consider N = 2m (which is required in the BBWizard OFDM system, N = 64).
When implementing the FFT computation, there are many ways to structure the computation [10]. The radix determines
how many sequence items are processed in a given operation as well as the number of stages. As stated in the introduction,
N
three common radix values are Radix-2, -4, and -2/4. Note that, for a Radix-l approach, logl N stages are required, with l
operations per stage. Next, we provide the derivations for each of the three common radix values.
1) Common Decimation-in-Frequency FFT Algorithms: Here, we provide the derivations for three common Cooley-Tukey
FFT algorithms, Radix-2, Radix-4, and Radix-2/4. These formulations are derived from the definition of the DFT in (1).
Radix-2: The Radix-2 approach begins by decomposing x [n] in (1) into two pieces, i.e.,

N
−1
2 N −1
X X
X [k] = x [n] · WNnk + x [n] · WNnk
n=0 n= N
2
N N
−1
2 2−1  
X X N (n+ N )k
= x [n] · WNnk + x n+ · WN 2
n=0 n=0
2
N N
−1
2 2 −1  
X Nk X N
= x [n] · WNnk + WN 2
· x n+ · WNnk
n=0 n=0
2
N
−1 
2  
X k N
= x [n] + (−1) · x n + · WNnk , (3)
n=0
2
3

x0 y0
!
x1 y1
"
WNn
Figure 1: The Radix-2 Butterfly architecture.

Nk
k
for 0 ≤ k ≤ N − 1. Note that WN2 = (−1) from (2). Splitting (decimating) X [k] in (3) into even- and odd-samples, one
obtains
N
−1 
2  
X N n(2k)
X [2k] = x [n] + x n + · WN
n=0
2
N
−1 
2  
X N nk
= x [n] + x n + · WN (4)
n=0
2 2

and
N
2−1   
X N n(2k+1)
X [2k + 1] = x [n] − x n + · WN
n=0
2
N
2−1   
X N
= x [n] − x n + · WNn · W N
nk
(5)
n=0
2 2

N N
for 0 ≤ k ≤ 2 −1. Recalling (1), one notices that the expressions (4) and (5) are the 2 -point DFTs of the following sequences
  
N
x0 [n] = x [n] + x n +
2
  
N
x1 [n] = x [n] − x n + · WNn , (6)
2
respectively. The architecture that implements (6) is shown in Figure 1 and consists of two complex adders and one complex
multiplier. This architecture is commonly referred to as the Radix-2 Butterfly, and to perform an N -point DFT computation,
N
2 log2 N Radix-2 Butterfly operations are required. Consequently, compared to the naive implementation of the DFT in (1),
N
this approach yields a reduction from N 2 to 2 log2 N complex multiplications and N (N − 1) to N log2 N complex additions.

Radix-4: The Radix-4 approach is a straightforward extension of the Radix-2 formulation. Here, one decomposes x [n]
in (1) into four pieces, i.e.,
N N N
−1
4 4 −1   4 −1  
X Nk X N Nk X N
X [k] = x [n] · WNnk + WN 4
· x n+ nk
· WN + WN ·
2
x n+ · WNnk +
n=0 n=0
4 n=0
2
N
4 −1  
3N k X 3N
WN 4
· x n+ · WNnk
n=0
4
N
4−1       
X k N k N k 3N
= x [n] + (−j) · x n + + (−1) · x n + +j ·x n+ · WNnk , (7)
n=0
4 2 4
4

Nk Nk 3N k
k k
for 0 ≤ k ≤ N − 1. Note that WN4 = (−j) , WN2 = (−1) , and WN 4 = j k all from (2). Splitting (decimating) X [k] in
(7) into four sample cases, i.e.,
N
4−1       
X N N 3N n(4k)
X [4k] = x [n] + x n + +x n+ +x n+ · WN
n=0
4 2 4
N
4−1       
X N N 3N nk
= x [n] + x n + +x n+ +x n+ · WN (8)
n=0
4 2 4 4

N
4−1       
X N N 3N n(4k+1)
X [4k + 1] = x [n] − j · x n + −x n+ +j·x n+ · WN
n=0
4 2 4
N
4−1       
X N N 3N
= x [n] − j · x n + −x n+ +j·x n+ · WNn · W N
nk
(9)
n=0
4 2 4 4

N
4−1       
X N N 3N n(4k+2)
X [4k + 2] = x [n] − x n + +x n+ −x n+ · WN
n=0
4 2 4
N
4−1       
X N N 3N
= x [n] − x n + +x n+ −x n+ · WN2n · W N
nk
(10)
n=0
4 2 4 4

and

N
4−1       
X N N 3N n(4k+3)
X [4k + 3] = x [n] + j · x n + −x n+ −j·x n+ · WN
n=0
4 2 4
N
4−1       
X N N 3N
= x [n] + j · x n + −x n+ −j·x n+ · WN3n · W N
nk
(11)
n=0
4 2 4 4

N N
for 0 ≤ k ≤ 4 − 1. Now, recalling (1), one notices that the expressions (8), (9), (10), and (11) are the 4 -point DFTs of the
following sequences
      
N N 3N
x0 [n] = x [n] + x n + +x n+ +x n+
4 2 4
      
N N 3N
x1 [n] = x [n] − j · x n + −x n+ +j·x n+ · WNn
4 2 4
      
N N 3N
x2 [n] = x [n] − x n + +x n+ −x n+ · WN2n
4 2 4
      
N N 3N
x3 [n] = x [n] + j · x n + −x n+ −j·x n+ · WN3n ,
4 2 4
respectively, which can be expressed as
       
N N 3N
x0 [n] = x [n] + x n + + x n+ +x n+
2 4 4
       
N N 3N
x1 [n] = x [n] − x n + −j· x n+ −x n+ · WNn
2 4 4
       
N N 3N
x2 [n] = x [n] + x n + − x n+ +x n+ · WN2n
2 4 4
       
N N 3N
x3 [n] = x [n] − x n + +j· x n+ −x n+ · WN3n . (12)
2 4 4

An architecture that implements (12) is shown in Figure 2 and it consists of eight complex adders and three complex
5

x0 y0
! !
x1 y2
! "
WN2n
x2 y1
" !
WNn
x3 y3
" "
"j WN3n

Figure 2: The Radix-4 Butterfly (Dragonfly) architecture.

multipliers. Note that the multiplication by −j is trivial. This architecture is commonly referred to as the Radix-4 Butterfly (or
N
Dragonfly), and to perform an N -point DFT computation, 4 log4 N Radix-4 Butterfly operations are required. The Radix-4
architecture offers speed improvement over the Radix-2 architecture at the cost of additional area.

Radix-2/4 (Split-Radix): The Split-Radix (SRFFT) algorithm uses both radix-2 and radix-4 decompositions to reduce the
number of computations, and this approach processes four samples during an iteration. Begin by examining the computation
of the even-indexed samples of X [k] in (1), i.e., X [2k], one can obtain

N
X −1
X [2k] = x [n] · WN2nk
n=0
N
−1
2 N −1
X X
= x [n] · WN2nk + x [n] · WN2nk
n=0 n= N
2
N N
−1
2 −1 
2 
X X N 2 )k
2(n+ N
= x [n] · WN2nk + x n+ · WN
n=0 n=0
2
N N
−1
2 −1 
2 
X X N
= x [n] · WN2nk + x n+ · WN2nk · WNN k
n=0 n=0
2
N
2−1   
X N
= x [n] + x n + · WN2nk
n=0
2
N
2−1   
X N nk
= x [n] + x n + · WN (13)
n=0
2 2

N
for 0 ≤ k ≤ 2 − 1. Note that WNN k = 1 from (2). Next, observe that the computation in (13) can be accomplished using the
Radix-2 approach (4).

Now, for the computation of the odd-indexed samples of X [k] in (1), i.e., X [4k + 1] and X [4k + 3], one recalls the Radix-4
equations (9) and (11), reproduced here for convenience
N
−1 
4       
X N N 3N
X [4k + 1] = x [n] − x n + −j· x n+ −x n+ · WNn · W N
nk
(14)
n=0
2 4 4 4
6

x0 y0
!
x1 y2
!
x2 y1
! "
WNn
x3 y3
" "
j WN3n

Figure 3: The Radix-2/4 Butterfly architecture.

and
N
−1 
4       
X N N 3N
X [4k + 3] = x [n] − x n + +j· x n+ −x n+ · WN3n · W N
nk
(15)
n=0
2 4 4 4

N
for 0 ≤ k ≤ 4 − 1. One observes that (14) and (15) can be realized using the radix-4 approach.
Considering the above remarks, the Radix-2/4 Butterfly or Split-Radix Butterfly architecture is shown in Figure 3. Like
N
Radix-4, the Split-Radix approach requires 4 log4 N Butterfly operations to compute an N -point DFT. One observes that
compared to the Radix-4 Butterfly, this architecture has a lower hardware complexity, consisting of two complex multipliers
and six complex adders.
2) Higher-Order Radix Decimation-in-Frequency FFT Algorithms: Higher-order radix implementations are possible and
consequently require additional hardware, however they generally offer lower multiplicative complexity [11]. The difficulty
with the implementation these approaches is the large amount of hardware resources consumed. We note that one can derive
the Radix-8 and Radix-2/4/8 using approaches similar to what was done Radix-4 and Radix-2/4, respectively.

C. IFFT Computation

It is widely know that the IFFT of a sequence can be obtained as


(N −1 )∗
1 X ∗ nk
x [n] = X [k] · WN (16)
N
k=0

for 0 ≤ n ≤ N − 1, where ∗ denotes the complex conjugate. The beauty of the operation described in (16) is that it can be
performed with the same FFT hardware block. This way, both imaginary components of the input and output sequences are
1
negated, as well as the output being scaled by N. This is illustrated in Figure 4.

1
N
Re " X k !# Re " x n !#

N
Im " X k !# Im " x n !#

$1 1
$
N

Figure 4: IFFT computation via FFT [9].


7

Table I: Radix comparisons: (a) Butterfly hardware complexity, (b) number of nontrivial operations for N = 64.
(a) (b)
Hardware Radix-2 Radix-4 Radix-2/4 N = 64 Radix-2 Radix-4 Radix-2/4
Complex× 1 3 2 Real × 264 208 196
Complex+ 2 8 6 Real + 1032 976 964

Table II: Typical OFDM modem FFT parameters.


Application FFT length Clock speed Data precision FFT precision
Digital video broadcasting 1024-2048 20 MHz 8 bits 10-16 bits
ADSL downstream 256 20 MHz 12-16 bits 12-20 bits
ADSL upstream 32 10 MHz 12-16 bits 12-20 bits
WLAN 16-64 20-200 MHz 8-10 bits 10-16 bits

III. OFDM FFT R EQUIREMENTS


The requirements of the FFT processor for the wireless IEEE 802.11a standard [12] are as follows:
• 64-point FFT and IFFT computations; and
• tFFT = 3.2 µs.
One notices that there is a great amount of flexibility in design of the FFT processor provided it meets the strict timing criteria.
For instance, Table I provides a comparison of the hardware requirements for the common butterfly architectures and the
number of nontrivial operations for N = 64.
Note that since real numbers are being represented in the digital domain, the precision of the computation is an important
consideration. Also, the FFT/IFFT blocks will interface to A/D and D/A converters, therefore floating-point I/O is not desirable.
In [4], a table is provided that states the FFT precision for some OFDM standards, it is reproduced here in Table II, for
convenience. Note the boldface value for the FFT precision for WLAN is 10 to 16 bits, therefore selecting a precision of 16
bits for FFT inputs should be more than acceptable in the BBWizard OFDM application.

IV. A LTERA AND X ILINX FFT IP C ORES


In this section, the provided IP FFT cores from Altera [7] and Xilinx [8] are compared. Note that one can use the structure
in Figure 4 to obtain the IFFT with either core. Some of the common features offered by these cores include:
• highly paramaterizable;
• support N = 64 DFT transforms;
• bit-accurate MATLAB models;
• fixed-point two’s complement and IEEE-7541 single-precision floating-point I/O;
• natural and bit-reversed output orders; and
• burst- and streaming-I/O architectures.
Most of these features are expected and straightforward, next we provide some further explanation for some of them.
First, it is known that due to the nature of the FFT computation, the magnitude of the internal values increases throughout
the stages of the computation. Therefore, when using fixed-point representation, larger bus widths are required to express
output values compared in order to the input values to maintain precision. In [8], it is noted that for each Radix-2 and Radix-4
√ √
butterfly stage, the output can grow by a factor up to 1 + 2 ≈ 2.414 and 1 + 3 2 ≈ 5.242, respectively. This implies a bit
growth of 2 and 3 bits per stage for Radix-2 and Radix-4, respectively. This issue is handled differently by each core, and
explanations are provided in the following sections.
Again, due to the nature of the FFT computation, the output samples are computed in bit-reversed order by default. That is,
Cooley-Tukey algorithms reorder the samples during the computation. Consequently, converting the output to a natural order
1 32-bit representation consisting of a 1-bit sign, 8-bit exponent, and 23-bit mantissa.
8

imposes a small time and area penalty. For the BBWizard OFDM system, paying this small penalty is likely worth while to
simplify the control of the system.
At this stage of the project, for the BBWizard OFDM application, it seems more suitable to use a burst-I/O architecture.
We feel that the streaming-I/O architectures are not necessary because:
• our transform size is small,
• streaming architectures require a large amount of additional area, and
• control will be more complicated.
For these reasons, a properly configured burst architecture will be able to deliver adequate performance for our application.
In the following sections, we summarize the features of the Altera and Xilinx FFT IP cores with the burst-I/O and fixed-point
representation options selected.

A. Altera FFT MegaCore Function

The IP FFT processor provided by Altera is named “FFT MegaCore Function” [7]. This core is fully supported by the
Arria™ GX, Cyclone®, Cyclone II, Cyclone III, HardCopy® II, Stratix®, Stratix II, Stratix II GX, Stratix III, Stratix GX
FPGAs, and preliminary supported by Arria II GX, HardCopy III, HardCopy IV E, and Stratix IV FPGAs. The burst-I/O
architectures offered by this core are:
• Radix-4 and
• Mixed-radix.
To handle the growing magnitude of the data during the FFT computation, this core maintains the same data width for both
input and output buses and provides an exponent output value. In [7], performance data for selected FPGAs and transform
sizes is provided.

B. Xilinx Fast Fourier Transform v6.0

The IP FFT processor provided by Xilinx is named “Xilinx Fast Fourier Transform v6.0” [8]. This core is supported
by the Virtex®-II Pro, Virtex-4, Virtex-5, Spartan®-3/XA, Spartan-3E/XA, Spartan-3A/AN DSP FPGAs. The burst-I/O FFT
architectures offered by this core are:
• radix-4,
• radix-2, and
• radix-2 lite.
This core handles the bit growth by allowing the designer to choose from either having no scaling or a fixed-scaling schedule.
With no scaling, all the significant integer bits are carried to the end of the computation. In other words, the output bus width
is greater than the input bus width. A crude plot illustrates the speed and area trade-offs of the various architectures.

V. S UMMARY AND R ECOMMENDATIONS

In this final section, we provide a summary of this brief and provide our recommendations.

A. Summary

In this brief we have reviewed the fast Fourier transform and its hardware implementation. The brief began with a short
review of the DFT. Then, the three common radixes for the FFT computation, i.e., Radix-2, Radix-4, and Radix-2/4 were
derived from the definition of the DFT. The only requirements for the FFT block in the BBWizard OFDM are that the system
must be able to perform the 64-point computation in under 3.2 µs. Finally, we examined the features of the FFT IP cores
provided by Altera and Xilinx.
9

B. Recommendations
This investigation suggests that the Xilinx core is the better option for the BBWizard OFDM application. Even though
performance data is not provided by Xilinx, our transform size is small (N = 64) and there are a few architecture options that
can be selected from to meet our requirements. Having access to the full precision outputs is another advantage that this core
has over the one offered from Altera.
Finally, we note that a possible follow-up to this report could detail the customization of the Xilinx Fast Fourier Transform
v6.0 block on the Spartan-3A DSP 1800 board. Another upcoming objective is achieving unity gain with an FFT and IFFT
computation.

R EFERENCES
[1] R. W. Chang, “Synthesis of Band-Limited Orthogonal Signals for Multichannel Data Transmission,” The Bell System Technical Journal, pp. 1775–1796,
1966.
[2] M. Petrov and M. Glesner, “Optimal FFT Architecture Selection for OFDM Receivers on FPGA,” in IEEE 2005 Conference on Field-Programmable
Technology (FPT’05), 2005, pp. 313–314.
[3] H. Jiang, H. Luo, J. Tian, and W. Song, “Design of an Efficient FFT Processor for OFDM Systems,” IEEE Transactions on Consumer Electronics,
vol. 51, no. 4, pp. 1099–1103, 2005.
[4] N. Weste and D. J. Skellern, “VLSI for OFDM,” IEEE Communications Magazine, vol. 36, pp. 127–131, 1998.
[5] J. W. Cooley and J. W. Tukey, “An Algorithm for the Machine Calculation of Complex Fourier Series,” Mathematics of Computation, vol. 19, pp.
297–301, 1965.
[6] P. Duhamel, “Algorithms Meeting the Lower Bounds on the Multiplicative Complexity of Length-2n DFT’s and Their Connection with Practical
Algorithms,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 38, pp. 1504–1511, 1990.
[7] FFT MegaCore Function: User Guide, 9th ed., Altera Corporation, 101 Innovation Drive, San Jose, CA 95134, March 2009.
[8] Fast Fourier Transform v6.0, 6th ed., Xilinx Inc., 2100 Logic Drive, San Jose, CA 95124, September 2008.
[9] S. K. Mitra, Digital Signal Processing: A Computer-Based Approach, 3rd ed. McGraw Hill, 2006.
[10] K. S. Hemmert and K. D. Underwood, “An Analysis of the Double-Precision Floating-Point FFT on FPGAs,” in 13th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines (FCCM 2005), 2005, pp. 171–180.
[11] W.-C. Yeh and C.-W. Jen, “High-Speed and Low-Power Split-Radix FFT,” IEEE Transactions on Signal Processing, vol. 51, pp. 864–874, 2003.
[12] Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, IEEE Std 802.11-2007 ed., IEEE, 3 Park Ave, New
York, NY 10015.

Vous aimerez peut-être aussi