Efficient Low Multiplier Cost 256-Point FFT Design With Radix-2 SDF Architecture

1
EFFICIENT LOW MULTIPLIER COST 256-POINT FFT

DESIGN WITH RADIX-2
4
SDF ARCHITECTURE

Chih-Peng Fan
1
*, Mau-Shih Lee
2
, and Guo-An Su
3

Department of Electrical Engineering,
National Chung Hsing University
Tai-Chung 402 , Taiwan,R.O.C.

Key Words : Fast Fourier Transform (FFT), Single-Path Delay Feedback (SDF) Structure, Digital Signal Processing

Abstract

In this paper, we propose an efficient and low-cost 256-point fast Fourier transform (FFT) architecture and
implementation, especially for WiMAX OFDM system. Based on the radix-16 FFT algorithm, the proposed 256-point
FFT processor utilizes simplified cascaded radix-2
4
single-path delay feedback (SDF) structure. The control circuit of
the proposed simplified radix-2
4
FFT SDF architecture is simpler than that of the direct radix-16 FFT SDF structure.
The multiplier cost of the proposed FFT architecture is less than that of the previous FFT structures in 256-point FFT
applications. The throughput of the proposed FFT processor is one sample per clock. In hardware verifications, the
throughput of our FFT design processes up to 35.5M samples/sec with Xilinx Virtex2 1500 FPGA, and it processes up to
51.5M samples/sec with UMC 0.18m standard cell technology. The throughput of our FFT is suitable for WiMAX
802.16a application, whose maximum sample rate is 32MHz.

2
2
4
SDF
256

1
*
2

3

:(SDF)

WiMAX OFDM 256
(FFT)16256
2
4
(SDF) 2
4
16(SDF)
256
Xilinx Virtex2 1500 FPGA
35.5UMC 0.18m
51.5 32MHz
WiMAX 802.16a

1

2

3

*Corresponding author ,
Email: cpfan@dragon.nchu.edu.tw

3

I. Introduction
Recently, the orthogonal frequency division
multiplexing (OFDM) technologies have become more and
more significant in modern communication systems. The
key concept of OFDM technique makes use of the
multi-carrier modulation. For example, WiMAX [1],
DVB-T[2], Wireless LAN [3], ADSL [4], and VDSL [5]
all utilize OFDM technology for baseband modulations.
The OFDM technique is advantageous because it uses
channels efficiently, overcomes multi-path fading, and has
simpler equalizers. For WiMAX system, the OFDM
technology can increase the efficiency of spectrum
utilization and provide the transmission capability in the
non-line-of-sight (NLOS) environment. The OFDM
systems need the FFT and IFFT processors to perform
real-time operations for the orthogonal multi-carrier
modulations and demodulations. In recent years, the
developments of the fast computations of discrete Fourier
transforms (DFT) have become more important in order
that the researchers [7-22] may reduce the computational
complexity of the FFT algorithms. For this reason, the
computational complexity of the DFT computation has
been decreased from O(N
2
) to O(NlogN) [6]. In OFDM
communications, the number of subcarriers is directly
proportional to the length of the FFT transformation in
different OFDM systems. For instance, the FFT size
supports 256-point transform length in the 802.16a,
802.16Rev and 802.16e WiMAX [1] systems. Thus, the
design of an efficient 256-point FFT processor in WiMAX
OFDM system is necessarily required. The pipelined
FFT/IFFT processor architecture which has been designed
in OFDM communication system has been studied since
the 1970s. There are many kinds of the methods to
implement the FFT hardware architecture. All hardware
implementations of pipelined FFT can be categorized into
three kinds of pipelined architectures, which include
multiple delay commutator (MDC), single delay
commutator (SDC), and single delay feedback (SDF)
architectures. Among three pipelined architectures, the
SDF architecture is more suitable and its advantages are
listed as follows, (1) The SDF architecture is very
convenient to implement the different length FFT. (2) The
number of the required registers in SDF architecture is
smaller than that in MDC and SDC structures. (3) The
controller of SDF architecture is easier than the other
structures. In our 256-point FFT design, the radix-16
FFT algorithm is utilized for the fast algorithm
development. Then the radix-16 FFT algorithm can be
directly implemented with the radix-16 SDF architecture.
However, the control unit of the direct radix-16 SDF
architecture is very complex. In the circumstances of no
increasing chip area and power consumption, the radix-2
4

SDF architecture is better than the direct radix-16 SDF
architecture. Thus, we choose the radix-2
4
based SDF
architecture to design the 256-point FFT processor.
In this paper, we propose a novel low-cost 256-point
FFT SDF design for OFDM systems. The rest of the paper
is organized as follows. In Section 2, we begin with the
derivation of the radix-16 FFT algorithm. By using the
structure of radix-8 FFT algorithm [7], the radix-16 FFT
algorithm is described to guide the depiction of hardware
pipelined architectures. Next, the proposed simplified
radix-2
4
SDF pipelined FFT architecture for realization of
radix-16 FFT is also shown in this section. In Section 3,
we provide a low multiplier cost 256-point FFT
architecture with simplified radix-2
4
SDF structure. The
detailed function realizations of the simplified radix-2
4
SDF structure are also discussed. In Section 4, the
proposed simplified radix-2
4
based FFT processor can
reduce the hardware complexity and control circuit
complexity. The results of VLSI realizations are shown in
this section. Then the comparisons of computational
complexity and hardware complexity are also described in
Section 4. Finally, we give a conclusion in Section 5.

II. Radix-16 FFT Algorithm and the
Simplified Pipelined SDF Architecture
2.1 Radix-16 FFT Algorithm
The butterfly structure for computation of the radix-8 FFT
algorithm is proposed in [7]. Fig. 1 shows the butterfly
structure of the radix-8 FFT algorithm, where
2
i nk
nk
N
N
e
= W . We replace the k index with the index

k=2k and the index k=2k+1 in radix-8 FFT structure.
Thus, the even parts and the odd parts of the radix-16 FFT
algorithm can be described in Eq.(1). The radix-16 FFT
algorithm listed in Eq.(1) can be summarized to the
butterfly structure that is shown in Fig. 2. First, the
radix-16 FFT algorithm can be realized with the direct
radix-16 SDF architecture, which is shown in Fig. 3. The
direct radix-16 SDF structure is constructed by an N/16
memory and the radix-16 butterfly processing element
(PE). Fig. 4 shows the detailed function block of the
radix-16 PE. Although the radix-16 FFT algorithm has
less computational complexity, the control circuit of the
direct radix-16 SDF architecture for implementing
radix-16 FFT is very complex. Thus, the efficient
simplified radix-2
4
SDF structure, which is described in
the next section, will be applied to radix-16 FFT
algorithm.

4

Fig. 1 Butterfly structure of the radix-8 FFT algorithm [7]

/16 1
( 2 )/ 4 2
0
[16 2 4 8 ] {{{[ ( ) ( 1) ( /2)] [ ( / 4) ( 1) ( 3 / 4)]}
N
a N a b a b
n
X k a b c d x n x n N W x n N x n N
+ +
=
+ + + + = + + + + + +

( 2 4 )/8 2 4 ( 2 )/4 2 4
{[ ( /8) ( 1) ( 5 /8)] [ ( 3 /8) ( 1) ( 7 /8)]}}
N a b c a b c N a b a b c
W x n N x n N W x n N x n N
+ + + + + + +
+ + + + + + + +
2 4 8 ( 2 )/ 4 2 4 8
{{[ ( /16) ( 1) ( 9 /16)] [ ( 5 /16) ( 1) ( 7 /8)]}
a b c d N a b a b c d
x n N x n N W x n N x n N
+ + + + + + +
+ + + + + + + +
2 4 8 ( 2 )/4 2 4 8
{[ ( 3 /16) ( 1) ( 11 /16)] [ ( 7 /16) ( 1) ( 15 /16)]}
a b c d N a b a b c d
x n N x n N W x n N x n N
+ + + + + + +
+ + + + + + + +
( 2 4 )/8 ( 2 4 8 )/16 ( 2 4 8 )
/16
} }
N a b c N a b c d a b c d n nk
N
W W W W
+ + + + + + + +

, (1)
where a, b, c, and d is 0 or 1, and 0,1,..., /16 1 k N =

Fig. 2 Butterfly structure of the radix-16 FFT algorithm
5

Fig. 3 Direct radix-16 SDF architecture

Fig. 4. Block diagram of the radix-16 PE

2.2 Simplified Radix-2
4
SDF Pipelined Architecture
In order to simplify the radix-16 control unit, we
derive the radix-2
4
algorithm so that the hardware
implementation is performed by cascading four simple
radix-2 processing elements. Therefore, the radix-2
4

architecture is simplified and has the property of high
spatial regularity. Subsequently, we use the common
factor algorithm (CFA) decomposition method to develop
the proposed FFT algorithm, and then the frequency
domain indices k and the time domain indices n can be
factorized as follows.
1 2 3 4 5
2 4 8 16
N
N N N N
n n n n n n =< + + + + > , (2)
where N=256, 0,1
i
n = for i=1, 2 ,3 and 4, and
5
0,1,2,..., /16 1 n N = . And
1 2 3 4 5
2 4 8 16
N
k k k k k k =< + + + + > , (3)
where 0,1
i
k = for i=1, 2, 3 and 4, and
5
0,1,2,..., /16 1 k N = .
In Eq.(2) and Eq.(3), the indices n and k are mapped
from one dimension to five dimensions linearly. By
using Eq.(2) and Eq.(3), the DFT formula is rewritten as
the multi-dimension form, which is shown as Eq. (4).

1 2 3 4 5
[ 2 4 8 16 ] X k k k k k + + + + =
5 4 3 2 1
/16 1 1 1 1 1
1 2 3 4 5
0 0 0 0 0
( )
2 4 8 16
N
n n n n n
N N N N
x n n n n n
= = = = =
+ + + +

1 2 3 4 5 1 2 3 4 5
( )( 2 4 8 16 )
2 4 8 16
N N N N
n n n n n k k k k k
N
W
+ + + + + + + +
. (4)
Subsequently, we use the period property of the twiddle
factor in Eq.(4) and decompose the twiddle factor as
follows.
1 2 3 4 5 1 2 3 4 5
( )( 2 4 8 16 )
2 4 8 16
N N N N
n n n n n k k k k k
N
W
+ + + + + + + +
=
3 1 2 3 1 1 2 1 2
( 2 4 ) ( 2 )
8 2 4
N N N
n k k k n k n k k
N N N
W W W
+ + +

4 1 2 3 4
5 1 2 3 4 5 5
( 2 4 8 )
( 2 4 8 )
16
/16
N
n k k k k
n k k k k n k
N N N
W W W
+ + +
+ + +

.
(5)
We expand the dimensional variables
1
n ,
2
n ,
3
n , and
4
n
6
in Eq.(4) and Eq.(5), and then the butterfly expressions,
which are ranged from the first stage to the fourth stage,
are denoted as
/ 2 N
A ,
/ 4 N
B ,
/8 N
C , and
/16 N
D ,
respectively.

1
/2 2 3 4 5 1 2 3 4 5 2 3 4 5
( , ) ( ) ( 1) ( )
4 8 16 4 8 16 4 8 16
k
N
N N N N N N N N N
A n n n n k x n n n n x n n n n + + + = + + + + + + +
, (6)
1 2
( 2 )
4
/4 3 4 5 1 2 /2 3 4 5 1 /2 3 4 5 1
( , , ) ( , ) ( , )
8 16 8 16 8 16 4
N
k k
N N N N
N N N N N N N
B n n n k k A n n n k W A n n n k
+
+ + = + + + + + +
, (7)
1 2 3
( 2 4 )
8
/ 8 4 5 1 2 3 / 4 4 5 1 2 / 4 4 5 1 2
( , , , ) ( , , ) ( , , )
16 16 16 8
N
k k k
N N N N
N N N N
C n n k k k B n n k k W B n n k k
+ +
+ = + + + +
, (8)
and
1 2 3 4
( 2 4 8 )
16
/16 5 1 2 3 4 / 8 5 1 2 3 / 8 5 1 2 3
( , , , , ) ( , , , ) ( , , , )
16
N
k k k k
N N N N
N
D n k k k k C n k k k W C n k k k
+ + +
= + +
. (9)
In Eq.(6), Eq.(7), Eq.(8) and Eq.(9), each of the four
butterflies can be implemented by a single radix-2 SDF
architecture. In Eq.(6) and Eq.(7), the multiplications
with the coefficients
1
( 1)
k
and
1 2
( 2 )
4
N
k k
N
W
+
are trivial
and we only need to exchange real and imaginary parts of
multiplicative operand and give a negative sign to the
result. In Eq.(8) and Eq.(9), the multiplications with the
coefficients
1 2 3
( 2 4 )
8
N
k k k
N
W
+ +
and
1 2 3 4
( 2 4 8 )
16
N
k k k k
N
W
+ + +

are constant multiplication and the canonical signed
digital (CSD) circuits are used for implementations, so the
number of general multipliers can be reduced greatly.
Moreover, the hardware area and power consumption of
the proposed FFT architecture can be decreased. After
the CFA decompositions, the required radix-2
4
FFT
algorithm can be achieved and shown as follows.

1 2 3 4 5
[ 2 4 8 16 ] X k k k k k + + + + =

5 1 2 3 4 5 5
5
/16 1
( 2 4 8 )
/16 5 1 2 3 4 /16
0
( , , , , )
N
n k k k k n k
N N N
n
D n k k k k W W
+ + +
=

.
(10)
In Eq.(10), the multiplication with the coefficient
5 1 2 3 4
( 2 4 8 ) n k k k k
N
W
+ + +
is non-trivial and needs a general
multiplier for implementations. The proposed simplified
radix-2
4
SDF (SR2
4
SDF) architecture is derived from
radix-2 based SDF architecture. The proposed SR2
4
SDF
architecture can be derived from the corresponding
4-stage radix-16butterfly structure. The four computing
stages of the proposed SR2
4
SDF structure correspond to
the relative 4-stage radix-16 FFT butterfly flow, which is
shown in Fig. 2. The architecture of the proposed
SR2
4
SDF is shown in Fig. 5. For the realization of
N-point FFT, the SR2
4
SDF architecture needs log
16
N-1
multipliers, 4log
4
N adders, and N-1 registers. The control
circuit of SR2
4
SDF is simpler in comparison with the
direct radix-16 SDF architecture. The hardware
requirements of different pipelined SDF FFT architectures
are listed in Table 1.
The SR2
4
SDF architecture is constructed by four
radix-2 based processing elements and the coefficient
multipliers. The first twiddle factor, i, performs sign
exchange between real and imaginary part data. The
second twiddle factor, w1, performs three
fixed-coefficient multiplications. Then the third twiddle
factor, w2, performs seven fixed-coefficient
multiplications. For the reductions of the chip area and
power consumption, the computations of w1 and w2 can
be implemented with the canonical signed digital (CSD)
circuits, where the computations of coefficients of w1 and
w2 are shown in Eq.(8) and Eq.(9). Finally, the last
stage computation is realized with a complex
multiplier, where the multiplicative coefficient
5 1 2 3 4
( 2 4 8 ) n k k k k
N
W
+ + +
is expressed in Eq.(10). In Section
3, we will describe the detailed hardware implementations
of the proposed simplified R2
4
SDF architecture.

7

Fig. 5 SR2
4
SDF architecture for FFT
Table 1. Comparison of Hardware Requirements for N-length FFT with Different SDF Architectures

Complex
Multiplier
Complex
Adder in Radix-2 PEs
Register Control Circuit
R2 SDF[11] 2(log
4
N 1) 4log
4
N N 1 simple
R4 SDF[12] log
4
N 1 8log
4
N N 1 medium
R2
2
SDF[8] log
4
N 1 4log
4
N N 1 simple
R2
3
SDF[9] log
8
N 1 4log
4
N N 1 simple
R2/4/8 SDF[13] log
4
N 1 4log
4
N N 1 simple
Proposed SR2
4
SDF log
16
N 1 4log
4
N N 1 simple
R16SDF log
16
N 1 4log
4
N N 1 complex

III. Proposed Efficient 256-point FFT
Architecture
The control circuit of proposed radix-2
4
FFT SDF
architecture is simpler than that of the corresponding
radix-16 SDF architecture. The radix-2
4
FFT SDF
architecture not only reduces the complexity of the control
circuits but also maintains the same number of multipliers,
adders, subtracters and the same utilization of memory in
the hardware architecture. In Fig. 6, the proposed FFT
processor can be applied to 256-point FFT computation.

The proposed efficient architecture needs 1 complex
multiplier and 16 complex adders. According to the
simplified radix-2
4
SDF architecture in Section 2.2, the
SR2
4
SDF FFT architecture can be separated into 4-stage
recursive computations from the radix-16 FFT butterfly
flow. Although the complexity of the SR2
4
SDF control
circuit is simpler than that of radix-16 SDF architecture,
the output throughput rate of the SR2
4
SDF architecture is
still equal to that of the complex radix-16 SDF
architecture.
N/2 N/4 N/8 N/16
-i
w1 w2
Radix-2
PE
Radix-2
PE
Radix-2
PE
Radix-2
PE
N/2 N/4 N/8 N/16
-i
w1 w2
Radix-2
PE
Radix-2
PE
Radix-2
PE
Radix-2
PE
8
R : Register File or SRAM
Radix-2
PE
Serial
Input
128
Radix-2
PE
W1
-i
64
Radix-2
PE
W2
32
Radix-2
PE
ROM
16
Radix-2
PE 16
8
Radix-2
PE
W1
-i
4
Radix-2
PE
W2
2
Radix-2
PE
1
R R R R
R
R
R R
FFT Input
FFT Output
Serial
Output

Fig. 6 The 256-point FFT processor with two-cascaded simplified radix-2
4
SDF architecture

In Fig. 6, the first twiddle factor, i, performs sign
exchange between real and imaginary part data, and it is
shown in Fig. 7. We obtain the output real part by the
direct connection of the input imaginary part. The output
imaginary part is obtained by the sign exchange with the
input real part. The simplified radix-2
4
SDF architecture
needs many fixed-coefficient multiplications. If we
realize the fixed-coefficient multiplications with general
multipliers, we will pay the extra costs for the chip area
and power consumption. Therefore, the CSD circuit,
which is constructed by only shifters and adders, is
applied to implement the fixed twiddle factor coefficients
in the radix-2
4
SDF architecture. The fixed twiddle
factor coefficients of the simplified radix-2
4
SDF
architecture include W
16
2
, W
16
6
, W
16
1
, W
16
5
, W
16
3
and
W
16
7
. Depending on the fixed word-length simulations
for suitable accuracy, we assign 8 fractional bits for CSD
implementations of all fixed twiddle factors.
In Eq.(11), the CSD realization for multiplications of
the fixed twiddle factors W
16
2
and W
16
6
is depicted as
follows.
{[( 1) ( 3)] [( 4) ( 6)]} ( 8) Y x x x x x = >> + >> + >> + >> + >>
(11)
where the x is the input and the Y is the output of the CSD
circuit. Then the symbol >> is defined as the right shift
operation and the integer number means the number of the
bit shifting. Four complex adders are required for the
realization of Eq.(11). In Eq.(12) and Eq.(13), the CSD
realization for multiplications of the fixed twiddle factors
W
16
1
and W
16
7
is shown as follows.
{[( 1) ( 2)] [( 3) ( 5)]}
real
Y x x x x = >> + >> + >> + >>

( 6) x + >> , (12)
[( 2) ( 3)] ( 8)
imag
Y x x x = >> + >> + >>
, (13)
where four complex adders are required for the realization
of Eq.(12), and two complex adders are required for the
realization of Eq.(13). In Eq.(14) and Eq.(15), we shows
the CSD realization for multiplications W
16
5
and W
16
3
as
follows.
[( 2) ( 3)] ( 8)
real
Y x x x = >> + >> + >>
, (14)
{[( 1) ( 2)] [( 3) ( 5)]}
imag
Y x x x x = >> + >> + >> + >>
( 6) x + >> , (15)
where two complex adders are required for the realization
of Eq.(14), and four complex adders are required for the
realization of Eq.(15). In Fig. 6, the w1 circuit performs
three fixed-coefficient twiddle factor multiplications.
Then, the w2 circuit performs seven fixed-coefficient
twiddle factor multiplications. Fig. 7 and Fig. 8 show
the functional structure of w1 and w2 circuits respectively.
Thus, three twiddle factors for w1 circuit needs 8 complex
adders, and seven twiddle factors for w2 circuit needs 32
complex adders, and then the total number of complex
adders in the CSD circuits is 40. The coefficients in w1
and w2 circuits are hard-wiredly realized with the CSD
schemes. Thus, the area of the hard-wired realization is
smaller than that of the ROM-based realization.
9
MUX
-i
Input
x
enable
enable
enable
Select
2
16
W
6
16
W
real
Y
imag
Y

Fig. 7 Three twiddle factors for w1 circuit

MUX
-i
Input
x
Select
enable
enable
enable
enable
enable
enable
enable
7
16
W
2
16
W
6
16
W
1
16
W
5
16
W
3
16
W
real
Y
imag
Y

Fig. 8 Seven twiddle factors for w2 circuit
10
Radix-2 PE
SRAM
R
e
g
R
e
g
MUX
MUX
Input
Output
select
select
a
b
a+b
a-b
Radix-2 PE with two-point butterfly

Fig. 9 Block diagram of the radix-2PE stage
Fig. 9 shows the block diagram of the radix-2PE cell
for the simplified radix-2
4
architecture. Each radix-2PE
stage includes the two-point butterfly, SRAM with P-1
size, two multiplexers and two D-type registers, where P
is 128, 64, 32, 16, 8, 4, or 2 for the corresponding stage.
Each radix-2 PE performs two-points add/sub butterfly
operation, which needs two complex adders. In the first
P/2 clock cycles, the storage data in SRAM sequentially
output to the next PE stage, and then the data from
previous PE stage will sequentially input into the SRAM.
During the last P/2 clock cycles, the output data from
SRAM will be co-calculated with the input data from
previous PE stage, and then the calculated results will
output to the next PE stage and store into the SRAM
individually. In order to capture the correct output value
at each PE stage, the D-type register can latch the input
data of each PE. The design with input latch can make
each PE stage have enough operation time for the FFT
computation. The proposed 256-point FFT architecture,
shown in Fig. 6, only needs one general complex
multiplier in the computation. We can implement the
general complex multiplier with a ROM, two multiplexers,
two transmission gates and a complex multiplication core.

Fig. 10 Simulation environment for decision of the suitable fixed word-length
64QAM
Mapper
Ideal
IFFT
AWGN
My Design
FFT
SNR
64QAM
Mapper
Ideal
IFFT
AWGN
My Design
FFT
SNR
11
0 2 4 6 8 10 12
0
5
10
15
20
25
30
35
40
Quantization Bit Length
O
u
t
p
u
t

S
N
R
(
d
B
)
SNR=10db
SNR=20db
SNR=30db
SNR=40db

Fig. 11 SNR simulations for twiddle factors with different decimal word-length
Depending on the fixed word-length simulation in Fig.
10, we assign 9 fractional bits for the storage of the
twiddle factors in ROM. In Fig. 5, the last stage of the
single radix-2
4
SDF architecture requires a multiplier to
perform the multiplications with
5 1 2 3 4
( 2 4 8 ) n k k k k
N
W
+ + +
.
If the data width of the hardware implementation is larger,
the differences between the FFT computational outputs
with hardware and the FFT computational outputs with
software will be reduced. In practice, the data width is
limited due to finite resources of hardware. Therefore,
we estimate the optimal bit width of multipliers
coefficients by using the different AWGN noise, where the
AWGN noise is added into input sequences of the FFT
algorithm. After the simulation with MATLAB software,
we find that the output SNR will converge when we
assign 9 binary bits to the twiddle factors
5 1 2 3 4
( 2 4 8 ) n k k k k
N
W
+ + +
for this general complex multiplier.
In our chip implementation, the multiplier module is
generated by the Synopsys DesignWare

tool.

IV. Complexity Comparisons and VLSI
Implementation
In Table 2, we show the comparisons of
computational complexity among different architectures
for computation of 256-point FFT. For fair comparisons,
each complex multiplication is equivalent to three real
multiplications and three real additions. Our proposed
architecture needs 1701 real multiplications and 5797 real
additions for the 256-point FFT computations. In order
to derive the close formula of computational complexity,
we define the term as multiplier stage and
non-multiplier stage. In Fig. 5, if the functional units,
which are referred as w1, w2, and the last general
multiplier, are needed, we call the single radix-2
4
SDF
architecture to be a multiplier stage. If the last general
multiplier is not needed, we call the single radix-2
4
SDF
architecture to be a non-multiplier stage. For instance,
the first radix-2
4
SDF stage in Fig. 6 is a multiplier stage
and the second radix-2
4
SDF stage in Fig. 6 is a
non-multiplier stage. Thus, the number of the multiplier
stages is
16
log 1 N and the number of real
multiplications at each multiplier stage is 33 /8 27 N .
The number of the non-multiplier stage is 1 for all radix
power-of-16 SDF architectures and the number of real
multiplications at the non-multiplier stage is 3 / 2 N . Then
the additional number of real multiplications in each
radix-2
4
SDF stage is 9 /16 N , where the number is
evaluated from constant multiplications. Then the close
formulas of real multiplications and additions are shown as
follows.

16
( ) (75 /16 27) log 21 /8 27
r
M N N N N = +
, (16)
2
( ) 2 log ( )
r r
A N N N M N = +
, (17)

where N is power of 16. In the proposed FFT algorithm,
the transform length N is defined as 256 in Eq.(16) and Eq.
(17). By using Eq.(16) and Eq.(17), the computational
complexity of the proposed 256-point FFT algorithm can
be compared with that of the previous methods in Table 2.
The computational complexity of the proposed 256-point
design is smaller than that of the radix-2 method in [17].
But the computational complexity of the proposed
256-point design is larger than that of the methods, which
use the radix-4 or the split-radix fast algorithms in [14],
[15], and [17].

12

Table 2. Comparison of Computational Complexity with Different Architectures for Computation of 256-point FFT
Complexity

Architectures
Numbers of
Real Multiplications
Numbers of
Real Additions
Radix-2 [17] 1737 5959
Radix-4 [17] 1350 5530
Takahashi [17] 1245 5555
Bouguezel et al. [15] 1284 5380
Bouguezel [14] 1284 5380
Proposed 1701 5797

Table 3. Comparison of Hardware Requirement for 256-point FFT with Different Architectures

Architectures
Complex
Multipliers
Complex
Adders
in radix-2 PEs

Complex
Adders
in CSD
circuits
Register Control
Circuits
Oh and Lim [10] 2.1 16 --- 255 Simple
Oh and Lim [22] 2.8 16 --- 255 Simple
He et al. [8] 3 16 --- 255 Simple
Yeh and J en [16] 3 16 --- 255 Simple
Son et al.[18] 3 8 --- 256 --------
Proposed 1 16 40 255 Simple

Table 4. Comparison of Hardware Requirement for 128-point FFT with the radix-2
4
fast algorithm

Architectures
Programmable
Complex
Multipliers
CSD
Complex
Constant
Multipliers
Complex
Adders
in PEs
Register Data format
Lee et al. [20] 2 3 28 190 Two data-path
Oh and Lim [22] 1 2 14 127 Serial
Proposed 1 3 14 127 Serial

Table 3 shows the comparisons of hardware
requirements for supporting 256-point FFT with different
FFT architectures. In Table 3, the proposed 256-point
FFT realization only require one complex multiplier,
which is less than the others FFT schemes do. In Fig. 6,
the hardware architecture needs two cascaded radix-2
4

SDF stages to perform 256-point FFT computations.
The first SR2
4
SDF stage requires one multiplier and two
CSD circuits, which process the constant multiplications
with w1 and w2. Table 4 shows comparison of hardware
requirement for 128-point FFT with the radix-2
4
fast
algorithm. Our FFT design only needs one complex
multiplier, which is the same as Oh and Lim [22] does.
The hardware in [20] is larger than that in the proposed
scheme because two parallel data-paths are applied for the
realization. But the second SR2
4
SDF stage only needs
two CSD circuits to generate the final FFT outputs.
According to the structures in Eqs.(11), (12), (13), (14),
and (15) and Fig. 9, the total number of complex adders in
four CSD circuits and eight radix-2 PEs is 56. To sum
up, our proposed architecture needs 1 complex multiplier
and 56 complex adders for the 256-point FFT
computations.
In Table 5 and Table 6, we show the results of
hardware verifications of the proposed 256-point FFT
processor. In Table 6, we use SYNOPSIS Design
13
Compiler for logic synthesis. The SYNOPSIS Design
Compiler provides the chip area in m
2
but we can divide
the total chip area by the area of a simple two-input AND
gate (e.g. 9 m
2
) to obtain the estimated gate count. In
Table 6, 51.5MHz is not the worst case when the worst
UMC 0.18m library is used. 51.5MHz is the maximum
working frequency when we just use the typical UMC
0.18um 1P6M cell library. The power consumption is
also measured by SYNOPSIS Design Compiler. When
the SYNOPSIS Design Complier is used for power
estimation, no test vector is required. There is no scan
chain applying to the cell-based chip design. Through
the hardware verification, the throughput rate of this FFT
realization is suitable for WiMAX 802.16a application,
whose maximum sample rate is 32MHz.

Table 5. Hardware verification with FPGA
Target Device Xilinx Virtex-II 1500 FG676 -4
FPGA gates 176127
Max. speed 35.76MHz

Table 6. Hardware verification with Standard Cell
Design
Design Process UMC 0.18m 1P6M
Gates count 173875
Max. speed 51.5MHz
Power consumption 162.7mW@33.3MHz
Supply voltage 1.8V

V. Conclusion
In this paper, we propose a low cost 256-point FFT
architecture design for special application to WiMAX
OFDM system. The proposed FFT architecture utilizes
cascaded simplified radix-2
4
SDF structure. The control
circuit of proposed simplified radix-2
4
SDF FFT
architecture is simple. The hardware requirement of the
proposed FFT architecture only needs 1 complex
multiplier and 56 complex adders for supporting
256-point computations. The throughput of the realized
FFT module is one sample per clock. In hardware
verifications, the throughput of the realized FFT design
processes up to 35.5M samples/sec with Xilinx Virtex2
FPGA, and it processes up to 51.5M samples/sec with
UMC 0.18m standard cell design. The throughput of
this scheme is suitable for WiMAX 802.16a application,
whose maximum sample rate is 32MHz.

Acknowledgment
This implementation was supported by the National
Science Council, Taiwan, R.O.C., under grant
NSC96-2220-E-005-006. The authors would like to
thank the anonymous reviewers whose careful reviews
and detailed comments help to improve the readability of
this paper. The authors would like to thank the National
Chip Implementation Center (CIC) in Taiwan for
providing the synthesis environment and related EDA
tools.

References
1. IEEE Std. 802.16-2004, Part 16: Air Interface for Fixed
Broadband Wireless Access Systems (2004).
2. ETSI EN 300 744 (v1.2.1):Digital Video Broadcasting
(DVB); Framing Structure, Channel Coding and
Modulation for Digital Terrestrial Television (1999).
3. IEEE 802.11, IEEE Standard for Wireless LAN
Medium Access Control and Physical Layer
Specifications (1999).
4. T1E1.4/98-007R4: Standards Project for Interfaces
Relating to Carrier to Customer Connection of
Asymmetrical Digital Subscriber Line (ADSL)
Equipment (1998).
5. ETSI TS 101 270-2 (v1.1.1): Transmission and
Multiplexing(TM); Access transmission systems on
metallic access cables; Very high speed Digital
Subscriber Line (VDSL); Part 2: Transceiver
specification (2001).
6. Oppenheim, A. V., Schafer, R. W. and Buck, J. R.,
Discrete-Time Signal Processing (1989).
7. J ia, L., Gao, Y., and Tenhunen, H., Efficient VLSI
Implementation of Radix-8 FFT Algorithm, IEEE
Pacific Rim Conference on Communications, Computers
and Signal Processing, pp. 468-471 (1999).
8. He, S. and Torkelson, M., A New Approach to Pipeline
FFT Processor, The 10th International Parallel
Processing Symposium, pp. 766 770 (1996).
9. He, S. and Torkelson, M., Designing Pipeline FFT
Processor for OFDM (de)Modulation, International
Symposium on Signals, Systems, and Electronics, pp.
257 262 (1998).
10. Oh, J. Y. and Lim, M. S., A Radix-2
4
SDF Pipeline
FFT Processor for OFDM Modulation, The First IEEE
VTS Asia Pacific Wireless Communications Symposium
(2004).
11. Wold, E. H. and Despain, A. M., Pipeline and
Parallel-Pipeline FFT Processors for VLSI
Implementation, IEEE Transactions on Computer, Vol.
33, No.5, pp. 414-426 (1984).
12. Despain, A. M., Fourier Transform Computer Using
CORDIC Iterations, IEEE Transactions on Computer,
Vol. 23, No. 10, pp. 993-1001 (1974).
13. J ia, L., Gao, Y., Isoaho, J. and Tenhunen, H., A New
VLSI-Oriented FFT Algorithm and Implementation,
IEEE ASIC Conference, pp. 337-341 (1998).
14. Bouguezel, S., A New Radix-2/8 FFT Algorithm for
Length-q2
m
DFTs, IEEE Transactions on Circuits and
Systems- Part I, Vol. 51, No. 9, pp. 1723-1732 (2004).
15. Bouguezel, S., Ahmad, M. O. and Swamy, M.N.S.,
An Efficient Split-Radix FFT Algorithm, IEEE
14
International Symposium on Circuits and Systems, Vol. 4,
pp.65-68 (2003).
16. Yeh, W. C. and J en, C. W.,High-Speed and
Low-Power Split-Radix FFT, IEEE Transactions on
Signal Processing, Vol. 51, No. 3, pp. 864-874 (2003).
17. Takahashi, D.,An Extended Split-Radix FFT
Algorithm, IEEE Signal Processing Letters, Vol. 8, No.
5, pp. 145-147 (2001).
18. Son, B. S., J o, B. G., Sunwoo, M. H. and Kim, Y. S.,
A High-Speed FFT Processor for OFDM Systems,
IEEE International Symposium on Circuits and Systems,
Vol. 3, pp. 281-284 (2002).
19. Fan, C. P., Lee M. S. and Su G. A., A Low Multiplier
and Multiplication Costs 256-point FFT Implementation
with Simplified Radix-2^4 SDF Architecture, IEEE
Asia-Pacific Conference on Circuits and Systems
(2006).
20. Lee, H. and Shin, M., A high-speed low-complexity
two-parallel radix-2
4
FFT/IFFT processor for UWB
applications, IEEE Asia Solid-State Circuits Conference,
pp. 284-287 (2007).
21. Lin, Y. W., Liu, H. Y., and Lee, C. Y., A 1-GS/s
FFT/IFFT Processor for UWB Applications, IEEE
Journal of Solid-Stale Circuit, Vol. 40, No. 8, pp.
1726-1735 (2005).
22. Oh, J . Y. and Lim, M. S., Fast Fourier Transform
Processor Based on Low-power and Area-efficient
Algorithm, IEEE Asia-Pacific Conference on Advanced
System Integrated Circuits, pp. 198-201 (2004).

Efficient Low Multiplier Cost 256-Point FFT Design With Radix-2 SDF Architecture

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Efficient Low Multiplier Cost 256-Point FFT Design With Radix-2 SDF Architecture

Transféré par

Droits d'auteur :

Formats disponibles

1

EFFICIENT LOW MULTIPLIER COST 256-POINT FFT

= W . We replace the k index with the index

Vous aimerez peut-être aussi