Vous êtes sur la page 1sur 4

A Radix 22 Based Parallel Pipeline FFT Processor

for MB-OFDM UWB system


Nuo Li and N.P. van der Meijs
Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS)
Delft University of Technology, Delft, Netherlands
Email: leenuo@gmail.com
AbstractThis paper presents a novel parallel pipeline FFT
processor especially tailored for Multiband Orthogonal Frequency Division Multiplexing (MB-OFDM) Ultra Wideband
(UWB) system, which was dened by ECMA International. The
proposed Radix 22 Parallel Pipeline processor, which employs two
parallel data path Radix 22 algorithm and single-path delay feedback (SDF) pipeline architecture, is a small-area and low-powerconsumption solution for MB-OFDM UWB system. Both FPGA
Xilinx Virtex4 and ASIC 90 nm technology, 1V supply voltage
targeted synthesis results of this architecture are presented. It is
shown from the results that, due to the revised algorithm and
novel architecture, the required clock frequency is 264MHz to
meet the ECMA requirement. Meanwhile, the required gates are
39000 without testing block and the corresponding area is 181140
m2 .

I. I NTRODUCTION
Ultra-Wideband (UWB) Technology brings the convenience
and mobility of wireless communications to high-speed interconnects in devices through out the digital home and ofce
[1]. Multiband-OFDM standard is one solution for UWB
technology. A proposal for Multi-band OFDM UWB standard
is published by IEEE 802.15 3a study group [2]. In December
2007, the second revised version Standard ECMA-368 was
released, which specied physical layer (PHY) and medium
access control layer (MAC) of the UWB technology based on
Multiband-OFDM [3].
Some key issues need to be solved for designing CMOS
based Multiband-OFDM UWB solution in support of the low
power requirement. One of the issues focuses on its FFT (Fast
Fourier Transform) block, which takes 25% design complexity
of the total digital baseband transceiver [4]. Although many
results have already been published in this research area for the
past few years [5], [6], [7], the area and power consumption
of the FFT block still need to be improved since this system
targets for the wireless portable devices. Therefore, this paper
focuses on the area and power consumption improvement
under the ECMA-368 standard requirements. Section II describes the requirements for the FFT block and the algorithm
which the proposed design is based on. Section III focuses
on presenting the proposed FFT solution from algorithm,
architecture, and implementation level respectively. Section
IV shows the synthesis results both targeted for FPGA and
ASIC implementation. Meanwhile, the comparison with other
published implementations is also presented.


  
     

II. BACKGROUND
A. The Requirements of FFT for Multiband OFDM System
According to the ECMA-368, the required sampling frequency is 528MHz and the total number of subcarriers, which
determines the FFT size, is 128. The time period available for
the IFFT and FFT is 242.42ns, which is the inverse of sampling
frequency multiplying the FFT size (TF F T = 128 f1s ). There
are 37 zero padded sufx samples, which take 70.08ns. So the
total symbol interval is 312.5ns (TSY M = TF F T + TZP S ).
The word length choice is a critical issue for FFT processor
design. The trade-off between chip area consideration and
signal to quantization noise ratio (SQNR) directly determines
the choice. Based on the analysis of [5] and [8], the word
length is chosen to be 10 bits in this paper for simulation and
comparison with their designs.
B. The Selection of FFT Algorithms
The traditional radix 2 FFT algorithms have simple structure
and clear data ow, which are easy to implement and are
suitable for generic FFT implementation. Nevertheless, these
algorithms need large memory to store data at inner stages,
which require large power and area consumption. Nowadays,
there are two trends for FFT implementation of OFDM system,
the mixed radix algorithms, such as [7] and the pipeline
structure based algorithms, such as [9]. Based on extensive
algorithm analysis and selection, the proposed design employs
the Radix 22 algorithm developed by He and Torkelson [10],
which integrates the twiddle factor decomposition every two
stages. The Radix 22 algorithm has the same multiplicative
complexity as radix 4 algorithm, but retains the buttery
structure of radix 2 algorithm, which is very suitable for ASIC
implementation.
The detailed algorithm deduction can be found in [10]. Its
application to 8 point FFT is used here to briey explain the
algorithm, which is shown in Figure 1. In this application the
Radix 22 algorithm is only used once for the rst two stages,
because 8 point DFT can only be decomposed once by radix
4. For the last stage, normal radix 2 DIF algorithm is used. By
using Radix 22 algorithm, complex multiplication of the twiddle factor in the rst stage is changed into multiplying (j).
Therefore, in a pipeline structure, one complex multiplier can
be reduced for 8 point FFT.



Fig. 1.

Radix 22 based parallel FFT algorithm data ow

III. T HE PROPOSED PROCESSOR


The proposed processor is described from the algorithm,
architecture and implementation level respectively.
A. The Revision in the Algorithm Level
After the analysis of the normal Radix 22 algorithm, it is
found that the input data can also be separated into the odd and
even parts and these odd and even parts are not mixed until the
last stage. It is one of the key points of proposed processor,
which can be effectively used for architecture design in order
to reduce the working frequency and used registers.
Eight point FFT data ow is again used here to illustrate the
changes, which are also shown in Figure 1. The dashed lines
show the odd input data ow while the solid lines show the
even input data ow. For the rst and second stage, there is no
cross between the dashed lines and solid lines, which means
the even and odd input data can be separately processed in
the rst and second stages. Only in the last stage, the dashed
lines and solid lines are crossed which means that the even
and odd data should be mixed to process.
The 128 point parallel algorithm data ow with twiddle
factor position is shown in Figure 2. The horizontal lines
are not shown here. The input data and twiddle factors are
separated into the even and odd data, which are processed
especially through the rst six stages and only to be combined
in the nal seventh stage. Please note that the output data are
ultimately produced in bit reversed order.
B. Architecture Level
From the previous analysis, employing two-path parallelism
in the rst six stages is proper for the structure design. Because
these six stages can process the even and odd input data
separately and the last stage, the seventh stage, needs to mix
the even and odd data. Nevertheless, there are some extra
requirements for this architecture design. First, a demultiplexer
is required to separate the input data into the even and odd
parts. On the other hand, the controller can be shared for both
even and odd path. Special care should be taken to generate
the right control signals for the last stage such that the even
and odd parts can be combined in the proper way.
Figure 3 shows the proposed parallel pipeline architecture.
It has seven stages and consists of demultiplexers, circular buffers, ROM, complex multipliers, and buttery units.

Fig. 2.

The 128 point parallel Radix 22 based algorithm data ow

BF1 means buttery type 1, which consists of four 2-to-1


multiplexers and four adders. BF2 means buttery type 2,
which includes extra real and imaginary parts switching and
iversing because of the (j) multiplication required by Radix
22 algorithm. First, the input data are streamed in and handled
by demultiplexer. These data are processed in the even and odd
parts of the architecture, where the dashed arrow lines stand
for the data ow of odd data and the solid lines show the even
data. For each odd and even part, single-path delay feedback
(SDF) pipeline structure [10] is used to process data separately.
There are three controllers which produce the control signals
and the addresses for reading the twiddle factor from the ROM.
The even and odd parts of each stage share the same controller.
There are ve complex multiplications in the architecture. In
the sixth stage, the even part outputs do not need multiplication
and twiddle factor storage, which can be found in Figure 3.
The reason is that, after twiddle factor separation in this stage,
all the twiddle factors in the even part become constant 1.
Therefore, no multiplication is required.
C. Implementation Level
As can be seen from Figure 3, there are seven stages. Based
on the required control, it is advantage to combine the stages



Fig. 3.

The parallel Radix 22 based pipeline architecture

1 and 2, stages 3 and 4, and stages 5 and 6 to three common


controller blocks. These common controller blocks all have a
structure as shown in Figure 4. Therefore, the whole parallel
architecture can also be divided into the rst three common
controller blocks, the last block, and the arithmetic blocks. The
arithmetic blocks are composed of ve ROMs and complex
multipliers.

following N4 cycles, control signal I is set to one to enable


the buttery function in stage 1. At the same time, the stage 2
reads in the N8 data outputs of the stage 1, which is controlled
by control signal II. The next N8 cycles, buttery II of stage2
works and control signal II equals one. The data ow analysis
is shown in the Figure 5.

Fig. 5.

Fig. 4.

The common controller block

The basic idea of the data ow in these common controller


N
blocks is that the stage 1 repeats after calculating r2
r data,
N
and the stage 2 repeats after calculating r2r+1 data, where r
(r = 1,2,3) is the index of the common controller blocks and
N is the FFT size. Only one counter is used to produce the
control signal I and II for both stage 1 and stage 2. For the
rst common controller block, rst, control signal I is set to
zero to let the N4 data be read into the stage 1, and in the

The operation modes of the block

The last block only includes the seventh stage. Because the
odd and even data need to be commutated, two demultiplexers
seem to be required to switch the data, as shown in the
Figure 3. However, this can be improved by analyzing the
scheduling of the last stage. It can be found that only one
buttery is working per clock circle and the rst output data
of the even path will be processed with the rst output of the
odd path of the 6th stage. As long as the timing is matched,
the even path outputs will be processed with the odd path
ones correspondingly. Therefore, the two demultiplexers are
not necessary and only one buttery in the last stage is required
to process the data. The modied structure of the last stage
and interface with previous stage is shown in Figure 6.
IV. I MPLEMENTATION AND RESULT ANALYSIS
A. FPGA Implementation
The proposed design is synthesized and implemented by
Xilinx ISE which is targeted for FPGA Xilinx Virtex4 implementation. The arithmetic blocks are directly mapped to



TABLE II
T HE ASIC I MPLEMENTATION C OMPARISON

Fig. 6.

Technology
Clock frequency (MHz)
Parallel data format
Algorithm
Word length (bits)
Complex multipliers
Registers
Gates
Area (m2 )
Area (m2 ) scaled for
90 nm

The improved version of the 7th stage

DSP48 components in Xilinx Virtex4. Table I is the performance of the proposed implementation and the comparison
with [7]. The table clearly shows the reduced resource count
of the proposed design compared with the implementation in
[7]. The reason is that the proposed design employs far less
memory blocks and complex multipliers.
TABLE I
T HE COMPARISON WITH [7]

Word length (bits)


Total Number Slice Registers
*Number used as Flip Flops
Total Number of 4 input LUTS
Number of DSP48s

[7]
11
7390
3860
12749
48

proposed
10
717
457
2230
20

proposed implementation
90 nm, 1 V
264
2 data-path
Radix 22
10
5
128
38540
181140
181140

[8]

[12]

0.18 m, 1.8 V
450
2 data-path
Radix 24
10
2+0.41
190
70000
-

0.18 m, 1.8 V
250
4 data-path
Mixed Radix
10
2+2.48
2466382
616595.5

V. C ONCLUSION
In this paper, a novel parallel pipeline FFT processor is
designed for the ECMA-368 standard. Our architecture is
based on a revised version of the Radix 22 algorithm. Our
revision amounts to restructuring of the associated signal ow
graph into an even and odd part. As such, it not only achieves
the low multiplier count of the standard 22 algorithms, but
also a 50 % reduction of the clock frequency and the lowest
circular buffer count compared to the traditional SDF architectures. Both FPGA and ASIC targeted synthesis results of this
architecture are presented. The results show that the required
area is dramatically reduced based on the proposed design.
R EFERENCES

The used word length is lower than [7]. However, even when
the word length of proposed design is increased to 15, the total
equivalent gate count is still much lower than [7]. At 15 bits,
the total number slice registers, 4 input LUTS and DSP48s of
proposed design is 1052, 3600, and 20 respectively.
B. ASIC targeted results
The proposed design is also synthesized by Synopsys Design Compiler which is targeted for ASIC implementation.
The synthesis library is Faraday 90nm standard cell library
[11], which is tailored for UMC 90 nm logic LL-RVT (lowK)
process. During the implementation stage of our processor, [8]
was published, which employed the similar parallel structure.
However, there are some key differences between these two
architectures. Specically important differences are in the rst
and last stages where the proposed design reduces the number
of shift registers and the latency of the processor. Table II
is the performance of the proposed implementation and the
comparison with other start-of-the-art designs. The table shows
that the number of used gates of the proposed design is only
55% of [8]. If 180 nm technology would be linear scaled to 90
nm, the area is reduced by a factor of 4. Hence, the design of
[12] in 180 nm would compare to a area of 616595.5 m2 in
90 nm technology, which is still much larger than the proposed
design.

[1] INTEL,
Ultra-wideband
(uwb)
technology,
http://www.intel.com/technology/comms/uwb/.
[2] e. a. A. Batra, Multi-band OFDM physical layer proposal for IEEE
802.15 Task Group 3a, Tech. Rep., IEEE P.802.15-04/0493r0, 2004.
[3] Standard ECMA-368: High Rate Ultra Wideband PHY and MAC Standard 2nd Edition.
[4] A. Batra, J. Balakrishnan, G. Aiello, J. Foerster, and A. Dabak, Design
of a multiband OFDM system for realistic UWB channel environments,
Microwave Theory and Techniques, IEEE Transactions on, vol. 52, no. 9,
pp. 21232138, Sept. 2004.
[5] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, A 1-GS/s FFT/IFFT processor
for UWB applications, Solid-State Circuits, IEEE Journal of, vol. 40,
no. 8, pp. 17261735, Aug. 2005.
[6] R. Chidambaram, A scalable and high-performance FFT processor,
optimized for UWB-OFDM, Masters thesis, Delft University of Technology, 2005.
[7] N. Rodrigues, H. Neto, and H. Sarmento, A OFDM module for a
MB-OFDM receiver, Design & Technology of Integrated Systems in
Nanoscale Era, 2007. DTIS. International Conference on, pp. 2529,
Sept. 2007.
[8] J. Lee and H. Lee, A High-Speed Two-Parallel Radix-24 FFT/IFFT
Processor for MB-OFDM UWB Systems, IEICE Trans Fundamentals,
vol. E91-A, no. 4, pp. 12061211, 2008.
[9] E. Saberinia, K. C. Chang, G. Sobelman, and A. H. Tewk, Implementation of a Multi-band Pulsed-OFDM Transceiver, J. VLSI Signal
Process. Syst., vol. 43, no. 1, pp. 7388, 2006.
[10] S. He and M. Torkelson, A new approach to pipeline FFT processor,
Parallel Processing Symposium, 1996., Proceedings of IPPS 96, The
10th International, pp. 766770, Apr 1996.
[11] FARADAY, FSD0A A 90 nm Logic SP-RVT(Low-K) Process. FARADAY Technology Corporation, 2006.
[12] T. Chakraborty and S. Chakrabarti, A reduced area 1 GSPS FFT design
using MRMDF architecture for UWB communication, in Circuits and
Systems, 2008. APCCAS 2008. IEEE Asia Pacic Conference on, 30
2008-Dec. 3 2008, pp. 11281131.



Vous aimerez peut-être aussi