Design of Radix-2 Butterfly Processor

ABSTRACT
The high growth of the semiconductor industry over the past two decades has put Very
Large Scale Integration in demand all over the world. Digital Signal Processing has played a
great role in expanding VLSI device area. The recent rapid advancements in multimedia
computing and high speed wired and wireless communications made DSP to grab increased
attention.
For an N-point transformation the direct computation of the Discrete Fourier Transform
(DFT) requires N2 operations. Cooley and Turkey explained the concept of Fast Fourier
Transform (FFT) which reduces the order of computation to Nlog2N. Basically FFT decomposes
the set of data to be transformed into a series of smaller data sets to be transformed. The size of
FFT decomposition is called "radix". Then, it decomposes those smaller sets into even smaller
sets.
In this work, the Radix-2 Butterfly Processor will be designed which includes the adder,
subtractor and the twiddle factor multiplier. The complex multiplier algorithm will be used in
order to achieve the efficient complex multiplication. The Radix-2 Butterfly Processor will be
modeled using VHDL language which is a Hardware Description Language (HDL) used to
describe a digital system. The modeled design can be simulated using Modelsim tool and the
intended functionality can be verified with the help of its simulation results and also it can be
synthesized using the Xilinx tool.
CHAPTER 1
INTRODUCTION AND LITERATURE SURVEY
High performance VLSI-based FFT architectures are key to signal processing and
telecommunication systems since they meet the hard real-time constraints at low silicon area and
low power compared to CPU-based solutions.
The fast Fourier transform (FFT) plays an important role in the design and
implementation of discrete-time signal processing algorithms and systems. In recent years,
motivated by the emerging applications in the modern digital communication systems and
television terrestrial broadcasting systems, there has been tremendous growth in the design of
high-performance dedicated FFT processors. Pipelined FFT processor is a class of real-time FFT
architectures characterized by continuous processing of the input data which, for the reason of
the transmission economy, usually arrives in the word sequential format. However, the FFT
operation is very communication intensive which calls for spatially global interconnection.
Therefore, much effort on the design of FFT processors focuses on how to efficiently map the
FFT algorithm to the hardware to accommodate the serial input for computation.
The FFT is a faster version of the Discrete Fourier Transform (DFT). The FFT utilizes
some clever algorithms to do the same thing as the DTF, but in much less time. Ok, but what is
the DFT? The DFT is extremely important in the area of frequency (spectrum) analysis because
it takes a discrete signal in the time domain and transforms that signal into its discrete frequency
domain representation. Without a discrete-time to discrete-frequency transform we would not be
able to compute the Fourier transform with a microprocessor or DSP based system.
The Discrete Fourier Transform (DFT) plays an important role in the analyses, design and
implementation of the discrete-time signal- processing algorithms and systems It is used to
convert the samples in time domain to frequency domain. The Fast Fourier Transform (FFT) is
simply a fast (computationally efficient) way to calculate the Discrete Fourier Transform (DFT).
The wide usage of DFTs in Digital Signal Processing applications is the motivation to
Implement FFTs. DFT is identical to samples of the Fourier transform at equally spaced
frequencies.
Consequently, computation of the N-point DFT corresponds to the computation of N
samples of the Fourier transform at N equally spaced frequencies k = 2k/N. Considering input
x[n] to be complex, N complex multiplications and (N-1) complex additions are required to
compute each value of the DFT, if computed directly from the formula given as
.. (1.1)
To compute all N values therefore requires a total N2 complex multiplications and N(N1) complex additions. Each complex multiplication requires four real multiplications and two
real additions and each complex addition requires two real additions. Therefore a total of 4N2
real multiplications and N (4N-2) real additions are required. Besides these multiplications and
additions there should be provision for storing N complex input sequences and also to store N
output values. Contrary to this by using Decimation in Frequency FFT radix-2 algorithm the
number of complex multiplications and additions will be reduced to (N/2) log2N and Nlog2N to
compute the DFT of a given complex x[n]. Hence in this project the Decimation in frequency
FFT radix-2 algorithm is implemented to compute the DFT of a sequence.
1.1 Definition of the Fourier Transform:
The Fourier transform (FT) of the function f (x) is the function F(), where:
.. (1.2)
And the inverse Fourier transform is
... (1.3)
1.3 Fast Fourier Transform
Fourier transform play an important role in many digital signal processing applications,
including acoustic, optics, telecommunications, speech, signal and image processing. However
their wide use makes their computational requirements a heavy burden in many real-world
applications. So, its efficient computation has received considerable attention by many
mathematicians, engineers, and applied scientists.
For an N-point transformation the direct computation of the Discrete Fourier Transform
(DFT) requires N2 operations. Cooley and Turkey explained the concept of Fast Fourier
Transform (FFT) which reduces the order of computation to Nlog2N. The FFT is not an
approximation of the DFT. It's exactly equal to the DFT; it is the DFT.
Functionally, the FFT decomposes the set of data to be transformed into a series of smaller
data sets to be transformed. The size of FFT decomposition is called "radix". Then, it
decomposes those smaller sets into even smaller sets. At each stage of processing, the results of
the previous stage are combined with twiddle factor multiplication. Finally, DFT is calculated for
each small data set. FFT's can be decomposed using DFT's of even and odd points, which is
called a Decimation-In-Time (DIT) FFT, or they can be decomposed using a first-half/secondhalf approach, which is called a "Decimation-In-Frequency" (DIF) FFT .
The FFT is simply an algorithm to speed up the DFT calculation by reducing the number
of multiplications and additions required. It was popularized by J. W.Cooley and J. W. Tukey in
the 1960s and was actually a rediscovery of an idea of Runge (1903) and Danielson and Lanczos
(1942), first occurring prior to the availability of computers and calculators when numerical
calculation could take many man hours. In addition, the German mathematician Karl Friedrich
Gauss (1777 1855) had used the method more than a century earlier. In order to understand the
basic concepts of the FFT and its derivation, note that the DFT expansion can be greatly
simplified by taking advantage of the symmetry and periodicity of the twiddle factors. If the
equations are rearranged and factored, the result is the Fast Fourier Transform (FFT) which
requires only (N/2) log2(N) complex multiplications. The computational efficiency of the FFT
versus the DFT becomes highly significant when the FFT point size increases to several
thousands. However, notice that the FFT computes all the output frequency components (either
all or none!). If only a few spectral points need to be calculated, the DFT may actually be more
efficient. Calculation of a single spectral output using the DFT requires only N complex
multiplications.
A fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier
transform (DFT) and its inverse. There are many distinct FFT algorithms involving a wide range
of mathematics, from simple complex-number arithmetic to group theory and number theory;
this article gives an overview of the available techniques and some of their general properties,
while the specific algorithms are described in subsidiary articles linked below.
A DFT decomposes a sequence of values into components of different frequencies. This
operation is useful in many fields (see discrete Fourier transform for properties and applications
of the transform) but computing it directly from the definition is often too slow to be practical.
An FFT is a way to compute the same result more quickly: computing a DFT of N points in the
naive way, using the definition, takes O(N2) arithmetical operations, while an FFT can compute
the same result in only O(N log N) operations. The difference in speed can be substantial,
especially for long data sets where N may be in the thousands or millionsin practice, the
computation time can be reduced by several orders of magnitude in such cases, and the
improvement is roughly proportional to N/log(N). This huge improvement made many DFTbased algorithms practical; FFTs are of great importance to a wide variety of applications, from
digital signal processing and solving partial differential equations to algorithms for quick
multiplication of large integers.
The most well known FFT algorithms depend upon the factorization of N, but (contrary
to popular misconception) there are FFTs with O(N log N) complexity for all N, even for prime
N. Many FFT algorithms only depend on the fact that is an
Nth primitive root of unity, and thus
can be applied to analogous transforms over any finite field, such as number-theoretic
transforms.
Since the inverse DFT is the same as the DFT, but with the opposite sign in the exponent
and a 1/N factor, any FFT algorithm can easily be adapted for it.
FFTs perform exactly the same task as the DFT, but are able to dramatically reduce the
amount of complex arithmetic required to do so, by taking advantage of some of the symmetries,
repetitions and redundancies within the transform. There are many different forms of FFT
algorithms, all with their advantages and disadvantages. The RADIX of the transform is the
number of samples grouped together when performing the next calculations. The radix is
normally a small power of two, e.g. 2 or 4. A larger radix transform has the advantage of a
slightly smaller number of calculations required, but as the total number of points of the
transform must be a power of the radix, the disadvantage is that transforms of certain sizes may
not be possible. E.g. radix-2 transforms can be 4-point, 16-point, 64-point, 256-point, 1024point, etc. A 512-point transform for example, must be radix-2. An in-place FFT algorithm is one
that overwrites its input samples with the results that it generates at each stage or pass. This has
the disadvantage that the original data is lost, but has the advantage of needing less memory
space. This is important if the calculations are to be carried out on data in the fast internal
memory, and is the reason that in-place algorithms are the most popular. In-place FFT algorithms
work with either the input samples (in the case of decimation-in-time FFTs), or the output
samples (in the case of decimation-in-frequency FFTs), jumbled into bit-reversed order. I.e.
decimation-in-time transforms need their input samples to be swapped into bit-reversed order
prior to the transform which gives normal-order results, and decimation-in-frequency transforms
use normal ordered input samples and generate their output samples in bit-reversed order.
However, this overhead is insignificant when compared to the transform itself.
N = ej2 /N . ..(1.4)
This leads to the definition of the twiddle factors as:
Nnk = ej2nk /N..(1.5)
Radix-2 FFT Algorithms

Let us consider the computation of the N = 2v point DFT by the divide-and conquer approach.
We split the N-point data sequence into two N/2-point data sequences f1(n) and f2(n),
corresponding to the even-numbered and odd-numbered samples of x(n), respectively, that is,
..(1.6)
Thus f1(n) and f2(n) are obtained by decimating x(n) by a factor of 2, and hence the resulting FFT
algorithm is called a decimation-in-time algorithm.
Now the N-point DFT can be expressed in terms of the DFT's of the decimated sequences as
follows:
But WN2 = WN/2. With this substitution, the equation can be expressed as
..(1.7)
where F1(k) and F2(k) are the N/2-point DFTs of the sequences f1(m) and f2(m), respectively.
Since F1(k) and F2(k) are periodic, with period N/2, we have
F1(k+N/2) = F1(k) and F2(k+N/2) = F2(k). ..(1.8)
In addition, the factor WNk+N/2 = -WNk. Hence the equation may be expressed as
..(1.9)
We observe that the direct computation of F1(k) requires (N/2)2 complex multiplications. The
same applies to the computation of F2(k). Furthermore, there are N/2 additional complex
multiplications required to compute WNkF2(k). Hence the computation of X(k) requires 2(N/2)2 +
N/2 = N 2/2 + N/2 complex multiplications. This first step results in a reduction of the number of
multiplications from N 2 to N 2/2 + N/2, which is about a factor of 2 for N large.
Figure 1.1 first steps in the decimation-in-time algorithm.

By computing N/4-point DFTs, we would obtain the N/2-point DFTs F1(k) and F2(k) from the
relations
The decimation of the data sequence can be repeated again and again until the resulting
sequences are reduced to one-point sequences. For N = 2v, this decimation can be performed v =
log2N times. Thus the total number of complex multiplications is reduced to (N/2)log2N. The
number of complex additions is Nlog2N.
For illustrative purposes, Figure 3.2 depicts the computation of N = 8 point DFT. We observe that
the computation is performed in three stages, beginning with the computations of four two-point
DFTs, then two four-point DFTs, and finally, one eight-point DFT. The combination for the
smaller DFTs to form the larger DFT is illustrated in Figure TC.3.3 for N = 8.
Figure 1.2 Three stages in the computation of an N = 8-point DFT.
Figure 1.3 Eight-point decimation-in-time FFT algorithms.
Figure 1.4 basic butterfly computations in the decimation-in-time FFT algorithm.

An important observation is concerned with the order of the input data sequence after it is
decimated (v-1) times. For example, if we consider the case where N = 8, we know that the first
decimation yeilds the sequence x(0), x(2), x(4), x(6), x(1), x(3), x(5), x(7), and the second
decimation results in the sequence x(0), x(4), x(2), x(6), x(1), x(5), x(3), x(7). This shuffling of the
input data sequence has a well-defined order as can be ascertained from observing Figure TC.3.5,
which illustrates the decimation of the eight-point sequence.
Another important radix-2 FFT algorithm, called the decimation-in-frequency algorithm, is
obtained by using the divide-and-conquer approach. To derive the algorithm, we begin by
splitting the DFT formula into two summations, one of which involves the sum over the first N/2
data points and the second sum involves the last N/2 data points. Thus we obtain
(1.10)
Now, let us split (decimate) X(k) into the even- and odd-numbered samples. Thus we obtain
Figure 1.5 Shuffling of the data and bit reversal.
(1.11)
where we have used the fact that WN2 = WN/2
The computational procedure above can be repeated through decimation of the N/2-point DFTs
X(2k) and X(2k+1). The entire process involves v = log2N stages of decimation, where each stage
involves N/2 butterflies of the type shown in Figure.3.7. Consequently, the computation of the Npoint DFT via the decimation-in-frequency FFT requires (N/2)log2N complex multiplications and
Nlog2N complex additions, just as in the decimation-in-time algorithm. For illustrative purposes,
the eight-point decimation-in-frequency algorithm is given in Figure 1.9.
Figure 1.6 First stage of the decimation-in-frequency FFT algorithm.
Figure 1.7 basic butterfly computations in the decimation-in-frequency.
Figure 1.8 N = 8-piont decimation-in-frequency FFT algorithm.

We observe from Figure.1.8 that the input data x(n) occurs in natural order, but the output DFT
occurs in bit-reversed order. We also note that the computations are performed in place.
However, it is possible to reconfigure the decimation-in-frequency algorithm so that the input
sequence occurs in bit-reversed order while the output DFT occurs in normal order. Furthermore,
if we abandon the requirement that the computations be done in place, it is also possible to have
both the input data and the output DFT in normal order.
Fast Fourier Transform (FFT)
The naive implementation of the N-point digital Fourier transform involves calculating
the scalar product of the sample buffer (treated as an N-dimensional vector) with N separate
basis vectors. Since each scalar product involves N multiplications and N additions, the total
time is proportional to N2 (in other words, it's an O(N 2) algorithm). However, it turns out that by
cleverly re-arranging these operations, one can optimize the algorithm down to O(N log(N)),
which for large N makes a huge difference. The optimized version of the algorithm is called the
fast Fourier transform, or the FFT.
Let's do some back of the envelope calculations. Suppose that we want to do a real-time
Fourier transform of one channel of CD-quality sound. That's 44k samples per second. Suppose
also that we have a 1k buffer that is being re-filled with data 44 times per second. To generate a
1000-point Fourier transform we would have to perform 2 million floating-point operations (1M
multiplications and 1M additions). To keep up with incoming buffers, we would need at least
88M flops (floating-point operations per second). Now, if you are lucky enough to have a 100 M
flop machine, which might be fine, but consider that you'd be left with very little processing
power to spare.
Using the FFT, on the other hand, we would perform on the order of 2*1000*log 2(1000)
operations per buffer, which is more like 20,000. This requires 880k flops--less than 1 Mflop! A
hundred-fold speedup.
The standard strategy to speed up an algorithm is to divide and conquer. We have to find some
way to group the terms in the equation
V[k] = n=0..N-1 WNkn v[n]
Let's see what happens when we separate odd ns from even ns (from now on, let's assume that N
is even):
V[k] = n even WNkn v[n] + n odd WNkn v[n]
= r=0..N/2-1 WNk(2r) v[2r] + r=0..N/2-1 WNk(2r+1) v[2r+1]
= r=0..N/2-1 WNk(2r) v[2r] + r=0..N/2-1 WNk(2r) WNk v[2r+1]
= r=0..N/2-1 WNk(2r) v[2r] + WNk r=0..N/2-1 WNk(2r) v[2r+1]
= (r=0..N/2-1 WN/2kr v[2r]) + WNk (r=0..N/2-1 WN/2kr v[2r+1])
where we have used one crucial identity:
WNk(2r) = e-2i*2kr/N = e-2i*kr/(N/2) = WN/2kr.(1.12)
Notice an interesting thing: the two sums are nothing else but N/2-point Fourier transforms of,
respectively, the even subset and the odd subset of samples. Terms with k greater or equal N/2
can be reduced using another identity:
WN/2m+N/2 = WN/2mWN/2N/2 = WN/2m
which is true because Wmm = e-2i = cos(-2) + i sin(-2)= 1. (1.13)
If we start with N that is a power of 2, we can apply this subdivision recursively until we
get down to 2-point transforms.
We can also go backwards, starting with the 2-point transform:
V[k] = W20*k v[0] + W21*k v[1], k=0,1
The two components are:
V[0] = W20 v[0] + W20 v[1] = v[0] + W20 v[1]
V[1] = W20 v[0] + W21 v[1] = v[0] + W21 v[1]
We can represent the two equations for the components of the 2-point transform
graphically using the, so called, butterfly
Fig.1.9 Butterfly calculation

Furthermore, using the divide and conquer strategy, a 4-point transform can be reduced to
two 2-point transforms: one for even elements, one for odd elements. The odd one will be
multiplied by W4k. Diagrammatically; this can be represented as two levels of butterflies. Notice
that using the identity WN/2n = WN2n, we can always express all the multipliers as powers of the
same WN (in this case we choose N=4).
Fig.1.10 Diagrammatical representation of the 4-point Fourier transforms calculation
I encourage the reader to derive the analogous diagrammatical representation for N=8.
What will become obvious is that all the butterflies have similar form:
Fig.1.11 Generic butterfly graph

This graph can be further simplified using this identity:
WNs+N/2 = WNs WNN/2 = -WNs.(1.14)
which is true because
WNN/2 = e-2i(N/2)/N = e-i = cos(-) + isin(-) = -1
Here's the simplified butterfly:
Fig.1.12 Simplified generic butterfly

Using this result, we can now simplify our 4-point diagram.
Fig.1.13 4-point FFT calculation
This diagram is the essence of the FFT algorithm. The main trick is that you don't
calculate each component of the Fourier transform separately. That would involve unnecessary
repetition of a substantial number of calculations. Instead, you do your calculations in stages. At
each stage you start with N (in general complex) numbers and "butterfly" them to obtain a new
set of N complex numbers. Those numbers, in turn, become the input for the next stage. The
calculation of a 4-point FFT involves two stages. The input of the first stage is the 4 original
samples. The output of the second stage is the 4 components of the Fourier transform. Notice that
each stage involves N/2 complex multiplications (or N real multiplications), N/2 sign inversions
(multiplication by -1), and N complex additions. So each stage can be done in O(N) time. The
number of stages is log2N (which, since N is a power of 2, is the exponent m in N = 2 m).
Altogether, the FFT requires on the order of O(N logN) calculations.
Moreover, the calculations can be done in-place, using a single buffer of N complex numbers.
The trick is to initialize this buffer with appropriately scrambled samples. For N=4, the order of
samples is v[0], v[2], v[1], v[3]. In general, according to our basic identity, we first divide the
samples into two groups, even ones and odd ones. Applying this division recursively, we split
these groups of samples into two groups each by selecting every other sample. For instance, the
group (0, 2, 4, 6, 8, 10, ... 2N-2) will be split into (0, 4, 8, ...) and (2, 6, 10, ...), and so on. If you
write these numbers in binary notation, you'll see that the first split (odd/even) is done according
to the lowest bit; the second split is done according to the second lowest bit, and so on. So if we
start with the sequence of, say, 8 consecutive binary numbers:
000, 001, 010, 011, 100, 101, 110, 111
we will first scramble them like this:
[even] (000, 010, 100, 110), [odd] (001, 011, 101, 111)
then we'll scramble the groups:
((000, 100), (010, 110)), (001, 101), (011, 111))
which gives the result:
000, 100, 010, 110, 001, 101, 011, 111
This is equivalent to sorting the numbers in bit-reversed order--if you reverse bits in each number
(for instance, 110 becomes 011, and so on), you'll get a set of consecutive numbers.
So this is how the FFT algorithm works (more precisely, this is the decimation-in-time in-place
FFT algorithm).
1. Select N that is a power of two. You'll be calculating an N-point FFT.

2. Gather your samples into a buffer of size N
3. Sort the samples in bit-reversed order and put them in a complex N-point buffer (set the
imaginary parts to zero)
4. Apply the first stage butterfly using adjacent pairs of numbers in the buffer
5. Apply the second stage butterfly using pairs that are separated by 2
6. Apply the third stage butterfly using pairs that are separated by 4
7. Continue butterflying the numbers in your buffer until you get to separation of N/2
8. The buffer will contain the Fourier transform
CHAPTER 2
Design of Butterfly Processor Results and Discussions
2.1 HDL Coding for the FFT Algorithm
In FFT all the operation to be performed is fixed. At top level it has instantiation of the
basic radix-2 butterflies. Each butterfly includes the multiplication with the twiddle factor and
some complex addition and subtractions. So, the butterfly coding also has the instantiations of
the signed complex adder/Subtractors and multipliers. In these two main modules the complexity
lies in proper port match while instantiation. The port connections should match the equation to
be derived so as the connections are optimized and an inherent optimized design is being
implemented.
The adder/subtractor and the multipliers need to handle signed complex numbers, so the
coding for them is done with parameterized signed operation of the block. Also, in all of the
operations the bit size of the output increases and in cascaded connection of such logic, proper
length is to be maintained. The modules are also parameterized for the same and while
instantiation of the module the length of the input bit size is defined. In case of multipliers, one
of the numbers is twiddle factor whose value lies between +1 and -1. So, proper scaling is
required. In the complex multiplier a scale down by a factor of 128 is done by truncation of the 7
least significant bits of the product. This introduces quantization error in the calculations done in
the code. It is check with test cases that the quantization error provides significant deviation from
the accurate value.
2.2 Comparing Different Schemes of Coding

Decimation in Time (DIT) and Decimation in Frequency (DIF) are the two coding schemes to
model the design. In DIT type of scheme, the twiddle factor multiplication take place before the
butterfly operation, while this takes place after the butterfly operation in DIF. This is re- analyzed
that DIF scheme of operation provides further reduction in the hardware in comparison to the
DIT. In both cases, the output is digit reversed for the ordered input.
2.3 HDL Simulation Analysis

The 2-point radix-2 algorithm is implemented in VHDL. Suitable test cases of varying ranges are
taken from calculations.
MATLAB tool from Mathworks Inc., has inbuilt function defined for fast fourier transform
(FFT). To get the proper test cases, the input were checked in the MATLAB and corresponding
output values being noted. The values of output match to the specified output values with
quantization error. The error fluctuates w.r.t. the quantization taking place in the calculation.
2.4 VHDL coding:

The VHDL code is developed for 2-point radix-2 Decimation-In- Frequency FFT
algorithm. There are different modules are performed in order to implement the algorithm such
as complex multiplication, single stage butterfly and top module that has the instantiation of submodules.
In the code, totally 2 inputs that has both real and imaginary each of 9-bits are fed to first
stage butterfly. Before that twiddle factor is to be multiplied which is neglected since all the
twiddle factors are 1. Output of first stage butterfly will be of 9-bits and is again multiplied with
twiddle factor. Only one multiplication process takes place by neglecting twiddle factors which is
unity. The 11-bits output is fed to second stage butterfly and hence final output of 13-bits is
achieved.
Here we implemented the Decimation in frequency radix 2 algorithms. The radix-2
algorithms are the simplest FFT algorithms. The decimation-in-frequency (DIF) radix-2 FFT
partitions the DFT computation into even-indexed and odd-indexed outputs, which can each be
computed by shorter-length DFTs of different combinations of input samples. Recursive
application of this decomposition to the shorter-length DFTs results in the full radix-2
decimation-in-frequency FFT
The radix-2 decimation-in-frequency and decimation-in-time fast Fourier transforms
(FFTs) are the simplest FFT algorithms. Like all FFTs, they compute the discrete Fourier
transform (DFT)
...(2.1)
Fig.2.1 Radix 2 Butterfly architecture
(2.2)
The Twiddle factor of Radix 2 and Radix 4 are same. How we are getting the twiddle
factor e/9 means we are taking 8 bits input data. The 8 bit of data needs to have 9 by including
sign bit. Here we are considering 9 bits including sign bit, there are probability to get value -1 or
1 so we are considering 0-8(9 bit ) for -1 probability and 0-8 for 1. Totally we are having 18 N
(9+9).
Twiddle factor = e-2jk/N
we are not having k value -1
N = 18, K = -1,
Twiddle factor = e-2j-1/18
the twiddle factor become = ej/9
180/9 = 20
= ej20
= (cos20 + jsin20)
= (cos20 + jsin20)*128(Scaling factor)
Here in the case COS value is 121 and SIN value is 39. This value we need to give as inputs by
knowing this we can identify COS + SIN and COS-SIN. For the twiddle factor multiplication we
are using the values of COS+SIN and COS-SIN. The Output of radix 2
Butterfly is
Dre + Dim = (Are + jAim) + (Bre + jBim)
Ere + Eim = [(dif_re + jdif_im)/2 * ((cos20 + jsin20)*128)]/128
dif_re + jdif_im = (Are + jAim) - (Bre + jBim)
dif_re = x
jdif_im = jy
COS = c
SIN = s
R + jI = (x + jy) (cos20 + jsin20)*128
First we will compute the factor E = x y
Z = c (x - y)
We can compute R = (c - s) y + z
R = ( c s )y + c(x - y)
= cy sy + cx cy
= cx sy.(2.3)
I = (c + s) x - z
I = ( c + s )x + c(x - y)
= cx + sx - cx + cy
= cy + sx.(2.4)
The equation 2.3 and 2.4 are the results of the complex multiplication for doing complex
multiplication we already know the values of cos and the sin so we need to find out the
intermediate values of the complex multiplier. We are feeding the values of COS SIN and the
COS + SIN. These values we can multiply with the dif_im and dif_r. so we can avoid the two
subtractions in the HDL.
According to the equation we come to know that we need to have the complex adder
complex subtractions and complex multiplier.
For the test case of the butterfly processor we given the inputs
Are + jAim = 100 + j110(Input_1)
Bre + j Bim = -40 + j10. (Input_2)
Dre + Dim = (Are + jAim) + (Bre + jBim)/2
Dre + Dim = (100 + j110 + -40 + j10)/2
Dre + Dim = 30 + j60(Output_1)
dif_re + jdif_im = (Are + jAim) - (Bre + jBim)/2
= (100 + j110) (-40 + j10)/2
dif_re + jdif_im = 70 + j50(Output_2)
Ere + Eim = [(70 + j50)/2 * ((70 + j50)]/128

Ere + Eim = 50 + j68. (Output_3)
clock_in
inputs
outputs
Fig.2.2 Result for the above test case in Butterfly architecture

The architecture is modeled in the HDL and the simulation results are shown in the figure 2.2 the
inputs of the butterfly processor will be having two points and each of 9 bits. After the butterfly
processor we need to divide the result by 21
. In the twiddle factor multiplication the results
need to divide by the 27. The output of the butterfly architecture we need to multiply the value of
the number of inputs thats why we can avoid the overflow in the arithmetic operation. Figure 2.3
showing the each stage outputs.
The implementation requirement which includes the primary input and primary output of
the design and the proper notation and conventions were discussed.
General implementation flow of the design were represented and explained in order to
understand the proper flow. Implementation details have been discussed which includes
implementation style of each process. The results showing after the twiddle factor multiplication
in figure 2.3. The twiddle factor multiplication architecture developed using the twiddle factor
equation (2.2). The k values taking from the architecture and the N is the number of points here
we are having N = 18. Cos value and the sin value we are finding out and storing the LUT to
multiply with the input of each stage
Reference clock
100 + j110
-40 + j10
COS = 121
COS+SIN = 160
COS-SIN = 82
70 + j50
30 + j60
50 + j68
Fig.2.3 Result for the above test case in Butterfly architecture
Summery
The identified RADIX 2 algorithm is modelled and is simulated using the Modelsim tool.
The simulation results are discussed by considering different cases.
CHAPTER 3
Implementation of Butterfly processor
3.1 Introduction to FPGA
The digital circuits can be broadly classified as Standard, ASSP (Application Specific
Standard Product) or Customized. Field Programmable Gate Arrays (FPGAs) is one of the ways
to implement custom logics. As the name suggests, this IC can programmed in the field i.e. by
the system manufacturer. These are like programmable ASSPs. The product cost of FPGAs is
significantly higher than ASICs but the Non-recurring Engineering costs (NRE) are reduced to
zero.
3.2 Literature Survey on FPGA

FPGAs provide a lot of flexibility to the user to program it in the way it is required. All the
IO pins, the internal blocks and even the interconnects are programmable. This flexibility
extends to design changes even after the layout is complete.
Figure 3.1 FPGA Design Flow

The main part of the FPGA design cycle is to design the logic and the tune it during
verification while implementation to an FPGA board. During the implementation, various
constraints can be defined. Finally, it is checked whether the design meets the design
specifications.
The FPGA design flow reduces the time cycle by a great margin and provided option for
iterations for optimizing the design in a short period.
3.3 Implementation of the HDL on FPGA

For implementing the HDL code for 2 point Radix 2 FFT, designed in chapter two,
200,000-gate Xilinx Spartan 3E 3s500efg320-4
FPGA in a 256-ball thin Ball Grid Array
package (3s500efg320-4) is used. The steps for implementation are :

a. Create a new project in Xilinx ISE.
b. Select the Family- Spartan 3E, Design- 3s500efg320-4 package-FG 320 & speed (-4), as
required for the FPGA board been used.
c. Add the VHDL HDL file to the project.
d. Check the syntax of the file.
e. Create implementation constraints through a User Constraint File (.ucf). Assign
constrains such as timing constraints, pin assignments and area constraints.
f. Synthesize the logic.
g. View and study the RTL schematic.
h. Implement the design (Translate, Map, Place & Route)
i. Change properties, constraints, and source as necessary, then re-synthesize and reimplement your design. Repeat until design requirements are met.
j. View the placed design in Floorplanner.
k. Create a programming file (.BIT) to program your FPGA.
l. Program the device with a programming cable & check the result with test vectors.
The reports are analyzed to check the timing details and the success of implementation of the
logic on FPGA board.
3.4 General Implementation Flow
The generalized implementation flow diagram of the project is represented as follows.
Figure 3.2 General Implementation Flow Diagram

Initially the market research should be carried out which covers the previous version of
the design and the current requirements on the design. Based on this survey, the specification and
the architecture must be identified. Then the RTL modeling should be carried out in VHDL HDL
with respect to the identified architecture. Once the RTL modeling is done, it should be simulated
and verified for all the cases. The functional verification should meet the intended architecture
and should pass all the test cases.
Once the functional verification is clear, the RTL model will be taken to the synthesis
process. Three operations will be carried out in the synthesis process such as
Translate
Map
Place and Route
Translate : Merge multiple design files into a single netlist.
Map : Group logical symbols from the netlist into physical components.
Place and Route : The place components into the chip, connect the components, and extract
timing data into reports
The developed RTL model will be translated to the mathematical equation format which
will be in the understandable format of the tool. These translated equations will be then mapped
to the library that is, mapped to the hardware. Once the mapping is done, the gates were placed
and routed. Before these processes, the constraints can be given in order to optimize the design.
Finally the BIT MAP file will be generated that has the design information in the binary format
which will be dumped in the FPGA board.
Above shows the basic steps involved in implementation. The initial design entry of may
be VHDL HDL, schematic or Boolean expression. The optimization of the boolean expression by
considering area or speed.
Figure 3.3 Logic Block

In technology mapping, the transformation of optimized Boolean expression to FPGA
logic blocks. Here area and delay optimization will be taken place. During placement the
algorithms are used to place each block in FPGA array. Assigning the FPGA wire segments,
which are programmable to establish connections among FPGA blocks through routing. The
configuration of final chip is made in programming unit.
3.5 Synthesis Result

The developed 2 point Radix-2 FFT design is simulated and verified their functionality.
Once the functional verification is done, the RTL model is taken to the synthesis process using
the Xilinx ISE tool. In synthesis process, the RTL model will be converted to the gate level
netlist mapped to a specific technology library. This 2 point Radix 2 FFT design can be
synthesized on the family of SPARTEN 3E.
Here in this SPARTEN 3E family, many different devices were available in the Xilinx
ISE tool. In order to synthesis this design the device named as XC3S500e-4FG320 has been
chosen and the package as FF1738 with the device speed such as -5. The design of Radix 2
FFT Algorithm is synthesized and its results were analyzed as follows.
Device utilization summary:
This device utilization includes the following.
Logic Utilization
Logic Distribution
Total Gate count for the Design
The device utilization summery is shown above in which its gives the details of number of
devices used from the available devices and also represented in %. Hence as the result of the
synthesis process, the device utilization in the used device and package is shown above.
Memory Usage
Total memory usage is 161708 kilobytes
Device Utilization Summary:
Selected Device : 3s500efg320-4
Number of Slices:
96 out of 4656
2%
Number of Slice Flip Flops:
64 out of 9312
0%
Number of 4 input LUTs:
157 out of 9312
Number of IOs:
91
Number of bonded IOBs:
91 out of 232
Number of MULT18X18SIOs:
3 out of
20
15%
Number of GCLKs:
1 out of
24
4%
1%
39%
Timing Summary:
Speed Grade: -4
Minimum period: 17.748ns (Maximum Frequency: 56.344MHz)
Minimum input arrival time before clock: 2.535ns
Maximum output required time after clock: 4.283ns
Maximum combinational path delay: No path found
3.6 Summary
The implementation requirement which includes the primary input and primary output of
the design and the proper notation and conventions were discussed.
General implementation flow of the design were represented and explained in order to
understand the proper flow.
Implementation details have been discussed which includes implementation style of each process
3.7 Analysis of Implementation Reports
During the implementation of the logic, any sub-output results are obtained. These together
with the related reports are detailed in the sub-section ahead.
3.7.1 RTL Black Box of the Logic

On successful synthesis of the HDL code, the RTL black box (as shown in figure 3.2) is
obtained. It shows all the inputs on the left and outputs on the right.
Figure 3.4 RTL Schematic for 2 point Radix-2 FFT
Outputs
Inputs
3.8 Implementation of the Radix-2 Butterfly Algorithm

For further analysis of the FFT algorithm implementation, the code is synthesized in Xilinx
ISE and the reports are analyzed. The level-1 views of the RTL schematic for the FFT logic is
shown in the figure 2.6. The synthesis report lists the options for synthesis and provides the HDL
compilation reports. . It also lists the Design Utilization details, both in numbers of blocks and
the % of usage. This helps to analyze the area utilization of the FPGA board used for
implementation.
3.9 Developing Specifications

Every logic has some design specifications which are to be met during the design. Also
some other parameters related to the functionality and operation of the design is available as a
part of the specification of the design. The FFT algorithm implemented has the specs as
Point (No. of inputs) 2
Algorithm used Radix-2
Coding Scheme Decimation-in-Frequency (DIF)
Maximum frequency of operation 56.344MHz
Offset in Period 3.760ns
Offset out Period 2.799ns
3.10 New Ideas for Realization of Design on FPGA
FPGA are finding more and more application because of the flexibility they provide in
prototyping any logic and the re-configurability. This creates a need to find new ideas for
realization of Design on FPGA.
As the limited by the FPGA board hardware, it is found that the number of inputs and
output are very limited. This could be increased by using rocker or piano type of micro-switches.
A 4:1 switch can be directly connected to the seven segment display to represent the 4-bit input
equivalent in hexadecimal. These hardware elements being smaller in size can be easily
accommodated on FPGA boards, but yes complexities do add to have the interconnects required
for the same to be connected to the control logic and to make it an integral part of the
programmability
Summery
The RTL model is synthesized using the Xilinx tool in Spartan 3E
The synthesis results were discussed with the help of generated reports.
According to the synthesis report the area utilization and the timing details is analyzed.
CHAPTER 4
CONCLUSION
Fast Fourier Transform (FFT) techniques have revolutionized the Digital Signal Processing
techniques in the past 30 years. In does not only provide a fast response but also many logic
thought to be un-realizable come easily in the range to be realized. The parallel processing of
FFT hence has been proposed to be used in multiprocessor algorithms. When an FFT is
implemented in special-purpose hardware, errors and constraints due to finite word lengths are
unavoidable. While deciding the wordlength needed for an FFT, these quantization effects must
be considered.
The algorithm for 2-point radix-2 FFT is studied for implementation. The HDL coding in
DIF scheme is done and tested with some test vectors generated using Matlab. The design is
synthesized on FPGA SPARTEN 3E configuration.
Fast Fourier Transform (FFT) techniques have revolutionized the Digital Signal Processing
techniques in the past 30 years. In does not only provide a fast response but also many logic
thought to be un-realizable come easily in the range to be realized. The parallel processing of
FFT hence has been proposed to be used in multiprocessor algorithms.
The algorithm for 2-point radix-2 FFT is studied for implementation. The HDL coding in
DIF scheme is done and tested with some test vectors generated. The design is synthesized on
FPGA SPARTEN 3E configuration and is found to have a maximum frequency of operation as
56.344MHz. Each value x(n) is represented by 2 bits , 8 bits for real part and 8 bits for imaginary
part respectively. In the 8 bits of real and imaginary parts the Most Significant 4 bits are used to
represent the integer part and the Least Significant 4 bits are used to represent the decimal part.
The Twiddle Factor Wn is represented by 8 bits of which the most significant bit represents the
sign of the number and the next bit represents the integer part while the remaining 6 bits
represents the decimal part. The twiddle factor represents a sine value or a cosine value. Since
the sine or a cosine function varies from 1 to 1, the integer part of the twiddle factor is
represented by only 1 bit.
Appendix
RADIX 2 Butterfly FFT HDL

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_SIGNED.ALL;
PACKAGE mul_package IS
COMPONENT ccmul
GENERIC (W2
PORT
--inputs
--user defined components
: INTEGER := 17;
--Multiplier bit width
W1 : INTEGER := 9;
--Bit width c+s sum
W : INTEGER := 8);
--Input Bit width
(clk : STD_LOGIC;
--clock for the output register
x_in,y_in,c_in : IN STD_LOGIC_VECTOR(W-1 DOWNTO 0);
cps_in,cms_in
r_out,i_out
: IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);

: OUT STD_LOGIC_VECTOR(W-1 DOWNTO 0));
--Results
END COMPONENT;
END mul_package;
LIBRARY work;
USE work.mul_package.ALL;
library IEEE;
ENTITY bfproc IS
GENERIC ( W2
: INTEGER := 17;
-- Multiplier bit width
W1 : INTEGER := 9;
-- Bit width c+s sum
W : INTEGER := 8);
-- Input Bit width
PORT (clk
: STD_LOGIC;
-- clock for the output register
Are_in, Aim_in, c_in : IN STD_LOGIC_VECTOR(W-1 DOWNTO 0);
--8 bit inputs
Bre_in, Bim_in : IN STD_LOGIC_VECTOR(W-1 DOWNTO 0);
--inputs
cps_in, cms_in : IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);
Dre_out, Dim_out,Ere_out, Eim_out : OUT STD_LOGIC_VECTOR (W1 DOWNTO 0));
END bfproc;
ARCHITECTURE flex OF bfproc IS
SIGNAL dif_re, dif_im : STD_LOGIC_VECTOR(W-1 DOWNTO 0);
SIGNAL Are, Aim, Bre, Bim : integer RANGE -128 TO 127; --intiger as input
SIGNAL c : STD_LOGIC_VECTOR (W-1 DOWNTO 0); --input
SIGNAL cps, cms : STD_LOGIC_VECTOR (W1-1 DOWNTO 0);
--coeff in
BEGIN
PROCESS
BEGIN
WAIT UNTIL clk = '1';
Are <=
CONV_INTEGER(Are_in);
Aim <=
CONV_INTEGER(Aim_in);
Bre <=
CONV_INTEGER(Bre_in);
Bim <=
CONV_INTEGER(Bim_in);
c
<=
cps <=
cms <=
c_in;
cps_in;
cms_in;
Dre_out
Dim_out
<=
<=
CONV_STD_LOGIC_VECTOR((Are + Bre)/2,W);
CONV_STD_LOGIC_VECTOR((Aim + Bim)/2,W);
END PROCESS;
PROCESS (Are, Bre, Aim, Bim)
BEGIN
dif_re
dif_im
<=
<=
--output port
CONV_STD_LOGIC_VECTOR((Are/2 - Bre/2),8);
CONV_STD_LOGIC_VECTOR((Aim/2 - Bim/2),8);
END PROCESS;
ccmul_1: ccmul
GENERIC MAP (W2 => W2, W1 => W1, W =>W)
PORT MAP
(clk => clk, x_in => dif_re, y_in => dif_im,

c_in => c, cps_in => cps, cms_in => cms,
r_out => Ere_out, i_out => Eim_out
END flex;
);
--------------------------------------------------------------------------------------------------------------------library IEEE;
ENTITY ccmul IS
GENERIC ( W2
: INTEGER := 17;
--Multiplier bit width
W1 : INTEGER := 9;
--Bit width c+s sum
W : INTEGER := 8);
--Input Bit width
PORT ( clk : STD_LOGIC;
--clock
for the output register
x_in,y_in,c_in : IN STD_LOGIC_VECTOR(W-1 DOWNTO 0);
--inputs
cps_in,cms_in : IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);
--Results
r_out,i_out
: OUT STD_LOGIC_VECTOR(W-1 DOWNTO 0));
END ccmul;
ARCHITECTURE flex OF ccmul IS
SIGNAL
--inputs and
SIGNAL
products
SIGNAL
etc
x,y,c : STD_LOGIC_VECTOR(W-1 DOWNTO 0);

outputs
r,i,cmsy,cpsx,xmyc : STD_LOGIC_VECTOR(W2-1 DOWNTO 0);
--
xmy,cps,cms,sxtx,sxty : STD_LOGIC_VECTOR(W1-1 DOWNTO 0); --x-y
COMPONENT lpm_mult IS
GENERIC ( LPM_WIDTHA : INTEGER := 9;
LPM_WIDTHB : INTEGER := 8;
LPM_WIDTHP : INTEGER := 17;
LPM_WIDTHS : INTEGER := 9 );
PORT (dataa : IN STD_LOGIC_VECTOR(LPM_WIDTHA-1 DOWNTO 0);
datab : IN STD_LOGIC_VECTOR(LPM_WIDTHB-1 DOWNTO 0);
result: OUT STD_LOGIC_VECTOR(LPM_WIDTHP-1 DOWNTO 0));
END COMPONENT;
COMPONENT lpm_sub IS
GENERIC ( LPM_WIDTH
: INTEGER := 9);
PORT (dataa : IN STD_LOGIC_VECTOR(LPM_WIDTH-1 DOWNTO 0);
datab : IN STD_LOGIC_VECTOR(LPM_WIDTH-1 DOWNTO 0);

result: OUT STD_LOGIC_VECTOR(LPM_WIDTH-1 DOWNTO 0));
END COMPONENT;
COMPONENT lpm_sub_17bits IS
GENERIC ( LPM_WIDTH
: INTEGER := 17);

END COMPONENT;
COMPONENT lpm_add_17bits IS
GENERIC ( LPM_WIDTH
: INTEGER := 17);

END COMPONENT;
BEGIN
x <= x_in;
y <= y_in;
c <= c_in;
cps <= cps_in;
cms <= cms_in;
--x
--j * y
--cos
--cos + sin
--cos - sin
PROCESS
BEGIN
WAIT UNTIL clk = '1';
r_out <= r(W2-3 DOWNTO W-1); --scaling and FF
i_out <= i(W2-3 DOWNTO W-1); --for output
END PROCESS;
--ccmul with 3 mul and 3 add/sub

sxtx <= x(x'high) & x;
sxty <= y(y'high) & y;
sub_1 : lpm_sub
-- sub: x - y
GENERIC MAP ( LPM_WIDTH => W1)

PORT MAP ( dataa => sxtx,
datab => sxty,
result => xmy);
mul_1: lpm_mult
xmyc
-- multiply (x - y) * c =
GENERIC MAP ( LPM_WIDTHA

LPM_WIDTHB
LPM_WIDTHP
LPM_WIDTHS
=>
=>
=>
=>
W1,
W,
W2,
W2)
PORT MAP ( dataa => xmy,

datab => c,
result => xmyc);
mul_2: lpm_mult
= cmsy
LPM_WIDTHB
LPM_WIDTHP
LPM_WIDTHS
-- multiply (c - s) * y
=>
=>
=>
=>
W1,
W,
W2,
W2)
PORT MAP ( dataa => cms,

datab => y,
result => cmsy);
mul_3: lpm_mult
= cpsx
LPM_WIDTHB
LPM_WIDTHP
LPM_WIDTHS
-- multiply (c + s) * x
=>
=>
=>
=>
W1,
W,
W2,
W2)
PORT MAP ( dataa => cps,

datab => x,
result => cpsx);
sub_2 : lpm_sub_17bits
y) * c;
-- sub: i <= (c-s) * x - (x -

PORT MAP ( dataa => cpsx,
datab => xmyc,
result => i);
add_1 : lpm_add_17bits
* y;
-- add: r <= (x-y) * + (c + s)

PORT MAP ( dataa => cmsy,
datab => xmyc,
result => r);
END flex;
--------------------------------------------------------------------------------------------------------------------library IEEE;
ENTITY lpm_sub IS
GENERIC ( LPM_WIDTH
: INTEGER := 9);
--LPM_DIRECTION
: "SUB";
--LPM_REPRESENTATION : "SIGNED");
PORT ( dataa : IN STD_LOGIC_VECTOR(LPM_WIDTH-1 DOWNTO 0);
END lpm_sub;
ARCHITECTURE behavioural OF lpm_sub IS
begin
result <= dataa - datab;
end behavioural;
--------------------------------------------------------------------------------------------------------------------library IEEE;
ENTITY lpm_sub_17bits IS
GENERIC ( LPM_WIDTH
: INTEGER := 17);
--LPM_DIRECTION
: "SUB";
PORT ( dataa : IN STD_LOGIC_VECTOR(LPM_WIDTH-1 DOWNTO 0);
END lpm_sub_17bits;
ARCHITECTURE behavioural OF lpm_sub_17bits IS
begin
result <= dataa - datab;
end behavioural;
--------------------------------------------------------------------------------------------------------------------library IEEE;
ENTITY lpm_mult IS
GENERIC ( LPM_WIDTHA : INTEGER :=
LPM_WIDTHB : INTEGER
LPM_WIDTHP : INTEGER
LPM_WIDTHS : INTEGER
9;
:= 8;
:= 17;
:= 9);
PORT ( dataa : IN STD_LOGIC_VECTOR(LPM_WIDTHA-1 DOWNTO 0);
datab : IN STD_LOGIC_VECTOR(LPM_WIDTHB-1 DOWNTO 0);
result: OUT STD_LOGIC_VECTOR(LPM_WIDTHP-1 DOWNTO 0));
END lpm_mult;
ARCHITECTURE behavioural OF lpm_mult IS
begin
result <= dataa * datab;
end behavioural;
--------------------------------------------------------------------------------------------------------------------library IEEE;
ENTITY lpm_add_17bits IS
GENERIC ( LPM_WIDTH
: INTEGER := 17);
--LPM_DIRECTION
: "ADD";
END lpm_add_17bits;
ARCHITECTURE behavioural OF lpm_add_17bits IS
begin
result <= dataa + datab;
end behavioural;

Design of Radix-2 Butterfly Processor

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Design of Radix-2 Butterfly Processor

Transféré par

Droits d'auteur :

Formats disponibles

ABSTRACT

Nth primitive root of unity, and thus

Radix-2 FFT Algorithms

Figure 1.1 first steps in the decimation-in-time algorithm.

Figure 1.2 Three stages in the computation of an N = 8-point DFT.

Figure 1.3 Eight-point decimation-in-time FFT algorithms.

Figure 1.4 basic butterfly computations in the decimation-in-time FFT algorithm.

Figure 1.5 Shuffling of the data and bit reversal.

Figure 1.6 First stage of the decimation-in-frequency FFT algorithm.

Figure 1.7 basic butterfly computations in the decimation-in-frequency.

Figure 1.8 N = 8-piont decimation-in-frequency FFT algorithm.

Fig.1.9 Butterfly calculation

Fig.1.10 Diagrammatical representation of the 4-point Fourier transforms calculation

Fig.1.11 Generic butterfly graph

Fig.1.12 Simplified generic butterfly

Fig.1.13 4-point FFT calculation

1. Select N that is a power of two. You'll be calculating an N-point FFT.

2.2 Comparing Different Schemes of Coding

2.3 HDL Simulation Analysis

2.4 VHDL coding:

Fig.2.1 Radix 2 Butterfly architecture

Ere + Eim = [(70 + j50)/2 * ((70 + j50)]/128

Fig.2.2 Result for the above test case in Butterfly architecture

. In the twiddle factor multiplication the results

The simulation results are discussed by considering different cases.

3.2 Literature Survey on FPGA

Figure 3.1 FPGA Design Flow

3.3 Implementation of the HDL on FPGA

FPGA in a 256-ball thin Ball Grid Array

package (3s500efg320-4) is used. The steps for implementation are :

3.4 General Implementation Flow

The generalized implementation flow diagram of the project is represented as follows.

Figure 3.2 General Implementation Flow Diagram

Figure 3.3 Logic Block

3.5 Synthesis Result

This device utilization includes the following.

Total Gate count for the Design

Number of Slice Flip Flops:

Number of 4 input LUTs:

157 out of 9312

Number of bonded IOBs:

3.7 Analysis of Implementation Reports

3.7.1 RTL Black Box of the Logic

Figure 3.4 RTL Schematic for 2 point Radix-2 FFT

3.8 Implementation of the Radix-2 Butterfly Algorithm

3.9 Developing Specifications

3.10 New Ideas for Realization of Design on FPGA

The RTL model is synthesized using the Xilinx tool in Spartan 3E

RADIX 2 Butterfly FFT HDL

--user defined components

: IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);

(clk => clk, x_in => dif_re, y_in => dif_im,

x,y,c : STD_LOGIC_VECTOR(W-1 DOWNTO 0);

xmy,cps,cms,sxtx,sxty : STD_LOGIC_VECTOR(W1-1 DOWNTO 0); --x-y

PORT (dataa : IN STD_LOGIC_VECTOR(LPM_WIDTH-1 DOWNTO 0);

datab : IN STD_LOGIC_VECTOR(LPM_WIDTH-1 DOWNTO 0);

PORT (dataa : IN STD_LOGIC_VECTOR(LPM_WIDTH-1 DOWNTO 0);

PORT (dataa : IN STD_LOGIC_VECTOR(LPM_WIDTH-1 DOWNTO 0);

--ccmul with 3 mul and 3 add/sub

GENERIC MAP ( LPM_WIDTH => W1)

GENERIC MAP ( LPM_WIDTHA

PORT MAP ( dataa => xmy,

PORT MAP ( dataa => cms,

PORT MAP ( dataa => cps,