Vous êtes sur la page 1sur 5

2014 IEEE 28th Convention of Electrical and Electronics Engineers in Israel

Designing of Single Precision Floating Point DSP


Co-Processor
Evgeni R. Overchick, Binyamin Abramov
Dept. Electrical Engineering
Afeka College of Engineering
Tel Aviv, Israel
ronyevgenio@mail.afeka.ac.il, binyamina@afeka.ac.il

Abstract—In this paper, we show that using FPGA as a Co- Effective hardware implementation of FFT algorithm can
Processor with floating point arithmetic can enhance DSP system be achieved through the following methods. The first, related
performance levels through optimized core implementation of to the modern and new techniques is the Field Programmable
critical compute-intensive digital signal processing algorithms Gate Array (FPGA) approach, and the second, based on
such as Fast Fourier Transform (FFT) . Our approach is based
Application Specified Integrated Circuit (ASIC) architecture.
on building basic building blocks, by implementing optimized,
multi-cycle, floating point arithmetic core, which are then used to The FPGA technologies are quite mature for Digital Signal
implement much complex layers of logic such as FFT butterfly, Processing (DSP) applications [5] due to fast progress in Very
complex multiplier and a DFT block. We present performance Large Scale Integration (VLSI) technology. Today, the FPGA
results to show that a speedup of 10-19X can be achieved over an devices provide fully programmable system-on-the-chip
optimized FFT DSP Coprocessor implementation on a low cost environments by incorporating the programmability of logic
FPGA such as Cyclone IV. cells, and architecture of gate arrays. They consist of tens of
thousands of configurable logic blocks which make them an
Index Terms - DSP Coprocessor, FFT, Single Precision appropriate solution for specified digital signal processing
Floating Point Numbers, IEEE754, FPGA.
application.
The objective of this work was to get an area and time
I. INTRODUCTION efficient architecture that could be used as a coprocessor, with

T he Fourier transform converts information from the time


domain into the frequency domain [16]. It has been widely
applied in the analysis and implementation of video, audio and
all necessary resources built-in, for an embedded DSP
application.
The basic concept of FPGA technique will be described in
digital communication systems. Since digital communication is details which will cover the necessary subjects that are needed
quite an active field, the arithmetic complexity of the Discrete in the application design. The design procedures of FFT for
Fourier Transform (DFT) algorithm became a significant factor FPGA technique will be presented with its results.
with impact in global computational costs. The Fast Fourier The paper is structured as follows: Section II describes
Transform (FFT) is an efficient algorithm for computing the related work done in the field, sections III describes Single
Discrete Fourier Transform (DFT) and requires fewer Precision IEEE 754 format and section IV describes FFT. Our
computations than a direct evaluation of DFT. The FFT is the implementation is discussed in section V, Section VI describes
generic name for a class of computationally efficient the results in terms of LE and maximum frequency, and section
algorithms that are widely used in the field of digital signal VII discusses our conclusions.
processing [7]. FFT can be implemented by using different
techniques. Because of the complexity of the processing II. RELATED WORK
algorithm of The FFT, recently various FFT algorithms have Unlike CPUs, FPGAs have a high degree of hardware
been proposed to meet real-time processing requirements and configurability. The dataflow nature of computation
to reduce hardware complexity over the last decades such as implemented in FPGA overcomes some of the issues of the
single-memory architecture, dual-memory architecture, cached memory and bandwidth limit that Floating Point arithmetic
memory architecture, array architecture and pipelined raises [1]. Much work has been done in designing coprocessor
architecture [17]. We implement single precision floating point units on FPGAs in order to accelerate performance in different
arithmetic units such as Adder/Subtractor, multiplier and use fields i.e. Geophysics, Image Processing, Molecular Dynamics,
them as building blocks for the coprocessor. The algorithm we etc., [2][3][4]. FPGAs consist of an array of programmable
implement is decimation in time Radix 2 FFT. In order to boost logic blocks of potentially different types, including general
the performance of the coprocessor, we use in-place calculation logic, memory and multiplier blocks, surrounded by a
and optimized floating point core. programmable routing fabric that allows blocks to be
programmably interconnected. The array is surrounded by field encodes the significant without the most significant non-
programmable input/output blocks, labeled I/O in the figure, zero bit. Also, there is an extra bit of precision that comes from
that connect the chip to the outside world. an implied (hidden) leading one bit except in a special case of
The “programmable” term in FPGA indicates an ability to sub-normal numbers where this implied bit is dropped. The
program a function into the chip after silicon fabrication is IEEE 754 also describes infinity and Not-a-Number (NaN).
complete. This customization is made possible by the Moreover, given that, the fraction field described by limited
programming technology, which is a method that can cause a number of bits, not all real numbers can be represented exactly.
change in the behavior of the pre-fabricated chip after It is handled by rounding rules which are specified in IEEE 754
fabrication, in the “field,” where system users create designs standard.
[18]. The use of FPGAs in signal processing offers an
alternative solution for computationally-intensive function IV. FAST FOURIER TRANSFORM
found in DSP [5] [6]. One of DSP’s most widely used The Fast Fourier Transform (FFT) is a discrete Fourier
algorithms is Fast Fourier Transform (FFT). Using this transform algorithm which reduces the number of
transform, signals are being represented in the frequency computations needed for N points from 2N2 to 2NlogN where
domain where performing analysis or filtering is much less log is the base-2 logarithm [16]. Fourier analysis converts time
complex. When considering the different alternative or space to frequency and vice versa. In many domains fixed
implementations of the FFT algorithm to be chosen, one must point is sufficient. However, the FFT also has uses in scientific
carefully consider different parameters, whether the execution applications ranging from climate modeling, molecular
speed and hardware complexity, flexibility and precision. For dynamics and radar, which require floating point arithmetic.
real time systems the execution speed is the main concern. The fundamental calculation of the N point DFT is
Many of the FPGA implementation of the FFT for real time described as:
signals use fixed point arithmetic[7] [8]. The flexibility of the
FPGA allows one to design application specific floating point N 1−1 − i 2π jk
operators by tailoring its parameters to the application
requirements [9]. Floating point arithmetic requires much more
Y[ j] = ∑ X [k ]WNjk ; where WNjk = e
k =0
N
(1)

area per operation and, more importantly, it requires almost as


much area for an adder as for multiplier. Also, it has higher We choose to implement FFT by using radix-2 decimation
demands on memory and bandwidth capacities. Number of in time algorithm, where each stage operates pairwise on data
vendors sell a IEEE 754 single precision floating point FFT set. A complete flow graph of radix 2 DIT decomposition of 8
core [10][11], but their implementation details, and various point DFT is represented in Figure 1.
design choices are not discussed. Also, an open source library
that implements IEEE 754 floating point arithmetic exists [12]
but it is not optimized and baseline performance are low if it is
used as is.

III. IEEE 754 FLOATING POINT OVERVIEW


Floating point encoding and functionality are defined in the
IEEE 754 Standard last revisited in 2008 [13] [14]. The
standard binary floating point format has three components. A
one bit sign, followed by exponent bits encoding the exponent
offset by a numeric bias, and the mantissa, which encodes the
significant or fraction. In order to ensure consistent
computation across different platforms and to exchange
floating point data, IEEE 754 defines basic formats. The Single
Fig. 1. 8 point decimation in time FFT flow graph
Precision format is a 32 bit binary floating point number which
have the following bit lengths:
Radix-2 algorithm was chosen because of its simplicity. It
yields the smallest butterfly unit compared to other radices.
Sign – 1 bit Exponent – 8 bit Fraction – 23 bit Moreover, other radices reduce the total number of operations
but are much complex [17]. Also, in Figure 1, we notice that
The sign is either negative or positive. The exponent field the butterfly unit is a basic operation which repeats itself at
encodes the exponent in base 2 and is biased by 127 to allow each stage but with different input.
exponents to extend from negative to positive. The fraction
Hence, our effort to optimize the process begins at the
butterfly unit. The structure of the butterfly unit, as shown in
Figure 2, consists of a single complex multiplication and two
complex additions.

Fig. 4. Complex multiplier implementation

Fig. 2. Radix-2 Decimation In Time Butterfly cost of this solution is in resolution and in accuracy. However,
this loss of accuracy is relative as the resolution of a
V. FFT COPROCESSOR IMPLEMENTATION denormalized number depends on the position of its leading 1.
We exploit FPGA’s parallelism in the FFT computation in
two ways: pipelined floating point arithmetic units (multiplier B. Floating Point Adder/Subtractor
and add/sub unit) and parallelism within the stages of FFT. In the pre-processing stage, the unit checks which input
number is bigger by comparing the exponent and then aligns
A. Floating Point Hardware Considerations the mantissa of the smaller number by the difference. In the
In any operation, that’s using floating point numbers the calculation stage of a floating-point adder/Subtractor unit we
first task is to analyze the exponent and the mantissa operands use fixed arithmetic adder/Subtractor which does not present
in order to determine the number type. On the other hand, the any particular complexity that calculates the mantissa of the
last task of the operation is to compose the sign, exponent and result. The sign of the result is calculated in the preprocessing
mantissa of the result into a number. Therefore, our floating stage by taking into account the input signs, if it is an addition
point units, the multiplication unit and the sub/add unit are or a subtraction and which operand is bigger. The exponent of
composed of three stages: the pre-processing stage, the the result which equals to the exponent of the biggest operand
calculation stage and the post-processing stage, as illustrated in is adjusted during post-processing, if the mantissa result
Figure 3. presents a carry (addition) or a cancelation of its most
significant bits (subtraction).

C. Floating Point Multiplication


In order to optimize the multiplication unit, we chose to
implement multiplication operation in the calculation stage by
using FPGA’s embedded multipliers. Nowadays most of the
FPGA’s include a combination of on-chip resources and
external interfaces that help increase the performance of DSP
systems. In Cyclone IV devices (we used EP4CE115) the
embedded multiplier can be configured as either one 18x18
Fig. 3. Floating point operation flow
multiplier or two 9x9 multipliers. We used embedded
multipliers to build a three cycle 24x24 multiplier by cascading
Two key design decisions were introduced. The first is 18x18 and a 9x9 multiplier.
simplification of denormalized numbers, and the second is
limitation of the rounding to just truncation toward zero. D. Complex Multiplication
Most of the needed logic in the pre-processing stage and In FPGA’s complex multiplication is an expensive
post-processing stage is used for handling denormalized operation in terms of logical operations in hardware. This is
numbers. The significant use of resource, negatively affects the true also when it is implemented using floating point
performance of arithmetic units. However, floating point arithmetic. As such, we reduce the complexity of this
operation algorithms require normalized numbers e.g. multiplication in the following manner:
mantissa’s leading bit has to be equal to 1. Moreover, the use
of denormalized numbers does not contribute to most
applications because denormalized numbers represent a small R + jI = (X + jY) × (C+ jS) (2)
infrequent part of the format. Most of the floating point format R = (C− S) × Y + Z (3)
is reserved for normalized numbers. Moreover, their value I = (C+ S) × X − Z (4)
−126
(i.e. < 2 ) is rarely used too, and is infrequent in most
applications. Therefore, in order to simplify the pre and post
Where Z = C × ( X − Y ) (5)
processing stages, which would lead to reduced synthesized
logic and a boost in performance, denormalized numbers were In this manner, we reduce the multiplicative complexity by
handled as zero so all related logic would be eliminated. The calculating only three sub/add operations and three real
flow graph at figure 1, the DFT units, recive different input
data and coefficients but the operation remains the same. This
is why our FFT unit has only four parallel DFT units. We use a
control unit and in-place memory calculation in order to repeat
the operation three times with diffrenet inputs and coefficients.
VI. RESULTS
In order to examine the proposed approach for design of
FFT DSP coprocessor we built a system simulating an input of
1024 point vector of normalized numbers. The system was
composed of test software written in C# that calculated a vector
of 1024 point of a Sine function. The system injected the vector
Fig. 5. FFT Unit Principal Architectute into a RAM unit and activated the coprocessor. The results of
the coprocessor were compared to pre-calculated FFT on the
multiplications. The twiddle factor is a number represented by same vector. The simulation of the coprocessor and the test
C and S, and is pre-computed. Therefore, it is necessary only to system were simulated using ModelSim software. In figure 6,
store the following coefficients: C+S, C S and C in a the FFT unit simulation is being presented. The first signal in
memory table. The signal flow of the complex multiplier unit the simulation is the coefficient address, which at each stage
can be seen in figure 3. changes. The next four signals are the DFT unit reset signal.
The stage signal is a counter which represents the current stage
E. DFT Unit in order to determine the correct address that the address
The DFT unit is the building block of the FFT. It is composed generator provides to diffrenet units (signals address_sig,
of a complex multiplier which was described previously, a addr_input_stage, counter_load). Signals d_i, d_r, q_r_out and
twiddle unit which is composed of two floating point q_i_out are the data signal vectors that are composed 8
Adder/Subtractor and a ROM that holds twiddle factor numbers of 32 bit lengths. The synthesis we used Altera
coefficients that were calculated in advance (C+S, C S and Qurtaus. Altera Cyclone IV FPGA was used as a basis for the
C). A simple state machine with only three states controls this implementation of the coprocessor and the test system. The
unit. The first two states control the operation of complex results of the experiments are presented in Table I and in Table
multiplier and of the twiddle factor, while the third state is the II. In Table I the first column shows the unit name, the next
end state where the DFT unit is on standby for another input. three columns describe the number of cycles each unit need in
order to finish its operation and get a result, the maximum
F. FFT Unit frequency of a unit and the number logic elements each unit
This is the main unit that performs the FFT calculation. It is requires. In comparison to floating point Adder/Subtractor unit
composed of four parallel DFT units, two serial-in parallel out that was instantiated by Bishop floating point library [12],
register and two parallel-in serial out register, each for real which required approximately 6000 LE and had maximum
numbers vector and for imagionary numbers vector (figure 5). operating frequency of 14 MHz, our FP Adder/Subtractor takes
Both of the registers receive eight floating point number (real ten time’s fewer logic elements (600 LE) and has a maximum
and imaginary) which are represented IEEE 754 format (32 operation frequency of 140 MHz. Moreover, the coprocessor
bit). An address generator unit is in charge of providing calculates FFT at 50 MHz tenfold faster than a 3.5 GHz i7
address to a RAM module that holds input/output values and to quad processor and at 130 MHz almost 20 times faster with
the coefficient ROM in the complex multiplier. The FFT unit
calculates in-place FFT. The first state of the FFT operation is
error of ≈ 1 ⋅10 −7 compared to pre calculated results in Matlab.
to receive an input vector into the RAM module. The next state TABLE I
is sending eight points of the input vector to the parallel DFT DESIGN SUMMARY OF DIFFRENT COPROCESSOR MODULES
Max
units through the memory registers. Inside the DFT unit, the Unit Cycles Frequency LE
first operation to be perfomed is the complex multiplication of [MHz]
the input and the pre-calculated coefficients which are chosen FP Adder/Subtractor 4 140 600
using the provided address from the address generator unit of FP Multiplier 3 250 200
the FFT. The last step in the DFT unit is the butterfly Complex
12 140 ~2400
calculation. After this calculation, the DFT results are written Multiplication
to the same place in the RAM module, again using the FFT 96 130 ~18k
provided address from the address generator unit. In fact, the TABLE II
FFT COPROCESSOR PERFORMANCE
address generator unit is composed of several counters of
Unit 50 [MHz] 100 [MHz] 130 [MHz]
diffrent lengths that are being used by all the control units and
all the memory units such as the coefficient ROM and the I/O CoProcessor 1.8 [µs] 0.9 [µs] 0.6932[µs]
results RAM. This flow repeats itself three times according to TABLE III
the eight point radix-2 FFT algorithm. At each stage in the FFT FFT COPROCESSOR PERFORMANCE
FFT i7 @ i5 @ 1.7 E2160 @ E8400 @
CoProcessor 3.5GHz GHz 1.8GHz 3GHz
0.6932[µs] 10 [µs] 24 [µs] 60 [µs] 12 [µs]
VII. CONCLUSIONS [2] He, Chuan, Mi Lu, and Chuanwen Sun. "Accelerating seismic
migration using FPGA-based coprocessor platform." Field-
The present paper proposes architecture for DSP Programmable Custom Computing Machines, 2004. FCCM
Coprocessor for FFT calculation using floating point 2004. 12th Annual IEEE Symposium on. IEEE, 2004.
arithmetic. The proposed architecture provides: [3] Thavot, Richard, et al. "Dataflow design of a co-processor
• Optimized floating point arithmetic units such as architecture for image processing." Proceedings of the
Adder/Subtractor and multiplier with IEEE 754 single Workshop on Design and Architectures for Signal and Image
precision representation. Processing (DASIP 2008), Bruxelles, Belgium. 2008.
• In-place 8 point radix 2 FFT using single precision [4] G. Eason Yongfeng Gu, Tom VanCourt, Martin C. Herbordt,
IEEE 754 floating point representation. Explicit design of FPGA-based coprocessors for short-range
Result analysis shows that the proposed solution can boost the force computations in molecular dynamics simulations, Parallel
performance of complex DSP algorithms. Our FFT Computing, Volume 34, Issues 4–5, May 2008, Pages 261-277,
ISSN 0167-8191, http://dx.doi.org/10.1016/j.parco.2008.01.007.
Coprocessor is x20 times faster than i7@3.5GHz CPU.
Two design decisions targeting the simplification of floating [5] Mittal, Sparsh, Saket Gupta, and S. Dasgupta. "System
generator: The state-of-art FPGA design tool for dsp
point complexity have been explained. Handling complexity
applications." Third International Innovative Conference On
requires logic resources, thus as the complexity is reduced so Embedded Systems, Mobile Communication And Computing
are the logic resources needed for an operator. Additionally, as (ICEMC2 2008). 2008
fewer resources are used, arithmetic operator’s [6] Knapp, Steven K. "Using programmable logic to accelerate DSP
implementations become more efficient, and speed (clock functions." Xilinx, Inc (1995): 1-8.
frequency) is also improved. For latency dependent
[7] Uzun, Isa Servan, Abbes Amira, and Ahmed Bouridane. "FPGA
applications, instead of increasing the clock frequency, we can implementations of fast Fourier transforms for real-time signal
reduce the number of pipeline stages. Our approach focused on and image processing." Vision, Image and Signal Processing,
optimizing those features of floating-point arithmetic that IEE Proceedings-. Vol. 152. No. 3. IET, 2005.
require a heavy processing overhead while their relevance in [8] Garcia, Joaquin, and Rene Cumplido. "On the design of an
the format is minimum or can be neglected. Also, when FPGA-Based OFDM modulator for IEEE 802.16-2004."
implementing higher level building blocks of the FFT, we Reconfigurable Computing and FPGAs, 2005. ReConFig 2005.
optimized the mathematical complexity of the complex International Conference on. IEEE, 2005.
multiplier and used pre-calculated coefficients which were [9] De Dinechin, Florent, et al. "An FPGA-specific approach to
stored in memory. Moreover, the in-place FFT approach floating-point accumulation and sum-of-products." ICECE
combined with an address generator reduces the memory Technology, 2008. FPT 2008. International Conference on
requirements, simplifies the control unit of the systems and IEEE, 2008.
allows more efficient implementation. [10] Altera. Floating Point FFT Processor (IEEE 754 Single
Precision) Radix 2 core . From web page at
http://www.altera.com.cn/literature/wp/wp_fft_radix2.pdf
[11] Xlinx. logiCORE IP Fast Fouirer Transform v8.0. From Web
page at http://www.xilinx.com/support/documentation _docum
entation / ds808_xfft.pdf
[12] Bishop, David. "Floating point package user’s guide." Packages
and bodies for the IEEE (2008): 1076-2008.
[13] ANSI/IEEE 754-1985. American National Standard — IEEE
Standard for Binary Floating-Point Arithmetic. American
National Standards Institute, Inc., New York, 1985.
[14] IEEE 754-2008. IEEE 754–2008 Standard for Floating-Point
Arithmetic. August 2008.
[15] Hemmert, K. Scott, and Keith D. Underwood. "An analysis of
the double-precision floating-point FFT on FPGAs." Field-
Programmable Custom Computing Machines, 2005. FCCM
2005. 13th Annual IEEE Symposium on. IEEE, 2005.
[16] W. Cooley and J. W. Tukey. An algorithm for the machine
Fig. 6. FFT unit simulation result calculation of complex Fourier series. Mathematics of
Computation, 19:297–301, 1965
VIII. REFERENCES [17] Johnson, S.G.; Frigo, M., "A Modified Split-Radix FFT With
[1] Underwood, Keith. "FPGAs vs. CPUs: trends in peak floating- Fewer Arithmetic Operations," Signal Processing, IEEE
point performance." Proceedings of the 2004 ACM/SIGDA 12th Transactions on , vol.55, no.1, pp.111,119, Jan. 2007
international symposium on Field programmable gate arrays. [18] Brown, Stephen, and Jonathan Rose. "Architecture of FPGAs
ACM, 2004. and CPLDs: A tutorial." IEEE Design and Test of
Computers 13.2 (1996): 42-57.

Vous aimerez peut-être aussi