Vous êtes sur la page 1sur 18


FPGA vs GPU Performance Comparison on the Implementation of FIR Filters

Interfacing GPU FPGA


CPUs - Ease of programming and native floating
point support with complex and cumbersome
memory systems, as well as significant operating
.system overhead
GPUs - Fine grain SIMD processing and native
floating point with a streaming memory architecture
.and a more difficult programming environment
FPGAs - Ultimate flexibility in processing, control
and interfacing, at the extreme end of programming
difficulty and lower clock rates with only
.cumbersome floating point support

GPGPU platforms
GP - GeneralPurpose computation usingGPU
GPU - Graphics Processing Unit
CUDA and OpenCL are both frameworks for
task-based and data-based general purpose
parallel execution. Their architectures show a
great similarity. The key difference between
these two frameworks is that OpenCL is a crossplatform framework (implemented on CPUs,
GPUs, DSPs and etc.); whereas, CUDA is
supported only by NVIDIA GPUs

Memory Hierarchy
Memory hierarchy of GPGPU architectures show similarity with CPU
At the bottom level of the memory hierarchy the slowest but the
largest capacity memory type resides. This type of memory is
named as global memory in CUDA terminology. A typical global
memory is 2 or 4 gigabyte size and resides outside of the GPU chip
Global memories are usually manufactured using DRAM
Constant memory is another memory type in CUDA devices and is
optimized for broadcasting operations, so that it can be accessed
faster than global memory.
Like caches in CPU memory hierarchy, there is a faster but smaller
memory type in CUDA memory hierarchy called as shared memory.
Registers are other storage units in CUDA memory hierarchy which
are private for each thread. Registers have the smallest latency and
maximum throughput, but their amount is very limited.

Memory Hierarchy

CUDA Memory Hierarchy

Filter Overview FIR

FIR filter structure is constructed from its
transfer function and linear difference
equation which is obtained from taking
inverse Z-transform of the transfer function
of the filter

Filter Overview FIR

The output stream y(n) is calculated
by multiplying the input signal [x(n),
x(n-1), x(n-M+1)] with the
corresponding filter coefficients [b0,
b1, ,bM-1] and adding all the
multiplication results together.

Filter Overview FIR

GPU Implementation
Three different implementation techniques are
designed to compare the performance of GPUs with
the FPGAs.
Two of the designs are implemented using CUDA.
The other design is an OpenCL kernel
The first CUDA design is a nave and simple kernel
that does not include any significant optimization.
The other optimized CUDA kernel uses shared
memory and coalesce global memory accesses.
The third one is just an OpenCL version of the highly
optimized CUDA FIR filter implementation.

FPGA Implementation
Three different implementation techniques are selected
to synthesize FIR filters on various FPGAs: Direct-form,
symmetric-form, and distributed arithmetic.
It is possible to achieve massive level of parallelism by
utilizing multiplier sources of the FPGAs. Most Xilinx
FPGAs have DSP48 macro blocks embedded in their
chips. These slices have 18x18-bit multiplier units
with pre-accumulator, 48-bit accumulators and
selection multiplexers in order to speed up DSP
For the direct-form and symmetric-form FPGA
implementations Xilinxs DSP48 macro slices are utilized.

FPGA Implementation
Distributed arithmetic (DA) technique is an efficient
method for implementing multiplication operations without
using the DSP macro blocks of the FPGA. In the DA
technique, the coefficients of the FIR filter is represented in
twos complement binary form and all possible sum
combinations of the filter coefficients are stored in lookup tables (LUT). Using classical shift-adder method the
multiplication operation can be performed effectively without
using multiplier units of the FPGA. We used 4-input LUTs of
the FPGA to implement the DA form of the FIR filter structure.
We chose three different FPGAs to compare the performance
results of the FIR filters. Utilized FPGAs and their properties
are given in Table 1. Xilinx ISE v14.1 software is used to
synthesize the circuits.

GPU and CPU Performance Results of

FIR Filter Application (Million Samples
per Second)

FIR filter order has a noticeable effect on performance.
For lower order FIR filters both FPGA and GPU achieved
better performance with respect to higher order FIR filters.
Serialization due to the lack of enough multiplier units is
the main performance decrease reason for FPGAs.
Logic resource capacity of an FPGA is another limiting factor
to implement high order FIR filters
FPGAs have relatively lower prices than GPUs, yet GPUs
enjoy the ease of programmability where FPGAs are still tough
to program.
In general, FPGA performance is higher than GPU when the FIR
filter is fully parallelized on FPGA device.
However, GPU outperforms FPGA when the FIR filter has to be
implemented with serial parts on FPGA.

Parallel processing has hit
mainstream computing in the form of
CPUs, GPUs and FPGAs. While
explorations proceed with all three
platforms individually and with the
CPU-GPU pair, little exploration has
been performed with the synergy of
GPU-FPGA. This is due in part to the
cumbersome nature of communication
.between the two

.The world of computing is experiencing an upheaval

The end of clock scaling has forced developers and
alike to begin to fully explore parallel computation in
mainstream. Multi-core CPUs, GPUs and to a lesser
FPGAs, are being employed to fill the computational
gap left
between clock rate and predicted performance



Ray Bittner, Erik Ruf,Microsoft
Direct GPU/FPGA Communication
Via PCI Express