Honours Thesis Compressed

Real-Time Image Processing
A report
submitted in fulfillment
of the requirements for the degree
of
Bachelor of Computing and Mathematical Sciences
with Honours
at
The University of Waikato
by
Mark Will
Department of Computer Science

Hamilton, New Zealand
October 15, 2013
2013 Mark Will

Abstract
For thousands of years engineers have dreamed of building machines with vi-
sion capabilities that match their own. However most image processing algo-
rithms are computationally intensive making real-time implementation diffi-
cult, expensive and not usually possible with microprocessor based hardware.
In this project a computationally intensive image-processing algorithm is im-
plemented, analysed, and optimized for an ARM Cortex-A9 microprocessor.
The highly optimized code is then further accelerated by using state-of-the-art
Xilinx devices which package an ARM Cortex-A9 and reconfigurable logic on
the same chip, to offload the most computationally intense functions into re-
configurable logic, improving performance and power usage without drastically
increasing development time.
ii
Acknowledgements
I am very grateful to my supervisor Anthony Blake for providing support,

direction, and for pushing me to my limits. Without his motivation, I never
would have got through the many late nights spent working in the lab. I would
also like to thank Adam and Matt from the WAND research group, for the
daily walks to get energy drinks and ice creams, and for fancy lunch Fridays.
Finally, thanks to all my family and friends, of which there are too many to
mention, for supporting me throughout my degree.
iii
Contents
Abstract ii
Acknowledgements iii
1 Introduction 1
1.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Image Stabilisation Techniques . . . . . . . . . . . . . . . . . . 7
3 Implementation 10
3.1 Image Stabilisation Algorithm . . . . . . . . . . . . . . . . . . 10
3.1.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . 11
3.1.3 Log Polar Transform . . . . . . . . . . . . . . . . . . . . . . 12
3.1.4 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.5 Scale and Rotation . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Common Computation . . . . . . . . . . . . . . . . . . . . . 17
3.2.3 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.4 Optimised Flow Diagram . . . . . . . . . . . . . . . . . . . 20
3.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
iv
Contents
3.3.1 Xilinx Zedboard . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.2 Hardware Descriptive Language . . . . . . . . . . . . . . . . 22
3.3.3 Vivado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.4 DDR3 AXI Interface . . . . . . . . . . . . . . . . . . . . . . 25
3.3.5 FFT Implementation . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Boot Process . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.3 System Call . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.4 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Results and Discussion 45

4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.1 Intel x86 Performance . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 ARM Performance . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.3 FPGA Performance . . . . . . . . . . . . . . . . . . . . . . 49
4.1.4 Power Consumption . . . . . . . . . . . . . . . . . . . . . . 51
4.1.5 Image Stabilisation . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 David vs. Goliath . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Developer Time . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Conclusion 57
5.1 Revisiting the Hypotheses . . . . . . . . . . . . . . . . . . . . 57
5.2 Contributions and Future Work . . . . . . . . . . . . . . . . . 57
5.2.2 Software Optimisations . . . . . . . . . . . . . . . . . . . . 58
5.2.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 59
References 61
A Image Stabilisation 66
A.1 C code for computing the 1D FFT . . . . . . . . . . . . . . . . 67
B Implementation 68
B.1 Screen shot of the Zynq layout from within Vivado. . . . . . . . 69
v
Contents
B.2 Implemented design in Vivado IP Integrator. . . . . . . . . . . 70

B.3 Implemented design in the reconfigurable logic. . . . . . . . . . 71
B.4 Write and interrupt handler functions for the device driver. . . . 72
B.5 C code for walking page tables on ARM. . . . . . . . . . . . . 73
vi
List of Figures
1.1 Performance of FFT code on an Asus Eee Pad TF201, which uses
the Nvidia Tegra 3 quad-core SoC. Source: [7] . . . . . . . . . . . 2
2.1 An FPGA is composed of configurable logic and I/O blocks tied

together with programmable interconnects. Source: [23] . . . . . 5
3.1 Gray scale images of a model car. . . . . . . . . . . . . . . . . . . 13

3.2 Modulus images of the computed 2D FFT for both images in Fig-
ure 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Log Polar transform of an image. . . . . . . . . . . . . . . . . . . 14
3.4 Splitting complex number into parts using SSE4. . . . . . . . . . 19
3.5 Dividing floats in Neon. . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Algorithm Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 Forcing signals to be untouched for debugging. . . . . . . . . . . . 24
3.8 DDR Controller IP. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 Sample burst read command. . . . . . . . . . . . . . . . . . . . . 28
3.10 Sample burst write command. . . . . . . . . . . . . . . . . . . . . 28
3.11 FFT controller state machine for an FFT of size 2048. . . . . . . 30
3.12 Sample device tree structure for memory declaration. . . . . . . . 34
3.13 Sample device tree structure for Fast Fourier Transform (FFT)
reconfigurable logic. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.14 Retrieving parameters in the device driver. . . . . . . . . . . . . . 37
3.15 Device driver setup script. . . . . . . . . . . . . . . . . . . . . . . 37
3.16 File operations for the driver. . . . . . . . . . . . . . . . . . . . . 39
3.17 Converting a virtual to physical address in Linux on ARM. . . . . 42
4.1 Intel x86 Performance with an array size of 2048x2048. . . . . . . 46

4.2 720p image using an array size of 2048x2048 and 1024x1024. . . . 47
4.3 Intel x86 Performance for the working array size. . . . . . . . . . 47
vii
List of Figures
4.4 ARM Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Software vs. Hardware: FFT computation of an array size 2048. . 49
4.6 FFT computation time for different 1D array sizes in reconfig-
urable logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Burst read operations using different sizes in reconfigurable logic. 51
4.8 Power Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.9 Three second video from drone, showing 3 FPS. Left is the input
and right is the output. . . . . . . . . . . . . . . . . . . . . . . . . 54
viii
List of Tables
3.1 DDR Controller IP signal details. . . . . . . . . . . . . . . . . . . . 27

3.2 Zedboard boot from SD card jumper settings. . . . . . . . . . . . . 34
ix
List of Acronyms
ASIC Application-Specific Integrated Circuit
AVX Advanced Vector Extensions
AVX2 Advanced Vector Extensions 2
AXI Advanced eXtensible Interface
BRAM Block RAM
CPU Central Processing Unit
DSP Digital Signal Processor
DoD Department of Defense
DoG Difference of Gaussian
FFT Fast Fourier Transform
FFTS The Fastest Fourier Transform in the South
FIFO First In, First Out
FPGA Field-Programmable Gate Array
FPS Frames Per Second
FSBL First-Stage Boot Loader
GPU Graphics Processing Unit
HDL Hardware Description Language
HLS High-Level Synthesis
I/O Input/Output
IEEE Institute of Electrical and Electronic Engineers
x
List of Tables
IRQ Interrupt Request
JVM Java Virtual Machine
MMCM Mixed-Mode Clock Manager
PGD Page Global Directory
PMD Page Middle Directory
PTE Page Table Entry
PUD Page Upper Directory
SIFT Scale-Invariant Feature Transform
SIMD Single Instruction, Multiple Data
SPI Shared Peripheral Interrupt
SSE4 Streaming SIMD Extensions 4
SURF Speeded Up Robust Features
TDP Thermal Design Power
VHSIC Very High Speed Integrated Circuit
xi
1
Introduction
People who are more than casually interested

in computers should have at least some idea of
what the underlying hardware is like. Otherwise
the programs they write will be pretty weird.
Donald Knuth
Humans visually perceive the world around them, estimating depth, rotation
and speed primarily from their eyes. Current efforts to build machines that
can navigate and interact with the world as we do rely on more than just a
video camera. They involve much larger equipment which are often active,
such as laser range finders and GPS units [13][31]. Systems which rely solely
on image based sensors are difficult to implement in real-time on micropro-
cessor based systems because of the computationally intensive nature of the
algorithms employed [11][41][49][42][12][54][15]. This project explored imple-
menting a computational intense image stabilisation algorithm based around
the Fast Fourier Transform (FFT), the most important numerical algorithm
of our lifetime [16], with the goal of gaining real-time performance.
Programmers in many applications treat the computer system as a black

box, relying on the compiler to optimise their code for the underlying machine.
In applications where performance is not critical, the development time for
optimising the code is too costly. However for real-time applications, this thesis
shows that the programmer should apply optimisations that are amenable to
1
Chapter 1 Introduction
2,000
ffts
fftw3-estimate
1,500 java-radix2
Speed (MFLOPS)
1,000
500
0
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
Transform Size
Figure 1.1: Performance of FFT code on an Asus Eee Pad TF201, which uses the Nvidia
Tegra 3 quad-core SoC. Source: [7]
the underlying machine.
Even then, software running on a microprocessor could benefit from hard-

ware accelerators sharing the computational load, especially with low powered
devices such as mobile devices. Apple is a world-leader in this area, they re-
leased the first 64-bit mobile processor in Q3 2013, and incorporate a lot of
custom hardware into their chips [45], for example the Audience audio process-
ing circuitry [28] was used in the iPhone 4, and also in other mobile devices
like the Samsung Galaxy S3.
Programming languages that use virtual machines are popular among many
programmers. Java is an example of this, using the Java Virtual Machine
(JVM) to run applications, allowing the code to easily support many systems
because it is always run on a virtual machine. This portability comes at a cost
however, with the overall performance of Java applications taking a huge hit
due to running virtually. Figure 1.1 compares the performance of different FFT
implementations, and shows that the Java implementation cannot compete
with other performance optimized solutions.
2
Chapter 1 Introduction
1.1 Hypothesis
Real-time implementation of computationally intensive image processing al-
gorithms requires the application of optimisations that are amenable to the
underlying machine, with low power systems requiring further acceleration via
customised hardware to meet power constraints. To evaluate this hypothesis,
the performance of a typical C implementation of an image stabilisation algo-
rithm is optimised by hand to remove common computation, use the memory
subsystem efficiently, and use machine specific instructions on both the x86 and
ARM architectures. This is discussed in Section 3.2 with the results presented
in Sections 4.1.1 and 4.1.2, where the optimised code is compared to the origi-
nal code. Once the algorithm has been highly optimised in software, offloading
computational intense functions such as the FFT into hardware will further
improve the performance of the algorithm while not drastically increasing the
power usage of the device, or the overall development time. This was per-
formed using a state-of-the-art Xilinx Zynq device which couples together an
ARM Cortex-A9 and reconfigurable logic, with the implementation discussed
in Section 3.3 and results in Section 4.1.3.
1.2 Scope
The scope of this project has been limited to accommodate for the short period
of time available. With regards to the image stabilisation algorithm, the qual-
ity of the output image only needs to show improvements over the input image,
and can treat the input as a fixed point in space, meaning it does not have to
handle desirable movements. This is because the hypothesis is based around
performance, with the quality of the output just needing to remain consistent.
The hardware used for performance testing was limited to technology available
at the start of the project, an early 2013 MacBook Pro, and an engineering sam-
ple Xilinx Zedboard that uses a Zynq 7020. The hardware chosen for further
acceleration was a Field-Programmable Gate Array (FPGA) because previous
work described in Section 2.2 showed the performance exceeds Graphic Pro-
cessing Units (GPUs) for real-time applications. Only reprogrammable chips
were considered and not more advanced approaches such as an Application-
Specific Integrated Circuit (ASIC) or a Digital Signal Processor (DSP) because
of the time constraints.
3
2
Background
2.1 FPGAs
A Field-Programmable Gate Array (FPGA) bridges the gap between hard-
ware and software, by providing performance closer to that of an Application-
Specific Integrated Circuit (ASIC), while having the reconfigurability of a mi-
croprocessor. They contain a finite amount of programmable logic, also known
as reconfigurable logic, that can be used to implement digital circuits by ap-
plying a bitstream file to the device. This bitstream file is analogous to a
compiled program in software, but where programs contain machine instruc-
tions, a bitstream file contains a sequence of bits which configure circuits and
logical functions. The inner workings of a FPGA is very complex, with a
detailed overall given in Reconfigurable computing: the theory and practice of
FPGA-based computation [22], however from a high level they made up of logic
blocks, Input/Output (I/O) blocks, along with other built-in components such
as DSPs, MMCMs and BRAMs. Figure 2.1 shows an overview of the internals
of a FPGA.
2.2 Previous Work

There is a large body of work concerned with real-time image processing using
a Central Processing Unit (CPU) or the Graphics Processing Unit (GPU) and
claim to give real-time performance [49][42][12][54][15][41][11][47][46]. How-
ever there is some work targeting FPGAs which also claims real-time perfor-
4
Chapter 2 Background
Figure 2.1: An FPGA is composed of configurable logic and I/O blocks tied together
with programmable interconnects. Source: [23]
mance [43][34].
2.2.1 CPU
Applications such as video surveillance need to detect moving objects, and

Cucchiara et al. [11] propose a system called Sakbot to do just that, while
trying to eliminate objects shadows and ghosts caused by a moving objects
becoming stationary for a long period of time. They accomplish this by using
a technique called background subtraction and used easy to compute methods
which still provided good repeatability. On a Pentium 4 with a 320x240 video
stream, the performance was 9-11 Frames Per Second (FPS) (ibid.).
Pattern Recognition is a common use of video processing and Ruta et al. [41]
developed a two stage road sign detection and classification system. Unlike
previous work in this area, they used colour information instead of black and
white as colours offer important information on a road sign. To detect a sign in
the video, they looked for shapes and rim colours. All of their 210 sign images
had a resolution of 60x60, and with a video stream of 640x480, managed 25-
30 FPS (ibid).
Google have developed a video stabilisation algorithm [20] and a camera

shutter removal algorithm [19] for use on YouTube. Both of these algorithms
are computationally intense, for example they use feature extraction to detect
key points in each frame, and apply full image warps resulting in poor cache
locality. The video stabilisation algorithm on a low resolution video can reach
5
20 FPS, however will wobble suppression, this drops to 10 FPS. The camera
shutter removal algorithm also applies video stabilisation, and is reported to
get 5-10 FPS.
These are just a few examples but there is a lot of work which claims to
achieve real-time performance on a CPU, with the frame rates anywhere be-
tween 5-40 FPS at resolutions ranging from 240x180 to 640x480 [49][42][12][54][15].
2.2.2 GPU
The primary purpose of a GPU is to accelerate the rendering of graphics to

be displayed on a monitor, but more recently GPUs are also being utilised for
general purpose computing. They process many warps in parallel, with each
warp executing the same instruction multiple times [29]. In order to maximise
performance, GPU programmers are advised to maximise occupancy and fully
saturate the GPU with threads, however it is claimed that using thread level
parallelism and instruction level parallelism gives the best results in many
cases [44].
A framework for creating abstracted video and images in real-time was de-
signed in 2006 by Winnemoller et al. to be highly parallel allowing it to be run
on a GPU [47]. There were 6 different techniques used in this framework in
which an image flows through before it was abstracted: feature space conver-
sion, abstraction, luminance quantization, Difference of Gaussian (DoG) edges,
colour space conversion and image based warping. Since they were focused on
real-time, they chose cheaper computation methods over more expensive ones.
For a 640x480 video stream their framework managed to get 9-15 FPS on a
Nvidia GeForce GT 6800. They also tested it on a Athlon 64 3200+ CPU and
got 0.3-0.5 FPS (ibid.).
Digital Matting is where the background of an image is removed leaving

the foreground object, similar to how movie sets use Green and Blue screens
for special effects. A real-time matting system was proposed using a Time-of-
Flight camera and multichannel Poisson equations by Wang et al. [46] for the
CPU and GPU. They split up the algorithm, performing tasks where flexible
looping and branching was required on the CPU, and optimized parts to be
parallel to take advantage of the GPU. Using a 320x240 video stream, overall
they managed to achieve 32 FPS, however 70% of this computation in terms
of runtime, was spent on the CPU (ibid.).
6
In summary, highly parallel algorithms may achieve high throughput on

a GPU, but because GPUs have limited gather-operation capabilities [47],
limited bus bandwidth, plus CPU and memory bottlenecks when transferring
data/control between the CPU and GPU, the overall runtime can be up to
50x longer than the time taken for the GPU to compute the result [18]. It has
also been claimed that GPUs offer 100x the performance of CPUs but other
more comprehensive studies put that figure at about 2.5x on average [29].
2.2.3 FPGA
Edge detection algorithms have been implemented within a FPGA such as

Canny Edge Detection [34] and the Sobel operator [43]. Canny Edge detection
involves Gaussian blurring/smoothing, finding derivatives of Gaussian images,
non-maximal suppression and hysteresis thresholding which is more compli-
cated than Sobel, where the main computation is applying two 3x3 kernels
to the image. The Canny implementation runs at 264MHz and manages to
process over 4000 FPS for a 256x256 frame size. The Sobel implementation
managed to get 50 FPS with a frame size of 750x576 running at 27MHz. Even
though Canny Edge detection involves more computation, it was faster than
Sobel. The authors of the Sobel implementation mention that if they added
more pipeline stages into their design, it would be able run at a lot higher fre-
quency, but the main reason the Canny design is so fast is because they used
a line buffering technique. In summary, the performance of FPGAs exceeds
that of GPUs and CPUs for image processing, which is why they were chosen
for this project.
2.3 Image Stabilisation Techniques

Image stabilization algorithms try to eliminate rotation differences between
frames while also removing unwanted x y translations. For a human to compare
two images and figure out the rotation and translation difference is fairly trivial
in most cases, however for a machine to do so is non-trivial. Most of the
current solutions for finding the rotation and translation between two frames
either use a key point feature extraction algorithm [5][20], or use sensor data
from gyroscopes [27]. Using data from gyroscopes limits the use cases, since
not all devices have gyroscopes, and it makes post-processing more difficult
because the gyroscope data would need to be encoded with the video. So even
7
though using gyroscope data would give accurate rotation between frames, it
is not a solution that can be widely applied.
There are two main algorithms for key point feature extraction, Scale-
Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF).
These work by trying to find points in an image which are unique compared
to other points around it. To find the rotation between two frames involves
comparing the key point describers and finding matching points between the
two frames. The matching points x y values then can be used to find the
rotation and translation between the two frames. Using key points for finding
the rotation works well and is widely used but there are many cases when it
can struggle, for example if there are not many key points, or too many similar
key points. They are also heavily effected by moving objects in the scene since
these objects would contain key points.
There is not often going to be a huge difference between two neighbouring

frames, because they are only 1/30th of a second apart (30 FPS), but it all
depends on how close the objects are to the video camera. For example video
footage from a drone is going to contain objects far away from the camera
so the difference between the neighbouring frames is not going to be that
great. However using a smart phones front facing video camera involves having
your face very close, so any movements can cause bigger differences between
frames than the drone. Because this application is being designed more for a
drone type device, two neighbouring frames are going to be similar. Instead
of using an algorithm that has been widely used for video stabilization, like
key point feature extraction, this project is going to explore the use of a Fast
Fourier Transform (FFT) based algorithm which is used for automatic image
registration [39][50].
This type of algorithm can find the rotation and translation between two
images which share a lot of the same content and is used in applications where
images need to be stitched together, like panorama or large aerial photos. One
reason why this type of algorithm might not be used for video stabilization is
that it is computationally expensive, however since this will be implemented
partially in hardware, this should not be a big concern. Also doing this ap-
proach in hardware will be much easier than an approach using feature extrac-
tion because if the camera resolution is known, then the hardware design can
be static. Unlike if a feature extraction approach was used, it would have to
8
deal with x number of features and try to match them in a dynamically sized
tree. Also the initial testing of feature extraction using a custom and library
implementation of SURF, did not give positive results. Using an artificially
rotated image, the angle returned was not very accurate, which is not ideal
since it is been used for stabilization. Then the fact both SIFT and SURF
have patents, meant using one of them could have been challenging.
9
3
Implementation
People who are really serious about software

should make their own hardware
Alan Kay
This chapter describes the implementation of the image stabilisation algo-

rithm in software and hardware, and has been split into four parts: algorithm
overview, software implementation, hardware implementation, and integration.
The algorithm overview describes how the algorithm in this project works, and
the baseline implementation. Software details the improvements and optimi-
sations made to the baseline implementation on x86 and ARM. The hardware
section details the implementation of a Fast Fourier Transform (FFT) module
in reconfigurable logic, and integration explains how the software and hardware
were connected together so that they could communicate with each other.
3.1 Image Stabilisation Algorithm

The image stabilisation algorithm implemented and optimised in this project
will be described to gain an appreciation on how computationally intense it
is. First, the baseline implementation will be described, then the two intense
transforms used, the Fast Fourier Transform (FFT) and Log Polar transform
will be explained. Lastly how the algorithm finds the translation and rotation
will be described, with the algorithms execution flow shown in Figure 3.6.
10
Chapter 3 Implementation
3.1.1 Baseline
The first implementation of the Image Stabilisation algorithm was written in

C and is based on the frequency-domain based algorithms described in [50]
and [40]. For computing the seven 2D FFTs originally needed for this al-
gorithm, the highly optimised The Fastest Fourier Transform in the South
(FFTS) library [9] was chosen because it reports better performance than
other optimised solutions, supports the C language, and both the x86 and
ARM architectures. This first implementation was never about making the
code tidy or efficient, it was to test that the algorithm will be able to correctly
find the angle and translation between two neighbouring frames.
This implementation first converts the input image from FFmpeg [38] to gray
scale, and reduces the size of the image in memory by one third. Currently,
the output is still gray scale but it would be possible to buffer the colour
image, and correct it once the rotation and translation values are known.
Originally the algorithm was producing accurate outputs when tested on a
few images, with the second being artificially rotated and translated, so that
the result values were known. However it was having problems with images
that were slightly blurry or with little gray scale variance, for example a close-
up of a tree. To solve this, the Scharr Sobel edge detector was used in a
preprocessing step to find the outlines of objects, which made the image more
crisp while also reducing the amount of data that could affect the result. This
improved the rotation and translation calculations quite significantly and made
the algorithm more robust when working with noisy or out-of-focus images.
3.1.2 Fast Fourier Transform
The FFT is an algorithm used to speed up the computation of discrete Fourier

transforms. These transforms convert time and space into the frequency do-
main and vice versa. The basic idea of a Fourier transform is given an object,
using filters it tries to find what the object is made off. For example a radio
wave is composed of many different sine waves, a Fourier transform can find
the magnitude and phase of these individual sine waves from the original ra-
dio wave. This image stabilization implementation needs to apply a discrete
Fourier transform to an image in two dimensions and find the inverse Fourier
transform. It will be using the FFTS [9] library because it is highly optimized
for both the x86 and ARM architectures.
11
Detailed explanations on the FFT can be found here [10][8], but to gain an
appreciation on how computationally intense the FFT is, computing the 1D
FFT on an array with the transform size of 1024 using a radix-2 implementa-
tion, takes over 50,000 floating-point operations [30]. The overall performance
of a radix-2 implementation of N operations gives O(Nlog2 N) where 2 is the
radix size. This is because each element in the array is accessed log2 times
while the FFT computes the two-point transform, then the four-point trans-
form, through to the Nth-point transform, which also results in poor tempo-
rally locality. Sample code that was written and tested during this project is
shown in Appendix A.1. One note, the sign parameter should be 1 for a 1D
FFT and 1 for the inverse 1D FFT.
The 2D FFT of an array with a size of NxN can be decomposed into 2N 1D

FFTs. The first step is to compute the 1D FFT for each row in the 2D array,
then the 1D FFT needs to be computed for every column of the results from
the first step. This results in poor spacial locality, hindering the performance
of the 2D FFT even more. Instead of computing on the columns, the array can
be transposed allowing rows to be computed again, but this is only a slight
improvement.
An example of the 2D FFT being applied to a couple of images is shown in

Figure 3.1 and Figure 3.2. The images in Figure 3.1 show cropped images of
a model car with Figure 3.1b artificially rotated by 10 degrees. Then the 2D
FFT was computed for both images and represented in a modulus form shown
in Figure 3.2, resulting in galaxy like images. These two modulus images look
very similar, however focusing on the two main lines that cross through the
center point of both images, the line in Figure 3.2b seems slightly rotated. It is
actually rotated by 10 degrees, the same as the image in Figure 3.1b. Thus the
2D FFT can provide very useful information that can be used for applications
like image stabilization, however this is not actually how the rotation is found,
it is just an example.
3.1.3 Log Polar Transform
The Log Polar transform is critical for the algorithm to find the angle difference
between two images. It is similar to the Polar transform, where it warps an
image around the center point, by converting the Cartesian coordinates from
the input image to Polar coordinates for the output [3]. An example of the Log
12
(a) 0 Degrees Rotation (b) 10 Degrees Rotation
Figure 3.1: Gray scale images of a model car.
(a) 0 Degrees Rotation (b) 10 Degrees Rotation
Figure 3.2: Modulus images of the computed 2D FFT for both images in Figure 3.1.
Polar transform is shown in Figure 3.3, with the input image in Figure 3.3a and
the output image in Figure 3.3b. The output has the original image warped in
such a way that every angle exists within it, and it is this property that allows
the angle between two images to be found, by finding the strongest match for
second image in the Log Polar output.
3.1.4 Translation
Using the Fourier transform to find the translation between two images relies
on the Fourier Shift theorem. This states that the phase of a defined ratio
is equal to the phase difference between two images. Therefore to find the
translation, first the ratio must be computed, as shown in Equation 3.1b,
and then the inverse Fourier transform is applied to the ratio as shown in
Equation 3.1c which will result in an array of numbers. Most of these numbers
will be zero apart from a small region around a point. This point will be the
13
(a) Input Image (b) Log Polar Output Image
Figure 3.3: Log Polar transform of an image.
maximum value in the array and the index of this point in the array gives
the xy translation between the two images, shown in Equation 3.1d. For
this to work properly, the images should be of the same scale and rotation.
The current implementation looks at previous frames and takes an average
translation then adjusts the current frame by that. This works well in cases
where the camera is treated as a fixed point, but a more advanced approach
needs to be looked at in the future, mainly for dealing with desired movement,
for example stabilising footage from a drone.
Fx = 2DF F T (Imagex ) (3.1a)

F1 conjugate(F2 )
ratio = (3.1b)
|F1 F2 |
array = 2DF F T 1 (ratio) (3.1c)
index = max(array) (3.1d)
3.1.5 Scale and Rotation
Before the translation can be calculated, first the scale and rotation of the two
images must be the same. The basic idea for finding scale and rotation is very
similar to translation but the main difference is that the images coordinates
need to be converted from rectangular to log polar before the ratio is calculated.
14
The first step is to compute the 2D FFT of both images and then compute the
absolute values of the complex numbers, shown in Equation 3.2a. A highpass
filter is applied to remove any low frequency noise before the log polar values
are computed, as shown in Equation 3.2b, then the 2D FFT is applied again.
The ratio and index is then computed the same as it was in Equations 3.1b, 3.1c
and 3.1d. However this time the index finds the angle and scale values. This
is shown in Equation 3.3 and Equation 3.4 assuming the index value is for a
1D array. Currently the scale is not being corrected since neighbouring frames
should already have a very similar scale, however this is an area which can be
explored in the future.
F ax = abs(2DF F T (Imagex )) (3.2a)

F lpx = logpolar(highpass(F ax )) (3.2b)
Fx = 2DF F T (F lpx ) (3.2c)
angle = (180 index)/(height width) (3.3)
b = 10log10 (width)/width
(3.4)
mod height
scale = bindex
15
3.2 Software
3.2.1 Locality
Once a baseline implementation was established, the code was rewritten from
scratch to make efficient use of the machines cache. Cache access has lower
latency than main memory, which reduces stalls when accessing memory with
poor locality [25]. Memory locality can be broadly divided into two types:
temporal and spatial locality
Temporal locality is where data is reused within a short period of time and
heavily applies to loops, where a section of memory is read, processed and
stored, and then repeated in a short amount of time. During the processing
stage, the cache is being used, and the other two stages involve using the
physical memory. So therefore for the best performance, as much computation
on a section of data that can be performed at once should be combined into
one loop. For example, applying the Sobel filter, normalising and converting
the data to a complex number, is all performed in a single loop instead of
three. This reduces the memory access by approximately two thirds because
the cache is being fully utilised. All the functions in between the FFTs were
able to be combined into one function apart from the Log Polar transform,
which will be explained later.
The idea behind spatial locality is that if you access something once, you are
likely to use its neighbour. For example an address in memory has just been ac-
cessed, the likelihood of the next address being a neighbouring address is high.
Therefore there is a better chance of the address being in a cache line, reduc-
ing the amount of cache misses and in turn the amount of physical memory
access, increasing the overall performance. The standard row-column method
of computing the 2D FFT has poor spatial locality because it accesses columns
of the array, where the FFTS library transposes the data when the columns
are required to improve spatial locality. Another function in this algorithm
with poor spatial locality is the Log Polar transform. It reads data sequen-
tially, but because of the nature of the transform, where it is wrapping the
data around itself, the output data is non-sequential giving very poor spatial
locality, which like the 2D FFT, is difficult to avoid. The remaining functions
have good spatial locality because they access and store data sequentially.
16
3.2.2 Common Computation
This algorithm will be computed over many frames, so factoring out computa-
tion between frames can increase performance. The computation time for the
Log Polar transform was greatly improved by only computing the output in-
dices once, and reusing the indices for every frame by storing them in an array.
This improved performance because each output index requires the comput-
ing of computationally intense power and trigonometric functions [26]. This
takes more instructions and CPU time to compute than fetching the index
from cache or memory. However this approach cannot be applied to all com-
putations, there is a trade off storing pre-computed results in memory because
of the access time, so even if a few instructions are common (give the same
output), computing them each time can actually still be faster than getting
them from memory.
When computing the algorithm on the current frame, the previous frame
requires a lot of computation to be able to compare the two frames. However
the previous frame has already been through all the computation required,
meaning there is no point in doing it again. Therefore during stages of the
algorithm, the current frames frequency representation is cached, so it can be
used when comparing against the next frame. This reduces the number of
2D FFTs from seven to five, along with reducing the use of other functions.
Buffering the previous frames data does have a small cost though, the angle and
translation values are now computed with the unfiltered previous frame instead
of the filtered one. This means all the previous values need to be combined
so the current angle and translations are known, which does not affect the
performance, however the quality of the output can be slightly impacted in
some cases with regards to translation.
3.2.3 Intrinsics
Modern CPUs have a hardwired performance increasing technique which pro-

grammers can take advantage of called vector intrinsics. Intrinsics are simply
a way of utilising low-level instructions without having to use assembly, be-
cause supported compilers have intimate knowledge of them. These intrinsics
allow the same instruction to be computed on multiple words in a single cycle,
thus greatly reducing the amount of instructions the CPU has to execute. The
downside of using intrinsics is that it makes the code far less portable across
17
multiple architectures, and even CPU versions because they are very hardware
specific. To solve this problem, C macros were used within the code so the
different intrinsics can be swapped out by simpling changing a header file, and
compiler macros to allow different versions of code to be built depending on
the available intrinsics.
As with writing the functions to have good cache locality, not all functions
can be improved with intrinsics, it is possible to actually make a function have
worse performance than using single instructions. A vector of size N does
not give N times the performance for a function, especially when dealing with
complex numbers and pixel data where the data is interleaved, because getting
the data into the correct positions in the vectors is not always easy, and can
result in needing more instructions than without intrinsics. Once again the
Log Polar transform was a good example of this, because of its non sequential
output, using intrinsics did not improve performance, but instead made it
worse.
Intel SSE4 and AVX
On Intel, Streaming SIMD Extensions 4 (SSE4) and Advanced Vector Exten-

sions (AVX) were the first to be implemented in this project, and supports a
width of 256 bits allowing for 8x32bit floating points to be executed in paral-
lel. This version does not support stride loads or stores, also known as vector
scatter/gather, which allows memory to be accessed using a step size, for ex-
ample to fill a vector of red values from a RGB pixel array would require a
step size of three and only one instruction. Therefore getting the complex
numbers into the correct format using SSE4 involved using shuffles, blends,
permutes, and unpacking high and low words. The other issue was that the
256 bits is actually made up of two 128 bits lanes, meaning data in one lane
could not be easily moved to the other, which often meant that a vector of
the real or imaginary parts was not able to be in sequence. One example on
splitting a complex number into its parts is shown in Figure 3.4, which loads
2 vectors and using the shuffle function to extract the parts. The shuffle mask
for the real vector is 10001000100010002 and each pair of bits choses which
word should go into the output vector. The upper 4 bits 10002 translates to
the 0th and 2nd word in the first vectors upper 128 bits. Then the next 10002
selects the 0th and 2nd word from the second vectors upper 128 bits. This
repeats for the lower 128 bits, or the other lane for both vectors and results in
18
# define _VEC_COUNT 8
# define __VEC __m256
# define _VEC_FLOAT_LOAD (a , os ) _mm256_load_ps ( a +
os )
# define _VEC_FLOAT_SHUFFLE (a ,b , mask ) _mm256_shuffle_ps (a ,
b , mask )
__VEC v0 , v1 , re , im ;
v0 = _VEC_FLOAT_LOAD ( input , index ) ;
v1 = _VEC_FLOAT_LOAD ( input , index + _VEC_COUNT ) ;
re = _VEC_FLOAT_SHUFFLE ( v0 , v1 , 34952) ;
im = _VEC_FLOAT_SHUFFLE ( v0 , v1 , 56797) ;
Figure 3.4: Splitting complex number into parts using SSE4.
the real vector containing {R0 , R1 , R4 , R5 , R2 , R3 , R6 , R7 }. So SSE4 is not

straight forward and involves a lot of knowledge to be able to use efficiently. To
help with this problem, the latest Intel family, Haswell, added a new micro-
architecture Advanced Vector Extensions 2 (AVX2), which supports gather
and scatter operations that allow a complex number to be easily split into a
real and imaginary vectors. However this has not been used in this project
due to the available hardware not supporting it.
ARM NEON
The ARM NEON general-purpose Single Instruction, Multiple Data (SIMD)

engine supports computing an instruction on 4x32 bit floating point values in
parallel, half that of Intel, but allows stride load and store operations up to
a stride length of four. Therefore splitting a complex number into the real
and imaginary parts on NEON involves using the single vld2q f32 instruc-
tion, where the 2 is the stride length, and loads data into a float32x4x2 t
type. This contains two 4x32 bit vectors, with the data loaded alternately
into each vector, resulting in one containing the real parts and the other the
imaginary parts. AVX requires 4 instructions to process 8 complex numbers,
where NEON only uses one instruction to process 4 complex numbers, which
appears to be better but with the raw performance Intel has over ARM, us-
ing AVX is still faster. An interesting feature in NEON is that it does not
have a divide intrinsic, instead it allows the result to be refined depending on
the accuracy required using the Newton-Raphson method for finding better
approximates shown in Figure 3.5, so it gives the programmer more choice
19
between performance and accuracy.
# define _VEC_FLOAT_DIV (a , b ) vec_neon_div (a , b ) ;

inline float32x4_t vec_neon_div ( float32x4_t a ,
float32x4_t b ) {
float32x4_t reciprocal = vrecpeq_f32 ( b ) ;
reciprocal = vmulq_f32 ( vrecpsq_f32 (b , reciprocal )
, reciprocal ) ;
reciprocal = vmulq_f32 ( vrecpsq_f32 (b , reciprocal )
, reciprocal ) ;
return vmulq_f32 (a , reciprocal ) ;
}
Figure 3.5: Dividing floats in Neon.
3.2.4 Optimised Flow Diagram
The optimised image stabilisation algorithm is shown in Figure 3.6, with any
combined functions grouped by red boxes, and the arrows are labeled with the
output from the last function into the input of the next function. Most of
these arrows show the array size at that point to get across how much memory
is required for this algorithm. The 2048x248x2 size means a complex number
array of size 2048x2048. This optimised algorithm only uses five 2D FFTs
because it is buffering results from the previous frame, which is less than other
2D FFT image stabilisation approaches [50] [32]. The reason that the original
image is corrected for the angle then fed through the Sobel filter again, is for
future work to fix the missing parts of the image after it has been rotated by
looking at other frames, and also the Sobel, Normalise and Shift function is
not very computational intense.
20
Input Frame
1280x720
Gray Scale
1280x720
Sobel Filter Sobel Filter
Normalise and Shift Normalise and Shift

2048x2048x2 2048x2048x2
Buffer Result
2D FFT 2D FFT
for Next Frame
2048x2048x2 2048x2048x2
Absolute 2048x2048x2 Ratio

2048x2048x2
High Pass Inverse 2D FFT

2048x2048 2048x2048x2
Log Polar Absolute

2048x2048x2 2048x2048
Buffer Result
2D FFT Find Translation
for Next Frame
2048x2048x2 <x,y>
1280x720
Ratio 2048x2048x2 Correct Translation
2048x2048x2 1280x720
Inverse 2D FFT Output frame

2048x2048x2
Absolute
2048x2048
Find the Angle

1280x720
Angle (Degrees)
Correct the Angle
Figure 3.6: Algorithm Flow Diagram
21
3.3 Hardware
3.3.1 Xilinx Zedboard
The platform chosen for this project was the Xilinx Zedboard, which is pop-
ulated with a device integrating a dual core ARM Cortex-A9 and an Artix
7 Field-Programmable Gate Array (FPGA) into the same die as the Zynq
7020, allowing for the communication between the processing system and re-
configurable logic to have low latency. The 28 nm chip is the first of its kind
and the actual board that was used is an engineering sample, with produc-
tion grade boards only becoming to be available in Q4 2013. There was little
documentation and examples available, so many issues had to be solved with
debugging.
3.3.2 Hardware Descriptive Language
A Hardware Description Language (HDL) is used to describe the structure,

design and overall operation of electronic circuits. The most commonly used
HDLs in industry are Verilog [14] and Very High Speed Integrated Circuit
(VHSIC) HDL [21], with VHDL used in this project. The U.S Department of
Defense (DoD) originally developed VHDL in 1981 because they were in need
of something to document the behaviour of Application-Specific Integrated
Circuit (ASIC) components. At the time the DoD were using the Ada pro-
gramming language which meant VHDL had to be syntacticly similar to avoid
the need to recreate concepts, resulting in VHDL being heavily based off Ada
with regards to concepts and syntax. VHDL was later transfered to Institute
of Electrical and Electronic Engineers (IEEE) who finalised a standard in 1987,
called VHDL-87. There have been many revisions and improvements to VHDL
over the years with the most recent in 2008.
3.3.3 Vivado
This project evaluated the new Xilinx Vivado Design Suite for developing and
building projects for the Xilinx Zedboard. Victor Peng, the Senior Vice Pres-
ident of Programmable Platforms Group at Xilinx, claims the design suite
will provide a 4x boost to productivity and attacks the major bottlenecks in
programmable systems integration and implementation. We believe it will dra-
matically boost customer productivity. This is a bold statement, and without
knowing what development stages and design suites Peng used to get the 4x
22
boost, it is hard to verify, but as this report shows in the discussion section,
there is definitely an improvement in the overall productivity for getting a
design from the planning stage to the implementation stage.
Unification
In the previous Xilinx suite, there were a number of separate programs to

perform different tasks, however with Vivado, Xilinx have combined most of
these into one tool. This made designing a project much easier because there
was not the need to open different applications, to do perform tasks on the
same design, which often caused errors, especially with mismatched device
configurations. In saying that, Vivado has a huge number of bugs, most of
which are connected to the graphical interface, so it is fair from perfect at this
stage.
Simulation
When designing and building HDL projects, simulation is a quick way to verify
if a design works, and makes it easy to find errors by analysing waveforms. A
simulation tool is directly built into Vivado but it is recommended to use a
more advanced tool, for example ModelSim by Mentor Graphics, because it
can more accurately simulate the logic. However for this project, simulation
was not used heavily due to most of the components needing to interact with
the hardwired components on the Zedboard, which simulation tools cannot
simulate accurately. Some HDL components were briefly tested with ModelSim
just to make sure they were behaving in the correct manor, but they still needed
to be primarily tested on the physical board.
Debugging
Testing designs on the physical board is easy, unless something goes wrong.
Unlike software based programs, where it is simple to use debugging tools
or by using the fprintf statement, hardware does not have an easy method
of debugging. The eight LEDs on the Zedboard can be used like a fprintf
statement, and can debug small issues, but this method is very limited and time
consuming. Instead a debug core has to be added to the design, then connected
to the required signals and clocks to be able to take a snapshot of them when a
condition is true. Before Vivado, this involved a tool to add the debug core and
to connect up the signals, then another tool to program the device and setup
23
the capture condition, before being able to view the waveforms. The worst
part of this process was that the signal names were removed, so another file
had to be imported into the waveform viewer to actually know which signals
were which.
Vivado has simplified this process, at the synthesis stage the debug core
and be added and connected to signals using a wizard, then onces the bitfile
is built, the device can be programmed and probed with the built in Hard-
ware Session manager. Unfortunately, adding a debug core to the design in
this project increased the build time by over 20 minutes, making debugging a
painful process, causing a lot of long nights. The only other limitation of the
current version is that often signals get removed or renamed during synthesis,
meaning finding the required signal can be annoying. There is a way to get
around this however, as shown in Figure 3.7, where in the VHDL code, the
synthesiser tool is told to not modify the defined signals, but this should only
be used for debugging, not the final design because it limits the amount of
signals the synthesiser can optimise.
attribute KEEP : string ;

attribute S : string ;
attribute KEEP of < signal_name > : signal is " true " ;

attribute S of < signal_name > : signal is " true " ;
Figure 3.7: Forcing signals to be untouched for debugging.
High-Level Synthesis
Along with Vivado, Xilinx have also released a separate tool called Vivado High-
Level Synthesis (HLS), which is still in development and converts C or C++
into HDL. This tool was evaluated during this project to see if it can improve
development time and allow for ideas to be explored faster. Unfortunately, this
does not work on large designs like the image stabilisation algorithm, meaning
the source code needs to be split up into smaller components. The docu-
mentation states that Vivado HLS can infer a FFT core and other OpenCV
functions, however these are only at the beta stage, and Xilinx would not give
early access to the university. This meant that only the Sobel filter could be
fully tested.
24
When writing the Sobel filter in software, the naive approach is to apply
the 3x3 kernel to the array directly while looping over it. However when
translating this to hardware, the HDL produced is very inefficient because the
image has to be stored in block ram, meaning only one value can be accessed
each clock cycle. To solve this the C code must be written with hardware
in mind, so in this case the data should be thought of as streaming through
the Sobel filter, instead of looping over the memory. Then since each pixel
requires its neighbours, as the pixels are stream in, they need to go into a
delay buffer until they are required. When this is converted to HDL it is
much more efficient, with the number of clock cycles equaling the number of
pixels plus the delay buffers size. Therefore for Vivado HLS to produce good
HDL, the input source code must be written by a someone with knowledge of
hardware, which probably means they would prefer to write directory in HDL,
so for this tool to be widely used, it needs to become smarter so standard
software programmers can produce efficient HDL.
3.3.4 DDR3 AXI Interface
The reconfigurable logic on the Zedboard has 140 36Kb Block RAM (BRAM)
resources [53], with 560 KB of space available in total. When dealing with large
images, a lot of space is required to buffer one entire image, and the BRAM
resources are not dense enough. For example, a grey scale image (using a
byte per pixel) with the resolution of 2048x2048, would use a total of 4MB
worth of space. This means the reconfigurable logic would only be able to
store a couple of rows of the image in each block RAM at a given time, and
in total approximately 1/6th of the full image. Streaming algorithms, like a
standard 3x3 Sobel filter, do not have a problem with this limitation. But for
algorithms like the 2D FFT, which require the entire image, other techniques
must be considered. This means off chip storage is required to buffer images
and other large pieces of data.
On the Zedboard there are two DDR3 components that each have 256MBs
of space and a data width of 16 bits. These are combined into one DDR3
interface within the built-in memory controller, giving a total of 512MB of
space and a data width of 32 bits. The processing system has full control
over the built-in memory controller, and the reconfigurable logic has four high
performance Advanced eXtensible Interface (AXI) 3 slave ports for reading and
writing to the DDR3 memory. The current design needs the processing system
25
to initialise the memory controller, whether it is on boot or by a bare metal

Zynq application, because if it is not enabled by software, the reconfigurable
logic cannot access memory properly. This resulted in days worth of debugging
because it appeared to working, the data was being read correctly, but a few
of the control signals from the memory controller never changed when they
should have.
The high performance AXI 3 ports allow for bursting of 16x64 bit words
which increases performance. This is because without bursting mode, a num-
ber of clock cycles are used to setup the request, 1 clock cycle to receive or
transmit the data, then finally a few more clock cycles to acknowledge the
completion of the request, resulting in a lot of clock cycles wasted for control
signals and just 1 for the actual data. However with bursting, the control
signal cost remains the same, but now approximately 16 clock cycles are used
to receive or transmit the data continuously from the given address, reducing
the overall number of clock cycles needed to read or write a large amount of
data. There is a newer version of AXI, version 4, which has been defined for
a couple of years and allows bursting up to 256x64 bit words, but because the
Zedboard project had already been established, it was left at version 3.
An AXI interface for the FFT IP was implemented with VHDL in this
project and has been converted to a Vivado IP component, shown in Figure 3.8
and described in Table 3.1. Creating an IP block with the AXI interface
allows it to be connected to other components quickly using the Vivado IP
Integrator, instead of manually connecting up components which can be very
time consuming. The bulk of the VHDL is made up of many smaller component
modules and state machines, which allow it to be more stable and support a
wider range of configurations. It can be used for AXI version 3 and 4 slave
ports, making it more universal for other devices and includes a simple control
interface for it to be used by other components and state machines so they
can access the DDR3 memory. Sample signals are shown in Figure 3.9 and
Figure 3.10 for burst read and write respectively, the pattern of rvld and wnext
can vary but at the start of the command, there is usually a delay between the
first and second data signal being read or written, so it is important for these
signals to be handled appropriately.
26
Figure 3.8: DDR Controller IP.
Name Width Direction Description

m axi clk 1 In Rising-edge clock.
m axi aresetn 1 In Active-Low synchronous reset.
Two cycles required for the IP to fully reset.
ddr ctrl cmd 1 In Command type signal.
1 = Write. 0 = Read.
ddr ctrl burst 1 In Enable Burst Mode of size 16.
1 = Burst. 0 = Non Burst.
ddr ctrl start 1 In Start the command on high tick.
ddr ctrl addr 32 In Read or Write Physical Address.
For bursting, this is the start address.
ddr ctrl wdata 64 In Write Data.
ddr ctrl done 1 Out Active-High on command completion or idle.
ddr ctrl rdata 64 Out Read Data.
ddr ctrl rvld 1 Out Active-High read data valid.
ddr ctrl rlast 1 Out Active-High read data last.
ddr ctrl wnext 1 Out Active-High next write data required.
The next clock cycle will use the Write Data signal.
md error 1 Out Active-High AXI master detected error. This is a
sticky bit, and is only cleared on reset.
m axi - Inout Master AXI bus interface. Needs to be connected
to the High Performance AXI slave interface, or to
an AXI interconnect.
Table 3.1: DDR Controller IP signal details.
27
m axi clk
m axi aresetn
ddr ctrl cmd
ddr ctrl burst
ddr ctrl start
ddr ctrl addr Z A1 Z
ddr ctrl wdata Z
ddr ctrl done

ddr ctrl rdata 0 R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 0
ddr ctrl rvld

ddr ctrl rlast
ddr ctrl wnext
Figure 3.9: Sample burst read command.
m axi clk
m axi aresetn
ddr ctrl cmd
ddr ctrl burst
ddr ctrl start
ddr ctrl addr Z A2 Z
ddr ctrl wdata 0 W0 W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 0
ddr ctrl done

ddr ctrl rdata 0
ddr ctrl rvld

ddr ctrl rlast
ddr ctrl wnext
Figure 3.10: Sample burst write command.
28
3.3.5 FFT Implementation
Implementing the entire image stabilisation algorithm in reconfigurable logic

would have been costly in terms of area and development time. It would
take a long time to implement each part of the algorithm in VHDL, then
a controller and datapath would have to be designed and built to connect
the components together, while guaranteeing the input and output signals are
working at 30 Frames Per Second (FPS). Finally on top of all that, the tools
would not be able to fit the design into the reconfigurable logic automatically,
so the design would have to be laid out by hand, which is very time consuming,
and is rare in industry. An alternative approach is to only implement the FFT
function in reconfigurable logic, with the software able to use the hardware to
compute the FFT.
Once again the time requirements of this project meant that developing a
FFT core from the ground up would not have been possible, so instead the
Xilinx FFT Logicore [52] was used for computing the FFT. This core is highly
configurable, with many performance vs area options available. Because this
project is about raw performance, the Radix-4 burst IO butterfly architec-
ture [48] was selected, which allows for parallel computation, along with using
a 4-multiplier structure with the XtremeDSP slices, meaning the buttery stages
are implemented with DSPs instead of slice logic.
The core has three data buses, the configuration bus, which allows the trans-
form length of the FFT to be set, then the data input and output buses. The
input bus has four signals, a 64 bit wide data signal for the complex number,
the data last signal to tell the core this is the last complex number which will
start the computation. The data valid signal is set to high when the input
data is valid, and the last signal is the ready signal, which is an output of the
core and is used for the controller to know when the core is ready to except
data. The output bus has actually the same signals in the opposite direction,
apart form the ready signal.
Another controller was developed to drive the FFT core and handle reading
and writing data from memory using the AXI interface controller, and the
state machine for this controller is shown in Figure 3.11. Note the * character
means the lowest priority condition, and the current state transitions to the
next state on the rising edge of each clock cycle. The two counters that count to
128 are actually counting the number of complex numbers, 128 x 16 (burst size)
29
* *
fft ctrl start=1 * ddr ctrl done=1 readcounter = 128
IDLE INIT READ READ READ NEXT WAIT FFT *
fft core m tlast=0

* and
fft core m tvalid=1
DONE NEXT WRITE WAIT WRITE WRITE OUT FFT
writecounter=128 ddr ctrl done=1 * *
Figure 3.11: FFT controller state machine for an FFT of size 2048.
equals the size of the data array in the figure, 2048. Currently the controller
uses a fixed size length to increase performance, however in the future this
could be made dynamic, which would involve using the Xilinx FFT Logicore
configurable bus.
The core burst reads the data into the FFT core, then when the last complex
number is being read, the last signal for the input to the FFT core is set to
high, which starts the execution. Once the FFT is computed, the output result
gets burst into a First In, First Out (FIFO) queue implemented using a block
RAM configured as a first word fall through FIFO, as this is happening the
burst write stage begins which pops data from the queue. Since both the AXI
DDR controller and FFT core are running at the same clock frequency, and
because the state machine has a delay state to account for the clock delay in
writing the data to the block RAM, meaning the write stage does not start for
a few clock cycles, the correct data will be written back to memory. To make
the controller act in three separate stages, the tlast condition for moving from
the WAIT FFT to the OUT FFT, can be set to 1, which can make debugging
easier. Now this state machine just shows the basic execution sequence for the
controller, there is far more logic involved.
30
To create a state machine within reconfigurable logic, three process blocks

are required for an efficient solution. The main process block is driven by a
change in state, and defines what signals should be set too for each state. This
involves one large case statement, and it is important to include all the signals
that any state requires, to get the tools to infer the state machine correctly.
Quite often some states do not actually set any signals, they are just used as
a clock delay, or wait for a condition to occur, like the WAIT FFT state in
Figure 3.11. Using the when others case allows for a default state to be used,
which can make the case statement cleaner and reduces the amount of time to
develop and modify the state machine.
Calculating the next state involves a process block, which is also driven by a
change in state, and has a similar case statement as the previous process block.
However instead of defining signals, it sets the next state signal, which is the
state for the next clock cycle. It is also good practise to set the next state
to the current state before the case statement, to guarantee the next state is
correct. Finally in the third process block, the state signal gets set to the next
state signal, and is synchronous to the clocks rising edge. Thus when the state
signal is set on a clock tick, it triggers the other two process blocks to execute.
The controller also has process blocks to calculate the addresses, control the
FIFO queue, and handle the data. So even though the state machine seems
simple, in reality the controller is quite complex.
Once the controller was implemented, there needed to be a way for the
addresses and data length to be set by the software. The best way to achieve
this is to create a register file in the reconfigurable logic, which are connected to
a General Purpose Master AXI interface (M AXI GP0). This GP0 interface is
routed into the central interconnect within the processing system on the Zynq
chip as shown in Appendix B.1, then after setting the base address of the
registers in Vivado, the software is able to write to the address and therefore
the registers. The register component implemented in this project has 5x32 bit
registers for the source address, destination address, length, a trigger and one
extra for possible future use. The trigger register is used to tell the controller to
start computing the FFT, the value of the register actually does not matter,
instead the start signal is connected to the trigger registers bit in the 5 bit
register select signal. This 5 bit signal controls which register the data is
written too, and only one bit can ever be set at a time, therefore by wiring
the start signal to the correct bit, whenever the software writes to the trigger
31
register, the FFT will be computed. But when the computation is finished,
the software needs to know so it can continue executing. The spare register
could be used so that the software reads from it checking for a completion flag,
however this would require constant polling and would affect performance. So
instead an interrupt signal is used because it allows the software to be notified
instantly, and very little extra computation on the software side. The register
file component handles this interrupt, and has an output signal defined as an
interrupt signal that is connected to the Interrupt Request (IRQ) handler on
the processing system, shown in Appendix B.1. Also from within this window
the IRQ number can be set, which will be needed for the integration section.
There have been three components created in this project and have been
converted into Vivado IPs, using the IP packager allow them to be added to the
Vivado IP Integrator tool: an AXI DDR3 controller, FFT controller and FFT
configuration component. These IPs were connected together, along with other
standard IPs, to create the overall design, which is shown in Appendix B.2. The
Zynq Processing System has two external buses for other I/O and the DDR3
bus, with the pins defined in a default design constraints file for the Zedboard.
Two of the output pins of the Processing System, drive the reconfigurable
logic, the not reset signal which is connect to the reset IP, and the fixed
100MHz clock signal. In order for a faster clock to be used in the reconfigurable
logic, the 100MHz signal must be inputted in a Mixed-Mode Clock Manager
(MMCM) [51], which is a built-in resource and is found in a clock management
tile, and allows for the generation of different clock frequencies and phases. The
design is currently running at 165MHz, with faster speeds possible however it
is failing timing and finding the part which is causing the issue is not easy,
even with the tools supposedly reporting whereabouts the problem is.
When the design has been run through synthesis and place and route, a
floor plan of the design is produced shown in Appendix B.3 which visualises
the implemented design on the actual die. The floor plan makes it clear how
much area is required just to compute the FFT, which is represented by the
aqua colour, and the amount of area for all the other logic involved just to
control the design.
32
3.4 Integration
The operating system used in this project was Linaro 13.04, which has been de-
signed for the ARM architecture by Linaro, a non-profit organization [1]. This
operating system is fully open source which allowed the Zedboard community
to make modification in order to get it running. There are many tutorials
available online that supposedly document the process to get Linaro running,
however few work and those that do still require further changes to work prop-
erly. A lot of very important details were missing from these tutorials, like
the actual boot sequence, device trees, U-Boot customisations and how to go
about making changes to the Linux kernel. This section summarises these dif-
ferences while describing the process of communicating with custom hardware
on the Zedboard.
3.4.1 Boot Process
Understanding how a Linux kernel boots on the Zedboard is critical, because it

is during this stage that all the customization in the reconfigurable logic and
kernel takes effect. A factory programmed ROM is responsible for the first
stage of the boot process and this cannot be edited by the user. This stage
partially initializes the system and performs some checks, including determin-
ing which device to boot from based on the jumper settings, with the jumper
settings for booting from the SD card described in Table 3.2.
There needs to be at least two partitions on the SD card, the first needs to
be approximately 40MB in size and is used primarily for the boot process. The
second can be the size of the left over space and is used for the file system if the
operating system needs one. On the boot partition, needs to be a First-Stage
Boot Loader (FSBL) file which the ROM program will execute. This FSBL file
is combined into a binary file with the reconfigurable logic definition file (.bit)
and then a U-Boot executable file. So straight after the FSBL gets executed,
the reconfigurable logic gets configured before the Zedboard tries to load an
operating system. The reason the reconfigurable logic has to be configured
second is because it can contain important devices such as graphics and audio
that the operating system requires.
33
Jumper Mode Connection

JP7 Mode0 GND
JP8 Mode1 GND
JP9 Mode2 3V3
JP10 Mode3 3V3
JP11 Mode4 GND
Table 3.2: Zedboard boot from SD card jumper settings.
memory {
device_type = " memory " ;
reg = <0 x000000000 0 x20000000 >;
};
Figure 3.12: Sample device tree structure for memory declaration.
U-Boot is used for the third stage boot loader. Very early on in the process,
U-Boots loads a device tree from the boot partition which contains information
in a data structure, on the hardware connected to the system. This data
structure is then passed to the operating system when it is first called, allowing
it to have knowledge of the devices available when it starts to boot. An
example of the structure for declaring the memory on the Zedboard is shown
in Figure 3.12. The reg parameter defines the start address (0x0) available
for the operating system to use in order to access memory and the size of the
memory, 512MB (0x20000000) in the case of the Zedboard. Any reconfigurable
logic which the operating system needs to communicate with is also defined in
this manner. Figure 3.13 shows the structure used to declare the FFT module
used in this project.
34
fft_1d@6D600000 {
compatible = " fft_1d " ;
reg = < 0 x6D600000 0 x1000000 >;
clock - frequency = < 165000000 >;
interrupts = < 0 59 1 >;
interrupt - parent = <& gic >;
};
Figure 3.13: Sample device tree structure for FFT reconfigurable logic.
The three parameters of note are reg, clock frequency and interrupts. Like
the memory structure, reg specifies the address and size. Meaning if the op-
erating system writes to the address 0x6D600000, it will be sending data to
the FFT modules register file in reconfiguration logic. The operating system
then needs to know the clock frequency of these registers in order to read and
write correctly, because if it were to use a different clock, the data may become
corrupt and the whole interface will be unstable. If the reconfigurable logic
needs to alert the operating system, for example to signal that it has finished
computing a task like the FFT, an interrupt is required. The interrupts param-
eter has three flag values, the first is whether the interrupt is shared between
devices, also known as Shared Peripheral Interrupt (SPI). The second flag is
the IRQ number, but it is not the same value as the Xilinx tools use, or the
value used in the operating system. To get the correct IRQ number for use in
the operating system, this flag needs to be incremented by 32 if the first flag
is 0 (non SPI), or 16 if the first flag is 1 (SPI). Then the last flag specifies the
type of interrupt. The two main types are rising edge (1) or level sensitive (4,
active when high). The other two parameters, compatible and interrupt parent
mean what type of driver can be used and which device the interrupt connects
to.
The final boot stage, U-Boot, loads the operating system with the device
tree. A compiled Linux kernel image needs to be on the boot partition, which
by default needs to be renamed uImage. However this can be changed when
building the U-Boot executable file, within the U-Boot source, include/con-
figs/zynq common.h is the file which sets most of boot parameters, like the
kernel filename or memory options for example. The kernel image then looks
for the file system on the second partition, but never looks at this partition for
35
another kernel. Therefore any changes to the kernel need to be compiled into
an image, and placed on the boot partition. Simpling rebuilding the kernel
while it is running then rebooting the Zedboard will not load the new kernel.
3.4.2 Driver
The operating system treats the reconfigurable logic as any other device, which
means it requires a driver to communicate with it. A character driver, which
sends/receives byte by byte, was created for the FFT module to write the
data width, source address, and destination address to the registers within the
reconfigurable logic, and also to handle the interrupt when the FFT compu-
tation is complete. When writing a driver, or any other kernel module, it is
important to remember that the code is not a separate program, it becomes
a part of the kernel. Therefore all errors need to be handled appropriately
to prevent system crashes, and also there are standards and templates that
should be followed [37].
When the kernel sets up a driver for the device, the kernel needs to know
the initialization and exit functions, like how programs need an entry point,
often called the main function. But instead of standardising the function
names, there are two built-in macros that are used to specify these func-
tions, module init(driver init) and module exit(driver exit), where driver init
and driver exit are functions within the driver.
Device drivers require a lot of information to operate, like the device name,
address, and IRQ number. But these values change from system to system so
this information must be passed into the driver at runtime. This is achieved
using another built-in macro, module param(variable, type, permissions) as
shown in Figure 3.14, which is a snippet of the implemented device driver
and also shows how to provide a description of each parameter. How to pass
these parameters to the driver is shown in the driver setup script in Figure 3.15.
Where the insmod command installs the driver module into the running kernel,
then mknod creates the character device driver node so it can actually be
accessed, and then the permissions are set.
The initialisation function is called when the driver is inserted into the run-
ning kernel. It registers the device, and then requests and enables the inter-
rupt. Because this is a character device, the register chrdev function was used,
which takes the device major number, device name, and the file operations the
36
static long int register_address ;

static long int register_length ;
static char * device_name ;
static int irq_number ;
static int device_major ;
// Retrieve Parameters
module_param ( register_address , long , 0) ;
MODULE_PARM_DESC ( register_address , " Base address for the
peripheral " ) ;
module_param ( register_length , long , 0) ;
MODULE_PARM_DESC ( register_address , " Address length for
the peripheral " ) ;
module_param ( device_name , charp , 0) ;
MODULE_PARM_DESC ( device_name , " Name of the peripheral " ) ;
module_param ( irq_number , int , 0) ;
MODULE_PARM_DESC ( irq_number , " IRQ number for peripherals
interrupt " ) ;
module_param ( device_major , int , 0) ;
MODULE_PARM_DESC ( device_major , " Major number of the
peripheral " ) ;
Figure 3.14: Retrieving parameters in the device driver.
directory = / home / linaro / fft_1d_driver

mod_name = fft_1d . ko
base = 0 x6D600000
length = 0 x1000000
name = " fft_1d "
irq = 91
mjor = 62
dev_name = fft_1d_fpga
insmod / $directory / $mod_name register_address = $base

register_length = $length device_name = $name
irq_number = $irq device_major = $mjor
mknod / dev / $dev_name c $mjor 0
chmod 700 / dev / $dev_name
Figure 3.15: Device driver setup script.
driver supports. The operations supported by this driver get defined in the
file operations struct, shown in Figure 3.16. The read and llseek functions are
not actually required because data is only ever written from offset 0, however
they were added for testing and possible future use. Enabling the interrupt
37
requires the request irq function, which needs 5 parameters, but only the IRQ
number, a pointer to the IRQ handler, and the device name are important,
the flags and dev id parameters can be just be set to null. Finally enabling
the interrupt uses the enable irq function, which just takes the IRQ number.
Now the driver is initialised and running in the kernel.
When a process writes data to the device, the drivers write function is
called, which only returns to the calling process once the computation on the
reconfigurable logic has finished. This allows the calling process to write to the
device, and automatically wait for completion, rather than writing data and
polling the driver for the interrupt. To write data to the reconfigurable logic,
the memcpy toio function is used which requires the address of the device, in
this case 0x6D60000, the address of the data after it has been copied to kernel
space using the copy from user function, and the number of bytes to copy.
The process writes to the driver twice, the first write contains the source and
destination addresses, and the length, then the second write can be any byte
value, which is used as a trigger. This works because the drivers offset value
after the first write is pointing to the last register in the reconfigurable logic,
and the register file in reconfigurable logic has been modified so that instead
of writing to the last register, it starts the computation. The second write was
required to guarantee the addresses and length signals had propagated through
the reconfigurable logic before they were needed, by delaying the start by a
couple of clock cycles. The driver requires that this second write only contain
one byte because it currently uses the number of bytes to tell if it is the trigger
or not, in future revisions this will be changed so that it looks at the offset
value instead.
Once the driver has triggered the reconfigurable logic, it needs to wait for
the interrupt before returning. This can be achieved by polling to see if the
interrupt has fired, then going to sleep for a bit before trying again. But this is
not efficient as it requires CPU time and does not know the instant the inter-
rupt has fired, which is why an event driven technique was used. In the kernel,
there are functions to handle waiting for interrupts, because they are such a
huge part of any computing system, so handling them efficiently is crucial.
These functions work by putting the process to sleep, then when the interrupt
fires it wakes up the process. In the write function, the wait event interruptible
function is called and requires the head of a wait queue, and a condition for
the process to continue. This condition checks that a variable is true, which
38
struct file_operations d r i ve r_ fi le _ op er at io n s = {
read : driver_read ,
write : driver_write ,
open : driver_open ,
release : driver_release ,
llseek : driver_llseek
};
Figure 3.16: File operations for the driver.
is set true by the interrupt handler, and set false at the start of the write
function. So now the process is sleeping, and will only be woken when the
wake up interruptible function is called by the interrupt handler. This func-
tion also needs the address of the head of the wait queue, and will iterate
over the queue, waking up every process. The woken processes then check
their continue condition, if it is true they carry on executing, otherwise they
go back to sleep. In this driver, because only one process gets added to the
queue, it guarantees that when it gets woken up, that it is condition is true.
Therefore only a few CPU instructions are used for sleeping, unlike the polling
method, improving the performance of the system. The write function for the
driver is shown in Appendix B.4 to highlight this event approach.
3.4.3 System Call
In order for a user space program to call the driver which handles the FFT
module in the reconfigurable logic, a system call was added to the Linux kernel.
It was implemented within the kernel rather than user-space, because inside
the kernel are a lot of useful functions with regards to memory management,
which were required and it would allow for better handling of multiple user
programs trying to use the programmable logic at once. On x86 the process
of adding a new system call is well documented, but for ARM there is very
little, therefore grep was used to find the necessary source files by searching
for known system calls. How these files were edited to add the fft 1d system
call will now be described.
arch/arm/kernel/calls.S
This file has a list of system calls, therefore add the custom call to the
end. For this kernel version, the custom call has an offset of 380.
39
/* 375 */ CALL ( sys_setns )

CALL ( sys_process_vm_readv )
CALL ( sy s_pr oces s_vm _writ ev )
CALL ( sys_kcmp )
CALL ( sys_finit_module )
/* 380 */ CALL ( sys_fft_1d )
arch/arm/include/asm/unistd.h
Because a new system call has been added, the number of calls needs to
be incremented by one.
# define __NR_syscalls (381)
arch/arm/include/uapi/asm/unistd.h
Add the system call with its offset.
# define __NR_setns ( __NR _SYSCA LL_BAS E +375)
# define __NR _pro c e s s _ v m _ r e a d v ( __NR _SYSCA LL_BAS E +376)
# define __ NR _p roc e s s _ v m _ w r i t e v ( __NR _SYSCA LL_BAS E +377)
# define __NR_kcmp ( __NR _SYSCA LL_BAS E +378)
# define __NR_finit_module ( __NR_SYSCALL_BASE +379)
# define __NR_fft_1d ( __NR _SYSCA LL_BAS E +380)
include/linux/syscalls.h
At the end of this file is where the system call is declared.
asmlinkage long sys_kcmp ( pid_t pid1 , pid_t pid2 , int
type , unsigned long idx1 , unsigned long idx2 ) ;
asmlinkage long sys_finit_module ( int fd , const char
__user * uargs , int flags ) ;
asmlinkage long sys_fft_1d ( char * source , char * destin
, int width ) ;
# endif
kernel/fft 1d.c
Create a source file for the new system call.
kernel/Makefile
Edit the kernel Makefile to include the fft 1d source.
obj - y = fork . o exec_domain . o panic . o printk . o \
... \
fft_1d . o
40
arch/arm/kernel/entry-common.S
There was a small problem when compiling the kernel however, during
one check it was not picking up the updated system call properly, so in
this file a few lines were commented out to allow the kernel to compile.
This has not caused any issues but it would need to be fixed in the future.
/*
. ifne NR_syscalls - __NR_syscalls
. error " __NR_syscalls is not equal to the size of the
syscall table "
. endif
*/
The primary purpose of the system call is to take the source and destination
memory pointers, pass them to the driver and wait for the computation to
complete before returning from the call. However because Linux uses virtual
memory addresses and the reconfigurable logic uses physical addresses, the
two memory pointers been passed to the system call can not simply be passed
straight to the driver, they have to be converted into physical addresses first.
Within the kernel there is a function virt to phys(virtual address) which will
return the physical address, however the virtual address must be within the
kernel address space, not user space meaning it will not work as expected for
the two pointer parameters. The next solution involved walking page tables
to manually find the physical address. This is well documented on the x86
architecture but not very well on the ARM architecture, a recurring pattern
in this project.
The code in Appendix B.5 follows the idea described in Figure 3.17, there
are four tables which need to be walked before the physical address can be
discovered. The virtual address provides an offset in the Page Global Directory
(PGD) table, which gives a pointer to one Page Upper Directory (PUD) table.
Then the virtual address is used again to provide an offset in the PUD table
to find the pointer to the next table, Page Middle Directory (PMD). This
process is the same for the PMD and Page Table Entry (PTE) tables then
the pointer to the physical page is known. When this was implemented, the
early results seemed promising, the physical address did contain the test data
that was written to memory using the virtual address. However when a large
array of data was written to memory using virtual addresses, the physical
41
PGD PUD PMD PTE Page
Figure 3.17: Converting a virtual to physical address in Linux on ARM.
addresses often only contained parts of the array. Therefore the memory was
not guaranteed to contain continuous data, which meant the reconfigurable
logic would read invalid data, because it starts reading from the given address,
then increments the address until the whole array has been read. So walking
page tables was not going to work for this projects needs. Since user space page
tables did not meet the requirements, the next method was to use kernel space
memory. It was possible to allocate continuous memory from within the kernel,
however there were a few issues with this approach, memory alignment was
not guaranteed automatically, and there was limited space within the kernels
memory space meaning continuous memory was hard to come by, and it would
either mean giving the user program access to the kernels memory or copying
the data from user space. All of these trade offs meant another solution was
required.
In both user and kernel space, it is possible to map into physical memory
directly for reading and writing data. The only problem being physical memory
is controlled by the kernel and it automatically swaps pages in and out, so
there is the danger of reading the wrong data because it has been overwritten
or writing over another processes memory. To get around these dangers, a boot
parameter which tells the kernel its entry point into memory has been changed
which allows some memory to be hidden from the kernel. This is possible
because the kernel enters at the highest address, and uses all the memory
below that, so for the Zedboard changing the high address parameter from
0x20000000 (512MB) to 0x19C00000 (412MB), allowed for 100MB of memory
that is safe to use directly.
This solution gives two options for implementation, performance and sim-
plicity. The best performance is achieved by the user space program mapping
42
into physical memory, meaning the system call only needs to pass the physical
addresses to the driver. This requires the code to be rewritten so it uses phys-
ical memory directly, plus the program needs to handle addresses correctly
to avoid multiple processes or threads using the same address. However for
simplicity, the user space program can use virtual memory as usual, meaning
when the system call is called, it has to figure out that it is a virtual address,
and copy the memory from user space, to the free physical memory before
calling the driver. Then when the driver returns, copy the memory back into
user space. This allows the program code to just use the system call, no other
changes are needed within the code. It is also safer because the system call
can dynamically choose the physical address since it is aware of what is in
use. But this simplicity involves extra copying of data when compared to the
performance option, so there is a trade off between the time to rewrite the
code and computation time.
3.4.4 Datapath
Traditional datapaths in reconfigurable logic are built to allow the design to

execute in parallel and are pipelined to increase throughput, however because
this design involves software and hardware working together, a traditional
datapath/pipeline design can not be applied. Instead the main datapath has
to be designed in software, with the hardware supporting a few functions like
the 1D FFT that can be executed in parallel, however recalling the floor plan
of the FFT design in Appendix B.3, available space is an issue which limits the
amount of FFT modules that can be implemented, so the software datapath
has to be designed to share the FFT hardware modules.
Within software, in order to get functions working in parallel, threads are

required, which initially sounds like a dreadful idea when dealing with a mem-
ory bandwidth bottleneck, however because most of the intensive computation
is now in reconfiguration logic, this does not become an issue. So to pipeline
the algorithm in software to guarantee 30 FPS, functions have to be created so
that their executed time is under 1/30th of a second, with each thread running
the sequence of functions at different phases with each other, then when the
next frame is ready, they start executing the next function in their sequence,
creating a pipeline for the frame to flow through, resulting in the output being
30 FPS. Because the algorithm requires five 2D FFTs, the functions have to
equally share the available FFT modules so that no thread will be starved,
43
which would break the pipeline. This type of design results in the software
acting like a controller for the hardware, while also computing the smaller
functions, and shows the true potential of coupling a FPGA and an ARM
Cortex-A9 on the same chip.
44
4
Results and Discussion
I came here to drink milk and kick ass. And

Ive just finished my milk.
Maurice Moss, The IT Crowd
This chapter presents the results for the x86 architecture, ARM architecture,
and the Fast Fourier Transform (FFT) module in reconfigurable logic, then
will discuss the findings in the discussion section. All timing benchmarks show
computation time only, with read/write delays from the storage device factored
out.
4.1 Results
4.1.1 Intel x86 Performance
The x86 benchmarks were performed using an early 2013 MacBook Pro Retina
13, which is equipped with a 2.6GHz Intel Core i5-3230M, 8GB 1600MHz
DDR3, and a 128GB solid state drive.
Three different versions of the code were benchmarked: an initial imple-

mentation in C as a baseline the refactored version which has improved cache
locality and common computation factored, and the last version which uses
AVX. The results are shown in Figure 4.1 using an array size of 2048x2048,
and note that all three stages use the optimized FFTS library, so the FFT
computation time is consistent, with the rest of algorithm being improved.
45
Chapter 4 Results and Discussion
2.5
Seconds Per Frame

2
1.5
0.5
0
Baseline Refactor Intrinsics
Implementation
Figure 4.1: Intel x86 Performance with an array size of 2048x2048.
The results show the importance of having knowledge of the CPUs cache
and only computing common values once, as it nearly halved the computation
time per frame over the baseline version. They also show the raw performance
available with vector intrinsics, which nearly decreases computation time by a
third over the refactored version, and this time could be further improved by
using the latest Intel Haswell chips which support AVX2.
When working with a 720p image stream (1280x720), using an array size of
2048x2048 for computing the 2D FFTs requires a lot of zero padding as shown
in Figure 4.2a, where the blue rectangle represents the image, and the gray
is padding. This results in the array containing nearly 80% padding, and the
remaining 20% being the actual data. This is a lot of wasted computation
time, so instead of padding up to the next power of two, reduce the array
size to the previous power of two, meaning an array of 2048x2048 becomes
1024x1024. For a 720p image, this only results in slicing 10% off each side
as shown in Figure 4.2b, however there is still enough image data for this not
to impact the quality of the output results, assuming the sliced 20% does not
contain the only useful data in the whole image.
Figure 4.3 shows how reducing the array size impacts the computation time
required per frame. These arrays sizes can be applied to any image resolu-
tion, however using either a 256x256 or 512x512 array size for a 720p image
46
(a) 2048x2048 (b) 1024x1024
Figure 4.2: 720p image using an array size of 2048x2048 and 1024x1024.
0.5
0.4
Seconds Per Frame
0.3
0.2
0.1
30 FPS
0
256 512 1024 2048
2D Array Size
Figure 4.3: Intel x86 Performance for the working array size.
will produce poor quality results, however for smaller image resolutions, this
algorithm is close to running in real-time at 30 Frames Per Second (FPS).
By reducing the array size from 2048x2048 to 1024x1024 for the 720p image,
improves the computation time per frame by a factor of three, giving over
7 FPS. Google claim 10 FPS [20][19] for their image stabilisation algorithm
on a single machine, so even though this algorithm is running on a MacBook,
and uses the very computationally intense 2D FFT function five times, the
47
6
Seconds Per Frame
5
0
Standard Neon
Mode
Figure 4.4: ARM Performance.
performance results are impressive. As mentioned, the new AVX2 instruction

set has been released with extra functionality such as vector scatter/gather,
and would post even more impressive numbers by improving the algorithm and
the FFTS library.
4.1.2 ARM Performance
The ARM benchmarks were performed on the Xilinx Zedboard which is equipped
with a Zynq 7020, that contains a dual core ARM Cortex-A9 clocked at
667MHz, 512MB DDR3 and uses a 8GB SD card.
Due to the Zedboard not having the level of resources on the Intel platform,
the results for NEON in Figure 4.4 were with an array size of 1024x1024.
They show that NEON has improved the computation time per frame by over
a second, however this is the first implementation, with further analysing of
the code required to increase performance even more. Due to time restrictions
and three SD cards corrupting, resulting in them being in read-only states,
this was not possible, and also meant that getting results for the FFT module
in the algorithm could not happen. This is one of the challenges of working
with an engineering sample board, many components do not work properly,
48
3,000
2,500
2,000
Microseconds
1,500
1,000
500
0
FFTS Simple Advanced
System Calls
FFT Function
Figure 4.5: Software vs. Hardware: FFT computation of an array size 2048.
and given that three SD cards have become corrupted, the board is probably
at fault. There have been a few known issues regarding the SD card reader [4],
but these do not mention corruption.
4.1.3 FPGA Performance
The Field-Programmable Gate Array (FPGA) benchmarks were performed on

the Xilinx Zedboard which is equipped with a Zynq 7020 packaging a dual
core ARM Cortex-A9 clocked at 667MHz and an Artix 7 FPGA. The board
also has 512MB DDR3 and uses a 8GB SD card.
Figure 4.5 compares computing the 1D FFT of a transform size of 2048

on the Xilinx Zedboard using the FFTS library on the ARM, and the FFT
module in the reconfigurable logic developed in this project. The FFT module
results are in two sets: the first, labeled simple system call, where the user
program calls the system call with virtual pointers and requires the kernel
to copy the virtual memory to the known physical memory location. Then
the second, labeled advanced system call, is where the kernel only needs to
pass the physical addresses to the driver, and no copying is required because
49
250
200
Microseconds
150
100
50
0
512 1024 2048
FFT Transform Size

Figure 4.6: FFT computation time for different 1D array sizes in reconfigurable logic.
the user program is accessing the physical memory directly. This involves
more coding, but the cost of copying memory is nearly the same as actually
computing the FFT, doubling the computation time. The results show that
the reconfigurable logic is currently achieving a 13 times speed up over the
software implementation, with the logic clocked at 165MHz. This will be able
to be increased to over 300MHz, with some timing closure effort.
The results for the FFT module in reconfigurable logic scaling with trans-
form size are shown in Figure 4.6, and the time differences for the array sizes
reveal that the computation time for the 1D FFT is not the only factor in-
fluencing the overall time. The time used for the control signals and memory
access requires a large fraction of the computation time. Figure 4.7 shows the
number of clock cycles for a burst read operation when using the AXI memory
interface IP developed in this project. The clock cycle value is the number
of ticks from when the IP receives the start signal, to when the done signal
goes back to high. These results show that using a bursting AXI interface to
access memory gives a huge performance increase, because the setup cost of
an operation is high. Therefore by bursting x number of words, the amount of
50
50
40
30
Clock Cycles
20
10
0
2 4 8 16
Burst Size
Figure 4.7: Burst read operations using different sizes in reconfigurable logic.
setup operations required is reduced by a factor of x. The results also show a

near linear increase in the number of clock cycles for the varying burst sizes,
but this is not guaranteed because the DDR3 memory is shared throughout
the system, so it is possible to experience small delays.
The Zedboard only supports the previous AXI standard for the high per-
formance memory interface, version 3, meaning the number of words that can
be burst at any one time is only 16, where version 4 supports bursting of 256
words. Using the linear trend, it is possible to determine how much of a per-
formance boost the new standard would give. Adding 224 clock cycles to the
result of size 16, gives 273 clock cycles. For an array of size 2048, AXI3 re-
quires 6272 clock cycles, where AXI4 only requires 2184 clock cycles, resulting
in AXI4 being nearly 3 times faster to read the same amount of data.
4.1.4 Power Consumption
The Zynq 7020 chip when idling is estimated to use 2.2W [6]. This was tested
using R292 10m register [17] which is connected on the 12V positive rail of
the Zynq. Using a precise digital multimeter, the voltage difference across this
51
30
20
Watts
10
0
Zynq Zynq-FPGA i5
CPU
Figure 4.8: Power Usage.
register when the board is in a default state was 1.83mV. Using the equation
V /R = I, 0.00183/0.010 gives a current value of 0.183A. Then to find the
power usage, P = V I, 12 0.183 which results in 2.196W.
Vivado gives an estimate of the power consumption for the design in recon-
figurable logic, 1.8W, but there are many factors influencing this value and it is
not very accurate. However using the R292 register, the overall power that the
Zynq chip requires can be calculated using the physical circuitry when the FFT
module is running. The multimeter gave a 3.16mV voltage difference across
the register, which gives a current value of 0.318A (0.00316/0.010), resulting in
the power usage being 3.792W (120.318). Therefore the total power usage of
this design when idling is 3.792W, with the power usage of the reconfigurable
logic at 1.596W. The multimeter reported the same readings when the FFT
was being computed in the reconfigurable logic, which probably means the
sampling rate of the multimeter was not fast enough, so it would be safe to
assume the power usage would increase slightly during the computation time,
but for the meter not to pick this up, it is only increasing for a very short
period of time.
It is difficult to know how much power the Intel i5-3230M CPU is drawing,
however the CPU alone is rated at 35W Thermal Design Power (TDP) [24],
with other components using a lot of power also, for example the memory.
Therefore the MacBook will be drawing a great deal more power than that of
52
the Zedboard, however just looking at the CPU ratings, the i5 requires over 9
times the wattage as the Zynq with the FFT design running in reconfigurable
logic, as shown in Figure 4.8.
4.1.5 Image Stabilisation
The output of a video captured from a drone is shown in Figure 4.9. The video
was three seconds long, and the figure has been reduced to only show 3 FPS,
with the left hand frames from the original video and the right hand side the
stabilized video.
53
Figure 4.9: Three second video from drone, showing 3 FPS. Left is the input and right
is the output.
54
4.2 Discussion
In Figure 4.9, looking at the input sequence first, focusing on where the grass
touches the sides and the orientation of the grass, it is easy to see the video
going up and down with roll from the drone cause by wind. Compared to
the output video, the grass touches the sides at roughly the same height and
rotation stays constant. The amount of correction is easy to see because of
the black around the frame caused by moving and rotation the frame. These
results are very positive and show that the rotation has been minimised. The
translation is also being correctly removed for this video with the algorithm
treating the drone as a fixed camera, like a security camera mounted on a pole
in the wind. However more work needs to be done to handle the drone flying
around, currently a basic solution is in place where it looks at the previous N
frames to average the translation, but a more advanced technique is required.
4.2.2 David vs. Goliath
The results from the Intel i5 processor showcase its raw performance, and
with the next generation of AVX released, Goliath has become even stronger,
appearing to crush the little ARM chip. However David brought a secret
weapon to the fight, an FPGA. Looking at the NEON results, they appear
to be far worse than that of Intel, however they do not tell the whole story.
Given the troubles experienced with the SD cards, the code could not be
further developed, nor could the FFT in reconfigurable logic be used within the
algorithm. The computation time for NEON code of a array of size 1024x1024
should be able to be decreased by over a second. When the reconfigurable logic
is used, which currently gives a 13 times speed up for the 1D FFT, and because
the FFT makes up over half of computation, the time will be decreased by at
least a factor of 7, resulting in around 0.7 seconds per frame. Now taking into
consideration the power requirements of both systems, the i5 alone draws 9
times the amount of power required by the Zynq, but only gives 5 times the
performance, plus the other components in the MacBook are using far more
power than the Zedboard. Therefore in terms of throughput per power, the
Zynq platform demolishes the Intel platform.
55
4.2.3 Developer Time
During this project, there was approximately an equal amount of time spent
developing the software and hardware, with both requiring new concepts which
meant the overall development time took longer due to the steep learning curve.
Learning how the initial algorithm worked with regards to the FFT and other
transforms, to analysis and improve upon it, was more time consuming than
learning how to use AVX and NEON because FFTs are not an easy transform
to master. The hardware development was the opposite, with most of the
time spent debugging and testing on the Zedboard to get components working
correctly. To put it in context, over a month of late nights was spent getting the
DDR3 AXI interface working reliably on the Zedboard, then actually building
the controller IP component for use in the project. This time was mostly due
to the lack of documentation and examples available for using the interface.
A lot of this learning and testing time is a one off cost, and can be applied
to other projects, so developing parts of an algorithms to run reconfigurable
logic does not add a significant amount of development time, given the huge
performance gain that it gives.
Vivado when compared to the previous design suite was fair easier to use, and
as a whole felt faster to development because there was not the need to keep
switching tools. The 2013 version of Vivado supported the Xilinx Zedboard
which meant there was not configuration mismatch issues experienced with
the older suite, however the biggest improvement was the debugging, the time
to setup and probe a debug core has reduced significantly, which in terms
of this project, saved days of debugging. Another tool was used to speed
up the development process of the HDL, Aquamacs VHDL mode [2]. Ideally
Vivado would have this editor built-in for writing HDL, because there current
implementation is basically just a text editor, unlike Aquamacs which can
automatically generate a lot of code.
56
5
Conclusion
5.1 Revisiting the Hypotheses

After testing the hypothesis outlined in Chapter 1, the results presented in
Chapter 5 show the performance gains of efficiently using the underlying hard-
ware, offering over 80% faster performance over a naive implementation, with-
out affecting the output quality of the algorithm. In terms of power usage, the
results in this thesis show that adding custom hardware does increase power
usage by approximately 70%, however gives 13 times the performance which
is a positive trade off. Developer time is discussed in Section 4.2.3 of this the-
sis, and details that once the steep learning curve is overcome, the additional
development time to add custom hardware on the Zedboard is minimum when
compared to the performance gains, and it also details that writing the algo-
rithm so that it uses the micro-processor efficiently, involves very little extra
development time as well.
5.2 Contributions and Future Work

The current implementation performs very well, however it could be improved

by also handling ego motion. Treating the camera as a fixed node was adequate
for this project, however being able to stabilise video from a car or drone would
give this algorithm more applications. There are already techniques [20][27]
that look into handling movement and these could be implemented. Also the
57
Chapter 5 Conclusion
quality of the output frames could be improved by trying to fill in the black
space created by the correction, and then other warps to remove issues like
the distortion caused by rolling shutters [33][19]. Even though a new image
stabilisation algorithm has been developed in this project, which has been
optimised to use five 2D FFTs, less than other solutions researched [50][32],
all of these improvements were out of the scope of this project. Because the
goal was to evaluate performance optimisations and the FFT, not to try and
get the best output quality possible, which is a vast project all in itself.
5.2.2 Software Optimisations
Results from this project show how difficult it is to get computationally intense
image processing algorithms running in real-time with large resolutions. How-
ever with Intel releasing the Haswell generation which supports scatter/gather,
or the use of higher end hardware, the algorithm could manage real-time per-
formance, so this is an area which can be explored in the future. The NEON
intrinsics also need to be further optimised and improved upon, which due to
faults in the engineering sample of the Zedboard, could not be accomplished
in this projects time period. The software optimisations that were completed
during this project really show the importance of having knowledge of the
underlying architecture is when performance is a concern.
5.2.3 Hardware
The FFT module in the reconfigurable logic is currently out performing the
software FFT by a factor of 13. However these results can still be improved
as the FFT core is only running at 165MHz, but also this implementation is
not fully taking advantage of the full potential of the FPGA, as it is not very
parallel. The FFT core supports many channels, allowing multiple FFTs to be
computed at once, while also allowing the transpose for the 2D FFT to occur
in reconfigurable logic. Plus the reconfigurable logic has a total of four memory
channels available, with one only currently used, therefore fully optimising the
memory and FFT core channels, will increase the performance of the FFT
module by well over a factor of 4. In total, there is room for another order of
magnitude of improvement.
A motion estimation project [35][36] by John Perrone, will be using the FFT
module in reconfigurable logic, the Linux driver, and the system call developed
in this project, combined with the knowledge gained of the Zynq 7020 chip, to
58
improve the computation time of the estimation algorithm in order to achieve

real-time performance, while also lowering the power requirements. The end
goal will be designing and building an autonomous drone controller which
will be on board a quadro copter, hence why power usage is so important.
There will also be the need for other computationally intense functions to be
implemented in the reconfigurable logic as well. With the lack of documenta-
tion available for the Zedboard, this project will be able to help many other
projects, especially with regards to communicating with the DDR3 memory
interface, and the possible release of the DDR3 interface IP component to the
Zedboard community. Consequently this project has a real-world application
and will have some lasting impact.
5.3 Final Remarks

This thesis showed that using the underlying hardware efficiently can drasti-
cally improve the performance of computationally intense algorithms, and it
is possible to further accelerate these algorithms by using custom hardware
without a huge increase in the power usage. They also show that a low power
Zynq chip beats an Intel i5 in terms of throughput per power, which is very
impressive.
59
FIN.
60
References
[1] Linaro. http://www.linaro.org/linux-on-arm (Accessed October 2013).
[2] Vhdl mode. http://www.iis.ee.ethz.ch/ zimmi/emacs/vhdl-mode.html

(Accessed October 2013).
[3] Helder Araujo and Jorge M. Dias. An Introduction to the Log-Polar Map-
ping. Department of Electrical Eng. University of Coimbra, 1997.
[4] Avnet, Inc. ZedBoard Rev C.1 Errata, 1.3 edition, May 2013.
[5] Sebastiano Battiato, Giovanni Gallo, Giovanni Puglisi, and Salvatore Scel-
lato. Sift features tracking for video stabilization. pages 825830, 2007.
[6] Jim Beneke. Zynq-7000 EPP in Action: Exploring ZedBoard. Avnet,

2012.
[7] Anthony M Blake. Dynamically generating fft code on mobile devices.
[8] Anthony M Blake. Computing the fast Fourier transform on SIMD mi-
croprocessors. PhD thesis, University of Waikato, 2012.
[9] Anthony M Blake, Ian H Witten, and Michael J Cree. The fastest fourier
transform in the south. 2013.
[10] E Oran Brigham. The Fast Fourier Transform and its applications. Pren-
tice Hall, 1988.
[11] Rita Cucchiara, Costantino Grana, Massimo Piccardi, and Andrea Prati.
Detecting moving objects, ghosts, and shadows in video streams. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 25(10):1337
1342, 2003.
[12] Andrew J Davison. Real-time simultaneous localisation and mapping with

a single camera. pages 14031410, 2003.
61
References
[13] Dmitri Dolgov, Sebastian Thrun, Michael Montemerlo, and James Diebel.
Path planning for autonomous vehicles in unknown semi-structured envi-
ronments. The International Journal of Robotics Research, 29(5):485501,
2010.
[14] Doulos. The Verilog Golden Reference Guide, 1996.
[15] Alexandre Francois and Gerard Medioni. A modular software architecture

for real-time video processing. Computer Vision Systems, pages 3549,
2001.
[16] Matteo Frigo. A Fast Fourier Transform Compiler. MIT Laboratory for
Computer Science, February 1999.
[17] GMAD. ZedBoard Schematic. Digilent, c.1 edition, June 2012.
[18] Chris Gregg and Kim Hazelwood. Where is the data? why you cannot de-
bate cpu vs. gpu performance without the answer. Performance Analysis
of Systems and Software (ISPASS), 2011 IEEE International Symposium
on, pages 134144, 2011.
[19] Matthias Grundmann, Vivek Kwatra, Daniel Castro, and Irfan Essa.
Calibration-free rolling shutter removal. 2012.
[20] Matthias Grundmann, Vivek Kwatra, and Irfan Essa. Auto-directed video
stabilization with robust l1 optimal camera paths. 2011.
[21] HARDI Electronics AB. VHDL Handbook, 2000.
[22] Scott Hauck and Andre DeHon. Reconfigurable computing: the theory and
practice of FPGA-based computation. Morgan Kaufmann, 2010.
[23] National Instruments. Chapter 5: Customizing hardware through lab-

view fpga. http://www.ni.com/pdf/products/us/criodevguidesec3.pdf
(Accessed October 2013).
[24] Intel. Mobile 3rd Generation Intel Core Processor Family, Mobile
Intel Pentium Processor Family, and Mobile Intel Celeron Processor
Family, datasheet, volume 1 of 2 edition, June 2013.
[25] Bruce Jacob, Spencer Ng, and David Wang. Memory systems: cache,
DRAM, disk. Morgan Kaufmann, 2010.
[26] Vitit Kantabutra. On hardware for computing exponential and trigono-
62
References
metric functions. IEEE Transactions on Computers, 45(3):328339, 1996.
[27] Alexandre Karpenko, David Jacobs, Jongmin Baek, and Marc Levoy. Dig-
ital video stabilization and rolling shutter correction using gyroscopes.
2011.
[28] Patent: Jean Laroche. Multi-microphone active noise cancellation system.
[29] Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun
Kim, Anthony D Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas
Chennupaty, Per Hammarlund, et al. Debunking the 100x gpu vs. cpu
myth: an evaluation of throughput computing on cpu and gpu. ACM
SIGARCH Computer Architecture News, 38(3):3451460, 2010.
[30] Paralant Ltd. Ffts on the multi-core anemone processor. 2012.
[31] John Markoff. Google cars drive themselves, in traffic. The New York
Times, 10:A1, 2010.
[32] JR Martinez-de Dios and Anbal Ollero. A real-time image stabilization

system based on fourier-mellin transform. pages 376383, 2004.
[33] Marci Meingast, Christopher Geyer, and Shankar Sastry. Geometric mod-
els of rolling-shutter cameras. arXiv preprint cs/0503076, 2005.
[34] Hong Shan Neoh and Asher Hazanchuk. Adaptive edge detection for real-
time video processing using fpgas. 2004.
[35] John A Perrone and Leland S Stone. A model of self-motion estimation

within primate extrastriate visual cortex. Vision research, 34(21):2917
2938, 1994.
[36] John A Perrone, Alexander Thiele, et al. Speed skills: measuring the
visual speed analyzing properties of primate mt neurons. Nature neuro-
science, 4(5):526532, 2001.
[37] Michael Burian Peter Jay Salzman and Ori Pomerantz. The Linux Kernel
Module Programming Guide, 2.6.4 edition, 2007.
[38] Dapeng Oliver Wu Qin Chen. Realtime H.264 Encoding and Decoding
Using FFmpeg and x264. Dept. of Electrical & Computer Engineering,
University of Florida, Gainesville, FL 32611, USA.
[39] B Srinivasa Reddy and BN Chatterji. An fft-based technique for transla-
63
References
tion, rotation, and scale-invariant image registration. Image Processing,

IEEE Transactions on, 5(8):12661271, 1996.
[40] Cynthia Rodriguez. Understanding FFT-based algorithm to calculate im-

age displacements with IDL programming language. University of Texas
at San Antonio, 2007.
[41] Andrzej Ruta, Yongmin Li, and Xiaohui Liu. Real-time traffic sign recog-
nition from video by class-specific discriminative features. Pattern Recog-
nition, 43(1):416430, 2010.
[42] Daniel Scharstein and Amy J Briggs. Real-time recognition of self-similar

landmarks. Image and Vision Computing, 19(11):763772, 2001.
[43] Sanjay Singh, Anil Kumar Saini, and Ravi Saini. Real-time fpga based
implementation of color image edge detection. International Journal of
Image, Graphics and Signal Processing (IJIGSP), 4(12):19, 2012.
[44] Vasily Volkov. Better performance at lower occupancy. In Proceedings of

the GPU Technology Conference, GTC, volume 10.
[45] Patent: Kaenel Vincent R. Von. Flexible packaging for chip-on-chip and
package-on-package technologies.
[46] Liang Wang, Minglun Gong, Chenxi Zhang, Ruigang Yang, Cha Zhang,
and Yee-Hong Yang. Automatic real-time video matting using time-of-
flight camera and multichannel poisson equations. International journal
of computer vision, 97(1):104121, 2012.
[47] Holger Winnemoller, Sven C Olsen, and Bruce Gooch. Real-time video
abstraction. 25(3):12211226, 2006.
[48] Charles Wu. Implementing the Radix-4 Decimation in Frequency (DIF)

Fast Fourier Transform (FFT) Algorithm Using a TMS320C80 DSP.
Texas Instruments, January 1998.
[49] Stephan Wurmlin, Edouard Lamboray, and Markus Gross. 3d video frag-
ments: Dynamic point samples for real-time free-viewpoint video. Com-
puters & Graphics, 28(1):314, 2004.
[50] Hongjie Xie, Nigel Hicks, G Randy Keller, Haitao Huang, and Vladik
Kreinovich. An idl/envi implementation of the fft-based algorithm for au-
tomatic image registration. Computers & Geosciences, 29(8):10451055,
64
References
2003.
[51] Xilinx. 7 Series FPGAs Clocking Resources, v1.8 edition, August 2013.
[52] Xilinx. LogiCORE IP Fast Fourier Transform v9.0, Product Guide for
Vivado Design Suite, March 2013.
[53] Xilinx. Zynq-7000 All Programmable SoC Overview, v1.5 edition, Septem-
ber 2013.
[54] Changsheng Xu, Kong Wan, Son Bui, and Qi Tian. Implanting virtual
advertisement into broadcast soccer video. Advances in Multimedia In-
formation Processing-PCM 2004, 2005.
65
A
Image Stabilisation
66
Appendix A Image Stabilisation
A.1 C code for computing the 1D FFT

1 void fft_exec ute_1d ( float * vector , int N , int sign ) {
2 int mmax , index_step , i , j ,m ;
3 float theta , temp , wpr , wpi , wr , wi , tr , ti ;
4
5 f f t _ r e a r r a n g e V e c t o r ( vector ) ;
6
7 mmax = 2;
8 while ( mmax < N ) {
9 index_step = mmax * 2;
10 theta = sign * (2 * PI / mmax ) ;
11 temp = sinf (0.5 * theta ) ;
12 wpr = -2.0 f * temp * temp ;
13 wpi = sinf ( theta ) ;
14 wr = 1.0 f ;
15 wi = 0.0 f ;
16 for ( m = 1; m < mmax ; m += 2) {
17 for ( i = m ; i <= N ; i += index_step ) {
18 j = i + mmax ;
19 tr = wr * vector [ j - 1] - wi * vector [ j ];
20 ti = wr * vector [ j ] + wi * vector [ j - 1];
21 vector [ j - 1] = vector [ i - 1] - tr ;
22 vector [ j ] = vector [ i ] - ti ;
23 vector [ i - 1] += tr ;
24 vector [ i ] += ti ;
25 }
26 wr = ( temp = wr ) * wpr - wi * wpi + wr ;
27 wi = wi * wpr + temp * wpi + wi ;
28 }
29 mmax = index_step ;
30 }
31 }
67
B
Implementation
68
Appendix B Implementation
B.1 Screen shot of the Zynq layout from within Vivado.
69
B.2 Implemented design in Vivado IP Integrator.
70
B.3 Implemented design in the reconfigurable logic.
Component
1D FFT Controller
AXI DDR Interface Controller
AXI DDR Interconnect
Configuration Registers
FFT Core
MMCM (Clock Manager)
Processing System
Processing System AXI Interconnect
System Reset
71
B.4 Write and interrupt handler functions for the device

driver.
1 size_t driver_write ( struct file * filp , char * buf , size_t count , loff_t * f_pos
) {
2 char tmp [ no_of_bytes ];
3 size_t n o _ o f _ b y t e s _ t o _ w r i t e ;
4 int i ;
5
6 // Reserve some memory
7 memset ( tmp ,0 , no_of_bytes ) ;
8 // Reset wakeup c o n d i t i o n
9 fft_complete = 0;
10 // Check count to stop o v e r f l o w s
11 if ( count <= no_of_bytes ) {
12 n o _ o f _ b y t e s _ t o _ w r i t e = count ;
13 }
14 else {
15 n o _ o f _ b y t e s _ t o _ w r i t e = no_of_bytes ;
16 }
17 // Don t go off the end of file
18 if ( n o _ o f _ b y t e s _ t o _ w r i t e > ( no_of_bytes - * f_pos ) ) {
19 n o _ o f _ b y t e s _ t o _ w r i t e = ( no_of_bytes - * f_pos ) ;
20 }
21 // Copy the memory form user space
22 if ( copy _from_use r (& tmp , buf , n o _ o f _ b y t e s _ t o _ w r i t e ) ) {
23 // An error o c c u r r e d
24 n o _ o f _ b y t e s _ t o _ w r i t e = - EFAULT ;
25 }
26 else {
27 // Copy the data to the r e c o n f i g u r a b l e logic
28 memcpy_toio ( baseptr + (* f_pos * sizeof ( char ) ) ,& tmp , n o _ o f _ b y t e s _ t o _ w r i t e ) ;
29 * f_pos += n o _ o f _ b y t e s _ t o _ w r i t e ;
30 // Check for trigger
31 if ( n o _ o f _ b y t e s _ t o _ w r i t e == 1) {
32 // Wait for the i n t e r r u p t to fire
33 w a i t _ e v e n t _ i n t e r r u p t i b l e ( wait_for_complete , fft_complete == 1) ;
34 }
35 }
36 return n o _ o f _ b y t e s _ t o _ w r i t e ;
37 }
38
39 irqreturn_t irq_handler ( int irq , void * dummy , struct pt_regs * regs ) {
40 // Set wakeup c o n d i t i o n to true
41 fft_complete = 1;
42 // Wakeup the write process
43 w a k e _ u p _ i n t e r r u p t i b l e (& w a i t _ f o r _ c o m p l e t e ) ;
44 return IRQ_HANDLED ;
45 }
72
B.5 C code for walking page tables on ARM.

1 unsigned long g e t _ v i r t u a l _ a d d r e s s ( unsigned long address )
2 {
3 pgd_t * pgd ;
4 pmd_t * pmd ;
5 pte_t * pte ;
6 pud_t * pud ;
7 struct page * pg ;
8 unsigned long paddr ;
9 // Walk the tables
10 pgd = pgd_offset ( current - > mm , address ) ;
11 pud = pud_offset ( pgd , address ) ;
12 pmd = pmd_offset ( pud , address ) ;
13 pte = pt e_offset _map ( pmd , address ) ;
14 // T e m p o r a r y lock the table
15 pg = pte_page (* pte ) ;
16 // Use pte entry to find virtual address
17 paddr = ( pte_val (* pte ) & PAGE_MASK ) | ( address & ( PAGE_SIZE - 1) ) ;
18 return paddr ;
19 }
73

Honours Thesis Compressed

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Honours Thesis Compressed

Transféré par

Droits d'auteur :

Formats disponibles

Real-Time Image Processing

Department of Computer Science

2013 Mark Will

I am very grateful to my supervisor Anthony Blake for providing support,

3.3.1 Xilinx Zedboard . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Results and Discussion 45

B.2 Implemented design in Vivado IP Integrator. . . . . . . . . . . 70

2.1 An FPGA is composed of configurable logic and I/O blocks tied

3.1 Gray scale images of a model car. . . . . . . . . . . . . . . . . . . 13

4.1 Intel x86 Performance with an array size of 2048x2048. . . . . . . 46

4.4 ARM Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1 DDR Controller IP signal details. . . . . . . . . . . . . . . . . . . . 27

ASIC Application-Specific Integrated Circuit

AVX Advanced Vector Extensions

AVX2 Advanced Vector Extensions 2

AXI Advanced eXtensible Interface

BRAM Block RAM

CPU Central Processing Unit

DSP Digital Signal Processor

DoD Department of Defense

DoG Difference of Gaussian

FFT Fast Fourier Transform

FFTS The Fastest Fourier Transform in the South

FIFO First In, First Out

FPGA Field-Programmable Gate Array

FPS Frames Per Second

FSBL First-Stage Boot Loader

GPU Graphics Processing Unit

HDL Hardware Description Language

HLS High-Level Synthesis

IEEE Institute of Electrical and Electronic Engineers

IRQ Interrupt Request

JVM Java Virtual Machine

MMCM Mixed-Mode Clock Manager

PGD Page Global Directory

PMD Page Middle Directory

PTE Page Table Entry

PUD Page Upper Directory

SIFT Scale-Invariant Feature Transform

SIMD Single Instruction, Multiple Data

SPI Shared Peripheral Interrupt

SSE4 Streaming SIMD Extensions 4

SURF Speeded Up Robust Features

TDP Thermal Design Power

VHSIC Very High Speed Integrated Circuit

People who are more than casually interested

Programmers in many applications treat the computer system as a black

the underlying machine.

Even then, software running on a microprocessor could benefit from hard-

2.2 Previous Work

Applications such as video surveillance need to detect moving objects, and

Google have developed a video stabilisation algorithm [20] and a camera

The primary purpose of a GPU is to accelerate the rendering of graphics to

Digital Matting is where the background of an image is removed leaving

In summary, highly parallel algorithms may achieve high throughput on

Edge detection algorithms have been implemented within a FPGA such as

2.3 Image Stabilisation Techniques

There is not often going to be a huge difference between two neighbouring

People who are really serious about software

This chapter describes the implementation of the image stabilisation algo-

3.1 Image Stabilisation Algorithm

The first implementation of the Image Stabilisation algorithm was written in

3.1.2 Fast Fourier Transform

The FFT is an algorithm used to speed up the computation of discrete Fourier