CUDA Algorithms For JAVA

CUDA Algorithms for JAVA
Kosta Krauth
B.Sc. in Computing Science

Department of Computing Science
Griffith College Dublin
May 2009
Submitted in partial fulfilment of the requirements of the Degree of

Bachelor of Science at Griffith College Dublin
1
Disclaimer
I hereby certify that this material, which I now submit for assessment on the
programme of study leading to the Degree of Bachelor of Science in Computing at
Griffith College Dublin, is entirely my own work and has not been submitted for
assessment for an academic purpose at this or any other academic institution other
than in partial fulfilment of the requirements of that stated above.
Signed: Date:
2
Abstract
This paper describes in detail a method of invoking NVIDA CUDA algorithms from
within JAVA. The points of focus are performance, ease of use, feasibility of the
approach and the possibility of expanding the proposed library of algorithms through
the open-source community. It also covers the implementation of an on-line self-
service CUDA to JAVA library compiler.
3
Table of Contents
Introduction and overview ............................................................................................. 6

Distributed computing and the Internet ..................................................................... 6
Parallel computing and Moore’s law ......................................................................... 6
Meanwhile, in the GPU world... ................................................................................ 7
The birth of GPGPU .............................................................................................. 8
The project’s aim ....................................................................................................... 9
Requirements analysis and specification ..................................................................... 10
CUDA API ............................................................................................................... 10
Platform independence............................................................................................. 11
JAVA JNI............................................................................................................. 11
SWIG ................................................................................................................... 12
Virtualization ....................................................................................................... 13
The Web Page .......................................................................................................... 13
Library download page ........................................................................................ 14
Algorithm description page.................................................................................. 15
Self-service compiler ........................................................................................... 16
Community / Open Source aspects ...................................................................... 17
Design Structure........................................................................................................... 18
System Components................................................................................................. 18
Operating System ................................................................................................. 18
CUDA Driver and Compiler ................................................................................ 18
The Virtual Machine ............................................................................................ 19
LAMP .................................................................................................................. 19
Zend Framework .................................................................................................. 19
Shell Scripting and Backing Services .................................................................. 20
CUDA algorithms .................................................................................................... 21
Parallel Sort (Bitonic) .......................................................................................... 21
GPU-Quicksort .................................................................................................... 23
Black-Scholes Option Pricing .............................................................................. 25
Fast Fourier Transform ........................................................................................ 26
Matrix Transform ................................................................................................. 27
Implementation Details ................................................................................................ 28
Algorithms ............................................................................................................... 28
SWIG & JNI ........................................................................................................ 28
4
CUDA algorithm adaptation ................................................................................ 30
Reusing the CUDA compiler configuration ........................................................ 33
The Virtual Machine ............................................................................................ 35
Test suites............................................................................................................. 37
The web site ............................................................................................................. 39
General features ................................................................................................... 39
Library download page ........................................................................................ 40
Self-service compiler ........................................................................................... 40
Project Tracking (TRAC) .................................................................................... 42
Conclusion ................................................................................................................... 42
Meeting the Objectives ........................................................................................ 42
Taking a different approach ................................................................................. 43
Bibliography ................................................................................................................ 44
5
Introduction and overview
Distributed computing and the Internet

The research done in the areas of distributed computing and processing has been
constantly on the rise in the last decade. Being a distributed architecture itself, the
Internet has been one of the major drives for the advancements in the field.
Cluster computing has been actively used for processing and organizing vast amounts
of data on the Internet. Google is one of the pioneers in this area with its MapReduce
model which is capable of processing terabytes of data across thousands of machines.
[1] Other popular fields where cluster computing is being actively used on the Internet
is for ensuring availability for high-traffic web servers and providing scalability to
hosting platforms.
Grid computing differs from cluster computing mainly in the fact that it is loosely
coupled, meaning the participating nodes can all operate on different hardware and
software. Many popular distributed applications use this architecture, most notably
projects like Folding@Home and SETI@Home. By taking advantage of all computers
connected to the internet Folding@Home has been able to create a grid with the
combined computational power exceeding 8 petaflops. [2] As a comparison, the
current fastest supercomputer (IBM’s Roadrunner) achieves a sustained rate of 1.1
petaflops.[3]
More recently a completely new paradigm called “cloud computing” has emerged,
sometimes used as a metaphor for the Internet itself. Clouds can be defined as large
pools of easily usable and accessible virtualized resources (such as hardware,
development platforms and/or services). These resources can be dynamically
reconﬁgured to adjust to a variable load (scale), allowing also for an optimum
resource utilization. [4] They have been widely used in providing web and application
hosting platforms such as Amazon’s EC2 services and most of Google’s services like
Gmail and AppEngine.
Because of all the advantages and benefits of using distributed computing in

networked environments, as we will see below, the paradigm has successfully been
applied to non-networked systems too.
Parallel computing and Moore’s law

Moore’s Law was proposed by Intel co-founder Gordon Moore and it states that the
number of transistors on a chip will double about every two years. [5] It is often said
that this is not so much a law as it is a pace the manufacturers are constantly trying to
match in any way imaginable. Intel has especially been hard at work in trying to keep
up with the law considering it was proposed by one of its co-founders.
6
As shown in Figure 1, around 1995, the ability to decrease the size of an IC
(integrated circuit) had slowed down dramatically as physical limits were reached.
The transistors were already so small that the electrons had started to jump from one
to the other in an unpredictable fashion, making it impossible to produce a stable and
reliable processor.
Figure 1, source Wolfram Alpha
Then, as Figure 1 clearly shows, in 2004 something happened and the transistor count
increased by an order of magnitude. In 2005 this behaviour occurred again. In order to
keep up with Moore’s law, Intel proposed a different architecture - one in which the
processor cores were multiplied instead of miniaturized. As such, on 7th September
2004, Intel had introduced its first multi-core CPU and announced parallelism as its
main microprocessor philosophy. [6] This happened again in early 2006 with the
introduction of quad-core CPUs, hence the second jump shown on Figure 1.
This was a major shift that affected both electrical engineers and computer
programmers alike. The goal was no longer to squeeze out every last megahertz out of
the processors clock, but rather parallelize the programs in the way that they could
spread their tasks into working units that could run on multiple cores. This proved to
be a very successful and beneficial model, and has continued to be the main driving
force behind CPU development until this day.
Meanwhile, in the GPU world...

CPU manufacturers weren’t the only ones active in this field. As a matter of fact,
there were companies active in this field years before Intel. These were the
manufacturers of graphical processing units (GPUs), with Silicon Graphics
International and 3dfx Interactive being the pioneers in the field. As the demand for
virtual realities that match our own increased (both for CAD and entertainment
industries), graphics cards had to employ multiple processors designed specifically for
performing a single task. These were primarily graphical operations such as pixel
shading, rasterization and vertex rendering.
7
As time progressed, virtual reality started matching our own more and more precisely,
but this came at a computational cost. With increase in resolution and introduction of
various algorithms that shaped virtual worlds in the way to make them look more
believable (most notably anti-aliasing and bilinear filtering), the manufacturers of
graphical processing units found themselves adding more and more of these
specialized processors that could crunch more data in less time, providing higher
frame rates and better quality. All these operations were well suited for being
executed in parallel and as such the GPUs started multiplying their processors much
earlier than CPUs.
Even though these specialized processors were much slower than modern CPUs, they
came in very high numbers, so in early 2000s people started wondering if there was a
more general way in which they could be used.
The birth of GPGPU
This gave rise to a whole new research field called General-Purpose computation on
Graphics Processing Units (GPGPU). It started off as a very small community of
scientists and researchers that explored various ways in which graphical computations
could be mapped to a more general set of instructions.
Initial efforts produced simple distributed algorithms for searching, sorting and
solving various problems in the scientific community. The initial results were very
promising. The advantages of using such a massively parallel architecture showed
extreme benefits for certain computations. Obviously getting texture transformation
computations to perform something that even resembles operations from basic linear
algebra was a long, tedious and hacky process.
Eventually the manufacturers of graphics cards realised that their architectures could
be used for more than just 3D acceleration. In November of 2006, both major graphics
card manufacturers of the time (ATI and NVIDIA) released their flavours of a
GPGPU API for their hardware. ATI’s (acquired by AMD) implementation was
called Close To Metal (CTM). CTM gave developers direct access to the native
instruction set and memory of the massively parallel computational elements in AMD
Stream Processors. Using CTM, stream processors effectively become powerful,
programmable open architectures like today’s central processing units (CPUs). [7]
NVIDIA’s API was called Compute Unified Device Architecture (CUDA). CUDA
expanded on the features and ideas of CTM, offering full support for BLAS (Basic
Linear Algebra Subprograms) and FFT (Fast Fourier Transform) operations. [8]
This was made possible by exploiting programmable stream processors that can
execute code written in a common language like C. Architecturally these processors
are very simple and have limited instructions sets, but their real power lies in
numbers, not speed. As an example, NVIDIA’s current flagship graphics card, the
GeForce GTX 295 contains 480 stream processors, each having an internal clock of
1242 MHz. [9]
8
Figure 2, source: NVIDIA
As illustrated by Figure 2, the combined computational output of these processors

(measured in floating point operations per second) far exceeded that of mainstream
Intel CPUs. This performance doesn’t come for free however. Getting the maximum
out of a graphics card is an involved and highly complex task. The peculiarities of the
GPU architecture have a huge impact on the performance of algorithms executed on
them. Therefore, in depth knowledge of the processor and memory architecture is
required in order to achieve maximum possible efficiency and throughput, as well as
avoid any pitfalls in the process
The project’s aim

CUDA has been a great success even since its introduction. Many mainstream
projects have started using the benefits of sharing the workload between CPU and
GPU, depending on the task at hand. Adobe’s latest Photoshop has GPGPU
functionality integrated, so do Folding@home and SETI@home projects. CUDA’s
homepage is also bursting with scientific applications ranging from physics, to
chemistry, biology, mathematics, AI and medicine.
So the benefits are obviously there, yet CUDA remains a closed community
consisting mostly of scientists and researchers. This is mostly due to high entry
barrier due to the specifics of the GPU architecture and a relatively obscure API that
is only accessible using C++ and that a non-mathematical person would have a hard
time following and understanding.
This projects aim is to make the benefits of CUDA available to a wider audience; by
using an easy to use, ready to deploy JAVA library. The approach taken in order to
assess the feasibility of such a solution is covered in detail. While it is more of a proof
of concept than anything else, the library shows potential to grow and expand through
open source community contributions, accessible through a web page.
9
Requirements analysis and specification
CUDA API
As previously mentioned, CUDA is essentially an extension of C. All of the memory
allocations, pre-calculations and variable and data initializations are done in regular
C. CUDA then builds on top of that and introduces several specific symbols and
keywords that only the CUDA compiler can process. As such, a CUDA program
usually consists of two main parts – a C program that initializes the memory on the
device and host, and prepares the data that needs to be processed, and a CUDA file
that contains the kernel which is uploaded to the graphics card and executed on the
stream processors.
The kernel file is compiled by the CUDA compiler and cloned over all the available
processors on the graphics card. Prior to that, the dataset needs to be uploaded to the
graphic cards main memory. Once these steps have been completed, the processors
will start performing calculations on the available data and storing the results. The key
to achieving maximum performance lies in optimal use of the shared memory (as
opposed to global texture memory which is much slower), good synchronization of
threads and maximizing the number of busy cores at any given moment. The optimal
number of threads is different depending on the graphics cards, but most modern ones
benefit most from thread numbers running in the thousands. [8]
Figure 3, GPU memory diagram, source (10)
Figure 3 illustrates how the memory on a graphics card is divided into a grid that
contains a number of blocks inside which the threads run. All threads operating within
the same block can synchronize their execution. Access to global memory is much
10
slower than to shared memory and registers, the same way a CPU can access its
registers much faster than RAM. Therefore, global memory should only be used when
retrieving new datasets and once the data in shared memory has been fully processed.
The second most time consuming and costly operation is transfer of memory between
the host (CPU) and device (GPU). All memory needs to be pre-allocated, and data
transferred before processing starts, adding quite a bit of latency to the task. This is
why GPUs are best at performing computationally intensive tasks that have low
number of memory accesses and high number of iterations on the same dataset. This
way, a dataset can be uploaded only once, and the threads can crunch data
independently, without wasting time competing for resources and synchronizing with
each other.
Since the CUDA API direct allows access to all of these subsystems, programmers
need to be very careful when writing programs. Naive implementations of algorithms
can be up to 100 times slower, therefore
Platform independence
One of the first things that had to be considered during the planning stage of the
project was how to solve the problem of combining JAVA’s platform independence
with CUDA’s complete platform dependence. The following chapters show how this
challenge was approached.
JAVA JNI
Since the CUDA API and compiler are C based, integration with JAVA is far from
trivial. Therefore, the only way to consume these algorithms from JAVA is through
use of the JAVA Native Interface.
The JNI allows JAVA applications to load precompiled binary libraries written in a
language like C, and exchange data with the methods residing within them. Figure 4
shows a very high level diagram of where and how the native applications are
invoked.
Figure 4, JNI diagram, source (11)
11
Even though this is a great way to exploit libraries written in other programming
languages, by using JNI, JAVA loses its platform independence. Since C libraries are
platform dependent, it means that a JAVA program consuming this library can only
be run on the system that the library was compiled on. For example, when compiling
shared libraries, Windows will produce Dynamic-Link Libraries (DLLs) and Linux
will produce Shared Objects (SOs). To make it possible for an application to run on
both operating systems, both of these need to be present.
Fortunately, JAVA comes with mechanisms that make this runtime loading of JNI
libraries somewhat easier. As long as the library resides on the JAVA class path, we
only need to provide its name, and JAVA will decide if it should load the .so or .dll
depending on what operating system it is running on. However, in order to produce
library that should be easily pluggable into existing JAVA programs, and run on a
majority of operating systems, it should come with at least Windows and Linux
versions. The way this problem was solved is be described in subsequent chapters.
Another downside of JNI is that it was designed primarily for supporting legacy
systems. As such, it is very verbose and writing the “glue” is a long and tedious
process, involving lots of low-level code that JAVA programmers are usually not very
familiar with. Because of this, one of the requirements of the project should be that
the users are completely abstracted from JNI. The only thing they would need to do is
call to the native method, passing the relevant data to it, and the rest would be
completely transparent.
SWIG
In order to avoid manual writing of JNI code, and enable developers to quickly wrap
their CUDA algorithms into JNI compatible libraries, an automated approach was
necessary. After some research, the Simplified Wrapper and Interface Generator, or
shorter, SWIG [10] seemed to have all the functionality required. What SWIG does is
generate the wrapper code (the “glue”) between JAVA and C programs.
Figure 5, SWIG architecture
12
As seen from Figure 5, SWIG takes an interface file as input, parses it and then, using
the code generator, it creates the wrappers from it. The wrappers can be generated for
Pyhton, Perl, Tcl and many other languages, but for this project’s particular use, only
the JAVA JNI one is needed.
All in all, SWIG was another step towards achieving complete automation between
supplying the CUDA source files and receiving ready to use, platform specific, JNI
compatible libraries. The generation of these interface files and further explanation of
how the wrappers fit into the system will be given in subsequent chapters.
Virtualization
As previously mentioned, even with the JNI wrappers generated, the problem of
platform dependence still remained. The only way to go around this was to try and
cover most of the platforms in one go. Windows and Linux cover over 90% of the
installed platforms worldwide so it was a good place to start. Somehow, a way to
compile the source files on those two operating systems simultaneously had to be
devised.
The first thing that came to mind was to have 2 machines, one running Linux, and the
other running Windows. However, for the purposes of the demo this wasn’t very
practical, and maintaining two separate computers to perform that task seemed
superfluous.
A much more sane approach would be to run two systems in parallel on one
computer, by using a virtual machine. Linux was chosen as the base OS and installed
Windows XP as virtual machine inside it. This proved to be a good solution as
compilation tasks could be passed to both systems at once and executed in parallel.
This approach worked very well and the details of how it was streamlined with the
rest of the process will be described in subsequent chapters.
With the JNI interface generator in place, as well as compiling environments set up on
the Linux base OS and Windows virtual machine the system begun taking shape.
The Web Page

The final steps were exposing the functionality of this automated compilation system
to the outside world, and creating some basic downloadable algorithms for the library.
The open source community is a vibrant and active one, so it would be worthwhile to
allow CUDA programmers to contribute their own algorithms to the library, therefore
increasing its usefulness and scope.
The page consists of 4 main sections:
Library download page

Algorithm description page
Self-service compiler
Community / Open Source aspects
outlined in chapters below.
13
The first one is the download page where users can hand-pick the library components
required, and based on the choices made the system then builds a custom-made
package and sends it to the user. Below is a screenshot of a portion of this page:
Figure 6, Library download page
As you can see from Figure 6, the precompiled libraries are sorted by category they
fall into. The users get the option to check the boxes next to ones they are interested in
or checking further information by clicking the “Details...” link. Finally, by clicking
on the “Download selection” button, the system will go and fetch all the individually
selected precompiled libraries, package them into an archive and send to the user.
14
Algorithm description page
The algorithm description page basically lists all the relevant information about it,
outlining details such as the general information, available methods and benchmarks.
Screenshot below shows one such page:
Figure 7, algorithm description page
As seen from Figure 7, general information about the algorithm is followed by a table
that shows the exact methods that can be accessed within the library. It also lists the
parameter data types that should be used and any special rules that need to be
followed. Finally, the graph under the benchmark heading displays a line-chart that
compares the performance of the CUDA algorithm to its CPU equivalent, as well as
the information about the system the benchmark was run on.
15
The self-service compiler exposes the automated-compilation system to the outside

world. It allows anyone to submit a CUDA based algorithm and immediately receive
a compiled JNI library that can be consumed from a JAVA program. Screenshot
below shows the page:
Figure 8, self-service compiler
I wanted to keep the self-service compiling functionality as simple as possible, ideally

asking for least possible input. This required the convention over configuration
approach, as there are many factors that determine how the SWIG interface file
should be generator. Therefore, there are certain rules that need to be followed when
submitting the source, defined by the opening paragraph. These will be explained in
greater detail later on.
16
Community / Open Source aspects
Finally, the last major component of the website is the community aspect of it. In
order to make it accessible and intuitive, on a platform that people are used to, Trac
[11] was installed. The screenshot below shows one of its pages:
Figure 9, TRAC
Trac is a well established system for project and source code management. It consists
of a Wiki, an issue tracking mechanism, a complete bug/ticket system, source control
access and browsing, roadmaps and more. Overall it’s a practical way to keep the
development of an open source project under control and make it consistent with an
obvious plan and roadmap.
With these major components in place, the website should serve as a good starting
point for a person to get introduced with the idea behind the project, the service it
offers, and the various algorithms that are precompiled and readily downloadable.
17
Design Structure
System Components
Operating System
The first choice that had to be made was what operating system should be chosen as
the platform. The CUDA SDK and drivers are compatible with all major operating
systems – Windows, Linux and MacOS. However, since this project required various
technologies that had to be streamlined into a single process flow, all running behind
a web server, Linux was chosen primarily for its versatility when it comes to shell
scripting and ability and excellent support for scripting languages in general.
All the major Linux distributions are supported by CUDA but we decided to use
Ubuntu for couple of reasons. Ubuntu is one of the most popular Linux distributions,
and as such, every product and library is almost guaranteed to support it. Also, the
community support is top-notch as the forums and mailing-lists are bursting with
activity. All this together ensured that there would be a good chance to overcome any
potential, seemingly insurmountable, problems.
Finally, Ubuntu comes with a very good software manager that makes installation of
web servers, databases, libraries and programming languages a breeze. This was
important as all these backing components were not a research part of the project, so
their deployment should have been as quick and painless as possible.
CUDA Driver and Compiler

Once the operating system was in place, the second required component was the
CUDA driver, toolkit and software development kit.
Unfortunately, the installation of this is not as quick and painless as one would hope
for. At the time when the project was started, the only Ubuntu distribution that CUDA
supported was 8.04. The installation was done quickly using a debian package, but
getting the driver and compiler to run successfully was a long and tedious process.
CUDA doesn’t ship with all its dependencies, and neither does Ubuntu ship with them
by default. All of these had to be tracked down manually and installed. Once these
were in place, the NVIDIA display driver had to be installed, which meant modifying
the module installation configuration file and replacing the default driver with the
NVIDIA one.
Finally, certain environment variables had to be added to the shell profile scripts in
order to successfully resolve all the dependencies during compile time. With all of
this in place, it was possible to compile and run CUDA programs that ship with the
SDK.
18
The Virtual Machine
A choice had to be made as to which virtual machine to use in order to run the
Windows compile environment. As previously mentioned, this was required in order
to ship the final library with both Linux and Windows compatible algorithms.
The final choice had to be made between VMWare and VirtualBox. The latter was
chosen mostly due to the fact that VirtualBox is open source and licensed under the
GNU General Public License [12]. As such, the community support is free, and it has
been a tried and tested VM on the Linux platform for a long time.
Windows XP was chosen for its stability, speed and moderate demand for resources.
The installation of the CUDA compile environment was quick and painless. Express
edition of the C++ was installed as the CUDA backing compiler.
In order to facilitate communication between the virtual machine and host OS, a
dedicated port was opened on both ends in order to exchange data. The virtual
machine had to share the hosts network interface card in order to make this possible.
With all of this in place, the basic building blocks have been in place. This was a solid
platform on top of which the remainder of the system could be built.
LAMP
In order to quickly get up to speed in terms of hosting the page, the LAMP
environment was chosen. LAMP is an acronym for Linux, Apache, MySQL, and
PHP. This is a widely used and standard environment for building rich web
applications.
Ubuntu is often used as a web server so installing the LAMP environment was quick
using the synaptic package manager. Once the software was installed, Apache’s
virtual hosts had to be configured in order to enable http access to specific folders on
the system. To simulate the on-line environment more closely, the hosts file was
modified so that a working domain could be assigned to a virtual host on my local
machine. This way the on-line experience could be replicated without relying on an
internet connection.
MySQL would be used for storing the news, algorithm information and TRAC.
Finally, PHP would be used for programming the actual web application.
Zend Framework
Writing a web application by starting with a clean slate is a perfectly valid method,
however, with the incredible growth of the Internet and the number of pages hosted,
there have been many advancements in web application building techniques. There
are countless frameworks out there that help kick-starting the development process in
a structured and organized way.
More than anything, the MVC (model – view – controller) design pattern has been
shown to map very well to web based applications. This clear separation of the
display logic, business logic and database logic usually results in much cleaner code
and overall better organization.
19
When choosing an appropriate framework for backing my web application, many
aspects had to be considered. In the end, Zend Framework was chosen as a stable,
feature-rich and extensible platform. Also, Zend was developed by the makers of PHP
itself so it uses best practices when it comes to the implementation and extending of
the PHP object model.
Although with ZF is not as quick and easy to get an application on its legs as with
some other frameworks out there, it was chosen because it offers numerous modules
and libraries that would speed up the deployment and development in the long run.
Shell Scripting and Backing Services

With the OS, compile environment, web server, database and virtual machine in
place, all of the separate components still had to come together behind the scenes in
order to provide the required functionality.
Python is used for executing the compile directives, moving files around and setting
up directories. Python is an excellent choice for doing shell scripting due to the fact
that it is very minimalistic, has excellent support for executing OS commands, backed
by numerous 3rd party libraries and very well written documentation. Python was
chosen over Perl due to personal preference since the two offer similar functionality,
although Python is considered to be modern and slick in comparison.
In order to get a better understanding of the system and how it comes together, Figure
10 shows a high-level overview:
Figure 10, system overview
20
CUDA algorithms
With all of the above components working together, everything was in place for
supporting the library of JAVA compatible CUDA algorithms. In order to try and
demonstrate the potential versatility of the library, we tried to cover a wide range of
areas where these algorithms could be applied. A high level overview of the
algorithms, their functionalities, performance and any pitfalls will be outlined below.
Note: all tests were executed on an NVIDIA 8600M-GT 512MB graphics card and a
Core 2 Duo T7500 CPU (2.2 Ghz). All of the GPU labelled algorithms show the
performance of the modified algorithms when called from JAVA through the JNI
interface.
Parallel Sort (Bitonic)

The first algorithm implemented was a sort, due to the fact that sorting is a commonly
used function in most applications. It is also a tricky algorithm to parallelize as there
are numerous ways in which this can be done, each one having its advantages and
disadvantages.
The parallel sort implementation was created by Alan Kaatz at the University of
Illinois at Urbana-Champaign. This algorithm is an implementation of the bitonic
sorter which is a sorting algorithm designed specifically for parallel machines. Bitonic
sorter is an implementation of a sorting network. It works by comparing and swapping
elements in pairs. These paired subsets are sorted then merged.
Figure 11, parallel bitonic sort [13]
21
Figure 11 clearly shows why bitonic sort, even though not being the most efficient
algorithm, is indeed very suitable for execution in massively parallel environments.
Taking into consideration the fact that most modern GPUs contain over 100
processors, this can massively reduce the complexity of the algorithm.
The downside of this implementation though is how it uses memory. More

specifically, the implementation is dependent on the size of the dataset and its
efficiency approaches maximum as the number of elements approaches . At
number of elements, the efficiency halves, producing a step function as shown in
Figure 12:
Figure 12, parallel sort performance
However, even with this seeming inefficiency, the algorithm still outperforms fastest
quicksort CPU implementation, even on a mid-range graphics card. Certain
improvements could be added to this algorithm in order to maintain a linear
performance curve, but for the purposes of this library, this implementation was
sufficient.
Figure 13 shows the performance difference between the GPU implementation of

parallel sort and CPU implementation of quicksort in JAVA. As you can see, the GPU
consistently outperforms the CPU even though quicksort runs in time,
whereas bitonic sort runs in time. This is due to the fact that each
sorting network can run independently on its own processor, effectively reducing the
complexity to when running on processors, yielding a much better
performance [14].
22
Figure 13, parallel sort GPU vs CPU performance
GPU-Quicksort
Quicksort is one of the most popular sorting algorithms and needs no special
introduction. It is suitable for large data sets because, as opposed to many other
sorting algorithms, it doesn’t run in exponential time. There are many
implementations of the quicksort algorithm for the CPU, however, parallelizing it in a
manner so that it can run on the GPU is less than trivial.
Quicksort has previously been considered as an inefficient sorting solution for

graphics processors, but in January 2008, Daniel Cederman from Chalmers University
in Sweden published a paper on how quicksort could be mapped and executed on the
GPU architecture in an efficient manner. This algorithm demonstrates that GPU-
Quicksort often performs better than the fastest known sorting implementations for
graphics processors, such as radix and bitonic sort, making it a viable alternative for
sorting large quantities of data on graphics processors [15].
The only downside of this implementation, when compared to parallel bitonic sort, is
that it currently only supports integer sorting. Sorting floating point numbers was
disabled due to a problem with how CUDA handles C++ templates. This is expected
to be corrected in one of the upcoming releases.
23
Just as its CPU counterpart, GPU Quicksort has the complexity of [15]. As
the number of processors increase, the complexity of this algorithm when executed on
p processors can be expressed as:
In words, when the number of processors becomes equal to the number of elements in
the sorting set, the complexity is equal to which is extremely fast for a
sorting algorithm.
Figure 14 shows the performance comparison between JAVA implementations of

CPU and GPU quicksort.
Figure 14, GPU-Quicksort vs CPU-Quicksort
The reason why the advantage is not as obvious for smaller datasets is due to the fact
that CUDA comes with a lot of overhead. In order to execute an algorithm on a GPU,
the CUDA environment needs to be bootstrapped first, then the memory needs to be
allocated on the GPU, followed by the actual computation, and finally transfer of the
results back to the host. All these operations are factored into the final time, so the
benefits really only become obvious for higher at bigger datasets. This generally holds
true for all algorithms, the question is only how computationally intensive and
memory-bandwidth dependant they are. Generally, more computationally intensive
algorithms with minimal memory-bandwidth dependency tend to perform best, as we
will see in the next few examples.
24
Black-Scholes Option Pricing
The Black-Scholes is a very popular and widely used economic model, created in
1973 by Fischer Black, Myron Scholes and Robert Merton. It is a method for pricing
equity options; prior to its development there was no standard way to do this ever
since the creation of organized option trading. In 1997, its creators were awarded the
Nobel Prize in Economics.
The most common definition of an option is an agreement between two parties, the
option seller and the option buyer, whereby the option buyer is granted a right (but not
an obligation), secured by the option seller, to carry out some operation (or exercise
the option) at some moment in the future [16].
The Black-Scholes pricing algorithm greatly benefits from a massively parallel

implementation due to numerous complex mathematical functions that can be
executed simultaneously on different stocks. The CUDA kernel algorithm was taken
from NVIDIA CUDA SDK [17]. Figure 15 shows the performance of JAVA
algorithms for Black-Scholes option pricing, executed on CPU and GPU respectively.
Figure 15, Black-Scholes option pricing, CPU vs GPU
We can see that the performance difference increases exponentially as the dataset
grows. At elements, the CUDA implementation is 20 times faster, even on a mid-
range graphics card. As previously mentioned, this is mainly due to the fact that this
algorithm can be implemented with low number of memory accesses and high number
of calculations per stock.
25
Fast Fourier Transform
The Fast Fourier Transform (FFT) is an efficient algorithm implementation for
computing the discrete Fourier transform (DFT) and its inverse. A Fourier transform
is a calculation needed to visualise a wave function not only in the time domain, but
also in the frequency domain [18]. By observing wave as a function of its frequency,
we can analyse it much closely than just by visualising the amplitude. The Fourier
transform is an invaluable tool in the field of electronics and digital communications,
and as such, a widely used algorithm in computer science. There is a whole field of
mathematics called Fourier analysis which grew out of the study of Fourier series.
Because of the importance of Fourier analysis, NVIDIA has developed a whole sub-
library containing primarily Fourier transform functions. The chosen implementation
is an improved version developed by Vasily Volkov at UC Berkeley. As with all FFT
implementations, dataset sizes have to be equal to a power of two. Figure 16 shows
the performance of my modified GPU implementation called through JNI when
compared to an equivalent JAVA CPU implementation.
Figure 16, Fast Fourier Transform, GPU vs CPU
Fast Fourier transform can be applied simultaneously to a number of point sets. The
point set sizes have to be expressed as a power of two. As a general mid-range value,
a 512 point set was chosen to test the performance. On average, this is a good
representation of the performance, since it varied depending on the chosen
number of points per set. For all set sizes, the GPU implementation far outperformed
the CPU equivalent, even with the JNI and CUDA overhead.
26
Matrix Transform
As the final algorithm to adapt for JNI calls, a 2D matrix transpose program was
chosen. Matrix transpose is a widely used mathematical operation, used in various
fields of science. More importantly, it was an interesting algorithm because it doesn’t
perform any calculations. Rather, all it does is item swapping within a two
dimensional array. As such, it is a good example to show the performance benefits
that can be achieved when shared memory is used efficiently within the thread blocks.
Figure 3 shows the architecture clearly.
Since all threads within a block can access the shared memory concurrently, and
operate on their own sets of data independently, the implementation of this algorithm
has a profound impact on its performance. A common naive implementation of this
algorithm suffers from non-coalesced writes. Basically, threads within a block can
synchronize their writes to global memory to occur simultaneously. The optimized
transpose algorithm with fully coalesced memory access and no bank conflicts can be
more than 10x faster for large matrices [19].
Figure 17, matrix transpose, GPU vs CPU
Figure 17 clearly shows the benefits of using very high speed memory present on
modern graphics cards, as well as exploiting advanced synchronized memory access
features offered by CUDA. At approximately five times faster than the CPU version,
the naive implementation would have been five times slower instead. This point also
illustrates how important the knowledge of underlying hardware is when developing
CUDA algorithms.
27
Implementation Details
As previously stated, there are many components in this system, and getting them to
work together was an iterative process. Therefore, in order to make the
implementation details clearer, code segments will be explained in the chronological
order of their creation. Even though the entry point to the system is the OS and the
compile environment, their installation process will not be detailed as it was long,
tedious and consisted of a trial and error approach. Therefore, the following chapter
will cover the implementation of previously mentioned algorithms and JNI
automation.
Algorithms
SWIG & JNI

In order to make the CUDA algorithms accessible through JAVA, JNI interfaces had
to be written. In earlier chapters it was mentioned that this can be a long and tedious
process in which mappings have to be created between every method and argument
type that needs to be exposed. To make matters worse, in order to make these callable
through a shared library, special directives and headers are required in order to stop
the C++ compiler from obfuscating the function names. The C++ compiler does that
internally because when all functions from all objects are collected, there could be
clashes in the names. This is fine because the references are internally resolved to the
obfuscated names, but it becomes a problem when a function needs to be called
externally.
In order to distance ourselves from these details, and also automate the JNI wrappers
as much as possible we decided to use SWIG. SWIG does just that, through the use of
an interface file that describes all methods that need to be exposed. It’s not as simple
as it sounds though. Since JAVA doesn’t support all the types that C supports, and in
most cases these are not implemented equally, mapping types between JAVA and C
can be very complicated. This is especially true with complex data types like arrays,
pointers, unsigned types and objects.
Thankfully, SWIG comes with some predefined mappings for arrays. Since all of the
algorithms use arrays for holding data to be processed, we had to take advantage of
this feature. Initial prototypes were quite slow and disappointing. The below segment
shows an example of an early prototype of a SWIG interface file:
%module sqrt
%include <arrays_java.i>
%apply float[] {float *};
%{
extern void csquare(float *f);
%}
extern void csquare(float *f);
28
This is a simple interface that builds JNI wrappers for a C function that simply
squares every element of a float number array being passed to it. There are upsides
and downsides to using this approach. The upside is, when this type of mapping is
used (through the arrays_java.i interface file), it is possible to use native JAVA types
and pass them directly to the C method, as shown below:
float[] f = {1,2,3};
sqrt.csquare(f);
for(float x : f){
System.out.println(x);
}
However, the downside of this mapping is that the conversion from the JAVA native
type to the C array is done internally in the generated JNI wrapper file. What this
means is that once the f variable is passed to the sqrt.csquare function, the JNI
wrapper will allocate a new memory segment and then copy each element to the new
location in a C compatible data structure. Then it will pass the reference to that new
memory location to the C algorithm. Same will be done once the C function has
completed execution, only the other way round. This introduces two overheads at the
cost of flexibility: first of all, for each element passed this way we need double the
memory that we would normally need, and secondly the copying of elements back
and forth between C and JAVA also carries a significant time cost.
For small datasets this would be a perfectly acceptable solution with minimal impact
on the programming logic on the JAVA side. However, in high-performance, large
dataset environments in which CUDA is most often used, such a performance hit
could outweigh any possible performance gain of using CUDA in the first place. It
would also place a lower limit on the maximum size of datasets since the memory
requirements double. This was a major hurdle, but further research resulted in the
discovery of an alternative method that would mitigate all of these problems.
Instead of copying the data back and forth, SWIG provides an alternative way of
mapping data structures, by allowing C and JAVA to share the same memory
location. This way, we don’t have to copy the entire data in order to make it available
in the C program, but rather just pass the pointer reference to the location where the
data is stored. This solves all our previous problems, but introduces certain issues as
well. First of all, this method of mapping prevents the use of JAVA in-built types.
Rather, in order to instantiate variables and assign values to them, we have to use
specialized functions provided by the JNI wrapper. For example, for an integer array,
SWIG would generate the following wrappers:
int *new_intArray(int nelements);
void delete_intArray(int *x);
int intArray_getitem(int *x, int index);
void intArray_setitem(int *x, int index, int value);
This provides us with full functionality of manipulating the data inside these types,
however, there is another caveat. Once the data is passed to the C function, it becomes
immutable. This means that any operations that have to performed on this array
cannot be stored in that same array, since the elements cannot be overwritten.
Therefore, in order to send source data and receive result data from the C program
using this method, two parameters had to be used – one with the source data, and the
other simply being an empty initialized array that would store the resulting elements.
29
The modified interface file, supporting this faster but less flexible method, looks as
follows:
%module transpose
%include "carrays.i"
%array_class(float, floatArray)
%{
void transpose(float* h_data, float* result, int size_x)
%}
What this does is map every float pointer argument to the floatArray type, as defined
by SWIG. This creates all the wrappers required for creating and modifying the float
elements within the array, as well as casting it to the pointer representations required
by the CUDA algorithm.
After testing the performance with this alternative approach, the results were very
encouraging. Previously, as the number of elements increased, the speed advantage
started fading since more and more data needed to be transferred between JAVA and
the CUDA algorithm, but this method enabled us to see performance increases in the
order of more than 10 times for the first time, making the overall project feasible
again.
CUDA algorithm adaptation

With being able to generate efficient JNI wrappers using simple interface files, the
time was to try and modify the algorithms so that they could be called using the
exposed methods.
This was quite a demanding task as a lot of code had to be refactored. What had to
remain in the end was a pure algorithm that just accepts the input data, runs it on the
GPU and then places the results into a referenced memory location that JAVA can
use. Each CUDA algorithms consists of two main parts:
1. data & device initialization program

2. CUDA kernel (executed on the graphics card)
The actual CUDA kernels did not have to be modified in any way as they are
executed on the actual graphics card and called by the setup program, so there is no
direct communication between it and JAVA. The first part of the program had to be
modified though. One of the simpler ones was the matrix transpose algorithm, shown
below:
void transpose( float* h_data, float* result, int size_x, int size_y)
{
// size of memory required to store the matrix
const unsigned int mem_size = sizeof(float) * size_x * size_y;
// declare device variables

float* d_idata;
float* d_odata;
// initialize memory on device

cutilSafeCall( cudaMalloc( (void**) &d_idata, mem_size));
cutilSafeCall( cudaMalloc( (void**) &d_odata, mem_size));
30
// copy host memory to device
cutilSafeCall( cudaMemcpy( d_idata, h_data, mem_size,
cudaMemcpyHostToDevice) );
// setup execution parameters

dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);
// execute the kernel

gputranspose<<< grid, threads >>>(d_odata, d_idata, size_x,
size_y);
cudaThreadSynchronize();
// copy result from device to host

cutilSafeCall( cudaMemcpy( result, d_odata, mem_size,
cudaMemcpyDeviceToHost) );
// cleanup memory
cutilSafeCall(cudaFree(d_idata));
cutilSafeCall(cudaFree(d_odata));
}
Let’s run through this quickly, as the base logic used was the same across all
algorithms, even though some were a lot more complex due to the usage of structs
(custom classes) and data types.
As seen at the top, the first thing we have to do is calculate the memory size required
for storing the data. This will be needed in order to allocate the space on the graphics
card. In this particular case, the data passed is a two-dimensional unbounded array.
The h_data variable holds the input data. The size_x and size_y integer variables
specify the number of rows and columns in the matrix we are transposing. Since in C
unbounded arrays are passed as pointer references, there is no way to know how big
they are. In order to stop at the last element and avoid overflowing into potentially
protected memory space (thus crashing the program), these two variables serve as
bounds. The memory size is simply calculated as a product of rows and columns in
the matrix and the actual size of float datatype.
Second step involves declaring variables that will hold the source and resulting data
on the actual graphics card. Once that has been done, we can use the cudaMalloc
function that allocates memory on the device. We pass the references to both source
and destination variables, and their size.
With the memory space allocated, we can now copy the data sent from JAVA (held in
the h_data variable) to the graphic cards main memory. This is done using the
cudaMemcpy function. The reason why all CUDA functions are wrapped within
cutilSafeCall method is so that they die gracefully if anything happens.
The block under “setup execution parameters” sets the grid size and the number of
threads that run per block, depending on the matrix size. Finally, the gputranspose
method uploads the kernel to the graphics card and executes it on the data previously
supplied. The function call passes the reference to the source and result memory
locations, as well as the dataset size.
Finally, the cudaThreadSynchronize() function waits for the graphics card to complete
the operation. Since the task is performed asynchronously, this ensures that the main
31
program waits for the results to be ready on the graphics card before proceeding
further. Once this has completed, the results on the graphics card (held in the d_odata
variable) are copied into the blank result variable we passed form JAVA. The last two
calls perform cleanup on the card, flushing any data left over in the memory.
At this point our results are stored starting with the memory position to which the
results variable points. We can now read those results in JAVA using the before
mentioned SWIG wrappers.
Even though some other algorithm implementations are a lot more complicated than
this (GPU-Quicksort and FFT), this example portrays the general logic in a concise
and easy to understand manner.
Fast Fourier transform algorithm, on the other hand, uses a non-standard type called
float2. This is basically a class consisting of a pair of floats, used to represent two
coordinate points. Since SWIG doesn’t come with a predefined mapping for this type,
it is possible to do this manually. The interface below shows this:
%include "arrays_java.i"
JAVA_ARRAYSOFCLASSES(float2)
%module fft
%{
#include "fft.h"
%}
struct float2{
float x,y;
};
%extend float2 {
char *toString() {
static char tmp[1024];
sprintf(tmp,"float2(%f,%f)", $self->x,$self->y);
return tmp;
}
float2(float x, float y) {
float2 *f = (float2 *) malloc(sizeof(float2));
f->x = x;
f->y = y;
return f;
}
};
As you can see, firstly we redefine the float2 struct and tell SWIG it simply consists
of two floats, x and y. The JAVA_ARRAYSOFCLASSES directive tells SWIG to
generate a wrapper that can work with arrays of this struct. Finally, we can extend the
basic struct to give it some more functionality. For example, the default constructor
created by SWIG is blank and we need to set each coordinate separately. This can be
extended so that we also have a constructor that accepts both coordinates at the same
time, thus shortening the initialization process. Finally, a toString method is
implemented just for convenience, to be able to output the contents in a simple way.
Through this extending functionality we can also throw and catch exceptions,
reducing the possibility of random and untraceable crashes. This is one of the
improvements proposed for future versions of the library.
32
The last addition to the modified algorithms was the init function. This is an addition
that doesn’t have to be used in order to execute the algorithms, but can be a useful
addition for performing batch runs. One of the issues with CUDA is that it has to
bootstrap the running environment first time it is invoked. This usually happens at the
beginning of the program when the available graphics cards are detected, and the
fastest one is chosen. The start up time varies from computer to computer, but in my
particular case it was 200 milliseconds on average. This was quite a performance hit
overall, and it impacted the results for smaller datasets where the average running
times were well under 50 milliseconds.
In order to exclude the bootstrapping time from the algorithm execution timings, a
helper init function was created for each algorithm present in the library. By calling it
at program start-up, the CUDA runtime environment is created straight away and
doesn’t impact further algorithm executions. Here is the body of the init() function:
void init(){
cudaSetDevice( cutGetMaxGflopsDeviceId() );
float* d_data = NULL;
cudaMalloc((void **) &d_data, 1*sizeof(float));
}
This code segment simply finds the fastest graphics card present on the system and
sets it as the active one. Then it creates a blank pointer and performs a memory
allocation of a single empty float on the graphics card. This is the point at which the
CUDA runtime environment is bootstrapped and it doesn’t happen again for as long
as the program doesn’t quit. This proved to be very useful in benchmarking the
algorithms, and would prove useful in any situation where the user wants to make all
preparations in order to be sure that the algorithm gets executed instantly when called.
Reusing the CUDA compiler configuration

With the modified algorithms and JNI wrappers in place, it was time to compile the
source files into a usable library. To do this under Linux, GCC and CUDA compilers
had to be used. Since the CUDA compiler is really an extension of the C compiler,
and needs a C compiler to be present in order to compile CUDA files, it is in fact fully
capable of compiling any file that GCC can compile. Therefore, in the initial compile
environment a single long compile directive was used directly from the CUDA
compiler and it looked something like this, depending on the algorithm:
nvcc -shared -I$JAVA_HOME/include -I$JAVA_HOME/include/linux -
I/home/kosta/NVIDIA_CUDA_SDK/common/inc -Xcompiler -fPIC -L
/usr/local/cuda/lib -lcuda -lcudart -o parallel_sort.so
parallel_sort.cu parallel_sort_wrap.c
This worked well for simple examples, but as the complexity of algorithms increased,
it became a less and less successful method. The biggest issues were getting the
functions to be accessible externally (avoid name mangling) and properly linking all
components from within the library. Eventually this method became more of a trial
and error than anything else, thus unusable in an automated system that had to be built
for the self-service compiler.
In the second attempt to achieve full automation, a compile script was used. This
called compile directives one by one, creating a more structured approach. Firstly, all
33
the C files would be compiled by the GCC compiler, followed by compiling of
CUDA files, and finally executing the linker for collecting all the intermediate object
files into a standalone library. A script to do this would look as follows:
gcc -fPIC -g -c transpose_wrap.c -I$JAVA_HOME/include -
I$JAVA_HOME/include/linux
nvcc -o transpose.cu
ld -shared -soname libtranspose.so -o transpose_wrap.o transpose.o
As you can see, the three steps are clearly visible. GCC compiles the JNI wrapper,
nvcc (CUDA compiler) compiles the cuda sources, and finally the GCC linker (ld) is
used to create a shared library from the compiled object files. This technique worked
better, but eventually it started failing as well. Several algorithms wouldn’t compile
successfully because of unresolved reference errors. This method had to be discarded
too.
Further examination of the CUDA compile directives from the SDK uncovered a
common makefile script. Makefiles are used on Linux for organizing and determining
the sequences in which files are compiled. This is important, especially for bigger
projects, and provides a logical way of precisely handling every step of the
compilation process. The CUDA SDK makefile was very detailed, and covered many
situations in which my compilation techniques failed. Therefore, as a last option this
script was reused and extend by adding the SWIG JNI generation steps, as well as the
final linking step during which everything is combined into a shared library.
The additions to this file included following directives:

$(TARGET): swig makedirectories $(CUBINS) $(OBJS) Makefile
$(VERBOSE)$(CXX) $(CXXFLAGS) -o $(OBJDIR)/$(PROJECT)_wrap.cpp.o -c
$(SRCDIR)$(PROJECT)_wrap.cpp
$(VERBOSE)$(LINKLINE)
$(VERBOSE)mv *.java java/
$(VERBOSE)javac java/*.java
$(VERBOSE)rm -f *.linkinfo
cubindirectory:
$(VERBOSE)mkdir -p $(CUBINDIR)
makedirectories:
$(VERBOSE)mkdir -p java
$(VERBOSE)mkdir -p $(OBJDIR)
swig:
$(VERBOSE)swig -c++ -addextern -java $(PROJECT).i
$(VERBOSE)mv $(PROJECT)_wrap.cxx $(PROJECT)_wrap.cpp
Basically the additions instruct the common makefile script to run the swig generation
process first, followed by creation of directories for the JAVA files and object files,
followed by the compilation of the generated JNI wrapper file. The rest of the process
executed normally and was unchanged. This ensured that the resulting libraries would
have no missing references and that none of the function names would get mangled in
the process. More importantly, it resulted in a more streamlined process since all that
34
was required to fully compile a library was a simple Make script, such as the one
below:
#####################################################################
# Build script for project
#####################################################################
PROJECT := transpose
# Cuda source files (compiled with cudacc)

CUFILES := $(PROJECT).cu
# CUDA dependency files

CU_DEPS := $(PROJECT)_kernel.cu
# Rules and targets

include ../common.mk
This is a very clean script with minimal inputs. As the rest of the project, it follows
the convention over configuration approach, so as long as the filenames are named
correctly, all that needs to be specified is the project name, and the rest is built from
that. Generating these files would be a simple matter, therefore greatly simplifying
automation efforts, and removing any intermediate steps that could cause problems in
the long run. With all the source files ready and the Linux library automation
complete, it was time to implement the same process on the Windows virtual
machine, and obtain the final piece of the puzzle.
The Virtual Machine

The first obstacle that stood in the way of successful addition of the Windows VM
into the existing system was the file exchange mechanism. There are many ways in
which a VM can connect to the existing network and its host, and many ways to
transfer the files between the two, so the simplest one had to be chosen in order to
minimize the number of places where something could go wrong.
After carefully considering all options, the final choice was to use the file transfer
protocol. It is a long existing and well established method of transferring files, and as
such would serve as a good starting point. The second obstacle was getting the host
FTP client and virtual FTP server to communicate together. This could have been
done through the network infrastructure, but was not necessary since the virtual
machine resided on the same computer as the host. Therefore, a direct connection
between the two made more sense. This was done by opening certain ports on the host
and guest. For the purposes of FTP communication, ports 21 and 8021 were used
respectively. The following commands made this possible:
VBoxManage setextradata "WindowsXP"
"VBoxInternal/Devices/pcnet/0/LUN#0/Config/guestftp/Protocol" TCP

"VBoxInternal/Devices/pcnet/0/LUN#0/Config/guestftp/GuestPort" 21

"VBoxInternal/Devices/pcnet/0/LUN#0/Config/guestftp/HostPort" 8021
35
With the guest and host being able to communicate with each other, it was time to
install the necessary FTP servers. The Cerberus FTP Server [20] was used on the
Windows machine in order to accept any incoming connections.
The last step required was to write a script that would upload the source files via FTP
to the VM, call a script on the VM that would find these files and compile them, place
them back in the FTP home directory and notify the host script that the compiled
library is ready to be picked up. The host script would then download the resulting
package and extract it to the appropriate location. To do this, python scripts were used
on both the host and the VM. First, a snippet from the host script:
ftp = FTP()
FTP.connect(ftp, 'localhost', 8021)
ftp.login()
ftp.storbinary("STOR " + os.path.basename(f.name), f, 1024)
ftp.quit()
os.remove("%s/%s.zip" % (tmp, algorithm))
url = "http://127.0.0.1:8080/ccompiler/index.py?f=%s.zip" % algorithm
urllib.urlopen(url)
FTP.connect(ftp, 'localhost', 8021)
ftp.login()
ftp.retrbinary("RETR %s_compiled.zip" % algorithm,
open("%s/%s_compiled.zip" % (tmp, algorithm), 'wb').write)
ftp.delete('%s_compiled.zip' % algorithm)
ftp.quit()
unzip("%s/%s_compiled.zip" % (tmp, algorithm),
"%s/algorithms/%s/windows" % (cwd, algorithm))
os.remove("%s/%s_compiled.zip" % (tmp, algorithm))
This is not the complete script, but it shows the logic. The host connects to the VMs
FTP client running on port 8021 and uploads the zip package with all required
sources. Once this is done, it calls the python script on the VM using the HTTP
protocol, and sends it the name of the algorithm it just uploaded. This script will be
explained in more detail later on. Once it has completed, the host connects to the FTP
server again and downloads the compiled package to a temporary area. Finally it
extracts it and cleans up any files that are no longer required. The following is the
snippet from the script running on the VM:
def handler(req):
req.content_type = 'text/html'
home = 'C:/wamp/www/ccompiler'
filename = params.getfirst("f")
algorithm = os.path.splitext(filename)[0]
unzip('c:/ftproot/%s' % (filename), '%s/extracted/%s' % (home,
algorithm))
os.chdir('%s/extracted/%s/' % (home, algorithm))
36
files = glob.glob( os.path.join('%s/extracted/%s/' % (home,
algorithm), 'make.bat') )
if(len(files) == 0):
os.system('nvcc.exe -shared -I"C:\Program Files\NVIDIA CUDA
SDK\common\inc" -I"C:\Program Files\Java\jdk1.6.0_12\include" -
I"C:\Program Files\Java\jdk1.6.0_12\include\win32" -IC:\CUDA\include
-l"C:\CUDA\lib\cudart" -l"C:\CUDA\lib\cutil32" -o %s.dll %s.cu
%s_wrap.cpp' % (algorithm, algorithm, algorithm))
else:
os.system('make.bat')
zip = zipfile.ZipFile('c:/ftproot/%s_compiled.zip' % algorithm,
'w')
for f in os.listdir(os.getcwd()):
if (f.find('.dll')!=-1):
zip.write('%s' % f)
As you can quickly see from this script, the Windows environment actually uses a
single compile directive by default. This was done because the makefile script from
Linux could not be reused. In case the project is too complex to compile with that
simple directive, an optional make.bat file can be supplied with the required directive.
This is used for the FFT algorithm. The rest of the script simply takes care of
extracting the file to a working directory, locating the library once the compilation has
finished and then packaging it and moving back to the FTP home where it can be
picked up by the host.
Test suites
In order to test the performance, a benchmark program was written for each
algorithm. The benchmarks consist of the following components:
CPU equivalent algorithm to compare the performance against

Comparison algorithm for finding the median difference between the two
resulting datasets
Initialization of data and timings of each run
CSV file storing the results
Finally, a shell script is used in order to compile all the required files, run the
benchmarks and tests, and plot the CSV files on a graph using GNU Plot. Let us look
at one such benchmark suite. It should give us a good indication of how the algorithm
is supposed to be used in a real world example.
First step is to load the library. This is done dynamically at runtime. As mentioned in
the first chapter, JAVA has two ways of doing this. It can either resolve a library from
an absolute filepath provided, or from a fully qualified name. The second approach is
a lot more flexible because it comes with some extra functionality. If the library is
loaded on a Windows platform, it will automatically append the .dll extension and try
and load it like that. Otherwise, if the JVM is running on Linux, it will prepend “lib”
and append .so to the given name, as per convention. The downside is the fact that the
37
library we are calling needs to reside in one of the directories specified on the
classpath. Below is the actual statement:
System.loadLibrary("blackscholes");
This will give us instant access to any methods available within that library.
Initialization of the elements is quite straightforward, but shows how we cannot use
JAVA provided types for passing the dataset to the C library:
floatArray cudaResultCall = new floatArray(size);
Instead of using a standard java float[] we need to use the SWIG provided floatArray
class instead. Once the elements have been initialized, we simply run the algorithm on
the GPU and the CPU, timing both runs separately. In order reduce effects of
background tasks influencing the execution times, each test run is executed several
times, and the average length is taken.
Finally, in order to ensure that we are getting correct results from the graphics card,
the returned dataset is compared to the reference result executed on the CPU. For
algorithms that deal with integers or don’t modify the actual data, a simple one to one
comparison is sufficient. However, with algorithms that perform complex
mathematical operations on floating point values, we need to ensure that the resulting
data is close enough to the reference. For the Black-Scholes and fast Fourier
transform algorithms, this was necessary. The algorithm chosen for comparison was
the L1 norm, also known as rectilinear distance. This is used because the general
accepted margin of error for a floating point operation is . By calculating the L1
distance between the GPU result and reference result, we can check if it conforms to
this expected standard of precision. The algorithm for this is shown below:
//Calculate L1 (rectilinear) distance between CPU and GPU results
public static double L1norm(float[] reference, floatArray cuda){
double sum_delta = 0;
double sum_ref = 0;
for(int i = 0; i < reference.length; i++){
double ref = reference[i];
double delta = Math.abs(reference[i] - cuda.getitem(i));
sum_delta += delta;
sum_ref += Math.abs(ref);
}
return sum_delta / sum_ref;
}
If the distance is the GPU result is accepted and considered valid.
With all the timings for various dataset sizes completed, and all the tests passed, the
last step is to read the produced CSV file and plot a graph. GNU Plot was chosen for
this purpose since it is a versatile and powerful tool for reading CSV files and
displaying graphs. A small script is required in order to specify certain parameters, as
shown below:
set title "2D Matrix Transpose"
set xlabel "Elements"
set ylabel "Milliseconds"

set logscale x 2
38
set data style linespoints
set grid ytics
set terminal png size 640,480
set output "benchmark.png"
plot "benchmark.txt" using 1:2 title "CPU", \
"benchmark.txt" using 1:3 title "CUDA" ls 3
set terminal wxt
set output
replot
Using a script like that, it is possible to define the axis labels, graph title, size of the
output file, input files, type of graph etc. The above script produces graphs that were
shown in the previous chapter under the CUDA Algorithms heading.
To tie all of this together, a shell script was written to perform clean-up, compiling,
building, execution and graph displaying for any given algorithm. This was done to
enable quick profiling of any changes and tweaks performed. It is shown below:
#!/bin/bash
if [ `find . -name "*.class" | wc -l` -lt `find . -name "*.java" | wc
-l` ]
if [ $# -ne 1 ]; then
echo 1>&2 Usage: $0 algorithm_name
exit 127
fi
algorithm=$1
rm -f algorithms/$algorithm/test/*.class
javac -classpath
algorithms/$algorithm/java/:algorithms/$algorithm/test/
algorithms/$algorithm/test/*.java
java -Xms128m -Xmx256m -

Djava.library.path=/var/www/cuda4j/algorithms/$algorithm/linux/ -
classpath algorithms/$algorithm/java/:algorithms/$algorithm/test/
Test
gnuplot plot
The web site
General features
The framework on top of which the website is built is Zend framework. The
supporting javascript library for adding effects and beautification is jQuery. The two
in combination are a widely used and thoroughly tested platform for building feature
rich, modern web applications. MySQL database was used for storing algorithm
information.
Getting the Zend framework up on its legs is probably the most time consuming
operation because it requires a bootstrap file. This file determines everything about
the running instance, ranging from the database configuration, directory structure,
front controller definitions, session information and more. The functionalities of the
library download page, the self-service compiler and TRAC will be covered.
39
The purpose of this page is to offer the user to hand pick the library components
he/she is interested in. This “build it yourself” approach has been very popular with
many successful frameworks and libraries on the internet, so the same approach was
taken. This way, the users get only exactly what they need so the clutter and
bandwidth waste is minimal.
Figure 6 shows the screenshot of this page. All the individual algorithms are grouped
under category headings for easier navigation. A checkbox is positioned next to each
one for quick and easy selection. Finally the download button submits the user
selections to the web server and returns a zip with the packaged choices. The PHP
code below performs this functionality:
public function downloadAction(){
if($this->_request->getPost("lib")){
$libs = $this->_request->getPost("lib");
$file = "/tmp/".session_id().".zip";
$command = "zip -q ".$file;
foreach($libs as $lib){
$command .= " algorithms/".$lib."/linux/*.so
algorithms/".$lib."/windows/*.dll algorithms/".$lib."/java/*.class
algorithms/".$lib."/test/Test.java";
}
if(file_exists($file)) @unlink($file);
exec($command);
header("Pragma: public");
header("Expires: 0");
header("Cache-Control: must-revalidate, post-check=0, pre-
check=0");
header("Cache-Control: private",false);
header("Content-Type: application/zip" );
header("Content-Disposition: attachment;
filename=\"cuda4jlib.zip\";");
header("Content-Transfer-Encoding:Â binary");
header("Content-Length: ".filesize($file));
readfile($file);
exit();
}
}
The first line reads the posted form data. It constructs a zip command for compressing
all the algorithms by iterating through the selected ones, and adding each to the
archive. Finally, it constructs all the necessary HTTP headers for sending a file to the
user, and then proceeds to send it. The user receives the archive with all the classes
and libraries needed to run the algorithms.
The self-service compiler is one of the main features of the website. Getting the
compile environment in place was a big task to start with, but most of the effort went
into the automation of the JNI wrapper generation and synchronized compiling on
both Windows and Linux platforms. Exposing this functionality to the outside world
was one of the key drivers behind the project. With the ability to perform self-service
compiling, the user could write all the CUDA algorithms without writing a single line
of JNI glue code, or having to install the CUDA compile environment and backing C
40
compilers. This is a great time saver and makes the barrier to entry for JAVA/CUDA
programming much lower. The below PHP code shows how the JNI wrapper
generation and CUDA compiling is kicked off from the web application:
$formData = $this->_request->getPost();
if ($form->isValid($formData)) {
// success - do something with the uploaded file
$uploadedData = $form->getValues();
$project = $uploadedData['name'];
$exposed = $uploadedData['exposed'];
$mappings = $uploadedData['mappings'];
$fullFilePath = $form->file->getFileName();
$destinationPath =
APPLICATION_PATH.'/data/uploads/'.session_id().'/'.$project;
if(!file_exists($destinationPath)) mkdir($destinationPath);
@exec("unzip " . $fullFilePath . " -d " . $destinationPath);
// Swig interface file generation

$swigfile = $destinationPath."/".$project.".i";
$fh = fopen($swigfile, 'w') or die("can't open file");
fwrite($fh, $this->generateSwigInterface($project, $exposed,
$mappings));
fclose($fh);
// Linux makefile generation

$makefile = $destinationPath."/Makefile";
$fh = fopen($makefile, 'w') or die("can't open file");
fwrite($fh, $this->generateMakefile($project));
fclose($fh);
chdir(APPLICATION_PATH.'/data');
copy("/var/www/cuda4j/algorithms/common.mk",
APPLICATION_PATH.'/data/uploads/'.session_id().'/common.mk');
exec("python compile.py ".session_id()." ".$project);
$this->downloadAction($project, $destinationPath);
exit();
}
Since most of the system was automated to begin with, all we are left with is some
simple code to create system calls for the existing scripts that do the grunt of the
work! All the effort has paid off in the end.
Figure 8 shows the self-service compiler form. It takes minimal input, but assumes
certain things about the input archive. The file must be submitted as a zip, containing
all the source files that need to be compiled. Also required is a C header file that
contains all functions that are to be exposed. By exposed we mean accessible in the
final JAVA compatible library. From this header file the compiler will know how to
construct the JNI wrappers around the library, so if a submitted archive does not
contain it, the system will reject it. Project name field is self explanatory, which
leaves the “pointer mappings” as the only ambiguous one. In this early prototype of
41
the application, only simple SWIG interface file generation is supported, and as such
only one type mapping. This is a quick way to handle arguments referenced as
pointers. For example, a float pointer might be mapped to a floatArray if that is its
intended use. Future versions should have a more fine-grained approach to the SWIG
interface file information editing, but as a proof of concept, it simply accepts a comma
separated list of types to map.
Project Tracking (TRAC)

The final component of the page is the project tracker. This is a much needed part of
any open-source application / library. There is no need for reinventing the wheel in
this area as there are many excellent free applications for project management. TRAC
is no exception to this and probably the most widely used platform in the open source
community. As such, it was the system of choice for our particular library too. Its
deployment is relatively straightforward on Ubuntu. Once set up, it was to connected
to SVN, the subversion repository where all the algorithm code was checked in. This
enables browsing and comparing of various revisions from within TRAC itself. It can
also tag the major release versions and streamline the roadmaps, milestones and bug
logging into one harmonious system. With TRAC in place, our system was ready for
contributions from any interested parties.
Conclusion
Meeting the Objectives

This project was a venture into the unknown. The idea was to make the benefits of
NVIDIA CUDA technology available to a wider audience in a ready to use package.
The feasibility of turning this idea into reality was unknown at the time it was
initiated. As such, it was an incremental process, with little steps being overcome
every week. It was quite an ambitious undertaking as well, since nothing of the sort
has been done before. There are several projects, such as JACUDA [21], that try to
achieve a similar goal, but they are complicated to set up, have numerous
dependencies without which they cannot be used and offer a very limited feature set.
Instead, we didn’t want any compromises, but a system that even a JAVA beginner
could use.
With that said, it wasn’t a smooth process either. There were many situations in which
it was necessary to take a step back and re-think parts of the system. Certain aspects
had to be re-developed from scratch several times in order to achieve the performance
and flexibility required. It is also important to note that CUDA is a brand new
technology, barely over two years old, so any sort of supporting documentation and
technical papers was difficult to procure. The obtained materials mostly consisted of
official SDK examples and documentation, and some stray university courses with
on-line materials. By diving into the JAVA world and trying to unify the two, we
were left on our own.
42
With that in mind, I can say with confidence that most of the initial objectives were
met. Not only that, the progress was incremental and on time with every week’s
milestone meeting. During the whole time, the emphasis was on making a successful
proof of concept rather than a fully polished product. There are still many areas left
for improvement in the work that has been done, such as safer handling of errors,
more advanced type mapping for the self-service compiler and a wider choice of
algorithms within the precompiled library showcase. This only shows that the project
has a potential to grow, and with sufficient help and support from the open source
community, it could turn into a versatile product that many would find useful.
Taking a different approach

The method used for achieving the project objectives was chosen early on. There were
many paths that could be taken, and only one had to be chosen. But as it usually is
with complex systems, each solution had its advantages and disadvantages. Looking
back, several different methods might have been taken, depending on the intended
usage.
First and foremost, the fact that these libraries are platform-dependent and only
support Linux and Windows is far from perfect considering the versatility of the JVM.
Secondly, the algorithms in their current state are very vulnerable to erroneous input
and can crash quite easily considering there is presently no mechanism for throwing
exceptions. To reduce these negative effects, a library called JCUBLAS [22] could
have been used in order to re-build these algorithms from scratch, entirely in JAVA.
What JCUBLAS does is provide JNI wrappers for the entire NVIDIA CUDA BLAS
(basic linear algebra subprograms) set of functions. This way, any CUDA program
that uses BLAS functions can be rewritten in JAVA code.
There are a few downsides to this approach. As I previously mentioned in detail, this
method will copy and duplicate the memory contents between the JVM and CUDA.
This process was found to be very slow for larger datasets, and in most cases it
completely outweighed any benefits of using CUDA in the first place. Secondly, such
an implementation suffers from the same problem as our own – it is platform
dependent, and the JCUBLAS library specific for the running operating system needs
to be present. On the other hand, the upside of using this method is that it is extremely
flexible since there are no special data types to be used. Also, there are no 3rd party
compilers or wrappers needed to compile the programs, making the end product a lot
more robust and less error prone. Considering CUDA is primarily a high-performance
application, introducing such a performance hit was deemed unacceptable, although
for certain algorithms where speedups are measured in orders of magnitude, this
approach might be better suited.
Lastly, there is always a chance that JAVA will incorporate a CUDA consuming
library into its own JRE package. Naturally, this would be the best solution as it
would remove all of the overheads associated with alternative methods, and also
provide much needed platform independence and stability. Due to licensing reasons
and the fact that CUDA is not open source and uses its own proprietary compiler, the
chances of this happening in the near future are minimal. As such, an intermediary
solution such as the one we achieved with this project will be a welcome addition to
the growing CUDA community.
43
Bibliography
[1]. MapReduce: Simplified Data Processing on Large Clusters. Dean, Jeffery and
Ghemawat, Sanjay. 2004, OSDI 2004.
[2]. Client Statistics by OS. Folding@Home. [Online] 16 May 2009. [Cited: 17 May
2009.] http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats.
[3]. Fact Sheet & Background: Roadrunner Smashes the Petaflop Barrier. IBM Press
room. [Online] IBM, 09 June 2008. [Cited: 17 May 2009.] http://www-
03.ibm.com/press/us/en/pressrelease/24405.wss.
[4]. Vaquero, Luis M. A break in the clouds: towards a cloud definition. ACM
SIGCOMM Computer Communication Review. 2009, Vol. 39, 1.
[5]. Moore's Law: Made real by Intel® innovation. Intel. [Online] Intel. [Cited: 17
May 2009.] http://www.intel.com/technology/mooreslaw/.
[6]. Intel will demo its first multi-core CPU at IDF. EE Times. [Online] United
Business Media, Sept 2004. [Cited: 17 May 2009.]
http://www.eetimes.com/news/semi/showArticle.jhtml?articleID=46200165.
[7]. AMD “Close to Metal”™ Technology Unleashes the Power of Stream

Computing. AMD Newsroom. [Online] AMD, 14 November 2006. [Cited: 18 May
2009.] http://www.amd.com/us-
en/Corporate/VirtualPressRoom/0,,51_104_543~114147,00.html.
[8]. NVIDIA CUDA Programming Guide. s.l. : NVIDIA, 2009. 2.2.
[9]. GeForce GTX 295. NVIDIA Web. [Online] NVIDIA. [Cited: 19 May 2009.]
http://www.nvidia.com/object/product_geforce_gtx_295_us.html.
[10]. SWIG. Simplified Wrapper and Interface Generator. [Online] University of

Chicago. [Cited: 20 May 2009.] http://www.swig.org/.
[11]. Trac. Integrated SCM & Project Management. [Online] Edgewall. [Cited: 20
May 2009.] http://trac.edgewall.org/.
[12]. VirtualBox. Licensing FAQ. [Online] Sun. [Cited: 22 May 2009.]

http://www.virtualbox.org/wiki/Licensing_FAQ.
[13]. Chalopin, Thierry and Demussat, Olivier. Parallel Bitonic Sort on MIMD
shared-memory computer. Metz : Supelec, 2002.
[14]. Kider, Joseph T. GPU as a Parallel Machine: Sorting on the GPU.

Philadelphia : Penn Engineering CIS, 2005.
[15]. Cederman, Daniel and Tsigas, Philippas. A Practical Quicksort Algorithm for
Graphics Processors. Goteborg, Sweden : Chalmers University of Technology, 2008.
44
[16]. Kolb, Craig and Pharr, Matt. GPU Gems 2: Programming Techniques for
High-Performance Graphics and General-Purpose Computation. s.l. : Addison-
Wesley Professional, 2005.
[17]. Podlozhnyuk, Victor. Black-Scholes option pricing. s.l. : NVIDIA Corporation,

2007.
[18]. Morita, Kiyoshi. Applied Fourier transform. s.l. : IOS Press, 1995.
[19]. NVIDIA Corporation. Matrix Transpose Source Code. 2008.
[20]. Cerberus FTP Server. Cerberus Software. [Online] Cerberus LLC. [Cited: 17
April 2009.] http://www.cerberusftp.com/.
[21]. JaCUDA. Sourceforge. [Online] [Cited: 24 May 2009.]

http://jacuda.wiki.sourceforge.net/.
[22]. JCUBLAS. [Online] [Cited: 23 May 2009.]

http://javagl.de/jcuda/jcublas/JCublas.html.
[23]. David, Kirk and Wen-Mei, Hwu. CUDA Textbook. 2009.
[24]. Liang, Sheng. The Java native interface. s.l. : Addison-Wesley, 1999.
45

CUDA Algorithms For JAVA

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

CUDA Algorithms For JAVA

Transféré par

Droits d'auteur :

Formats disponibles

CUDA Algorithms for JAVA

B.Sc. in Computing Science

Submitted in partial fulfilment of the requirements of the Degree of

Introduction and overview ............................................................................................. 6

Distributed computing and the Internet

Because of all the advantages and benefits of using distributed computing in

Parallel computing and Moore’s law

Figure 1, source Wolfram Alpha

Meanwhile, in the GPU world...

The birth of GPGPU

As illustrated by Figure 2, the combined computational output of these processors

The project’s aim

Figure 3, GPU memory diagram, source (10)

Figure 4, JNI diagram, source (11)

Figure 5, SWIG architecture

The Web Page

The page consists of 4 main sections:

Library download page

outlined in chapters below.

Figure 6, Library download page

Figure 7, algorithm description page

The self-service compiler exposes the automated-compilation system to the outside

Figure 8, self-service compiler

I wanted to keep the self-service compiling functionality as simple as possible, ideally

CUDA Driver and Compiler

Shell Scripting and Backing Services

Figure 10, system overview

Parallel Sort (Bitonic)

Figure 11, parallel bitonic sort [13]

The downside of this implementation though is how it uses memory. More

Figure 12, parallel sort performance

Figure 13 shows the performance difference between the GPU implementation of

Quicksort has previously been considered as an inefficient sorting solution for

Figure 14 shows the performance comparison between JAVA implementations of

Figure 14, GPU-Quicksort vs CPU-Quicksort

The Black-Scholes pricing algorithm greatly benefits from a massively parallel

Figure 15, Black-Scholes option pricing, CPU vs GPU

Figure 16, Fast Fourier Transform, GPU vs CPU

Figure 17, matrix transpose, GPU vs CPU

SWIG & JNI

extern void csquare(float *f);

CUDA algorithm adaptation

1. data & device initialization program

// declare device variables

// initialize memory on device

// setup execution parameters

// execute the kernel

// copy result from device to host

Reusing the CUDA compiler configuration

ld -shared -soname libtranspose.so -o transpose_wrap.o transpose.o

The additions to this file included following directives:

# Cuda source files (compiled with cudacc)

# CUDA dependency files

# Rules and targets

The Virtual Machine

VBoxManage setextradata "WindowsXP"

VBoxManage setextradata "WindowsXP"

CPU equivalent algorithm to compare the performance against

If the distance is the GPU result is accepted and considered valid.

set ylabel "Milliseconds"

java -Xms128m -Xmx256m -

The web site