Vous êtes sur la page 1sur 118

Finite-Dierence Time-Domain Method Implemented on the CUDA Architecture Wei Chern TEE

School of Information Technology and Electrical Engineering University of Queensland

Submitted for the degree of Bachelor of Engineering (Honours) in the division of Electrical and Electronic Engineering

June 2011

Statement of Originality

June 3, 2011

Head of School School of Information Technology and Electrical Engineering University of Queensland St Lucia, Q 4072

Dear Professor Paul Strooper, In accordance with the requirements of the degree of Bachelor of Engineering (Honours) in the division of Electrical and Electronic Engineering, I present the following thesis entitled Finite-Dierence Time-Domain Method Implemented on the CUDA Architecture. This work was performed under the supervision of Dr. David Ireland. I declare that the work submitted in this thesis is my own, except as acknowledged in the text and footnotes, and has not been previously submitted for a degree at the University of Queensland or any other institution.

Yours sincerely,

Wei Chern TEE

Abstract

The nite-dierence time-domain (FDTD) method is a numerical method that is relatively simple, robust and accurate. Moreover, it lends itself well to a parallel implementation. Modern FDTD simulations however, are often time-consuming and can take days or months to complete depending on the complexity of the problem. A potential way of reducing this simulation time is in the use of graphical processor units (GPUs). This thesis thus studies the challenges of using GPUs for solving the FDTD algorithm. GPUs are no longer used just to render graphics. Recent advancements in the use of GPUs for general-purpose and scientic computing have sparked an interest in the use of FDTD. New graphics processors such as NVIDIA CUDA GPUs provide a cost-eective alternative to traditional supercomputers and cluster computers. The parallel nature of the FDTD algorithm coupled with the use of GPUs can potentially and signicantly reduce simulation time compared to the CPU. The focus of the thesis is to utilize NVIDIA CUDA GPUs to implement the FDTD method. A brief study is done on CUDAs architecture and how it is capable of reducing the FDTD simulation time. The thesis examines implementations of the FDTD in one, two and three dimensions using CUDA and the CPU. Comparisons of code-complexity, accuracy and simulation time are made in order to provide substantial arguments for concluding whether the implementation of FDTD on CUDA is benecial. In summary, speed-ups of over 20x, 60x and 50x were achieved for one-, two- and three-dimensions respectively. However, there are challenges involved in using CUDA which are investigated in the thesis.

ii

Acknowledgements

I would like to acknowledge my supervisor, Dr. David Ireland especially for his patience and guidance. I would also like to acknowledge Dr. Konstanty Bialkowski for his ideas and help.

To my parents, without whom I would be sorely put. Thank you.

Special thanks to Ahmad Faiz, Tan Jon Wen, Christina Lim, Anne-Soe Pederson, Franciss Chuah, Kenny Heng and Lee Kam Heng. For friendship.

iii

Contents

1 Introduction 1.1 1.2 1.3 1.4 Thesis Introduction . . . . . . . . . . . . . . . . . . . . . . . . Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Signicance . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.4.6 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 3 3 3 3 4 4 4 4 6 6 7 8 9 11

2 Finite-Dierence Time-Domain (FDTD) 2.1 2.2 2.3 2.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . One-Dimensional FDTD Equations . . . . . . . . . . . . . . . Two-Dimensional FDTD Equations . . . . . . . . . . . . . . . Three-Dimensional FDTD Equations . . . . . . . . . . . . . .

3 Compute Unied Device Architecture (CUDA) 3.1 3.2 3.3 3.4

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Computation Capability of Graphics Processing Units . . . . . 14 Memory Structure . . . . . . . . . . . . . . . . . . . . . . . . 17 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 21

4 Literature Review 4.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iv

CONTENTS 5 1-D FDTD Results 5.1 5.2 5.3 5.4 5.5 23

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Test Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Discrepancy In Results . . . . . . . . . . . . . . . . . . . . . . 32 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 37

6 2-D FDTD Results 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Test Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Initial Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Updating PML Using CUDA . . . . . . . . . . . . . . . . . . 42

Compute Proler . . . . . . . . . . . . . . . . . . . . . . . . . 43 CUDA Memory Restructuring . . . . . . . . . . . . . . . . . . 47 Discrepancy In Results . . . . . . . . . . . . . . . . . . . . . . 55 Alternative Absorbing Boundaries . . . . . . . . . . . . . . . . 55 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 61

7 3-D FDTD Results 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Test Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 62 CUDA Block & Grid Congurations . . . . . . . . . . . . . . 65

Three-Dimensional Arrays In CUDA . . . . . . . . . . . . . . 67 Initial Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Visual Proling . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Memory Coalescing for 3-D Arrays . . . . . . . . . . . . . . . 69 Discrepancy In Results . . . . . . . . . . . . . . . . . . . . . . 73 Alternative Absorbing Boundaries . . . . . . . . . . . . . . . . 74

CONTENTS

7.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 8 Conclusions 8.1 8.2 84

Thesis Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 84 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 87 91 94 100

Bibliography A Emory Specications B Uncoalesed Memory Access Test C Coalesed Memory Access Test

vi

List of Figures

2.1

Position of the electric and magnetic elds in Yees scheme. (a) Electric element. (b) Relationship between the electric and magnetic elements. Source: [4]. . . . . . . . . . . . . . . . 7

3.1

A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more cores will automatically execute the program in less time than a GPU with fewer cores. Source: [6]. . . . . . . . . 12

3.2

The GPU devotes more transistors to data processing. Source: [6]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 3.4

Hierarchy of threads, blocks and grid in CUDA. Source: [6]. . 15 Growth in single precision computing capability of NVIDIAs GPUs compared to Intels CPUs. Source: [8]. . . . . . . . . . 16

3.5 5.1 5.2

Hierarchy of various types of memory in CUDA. Source: [6]. . 18 Speed-up for one-dimensional FDTD simulation. . . . . . . . . 29 Throughput for one-dimensional FDTD simulation running on CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.3

Plot of 1-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size of 100 cells. The location of the probe is at x = 30. . . . . . . . . . . . . . . . . . . . . . . . . 34

vii

LIST OF FIGURES

5.4

Plot of 1-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size of 100 cells. The location of the probe is at x = 30. The plot is centered between time-steps 50 and 90. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1 6.2

Flowchart for two-dimensional FDTD simulation. . . . . . . . 39 Screen-shot of Visual Proler GUI from CUDA Toolkit 3.2 running on Windows 7. The data in the screen-shot is imported from the results of the memory test in Listing 6.4. These results are also available in Appendix B. . . . . . . . . . 44

6.3

Simple program to determine advantages of cudaMallocPitch


().

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.4

Throughput for two-dimensional FDTD simulation running on CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.5

Plot of 2-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size 128 128. The location of the probe is at cell (25, 25) as specied in Table 6.1. . . . . . . . . 56

6.6

Comparison of speed-up for various ABCs in two-dimensional FDTD simulation running on CUDA. . . . . . . . . . . . . . . 57

6.7

Comparison of throughput for various ABCs in two-dimensional FDTD simulation running on CUDA. . . . . . . . . . . . . . . 58

7.1 7.2 7.3

Flowchart for three-dimensional FDTD simulation. . . . . . . 63 Speed-up for three-dimensional FDTD simulation. . . . . . . . 71 Throughput for three-dimensional FDTD simulation running on CUDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

viii

LIST OF FIGURES

7.4

Plot of 3-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size 160 160 160 cells. The location of the probe is at cell (60, 60, 60) as specied in Table 7.1 . . 75

7.5

Plot of 3-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size 160 160 160 cells. The location of the probe is at cell (60, 60, 60) as specied in Table 7.1. The plot is centered between time-steps 300 and 350. . . . . . 76

7.6

Comparison of speed-up for various ABCs in three-dimensional FDTD simulation running on CUDA. . . . . . . . . . . . . . . 79

7.7

Comparison of throughput for various ABCs in three-dimensional FDTD simulation running on CUDA. . . . . . . . . . . . . . . 80

7.8

Plot of 3-D FDTD simulation results to compare accuracy between various ABCs. The plots are generated from the result from simulation size 160 160 160 cells. The location of the probe is at cell (60, 60, 60) as specied in Table 7.1. . . . . . . 81

7.9

Plot of 3-D FDTD simulation results to compare accuracy between various ABCs. The plots are generated from the result from simulation size 160 160 160 cells. The location of the probe is at cell (60, 60, 60) as specied in Table 7.1. The plot is centered between time-steps 400 and 700. . . . . . . . . . . 82

ix

List of Tables

5.1 5.2 5.3

Specications of the test platform. . . . . . . . . . . . . . . . . 24 Results for one-dimensional FDTD simulation. . . . . . . . . . 28 Discrepancy in results between CPU and CUDA simulation of the one-dimensional FDTD method. . . . . . . . . . . . . . . . 33

6.1 6.2 6.3

Set-up for two-dimensional FDTD simulation. . . . . . . . . . 40 Results for two-dimensional FDTD simulation on CPU. . . . . 41 Results for two-dimensional FDTD simulation on CUDA (initial run). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.4

Results for two-dimensional FDTD simulation on CUDA (using CUDA to update PML). . . . . . . . . . . . . . . . . . . . 43

6.5

Textual proling results for investigation into memory coalescing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.6

Results for two-dimensional FDTD simulation on CUDA (using cudaMallocPitch() for Ex memory allocation). . . . . . . 48

6.7

Snippet of results from textual proling on uncoalesced memory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.8

Snippet of results from textual proling on coalesced memory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.9

Results for two-dimensional FDTD simulation on CUDA (using cudaMallocPitch() and column-major indexing). . . . . . 53

6.10 Discrepancy in results between CPU and CUDA simulation of the two-dimensional FDTD method. . . . . . . . . . . . . . . 55 7.1 7.2 Set-up for three-dimensional FDTD simulation. . . . . . . . . 64 Results for three-dimensional FDTD simulation on CPU. . . . 64

LIST OF TABLES

7.3

Limitations of the block and grid conguration for CUDA for Compute Capability 1.x [6]. . . . . . . . . . . . . . . . . . . . 65

7.4

Results for three-dimensional FDTD simulation on CUDA (initial run). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.5

Results from visual proling on three-dimensional FDTD simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.6

Results for three-dimensional FDTD simulation on CUDA (with coalesced memory access). . . . . . . . . . . . . . . . . . . . . 70

7.7

Results from visual proling on three-dimensional FDTD simulation (with new indexing of arrays in kernel). . . . . . . . . 73

7.8

Discrepancy in results between CPU and CUDA simulation of the three-dimensional FDTD method. . . . . . . . . . . . . . . 74

B.1 CUDA Textual Proler results from testing uncoalesced memory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 C.1 CUDA Textual Proler results from testing coalesced memory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xi

1
Introduction
1.1 Thesis Introduction
The nite-dierence time-domain (FDTD) modelling technique is used to solve Maxwells equations in the time-domain. The FDTD method was introduced by Yee in 1966 [1] and interest on the topic has increased almost exponentially over the past 30 years [2]. The FDTD method provides a relatively simple mathematical solution to the Maxwells equations but simulation can take days or months to complete depending on the complexity of the problem. However, the FDTD technique is also highly parallel which allows it to leverage parallel processing architectures to achieve speed-ups. This project involves the use of graphics processing units (GPUs) which have parallel architectures to implement the FDTD method in order to reduce computation time. Historically, the GPU is used as a co-processor to the main processor of a computer, the central processing unit (CPU). The GPU is designed with a mathematically-intensive, highly parallel architecture for rendering graphics. Modern GPUs however, are becoming increasing popular for performing 1

1.2. MOTIVATION

general purpose computations instead of just graphics-processing. By utilizing the GPUs massively parallel architecture with high memory bandwidth, applications running on the GPU can achieve speed-ups in orders of magnitude compared to CPU implementations. The utilization of GPUs for applications other than graphics rendering is known as general-purpose computing on graphics processing units (GPGPU). Therefore, the thesis focuses on combining the parallel nature of the FDTD technique with the parallel architecture of the GPU to achieve speed-ups compared to the traditional use of CPU for computation.

1.2

Motivation

The Microwave and Optical Communications (MOC) Group at the University of Queensland require computer simulations of very large, realistic models of the human anatomy and tissues interacting with electromagnetic energy. The FDTD technique is used for the simulation and takes a long period of time to solve due to the large simulation models. Therefore, the MOC group is interested in utilizing the GPU to reduce simulation time. The FDTD technique is gaining popularity in vast areas of study such as telecommunications, optoelectronics, biomedical engineering and geophysics. Thus, the outcome of the thesis and the understanding of GPUs will be valuable towards the reduction of computation time in scientic applications. As many engineering practice require design by repetitive simulation such as in design optimization, a more thorough optimization procedure can be achieved with a faster simulation time.

1.3. SIGNIFICANCE

1.3

Signicance

From the results of the thesis, a conclusion will be made on the feasibility of using GPUs as an alternative to the CPU. If the outcome of the thesis is a successful implementation of the FDTD method using the CUDA architecture, the thesis could be a motivation for further research into both the FDTD method and GPU acceleration. While the focus of this thesis is on obtaining speed-ups for the FDTD method, the results could be used as a gauge for the benets of GPU acceleration in other applications. Other applications can also benet from leveraging a technology that already exists in our computers today.

1.4

Thesis Outline

A review of the remaining chapters of the thesis is given here.

1.4.1

Chapter 2

This chapter serves to introduce the nite-dierence time-domain method. A short introduction is given for one-, two- and three-dimensional FDTD equations.

1.4.2

Chapter 3

The CUDA architecture is introduced in this chapter. A short summary of the various types of memory available on a GPU are given here. The dierences between a CUDA GPU and a CPU are discussed. CUDAs potential in reducing computation time for the FDTD method is also explored.

1.4. THESIS OUTLINE

1.4.3

Chapter 4

A review of existing literature are selectively given in this chapter. Focus is given on literature focusing on implementations of the FDTD method on GPUs and their ndings.

1.4.4

Chapter 5

The implementation of the FDTD method for one-dimension is discussed in this chapter. While this chapter serves as more of an introduction into programming with CUDA, signicant speed-ups of over 20x were achieved. Details of the test platform used throughout the thesis are also listed in this chapter.

1.4.5

Chapter 6

The implementation of the two-dimensional FDTD method is discussed in this chapter. The chapter explores the use of CUDAs Compute Proler as a tool for determining the eciency of the kernel code. Memory coalescing requirements are also discussed and this chapter starts to introduce the complexity involved in programming using the CUDA framework. The use of various absorbing boundary conditions (ABCs) as an alternative to perfectly-matched layers (PMLs) are also discussed.

1.4.6

Chapter 7

In this chapter, the FDTD method for three-dimensions is explored. The chapter explains the diculties involved in implementing three-dimensional

1.4. THESIS OUTLINE

grids and blocks of threads on CUDA. The use of the CUDA Visual Proler as an eective tool for debugging CUDA applications are discussed. As with Chapter 6, alternative ABCs are explored in order to produce faster execution times and better throughput.

2
Finite-Dierence Time-Domain (FDTD)
2.1 Introduction
The FDTD method is a numerical method introduced by Yee in 1966 [1] to solve the dierential form of Maxwells equations in time-domain. Although the method has existed for over four decades, enhancements to improve the FDTD are continuously being published [2]. The FDTD method discretizes the Maxwells curl equations into time and spatial domains. The electric elds are generally located at the edge of the Yee Cell and the magnetic elds are located at the centre of the Yee Cell. This is shown in Figure 2.1. In three dimensions, the number of cells in one time step can easily be in orders of millions. A dimension of 100 100 100 cells already yields a total of one million cells. For example, in [3], a high resolution head model of an adult male has a total of 4,642,730 Yee cells with each cell having a dimension of 1 1 1 mm3 .

2.2. ONE-DIMENSIONAL FDTD EQUATIONS

Figure 2.1: Position of the electric and magnetic elds in Yees scheme. (a) Electric element. (b) Relationship between the electric and magnetic elements. Source: [4].

2.2

One-Dimensional FDTD Equations

The Maxwells curl equations in free space for one-dimension are 1 Hy Ex = t 0 z Hy 1 Ex = t 0 z [5], the equations become
n+1/2 n1/2

2.1 2.2

Using the nite-dierence method of central approximation and rearranging

Ex
k n+1

= Ex
k n

t Hy 0 x t Ex 0 x

Hy
k+1/2 n+1/2 k1/2 n+1/2

2.3 2.4

Hy
k+1/2

= Hy
k+1/2

Ex
k+1 k

The electric elds and magnetic elds are calculated alternately over the entire spatial domain at time-step n, and this process the continued over all 7

2.3. TWO-DIMENSIONAL FDTD EQUATIONS

the time-steps until convergence is achieved. Depending on the size of the computation domain, millions of iterations could be required to solve the dierential equations. However, the benet of using the FDTD method is that it only requires exchange of data with neighbouring cells. The equations above show that Ex from Ex
n n n1/2 , Hy k+1/2 and Hy k1/2 . Similarly, k n+1/2 n+1/2 n . Therefore, the FDTD Hy k+1/2 , Ex k+1 and Ex k n+1/2 is only updated k n+1 Hy k+1/2 is updated from

algorithm is highly

parallel in nature and signicant speed-ups can be achieved by harnessing the power of parallel processing. Further reading on parallel FDTD can be found in [4].

2.3

Two-Dimensional FDTD Equations

For two-dimensional FDTD method in transverse-magnetic (TM) mode , the update equations are
n+1/2 Hz i,j n i,j+1/2

n1/2 Da i,j Hz i,j

+ Db

Ex
i,j

Ex Ey

n i,j1/2

y Ey
n i+1/2,j n i1/2,j

2.5 2.6 2.7

x 1 Da
i,j

= 1+ = 1

i,j t 2i,j i,j t 2i,j

Db

i,j

t i,j i,j t + 2i,j

n+1 Ex i,j

n Ca i,j Ex i,j

+ Cb

Hz
i,j

n+1/2 i,j+1/2

Hz

n+1/2 i,j1/2

y Hz
n+1/2 i+1/2,j

2.8 2.9

n+1 Ey i,j

n Ca i,j Ey i,j

+ Cb

Hz x

n+1/2 i1/2,j

i,j

2.4. THREE-DIMENSIONAL FDTD EQUATIONS 1 Ca


i,j i,j t 2 i,j i,j t 2 i,j t
i,j

1+

2.10 2.11

Cb

i,j

1+

i,j t 2 i,j

where and are the electric and magnetic conductivity respectively. and are the permittivity and permeability respectively.

2.4

Three-Dimensional FDTD Equations

For three-dimensional FDTD method, the update equations are


n+1 Ex i,j,k

n Ca i,j,k Ex i,j,k

+ Cb

Hz
i,j,k

n+1/2 i,j+1/2,k

Hz y

n+1/2 i,j1/2,k

2.12

Hy

n+1/2 i,j,k+1/2

Hy

n+1/2 i,j,k1/2

z Hx
i,j,k n+1/2 i,j,k+1/2

n+1 Ey i,j,k

n Ca i,j,k Ey i,j,k

+ Cb

Hx

n+1/2 i,j,k1/2

z Hz
n+1/2 i+1/2,j,k

2.13

Hz

n+1/2 i1/2,j,k

x Hy
i,j,k n+1/2 i+1/2,j,k

n+1 Ez i,j,k

n Ca i,j,k Ez i,j,k

+ Cb

Hy

n+1/2 i1/2,j,k

x Hz
n+1/2 i,j+1/2,k

2.14 2.15 2.16

Hz y

n+1/2 i,j1/2,k

1 Ca
i,j,k

1+

i,j,k t 2 i,j,k i,j,k t 2 i,j,k t


i,j,k

Cb

i,j,k

1+

i,j,k t 2 i,j,k

2.4. THREE-DIMENSIONAL FDTD EQUATIONS Ey


i,j,k n i,j,k+1/2

n+1/2 Hx i,j,k

n1/2 Da i,j,k Hx i,j,k

+ Db

Ey Ez y

n i,j,k1/2

z Ez
n i,j+1/2,k n i,j1/2,k

2.17

n+1/2 Hy i,j,k

n1/2 Da i,j,k Hy i,j,k

+ Db

Ez
i,j,k

n i+1/2,j,k

Ez Ex

n i1/2,j,k

x Ex
n i,j,k+1/2 n i,j,k1/2

2.18

z
n i,j+1/2,k n i,j1/2,k

n+1/2 Hz i,j,k

n1/2 Da i,j,k Hz i,j,k

+ Db

Ex
i,j,k

Ex y Ey

2.19

Ey

n i+1/2,j,k

n i1/2,j,k

x 1 Da
i,j,k
i,j,k t 2i,j,k i,j,k t 2i,j,k

= 1+ = 1

2.20 2.21

Db

i,j,k

t i,j,k i,j,k t + 2i,j,k

where and are the electric and magnetic conductivity respectively. and are the permittivity and permeability respectively.

10

3
Compute Unied Device Architecture (CUDA)
3.1 Introduction
CUDA is the hardware and software architecture introduced by NVIDIA in November 2006 [6] to provide developers with access to the parallel computational elements of NVIDIA GPUs. The CUDA architecture enables NVIDIA GPUs to execute programs written in various high-level languages such as C, Fortran, OpenCL and DirectCompute. The newest architecture of GPUs by NVIDIA (codenamed Fermi) also fully supports programming through the C++ language [7]. Because of advancements in technology, the processing power and parallelism of GPUs are continuously increasing. CUDAs scalable programming model makes it easy to provide this abstraction to software developers, allowing the program the automatically scale according to the capabilities of the GPU without any change in code, unlike traditional graphics programming languages such as OpenGL [8]. This is illustrated in Figure 3.1. 11

3.1. INTRODUCTION

Figure 3.1: A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more cores will automatically execute the program in less time than a GPU with fewer cores. Source: [6].

12

3.1. INTRODUCTION

Figure 3.2: The GPU devotes more transistors to data processing. Source: [6]. Because the GPU and CPU both serve dierent purposes in a computer, their microprocessor architecture as shown in Figure 3.2 is very dierent. While CPUs currently have up to six processor cores (Intel Core i7-970), a GPU has hundreds. For example, the NVIDIA Tesla 20-series has 448 CUDA cores [7]. Compared to the CPU, the GPU devotes more transistors to data processing rather than data caching and ow control. This allows GPUs to specialize in math-intensive, highly parallel operations compared to the CPU which serves as a multi-purpose microprocessor. Therefore, calculations of the FDTD algorithm are potentially much faster when executed on the GPU instead of the CPU. This is becoming increasingly true as graphics card vendors such as NVIDIA and AMD are now developing more graphics card for high performance computing (HPC) such as the NVIDIA Tesla [9]. CUDA has a single-instruction multiple-thread (SIMT) execution model where multiple independent threads execute concurrently using a single instruction [7]. CUDA GPUs have a hierarchy of grids, threads and blocks as shown in Figure 3.3. Each thread has its own private memory. Shared 13

3.2. COMPUTATION CAPABILITY OF GRAPHICS PROCESSING UNITS memory is available per-block and global memory is accessible by all threads. This multi-threaded architecture model puts focus on data calculations rather than data caching. Thus, it can sometimes be faster to recalculate rather than cache on a GPU. A CUDA program is called a kernel and the kernel is invoked by a CPU program. The CUDA programming model assumes that CUDA threads execute on a physically separate device (GPU). The device is a co-processor to the host (CPU) which runs the program. CUDA also assumes that the host and device both have separate memory spaces: host memory and device memory, respectively. Because host and device both have their own separate memory spaces, there is potentially a lot of memory allocation, deallocation and data transfer between host and device. Thus, memory management is a key issue in GPGPU computing. Inecient use of memory can signicantly increase the computation time and mask the speed-ups obtained by the data calculations.

3.2

Computation Capability of Graphics Processing

Units
The number of oating points operations per second (ops) of a computer is one of the measures of the computational abilities of a computer. This is an important measure especially in scientic calculations as it is an indication of a computers arithmetic capabilities. While a high-performance CPU can have a double precision computation capability of 140 Gops (Intel Nehalem architecture) [8], an NVIDIA Tesla 20-series (NVIDIA Fermi architecture) GPU has a peak single precision performance of 1.03 Tops and a peak double precision performance of 515

14

3.2. COMPUTATION CAPABILITY OF GRAPHICS PROCESSING UNITS

Figure 3.3: Hierarchy of threads, blocks and grid in CUDA. Source: [6].

15

3.2. COMPUTATION CAPABILITY OF GRAPHICS PROCESSING UNITS

Figure 3.4: Growth in single precision computing capability of NVIDIAs GPUs compared to Intels CPUs. Source: [8]. Gops [9]. Furthermore, Figure 3.4 shows that the computation capability of a GPU is growing at a much faster pace compared to the CPU. Although the compute capability of a GPU is impressive when compared to the CPU, it has one signicant disadvantage in scientic applications. Not all GPUs fully conform to the IEEE standard for oating point operations [10]. Although the oating point arithmetic of NVIDIA graphics cards is similar to the IEEE 754-2008 standard used by many CPU vendors, it is not quite the same especially for double precision [6]. In computers, the natural form of representation of numbers is in binary (1s and 0s). Thus, computers cannot accurately represent real numbers. There are standards to represent oating-point numbers in computers and the most widely used is the IEEE 754 standard. Accuracy in representation

16

3.3. MEMORY STRUCTURE

of oating point numbers in computers is important in scientic applications. There have been many cases where errors in oating-point representation have caused catastrophes. One example is the failure of the American Patriot Missile defence system to intercept an incoming Iraqi Scud missile at Dharan, Saudi Arabia on February 25, 1991. This resulted in the death of 28 Americans [11]. The cause of this was determined to be the loss of accuracy from conversion of an integer to real number in the Patriots computer. Other examples of catastrophes resulting from oating-point representation errors can be found at [12]. Thus, accurate oating-point representation is important in scientic computations. Most CPU manufacturers now use the IEEE 754 oatingpoint standard. As developments in GPGPUs continue, GPU vendors will inevitably conform to the IEEE 754 standard for oating-point representation as well. This is evident with the newest Fermi architecture from NVIDIA which implements the IEEE 754-2008 oating point for both single and double precision arithmetic [7].

3.3

Memory Structure

The CUDA memory hierarchy is shown in Figure 3.5. These dierent types of memory dier in size, access times and restrictions. Detailed descriptions of the various memory types are available in the CUDA Programming Guide [6] and the CUDA Best Practices Guide [13]. In short, global memory is the largest in size and is located o the GPU chip and can be accessed by any thread. Because it is o-chip, its access time is the slowest amongst all the various types of memory. Shared memory is located on-chip which makes memory access very fast compared to global

17

3.3. MEMORY STRUCTURE

Figure 3.5: Hierarchy of various types of memory in CUDA. Source: [6].

18

3.4. CONCLUSIONS

memory. However, it is limited in size and shared memory access is only on block-level. This means that a thread cannot access shared memory that is allocated outside its block. Local memory is located o-chip and thus has a long latency. It is used to store variables when there are insucient registers available. Other dierent types of memory not shown in Figure 3.5 are registers and constant memory. Registers are located on-chip and are scarce. Registers are not shared between threads. Constant memory is located o-chip but is cached. Caching makes memory accesses to constant memory fast although it is located o-chip. While global memory is located o-chip and has the longest latency, there are techniques available that can reduce the amount of GPU clock cycles required to access large amounts of memory at one time. This can be done through memory coalescing. Memory coalescing refers to the alignment of threads and memory. For example, if memory access is coalesced, it takes only one memory request to read 64-bytes of data. On the other hand, if it is not coalesced, it could take up to 16 memory requests depending on the GPUs compute capability. This is further explained in Section 3.2.1 of the CUDA Best Practices Guide [13].

3.4

Conclusions

GPUs have a parallel architecture with the capability of executing thousands of threads simultaneously. This gives the GPU advantage over the CPU when it comes to intensive computations on large amounts of data. With the CUDA framework, developers have access to CUDA-enabled NVIDIA GPUs. This allows developers to leverage this computation capability for

19

3.4. CONCLUSIONS

applications other than graphics-rendering. Along with this, CUDA provides a relatively cheap alternative to supercomputing.

20

4
Literature Review
In the NVIDIA GPU Computing Sofware Development Kit (SDK), there is a sample code for three-dimensional FDTD simulation. However, the FDTD method implemented is not the conventional FDTD method as discussed in Chapter 2. In [14], the code was tested on an NVIDIA Tesla S1070 and a throughput of nearly 3,000 Mcell/s were achieved. While the code is not of much use to the FDTD method explored in this thesis, the throughput it achieved provides a good indication of the NVIDIA Tesla S1070s capability. In [15], the use of CUDA on the FDTD method is explored and a short summary of CUDAs architecture is given. In this paper, four NVIDIA Tesla C1060s were used for testing and this resulted in a throughput of almost 2,000 Mcell/s. In [16], a two-dimensional FDTD simulation for mobile communications systems was implemented on CUDA. In this paper, the Convolutional Perfectly Matched Layer (CPML) was used as the absorbing region. The paper discusses the use of shared memory, and conguring block sizes for optimal performance. The results from the simulation on an NVIDIA Tesla C870 produced a throughput of 760 Mcells/s. MATLAB was also used for comparison 21

4.1. SUMMARY

and it was the slower compared to both the CPU and CUDA. In the article by Garland, M. et al. [17], a detailed explanation of CUDAs architecture is given. The article also summarises a few applications that are suited for running on CUDA such as molecular dynamics, medical imaging and uid dynamics. In [8], the author discusses the three-dimensional FDTD algorithm including its implementation on CUDA. Applications of the FDTD method such as in microwave systems and biomedicine are discussed. The author argues that there are an innite number of ways in which an algorithm can be partitioned for parallel execution on a GPU. On an NVIDIA Tesla S1070, a maximum speed-up of 1,680 Mcells/s was achieved and the optimum simulation size is said to be up to 380 Mcells. By using a cluster of Tesla S1070s, a throughput of over 15,000 Mcells/s was achieved.

4.1

Summary

In summary, while there are many existing literature on the topic of accelerating the FDTD method using CUDA, there are few that detail the diculty and complexity involved in developing the program. All of the existing literature reviewed show that signicant speed-ups were achieved by using CUDA to accelerate computation. Thus, this thesis investigates the complexity involved in programming using the CUDA framework to achieve the speed-ups.

22

5
1-D FDTD Results
5.1 Introduction
The development of the code for running the FDTD method using CUDA was done incrementally, starting from one-dimension (1-D), followed by 2-D and nally 3-D; these are given in later chapters. For each phase, new methods and techniques were used to provide more speed-up as the algorithm became more complex and the amount of data processed increased. For consistency, the platform used for testing was not changed throughout the development. The specications of the test platform are listed in the following section. Each chapter will also present the various methods and techniques used to achieve speed-ups. In each phase, various simulation sizes tested. The results of the simulations are compared and explained. Execution times and speed-ups are listed. To address concerns in the accuracy of the CUDA implementation, the variation of results between CPU and GPU will also be detailed. Throughout the thesis, speed-ups and throughputs are used to quantify

23

5.2. TEST PLATFORM

performance. They are dened as Speed-up = Throughput (Mcells/s) = CPU Execution Time GPU Execution Time
5.1 5.2

Number of cells Number of time-steps 106 Execution Time (s)

where Mcells is a million-cells.

5.2

Test Platform

The nite-dierence time-domain implementation for running on CUDA was tested on the computer Emory. The specications of the computer are shown in Table 5.1. Operating System Memory (RAM) CPU 64-bit CentOS Linux 32 GB Dual Intel Xeon E54x30 Clock Speed: 2.66GHz Number of cores: 4 GPU Dual NVIDIA Tesla S1070 Clock Speed: 1.30GHz Number of processors: 4 Number of cores per processor: 240 Table 5.1: Specications of the test platform. The NVIDIA Tesla S1070 has a CUDA Compute Capability of 1.3. Detailed specications of the Tesla S1070 are listed in Appendix A. As shown in Table 5.1, there are 4 processors in the Tesla S1070. However, for the purpose of this thesis, only one of the processors is utilized.

24

5.3. RESULTS

5.3

Results

The one-dimensional FDTD algorithm that is used for porting to CUDA is a simulation of a wave travelling in free space with absorbing boundary conditions. The computer equations for the algorithm are
1 ex[i] = ca[i] * ex[i] - cb[i] * (hy[i] - hy[i-1]); 2 hy[i] = da[i] * hy[i] - db[i] * (ex[i+1] - ex[i]);

Listing 5.1: Main update equations for the 1-D FDTD Method. where ex is the electric eld and hy is the magnetic eld. ca, cb, da and
db

are coecients which are pre-calculated before running the main update

equations of Listing 5.1. These coecients remain constant throughout the main update loop. As this is the rst attempt at getting the FDTD to run on CUDA, the algorithm in Listing 5.1 was made to run in a CUDA kernel with very little modications to other segments of the code. The initialization routines and pre-calculating of coecients (ca, cb, da and db) are still done by the CPU. After all initialization, all necessary data (ex, hy, ca, cb, da and db) are transferred from the CPU to the GPUs global memory. Then, the CUDA kernel which contains the update equations of Listing 5.1 is executed. To compare execution time, CUDAs timer functions are utilized. Only the time taken to run the main loop of the FDTD update equations is recorded. In order to analyse both the accuracy and the execution time between CPU and GPU, the CUDA code has to be executed twice. This is because the GPU works with the data stored in its global memory and the data has to be transferred to the CPU before it can be analysed. Thus, after each time-step, the data in GPUs global memory is transferred to the CPU for 25

5.3. RESULTS

processing. However, with these memory transfers, the execution time for the main loop cannot be accurately obtained. Thus, the CUDA code is executed again, this time without the memory transfers. This will provide a more accurate and fair comparison against the CPUs execution time. Listing 5.2 and Listing 5.3 show the C code that utilizes the CPU and CUDA respectively. Listing 5.4 shows the C code for the CUDA kernel.
1 for (int n = 0; n < Nmax; n++) { 2 3 4 5 6 7 8 9 10 11 12 13 } for ( m = 0; m < Ncells - 1; m++) hy[m] = da[m] * hy[m] - db[m] * (ex[m+1] - ex[m]); ex[Location] = ex[Location] + pulse; for ( m = 1; m < Ncells ; m++) ex[m] = ca[m] * ex[m] - cb[m] * (hy[m] - hy[m-1]); pulse = exp(-0.5 * pow((no - n)/spread,2)) ; int m;

Listing 5.2: Main update loop for the 1-D FDTD method using CPU.

1 for (int n = 0; n < Nmax; n++) { 2 calcOneTimeStep_cuda<<<numberOfBlocks, threadsPerBlock>>>(ex , hy, ca_d, cb_d, da_d, db_d, Ncells, Ncells/2, n); 3 }

Listing 5.3: Main update loop for the 1-D FDTD method using CUDA.

26

5.3. RESULTS

1 __global__ void calcOneTimeStep_cuda(float *ex, float* hy, float* ca, float* cb, float* da, float* db, int Ncells, int Location, int Nmax) { 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 } if ( i < Ncells - 1 ) - db[i] * (ex[i+1] - ex[i]); __syncthreads(); if ( i == Location ) ex[i] = ex[i] + pulse; if ( i > 0 && i < Ncells ) ex[i] = ca[i] * ex[i] - cb[i] * (hy[i] - hy[i-1]); pulse = exp(-0.5 * pow((no - n)/spread,2)) ; int i = blockIdx.x * blockDim.x + threadIdx.x; float pulse; float no = 40;

float spread = 12;

hy[i] = da[i] * hy[i]

Listing 5.4: CUDA kernel for the 1-D FDTD method.

27

5.3. RESULTS

Table 5.2 shows the results obtained from running the code on Emory. The number of time steps is 3,000 and the execution time is an average of ve runs. Simulation Size (number of cells) 100 512 1,000 1,024 5,120 10,240 51,200 102,400 512,000 1,024,000 5,120,000 10,240,000 Execution Time (ms) Speed-up CPU 5.368 25.340 49.781 51.324 263.243 520.865 2,674.100 5,125.420 28,962.292 57,911.306 CUDA 178.111 188.3199 187.775 183.427 193.365 203.430 315.700 456.463 1,739.171 2,938.229 0.0301 0.1346 0.2651 0.2798 1.3614 2.5604 8.4704 11.2286 16.6529 19.7096 20.9058 20.2367

290,531.271 13,897.174 558,871.438 27,616.772

Table 5.2: Results for one-dimensional FDTD simulation. There are a few interesting observations that can be made from the results. Firstly, as expected, higher speed-ups are obtained when the simulation size is increased. At the simulation size of approximately ten million cells, the CPU takes more than nine minutes to run while the GPU only takes slightly more than 27 seconds. This speed-up will be more appreciable as simulation size is increased and the FDTD is done in three dimensions instead of only one dimension. Simulation that will take hours to run on CPU could potentially take only 28

5.3. RESULTS

25

20

Speedup

15

10

0.005

0.01

0.05 0.1 0.5 1 Simulation Size (Mcells)

10

Figure 5.1: Speed-up for one-dimensional FDTD simulation.

29

5.3. RESULTS

1200

1000

Throughput, Mcells/s

800

600

400

200

0.005

0.01

0.05 0.1 0.5 1 Simulation Size (Mcells)

10

Figure 5.2: Throughput for one-dimensional FDTD simulation running on CUDA.

30

5.3. RESULTS

minutes to run on a GPU. Secondly, the results show that for small simulation sizes such as 100 cells, the CPU runs much faster than the CPU. This is a result of the latency of memory transfers between host (CPU) and device (GPU) as discussed in Section 3.1. Because both host and device have separate memory spaces, data has to be transferred between host and device in order for the GPU to perform the calculations. However, when the simulation size is small, the fast speed of the GPU in arithmetic operations is obscured by the time taken to transfer the data. Also, the runtime of the GPU is longer when the simulation size is 1,000 instead of 1,024. The reason for this is with the thread and block organization of the GPU. For this simulation, the number of threads per block was set constant at 512. The number of blocks required is then obtained by dividing the simulation size with the threads per block. Simulation sizes of 100 and 1,024 cells both require two blocks but the former does not fully use all threads in the one of the blocks. CUDA uses a SIMT model as discussed in Section 3.1. Multiple threads run the same instruction simultaneously. With the simulation size of 1,000 cells, not all the threads in the blocks are used. Specically, 24 threads are not performing the calculations. This breaks homogeneity and causes the CUDA GPU which has a SIMT model to perform slower. One method to prevent breaking homogeneity would be to program the GPU to calculate on all threads but ignore the results from the 24 unused threads. It is also interesting to note that from Figure 5.2, the throughput appears to saturate at around 1,100 Mcells/s.

31

5.4. DISCREPANCY IN RESULTS

5.4

Discrepancy In Results

The results from CUDA and the CPU were compared by obtaining the dierence between the CPU and CUDA electric elds. The percentage in change was also calculated. Dierence = Ex,CP U Ex,CP U
n k n k

Ex,CU DA Ex,CU DA
n k

n k n k

5.3 5.4

Percentage change =

Ex,CP U

100%

where n is the time step and k is the spatial location. Instead of analysing all the dierences at every time step and at every cell, only the largest dierence and the largest percentage of change over all the time steps in the whole simulation space is recorded. This is sucient to determine whether CUDA is performing correctly. The results are shown in Table 5.3. To illustrate the signicance of the discrepancy in results, MATLAB was used to the plot the results generated from both the CPU and CUDA. Only the results for the simulation size of 100 cells are plotted. It is redundant to plot the other simulation sizes because all parameters including the type of wave used are set to be constant for all the simulation sizes. The only change is in spatial size. From the plots, it can be concluded that there are no appreciable differences in the results. The dierences are small enough to ignore and insignicant considering the speed-ups obtained. Although Table 5.3 shows a discrepancy of almost 50%, the magnitude of this dierence is less than 9 107 . The plot conrms that the dierences are negligible. However so, it must be said that these comparisons are crucial and were made throughout the development. There are many parts of the program that could easily go wrong and these checks are necessary in order to ensure that 32

5.4. DISCREPANCY IN RESULTS

Simulation Size 100 512 1,000 1,024 5,120 10,240 51,200 102,400 512,000 1,024,000 5,120,000 10,240,000

Dierence 8.940697 107 1.184060 106 5.365000 106 5.598064 106 4.082790 105 4.082790 105 4.082790 105 4.082790 105 4.082790 105 4.082790 105 4.082790 105 4.082790 105

Percentage change 46.901276% 20.490625% 37.937469% 18.774035% 2.390608% 2.390608% 2.390608% 2.390608% 2.390608% 2.390608% 2.390608% 2.390608%

Table 5.3: Discrepancy in results between CPU and CUDA simulation of the one-dimensional FDTD method.

33

5.4. DISCREPANCY IN RESULTS

0.6 CPU GPU (CUDA) 0.5

0.4

Magnitude

0.3

0.2

0.1

0.1

500

1000

1500 Timestep

2000

2500

3000

Figure 5.3: Plot of 1-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size of 100 cells. The location of the probe is at x = 30.

34

5.4. DISCREPANCY IN RESULTS

0.7 CPU GPU (CUDA) 0.6

0.5

Magnitude

0.4

0.3

0.2

0.1

0 50

55

60

65

70 Timestep

75

80

85

90

Figure 5.4: Plot of 1-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size of 100 cells. The location of the probe is at x = 30. The plot is centered between time-steps 50 and 90.

35

5.5. CONCLUSIONS

the changes made to the program do not cause CUDA to produce incorrect results.

5.5

Conclusions

Porting the one-dimension FDTD method to CUDA has demonstrated that there is a lot of potential in GPU acceleration for the FDTD method. Speedups of over 20x and throughputs of over 1,100 Mcells/s (in comparison to a CPU throughput of only 54 Mcells/s) are convincing. The results have also shown that CUDA is best-suited for repetitive processing on large amounts of data. For a small data set, CUDA does not perform well due to the memory latency issues. Concerns of discrepancy between the CPU and GPU results discussed in Section 3.2 were addressed and the simulations performed shows no signicant dierence. Although the results are convincing, the one-dimension FDTD method is relatively simple compared to its two- and three-dimension counterparts. The diculty in extracting speed-ups from CUDA increases with the number of dimensions and various optimizations have to be performed. This is discussed in the following chapters.

36

6
2-D FDTD Results
6.1 Introduction
Just as in the one-dimensional FDTD method, a wave emanating through free space from a point source is simulated. The computer code for the two-dimensional FDTD methods main eld update equations are shown in Listing 6.1.
1 ex[i][j] = caex[i][j] * ex[i][j] + cbex[i][j] * (hz[i][j] - hz [i][j-1]); 2 ey[i][j] = caey[i][j] * ey[i][j] + cbey[i][j] * (hz[i-1][j] hz[i][j]); 3 hz[i][j] = dahz[i][j] * hz[i][j] + dbhz[i][j] * (ex[i][j+1] ex[i][j] + ey[i][j] - ey[i+1][j]);

Listing 6.1: Main update equations for the 2-D FDTD Method. Perfectly-match layers (PML) were used as the articial absorbing layer to prevent the travelling waves from reecting at the boundaries of the simulation space. This causes the main update loop to be more complex compared to the 1-D loop. This is because the equations for the PML region are dier37

6.2. TEST PARAMETERS

ent from the FDTD update equations. Apart from that, the use of the PMLs introduces four more regions in addition to the main eld domainthe PML regions on the top, bottom, left and right of the central FDTD region. The owchart in Figure 6.1 provides an illustration of the main update loop for the two-dimensional FDTD method with PMLs as the absorbing region. As the PML update equations are too long, they will not be listed here. However, it is important to note that the PML update equations and main eld update equations are independent of each other. This allows the PML and the eld equations to be updated concurrently. Obtaining a noticeable speed-up for the two-dimensional FDTD method proved to be more challenging compared to the one-dimensional FDTD method. As was discovered, the main factors that were causing CUDA to run slowly were related to memory. However, all problems faced during the development and methods used to improve the speed-up will be discussed in the following sections.

6.2

Test Parameters

The program execution set-up used for testing is shown in Table 6.1. As in the one-dimensional FDTD simulation, there needs to be a way of determining whether CUDA is performing correctly and producing accurate results. To do this, the magnetic eld, Hz , is monitored at a particular location. This probe location could be anywhere in the FDTD region. For consistency, the cell at location (25, 25) is used. The magnetic eld calculated by the CPU and by CUDA at this location is recorded throughout all the time-steps.

38

6.2. TEST PARAMETERS

Initialize Model Update electric elds ex and


ey

in main grid no

Update ex in all PML regions Update ey in all PML regions Update magnetic eld hz in main grid Update hzx in all PML regions Update hzy in all PML regions Figure 6.1: Flowchart for two-dimensional FDTD simulation. Has time-step ended? yes End

39

6.2. TEST PARAMETERS

Number of time steps x-location of wave source y-location of wave source x-location of probe y-location of probe CUDA x-dimension of a block CUDA y-dimension of a block Simulation Sizes

1000 75 75 25 25 16 16 128 128 256 256 512 512 1024 1024 2048 2048

Thickness of PML

8 cells

Table 6.1: Set-up for two-dimensional FDTD simulation.

40

6.3. INITIAL RUN

The original CPU execution times are shown in Table 6.2. Throughout this chapter, the speed-ups calculated are based on the CPU execution times in this table. Simulation Size CPU Execution Time (ms) 128 128 256 256 512 512 1024 1024 2048 2048 5,177.621 18,342.557 75,437.734 288,112.225 1,129,977.275

Table 6.2: Results for two-dimensional FDTD simulation on CPU.

6.3

Initial Run

Similar to the development of the 1-D code, the 2-D code was written to execute the main update equation using CUDA. Specically, Ex , Ey and Hz update equations are executed using CUDA. The eld updates in the PML regions however were still executed using the CPU. In order to do this, the Ex and Ey elds had to be transferred from the host to CUDA before executing a CUDA kernel and then transferred back from CUDA to the host after the kernel has nished executing. As expected from the discussion in Section 3.3, the memory operations were costly and resulted in a slower execution compared to the CPU. The results are shown in Table 6.3. The results show that as the simulation sized increases, the speed-up decreases. This is because at a larger simulation sizes, the amount of memory transferred per time-step is larger. To reduce the memory transactions, CUDA is used to update the PML regions as well.

41

6.4. UPDATING PML USING CUDA

Simulation Size CUDA Execution Time (ms) 128 128 256 256 512 512 1024 1024 2048 2048 5,327.528 19,307.889 108,445.182 392,133.552 1,550,150.542

Speed-up 0.97186181 0.950003219 0.695630109 0.734729848 0.728946799

Table 6.3: Results for two-dimensional FDTD simulation on CUDA (initial run).

6.4

Updating PML Using CUDA

By using CUDA to update the PML region, the need for any data transfers between the host and the device in the main update loop is eliminated. However, this prevents us from probing the magnetic eld in order to determine the accuracy of CUDAs execution as discussed in Section 6.2. This is solved in a similar fashion as the 1-D FDTD simulation. The program is executed twice. First without probing and second with probing. The rst execution will be timed so that an accurate speed-up can be calculated. In the second execution, the magnetic eld is saved so that a comparison can be made between CUDA and the CPU. Comparing the results of Table 6.3 and Table 6.4, there is a noticeable improvement to the execution times when CUDA is used to calculate the PML. However, CUDAs perfomance is still very poor and this is certainly not to be expected given the results from the one-dimensional FDTD simulation. The reason behind this is not immediately apparent but further investigation into the CUDA execution model reveals the reason and it is discussed is Section 6.6.

42

6.5. COMPUTE PROFILER

Simulation Size CUDA Execution Time (ms) 128 128 256 256 512 512 1024 1024 2048 2048 2,388.178 7,920.042 67,085.505 286,012.760 1,157,616.333

Speed-up 2.168021637 2.315967187 1.124501243 1.007340458 0.976124163

Table 6.4: Results for two-dimensional FDTD simulation on CUDA (using CUDA to update PML).

6.5

Compute Proler

In order to understand what is happening in the GPU throughout the execution of the program and determine where the bottlenecks occur, the NVIDIA Compute Visual Proler tool is used. This tool is bundled together with the NVIDIA CUDA Toolkit. The Compute Visual Proler is used to measure performance and nd potential opportunities for optimization in order to achieve maximum performance from NVIDIA GPUs [18]. Since the testing of CUDA on Emory is performed through a commandline interface, a textual method of proling the CUDA program is used instead of the graphical user interface (GUI) method shown in Figure 6.2. The article in [19] provides an excellent tutorial on textual proling. For the purpose of proling the FDTD program, focus is given on the following parameters: gld incoherent Number of non-coalesced global memory loads gld coherent Number of coalesced global memory loads gst incoherent Number of non-coalesced global memory stores

43

6.5. COMPUTE PROFILER

Figure 6.2: Screen-shot of Visual Proler GUI from CUDA Toolkit 3.2 running on Windows 7. The data in the screen-shot is imported from the results of the memory test in Listing 6.4. These results are also available in Appendix B.

44

6.5. COMPUTE PROFILER

gst coherent Number of coalesced global memory stores The results of the proling is expected to show a high number of noncoalesced memory loads and non-coalesced memory stores as there has been no optimizations to the utilization of CUDAs memory yet. However, the results of the proling showed that all global memory loads and stores were coalesced. The source code for the test and the proling results are shown in Appendix B. The reason behind this could be the CUDA compute capability of Emorys GPU. As shown in Appendix A, the NVIDIA Tesla S1070s compute capability is 1.3. The memory coalescing requirements for compute capability 1.3 are relaxed. This is explained in Section 3.2.1 of CUDA Best Practices Guide 3.1 [13]. Although the proling shows that all memory accesses are coalesced, it does not mean memory accesses are optimised. Improvements can be made and a simple program based on Section 3.2.1.3 of the CUDA Best Practices Guide is tested on Emory. Table 6.5 summarises the textual proling results for the program. The results in Table 6.5 show 0 incoherent memory loads (gld incoherent) and 0 incoherent memory stores (gst incoherent). This is inconsistent with what is expected from the program. The program should show uncoalesed memory accesses at osets other than 0 and 16. Therefore, the example above clearly shows that on Emorys GPU, textual proling does not provide sucient information on whether memory accesses are optimized.

45

6.5. COMPUTE PROFILER

Oset 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

gld coherent gld incoherent gst coherent gst incoherent 128 192 192 192 192 192 192 192 192 128 192 192 192 192 192 192 128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Table 6.5: Textual proling results for investigation into memory coalescing.

46

6.6. CUDA MEMORY RESTRUCTURING

6.6

CUDA Memory Restructuring

As explained in Section 3.3, there are dierent types of memory in CUDA and they all have dierent advantages and disadvantages. While global memory is the largest, it is not cached and the memory is located o-chip. This causes access to global memory to be slow. Therefore, to improve the results of the simulation, CUDAs memory structure is investigated in order to obtain better speed-ups. It is worth noting that in the CUDA C Best Practices Guide [13], memory optimizations are recommended as high-priority. In Section 6.4, the amount of memory transfers between host and device has been reduced by using CUDA to perform the calculations in the PML regions. This eliminated the need for any data transfers within the main update loop. However, the data was not stored in a way to assist in coalesced memory accesses. Although the simulation is for 2-D space, memory was only allocated in 1-D arrays in CUDA. An example of this is shown in Listing 6.2.
1 cudaMalloc(&ex_d, ie*jb*sizeof(float)); 2 cudaMemcpy(ex_d, ex, ie*jb*sizeof(float), cudaMemcpyHostToDevice); 3 cudaMemcpy(ex, ex_d, ie*jb*sizeof(float), cudaMemcpyDeviceToHost);

Listing 6.2: 1-D array in CUDA. To improve on this, the cudaMallocPitch() function is used for allocating memory instead of cudaMalloc(). This function is recommended by the NVIDIA CUDA Programming Guide 3.1.1 for allocating 2D memory.
cudaMallocPitch()

ensures that memory allocation is padded properly to

meet alignment requirements for coalesced memory access. Further explanation on this can be found in Section 5.3.2.1.2 of the Programming Guide. Listing 6.3 shows how cudaMallocPitch() is used in place of the code in 47

6.6. CUDA MEMORY RESTRUCTURING

Listing 6.2.
1 cudaMallocPitch((void**)&ex_d, &ex_pitch, jb * sizeof(float), ie); 2 cudaMemcpy2D(ex_d, ex_pitch, ex, jb * sizeof(float), jb * sizeof(float), ie, cudaMemcpyHostToDevice); 3 cudaMemcpy2D(ex, jb * sizeof(float), ex_d, ex_pitch, jb * sizeof(float), ie, cudaMemcpyDeviceToHost);

Listing 6.3: 1-D array in CUDA allocated using cudaMallocPitch().

To test this, the allocation of the electric eld Ex in CUDA is changed to use cudaMallocPitch() instead of cudaMalloc(). The result of this change is shown in Table 6.6. However, the results are not convincing. There is little dierence compared the the previous result from using cudaMalloc(). Simulation Size CUDA Execution Time (ms) 128 128 256 256 512 512 1024 1024 2048 2048 1,380.316 5,523.732 49,619.559 211,769.401 879,721.688 Speed-up 3.751039236 3.320682159 1.520322553 1.360499787 1.284471318

Table 6.6: Results for two-dimensional FDTD simulation on CUDA (using


cudaMallocPitch()

for Ex memory allocation).

To investigate the cause of this, a simple program was developed to determine whether there are any advantages of using cudaMallocPitch() for 2-D arrays. The algorithm for the program is shown in Figure 6.3. The kernel invocation and kernel denition is shown in Listing 6.4. A full-listing of the source code is attached in Appendix B. 48

6.6. CUDA MEMORY RESTRUCTURING

1 // Kernel definition 2 __global__ void copy(float *odata, float* idata, int pitch, int size_x, int size_y) { 3 4 5 6 7 8 9 10 } 11 12 int main() { 13 14 ... dim3 grid(ceil((float)size_x/16), ceil((float)size_y/16), 1) ; 15 16 17 18 //Kernel invocation copy<<<grid, threads>>>(d_odata, d_idata, pitch, size_x, size_y); 19 } dim3 threads(16, 16, 1); } if ( xid < size_y && yid < size_x ) { int index = ( yid * pitch / sizeof(float) ) + xid; odata[index] = idata[index] * 2.0; unsigned int yid = blockIdx.x * blockDim.x + threadIdx.x; unsigned int xid = blockIdx.y * blockDim.y + threadIdx.y;

Listing 6.4:

Simple program to determine advantages of using

cudaMallocPitch().

49

6.6. CUDA MEMORY RESTRUCTURING

Run initialization

Create 2-D array in CPU and ll with random data Create 2-D array in CUDA using cudaMallocPitch().

Copy array from CPU to CUDA

Multiply each element in array by 2 using CUDA

Copy array from CUDA to CPU

Check that CUDA has multiplied all elements correctly

End

Figure 6.3: Simple program to determine advantages of cudaMallocPitch(). A snippet of the textual proling results are shown in the Table 6.7. For the full results, refer to Appendix B. Again, the results from executing this program were unconvincing. Although there were no indication of uncoalesced memory reads or writes, the execution time is still quite slow and does not improve although cudaMallocPitch 50

6.6. CUDA MEMORY RESTRUCTURING

method gputime gld coherent gld incoherent gst coherent gst incoherent

Z4copyPfS iii 7579.68 104960 0 209920 0

Table 6.7: Snippet of results from textual proling on uncoalesced memory access.
()

was used. Thus, this program was proled on another computer with

Compute Capability 1.0 and the CUDA Compute Proler showed that that memory accesses were still uncoalesced. This was not expected and further testing showed that CUDA memory allocation in CUDA was column-major. To x the problem, the code from Listing 6.4 is replaced with the code in Listing 6.5. The only changes made to the source code are in Lines 3, 4 and 18. The full source code is available in Appendix C. A snippet of the test results are shown in the Table 6.8. For the full results, refer to Appendix C. This change to column-major produced a signicantly faster result46 times faster in this test. It is also interesting to note that the number of global memory loads recorded in the proling has dropped by 16 times. This indicates that for the uncoalesced memory test, although the proler did not show incoherent memory access, there were 16 times more memory accessed than were required for a coalesced kernel. The number 16 is equivalent to a half-warp of threads. This is consistent with the explanation in Section 3.2.1 of the CUDA Best Practices Guide [13].

51

6.6. CUDA MEMORY RESTRUCTURING

1 // Kernel definition 2 __global__ void copy(float *odata, float* idata, int pitch, int size_x, int size_y) { 3 4 5 6 7 8 9 10 } 11 12 int main() { 13 14 ... dim3 grid(ceil((float)size_y/16), ceil((float)size_x/16), 1) ; 15 16 17 18 //Kernel invocation copy<<<grid, threads>>>(d_odata, d_idata, pitch, size_x, size_y); 19 } dim3 threads(16, 16, 1); } if ( xid < size_y && yid < size_x ) { int index = ( yid * pitch / sizeof(float) ) + xid; odata[index] = idata[index] * 2.0; unsigned int xid = blockIdx.x * blockDim.x + threadIdx.x; unsigned int yid = blockIdx.y * blockDim.y + threadIdx.y;

Listing 6.5:

Simple program to determine advantages of using

cudaMallocPitch().

52

6.6. CUDA MEMORY RESTRUCTURING

method gputime gld coherent gld incoherent gst coherent gst incoherent

Z4copyPfS iii 161.792 6560 0 26240 0

Table 6.8: Snippet of results from textual proling on coalesced memory access. This change to the memory access pattern is applied to the 2-D FDTD code. The results are shown in Table 6.9. Simulation Size CUDA Execution Time (ms) 128 128 256 256 512 512 1024 1024 2048 2048 855.447 1,273.917 2,514.767 6,739.438 22,020.758 Speed-up 6.05253435 14.39854914 29.99790577 42.75018872 51.31418658

Table 6.9: Results for two-dimensional FDTD simulation on CUDA (using


cudaMallocPitch()

and column-major indexing).

A plot of the throughput for CUDA is shown in Figure 6.4. The throughput for CUDA is over 190 Mcells/s. This is signicantly faster than the CPUs throughput of less than 4 Mcells/s.

53

6.6. CUDA MEMORY RESTRUCTURING

200 180 160 Throughput, Mcells/s 140 120 100 80 60 40 20 0 16,384 65,536 262,144 1,048,576 4,192,304 Simulation Size (number of cells)

Figure 6.4: Throughput for two-dimensional FDTD simulation running on CUDA.

54

6.7. DISCREPANCY IN RESULTS

6.7

Discrepancy In Results

The similar tests for discrepancy between CPU and GPU as in the onedimensional FDTD simulation detailed in Section 5.4 are performed here. The results are shown in Table 6.10 below. Simulation Size 128 128 256 256 512 512 1024 1024 2048 2048 Dierence 4.507601 107 3.613532 107 3.613532 107 3.613532 107 3.613532 107 Percentage change 0.299014% 0.304905% 0.304905% 0.304905% 0.304905%

Table 6.10: Discrepancy in results between CPU and CUDA simulation of the two-dimensional FDTD method. The conversion to utilize CUDA for computation has not caused any signicant change in the FDTD results. The discrepancy is only around 0.3% and Figure 6.5 shows no noticeable dierence in the plot of CPU and CUDA results.

6.8

Alternative Absorbing Boundaries

During the development of the code, it became obvious that the PML used to absorb the waves at the boundaries of the simulation space was complex. The PMLs did not appear to be good candidate for running on CUDA and thus, other absorbing boundary conditions (ABCs) were explored. Murs rst order ABC, Murs second order ABC and Liaos ABC were tested as replacement for the PMLs. Figure 6.6 shows that both versions of Murs ABC have better speed-ups 55

6.8. ALTERNATIVE ABSORBING BOUNDARIES

0.2 0.15 0.1 0.05 Magnitude 0 0.05 0.1 0.15 0.2 CPU GPU (CUDA)

1000

2000 3000 Timestep

4000

5000

Figure 6.5: Plot of 2-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size 128 128. The location of the probe is at cell (25, 25) as specied in Table 6.1.

56

6.8. ALTERNATIVE ABSORBING BOUNDARIES

70 PML 1st Order Mur 2nd Order Mur Liao

60

50

Speedup

40

30

20

10

0 16,384

65,536 262,144 1,048,576 Simulation Size (number of cells)

4,192,304

Figure 6.6: Comparison of speed-up for various ABCs in two-dimensional FDTD simulation running on CUDA.

57

6.8. ALTERNATIVE ABSORBING BOUNDARIES

1200 PML 1st Order Mur 2nd Order Mur Liao

1000

Throughput, Mcells/s

800

600

400

200

0 16,384

65,536 262,144 1,048,576 Simulation Size (number of cells)

4,192,304

Figure 6.7: Comparison of throughput for various ABCs in two-dimensional FDTD simulation running on CUDA.

58

6.9. CONCLUSIONS

than the PML while Liaos ABC has the lowest speedup. However, this does not mean that Liaos ABC is the slowest. In fact, Liaos ABC has a much higher throughput than the PML as shown in Figure 6.7. The reason Liaos ABC performs slower than Murs ABC is in the use of doubles instead of oats for the eld variables. The use of oats causes Liaos ABC to become unstable. The results conrm that the PMLs complexity signicantly reduces the performance of the FDTD simulations. The PML is widely-used because its accuracy is better than the ABCs. However, the performance benet that can be obtained from using other ABCs on CUDA makes using the PML less attractive. At 2048 2048 cells, it takes 22 seconds for the 2-D FDTD simulation using PMLs to complete while using Murs second order ABC reduces that time to less than four seconds. Furthermore, for Murs second order ABC and Liaos ABC, there was no signicant discrepancy noticed when compared to the PML.

6.9

Conclusions

In the one-dimension FDTD method, speed-ups of over 20x and a throughput of over 1,100 Mcells/s for CUDA was recorded. In the two-dimension FDTD method, similarly convincing results were achieved with over 60x speedup and a throughput of over 1,400 Mcells. However, achieving those results comes with increasing complexity and diculty. While a simple port to CUDA with almost no optimizations yielded signicant speed-ups for 1-D, the same cannot be said for the 2-D. It became obvious that programming on CUDAs framework required many optimizations for optimal performance. It also becomes clear that the question of

59

6.9. CONCLUSIONS

whether it is worth taking the time to obtain these speed-ups has to be answered. Alternative ABCs were explored in this chapter because the PML is not well-suited to run on CUDA. The throughput was increased by more than ve times when Murs second order ABC was used. Also, Murs second order and Liaos ABC did not show signicant discrepancies compared to the PML implementation.

60

7
3-D FDTD Results
7.1 Introduction
The three-dimension FDTD method has six main equations as listed in Section 2.4. The equations are converted into computer code as shown in Listing 7.1.
1 ex[i][j][k] = ca[id] * ex[i][j][k] + cby[id] * (hz[i][j][k] hz[i][j-1][k]) - cbz[id] * (hy[i][j][k] - hy[i][j][k-1]); 2 ey[i][j][k] = ca[id] * ey[i][j][k] + cbz[id] * (hx[i][j][k] hx[i][j][k-1]) - cbx[id] * (hz[i][j][k] - hz[i-1][j][k]); 3 ez[i][j][k] = ca[id] * ez[i][j][k] + cbx[id] * (hy[i][j][k] hy[i-1][j][k]) - cby[id] * (hx[i][j][k] - hx[i][j-1][k]); 4 hx[i][j][k] = da[id] * hx[i][j][k] + dbz[id] * (ey[i][j][k+1] - ey[i][j][k]) - dby[id] * (ez[i][j+1][k] - ez[i][j][k]); 5 hy[i][j][k] = da[id] * hy[i][j][k] + dbx[id] * (ez[i+1][j][k] - ez[i][j][k]) - dbz[id] * (ex[i][j][k+1] - ex[i][j][k]); 6 hz[i][j][k] = da[id] * hz[i][j][k] + dbx[id] * (ex[i][j+1][k] - ex[i][j][k]) - dbz[id] * (ey[i+1][j][k] - ey[i][j][k]);

Listing 7.1: Main update equations for the 3-D FDTD Method.

61

7.2. TEST PARAMETERS

The owchart in Figure 7.1 illustrates the main update loop of the FDTD method for three-dimensional space. For development and testing purposes, a Gaussian wave propagating from a dipole antenna into free-space is simulated. After stable and fast speedups were obtained on the CUDA GPU, a head model was then simulated. The head model was chosen because it was the motivation of the project. Simulations of the head model were taking a long time to complete using CPU and thus, it is only appropriate that the ndings of this project be used to improve the execution time on the head model. While the two-dimension FDTD method was more challenging to work with on CUDA compared to the one-dimension FDTD method, the threedimension FDTD method proves to be signicantly more dicult. As will be discussed in this chapter, the CUDA architecture does not support running a large number of threads in three-dimensional and thus, alternative methods had to be used to get around this. As was found in the two-dimension FDTD simulations, PMLs as an absorbing region is not best suited to run on CUDA. In fact, it was found that updating the PMLs were taking longer to run than updating the main grid. Thus, alternative absorbing boundary conditions are explored and the results are presented at the end of this chapter.

7.2

Test Parameters

Table 7.1 shows the conguration used for the three-dimension FDTD simulations during development. The original CPU execution times are shown in Table 7.2 and these execution times are used as a base for the calculation of speed-ups.

62

7.2. TEST PARAMETERS

Initialize Model

Update electric elds ex,


ey

and ez in main grid Update ex in all PML regions no Update ey in all PML regions Update ez in all PML regions

Update magnetic eld hx,


hy

Has time-step ended? yes End

and hz in main grid Update hx in all PML regions Update hy in all PML regions Update hz in all PML regions

Figure 7.1: Flowchart for three-dimensional FDTD simulation.

63

7.2. TEST PARAMETERS

Number of time steps Center of dipole antenna Location of probe CUDA x-dimension of a block CUDA y-dimension of a block Simulation Sizes

1,000 (60, 60, 60) (60, 60, 60) 16 16 128 128 128 160 160 160 192 192 192 224 224 224 256 256 256

Thickness of PML

4 cells

Table 7.1: Set-up for three-dimensional FDTD simulation.

Simulation Size CPU Execution Time (ms) 128 128 128 160 160 160 192 192 192 224 224 224 256 256 256 515,913.3910 947,819.1530 1,588,263.3060 2,452,357.4220 3,609,773.9260

Table 7.2: Results for three-dimensional FDTD simulation on CPU.

64

7.3. CUDA BLOCK & GRID CONFIGURATIONS

7.3

CUDA Block & Grid Congurations

In the one-dimensional FDTD method, the blocks and grids of the CUDA kernel were congured as one-dimension. Similarly, in the two-dimensional FDTD method, the CUDA kernel was congured for two-dimension blocks and grids. This is because it simplies programming the CUDA kernel. However, this could not be extended to the three-dimensional FDTD method. At this point, it worthy to note the limitations in CUDAs block-grid conguration as shown in Table 7.3. Maximum x- or y-dimension of a grid of thread blocks Maximum z-dimension of a grid of thread blocks Maximum x- or y-dimension of a block Maximum z-dimension of a block 65,535 1 512 64

Table 7.3: Limitations of the block and grid conguration for CUDA for Compute Capability 1.x [6]. While CUDA does support three-dimensional blocks, the number of threads in the z-dimension of a block is limited to only 64 compared to the 512 threads for x- and y-dimension. More importantly, CUDA does not support threedimensional grids. To circumvent this limitation, a two-dimensional conguration of blocks and threads are used. Then, within the CUDA kernel, a for-loop is used to cycle through the z-dimension. An example of this is shown in Listing 7.2.

65

7.3. CUDA BLOCK & GRID CONFIGURATIONS

1 // Kernel definition 2 __global__ void kernel(float *odata, int size_x, int size_y, int size_z) { 3 4 5 6 7 8 9 10 11 12 } 13 14 int main() { 15 16 ... dim3 grid(ceil((float)size_x/16), ceil((float)size_y/16), 1) ; 17 18 19 20 21 } //Kernel invocation kernel<<<grid, threads>>>(d_odata, size_x, size_y, size_z); dim3 threads(16, 16, 1); } } if ( xid < size_x && yid < size_y ) { for ( int zid = 0; zid < size_z; zid++ ) { ... odata[index] = ...; unsigned int xid = blockIdx.x * blockDim.x + threadIdx.x; unsigned int yid = blockIdx.y * blockDim.y + threadIdx.y;

Listing 7.2:

Looping through a three-dimensional array using two-

dimensional blocks and grids in CUDA.()

66

7.4. THREE-DIMENSIONAL ARRAYS IN CUDA

7.4

Three-Dimensional Arrays In CUDA

In the two-dimensional FDTD simulation, cudaMallocPitch() was used to allocate memory for the two-dimensional arrays. The use of cudaMallocPitch
()

ensured that the arrays were properly aligned to meet CUDAs require-

ments for coalesced memory access. For three-dimensional arrays, the cudaMalloc3D
()

and make_cudaExtent() functions are used. Listing 7.3 shows how these

functions are used.


1 int main() { 2 3 4 5 cudaExtent extent = make_cudaExtent(width * sizeof(float), height, depth); 6 7 8 9 } ... cudaMalloc3D(&ptr, extent); ... float* ptr;

Listing 7.3: Allocating a three-dimensional array in CUDA. Further details of these functions are described in the CUDA Reference Manual [20].

7.5

Initial Run

Table 7.4 shows the results obtained from converting the main update loop of Figure 7.1 to utilize CUDA. At 256 256 256 cells, the throughput was calculated to be less than 19 Mcells/s. This is signicantly less than what was achieved for the twodimensional FDTDapproximately 222 Mcells/s for the same number of 67

7.6. VISUAL PROFILING

Simulation Size CUDA Execution Time (ms) Speed-up 128 128 128 160 160 160 192 192 192 224 224 224 256 256 256 107,880.5770 194,591.8580 568,707.0920 881,243.1030 890,487.4880 4.7823 4.8708 2.7928 2.7828 4.0537

Table 7.4: Results for three-dimensional FDTD simulation on CUDA (initial run). cells (4096 4096). However, it is worthy to note that although there hasnt been much optimizations done yet, it takes CUDA less than 15 minutes to complete a simulation that takes an hour to complete on the CPU.

7.6

Visual Proling

To improve on the speed-ups, the CUDA textual proler was used. The results were similar to those obtained in Chapter 6. The textual proler did not show any uncoalesced memory accesses and provided insucient information to determine where the bottleneck was. Later on in the project, it was discovered that although Emory does not have a GUI natively, the Visual Proler could be executed executed using X11 forwarding. A secondary computer was used to connect to Emory through an SSH client with X11 forwarding enabled. This allowed the CUDA Visual Proler to show on the secondary computer. The Visual Proler proved to be very useful and immediately the bottleneck for the three-dimensional FDTD method was found. The Visual Proler

68

7.7. MEMORY COALESCING FOR 3-D ARRAYS

showed that global memory load and global memory store eciency was just 0.06 for most of the CUDA kernels. The proling results for the six main kernels are shown in Table 7.5. On the other hand, when the Visual Proler was executed for the twodimensional FDTD method developed in Chapter 6, the results showed global memory load and global memory store eciencies that were close to one. Method ex cuda d ey cuda d ez cuda d hx cuda d hy cuda d hz cuda d gld eciency 0.0630682 0.0613636 0.0630682 0.06 0.065625 0.061875 gst eciency 0.0630686 0.0613641 0.0630686 0.0600002 0.0656253 0.0618752

Table 7.5: Results from visual proling on three-dimensional FDTD simulation. These results were unexpected because the arrays were allocated using
cudaMalloc3D()

as recommended by the CUDA Programming Guide [6] and

the CUDA Best Practices Guide [13].

7.7

Memory Coalescing for 3-D Arrays

The CUDA documentations [6, 13, 20] were consulted but there were no mention on how to index three-dimensional arrays in the kernel for coalesced memory accesses. The NVIDIA forums was not helpful in this matter as well. However, by slowly analysing and debugging the CUDA kernel, a simple

69

7.7. MEMORY COALESCING FOR 3-D ARRAYS

solution was found on how to index arrays in the kernel while maintaning aligned accessed to global memory. The only limitation with this solution is that the dimensions of threads in a block must be the same. Specically, the x- and y-dimensions of a block must be the same. This solution was a huge milestone in the development of the threedimensional code and the speed-ups achieved are listed in Table 7.6. Figures 7.2 and 7.3 show the plots for the speed-up and throughput respectively. Simulation Size CUDA Execution Time (ms) Speed-up 128 128 128 160 160 160 192 192 192 224 224 224 256 256 256 25,842.8480 35,609.7410 83,756.5080 142,680.5420 126,128.8680 19.9635 26.6169 18.9629 17.1877 28.6197

Table 7.6: Results for three-dimensional FDTD simulation on CUDA (with coalesced memory access). A throughput of over 130 Mcells/s and a speedup of over 28x was achieved for a simulation size of 16 Mcells. While CUDAs execution time for the three-dimensional FDTD is signicantly better than that of the CPU, the throughput is still much slower compared to the two-dimensional FDTD executed on CUDA. Again, the use of PMLs as absorbing regions can be attributed to low throughput. Thus, alternative absorbing boundary conditions are used in replacement of the PMLs for comparison. This is discussed in the following section. The Visual Proler was executed on the improved kernels to check the eciency of the global memory accesses. The results are shown in Table 7.7. Comparing the results of Table 7.5 and Table 7.7, it is obvious that there is 70

7.7. MEMORY COALESCING FOR 3-D ARRAYS

30

25

20 Speedup

15

10

2,097,152 4,096,000 7,077,888 11,239,424 16,777,216 Simulation Size (number of cells)

Figure 7.2: Speed-up for three-dimensional FDTD simulation.

71

7.7. MEMORY COALESCING FOR 3-D ARRAYS

140

120

100 Throughput, Mcells/s

80

60

40

20

2,097,152 4,096,000 7,077,888 11,239,424 16,777,216 Simulation Size (number of cells)

Figure 7.3: Throughput for three-dimensional FDTD simulation running on CUDA.

72

7.8. DISCREPANCY IN RESULTS

a huge improvement in with the memory accesses in the new kernels. In Table 7.7, there are eciencies that exceed 1. This could be due to inaccuracies in the Visual Proler program. The CUDA Compute Visual Proler User Guide conrms that the eciency should be between 0 and 1. Method ex cuda d ey cuda d ez cuda d hx cuda d hy cuda d hz cuda d gld eciency 0.81388 0.785047 0.75882 0.857143 0.847646 0.941704 gst eciency 1.01576 0.984379 1.00782 0.666667 0.658067 0.681822

Table 7.7: Results from visual proling on three-dimensional FDTD simulation (with new indexing of arrays in kernel). Also, the plots of Figure 7.2 and Figure 7.3 show an irregularity in performance in terms of simulation size. In one- and two-dimensional FDTD simulations, the trend is an increase in throughput and speedup with increase in simulation size. However, this trend did not extend to the three-dimensional FDTD simulations. The results generated from the Visual Proler showed no obvious reason for this occurrence. Thus, the cause of this still needs to be investigated.

7.8

Discrepancy In Results

The similar tests for discrepancy between CPU and GPU as in the onedimensional FDTD simulation detailed in Section 5.4 are performed here. The results are shown in Table 7.8 below. 73

7.9. ALTERNATIVE ABSORBING BOUNDARIES

Simulation Size 128 128 128 160 160 160 192 192 192 224 224 224 256 256 256

Dierence 2.980232 107 2.980232 107 2.980232 107 2.980232 107 2.980232 107

Percentage change 0.113527% 0.018938% 0.136416% 0.079594% 0.068311%

Table 7.8: Discrepancy in results between CPU and CUDA simulation of the three-dimensional FDTD method. While the CUDA framework does not fully conform to the IEEE 7842008 standard for oating-point computation (as discussed in Section 3.2), the results prove that the discrepancy of results between CUDA and the CPU are relatively small. The speed-ups and throughput that can be achieved by CUDA are more than sucient to disregard the small discrepancies. This is proven in all one-, two- and three-dimensional space of the FDTD method. Moreover, newer CUDA GPUs are designed to conform to the IEEE 784-2008 standard. Figure 7.4 and Figure 7.5 show plots of the data generated from the simulation of 160 160 160 cells. The plots show that the discrepancy between the CPU and CUDA is neither signicant nor noticeable.

7.9

Alternative Absorbing Boundaries

Berengers PMLs have been used as the absorbing region for the threedimensional FDTD simulation up until this point. Just like in the case of the two-dimensional FDTD method, it was found that the PMLs were not particularly well-suited for running in CUDA. Thus, alternative absorbing

74

7.9. ALTERNATIVE ABSORBING BOUNDARIES

0.7 0.6 0.5 0.4 Magnitude 0.3 0.2 0.1 0 0.1 CPU GPU (CUDA)

200

400 Timestep

600

800

1000

Figure 7.4: Plot of 3-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size 160 160 160 cells. The location of the probe is at cell (60, 60, 60) as specied in Table 7.1

75

7.9. ALTERNATIVE ABSORBING BOUNDARIES

0 0.005 0.01 0.015 Magnitude 0.02 0.025 0.03 0.035 0.04 0.045 0.05 300 310 320 Timestep 330 340 350 CPU GPU (CUDA)

Figure 7.5: Plot of 3-D FDTD simulation results to compare accuracy between CPU and GPU. The plots are generated from the result from simulation size 160 160 160 cells. The location of the probe is at cell (60, 60, 60) as specied in Table 7.1. The plot is centered between time-steps 300 and 350.

76

7.9. ALTERNATIVE ABSORBING BOUNDARIES

boundary conditions were used to investigate how much more speed-up and throughput can be achieved. The three ABCs used are the Murs rst order, Murs second order and Liaos ABC. This is similar to what was done for the two-dimensional FDTD simulation in Section 6.8. The equations for the ABCs are shown below. The equations assume that the boundary is at x = 0. Although only Ez is considered, the equations have to be applied to Ex and Ey as well. Murs 1st order ABC:
n+1

Ez
0,j,k

ct z = Ez + Ez ct + z 1,j,k

n+1

Ez
1,j,k 0,j,k

7.1

Murs 2st order ABC:


n+1 n1

Ez
0,j,k

= Ez
1,j,k

+ EQ1 + EQ2 + EQ3 + EQ4


n+1 n1

7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10

EQ1 = EQ2 = EQ3 =

ct x Ez ct + x 2x Ez ct + x
2

Ez
1,j,k n 0,j,k n

Ez
0,j,k 1,j,k

x(ct) (Ca + Cb ) 2(y)2 (ct + x)

x(ct)2 EQ4 = (Cc + Cd ) 2(z)2 (ct + x)


n n n

Ca = E z
0,j+1,k n

2Ez
0,j,k n

+ Ez
0,j1,k n

Cb = Ez
1,j+1,k n

2Ez
1,j,k n

+ Ez
1,j1,k n

Cc = Ez
0,j,k+1 n

2Ez
0,j,k n

+ Ez
0,j,k1 n

Cd = E z
1,j,k+1

2Ez
1,j,k

+ Ez
1,j,k1

77

7.9. ALTERNATIVE ABSORBING BOUNDARIES

Liaos ABC:
n+1 N n+1(m1)

Ez
0,j,k

(1)
m=1

m+1

N Cm Ez mct,j,k

7.11

N where N is the order of the boundary condition and Cm is the binomial

coecient given by:


N Cm =

N! m!(N m)!

7.12

The speed-up and throughput for the three-dimensional FDTD simulation on CUDA are shown in Figure 7.6 and Figure 7.7 respectively. It is apparent from the results that the irregularity in performance for varying simulation sized noted with the PMLs still exist when other ABCs are used. However, Murs ABCs performed better than the PMLs. It must also be noted that Liaos ABC were implemented using doubles while the other ABCs were implemented using oats. From the results, it can be argued that Liaos ABC performed reasonably well. This could be attributed to the simplicity of Liaos ABC compared to the PML and the Murs ABCs. All three types of ABCs produced results that were reasonable similar to that of the PML. Also, the PML width used for simulations is only 4 cells and should be taken into consideration when analysing the discrepancy. If the width of the PML region were increased, there would be less reection at the boundaries but this would cause the execution time to increase.

78

7.9. ALTERNATIVE ABSORBING BOUNDARIES

55 50 45 40 35 Speedup 30 25 20 15 10 5 0 2,097,152 4,096,000 7,077,888 11,239,424 Simulation Size (number of cells) 16,777,216 PML 1st Order Mur 2nd Order Mur Liao

Figure 7.6: Comparison of speed-up for various ABCs in three-dimensional FDTD simulation running on CUDA.

79

7.9. ALTERNATIVE ABSORBING BOUNDARIES

300 PML 1st Order Mur 2nd Order Mur Liao

250

Throughput, Mcells/s

200

150

100

50

0 2,097,152

4,096,000 7,077,888 11,239,424 Simulation Size (number of cells)

16,777,216

Figure 7.7: Comparison of throughput for various ABCs in three-dimensional FDTD simulation running on CUDA.

80

7.9. ALTERNATIVE ABSORBING BOUNDARIES

0.7 0.6 0.5 0.4 Magnitude 0.3 0.2 0.1 0 0.1 PML 1st Order Mur 2nd Order Mur Liao

200

400 Timestep

600

800

1000

Figure 7.8: Plot of 3-D FDTD simulation results to compare accuracy between various ABCs. The plots are generated from the result from simulation size 160 160 160 cells. The location of the probe is at cell (60, 60, 60) as specied in Table 7.1.

81

7.9. ALTERNATIVE ABSORBING BOUNDARIES

0.02 0.015 0.01 0.005 Magnitude 0 0.005 0.01 0.015 0.02 400 PML 1st Order Mur 2nd Order Mur Liao

450

500

550 Timestep

600

650

700

Figure 7.9: Plot of 3-D FDTD simulation results to compare accuracy between various ABCs. The plots are generated from the result from simulation size 160 160 160 cells. The location of the probe is at cell (60, 60, 60) as specied in Table 7.1. The plot is centered between time-steps 400 and 700.

82

7.10. CONCLUSIONS

7.10

Conclusions

Porting the FDTD method to execute on CUDA produced signicant speedups. However, it is signicantly more dicult to obtain optimal performance on CUDA compared to one- and two-dimensional FDTD simulations. One reason for this is that CUDA does not support three-dimensional kernels for a large number of threads. Although the throughput achieved was less than that of the one- and two-dimensional code, the speed-up achieved is encouraging. For example, it takes the CPU more than an hour to complete the simulation for 256 256 256 cells but CUDA takes just over two minutes to complete the same simulation. Also, as was discovered with the two-dimensional FDTD simulations, there are other ABCs that perform better on CUDA compared to the PMLs. The PMLs are not as well-suited to leverage CUDAs parallel architecture.

83

8
Conclusions
8.1 Thesis Conclusions
The thesis nds that the FDTD method is very well suited to run on parallel architectures such as CUDA. Throughputs of over 1,000 Mcell/s can be achieved on CUDA when the CPU only manages 4 Mcells/s. However, achieving the speed-ups can be dicult. It is in fact signicantly more difcult to write code to run optimally on CUDA than on the CPU. Most of the problems that were faced during development were related to memoryaccesses on CUDA. There are various documentations and tools that are available to developers but occasionally, these resources are insucient. However, it is the authors opinion that the diculty in producing optimal CUDA programs is not a strong deterrent from using CUDA to execute the FDTD method. Taken into perspective the results that have been achieved from the thesis, the use of CUDA is without doubt a good choice. It is hard to imagine waiting an hour for the CPU to complete a simulation of 256 256 256 cells while it be done in just over two minutes using CUDA. It should also be noted that not only is the computational capabilities of 84

8.2. FUTURE WORK

GPUs continuing to rise at a faster pace compared to CPUs [8], the CUDA framework is also consistently being improved. With these improvements, it is almost certain that programming on CUDA will become easier and higher speed-ups will be achieved in the near future. As for the progress of the thesis, the simulations for three-dimensional FDTD method took longer than expected due to the complexity involved that was unforeseen. As the development progressed, it was found that experimenting with the use of alternative ABCs in place of the PMLs would be benecial. Thus, Murs rst order, Murs second order and Liaos ABCs were implemented as advanced work for the thesis.

8.2

Future Work

This thesis explored the capabilities and prospects of using CUDA as an alternative to CPUs. Thus, the implementation of the CUDA program was relatively straightforward. There is still a lot of work that can be done such as investigating the use of shared memory and texture memory. While the thesis focused on the higher-priority recommendations of the CUDA Best Practices Guide [13], there are numerous other optimization that can be explored to produced a more optimal CUDA program. For the FDTD method on CUDA, the use of Murs and Liaos ABCs can denitely reduce simulation times compared to using PMLs. This is another topic that can be studied in the future. Although the focus of this thesis is on NVIDIAs CUDA architecture, there are other parallel architectures that in the market today. GPU manufacturers including NVIDIA and AMD support the Open Computing Language (OpenCL) which is a royalty-free standard for parallel computing

85

8.2. FUTURE WORK

[21, 22]. Research into OpenCL and comparisons with CUDA can also be done as future work.

86

Bibliography

[1] Kane S. Yee. Numerical solution of initial boundary value problems involving maxwells equations in isotropic media. Antennas and Propagation, IEEE Transactions on, 14(3):302307, 1966. [2] K. L. Shlager and J. B. Schneider. A selective survey of the nitedierence time-domain literature. Antennas and Propagation Magazine, IEEE, 37(4):3957, 1995. [3] L.M. Angelone, S. Tulloch, G. Wiggins, S. Iwaki, N. Makris, and G. Bonmassar. New high resolution head model for accurate electromagnetic eld computation. In ISMRM Thirteenth Scientic Meeting, page 881, Miami, FL, USA, 2005. [4] Wenhua Yu. Parallel nite-dierence time-domain method. Artech

House electromagnetic analysis series. Artech House, Boston, MA, 2006. 2006045967 GBA639156 013443102 Wenhua Yu ... [et al.]. ill. ; 24 cm. Includes bibliographical references and index. [5] Dennis Michael Sullivan, IEEE Microwave Theory, and Techniques Society. Electromagnetic simulation using the FDTD method. IEEE Press series on RF and microwave technology. IEEE Press, New York, 2000. 00038922 Dennis M. Sullivan. ill. ; 27 cm. IEEE Microwave Theory and Techniques Society, sponsor. Includes bibliographical references and index. [6] NVIDIA. Cuda c programming guide version 3.1.1, 2010. [7] NVIDIA. Fermi compute architecture white paper, 2009.

87

BIBLIOGRAPHY

[8] Ong Cen Yen, M. Weldon, S. Quiring, L. Maxwell, M. Hughes, C. Whelan, and M. Okoniewski. Speed it up. Microwave Magazine, IEEE, 11(2):7078, 2010. [9] NVIDIA. Tesla c2050/c2070 gpu computing processor at 1/10th the cost, 2010. [10] Karl E. Hillesland and Anselmo Lastra. Gpu oating-point paranoia, 2004. [11] United States General Accounting Oce. Patriot missile defense: Software problem led to system failure at dhahran, saudi arabia, February 4, 1992 1992. [12] Robert Sedgewick and Kevin Daniel Wayne. Floating point, 2010. [13] NVIDIA. Cuda c best practices version 3.1, 2010. [14] Paulius Micikevicius. 3d nite dierence computation on gpus using cuda, 2009. [15] James F. Stack. Accelerating the nite dierence time domain (fdtd) method with cuda, 2010. [16] Alvaro Valcarce and Jie Zhang. Implementing a 2d fdtd scheme with cpml on a gpu using cuda, 2010. [17] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Zhang Yao, and V. Volkov. Parallel computing experiences with cuda. Micro, IEEE, 28(4):1327, 2008. [18] NVIDIA. Compute visual proler, 2010. [19] Rob Farber. Cuda, supercomputing for the masses: Part 6, 2008. 88

BIBLIOGRAPHY

[20] NVIDIA. Cuda reference manual 3.1, 2010. [21] Khronos Group. Khronos launches heterogeneous computing initiative, 2008. [22] Khronos Group. The khronos group releases opencl 1.0 specication, 2008. [23] S. Adams, J. Payne, and R. Boppana. Finite dierence time do-

main (fdtd) simulations using graphics processors. In DoD High Performance Computing Modernization Program Users Group Conference, 2007, pages 334338, 2007. [24] G. Cummins, R. Adams, and T. Newell. Scientic computation through a gpu. In Southeastcon, 2008. IEEE, pages 244246, 2008. [25] Atef Z. Elsherbeni and Veysel Demir. The nite-dierence time-domain method for electromagnetics with MATLAB simulations. SciTech Pub., Raleigh, NC :, 2009. [26] C. D. Moss, F. L. Teixeira, and Kong Jin Au. Analysis and compensation of numerical dispersion in the fdtd method for layered, anisotropic media. Antennas and Propagation, IEEE Transactions on, 50(9):1174 1184, 2002. [27] NVIDIA. Cuda architecture introduction & overview, 2009. [28] Robert Sedgewick and Kevin Daniel Wayne. Introduction to programming in Java : an interdisciplinary approach. Pearson Addison-Wesley, Boston, 2008. 2007020235 Robert Sedgewick and Kevin Wayne. ill. ; 24 cm. Includes index.

89

BIBLIOGRAPHY

[29] Allen Taove. Computational electrodynamics : the nite-dierence time-domain method. Artech House, Boston, 1995. 95015008 Allen Taove. ill. ; 24 cm. Includes bibliographical references and indexes. [30] F. Zheng and Z. Chen. Numerical dispersion analysis of the unconditionally stable 3-d adi-fdtd method. Microwave Theory and Techniques, IEEE Transactions on, 49(5):10061009, 2001.

90

A
Emory Specications
These specications are obtained by executing the deviceQuery program bundled together with the NVIDIA GPU Computing SDK.
CUDA Device Query (Runtime API) version (CUDART static linking) There are 2 devices supporting CUDA

Device 0: "Tesla T10 Processor" CUDA Driver Version: CUDA Runtime Version: CUDA Capability Major revision number: CUDA Capability Minor revision number: Total amount of global memory: bytes Number of multiprocessors: Number of cores: Total amount of constant memory: Total amount of shared memory per block: 30 240 65536 bytes 16384 bytes 3.10 3.10 1 3 4294770688

Total number of registers available per block: 16384

91

Warp size: Maximum number of threads per block: Maximum sizes of each dimension of a block: Maximum sizes of each dimension of a grid: x 1 Maximum memory pitch: bytes Texture alignment: Clock rate: Concurrent copy and execution: Run time limit on kernels: Integrated: Support host page-locked memory mapping: Compute mode:

32 512 512 x 512 x 64 65535 x 65535

2147483647

256 bytes 1.30 GHz Yes No No Yes Default

(multiple host threads can use this device simultaneously) Concurrent kernel execution: Device has ECC support enabled: No No

Device 1: "Tesla T10 Processor" CUDA Driver Version: CUDA Runtime Version: CUDA Capability Major revision number: CUDA Capability Minor revision number: Total amount of global memory: bytes Number of multiprocessors: Number of cores: Total amount of constant memory: Total amount of shared memory per block: 30 240 65536 bytes 16384 bytes 3.10 3.10 1 3 4294770688

Total number of registers available per block: 16384 Warp size: Maximum number of threads per block: Maximum sizes of each dimension of a block: 32 512 512 x 512 x 64

92

Maximum sizes of each dimension of a grid: x 1 Maximum memory pitch: bytes Texture alignment: Clock rate: Concurrent copy and execution: Run time limit on kernels: Integrated: Support host page-locked memory mapping: Compute mode:

65535 x 65535

2147483647

256 bytes 1.30 GHz Yes No No Yes Default

(multiple host threads can use this device simultaneously) Concurrent kernel execution: Device has ECC support enabled: No No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 2, Device = Tesla T10 Processor, Device = Tesla T10 Processor

93

B
Uncoalesed Memory Access Test
1 /* 2 3 4 5 6 7 8 #include <stdlib.h> 9 #include <stdio.h> 10 #include <cuda.h> 11 #include <math.h> 12 13 #define DIM_SIZE 16 14 15 #define NUM_REPS 10 16 17 18 __global__ void copy(float *odata, float* idata, int pitch, int size_x, int size_y) { 19 unsigned int yid = blockIdx.x * blockDim.x + threadIdx.x; * Test cudaMallocPitch() usage. * * ./binary <size_x> <size_y> * */

94

20 21 22 23 24 25 26 } 27

unsigned int xid = blockIdx.y * blockDim.y + threadIdx.y;

if ( xid < size_y && yid < size_x ) { int index = ( yid * pitch / sizeof(float) ) + xid; odata[index] = idata[index] * 2.0; }

28 // http://www.drdobbs.com/high-performance-computing/207603131 29 void checkCUDAError(const char *msg) 30 { 31 32 33 34 35 36 37 38 } 39 40 int main (int argc, char** argv) { 41 42 43 44 45 46 47 48 49 50 51 52 // execution configuration parameters printf("size_x: %d\n", size_x); printf("size_y: %d\n", size_y); } if ( argc > 2 ) { sscanf(argv[1], "%d", &size_x); sscanf(argv[2], "%d", &size_y); int size_x = 64; int size_y = 64; } exit(EXIT_FAILURE); cudaError_t err = cudaGetLastError(); if( err != cudaSuccess ) { fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );

95

53

dim3 grid(ceil((float)size_x/DIM_SIZE), ceil((float)size_y /DIM_SIZE), 1), threads(DIM_SIZE, DIM_SIZE, 1);

54 55 56 57 58 59 60 61 62 63 64 65 // allocate device memory float *d_idata, *d_odata; size_t pitch; cudaMallocPitch((void**)&d_idata, &pitch, size_y * sizeof( float), size_x); checkCUDAError("cudaMallocPitch"); 66 cudaMallocPitch((void**)&d_odata, &pitch, size_y * sizeof( float), size_x); checkCUDAError("cudaMallocPitch"); 67 68 69 70 71 72 73 74 75 76 77 // copy host data to device cudaMemcpy2D(d_idata, pitch, h_idata, size_y * sizeof( float), size_y * sizeof(float), size_x, cudaMemcpyHostToDevice); 78 79 80 for ( int i=0; i < NUM_REPS; i++) { checkCUDAError("cudaMemcpy2D H to D"); } // initalize host data for ( int i = 0; i < size_x*size_y; i++ ) { h_idata[i] = (float)i; h_odata[i] = 0.0f; printf("dpitch: %d\n", pitch); // allocate host memory float *h_idata = (float*) malloc(mem_size); float *h_odata = (float*) malloc(mem_size); // size of memory required to store the matrix const int mem_size = sizeof(float) * size_x * size_y;

96

81

copy<<<grid, threads>>>(d_odata, d_idata, pitch, size_x, size_y);

82 83 84 85

// copy device data to host cudaMemcpy2D(h_odata, size_y * sizeof(float), d_odata, pitch, size_y * sizeof(float), size_x, cudaMemcpyDeviceToHost);

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110

checkCUDAError("cudaMemcpy2D D to H");

for ( int k = 0; k < size_x * size_y; k++ ) { if ( h_odata[k] != h_idata[k] * 2.0 ) { printf("Mismatch!\n"); printf("h_idata[%d] = %f\n", k, h_idata[k]); printf("h_odata[%d] = %f\n", k, h_odata[k]);

printf("---result---\n"); for ( int i = 0; i < size_x; i++ ) { for ( int j = 0; j < size_y; j++ ) { printf("%d ", (int)h_odata[i*size_y+j]); } printf("\n"); } break; } }

free(h_idata); free(h_odata);

cudaFree(d_idata); cudaFree(d_odata);

97

111 112 113 }

printf("Completed.\n"); return 0;

Listing B.1: Source code for testing uncoalesced memory access.

98

# CUDA PROFILE LOG VERSION 2.0

# CUDA DEVICE 0 Tesla T10 Processor

# CUDA PROFILE CSV 1 cputime gld coherent gld incoherent gst coherent gst incoherent 3602 7635 7508 7597 7647 7530 7436 7629 7536 7530 7580 3253 104704 104960 0 0 104704 0 104960 0 104960 0 104704 0 104960 0 209920 209408 209920 209920 209408 209920 209408 104704 0 209408 104960 0 209920 104960 0 209920 0 0 0 0 0 0 0 0 0 0

method

gputime

memcpyHtoD

3000.768

Z4copyPfS iii

7579.68

Z4copyPfS iii

7488.16

Z4copyPfS iii 7576.928

Z4copyPfS iii 7628.576

Z4copyPfS iii

7511.2

Z4copyPfS iii 7417.344

Z4copyPfS iii 7609.152

Z4copyPfS iii

7517.76

Z4copyPfS iii 7514.816

Z4copyPfS iii 7563.744

memcpyDtoH

2666.496

99

Table B.1: CUDA Textual Proler results from testing uncoalesced memory access.

C
Coalesed Memory Access Test
1 /* 2 3 4 5 6 7 8 #include <stdlib.h> 9 #include <stdio.h> 10 #include <cuda.h> 11 #include <math.h> 12 13 #define DIM_SIZE 16 14 15 #define NUM_REPS 10 16 17 18 __global__ void copy(float *odata, float* idata, int pitch, int size_x, int size_y) { 19 unsigned int xid = blockIdx.x * blockDim.x + threadIdx.x; * Test cudaMallocPitch() usage. * * ./binary <size_x> <size_y> * */

100

20 21 22 23 24 25 26 } 27

unsigned int yid = blockIdx.y * blockDim.y + threadIdx.y;

if ( xid < size_y && yid < size_x ) { int index = ( yid * pitch / sizeof(float) ) + xid; odata[index] = idata[index] * 2.0; }

28 // http://www.drdobbs.com/high-performance-computing/207603131 29 void checkCUDAError(const char *msg) 30 { 31 32 33 34 35 36 37 38 } 39 40 int main (int argc, char** argv) { 41 42 43 44 45 46 47 48 49 50 51 52 // execution configuration parameters printf("size_x: %d\n", size_x); printf("size_y: %d\n", size_y); } if ( argc > 2 ) { sscanf(argv[1], "%d", &size_x); sscanf(argv[2], "%d", &size_y); int size_x = 64; int size_y = 64; } exit(EXIT_FAILURE); cudaError_t err = cudaGetLastError(); if( err != cudaSuccess ) { fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );

101

53

dim3 grid(ceil((float)size_y/DIM_SIZE), ceil((float)size_x /DIM_SIZE), 1), threads(DIM_SIZE, DIM_SIZE, 1);

54 55 56 57 58 59 60 61 62 63 64 65 // allocate device memory float *d_idata, *d_odata; size_t pitch; cudaMallocPitch((void**)&d_idata, &pitch, size_y * sizeof( float), size_x); checkCUDAError("cudaMallocPitch"); 66 cudaMallocPitch((void**)&d_odata, &pitch, size_y * sizeof( float), size_x); checkCUDAError("cudaMallocPitch"); 67 68 69 70 71 72 73 74 75 76 77 // copy host data to device cudaMemcpy2D(d_idata, pitch, h_idata, size_y * sizeof( float), size_y * sizeof(float), size_x, cudaMemcpyHostToDevice); 78 79 80 for ( int i=0; i < NUM_REPS; i++) { checkCUDAError("cudaMemcpy2D H to D"); } // initalize host data for ( int i = 0; i < size_x*size_y; i++ ) { h_idata[i] = (float)i; h_odata[i] = 0.0f; printf("dpitch: %d\n", pitch); // allocate host memory float *h_idata = (float*) malloc(mem_size); float *h_odata = (float*) malloc(mem_size); // size of memory required to store the matrix const int mem_size = sizeof(float) * size_x * size_y;

102

81

copy<<<grid, threads>>>(d_odata, d_idata, pitch, size_x, size_y);

82 83 84 85

// copy device data to host cudaMemcpy2D(h_odata, size_y * sizeof(float), d_odata, pitch, size_y * sizeof(float), size_x, cudaMemcpyDeviceToHost);

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110

checkCUDAError("cudaMemcpy2D D to H");

for ( int k = 0; k < size_x * size_y; k++ ) { if ( h_odata[k] != h_idata[k] * 2.0 ) { printf("Mismatch!\n"); printf("h_idata[%d] = %f\n", k, h_idata[k]); printf("h_odata[%d] = %f\n", k, h_odata[k]);

printf("---result---\n"); for ( int i = 0; i < size_x; i++ ) { for ( int j = 0; j < size_y; j++ ) { printf("%d ", (int)h_odata[i*size_y+j]); } printf("\n"); } break; } }

free(h_idata); free(h_odata);

cudaFree(d_idata); cudaFree(d_odata);

103

111 112 113 }

printf("Completed.\n"); return 0;

Listing C.1: Source code for testing coalesced memory access.

104

# CUDA PROFILE LOG VERSION 2.0

# CUDA DEVICE 0 Tesla T10 Processor

# CUDA PROFILE CSV 1 cputime gld coherent gld incoherent gst coherent gst incoherent 3615s 201 190 178 181 181 177 179 176 179 190 3280 6544 6560 0 0 6544 0 6560 0 6560 0 6544 0 6560 0 26240 26176 26240 26240 26176 26240 26176 6544 0 26176 6560 0 26240 6560 0 26240 0 0 0 0 0 0 0 0 0 0

method

gputime

memcpyHtoD

3002.4

Z4copyPfS iii

161.792

Z4copyPfS iii

161.92

Z4copyPfS iii

159.456

Z4copyPfS iii

161.472

Z4copyPfS iii

162.272

Z4copyPfS iii

161.152

Z4copyPfS iii

160.224

Z4copyPfS iii

160.576

Z4copyPfS iii

160.32

Z4copyPfS iii

162.112

memcpyDtoH

2678.368

105

Table C.1: CUDA Textual Proler results from testing coalesced memory access.

Vous aimerez peut-être aussi