1 2

Applications of GPU Computing
Alex Karantza 0306-722 Advanced Computer Architecture Fall 2011
Outline
Introduction GPU Architecture
Multiprocessing Vector ISA
GPUs in Industry
Scientific Computing Image Processing Databases
Examples and Benefits
Introduction
GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.
- Prof. Jack Dongarra, director of the Innovative Computing Laboratory at the University of Tennessee Author of LINPACK
(As typified by NVIDIA CUDA)
GPU Architecture
Parallel Coprocessor to conventional CPUs
Implement a SIMD structure, multiple threads running the same code.
Grid of Blocks of Threads

Thread local registers Block local memory and control Global memory
Grids, Blocks, and Threads

Thread Thread Processor Contains local registers and memory; scalar processor
Shared memory and registers; shared control logic Thread Block Multiprocessor Global memory, can be easily distributed across devices Grid Device(s)
GPU Architecture
Processors also implement vector instructions
Vectors of length 2,3,4 of any fundamental type
integer, float, bits, predicate
Instructions for conversion between vector, scalar
To encourage uniform execution, rather than branching for conditionals, use predicates
All instructions can be conditionally executed based on predicate registers
Vectors and Predicates

.global .v4 .f32 V; // a length-4 vector of floats .shared .v2 .u16 uv; // a length-2 vector of unsigned .global .v4 .b8 v; // a length-4 vector of bytes
.reg .reg
.s32 .pred
a, b; p;
// two 32-bit signed ints // a predicate register
@p
setp.lt.s32 p, a, b; // if a < b, set p add.v4.f32 V, V, {1,0,0,0}; // if p, V.x = V.x + 1
NSF Keeneland 360 Tesla20s
GPUs in Industry
Many applications have been developed to use GPUs for supercomputing in various fields
Scientific Computing
CFD, Molecular Dynamics, Genome Sequencing, Mechanical Simulation, Quantum Electrodynamics
Image Processing
Registration, interpolation, feature detection, recognition, filtering
Data Analysis
Databases, sorting and searching, data mining
Major Categories of Algorithm

2D/3D filtering operations n-body simulations Parallel tree operations searching/sorting All suited to GPUs because of data-parallel requirements and uniform kernels
Computational Fluid Dynamics

Simulate fluids in a discrete volume over time Involves solving the Navier-Stokes partial differential equations iteratively on a grid
Can be considered a filtering operation
When parallelized on a GPU using multigrid solvers, 10x speedups have been reported
Molecular Dynamics
Large set of particles with forces between them protein behavior, material simulation
Calculating forces between particles can be done in parallel for each particle Accumulation of forces can be implemented as multilevel parallel sums
Genetics
Large strings of genome sequences must be searched through to organize and identify samples
GPUs enable multiple parallel queries to the database to perform string matching Again, order of magnitude speedups reported
Electrodynamics
Simulation of electric fields, Coulomb forces Requires iterative solving of partial differential equations Cell phone modeling applications have reported 50x speedups using GPUs
Image Processing
Medical Imaging was the early adopter
Registration of massive 3D voxel images Both the cost function for deformable registration and interpolation of results are filtering operations
Generic feature detection, recognition, object extraction are all filters For object recognition, one can search a database of objects in parallel Running these algorithms off the CPU can allow real-time interaction
Data Analysis
Huge databases for web services require instant results for many simultaneous users Insufficient room in main memory, disk is too slow and doesnt allow parallel reads GPUs can split up the data and perform fast searches, keeping their section in memory
Example: Filtering Operation

Many algorithms can be reduced to a filtering operation. As an example, consider image convolution for blurring
Kernel = Gaussian2D(size); for (x,y) in Input { for (p,q) in Kernel { Output(x,y) += Input(x+p,y+q) * Kernel(p,q); } }

A quick optimization that can be made on many filters is that they are separable, and can be done in one pass per dimension
Kernel = Gaussian1D(size);
for (x,y) in Input { for (p) in Kernel { Output(x,y) += Input(x+p,y) * Kernel(p); } }

for (x,y) in Input { for (q) in Kernel { Output(x,y) += Input(x,y+q) * Kernel(q); } }

This is still O(2nnm) on a sequential processor Each output pixel is independent, but shares spatially local data and a constant kernel
UploadGPU(Kernel, CONSTANT); UploadGPU(Input, TEXTURE);

ConvolveColumnsGPU<blocks,threads>(); ConvolveRowsGPU<blocks,threads>(); DownloadGPU(Output, TEXTURE);

Complexity remains the same, however each MAC instruction can be executed on as many processors as are available, and memory can be accessed quickly because of the assignment of blocks and texture memory
In practice, the overhead of uploading and downloading from the GPU is far less than the performance gained in the kernel

__global__ void convolutionColumnsKernel( float *d_Dst, float *d_Src, int imageW, int imageH, int pitch ){ __shared__ float s_Data[COLUMNS_BLOCKDIM_X] [(COLUMNS_RESULT_STEPS + 2 * COLUMNS_HALO_STEPS) * COLUMNS_BLOCKDIM_Y + 1]; //// *snip* Populate s_Data from d_Src __syncthreads(); #pragma unroll for(int i = COLUMNS_HALO_STEPS; i < COLUMNS_HALO_STEPS + COLUMNS_RESULT_STEPS; i++){ float sum = 0; #pragma unroll for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; j++) sum += c_Kernel[KERNEL_RADIUS - j] * s_Data[threadIdx.x][threadIdx.y + i * COLUMNS_BLOCKDIM_Y + j]; d_Dst[i * COLUMNS_BLOCKDIM_Y * pitch] = sum; }
Even More Fun

Some of that overhead can be avoided when the destination of the GPUs data is graphics Texture memory can be shared between general purpose computations and normal rendering For post-processing effects or visualizing particles, the pixel/vertex data never needs to leave the GPU
Conclusions
Certain classes of problem appear in many different fields, and involve very data-parallel operations such as filtering, sorting, or integration Taking advantage of the architecture decisions behind graphics processing units such as their multiprocessing and native vector operations, these problems can be solved quickly and cheaply
References
1. Ziegler, Grenot. Introduction to the CUDA Architecture. [Online] 2009. http://www.cse.scitech.ac.uk/disco/workshops/200907/Day1_01_Intro_CUDA_Architecture.pdf. 2. NVIDIA Corporation. NVIDIA Compute PTX: Parallel Thread Execution ISA Version 1.1. 2007. 3. Gddeke, Dominik. Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters. Berlin : Logos Verlag, 2010. 978-3-8325-2768-6. 4. Accellerating molecular modeling application swith graphics processors. John E Stone, James C Phillips, Peter L Freddolino, David J Hardy, Leonardo G Trabuco, and Klaus Schulten. 2007, Journal of Computational Chemistry, pp. 28:2618-2640. 5. Michael C Schatz, Cole Trapnell, Arthur L Delcher, and Amitabh Varshney. High-throughput sequence alignment using Graphics Processing Units. s.l. : BMC Bioinformatics, 2007. 6. ANSYS, Inc. ANSYS Unveils GPU Computing for Accelerated Engineering Simulations. [Online] 2010. http://investors.ansys.com/releasedetail.cfm?releaseid=509436. 7. Warburton, Tim. Parallel Numerical Methods for Partial Differential Equations. Rocky Mountain Mathematics Consortium. [Online] 2008. http://www.caam.rice.edu/~timwar/RMMC/gpuDG.html. 8. Ansorge, Richard. AIRWC : Accelerated Image Registration With CUDA . BSS Group, Cavendish Laboratory, University of Cambridge UK. 2008. 9. N. Cornelis, L. Van Gool. Fast Scale Invariant Feature Detection and Matching on Programmable Graphics Hardware. s.l. : CVPR 2008 Workshop, 2008. 10. Andrea DiBlas, Tim Kaldewey. Data Monster: Why graphics processors will transform database processing. IEEE Spectrum. [Online] 2009. http://spectrum.ieee.org/computing/software/data-monster/0. 11. Podlozhnyuk, Victor. Image Convolution with CUDA. [Online] 2007. http://developer.download.nvidia.com/compute/DevZone/C/html/C/src/convolutionSeparable/doc/convolutionSeparable.pdf. 12. Goodnight, Nolan. CUDA/OpenGL Fluid Simulation. [Online] 2007. http://new.math.uiuc.edu/MA198-2008/schaber2/fluidsGL.pdf.
Questions?

1 2

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

1 2

Transféré par

Droits d'auteur :

Formats disponibles

Applications of GPU Computing

Alex Karantza 0306-722 Advanced Computer Architecture Fall 2011

Examples and Benefits

(As typified by NVIDIA CUDA)

Grid of Blocks of Threads

Grids, Blocks, and Threads

Instructions for conversion between vector, scalar

Vectors and Predicates

// two 32-bit signed ints // a predicate register

setp.lt.s32 p, a, b; // if a < b, set p add.v4.f32 V, V, {1,0,0,0}; // if p, V.x = V.x + 1

NSF Keeneland 360 Tesla20s

Major Categories of Algorithm

Computational Fluid Dynamics

Example: Filtering Operation

Example: Filtering Operation

for (x,y) in Input { for (p) in Kernel { Output(x,y) += Input(x+p,y) * Kernel(p); } }

Example: Filtering Operation

UploadGPU(Kernel, CONSTANT); UploadGPU(Input, TEXTURE);

Example: Filtering Operation

Example: Filtering Operation

Even More Fun

Vous aimerez peut-être aussi