Vous êtes sur la page 1sur 29

Applications of GPU Computing

Alex Karantza 0306-722 Advanced Computer Architecture Fall 2011

Outline
Introduction GPU Architecture
Multiprocessing Vector ISA

GPUs in Industry
Scientific Computing Image Processing Databases

Examples and Benefits

Introduction
GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.

- Prof. Jack Dongarra, director of the Innovative Computing Laboratory at the University of Tennessee Author of LINPACK

(As typified by NVIDIA CUDA)

GPU Architecture
Parallel Coprocessor to conventional CPUs
Implement a SIMD structure, multiple threads running the same code.

Grid of Blocks of Threads


Thread local registers Block local memory and control Global memory

Grids, Blocks, and Threads


Thread Thread Processor Contains local registers and memory; scalar processor

Shared memory and registers; shared control logic Thread Block Multiprocessor Global memory, can be easily distributed across devices Grid Device(s)

GPU Architecture
Processors also implement vector instructions
Vectors of length 2,3,4 of any fundamental type
integer, float, bits, predicate

Instructions for conversion between vector, scalar

To encourage uniform execution, rather than branching for conditionals, use predicates
All instructions can be conditionally executed based on predicate registers

Vectors and Predicates


.global .v4 .f32 V; // a length-4 vector of floats .shared .v2 .u16 uv; // a length-2 vector of unsigned .global .v4 .b8 v; // a length-4 vector of bytes

.reg .reg

.s32 .pred

a, b; p;

// two 32-bit signed ints // a predicate register

@p

setp.lt.s32 p, a, b; // if a < b, set p add.v4.f32 V, V, {1,0,0,0}; // if p, V.x = V.x + 1

NSF Keeneland 360 Tesla20s

GPUs in Industry
Many applications have been developed to use GPUs for supercomputing in various fields
Scientific Computing
CFD, Molecular Dynamics, Genome Sequencing, Mechanical Simulation, Quantum Electrodynamics

Image Processing
Registration, interpolation, feature detection, recognition, filtering

Data Analysis
Databases, sorting and searching, data mining

Major Categories of Algorithm


2D/3D filtering operations n-body simulations Parallel tree operations searching/sorting All suited to GPUs because of data-parallel requirements and uniform kernels

Computational Fluid Dynamics


Simulate fluids in a discrete volume over time Involves solving the Navier-Stokes partial differential equations iteratively on a grid
Can be considered a filtering operation

When parallelized on a GPU using multigrid solvers, 10x speedups have been reported

Molecular Dynamics
Large set of particles with forces between them protein behavior, material simulation

Calculating forces between particles can be done in parallel for each particle Accumulation of forces can be implemented as multilevel parallel sums

Genetics
Large strings of genome sequences must be searched through to organize and identify samples

GPUs enable multiple parallel queries to the database to perform string matching Again, order of magnitude speedups reported

Electrodynamics
Simulation of electric fields, Coulomb forces Requires iterative solving of partial differential equations Cell phone modeling applications have reported 50x speedups using GPUs

Image Processing
Medical Imaging was the early adopter
Registration of massive 3D voxel images Both the cost function for deformable registration and interpolation of results are filtering operations

Generic feature detection, recognition, object extraction are all filters For object recognition, one can search a database of objects in parallel Running these algorithms off the CPU can allow real-time interaction

Data Analysis
Huge databases for web services require instant results for many simultaneous users Insufficient room in main memory, disk is too slow and doesnt allow parallel reads GPUs can split up the data and perform fast searches, keeping their section in memory

Example: Filtering Operation


Many algorithms can be reduced to a filtering operation. As an example, consider image convolution for blurring
Kernel = Gaussian2D(size); for (x,y) in Input { for (p,q) in Kernel { Output(x,y) += Input(x+p,y+q) * Kernel(p,q); } }

Example: Filtering Operation


A quick optimization that can be made on many filters is that they are separable, and can be done in one pass per dimension
Kernel = Gaussian1D(size);

for (x,y) in Input { for (p) in Kernel { Output(x,y) += Input(x+p,y) * Kernel(p); } }


for (x,y) in Input { for (q) in Kernel { Output(x,y) += Input(x,y+q) * Kernel(q); } }

Example: Filtering Operation


This is still O(2nnm) on a sequential processor Each output pixel is independent, but shares spatially local data and a constant kernel

UploadGPU(Kernel, CONSTANT); UploadGPU(Input, TEXTURE);


ConvolveColumnsGPU<blocks,threads>(); ConvolveRowsGPU<blocks,threads>(); DownloadGPU(Output, TEXTURE);

Example: Filtering Operation


Complexity remains the same, however each MAC instruction can be executed on as many processors as are available, and memory can be accessed quickly because of the assignment of blocks and texture memory

In practice, the overhead of uploading and downloading from the GPU is far less than the performance gained in the kernel

Example: Filtering Operation


__global__ void convolutionColumnsKernel( float *d_Dst, float *d_Src, int imageW, int imageH, int pitch ){ __shared__ float s_Data[COLUMNS_BLOCKDIM_X] [(COLUMNS_RESULT_STEPS + 2 * COLUMNS_HALO_STEPS) * COLUMNS_BLOCKDIM_Y + 1]; //// *snip* Populate s_Data from d_Src __syncthreads(); #pragma unroll for(int i = COLUMNS_HALO_STEPS; i < COLUMNS_HALO_STEPS + COLUMNS_RESULT_STEPS; i++){ float sum = 0; #pragma unroll for(int j = -KERNEL_RADIUS; j <= KERNEL_RADIUS; j++) sum += c_Kernel[KERNEL_RADIUS - j] * s_Data[threadIdx.x][threadIdx.y + i * COLUMNS_BLOCKDIM_Y + j]; d_Dst[i * COLUMNS_BLOCKDIM_Y * pitch] = sum; }

Even More Fun


Some of that overhead can be avoided when the destination of the GPUs data is graphics Texture memory can be shared between general purpose computations and normal rendering For post-processing effects or visualizing particles, the pixel/vertex data never needs to leave the GPU

Conclusions
Certain classes of problem appear in many different fields, and involve very data-parallel operations such as filtering, sorting, or integration Taking advantage of the architecture decisions behind graphics processing units such as their multiprocessing and native vector operations, these problems can be solved quickly and cheaply

References
1. Ziegler, Grenot. Introduction to the CUDA Architecture. [Online] 2009. http://www.cse.scitech.ac.uk/disco/workshops/200907/Day1_01_Intro_CUDA_Architecture.pdf. 2. NVIDIA Corporation. NVIDIA Compute PTX: Parallel Thread Execution ISA Version 1.1. 2007. 3. Gddeke, Dominik. Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters. Berlin : Logos Verlag, 2010. 978-3-8325-2768-6. 4. Accellerating molecular modeling application swith graphics processors. John E Stone, James C Phillips, Peter L Freddolino, David J Hardy, Leonardo G Trabuco, and Klaus Schulten. 2007, Journal of Computational Chemistry, pp. 28:2618-2640. 5. Michael C Schatz, Cole Trapnell, Arthur L Delcher, and Amitabh Varshney. High-throughput sequence alignment using Graphics Processing Units. s.l. : BMC Bioinformatics, 2007. 6. ANSYS, Inc. ANSYS Unveils GPU Computing for Accelerated Engineering Simulations. [Online] 2010. http://investors.ansys.com/releasedetail.cfm?releaseid=509436. 7. Warburton, Tim. Parallel Numerical Methods for Partial Differential Equations. Rocky Mountain Mathematics Consortium. [Online] 2008. http://www.caam.rice.edu/~timwar/RMMC/gpuDG.html. 8. Ansorge, Richard. AIRWC : Accelerated Image Registration With CUDA . BSS Group, Cavendish Laboratory, University of Cambridge UK. 2008. 9. N. Cornelis, L. Van Gool. Fast Scale Invariant Feature Detection and Matching on Programmable Graphics Hardware. s.l. : CVPR 2008 Workshop, 2008. 10. Andrea DiBlas, Tim Kaldewey. Data Monster: Why graphics processors will transform database processing. IEEE Spectrum. [Online] 2009. http://spectrum.ieee.org/computing/software/data-monster/0. 11. Podlozhnyuk, Victor. Image Convolution with CUDA. [Online] 2007. http://developer.download.nvidia.com/compute/DevZone/C/html/C/src/convolutionSeparable/doc/convolutionSeparable.pdf. 12. Goodnight, Nolan. CUDA/OpenGL Fluid Simulation. [Online] 2007. http://new.math.uiuc.edu/MA198-2008/schaber2/fluidsGL.pdf.

Questions?

Vous aimerez peut-être aussi