Vous êtes sur la page 1sur 19

CUDABasics

MurphyStein NewYorkUniversity

Overview

DeviceArchitecture CUDAProgrammingModel MatrixTransposeinCUDA FurtherReading

WhatisCUDA?
CUDAstandsfor:
ComputeUnifiedDeviceArchitecture

Itis2things:
1.DeviceArchitectureSpecification 2.AsmallextensiontoC
=NewSyntax+BuiltinVariablesRestrictions+Libraries

DeviceArchitecture:StreamingMultiprocessor(SM) 1SMcontains8scalarcores

Upto8corescanrun simulatenously Eachcoreexecutesidentical instructionset,orsleeps SMschedulesinstructions acrosscoreswith0overhead Upto32threadsmaybe scheduledatatime,calleda warp,butmax24warpsactive in1SM Threadlevelmemorysharing supportedviaSharedMemory Registermemoryislocalto thread,anddividedamongstall blocksonSM

SM
InstructionFetch/Dispatch

Streaming Core #1 Streaming Core #2 Streaming Core #3

SharedMemory16KB

Registers8KB

TextureMemory Cache58KB ConstantMemory Cache8KB

...

Streaming Core #8

TransparentScalability
Hardwareisfreetoassignsblockstoany processoratanytime
Akernelscalesacrossanynumberof parallelprocessors
Device Kernelgrid Block0 Block1 Block2 Block3 Block0 Block2 Block4 Block6 Block1 Block3 Block5 Block7 Block4 Block5 Block6 Block7 Device

time

Block0 Block4

Block1 Block5

Block2 Block6

Block3 Block7

Eachblockcanexecuteinanyorderrelativetootherblocks.

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

SMWarpScheduling
SMhardwareimplementszero overheadWarpscheduling

SMmultithreaded Warpscheduler time warp8instruction11 warp1instruction42 warp3instruction95 . . . warp8instruction12 warp3instruction96

Warpswhosenextinstructionhasits operandsreadyforconsumptionare eligibleforexecution EligibleWarpsareselectedfor executiononaprioritizedscheduling policy AllthreadsinaWarpexecutethe sameinstructionwhenselected

4clockcyclesneededtodispatch thesameinstructionforallthreads inaWarpinG80


Ifoneglobalmemoryaccessis neededforevery4instructions Aminimalof13Warpsareneededto fullytolerate200cyclememory latency

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

DeviceArchitecture

Host

BlockExecutionManager

1GPU

Streaming Multiprocessor (SM) 1

Streaming Multiprocessor (SM) 2

...

Streaming Multiprocessor (SM) N1

Streaming Multiprocessor (SM) N

ConstantMemory64KB TextureMemory

GlobalMemory768MB4GB

CExtension
Consistsof:

NewSyntaxandBuiltinVariables RestrictionstoANSIC API/Libraries

CExtension:BuiltinVariables

NewSyntax:

<<<...>>> __host__,__global__,__device__ __constant__,__shared__,__device __syncthreads()

CExtension:BuiltinVariables

BuiltinVariables: dim3gridDim;
Dimensionsofthegridinblocks(gridDim.zunused)

dim3blockDim;
Dimensionsoftheblockinthreads

dim3blockIdx;
Blockindexwithinthegrid

dim3threadIdx;
Threadindexwithintheblock

CExtension:Restrictions

NewRestrictions:

Norecursionindevicecode Nofunctionpointersindevicecode

CUDAAPI

CUDARuntime(HostandDevice) DeviceMemoryHandling(cudaMalloc,...) BuiltinMathFunctions(sin,sqrt,mod,...) Atomicoperations(forconcurrency) Datatypes(2Dtextures,dim2,dim3,...)

CompilingaCUDAProgram

C/C++CUDA Application

C/C++Code

NVCC

VirtualPTX Code

gcc

PTXtoTarget Compiler

CPU Instructions

GPU Instructions

MatrixTranspose

M i,j

M j,i

MatrixTranspose

A C

B D

A B

C D

MatrixTranspose:Firstidea

Eachthreadblocktransposes anequalsizedblockofmatrix M AssumeMissquare(nxn) Whatisagoodblocksize? CUDAplaceslimitationson numberofthreadsperblock 512threadsperblockisthe maximumallowedbyCUDA MatrixM n

MatrixTranspose:Firstidea
#include<stdio.h> #include<stdlib.h> __global__ voidtranspose(float*in,float*out,uintwidth){ uinttx=blockIdx.x*blockDim.x+threadIdx.x; uintty=blockIdx.y*blockDim.y+threadIdx.y; out[tx*width+ty]=in[ty*width+tx]; } intmain(intargs,char**vargs){ constintHEIGHT=1024; constintWIDTH=1024; constintSIZE=WIDTH*HEIGHT*sizeof(float); dim3bDim(16,16); dim3gDim(WIDTH/bDim.x,HEIGHT/bDim.y); float*M=(float*)malloc(SIZE); for(inti=0;i<HEIGHT*WIDTH;i++) {M[i]=i;} float*Md=NULL; cudaMalloc((void**)&Md,SIZE); cudaMemcpy(Md,M,SIZE,cudaMemcpyHostToDevice); float*Bd=NULL; cudaMalloc((void**)&Bd,SIZE); transpose<<<gDim,bDim>>>(Md,Bd,WIDTH); cudaMemcpy(M,Bd,SIZE,cudaMemcpyDeviceToHost); return0; }

FurtherReading

OnlineCourse:

UIUCNVIDIAProgrammingCoursebyDavidKirkandWenMeiW.Hwu http://courses.ece.illinois.edu/ece498/al/Syllabus.html

CUDA@MIT'09

http://sites.google.com/site/cudaiap2009/materials1/lectures

GreatMemoryLatencyStudy:

LU,QRandCholeskyFactorizationsusingVectorCapabilitiesofGPUsby Vasily&Demmel

Bookofadvancedexamples:

GPUGems3EditedbyHubertNguyen

CUDASDK

TonsofsourcecodeexamplesavailablefordownloadfromNVIDIA'swebsite

Vous aimerez peut-être aussi