Cuda Many Cores

CUDABasics
MurphyStein NewYorkUniversity
Overview
DeviceArchitecture CUDAProgrammingModel MatrixTransposeinCUDA FurtherReading
WhatisCUDA?
CUDAstandsfor:
ComputeUnifiedDeviceArchitecture
Itis2things:
1.DeviceArchitectureSpecification 2.AsmallextensiontoC
=NewSyntax+BuiltinVariablesRestrictions+Libraries
DeviceArchitecture:StreamingMultiprocessor(SM) 1SMcontains8scalarcores
Upto8corescanrun simulatenously Eachcoreexecutesidentical instructionset,orsleeps SMschedulesinstructions acrosscoreswith0overhead Upto32threadsmaybe scheduledatatime,calleda warp,butmax24warpsactive in1SM Threadlevelmemorysharing supportedviaSharedMemory Registermemoryislocalto thread,anddividedamongstall blocksonSM
SM
InstructionFetch/Dispatch
Streaming Core #1 Streaming Core #2 Streaming Core #3
SharedMemory16KB
Registers8KB
TextureMemory Cache58KB ConstantMemory Cache8KB
...
Streaming Core #8
TransparentScalability
Hardwareisfreetoassignsblockstoany processoratanytime
Akernelscalesacrossanynumberof parallelprocessors
Device Kernelgrid Block0 Block1 Block2 Block3 Block0 Block2 Block4 Block6 Block1 Block3 Block5 Block7 Block4 Block5 Block6 Block7 Device
time
Block0 Block4
Block1 Block5
Block2 Block6
Block3 Block7
Eachblockcanexecuteinanyorderrelativetootherblocks.
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
SMWarpScheduling
SMhardwareimplementszero overheadWarpscheduling
SMmultithreaded Warpscheduler time warp8instruction11 warp1instruction42 warp3instruction95 . . . warp8instruction12 warp3instruction96
Warpswhosenextinstructionhasits operandsreadyforconsumptionare eligibleforexecution EligibleWarpsareselectedfor executiononaprioritizedscheduling policy AllthreadsinaWarpexecutethe sameinstructionwhenselected
4clockcyclesneededtodispatch thesameinstructionforallthreads inaWarpinG80

Ifoneglobalmemoryaccessis neededforevery4instructions Aminimalof13Warpsareneededto fullytolerate200cyclememory latency
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
DeviceArchitecture
Host
BlockExecutionManager
1GPU
Streaming Multiprocessor (SM) 1
Streaming Multiprocessor (SM) 2
...
Streaming Multiprocessor (SM) N1
Streaming Multiprocessor (SM) N
ConstantMemory64KB TextureMemory
GlobalMemory768MB4GB
CExtension
Consistsof:
NewSyntaxandBuiltinVariables RestrictionstoANSIC API/Libraries
CExtension:BuiltinVariables
NewSyntax:
<<<...>>> __host__,__global__,__device__ __constant__,__shared__,__device __syncthreads()
CExtension:BuiltinVariables
BuiltinVariables: dim3gridDim;
Dimensionsofthegridinblocks(gridDim.zunused)
dim3blockDim;
Dimensionsoftheblockinthreads
dim3blockIdx;
Blockindexwithinthegrid
dim3threadIdx;
Threadindexwithintheblock

CExtension:Restrictions
NewRestrictions:
Norecursionindevicecode Nofunctionpointersindevicecode
CUDAAPI
CUDARuntime(HostandDevice) DeviceMemoryHandling(cudaMalloc,...) BuiltinMathFunctions(sin,sqrt,mod,...) Atomicoperations(forconcurrency) Datatypes(2Dtextures,dim2,dim3,...)
CompilingaCUDAProgram
C/C++CUDA Application
C/C++Code
NVCC
VirtualPTX Code
gcc
PTXtoTarget Compiler
CPU Instructions
GPU Instructions
MatrixTranspose
M i,j

M j,i
MatrixTranspose
A C
B D
A B
C D
MatrixTranspose:Firstidea
Eachthreadblocktransposes anequalsizedblockofmatrix M AssumeMissquare(nxn) Whatisagoodblocksize? CUDAplaceslimitationson numberofthreadsperblock 512threadsperblockisthe maximumallowedbyCUDA MatrixM n
MatrixTranspose:Firstidea
#include<stdio.h> #include<stdlib.h> __global__ voidtranspose(float*in,float*out,uintwidth){ uinttx=blockIdx.x*blockDim.x+threadIdx.x; uintty=blockIdx.y*blockDim.y+threadIdx.y; out[tx*width+ty]=in[ty*width+tx]; } intmain(intargs,char**vargs){ constintHEIGHT=1024; constintWIDTH=1024; constintSIZE=WIDTH*HEIGHT*sizeof(float); dim3bDim(16,16); dim3gDim(WIDTH/bDim.x,HEIGHT/bDim.y); float*M=(float*)malloc(SIZE); for(inti=0;i<HEIGHT*WIDTH;i++) {M[i]=i;} float*Md=NULL; cudaMalloc((void**)&Md,SIZE); cudaMemcpy(Md,M,SIZE,cudaMemcpyHostToDevice); float*Bd=NULL; cudaMalloc((void**)&Bd,SIZE); transpose<<<gDim,bDim>>>(Md,Bd,WIDTH); cudaMemcpy(M,Bd,SIZE,cudaMemcpyDeviceToHost); return0; }
FurtherReading
OnlineCourse:

UIUCNVIDIAProgrammingCoursebyDavidKirkandWenMeiW.Hwu http://courses.ece.illinois.edu/ece498/al/Syllabus.html
CUDA@MIT'09
http://sites.google.com/site/cudaiap2009/materials1/lectures
GreatMemoryLatencyStudy:
LU,QRandCholeskyFactorizationsusingVectorCapabilitiesofGPUsby Vasily&Demmel
Bookofadvancedexamples:
GPUGems3EditedbyHubertNguyen
CUDASDK
TonsofsourcecodeexamplesavailablefordownloadfromNVIDIA'swebsite

Cuda Many Cores

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cuda Many Cores

Transféré par

Droits d'auteur :

Formats disponibles

CUDABasics

DeviceArchitecture CUDAProgrammingModel MatrixTransposeinCUDA FurtherReading

Streaming Core #1 Streaming Core #2 Streaming Core #3

TextureMemory Cache58KB ConstantMemory Cache8KB

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

SMmultithreaded Warpscheduler time warp8instruction11 warp1instruction42 warp3instruction95 . . . warp8instruction12 warp3instruction96

Warpswhosenextinstructionhasits operandsreadyforconsumptionare eligibleforexecution EligibleWarpsareselectedfor executiononaprioritizedscheduling policy AllthreadsinaWarpexecutethe sameinstructionwhenselected

4clockcyclesneededtodispatch thesameinstructionforallthreads inaWarpinG80

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

Streaming Multiprocessor (SM) 1

Streaming Multiprocessor (SM) 2

Streaming Multiprocessor (SM) N1

Streaming Multiprocessor (SM) N

NewSyntaxandBuiltinVariables RestrictionstoANSIC API/Libraries

<<<...>>> host,global,device constant,shared,device syncthreads()

CUDARuntime(HostandDevice) DeviceMemoryHandling(cudaMalloc,...) BuiltinMathFunctions(sin,sqrt,mod,...) Atomicoperations(forconcurrency) Datatypes(2Dtextures,dim2,dim3,...)

Eachthreadblocktransposes anequalsizedblockofmatrix M AssumeMissquare(nxn) Whatisagoodblocksize? CUDAplaceslimitationson numberofthreadsperblock 512threadsperblockisthe maximumallowedbyCUDA MatrixM n

Vous aimerez peut-être aussi