Cuda 04 Ykhung

CUDA
Advanced Memory Usage and Op5miza5on

Yukai Hung
a0934147@gmail.com
Department of Mathema>cs
Na>onal Taiwan University
Register as Cache?
Vola5le Qualier
! Vola>le qualier
__global__ void kernelFunc(int* result)
{
int temp1;
int temp2;
if(threadIdx.x<warpSize)
{
temp1=array[threadIdx.x]
array[threadIdx.x+1]=2;
iden>cal reads
compiler op>mized
this read away
temp2=array[threadIdx.x]
result[threadIdx.x]=temp1*temp2;
Vola5le Qualier
! Vola>le qualier
{
int temp1;
int temp2;
{
int temp=array[threadIdx.x];
temp1=temp; array[threadIdx.x+1]=2;
temp2=temp; result[threadIdx.x]=temp1*temp2;
Vola5le Qualier
! Vola>le qualier
{
int temp1;
int temp2;
{
temp1=array[threadIdx.x]*1;
__syncthreads();
}
Vola5le Qualier
! Vola>le qualier
{
volatile int temp1;
volatile int temp2;
{
Data Prefetch
Data Prefetch
! Hide memory latency by overlapping loading and compu>ng
- double buer is tradi>onal soQware pipeline technique
load blue block to shared memory
Nd
compute blue block on shared memory

and load next block to shared memory
Pd
Md
Pdsub
Data Prefetch
for loop
{
load data from global to shared memory
synchronize block
compute data in the shared memory
synchronize block
}
Data Prefetch
load data from global memory to registers
for loop
{
store data from register to shared memory
synchronize block
load data from global memory to registers
compute data in the shared memory
synchronize block
}
very small overhead
compu>ng and loading overlap
both memory are very fast register and shared are independent
10
Data Prefetch
! Matrix-matrix mul>plica>on
11
Constant Memory
Constant Memory
! Where is constant memory?
- data is stored in the device global memory
- read data through mul>processor constant cache
- 64KB constant memory and 8KB cache for each mul>processor
! How about the performance?
- op>mized when warp of threads read same loca>on
- 4 bytes per cycle through broadcas>ng to warp of threads
- serialized when warp of threads read in dierent loca>on
- very slow when cache miss (read data from global memory)
- access latency can range from one to hundreds clock cycles
13
Constant Memory
! How to use constant memory?
- declare constant memory on the le scope (global variable)
- copy data to constant memory by host (because it is constant!!)
//declare constant memory
__constant__ float cst_ptr[size];
//copy data from host to constant memory
cudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);
14
Constant Memory
//declare constant memory
__constant__ float cangle[360];
int main(int argc,char** argv)
{
int size=3200;
float* darray;
float hangle[360];
//allocate device memory
cudaMalloc((void**)&darray,sizeof(float)*size);
//initialize allocated memory
cudaMemset(darray,0,sizeof(float)*size);
//initialize angle array on host
for(int loop=0;loop<360;loop++)
hangle[loop]=acos(-1.0f)*loop/180.0f;
//copy host angle data to constant memory
cudaMemcpyToSymbol(cangle,hangle,sizeof(float)*360);
15
Constant Memory
//execute device kernel
test_kernel<<<size/64,64>>>(darray);
//free device memory
cudaFree(darray);
}
return 0;
__global__ void test_kernel(float* darray)

{
int index;
//calculate each thread global index
index=blockIdx.x*blockDim.x+threadIdx.x;
#pragma unroll 10
for(int loop=0;loop<360;loop++)
darray[index]=darray[index]+cangle[loop];
}
return;
16
Texture Memory
Texture Memory
! Texture mapping
18
Texture Memory
! Texture mapping
19
Texture Memory
! Texture ltering
nearest-neighborhood interpola>on
20
Texture Memory
! Texture ltering
linear/bilinear/trilinear interpola>on
21
Texture Memory
! Texture ltering
two >mes bilinear interpola>on
22
Texture Memory
Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
L2
FB
SP
L1
L2
FB
these units perform graphical texture opera>ons
23
SP
TF
L1
L2
FB
SP
Thread Processor
Vtx Thread Issue
L2
FB
Texture Memory
two SMs are cooperated as

texture processing cluster
scalable units on graphics
texture specic unit
only available for texture
24
Texture Memory
texture specic unit
texture address units

compute texture addresses
texture ltering units
compute data interpola>on
read only texture L1 cache
25
Texture Memory
Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
L2
FB
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
read only texture L2 cache for all TPC read only texture L1 cache for each TPC
26
Texture Memory
texture specic units
27
Texture Memory
! Texture is an object for reading data
- data is stored on the device global memory
- global memory is bound with texture cache
SP
SP
TF
SP
SP
TF
L1
TF
L1
L2
FB
SP
SP
SP
SP
TF
TF
L1
L2
FB
SP
SP
TF
L1
L1
SP
L2
SP
TF
L1
28
L2
FB
SP
SP
TF
L1
L2
FB global memory FB
SP
Thread Processor
SP
L1
L2
What is the advantages of texture?
Texture Memory
! Data caching
- helpful when global memory coalescing is the main bocleneck
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
SP
TF
L1
L2
FB
30
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
SP
L2
FB
Texture Memory
! Data ltering
- support linear/bilinear and trilinear hardware interpola>on
texture specic unit
intrinsic interpola>on
cudaFilterModePoint
cudaFilterModeLinear
31
Texture Memory
! Accesses modes
- clamp and wrap memory accessing for out-of-bound addresses
wrap boundary
texture specic unit
cudaAddressModeWrap
clamp boundary
cudaAddressModeClamp
32
Texture Memory
! Bound to linear memory
- only support 1-dimension problems
- only get the benets from texture cache
- not support addressing modes and ltering
! Bound to cuda array
- support oat addressing
- support addressing modes
- support hardware interpola>on
- support 1/2/3-dimension problems
33
Texture Memory
! Host code
- allocate global linear memory or cuda array
- create and set the texture reference on le scope
- bind the texture reference to the allocated memory
- unbind the texture reference to free cache resource
! Device code
- fetch data by indica>ng texture reference
- fetch data by using texture fetch func>on
34
Texture Memory
! Texture memory constrain
1D texture linear memory
Compute capability 1.3
Compute capability 2.0
8192
31768
1D texture cuda array
1024x128
(65536,32768)
(65536,65536)
(2048,2048,2048)
(4096,4096,4096)
35
Texture Memory
! Measuring texture cache miss or hit number
- latest visual proler can count cache miss or hit
- need device compute capability higher than 1.2
36
Example: 1-dimension linear memory
Texture Memory
//declare texture reference
texture<float,1,cudaReadModeElementType> texreference;
{
int size=3200;
float* harray;
float* diarray;
float* doarray;
//allocate host and device memory
harray=(float*)malloc(sizeof(float)*size);
cudaMalloc((void**)&diarray,sizeof(float)*size);
cudaMalloc((void**)&doarray,sizeof(float)*size);
//initialize host array before usage
for(int loop=0;loop<size;loop++)
harray[loop]=(float)rand()/(float)(RAND_MAX-1);
//copy array from host to device memory
cudaMemcpy(diarray,harray,sizeof(float)*size,cudaMemcpyHostToDevice);
38
Texture Memory
//bind texture reference with linear memory
cudaBindTexture(0,texreference,diarray,sizeof(float)*size);
kernel<<<(int)ceil((float)size/64),64>>>(doarray,size);
//unbind texture reference to free resource
cudaUnbindTexture(texreference);
//copy result array from device to host memory
cudaMemcpy(harray,doarray,sizeof(float)*size,cudaMemcpyDeviceToHost);
//free host and device memory
free(harray);
cudaFree(diarray);
cudaFree(doarray);
return 0;
}
39
Texture Memory
__global__ void kernel(float* doarray,int size)
{
int index;
index=blockIdx.x*blockDim.x+threadIdx.x;
//fetch global memory through texture reference
doarray[index]=tex1Dfetch(texreference,index);
}
return;
40
Texture Memory
__global__ void offsetCopy(float* idata,float* odata,int offset)
{
//compute each thread global index
int index=blockIdx.x*blockDim.x+threadIdx.x;
//copy data from global memory

odata[index]=idata[index+offset];
41
Texture Memory
__global__ void offsetCopy(float* idata,float* odata,int offset)
{
//compute each thread global index
int index=blockIdx.x*blockDim.x+threadIdx.x;
//copy data from global memory

odata[index]=tex1Dfetch(texreference,index+offset);
42
Example: 2-dimension cuda array
Texture Memory
#define size 3200
{
dim3 blocknum;
dim3 blocksize;
float* hmatrix;
float* dmatrix;
cudaArray* carray;
cudaChannelFormatDesc channel;
hmatrix=(float*)malloc(sizeof(float)*size*size);
cudaMalloc((void**)&dmatrix,sizeof(float)*size*size);
//initialize host matrix before usage
for(int loop=0;loop<size*size;loop++)
hmatrix[loop]=float)rand()/(float)(RAND_MAX-1);
44
Texture Memory
//create channel to describe data type
channel=cudaCreateChannelDesc<float>();
//allocate device memory for cuda array
cudaMallocArray(&carray,&channel,size,size);
//copy matrix from host to device memory
bytes=sizeof(float)*size*size;
cudaMemcpyToArray(carray,0,0,hmatrix,bytes,cudaMemcpyHostToDevice);
//set texture filter mode property
//use cudaFilterModePoint or cudaFilterModeLinear
texreference.filterMode=cudaFilterModePoint;
//set texture address mode property
//use cudaAddressModeClamp or cudaAddressModeWrap
texreference.addressMode[0]=cudaAddressModeWrap;
texreference.addressMode[1]=cudaaddressModeClamp;
45
Texture Memory
//bind texture reference with cuda array
cudaBindTextureToArray(texreference,carray);
blocksize.x=16;
blocksize.y=16;
blocknum.x=(int)ceil((float)size/16);
blocknum.y=(int)ceil((float)size/16);
kernel<<<blocknum,blocksize>>>(dmatrix,size);
//copy result matrix from device to host memory
cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);
free(hmatrix);
cudaFree(dmatrix);
cudaFreeArray(carray);
}
return 0;
46
Texture Memory
__global__ void kernel(float* dmatrix,int size)
{
int xindex;
int yindex;
xindex=blockIdx.x*blockDim.x+threadIdx.x;
yindex=blockIdx.y*blockDim.y+threadIdx.y;
//fetch cuda array through texture reference
dmatrix[yindex*size+xindex]=tex2D(texreference,xindex,yindex);
return;
}
47
Example: 3-dimension cuda array
Texture Memory
#define size 256
{
dim3 blocknum;
dim3 blocksize;
float* hmatrix;
float* dmatrix;
cudaArray* cudaarray;
cudaExtent volumesize;
cudaChannelFormatDesc channel;
cudaMemcpy3DParms copyparms={0};
hmatrix=(float*)malloc(sizeof(float)*size*size*size);
cudaMalloc((void**)&dmatrix,sizeof(float)*size*size*size);
49
Texture Memory
//initialize host matrix before usage
for(int loop=0;loop<size*size*size;loop++)
hmatrix[loop]=(float)rand()/(float)(RAND_MAX-1);
//set cuda array volume size
volumesize=make_cudaExtent(size,size,size);
//create channel to describe data type
channel=cudaCreateChannelDesc<float>();
//allocate device memory for cuda array
cudaMalloc3DArray(&cudaarray,&channel,volumesize);
//set cuda array copy parameters
copyparms.extent=volumesize;
copyparms.dstArray=cudaarray;
copyparms.kind=cudaMemcpyHostToDevice;
copyparms.srcPtr=
make_cudaPitchPtr((void*)hmatrix,sizeof(float)*size,size,size);
cudaMemcpy3D(&copyparms);
50
Texture Memory
//set texture filter mode property
//use cudaFilterModePoint or cudaFilterModeLinear
texreference.filterMode=cudaFilterModePoint;
//set texture address mode property
//use cudaAddressModeClamp or cudaAddressModeWrap
texreference.addressMode[2]=cudaaddressModeClamp;
//bind texture reference with cuda array
cudaBindTextureToArray(texreference,carray,channel);
blocksize.x=8;
blocksize.y=8;
blocksize.z=8;
blocknum.x=(int)ceil((float)size/8);
blocknum.y=(int)ceil((float)size/8);
kernel<<<blocknum,blocksize>>>(dmatrix,size);
51
Texture Memory
//copy result matrix from device to host memory
cudaMemcpy(hmatrix,dmatrix,bytes,cudaMemcpyDeviceToHost);
free(hmatrix);
cudaFree(dmatrix);
cudaFreeArray(carray);
}
return 0;
52
Texture Memory
__global__ void kernel(float* dmatrix,int size)
{
int loop;
int xindex;
int yindex;
int zindex;
xindex=threadIdx.x+blockIdx.x*blockDim.x;
yindex=threadIdx.y+blockIdx.y*blockDim.y;
for(loop=0;loop<size;loop++)
{
zindex=loop;
//fetch cuda array via texture reference
dmatrix[zindex*size*size+yindex*size+xindex]=
tex3D(texreference,xindex,yindex,zindex);
}
}
return;
53
Performance comparison: image projec5on
Texture Memory
image projec>on or ray cas>ng
55
Texture Memory
global memory accessing

is very close to random
intrinsic interpola>on
units is very powerful
trilinear interpola>on
on nearby 8 pixels
56
Texture Memory
object size 512 x 512x 512 / ray number 512 x 512

Method
Time
Speedup
global
global/locality
1.891
0.198
-
9.5
texture/point
texture/linear
0.072
0.037
26.2
51.1
texture/linear/locality
texture/linear/locality/fast math
0.012
0.011
157.5
171.9
57
Why texture memory is so powerful?
Texture Memory
! CUDA Array is reordered to something like space lling Z-order
- soQware driver supports reordering data
- hardware supports spa>al memory layout
59
Why only readable texture cache?
Texture Memory
! Texture cache cannot detect the dirty data
lazy update
for write-back
modied by
other threads
cache
perform some
opera>ons on cache
load from reload from

memory to memory to
cache
cache
host memory
61
oat array
Texture Memory
! Write data to global memory directly without texture cache
- only suitable for global linear memory not cuda array
tex1Dfetch(texreference,index)
read data through

texture cache
darray[index]=value;
cache
texture cache may
not be updated
write data to global

memory directly
device memory
62
oat array
How about the texture data locality?
Texture Memory
Why CUDA distributes

the work blocks in
horizontal direc>on?
all blocks get scheduled

round-robin based on
the number of shaders
64
Texture Memory
load balancing on overall texture cache data locality,

SMs, suppose consecu>ve suppose consecu>ve blocks
blocks
use similar nearby data
have very similar work load
65
Texture Memory
reorder the block index ing into z-order to

take advantage of texture L1 cache
66
Texture Memory
concurrent execu>on
for independent units
streaming processors
temp1=a/b+sin(c)
special func>on units

temp2[loop]=__cos(d)
texture opera>on units

temp3=tex2D(ref,x,y)
67
Texture Memory
Memory
Loca>on
Cache
Speed
Access
global
o-chip
no
hundreds
all threads
constant
o-chip
yes
one ~ hundreds
all threads
texture
o-chip
yes
one ~ hundreds
all threads
shared
on-chip
one
block threads
local
o-chip
no
very slow
single thread
register
on-chip
one
single thread
instruc>on
o-chip
yes
invisible
68
Texture Memory
Memory
Read/Write
Property
global
read/write
input or output
constant
read
no structure
texture
read
locality structure
shared
read/write
shared within block
local
read/write
register
read/write
local temp variable
69
! Reference
- Mark Harris http://www.markmark.net/
- Wei-Chao Chen http://www.cs.unc.edu/~ciao/
- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php
70

Cuda 04 Ykhung

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cuda 04 Ykhung

Transféré par

Droits d'auteur :

Formats disponibles

CUDA

Advanced Memory Usage and Op5miza5on

compute blue block on shared memory

__global__ void test_kernel(float* darray)

Setup / Rstr / ZCull

Pixel Thread Issue

these units perform graphical texture opera>ons

Vtx Thread Issue

two SMs are cooperated as

texture specic unit

texture address units

Setup / Rstr / ZCull

Pixel Thread Issue

Vtx Thread Issue

texture specic units

What is the advantages of texture?

1D texture linear memory

Compute capability 1.3

Compute capability 2.0

1D texture cuda array

2D texture cuda array

3D texture cuda array

Example: 1-dimension linear memory

//copy data from global memory

//copy data from global memory

Example: 2-dimension cuda array

Example: 3-dimension cuda array

Performance comparison: image projec5on

image projec>on or ray cas>ng

global memory accessing

object size 512 x 512x 512 / ray number 512 x 512

Why texture memory is so powerful?

Why only readable texture cache?

load from reload from

read data through

write data to global

How about the texture data locality?

Why CUDA distributes

all blocks get scheduled

load balancing on overall texture cache data locality,

reorder the block index ing into z-order to

special func>on units

texture opera>on units

shared within block

local temp variable

Vous aimerez peut-être aussi

global void test_kernel(float* darray)