Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization

Developing Kernels: Part 2
Algorithm Considerations, Multi-Kernel Programs and Optimization
Steven Gratton, Institute of Astronomy
Want to break the problem up into multiple steps, each of which could do with a large number of blocks and threads within blocks thrown at it
How do we synchronize between steps? Remember __syncthreads() only works within a block Key point: the global memory persists on the gpu after a kernel finishes, and a new kernel isnt start until pending memory writes complete
Global Synchronization
So, enforce global synchronization by stopping one kernel and starting another one!
A bit drastic, but it works! O(10^-5 s) kernel launch overhead Shared memory is erased
No problem as long as you dont have to do it too often!
So, well need something like

__global__ topleft(,); __global__ strip(,); __global__ therest(,);
for (n=0;n<SZ;n++) { topleft<<<,>>>(n,); strip<<<,>>>(n,); therest<<<,>>>(n,); }
Only worry about/optimize what matters!
For us, therest<<<,>>>() is what matters As long as we dont do anything silly, the other two basically just have to work and no more
So, what can we do?
Matrix suggests a tiled approach, each thread doing one element. Keeping our thread blocks square for simplicity, a natural choice is 16x16 (more on this later), so well need a dim3 Db(16,16); Nb. 16x16=8x32, so no incomplete warps! But a problem is now that we have to subtract a vector times its transpose Vectors are a bit unnatural when dealing with 16x16 tiles! Can we improve our algorithm?
Now treat our elements as tiles!
i.e.
Just another Cholesky factorization, smaller by one tile!
What have we gained?
(N/16)^3 tiled operations altogether. Each of these is now an 16x16 matrix multiply, so 16^3. No change in the FLOPs.
But what about memory? If each block can store the tiles it needs in shared memory, then it only has to do 16x16 loads So saves a factor of 16 on the memory!!!
But why 16?
A magic number in CUDA, for many reasons. Weve already seen that 16^2 = 32x8 But heres another one: On existing hardware, memory accesses are grouped by half-warp, which is 16 And global memory accesses are optimal (coalesced) if all threads in a half-warp access consecutive memory locations Even more important for 1st gen hardware (Now you can shuffle your accesses somewhat)
Sharing memory banks
Remember, were caching the tiles in shared memory! Like global memory, shared memory is accessed per half-warp. Divided up into banks, currently 16, each a float wide. So __shared__float c[16][16]; has one column per bank
Shared memory is quick if: All threads access a different bank C[?][tx] is fine All threads access the same element in a given bank C[ty][3] is fine
But c[tx][3] is not
Look at B, need to load b^T and c Fine for the untransposed matrix, well have something like c[ty][tx]=m[ty][tx]; but what about the transposed one? We need something like b[ty][tx]=m[ty][tx];
To do the matrix multiply within a block, well need for (i=0;i<16;i++){ m[ty][tx]=m[ty][tx]-b[i][ty]*c[i][tx]; }
so were okay here
Digression: Loop unrolling
#pragma unroll 16 for (i=0;i<16;i++){ m[ty][tx]=m[ty][tx]-b[i][ty]*c[i][tx]; }
replaces the loop with 16 lines of code, saving loop overhead
What if you do have shared memory bank conflicts?
Typically happen when accessing a 16x16 array columnwise rather than rowwise Consider defining your array a bit off: __shared__float c[16][17]; Now not only is each element of a row in a different bank, but now each element of a column is also! (just dont get a half warp to access diagonal elements now) Avoiding conflicts can be tricky if youre dealing with e.g. doubles
So, lets have a look through and give it a go!
see cholesky.cu, transchol.cu, doublechol.cu
(Code available via www.ast.cam.ac.uk/~stg20/cuda/cholesky/)
What goes on under the hood?
nvcc splits your .cu code into .c/.cpp code for the host compiler and code for the device Compiles the device code into a virtual assembly language (.ptx) This then gets further compiled into proper executable machine code for your gpu (.cubin) Everything then gets put back together and linked with the appropriate libraries to give you your executable a.out
You can take a look:
nvcc keep
Looking at the ptx can give you a good idea of what is going on.
Nvidia havent disclosed the real ISA, but you can look at the .cubin file and see some interesting things, like how many registers per thread and how much shared memory your kernels need the less of both, the more blocks per multiprocessor you can run which helps hide memory latency
Conclusions
Weve seen how that GPGPU is all about having many threads run the same program Hardware constraints make this more complicated than it might be Seen how we can use multiple kernels if we need global synchronization Choice of algorithm to implement crucial Looked at possible optimizations and what goes on beneath the surface
Further issues
Need 6 warps per multiprocessor to hide any pipelining
As well as shared memory bank conflicts, you can also have register bank conflicts. The only advice is to have 64, 128, threads per block.
Global memory is partitioned round-robin between channels in 256-byte chunks. Be careful to avoid all threads only accessing memory that lies in one channel! Nvidia now recognize this and call it partition camping.
nvcc -maxrregcount n can sometimes help

Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization

Transféré par

Droits d'auteur :

Formats disponibles

Developing Kernels: Part 2

Algorithm Considerations, Multi-Kernel Programs and Optimization

Steven Gratton, Institute of Astronomy

No problem as long as you dont have to do it too often!

So, well need something like

for (n=0;n<SZ;n++) { topleft<<<,>>>(n,); strip<<<,>>>(n,); therest<<<,>>>(n,); }

Only worry about/optimize what matters!

So, what can we do?

Now treat our elements as tiles!

Just another Cholesky factorization, smaller by one tile!

What have we gained?

But why 16?

Sharing memory banks

But c[tx][3] is not

so were okay here

Digression: Loop unrolling

#pragma unroll 16 for (i=0;i<16;i++){ m[ty][tx]=m[ty][tx]-b[i][ty]*c[i][tx]; }

replaces the loop with 16 lines of code, saving loop overhead

What if you do have shared memory bank conflicts?

So, lets have a look through and give it a go!

see cholesky.cu, transchol.cu, doublechol.cu

(Code available via www.ast.cam.ac.uk/~stg20/cuda/cholesky/)

What goes on under the hood?

You can take a look:

Need 6 warps per multiprocessor to hide any pipelining

nvcc -maxrregcount n can sometimes help

Vous aimerez peut-être aussi