Académique Documents
Professionnel Documents
Culture Documents
Want to break the problem up into multiple steps, each of which could do with a large number of blocks and threads within blocks thrown at it
How do we synchronize between steps? Remember __syncthreads() only works within a block Key point: the global memory persists on the gpu after a kernel finishes, and a new kernel isnt start until pending memory writes complete
Global Synchronization
So, enforce global synchronization by stopping one kernel and starting another one!
A bit drastic, but it works! O(10^-5 s) kernel launch overhead Shared memory is erased
For us, therest<<<,>>>() is what matters As long as we dont do anything silly, the other two basically just have to work and no more
Matrix suggests a tiled approach, each thread doing one element. Keeping our thread blocks square for simplicity, a natural choice is 16x16 (more on this later), so well need a dim3 Db(16,16); Nb. 16x16=8x32, so no incomplete warps! But a problem is now that we have to subtract a vector times its transpose Vectors are a bit unnatural when dealing with 16x16 tiles! Can we improve our algorithm?
i.e.
(N/16)^3 tiled operations altogether. Each of these is now an 16x16 matrix multiply, so 16^3. No change in the FLOPs.
But what about memory? If each block can store the tiles it needs in shared memory, then it only has to do 16x16 loads So saves a factor of 16 on the memory!!!
A magic number in CUDA, for many reasons. Weve already seen that 16^2 = 32x8 But heres another one: On existing hardware, memory accesses are grouped by half-warp, which is 16 And global memory accesses are optimal (coalesced) if all threads in a half-warp access consecutive memory locations Even more important for 1st gen hardware (Now you can shuffle your accesses somewhat)
Remember, were caching the tiles in shared memory! Like global memory, shared memory is accessed per half-warp. Divided up into banks, currently 16, each a float wide. So __shared__float c[16][16]; has one column per bank
Shared memory is quick if: All threads access a different bank C[?][tx] is fine All threads access the same element in a given bank C[ty][3] is fine
Look at B, need to load b^T and c Fine for the untransposed matrix, well have something like c[ty][tx]=m[ty][tx]; but what about the transposed one? We need something like b[ty][tx]=m[ty][tx];
To do the matrix multiply within a block, well need for (i=0;i<16;i++){ m[ty][tx]=m[ty][tx]-b[i][ty]*c[i][tx]; }
Typically happen when accessing a 16x16 array columnwise rather than rowwise Consider defining your array a bit off: __shared__float c[16][17]; Now not only is each element of a row in a different bank, but now each element of a column is also! (just dont get a half warp to access diagonal elements now) Avoiding conflicts can be tricky if youre dealing with e.g. doubles
nvcc splits your .cu code into .c/.cpp code for the host compiler and code for the device Compiles the device code into a virtual assembly language (.ptx) This then gets further compiled into proper executable machine code for your gpu (.cubin) Everything then gets put back together and linked with the appropriate libraries to give you your executable a.out
nvcc keep
Looking at the ptx can give you a good idea of what is going on.
Nvidia havent disclosed the real ISA, but you can look at the .cubin file and see some interesting things, like how many registers per thread and how much shared memory your kernels need the less of both, the more blocks per multiprocessor you can run which helps hide memory latency
Conclusions
Weve seen how that GPGPU is all about having many threads run the same program Hardware constraints make this more complicated than it might be Seen how we can use multiple kernels if we need global synchronization Choice of algorithm to implement crucial Looked at possible optimizations and what goes on beneath the surface
Further issues
As well as shared memory bank conflicts, you can also have register bank conflicts. The only advice is to have 64, 128, threads per block.
Global memory is partitioned round-robin between channels in 256-byte chunks. Be careful to avoid all threads only accessing memory that lies in one channel! Nvidia now recognize this and call it partition camping.