Académique Documents
Professionnel Documents
Culture Documents
NVIDIA Confidential
Optimization
GPUs are very fast BUT
Poor programming can lead to hideous performance Squeaking out the most speed takes a bit of expertise
NVIDIA Confidential
NVIDIA Confidential
What to optimize?
GMEM Coalescing GMEM Bandwidth Occupancy
# of threads running on an SM Limited by Registers, SMEM, 8-blocks maximum, 768 threads maximum (1024 on GT200) More threads running allows more latency hiding!
NVIDIA Confidential
Additional restrictions:
Starting address for a region must be a multiple of region size The kth thread in a half-warp must access the kth element in a block being read
128
132
136
140
144
184
188
192
t0
t1
t2
t3
t14 t15
128
132
136
140
144
184
188
192
128
132
136
140
144
184
188
192
t0
t1
t2
t3
t13
t14 t15
128
132
136
140
144
184
188
192
NVIDIA Confidential
Tools
Look at the .cubin to find register, smem, lmem usage Use the keep compiler option
NVIDIA Confidential
NVIDIA Confidential
Visual Profiler
NVIDIA Confidential
NVIDIA Confidential
NVIDIA Confidential
if(X > 1 && X < width-2 && Y > 1 && Y < height-2) { int sum = 0; int kidx = 0; for(int i = -2;i<= 2;i++) { for(int j= -2;j<= 2;j++) { sum += gpu_kernel[kidx++] * img_in[__umul24((Y+i),pitch) + X+j]; } } sum = (int)((float)sum * scale); img_out[__umul24(Y,pitch) + X] = CLAMP(sum,0,255); } }
Results
148 ms!
Nothing is coalesced 8-bit Memory accesses are very inefficient
NVIDIA Confidential
NVIDIA Confidential
NVIDIA Confidential
Example 3: Continued
Unfortunately, textures use a lot of registers (25 in this kernel). This reduces occupancy 16x16 block size limits us to 1 block / SM (on G92), thus the entire block is stalled during the __syncthreads() 16x8 block allows 2 blocks / SM,
surprisingly little performance improvement (3.2 ms)
NVIDIA Confidential
NVIDIA Confidential
NVIDIA Confidential
18
Apron Pixels
Image Tile
NVIDIA Confidential
NVIDIA Confidential
NVIDIA Confidential
NVIDIA Confidential
NVIDIA Confidential
NVIDIA Confidential
Example 6 Cont.
Results (16x16 block, 2 blocks / SM):
3.4 ms. Memory Only 1.2 ms. Note: Texture is still slightly better, even with all this work!
NVIDIA Confidential
16x20 Results:
3.4 ms. again no real benefit
NVIDIA Confidential
NVIDIA Confidential
The Answer:
Bank 15
NVIDIA Confidential
No Bank Conflicts
Random 1:1 Permutation Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7
Thread 15
Bank 15
Thread 15
Bank 15
NVIDIA Confidential
x8
Bank 15
Thread 15
Bank 15
NVIDIA Confidential
Results:
2.9 ms. ! Memory Only: 0.4 ms.
NVIDIA Confidential
For the 2-D convolution, the compiler only unrolled the inner loop
(#pragma unroll didnt work for the outer loop)
Results:
2.5 ms. 0.4 ms. Memory only
NVIDIA Confidential
NVIDIA Confidential
Tim e (m s.)
10
Summary
Algorithm
N S ai im ve pl G e lo Te ba x T tu lC S ex r i m e on tu T pl 1ex re vo e by tu -t lu Te ote re tio S x C to tu n M on -S re E M v M 4 o -b E 16 lu yt tio x8 M 1 e 6x C (2 on 1 b 6 lo T vo ex ck bl lu oc tu s tio /s k re C -t m oon ) S bl M oc vol E ut k M C G 16 on lo T x2 vo ba ex 0 l ltu bl to r eoc -S to k M -S C G E M on lo M ba E S Glo vo M tr lb lu ic to flo t t alS C at to G M o -S lo a C E ba le M on M sc E lvo S to ed M tr lu ic S G tio 16 flo t M a lo C x1 t E ba o C M a 6 lG le S bl on to sc lo tr oc vo -S ba ic ed k lu t M lC C tio E 1 to oa o 6x M -S nv l 8 e B M o s r bl E e c oc ed M ak k B C 16 C re oa on x2 ak le 0 vo sc C bl l oa in oc g le k 19 sc C in o x1 nv g G 3 o w lo G b i lo ba th lo ck ba G lw to l ro lC -S oba to ng on -S M lvo p to E M i t M -S ch E M 1 M C 28 E 1 on M 28 -b vo 1 -b it 2 r/ it 8 w r/ b w ba it ba r/ nk w nk op c op tim on vo tim iz ed lut iz i ed co o un nv ro ol lle u d co nv o
Compute
Memory Only
GT200 optimization
Coalescing buffers greatly improve the performance, especially in non-optimal situations
Convolution Approaches Comparison, GT200
10 9 8 7 6 5 4 3 2 1 0 Texture 4Texture-toSMEM float Global-toGlobal-toGlobal-toSMEM 128Naive Global Texture-toGlobal-toSMEM SMEM Simple SMEM SMEM
Com pute + Mem ory Mem ory Only
NVIDIA Confidential
Time (ms)
GT200 Observations
hideous case isnt so bad: 2.5x slower than best case, vs. 60x slower on G92 128-bit is still best Complex Coalesced pattern to fill SMEM is actually slightly slower than just a simple shift
NVIDIA Confidential
100
Tim e (m s)
10
NVIDIA Confidential
G92 GT200
NVIDIA Confidential
Broken Coalescing
Effect of Wrong Pitch
5
4.5
3.5
2.5
G92 GT200
1.5
0.5
0 Correct Coalescing Correct Coalescing - Memory Tim e (m s.) Wrong Pitch Wrong Pitch - Memory
NVIDIA Confidential