2 Examples

Dynamic parallelism
What is dynamic parallelism?
The ability to launch new grids from the GPU:

Dynamically: Based on run-time data.
Simultaneously: From multiple threads at once.
Independently: Each thread can launch a different grid.
CPU GPU CPU GPU

Fermi: Only CPU Kepler: GPU can
can generate GPU work. generate work for itself.
2
The way we did things in the pre-Kepler era:
The GPU was a slave for the CPU
High data bandwidth for communications:
External: More than 10 GB/s (PCI-express 3).
Internal: More than 100 GB/s (GDDR5 video memory and 384 bits,
which is like a six channel CPU architecture).
Function Lib Lib Function Function
Init
Alloc GPU
CPU
Operation 1 Operation 2 Operation 3
3
The way we do things in Kepler:
GPUs launch their own kernels
The pre-Kepler GPU is a co-processor The Kepler GPU is autonomous:
Dynamic parallelism
CPU GPU CPU GPU
Now programs run faster and

are expressed in a more natural way.
4
Example 1: Dynamic work generation
Assign resources dynamically according to real-time

demand, making easier the computation of irregular
problems on GPU.
It broadens the application scope where it can be useful.
Coarse grid Fine grid Dynamic grid
Higher performance, Lower performance, Target performance

lower accuracy higher accuracy where accuracy is required 5
Example 2: Deploying
parallelism based on level of detail
Computational power
allocated to regions
of interest
CUDA until 2012: CUDA on Kepler:

• The CPU launches • The GPU launches a
kernels regularly. different number of
• All pixels are treated kernels/blocks for each
the same. computational region.
6
Warnings when using dynamic parallelism
It is a much more powerful mechanism than it suggests

from its simplicity in the code. However...
What we write within a CUDA kernel is replicated for all
threads. Therefore, a kernel call will produce millions of
launches if it is not used within an IF statement (which, for
example, limits the launch to a single one from thread 0).
If a father block launches sons, can they use the shared
memory of their father?
No. It would be easy to implement in hardware, but very complex
for the programmer to guarantee the code correctness (avoid race
conditions).
7
Hyper-Q
Hyper-Q
In Fermi, several CPU processes can send thread blocks to

the same GPU, but the concurrent execution of kernels was
severely limited by hardware constraints.
In Kepler, we can execute simultaneously up to 32 kernels
launched from different:
MPI processes, CPU threads (POSIX threads) or CUDA streams.
This increments the % of temporal occupancy on the GPU.
FERMI KEPLER
1 MPI Task at a Time 32 Simultaneous MPI Tasks
9
An example:
3 streams, each composed of 3 kernels
__global__ kernel_A(pars) {body} // Same for B...Z stream_1
cudaStream_t stream_1, stream_2, stream_3;
kernel_A
...
cudaStreamCreatewithFlags(&stream_1, ...); kernel_B
cudaStreamCreatewithFlags(&stream_2, ...); kernel_C
cudaStreamCreatewithFlags(&stream_3, ...);
...
stream_2
stream 1
kernel_A <<< dimgridA, dimblockA, 0, stream_1 >>> (pars);

kernel_B <<< dimgridB, dimblockB, 0, stream_1 >>> (pars); kernel_P
kernel_C <<< dimgridC, dimblockC, 0, stream_1 >>> (pars); kernel_Q
... kernel_R
stream 2
kernel_P <<< dimgridP, dimblockP, 0, stream_2 >>> (pars);

kernel_Q <<< dimgridQ, dimblockQ, 0, stream_2 >>> (pars);
kernel_R <<< dimgridR, dimblockR, 0, stream_2 >>> (pars); stream_3
... kernel_X
stream 3
kernel_X <<< dimgridX, dimblockX, 0, stream_3 >>> (pars);

kernel_Y
kernel_Y <<< dimgridY, dimblockY, 0, stream_3 >>> (pars);
kernel_Z <<< dimgridZ, dimblockZ, 0, stream_3 >>> (pars); kernel_Z
10
Grid management unit: Fermi vs. Kepler
Fermi Kepler GK110

Stream Queue Stream Queue
(ordered queues of grids) C R Z
Stream 1 Stream 2 Stream 3 B Q Y
Kernel C Kernel R Kernel Z A P X
Kernel B Kernel Q Kernel Y
Parallel hardware streams
Kernel A Kernel P Kernel X
Grid Management Unit

Single hardware queue Pending & Suspended Grids
multiplexing streams
1000s of pending grids
CUDA Generated Work

Work Distributor Allows suspending of grids
Tracks blocks issued from grids
16 active grids
Work Distributor
Actively dispatching grids
32 active grids
SM SM SM SM
SMX SMX SMX SMX
11
The relation between
software and hardware queues
Fermi: A -- B -- C
But CUDA streams multiplex into a single queue
Stream 1
Up to 16 grids
can run at once A--B--C P--Q--R X--Y--Z P -- Q -- R
on GPU hardware
Stream 2
Chances for overlapping: Only at stream edges
X -- Y -- Z
Stream 3
12
The relation between
software and hardware queues
Fermi: A -- B -- C
But CUDA streams multiplex into a single queue
Stream 1
Up to 16 grids
can run at once A--B--C P--Q--R X--Y--Z P -- Q -- R
on GPU hardware
Stream 2
Chances for overlapping: Only at stream edges
X -- Y -- Z
Stream 3
Kepler: No inter-stream dependencies
A--B--C A -- B -- C
Stream 1
Up to 32 grids
can run at once P--Q--R P -- Q -- R
on GPU hardware
Stream 2
X--Y--Z X -- Y -- Z
Stream 3
Concurrency at full-stream level 13
Without Hyper-Q: Multiprocess by temporal division
100
% GPU utilization
50
B
A B C D E F
0
Time
C
With Hyper-Q: Symultaneous multiprocess
100
D D D
C
F
A C
% GPU utilization
E B
C
50 E F
B
F
E
F A B
A
E
0 D
0
Time saved
CPU processes... ...mapped on GPU 14
Unified memory
CUDA memory types
Zero-Copy Unified Virtual

Unified Memory
(pinned memory) Addressing
CUDA call cudaMallocHost(&A, 4); cudaMalloc(&A, 4); cudaMallocManaged(&A, 4);
Allocation fixed in Main memory (DDR3) Video memory (GDDR5) Both
Local access for CPU Home GPU CPU and home GPU
PCI-e access for All GPUs Other GPUs Other GPUs
Other features Avoid swapping to disk No CPU access On access CPU/GPU migration
Coherency At all times Between GPUs Only at launch & sync.
Full support in CUDA 2.2 CUDA 1.0 CUDA 6.0
16
Additions to the CUDA API
New call: cudaMallocManaged(pointer,size,flag)

Drop-in replacement for cudaMalloc(pointer,size).
The flag indicates who shares the pointer with the device:
cudaMemAttachHost: Only the CPU.
cudaMemAttachGlobal: Any other GPU too.
All operations valid on device mem. are also ok on managed mem.
New keyword: __managed__
Global variable annotation combines with __device__.
Declares global-scope migratable device variable.
Symbol accessible from both GPU and CPU code.
New call: cudaStreamAttachMemAsync()
Manages concurrently in multi-threaded CPU applications.
17
Unified memory: Technical details
The maximum amount of unified memory that can be

allocated is the smallest of the memories available on GPUs.
Memory pages from unified allocations touched by CPU are
required to migrate back to GPU before any kernel launch.
The CPU cannot access any unified memory as long as GPU
is executing, that is, a cudaDeviceSynchronize() call is
required for the CPU to be allowed to access unified memory.
The GPU has exclusive access to unified memory when
any kernel is executed on the GPU, and this holds even if the
kernel does not touch the unified memory (see an example
on next slide).
18
First example:
Access constraints
__device__ __managed__ int x, y = 2; // Unified memory
__global__ void mykernel() // GPU territory

{
x = 10;
}
int main() // CPU territory

{
mykernel <<<1,1>>> ();
y = 20; // ERROR: CPU access concurrent with GPU

return 0;
}
19
First example:
Access constraints
__device__ __managed__ int x, y = 2; // Unified memory
__global__ void mykernel() // GPU territory

{
x = 10;
}
int main() // CPU territory

{
mykernel <<<1,1>>> ();
cudaDeviceSynchronize(); // Problem fixed!
// Now the GPU is idle, so access to “y” is OK
y = 20;
return 0;
}
20
Second example:
Sorting elements from a file
CPU code in C GPU code from CUDA 6.0 on
void sortfile (FILE *fp, int N) void sortfile (FILE *fp, int N)
{ {
char *data; char *data;
data = (char *) malloc(N); cudaMallocManaged(&data, N);
fread(data, 1, N, fp); fread(data, 1, N, fp);
qsort(data, N, 1, compare); qsort<<<...>>>(data, N, 1, compare);

cudaDeviceSynchronize();
use_data(data); use_data(data);
free(data); cudaFree(data);
} }
21
Third example: Cloning dynamic data
structures WITHOUT unified memory
struct dataElem {
int prop1;
CPU memory
int prop2; dataElem
char *text; prop1
}
prop2
A “deep copy” is required: *text “Hello, world”
We must copy the structure

Two addresses
and everything that it points to. and two copies
This is why C++ invented the of the data
copy constructor.
CPU and GPU cannot share a GPU memory
copy of the data (coherency). dataElem
This prevents memcpy style prop1
comparisons, checksumming prop2
and other validations. *text “Hello, world”
22
Cloning dynamic data structures
WITHOUT unified memory (2)
CPU memory void launch(dataElem *elem) {
dataElem *g_elem;
dataElem char *g_text;
prop1
int textlen = strlen(elem->text);
prop2
*text “Hello, world” // Allocate storage for struct and text
cudaMalloc(&g_elem, sizeof(dataElem));
cudaMalloc(&g_text, textlen);
Two addresses
and two copies // Copy up each piece separately, including
of the data new “text” pointer value
cudaMemcpy(g_elem, elem, sizeof(dataElem));
GPU memory
cudaMemcpy(g_text, elem->text, textlen);
cudaMemcpy(&(g_elem->text), &g_text,
dataElem sizeof(g_text));
prop1
// Finally we can launch our kernel, but
prop2 // CPU and GPU use different copies of “elem”
*text “Hello, world” kernel<<< ... >>>(g_elem);
} 23
Cloning dynamic data structures
WITH unified memory
void launch(dataElem *elem) {
kernel<<< ... >>>(elem);
CPU memory }
What remains the same:

Data movement.
Unified memory GPU accesses a local copy of text.
dataElem
What has changed:
prop1
prop2
Programmer sees a single pointer.
*text “Hello, world” CPU and GPU both reference the
same object.
There is coherence.
GPU memory To pass-by-reference vs. pass-
by-value you need to use C++.
24
Fourth example: Linked lists
key key key

value value value CPU memory
next next next
All accesses via PCI-express bus
key key key

value value value GPU memory
next next next
Almost impossible to manage in the original CUDA API.

The best you can do is use pinned memory:
Pointers are global: Just as unified memory pointers.
Performance is low: GPU suffers from PCI-e bandwidth.
GPU latency is very high, which is critical for linked lists because of
the intrinsic pointer chasing. 25
Linked lists with unified memory
CPU memory
key key key

value value value
next next next
GPU memory
Can pass list elements between CPU & GPU.
No need to move data back and forth between CPU and GPU.
Can insert and delete elements from CPU or GPU.
But program must still ensure no race conditions (data is coherent
between CPU & GPU at kernel launch only). 26
Unified memory: Summary
Drop-in replacement for cudaMalloc() using

cudaMallocManaged().
cudaMemcpy() now optional.
Greatly simplifies code porting.
Less Host-side memory management.
Enables shared data structures between CPU & GPU
Single pointer to data = no change to data structures.
Powerful for high-level languages like C++.
27
Unified memory: The roadmap.
Contributions on every abstraction level
Past: Present: Future:

Abstraction
Consolidated Recently available Available
level
(2014-15) (2016-17) in coming years
Single pointer to data. Prefetching mechanisms

High No cudaMemcpy() to anticipate data arrival System allocator unified
is required in copies
Coherence @
Medium Migration hints Stack memory unified
launch & synchronize
Shared C/C++ data Additional Hardware-accelerated

Low
structures OS support coherence
28

2 Examples

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

2 Examples

Transféré par

Droits d'auteur :

Formats disponibles

Dynamic parallelism

What is dynamic parallelism?

The ability to launch new grids from the GPU:

CPU GPU CPU GPU

Function Lib Lib Function Function

Now programs run faster and

Assign resources dynamically according to real-time

Higher performance, Lower performance, Target performance

CUDA until 2012: CUDA on Kepler:

It is a much more powerful mechanism than it suggests

In Fermi, several CPU processes can send thread blocks to

kernel_A <<< dimgridA, dimblockA, 0, stream_1 >>> (pars);

kernel_P <<< dimgridP, dimblockP, 0, stream_2 >>> (pars);

kernel_X <<< dimgridX, dimblockX, 0, stream_3 >>> (pars);

Fermi Kepler GK110

Grid Management Unit

CUDA Generated Work

Kepler: No inter-stream dependencies

Zero-Copy Unified Virtual

New call: cudaMallocManaged(pointer,size,flag)

The maximum amount of unified memory that can be

__global__ void mykernel() // GPU territory

int main() // CPU territory

y = 20; // ERROR: CPU access concurrent with GPU

__global__ void mykernel() // GPU territory

int main() // CPU territory

fread(data, 1, N, fp); fread(data, 1, N, fp);

qsort(data, N, 1, compare); qsort<<<...>>>(data, N, 1, compare);

A “deep copy” is required: *text “Hello, world”

We must copy the structure

What remains the same:

key key key

key key key

Almost impossible to manage in the original CUDA API.

key key key

Drop-in replacement for cudaMalloc() using

Past: Present: Future:

Single pointer to data. Prefetching mechanisms

Shared C/C++ data Additional Hardware-accelerated

Vous aimerez peut-être aussi

global void mykernel() // GPU territory

global void mykernel() // GPU territory