Vous êtes sur la page 1sur 28

Dynamic parallelism

What is dynamic parallelism?

The ability to launch new grids from the GPU:


Dynamically: Based on run-time data.
Simultaneously: From multiple threads at once.
Independently: Each thread can launch a different grid.

CPU GPU CPU GPU


Fermi: Only CPU Kepler: GPU can
can generate GPU work. generate work for itself.

2
The way we did things in the pre-Kepler era:
The GPU was a slave for the CPU
High data bandwidth for communications:
External: More than 10 GB/s (PCI-express 3).
Internal: More than 100 GB/s (GDDR5 video memory and 384 bits,
which is like a six channel CPU architecture).

Function Lib Lib Function Function

Init
Alloc GPU

CPU
Operation 1 Operation 2 Operation 3

3
The way we do things in Kepler:
GPUs launch their own kernels
The pre-Kepler GPU is a co-processor The Kepler GPU is autonomous:
Dynamic parallelism
CPU GPU CPU GPU

Now programs run faster and


are expressed in a more natural way.
4
Example 1: Dynamic work generation

Assign resources dynamically according to real-time


demand, making easier the computation of irregular
problems on GPU.
It broadens the application scope where it can be useful.
Coarse grid Fine grid Dynamic grid

Higher performance, Lower performance, Target performance


lower accuracy higher accuracy where accuracy is required 5
Example 2: Deploying
parallelism based on level of detail

Computational power
allocated to regions
of interest

CUDA until 2012: CUDA on Kepler:


• The CPU launches • The GPU launches a
kernels regularly. different number of
• All pixels are treated kernels/blocks for each
the same. computational region.
6
Warnings when using dynamic parallelism

It is a much more powerful mechanism than it suggests


from its simplicity in the code. However...
What we write within a CUDA kernel is replicated for all
threads. Therefore, a kernel call will produce millions of
launches if it is not used within an IF statement (which, for
example, limits the launch to a single one from thread 0).
If a father block launches sons, can they use the shared
memory of their father?
No. It would be easy to implement in hardware, but very complex
for the programmer to guarantee the code correctness (avoid race
conditions).

7
Hyper-Q
Hyper-Q

In Fermi, several CPU processes can send thread blocks to


the same GPU, but the concurrent execution of kernels was
severely limited by hardware constraints.
In Kepler, we can execute simultaneously up to 32 kernels
launched from different:
MPI processes, CPU threads (POSIX threads) or CUDA streams.
This increments the % of temporal occupancy on the GPU.
FERMI KEPLER
1 MPI Task at a Time 32 Simultaneous MPI Tasks

9
An example:
3 streams, each composed of 3 kernels
__global__ kernel_A(pars) {body} // Same for B...Z stream_1
cudaStream_t stream_1, stream_2, stream_3;
kernel_A
...
cudaStreamCreatewithFlags(&stream_1, ...); kernel_B
cudaStreamCreatewithFlags(&stream_2, ...); kernel_C
cudaStreamCreatewithFlags(&stream_3, ...);
...
stream_2
stream 1

kernel_A <<< dimgridA, dimblockA, 0, stream_1 >>> (pars);


kernel_B <<< dimgridB, dimblockB, 0, stream_1 >>> (pars); kernel_P
kernel_C <<< dimgridC, dimblockC, 0, stream_1 >>> (pars); kernel_Q
... kernel_R
stream 2

kernel_P <<< dimgridP, dimblockP, 0, stream_2 >>> (pars);


kernel_Q <<< dimgridQ, dimblockQ, 0, stream_2 >>> (pars);
kernel_R <<< dimgridR, dimblockR, 0, stream_2 >>> (pars); stream_3
... kernel_X
stream 3

kernel_X <<< dimgridX, dimblockX, 0, stream_3 >>> (pars);


kernel_Y
kernel_Y <<< dimgridY, dimblockY, 0, stream_3 >>> (pars);
kernel_Z <<< dimgridZ, dimblockZ, 0, stream_3 >>> (pars); kernel_Z
10
Grid management unit: Fermi vs. Kepler

Fermi Kepler GK110


Stream Queue Stream Queue
(ordered queues of grids) C R Z
Stream 1 Stream 2 Stream 3 B Q Y
Kernel C Kernel R Kernel Z A P X
Kernel B Kernel Q Kernel Y
Parallel hardware streams
Kernel A Kernel P Kernel X

Grid Management Unit


Single hardware queue Pending & Suspended Grids
multiplexing streams
1000s of pending grids

CUDA Generated Work


Work Distributor Allows suspending of grids
Tracks blocks issued from grids
16 active grids
Work Distributor
Actively dispatching grids
32 active grids

SM SM SM SM
SMX SMX SMX SMX
11
The relation between
software and hardware queues
Fermi: A -- B -- C
But CUDA streams multiplex into a single queue
Stream 1
Up to 16 grids
can run at once A--B--C P--Q--R X--Y--Z P -- Q -- R
on GPU hardware
Stream 2
Chances for overlapping: Only at stream edges
X -- Y -- Z
Stream 3

12
The relation between
software and hardware queues
Fermi: A -- B -- C
But CUDA streams multiplex into a single queue
Stream 1
Up to 16 grids
can run at once A--B--C P--Q--R X--Y--Z P -- Q -- R
on GPU hardware
Stream 2
Chances for overlapping: Only at stream edges
X -- Y -- Z
Stream 3

Kepler: No inter-stream dependencies

A--B--C A -- B -- C
Stream 1
Up to 32 grids
can run at once P--Q--R P -- Q -- R
on GPU hardware
Stream 2

X--Y--Z X -- Y -- Z
Stream 3
Concurrency at full-stream level 13
Without Hyper-Q: Multiprocess by temporal division
100

% GPU utilization
50

B
A B C D E F
0
Time
C
With Hyper-Q: Symultaneous multiprocess
100
D D D
C
F
A C
% GPU utilization

E B
C

50 E F
B

F
E

F A B
A
E

0 D
0
Time saved
CPU processes... ...mapped on GPU 14
Unified memory
CUDA memory types

Zero-Copy Unified Virtual


Unified Memory
(pinned memory) Addressing
CUDA call cudaMallocHost(&A, 4); cudaMalloc(&A, 4); cudaMallocManaged(&A, 4);
Allocation fixed in Main memory (DDR3) Video memory (GDDR5) Both
Local access for CPU Home GPU CPU and home GPU
PCI-e access for All GPUs Other GPUs Other GPUs
Other features Avoid swapping to disk No CPU access On access CPU/GPU migration
Coherency At all times Between GPUs Only at launch & sync.
Full support in CUDA 2.2 CUDA 1.0 CUDA 6.0

16
Additions to the CUDA API

New call: cudaMallocManaged(pointer,size,flag)


Drop-in replacement for cudaMalloc(pointer,size).
The flag indicates who shares the pointer with the device:
cudaMemAttachHost: Only the CPU.
cudaMemAttachGlobal: Any other GPU too.
All operations valid on device mem. are also ok on managed mem.
New keyword: __managed__
Global variable annotation combines with __device__.
Declares global-scope migratable device variable.
Symbol accessible from both GPU and CPU code.
New call: cudaStreamAttachMemAsync()
Manages concurrently in multi-threaded CPU applications.
17
Unified memory: Technical details

The maximum amount of unified memory that can be


allocated is the smallest of the memories available on GPUs.
Memory pages from unified allocations touched by CPU are
required to migrate back to GPU before any kernel launch.
The CPU cannot access any unified memory as long as GPU
is executing, that is, a cudaDeviceSynchronize() call is
required for the CPU to be allowed to access unified memory.
The GPU has exclusive access to unified memory when
any kernel is executed on the GPU, and this holds even if the
kernel does not touch the unified memory (see an example
on next slide).
18
First example:
Access constraints
__device__ __managed__ int x, y = 2; // Unified memory

__global__ void mykernel() // GPU territory


{
x = 10;
}

int main() // CPU territory


{
mykernel <<<1,1>>> ();

y = 20; // ERROR: CPU access concurrent with GPU


return 0;
}

19
First example:
Access constraints
__device__ __managed__ int x, y = 2; // Unified memory

__global__ void mykernel() // GPU territory


{
x = 10;
}

int main() // CPU territory


{
mykernel <<<1,1>>> ();
cudaDeviceSynchronize(); // Problem fixed!
// Now the GPU is idle, so access to “y” is OK
y = 20;
return 0;
}

20
Second example:
Sorting elements from a file
CPU code in C GPU code from CUDA 6.0 on

void sortfile (FILE *fp, int N) void sortfile (FILE *fp, int N)
{ {
char *data; char *data;
data = (char *) malloc(N); cudaMallocManaged(&data, N);

fread(data, 1, N, fp); fread(data, 1, N, fp);

qsort(data, N, 1, compare); qsort<<<...>>>(data, N, 1, compare);


cudaDeviceSynchronize();
use_data(data); use_data(data);

free(data); cudaFree(data);
} }
21
Third example: Cloning dynamic data
structures WITHOUT unified memory
struct dataElem {
int prop1;
CPU memory
int prop2; dataElem
char *text; prop1
}
prop2

A “deep copy” is required: *text “Hello, world”

We must copy the structure


Two addresses
and everything that it points to. and two copies
This is why C++ invented the of the data
copy constructor.
CPU and GPU cannot share a GPU memory
copy of the data (coherency). dataElem
This prevents memcpy style prop1
comparisons, checksumming prop2
and other validations. *text “Hello, world”
22
Cloning dynamic data structures
WITHOUT unified memory (2)
CPU memory void launch(dataElem *elem) {
dataElem *g_elem;
dataElem char *g_text;
prop1
int textlen = strlen(elem->text);
prop2
*text “Hello, world” // Allocate storage for struct and text
cudaMalloc(&g_elem, sizeof(dataElem));
cudaMalloc(&g_text, textlen);
Two addresses
and two copies // Copy up each piece separately, including
of the data new “text” pointer value
cudaMemcpy(g_elem, elem, sizeof(dataElem));

GPU memory
cudaMemcpy(g_text, elem->text, textlen);
cudaMemcpy(&(g_elem->text), &g_text,
dataElem sizeof(g_text));

prop1
// Finally we can launch our kernel, but
prop2 // CPU and GPU use different copies of “elem”
*text “Hello, world” kernel<<< ... >>>(g_elem);
} 23
Cloning dynamic data structures
WITH unified memory
void launch(dataElem *elem) {
kernel<<< ... >>>(elem);

CPU memory }

What remains the same:


Data movement.
Unified memory GPU accesses a local copy of text.
dataElem
What has changed:
prop1
prop2
Programmer sees a single pointer.
*text “Hello, world” CPU and GPU both reference the
same object.
There is coherence.
GPU memory To pass-by-reference vs. pass-
by-value you need to use C++.
24
Fourth example: Linked lists

key key key


value value value CPU memory
next next next
All accesses via PCI-express bus

key key key


value value value GPU memory
next next next

Almost impossible to manage in the original CUDA API.


The best you can do is use pinned memory:
Pointers are global: Just as unified memory pointers.
Performance is low: GPU suffers from PCI-e bandwidth.
GPU latency is very high, which is critical for linked lists because of
the intrinsic pointer chasing. 25
Linked lists with unified memory

CPU memory

key key key


value value value
next next next

GPU memory
Can pass list elements between CPU & GPU.
No need to move data back and forth between CPU and GPU.
Can insert and delete elements from CPU or GPU.
But program must still ensure no race conditions (data is coherent
between CPU & GPU at kernel launch only). 26
Unified memory: Summary

Drop-in replacement for cudaMalloc() using


cudaMallocManaged().
cudaMemcpy() now optional.
Greatly simplifies code porting.
Less Host-side memory management.
Enables shared data structures between CPU & GPU
Single pointer to data = no change to data structures.
Powerful for high-level languages like C++.

27
Unified memory: The roadmap.
Contributions on every abstraction level

Past: Present: Future:


Abstraction
Consolidated Recently available Available
level
(2014-15) (2016-17) in coming years

Single pointer to data. Prefetching mechanisms


High No cudaMemcpy() to anticipate data arrival System allocator unified
is required in copies

Coherence @
Medium Migration hints Stack memory unified
launch & synchronize

Shared C/C++ data Additional Hardware-accelerated


Low
structures OS support coherence

28

Vous aimerez peut-être aussi