Académique Documents
Professionnel Documents
Culture Documents
2
The way we did things in the pre-Kepler era:
The GPU was a slave for the CPU
High data bandwidth for communications:
External: More than 10 GB/s (PCI-express 3).
Internal: More than 100 GB/s (GDDR5 video memory and 384 bits,
which is like a six channel CPU architecture).
Init
Alloc GPU
CPU
Operation 1 Operation 2 Operation 3
3
The way we do things in Kepler:
GPUs launch their own kernels
The pre-Kepler GPU is a co-processor The Kepler GPU is autonomous:
Dynamic parallelism
CPU GPU CPU GPU
Computational power
allocated to regions
of interest
7
Hyper-Q
Hyper-Q
9
An example:
3 streams, each composed of 3 kernels
__global__ kernel_A(pars) {body} // Same for B...Z stream_1
cudaStream_t stream_1, stream_2, stream_3;
kernel_A
...
cudaStreamCreatewithFlags(&stream_1, ...); kernel_B
cudaStreamCreatewithFlags(&stream_2, ...); kernel_C
cudaStreamCreatewithFlags(&stream_3, ...);
...
stream_2
stream 1
SM SM SM SM
SMX SMX SMX SMX
11
The relation between
software and hardware queues
Fermi: A -- B -- C
But CUDA streams multiplex into a single queue
Stream 1
Up to 16 grids
can run at once A--B--C P--Q--R X--Y--Z P -- Q -- R
on GPU hardware
Stream 2
Chances for overlapping: Only at stream edges
X -- Y -- Z
Stream 3
12
The relation between
software and hardware queues
Fermi: A -- B -- C
But CUDA streams multiplex into a single queue
Stream 1
Up to 16 grids
can run at once A--B--C P--Q--R X--Y--Z P -- Q -- R
on GPU hardware
Stream 2
Chances for overlapping: Only at stream edges
X -- Y -- Z
Stream 3
A--B--C A -- B -- C
Stream 1
Up to 32 grids
can run at once P--Q--R P -- Q -- R
on GPU hardware
Stream 2
X--Y--Z X -- Y -- Z
Stream 3
Concurrency at full-stream level 13
Without Hyper-Q: Multiprocess by temporal division
100
% GPU utilization
50
B
A B C D E F
0
Time
C
With Hyper-Q: Symultaneous multiprocess
100
D D D
C
F
A C
% GPU utilization
E B
C
50 E F
B
F
E
F A B
A
E
0 D
0
Time saved
CPU processes... ...mapped on GPU 14
Unified memory
CUDA memory types
16
Additions to the CUDA API
19
First example:
Access constraints
__device__ __managed__ int x, y = 2; // Unified memory
20
Second example:
Sorting elements from a file
CPU code in C GPU code from CUDA 6.0 on
void sortfile (FILE *fp, int N) void sortfile (FILE *fp, int N)
{ {
char *data; char *data;
data = (char *) malloc(N); cudaMallocManaged(&data, N);
free(data); cudaFree(data);
} }
21
Third example: Cloning dynamic data
structures WITHOUT unified memory
struct dataElem {
int prop1;
CPU memory
int prop2; dataElem
char *text; prop1
}
prop2
GPU memory
cudaMemcpy(g_text, elem->text, textlen);
cudaMemcpy(&(g_elem->text), &g_text,
dataElem sizeof(g_text));
prop1
// Finally we can launch our kernel, but
prop2 // CPU and GPU use different copies of “elem”
*text “Hello, world” kernel<<< ... >>>(g_elem);
} 23
Cloning dynamic data structures
WITH unified memory
void launch(dataElem *elem) {
kernel<<< ... >>>(elem);
CPU memory }
CPU memory
GPU memory
Can pass list elements between CPU & GPU.
No need to move data back and forth between CPU and GPU.
Can insert and delete elements from CPU or GPU.
But program must still ensure no race conditions (data is coherent
between CPU & GPU at kernel launch only). 26
Unified memory: Summary
27
Unified memory: The roadmap.
Contributions on every abstraction level
Coherence @
Medium Migration hints Stack memory unified
launch & synchronize
28