Académique Documents
Professionnel Documents
Culture Documents
Batch Queue
Transformation
(e.g. flip, colour)
Noise
DATA INPUT PIPELINE
Simple pipeline
Even when implemented correctly you will need a Broadwell class CPU just to
saturate an 8 Pascal GPU node when training ResNet 50 on ImageNet
OPTIMIZING THE I/O PIPELINE
Overlapping data communication
INCREASING I/O CHALLENGES
GPU speed continues to increase (not matched by the CPU)
COMPUTE TO DATA RATIO
More compute equals more time
• The more compute you have to execute on a unit of data the more time you have
to deliver a new sample
• Model choices are frequently requirements driven (e.g. self driving car might not
be able to use a large model as it has a strict latency, compute and power
budget)
PROFILING YOUR MODEL
CNTK training ResNet 50 on DGX-1 V
DOES LATENCY MATTER?
DOES THROUGHPUT MATTER?
ROOFLINE ANALYSIS
Understand your constrains
DATA INPUT PIPELINE
Overview
Download Decode Augment
Batch Queue
Transformation
(e.g. flip, colour)
Noise
AUGMENTATION
Consuming computational resource
• Optimize I/O
• Use the deep learning framework specific data storage and loading mechanism.
• Multithreaded
• Be cautious about the third party libraries (for example using OpenCV does not allow
you to optimise performance)
OPTIMIZE AUGMENTATION
Take action
• Optimize Augmentation
• Understand the performance of your augmentation pipeline
- Impact of inefficiencies:
- File format and caching
- Decode logic
- Augmentation logic
IN NODE COMMUNICATION
DL DATA PARALLELISM – PCIE BASED
QPI Link
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
DL DATA PARALLELISM – PCIE BASED
QPI Link
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
• Data delivery
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
DL DATA PARALLELISM – NVLINK
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
• PCIe is reserved for data delivery (and for multi node training for external
communication)
DL DATA PARALLELISM – NVLINK
PERFORMANCE
Intra-node performance
AllReduce bandwidth (OMB, size=128MB, in GB/s)
60
50
40
30
20
10
0
4 QPI 4 CPU 4 PCI DGX-1
NVSWITCH: ALL-TO-ALL CONNECTIVITY
NVSwitch Fabric
• B – available bandwidth
EVALUATION
Calculating the bandwidth requirements
0.8
0.6
0.4
0.2
0
IMPLICATIONS
Requirements highly dependent on the workload
IMPLICATIONS
Requirements highly dependent on the workload
10X PERFORMANCE GAIN IN LESS THAN A YEAR
DGX-2 days
1.5 10 Times Faster
0 5 10 15 20
• Larger IB director
switch (216 ports)
with capacity for
more pods via unused
ports
MULTI-NODE DGX-1 LARGE CLUSTER
up to 144 DGX-1 nodes (4 ”PODS”)
• Implements 4 DGX-1
pods
• Distributed across 24
racks
• Full bi-section
bandwidth within pod,
2:1 between pods
• Login and
management
servers
• Storage and
networking
Main Ethernet
Management Ethernet
COMMUNICATION SOFTWARE
DESIGN
What is NCCL ?
Easy to integrate into any DL framework, as well as traditional HPC apps using MPI.
Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory,
parallel reductions, NVLink usage.
CUDA
NVIDIA GPUs
DESIGN
Rings
NCCL uses rings to move data across all GPUs and perform reductions.
DESIGN
Rings
NCCL uses rings to move data across all GPUs and perform reductions.
NCCL uses rings to move data across all GPUs and perform reductions.
sendbuff recvbuff
FIFO
Reduction
40
35
30
25 MPI
20 Baidu Allreduce
15 NCCL
10
0
2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)
PERFORMANCE
Deep Learning - CNTK
CNTK scalingResNet50, images/s
8000
7000
6000 6569
5000
4000
1645 1744
1000 217
0
0 8 16 24 32
- In the next Lab we will see how to use Horovod to hide the explicit use of NCCL
REFERENCE ARCHITECTURE
BALANCED HARDWARE
DGX-1 as a reference point for solution design
MULTI-NODE SCALING WHITEPAPER
Share, collaborate,
and test applications across 69
END-TO-END PRODUCT FAMILY
TRAINING INFERENCE
FULLY INTERGRATED DL SUPERCOMPUTER DATA CENTER AUTOMOTIVE EMBEDDED
DGX Family
(DGX Station, DGX-1, DGX-2, Cloud Service Provider)
Tesla P100/V100
Tesla P100
TITAN V Tesla P4
Tesla V100
Tesla P100
MANAGING RESOURCES
JOB SCHEDULING
The role of a job scheduler
Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
JOB SCHEDULING
Feature comparison (this is not an exhaustive list)
Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
JOB SCHEDULING
Feature comparison
Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
JOB SCHEDULING
Feature comparison
Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
JOB SCHEDULING
Feature comparison
Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
NGC VISION
AI platform of choice
MORE THAN JOB SCHEDULING
Wider set of Machine Learning activities (Uber’s Michelangelo)
https://eng.uber.com/michelangelo/
DL IS A HPC WORKLOAD
HPC expertise is important for success
It makes sense to build an AI team and a separate systems/HPC team and have the
two teams sit next to each other.
That is because solving some of the problems discussed in the lecture requires very
specialised systems/HPC knowledge. It is incredibly difficult for any single human to
acquire both the AI and systems/HPC knowledge.
For detailed discussion see Andrew Ng’s “Nuts and Bolts of Applying Deep Learning ”: https://www.youtube.com/watch?
v=F1ka6a13S9I&t=120s
OTHER
MULTI-NODE DGX-1
“A-HA” MOMENTS IN DL CLUSTER DESIGN
Additional design insights to get you started
Overall Cluster Rack Design Networking Storage Facilities Software
• HPC similar to DL • DL drives close to • Like HPC, • DGX-1 read cache • GPU data center • Scale requires
operational limits; InfiniBand is is critical operates at near- “cluster-aware”
• HPC expertise can preferred max power software
help in design • Assume less • Datasets range
headroom • Require high from 10k’s to • Assume higher • NCCL2 =
• Even with HPC, bandwidth, low millions objects watts per-rack GPU/multi-node
the similarities are • Proper airflow is latency acceleration
limited crucial to cluster • Terabyte levels of • Dramatically
performance • Maximize per- storage higher FLOPS/watt • Automatic
node IB = floor space topology detect
connections • Large variance saved
• DL framework
optimizations
TALK TO US
www.nvidia.com/dli