Académique Documents
Professionnel Documents
Culture Documents
https://developer.nvidia.com/content/how-build-gpu-acce...
NVIDIA Developer Zone Developer Centers Secondary links Log In CUDA Zone Home > CUDA Zone How to Build a GPU-Accelerated Research Cluster By Pradeep Gupta, posted Apr 29 2013 at 11:48PM Technologies Tools Resources Community
Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (nodes) connected with a high-speed network. Cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC), NVIDIA GPUs are becoming part of some of the worlds most powerful supercomputers and clusters. The most recent top 500 list of the worlds fastest supercomputers included nearly 50 supercomputers powered by NVIDIA GPUs, and the current worlds fastest supercomputer, Oak Ridge National Labs TITAN, utilizes over 18,000 NVIDIA Kepler GPUs. In this post I will take you step by step through the process of designing, deploying, and managing a small research prototype GPU cluster for HPC. I will describe all the components needed for a GPU cluster as well as the complete cluster management software stack. The goal is to build a research prototype GPU cluster using all open source and free software and with minimal hardware cost. I gave a talk on this topic at GTC 2013 (session S3516 Building Your Own GPU Research Cluster Using Open Source Software Stack). The slides and a recording are available at that link so please check it out! There are multiple motivating reason for building a GPU-based research cluster. Get a feel for production systems and performance estimates; Port your applications to GPUs and distributed computing (using CUDA-aware MPI); Tune GPU and CPU load balancing for your application; Use the cluster as development platform; Early experience means increased readiness; The investment is relatively small for a research prototype cluster Figure 1 shows the steps to build a small GPU cluster. Lets look at the process in more detail.
1 of 7
10/04/2013 10:37 PM
https://developer.nvidia.com/content/how-build-gpu-acce...
Figure 1: Seven steps to build and test a small research GPU cluster.
2 of 7
10/04/2013 10:37 PM
https://developer.nvidia.com/content/how-build-gpu-acce...
3 of 7
10/04/2013 10:37 PM
https://developer.nvidia.com/content/how-build-gpu-acce...
Follow the steps in Chapter 3 of the Rocks user guide and do a CD-based installation. Install the NVIDIA drivers and CUDA Toolkit on the head node. (CUDA 5 provides a unified package that contain NVIDIA driver, toolkit and CUDA Samples.) Install network interconnect drivers (e.g. Infiniband) on the head node. These drivers are available from your interconnect manufacturer. Nagios Core is an open source system and network monitoring application. It watches hosts and services that you specify, alerting you when things go wrong and when they get better. To install, follow the instructions given in the Nagios installation guide. The NRPE Nagios add-on allows you to execute Nagios plugins on remote Linux machines. This allows you to monitor local resources like CPU load and memory usage, which are not usually exposed to external machines, on remote machines using Nagios. Install NRPE following the install guide.
4 of 7
10/04/2013 10:37 PM
https://developer.nvidia.com/content/how-build-gpu-acce...
created the context. Exclusive-process-and-thread compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may only be current to one thread at a time. Prohibited compute mode: No CUDA context can be created on the device. NVIDIA-SMI also allows you to turn ECC (Error Correcting Code memory) mode on and off. The default is ON, but applications that do not need ECC can get higher memory bandwidth by disabling it.
NVML API
The NVML API is a C-based API which provides programmatic state monitoring and management of NVIDIA GPU devices. The NVML dynamic run-time library ships with the NVIDIA display driver, and the NVML SDK provides headers, stub libraries and sample applications. NVML can be used from Python or Perl (bindings are available) as well as C/C++ or Fortran. Ganglia is an open-source scalable distributed monitoring system used for clusters and grids with very low per-node overhead and high concurrency. Ganglia gmond is an NVML-based Python module for monitoring NVIDIA GPUs in the Ganglia interface.
NVIDIA-HEALTHMON
This utility provides quick health checking of GPUs in cluster nodes. The tool detects issues and suggests remedies to software and system configuration problems, but it is not a comprehensive hardware diagnostic tool. Features include: basic CUDA and NVML sanity check; diagnosis of GPU failures; check for conflicting drivers; poorly seated GPU detection; check for disconnected power cables; ECC error detection and reporting; bandwidth test; infoROM validation.
5 of 7
10/04/2013 10:37 PM
https://developer.nvidia.com/content/how-build-gpu-acce...
2. bandwidthtest: This is another of the CUDA Samples included with the Toolkit. This sample measures the cudaMemcopy bandwidth of the GPU across PCI-e as well as internally. You should measure deviceto-device copy bandwidth, host-to-device copy bandwidth for pageable and page-locked memory, and device-to-host copy bandwidth for pageable and page-locked memory. To benchmark network performance, you should run the bandwidth and latency tests for your installed MPI distribution. MPI standard installations have standard benchmarks such as /tests/osu_benchmarks-3.1.1. You should consider using an open source CUDA-aware MPI implementation like MVAPICH2, as described in earlier Parallel Forall posts An Introduction to CUDA-Aware MPI and Benchmarking CUDA-Aware MPI. To benchmark the entire cluster, you should run the LINPACK numerical linear algebra application. The top 500 supercomputers list uses the HPL benchmark to decide the fastest supercomputers on Earth. The CUDA-enabled version of HPL (High-Performance LINPACK) optimized for GPUs is available from NVIDIA on request, and there is a Fermi-optimized version available to all NVIDIA registered developers. In this post I have provided an overview of the basic steps to build a GPU-accelerated research prototype cluster. For more details on GPU-based clusters and some of best practices for production clusters, please refer to Dale Southards GTC 2013 talk S3249 Introduction to Deploying, Managing, and Using GPU Clusters by Dale Southard.
About the author: Pradeep Gupta is a Developer Technology Engineer at NVIDIA, where he supports developers with HPC and CUDA application development and optimization, and works to enable the GPU computing ecosystem in various universities and research labs across India. Before joining NVIDIA, Pradeep worked on various technologies including the Cell architecture and programming, MPI, OpenMP, and green data center technologies. Pradeep received a master's degree in research from the Indian Institute of Science (IISc), Bangalore. His research focused on developing compute-efficient algorithms for image denoising and inpainting using transform domains.
Parallel Forall is the NVIDIA Parallel Programming blog. If you enjoyed this post, subscribe to the Parallel Forall RSS feed! You may contact us via the contact form.
NVIDIA Developer Programs Get exclusive access to the latest software, report bugs and receive notifications for special events. Learn more and Register
Recommended Reading About Parallel Forall Contact Parallel Forall Parallel Forall Blog Featured Articles
6 of 7
10/04/2013 10:37 PM
https://developer.nvidia.com/content/how-build-gpu-acce...
PreviousPauseNext Tag Index accelerometer (1) Algorithms (3) Android (1) ANR (1) ARM (2) ArrayFire (1) Audi (1) Automotive & Embedded (1) Blog (19) Blog (21) Blog (1) Cluster (4) competition (1) Compilation (1) Concurrency (2) Copperhead (1) CUDA (22) CUDA 4.1 (1) CUDA 5.5 (3) CUDA C (15) CUDA Fortran (10) CUDA Pro Tip (1) CUDA Profiler (1) CUDA Spotlight (1) CUDA Zone (80) CUDACasts (2) Debug (1) Debugger (1) Debugging (3) Develop 4 Shield (1) development kit (1) DirectX (3) Eclipse (1) Events (2) FFT (1) Finite Difference (4) Floating Point (2) Game & Graphics Development (33) Games and Graphics (6) GeForce Developer Stories (1) getting started (1) google io (1) GTC (2) Hardware (1) Interview (1) Kepler (1) Lamborghini (1) Libraries (3) memory (6) Mobile Development (26) Monte Carlo (1) MPI (2) Multi-GPU (3) native_app_glue (1) NDK (1) NPP (1) Nsight (2) NSight Eclipse Edition (1) Nsight Tegra (1) NSIGHT Visual Studio Edition (1) NumbaPro (2) Numerics (1) NVIDIA Parallel Nsight (1) nvidia-smi (1) Occupancy (1) OpenACC (6) OpenGL (3) OpenGL ES (1) Parallel Forall (68) Parallel Nsight (1) Parallel Programming (5) PerfHUD ES (2) Performance (4) Portability (1) Porting (1) Pro Tip (5) Professional Graphics (6) Profiling (3) Programming Languages (1) Python (3) Robotics (1) Shape Sensing (1) Shared Memory (6) Shield (1) Streams (2) tablet (1) TADP (1) Technologies (3) tegra (5) Tegra Android Developer Pack (1) Tegra Android Development Pack (1) Tegra Developer Stories (1) Tegra Profiler (1) Tegra Zone (1) Textures (1) Thrust (2) Tools (10) tools (2) Toradex (1) Visual Studio (3) Windows 8 (1) xoom (1) Zone In (1) Developer Blogs Parallel Forall Blog
About Contact Copyright 2013 NVIDIA Corporation Legal Information Privacy Policy Code of Conduct
7 of 7
10/04/2013 10:37 PM