Académique Documents
Professionnel Documents
Culture Documents
John E. Stone
Beckman Institute, University of Illinois at Urbana-Champaign
David J. Hardy
Beckman Institute, University of Illinois at Urbana-Champaign
Barry Isralewitz
Beckman Institute, University of Illinois at Urbana-Champaign
Klaus Schulten
Department of Physics, University of Illinois at Urbana-Champaign
351
16.1 Introduction
Over the past decade, graphics processing units (GPUs) have become fully
programmable parallel computing devices capable of accelerating a wide va-
riety of data-parallel algorithms used in computational science and engineer-
ing. The tremendous computational capabilities of GPUs can be employed
to accelerate molecular modeling applications, enabling molecular dynamics
simulations and their analyses to be run much faster than before and allowing
the use of scientific techniques that are impractical on conventional hardware
Downloaded by [University of Waterloo] at 07:42 04 November 2016
and other small molecules. This step involves checking that all components
are arranged consistently with the known properties of the system. Electro-
statics calculations need to be performed to place ions energetically most
favorably near charged parts of a system. There may be multiple rounds of
iteration toward system setup. GPU-accelerated calculation in this step yields
today results in seconds or minutes versus hours or days previously, while only
requiring a small amount of compute hardware for on-demand calculations.
Methods that can run on available GPU hardware are likelier to be run on-
demand, and so provide shorter turnaround time than batch jobs. The authors
have used GPUs to accelerate ion placement in the preparation of several large
simulations [5, 24].
Downloaded by [University of Waterloo] at 07:42 04 November 2016
involving significant amounts of data parallelism [11, 13, 16]. Many algorithms
involved in molecular modeling and computational biology applications can
be adapted to GPU acceleration, achieving significant performance increases
relative to contemporary multicore CPUs [6, 19, 22, 23]. Below we describe de-
tails of GPU hardware and software organization that impact the design of
molecular modeling algorithms.
memory, which can be read and written by any thread. Special “texture” fil-
tering hardware provides cached, read-only access to GPU global memory,
improving performance for random memory access patterns and for multi-
dimensional interpolation and filtering. Recent devices, such as the NVIDIA
“Fermi” GPU architecture, have added L1 and L2 caches to improve per-
formance for read-write accesses to global memory, particularly for atomic
update operations. GPUs also include thread-private registers, as well as two
other types of small, low-latency on-chip memory systems, one for inter-thread
communication and reduction operations (called “shared memory” in CUDA
or “local memory” in OpenCL) and another for read-only data (called “con-
stant memory”). GPUs are designed such that each group of SIMD processing
Downloaded by [University of Waterloo] at 07:42 04 November 2016
0,0 0,1 …
Downloaded by [University of Waterloo] at 07:42 04 November 2016
1,0 1,1 …
1-D, 2-D, or 3-D
work group
… … …
1 work item
and traversal patterns that are efficient for the underlying hardware. GPU
global memory systems are attached by wide data paths that typically re-
quire memory accesses to be organized so they are at least as wide as the
memory channel (thereby achieving peak transfer efficiency) and are aligned
to particular address boundaries (thereby reducing the number of transfers
needed to service a request). The requirements for so-called coalesced mem-
ory accesses lead to data structures organized in fixed sized blocks that are a
multiple of the minimum required coalesced memory transfer size. As a result
of the need to maintain memory address alignment, it is often beneficial to pad
the edges of data structures and parallel decompositions so they are an even
multiple of the memory coalescing size, even if that means generating threads
Downloaded by [University of Waterloo] at 07:42 04 November 2016
and computing work items that are ultimately discarded. The illustration in
Figure 16.1 shows an example of how a two-dimensional grid of work items is
padded to be a multiple of the coalescing size.
with the sum taken over the atoms modeled as point charges, where an atom
i having partial charge qi is located at position ~ri , and C is a constant. For a
system of N atoms and a lattice of M points, the computational complexity
of direct summation is O(M N ).
the global pool of work items or threads are grouped together into blocks that
execute on the same SIMD execution unit, with access to the same on-chip
shared memory and constant memory.
On-the-fly calculation of such quantities as lattice coordinates make better
use of the GPU resources, since the registers and fast on-chip memory spaces
are limited while arithmetic is comparatively inexpensive. Whenever neigh-
boring lattice points share a significant amount of input data, multiple lattice
points can be computed by a single work item, using an optimization technique
known as unroll-and-jam [3]. This optimization allows common components
of the distances between an atom and a group of neighboring lattice points to
be reused, increasing the ratio of arithmetic operations to memory operations
at the cost of a small increase in register use [6, 20, 22]. Manual unrolling of
these lattice point evaluation loops can provide an additional optimization,
increasing arithmetic intensity while reducing loop branching, indexing, and
control overhead.
Besides the memory accesses associated with lattice points, a good algo-
rithm must also optimize the memory organization of atomic coordinates and
any additional data, such as atomic element type, charge, radii, or force con-
stants. The needs of specific molecular modeling algorithms differ depending
on what group of atoms contribute to the accumulated quantity at a given lat-
tice point. For algorithms that accumulate contributions from all atoms in each
lattice point (e.g., direct Coulomb summation or computation of molecular or-
bitals), atom data are often stored in a flattened array with elements aligned
and padded to guarantee coalesced global memory accesses. Arrays containing
additional coefficients, such as quantum chemistry wavefunctions or basis sets,
must also be organized to facilitate coalesced memory accesses [23]. In cases
where coefficient arrays are traversed in a manner that is decoupled from the
lattice or atom index, the use of on-chip constant memory or shared memory
is often helpful.
The performances of efficient GPU implementations of direct summation
algorithms have previously been shown to be as high as a factor of 44 times
faster than a single core of a contemporary generation CPU for simple po-
tential functions [16, 22]. Potential functions that involve exponential terms
for dielectric screening have much higher arithmetic intensity, enabling GPU
direct summation speedups of over a factor of 100 in some cases [21].
of mixed precision might offer the best tradeoff between accuracy and per-
formance, with double precision reserved for those parts of the computation
most affected by roundoff error, such as the accumulation of the potentials,
while single precision is used for everything else.
where atom i has position ~ri and charge qi , the constant C is fixed, the con-
stants ij and σij are determined by the atom types of i and j, and the set
χ(i) are the atom indices j that are excluded from interacting with atom i,
typically including atom i and atoms that are covalently bonded to atom i.
The individual interaction terms decay quickly as the pairwise distance
increases, so the computational work for the N -body problem is typically
Downloaded by [University of Waterloo] at 07:42 04 November 2016
Shared Memory
}
Work Group
Downloaded by [University of Waterloo] at 07:42 04 November 2016
work items
FIGURE 16.2: The atoms are spatially hashed into bins, which are assigned
to work groups. The work group for a bin (dark-gray shading) loops over
the neighboring bins (light-gray shading), loading each into shared memory.
Each work item loops over the atoms in shared memory (white-filled circles),
accumulating forces to its atom (black-filled circles).
cooperatively loading each bin into shared memory for processing, as depicted
in Figure 16.2. Each work item loops over the atoms stored in shared memory,
accumulating the force for any pairwise interactions that are within the cutoff
distance. The memory bandwidth is effectively amplified by having each work
item reuse the atom data in shared memory.
The atom bin input data must include position and charge for each atom.
The constants for Lennard-Jones interactions are based on atom type and
determined through combining rules (involving arithmetic and/or geometric
averaging of the individual atom parameters) to obtain parameters ij and σij
in Eq. (16.4) for the (i, j) interaction. Although there are few enough atom
types to fit a square array of (ij , σij ) pairs into the constant memory cache,
accessing the table will result in simultaneous reads to different table entries,
which will degrade performance. A better approach might be to perform the
extra calculation on-the-fly to obtain the combined ij and σij parameters.
The individual atom parameters could either be stored in the bins or read
from constant memory without conflict.
A sequential code typically makes use of Newton’s Third Law by evaluating
each interaction only once and updating the forces on both atoms. Doing
this cannot be readily adapted to GPUs without incurring additional cost to
uate the interactions through table lookup from texture cache memory. Ad-
vantages to using a table lookup, besides avoiding expensive function evalu-
ations, include hardware-accelerated interpolation from texture memory, the
elimination of special cases like the switching function for the Lennard-Jones
potential, and the automatic clamping of the force to appropriate values at
the outer and inner cutoffs. Disadvantages include some reduction in the pre-
cision of forces and the fact that the floating-point performance will continue
to outpace the performance of the texture cache memory system on future
hardware.
work when a kernel is launched. The amount of work triggered (or otherwise
available) as a result of any individual incoming message is too small to yield
efficient execution on the GPU, so incoming messages must be aggregated
until there is enough work to fully occupy the GPU. Once enough work has
accumulated, data are copied to the GPU, and a GPU kernel is launched.
It is undesirable to allow the GPU to idle while waiting for work to ar-
rive; a refinement of the above approach partitions work into two categories:
the portion resulting from communications with peers and the portion that
depends only on local data. At the beginning of a simulation time-step, com-
putations involving only local data can begin immediately. By launching local
computations on the GPU immediately, the CPU can continue aggregation
of incoming work from peers, and the GPU is kept fruitfully occupied. One
drawback to this technique is that it will add latency to computations result-
ing from messages that arrive after the local work has already been launched,
potentially delaying peers waiting on results they need. In the ideal case, local
work would be scheduled immediately, but at a low priority, to be immedi-
ately superceded by work arriving from remote peers, thereby optimizing GPU
utilization and also minimizing latency [19].
Next-generation GPUs will allow multiple kernels to execute concur-
rently [15], enabling work to be submitted to the GPU in much smaller
batches, and making it possible to simultaneously process local and remote
workloads. Despite the lack of work prioritization capabilities in current GPU
hardware and software interfaces, it should be possible to take advantage of
this capability to reduce the latency of work for peers while maintaining full
GPU utilization. One possible approach for accomplishing this would submit
local work in small batches just large enough to keep the GPU fully occupied.
Work contained in messages arriving from remote peers would be immediately
submitted to the GPU, thereby minimizing latency. GPU work completion
events or periodic polling would trigger additional local work to be submitted
on the GPU to keep it productively occupied during periods when no remote
messages arrive. If future GPU hardware and software add support for work
prioritization, then this scheme could be implemented simply by assigning
highest priority to remote work, thereby minimizing latency [19].
Another potential complication that arises with GPU computing is how
best to decompose work and manage execution with varing ratios of CPU
cores and GPUs. Today, most applications that take advantage of GPU accel-
eration use a relatively simple mapping of host CPU threads to GPU cores.
Existing GPU programming toolkits such as CUDA and OpenCL associate a
GPU device with a context that is specific to a particular host thread. This
creates pressure on application developers to organize the host-side applica-
tion code so that only one thread accesses a particular GPU device. Parallel
message passing systems such as MPI are completely unaware of the existence
of GPUs, and on clusters with an unequal number of CPU cores and GPUs,
multiple processes can end up competing for access to the same GPU. Since
current-generation GPUs can only execute a single kernel from a single context
Downloaded by [University of Waterloo] at 07:42 04 November 2016
same system. Unexpected errors deep within a multi-GPU code are difficult
to handle, since errors must be handled by the thread detecting the problem,
and actions to correct the error may require cooperation with other threads.
In VMD, we use an error handling mechanism that is built into the dynamic
work distribution system, enabling errors to be propagated from the thread
encountering problems to its peers. This mechanism allows orderly cleanup in
the case of fatal errors, and allows automatic redistribution of failed work in
cases where the errors relate to resource exhaustion affecting only one GPU
device.
Acknowledgments
This work was supported by the National Institutes of Health under grant
P41-RR05969. The authors also wish to acknowledge additional support pro-
vided by the NVIDIA CUDA center of excellence at the University of Illinois
at Urbana-Champaign. Performance experiments were made possible by a
generous hardware donation by NVIDIA.
Downloaded by [University of Waterloo] at 07:42 04 November 2016
Bibliography
[1] Anton Arkhipov, Jana Hüve, Martin Kahms, Reiner Peters, and Klaus
Schulten. Continuous fluorescence microphotolysis and correlation spec-
troscopy using 4Pi microscopy. Biophys. J., 93:4006–4017, 2007.
[2] David H. Bailey. High-precision floating-point arithmetic in scientific
computation. Comput. Sci. Eng., 7(3):54–61, 2005.
[3] David Callahan, John Cocke, and Ken Kennedy. Estimating interlock and
improving balance for pipelined architectures. J. Paral. Distr. Comp.,
5(4):334 – 358, 1988.
[4] U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee, and L. G.
Pedersen. A smooth particle mesh Ewald method. J. Chem. Phys.,
103:8577–8593, 1995.
[5] James Gumbart, Leonardo G. Trabuco, Eduard Schreiner, Elizabeth
Villa, and Klaus Schulten. Regulation of the protein-conducting channel
by a bound ribosome. Structure, 17:1453–1464, 2009.
[6] David J. Hardy, John E. Stone, and Klaus Schulten. Multilevel summa-
tion of electrostatic potentials using graphics processing units. J. Paral.
Comp., 35:164–177, 2009.
[7] Yun He and Chris H. Q. Ding. Using accurate arithmetics to improve
numerical reproducibility and stability in parallel applications. In ICS
’00: Proceedings of the 14th International Conference on Supercomputing,
pages 225–234, New York, NY, USA, 2000. ACM Press.
[8] R. W. Hockney and J. W. Eastwood. Computer Simulation Using Par-
ticles. McGraw-Hill, New York, 1981.
[9] William Humphrey, Andrew Dalke, and Klaus Schulten. VMD—Visual
molecular dynamics. J. Mol. Graphics, 14:33–38, 1996.
[13] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable
parallel programming with CUDA. ACM Queue, 6(2):40–53, 2008.
[14] NVIDIA. CUDA Compute Unified Device Architecture Programming
Guide. NVIDIA, Santa Clara, CA, USA, 2007.
[15] NVIDIA’s next generation CUDA compute architecture: Fermi. White
Paper, NVIDIA, 2009. Available online (Version 1.1, 22 pages).
[16] John D. Owens, Mike Houston, David Luebke, Simon Green, John E.
Stone, and James C. Phillips. GPU computing. Proc. IEEE, 96:879–899,
2008.
[17] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens
Kruger, Aaron E. Lefohn, and Timothy J. Purcell. A survey of general-
purpose computation on graphics hardware. Comput. Graph. Forum,
26(1):80–113, 2007.
[18] James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad
Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel,
Laxmikant Kale, and Klaus Schulten. Scalable molecular dynamics with
NAMD. J. Comp. Chem., 26:1781–1802, 2005.
[19] James C. Phillips, John E. Stone, and Klaus Schulten. Adapting a
message-driven parallel application to GPU-accelerated clusters. In SC
’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing,
Piscataway, NJ, USA, 2008. IEEE Press.
[20] Christopher I. Rodrigues, David J. Hardy, John E. Stone, Klaus Schul-
ten, and Wen-mei W. Hwu. GPU acceleration of cutoff pair potentials
for molecular modeling applications. In CF’08: Proceedings of the 2008
Conference on Computing Frontiers, pages 273–282, New York, NY, USA,
2008. ACM.
[21] John E. Stone, David Gohara, and Guochun Shi. OpenCL: A parallel
programming standard for heterogeneous computing systems. Computing
in Science and Engineering, 12:66–73, 2010.