Vous êtes sur la page 1sur 10

Multiparadigm Parallel Acceleration for

Reservoir Simulation
Larry S.K. Fung, SPE, Mohammad O. Sindi, SPE, and Ali H. Dogru, SPE, Saudi Aramco
Summary
With the advent of the multicore central-processing unit (CPU),
todays commodity PC clusters are effectively a collection of inter-
connected parallel computers, each with multiple multicore CPUs
and large shared randomaccess memory (RAM), connected together
by means of high-speed networks. Each computer, referred to as a
compute node, is a powerful parallel computer on its own. Each
compute node can be equipped further with acceleration devices
such as the general-purpose graphical processing unit (GPGPU) to
further speed up computational-intensive portions of the simulator.
Reservoir-simulation methods that can exploit this heterogeneous
hardware system can be used to solve very-large-scale reservoir-
simulation models and run signicantly faster than conventional
simulators. Because typical PC clusters are essentially distributed
share-memory computers, this suggests that the use of the mixed-
paradigm parallelism (distributed-shared memory), such as mes-
sage-passing interface and open multiprocessing (MPI-OMP),
should work well for computational efciency and memory use. In
this work, we compare and contrast the single-paradigm program-
ming models, MPI or OMP, with the mixed paradigm, MPI-OMP,
programming model for a class of solver method that is suited for
the different modes of parallelism. The results showed that the dis-
tributed memory (MPI-only) model has superior multicompute-
node scalability, whereas the shared memory (OMP-only) model
has superior parallel performance on a single compute node. The
mixed MPI-OMP model and OMP-only model are more memory-ef-
cient for the multicore architecture than the MPI-only model
because they require less or no halo-cell storage for the subdomains.
To exploit the ne-grain shared memory parallelism available
on the GPGPU architecture, algorithms should be suited to the
single-instruction multiple-data (SIMD) parallelism, and any re-
cursive operations are serialized. In addition, solver methods and
data store need to be reworked to coalesce memory access and to
avoid shared memory-bank conicts. Wherever possible, the cost
of data transfer through the peripheral component interconnect
express (PCIe) bus between the CPU and GPGPU needs to be hid-
den by means of asynchronous communication. We applied multi-
paradigm parallelism to accelerate compositional reservoir
simulation on a GPGPU-equipped PC cluster. On a dual-CPU-
dual-GPGPU compute node, the parallelized solver running on
the dual-GPGPU Fermi M2090Q achieved up to 19 times speedup
over the serial CPU (1-core) results and up to 3.7 times speedup
over the parallel dual-CPU X5675 results in a mixed MPI OMP
paradigm for a 1.728-million-cell compositional model. Parallel
performance shows a strong dependency on the subdomain sizes.
Parallel CPU solve has a higher performance for smaller domain
partitions, whereas GPGPU solve requires large partitions for
each chip for good parallel performance. This is related to
improved cache efciency on the CPU for small subdomains and
the loading requirement for massive parallelism on the GPGPU.
Therefore, for a given model, the multinode parallel performance
decreases for the GPGPU relative to the CPU as the model is fur-
ther subdivided into smaller subdomains to be solved on more
compute nodes. To illustrate this, a modied SPE5 (Killough and
Kossack 1987) model with various grid dimensions was run to
generate comparative results. Parallel performances for three eld
compositional models of various sizes and dimensions are
included to further elucidate and contrast CPU-GPGPU single-
node and multiple-node performances. A PC cluster with the
Tesla M2070Q GPGPU and the 6-core Xeon X5675 Westmere
was used to produce the majority of the reported results. Another
PC cluster with the Tesla M2090Q GPGPU was available for
some cases, and the results are reported for the modied SPE5
(Killough and Kossack 1987) problems for comparison.
Introduction
Modern reservoir simulators in use by the oil and gas industry are
computationally intensive software packages that are complex to
build and maintain. A general-purpose simulator includes a
diverse collection of algorithms and methods. Coupled multi-
phase-ow and transport problems are a tightly coupled system of
nonlinear equations with signicant spatial/temporal dependen-
cies that need to be resolved for a stable transient solution. Core
components include the nonlinear and linear solvers, various for-
mulations, discretization methods both in time and space, treat-
ment of faults and fractures, dual or multiple porosities and
permeabilities, coupled geomechanics modeling, wellbore-model-
ing methods, near-well modeling methods, implicit coupled well-
solution method, coupled surface-network modeling, eld/reser-
voir/well/group management and optimization module, uid- and
rock-property calculation package with drainage and imbibition
hysteresis modeling, phase-behavior package and equation-of-
state calculation, Jacobian assembly, nonlinear update algorithm,
timestepping algorithm, general uid-in-place initialization algo-
rithm that handles multiple uid types, multiple rock types, a spa-
tial distribution of composition and saturation, and complex input/
output (I/O) processing required to manage history-match and
eld-performance evaluation workow.
High-performance computing (HPC) technologies have evolved
rapidly during the last 15 years from the very expensive centralized
supercomputer to commodity-based PC clusters. Modern typical PC
clusters may have hundreds of compute nodes connected together
by means of high-speed networks. Each compute node contains
tens of CPU cores and several Gigabytes of memory. Primarily, two
modes of parallelism have been used to speed up reservoir-simula-
tion codedistributed memory parallelism by use of the MPI stand-
ard and shared memory parallelism by use of the OpenMP standard
because they are widely supported by hardware vendors. Shared-
memory thread-based parallelization can be applied incrementally
and locally to speed up certain computationally intensive loops. It
is well-suited to parallelize code segments that t the SIMD
programming model. It is less exible compared with the distrib-
uted-memory method in which each process is independent. Distrib-
uted-memory programming is well-suited to the general multiple-
instruction multiple-data (MIMD) programming model and can also
be used for task parallelism; however, the entire simulator (algo-
rithm and data structure) must be engineered for the distributed-
memory model. In massively parallel applications, each of the
simulator components needs to be efciently parallel and data
distributed for the overall simulator need to be scalable. The
diverse algorithms and methods within a production-level reser-
voir simulator pose signicant challenges to fully exploit the
available performance on these HPC hardwares that may contain
several-thousand computing cores to speed up simulation.
Copyright VC 2013 Society of Petroleum Engineers
This paper (SPE 163591) was accepted for presentation at the SPE Reservoir Simulation
Symposium, The Woodlands, Texas, USA, 1820 February 2013, and revised for
publication. Original manuscript received for review 15 March 2013. Revised manuscript
received for review 30 July 2013. Paper peer approved 16 October 2013.
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 1 Total Pages: 10
ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
2013 SPE Journal 1
The GPGPU is an emerging accelerator hardware for scientic
computation. These units can be added by means of PCIe expansion
slots to enhance computational horsepower. The GPGPU Fermi
M2090Q has 512 cores and is capable of 256 fused multiply/add
op/clock peak performances at double precision. This is a signicant
performance multiple over the hex-core Westmere CPU X5675;
however, algorithms suited for GPGPU computing need to exhibit
massive parallelism with thousands of independent threads. This is
not typical with many reservoir-simulation solution methods.
Unstructured problems requiring indirect addressing that can lead to
shared memory-bank conicts and inefciency in global-memory
access will lead to poor performance and should be avoided. Meth-
ods and associated data that cannot be restructured for efciency on
the GPGPU may not realize any performance gains. Furthermore,
the code segment should contain a sufciently large amount of
work so that the data movement through the PCIe bus will not over-
whelm performance gains in computation. In the best-case scenario,
the data-copying cost can be hidden by means of asynchronous
communication that can be overlaid with computation.
Earlier studies involved programming the GPGPU devices with
the low-level CUDA
TM
programming language that was seen by
many as undesirable because it is not very programmer-friendly. It
requires signicant time and effort to develop and debug code on
the device. Studies such as Vuduc et al. (2010) caution that the pro-
ductivity loss during the tedious practice of porting scientic code
to GPGPU may outweigh the performance gain in some cases.
Consequently, the availability of compiler with CUDA applica-
tion-programming-interface (API) extension for high-level lan-
guages such as C (NVIDIA 2012a,b) and FORTRAN (The Portland
Group 2011) has improved this situation. More recently, a directive-
based approach with the proposal of the new OpenACC Application
Programming Interface Standard (2013) for multiple accelerator
devices may further reduce the development time. OpenACC is a
directive-based programming standard for parallel programming of
heterogeneous CPU/GPUsystems.
Previous work to accelerate linear solver for reservoir simula-
tion has been reported by Klie et al. (2011). They ported Krylov-
based solvers [generalized conjugate residual (GCR) or general-
ized minimum residual (GMRES) methods] with variants of block
incomplete LU (BILU) factorization or symmetric successive
over-relaxation (SSOR) method preconditioners and compared
performance between the GPGPU and CPU. By using the state-
of-the-art hardware at the time (Fermi GPGPU C2050 and Neha-
lem CPU X5570), they found the GPGPU performance may be
comparable to the performance obtained in eight cores of a single
multicore device for their solver implementation.
Because the traditional preconditioning methods, such as the
incomplete LU (ILU) factorization variants or the nested factori-
zation (NF) method, are not suitable, Appleyard et al. (2011)
developed the multicolor NF (MCNF) preconditioner as the accel-
erator for the Krylov-solver GMRES to solve the sparse linear
equations for reservoir simulation on the GPGPU. Their study
was limited to parallelization on a single GPGPU in which they
reported good speedup for large models (>100,000 gridblocks),
surpassing their robust serial solver running on the CPU that uses
NF as the preconditioning method.
Zhou and Tchelepi (2013) also reported on accelerating the
MCNF algorithm on the GPGPU as the pressure preconditioner of
the constraint pressure residual (CPR) method in the fully implicit
system solution. They reported results for single GPGPU as well
as multiple GPGPU on a single compute node (with multiple PCI
expansion slots). In their work, only the pressure solve of the CPR
algorithm was ported to run on the GPGPU, whereas the remain-
der of the solver runs on the CPU. For the SPE10 (Christie and
Blunt 2001) problem, they reported a 19-times speedup, surpass-
ing the one-core serial CPU speed for their implementation and a
factor of 3.2 out of 4 for four-GPGPU cards on a one-compute-
node implementation. The solution of the SPE10 (Christie and
Blunt 2001) problem is overwhelmingly dominated by the pres-
sure-solution time. Thus, leaving the full-solve on the CPU did
not post apparent performance issues for them. However, for the
more-general multiphase/multicomponent simulation problems,
the full solve may be a more signicant component of the overall
solution time, and the achievable speedup factor will be lower.
Recent published data comparing the parallel performance for
11 parallel scientic applications indicate that the norm of the per-
formance multiples obtained on the GPUs surpassing the CPUs
may only be a small fraction of their quoted peak-performance
ratios (Table 1). Often times, the achieved performance on a rou-
tine or on a code segment can be signicantly better than the over-
all applications speedups that might have generated initial
optimism. Of course, what can be achieved will strongly depend
on the algorithms and methods of the respective applications.
In the following sections, the solver method that is well-
suited for multiparadigm parallel acceleration is described. The
heterogeneous computing environment used for the project is a
PC cluster with compute nodes containing the hex-core West-
mere CPUs and the Fermi GPGPUs. This is followed by an ex-
planation of mixed paradigm (MPI-OMP) parallelization with
unstructured domain decomposition on the multicore CPUs. The
comparison among the MPI-only, the OMP-only, and the mixed
MPI-OMP parallelization for a single compute node and multi-
ple compute nodes is illustrated with example problems of vari-
ous model sizes. Some of the issues with the GPGPU parallel
acceleration are then explained. It is related to the hardware
architecture and memory model, which are signicantly differ-
ent from the cache-based multicore CPU architecture. The solver
algorithms discussed in this work are well-suited to both the
CPU-mixed paradigm parallelism and the GPGPU massive par-
allelism; however, data organization and code are necessarily
different to address the different architecture. Some of the perti-
nent issues are explained.
The GPGPU parallelization conducted is different in several
aspects from those previously cited. The use of GPGPU directives
signicantly reduces the development overhead involved with
the low-level CUDA programming. Our implementation uses the
hybrid approachMPI OpenMPGPGPU directives. This ap-
proach enables us to run on multiple CPUs and GPGPUs on multi-
ple compute nodes in a distributed parallel fashion. We report the
performance not only on synthetic SPE-type models but also on
three full-eld compositional models. All our computations are
performed in 64-bit double-precision arithmetic that is required
for reservoir simulation.
TABLE 1RESEARCHERS SQUEEZE GPU PERFORMANCE
FROM11 BIG SCIENCE APPS (FELDMAN 2012)
Application
Performance
XK6
*
vs. XEG
**
Software
Framework
S3DTurbulent combustion 1.4 OpenACC
NAMDMolecular dynamics 1.4 CUDA
CP2KChemical physics 1.5 CUDA
CAM-SECommunity-
atmosphere model
1.5 PGI CUDA
Fortran
WL-LSMSStatistical
mechanics of magnetic materials
1.6 CUDA
GTC/GTC-GPUPlasma
physics for fusion energy
1.6 CUDA
SPECFEM-3DSeismology 2.5 CUDA
QMCPACKElectronic
structure of materials
3.0 CUDA
LAMMPSMolecular dynamics 3.2 CUDA
Denovo3D-neutron
transport for nuclear reactors
3.3 CUDA
ChromaLattice quantum
chromodynamics
6.1 CUDA
*
XK6One Opteron 6200 one Fermi/node.
**
XEGTwo AMD Opteron 6200/node (also known as Interlagos with 16 cores,
8 dual-core module on a chip).
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 2 Total Pages: 10
ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
2 2013 SPE Journal
Parallel-Solver Method
The iterative-solution method in this work uses the approximate
inverse preconditioner known as the Z-line powers-series method
to accelerate a Krylov-subspace solver such as the Orthomin (Vin-
some 1976) or the GMRES (Saad and Schultz 1986). The Z-line
powers-series method is a special instance of the general line-
solve powers-series (LSPS) method and was discussed previously
(Fung and Dogru 2008a). The mechanics of the method is to sub-
divide the Jacobian matrix into two parts:
A P E 1
In this approach, the matrices A, P, and E are fully unstruc-
tured. The important aspect of the approach is to choose P such
that it includes the dominant terms of A and at the same time
remains inexpensive to compute the inverse. Thus, matrix A can
be written as
A I EP
1
P: 2
The approximate inverse preconditioner for A by use of the N-
term power series is
A
1
M
1
N
I
X
N
k1
1
k
P
1
E
k
" #
P
1
: 3
In the Z-line powers-series method, P is block-tridiagonal with
grid cell ordered in the Z-direction rst. The two-level CPR
method (Wallis et al. 1985; Fung and Dogru 2008a) can be con-
structed that uses the LSPS method as the base-preconditioning
method. An alternate choice of pressure solver, such as the alge-
braic multigrid method, can have a faster convergence rate but
will be harder to parallelize and will have poorer scalability (Fung
and Dogru 2008a). The present paper documents the performance
results for the one-level solver only. If the LSPS preconditioner is
used as the preconditioner for both the pressure solve and the full-
system solve of the CPR algorithm, the expected performance
gains by use of the multiparadigm parallelization approaches dis-
cussed herein will be comparable, with the attendant improve-
ments in the robustness and the efciency of the CPR method, as
documented previously. The additional operations in the two-
level CPR solver do not pose further issues for either the OMP
parallelization or the GPGPU parallelization.
The Z-line powers-series method involves Z-line solve
coupled with matrix-vector (MV) multiplication operations in a
sequ-ence. For a structured-grid, a red-black line-reduction step
may be applied to reduce the work counts. The Orthomin method
was chosen to generate the reported results in this paper. It con-
sists of a series of vector dot products and a MV multiplication
operation. It is relatively easy to parallelize on both the CPU and
GPU in multiple paradigms and is a small fraction of the over-
all-solution cost. However, on the GPU, the implementation of
the reduction operation must be shared memory-bank-conict-
free to have good performance. Memory-bank conicts will lead
to the serialization of operations. Branching will also lead to the
serialization of operations on the GPU because each half-warp
(16 threads) must execute identical instruction on the GPGPU
cores. Algorithms that are suited to the SIMD hardware will nat-
urally have an advantage running on the GPGPU. The NVIDIA
GPGPU provides the single-instruction multiple-threads (SIMT)
programming model that allows the more general MIMD code to
run and handles the necessary serialization automatically for the
application developers. However, such applications without
proper re-engineering will suffer a signicant performance dis-
advantage.
Heterogeneous (CPU-GPGPU) Parallel-Computing
Environment
The computer used for this study is a Dell cluster of PowerEdge
C6100 nodes. The cluster consists of 32 compute nodes with dual
multicore processors. Each processor is an Intel X5675 hex-core
(Westmere) 3.07 GHz; therefore, each node is equipped with 12
processing cores in total. The operating system running on the
nodes is RedHat Enterprise Linux Server 5.4 with the 2.6.18-
164.15.1.el5 kernel. Each node is equipped with an Inniband host-
channel adapter supporting quad data rate (QDR) connections
between the nodes through a Qlogic 12800-40 Inniband switch.
Each node is also equipped with 48 GB of memory. As for MPI
communication over the Inniband network, the MVAPICH 2 was
used for all the simulation runs. In mixed-paradigm computing, it is
important to correctly set up the environment variables for process
pinning to CPU cores. This is documented in the MVAPICH 2 user
manual (Network-Based Computing Laboratory 2008).
In addition to the two CPU processors, each node of the cluster
is also equipped with two NVIDIA M2070Q GPGPUs, which are
based on NVIDIAs Fermi technology. Each GPGPU has 448
processing cores [14 streaming multiprocessors (SMs) with 32
cores each] at 1.15 GHz along with 6 GB of memory. With this
setup, a one-to-one binding between the CPU processors and
GPGPUs is maintained; thereby, each node has two CPU process-
ors and two GPGPUs. Each group of 16 GPGPUs is hosted in a
Dell PowerEdge C410x PCIe expansion chassis, with a total of 64
GPGPUs being hosted in four PCIe expansion chassis. NVIDIAs
275.09.07 driver with CUDA 4.0 and compute capability 2.0 was
used with the GPGPUs along with PGIs 11.8 compiler supporting
accelerator directives. Fig. 1 illustrates the cluster-hardware lay-
out used to conduct this study.
Mixed-Paradigm(MPI-OMP) Parallelization With
Unstructured-Domain Partitions
Mixed-paradigm parallelization (distributed-shared-memory model)
partitions the grid into an equal division of Z-lines by use of a
graph-partitioning algorithm that produces a nearly equal number of
active cells per partitions with minimized interdomain connections.
The current implementation uses the MPI (1995) standard for dis-
tributed memory and the OpenMP (2011) standard for shared mem-
ory. Each grid partition is owned by an MPI process. The data, both
matrix and vector, within each partition are organized to facilitate
communication hiding and memory access for the solver methods.
The dual-level graph-based matrix-data organization was introduced
previously in terms of a distributed parallel method (Fung and
Dogru 2008b) with global cell lists and ordering; however, in mas-
sively parallel implementation, local cell lists and ordering methods
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
Qlogic Infiniband
12800-40 Switch
4 Infiniband Optical
QDR Links
4 Infiniband Optical
QDR Links
4 Nodes Per
PowerEdge
C6100
4 Nodes Per
PowerEdge
C6100
Rack 1 Rack 4
16 GPGPUs in
PowerEdge
C410x

Fig. 1Cluster layout consisting of 32 nodes in total with 64
GPGPUs. Nodes are interconnected by means of Inniband
QDR.
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 3 Total Pages: 10
ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
2013 SPE Journal 3
were implemented without a global hash table (Fung and Mezghani
2013). This methodology is more memory-efcient, and it has no
upper limits on model sizes. In mixed paradigm parallelism, each
MPI process spawns a user-specied number of OMP threads. The
OMP-loop limits for each thread are set on the basis of an equal di-
vision of Z-lines in each subdomain. When the line lengths are vari-
able, a second-level graph partition is required in which the weight
for each Z-line is the number of active cells. For good parallel per-
formance, the number of fork-joint and synchronization points for
OMP threads needs to be minimized. This can be accomplished by
expanding the OMP parallel regions to encompass the entire
solver code. Communication hiding is easily implemented with
proper matrix and vector data layouts and by use of asynchro-
nous peer-to-peer communication protocol. Computational work
for the subdomain interior can now overlap with communication.
The work on the subdomain boundary can start when all the data
in the halo of the subdomain have been received. The current
implementation uses the master thread of each MP partition to
handle the interprocess communication. When the code is instru-
mented with mixed-paradigm parallelism, the choice of MPI
only, OMP only, or mixed-paradigm MPI-OMP is a runtime de-
cision by simply setting the number of compute nodes, the num-
ber of processes per node, and the number of OMP threads per
process in a job script. Shared-memory parallelism is limited to
one compute node in the typical PC-cluster computing
environment.
Results for Mixed-Paradigm(MPI-OMP)
Parallelization
A series of problems with various domain sizes were set up to
compare parallel performances among MPI-only, OMP-only, and
mixed MPI-OMP models on one node (2 X5675, 12 cores). The
model sizes ranged from 50,000 cells to 1 million cells
(NXNYNZ100 100 5, 100100 10, 100 100 20,
100 100 50, and 100 100100). For each model size, the
MPI-OMP congurations of 1-12, 12-1, 2-6, 6-2, and 4-3 were
tested. The 1-12 conguration is the shared memory1-process,
12-threads congurationand the 12-1 conguration is the distrib-
uted memory12-processes, 1-thread conguration. The other
three congurations are mixed-paradigm congurations. All runs
are converged to 1.0 10
8
, and each run of the same model has
exactly the same number of iterations and residual norms and
change vectors. The convergent tolerance used in our testing is to
purposely check for the effects of the differences in the treatment
of double-precision arithmetic on different hardware architectures.
This also ensures the correctness of the parallel implementation
and the independence of solution work counts for each paralleliza-
tion option. Solver tolerances in production runs will normally be
lower and more typically at 1 10
4
. The parallel speedup factors
over the serial (1-core) solution time on one compute node are sum-
marized in Table 2 and plotted in Fig. 2. The dependence of paral-
lel performance on model sizes is clearly shown in Fig. 2, in which
the smaller model has better parallel performance because of
improved cache efciency. On one node, shared-memory paralleli-
zation performs better than distributed memory or mixed paradigm
for this solver method. Mixed-paradigm and MPI-only paralleliza-
tion produce comparable performances. To test the multicompute-
node performance, the 1-million-cell model is solved on up to 12
compute nodes (144 cores) in mixed-paradigm (MPI-OMPn-12,
2n-6, 3n-4, 4n-3, or 6n-2) congurations or a 12n-1 MPI-only con-
guration, in which n is the number of compute nodes. The
speedup factors, normalized to the parallel solution time on one
compute node, are plotted in Fig. 3. The normalization is per-
formed on the basis of the respective one-node result for each MPI-
OMP conguration, as summarized in the last column of Table 2.
For example, the four-compute-node run (n 4) of the 2n-6 case
uses the timing of the MPI-OMP2-6 case for normalization. This
gives the valid one-node to multinode speedup factors for the re-
spective MPI-OMP congurations. The speedup factors from one
core to one node (12 cores) are stated in Table 2. Thus, the speedup
factor for up to twelve nodes (144 cores) over the serial one-core
run can be calculated by use of Table 2 and Fig. 3. That factor is
13.8-5.32 73.42 for the 12n-1 MPI-only conguration. In the
multicompute-node application, MPI-only has super linear scalabil-
ity throughout the range of compute nodes tested. The improved
parallel performance with decreasing subdomain sizes is evident;
however, mixed-paradigm parallelization has poorer performances
in the multinode situations. This is primarily because of cache-line
conicts of OMP parallelization in the halo computation, which
TABLE 2COMPARISON OF MIXED-PARADIGM(MPI-OMP)
PARALLEL PERFORMANCE FOR VARIOUS MODEL SIZES
ON 1 NODE (2 X5675 CPU, 12 CORES)
Model Size 50,000 100,000 200,000 500,000 1,000,000
112 9.96 9.64 8.12 5.82 5.56
121 7.97 8.15 7.52 5.58 5.32
26 8.15 8.06 7.18 5.53 5.09
62 7.44 7.48 7.01 5.29 5.21
43 8.12 7.8 7.27 5.59 5.32
1
2
3
4
5
6
7
8
9
10
11
500,000 50,000
S
p
e
e
d
-
u
p

F
a
c
t
o
r
s
Model Sizes
112
121
26
62
43
MPI*OMP
Fig. 2Mixed-paradigm parallel performance on 1 node (2
X5675 CPU, 12 cores) for various model sizes.
1.0
3.0
5.0
7.0
9.0
11.0
13.0
15.0
1 2 3 4 5 6 7 8 9 10 11 12
S
o
l
v
e
r

S
p
e
e
d
-
u
p

F
a
c
t
o
r
s
Number of Compute Nodes
N12
2N6
3N4
4N3
6N2
12N1
Ideal
Fig. 3Parallel performance on multicompute nodes for the 1-
million-cell model.
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 4 Total Pages: 10
ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
4 2013 SPE Journal
can be improved with thread private workspace or fewer threads
for the halo computation.
GPUParallelization
The abstracted view of CPU 86 plus GPGPU Fermi architecture is
shown in Fig. 4. The GPGPU Fermi architecture has 32 CUDA
cores per SM. Each SM schedules threads in groups of 32 threads
called warps. Each SM has a dual-warp scheduler that simultane-
ously schedules and dispatches instructions for two independent
warps. A kernel program executing on the GPGPU organizes threads
on a grid of thread blocks. Threads in a thread block cooperate
through barrier synchronization and shared memory. It is important
that kernel-code execution does not involve shared memory-bank
conicts that may lead to the serialization of execution. An example
of a memory-bank conict is illustrated in Fig. 5. The memory
model for CUDA coding is per-thread private, per-block shared, per-
grid global after kernel-wide synchronization. Coalesced memory
access allows the efcient movement of data from global memory to
the shared memory of each SM. Code and data that are structured
for efciency on a cache-based architecture, such as the 86, must
be reorganized to realize performance on the GPGPU.
Coalesced global-memory access happens when each half warp
(16 threads) of an SM accesses the contiguous region of device
memory. Each SM has congurable shared memory between 16 KB
and 48 KB. Shared memory of the SM is organized into 16 benches
of 4-byte words with 256 rows of benches for the 16-KB instance,
similar to benches in a football stadium. A memory bank represents
a column of benches that feed data to the GPGPU cores of a warp.
Access data on an adjacent bank of a warp lead to memory-bank
conicts for the SIMT operations. For CPU computation, data store
is organized to optimize cache reuse, thus minimizing memory traf-
c. This means multiple data elements may be stored consecutively,
which are accessed by the same CPU core in computation. This or-
ganization is obviously not good for SIMTGPGPUcomputing.
To ofoad computationally intensive components of the simu-
lator to the GPGPU, several programming choices are available.
These include direct CUDA programming, OpenCL, DirectCom-
pute, CUDA C/C, CUDA Fortran, or a directive-based
approach through the PGI-accelerator directives ACC, which is
one of the foundations of the new OpenACC standard used to off-
load kernels from the CPU to accelerator devices such as the
GPGPU. To evaluate suitability for a production-level application
development, we choose the directive-based method and, where
necessary, supplement with CUDA Fortran. In this approach, the
GPGPU code can be compiled and run or debugged on the CPU
before generating the acceleration kernels for the GPGPU. This
was useful to speed up the development process.
Host
Memory
T
h
r
e
a
d

P
r
o
c
e
s
s
o
r
s
x86
Host
Control
Execution Queue
Device Memory
RDMA
Level 2 Cache
Dual Warp Issue
Special
Function
Unit
H/W
Cache
User
Selectable
S/W
Cache
H/W
Cache
User
Selectable
S/W
Cache
Special
Function
Unit
Special
Function
Unit
Special
Function
Unit
H/W
Cache
User
Selectable
S/W
Cache
Special
Function
Unit
Special
Function
Unit
Dual Warp Issue Dual Warp Issue
0 1 15
Fig. 4An abstracted view of the x86 CPU1GPU Fermi accelerator architecture.
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Thread 15 Thread 15 Thread 15
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Bank 15 Bank 15 Bank 15
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
(a) (b) (c)
Good, No bank conflict Bad, 2-way bak conflict
Fig. 5Examples showing GPGPU thread accessing shared memory with no bank conict (a) and with bank conicts (b,c).
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 5 Total Pages: 10
ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
2013 SPE Journal 5
MultiparadigmParallelization With
Structured-Domain Partitions
The rst-generation in-house parallel reservoir simulator was cho-
sen for multiparadigm parallelization on the heterogeneous com-
puting environment. The simulator uses a structured domain-
partitioning scheme along the X- and Y-axis of the grid. In mixed
paradigm, the X-dimension is subdivided into n slices in which n
is equal to the number of MPI processes. Each MPI process
spawns m threads within the OMP-parallel regions. The Y-dimen-
sion is subdivided into m equal slices and assigned to each thread
to compute. This domain-partitioning scheme is simple but is less
exible, and the load balancing may be less than those achieved
by the use of the unstructured domain-partitioning method dis-
cussed earlier; however, the code is well-suited to multiparadigm
parallelization on the CPU-GPGPU architecture. In this experi-
mentation, the solver component of the simulator can be option-
ally ofoaded to the GPUs, whereas the rest of the simulator runs
parallel on the CPUs. Alternatively, the entire simulator can run
in mixed paradigm (MPI OMP) on CPUs only.
Fermi CUDA architecture supports the concurrent execution
of multiple kernels. This is useful for codes with multiple inde-
pendent tasks that can be executed together and are not useful for
porting the linear solver. The most-efcient approach is to ofoad
one subdomain per GPGPU in a kernel from one MPI process. In
our compute environment, each compute node consists of two
CPUs and two GPUs. Therefore, two MPI processes per node con-
guration are used; each MPI process drives one Fermi M2070Q.
A piece of supplementary code is added for this purpose. The
code was written with PGIs CUDA Fortran runtime library rou-
tines for GPGPU accelerators so that odd-numbered MPI proc-
esses would bind to GPGPU 1 and even-numbered processes
would bind to GPGPU 0. The code is called at initialization
before entering the accelerated region. Data transfer between
CPU and GPU memory through the PCIe bus occurs simultane-
ously, and asynchronous communication has been implemented
by means of cudaMemcpyAsync function calls to overlap remote
direct-memory access (RDMA) with computational work. In this
work, the asynchronous transfer of the [A] matrix and the right-
hand-side vector is overlain with computation that is better than
95% effective, but the halo communication for MPI partitions is
not overlapped. In the solver implementation, block-line solves
are serial. Lines are organized into blocks of threads for massive
SIMD-type parallelism. GPGPU solver is recoded to coalesce
memory access and to avoid shared-memory bank conicts in
MV operations as well as reduction operations. In particular, a
Krylov-subspace method, such as Orthomin or GMRES, involves
many vector dot products. Hand coding of this aspect must be per-
formed to eliminate memory-bank conicts or simply to use suita-
ble cudaBlas library function.
On a single node, MPI communication involves only shared-
memory copy and does not have inniband trafc. Data on the
GPGPU must be moved back to the CPU through the PCIe bus for
this purpose. For multinode parallel applications and without the
use of NVIDIAs GPUDirect technology (NVIDIA 2013), three
data copies are required for MPI, as illustrated in Fig. 6. This data
trafc can reduce parallel scalability. The use of the GPUDirect
technology was not possible in our testing because the Inniband
interconnectivity hardware did not support it. The more important
issue that reduces attractiveness for spreading work over multiple
compute nodes comes from the reduction of subdomain sizes and,
therefore, the workload on each GPGPU. This is contrary to what
we want to do to speed up application by distributing the work to
more hardware.
Results for MultiparadigmParallelization With
CPU1GPGPU
To test the parallel performance in multiparadigm, the SPE5 com-
positional model (Killough and Kossack 1987) has been modied
into several model sizes. The SPE5 model has six hydrocarbon
components with the water-alternating-gas process. We ran the
model for 1.5 years with water/CO
2
/water injection at 6-month
cycles. The original model has been modied to several model
sizes, as shown in Table 3. We expect both the sizes and shapes
of the model to have an impact on GPGPU performances. Fig. 7
shows GPGPU-solver speedup factors against the serial (1 core)
CPU solver time. Both the Fermi M2070Q and M2090Q results
are plotted in which the M2090Q has an average of a 20% better
performance than the M2070Q. In the gure, the dependence of
parallel performance on model sizes is very evident. At a subdo-
main size of 864,000 cells (1.728-million cells in two subdo-
mains), a factor of 19 over serial CPU speed is realized for the
solver on one compute node (two GPGPU/two CPU congura-
tions). For smaller model sizes, the speedup multiples are more
modest. Fig. 8 shows the GPGPU solver speed up against the
mixed-paradigm (MPI-OMP2-6) parallel CPU-solver time. Our
solver-timing comparisons are the actual overall time cost to solu-
tion that is inclusive of all communication costs when it is not, or
cannot be, overlapped. This comparison represents the real gain
by ofoading the solver code onto the GPGPU. For these models,
a factor of approximately 3.7 was achieved for the solver on the
M2090Q GPGPU at model sizes larger than approximately 1
InfiniBand
Without GPUDirect With GPUDirect
InfiniBand
System
Memory
Chip
set
CPU
GPU Chip
set
CPU
GPU
GPU
Memory
GPU
Memory
1
1
2
Fig. 6Extra buffer copying needed to move halo data of subdomains for MPI communication in multicompute-node applications.
TABLE 3MODEL SIZES AND SHAPES OF COMPOSITIONAL
MODEL USED FOR TESTING MULTIPARADIGM
PARALLELIZATION
Model Sizes Model Shapes
25,000 50 50 10
49,000 70 70 10
100,000 100 100 10
225,000 150 150 10
400,000 200 200 10
972,000 180 180 30
1,200,000 200 200 30
1,728,000 240 240 30
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 6 Total Pages: 10
ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
6 2013 SPE Journal
million grid cells. The grid dimensions also have an impact on
GPGPU parallel performance because kernel computation is
organized into parallel blocks of threads assigned to the SM pro-
cessors. Actual eld-model dimensions are not in multiples of 16,
which is the optimal dimension to use for GPGPU computing.
The effect of model-grid dimensions on parallel performance is
not investigated in the present study, but it may account for some
of the scatters in the performance trend against model sizes.
With the solver running on the GPGPU and the remainder of
the simulator running on the CPU in mixed paradigm (MPI-
OMP 2-6 conguration), Fig. 9 shows the multiparadigm (MPI-
OMP-ACC) speedup factors for the complete runs relative to se-
rial (one core) simulation time. They are the comparisons of the
actual wall times for the simulation runs on the respective hard-
wares tested. Depending on the model sizes and versions of Fermi
GPGPU, speedup factors from four to nine were achieved. Fig. 10
shows the multiparadigm speedup factors relative to mixed-para-
digm parallel simulation running on the CPU only. This repre-
sents the performance gain achieved by ofoading the solver
component of the simulator onto the GPGPU for a single compute
node. It is noted that, in Fig. 9, the multiparadigm-simulator
speedup over serial one-core CPU speed shows a monotonically
increasing trend with model sizes for the entire range of model
sizes tested. In Fig. 10, the parallel GPGPU-CPU multiparadigm
parallelization vs. parallel CPU mixed-paradigm parallelization
showed a parabolic trend with a somewhat lower speedup factor
for the largest models. This is because the relative parallel scal-
ability of the simulator portion and of the solver portion of the
simulator is different. The solver portion is more scalable than the
simulator portion. As a result, the percentage time spent in the
solver becomes less for the largest models, which is reected in
the overall speedup factors.
To compare multicompute-node scalability, the 1.728-million-
cell model is also solved by use of two, four, and six nodes. The
scalability results for the GPGPU and the CPU parallel solver are
illustrated in Fig. 11 in which the multinode speedup factors are
normalized to the results of their respective single-node perform-
ance. Internode MPI communication for the GPGPU (without
GPUDirect) will involve three buffer copies with an additional
buffer copy on the CPU. In addition, the subdomain size decreases
as the model is subdivided into smaller and smaller subdomains
that reduce the parallel performance of the GP GPUs. These com-
binations of factors result in lower scalability for parallel GPGPU
solve compared with parallel CPU solve as the subdomain size
decreases.
1.0
1.5
2.0
2.5
3.0
3.5
4.0
10,000 100,000 1,000,000
G
P
U

S
o
l
v
e
r

S
p
e
e
d
-
u
p

F
a
c
t
o
r
s

o
v
e
r

P
a
r
a
l
l
e
l

C
P
U
Model Sizes
M2070
M2090
Fig. 8Solver speedup factors on dual-GPU Fermi against par-
allel hex-core dual-X5675 CPUs time for the modied SPE5
problem of various model sizes on a single compute node.
4.0
5.0
6.0
7.0
8.0
9.0
10,000 100,000 1,000,000 O
v
a
r
a
l
l

S
p
e
e
d
-
u
p

f
a
c
t
o
r
s

o
v
e
r

S
e
r
i
a
l

C
P
U
Model Sizes
CPU+M2070
CPU+M2090
Fig. 9Overall simulator speedup factors by use of multipara-
digm parallel acceleration (GPGPU1CPU) surpassing serial 1-
core CPU speed for the modied SPE5 problem of various
model sizes on a single compute node.
1.00
1.10
1.20
1.30
1.40
1.50
1.60
10,000 100,000 1,000,000
O
v
e
r
a
l
l

S
p
e
e
d
-
u
p

f
a
c
t
o
r
s

o
v
e
r

P
a
r
a
l
l
e
l

C
P
U
Model Sizes
CPU+M2070
CPU+M2090
Fig. 10Overall-simulator speedup factors by use of multipara-
digm parallel acceleration (GPGPU1CPU) surpassing parallel
CPU speed for the modied SPE5 problem of various model
sizes on a single compute node.
0
2
4
6
8
10
12
14
16
18
20
10,000 100,000 1,000,000
G
P
U

S
o
l
v
e
r

S
p
e
e
d
-
u
p

F
a
c
t
o
r
s

o
v
e
r

S
e
r
i
a
l

C
P
U
Model Sizes
M2070
M2090
Fig. 7Solver speedup factors on the dual-GPU Fermi against
serial (1-core) X5675 CPU time for the modied SPE5 problem
of various model sizes on a single compute node.
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 7 Total Pages: 10
ID: jaganm Time: 15:00 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
2013 SPE Journal 7
Three compositional models (Models A, B, and C) with increas-
ing model sizes were also tested. Model A has a 90138 10 grid
(124,200 cells) and is an eight-component model with 21 vertical
wells and 30 years of simulation period. Model B has an 81
10630 grid (257,580 cells) and is an eight-component model
with eight complex wells and 7 years of simulation. Model C has a
1991198 26 grid (6,146,712 cells) and is a nine-component
model with 231 complex wells and 7 years of simulation. A
detailed comparison table for the solve components is illustrated in
Table 4 for Model A and in Table 5 for Model B. Models A and B
are solved on a single compute node. The solver speedup factors of
the M2070Q Fermi over the parallel CPU-solve were 1.99 and
2.22, respectively, for these two models. Note that the M2070Q is
approximately 20% slower than the M2090Q for this code. For
Model C, simulation runs on 7, 8, 10, and 16 compute nodes were
conducted for both the GPU parallel-solve and the CPU parallel-
solve options. The speedup factors comparing the GPU-solve
option to CPU-solve option are shown in Fig. 12. The GPU-solve
speedup multiple surpassing the CPU parallel solve decreases from
3.0 to 1.45 when the number of compute nodes increases from
seven to 16. The improvement in terms of overall simulator runtime
is also indicated in the plot. It decreases from a factor of 1.5 to a
factor of 1.15. These multinode results are consistent with the
results from the modied SPE5 (Killough and Kossack 1987) prob-
lems that were previously discussed in detail. All models tested are
actual eld models with model dimensions that are not divisible by
16 (size of a half-warp). To improve coalesced-memory access, it
will be necessary to use pitched memory on the multidimensional
arrays. This is not implemented in our current report and may be
considered as future required-code enhancements for GPGPU com-
puting. In general, the GPGPU solver requires large subdomain
sizes for good performance. Although large models are solved on
more compute nodes, the smaller subdomain partitions yield lower
parallel performance on the GP GPU compared with CPU-only
simulation.
Summary and Conclusions
On a single-compute node with dual hex-core Westmere X5675
CPUs, the parallel LSPS preconditioned Krylov iterative solver in
mixed paradigm (MPI-OMP) has better parallel performance run-
ning in OMP-only mode, whereas the parallel efciency is com-
parable between mixed MPI-OMP and MPI-only modes. In all
cases, parallel performance increases as model size decreases
because of better cache efciency.
1.0
2.0
3.0
4.0
5.0
6.0
1 2 3 4 5 6
S
p
e
e
d
-
u
p

F
a
c
t
o
r
s
Number of Compute Nodes
GPU M20070 Solve
CPU Solve
Ideal
Fig. 11Multiple-compute-node parallel scalability for the
1.728-million-cell modied SPE5 compositional model. Each
node consists of dual-GPGPU Fermi M2070Q and dual-CPU
hex-core Westmere X5675.
TABLE 4SPEEDUP-FACTOR COMPARISON FOR INDIVIDUAL-SOLVER COMPONENTS OF
MODEL-A SIMULATION
M2070-Q Hex-Core Westmere
Solver Components GPU Parallel CPU Parallel CPU Serial
Preconditioning 208 seconds 440 seconds 2,097 seconds
Speedup factor 10.08 4.76 1
MV multiply 109 seconds 197 seconds 922 seconds
Speedup factor 8.46 4.68 1
Orthomin 12 seconds 42 seconds 108 seconds
Speedup factor 9.0 2.57 1
Overall solve 361 seconds 720 seconds 3,276 seconds
Speedup factor 9.07 4.55 1
Overall parallel GPU/CPU solver time ratio 9.07/4.55 1.99.
TABLE 5SPEEDUP-FACTOR COMPARISON FOR INDIVIDUAL-SOLVER COMPONENTS OF
MODEL-B SIMULATION
M2070-Q Hex-Core Westmere
Solver Components GPU Parallel CPU Parallel CPU Serial
Preconditioning 306 seconds 655 seconds 2,671 seconds
Speedup factor 8.73 4.07 1
MV multiply 95 seconds 238 seconds 913 seconds
Speedup factor 9.61 3.84 1
Orthomin 29 seconds 66 seconds 214 seconds
Speedup factor 8.3 3.65 1
Overall solve 450 seconds 999 seconds 3,921 seconds
Speedup factor 8.71 3.94 1
Overall parallel GPU/CPU solver time ratio 8.71/3.94 2.21.
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 8 Total Pages: 10
ID: jaganm Time: 15:01 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
8 2013 SPE Journal
On multicompute node, MPI-only has better parallel perform-
ance than mixed paradigm (MPI OMP). This is primarily a
result of cache-line conicts of OMP parallelization in the halo
computation that can be improved with thread-private workspace
or fewer threads for the halo computation. Mixed paradigm in
multinode computing has lower memory consumption as a result
of the reduction in halo-cell storage of the fewer distributed sub-
domains. The choice of MPI-only, OMP-only, and mixed MPI-
OMP is a runtime decision.
A multiparadigm-parallelization approach has been success-
fully implemented in the rst-generation in-house parallel reser-
voir simulator. The simulator is massively parallel, and it used a
structured domain-partitioning scheme. The simulator portion
runs in mixed paradigm (MPI-OMP) on the CPUs. The linear
solver can run optionally on either the GPGPU (MPI-ACC) or the
CPU (MPI-OMP). Numerical experimentation shows the follow-
ing characteristics:
On a single compute node, the solver speedup factor on the
GPGPU over the CPU strongly depends on the model sizes. For
the range of model sizes tested, with the Fermi M2090Q, a solver
speedup factor of 6 to 19 over the serial CPU speed and a factor
of 1.4 to 3.7 over the parallel CPU speed were achieved. The
overall simulator speedup factor is 5 to 9 over the serial CPU
speed, and 1.37 to 1.57 over the parallel CPU speed, in which
only the solver was ported to run on the GPGPU.
Multicompute-node parallel scalability is better for the CPU-
only simulation runs than for the CPUGPGPU simulation
runs. The primary reason is that the GPGPU requires a large
amount of parallel work for good performance, whereas the
CPU is more cache-efcient at smaller subdomain sizes. The di-
vision of a model into more subdomains gives less work to each
GPGPU, which yields lower performance. The secondary rea-
son is that MPI data movement requires extra memory copy.
The various components of the solver (preconditioner, matrix-
vector multiplication, Orthomin) show similar performance
multiples comparing the GPGPU vs. the CPU parallel speedup.
Solver time is dominated by the preconditioner, and Orthomin
represents less than 10% of parallel-solver time. This is true for
both the GPGPU and the CPU parallelization.
Reservoir simulators in production practice are complex soft-
ware with a diverse collection of algorithms and methods. It is a
tightly coupled system with strong spatial and temporal dependen-
cies. The parallel-data management and associated methods to
achieve good scalability on the multicore CPU require signicant
know-how and efforts. Few simulators have achieved massive
parallel scalability. The many-core GPGPU is an accelerator de-
vice and has an architecture that is very different from that of the
multicore CPU. Some aspects have been discussed in the paper.
Signicant re-engineering of code and data layout will be required
to accelerate code on the GPGPU for reservoir simulators. In
some cases, different methods and algorithms may need to be
built altogether. Some simulator components may not be suitable
for porting to the GPGPU. Our current-research results indicate
that although research on heterogeneous HPC hardware platforms
for reservoir simulation should continue, it is not yet mature
enough for production-level code development.
Acknowledgments
The authors would like to thank Saudi Aramco management for
the permission to publish this paper. We also thank NVIDIA Cor-
poration for providing access to their test cluster with the GPGPU
Fermi M2090Q in which the results for some of the test cases
were generated.
References
Appleyard, J.R., Appleyard, J.D., Wakeeld, M.A. et al. 2011. Accelerat-
ing Reservoir Simulators Using GPU Technology. Paper SPE 141402
presented at the 2011 SPE Reservoir Symposium, Woodlands, Texas,
2123 February. http://dx.doi.org/10.2118/141402-MS.
Christie, M.A. and Blunt, M.J. 2001. Tenth SPE Comparative Solution
Project: A Comparison of Upscaling Techniques. Presented at the SPE
Reservoir Simulation Symposium, Houston, 1114 February. SPE-
66599-MS. http://dx.doi.org/10.2118/66599-MS.
Feldman, M. 2012. Researchers Squeeze GPU Performance from 11 Big
Science Apps. HPCwire (18 July 2012). http://archive.hpcwire.com/
hpcwire/2012-07-18/researchers_squeeze_gpu_performance_from_11_
big_science_apps.html.
Fung, L.S.K. and Dogru, A.H. 2008a. Parallel Unstructured-Solver Meth-
ods for Simulation of Complex Giant Reservoirs. SPE J. 13 (4):
440446. http://dx.doi.org/10.2118/106237-PA.
Fung, L.S.K. and Dogru, A.H. 2008b. Distributed Unstructured Grid
Infrastructure for Complex Reservoir Simulation. Paper SPE 113906
presented at the SPE Europec/EAGE Annual Conference and Exhi-
bition, Rome, Italy, 912 June. http://dx.doi.org/10.2118/113906-
MS.
Fung, L.S.K. and Mezghani, M.M. 2013. Machine, Computer Program
Product and Method to Carry Out Parallel Reservoir Simulation. US
Patent 8,433,551.
Killough, J.E. and Kossack, C.A. 1987. Fifth Comparative Solution Pro-
ject: Evaluation of Miscible Flood Simulators. Presented at the SPE
Symposium on Reservoir Simulation, San Antonio, Texas, 14 Febru-
ary. SPE-16000-MS. http://dx.doi.org/10.2118/16000-MS.
Klie, H, Sudan, H., Li, R. et al. 2011. Exploiting Capabilities of Many
Core Platforms in Reservoir Simulation. Paper SPE 141265 presented
at the 2011 SPE Reservoir Symposium, Woodlands, Texas, 2123
February. http://dx.doi.org/10.2118/141265-MS.
MPI: A Message-Passing Interface Standard. 1995. Message Passing Inter-
face Forum, http://www.mpi-forum.org, June12.
Network-Based Computing Laboratory. 2008. MVAPICH2 1.2 User Guide.
Columbus, Ohio: Ohio State University. http://www.compsci.wm.edu/
SciClone/documentation/software/communication/MVAPICH2-1.2/
mvapich2-1.2rc2_user_guide.pdf.
NVIDIA. 2012a. CUDA C Best Practice Guide, Version 5.0, October.
NVIDIA. 2012b. CUDA C Programming Guide, Version 5.0, October.
NVIDIA. 2013. GPUDirect Technology, CUDA Toolkit, Version 5.5. (19
July 2013). https://developer.nvidia.com/gpudirect.
OpenACC. 2013. The OpenACC Application Programming Interface. Open-
ACC Standard Organization, Version 2.0, June. http://www.openacc-
standard.org.
OpenMP Application Program Interface. 2011. Version 3.1, July, http://
www.openmp.org.
Saad, Y. and Schultz, M.H. 1986. GMRES: A Generalized Minimal Resid-
ual Algorithm for Solving Nonsymmetric Linear Systems. SIAM J.
Sci. Stat. Comput. 7 (3): 856869.
The Portland Group. 2011. CUDA FORTRAN Programming Guide and
Reference, Release 2011, Version 11.8, August.
Vinsome, P.K.W. 1976. Orthomin, An Iterative Method for Solving
Sparse Sets of Simultaneous Linear Equations. Paper SPE 5729 pre-
sented at the 4th SPE Symposium of Numerical Simulation for
1.0
1.5
2.0
2.5
3.0
7 8 9 10 11 12 13 14 15 16
S
p
e
e
d
-
u
p

F
a
c
t
o
r
s

(
G
P
U
/
C
P
U

S
o
l
v
e
)
Number of Compute Nodes
overall speedup
solver speedup
Fig. 12Performance comparison between the use of GPGPU
and CPU parallel solve on multiple compute nodes for Model C,
which is a 6.15-million-cell nine-component compositional model.
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 9 Total Pages: 10
ID: jaganm Time: 15:01 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
2013 SPE Journal 9
Reservoir Performance, Los Angeles, California, 1920 February.
http://dx.doi.org/10.2118/5729-MS.
Vuduc, R., Chandramowlishwaran, A., Choi, J. et al. 2010. On the Limits
of GPU Acceleration. In Proceedings of the 2010 USENIX Workshop.
Hot Topics in Parallelism (HotPar), Berkeley, California, June.
Wallis, J.R., Kendall, R.P., and Little, T.E. 1985. Constrained Residual
Acceleration of Conjugate Residual Methods. Paper SPE 13563 pre-
sented at the 8th SPE Reservoir Simulation Symposium, Dallas, 1013
February. http://dx.doi.org/10.2118/13563-MS.
Zhou, Y. and Tchelepi, H.A. 2013. Multi-GPU Parallelization of Nested
Factorization for Solving Large Linear Systems. Paper SPE 163588
Presented at the 2013 SPE Reservoir Simulation Symposium, Wood-
lands, Texas, 1820 February. http://dx.doi.org/10.2118/163588-MS.
Larry S.K. Fung is Principal Professional of Computational Mod-
eling Technology in the EXPEC Advanced Research Center of
Saudi Aramco, which he joined in 1997. He is a chief developer
of Saudi Aramcos in-house massively parallel reservoir simula-
tors GigaPOWERS and POWERS. During this time, Fung has built
several core-simulator components, such as the linear and
nonlinear solvers, distributed parallel-data infrastructure, multi-
scale-fracture multimodal-porosity simulation system, unstruc-
tured gridding, multilevel local grid refinement method, and
fully coupled implicit-well solver. Before that, he was a staff
engineer at Computer Modelling Group for 11 years and had
built several features for the simulators IMEX and STARS, which
include the systems for coupled geomechanics thermal simu-
lation, and naturally fractured reservoir simulation. Fung has
published more than 30 papers on reservoir-simulation meth-
ods and holds seven US Patents on the subject. He served on
the steering committees of the 2007 SPE Forum on 70% Recov-
ery and the 2009, 2011, and 2013 Reservoir Simulation Sympo-
sium, and he was Cochair of the 2010 SPE Forum on Reservoir
Simulation. Fung holds BSc and MSc degrees in civil and envi-
ronmental engineering from the University of Alberta and is a
registered professional engineer with the Association of Profes-
sional Engineers and Geoscientists of Alberta in Canada.
Mohammad O. Sindi is a petroleum-engineering system ana-
lyst of computational modeling technology in the EXPEC
Advanced Research Center of Saudi Aramco, which he
joined in 2003. He specializes in high-performance computing
and has had numerous publications with the Institute of Electri-
cal and Electronic Engineers, the Association of Computing
Machinery, Intel, NVIDIA, and SPE. Sindi holds a BS degree
from the University of Kansas and an MS degree from George
Washington University, both in computer science.
Ali H. Dogru is Chief Technologist of Computational Modeling
Technology in the EXPEC Advanced Research Center of Saudi
Aramco. He previously worked for Mobil R&D Company and
Core Labs Inc., both in Dallas. Dogrus academic experience
covers various research and teaching positions at the Univer-
sity of Texas at Austin, Technical University of Istanbul, Califor-
nia Institute of Technology, and Norwegian Institute of
Technology, and he is currently a visiting scientist at the Massa-
chusetts Institute of Technology. He holds several US patents,
an MS degree from the Technical University of Istanbul, and a
PhD degree from the University of Texas at Austin. Dogru
chaired the 20042008 SPE JPT Special Series Committee;
served on various SPE committees, including the 20082011
R&D Technical Section, the Editorial Review, and SPE Fluid
Mechanics; and he was the chairman of the 2012 Joint SPE/
SIAM Symposium on Mathematical Methods in Large Scale
Simulation. He was recipient of the SPE Reservoir Dynamics &
Description Award in 2008 and the SPE John Franklin Carll Dis-
tinguished Professional Award in 2012.
J163591 DOI: 10.2118/163591-PA Date: 5-December-13 Stage: Page: 10 Total Pages: 10
ID: jaganm Time: 15:01 I Path: S:/3B2/J###/Vol00000/130102/APPFile/SA-J###130102
10 2013 SPE Journal

Vous aimerez peut-être aussi