Vous êtes sur la page 1sur 98

HPC Best Practices

for FEA

John Higgins, PE
Senior Application Engineer
1 © 2012 ANSYS, Inc. May 18, 2012
Agenda
• Overview
• Parallel Processing Methods
• Solver Types
• Performance Review
• Memory Settings
• GPU Technology
• Software Considerations
• Appendix

2 © 2012 ANSYS, Inc. May 18, 2012


Overview
Basic information Output data

A model Elapsed Time


A machine

Need for speed :


Implicit structural FEA codes
Mesh fidelity continues to increase
More complex physics being analyzed
Lots of computations !!

3 © 2012 ANSYS, Inc. May 18, 2012


Overview
Basic information Solver Configuration Output data

A model : Elapsed Time


-Size / number of
DOF
-Analysis type

A machine :
Analysing the model prior to launch the
-Number of cores run may help to choose the more suitable
-RAM solver configuration at the first attempt

4 © 2012 ANSYS, Inc. May 18, 2012


Overview
Basic information Solver Configuration Output data

A model : Parallel Processing


Method : Elapsed Time
-Size / number of
-Shared Memory
DOF
(SMP)
-Analysis type
-Distributed Memory
(DMP)
A machine :
-Number of cores Solver type :
-RAM -Direct (Sparse)
-Iterative (PCG)

Memory Settings

5 © 2012 ANSYS, Inc. May 18, 2012


Overview
Information during
Basic information Solver Configuration Output data
the solve

A model : Parallel Processing Resource Monitor :


Method : Elapsed Time
-Size / number of -CPU
-Shared Memory
DOF -Memory
(SMP)
-Analysis type -Disk
-Distributed Memory
-Network
(DMP)
A machine :
-Number of cores Solver type :
-RAM -Direct (Sparse)
-Iterative (PCG)

Memory Settings

6 © 2012 ANSYS, Inc. May 18, 2012


Overview
Information during
Basic information Solver Configuration Output data
the solve

A model : Parallel Processing Resource Monitor : -Elapsed Time


Method :
-Size / number of -CPU -Equation solver
-Shared Memory
DOF -Memory computational rate
(SMP)
-Analysis type -Disk
-Distributed Memory -Equation solver
-Network
(DMP) effective I/O rate
A machine : (Bandwidth)
-Number of cores Solver type : -Total memory used
-RAM -Direct (Sparse) (incore/out-of-core?)
-Iterative (PCG)

Memory Settings

7 © 2012 ANSYS, Inc. May 18, 2012


Overview
Information during
Basic information Solver Configuration Output data
the solve

A model : Parallel Processing Resource Monitor : -Elapsed Time


Method :
-Size / number of -CPU -Equation solver
-Shared Memory
DOF -Memory computational rate
(SMP)
-Analysis type -Disk
-Distributed Memory -Equation solver
-Network
(DMP) effective I/O rate
A machine : (Bandwidth)
-Number of cores Solver type : -Total memory used
-RAM -Direct (Sparse) (incore/out-of-core?)
-Iterative (PCG)

Memory Settings

8 © 2012 ANSYS, Inc. May 18, 2012


Parallel Processing
Information during
Basic information Solver Configuration Output data
the solve

A model : Parallel Processing Resource Monitor : -Elapsed Time


Method :
-Size / number of -CPU -Equation solver
-Shared Memory
DOF -Memory computational rate
(SMP)
-Analysis type -Disk
-Distributed Memory -Equation solver
-Network
(DMP) effective I/O rate
A machine : (Bandwidth)
-Number of cores Solver type : -Total memory used
-RAM -Direct (Sparse) (incore/out-of-core?)
-Iterative (PCG)

Memory Settings

9 © 2012 ANSYS, Inc. May 18, 2012


Parallel Processing – Hardware

Workstation/Server:

• Shared memory (SMP)  single box,


or
• Distributed memory (DMP)  single box,

Workstation

10 © 2012 ANSYS, Inc. May 18, 2012


Parallel Processing – Hardware

Cluster (Workstation Cluster, Node Cluster):


• Distributed memory (DMP)  multiple boxes, cluster

Cluster
11 © 2012 ANSYS, Inc. May 18, 2012
Parallel Processing – Hardware + Software

Laptop/Desktop Cluster
or
Workstation/Server
ANSYS YES SMP (per node)
Distributed ANSYS YES YES

12 © 2012 ANSYS, Inc. May 18, 2012


Distributed ANSYS Design Requirements

No limitation in simulation capability

Reproducible and consistent results

Support all major platforms

13 © 2012 ANSYS, Inc. May 18, 2012


Distributed ANSYS Architecture

Domain decomposition approach


• Break problem into N pieces (domains)
• “Solve” the global problem independently within
each domain
• Communicate information across the boundaries
as necessary
Processor 3

Processor 1

Processor 4
Processor 2

14 © 2012 ANSYS, Inc. May 18, 2012


Distributed ANSYS Architecture
process 0 (host)
domain
process 1 interprocess
decomposition
communication
process n-1

domain 0 domain 1 domain n-1


elem elem … elem

assemble assemble assemble

solve solve solve

elem output elem output elem output



combining results
15 © 2012 ANSYS, Inc. May 18, 2012
Distributed ANSYS Solvers

Distributed sparse (default)


• Supports all analyses supported with DANSYS ( Linear,
Non Linear, Static , Transient )

Distributed PCG
• For static and full transient analyses

Distributed LANPCG (eigensolver)


• For modal analyses

16 © 2012 ANSYS, Inc. May 18, 2012


Benefits of Distributed ANSYS

The entire SOLVE phase is parallel


• More computations performed in parallel  faster solution time
Better speedups than SMP
• Can achieve > 4x on 8 cores (Try getting that with SMP!!!!)
• Can be used for jobs running on up to hundreds of cores
Can take advantage of resources on multiple machines
• Whole new class of problems can be solved!
• Memory usage and bandwidth scales
• Disk (I/O) usage scales (i.e. parallel I/O)

17 © 2012 ANSYS, Inc. May 18, 2012


Solver Types
Information during
Basic information Solver Configuration Output data
the solve

A model : Parallel Processing Resource Monitor : -Elapsed Time


Method :
-Size / number of -CPU -Equation solver
-Shared Memory
DOF -Memory computational rate
(SMP)
-Analysis type -Disk
-Distributed Memory -Equation solver
-Network
(DMP) effective I/O rate
A machine : (Bandwidth)
-Number of cores Solver type : -Total memory used
-RAM -Direct (Sparse) (incore/out-of-core?)
-Iterative (PCG)

Memory Settings

18 © 2012 ANSYS, Inc. May 18, 2012


Solver Types
Solution Overview

5% CPU time Prep Data


10% CPU time Element Formation
10% CPU time
Solution Procedures
Global Assembly “Solver”
70% CPU time Solve [K]{x} = {b}
5% CPU time Element Stress Recovery

Equation solver dominates solution CPU time! Need to pay attention to equation solver
Equation solver also consumes the most system resources (memory and I/O)

19 © 2012 ANSYS, Inc. May 18, 2012


Solver Types
Solution Overview

Prep Data
Element Formation
Solution Procedures
Global Assembly
Solve [K]{x} = {b}
Element Stress Recovery

20 © 2012 ANSYS, Inc. May 18, 2012


System Resolution
Solver Architecture

Element emat
esav
Formation
data in-core objects

Symbolic
database

full
Assembly

PCG Solver Sparse Solver

rst,rth Output
Element
Output
21 © 2012 ANSYS, Inc. May 18, 2012
Solver Types: SPARSE (Direct)

SPARSE (Direct)
Filing …

LN09

*.BCS: Stats from Sparse Solver

*.full: Assembled Stiffness Matrix

22 © 2012 ANSYS, Inc. May 18, 2012


Solver Types: SPARSE (Direct)

SPARSE (Direct)
PROS
- More robust with poorly conditioned problems (Shell-Beams)
- Solution always guaranteed
- Fast for 2nd Solve or Higher (Multiple Load cases)
CONS
- Factoring matrix & Solving are resource intensive
- Large memory requirements

23 © 2012 ANSYS, Inc. May 18, 2012


Solver Types: PCG (Iterative)

PCG (Iterative)
- Minimization of residuals/potential energy (Standard Conjugate
Gradient Method) ( {r} = {f} – [K].{u} )
- Iterative process requiring a convergence test (PCGTOL).
- Preconjugate CG used instead to reduce the number of iterations
( Preconditioner [Q] ̴ [K-1] - [Q] cheaper than [K-1] )
- Number of iterations

24 © 2012 ANSYS, Inc. May 18, 2012


Solver Types: PCG (Iterative)
PCG (Iterative)
PCGTOL need to be used ( ill conditionned model ) with lower value 1e-9 or 1e-10
to let ANSYS follow the same path ( equilibrium iterations ) than the direct
solver

PCGTOL
25 © 2012 ANSYS, Inc. May 18, 2012
Solver Types: PCG (Iterative)

PCG (Iterative)

Filing…

*.PC*

*.PCS: Iterative solver stats

26 © 2012 ANSYS, Inc. May 18, 2012


Solver Types: PCG (Iterative)

PCG (Iterative)
PROS
- Less memory requirements
- Better suited for well conditioned bigger problem
CONS
- Not useful with near or rigid body behavior
- Less robust with ill-conditioned models (Shells & Beams, inadequate
boundary conditions (Rigid Body Motions), elements considerably
elongated, nearly singular matrices…) – more difficult to approximate
[K-1] with [Q]

27 © 2012 ANSYS, Inc. May 18, 2012


Solver Types: PCG (Iterative)
Level Of Difficulty
LOD number is available in the solver output (solve.out)…

.. but can also be seen along with


number of PCG iteration required
to reach a converged solution
within jobname.PCS file.

28 © 2012 ANSYS, Inc. May 18, 2012


Solver Types: PCG (Iterative)
Other ways to evaluate ill-conditioning

Error message is also an indication.


Although we propose to change some MULT coefficient, model should be
carefully reviewed first and SPARSE solver considered for resolution instead.

29 © 2012 ANSYS, Inc. May 18, 2012


Solver Types
Comparative

30 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
Information during
Basic information Solver Configuration Output data
the solve

A model : Parallel Processing Resource Monitor : -Elapsed Time


Method :
-Size / number of -CPU -Equation solver
-Shared Memory
DOF -Memory computational rate
(SMP)
-Analysis type -Disk
-Distributed Memory -Equation solver
-Network
(DMP) effective I/O rate
A machine : (Bandwidth)
-Number of cores Solver type : -Total memory used
-RAM -Direct (Sparse) (incore/out-of-core?)
-Iterative (PCG)

Memory Settings

31 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
Process Resource Monitoring (only available on Windows7)
Windows Resource Monitor is a powerful tool for understanding how your
system resources are used by processes and services in real time.

32 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
How to access to the Process Resource Monitoring ? :
- from OS Task Manager (Ctrl + Shift + Esc) :

- Click Start, click in the Start Search box, type resmon.exe, and then press ENTER.

33 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
Process Resource Monitoring - CPU

Shared Memory (SMP) Distributed Memory (DMP)

34 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
Process Resource Monitoring - Memory

Before the solve :

During the solve :

Information from the solve.out :

35 © 2012 ANSYS, Inc. May 18, 2012


Overview
Information during
Basic information Solver Configuration Output data
the solve

A model : Parallel Processing Resource Monitor : -Elapsed Time


Method :
-Size / number of -CPU -Equation solver
-Shared Memory
DOF -Memory computational rate
(SMP)
-Analysis type -Disk
-Distributed Memory -Equation solver
-Network
(DMP) effective I/O rate
A machine : (Bandwidth)
-Number of cores Solver type : -Total memory used
-RAM -Direct (Sparse) (incore/out-of-core?)
-Iterative (PCG)

Memory Settings

36 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
ANSYS End Statistics

Basic information about Analysis


Solving directly available at the end of
Solver Output file (*.out) in Solution
Information

37 © 2012 ANSYS, Inc. May 18, 2012 Total Elapsed Time


Performance Review

Other main output data to check :

Output Data Description


Elapsed Time (sec) Total time of the simulation
Solver rate (Mflops) Speed of the solver
Bandwidth (Gbytes/s) I/O rate
Memory Used (Mbytes) Memory required
Number of iterations (PCG) Available for PCG only

38 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
Elapsed Time Solver rate Bandwidth Memory Used Number of iterations

PCG (*.PCS file) SPARSE (*.BCS file)

39 © 2012 ANSYS, Inc. May 18, 2012


Memory Settings
Information during
Basic information Solver Configuration Output data
the solve

A model : Parallel Processing Resource Monitor : -Elapsed Time


Method :
-Size / number of -CPU -Equation solver
-Shared Memory
DOF -Memory computational rate
(SMP)
-Analysis type -Disk
-Distributed Memory -Equation solver
-Network
(DMP) effective I/O rate
A machine : (Bandwidth)
-Number of cores Solver type : -Total memory used
-RAM -Direct (Sparse) (incore/out-of-core?)
-Iterative (PCG)

Memory Settings

40 © 2012 ANSYS, Inc. May 18, 2012


Memory Settings
SPARSE: Solver Output Statistics >> Memory Checkup

41 © 2012 ANSYS, Inc. May 18, 2012


Memory Settings – Test Case 1
Test Case 1 : “Small model” (need 4 Gb Scratch Memory < RAM)
Machine reference : 6Gb RAM , enough memory but …

Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE

Elapsed Time = 146 sec Elapsed Time = 77 sec

42 © 2012 ANSYS, Inc. May 18, 2012


Memory Settings – Test Case 1
Test Case 1 : “Small model” (need ..Gb Scratch Memory < RAM)
Machine reference : 6Gb RAM

Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE

43 © 2012 ANSYS, Inc. May 18, 2012


Memory Settings – Test Case 2
Test Case 2 : “Large model” (need 21.1Gb Scratch Memory > RAM)
Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0
Do not set always incore when memory available is not enough !!
Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE

Elapsed Time = 1249 sec Elapsed Time = 4767 sec

44 © 2012 ANSYS, Inc. May 18, 2012


Memory Settings – Test Case 2
Test Case 2 : “Large model” (need 21.1Gb Scratch Memory > RAM)
Machine reference : DELL M6400 12Gb RAM 2 sata 7200 rpm raid 0

Default : BCSOPTION,,OPTIMAL BCSOPTION,,INCORE

45 © 2012 ANSYS, Inc. May 18, 2012


Memory Settings
SPARSE: Solver Output Statistics >> Memory Checkup

206MB available for Sparse solver at time of factorization.

This is sufficient to run in Optimal out-of-core mode (which requires 126MB)


and obtain good performance.

If more than 1547 MB is available, it’ll run fully in-core – best performance

Avoid using Minimum out-of-core -- memory less than 126 MB

Incore >1547 MB 1.5 GB > Optimal > 126 MB Out-of-Core <126 MB


46 © 2012 ANSYS, Inc. May 18, 2012
Memory Settings
Performance
SPARSE: 3 Memory Modes can be Observed
Best
In-core mode (optional)
• Requires the most amount of memory
• Performs no I/O
Optimal out-of-core mode (default)
• Balances memory usage and I/O

Minimum core mode (not recommended)


• Requires the least amount of memory
• Performs most amount I/O
Worst

47 © 2012 ANSYS, Inc. May 18, 2012


Memory Settings – Test Case 3
Test Case 3 : trap to avoid : launch a run on a network ( or a slow drive )
Local solve on a local disk (left ) vs a slow disk ( networked or USB (right )

Elapsed Time = 1 998 sec Elapsed Time = 3 185 sec


48 © 2012 ANSYS, Inc. May 18, 2012
Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations

# of cores used (SMP,DMP)

From PCGOPT,Lev_Diff

Important statistic!

49 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations
Check total number of PCG iterations

• Less than 1000 iterations: good performance


• Greater than 1000 iterations: performance is deteriorated. Try increasing Lev_Diff on
PCGOPT)
• Greater than 3000 iterations: assuming you have tried increasing Lev_Diff, either
abandon PCG and use Sparse solver or improve element aspect ratios, boundary
conditions, and/or contact conditions

<1000 Iterations >3000 Iterations

50 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
PCG: Solver Statistics - *.PCS File>> Number of Iterations

If too much iteration :


*Use parallel processing
• Use PCGOPT,lev
• Refine your mesh
• Check for too high stiffness

51 © 2012 ANSYS, Inc. May 18, 2012


Performance Review
PCG: Solver Statistics - *.PCS File>> Efficiency & Solver MemoryUsage

52 © 2012 ANSYS, Inc. May 18, 2012


GPU Technology
Information during
Basic information Solver Configuration Output data
the solve

A model : Parallel Processing Resource Monitor : -Elapsed Time


Method :
-Size / number of -CPU -Equation solver
-Shared Memory
DOF -Memory computational rate
(SMP)
-Analysis type -Disk
-Distributed Memory -Equation solver
-Network
(DMP) effective I/O rate
A machine : (Bandwidth)
-Number of cores Solver type : -Total memory used
-RAM -Direct (Sparse) (incore/out-of-core?)
-GPU -Iterative (PCG)

Memory Settings

53 © 2012 ANSYS, Inc. May 18, 2012


GPU Technology

Graphics processing units (GPUs)


• Widely used for gaming, graphics rendering
• Recently been made available as general-purpose “accelerators”
– Support for double precision arithmetic
– Performance exceeding the latest multicore CPUs

• So how can ANSYS Mechanical make use of this new


technology to reduce the overall time to solution??

54 © 2012 ANSYS, Inc. May 18, 2012


GPU Technology – Introduction

CPUs and GPUs used in a collaborative fashion


CPU GPU

PCI Express
channel

Multi-core processors Many-core processors


• Typically 4-12 cores • Typically hundreds of cores
• Powerful, general purpose • Great for highly parallel code

55 © 2012 ANSYS, Inc. May 18, 2012


GPU Accelerator capability

Motivation
• Equation solver dominates solution time
– Logical place to add GPU acceleration

5%-30% time Element Formation

5%-10% time Solution Procedures

5%-10% time Global Assembly “solver”

60%-90% time Equation Solver (e.g., [A]{x} = {b})

1%-10% time Element Stress Recovery

56 © 2012 ANSYS, Inc. May 18, 2012


GPU Accelerator capability

“Accelerate” sparse direct solver (Boeing/DSP)


• GPU is only used to factor a dense frontal matrix
• Decision is made based on frontal matrix size on when
to send data to GPU or not:
– Too small, too much overhead, stays on CPU
– Too large, exceeds GPU memory, stays on CPU

57 © 2012 ANSYS, Inc. May 18, 2012


GPU Accelerator capability

Supported hardware
• Currently recommending NVIDIA Tesla 20-series cards
• Recently added support for Quadro 6000
• Requires the following items
– Larger power supply (1 card needs about 225W)
– Open 2x form factor PCIe x16 Gen2 slot
• Supported on Windows/Linux 64-bit
NVIDIA Tesla NVIDIA Tesla NVIDIA Quadro
C2050 C2070 6000
Power 225 Watts 225 Watts 225 Watts
CUDA cores 448 448 448
Memory 3 GB 6 GB 6 GB
Memory Bandwidth 144 GB/s 144 GB/s 144 GB/s

58 Peak Speed (SP/DP)


© 2012 ANSYS, Inc. 1030/515 Gflops
May 18, 2012 1030/515 Gflops 1030/515 Gflops
ANSYS Mechanical SMP – GPU Speedup

Solver
Kernel
Speedups

Overall
Speedups

• Intel Xeon 5560 processors


(2.8 GHz, 8 cores total)
• 32 GB of RAM
• Windows XP SP2 (64-bit)
• Tesla C2050 (ECC,ON; WDDM driver)

59 © 2012 ANSYS, Inc. May 18, 2012


Distributed ANSYS – GPU Speedup @ 14.0

Vibroacoustic harmonic analysis of an audio speaker


• Direct sparse solver
• Quarter-symmetry model with 700K DOF:
– 657424 nodes
– 465798 elements
– higher-order acoustic fluid elements (FLUID220/221)

Distributed ANSYS Results (baseline is 1 core):


• With GPU, ~11x speedup on 2 cores!
• 15-25% faster than SMP with same number of cores
Speedup
Cores GPU Speedup
2 no 2.25 12.00
4 no 4.29 10.00
2 yes 11.36
8.00
4 yes 11.51
6.00

Windows workstation: Two Intel Xeon 5530 4.00

processors (2.4 GHz, 8 cores total), 48 GB RAM, 2.00


0.00
NVIDIA Quadro 6000 DANSYS+GPU
4
SMP+GPU 2
60 © 2012 ANSYS, Inc. May 18, 2012 DANSYS
SMP
ANSYS Mechanical 14.0 Performance for Tesla C2075

3000
Lower
is Xeon 5670 2.93 GHz Westmere (Dual Socket)
Better V13sp-5 Model
ANSYS Mechanical Times in Seconds

Xeon 5670 2.93 GHz Westmere + Tesla C2075

2000

1848
Add a Tesla C2075 to
use with 6 cores:
now 46% faster than
12, with 6 available
for other tasks
4.2x 1192 Turbine
1000
geometry
3.5x 846 2,100 K DOF
2.7x SOLID187 FEs
564 2.1x 516 1.9x
444 Static, nonlinear
399
342 314 273 270 One iteration
0 Direct sparse

1 Core 2 Core 4 Core 6 Core 8 Core 12 Core

1 Socket 2 Socket

Results from HP Z800 Workstation, 2 x Xeon X5670 2.93GHz


61 © 2012 ANSYS, Inc. May 18, 2012
48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17
GPU Accelerator capability

V13sp-5 benchmark (turbine model)


200000

180000

160000
Factorization speed (Mflops)

140000

120000

100000

80000

60000

40000

20000

0
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
Front size (MB)
62 © 2012 ANSYS, Inc. May 18, 2012
ANSYS Mechanical – Multi-Node GPU

Solder Joint Benchmark (4 MDOF, Creep Strain Analysis)


R14 Distributed ANSYS w/wo GPU
Linux cluster : Each node 5.0
contains 12 Intel Xeon 4.4x
Without GPU
5600-series cores, 96 GB 4.0
RAM, NVIDIA Tesla M2070, With GPU 3.4x
Total Speedup
3.2x
InfiniBand 3.0

2.0 1.7x 1.9x

1.0

0.0
Solder 16 cores 32 cores 64 cores
balls Results Courtesy of MicroConsult Engineering, GmbH
Mold
PCB
63 © 2012 ANSYS, Inc. May 18, 2012
Trends in Performance by Solver Type
Comparative Trends

Elapsed
Time
I II III
With multiple
cores & GPUs all
trends can
change due to PCG1
speedup
sparse
difference
sparse+gpu
PCG2

3 areas can be defined:


I. SPARSE is more efficient
Number of DOF
II. Either SPARSE or PCG can be used
III. PCG solver works faster since it needs less I/O exchanges with HD
Need to evaluate Sparse & PCG behavior & speedup on your own model!
64 © 2012 ANSYS, Inc. May 18, 2012
Other Software Considerations

Tips and Tricks on performance gains


• Some considerations on scalability of DANSYS
• Working with solution differences
• Working with a case that does not (or hardly) scale
• Working with programmable features for parallel runs

65 © 2012 ANSYS, Inc. May 18, 2012


Scalability Considerations

Load balance
Improvements on domain decomposition
Amdahl’s Law
• Algorithmic enhancements: every part of the code is to run in parallel
User controllable items:
• Contact pair definitions: big contact pairs hurt load balance (one contact pair is put
into one domain in our code )
• CE definition: many CE terms hurt load balance and Amdahl’s law ( CE needs
communications among domains that the CE’s are defined )
• Use best and most suitable hardware possible (speedup of the CPU, memory, I/O and
interconnects)

66 © 2012 ANSYS, Inc. May 18, 2012


Scalability Considerations: Contact

Avoid overlapping
contact surface if
possible
Define half circle as
target, don’t define
full circle

Define potential
contact surface
into smaller pieces

• Avoid defining whole exterior surface as one piece target


• Break pairs into smaller pieces if possible
• Remember: one whole contact pair is processed on one processor (contact
work cannot be spread out)
67 © 2012 ANSYS, Inc. May 18, 2012
Scalability Considerations: Contact
Trim

• Avoid defining “un-used” surfaces as contact or target: i.e. reduce


potential contact definition to minimum:

• In rev. 12.0: Use new control “ CNCheCK,TRIM”

• In rev. 11.0: Turn NLGEOM,OFF when define contact pairs in WB. WB


auto turns on facility like “CNCheCK, TRIM” internally.

68 © 2012 ANSYS, Inc. May 18, 2012


Scalability Considerations: Remote Load/Disp

Point load distribution (remote load)


All nodes connected to one RBE3 node have to be
grouped into the same domain. This hurts load
balance! Try to reduce # of RBE3 nodes.

Point moment and it is


distributed to internal
surface of the hole Deformed shape
69 © 2012 ANSYS, Inc. May 18, 2012
Example of Bonded Contact and Remote Loads: Universal Joint Model

14 bonded
contact pairs

Internal CE
generated
by bonded
contact

Torque
Torque defined by RBE3 on end
surface only – good practice

This model has small pieces of contacts


and RBE3, it scales well in DANSYS

70 © 2012 ANSYS, Inc. May 18, 2012


Working With Solution Differences in Parallel Runs

Most of solution differences come from contact applications when NP =1, versus
NP = 2, 3, 4, 5, 6, 7, ……
• Check on contact pairs to make sure we don’t have a case of bifurcation and also plot
deformations to see the case.
• Tighten CNVTOL convergence tolerance to see solution accuracy. If solution is less than,
say, 1 % in difference, then parallel computing can make some difference in convergence,
all solutions are acceptable.
• If solution is well-defined and all input settings are correct, report this case to ANSYS Inc.
for investigations

71 © 2012 ANSYS, Inc. May 18, 2012


Working With a Case of Poor Scalability

No scalability (speedup) at all (or even slower than NP = 1)


• Is this problem too small (normally DOFs should be greater 50K)?
• Do I have a slow disk, problem is so big that I/O size exceeds the memory I/O buffer?
• Is every NODE of my machines connected to public network?
• Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly not
scalable)
• Resume data at /solu level and don’t read in input files every time of the run
• etc

72 © 2012 ANSYS, Inc. May 18, 2012


Working With a Case of Poor Scalability

Yes, I have scalability but poor (say, speedup < 2X)


• Is this GigE or other slow interconnect?
• Are all processors sharing one disk (SF mount)?
• Do other people run the job on the same machine the same time?
• Do I have many big pairs of contacts or do I have remote load or displacement that tie to
the major portions of the model?
• Am I using a generation of dual/quad cores where the memory bandwidth is totally
shared within a core?
• Look at scalability at Solver first not the entire run (e.g. /prep7 time is mostly not
scalable)
• Resume data at /solu level and don’t read in input files every time of the run
• etc

73 © 2012 ANSYS, Inc. May 18, 2012


APPENDIX

74 © 2012 ANSYS, Inc. May 18, 2012


Platform MPI Installation for ANSYS 14

Note for ANSYS Mechanical R13 users

- Do not uninstall HP-MPI, this is required for compatibility purposes with R13.

- Verify that HP-MPI is installed in its default location :


“C:\Program Files (x86)\Hewlett-Packard\HP-MPI”, this is required for ANSYS
Mechanical R13 to execute properly.

75 © 2012 ANSYS, Inc. May 18, 2012


Platform MPI Installation for ANSYS 14
- Run “setup.exe” of AnsysR14 Installation as Administrator :

- Install Platform MPI :

- Follow the Platform MPI Installation Instructions

76 © 2012 ANSYS, Inc. May 18, 2012


Platform MPI Installation for ANSYS 14

Note for ANSYS Mechanical R13 users

For ANSYS Mechanical customers who have R13 installed and wish to
continue to use R13, please run the following command to ensure
compatibility :
"%AWP_ROOT140%\commonfiles\MPI\Platform\8.1.2\Windows\HPMPICO
MPAT\hpmpicompat.bat“
(by default : “C:\Program Files\ANSYS Inc\v140\commonfiles\MPI\Platform
\8.1.2\Windows\HPMPICOMPAT\hpmpicompat.bat”)
The command will display a dialog box with a title of "ANSYS 13.0 SP1
Help".

77 © 2012 ANSYS, Inc. May 18, 2012


Platform MPI Installation for ANSYS 14
To finish the installation :
- Go to %AWP_ROOT140%\
commonfiles\MPI\Platform\8.1\Windows\setpcmpipassword.bat
(by default : “C:\Program Files\ANSYS Inc\v140\
commonfiles\MPI\Platform\8.1.2\Windows\setpcmpipassword.bat”)
- Run "sethpmpipassword.bat", tape your Windows User Password and press Enter :

78 © 2012 ANSYS, Inc. May 18, 2012


Test MPI Installation for ANSYS 14
The installation is now finished. How to verify the proper functioning ?
- Edit the file "test_mpi14.bat" attached in the .zip
"c:\program files\ansys inc\v140\ansys\bin\winx64\ansys140" -mpitest -mpi pcmpi -np 2

- Change the Ansys path and the number of processors if necessary (-np x)
- Save and run the file "test_mpi14.bat"
- The expected result is shown below :

79 © 2012 ANSYS, Inc. May 18, 2012


Test Case – Batch launch (Solver Sparse)
- The file "cube_sparse_hpc.txt" is an input file for a simple analysis
(pressure on a cube).
- Edit the file "job_sparse.bat" and change the Ansys path and/or the
number of processors is necessary.
- Possibility to change the number of mesh division of the cube to try out
the performance of your machine. (-ndiv xx)
- Save and run the file "job_sparse.bat".

Informations about the file "job_sparse.bat"


-b : batch -np : number of processors
-j : jobname -ndiv : number of division (for this exemple only)
-i : input file -acc nvidia : use GPU acceleration
-o : output file -mpi pcmpi : plateform MPI

80 © 2012 ANSYS, Inc. May 18, 2012


Test Case – Batch launch (Solver Sparse)
- Possibility to check your processors running with the Windows Task
Manager. (Ctrl+Shift+Esc)
Exemple with 6 processus requested :

Advice : do not request all the processors available if you want to do


something else during the running.
81 © 2012 ANSYS, Inc. May 18, 2012
Test Case – Batch launch (Solver Sparse)
Once the running is finished :
- Read the file .out to collect all the informations about the solver output.
- The main informations are :
- Elapsed Time (sec)
- Latency Time from Master to core
- Communication Speed from Master to core
- Equation solver computational rate
- Equation solver effective I/O rate

82 © 2012 ANSYS, Inc. May 18, 2012


Test Case – Workbench launch
- Open a Workbench Project with AnsysR14
- Open Mechanical
- Go to : Tools -> Solver Process Setting… -> Advanced…
- Check "Distributed Solution", specify the number of processors used and
write the Additionnal Command (-mpi pcmpi) as shown below :
1 3
Possibility to
use GPU

83 © 2012 ANSYS, Inc. May 18, 2012


Test Case – Workbench launch
In the Analysis settings :
- Possibility to choose the Solver Type (Direct = Sparse, Iterative = PCG)

- Solve your model


- Read the Solver Output from the Solution Information

84 © 2012 ANSYS, Inc. May 18, 2012


Appendix

Automated run for a model


Compare customer results with ANSYS reference
First step for an HPC test on customer machine

85 © 2012 ANSYS, Inc. May 18, 2012


General view
The goal of this Excel file is twofold :
•On the one hand, it enables to write the batch launch commands of multiple analysis
in a file (job.bat)
•On the other hand, it enables to extract informations from the different solve.out files
and write them in Excel.

INPUT DATA OUTPUT DATA

86 © 2012 ANSYS, Inc. May 18, 2012


INPUT DATA

87 © 2012 ANSYS, Inc. May 18, 2012


INPUT DATA

Name of the machines used for the solve with PCMPI (up to 3)

Not required if the solve is performed on a single machine

88 © 2012 ANSYS, Inc. May 18, 2012


INPUT DATA

Description Choice
Machine Number of machines used 1,2 or 3
Solver Type of solver used sparse or pcg
Division Division of the edge for meshing Any integer
Release Select Ansys Release 140 or 145
GPU Use GPU acceleration yes or no
np total Total number of cores No choice (value calculated)
np / machine Number of cores by machines Any integer
PCG level Only available for PCG solver 1,2,3 or 4
Simulation Shared Memory or Distributed SMP or DMP
method Memory
89 © 2012 ANSYS, Inc. May 18, 2012
INPUT DATA

Create a job.bat file with all the input data given in the Excel

90 © 2012 ANSYS, Inc. May 18, 2012


OUTPUT DATA

2
3

91 © 2012 ANSYS, Inc. May 18, 2012


OUTPUT DATA

Read the informations from all the *.out files.


Nb : All the files must be in the same directory.
If a *.out file is not found, a pop-up will appear :

Continue : over pass this file and go to next


STOP : stop reading all the next *.out files

92 © 2012 ANSYS, Inc. May 18, 2012


OUTPUT DATA

Output Data Description


Elapsed Time (sec) Total time of the simulation
Solver rate (Mflops) Speed of the solver
Bandwidth (Gbytes/s) I/O rate
Memory Used (Mbytes) Memory required
Number of iterations (PCG) Available for PCG only

93 © 2012 ANSYS, Inc. May 18, 2012


OUTPUT DATA
Elapsed Time Solver rate Bandwidth Memory Used Number of iterations

All this informations are extracted from the *.out files :

PCG SPARSE

94 © 2012 ANSYS, Inc. May 18, 2012


OUTPUT DATA

Hyperlinks are automatically created to open the different *.out


files directly from Excel.

Nb : if an error occurred during the solve (*** ERROR ***), it


will be automatically highlighted in the Excel file.

95 © 2012 ANSYS, Inc. May 18, 2012


And now :
waiting your feedback ,
from your results

96 © 2012 ANSYS, Inc. May 18, 2012


Any suggestion/question for Excel tool
improvement :
gabriel.messager@ansys.com

97 © 2012 ANSYS, Inc. May 18, 2012


THANK YOU

98 © 2012 ANSYS, Inc. May 18, 2012