LS-DYNA Analysis

LSDYNAPerformanceBenchmarks
andProfiling
January 2009
Note
The following research was performed under the HPC Advisory
Council activities
AMD, Dell, Mellanox
HPC Advisory Council Cluster Center
The participating members would like to thank LSTC for their

support and guidelines
The participating members would like to thank Sharan Kalwani,
HPC Automotive specialist, for his support and guidelines
For more info please refer to
www.mellanox.com, www.dell.com/hpc, www.amd.com
LS-DYNA
LS-DYNA
A general purpose structural and fluid analysis simulation software
package capable of simulating complex real world problems
Developed by the Livermore Software Technology Corporation (LSTC)
LS-DYNA used by
Automobile
Aerospace
Construction
Military
Manufacturing
Bioengineering
LS-DYNA
LS-DYNA SMP (Shared Memory Processing)
Optimize the power of multiple CPUs within single machine
LS-DYNA MPP (Massively Parallel Processing)

The MPP version of LS-DYNA allows to run LS-DYNA solver over
High-performance computing cluster
Uses message passing (MPI) to obtain parallelism
Many companies are switching from SMP to MPP

For cost-effective scaling and performance
Objectives
The presented research was done to provide best practices
LS-DYNA performance benchmarking
Interconnect performance comparisons
Ways to increase LS-DYNA productivity
Understanding LS-DYNA communication pattern
MPI libraries comparisons
Power-aware consideration
Test Cluster Configuration
Dell PowerEdge SC 1435 24-node cluster
Quad-Core AMD Opteron Model 2358 processors (Barcelona)
Mellanox InfiniBand ConnectX DDR HCAs
Mellanox InfiniBand DDR Switch
Memory: 16GB memory, DDR2 667MHz per node
OS: RHEL5U2, OFED 1.3 InfiniBand SW stack
MPI: HP MPI 2.2.7, Platform MPI 5.6.5
Application: LS-DYNA MPP971
Benchmark Workload
Three Vehicle Collision Test simulation
Neon-Refined Revised Crash Test simulation
Mellanox InfiniBand Solutions

Industry Standard
Hardware, software, cabling, management
Design for clustering and storage interconnect
The InfiniBand Performance

Gap is Increasing
Performance
240Gb/s
(12X)
40Gb/s node-to-node
120Gb/s switch-to-switch
1us application latency
Most aggressive roadmap in the industry
Reliable with congestion management

Efficient
RDMA and Transport Offload
Kernel bypass
CPU focuses on application processing
Scalable for Petascale computing & beyond

End-to-end quality of service
Virtualization acceleration
I/O consolidation Including storage
120Gb/s
80Gb/s
(4X)
60Gb/s
40Gb/s
20Gb/s
Ethernet
Fibre
Channel
InfiniBand Delivers the Lowest Latency

7
Quad-Core AMD Opteron Processor
Performance
Quad-Core
Dual Channel
Reg DDR2
Enhanced CPU IPC

4x 512K L2 cache
2MB L3 Cache
8 GB/S
8 GB/S
Direct Connect Architecture

HyperTransport technology
Up to 24 GB/s
Floating Point
128-bit FPU per core
4 FLOPS/clk peak per core
Memory
1GB Page Support

DDR-2 667 MHz
Scalability
8 GB/S
8 GB/S
PCI-E
PCI-E
Bridge
Bridge
PCI-E
PCI-E
Bridge
Bridge
8 GB/S
USB
USB
I/O
I/OHub
Hub
PCI
PCI
48-bit Physical Addressing
Compatibility
Same power/thermal envelopes as Second-Generation AMD Opteron processor

8
November5, 2007
Dell PowerEdge Servers helping Simplify IT

System Structure and Sizing Guidelines
24-node cluster build with Dell PowerEdge SC 1435 Servers
Servers optimized for High Performance Computing environments
Building Block Foundations for best price/performance and performance/watt
Dell HPC Solutions
Scalable Architectures for High Performance and Productivity
Dell's comprehensive HPC services help manage the lifecycle requirements.
Integrated, Tested and Validated Architectures
Workload Modeling
Optimized System Size, Configuration and Workloads
Test-bed Benchmarks
ISV Applications Characterization
Best Practices & Usage Analysis
9
LS-DYNA Performance Results - Interconnect

InfiniBand high speed interconnect enables highest scalability
Performance gain with cluster size
Performance over GigE is not scaling

Slowdown occurs as number of processors increases beyond 16 nodes
LS-DYNA - Neon Refined Revised
Number of Nodes
InfiniBand
GigE
or
es
C )
or
8
es
(6
4
)
C
10
or
es
(8
0
)
C
12
or
es
(9
)
14 6 C
o
(1
re
1
s)
16 2 C
o
(1
r
28 e s
)
18
C
(1 ore
4
s)
20 4 C
(1 ore
60
s)
22
C
o
(1
r
76 e s)
24
C
(1 ore
92
s)
C
or
es
)
700
600
500
400
300
200
100
0
(4
8
(3
2
4
(4
8
(3
2
or
es
)
C
or
8
es
(6
)
4
10 Cor
es
(8
)
0
C
12
or
es
(9
)
6
C
14
or
(1
12 es )
C
16
or
(1
28 e s)
C
18
or
(1
44 e s)
C
20
or
(1
60 e s)
C
22
or
(1
76 e s)
C
24
or
(1
92 e s)
C
or
es
)
8000
7000
6000
5000
4000
3000
2000
1000
0
Elapsed time (Seconds)
LS-DYNA - 3 Vehicle Collision
Num ber of Nodes

InfiniBand
GigE
Lower is better
10
(3
2
C
Performance Advantage
Number of Nodes
6
(4
8
or
es
)
C
8
or
(6
4 es
10 Co )
re
(8
s)
0
12 Co
re
(9
14 6 C s )
(1 or
e
1
16 2 C s )
(1 or
e
2
18 8 C s)
(1 ore
4
20 4 C s)
(1 ore
6
22 0 C s)
(1 or
e
7
24 6 C s)
(1 ore
92
s)
C
or
es
)
Performance Advantage
120%
100%
80%
60%
40%
20%
0%
(3
2
(4
8

(InfiniBand vs GigE)
or
es
C )
or
8
(6
es
4
)
C
10
or
(8
es
0
)
12 Co
r
es
(9
)
14 6 C
(1 or
e
1
16 2 C s )
(1 ore
2
s)
18 8 C
o
(1
re
4
s)
20 4 C
o
(1
60 re s
)
22
C
(1 ore
7
s)
24 6 C
o
(1
r
92 e s
)
C
or
es
)

(InfiniBand vs GigE)
140%
120%
100%
80%
60%
40%
20%
0%
Number of Nodes
InfiniBand outperforms GigE by up to 132%

As node number increases, bigger advantage is expected
11
LS-DYNA Performance Results CPU Affinity

(CPU Affinity vs Non-Affinity)
(3
Co
re
(4
s)
8
Co
8
re
(6
s)
4
C
10
or
es
(8
0
)
1 2 Co
(9 r e s
6
14
C )
(1 ore
12
s)
16
C
(1 ore
2
s
18 8 C )
(1 ore
44
s
20
C )
(1 ore
60
s)
22
C
(1 ore
76
s)
24
C
(1 ore
92
s)
C
or
es
)
7000
6000
5000
4000
3000
2000
1000
0
Number of Nodes
CPU Affinity
Without Affinity
CPU affinity accelerates performance up to 10%

Saves up to 177 seconds per simulation
Lower is better
12
LS-DYNA Performance Results - Productivity
InfiniBand increases productivity by allowing multiple jobs to run simultaneously

Providing required productivity for virtual vehicle design
Three cases are presented

Single job over the entire systems (with CPU affinity)
Two jobs, each on a single CPU per server (job placement , CPU affinity)
Four jobs, each on two CPU cores per CPU per server (job placement , CPU affinity)
Four jobs per day increases productivity by 97% for Neon Refined Revised, 57% for 3 Car collision case
Increased number of parallel processes (jobs) increases the load on the interconnect
2 Parallel Jobs
4 Parallel Jobs
(1
60
or
es
)
24
20
(1
92
or
es
)
or
es
)
C
(1
28
16
(9
6
12
Number of Nodes
Number of Nodes
1 Job
s)
C
or
e
Co
re
s
8
(3
2
4
(6
4
Co
re
s
or
es
)
1000
900
800
700
600
500
400
300
200
100
0
C
(1
92
24
20
(1
60
C
16
(1
28
C
(9
6
12
or
es
)
or
es
)
s)
or
e
Co
re
s
(6
4
8
(3
2
Co
re
s
Jobs per Day
90
80
70
60
50
40
30
20
10
0
Jobs per Day
High speed and low latency interconnect solution is required for gaining high productivity
1 Job
2 Parallel Jobs
4 Parallel Jobs
Higher is better
13
LS-DYNA Profiling Data Transferred

LS-DYNA MPI Profiliing
(3 Vehicle Collision)
1E+12
Total Size (MB)
1E+11
1E+10
1E+09
100000000
in
i
ty
]
]
[4
M
..i
nf
[1
..4
M
[0
..6
4B
]
[6
4.
.2
56
B]
[2
56
B.
.1
KB
]
[1
..4
KB
]
[4
..1
6K
B]
[1
6.
.6
4K
B]
[6
4.
.2
56
KB
[2
]
56
K
B
..1
M
]
10000000
Message Size
4nodes
8nodes
12nodes
16nodes
20nodes
24nodes
Majority of data transfer is done via 256B-4KB message size

14
LS-DYNA Profiling Message Distribution

Number of Messages
1000000000
100000000
10000000
1000000
100000
10000
1000
100
10
in
i
ty
]
]
[4
M
..i
nf
[1
..4
M
[0
..6
4B
]
[6
4.
.2
56
B]
[2
56
B.
.1
KB
]
[1
..4
KB
]
[4
..1
6K
B]
[1
6.
.6
4K
B]
[6
4.
.2
56
KB
[2
]
56
KB
..1
M
]
Message Size
4nodes
8nodes
12nodes
16nodes
20nodes
24nodes
Majority of the messages are in the range of 2B-4KB

2B-256B for synchronization, 256B-4KB for data communications
15
LS-DYNA Profiling Message Distribution

70%
% of total messages
60%
50%
40%
30%
20%
10%
0%
[0..64]
[65..256]
[257..1024]
[1025..4096]
[4097..16384]
Message Size
4nodes
8nodes
12nodes
16nodes
20nodes
24nodes
As number of nodes scales, percentage of small messages increases

percentage of 256-1KB messages is relatively consistent with cluster size
Actual number increases with cluster size,
16
LS-DYNA Profiling MPI Collectives
% of Total Overhead (ms)
Two key MPI collective functions in LS-DYNA

MPI_AllReduce
MPI_Bcast
Account for the majority of MPI communication overhead
MPI Collectives
70%
60%
50%
40%
30%
20%
10%
0%
32
4(
)
res
o
C
64
8(
)
res
o
C
(9
12
)
res
o
6C
)
)
)
res
res
res
o
o
o
8C
2C
0C
19
12
16
(
(
(
20
16
24
Number of Nodes
MPI_AllReduce
MPI_Bcast
17
MPI Collective Benchmarking

MPI collective performance comparison
Two frequently called collection operations in LS-DYNA were benchmarked
MPI_Allreduce
MPI_Bcast
Platform MPI shows better latency for AllReduce operation
MPI_Bcast
120
30
100
25
Latency(usec)
Latency(usec)
MPI_AllReduce
80
60
40
20
20
15
10
5
0
0
0
16
32
64
128
Message Size
HP-MPI
Platform MPI
256
512
16
32
64
128
256
Message Size
HP-MPI
Platform MPI
18
512
LS-DYNA with Different MPI Libraries

LS-DYNA performance Comparison
Each MPI library shows different benefits for latency and collectives
As such, HP-MPI and Platform MPI shows comparable performance
2000
(3
2
C
or
es
(4
)
8
C
o
re
8
(6
s)
4
C
or
10
es
(8
)
0
C
or
12
es
(9
)
6
C
14
or
(1
es
12
)
C
16
or
(1
es
28
)
C
18
or
(1
es
44
)
C
20
or
(1
es
60
)
C
22
or
(1
es
76
)
C
24
or
(1
es
92
)
C
or
es
)
1000
Number of Nodes
Platform MPI
HP-MPI
or
es
C )
or
8
(6
es
4
)
C
10
or
(8
es
0
)
12 Co
r
(9
e
14 6 C s )
(1 o r
e
1
16 2 C s )
(1 ore
2
18 8 C s)
(1 ore
4
20 4 C s)
(1 ore
6
22 0 C s)
(1 ore
7
24 6 C s)
(1 ore
92
s)
C
or
es
)
3000
(4
8
4000
5000
6000
550
500
450
400
350
300
250
200
150
100
(3
2
7000
Num ber of Nodes

Platform MPI
HP-MPI
Lower is better
19
LS-DYNA Profiling Summary - Interconnect
LS-DYNA was profiled to determine networking dependency
Majority of data transferred between compute nodes

Done with 256B-4KB message size, data transferred increases with cluster size
Most used message sizes

<64B messages mainly synchronizations
64B-4KB mainly compute related
Message size distribution

Percentage of smaller messages (<64B) increases with cluster size
Mainly due to the needed synchronization
Percentage of mid-size messages (64B-4KB) is kept the same with cluster size
Compute transactions increases with cluster size
Percentage of very large messages decreases with cluster size

Mainly used for problem data distribution at the simulation initialization phase
LS-DYNA interconnect sensitivity points

Interconnect latency and throughput for 64B-4KB message range
Collectives operations performance, mainly MPI_Allreduce
20
Test Cluster Configuration System Upgrade
The following results were achieved after system upgrade (changes are in green)
Dell PowerEdge SC 1435 24-node cluster
Quad-Core AMD Opteron Model 2382 processors (Shanghai) (vs Barcelona in previous
configuration)
Mellanox InfiniBand ConnectX DDR HCAs
Mellanox InfiniBand DDR Switch
Memory: 16GB memory, DDR2 800MHz per node (vs 667MHz in previous configuration)
OS: RHEL5U2, OFED 1.3 InfiniBand SW stack
MPI: HP MPI 2.2.7, Platform MPI 5.6.5
Application: LS-DYNA MPP971
Benchmark Workload
Three-Car Crash Test simulation
Neon-Refined Revised Crash Test simulation
21
Quad-Core AMD Opteron Processor
Performance
Quad-Core
Dual Channel
Reg DDR2
Enhanced CPU IPC

4x 512K L2 cache
6MB L3 Cache
8 GB/S
8 GB/S
Direct Connect Architecture

HyperTransport technology
Up to 24 GB/s peak per processor
Floating Point
128-bit FPU per core
4 FLOPS/clk peak per core
Integrated Memory Controller
Up to 12.8 GB/s
DDR2-800 MHz or DDR2-667 MHz
Scalability
8 GB/S
8 GB/S
PCI-E
PCI-E
Bridge
Bridge
PCI-E
PCI-E
Bridge
Bridge
8 GB/S
USB
USB
I/O
I/OHub
Hub
PCI
PCI
48-bit Physical Addressing
Compatibility
Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron processor

22
November5, 2007
22
Performance Improvement
Upgraded AMD CPU and DDR-2 Memory
LS-DYNA run time decreased by more than 20%
Leveraging InfiniBand 20Gb/s for higher scalability
Number of Nodes
Barcelona
Shanghai
or
es
)
C
o
10
re
s)
(8
0
C
or
12
es
(9
)
6
C
14
or
es
(1
12
)
C
16
or
es
(1
28
)
C
18
or
es
(1
44
)
C
20
or
es
(1
60
)
C
22
or
es
(1
76
)
C
24
or
es
(1
92
)
C
or
es
)
(6
4
(3
2
or
es
)
C
or
10
es
(8
)
0
C
o
12
re
s)
(9
6
C
14
or
es
(1
12
)
C
16
or
es
(1
28
)
C
18
or
es
(1
44
)
C
20
or
es
(1
60
)
C
22
or
es
(1
76
)
C
24
or
es
(1
92
)
C
or
es
)
(6
4
(4
8
C
6
(3
2
or
es
)
100
1000
200
or
es
)
2000
300
(4
8
3000
400
4000
500
5000
600
6000
7000
Number of Nodes
Barcelona
Shanghai
Lower is better
23
Maximize LS-DYNA Productivity

Scalable latency of InfiniBand and latest Shanghai
processor deliver scalable LS-DYNA performance

100
Jobs per Day
Jobs per Day
120
80
60
40
20
0
3
4(
o re
2C
s)
6
8(
o re
4C
s)
12
(96
)
res
Co
8
(12
16
)
res
Co
0
(16
20
)
res
Co
2
(19
24
)
res
Co
1400
1200
1000
800
600
400
200
0
3
4(
o re
2C
s)
6
8(
Number of Nodes
1 Job
2 Parallel Jobs
4 Parallel Jobs
o re
4C
s)
12
s)
ore
C
(96
8
(12
6
1
)
res
o
C
0
(16
0
2
)
res
o
C
2
(19
4
2
)
res
o
C
Number of Nodes
8 Parallel Jobs
1 Job
2 Parallel Jobs
4 Parallel Jobs
8 Parallel Jobs
Higher is better
24
LS-DYNA with Shanghai Processors

Shanghai processors provides higher performance
compared to Barcelona
% of more jobs per day

(Shanghai vs Barcelona)
30%
25%
20%
15%
10%
5%
0%
C
32
(
4
s)
e
r
o
C
64
(
8
s)
e
r
o
(9
12
s)
s)
s)
e
e
e
r
r
r
o
o
o
2C
0C
8C
9
6
2
(1
(1
(1
24
20
16
s)
e
r
o
6C
Number of Nodes
1 Job
2 Parallel Jobs
4 Parallel Jobs
25

InfiniBand 20Gb/s vs 10GigE vs GigE
InfiniBand 20Gb/s (DDR) outperforms 10GigE and GigE in all test cases
Reducing run time by up to 60% versus 10GigE and 61% vs GigE
Performance loss shown beyond 16 nodes with 10GigE and GigE

InfiniBand 20Gb/s maintain scalability with cluster size
(HP-MPI)

(HP-MPI)
600
500
400
300
200
100
0
Co
(32
)
res
8
Co
(64
)
res
12
Co
(96
)
res
16
o
8C
(12
)
res
20
o
0C
(16
)
res
Number of Nodes
GigE
10GigE
InfiniBand
24
o
2C
(19
6000
5000
4000
3000
2000
1000
0
)
res
4(
32
)
res
o
C
8(
64
)
res
o
C
(
12
96
)
res
o
C
(
16
8
12
)
res
o
C
(
20
0
16
)
res
o
C
(
24
2
19
)
res
o
C
Number of Nodes
GigE
10GigE
InfiniBand
Lower is better
26
Power Consumption Comparison

Power Consumption
(InfiniBand vs 10GigE vs GigE)
4500
4000
Wh per Job
3500
50%
3000
2500
2000
1500
1000
62%
500
0
3 Vehicle Collision
GigE
10GigE
Neon Refined Revised
InfiniBand
InfiniBand also enables power efficient simulations

Reducing power/job by up to 62%!
24-node comparison
27
Conclusions
LS-DYNA is widely used to simulate many real-world problems
Automotive crash-testing and finite-element simulations
Developed by Livermore Software Technology Corporation (LSTC)
LS-DYNA performance and productivity relies on
Scalable HPC systems and interconnect solutions

Low latency and high throughput interconnect technology
NUMA aware application for fast access to local memory
Reasonable job distribution can dramatically improve productivity
Increasing number of jobs per day while maintaining fast run time
Interconnect comparison shows

InfiniBand delivers superior performance and productivity in every cluster size
Scalability requires low latency and zero scalable latency
Lowest power consumption was achieved with InfiniBand
Saving in system power, cooling and real-estate
28
Thank You
HPC Advisory Council
HPC@mellanox.com
All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein
29
29

LS-DYNA Analysis

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

LS-DYNA Analysis

Transféré par

Droits d'auteur :

Formats disponibles

LSDYNAPerformanceBenchmarks

The participating members would like to thank LSTC for their

LS-DYNA MPP (Massively Parallel Processing)

Many companies are switching from SMP to MPP

Test Cluster Configuration

Dell PowerEdge SC 1435 24-node cluster

Quad-Core AMD Opteron Model 2358 processors (Barcelona)

Mellanox InfiniBand ConnectX DDR HCAs

Mellanox InfiniBand DDR Switch

Memory: 16GB memory, DDR2 667MHz per node

OS: RHEL5U2, OFED 1.3 InfiniBand SW stack

MPI: HP MPI 2.2.7, Platform MPI 5.6.5

Application: LS-DYNA MPP971

Mellanox InfiniBand Solutions

The InfiniBand Performance

Reliable with congestion management

Scalable for Petascale computing & beyond

InfiniBand Delivers the Lowest Latency

Quad-Core AMD Opteron Processor

Enhanced CPU IPC

Direct Connect Architecture

1GB Page Support

48-bit Physical Addressing

Same power/thermal envelopes as Second-Generation AMD Opteron processor

Dell PowerEdge Servers helping Simplify IT

LS-DYNA Performance Results - Interconnect

Performance over GigE is not scaling

Elapsed time (Seconds)

Elapsed time (Seconds)

LS-DYNA - 3 Vehicle Collision

Num ber of Nodes

LS-DYNA - 3 Vehicle Collision

LS-DYNA Performance Results - Interconnect

LS-DYNA - Neon Refined Revised

InfiniBand outperforms GigE by up to 132%

LS-DYNA Performance Results CPU Affinity

Elapsed time (Seconds)

LS-DYNA - 3 Vehicle Collision

CPU affinity accelerates performance up to 10%

LS-DYNA Performance Results - Productivity

InfiniBand increases productivity by allowing multiple jobs to run simultaneously

Three cases are presented

Jobs per Day

Jobs per Day

LS-DYNA Profiling Data Transferred

Total Size (MB)

Majority of data transfer is done via 256B-4KB message size

LS-DYNA Profiling Message Distribution

Majority of the messages are in the range of 2B-4KB

LS-DYNA Profiling Message Distribution

As number of nodes scales, percentage of small messages increases

LS-DYNA Profiling MPI Collectives

% of Total Overhead (ms)

Two key MPI collective functions in LS-DYNA

MPI Collective Benchmarking

Platform MPI shows better latency for AllReduce operation

LS-DYNA with Different MPI Libraries

Elapsed time (Seconds)

Elapsed time (Seconds)

LS-DYNA - 3 Vehicle Collision

Num ber of Nodes

LS-DYNA Profiling Summary - Interconnect

LS-DYNA was profiled to determine networking dependency