Académique Documents
Professionnel Documents
Culture Documents
andProfiling
January 2009
Note
The following research was performed under the HPC Advisory
Council activities
AMD, Dell, Mellanox
HPC Advisory Council Cluster Center
LS-DYNA
LS-DYNA
A general purpose structural and fluid analysis simulation software
package capable of simulating complex real world problems
Developed by the Livermore Software Technology Corporation (LSTC)
LS-DYNA used by
Automobile
Aerospace
Construction
Military
Manufacturing
Bioengineering
LS-DYNA
LS-DYNA SMP (Shared Memory Processing)
Optimize the power of multiple CPUs within single machine
Objectives
The presented research was done to provide best practices
LS-DYNA performance benchmarking
Interconnect performance comparisons
Ways to increase LS-DYNA productivity
Understanding LS-DYNA communication pattern
MPI libraries comparisons
Power-aware consideration
Benchmark Workload
Three Vehicle Collision Test simulation
Neon-Refined Revised Crash Test simulation
Performance
240Gb/s
(12X)
40Gb/s node-to-node
120Gb/s switch-to-switch
1us application latency
Most aggressive roadmap in the industry
120Gb/s
80Gb/s
(4X)
60Gb/s
40Gb/s
20Gb/s
Ethernet
Fibre
Channel
Performance
Quad-Core
Dual Channel
Reg DDR2
8 GB/S
8 GB/S
Floating Point
128-bit FPU per core
4 FLOPS/clk peak per core
Memory
8 GB/S
8 GB/S
PCI-E
PCI-E
Bridge
Bridge
PCI-E
PCI-E
Bridge
Bridge
8 GB/S
USB
USB
I/O
I/OHub
Hub
PCI
PCI
Compatibility
November5, 2007
Number of Nodes
InfiniBand
GigE
or
es
C )
or
8
es
(6
4
)
C
10
or
es
(8
0
)
C
12
or
es
(9
)
14 6 C
o
(1
re
1
s)
16 2 C
o
(1
r
28 e s
)
18
C
(1 ore
4
s)
20 4 C
(1 ore
60
s)
22
C
o
(1
r
76 e s)
24
C
(1 ore
92
s)
C
or
es
)
700
600
500
400
300
200
100
0
(4
8
(3
2
4
(4
8
(3
2
or
es
)
C
or
8
es
(6
)
4
10 Cor
es
(8
)
0
C
12
or
es
(9
)
6
C
14
or
(1
12 es )
C
16
or
(1
28 e s)
C
18
or
(1
44 e s)
C
20
or
(1
60 e s)
C
22
or
(1
76 e s)
C
24
or
(1
92 e s)
C
or
es
)
8000
7000
6000
5000
4000
3000
2000
1000
0
GigE
Lower is better
10
(3
2
C
Performance Advantage
Number of Nodes
6
(4
8
or
es
)
C
8
or
(6
4 es
10 Co )
re
(8
s)
0
12 Co
re
(9
14 6 C s )
(1 or
e
1
16 2 C s )
(1 or
e
2
18 8 C s)
(1 ore
4
20 4 C s)
(1 ore
6
22 0 C s)
(1 or
e
7
24 6 C s)
(1 ore
92
s)
C
or
es
)
Performance Advantage
120%
100%
80%
60%
40%
20%
0%
(3
2
(4
8
or
es
C )
or
8
(6
es
4
)
C
10
or
(8
es
0
)
12 Co
r
es
(9
)
14 6 C
(1 or
e
1
16 2 C s )
(1 ore
2
s)
18 8 C
o
(1
re
4
s)
20 4 C
o
(1
60 re s
)
22
C
(1 ore
7
s)
24 6 C
o
(1
r
92 e s
)
C
or
es
)
140%
120%
100%
80%
60%
40%
20%
0%
Number of Nodes
(3
Co
re
(4
s)
8
Co
8
re
(6
s)
4
C
10
or
es
(8
0
)
1 2 Co
(9 r e s
6
14
C )
(1 ore
12
s)
16
C
(1 ore
2
s
18 8 C )
(1 ore
44
s
20
C )
(1 ore
60
s)
22
C
(1 ore
76
s)
24
C
(1 ore
92
s)
C
or
es
)
7000
6000
5000
4000
3000
2000
1000
0
Number of Nodes
CPU Affinity
Without Affinity
Four jobs per day increases productivity by 97% for Neon Refined Revised, 57% for 3 Car collision case
Increased number of parallel processes (jobs) increases the load on the interconnect
2 Parallel Jobs
4 Parallel Jobs
(1
60
or
es
)
24
20
(1
92
or
es
)
or
es
)
C
(1
28
16
(9
6
12
Number of Nodes
Number of Nodes
1 Job
s)
C
or
e
Co
re
s
8
(3
2
4
(6
4
Co
re
s
or
es
)
1000
900
800
700
600
500
400
300
200
100
0
C
(1
92
24
20
(1
60
C
16
(1
28
C
(9
6
12
or
es
)
or
es
)
s)
or
e
Co
re
s
(6
4
8
(3
2
Co
re
s
90
80
70
60
50
40
30
20
10
0
High speed and low latency interconnect solution is required for gaining high productivity
LS-DYNA - Neon Refined Revised
LS-DYNA - 3 Vehicle Collision
1 Job
2 Parallel Jobs
4 Parallel Jobs
Higher is better
13
1E+11
1E+10
1E+09
100000000
in
i
ty
]
]
[4
M
..i
nf
[1
..4
M
[0
..6
4B
]
[6
4.
.2
56
B]
[2
56
B.
.1
KB
]
[1
..4
KB
]
[4
..1
6K
B]
[1
6.
.6
4K
B]
[6
4.
.2
56
KB
[2
]
56
K
B
..1
M
]
10000000
Message Size
4nodes
8nodes
12nodes
16nodes
20nodes
24nodes
(3 Vehicle Collision)
Number of Messages
1000000000
100000000
10000000
1000000
100000
10000
1000
100
10
in
i
ty
]
]
[4
M
..i
nf
[1
..4
M
[0
..6
4B
]
[6
4.
.2
56
B]
[2
56
B.
.1
KB
]
[1
..4
KB
]
[4
..1
6K
B]
[1
6.
.6
4K
B]
[6
4.
.2
56
KB
[2
]
56
KB
..1
M
]
Message Size
4nodes
8nodes
12nodes
16nodes
20nodes
24nodes
(3 Vehicle Collision)
70%
% of total messages
60%
50%
40%
30%
20%
10%
0%
[0..64]
[65..256]
[257..1024]
[1025..4096]
[4097..16384]
Message Size
4nodes
8nodes
12nodes
16nodes
20nodes
24nodes
32
4(
)
res
o
C
64
8(
)
res
o
C
(9
12
)
res
o
6C
)
)
)
res
res
res
o
o
o
8C
2C
0C
19
12
16
(
(
(
20
16
24
Number of Nodes
MPI_AllReduce
MPI_Bcast
17
MPI_Bcast
120
30
100
25
Latency(usec)
Latency(usec)
MPI_AllReduce
80
60
40
20
20
15
10
5
0
0
0
16
32
64
128
Message Size
HP-MPI
Platform MPI
256
512
16
32
64
128
256
Message Size
HP-MPI
Platform MPI
18
512
2000
(3
2
C
or
es
(4
)
8
C
o
re
8
(6
s)
4
C
or
10
es
(8
)
0
C
or
12
es
(9
)
6
C
14
or
(1
es
12
)
C
16
or
(1
es
28
)
C
18
or
(1
es
44
)
C
20
or
(1
es
60
)
C
22
or
(1
es
76
)
C
24
or
(1
es
92
)
C
or
es
)
1000
Number of Nodes
Platform MPI
HP-MPI
or
es
C )
or
8
(6
es
4
)
C
10
or
(8
es
0
)
12 Co
r
(9
e
14 6 C s )
(1 o r
e
1
16 2 C s )
(1 ore
2
18 8 C s)
(1 ore
4
20 4 C s)
(1 ore
6
22 0 C s)
(1 ore
7
24 6 C s)
(1 ore
92
s)
C
or
es
)
3000
(4
8
4000
5000
6000
550
500
450
400
350
300
250
200
150
100
(3
2
7000
HP-MPI
Lower is better
19
Percentage of mid-size messages (64B-4KB) is kept the same with cluster size
Compute transactions increases with cluster size
The following results were achieved after system upgrade (changes are in green)
Dell PowerEdge SC 1435 24-node cluster
Quad-Core AMD Opteron Model 2382 processors (Shanghai) (vs Barcelona in previous
configuration)
Mellanox InfiniBand ConnectX DDR HCAs
Mellanox InfiniBand DDR Switch
Memory: 16GB memory, DDR2 800MHz per node (vs 667MHz in previous configuration)
OS: RHEL5U2, OFED 1.3 InfiniBand SW stack
MPI: HP MPI 2.2.7, Platform MPI 5.6.5
Application: LS-DYNA MPP971
Benchmark Workload
Three-Car Crash Test simulation
Neon-Refined Revised Crash Test simulation
21
Performance
Quad-Core
Dual Channel
Reg DDR2
8 GB/S
8 GB/S
Floating Point
128-bit FPU per core
4 FLOPS/clk peak per core
Up to 12.8 GB/s
DDR2-800 MHz or DDR2-667 MHz
Scalability
8 GB/S
8 GB/S
PCI-E
PCI-E
Bridge
Bridge
PCI-E
PCI-E
Bridge
Bridge
8 GB/S
USB
USB
I/O
I/OHub
Hub
PCI
PCI
Compatibility
November5, 2007
22
Performance Improvement
Upgraded AMD CPU and DDR-2 Memory
LS-DYNA run time decreased by more than 20%
Leveraging InfiniBand 20Gb/s for higher scalability
LS-DYNA - 3 Vehicle Collision
Number of Nodes
Barcelona
Shanghai
or
es
)
C
o
10
re
s)
(8
0
C
or
12
es
(9
)
6
C
14
or
es
(1
12
)
C
16
or
es
(1
28
)
C
18
or
es
(1
44
)
C
20
or
es
(1
60
)
C
22
or
es
(1
76
)
C
24
or
es
(1
92
)
C
or
es
)
(6
4
(3
2
or
es
)
C
or
10
es
(8
)
0
C
o
12
re
s)
(9
6
C
14
or
es
(1
12
)
C
16
or
es
(1
28
)
C
18
or
es
(1
44
)
C
20
or
es
(1
60
)
C
22
or
es
(1
76
)
C
24
or
es
(1
92
)
C
or
es
)
(6
4
(4
8
C
6
(3
2
or
es
)
100
1000
200
or
es
)
2000
300
(4
8
3000
400
4000
500
5000
600
6000
7000
Number of Nodes
Barcelona
Shanghai
Lower is better
23
120
80
60
40
20
0
3
4(
o re
2C
s)
6
8(
o re
4C
s)
12
(96
)
res
Co
8
(12
16
)
res
Co
0
(16
20
)
res
Co
2
(19
24
)
res
Co
1400
1200
1000
800
600
400
200
0
3
4(
o re
2C
s)
6
8(
Number of Nodes
1 Job
2 Parallel Jobs
4 Parallel Jobs
o re
4C
s)
12
s)
ore
C
(96
8
(12
6
1
)
res
o
C
0
(16
0
2
)
res
o
C
2
(19
4
2
)
res
o
C
Number of Nodes
8 Parallel Jobs
1 Job
2 Parallel Jobs
4 Parallel Jobs
8 Parallel Jobs
Higher is better
24
C
32
(
4
s)
e
r
o
C
64
(
8
s)
e
r
o
(9
12
s)
s)
s)
e
e
e
r
r
r
o
o
o
2C
0C
8C
9
6
2
(1
(1
(1
24
20
16
s)
e
r
o
6C
Number of Nodes
1 Job
2 Parallel Jobs
4 Parallel Jobs
25
600
500
400
300
200
100
0
Co
(32
)
res
8
Co
(64
)
res
12
Co
(96
)
res
16
o
8C
(12
)
res
20
o
0C
(16
)
res
Number of Nodes
GigE
10GigE
InfiniBand
24
o
2C
(19
6000
5000
4000
3000
2000
1000
0
)
res
4(
32
)
res
o
C
8(
64
)
res
o
C
(
12
96
)
res
o
C
(
16
8
12
)
res
o
C
(
20
0
16
)
res
o
C
(
24
2
19
)
res
o
C
Number of Nodes
GigE
10GigE
InfiniBand
Lower is better
26
Wh per Job
3500
50%
3000
2500
2000
1500
1000
62%
500
0
3 Vehicle Collision
GigE
10GigE
InfiniBand
Conclusions
LS-DYNA is widely used to simulate many real-world problems
Automotive crash-testing and finite-element simulations
Developed by Livermore Software Technology Corporation (LSTC)
28
Thank You
HPC Advisory Council
HPC@mellanox.com
All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein
29
29