Vous êtes sur la page 1sur 29

LSDYNAPerformanceBenchmarks

andProfiling
January 2009

Note
The following research was performed under the HPC Advisory
Council activities
AMD, Dell, Mellanox
HPC Advisory Council Cluster Center

The participating members would like to thank LSTC for their


support and guidelines
The participating members would like to thank Sharan Kalwani,
HPC Automotive specialist, for his support and guidelines
For more info please refer to
www.mellanox.com, www.dell.com/hpc, www.amd.com

LS-DYNA
LS-DYNA
A general purpose structural and fluid analysis simulation software
package capable of simulating complex real world problems
Developed by the Livermore Software Technology Corporation (LSTC)

LS-DYNA used by
Automobile
Aerospace
Construction
Military
Manufacturing
Bioengineering

LS-DYNA
LS-DYNA SMP (Shared Memory Processing)
Optimize the power of multiple CPUs within single machine

LS-DYNA MPP (Massively Parallel Processing)


The MPP version of LS-DYNA allows to run LS-DYNA solver over
High-performance computing cluster
Uses message passing (MPI) to obtain parallelism

Many companies are switching from SMP to MPP


For cost-effective scaling and performance

Objectives
The presented research was done to provide best practices
LS-DYNA performance benchmarking
Interconnect performance comparisons
Ways to increase LS-DYNA productivity
Understanding LS-DYNA communication pattern
MPI libraries comparisons
Power-aware consideration

Test Cluster Configuration

Dell PowerEdge SC 1435 24-node cluster

Quad-Core AMD Opteron Model 2358 processors (Barcelona)

Mellanox InfiniBand ConnectX DDR HCAs

Mellanox InfiniBand DDR Switch

Memory: 16GB memory, DDR2 667MHz per node

OS: RHEL5U2, OFED 1.3 InfiniBand SW stack

MPI: HP MPI 2.2.7, Platform MPI 5.6.5

Application: LS-DYNA MPP971

Benchmark Workload
Three Vehicle Collision Test simulation
Neon-Refined Revised Crash Test simulation

Mellanox InfiniBand Solutions


Industry Standard
Hardware, software, cabling, management
Design for clustering and storage interconnect

The InfiniBand Performance


Gap is Increasing

Performance

240Gb/s
(12X)

40Gb/s node-to-node
120Gb/s switch-to-switch
1us application latency
Most aggressive roadmap in the industry

Reliable with congestion management


Efficient
RDMA and Transport Offload
Kernel bypass
CPU focuses on application processing

Scalable for Petascale computing & beyond


End-to-end quality of service
Virtualization acceleration
I/O consolidation Including storage

120Gb/s

80Gb/s
(4X)
60Gb/s

40Gb/s
20Gb/s

Ethernet
Fibre
Channel

InfiniBand Delivers the Lowest Latency


7

Quad-Core AMD Opteron Processor

Performance

Quad-Core

Dual Channel
Reg DDR2

Enhanced CPU IPC


4x 512K L2 cache
2MB L3 Cache

8 GB/S
8 GB/S

Direct Connect Architecture


HyperTransport technology
Up to 24 GB/s

Floating Point
128-bit FPU per core
4 FLOPS/clk peak per core

Memory

1GB Page Support


DDR-2 667 MHz
Scalability

8 GB/S

8 GB/S

PCI-E
PCI-E
Bridge
Bridge

PCI-E
PCI-E
Bridge
Bridge

8 GB/S

USB
USB
I/O
I/OHub
Hub

PCI
PCI

48-bit Physical Addressing

Compatibility

Same power/thermal envelopes as Second-Generation AMD Opteron processor


8

November5, 2007

Dell PowerEdge Servers helping Simplify IT


System Structure and Sizing Guidelines
24-node cluster build with Dell PowerEdge SC 1435 Servers
Servers optimized for High Performance Computing environments
Building Block Foundations for best price/performance and performance/watt
Dell HPC Solutions
Scalable Architectures for High Performance and Productivity
Dell's comprehensive HPC services help manage the lifecycle requirements.
Integrated, Tested and Validated Architectures
Workload Modeling
Optimized System Size, Configuration and Workloads
Test-bed Benchmarks
ISV Applications Characterization
Best Practices & Usage Analysis
9

LS-DYNA Performance Results - Interconnect


InfiniBand high speed interconnect enables highest scalability
Performance gain with cluster size

Performance over GigE is not scaling


Slowdown occurs as number of processors increases beyond 16 nodes
LS-DYNA - Neon Refined Revised

Number of Nodes
InfiniBand

GigE

or
es
C )
or
8
es
(6
4
)
C
10
or
es
(8
0
)
C
12
or
es
(9
)
14 6 C
o
(1
re
1
s)
16 2 C
o
(1
r
28 e s
)
18
C
(1 ore
4
s)
20 4 C
(1 ore
60
s)
22
C
o
(1
r
76 e s)
24
C
(1 ore
92
s)
C
or
es
)

700
600
500
400
300
200
100
0

(4
8

(3
2
4

(4
8

(3
2

or
es
)
C
or
8
es
(6
)
4
10 Cor
es
(8
)
0
C
12
or
es
(9
)
6
C
14
or
(1
12 es )
C
16
or
(1
28 e s)
C
18
or
(1
44 e s)
C
20
or
(1
60 e s)
C
22
or
(1
76 e s)
C
24
or
(1
92 e s)
C
or
es
)

8000
7000
6000
5000
4000
3000
2000
1000
0

Elapsed time (Seconds)

Elapsed time (Seconds)

LS-DYNA - 3 Vehicle Collision

Num ber of Nodes


InfiniBand

GigE

Lower is better
10

(3
2
C

Performance Advantage

Number of Nodes
6

(4
8

or
es
)
C
8
or
(6
4 es
10 Co )
re
(8
s)
0
12 Co
re
(9
14 6 C s )
(1 or
e
1
16 2 C s )
(1 or
e
2
18 8 C s)
(1 ore
4
20 4 C s)
(1 ore
6
22 0 C s)
(1 or
e
7
24 6 C s)
(1 ore
92
s)
C
or
es
)

Performance Advantage

120%
100%
80%
60%
40%
20%
0%

(3
2

(4
8

LS-DYNA - 3 Vehicle Collision


(InfiniBand vs GigE)

or
es
C )
or
8
(6
es
4
)
C
10
or
(8
es
0
)
12 Co
r
es
(9
)
14 6 C
(1 or
e
1
16 2 C s )
(1 ore
2
s)
18 8 C
o
(1
re
4
s)
20 4 C
o
(1
60 re s
)
22
C
(1 ore
7
s)
24 6 C
o
(1
r
92 e s
)
C
or
es
)

LS-DYNA Performance Results - Interconnect

LS-DYNA - Neon Refined Revised


(InfiniBand vs GigE)

140%
120%
100%
80%
60%
40%
20%
0%

Number of Nodes

InfiniBand outperforms GigE by up to 132%


As node number increases, bigger advantage is expected
11

LS-DYNA Performance Results CPU Affinity

Elapsed time (Seconds)

LS-DYNA - 3 Vehicle Collision


(CPU Affinity vs Non-Affinity)

(3

Co
re
(4
s)
8
Co
8
re
(6
s)
4
C
10
or
es
(8
0
)
1 2 Co
(9 r e s
6
14
C )
(1 ore
12
s)
16
C
(1 ore
2
s
18 8 C )
(1 ore
44
s
20
C )
(1 ore
60
s)
22
C
(1 ore
76
s)
24
C
(1 ore
92
s)
C
or
es
)

7000
6000
5000
4000
3000
2000
1000
0

Number of Nodes
CPU Affinity

Without Affinity

CPU affinity accelerates performance up to 10%


Saves up to 177 seconds per simulation
Lower is better
12

LS-DYNA Performance Results - Productivity

InfiniBand increases productivity by allowing multiple jobs to run simultaneously


Providing required productivity for virtual vehicle design

Three cases are presented


Single job over the entire systems (with CPU affinity)
Two jobs, each on a single CPU per server (job placement , CPU affinity)
Four jobs, each on two CPU cores per CPU per server (job placement , CPU affinity)

Four jobs per day increases productivity by 97% for Neon Refined Revised, 57% for 3 Car collision case

Increased number of parallel processes (jobs) increases the load on the interconnect

2 Parallel Jobs

4 Parallel Jobs

(1
60

or
es
)

24

20

(1
92

or
es
)

or
es
)
C
(1
28

16

(9
6
12

Number of Nodes

Number of Nodes
1 Job

s)
C

or
e

Co
re
s
8

(3
2
4

(6
4

Co
re
s

or
es
)

1000
900
800
700
600
500
400
300
200
100
0

C
(1
92
24

20

(1
60

C
16

(1
28

C
(9
6
12

or
es
)

or
es
)

s)
or
e

Co
re
s
(6
4
8

(3
2

Co
re
s

Jobs per Day

90
80
70
60
50
40
30
20
10
0

Jobs per Day

High speed and low latency interconnect solution is required for gaining high productivity
LS-DYNA - Neon Refined Revised
LS-DYNA - 3 Vehicle Collision

1 Job

2 Parallel Jobs

4 Parallel Jobs

Higher is better
13

LS-DYNA Profiling Data Transferred


LS-DYNA MPI Profiliing
(3 Vehicle Collision)
1E+12

Total Size (MB)

1E+11

1E+10

1E+09

100000000

in
i

ty
]

]
[4
M
..i
nf

[1
..4
M

[0
..6
4B

]
[6
4.
.2
56
B]
[2
56
B.
.1
KB
]
[1
..4
KB
]
[4
..1
6K
B]
[1
6.
.6
4K
B]
[6
4.
.2
56
KB
[2
]
56
K
B
..1
M
]

10000000

Message Size

4nodes

8nodes

12nodes

16nodes

20nodes

24nodes

Majority of data transfer is done via 256B-4KB message size


14

LS-DYNA Profiling Message Distribution


LS-DYNA MPI Profiliing

(3 Vehicle Collision)

Number of Messages

1000000000
100000000
10000000
1000000
100000
10000
1000
100
10

in
i

ty
]

]
[4
M
..i
nf

[1
..4
M

[0
..6
4B

]
[6
4.
.2
56
B]
[2
56
B.
.1
KB
]
[1
..4
KB
]
[4
..1
6K
B]
[1
6.
.6
4K
B]
[6
4.
.2
56
KB
[2
]
56
KB
..1
M
]

Message Size
4nodes

8nodes

12nodes

16nodes

20nodes

24nodes

Majority of the messages are in the range of 2B-4KB


2B-256B for synchronization, 256B-4KB for data communications
15

LS-DYNA Profiling Message Distribution


LS-DYNA MPI Profiliing

(3 Vehicle Collision)
70%

% of total messages

60%
50%
40%
30%
20%
10%
0%
[0..64]

[65..256]

[257..1024]

[1025..4096]

[4097..16384]

Message Size
4nodes

8nodes

12nodes

16nodes

20nodes

24nodes

As number of nodes scales, percentage of small messages increases


percentage of 256-1KB messages is relatively consistent with cluster size
Actual number increases with cluster size,
16

LS-DYNA Profiling MPI Collectives

% of Total Overhead (ms)

Two key MPI collective functions in LS-DYNA


MPI_AllReduce
MPI_Bcast
Account for the majority of MPI communication overhead
MPI Collectives
70%
60%
50%
40%
30%
20%
10%
0%

32
4(

)
res
o
C

64
8(

)
res
o
C

(9
12

)
res
o
6C

)
)
)
res
res
res
o
o
o
8C
2C
0C
19
12
16
(
(
(
20
16
24

Number of Nodes

MPI_AllReduce

MPI_Bcast
17

MPI Collective Benchmarking


MPI collective performance comparison
Two frequently called collection operations in LS-DYNA were benchmarked
MPI_Allreduce
MPI_Bcast

Platform MPI shows better latency for AllReduce operation

MPI_Bcast

120

30

100

25

Latency(usec)

Latency(usec)

MPI_AllReduce

80
60
40
20

20
15
10
5
0

0
0

16

32

64

128

Message Size

HP-MPI

Platform MPI

256

512

16

32

64

128

256

Message Size

HP-MPI

Platform MPI

18

512

LS-DYNA with Different MPI Libraries


LS-DYNA performance Comparison
Each MPI library shows different benefits for latency and collectives
As such, HP-MPI and Platform MPI shows comparable performance
LS-DYNA - Neon Refined Revised

2000

(3
2

C
or
es
(4
)
8
C
o
re
8
(6
s)
4
C
or
10
es
(8
)
0
C
or
12
es
(9
)
6
C
14
or
(1
es
12
)
C
16
or
(1
es
28
)
C
18
or
(1
es
44
)
C
20
or
(1
es
60
)
C
22
or
(1
es
76
)
C
24
or
(1
es
92
)
C
or
es
)

1000

Number of Nodes
Platform MPI

HP-MPI

or
es
C )
or
8
(6
es
4
)
C
10
or
(8
es
0
)
12 Co
r
(9
e
14 6 C s )
(1 o r
e
1
16 2 C s )
(1 ore
2
18 8 C s)
(1 ore
4
20 4 C s)
(1 ore
6
22 0 C s)
(1 ore
7
24 6 C s)
(1 ore
92
s)
C
or
es
)

3000

(4
8

4000

5000

Elapsed time (Seconds)

6000

550
500
450
400
350
300
250
200
150
100

(3
2

7000

Elapsed time (Seconds)

LS-DYNA - 3 Vehicle Collision

Num ber of Nodes


Platform MPI

HP-MPI

Lower is better
19

LS-DYNA Profiling Summary - Interconnect

LS-DYNA was profiled to determine networking dependency

Majority of data transferred between compute nodes


Done with 256B-4KB message size, data transferred increases with cluster size

Most used message sizes


<64B messages mainly synchronizations
64B-4KB mainly compute related

Message size distribution


Percentage of smaller messages (<64B) increases with cluster size
Mainly due to the needed synchronization

Percentage of mid-size messages (64B-4KB) is kept the same with cluster size
Compute transactions increases with cluster size

Percentage of very large messages decreases with cluster size


Mainly used for problem data distribution at the simulation initialization phase

LS-DYNA interconnect sensitivity points


Interconnect latency and throughput for 64B-4KB message range
Collectives operations performance, mainly MPI_Allreduce
20

Test Cluster Configuration System Upgrade

The following results were achieved after system upgrade (changes are in green)
Dell PowerEdge SC 1435 24-node cluster
Quad-Core AMD Opteron Model 2382 processors (Shanghai) (vs Barcelona in previous
configuration)
Mellanox InfiniBand ConnectX DDR HCAs
Mellanox InfiniBand DDR Switch
Memory: 16GB memory, DDR2 800MHz per node (vs 667MHz in previous configuration)
OS: RHEL5U2, OFED 1.3 InfiniBand SW stack
MPI: HP MPI 2.2.7, Platform MPI 5.6.5
Application: LS-DYNA MPP971
Benchmark Workload
Three-Car Crash Test simulation
Neon-Refined Revised Crash Test simulation
21

Quad-Core AMD Opteron Processor

Performance

Quad-Core

Dual Channel
Reg DDR2

Enhanced CPU IPC


4x 512K L2 cache
6MB L3 Cache

8 GB/S
8 GB/S

Direct Connect Architecture


HyperTransport technology
Up to 24 GB/s peak per processor

Floating Point
128-bit FPU per core
4 FLOPS/clk peak per core

Integrated Memory Controller

Up to 12.8 GB/s
DDR2-800 MHz or DDR2-667 MHz
Scalability

8 GB/S

8 GB/S

PCI-E
PCI-E
Bridge
Bridge

PCI-E
PCI-E
Bridge
Bridge

8 GB/S

USB
USB
I/O
I/OHub
Hub

PCI
PCI

48-bit Physical Addressing

Compatibility

Same power/thermal envelopes as 2nd / 3rd generation AMD Opteron processor


22

November5, 2007

22

Performance Improvement
Upgraded AMD CPU and DDR-2 Memory
LS-DYNA run time decreased by more than 20%
Leveraging InfiniBand 20Gb/s for higher scalability
LS-DYNA - 3 Vehicle Collision

Number of Nodes

Barcelona

Shanghai

or
es
)
C
o
10
re
s)
(8
0
C
or
12
es
(9
)
6
C
14
or
es
(1
12
)
C
16
or
es
(1
28
)
C
18
or
es
(1
44
)
C
20
or
es
(1
60
)
C
22
or
es
(1
76
)
C
24
or
es
(1
92
)
C
or
es
)
(6
4

(3
2

or
es
)
C
or
10
es
(8
)
0
C
o
12
re
s)
(9
6
C
14
or
es
(1
12
)
C
16
or
es
(1
28
)
C
18
or
es
(1
44
)
C
20
or
es
(1
60
)
C
22
or
es
(1
76
)
C
24
or
es
(1
92
)
C
or
es
)
(6
4

(4
8

C
6

(3
2

or
es
)

100

1000

200

or
es
)

2000

300

(4
8

3000

400

4000

500

5000

600

Elapsed time (Seconds)

6000

Elapsed time (Seconds)

7000

LS-DYNA - Neon Refined Revised

Number of Nodes

Barcelona

Shanghai

Lower is better
23

Maximize LS-DYNA Productivity


Scalable latency of InfiniBand and latest Shanghai
processor deliver scalable LS-DYNA performance
LS-DYNA - Neon Refined Revised

LS-DYNA - 3 Vehicle Collision


100

Jobs per Day

Jobs per Day

120

80
60
40
20
0

3
4(

o re
2C

s)
6
8(

o re
4C

s)
12

(96

)
res
Co

8
(12
16

)
res
Co

0
(16
20

)
res
Co

2
(19
24

)
res
Co

1400
1200
1000
800
600
400
200
0

3
4(

o re
2C

s)
6
8(

Number of Nodes

1 Job

2 Parallel Jobs

4 Parallel Jobs

o re
4C

s)
12

s)
ore
C
(96

8
(12
6
1

)
res
o
C

0
(16
0
2

)
res
o
C

2
(19
4
2

)
res
o
C

Number of Nodes

8 Parallel Jobs

1 Job

2 Parallel Jobs

4 Parallel Jobs

8 Parallel Jobs

Higher is better
24

LS-DYNA with Shanghai Processors


Shanghai processors provides higher performance
compared to Barcelona

% of more jobs per day

LS-DYNA - 3 Vehicle Collision


(Shanghai vs Barcelona)
30%
25%
20%
15%
10%
5%
0%

C
32
(
4

s)
e
r
o

C
64
(
8

s)
e
r
o
(9
12

s)
s)
s)
e
e
e
r
r
r
o
o
o
2C
0C
8C
9
6
2
(1
(1
(1
24
20
16

s)
e
r
o
6C

Number of Nodes

1 Job

2 Parallel Jobs

4 Parallel Jobs

25

LS-DYNA Performance Results - Interconnect


InfiniBand 20Gb/s vs 10GigE vs GigE
InfiniBand 20Gb/s (DDR) outperforms 10GigE and GigE in all test cases
Reducing run time by up to 60% versus 10GigE and 61% vs GigE

Performance loss shown beyond 16 nodes with 10GigE and GigE


InfiniBand 20Gb/s maintain scalability with cluster size
LS-DYNA - Neon Refined Revised
(HP-MPI)

LS-DYNA - 3 Vehicle Collision


(HP-MPI)
Elapsed time (Seconds)

Elapsed time (Seconds)

600
500
400
300
200
100
0

Co
(32

)
res
8

Co
(64

)
res
12

Co
(96

)
res
16

o
8C
(12

)
res
20

o
0C
(16

)
res

Number of Nodes
GigE

10GigE

InfiniBand

24

o
2C
(19

6000
5000
4000
3000
2000
1000
0

)
res

4(

32

)
res
o
C

8(

64

)
res
o
C

(
12

96

)
res
o
C
(
16

8
12

)
res
o
C
(
20

0
16

)
res
o
C
(
24

2
19

)
res
o
C

Number of Nodes
GigE

10GigE

InfiniBand

Lower is better
26

Power Consumption Comparison


Power Consumption
(InfiniBand vs 10GigE vs GigE)
4500
4000

Wh per Job

3500

50%

3000
2500
2000
1500
1000

62%

500
0
3 Vehicle Collision

GigE

10GigE

Neon Refined Revised

InfiniBand

InfiniBand also enables power efficient simulations


Reducing power/job by up to 62%!
24-node comparison
27

Conclusions
LS-DYNA is widely used to simulate many real-world problems
Automotive crash-testing and finite-element simulations
Developed by Livermore Software Technology Corporation (LSTC)

LS-DYNA performance and productivity relies on

Scalable HPC systems and interconnect solutions


Low latency and high throughput interconnect technology
NUMA aware application for fast access to local memory
Reasonable job distribution can dramatically improve productivity
Increasing number of jobs per day while maintaining fast run time

Interconnect comparison shows


InfiniBand delivers superior performance and productivity in every cluster size
Scalability requires low latency and zero scalable latency
Lowest power consumption was achieved with InfiniBand
Saving in system power, cooling and real-estate

28

Thank You
HPC Advisory Council
HPC@mellanox.com

All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein

29

29

Vous aimerez peut-être aussi