Vous êtes sur la page 1sur 40

Parallel and Distributed

Computing on Low
Latecy Clusters

Vittorio Giovara
M. S. Electrical Engineering and Computer Science
University of Illinois at Chicago
May 2009
Contents
• Motivation • Application

• Strategy • Compiler Optimizations

• Technologies • OpenMP and MPI over


Infinband

• OpenMP
• Results
• MPI
• Conclusions
• Infinband
Motivation
Motivation

• Scaling trend has to stop for CMOS


technology:
✓ Direct-tunneling limit in SiO2 ~3 nm
✓ Distance between Si atoms ~0.3 nm
✓ Variabilty

• Foundamental reason: rising fab cost


Motivation

• Easy to build multiple core processor


• Requires human action to modify and adapt
concurrent software
• New classification for computer
architectures
Classification
SISD SIMD
data pool data pool

instruction pool
instruction pool

CPU CPU CPU

MISD MIMD
data pool data pool

instruction pool
instruction pool

CPU CPU CPU

CPU CPU CPU


easier to parallelize

abstraction level

algorithm
loop level
process management
Levels
recursion
memory
management
profiling
data dependency
branching overhead
control flow
algorithm
loop level
process management

SMP Multiprogramming
Multithreading and Scheduling
Backfire

• Difficutly to fully exploit the parallelism


offered
• Automatic tools required to adapt software
to parallelism
• Compiler support for manual or semi-
automatic enhancement
Applications
• OpenMP and MPI are two popular tools
used to simplify the parallelizing process of
both new and old software
• Mathematics and Physics
• Computer Science
• Biomedics
Specific Problem and
Background
• Sally3D is a micromagnetism program suit
for field analysis and modeling developed at
Politecnico di Torino (Department of
Electrical Engineering)
• Computationally intensive (even days of
CPU); speedup required
• Previous works still not fully encompassing
the problem (no Infiniband or OpenMP
+MPI solutions)
Strategy
Strategy
• Install a Linux Kernel with ad-hoc
configuration for scientific computation
• Compile a OpenMP enable GCC
(supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with
proper drivers in kernel and user space
• Select a MPI implementation library
Strategy
• Verify Infiniband network through some
MPI test examples
• Install the target software
• Proceed to include OpenMP and MPI
directives in the code
• Run test cases
OpenMP

• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods
OpenMP - example
OpenMP - example
Parallel Task 1 Parallel Task 3

Parallel Task 2 Parallel Task 4


Parallel Task 1 Parallel Task 2

Thread A
Parallel Task 4

Thread B
Parallel Task 3

Join
Master Thread
OpenMP Sceduler

• Which scheduler available for hardware?


- Static
- Dynamic
- Guided
OpenMP Scheduler
OpenMP Static Scheduler Chart
80000

70000

60000

50000
microseconds

40000

30000

20000

10000

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000


OpenMP Scheduler
OpenMP Dynamic Scheduler Chart
117000

102375

87750

73125
microseconds

58500

43875

29250

14625

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000


OpenMP Scheduler
OpenMP Guided Scheduler Chart
80000

70000

60000

50000
microseconds

40000

30000

20000

10000

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
number of threads

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000


OpenMP Scheduler
OpenMP Scheduler

static scheduler dynamic scheduler guided scheduler


MPI
• standard
• widely used in cluster environment
• many transport link supported
• different implementations available
- OpenMPI
- MVAPICH
Infiniband

• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed
MPI over Infiniband
10000000,0 µs

1000000,0 µs

100000,0 µs

10000,0 µs

1000,0 µs

100,0 µs

10,0 µs

1,0 µs
kB
kB
kB
kB

kB
kB

12 B
25 B
51 B
kB

B
B
B
B

32 B
64 B
12 B
25 B
51 B
B
B
B
B
B

B
k

k
k

M
M
M
M

M
M
M

M
M
M
G
G
G
G

G
1
2
4
8
16
32
64

8
6
2

1
2
4
8
16
1
2
4
8
16

8
6
2
OpenMPI Mvapich2
MPI over Infiniband
10000000,00 µs

1000000,00 µs

100000,00 µs

10000,00 µs

1000,00 µs

100,00 µs

10,00 µs

1,00 µs
kB

kB

kB

kB

kB

kB

kB

kB

kB

kB

B
M

M
1

16

32

64

8
12

25

51

OpenMPI Mvapich2
Optimizations

• Active at compile time


• Available only after porting the software to
standard FORTRAN
• Consistent documentation available
• Unexpected positive results
Optimizations

•-march = native
•-O3
•-ffast-math
•-Wl,-O1
Target Software
Target Software

• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of
mathematical models
Implementation Scheme
sequential loop parallel loop

standard
programming
model

OpenMP Threads
distributed loop

OpenMP Threads OpenMP Threads


Host 1 Host 2
MPI
Implementation
Scheme
• Data Structure: not embarrassingly parallel
• Three dimensional matrix
• Several temporary arrays – synchronization
obiects required
➡ send() and recv() mechanism
➡ critical regions using OpenMP directives
➡ functions merging
➡ matrix conversion
Results
Results
OMP MPI OPT seconds
* * * 133
* * - 400
* - * 186
* - - 487
- * * 200
- * - 792
- - * 246
- - - 1062

Total Speed Increase: 87.52%


Actual Results
OMP MPI seconds
* * 59
* - 129
- * 174
- - 249

Function Name Normal OpenMP MPI OpenMP+MPI


calc_intmudua 24.5 s 4.7 s 14.4 s 2.8 s
calc_hdmg_tet 16.9 s 3.0 s 10.8 s 1.7 s
calc_mudua 12.1 s 1.9 s 7.0 s 1.1 s
campo_effettivo 17.7 s 4.5 s 9.9 s 2.3 s
Actual Results

• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x

Total Raw Speed Increment: 76%


Conclusions
Conclusions and
Future Works
• Computational time has been significantly
decreased
• Speedup is consistent with expected results
• Submitted to COMPUMAG ‘09
• Continue inserting OpenMP and MPI directives
• Perform algorithm optimizations
• Increase cluster size