Académique Documents
Professionnel Documents
Culture Documents
M. D. Jones, Ph.D.
Center for Computational Research University at Buffalo State University of New York
1 / 24
Background
Background
This document covers the essentials of compiling and running MPI applications on the CCR platforms. It does not cover MPI programming itself, nor debugging, etc. (covered more thoroughly in separate presentations).
3 / 24
Modules
There are a large number of available software packages on the CCR systems, particularly the Linux clusters. To help maintain this often confusing environment, the modules package is used to add and remove these packages from your default environment (many of the packages conict in terms of their names, libraries, etc., so the default is a minimally populated environment).
4 / 24
Modules
5 / 24
Modules
If you change shells in your batch script you may need to explicitly load the modules environment: tcsh :
source $MODULESHOME/ i n i t / t c s h
bash :
. $ {MODULESHOME} / i n i t / bash
but generally you should not need to worry about this step (do a module list and if it works ok your environment should already be properly initialized).
6 / 24
Objective: Construct a very elementary MPI program to do the usual Hello World problem, i.e. have each process print out its rank in the communicator.
8 / 24
in C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
# include < s t d i o . h> # include " mpi . h " i n t main ( i n t argc , char * * argv ) { i n t myid , nprocs ; i n t namelen , mpiv , mpisubv ; char processor_name [MPI_MAX_PROCESSOR_NAME ] ; M P I _ I n i t (& argc ,& argv ) ; MPI_Comm_size (MPI_COMM_WORLD,& nprocs ) ; MPI_Comm_rank (MPI_COMM_WORLD,& myid ) ; MPI_Get_processor_name ( processor_name ,& namelen ) ; p r i n t f ( " Process %d o f %d on %s \ n " , myid , nprocs , processor_name ) ; i f ( myid == 0 ) { MPI_Get_version (& mpiv ,& mpisubv ) ; p r i n t f ( " MPI V e r s i o n : %d.%d \ n " , mpiv , mpisubv ) ; } MPI_Finalize ( ) ; return 0; }
9 / 24
There are several commercial implementations of MPI, Intel and HP currently being the most prominent (IBM, Sun, SGI, etc. all have their own variants, but usually are only supported on their own hardware). CCR has a license for Intel MPI, and it has some nice features: Support for multiple networks (Inniband, Myrinet, TCP/IP) Part of the ScaLAPACK support in the Intel MKL MPI-2 features (one-sided, dynamic tasks, I/O with parallel lesystems support) CPU pinning/process afnity options (extensive)
10 / 24
Unfortunately Intel MPI still lacks tight integration with PBS/Torque, and instead relies on daemons (launched by you) or the hydra task launcher to initiate MPI tasks.
11 / 24
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#PBS S / b i n / bash #PBS q debug #PBS l w a l l t i m e =00:10:00 #PBS l nodes =2:GM: ppn=2 #PBS jonesm@ccr . b u f f a l o . edu M #PBS e m #PBS N t e s t #PBS o subQ . o u t #PBS j oe # # Note t h e above d i r e c t i v e s can be commented o u t u s i n g an # a d d i t i o n a l " # " ( as i n t h e debug queue l i n e above ) # module l o a d i n t e l mpi i n t e l / 1 2 . 1 # # cd t o d i r e c t o r y from which j o b was s u b m i t t e d # cd $PBS_O_WORKDIR # # I n t e l MPI has no t i g h t i n t e g r a t i o n w i t h PBS, # so you have t o t e l l i t where t o run , b u t i t s mpirun # wrapper w i l l autod e t e c t PBS . # You can f i n d d e s c r i p t i o n o f a l l I n t e l MPI parameters i n t h e # I n t e l MPI Reference Manual . # See < i n t e l mpi i n s t a l l d i r > / doc / Reference_manual . p d f #
12 / 24
27 28 29 30 31 32 33 34 35 36 37 38 39
export I_MPI_DEBUG=5 # n i c e debug l e v e l , s p i t s o u t u s e f u l i n f o NPROCS= ` cat $PBS_NODEFILE | wc l ` NODES= ` cat $PBS_NODEFILE | uniq ` NNODES= ` cat $PBS_NODEFILE | u n i q | wc l ` # # mpd based way : mpdboot n $NNODES f $PBS_NODEFILE v mpdtrace mpiexec np $NPROCS . / h e l l o . i m p i mpdallexit # # mpirun wrapper : mpirun np $NPROCS . / h e l l o . i m p i
13 / 24
14 / 24
[ 0 ] MPI s t a r t u p ( ) : 1 30111 f09n35 . c c r . b u f f a l o . edu 1 [ 0 ] MPI s t a r t u p ( ) : 2 31983 f09n34 . c c r . b u f f a l o . edu 0 [ 0 ] MPI s t a r t u p ( ) : 3 31984 f09n34 . c c r . b u f f a l o . edu 1 [ 0 ] MPI s t a r t u p ( ) : I_MPI_DEBUG=5 [ 0 ] MPI s t a r t u p ( ) : I_MPI_FABRICS_LIST=dapl , t c p [ 0 ] MPI s t a r t u p ( ) : I_MPI_FALLBACK=enable [ 0 ] MPI s t a r t u p ( ) : I_MPI_PLATFORM=auto Process 0 o f 4 on f09n35 . c c r . b u f f a l o . edu MPI V e r s i o n : 2 . 1 Process 2 o f 4 on f09n34 . c c r . b u f f a l o . edu [ 1 ] DAPL s t a r t u p ( ) : t r y i n g t o open d e f a u l t DAPL p r o v i d e r from d a t r e g i s t r y [ 0 ] DAPL s t a r t u p ( ) : t r y i n g t o open d e f a u l t DAPL p r o v i d e r from d a t r e g i s t r y [ 3 ] DAPL s t a r t u p ( ) : t r y i n g t o open d e f a u l t DAPL p r o v i d e r from d a t r e g i s t r y [ 2 ] DAPL s t a r t u p ( ) : t r y i n g t o open d e f a u l t DAPL p r o v i d e r from d a t r e g i s t r y [ 0 ] MPI s t a r t u p ( ) : DAPL p r o v i d e r mx2 [ 1 ] MPI s t a r t u p ( ) : DAPL p r o v i d e r mx2 ... [ 3 ] MPI s t a r t u p ( ) : shm and d a p l data t r a n s f e r modes [ 0 ] MPI s t a r t u p ( ) : Rank Pid Node nameProcess 1 o f 4 on f09n35 . c c r . Pin cpu [ 0 ] MPI s t a r t u p ( ) : 0 30125 f09n35 . c c r . b u f f a l o . edu 0 Process 2 o f 4 on f09n34 . c c r . b u f f a l o . edu Process 3 o f 4 on f09n34 . c c r . b u f f a l o . edu [ 0 ] MPI s t a r t u p ( ) : 1 30126 f09n35 . c c r . b u f f a l o . edu 1 [ 0 ] MPI s t a r t u p ( ) : 2 32040 f09n34 . c c r . b u f f a l o . edu 0 [ 0 ] MPI s t a r t u p ( ) : 3 32041 f09n34 . c c r . b u f f a l o . edu 1 [ 0 ] MPI s t a r t u p ( ) : I_MPI_DEBUG=5 [ 0 ] MPI s t a r t u p ( ) : I_MPI_FABRICS_LIST=dapl , t c p [ 0 ] MPI s t a r t u p ( ) : I_MPI_FALLBACK=enable [ 0 ] MPI s t a r t u p ( ) : I_MPI_PIN_MAPPING=2:0 0 ,1 1 [ 0 ] MPI s t a r t u p ( ) : I_MPI_PLATFORM=auto Process 0 o f 4 on f09n35 . c c r . b u f f a l o . edu MPI V e r s i o n : 2 . 1
: : : :
b u f f a l o . edu
15 / 24
The newest U2 nodes have Inniband (IB) as the optimal interconnect for message-passing, running an Intel MPI job should automatically nd and use IB on those machines (and they have 8 or 12 cores each, so adjust your script accordingly):
1 [ u2 : ~ / d_mpisamples ] $ c a t subQ . o u t
16 / 24
Job 2949999. d15n41 . c c r . b u f f a l o . edu has requested 12 cores / p r o c e ss o r s per node . r u n n i n g m p d a l l e x i t on k16n13a LAUNCHED mpd on k16n13a v i a RUNNING : mpd on k16n13a LAUNCHED mpd on k16n12b v i a k16n13a RUNNING : mpd on k16n12b k16n13a k16n12b [ 0 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 2 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 4 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 9 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 2 0 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 1 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 4 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 5 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 3 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 3 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 2 3 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 2 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 2 1 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 0 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 5 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 7 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 2 2 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 6 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 8 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 9 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 7 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 8 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes [ 1 6 ] MPI s t a r t u p ( ) : shm and t m i data t r a n s f e r modes
17 / 24
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
[ 0 ] MPI s t a r t u p ( ) : Rank Pid Node nameProcess 1 o f 24 on k16n13a . c c r . b u f f a l o . edu Process 2 o f 24 on k16n13a . c c r . b u f f a l o . edu Process 5 o f 24 on k16n13a . c c r . b u f f a l o . edu Process 11 o f 24 on k16n13a . c c r . b u f f a l o . edu ... Process 18 o f 24 on k16n12b . c c r . b u f f a l o . edu Process 20 o f 24 on k16n12b . c c r . b u f f a l o . edu Process 22 o f 24 on k16n12b . c c r . b u f f a l o . edu [ 0 ] MPI s t a r t u p ( ) : I_MPI_DEBUG=5 [ 0 ] MPI s t a r t u p ( ) : I_MPI_FABRICS_LIST= tmi , dapl , t c p [ 0 ] MPI s t a r t u p ( ) : I_MPI_FALLBACK=enable [ 0 ] MPI s t a r t u p ( ) : I_MPI_PIN_MAPPING=12:0 0 ,1 1 ,2 2 ,3 3 ,4 4 ,5 5 ,6 6 ,7 7 ,8 8 ,9 9 ,10 10 ,11 11 [ 0 ] MPI s t a r t u p ( ) : I_MPI_PLATFORM=auto Process 0 o f 24 on k16n13a . c c r . b u f f a l o . edu MPI V e r s i o n : 2 . 1
18 / 24
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Job 2950135. d15n41 . c c r . b u f f a l o . edu has requested 2 cores / p r o c e ss o r s per node . r u n n i n g m p d a l l e x i t on f09n35 LAUNCHED mpd on f09n35 v i a RUNNING : mpd on f09n35 LAUNCHED mpd on f09n34 v i a f09n35 RUNNING : mpd on f09n34 f09n35 f09n34 [ 3 ] MPI s t a r t u p ( ) : shared memory and s o c k e t data t r a n s f e r modes [ 0 ] MPI s t a r t u p ( ) : shared memory and s o c k e t data t r a n s f e r modes [ 2 ] MPI s t a r t u p ( ) : shared memory and s o c k e t data t r a n s f e r modes [ 1 ] MPI s t a r t u p ( ) : shared memory and s o c k e t data t r a n s f e r modes [ 1 ] MPI s t a r t u p ( ) : shm and t c p data t r a n s f e r modes [ 2 ] MPI s t a r t u p ( ) : shm and t c p data t r a n s f e r modes [ 3 ] MPI s t a r t u p ( ) : shm and t c p data t r a n s f e r modes [ 0 ] MPI s t a r t u p ( ) : shm and t c p data t r a n s f e r modes
19 / 24
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Process 2 o f 4 on f09n34 . c c r . b u f f a l o . edu Process 1 o f 4 on f09n35 . c c r . b u f f a l o . edu Process 3 o f 4 on f09n34 . c c r . b u f f a l o . edu [ 0 ] MPI s t a r t u p ( ) : Rank Pid Node name [ 0 ] MPI s t a r t u p ( ) : 0 30385 f09n35 . c c r . b u f f a l o . edu [ 0 ] MPI s t a r t u p ( ) : 1 30384 f09n35 . c c r . b u f f a l o . edu [ 0 ] MPI s t a r t u p ( ) : 2 32261 f09n34 . c c r . b u f f a l o . edu [ 0 ] MPI s t a r t u p ( ) : 3 32260 f09n34 . c c r . b u f f a l o . edu [ 0 ] MPI s t a r t u p ( ) : I_MPI_DEBUG=5 [ 0 ] MPI s t a r t u p ( ) : I_MPI_DEVICE=ssm [ 0 ] MPI s t a r t u p ( ) : I_MPI_FABRICS_LIST=dapl , t c p [ 0 ] MPI s t a r t u p ( ) : I_MPI_FALLBACK=enable [ 0 ] MPI s t a r t u p ( ) : I_MPI_PLATFORM=auto [ 0 ] MPI s t a r t u p ( ) : MPICH_INTERFACE_HOSTNAME= 1 0 . 1 0 6 . 9 . 3 5 Process 0 o f 4 on f09n35 . c c r . b u f f a l o . edu MPI V e r s i o n : 2 . 1
Pin cpu 0 1 0 1
20 / 24
21 / 24
MPI processes - things to keep in mind: You can over-subscribe the processors if you want, but that is going to under-perform (but it is often useful for debugging). Note that batch queuing systems (like those in CCR) may not let you easily over-subscribe the number of available processors Better MPI implementations will give you more options for the placement of MPI tasks (often through so-called "afnity" options, either for CPU or memory) Typically want a 1-to-1 mapping of MPI processes with available processors (cores), but there are times when that may not be desirable
22 / 24
Afnity
Intel MPI has options for associating MPI tasks to cores - better known as CPU-process afnity I_MPI_PIN, I_MPI_PIN_MODE, I_MPI_PIN_PROCESSOR_LIST, I_MPI_PIN_DOMAIN in the current version of Intel MPI (it never hurts to check the documentation for the version that you are using, these options have a tendency to change) Can specify core list on which to run MPI tasks, also domains of cores for hybrid MPI-OpenMP applications
23 / 24
Summary
Use modules environment manager to choose your MPI avor I recommend Intel MPI on the clusters, unless you need access to the source code for the implementation itself. It has a lot of nice features and is quite exible. Be careful with task launching - use mpiexec whenever possible Ensure that your MPI processes end up where you want - use ps and top to check (also use MPI_Get_processor_name in your code). Also use the CCR ccrjobviz.pl job visualizer utility to quickly scan for expected task placement and performance issues.
24 / 24