Clusterguide-V4 0

Computer Science and Engineering Department
University Politehnica of Bucharest
The NCIT Cluster

Resources Users Guide
Version 4.0
Emil-Ioan Slusanschi
Alexandru Herisanu
R
azvan Dobre
2013
c
2013
Editura Paideia
Piata Unirii nr. 1, etaj 5, sector 3
Bucuresti, Romania
tel.: (031)425.34.42
e-mail: oce@paideia.ro
www.paideia.ro
www.cadourialese.ro
ISBN 978-973-596-909-7
Contents
1 Acknowledgements and History
2 Introduction
2.1 The Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Software Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Further Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
6
7
7
3 Hardware
3.1 Conguration . . . . . . . . . . . . . . . . . . . .
3.2 Processor Datasheets . . . . . . . . . . . . . . . .
3.2.1 The Intel Xeon Processors . . . . . . . . .
3.2.2 AMD Opteron Processors . . . . . . . . .
3.2.3 IBM Cell Broadband Engine Processors . .
3.3 Server Datashees . . . . . . . . . . . . . . . . . .
3.3.1 IBM Blade Center H . . . . . . . . . . . .
3.3.2 HS 21 blade . . . . . . . . . . . . . . . . .
3.3.3 HS 22 blade . . . . . . . . . . . . . . . . .
3.3.4 LS 22 blade . . . . . . . . . . . . . . . . .
3.3.5 QS 22 blade . . . . . . . . . . . . . . . . .
3.3.6 Fujitsu Celsius R620 . . . . . . . . . . . .
3.3.7 Fujitsu Esprimo Machines . . . . . . . . .
3.3.8 IBM eServer xSeries 336 . . . . . . . . . .
3.3.9 Fujitsu-SIEMENS PRIMERGY TX200 S3
3.4 Storage System . . . . . . . . . . . . . . . . . . .
3.5 Network Connections . . . . . . . . . . . . . . . .
3.5.1 Conguring VPN . . . . . . . . . . . . . .
3.6 HPC Partner Clusters . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
8
8
8
9
9
9
9
9
9
10
10
10
10
10
10
11
12
12
12
4 Operating systems
4.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
14
14
5 Environment
5.1 Login . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 X11 Tunneling . . . . . . . . . . . . . . . .
5.1.2 VNC . . . . . . . . . . . . . . . . . . . . . .
5.1.3 FreeNX . . . . . . . . . . . . . . . . . . . .
5.1.4 Running a GUI on your VirtualMachine . .
5.2 File Management . . . . . . . . . . . . . . . . . . .
5.2.1 Tips and Tricks . . . . . . . . . . . . . . . .
5.2.2 Sharing Files Using Subversion / Trac . . .
5.3 Module Package . . . . . . . . . . . . . . . . . . . .
5.4 Batch System . . . . . . . . . . . . . . . . . . . . .
5.4.1 Sun Grid Engine . . . . . . . . . . . . . . .
5.4.2 Easy submit: MPRUN.sh . . . . . . . . . .
5.4.3 Easy development: APPRUN.sh . . . . . . .
5.4.4 Running a custom VM on the NCIT-Cluster
15
15
15
15
16
16
18
18
19
21
22
23
24
26
26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 The Software Stack

6.1 Compilers . . . . . . . . . . . . . . . . . . . .
6.1.1 General Compiling and Linker hints . .
6.1.2 Programming Hints . . . . . . . . . . .
6.1.3 GNU Compilers . . . . . . . . . . . . .
6.1.4 GNU Make . . . . . . . . . . . . . . .
6.1.5 Sun Compilers . . . . . . . . . . . . . .
6.1.6 Intel Compilers . . . . . . . . . . . . .
6.1.7 PGI Compiler . . . . . . . . . . . . . .
6.2 OpenMPI . . . . . . . . . . . . . . . . . . . .
6.3 OpenMP . . . . . . . . . . . . . . . . . . . . .
6.3.1 What does OpenMP stand for? . . . .
6.3.2 OpenMP Programming Model . . . . .
6.3.3 Environment Variables . . . . . . . . .
6.3.4 Directives format . . . . . . . . . . . .
6.3.5 The OpenMP Directives . . . . . . . .
6.3.6 Examples using OpenMP with C/C++
6.3.7 Running OpenMP . . . . . . . . . . .
6.3.8 OpenMP Debugging - C/C++ . . . . .
6.3.9 OpenMP Debugging - FORTRAN . . .
6.4 Debuggers . . . . . . . . . . . . . . . . . . . .
6.4.1 Sun Studio Integrated Debugger . . . .
6.4.2 TotalView . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Parallelization
7.1 Shared Memory Programming . . . . . . . . . . . .
7.1.1 Automatic Shared Memory Parallelization of
7.1.2 GNU Compilers . . . . . . . . . . . . . . . .
7.1.3 Intel Compilers . . . . . . . . . . . . . . . .
7.1.4 PGI Compilers . . . . . . . . . . . . . . . .
7.2 Message Passing with MPI . . . . . . . . . . . . . .
7.2.1 OpenMPI . . . . . . . . . . . . . . . . . . .
7.2.2 Intel MPI Implementation . . . . . . . . . .
7.3 Hybrid Parallelization . . . . . . . . . . . . . . . .
7.3.1 Hybrid Parallelization with Intel-MPI . . .
8 Performance / Runtime Analysis Tools
8.1 Sun Sampling Collector and Performance Analyzer
8.1.1 Collecting experiment data . . . . . . . . . .
8.1.2 Viewing the experiment results . . . . . . .
8.2 Intel MPI benchmark . . . . . . . . . . . . . . . . .
8.2.1 Installing and running IMB . . . . . . . . .
8.2.2 Submitting a benchmark to a queue . . . . .
8.3 Paraver and Extrae . . . . . . . . . . . . . . . . . .
8.3.1 Local deployment - Installing . . . . . . . .
8.3.2 Deployment on NCIT cluster . . . . . . . .
8.3.3 Installing . . . . . . . . . . . . . . . . . . .
8.3.4 Checking for Extrae installation . . . . . . .
8.3.5 Visualization with Paraver . . . . . . . . . .
8.3.6 Do it yourself tracing on the NCIT Cluster .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
29
29
30
31
31
35
36
36
37
38
38
38
39
40
44
45
48
48
57
61
61
61
. . . .
Loops
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
70
70
71
71
72
72
73
73
74
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
75
76
77
77
78
80
80
82
82
82
83
83
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8.4
8.3.7 Observations . . . .
Scalasca . . . . . . . . . . .
8.4.1 Installing Scalasca .
8.4.2 Running experiments
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9 Application Software and Program Libraries

9.1 Automatically Tuned Linear Algebra Software (ATLAS)
9.1.1 Using ATLAS . . . . . . . . . . . . . . . . . . . .
9.1.2 Performance . . . . . . . . . . . . . . . . . . . . .
9.2 MKL - Intel Math Kernel Library . . . . . . . . . . . . .
9.2.1 Using MKL . . . . . . . . . . . . . . . . . . . . .
9.2.2 Performance . . . . . . . . . . . . . . . . . . . . .
9.3 ATLAS vs MKL - level 1,2,3 functions . . . . . . . . . .
9.4 Scilab . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.1 Sour code and compilation . . . . . . . . . . . . .
9.4.2 Using Scilab . . . . . . . . . . . . . . . . . . . . .
9.4.3 Basic elements of the language . . . . . . . . . . .
9.5 Deal II . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.1 Introduction . . . . . . . . . . . . . . . . . . . . .
9.5.2 Description . . . . . . . . . . . . . . . . . . . . .
9.5.3 Installation . . . . . . . . . . . . . . . . . . . . .
9.5.4 Unpacking . . . . . . . . . . . . . . . . . . . . . .
9.5.5 Conguration . . . . . . . . . . . . . . . . . . . .
9.5.6 Running Examples . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
84
84
85
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
89
89
89
90
91
92
92
94
94
95
96
98
99
99
100
100
100
100
104
Acknowledgements and History
This is a document concerning the use of NCIT cluster resources. It was developed starting
from the GridInitiative 2008 Summer School with continous additions over the years until July
2013. Future versions are expected to follow, as new hardware and software upgrades will be
made to our computing center.
The authors and coordinators of this series would like to thank the following people that
have contributed to the information included in this guide: Sever Apostu, Alexandru Gavrila,
Ruxandra Cioromela, Alexandra - Nicoleta Firica, Adrian Lascateu, Cristina Ilie, CatalinIonut Fratila, Vlad Spoiala, Alecsandru Patrascu, Diana Ionescu, Dumitrel Loghin, GeorgeDanut Neagoe, Ana Maria Tuleiu, Raluca Silvia Negru, Stefan Nour, and many others.
v.1.0
v.1.1
v.2.0
v.3.0
v.3.1
v.4.0
Jul 2008
Nov 2008
Jul 2009
Jul 2010
Jul 2012
Jul 2013
Initial release
Added examples, reformatted LaTex code
Added chapters 7, 8 and 9, Updated chapters 3, 5 and 6
Added sections in chapters 5 and 6, Updated chapters 1-4
Updated chapters 1-9
Added sections 8.3, 8.4, 9.4, 9.5 Updated chapters 1-9
Introduction
The National Center for Information Technology NCIT of the University Politehnica of
Bucharest started back in 2001 with the creation of CoLaborator labortory, a Research Base
with Multiple Users (R.B.M.U) for High Performance Computing (HPC), that beneted from
funding by a World Bank project. CoLaborator was designed as a path of communication
between Universities, at a national level, using the national network infrastructure for education and research. Back in 2006, NCITs infrastructure was enlarged with the creation of a
second, more powerful, computing site, more commmonly referred to as the NCIT Cluster.
Both sites are used in research and teaching purposes by teachers, PhD students, grad
students and students alike. Currently, a single sign-on (SSO) scenario is implemented, with
the same users credentials across sites in the entire computing infrastructure oered by our
center, using the already existing LDAP infrastructure behind the http://curs.cs.pub.ro
project.
This document was created with the given purpose of serving as an introduction to the
parallel computing paradigm and as a guide to using the clusters specic resources. You will
nd within the next paragraphs descriptions of the various existing hardware architectures,
operating and programming environments, as well as further (external) information as the
given subjects.
2.1
The Cluster
Although the two computing sites are dierent administrative entities and have dierent
physical locations, the approach we use throughout this document will be that of a single
cluster with various platforms. This approach is also justied by the planned upgrade to a
10Gb Ethernet link between the two sites.
Given the fact that the cluster was created over a rather large period of time, that new
machines are added continuosly, and that there is a real need for software testing on multiple
platforms, the structure of our computing center is a heterogenous one, in terms of hardware
platforms and operating/programming environments. More to the pont, there are currently
six dierent computing architecture available on the cluster, namely:
Intel Xeon Quad 64b
Intel Xeon Nehalem 64b
AMD Opteron 64b
IBM Cell BE EDp 32/64b
IBM Power7 64b
NVidia Tesla M2070 32/64b
Not all of these platforms are currently given a frontend, the only frontends available at
the moment being fep.grid.pub.ro and gpu.grid.pub.ro machines (Front End Processors).
The machines behind them are all running non-interactive, being available only to jobs sent
through the frontends.
2.2
Software Overview
The tools used in the NCIT and CoLaborator cluster for software developing are
Sun Studio, OpenMP and OpenMPI. For debugging we use the TotalView Debugger and
Sun Studio and for proling and performance analysis - Sun Studio Performance Tools and
Intel VTune. Intel, Sun Studio and GNU compilers were used to compile our tools. The
installation of all the tools needed was done using the local repository of our computing
center, available online at http://storage.grid.pub.ro.
2.3
Further Information
The latest version of this document will always be kept online at:
https://cluster.grid.pub.ro/index.php/home
For any questions or feedback on this document, please feel free to contact us at:
https://support.grid.pub.ro/
Hardware
This section covers in detail the dierent hardware architectures available in the cluster.
3.1
Conguration
The following table contains the list of all the nodes available for general use. There are
also various machines which are curently used for maintenance purposes. They will not be
presented in the list below.
Model
IBM HS21
28 nodes
IBM HS22
4 nodes
IBM LS22
14 nodes
IBM QS22
4 nodes
IBM PS703
8 nodes
IBM iDataPlex
dx360M3
4 nodes
Fujitsu Esprimo
66 nodes
Fujitsu Celsius
2 nodes
Processor Type
Intel Xeon
E5405, 2 GHz
Intel Xeon
E5630, 2.53 GHz
AMD Opteron
2435, 2.6 GHz
Cell BE Broadband
3.2GHz
Power7
2.4GHz
NVidia Tesla
1.15GHz
Intel P4
3 GHz
Intel Xeon
3 GHz
Sockets/Cores
2/8
Memory
16 GByte
Hostname
quad-wn
2/16
32 GByte
nehalem-wn
2/12
16 GByte
opteron-wn
2/4
8 GByte
cell-qs
2/8
32 GByte
power-wn
2/448
32 GByte
dp-wn
1/1
5GByte VRAM
2 GByte
p4-wn
2/2
2 GByte
dual-wn
Additionally, our cluster has multiple partnerships and you can choose to run your code to
these remote sites using the same infrastructure and same user ids. What must be considered
is the dierent architecture of the remote cluster and the delay involved in moving your data
through the VPN links involved.
3.2
3.2.1
Processor Datasheets
The Intel Xeon Processors
The Intel Xeon Processor refers to many families of Intels x86 multiprocessing CPUs for
dual-processor (DP) and multi-processor (MP) conguration on a single motherboard targeted
at non-consumer markets of server and workstation computers, and also at blade servers and
embedded systems.The Xeon CPUs generally have more cache than their desktop counterparts
in addition to multiprocessing capabilities.
Our cluster is currently equipped with the Intel Xeon 5000 Processor sequence. Here is a
quick list of the processor available in our cluster and their corresponding datasheets:
CPU Name Version
Intel Xeon E5405
Intel Xeon E5630
Intel Xeon X5570
Intel P4
-
Speed
L2 Cache
2 Ghz
12Mb
2.53 GHz
12MB
2.93 Ghz
12MB
3 Ghz
8
Datasheet
click here
click here
click here
-
3.2.2
AMD Opteron Processors
The Six-Core AMD Opteron processor-based servers deliver performance eciency to handle real world workloads with a good energy eciency. There is one IBM Chassis with 14
Ininiband QDR connected Opteron blades available and the rest are used in our Hyper-V
virtualization platform.
Our cluster is currently equipped with the Six-Core AMD Opteron Processor series. Click
on a link to see a corresponding datasheet.
CPU Name
AMD Opteron
3.2.3
Version
2435
Speed L2 Cache
2.6Ghz 6x512Kb
Datasheet
click here
IBM Cell Broadband Engine Processors
The QS22, based on the new IBM PowerXCell 8i multicore processor, oers extraordinary single-precision and double-precision oating-point computing power to accelerate key
algorithms, such as 3D rendering, compression, encryption, nancial algorithms, and seismic
processing. Our cluster is currently equipped with four Cells B.E.. Click on a link to see the
datasheet.
CPU Name
Version
IBM PowerXCell 8i QS22
3.3
Speed L2 Cache
3.2Ghz
-
Datasheet
CELL Arch / PowerPC Arch
Server Datashees
This section presents short descriptions of the server platforms that are used in our computing center.
3.3.1
IBM Blade Center H
There are seven chassis and each can t 14 blades in 9U. You can nd general information
about the model here. Currently there are ve types of blades installed: Intel based HS21
and HS22 blades, AMD based LS22, IBM Cell based QS22, IBM Power7 PS703.
3.3.2
HS 21 blade
There are 32 H21 blades of which 28 are used for the batch system and 4 are for developement and virtualization projects. Each blade has an Intel Xeon quad-core processor at
2Ghz with 2x6MB L2 cache, 1333Mhz FSB and 16GB of memory. Full specications can be
found here. One can access these machines using the ibm-quad.q queue in SunGridEngine
and their hostname is dual-wnXX.grid.pub.ro - 172.16.3.X
3.3.3
HS 22 blade
There are 14 H22 blades of which 4 are used for the batch system and 10 are dedicated
to the Hyper-V virtualization environment. Each blade has two Intel Xeon core processor at
2.53Ghz with 12MB L2 cache, 1333Mhz FSB and 32GB of memory. Full specications here.
Also, if one requires, 17 blades are available in the Hyper-V environment for HPC applications using Microsoft Windows HPCC. The user is responsible to set-up the environment. All
HS22 blades have FibreChannel connected storage and use high-speed FibreChannel Disks.
These disks are only connected on demand. What one gets by using the batch system is a
local disk. One can access these machines using the ibm-nehalem.q queue in SunGridEngine
and their hostname is nehalem-wnXX.grid.pub.ro - 172.16.9.X
9
3.3.4
LS 22 blade
There are 20 LS22 blades of which 14 are available for batch system use, the rest can be
used in the Hyper-V Environment. Each blade has an Opteron six-core processor at 2,6Ghz.
Full specications can be found here. One can access these machines using the ibm-opteron.q
queue in SunGridEngine and their hostname is opteron-wnXX.grid.pub.ro - 172.16.8.X
3.3.5
QS 22 blade
The Cell based QS22 blade features two dual core 3.2 GHz IBM PowerXCell 8i Processors,
512 KB L2 cache per IBM PowerXCell 8i Processor, plus 256 KB of local store memory for
each eDP SPE. Their memory capacity is 8GB. They have no local storage, ergo they boot
over the network. For more features QS 22 features. One can access these machines using
the ibm-cell-qs22.q queue in SunGridEngine and their hostname is cell-qs22-X.grid.pub.ro
- 172.16.6.X. One can also connect to these systems using a load-balanced connection at
cell.grid.pub.ro (SSH).
3.3.6
Fujitsu Celsius R620
The Fujitsu-SIEMENS Celsius are workstations equipped with two Xeon processors. Because of their high energy consumption, they were migrated beginning January 2011 in the
training and in the pre-production lab. A couple of them are still accessible using the batch
system, and they are used to host CUDA-capable graphics cards. One can access theese
machines using the fs-dual.q in SunGridEngine. Their hostname is dual-wnXX.grid.pub.ro 172.16.3.X
3.3.7
Fujitsu Esprimo Machines
There are curently 60 Fujitsu Esprimo, model P5905, available. They each have an Intel
Pentium 4 3.0Ghz CPU, with 2048KB L2 cache, 2048MB DDR2 man memory (upgradable to a
maximum of 4GB) working at 533Mhz. Storage SATAII (300MB/s) 250 GB. More information
can be found here. One can acces theese machines using the fs-p4.q in SunGridEngine. If one
has special projects requiring physical and dedicated access to the machines, this will be the
queue to use. Their corresponding hostname is p4-wnXXX.grid.pub.ro - 172.16.2.X.
3.3.8
IBM eServer xSeries 336
The IBM eServer xSeries 336 servers available at NCIT Cluster are 1U rack-mountable
corporate business servers, each with one Intel Xeon 3.0 GHz processor with Intel Extended
Memory 64 Technology and upgrade possibility, Intel E7520 Chipset Type and a Data Bus
Speed of 800MHz. They are equiped with 512MB DDR2 SDRAM ECC main memory working
at 400 MHz (upgradable to a maximum of 16GB), one Ultra320 SCSI integrated controller and
one UltraATA 100 integrated IDE controller. They posses two network interfaces, Ethernet
10Base-T/100Base-TX/1000BaseT (RJ-45). More information on the IBM eServer xSeries
336 can be found on IBMs support site, here. Currently, these servers are part of the core
system of the NCIT cluster and users do not have direct access to them.
3.3.9
Fujitsu-SIEMENS PRIMERGY TX200 S3
The Fujitsu-SIEMENS PRIMERGY TX200 S3 servers available at NCIT Cluster have

two Intel Dual-Core Xeon 3.0Ghz each with Intel Extended Memory 64 Technology and up-
10
grade possibility, Intel 5000V Chipset Type and a Data Bus Speed of 1066MHz. These processors have 4096KB of L2 Cache, ECC.
They come with 1024MB DDR2 SDRAM ECC main memory, upgradable to a maximum
of 16GB, 2-way interleaved, working at 400 MHz, one 8-port SAS variant controller, one FastIDE controller and a 6-port controller. They have two network interfaces, Ethernet 10BaseT/100Base-TX/1000BaseT(RJ-45). More information on the Fujitsu-SIEMENS PRIMERGY
TX200 S3 can be found on Fujitsu-SIEMENSs site, here. Currently, these servers are part of
the core system of the NCIT cluster and you do not have direct access to them.
3.4
Storage System
The storage system is composed of the following DELL solutions: 2 PowerEgde 2900 and
2 PowerEdge 2950 servers, and 4 PowerVault MD1000 Storage Arrays. There are four types
of disk systems you can use local disks, NFS, LustreFS and FibreChannel disks.
All home directories are NFS mounted. There are several reasons behind this approach:
many proling tools can not run over LustreFS because of its locking mechanism and second,
if the cluster is shut down, the time to start the Lustre lesystem is much greater than
starting NFS. The NFS partition is under: /export/home/ncit-cluster. Jobs with high I/O are
forbidden on the NFS directories.
Each user also has access to a LustreFS directory. e.g. alexandru.herisanu/LustreFS
(symbolic link to /d02/home/ncit-cluster/prof/alexandru.herisanu). The AMD Opteron nodes
(LS22 blades) are connected to the LustreFS Servers through Innband, all the other nodes
use one of the 4 LNET Routers to mount the lesystem. There are curently 3 OST servers
and 1 MDS node, available either on Inniband or TCP. Last but not least, each job has a
default local scratch space available created by our batch system.
Type
Where
NFS
HOME (/export/home/ncit-cluster)
LustreFS
HOME/LustreFS (/d02/home)
Local HDD
/scratch/tmp
Observations
Do not use I/O jobs here
Local on each node
Starting from December 2011, our virtualization platform has received an upgrade in the
form of a new FibreChannel storage system - an IBM DS 3950 machine with a total capacity of
12Tb. The computing center has additional licences available so if required by really intensive
I/O applications where LustreFS is not an option, we can map some harddisks to one of the
nehalems to satisfy these requirements. The NFS Server is storage-2. Here follows a list of
the ips of the storage servers.
11
Hostname
batch
storage
storage-2
storage-3
storage-4
storage-5
Connected Switch
NCitSw10GW4948-48-1
NCitSw10GW4948-48-1
NCitSw10GW4948-48-1
Inniband Voltaire
NCitSw10GW4948-48-1
NCitMgmtSw-2950-48-1
NCitSw10GW4948-48-1
NCitSw10GW4948-48-1
Inniband Voltaire
NCitSw10GW4948-48-1
Inniband Voltaire
NCitSw10GW4948-48-1
Inniband Voltaire
NCitSw10GW4948-48-1
NCitSw10GW4948-48-1
Inniband Voltaire
Port
IP
Gi1/37
172.16.1.1
Gi1/38 141.85.224.101
Gi1/39
N/A*
N/A*
Gi1/2 141.85.224.10
Fa0/4
172.16.1.10
Gi1/13
172.16.1.20
Gi1/14 141.85.224.103
192.168.5.20
Gi1/15
172.16.1.30
192.168.5.30
Gi1/3
172.16.1.40
192.168.5.40
Gi1/4
172.16.1.60
Gi1/7 141.85.224.49
192.168.5.60
(*) This is the MDS for the Lustre system.
3.5
Network Connections
Our main worker node router is a debian based machine. (141.85.241.163, 172.16.1.7,
192.168.6.1, 10.42.0.1 ). It also acts as a name-caching server. If you get a public ip directly,
your routers are 141.85.241.1 and 141.85.224.1, depending on the Vlan.
DNS Servers: 141.85.241.15, 141.85.164.62
Our IPv6 network is 2001:b30:800:f0::/54. To get the ipv6 address of a host just use
the following rule:
IPv4 address: 172.16.1.7 | -> IPv6 address: 2001:b30:800:f006:172:16:1:7/64
Vlan: 6 (Cluster Nodes) |
f0[06], because f0 is part of the network and 06 because the host is in Vlan6. 172:16:1:7 is
the IPv4 address of the host.
3.5.1
Conguring VPN
Sometimes one needs acces to resources that are not reachable from the internet. If one
has an NCIT AD account, then one can still connect through VPN in the training network.
Please write us if you need a VPN account.
The VPN Server is: win2k3.grid.pub.ro (141.85.224.36). We do not route your trac so
deselect the following option. Right click on the VPN Connection - Properties - Networking
- TCP/IP v4 - Properties - Advances, deselect Use default gateway on remote network.
3.6
HPC Partner Clusters
One can also run jobs on partner clusters like the HPC ICF and CNMSI Virtual Cluster.
The ICF cluster uses IBM BladeCenter Chassis of the same generation as our quad-wn nodes
and the CNMSI Cluster uses a lot of virtualized machines with limited memory and limited
storage space. Although you can use these systems in any project you like, please take note
of the networking and storage architecture involved.
12
The HPC Cluster of the Institute of Physical Chemistry (http://www.icf.ro) is nearly

identical to ours. There are 65 HS21 blades available with dual quad-core Xeon processors.
The home directories are nfs-mouted through autofs. There are ve chassis with 13 blades
each. Currently, your home directory is mounted over the VPN link so it is advisable that you
store your data locally using the scratch directory provided by SunGridEngine or a local NFS
temporary storage. The use of the icf-hpc-quad.q queue is restricted. The ip range visible
from our cluster is 172.17.0.0/16 - quad-wnXX.hpc-icf.ro.
The second cluster is a collaboration between UPB and CNMSI (http://www.cnmsi.ro).

They provide us with access to 120 virtual machines, each witch 10Gb of hard-drive and 1Gb
of memory. The use of the cnmsi-virtual.q queue is also restricted. The ip range visible from
our cluster is 10.10.60.0/24 - cnmsi-wnXXX.grid.pub.ro.
13
Operating systems
There is only one operating system running in the NCIT Cluster and that is Linux. The
cluster is split into a HPC domain and a virtualization domain. If you need to run windows
applications we can provide you with the necessary virtualized machines, documentation and
howtos, but you have to set it up yourself.
4.1
Linux
Linux is the UNIX-like operating system. Its name comes from the Linux kernel, originally
written in 1991 by Linus Torvalds. The systems utilities and libraries usually come from the
GNU operating system, announced in 1983 by Richard Stallman. The Linux release used
at the NCIT Cluster is a RHEL (Red Hat Enterprise Linux) clone called Scientic Linux,
co-developed by Fermi National Accelerator Laboratory and the European Organization for
Nuclear Research (CERN).
The Linux kernel version we use is:
$ uname -r
2.6.32-279.2.1.el6.x86_64
whereas the distribution release:
$ cat /etc/issue
Scientific Linux release 6.2 (Carbon)
4.2
Addressing Modes
Linux supports 64bit addressing, thus programs can be compiled and linked either in 32
or 64bit mode. This has no inuence on the capacity or precision of oating point numbers (4
or 8 byte real numbers), aecting only memory addressing, the usage of 32 or 64bit pointers.
Obviously, programs requiring more than 4GB of memory have to use the 64bit addressing
mode.
14
Environment
5.1
Login
Logging into UNIX-like systems is done through the secure shell (SSH). Since usually the
SSH daemon is installed by default both on Unix and Linux systems. You can log into each
one of the clusters frontends from your local UNIX machine, using the ssh command:
$ ssh username@fep.grid.pub.ro
$ ssh username@gpu.grid.pub.ro
$ ssh username@cell.grid.pub.ro
Usage example:
$ssh username@fep.grid.pub.ro
Logging into one of the frontends from Windows is done by using Putty.
We provide three ways to connect to the cluster having a graphical environment. You can
use X11 Tunneling, VNC or FreeNX to run GUI apps.
5.1.1
X11 Tunneling
The simple way to get GUI access is to use ssh X11 Tunneling. This is also the slowest
method.
$ssh -X username@fep.grid.pub.ro
$xclock
Depending on your local conguration it may be necessary to use the -Y ag to enable
the trusted forwarding of graphical programs. (Especially if youre a MAC user). If youre
running Windows, you need to run a local X Server. A lightweight server ( 2Mb) is XMing.
To connect using Windows, run XMing, run Putty and select Connection - SSH - X11 Enable X11 forwarding, and connect to fep.
5.1.2
VNC
Another method is to use VNC. Due to the fact that VNC is not encrypted we use SSH port
forwarding just to be safe. The frontend runs a conguration named VNC Linux Terminal Services,
meaning that if you connect on port 5900 youll get a VNC server with 1024x768 resolution,
5901 is 800x600 and so on. You can not connect directly so you must use ssh.
ssh -L5900:localhost:5900 username@fep.grid.pub.ro
On the local computer:
vncviewer localhost
First line connects to fep and creates a tunnel from your host port 5900 to fep (localhost:5900) port 5900. On your computer use vncviewer to connect to localhost.
If you use windows, use RealVNC Viewer and Putty. First congure tunneling in putty.
Run putty and select Connection - SSH - Tunnels. We want to create a tunnel from our
machine, port 5900 to fep after we connect. So select Source port: 5900 and Destination
localhost:5900 and click Add. Connect to fep and then use RealVnc and connect to localhost.
You should get this:
15
Select ICE VM from Sessions. There is no GNOME or KDE Installed.

5.1.3
FreeNX
FreeNX uses a propietary protocol over a secondary ssh connection. It is the most ecient
remote desktop for linux by far, but requires a client to be installed. NX Client is used both
on Linux and Windows. After installing the client, run the NX Connection Wizard like in the
steps below.
(*) You must actually select Unix - Custom in step2, as we do not use either Gnome or
KDE but IceWM.
A more thorough howto can be found here. Our conguration uses the default FreeNX
client keys.
You could also use our NXBuilder App to download, install and congure the connection
automatically. Just point your browser to http://cluster.grid.pub.ro/nx and make sure you
have java installed.
5.1.4
Running a GUI on your VirtualMachine
If you wish you can run your own custom virtual machine on any machine you like but
depending on the Virtual domain used, you may not have inbound internet acces. You can
use ssh tunneling to access your machine and this is how to do it.
16
First of all, you must decide on one of the following methods you want to use: X11
tunneling, VNC or FreeNX. If you use KVM, then you can also connect to the console of the
virtual machine directly.
KVM Console
When you run the virtual machine with KVM, you will specify the host or the queue where
your machine runs. Using qstat or the output of apprun.sh get the machine thats running
the VM. Ex: opteron-wn01.grid.pub.ro, port 11. Connect to the machine that hosts your VM
directly using vncviewer either through X11 tunnelling or port forwarding.
a. X11 tunneling
$ssh -X username@fep.grid.pub.ro
$vncviewer opteron-wn01.grid.pub.ro:11
b. SSH Tunneling (tunnel the vncport on the localhost)
$ssh -L5900:opteron-wn01.grid.pub.ro:5911 username@fep.grid.pub.ro
on the local host
$vncviewer localhost
You can use any of the methods described ealier to get a GUI on fep. (X11,VNC or
FreeNX). Use this method if you do not know your ip.
X11 Tunneling / VNC / FreeNX

If you know your ip, jst install SSH/VNC or FreeNX on your VM, connect to fep and
connect with -X. For example if the ip of your machine is 10.42.8.1 :
a. X11 tunneling
$ ssh -X username@fep.grid.pub.ro
(fep)$ ssh -X root@10.42.8.1
(vm)$ xclock
b. SSH Tunneling (tunnel the remote ssh port on the localhost)
$ ssh -L22000:10.42.8.1:22 username@fep.grid.pub.ro
on the local host
$ ssh -P22000 -X root@localhost
All theese three methods rely on services you install on your machine. The best way is
to port forward the remote port locally. In case of X11 and FreeNX you will tunnel the SSH
port (22), in case of VNC, port 5900+.
For more information check theese howtos: http://wiki.centos.org/HowTos/VNC-Server
(5. Remote login with vnc-ltsp-cong) and http://wiki.centos.org/HowTos/FreeNX.
17
5.2
File Management
At the time of the writing of this section there were no quota limits on how much disk
space you can use. If you really need it, we can provide it. Every user of the cluster has a home
directory on an NFS shared lesystem within the cluster and a LustreFS mounted directory.
Your home directory is usually $HOME=/export/home/ncit-cluster/role/username.
5.2.1
Tips and Tricks
Here are some tips on how to manage your les:

SCP/WinSCP
Transfering les to the cluster from your local UNIX-like machine is done through the
secure copy command scp, e.g:
$ scp localfile username@fep.hpc.pub.ro:~/
$ scp -r localdirectory username@fep.grid.pub.ro:~/
(use -r when transferring muliple files)
The default directory where scp copies the le is the home directory. If you want to specify
a dierent path where to save the le, you should write the path after : For example:
$ scp localfile username@fep.hpc.pub.ro:your/relative/path/to/home
$ scp localfile username@fep.hpc.pub.ro:/your/absolute/path
Transfering les back from the cluster goes the same way:
$ scp username@fep.hpc.pub.ro:path/to/file /path/to/destination/on/local/machine
If you use Windows, use WinSCP. This is a scp client for Windows that provides a
graphical le manager for copying les to and from the cluster.
SSH-FS Fuse
This is actually the best method you can use if you plan to edit les locally and run them
remotely. First install sshFs (most distributions have this package already). Carefull, one
must use the absolute path of ones home directory.
$ sshfs alexandru.herisanu@fep.grid.pub.ro:/export/
home/ncit-cluster/prof/alexandru.herisanu /mnt
This allows you to use for example eclipse locally and see your les as local. Because its
a mounted le system, the transfer is transparent. (Dont forget to unmount). To see what
your full home directory path is do this:
[alexandru.herisanu@fep ~]$ echo $HOME
/export/home/ncit-cluster/prof/alexandru.herisanu
18
SSH Fish protocol

If you use Linux, you can use the Fish protocol. Install mc, F9 (Left) - Shell link fep.grid.pub.ro. You now have a in the left pane of the le manager all your remote les.
The same protocol can be used from GNOME and KDE. From Gnome, Places - Connect
to Server . . . . Select service type: SSH, Server: fep.grid.pub.ro, Port: 22, Folder: full path
of the home directory (/export. . . ), User Name: your username, Name for connection: NCIT
Fep. After doing this you will have a desktop item that Nautilus will use to browse your
remote les.
Microsoft Windows Share
Your home directory is also exported as a samba share. Connect through VPN and browse
storage-2. You can also mount your home partition as a local disk. Contact us if you need
this feature enabled.
Wget
If you want to copy an archive from a web-link in your current directory do this: Use Copy
Link Location (from your browser) and paste the link as parameter for wget. For example:
wget http://link/for/download
5.2.2
Sharing Files Using Subversion / Trac
The NCIT Cluster can host your project on its SVN and Trac Servers. Trac is an enhanced wiki and issue tracking system for software development projects. Our SVN server is
https://svn-batch.grid.pub.ro and the Trac system is here https://ncit-cluster.grid.pub.ro.
19
Apache Subversion, more commonly known as Subversion (command name svn) is a version
control system. It is mostly used in software development projects, where a team of people
may alter the same les or folders. Changes are usually identied by a number (sometimes a
letter) code, called the revision number, revision level, or simply revision. For example,
an initial set of les is revision 1. When the rst change is made, the resulting set is
revision 2, and so on. Each revision is associated with a timestamp and the person making
the change. Revisions can be compared, restred, and, with most types of les, merged.
First of all, a repository has to be created in order to host all the revisions. This is generally
done using the create command as shown below. Note that any machine can host this type
of repository, but in some cases such as our cluster you are required to have certain rights of
access in order to create one.
$ svnadmin create /path/to/repository
Afterwards, the other users are provided with an address which hosts their les. Every user
must install a svn version (e.g.: subversion-1.6.2) in order to have access to the svn commands
and utilities. It is recommended that all the users involved in the same project use the same
version.
Before getting started setting the default editor for the svn log messages is a good idea.
Choose whichever editor you see t. In the example below I chose vi.
$ export SVN_EDITOR=vim
Here are a few basic commands you should master in order to be able to use svn properly
and eciently:
- Import - this command is used only once, when le sources are added for the rst time;
this part has been previously referred to as adding revision 1.
$ svn import /path/to/files/on/local/machine
/SVN_address/New_directory_for_your_project
- Add - this command is used when you what to add a new le to the ones that are
already existent. Be careful though - this phase itself does not commit the changes. It must
be followed by an explicit commit command.
$ svn add /path/to/new_file_to_add
- Commit - this command is used when adding a new le or when submitting the changes
made to one of the les. Before the commit, the changes are only visible to the user how
makes them and not to the entire team.
$ svn commit /path/to/file/on/local/machine -m "explicit message explaining your change
- checkout - this command is used when you want to retrieve the latest version of the
project and bring it to your local machine.
$ svn checkout /SVN_address/Directory_of_your_project /path/to/files/on/local/machine
- rm - this command is used when you want to delete an existing le from your project.
This change is visible to all of your team members.
$ svn rm /address/to/file_to_be_deleted -m message explaining your action
20
- merge - this command is used when you want to merge two or more revisions. M and N
are the revision numbers you want to merge.
$ svn merge sourceURL1[@N] sourceURL2[@M] [WORKINGPATH]
OR:
$ svn merge -r M:HEAD /SVN_address/project_directory
/path/to/files/on/local/machine
The last variant merges the les from revision M with the last revision existent.
- Update - this command is used when you want to update the version you have on your
local machine to the latest revision. It is also an easy way to merge your le with the changes
made by your team before you commit your own changes. Do not worry. Your changes will
not be lost. If by any chance, both you and the other members have modied the same lines
in a le, a conict will be signaled and you will be given the opportunity to choose the nal
version of that line.
$ svn update
For further information and examples check theese links SVN Redbook and SVN Tutorial.
5.3
Module Package
The Module package provides for the dynamic modication of the users environment.
Initialization scripts can be loaded and unloaded to alter or set shell environemnt variables
such as $PATH or $LD_LIBRARY_PATH, to choose for example a specic compiler version or use
software packages.
The advantage of the modules system is that environment changes can easily be undone
by unloading a module. Dependencies and conicts can be easily controlled. If, say, you need
mpi with gcc then youll just have to load both the gcc compiler and mpi-gcc modules. The
module les will make all the necessary changes to the environment variables.
Note: The changes will remain active only for your current session. When you exit, they
will revert back to the initial settings.
For working with modules, the module command is used. The most important options
are explained in the following. To get help about the module command you can either read
the manual page (man module), or type
$ module help
To get the list of available modules type
$ module avail
--------------------------- /opt/modules/modulefiles --------------------------apps/bullet-2.77
java/jdk1.6.0_23-32bit
apps/codesaturn-2.0.0RC1
apps/gaussian03
mpi/Sun-HPC8.2.1c-gnu
apps/gulp-3.4
mpi/Sun-HPC8.2.1c-intel
apps/hrm
mpi/Sun-HPC8.2.1c-pgi
apps/matlab
mpi/Sun-HPC8.2.1c-sun
batch-system/sge-6.2u5
mpi/intelmpi-3.2.1_mpich
mpi/openmpi-1.3.2_gcc-4.1.2
21
blas/atlas-9.11_gcc
blas/atlas-9.11_sunstudio12.1
cell/cell-sdk-3.1
compilers/gcc-4.1.2
compilers/gcc-4.4.0
compilers/intel-11.0_083
compilers/pgi-7.0.7
compilers/sunstudio12.1
debuggers/totalview-8.4.1-7
grid/gLite-UI-3.1.31-Prod
mpi/openmpi-1.3.2_pgi-7.0.7
mpi/openmpi-1.3.2_sunstudio12.1
oscar-modules/1.0.5(default)
tools/ParaView-3.8.1
tools/ROOT-5.28.00
tools/celestia-1.6.0
tools/eclipse_helios-3.6.1
tools/scalasca-1.3.2_gcc-4.1.2
An available module can be loaded with

$ module load [module name] -> $ module load compilers/gcc-4.1.2
A module which has been loaded before but is no longer needed can be removed using
$ module unload [module name]
If you want to use another version of a software (e.g. another compiler), we strongly recommend switching between modules.
$ module switch [oldfile] [newfile]
This will unload all modules from bottom up to the oldle , unload the oldle , load the
newle and then reload all previously unload modules. Due to this procedure the order of
the loaded modules is not changed and dependencies will be rechecked. Furthermore some
modules adjust their environment variables to match previous loaded modules.
You will get a list of loaded modules with
$ module list
A short information about the software initialized by a module can be obtained by
$ module whatis [file]
e.g.: $ module whatis compilers/gcc-4.1.2
compilers/gcc-4.1.2 : Sets up the GCC 4.1.2 (RedHat 5.3) Environment.
You can add a directory with your own module les with
$ module use path
Note : If you loaded module les in order to compile a program, you probably have to load
the same module les before running that program. Otherwise some necessary libraries may
not be found at program start time. This is also true if using the batch system!
5.4
Batch System
A batch system controls the distribution of tasks (batch jobs) to the available machines
or resources. It ensures that the machines are not overbooked, to provide optimal program
execution. If no suitable machines have available resources, the batch job is queued and will
be executed as soon as there are resources available. Compute jobs that are expected to run
for a large period of time or use a lot of resources should use the batchsystem in order to
reduce load on the frontend machines.
You may submit your jobs for execution on one of the available queues. Each of the queues
has an associated environment.
To display queues summary:
22
$ qstat -g c [-q queue]

CLUSTER QUEUE
CQLOAD
USED
RES AVAIL TOTAL aoACDS cdsuE
-------------------------------------------------------------------------------all.q
0.45
0
0
256
456
200
0
cnmsi-virtual.q
-NA0
0
0
0
0
0
fs-p4.q
-NA0
0
0
0
0
0
ibm-cell-qs22.q
0.00
0
0
0
16
0
16
ibm-nehalem.q
0.03
15
0
49
64
0
0
ibm-opteron.q
0.72
122
0
46
168
0
0
ibm-quad.q
0.37
90
0
134
224
0
0
5.4.1
Sun Grid Engine
To submit a job for execution over a cluster, you have two options: either specify the
command directly, or provide a script that will be executed. This behavior is controlled by
the -b yn parameter as follows: y means the command may be a binary or a script and
n means it will be treated as a script. Some examples of submitting jobs (both binaries and
scripts).
$ qsub -q [queue]
-b y [executable] -> $ qsub -q queue_1 -b y /path/my_exec
$ qsub -pe [pe_name] [no_procs] -q [queue] -b n [script]

e.g: $ qsub -pe pe_1 4 -q queue_1 -b n my_script.sh
To watch the evolution of the submited job, use qstat. Running it without any arguments
shows information about the jobs submited by you alone.
To see the progress of all the jobs use the -f ag. You may also specify which queue jobs
you are interested in by using the -q [queue] parameter, e.g:
$ qstat [-f] [-q queue]
Typing watch qstat will automatically run qstat every 2 sec. To exit type Ctrl-C.
In order to delete a job that was previously submitted invoke the qdel command, e.g:
$ qdel [-f] [-u user_list] [job_range_list]
where:
-f - forces action for running jobs
-u - users whose jobs will be removed. To delete all the jobs for all users use -u *.
An example of submitting a job with SGE looks like that:
$ cat script.sh
#!/bin/bash
pwd/script.sh
$ chmod +x script.sh
$ qsub -q queue_1 script.sh
(you may omit -b and it will behave like -b n)
To display the sumited jobs of all users( -u *) or a specied user, use:

$ qstat [-q queue] [-u user]
To display extended information about some jobs, use:
23
$ qstat -t [-u
user]
To print detailed information about one job, use:

$ qstat -j job_id
MPI Jobs need so called paralell environments. There are two MPI integration types:
tight and loose. Tight integration means that sun grid engine takes care of running all MPI
daemons for you on each machine. Loose, means that you have to boot up and tear down the
MPI ring yourself. The Openmpi libraries use a tight integration (you use mpirun directly),
and the Intel MPI library uses a loose integration (you must use mpdboot.py, mpiexec and
mpdallexit). Each congured PE has a dierent scheduling policy. To see a list of paralell
environments type:
[alexandru.herisanu@fep ~]$ qconf -spl
make
openmpi
openmpi*1
sun-hpc
sun-hpc*1
qsub -q ibm-quad.q -pe openmpi 5 means you want 5 slots from the ibm-quad.q. Depending
on the type of job you want to run simple/smp/mpi/hybrid, you may need to know the
scheduling type used on each paralell environment.
Basically -pe openmpi 5 will schedule all ve slots on the same machine, and -pe openmpi*1
5 will schedule each mpi process on a dierent machine so you can use openmp or have a better
I/O throughput.
You can nd a complete howto for Sun GridEngine here:
http://wikis.sun.com/display/gridengine62u5/Home
5.4.2
Easy submit: MPRUN.sh
Mprun.sh is a helper script provided by us for an easier application proling. When you test
your application, you want to run the same application using dierent environment settings,
compiler settings and so on. The mprun.sh script lets you run and customize your application
using the command line.
$ mprun.sh -h
Usage: mprun.sh --job-name [job-name] --queue [queue-name] \
--pe [Paralell Environment Name] [Nr. of Slots]
--modules [modules to load] --script [My script]
--out-dir [log dir] --show-qsub --show-script
--batch-job
\
\
\
Example:
mprun.sh --job-name MpiTest --queue ibm-opteron.q \
--pe openmpi*1 3 \
--modules "compilers/gcc-4.1.2:mpi/openmpi-1.5.1_gcc-4.1.2" \
--script exec_script.sh \
--show-qsub --show-script
24
=> exec_script.sh <=

# This is where you put what to run ...
mpirun -np $NSLOTS ./a.out
# End of script.
For example, you have an mpi program named a.out and you wish to test it using 1, 4 and
8 mpi processes on dierent queues and dierent scheduling options. All you need is a run
script like:
mprun.sh --job-name MpiTest --queue ibm-opteron.q --pe openmpi 1 \
--script exec_script.sh --show-qsub --show-script
mprun.sh --job-name MpiTest --queue ibm-nehalem.q --pe openmpi 2 \
mprun.sh --job-name MpiTest --queue ibm-quad.q --pe openmpi*1 4 \
Using this procedure, you can run your program, modify it and run the program using
the same conditions. A more advanced feature is using dierent compilers end environmental
variables for example you can use this script to run your program either with a tight integrated
OpenMPI integration or a loose Intel MPI one.
MY_COMPILER=gcc
mprun.sh --job-name MpiTest --queue ibm-nehalem.q --pe openmpi 2 \
--script exec_script.sh --show-qsub --show-script \
--additional-vars MY_COMPILER
MY_COMPILER=intel
mprun.sh --job-name MpiTest --queue ibm-quad.q --pe openmpi*1 4 \
--modules "compilers/intel-11.0_083:mpi/intelmpi-3.2.1_mpich" \
--script exec_script.sh --show-qsub --show-script \
--additional-vars MY_COMPILER
Your exec script.sh must reect these changes. When you run the execution script, you
will also have access to the MY COMPILER variable.
# This is where you put what to run ...
if [[ $MY_COMPILER == intel" ]]; then
cat $PE_HOSTFILE | cut -f 1 -d > hostfile
mpdboot --totalnum=$NSLOTS --file=hostfile --rsh=/usr/bin/ssh
mpdtrace -l
mpiexec -np $NSLOTS ./hello_world_intel
mpdallexit
else
mprun -np $NSLOTS ./a.out
fi
# End of script.
25
5.4.3
Easy development: APPRUN.sh
Another tool build ontop of the SunGridEngine capabilities is Apprun. You can use apprun
to easily run programs in the batch system and export the display home. This is how it works:
You connect to fep.grid.pub.ro using a GUI-capable connection (see ). Use apprun.sh

eclipse for example to schedule a job that will run eclipse on an empty slot. The graphical
display will be exported through fep back to you. For example:
$ apprun.sh eclipse
will run eclipse. Curent available programs are: eclipse and xterm - for using interactive
jobs.
5.4.4
Running a custom VM on the NCIT-Cluster
The NCIT Cluster has two virtualization stategies available: short-term virtual machines
and long-term, internet connected VMs. The short-term Virtual Machines use KVM and the
LustreFS Storage. They are meant for application scaling in cases where you need another
type of operating system or root access. The long term VM domain is a Hyper-V R2 Cluster.
You must be part of the kvm-users group to run a KVM machine. Basically you use the
SunGridEngine batch system to reserve resources (cpu and memory). This way the virtual
machines will not overlap with normal jobs. All virtual machines are hosted on LustreFS, so
you have a inniband connection on the opteron nodes and 4x1Gb maximal total throughput
if running on the other nodes.
This system is used for systems testing and scaling. You boot your machine once customize
it, you shut it down and you boot several copy-on-write instances back up again. Copy-OnWrite means the main disks are read-only and all the modied data is written to instance
les. If you wish to revert to the initial machine, just delete the instance les and youre
set. Additionally you can run as many instances you like without having to copy your master
machine all over again.
The VM startup script also uses apprun.sh. For example:
26
##
## Master Copy (for the master copy)
#
#apprun.sh kvm --queue ibm-quad.q@quad-wn05.grid.pub.ro --vmname ABDLab --cpu 2 \
--memory 2048M --hda db2-hda.qcow2 --status status.txt \
--mac 80:54:00:01:34:01 --vncport 10 --master
##
## Slave Copy (for the copy-on-write instances)
#
apprun.sh kvm --queue ibm-quad.q@quad-wn01.grid.pub.ro --vmname ABDLab01 --cpu 2 \
--memory 2048M --hda db2-hda.qcow2 --status status.txt --mac 80:54:01:01:34:01 \
--vncport 11
apprun.sh kvm --queue ibm-quad.q@quad-wn02.grid.pub.ro --vmname ABDLab02 --cpu 2 \
--memory 2048M --hda db2-hda.qcow2 --status status.txt --mac 80:54:02:02:34:02 \
--vncport 11
See http://cluster.grid.pub.ro for a complete howto.
27
The Software Stack
This section covers in detail the programming tools available on the cluster, including
compilers, debuggers and proling tools. On the Linux operating system the freely available
GNU compilers are the somewhat natural choice. Code generated by the gcc C/C++
compiler performs acceptably on the Opteron-based machines. Starting with version 4.2 of
the gcc now oers support for shared memory parallelization with OpenMP. Code generated
by the old g77 Fortran compiler typically does not perform well. Since version 4 of the GNU
compiler suite a FORTRAN 95 compiler - gfortran - is available. Due to performance reasons,
we recommend that Fortran programmers use the Intel or Sun compiler in 64-bit mode. As
there is an almost unlimited number of possible combinations of compilers and libraries and
also the two addressing modes, 32- and 64-bit, we expect that there will be problems with
incompatibilities, especially when mixing C++ compilers.
Heres a shortlist of the software and middleware available on our cluster.
MPI
API
OpenMPI v.1.5.1
OpenMPI v.1.5
OpenMPI v.1.3
Flavor
Openmpi 1.5
Gcc 4.1.2
(if needed we can
compile it for pgi
, intel and sun)
Openmpi 1.5
Gcc 4.1.2
Openmpi 1.3
Gcc 4.1.2 and 4.4.0,
Intel, PGI and Sun
Compiler supported
Observations
The default mpi setup. On the
opteron nodes, it uses the
inniband network by default.
All TCP nodes use both ethernet
cards to transmit MPI messages.
No inniband support compiled
No inniband support compiled
Environments:
mpi/openmpi-1.3.2-gcc-4.1.2
mpi/openmpi-1.3.2-pgi-7.0.7
mpi/openmpi-1.3.2-sunstudio12.1
One needs to load the compiler before the mpi environment. For example if you use
mpi/openmpi-1.3.2 pgi-7.0.7 then a pgi 7.0.7 module load is required.
Website:
http://www.open-mpi.org/.
SDK Reference:
http://www.open-mpi.org/doc/v1.5/
http://www.open-mpi.org/doc/v1.3/
API
Sun-HPC8.2.1c
Flavor
Openmpi 1.4
Gcc, Intel, PGI
and Sun Compiler
supported
28
Observations
Sun Cluster Tools 8.2.1c is
based on Openmp 1.4. Inniband
support is provided.
Environments: mpi/Sun-HPC8.2.1c-gnu, mpi/Sun-HPC8.2.1c-intel, mpi/Sun-HPC8.2.1cpgi, mpi/Sun-HPC8.2.1c-sun.

Website: http://www.oracle.com/us/products/tools/message-passing-toolkit-070499.html
SDK Reference: http://download.oracle.com/docs/cd/E19708-01/821-1319-10/index.html
API
Flavor
Intel MPI 3.2.1 MPICH v2 avor
Observations
Loose MPI implementation
For the Intel Compiler, there is no tight integration setup. You must build your own mpi
ring using mpdboot and mpiexec. Website: . SDK Reference: Sun Grid Engine uses a loose
MPI integration for Intel MPI. You must start the mpd ring manually. For example:
cat $PE_HOSTFILE | cut -f 1 -d > hostfile
mpdboot --totalnum=$NSLOTS --file=hostfile --rsh=/usr/bin/ssh
mpdtrace -l
mpiexec -np $NSLOTS ./hello_world_intel
mpdallexit
When using a paralell environment, Sun Grid Engine exports the le containing the reserved nodes in the variable PEll HOSTFILE. The rst line, parses that le and rewrites it
in MPICH format. The communication is done using ssh public key authentication.
Development software
The current compiler suite is composed of GCC 4.1.2, GCC 4.4.0, Sun Studio 12 U1, Intel
11.0-83, PGI 7.0.7 and PGI 10. Additionaly on the CELL B.E platform there is a IBM XL
compiler for C and Fortran available, but currently unaccesible.
Supported java version are Sun Java 1.6 U23 and OpenJDK 1.6. We provide Matlab
14 access by running it on VMWare Machines. The main proler application is Sun Studio
Analyser and Collector and Intel VTUNE. Other MPI proling tools available are Scalasca
(and its viewer cube3). Current Math libraries available: Intel MKL, NAG (for C and Fortran
currently unavailable) and ATLAS v.9.11 (compiled for gcc and sunstudio on the Xeon nodes).
Current debugging tools: TotalView 8.6.2, Eclipse Debugger (GDB), Valgrind We provide acces to a remote desktop to use all GUI enabled applications. We have a dedicated
set of computers on witch you can run Eclipse, SunStudio, TotalView etc and export your
display locally. We currently provide remote display capability through FreeNX, VNC or X11
Forwarding.
6.1
6.1.1
Compilers
General Compiling and Linker hints
To access non-default compilers you have to load the appropriate module using module
avail to see the availables modules and the module load to load the modul (see 5.3 Module
Package). You can then access the compilers by their original name, e.g. g++, gcc, gfortran,
or by environment variables $CXX, $CC or $FC. When, however, loading more than one
compiler module, you have to be aware that environment variables point to the compiler
loaded at last.
For convenient switching between compilers and platforms, we added environment variables
for the most important compiler ags. These variables can be used to write a generic makele
which compiles on all our Unix like platforms:
29
FC, CC, CXX -a variable containing the appropiate compiler name.

FLAGS DEBUG -enable debug information.
FLAGS FAST - include the options which usually oer good performance.For many
compiler this will be the -fast option.
FLAGS FAST NO FPOPT - like fast, but disallow any oating point optimizations
which will have an impact on rounding errors.
FLAGS ARCH32 - build 32 bit executables or libraries.
FLAGS ARCH64 - build 64 bit executables or libraries.
FLAGS AUTOPAR - enable autoparallelization, if the compiler supports it.
FLAGS OPENMP - enable OpenMP support, if supported by the compiler.
To produce debugging information in the operating systems native format use -g option at
compile time. In order to be able to mix dierent compilers all these variables exist also with
the compiler name in the variable name, like GCC CXX or FLAGS GCC FAST.
$ $PSRC/pex/520|| $CXX $FLAGS_FAST $FLAGS_ARCH64 $FLAGS_OPENMP $PSRC/cpop/pi.cpp
In general we recommend to use the same ags for compiling and for linking. Otherwise the
program may not run correctly or linking may fail. The order of the command line options
while compiling and linking does matter. If you get unresolved symbols while linking, this
may be caused by a wrong order of libraries. If a library xxx uses symbols out of the library
yyy, the library yyy has to be right of xxx in the command line, e.g ld ... -lxxx -lyyy.
The search path for header les is extended with the -Idirectory option and the library
search path with the -Ldirectory option. The environment variable ld_library_path species
the search path where the program loader looks for shared libraries. Some compile time linker,
e.g. the Sun linker, also use this variable while linking, while the GNU linker does not.
6.1.2
Programming Hints
Generally, when developing a program, one wants to make it tu run faster. In order to
improve the quality of a code, there are certain aspects that must be followed, in order to
make better use the available hardware resources:
1. Turn on compiler optimization. The use of $FLAGS FAST options which may be a
good starting point. However keep in mind that optimization may change rounding errors
of oating point calculations. You may want to use the variables supplied by the compiler
modules. An optimized program runs typically 3 to 10 times faster than the non-optimized
one.
2. Try another compiler. The ability of dierent compilers to generate ecient executables
varies. The runtime dierences are often between 10% to 30%.
3. Write ecient code, which can be optimized by the compiler. Look up for information
regarding the compiler you want to use on its documentation, both online and oine.
4. Access memory continously in order to reduce cache and TLB misses. This especially
eects multidimensional arrays and complex data structures.
5. Use optimized libraries, e.g. the Sun Performance Library on the ACML library.
6. Use a proling tool, like the Sun Collector and Analyzer, to nd the compute or time
intensive parts of your program, since thsese are the parts where you want to start optimizing.
7. Consider parallelization to reduce the runtime of your program.
30
6.1.3
GNU Compilers
The GNU C/C++/Fortran compilers are available by using the binaries gcc, g++, g77
and gfortran or by environment variables $CXX, $CC or $FC. If you cannot access them you
have to load the appropiate module le as is described in section 5.3 Module Package. For
further references the manual pages are available. Some options to compile your program and
increase their performance are:
-m32 or -m64, to produce code with 32-bit or 64-bit addresing - as mentioned above, the
default is platform dependant
-march=opteron, to optimize for the Pentium processor (NCIT Cluster)
-mcpu=ultrasparc optimize for the Ultrasparc I/II processors (CoLaborator)
-O2 or -O3, for dierent levels of optimization
-malign-double, for Pentium specic optimization
-ast-math, for oating point optimizations
GNU Compilers with versions above 4.2 support OpenMP by default. Use the -fopenmp
ag to enable the OpenMP support.
6.1.4
GNU Make
Make is a tool which allows the automation and hence the ecient execution of tasks. In
particular, it is used to auto-compile programs. In order to obtain an executable from more
sources it is inecient to compile every le each time and link-edit them after that. GNU
Make compiles every les separately and once one of them is changed, only the modied one
will be recompiled.
The tool Make uses a conguration le called Makele. Such a le contains rules and
commands of automation. Here is a very simple Makele example which helps clarify the
Make syntax.
Makele example1
all:
gcc -Wall hello.c -o hello
clean:
rm -f hello
For the execution of the example above the following commands are used:
$ make
$ ./hello
hello world!
The example presented before contains two rules: all and clean. When run, the make
command performs the rst rule written in the Makele (in this case all - the name is not
particularly important).
The executed command is gcc - Wall hello.c -o hello. The user can choose explicitly what
rule will be performed by giving it as a parameter to the make command.
31
$ make clean
rm -f hello
$ make all
In the above example, the clean rule is used in order to delete the executable hello and the
make all command to obtain the executable again.
It can be seen that no other arguments are passed to the make command to specify what
Makele will be analyzed. By default, GNU Make searches, in order, for the following les:
GNUmakele, Makele, makele and analyzes them.
The syntax of a rule
Here is the syntax of a rule from a Makele le:
target: prerequisites
<tab> command
* target is, usually, the le which will be obtained by performing the command command.
As we had seen from the previous examples, this can also be a virtual target, meaning that it
has no le associated with it.
* prerequisites represents the dependencies needed to follow the rule. These are usually
the various les needed for the obtaining of the target.
* <tab>represents the tab character and it MUST, by all means, be used before specifying
the command.
* command - a list of one or more commands which are executed when the target id
obtained.
Here is another example of Makele:
Makele example2
all: hello
hello: hello.o
gcc hello.o -o hello
hello.o: hello.c
gcc -Wall -c hello.c
clean:
rm -f *.o hello
Observation: The rule all is performed implicitly.
* all has a hello dependency and executes no commands.
* hello is dependent on hello.o; it makes the link-editing of the le hello.o.
* hello.o has a hello.c dependency; it makes the compiling and assembling of the hello.c le.
In order to obtain the executable, the following commands are used:
$ make -f Makefile_example2
gcc -Wall -c hello.c
The use of the variables
A Makele le allows the use of variables. Here is an example:
Makele example3
32
CC = gcc
CFLAGS = -Wall -g
all: hello
hello: hello.o
$(CC) $^ -o $@
hello.o: hello.c
$(CC) $(CFLAGS) -c $<
.PHONY: clean
clean:
rm -f *.o hello
In the example above, the variables CC and CFLAGS were dened. CC stands for the
compiler used, and CFLAGS for the options (ags) used for compiling. In this case, the options
used show the warnings (-Wall) and compiling with debugging support (-g). The reference to
a variable is done using the construction $(VAR NAME). Therefore, $(CC) is replaced with
gcc, and $(CFLAGS) is replaced with -Wall -g.
There are also some predened useful variables:
* $@ expands to the name of the target;
* $ expands to the list of requests;
* $< expands to the rst request.
Ergo, the command $(CC) $ -o $@ reads as:
and the command $(CC) $(CFLAGS) -c $< reads as:
gcc -Wall -g -c hello.c
The usage of implicit rules
Many times there is no need to specify the command that must be executed as it can be
detected implicitly.This way, in case the following rule is specied :
main.o: main.c
the implicit command $(CC) $(CFLAGS) -c -o $@ $< is used.
Thus, the Makele example2 shown before can be simplied, using implicit rules, like this:
Makele example4
CC = gcc
CFLAGS = -Wall -g
all: hello
hello: hello.o
hello.o: hello.c
.PHONY: clean
clean:
rm -f *.o *~ hello
A phony target is one that is not really the name of a le. It is just a name for some
commands to be executed when you make an explicit request. There are two reasons to use a
phony target: to avoid a conict with a le of the same name, and to improve performance.
If you write a rule whose commands will not create the target le, the commands will be
executed every time the target comes up for remaking. Here is an example:
33
clean:
rm *.o hello
Because the rm command does not create a le named clean, probably no such le
will ever exist. Therefore, the rm command will be executed every time the make clean
command is run.
The phony target will cease to work if anything ever does create a le named clean in
that directory. Since it has no dependencies, the le clean would inevitably be considered
up to date, and its commands would not be executed. To avoid this problem, the explicit
declaration of the target as phony, using the special target .PHONY is recommended.
.PHONY : clean
Once this is done, make clean will run the commands regardless of whether there is a
le named clean or not. Since the compiler knows that phony targets do not name actual
les that could be remade from other les, it skips the implicit rule search for phony targets
. This is why declaring a target phony is good for performance, even if you are not worried
about the actual le existing. Thus, you rst write the line that states that clean is a phony
target, then you write the rule, like this:
.PHONY: clean
clean:
rm *.o hello
It can be seen that in the Makele example4 implicit rules are used. The Makele can be
simplied even more, like in the example below:
Makele example5
CC = gcc
CFLAGS = -Wall -g
all: hello
hello: hello.o
.PHONY: clean
clean:
rm -f *.o hello
In the above example, the rule hello.o:hello.c was deleted. Make sees that there is no le
hello.o and it looks for the le C from which it can obtained. In order to do that, it creates
an implicit rule and compiles the le hello.c:
$ make -f Makefile.ex5
gcc -Wall -g -c -o hello.o hello.c
gcc
hello.o -o hello
Generally, if we have only one source le, there is no need for a Makele le to obtain the
desired executable.
$ls
hello.c
$ make hello
cc hello.c -o hello
34
Here is a complete example of a Makele using gcc. Gcc ca be easily replaced with other
compilers. The structure of the Makele remains the same. Using all the facilites discussed up
to this point, we can write a complete example using gcc (the most commonly used compiler),
in order to obtain the executables from both a client and a server le.
Files used:
* the server executable depends on the C les server.c, sock.c, cli handler.c, log.c and on the
header les sock.h, cli handler.h, log.h;
* the client executable depends on the C les client.c, sock.c, user.c, log.c and on the header
les sock.h, user.h, log.h.
The structure of the Makele le is presented below:
Makele example6
CC = gcc
# the used compiler
CFLAGS = -Wall -g
# the compiling options
LDLIBS = -lefence
# the linking options
#create the client and server executables
all: client server
#link the modules client.o user.o sock.o in the client executable
client: client.o user.o sock.o log.o
#link the modules server.o cli_handler.o sock.o in the server executable
server: server.o cli_handler.o sock.o log.o
#compile the file client.c in the object module client.o
client.o: client.c sock.h user.h log.h
#compile the file user.c in the object module user.o
user.o: user.c user.h
# compile the file sock.c in the module object sock.o
sock.o: sock.c sock.h
#compiles the file server.c in the object module server.o
server.o: server.c cli_handler.h sock.h log.h
#compile the file cli_handler.c in the object module cli_handler.o
cli_handler.o: cli_handler.c cli_handler.h
#compile the file log.c in the object module log.o
log.o: log.c log.h
.PHONY: clean
clean:
rm -fr *.o server client
6.1.5
Sun Compilers
We use Sun Studio 6 on the Solaris machines (soon to be upgraded) and Sun Studio 12 on
the Linux machines. Nevertheless, the use of these two versions of Sun Studio is pretty much
the same.
The Sun Studio development tools include the Fortran95, C and C++ compilers. The
best way to keep your applications free of bugs and at the actual performance level we recommend you to recompile your code with the latest production compiler. In order to check the
version that your are currently using use the ag -V.
The commands that invoke the compilers are cc, f77, f90, f95 and CC. An important aspect about the Fortran 77 compiler is that from Sun Studio 7 is no longer available.
35
Actually, the command f77 invokes a script that is a wrapper and it is used to pass the
necessary compatibility options, like -f77, to the f95 compiler. We recommend adding -f77
-trap=common in order to revert to f95 settings for error trapping. At the link step you
may want to use the -xlang=f77 option(when linking to old f77 object binaries). Detailed
information about compatibility issues between Fortran 77 and Fortran 95 can be found in
http://docs.sun.com/source/816-2457/5_f77.html
For more information about the use of Sun Studio compilers you may use the man pages but
you may also use the documentation found at http://developers.sun.com/sunstudio/documentation.
6.1.6
Intel Compilers
Use the module command to load the Intel compilers into your environment. The curent
version of Intel Compiler is 11.1. The Intel C/C++ and Fortran77/Fortran90 compilers are
invoked by the commands icc icpc ifort on Linux. The corresponding manual pages are
available for further information. Some options to increase the performance of the produced
code include:
-O3 high optimization
-fp-model fast=2 enable oating point optimization
-openmp turn on OpenMP
-parallel turn on auto-parallelization
In order to read or write big-endian binary data in Fortran programs, you can use the compiler
option -convert big endian.
6.1.7
PGI Compiler
PGI compilers are a set of commercially available Fortran, C and C++ compilers for High
Performance Computing Systems from Portland Group.
PGI Compiler:
PGF95 - for fortran
PGCC - for c
PGC++ - for c++
PGI Recommended Default Flags:
-fast A generally optimal set of options including global optimization, SIMD vectorization, loop unrolling and cache optimizations.
-Mipa=fast,inline Aggressive inter-procedural analysis and optimization, including automatic inlining.
-Msmartalloc Use optimized memory allocation (Linux only).
zc eh Generate low-overhead exception regions.
PGI Tuning Flags
-Mconcur Enable auto-parallelization; for use with multi-core or multi-processor targets.
36
-mp Enable OpenMP; enable user inserted parallel programming directives and pragmas.
-Mprefetch Control generation of prefetch instructions to improve memory performance
in compute-intensive loops.
-Msafeptr Ignore potential data dependencies between C/C++ pointers.
-Mfprelaxed Relax oating point precision; trade accuracy for speed.
-tp x64 Create a PGI Unied Binary which functions correctly on and is optimized for
both Intel and AMD processors.
-Mp/-Mpfo Prole Feedback Optimization; requires two compilation passes and an
interim execution to generate a prole.
6.2
OpenMPI
RPMs are available compiled both for 32and 64bit machines. It was compiled with both
Sun Studio and GNU Compilers and the user may select which one to use depending on the
task.
The compilers provided by OpenMPI are mpicc, mpiCC, mpic++, mpicxx, mpif77 and
mpif90. Please note that mpiCC, mpic++ and mpicxx all invoke the same C++ compiler
with the same options. Another aspect is that all of these commands are only wrappers that
actually call opal wrapper. Using the -show ag does not invoke the compiler, instead it prints
the command that would be executed. To nd out all the possible ags these commands may
receive, use the -ags ag.
To compile your program with mpicc, use:
$ mpicc -c pr.c
To link your compiled program, use:
$ mpicc -o pr pr.o
To compile and link all at once, use:
$ mpicc -o pr pr.c
For the others compilers the commands are the same - you only have to replace the compilers name with the proper one.
The mpirun command executes a program, like
$ mpirun [options] <program> [<args>]
The most used option species the number of cores to run the job: -n #. It is not necessary
to specify the hosts on which the job would execute because this will be managed by Sun Grid
Engine.
37
6.3
OpenMP
OpenMP is an Application Program Interface (API), jointly dened by a group of major

computer hardware and software vendors. OpenMP provides a portable, scalable model for
developers of shared memory parallel applications. The API supports C/C++ and Fortran on
multiple architectures, including UNIX and Windows NT. This tutorial covers most of the major features of OpenMP, including its various constructs and directives for specifying parallel
regions, work sharing, synchronisation and data environment. Runtime library functions and
environment variables are also covered. This tutorial includes both C and Fortran example
codes and an exercise.
An Application Program Interface (API) that may be used to explicitly direct multithreaded, shared memory parallelism is comprised of three primary API components:
Compiler Directives
Runtime Library Routines
Environment Variables
The API is specied for C/C++ and Fortran and most major platforms have been implemented including Unix/Linux platforms and Windows NT, thus making it portable. It
is standardised: jointly dened and endorsed by a group of major computer hardware and
software vendors and it is expected to become an ANSI standard.
6.3.1
What does OpenMP stand for?
Short answer: Open Multi-Processing

Long answer: Open specications for Multi-Processing via collaborative work between
interested parties from the hardware and software industry, government and academia.
OpenMP is not meant for distributed memory parallel systems (by itself) and it is not
necessarily implemented identically by all vendors. It doesnt guarantee to make the most
ecient use of shared memory and it doesnt require to check for data dependencies, data
conicts, race conditions or deadlocks. It doesnt require to check for code sequences that
cause a program to be classied as non-conforming. It is also not meant to cover compilergenerated automatic parallel processing and directives to the compiler to assist it and the
design wont guarantee that input or output to the same le is synchronous when executed in
parallel. The programmer is responsible for the synchronising part.
6.3.2
OpenMP Programming Model
OpenMP is based upon the existence of multiple threads in the shared memory programming paradigm. A shared memory process consists of multiple threads.
OpenMP is an explicit (not automatic) programming model, oering the programmer full
control over the parallel processing. OpenMP uses the fork-join model of parallel execution.
All OpenMP programs begin as a single process: the master thread. The master thread runs
sequentially until the rst parallel region construct is encountered.
FORK: the master thread then creates a team of parallel threads. The statements in
the program that are enclosed by the parallel region construct are then executed in parallel
amongst the various team threads.
JOIN: When the team threads complete, they synchronise and terminate, leaving only the
master thread.
38
Most OpenMP parallelism is specied through the use of compiler directives which are
embedded in C/C++ or Fortran source code. Nested Parallelism Support: the API provides
for the placement of parallel constructs inside of other parallel constructs. Implementations
may or may not support this feature.
Also, the API provides for dynamically altering the number of threads which may be used
to execute dierent parallel regions. Implementations may or may not support this feature.
OpenMP species nothing about parallel I/O. This is particularly important if multiple
threads attempt to write/read from the same le. If every thread conducts I/O to a dierent
le, the issue is not signicant. It is entirely up to the programmer to ensure that I/O is
conducted correctly within the context of a multi-threaded program.
OpenMP provides a relaxed-consistency and temporary view of thread memory, as
the producers claim. In other words, threads can cache their data and are not required
to maintain exact consistency with real memory all of the time. When it is critical that all
threads view a shared variable identically, the programmer is responsible for ensuring that the
variable is FLUSHed by all threads as needed.
6.3.3
Environment Variables
OpenMP provides the following environment variables for controlling the execution of parallel code. All environment variable names are uppercase. The values assigned to them
are not case sensitive.
OMP SCHEDULE
Applies only to DO, PARALLEL DO (Fortran) and for, parallel for C/C++ directives which
have their schedule clause set to RUNTIME. The value of this variable determines how iterations of the loop are scheduled on processors. For example:
setenv OMP_SCHEDULE "guided, 4"
setenv OMP_SCHEDULE "dynamic"
OMP NUM THREADS
Sets the maximum number of threads to use during execution. For example:
setenv OMP_NUM_THREADS 8
OMP DYNAMIC
Enables or disables dynamic adjustment of the number of threads available for execution of
parallel regions. Valid values are TRUE or FALSE. For example:
setenv OMP_DYNAMIC TRUE
OMP NESTED
Enables or disables nested parallelism. Valid values are TRUE or FALSE. For example:
setenv OMP_NESTED TRUE
Implementation notes:
Your implementation may or may not support nested parallelism and/or dynamic threads.
If nested parallelism is supported, it is often only nominal, meaning that a nested parallel
region may only have one thread. Consult your implementations documentation for details or experiment and nd out for yourself.
OMP STACKSIZE
New feature available with OpenMP 3.0. Controls the size of the stack for created (nonMaster) threads. Examples:
39
setenv
setenv
setenv
setenv
setenv
setenv
setenv
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
2000500B
"3000 k "
10M
" 10 M "
"20 m "
" 1G"
20000
OMP WAIT POLICY

New feature available with OpenMP 3.0. Provides a hint to an OpenMP implementation
about the desired behaviour of waiting threads. A compliant OpenMP implementation may
or may not abide by the setting of the environment variable. Valid values are ACTIVE and
PASSIVE. ACTIVE species that waiting threads should mostly be active, i.e. consume
processor cycles, while waiting. PASSIVE species that waiting threads should mostly be
passive, i.e. not consume processor cycles, while waiting. The details of the ACTIVE and
PASSIVE behaviours are implementation dened. Examples:
setenv
setenv
setenv
setenv
OMP_WAIT_POLICY
OMP_WAIT_POLICY
OMP_WAIT_POLICY
OMP_WAIT_POLICY
ACTIVE
active
PASSIVE
passive
OMP MAX ACTIVE LEVELS

New feature available with OpenMP 3.0. Controls the maximum number of nested active parallel regions. The value of this environment variable must be a non-negative integer. The behaviour of the program is implementation-dened if the requested value of
OMP MAX ACTIVE LEVELS is greater than the maximum number of nested active parallel
levels an implementation can support or if the value is not a non-negative integer. Example:
setenv OMP_MAX_ACTIVE_LEVELS 2
OMP THREAD LIMIT
New feature available with OpenMP 3.0. Sets the number of OpenMP threads to use for
the whole OpenMP program. The value of this environment variable must be a positive
integer. The behaviour of the program is implementation-dened if the requested value of
OMP THREAD LIMIT is greater than the number of threads an implementation can support
or if the value is not a positive integer. Example:
setenv OMP_THREAD_LIMIT 8
6.3.4
Directives format
Fortran Directives Format

Format: (not case sensitive)
sentinel directive-name [clause ...]
All Fortran OpenMP directives must begin with a sentinel. The accepted sentinels depend
on the type of Fortran source. Possible sentinels are:
!$OMP
C$OMP
*$OMP
40
Example:
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(BETA,PI)
Fixed Form Source:
- !$OMP C$OMP *$OMP are accepted sentinels and must start in column 1.
- All Fortran xed form rules for line length, white space, continuation and comment
columns apply for the entire directive line.
- Initial directive lines must have a space/zero in column 6.
- Continuation lines must have a non-space/zero in column 6.
Free Form Source:
- !$OMP is the only accepted sentinel. Can appear in any column, but must be preceded
by white space only.
- All Fortran free form rules for line length, white space, continuation and comment columns
apply for the entire directive line
- Initial directive lines must have a space after the sentinel.
- Continuation lines must have an ampersand as the last non-blank character in a line.
The following line must begin with a sentinel and then the continuation directives.
General Rules:
* Comments can not appear on the same line as a directive.
* Only one directive name may be specied per directive.
* Fortran compilers which are OpenMP enabled generally include a command line option
which instructs the compiler to activate and interpret all OpenMP directives.
* Several Fortran OpenMP directives come in pairs and have the form shown below. The
end directive is optional but advised for readability.
!$OMP directive
[ structured block of code ]
!$OMP end directive
C / C++ Directives Format
Format:
#pragma omp directive-name [clause, ...] newline
A valid OpenMP directive must appear after the pragma and before any clauses. Clauses
can be placed in any order, and repeated as necessary, unless otherwise restricted. It is required
that that the pragma clause precedes the structured block which is enclosed by this directive.
Example:
#pragma omp parallel default(shared) private(beta,pi)
General Rules:
* Case sensitive
* Directives follow conventions of the C/C++ standards for compiler directives.
* Only one directive-name may be specied per directive.
* Each directive applies to at most one succeeding statement, which must be a structured
block.
* Long directive lines can be continued on succeeding lines by escaping the newline
character with a backslash (\) at the end of a directive line.
PARALLEL Region Construct
Purpose: A parallel region is a block of code that will be executed by multiple threads.
This is the fundamental OpenMP parallel construct.
Example:
Fortran
41
!$OMP PARALLEL [clause ...]

IF (scalar_logical_expression)
PRIVATE (list)
SHARED (list)
DEFAULT (PRIVATE | FIRSTPRIVATE | SHARED | NONE)
FIRSTPRIVATE (list)
REDUCTION (operator: list)
COPYIN (list)
NUM_THREADS (scalar-integer-expression)
block
!$OMP END PARALLEL
C/C++
#pragma omp parallel [clause ...] newline
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
reduction (operator: list)
copyin (list)
num_threads (integer-expression)
structured_block
Notes:
- When a thread reaches a PARALLEL directive, it creates a team of threads and becomes
the master of the team. The master is a member of that team and has thread number 0 within
that team.
- Starting from the beginning of this parallel region, the code is duplicated and all threads
will execute that code.
- There is an implicit barrier at the end of a parallel section. Only the master thread
continues execution past this point.
- If any thread terminates within a parallel region, all threads in the team will terminate,
and the work done up until that point is undened.
How Many Threads?
The number of threads in a parallel region is determined by the following factors, in order
of precedence:
1. Evaluation of the IF clause
2. Setting of the NUM THREADS clause
3. Use of the omp set num threads() library function
4. Setting of the OMP NUM THREADS environment variable
5. Implementation default - usually the number of CPUs on a node, though it could be
dynamic.
Threads are numbered from 0 (master thread) to N-1.
Dynamic Threads:
Use the omp get dynamic() library function to determine if dynamic threads are enabled.
If supported, the two methods available for enabling dynamic threads are:
1. The omp set dynamic() library routine;
2. Setting of the OMP DYNAMIC environment variable to TRUE.
Nested Parallel Regions:
42
Use the omp get nested() library function to determine if nested parallel regions are enabled. The two methods available for enabling nested parallel regions (if supported) are:
1. The omp set nested() library routine
2. Setting of the OMP NESTED environment variable to TRUE
If not supported, a parallel region nested within another parallel region results in the
creation of a new team, consisting of one thread, by default.
Clauses:
IF clause: If present, it must evaluate to .TRUE. (Fortran) or non-zero (C/C++) in order
for a team of threads to be created. Otherwise, the region is executed serially by the master
thread.
Restrictions:
A parallel region must be a structured block that does not span multiple routines or code
les. It is illegal to branch into or out of a parallel region. Only a single IF clause is permitted.
Only a single NUM THREADS clause is permitted.
Example: Parallel Region - Simple Hello World program
- Every thread executes all code enclosed in the parallel section
- OpenMP library routines are used to obtain thread identiers and total number of threads
Fortran - Parallel Region Example
INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
C
Fork a team of threads with each thread having a private TID variable
!$OMP PARALLEL PRIVATE(TID)
C
Obtain and print thread id

TID = OMP_GET_THREAD_NUM()
PRINT *, Hello World from thread = , TID
Only master thread does this
IF (TID .EQ. 0) THEN
NTHREADS = OMP_GET_NUM_THREADS()
PRINT *, Number of threads = , NTHREADS
END IF
C
All threads join master thread and disband
!$OMP END PARALLEL
END
C / C++ - Parallel Region Example
#include <omp.h>
main () {
int nthreads, tid;
/* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
{
/* Obtain and print thread id */
43
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and terminate */
}
General rules of directives (for more details about these directives you can go to
openMP Directives ):
- They follow the standards and conventions of the C/C++ or Fortran compilers;
- They are case sensitive;
- In a directive, only one name can me specied;
- Any directive can be applied only to the statement following it, which must be a structured
block.
- Long directives can be continued on the next lines by adding a \ at the end of the
rst line of the directive.
6.3.5
The OpenMP Directives
PARALLEL region: a block will be executed in parallel by OMP NUM THREADS number of threads. It is the fundamental construction in OpenMP.
Work-sharing structures:
DO/for - shares an iteration of a cycle over all threads (parallel data);
SECTIONS - splits the task in separated sections (functional parallel processing);
SINGLE - serialises a code section.
Synchronizing constructions:
MASTER - only the master thread will execute the region of code;
CRITICAL - that region of code will be executed only by one thread;
BARRIER - all threads from the pool synchronize;
ATOMIC - a certain region of memory will be updated in an atomic mode - a sort of
critical section;
FLUSH - identies a syncronization point in which the memory must be in a consistent
mode;
ORDERED - the iterations of the cycle from this directive will be executed in the same
order like the corresponding serial execution;
THREADPRIVATE - it is used to create from the global variables, local separated
variables which will be executed on several parallel regions.
Clauses to set the context:
These are important for programming in a programming model with shared memory. It is
used together with the PARALLEL, DO/for and SECTIONS directives.
PRIVATE - the variables from the list are private in every thread;
SHARED - the variables from the list are shared by the threads of the current team;
DEFAULT - it allows the user to set the default PRIVATE, SHARED or NONE
for all the variables from a parallel region;
44
FIRSTPRIVATE - it combines the functionality of the clause PRIVATE with the automated initialization of the variables from the list: the initialisation of the local variables is
made using the previous value from the cycle;
LASTPRIVATE - it combines the functionality of the PRIVATE clause with a copy of
the last iteration from the current section;
COPYIN - it oers the possibility to assign the same value to the variables THREADPRIVATE for all the threads in the pool;
REDUCTION - it makes a reduction on the variables that appear in the list (with a
specic operation: + - * /,etc.).
6.3.6
Examples using OpenMP with C/C++
Here are some examples using OpenMP with C/C++:

/******************************************************************************
* OpenMP Example - Hello World - C/C++ Version
* FILE: omp_hello.c
* DESCRIPTION:
*
In this simple example, the master thread forks a parallel region.
*
All threads in the team obtain their unique thread number and print it.
*
The master thread only prints the total number of threads. Two OpenMP
*
library routines are used to obtain the number of threads and each
*
threads number.
* SOURCE: Blaise Barney 5/99
* LAST REVISED:
******************************************************************************/
#include <omp.h>
main () {
int nthreads, tid;
/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
{
/* Obtain thread number */
printf("Hello World from thread = %d\n", tid);
if (tid == 0)
{
}
} /* All threads join master thread and disband */
}
/******************************************************************************
* OpenMP Example - Loop Work-sharing - C/C++ Version
* FILE: omp_workshare1.c
* DESCRIPTION:
*
In this example, the iterations of a loop are scheduled dynamically
*
across the team of threads. A thread will perform CHUNK iterations
45
*
at a time before being scheduled for the next CHUNK of work.
* LAST REVISED: 03/03/2002
******************************************************************************/
#include <omp.h>
#define CHUNKSIZE
10
#define N
100
main () {
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i,nthreads,tid)
{
#pragma omp for schedule(dynamic,chunk)
for (i=0; i < N; i++)
{
c[i] = a[i] + b[i];
printf("tid= %d i= %d c[i]= %f\n", tid,i,c[i]);
}
if (tid == 0)
{
}
} /* end of parallel section */
}
/******************************************************************************
* OpenMP Example - Sections Work-sharing - C/C++ Version
* DESCRIPTION:
*
In this example, the iterations of a loop are split into two different
*
sections. Each section will be executed by one thread. Extra threads
*
will not participate in the sections code.
******************************************************************************/
#include <omp.h>
#define N
50
main ()
{
int i, nthreads, tid;
for (i=0; i < N; i++)
46
a[i] = b[i] = i * 1.0;

#pragma omp parallel shared(a,b,c) private(i,tid,nthreads)
{
printf("Thread %d starting...\n",tid);
#pragma omp sections nowait
{
#pragma omp section
for (i=0; i < N/2; i++)
{
c[i] = a[i] + b[i];
printf("tid= %d i= %d c[i]= %f\n",tid,i,c[i]);
}
#pragma omp section
for (i=N/2; i < N; i++)
{
c[i] = a[i] + b[i];
printf("tid= %d i= %d c[i]= %f\n",tid,i,c[i]);
}
} /* end of sections */
if (tid == 0)
{
}
}
/******************************************************************************
* OpenMP Example - Combined Parallel Loop Reduction - C/C++ Version
* FILE: omp_reduction.c
* DESCRIPTION:
*
This example demonstrates a sum reduction within a combined parallel loop
*
construct. Notice that default data element scoping is assumed - there
*
are no clauses specifying shared or private variables. OpenMP will
*
automatically make loop index variables private within team threads, and
*
global variables shared.
* LAST REVISED:
******************************************************************************/
#include <omp.h>
main () {
int
i, n;
float a[100], b[100], sum;
n = 100;
for (i=0; i < n; i++)
a[i] = b[i] = i * 1.0;
47
sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i=0; i < n; i++)
sum = sum + (a[i] * b[i]);
printf("
Sum = %f\n",sum);
}
6.3.7
Running OpenMP
On Linux machines. GNU C Compiler now provides integrated support for OpenMP. To
compile your programs to use #pragma omp directives use the -fopenmp ag in addition to
the gcc command.
In order to compile on a local station, for a C/C++ program the command used is:
- gcc -fopenmp my_program.c
In order to compile on fep.grid.pub.ro , the following command can be used (gcc):
- gcc -fopenmp -xopenmp -xO3 file_name.c -o binary_name
For dening the number of threads use a structure similar to the following one:
#define NUM_THREADS 2
combined with the function omp set num threads(NUM THREADS). Similarily,
export OMP_NUM_THREADS=4
can be used in the command line in order to create a 4 thread-example.
6.3.8
OpenMP Debugging - C/C++
In this section, there are some C and Fortran programs examples using OpenMP that have
bugs. We will show you how to x these programs, and we will shortly present a debugging
tool, called TotalView, that will be explained later in the documentation.
/******************************************************************************
* OpenMP Example - Combined Parallel Loop Work-sharing - C/C++ Version
* DESCRIPTION:
*
This example attempts to show use of the parallel for construct. However
*
it will generate errors at compile time. Try to determine what is causing
*
the error. See omp_workshare4.c for a corrected version.
******************************************************************************/
#include <omp.h>
#define N
50
#define CHUNKSIZE
main () {
int i, chunk, tid;
48
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel for
\
shared(a,b,c,chunk)
\
private(i,tid)
\
schedule(static,chunk)
{
for (i=0; i < N; i++)
{
c[i] = a[i] + b[i];
printf("tid= %d i= %d c[i]= %f\n", tid, i, c[i]);
}
} /* end of parallel for construct */
}
The output of the gcc command, is:
[testuser@fep ~]$ gcc -fopenmp test_openmp.c -o opens
test_openmp.c: In function \u2018main\u2019:
test_openmp.c:19: error: for statement expected before \u2018{\u2019 token
test_openmp.c:24: warning: incompatible implicit declaration
of built-in function \u2018printf\u2019
The cause of these errors is the form of the code that follows the pragma declaration. It is
not allowed to include code between the parallel for and the for loop. Also, it is not allowed
to include the code that follows the pragma declaration between parenthesis (e.g.: {}).
The revised, correct form of the program above, is the following:
/******************************************************************************
* OpenMP Example - Combined Parallel Loop Work-sharing - C/C++ Version
* DESCRIPTION:
*
This is a corrected version of the omp_workshare3.c example. Corrections
*
include removing all statements between the parallel for construct and
*
the actual for loop, and introducing logic to preserve the ability to
*
query a threads id and print it from inside the for loop.
******************************************************************************/
#include <omp.h>
#define N
50
#define CHUNKSIZE
main () {
int i, chunk, tid;
char first_time;
49
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
first_time = y;
#pragma omp parallel for
\
shared(a,b,c,chunk)
\
private(i,tid)
\
schedule(static,chunk)
\
firstprivate(first_time)
for (i=0; i < N; i++)
{
if (first_time == y)
{
first_time = n;
}
c[i] = a[i] + b[i];
printf("tid= %d i= %d c[i]= %f\n", tid, i, c[i]);
}
}
If we easily detected the error above only by taking into consideration various syntax
matters, things wont work as simply every time. There are errors that cannot be detected on
compiling. In this case, a specialized debugger, called TotalView, is used. More details about
these debuggers you can nd at the Debuggers section.
/******************************************************************************
* FILE: omp_bug2.c
* DESCRIPTION:
*
Another OpenMP program with a bug.
******************************************************************************/
#include <omp.h>
main () {
int nthreads, i, tid;
float total;
/*** Spawn parallel region ***/
#pragma omp parallel
{
if (tid == 0) {
}
printf("Thread %d is starting...\n",tid);
#pragma omp barrier
/* do some work */
50
total = 0.0;
#pragma omp for schedule(dynamic,10)
for (i=0; i<1000000; i++)
total = total + i*1.0;
printf ("Thread %d is done! Total= %f\n",tid,total);
} /*** End of parallel region ***/
}
The bugs in this case are caused by neglecting to scope the TID and TOTAL variables as
PRIVATE. By default, most OpenMP variables are scoped as SHARED. These variables need
to be unique for each thread. It is also necessary to include stdio.h in order to have no
warnings.
The repaired form of the program, is the following:
#include <omp.h>
#include <stdio.h>
main () {
int nthreads, i, tid;
float total;
/*** Spawn parallel region ***/
#pragma omp parallel private(tid,total)
{
if (tid == 0) {
}
printf("Thread %d is starting...\n",tid);
#pragma omp barrier
/* do some work */
total = 0.0;
#pragma omp for schedule(dynamic,10)
for (i=0; i<1000000; i++)
total = total + i*1.0;
printf ("Thread %d is done! Total= %f\n",tid,total);
} /*** End of parallel region ***/
}
/******************************************************************************
* FILE: omp_bug3.c
* DESCRIPTION:
*
Run time error
* AUTHOR: Blaise Barney 01/09/04
******************************************************************************/
#include <omp.h>
#include <stdio.h>
51
#include <stdlib.h>
#define N
50
int main (int argc, char *argv[])
{
int i, nthreads, tid, section;
void print_results(float array[N], int tid, int section);
for (i=0; i<N; i++)
a[i] = b[i] = i * 1.0;
#pragma omp parallel private(c,i,tid,section)
{
if (tid == 0)
{
}
/*** Use barriers for clean output ***/
#pragma omp barrier
printf("Thread %d starting...\n",tid);
#pragma omp barrier
{
#pragma omp section
{
section = 1;
for (i=0; i<N; i++)
c[i] = a[i] * b[i];
print_results(c, tid, section);
}
#pragma omp section
{
section = 2;
for (i=0; i<N; i++)
c[i] = a[i] + b[i];
print_results(c, tid, section);
}
/*** Use barrier for clean output ***/
#pragma omp barrier
printf("Thread %d exiting...\n",tid);
}
void print_results(float array[N], int tid, int section) {
int i,j;
j = 1;
52
/*** use critical for clean output ***/

#pragma omp critical
{
printf("\nThread %d did section %d. The results are:\n", tid, section);
for (i=0; i<N; i++) {
printf("%e ",array[i]);
j++;
if (j == 6) {
printf("\n");
j = 1;
}
}
printf("\n");
} /*** end of critical ***/
#pragma omp barrier
printf("Thread %d done and synchronized.\n", tid);
}
Solving the problem:
The run time error is caused by by the OMP BARRIER directive in the PRINT RESULTS
subroutine. By denition, an OMP BARRIER can not be nested outside the static extent of
a SECTIONS directive. In this case it is orphaned outside the calling SECTIONS block. If
you delete the line with #pragma omp barrier from the print results function, the program
wont hang anymore.
/******************************************************************************
* FILE: omp_bug4.c
* DESCRIPTION:
*
This very simple program causes a segmentation fault.
******************************************************************************/
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N 1048
{
int nthreads, tid, i, j;
double a[N][N];
/* Fork a team of threads with explicit variable scoping */
#pragma omp parallel shared(nthreads) private(i,j,tid,a)
{
/* Obtain/print thread info */
if (tid == 0)
{
53
}
printf("Thread %d starting...\n", tid);
/* Each thread works on its own private copy of the array */
for (i=0; i<N; i++)
for (j=0; j<N; j++)
a[i][j] = tid + i + j;
/* For confirmation */
printf("Thread %d done. Last element= %f\n",tid,a[N-1][N-1]);
} /* All threads join master thread and disband */
}
If you run the program, you can see that it causes segmentation fault. OpenMP thread stack
size is an implementation dependent resource. In this case, the array is too large to t into the
thread stack space and causes the segmentation fault. You have to modify the environment
variable, for Linux, KMP STACKSIZE with the value 20000000.
******************************************************************************
* FILE: omp_bug5.c
* DESCRIPTION:
*
Using SECTIONS, two threads initialize their own array and then add
*
it to the others array, however a deadlock occurs.
******************************************************************************/
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N 1000000
#define PI 3.1415926535
#define DELTA .01415926535
int main (int argc, char *argv[]) {
int nthreads, tid, i;
float a[N], b[N];
omp_lock_t locka, lockb;
/* Initialize the locks */
omp_init_lock(&locka);
omp_init_lock(&lockb);
#pragma omp parallel shared(a, b, nthreads, locka, lockb) private(tid)
{
/* Obtain thread number and number of threads */
#pragma omp master
{
}
#pragma omp barrier
54

{
#pragma omp section
{
printf("Thread %d initializing a[]\n",tid);
omp_set_lock(&locka);
for (i=0; i<N; i++)
a[i] = i * DELTA;
omp_set_lock(&lockb);
printf("Thread %d adding a[] to b[]\n",tid);
for (i=0; i<N; i++)
b[i] += a[i];
omp_unset_lock(&lockb);
omp_unset_lock(&locka);
}
#pragma omp section
{
printf("Thread %d initializing b[]\n",tid);
for (i=0; i<N; i++)
b[i] = i * PI;
printf("Thread %d adding b[] to a[]\n",tid);
for (i=0; i<N; i++)
a[i] += b[i];
}
} /* end of parallel region */
}
EXPLANATION:
The problem in omp bug5 is that the rst thread acquires locka and then tries to get lockb
before releasing locka. Meanwhile, the second thread has acquired lockb and then tries to get
locka before releasing lockb. The solution overcomes the deadlock by using locks correctly.
/******************************************************************************
* FILE: omp_bug5fix.c
******************************************************************************/
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N 1000000
#define PI 3.1415926535
#define DELTA .01415926535
55
{
int nthreads, tid, i;
float a[N], b[N];
omp_lock_t locka, lockb;
/* Initialize the locks */
omp_init_lock(&locka);
omp_init_lock(&lockb);
#pragma omp parallel shared(a, b, nthreads, locka, lockb) private(tid)
{
/* Obtain thread number and number of threads */
#pragma omp master
{
}
#pragma omp barrier
{
#pragma omp section
{
printf("Thread %d initializing a[]\n",tid);
for (i=0; i<N; i++)
a[i] = i * DELTA;
printf("Thread %d adding a[] to b[]\n",tid);
for (i=0; i<N; i++)
b[i] += a[i];
}
#pragma omp section
{
printf("Thread %d initializing b[]\n",tid);
for (i=0; i<N; i++)
b[i] = i * PI;
printf("Thread %d adding b[] to a[]\n",tid);
for (i=0; i<N; i++)
a[i] += b[i];
}
} /* end of parallel region */
56
}
6.3.9
OpenMP Debugging - FORTRAN
C******************************************************************************
C FILE: omp_bug1.f
C DESCRIPTION:
C
This example attempts to show use of the PARALLEL DO construct. However
C
it will generate errors at compile time. Try to determine what is causing
C
the error. See omp_bug1fix.f for a corrected version.
C AUTHOR: Blaise Barney 5/99
C LAST REVISED:
C******************************************************************************
PROGRAM WORKSHARE3
INTEGER TID, OMP_GET_THREAD_NUM, N, I, CHUNKSIZE, CHUNK
PARAMETER (N=50)
PARAMETER (CHUNKSIZE=5)
REAL A(N), B(N), C(N)
!
Some initializations
DO I = 1, N
A(I) = I * 1.0
B(I) = A(I)
ENDDO
CHUNK = CHUNKSIZE
!$OMP PARALLEL DO SHARED(A,B,C,CHUNK)
!$OMP& PRIVATE(I,TID)
!$OMP& SCHEDULE(STATIC,CHUNK)
DO I = 1, N
C(I) = A(I) + B(I)
PRINT *,TID= ,TID,I= ,I,C(I)= ,C(I)
ENDDO
!$OMP END PARALLEL DO
END
EXPLANATION:
This example illustrates the use of the combined PARALLEL for-DO directive. It fails
because the loop does not come immediately after the directive. Corrections include removing
all statements between the PARALLEL for-DO directive and the actual loop. Also, logic is
added to preserve the ability to query the thread id and print it from inside the loop. Notice
the use of the FIRSTPRIVATE clause to intialise the ag.
C******************************************************************************
C FILE: omp_bug1fix.f
C DESCRIPTION:
C
This is a corrected version of the omp_bug1fix.f example. Corrections
C
include removing all statements between the PARALLEL DO construct and
C
the actual DO loop, and introducing logic to preserve the ability to
C
query a threads id and print it from inside the DO loop.
C AUTHOR: Blaise Barney 5/99
57
C LAST REVISED:
C******************************************************************************
PROGRAM WORKSHARE4
INTEGER TID, OMP_GET_THREAD_NUM, N, I, CHUNKSIZE, CHUNK
PARAMETER (N=50)
PARAMETER (CHUNKSIZE=5)
REAL A(N), B(N), C(N)
CHARACTER FIRST_TIME
!
Some initializations
DO I = 1, N
A(I) = I * 1.0
B(I) = A(I)
ENDDO
CHUNK = CHUNKSIZE
FIRST_TIME = Y
!$OMP PARALLEL DO SHARED(A,B,C,CHUNK)
!$OMP& PRIVATE(I,TID)
!$OMP& SCHEDULE(STATIC,CHUNK)
!$OMP& FIRSTPRIVATE(FIRST_TIME)
DO I = 1, N
IF (FIRST_TIME .EQ. Y) THEN
FIRST_TIME = N
ENDIF
C(I) = A(I) + B(I)
PRINT *,TID= ,TID,I= ,I,C(I)= ,C(I)
ENDDO
!$OMP END PARALLEL DO
END
C******************************************************************************
C FILE: omp_bug5.f
C DESCRIPTION:
C
Using SECTIONS, two threads initialize their own array and then add
C
it to the others array, however a deadlock occurs.
C AUTHOR: Blaise Barney 01/09/04
C LAST REVISED:
C******************************************************************************
PROGRAM BUG5
INTEGER*8 LOCKA, LOCKB
INTEGER NTHREADS, TID, I,
+
OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
PARAMETER (N=1000000)
REAL A(N), B(N), PI, DELTA
PARAMETER (PI=3.1415926535)
PARAMETER (DELTA=.01415926535)
58
C
!$OMP
C
!$OMP
!$OMP
!$OMP
!$OMP
!$OMP
!$OMP
!$OMP
!$OMP
Initialize the locks

CALL OMP_INIT_LOCK(LOCKA)
CALL OMP_INIT_LOCK(LOCKB)
Fork a team of threads giving them their own copies of variables
PARALLEL SHARED(A, B, NTHREADS, LOCKA, LOCKB) PRIVATE(TID)
Obtain thread number and number of threads
MASTER
END MASTER
PRINT *, Thread, TID, starting...
BARRIER
SECTIONS
SECTION
PRINT *, Thread,TID, initializing A()
CALL OMP_SET_LOCK(LOCKA)
DO I = 1, N
A(I) = I * DELTA
ENDDO
CALL OMP_SET_LOCK(LOCKB)
PRINT *, Thread,TID, adding A() to B()
DO I = 1, N
B(I) = B(I) + A(I)
ENDDO
CALL OMP_UNSET_LOCK(LOCKB)
CALL OMP_UNSET_LOCK(LOCKA)
SECTION
PRINT *, Thread,TID, initializing B()
DO I = 1, N
B(I) = I * PI
ENDDO
PRINT *, Thread,TID, adding B() to A()
DO I = 1, N
A(I) = A(I) + B(I)
ENDDO
END SECTIONS NOWAIT
PRINT *, Thread,TID, done.
END PARALLEL
END
C******************************************************************************
C FILE: omp_bug5fix.f
C DESCRIPTION:
C
The problem in omp_bug5.f is that the first thread acquires locka and then
C
tries to get lockb before releasing locka. Meanwhile, the second thread
59
C
has acquired lockb and then tries to get locka before releasing lockb.
C
This solution overcomes the deadlock by using locks correctly.
C AUTHOR: Blaise Barney 01/09/04
C LAST REVISED:
C******************************************************************************
PROGRAM BUG5
INTEGER*8 LOCKA, LOCKB
INTEGER NTHREADS, TID, I, OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
PARAMETER (N=1000000)
REAL A(N), B(N), PI, DELTA
PARAMETER (PI=3.1415926535)
PARAMETER (DELTA=.01415926535)
C
C
!$OMP
C
!$OMP
!$OMP
!$OMP
!$OMP
!$OMP
!$OMP
Initialize the locks

CALL OMP_INIT_LOCK(LOCKA)
CALL OMP_INIT_LOCK(LOCKB)
Fork a team of threads giving them their own copies of variables
PARALLEL SHARED(A, B, NTHREADS, LOCKA, LOCKB) PRIVATE(TID)
Obtain thread number and number of threads
MASTER
END MASTER
PRINT *, Thread, TID, starting...
BARRIER
SECTIONS
SECTION
PRINT *, Thread,TID, initializing A()
DO I = 1, N
A(I) = I * DELTA
ENDDO
PRINT *, Thread,TID, adding A() to B()
DO I = 1, N
B(I) = B(I) + A(I)
ENDDO
SECTION
PRINT *, Thread,TID, initializing B()
DO I = 1, N
B(I) = I * PI
ENDDO
60
PRINT *, Thread,TID, adding B() to A()

DO I = 1, N
A(I) = A(I) + B(I)
ENDDO
!$OMP END SECTIONS NOWAIT
PRINT *, Thread,TID, done.
!$OMP END PARALLEL
END
6.4
6.4.1
Debuggers
Sun Studio Integrated Debugger
Sun Studio includes a debugger for serial and multithreaded programs. You will nd more
information on how to use this environment online. You may start debugging your program
from Run - Debug executable. There are a series of actions that are available from the Run
menu but which may be specied from the command line when starting Sun Studio. In order
to just start a debuggind session, you can attach to a running program by:
$ sunstudio -A pid[:program_name]
or from Run - Attach Debugger. To analyse a core dump, use:
$ sunstudio -C core[:program_name]
or Run - Debug core le
6.4.2
TotalView
TotalView is a sophisticated software debugger product of TotalView Technologies that has

been selected as the Department of Energys Advanced Simulation and Computing programs
debugger. It is used for debugging and analyzing both serial and parallel programs and it is
especially designed for use with complex, multi-process and/or multi-threaded applications.
It is designed to handle most types of HPC parallel coding and it is supported on most
HPC platforms. It provides both a GUI and command line interface and it can be used to
debug programs, running processes, and core les, including also memory debugging features.
It provides graphical visualization of array data and icludes a comprehensive built-in help
system.
Supported Platforms and Languages
Supported languages include the usual HPC application languages:
o C/C++
o Fortran77/90
o Mixed C/C++ and Fortran
o Assembler
Compiling Your Program
-g:
Like many UNIX debuggers, you will need to compile your program with the appropriate
ag to enable generation of symbolic debug information. For most compilers, the -g option is
used for this. TotalView will allow you to debug executables which were not compiled with
the -g option. However, only the assembler code can be viewed.
Beyond -g:
61
Dont compile your program with optimization ags while you are debugging it. Compiler
optimizations can rewrite your program and produce machine code that doesnt necessarily
match your source code. Parallel programs may require additional compiler ags.
Overview
TotalView is a full-featured, source-level, graphical debugger for C, C++, and Fortran (77
and 90), assembler, and mixed source/assembler codes based on the X Window System from
Etnus. TotalView supports MPI, PVM and HPF. Information on TotalView is available in
the release notes and user guide at the Etnus Online Documentation page. Also see man
totalview for command syntax and options. Note: In order to use TotalView, you must be
using a terminal or workstation capable of displaying X Windows. See Using the X Window
System for more information.
TotalView on Linux Clusters
TotalView is available on NCSAs Linux Clusters. On Abe there is a 384 token TotalView
license and you only checkout the number of licenses you need . We do not currently have a
way to guarantee you will get a license when your job starts if you run in batch. GNU and
Intel compilers are both supported.
Important: For both compilers you need to compile and link your code with -g to enable
source code listing within TotalView. TotalView is also supported on Abe.
Starting TotalView on the cluster:
To use TotalView over a cluster there are a few steps to follow. It is very important to
have the $HOME directory shared over the network between nodes.
11 steps that show you how to run TotalView to debug your program:
1. Download an NX client on your local station (more details about how you need to
congure it you can nd at the Environment chapter, File Management subsection from the
present Clusterguide);
2. Connect with the NX client to the cluster;
3. Write or upload your document in the home folder;
4. Compile it using the proper compiler and add -g at the compiling symbols. It produces
debugging information in the systems native format.
gcc -fopenmp -O3 -g -o app_lab4_gcc openmp_stack_quicksort.c
5. Find the value for $DISPLAY using the command:
$ echo $DISPLAY
You also need to nd out the port the X11 connection is forwarded on. For example, if
DISPLAY is host:14.0 the connection port will be 6014.
setenv DISPLAY fep.grid.pub.ro:14.0
6. Type the command xhost +, in order to disable the access control
$ xhost +
Now access control is disabled and clients can connect from any host (X11 connections are
allowed).
7. Currently, the only available queue is ibm-quad.q and the version of TotalView is
totalview-8.6.2-2. To nd out what queues are available, use the command:
qconf -sql
8. Considering the ibm-quad.q queue available, write after the following command:
62
qsub -q ibm-quad.q -cwd

9. After running the command above you have to write the following lines (:1000.0 can be
replaced by the value that you obtained after typing the command echo $DISPLAY):
module load debuggers/totalview-8.4.1-7
setenv DISPLAY fep.grid.pub.ro:1000.0
totalview
10. After that, press Ctrl+D in order to submit your request to the queue.
When the xterm window will appear, we will launch TotalView. If the window does not
appear check the job output. This should give you some clue and why your script failed;
maybe you misspelled a command or maybe the port for the X11 forwarding is closed from
the rewall. You should check these two things rst.
11. Open /opt/totalview/toolworks/totalview.8.6.2-2/bin and run the totalview
Now you should be able to see the graphical interface of TotalView. Here are some pictures
to help you interact with it. Next you have to select the executable le corresponding to your
program, which has been previously compiled using the -g option, from the appropriate
folder placed on the cluster(the complete path to your home folder).
63
Next you need to select the parallel environment you want to use.
The next step is to set the number of tasks you want to run:
64
Here is an example of how to debug your source. Try to use the facilities and options
oered by TotalView, combined with the examples shown in the tutorials below.
Some helpful links, to help you with Totalview:
Total View Tutorial
Total View Excercise
TotalView Command Line Interpreter
65
The TotalView Command Line Interpreter (CLI) provides a command line debugger interface. It can be launched either stand-alone or via the TotalView GUI debugger.
The CLI consists of two primary components:
o The CLI commands
o A Tcl interpreter
Because the CLI includes a Tcl interpreter, CLI commands can be integrated into userwritten Tcl programs/scripts for automated debugging. Of course, putting the CLI to real
use in this manner will require some expertise in Tcl.
Most often, the TotalView GUI is the method of choice for debugging. However, the CLI
may be the method of choice in those circumstances where using the GUI is impractical:
o When a program takes several days to execute.
o When the program must be run under a batch scheduling system or network conditions
that inhibit GUI interaction.
o When network trac between the executing program and the person debugging is not
permitted or limits the use of the GUI.
See the TotalView documentation located at Total View Ocial Site for details:
o TotalView User Guide - relevant chapters
o TotalView Reference Guide - complete coverage of all CLI commands, variables and
usage.
Starting an Interactive CLI Debug Session:
Method 1: From within the TotalView GUI:
1. Use either path:
Process Window
Root Window >
> Tools Menu > Command Line

Tools Menu > Command Line
2. A TotalView CLI xterm window (below) will then open for you to enter CLI commands.
3. Load/Start your executable or attach to a running process
4. Issue CLI commands
Method 2: From a shell prompt window:
1. Invoke the totalviewcli command (provided that it is in your path).
2. Load/Start your executable or attach to a running process
3. Issue CLI commands
CLI Commands:
As of TotalView version 8, there are approximately 75 CLI commands. These are covered
completely in the TotalView Reference Guide.
Some representative CLI commands are shown in the subsequent table.
66
alias
capture
dgroups
dset
dunset
help
stty
unalias
dworker
CLI
dattach
ddetach
dkill
dload
dreload
drerun
drun
dstatus
quit
dassign
dlist
dmstat
dprint
dptsets
dwhat
dwhere
dcont
dfocus
dgo
dhalt
dhold
dnext
dnexti
dout
dstep
dstepi
dunhold
duntil
dwait
Environment Commands
creates or views user-dened commands
allows commands that print information to send their output to a string variable
manipulates and manages groups
changes or views values of CLI state variables
restores default settings of CLI state variables
displays help information
sets terminal properties
removes a previously dened command
adds or removes a thread from a workers group
initialization and termination
attaches to one/more processes executing in the normal run-time environment
detaches from processes
kills existing user process, leaving debugging information in place
loads debugging information about the target program & prepares it for execution
reloads the current executable
restarts a process
starts or restarts the execution of users processes under control of the CLI
shows current status of processes and threads
exits from the CLI, ending the debugging session
Program Information
changes the value of a scalar variable
browses source code relative to a particular le, procedure or line
displays memory use information
evaluates an expression or program variable and displays the resulting value
shows status of processes and threads
determines what a name refers to
prints information about the target threads stack
Execution Control
continues execution of processes and waits for them
changes the set of process, threads, or groups upon which a CLI command acts
resumes execution of processes (without blocking)
suspends execution of processes
holds threads or processes
executes statements, moving into subfunctions if required
executes machine instructions, stepping over subfunctions
runs out from the current subroutine
executes statements, moving into subfunctions if required
executes machine instructions, moving into subfunctions if required
releases a held process or thread
runs the process until a target place is reached
blocks command input until processes stop
67
dactions
dbarrier
dbreak
ddelete
ddisable
denable
dwatch
dcache
ddown
dush
dlappend
dup
Action Points
views information on action point denitions and their current status
denes a process or thread barrier breakpoint
denes a breakpoint
deletes an action point
temporarily disables an action point
reenables an action point that has been disabled
denes a watchpoint
Miscellaneous
clears the remote library cache
moves down the call stack
unwinds stack from suspended computations
appends list elements to a TotalView variable
moves up the call stack
68
Parallelization
Parallelization for computers with shared memory (SM) means the automatic distribution
of loop iterations over several processors(autoparallelization), the explicit distribution of work
over the processors by compiler directives (OpenMP) or function calls to threading libraries,
or a combination of those.
Parallelization for computers with distributed memory (DM) is done via the explicit distribution of work and data over the processors and their coordination with the exchange of
messages (Message Passing with MPI). MPI programs run on shared memory computers as
well, whereas OpenMP programs usually do not run on computers with distributed memory.
There are solutions that try to achieve the programming ease of shared memory parallelization on distributed memory clusters. For example Intels Cluster OpenMP oers a relatively
easy way to get OpenMP programs running on a cluster.
For large applications the hybrid parallelization approach, a combination of coarse-grained
parallelism with MPI and underlying ne-grained parallelism with OpenMP, might be attractive, in order to use as many processors eciently as possible.
Please note, that large computing jobs should not be started interactively, and that when
submitting use of batch jobs, the GridEngine batch system determines the distribution of the
MPI tasks on the machines to a large extent.
7.1
Shared Memory Programming
For shared memory programming, OpenMP http://www.openmp.org is the de facto standard. The OpenMP API is dened for Fortran, C and C++ and consists of compiler directives,
runtime routines and environment variables.
In the parallel regions of a program several threads are started. They execute the contained
program segment redundantly, until they hit a worksharing construct . Within this construct,
the contained work (usually do- or for-loops) is distributed among the threads. Under normal
conditions all threads have access to all data (shared data). But pay attention: if data,
accessed by several threads, is modied, then the access to this data must be protected with
critical regions or OpenMP locks. Also private data areas can be used, where the individual
threads hold their local data. Such private data (in OpenMP terminology) is only visible to
the thread owning it. Other threads will not be able to read or write private data.
Note: In many cases, the stack area for the slave threads must be increased by changing a
compiler specic environment variable (e.g. Sun Studio: stacksize, Intel:kmp stacksize), and
the stack area for the master thread must be increased with the command ulimit -s xxx (zsh
shell, specication in KB) or limit s xxx (C-shell, in KB).
Hint: In a loop, which is to be parallelized, the results must not depend on the order of
the loop iterations! Try to run the loop backwards in serial mode. The results should be the
same. This is a necessary, but not a sucient condition! The number of the threads has to be
specied by the environment variable omp num threads.
Note: The OpenMP standard does not specify the value for omp num threads in case it is
not set explicitly. If omp num threads is not set, then Sun OpenMP for example starts only 1
thread, as opposed to the Intel compiler which starts as many threads as there are processors
available.
On a loaded system fewer threads may be employed than specied by this environment
variable, because the dynamic mode may be used by default. Use the environment variable
omp dynamic to change this behavior. If you want to use nested OpenMP, the environment
variable omp nested=true has to be set.
69
The OpenMP compiler options have been sumarized in the following. These compiler
ags will be set in the environment variables FLAGS AUTOPAR and FLAGS OPENMP (as
explained in section 6.1.1).
Compiler
Sun
Intel
GNU
PGI
ags openmp
-xopenmp
-openmp
-fopenmp (4.2 and above)
-mp
ags autopar
-xautopar -xreduction
-parallel
n.a. (planned for 4.3)
-Mconcur -Minline
An example program using OpenMP is

/export/home/stud/username/PRIMER/PROFILE/openmp only.c.
7.1.1
Automatic Shared Memory Parallelization of Loops
The Sun Fortran, C and C++-compilers are able to parallelize loops automatically. Success
or failure depends on the compilers ability to prove it is safe to parallelize a (nested) loop.
This is often application area specic (e.g. nite dierences versus nite elements), language
(pointers and function calls may make the analysis dicult) and coding style dependent. The
appropriate option is -xautopar which includes -depend -xO3. Although the -xparallel option
is also available, we do not recommend to use this. The option combines automatic and
explicit parallelization, but assumes the older Sun parallel programming model is used instead
of OpenMP. In case one would like to combine automatic parallelization and OpenMP, we
strongly suggest to use the -xautopar -xopenmp combination. With the option -xreduction,
automatic parallelization of reductions is also permitted, e.g. accumulations, dot products
etc., whereby the modication of the sequence of the arithmetic operation can cause dierent
rounding error accumulations. Compiling with the option -xloopinfo supplies information
about the parallelization. The compiler messages are shown on the screen. If the number
of loop iterations is unknown during compile time, then code is produced, which decides at
run-time whether a parallel execution of the loop is more ecient or not (alternate coding).
Also with automatic parallelization, the number of the used threads can be specied by the
environment variable omp num threads.
7.1.2
GNU Compilers
With version 4.2 the GNU compiler collection supports OpenMP with the option -fopenmp.
It supports nesting using the standard OpenMP environment variables. Using the variable
GOMP STACKSIZE one can also set the default thread stack size (in kilobytes).
CPU binding can be done with the GOMP CPU AFFINITY environment variable.
The variable should contain a space or comma-separated list of CPUs. This list may contain dierent kind of entries: either single CPU numbers in any order, a range of CPUs
(M-N) or a range with some stride (M-N:S). CPU numbers are zero based. For example,
GOMP CPU AFFINITY=0 3 1-2 4-15:2 will bind the initial thread to CPU 0, the
second to CPU 3, the third to CPU 1, the fourth to CPU 2, the fth to CPU 4, the sixth
through tneth to CPUs 6, 8, 10, 12 and 14 respectively and then start assigning back from
the beggining of the list. CGOMP CPU AFFINITY=0 binds all threads to CPU 0.
Automatic Shared Memory Parallelization of Loops Since version 4.3, GNU compilers are able to parallelize loops automatically using the -ftree-parallelize-loops=[threads]
option. However the number of threads to use have to be specied at compile time and thus
are x at runtime.
70
7.1.3
Intel Compilers
By adding the option -openmp the OpenMP directives are interpreted by the Intel compilers. Nested OpenMP is supported too.
The slave threads stack size can be increased with the environment variable kmp stacksize=megabytes
M.
Attention: By default the number of threads is set to the number of processors. It is
not recommended to set this variable to larger values than the number of processors available
on the current machine. By default, the environment variables omp dynamic and omp nested
are set to false. Intel compilers provide an easy way for processor binding. Just set the
environment variable kmp anity to compact or scatter, e.g.
$ export KMP_AFFINITY=scatter
Compact binds the threads as near as possible, e.g. two threads on dierent cores of one
processor chip. Scatter binds the threads as far away as possible, e.g. two threads, each on
one core on dierent processor sockets.
Automatic Shared Memory Parallelization of Loops The Intel Fortran, C and C++
compilers are able to parallelize certain loops automatically. This feature can be turned on
with the option -parallel. The number of the used threads is specied by the environment
variable OMP NUM THREADS.
Note: using the option -O2 enables automatic inlining which may help the automatic
parallelization, if functions are called within a loop.
7.1.4
PGI Compilers
By adding the option -mp the OpenMP directives, according to the OpenMP version 1
specications, are interpreted by the PGI compilers.
Explicit parallelization can be combined with the automatic parallelization of the compiler.
Loops within parallel OpenMP regions are no longer subject to automatic parallelization.
Nested parallelization is not supported. The slave threads stack size can be increased with
the environment variable mpstkz=megabytes M.
By default omp num threads is set to 1. It is not recommended to set this variable to a
larger value than the number of processors available on the current machine. The environment
variables omp dynamic and omp nested have no eect!
The PGI compiler oers some support for NUMA architectures like the V40z Opteron
systems with the option -mp=numa. Using NUMA can improve performance of some parallel
applications by reducing memory latency. Linking -mp=numa also allows you to use the
environment variables mp bind, mp blist and mp spin. When mp bind is set to yes, parallel
processes or threads are bound to a physical processor. This ensures that the operating system
will not move your process to a dierent CPU while it is running. Using mp blist, you can
specify exactly which processors to attach your process to. For example, if you have a quad
socket dual core system (8 CPUs), you can set the blist so that the processes are interleaved
across the 4 sockets (MP BLIST=2,4,6,0,1,3,5,7) or bound to a particular (MP BLIST=6,7).
Threads at a barrier in a parallel region check a semaphore to determine if they can proceed.
If the semaphore is not free after a certain number of tries, the thread gives up the processor
for a while before checking again. The mp spin variable denes the number of times a thread
checks a semaphore before idling. Setting mp spin to -1 tells the thread never to idle. This
can improve performance but can waste CPU cycles that could be used by a dierent process
if the thread spends a signicant amount of time in a barrier.
71
Automatic Shared Memory Parallelization of Loops The PGI Fortran, C and C++
compilers are able to parallelize certain loops automatically. This feature can be turned on
with the option -Mconcur. The number of the used threads is also specied by the environment
variable OMP NUM THREADS.
Note: Using the option -Minline the compiler tries to inline functions. So even loops with
function calls may be parallelized.
7.2
Message Passing with MPI
MPI (Message-Passing Interface) is the de-facto standard for parallelization on distributed

memory parallel systems. Multiple processes explicitly exchange data and coordinate their
work ow. MPI species the interface but not the implementaion. Therefore, there are
plenty of implementations for PCs as well as for supercomputers. There are freely available
implementations and commercial ones, which are particularly tuned for the target platform.
MPI has a huge number of calls, although it is possible to write meaningful MPI applications
just employing some 10 of these calls.
An example program using MPI is:
/export/home/stud/alascateu/PRIMER/PROFILE/mpi.c.
7.2.1
OpenMPI
The Open MPI Project (www.openmpi.org) is an open source MPI-2 implementation that
is developed and maintained by a consortium of academic, research, and industry partners.
Open MPI is therefore able to combine the expertise, technologies, and resources from all
across the High Performance Computing community in order to build the best MPI library
available. Open MPI oers advantages for system and software vendors, application developers
and computer science researchers.
The compiler drivers are mpicc for C, mpif77 and mpif90 for FORTRAN, mpicxx and
mpiCC for C++.
mpirun is used to start a MPI program. Refere to the manual page for a detailed description of mpirun ( $ man mpirun).
We have several Open MPI implementations. To use the one suitable for your programs,
you must load the appropriate module (remember to also load the corresponding compiler
module). For example, if you want to use the PGI implementation you should type the
following:
$ module list
Currently Loaded Modulefiles:
1) batch-system/sge-6.2u3
4) switcher/1.0.13
2) compilers/sunstudio12.1
5) oscar-modules/1.0.5
3) mpi/openmpi-1.3.2_sunstudio12.1
$ module avail
[...]
-------------------- /opt/modules/modulefiles -------------------apps/hrm
apps/matlab
grid/gLite-UI-3.1.31-Prod
apps/uso09
cell/cell-sdk-3.1
compilers/gcc-4.1.2
72
compilers/gcc-4.4.0
compilers/intel-11.0_081
compilers/pgi-7.0.7
compilers/sunstudio12.1
mpi/openmpi-1.3.2_intel-11.0_081
mpi/openmpi-1.3.2_pgi-7.0.7
mpi/openmpi-1.3.2_sunstudio12.1
oscar-modules/1.0.5(default)
Load the PGI implementation of MPI

$ module switch mpi/openmpi-1.3.2_pgi-7.0.7
Load the PGI compiler
$ module switch compilers/pgi-7.0.7
Now if you type mpicc youll see that the wrapper calls pgcc.
7.2.2
Intel MPI Implementation
Intel-MPI is a commercial implementation based on mpich2 which is a public domain

implementation of the MPI 2 standard provided by the Mathematics and Computer Science
Division of the Argonne National Laboratory.
The compiler drivers mpifc, mpiifort, mpiicc, mpiicpc, mpicc and mpicxx and the
instruction for starting an MPI application mpiexec will be included in the search path.
There are two dierent versions of compiler driver: mpiifort, mpiicc and mpiicpc are the
compiler driver for Intel Compiler. mpifc, mpicc and mpicxx are the compiler driver for
GCC (GNU Compiler Collection).
To use the Intel implementation you must load the apporpriate modules just like in the
PGI exemple in the OpenMPI section.
Examples:
$ mpiifort -c ... *.f90
$ mpiicc -o a.out *.o
$ mpirun -np 4 a.out:
$ ifort -I$MPI_INCLUDE -c prog.f90
$ mpirun -np 4 a.out
7.3
Hybrid Parallelization
The combination of MPI and OpenMP and/or autoparallelization is called hybrid parallelization. Each MPI process may be multi-threaded. In order to use hybrid parallelization the
MPI library has to support it. There are 4 stages of possible support:
1. single - multi-threading is not supported.
2. funneled - only the main thread, which initializes MPI, is allowed to make MPI calls.
3. serialized - only one thread may call the MPI library at a time.
4. multiple - multiple threads may call MPI, without restrictions.
You can use the MPI Init thread function to query multi-threading support of the MPI
implementation.
A quick example of a hybrid program is
/export/home/stud/alascateu/PRIMER/PROFILE/hybrid.c.
It is a standard Laplace equation program, with MPI support, in witch a simple OpenMP
matrix multiply program was inserted. Thus, every process distributed over the cluster will
spawn multiple threads that will multiply some random matrices. The matrix dimensions
73
where augmented so the program would run sucient time to collect experiment data with
the Sun Analyzer presented in the Performance / Runtime Analysis Tools section. Tu
run the program (C environment in the example) compile it as a MPI program but with
OpenMP support:
$ mpicc -fopenmp hybrid.c -o hybrid
Run with (due to the Laplace layout you need 4 processors):
$ mpirun -np 4 hybrid
7.3.1
Hybrid Parallelization with Intel-MPI
Unfortunately Intel-MPI is not thread safe by default. Calls to the MPI library should not
be made inside of parallel regions if the library is not linked to the program. To provide full
MPI support inside parallel regions the program must be linked with the option -mt mpi.
Note: If you specify either the -openmp or the -parallel option of the Intel C Compiler,
the thread safe version of the library is used.
Note: If you specify one of the following options for the Intel Fortran Compiler, the thread
safe version of the library is used:
1. -openmp
2. -parallel
3. -threads
4. -reentrancy
5. -reentrancy threaded
74
Performance / Runtime Analysis Tools
This chapter describes tools that are available to help you assess the performance of your
code, identify potential performance problems, and locate the part of the code where most of
the execution time is spent. It also covers the installation and run of an Intel MPI benchmark.
8.1
Sun Sampling Collector and Performance Analyzer
The Sun Sampling Collector and the Performance Analyzer are a pair of tools that you
can use to collect and analyze performance data for your serial or parallel application. The
Collector gathers performance data by sampling at regular time intervals and by tracing
function calls. The performance information is gathered in so called experiment les, which
can then be displayed with the analyzer GUI or the er print line command after the program
has nished. Since the collector is part of the Sun compiler suite the studio compiler module
has to be loaded. However programs to be analyzed do not have to be compiled with the Sun
compiler, the GNU or Intel compiler for example work as well.
8.1.1
Collecting experiment data
The rst step in proling with the Sun Analyzer is to obtain experiement data. For this
you must compile your code with the -g option. After that you can either run collect like
this
$ collect a.out
or use the GUI.
To use the GUI to collect experiment data, start the analyzer (X11 forwarding must be
enabled - $ analyzer), go to Collect Experiment under the File menu and select the Target,
Working Directory and add Arguments if you need to. Click on Preview Command to
view the command for collecting experiment data only. Now you can submit the command to
a queue. Some example of scripts used to submit the command (the path to collect might be
dierent):
$ cat script.sh
#!/bin/bash
qsub -q [queue] -pe [pe] [np] -cwd -b y \
"/opt/sun/sunstudio12.1/prod/bin/collect -p high -M CT8.1 -S on -A on -L none \
mpirun -np 4 -- /path/to/file/test"
$ cat script_OMP_ONLY.sh
#!/bin/bash
qsub -q [queue] [pe] [np] -v OMP_NUM_THREADS=8 -cwd -b y
"/opt/sun/sunstudio12.1/prod/bin/collect -L none \
-p high -S on -A on /path/to/file/test"
$ cat scriptOMP.sh
#!/bin/bash
qsub -q [queue] [pe] [np] -v OMP_NUM_THREADS=8 -cwd -b y \
"/opt/sun/sunstudio12.1/prod/bin/collect -p high -M CT8.1 -S on -A on -L none \
mpirun -np 4 -- /path/to/file/test"
75
The rst one uses MPI tracing for testing MPI programs, the second one is intended for
Openmp programs (thats why it sets the OMP NUM THREADS variable) and the last one
is for hybird programs (they use both MPI and Openmp). Some of the parameters used are
explained in the following. You can nd more information in the manual ($ man collect).
-M CT8.1; Specify collection of an MPI experiment. CT8.1 is the MPI version installed.
-L size; Limit the amount of proling and tracing data recorded to size megabytes. None
means no limit.
-S interval; Collect periodic samples at the interval specied (in seconds). on defaults
to 1 second.
-A option; Control whether or not load objects used by the target process should be
archived or copied into the recorded experiment. on archive load objects into the experiment.
-p option; Collect clock-based proling data. high turn on clock-based proling with
the default proling interval of approximately 1 millisecond.
8.1.2
Viewing the experiment results
To view the results, open the analyzer and go to File - Open Experiment and select the
experiment you want to view. A very good tutorial for analyzing the data can be found here.
Performance Analyzer MPI Tutorial is a good place to start.
The following screenshots where taken from the analysis of the programs presented in the
Parallelization section under Hybrid Parallelization.
MPI only version
76
Hybrid (MPI + Openmp) version
8.2
Intel MPI benchmark
The Intel MPI benchmark - IMB is a tool for evaluating the performance of a MPI
installation. The idea of IMB is to provide a concise set of elementary MPI benchmark
kernels. With one executable, all of the supported benchmarks, or a subset specied by the
command line, can be run. The rules, such as time measurement (including a repetitive call of
the kernels for better clock synchronization), message lengths, selection of communicators to
run a particular benchmark (inside the group of all started processes) are program parameters.
8.2.1
Installing and running IMB
The rst step is to get the package from here. Unpack the archive. Make sure you have
the Intel compiler module loaded and a working OpenMPI installation. Go to the /imb/src
directory. There are three benchmarks available: IMB-MPI1, IMB-IO, IMB-EXT. You can
build them separately with:
$ make <benchmark name>
or all at once with:
$ make all
Now you can run any of the three benchmarks using:
$ mpirun -np <nr_of_procs> IMB-xxx
NOTE: there are useful documents in the /imb/doc directory detailing the benchmarks.
77
8.2.2
Submitting a benchmark to a queue
You can also submit a benchmark to run on a queue. The following two scripts are
examples:
$ cat submit.sh
#!/bin/bash
qsub -q [queue] -pe [pe] [np] -cwd -b n [script_with_the_run_cmd]
$ cat script.sh
#!/bin/bash
mpirun -np [np] IMB-xxx
To submit you just have to run:
$ ./submit.sh
After running the IMB-MPI1 benchmark on a queue with 24 processes the following result
was obtained (only parts are shown):
#--------------------------------------------------#
Intel (R) MPI Benchmark Suite V3.2, MPI-1 part
#--------------------------------------------------# Date
: Thu Jul 23 16:37:23 2009
# Machine
: x86_64
# System
: Linux
# Release
: 2.6.18-128.1.1.el5
# Version
: #1 SMP Tue Feb 10 11:36:29 EST 2009
# MPI Version
: 2.1
# MPI Thread Environment: MPI_THREAD_SINGLE
#
#
#
#
#
#
#
#
#
#
Calling sequence was:

IMB-MPI1
Minimum message length in bytes:
Maximum message length in bytes:
0
4194304
MPI_Datatype
MPI_Datatype for reductions
MPI_Op
MPI_BYTE
MPI_FLOAT
MPI_SUM
#
#
#
#
#
#
#
#
#
#
List of Benchmarks to run:

PingPong
PingPing
Sendrecv
Exchange
Allreduce
Reduce
Reduce_scatter
Allgather
Allgatherv
:
:
:
78
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier
[...]
#----------------------------------------------------------------------------# Benchmarking Exchange
# #processes = 2
# ( 22 additional processes waiting in MPI_Barrier)
#----------------------------------------------------------------------------#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
Mbytes/sec
524288
80
1599.30
1599.31
1599.31
1250.54
1048576
40
3743.45
3743.48
3743.46
1068.53
2097152
20
7290.26
7290.30
7290.28
1097.35
4194304
10
15406.39
15406.70
15406.55
1038.51
[...]
#----------------------------------------------------------------------------# Benchmarking Exchange
# #processes = 24
#----------------------------------------------------------------------------#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
Mbytes/sec
0
1000
75.89
76.31
76.07
0.00
1
1000
67.73
68.26
68.00
0.06
2
1000
68.47
69.29
68.90
0.11
4
1000
69.23
69.88
69.57
0.22
8
1000
68.20
68.91
68.55
0.44
262144
160
19272.77
20713.69
20165.05
48.28
524288
80
63144.46
65858.79
63997.79
30.37
1048576
40
83868.32
89965.37
87337.56
44.46
2097152
20
91448.50
106147.55
99928.08
75.37
4194304
10
121632.81
192385.91
161055.82
83.17
[...]
#---------------------------------------------------------------# Benchmarking Alltoallv
# #processes = 8
# ( 16 additional processes waiting in MPI_Barrier)
#---------------------------------------------------------------#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0
1000
0.10
0.10
0.10
1
1000
18.49
18.50
18.49
2
1000
18.50
18.52
18.51
4
1000
18.47
18.48
18.47
8
1000
18.40
18.40
18.40
16
1000
18.42
18.43
18.43
32
1000
18.89
18.90
18.89
79
68
65536
131072
262144
524288
1048576
2097152
4194304
1000
640
320
160
80
40
20
10
601.29
1284.44
3936.76
10745.08
22101.26
44044.33
88028.00
175437.78
601.36
1284.71
3937.16
10746.09
22103.33
44056.68
88041.70
175766.59
601.33
1284.57
3937.01
10745.83
22102.58
44052.76
88037.15
175671.63
[...]
#---------------------------------------------------------------# Benchmarking Alltoallv
# #processes = 24
#---------------------------------------------------------------#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
0
1000
0.18
0.22
0.18
1
1000
891.94
892.74
892.58
2
1000
891.63
892.46
892.28
4
1000
879.25
880.09
879.94
8
1000
898.30
899.29
899.05
262144
15
923459.34
950393.47
938204.26
524288
10
1176375.79
1248858.31
1207359.81
1048576
6
1787152.85
1906829.99
1858522.38
2097152
4
3093715.25
3312132.72
3203840.16
4194304
2
5398282.53
5869063.97
5702468.73
As you can see, if you specify 24 processes then the benchmark will also run the 2,4,8,16
tests. You can x the minimum number of processes to use with:
$ mpirun [...] <benchmark> -npmin <minimum_number_of_procs>
NOTE: Other useful control commands can be found in the Users Guide (/doc directory)
under section 5.
8.3
8.3.1
Paraver and Extrae

Local deployment - Installing
PAPI 4.2.0
You should use a kernel with version >= 2.6.33. For versions <= 2.6.32 perfctr patches
are required in the kernel or you have to specify with-perf-events to the congure script if
you have support for perf events in your kernel, I used 2.6.35.30 on 64bits/ From the root
folder of the svn repo:
cd papi-4.2.0/src
./configure --prefix=/usr/local/share/papi/
make
sudo make install-all
PAPI 4.2.0 can also be downloaded from here: http://icl.cs.utk.edu/papi/software/
OpenMPI 1.4.4
From the root folder of the svn repo:
80
cd openmpi-1.4.4/
./configure --prefix /usr/local/share/openmpi
sudo make all install
OpenMPI 1.4.4 can also be downloaded from here:
http://www.open-mpi.org/software/ompi/v1.4/
Extrae 2.2.0
cd extrae-2.2.0/
./configure --with-mpi=/usr/local/share/openmpi \
--with-papi=/usr/local/share/papi \
--enable-posix-clock --without-unwind --without-dyninst \
--prefix=/usr/local/share/extrae
make
sudo make install
Extrae 2.2.0 can also be downloaded from here:
http://www.bsc.es/ssl/apps/performanceTools/
Obtaining traces
cd acoustic_with_extrae/
Open extrae.xml. We have a few absolute paths that we need to change
in this file so that tracing will work correctly. Search for vlad.
There should be 3 occurrences in the file. Modify the paths you
find with vlad by replacing /home/vlad/dimemas_paraver_svn with
the path to your local copy of the svn repo.
make
./run_ldpreload.sh 3 (3 is the number of MPI processes)
Warning: the acoustic workload generates 950MB of output for this
run. All output files are located in the export folder. Please
make sure you have enough free space.
Extrae produces tracing files for each MPI Process. The files
are located in the trace folder. The trace/set-0 folder will
contain 3 files, one for each MPI process, which are merged
in the final line of the run_ldpreload.sh script. Each .mpits
file has between 20 and 30 MB for this run.
After the script finishes you should find 3 files like this:
EXTRAE_Paraver_trace.prv
EXTRAE_Paraver_trace.row
EXTRAE_Paraver_trace.pcf
Were interested in the .prv file (this contains the tracing info).
81
8.3.2
Deployment on NCIT cluster
8.3.3
Installing
Extrae, PAPI and OpenMPI are already installed on the cluster. PAPI version is 4.2.0.
Extrae version is 2.2.0. Extrae is congured to work with OpenMPI version 1.5.3 for gcc 4.4.0.
The moduleavail command will oer more information about what each modules.
8.3.4
Checking for Extrae installation
At the moment when this documentation was written Extrae was not installed on all the
nodes of the Opteron queue. We have created a script that checks on which nodes Extrae was
installed. Copy the check extrae install folder on fep.grid.pub.ro and then:
./qsub.sh
This will submit a job on each of the 14 nodes of the Opteron
queue which will check for the Extrae installation.
After all the jobs finish:
cat check_extrae.sh.o*
The output should be something like this:
...
opteron-wn10.grid.pub.ro 1
...
When obtaining traces on the NCIT cluster we chose to use a single node. After Extrae is
installed on all the Opteron nodes this validation step will become unnecessary. Please dont
leave any jobs stuck in the queue waiting state.
Obtaining traces
Copy the acoustic_with_extrae_fep folder in your home on fep.grid.pub.ro.
Change the paths in the extrae.xml file to match the paths in which you wish
to collect the trace information. See section Local Deployment, subsection
Obtaining traces for more infomation.
. load_modules.sh
This loads the gcc, openmpi and extrae modules
Be sure to use . and not ./
make
./qsub.sh 6 # 6 in this case is the number of MPI Processes. Wait for
the job to finish. As you can see in the qsub.sh script the jobs are
ran on a single node: opteron-wn10.grid.pub.ro. Once extrae is
installed all the nodes in the Opteron queue you can just send the
job on any nodes in the queue not just a subset of nodes.
After the job finishes running:
82
./merge_mpits.sh
This will merge the .mpits file in a single .prv file which you can
load into Paraver.
8.3.5
Visualization with Paraver
The svn repo contains a 64-bit version of Paraver. If your OS is 32-bit please download an
appropriate version from here: http://www.bsc.es/ssl/apps/performanceTools/
cd wxparaver64/bin/
export PARAVER_HOME=../
./wxparaver
File --> Load Trace
Load the previously generated .prv file
File --> Load Configuration
Load one of the configurations from intro2paraver_MPI/cfgs/. Double clicking
the pretty colored output will make a window pop up with information
regarding the output (what each color represents). A .doc file is located
in the intro2paraver_MPI folder which explains Paraver usage with the
provided configurations.
A larger number of configurations (258 possible configurations)
exists in the wxparaver64/cfgs folder.
8.3.6
Do it yourself tracing on the NCIT Cluster
In order to trace a new C program you need to take the following les from the acoustic
sample:
extrae.xml
load modules.sh
Makele
merge mpits.sh
qsub.sh
run ldpreload.sh
and copy them to your source folder. The Makele and the run ldpreload.sh script should
be changed accordingly to match you source le hierarchy and the name of the binary. The
changes that need to be made are minor (location of C source les, name of the binary, linking
with extra libraries)
8.3.7
Observations
For some of the events data wont be collected because support is missing in the kernel.
Patches for the perfctr syscall should be added to the kernel to collect hardware counter data.
This wont be a problem on newer kernels (local testing) since with kernels >= 2.6.32 PAPI
83
uses the perf events infrastructure to collect hardware counter data. More information on the
perfctr patches can be found here:
https://ncit-cluster.grid.pub.ro/trac/HPC2011/wiki/Dimemas.
A more in-depth user guide for Extrae can be found here:
http://www.bsc.es/ssl/apps/performanceTools/files/docs/extrae-userguide.pdf.
It covers a number of aspects in detail. Of interest are customizing the extrae.xml le and
dierent ways of using Extrae to obtain trace data.
Warning: The XML les we provided log all the events and will generate a lot of output.
The acoustic example we provided has a fairly short run time, but for long running jobs a
signicant amount of data will be collected. We recommend setting up the trace les on
LustreFS and not on NFS. Also, we recommend customizing the XML le so that only a
subset of events will be logged. In order to limit the amount of data being logged and the
number of events being handled please consult the Extrae User Guide.
Limitation: The les we provided handle tracing MPI programs and not OpenMP programs
or hybrid MPI - OpenMP programs. Although the example we provided is a hybrid program
we have taken out the OpenMP part by setting the number of OpenMP threads to 1 in
the input le. Future work on this project should also add scripts and makeles for tracing
OpenMP and hybrid programs.
8.4
Scalasca
Scalasca is a performance toolset that has been specically designed to analyze parallel
application execution behavior on large-scale systems. It oers an incremental performance
analysis procedure that integrates runtime summaries with in-depth studies of concurrent
behavior via event tracing, adopting a strategy of successively rened measurement congurations. Distinctive features are its ability to identify wait states in applications with very
large numbers of processes and combine these with eciently summarized local measurements.
Scalasca is a software tool that supports the performance optimization of parallel programs
by measuring and analyzing their runtime behavior. Scalasca supports two dierent analysis
modes that rely on either proling or event tracing. In proling mode, Scalasca generates aggregate performance metrics for individual function call paths, which are useful to identify the
most resource-intensive parts of the program and to analyze process-local metrics such as those
derived from hardware counters. In tracing mode it records individual performance-relevant
events, allowing the automatic identication of call paths that exhibit wait states.
8.4.1
Installing Scalasca
Before installing Scalasca some prerequisites need to be satised:

Gnu make
Qt version at least 4.2 (qmake)
Cube3 (performance report visual explorer) - can be downloaded from the same site
(www.scalasca.org)
fortran77 / fortran95 (gfortran will suce)
After that, you can run using root privilages the following commands:
./configure -prefix=/opt/tools/scalasca-1.3.2
make
84
make install
./configure -prefix=/opt/tools/cube-3.3.1
make
make install
8.4.2
Running experiments
Runing on cluster:
module load utilities/scalasca-1.4.1-gcc-4.6.3

Insert in the "Makefile" that compiles the test application the following command:
"scalasca -instrument mpicc -O3 -lm myprog.o myprog.c -o myprog".
In the "mprun.sh" script file you have to put in the MODULE section the command "compile
After that, all you have to do is to run "mprun.sh"
After that, all you have to do is to run mprun.sh. After the end of the execution data is
collected in a folder called epik [applicationName] [numProcesses].
Runtime measurement collection and analysis
The Scalasca measurement collection and analysis nexus accessed through the scalasca -analyze
command integrates the following three steps:
Instrumentation
Runtime measurement and collection of data
Analysis and interpretation
Instrumentation
First of all, in order to run proling experiments using Scalasca, application that use MPI
or OpenMP (or both) must have their code modied before execution. This modication is
done using Scalasca and it consists of inserting some specic measurement calls for important
points (events) of the applications runtime.
All the necessary instrumentation is automatic and the user, OpenMP and MPI functions
are handled by the Scalasca instrumenter, which is called using the scalasca -instrument
command. All the compile and link commands for the modules of the application containing
OpenMP and/or MPI code must be prexed by the scalasca -instrument command (this also
needs to be added in the Makele of the application). An example for the command use is:
scalasca -instrument mpicc myprog.c -o myprog
Although generally more convenient, automatic function instrumentation may result in too
many and/or too disruptive measurements, which can be addressed with selective instrumentation and measurement ltering. On supercomputing systems, users usually have to submit
their jobs to a batch system and are not allowed to start parallel jobs directly. Therefore, the
call to the scalasca command has to be provided within a batch script, which will be scheduled
for execution when the required resources are available. The syntax of the batch script diers
between the dierent scheduling systems. However, common to every batch script format is a
passage where all shell commands can be placed that will be executed.
85
Runtime measurement and collection of data
This stage follows compilation and instrumentation and it is responsible for managing
the conguration and processing of performance experiments. The tool used in this stage,
referred to by its creators as Scalasca measurement collection and analysis nexus - SCAN, is
responsible for several features:
- measurement conguration - congures metrics; lters uninteresting functions, methods
and subroutines; supports selective event generation.
- application execution - using the specied application launcher (e.g. mpiexec or mpirun
for MPI implementations of instrumented executables)
- collection of data - stores data for later analysis in a folder named epik applicationName numberOfP
This step is done by running the command scalasca -analyze followed by the application
executable launcher (if one is needed - as is the case with MPI) together with its arguments
and ags, the target executable and the targets arguments. An example for the use of this
command is: scalasca -analyze mpirun -np 4 myprog
Post-processing is done the rst time that an archive is examined, before launching the
CUBE3 report viewer. If the scalasca -examine command is executed on an already processed experiment archive, or with a CUBE le specied as argument, the viewer is launched
immediately.
An example for using this command is:
scalasca -examine epik_myprog_4x0_sum
Analysis and interpretation
The results of the previous phase are saved, as mentioned, in a folder (by default, as a subfolder of the experiment folder) named epik applicationName numberOfProcesses which is
the report for the previously analyzed experiment. This report needs post-processing before
any results can be visualized and studied, and this process is only done the rst time when
it is examined, by using the command scalasca -examine. Post-processing is done the rst
time that an archive is examined, before launching the CUBE3 report viewer. If the scalasca
-examine command is executed on an already processed experiment archive, or with a CUBE
le specied as argument, the viewer is launched immediately.
A short textual score report can be obtained without launching the viewer: scalasca examine -s epik title. This score report comes from the cube3 score utility and provides a
breakdown of the dierent types of region included in the measurement and their estimated
associated trace buer capacity requirements, aggregate trace size (total tbc) and largest
process trace size (max tbc), which can be used to specify an appropriate ELG BUFFER SIZE
for a subsequent trace measurement. No post-processing is performed in this case, so that
only a subset of Scalasca analyses and metrics may be shown.
Using Cube3
CUBE3 is a generic user interface for presenting and browsing performance and debugging
information from parallel applications. The CUBE3 main window consists of three panels
containing tree displays or alternate graphical views of analysis reports. The left panel shows
performance properties of the execution, the middle pane shows the call-tree or a at prole of
the application, and the right tree either shows the system hierarchy consisting of machines,
compute nodes, processes, and threads or a topological view of the applications processes and
threads. All tree nodes are labeled with a metric value and a colored box which can help
identify hotspots. The metric value color is determined from the proportion of the total (root)
value or some other specied reference value.
86
A click on a performance property or a call path selects the corresponding node. This has
the eect that the metric value held by this node (such as execution time) will be further
broken down into its constituents. That is, after selecting a performance property, the middle
panel shows its distribution across the call tree. After selecting a call path (i.e., a node in
the call tree), the system tree shows the distribution of the performance property in that call
path across the system locations. A click on the icon left to a node in each tree expands or
collapses that node. By expanding or collapsing nodes in each of the three trees, the analysis
results can be viewed on dierent levels of granularity.
During trace collection, information about the applications execution behavior is recorded
in so called event streams. The number of events in the streams determines the size of the
buer required to hold the stream in memory. To minimize the amount of memory required,
and to reduce the time to ush the event buers to disk, only the most relevant function
calls should be instrumented. When the complete event stream is larger than the memory
buer, it has to be ushed to disk during application runtime. This ush impacts application
performance, as ushing is not coordinated between processes, and runtime imbalances are
induced into the measurement. The Scalasca measurement system uses a default value of 10
MB per process or thread for the event trace: when this is not adequate it can be adjusted to
minimize or eliminate ushing of the internal buers. However, if too large a value is specied
for the buers, the application may be left with insucient memory to run, or run adversely
with paging to disk. Larger traces also require more disk space (at least temporarily, until
analysis is complete), and are correspondingly slower to write to and read back from disk.
Often it is more appropriate to reduce the size of the trace (e.g., by specifying a shorter
execution, or more selective instrumentation and measurement), than to increase the buer
size.
Conclusions
Debugging a parallel application is a dicult task and tools like Scalasca are very useful
whenever the behavior of running applications isnt the one we have expected. Our general
approach is to rst observe parallel execution behavior on a coarse-grained level and then to
successively rene the measurement focus as new performance knowledge becomes available.
87
Future enhancements will aim at both further improving the functionality and scalability of
the SCALASCA toolset. Completing support for OpenMP and the missing features of MPI
to eliminate the need for sequential trace analysis is a primary development objective. Using
more exible measurement control, we are striving to oer more targeted trace collection mechanisms, reducing memory and disk space requirements while retaining the value of tracebased
in-depth analysis. In addition, while the current measurement and trace analysis mechanisms
are already very powerful in terms of the number of application processes they support, we are
working on optimized data management and workows that will allow us to master even larger
congurations. These might include truly parallel identier unication, trace analysis without
le I/O, and using parallel I/O to write analysis reports. Since parallel simulations are often
iterative in nature, and individual iterations can dier in their performance characteristics,
another focus of our research is therefore to study the temporal evolution of the performance
behavior as a computation progresses.
88
Application Software and Program Libraries
9.1
Automatically Tuned Linear Algebra Software (ATLAS)
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building
blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar,
vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and
the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are ecient, portable,
and widely available, they are commonly used in the development of high quality linear algebra
software, LAPACK for example.
The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research
eort that provides C and Fortran77 interfaces to a portably ecient BLAS implementation, as
well as a few routines from /htmladdnormallinkLAPACKhttp://math-atlas.sourceforge.net/
9.1.1
Using ATLAS
To initialize the environment use:

module load blas/atlas-9.11 sunstudio12.1 (compiled with gcc)
module load blas/atlas-9.11 sunstudio12.1 (compiled with sun)
To use level1-3 functions available in atlas see the prototypes functions in cblas.h and use
them in your examples. For compiling you should speciy the necessary libraries les.
Example:
gcc -lcblas -latlas example.c
cc -lcblas -latlas example.c
It is recommended the version compiled with gcc.It is almost never a good idea to change
the C compiler used to compile ATLASs generated double precision (real and complex) and C
compiler used to compile ATLASs generated single precision (real and complex), and it is only
very rarely a good idea to change the C compiler used to compile all other double precision
routines and C compiler used to compile all other single precision routines . For ATLAS
3.8.0, all architectural defaults are set using gcc 4.2 only (the one exception is MIPS/IRIX,
where SGIs compiler is used). In most cases, switching these compilers will get you worse
performance and accuracy, even when you are absolutely sure it is a better compiler and ag
combination!
89
9.1.2
Performance
90
9.2
MKL - Intel Math Kernel Library
Intel Math Kernel Library (Intel MKL) is a library of highly optimized, extensively threaded
math routines for science, engineering, and nancial applications that require maximum performance. Core math functions include BLAS, LAPACK, ScaLAPACK, Sparse Solvers, Fast
Fourier Transforms, Vector Math, and more. Oering performance optimizations for current
and next-generation Intel processors, it includes improved integration with Microsoft Visual
91
Studio, Eclipse, and XCode. Intel MKL allows for full integration of the Intel Compatibility
OpenMP run-time library for greater Windows/Linux cross-platform compatibility.
9.2.1
Using MKL
To initialize the environment use:

module load blas/mkl-10.2
To compile with gcc an example that uses mkl functions you should specify necessary
libraries.
Example:
gcc -lmkl intel lp64 -lmkl intel thread -lmkl core -liomp5 -lpthread -lm example.c
9.2.2
Performance
92
93
9.3
ATLAS vs MKL - level 1,2,3 functions
The BLAS functions were tested for all 3 levels and the results are shown only for level 3.
To summarize the performance tests, it would be that ATLAS loses for Level 1 BLAS, tends
to be beat MKL for Level 2 BLAS, and varies between quite a bit slower and quite a bit faster
than MKL for Level 3 BLAS, depending on problem size and data type.
ATLASs present Level 1 gets its optimization mainly from the compiler. This gives MKL
two huge advantages: MKL can use the SSE prefetch instructions to speed up pretty much
all Level 1 ops. The second advantage is in how ABS() is done. ABS() *should* be a 1-cycle
operation, since you can just mask o the sign bit. However, you cannot standardly do bit
operation on oats in ANSI C, so ATLAS has to use an if-type construct instead. This spells
absolute doom for the performance of NRM2, ASUM and AMAX.
For the Level 2 and 3, ATLAS has its usual advantage of leveraging basic kernels to the
maximum. This means that all Level 3 ops follow the performance of GEMM, and Level 2
ops follow GER or GEMV. MKL has the usual disadvantage of optimizing all these routines
seperately, leading to widely varying performance.
9.4
Scilab
Scilab is a programming language associated with a rich collection of numerical algorithms

covering many aspects of scientic computing problems. From the software point of view, Scilab
is an interpreted language. This generally allows to get faster development processes, because
the user directly accesses a high-level language, with a rich set of features provided by the
library. The Scilab language is meant to be extended so that user-dened data types can be
dened with possibly overloaded operations. Scilab users can develop their own modules so that
they can solve their particular problems. The Scilab language allows to dynamically compile
and link other languages such as Fortran and C: this way, external libraries can be used as if
they were a part of Scilab built-in features.
From the scientic point of view, Scilab comes with many features. At the very beginning of
94
Scilab, features were focused on linear algebra. But, rapidly, the number of features extended
to cover many areas of scientic computing. The following is a short list of its capabilities:
Linear algebra, sparse matrices,
Polynomials and rational functions,
Interpolation, approximation,
Linear, quadratic and non linear optimization,
Ordinary Dierential Equation solver and Dierential Algebraic Equations solver,
Classic and robust control, Linear Matrix Inequality optimization, Dierentiable and
non-dierentiable optimization,
Signal processing,
Statistics.
Scilab provides many graphics features, including a set of plotting functions, which allow
to create 2D and 3D plots as well as user interfaces. The Xcos environment provides a hybrid
dynamic systems modeler and simulator.
9.4.1
Sour code and compilation
The source code for Scilab can be found at:

via git protocol: git clone git://git.scilab.org/scilab
via http protocol: git clone http://git.scilab.org/scilab.git
Compilig from source code The source code can be compiled from the source code. In
order to do that, we issued the following commands:
module load compilers/gcc-4.6.0
module load java/jdk1.6.0_23-64bit
module load blas/atlas-9.11_gcc
export PATH=$PATH:/export/home/ncit-cluster/username/scilab-req/apache-ant-1.8.2/bin
./configure \
--without-gui \
--without-hdf5 \
--disable-build-localisation \
--with-libxml2=/export/home/ncit-cluster/username/scilab-req/libxml2-2.7.8 \
--with-pcre=/export/home/ncit-cluster/username/scilab-req/pcre-8.20 \
--with-lapack-library=/export/home/ncit-cluster/username/scilab-req/lapack-3.4.0 \
--with-umfpack-library=/export/home/ncit-cluster/username/scilab-req/UMFPACK/Lib \
--with-umfpack-include=/export/home/ncit-cluster/username/scilab-req/UMFPACK/Include
95
9.4.2
Using Scilab
In this section, we make our frst steps with Scilab and present some simple tasks we
can perform with the interpreter. There are several ways of using Scilab and the following
paragraphs present three methods:
using the console in the interactive mode
using the exec function against a le
using batch processing
The console The rst way is to use Scilab interactively, by typing commands in the console,
analyzing the results and continuing this process until the nal result is computed. This
document is designed so that the Scilab examples which are printed here can be copied into
the console. The goal is that the reader can experiment by himself Scilab behavior. This
is indeed a good way of understanding the behavior of the program and, most of the time,
it allows a quick and smooth way of performing the desired computation. In the following
example, the function disp is used in the interactive mode to print out the string Hello
World!.
-->s=" Hello World !"
s =
Hello World !
-->disp (s)
Hello World !
In the previous session, we did not type the characters >which is the prompt, and which
is managed by Scilab. We only type the statement s=Hello World! with our keyboard and
then hit the <Enter >key. Scilab answer is s = and Hello World!. Then we type disp(s) and
Scilab answer is Hello World!.
When we edit a command, we can use the keyboard, as with a regular editor. We can use
the left and right ! arrows in order to move the cursor on the line and use the <Backspace>and
<Suppr >keys in order to x errors in the text. In order to get access to previously executed
commands, use the up arrow key. This allows to browse the previous commands by using the
up and down arrow keys.
The <Tab >key provides a very convenient completion feature. In the following session,
we type the statement disp in the console.
-->disp
The editor can be accessed from the menu of the console, under the Applications Editor
menu, or from the console, as presented in the following session.
--> editor ()
This editor allows to manage several les at the same time. There are many features which
are worth to mention in this editor. The most commonly used features are under the Execute
menu.
Load into Scilab allows to execute the statements in the current le, as if we did a copy
and paste. This implies that the statements which do not end with the semicolon ;
character will produce an output in the console.
96
Evaluate Selection allows to execute the statements which are currently selected.
Execute File Into Scilab allows to execute the le, as if we used the exec function.
The results which are produced in the console are only those which are associated with
printing functions, such as disp for example.
We can also select a few lines in the script, right click (or Cmd+Click under Mac), and
get the context menu. The Edit menu provides a very interesting feature, commonly known
as a pretty printer in most languages. This is the Edit Correct Indentation feature, which
automatically indents the current selection. This feature is extremelly convenient, as it allows
to format algorithms, so that the if, for and other structured blocks are easy to analyze.
The editor provides a fast access to the inline help. Indeed, assume that we have selected
the disp statement, as presented in gure 7. When we right-click in the editor, we get the
context menu, where the Help about disp entry allows to open the help page associated
with the disp function.
The graphics in Scilab version 5 has been updated so that many components are now
based on Java. This has a number of advantages, including the possibility to manage docking
windows.
The docking system uses Flexdock, an open-source project providing a Swing docking
framework. Assume that we have both the console and the editor opened in our environment.
It might be annoying to manage two windows, because one may hide the other, so that we
constantly have to move them around in order to actually see what happens. The Flexdock
system allows to drag and drop the editor into the console, so that we nally have only one
window, with several sub-windows. All Scilab windows are dockable, including the console,
the editor, the help and the plotting windows.
In order to dock one window into another window, we must drag and drop the source
window into the target window. To do this, we left-click on the title bar of the docking
window. Before releasing the click, let us move the mouse over the target window and notice
that a window, surrounded by dotted lines is displayed. This phantom window indicates
the location of the future docked window. We can choose this location, which can be on the
top, the bottom, the left or the right of the target window. Once we have chosen the target
location, we release the click, which nally moves the source window into the target window.
We can also release the source window over the target window, which creates tabs.
Using exec When several commands are to be executed, it may be more convenient to write
these statements into a le with Scilab editor. To execute the commands located in such a
le, the exec function can be used, followed by the name of the script. This le generally
has the extension .sce or .sci, depending on its content: les having the .sci extension contain
Scilab functions and executing them loads the functions into Scilab environment (but does not
execute them), les having the .sce extension contain both Scilab functions and executable
statements. Executing a .sce le has generally an eect such as computing several variables
and displaying the results in the console, creating 2D plots, reading or writing into a le, etc...
Assume that the content of the le myscript.sce is the following.
disp("Hello World !")
In the Scilab console, we can use the exec function to execute the content of this script.
-->exec (" myscript .sce")
-->disp (" Hello World !")
Hello World !
In practical situations, such as debugging a complicated algorithm, the interactive mode
is used most of the time with a sequence of calls to the exec and disp functions.
97
Batch processing Another way of using Scilab is from the command line. Several command
line options are available. Whatever the operating system is, binaries are located in the
directory scilab-5.2.0/bin. Command line options must be appended to the binary for the
specic platform, as described below. The -nw option allows to disable the display of the
console. The -nwni option allows to launch the non-graphics mode: in this mode, the console
is not displayed and plotting functions are disabled (using them will generate an error).
9.4.3
Basic elements of the language
In this section, we present the basic features of the language, that is, we show how to
create a real variable, and what elementary mathematical functions can be applied to a real
variable. If Scilab provided only these features, it would only be a super desktop calculator.
Fortunately, it is a lot more and this is the subject of the remaining sections, where we will
show how to manage other types of variables, that is booleans, complex numbers, integers and
strings. It seems strange at rst, but it is worth to state it right from the start: in Scilab,
everything is a matrix. To be more accurate, we should write: all real, complex, boolean,
integer, string and polynomial variables are matrices. Lists and other complex data structures
(such as tlists and mlists) are not matrices (but can contain matrices). These complex data
structures will not be presented in this document. This is why we could begin by presenting
matrices. Still, we choose to present basic data types rst, because Scilab matrices are in fact
a special organization of these basic building blocks.
Creating real variables In this section, we create real variables and perform simple operations with them. Scilab is an interpreted language, which implies that there is no need to
declare a variable before using it. Variables are created at the moment where they are rst
set.
In the following example, we create and set the real variable x to 1 and perform a multiplication on this variable. In Scilab, the = operator means that we want to set the variable
on the left hand side to the value associated with the right hand side (it is not the comparison
operator, which syntax is associated with the == operator).
-->x=1
x = 1.
-->x = x * 2
x = 2.
The value of the variable is displayed each time a statement is executed. That behavior can
be suppressed if the line ends with the semicolon ; character, as in the following example.
-->y=1;
-->y=y*2;
Elementary mathematical functions In the following example, we use the cos and sin
functions :
-->x = cos (2)
x =- 0.4161468
-->y = sin (2)
y = 0.9092974
-->x^2+y^2
ans = 1.
98
Complex Numbers Scilab provides complex numbers, which are stored as pairs of oating
point numbers. The predened variable i represents the mathematical imaginary number i
which satises i2 = 1. All elementary functions previously presented before, such as sin for
example, are overloaded for complex numbers. This means that, if their input argument is a
complex number, the output is a complex number. Figure 17 presents functions which allow
to manage complex numbers. In the following example, we set the variable x to 1 + i, and
perform several basic operations on it, such as retrieving its real and imaginary parts. Notice
how the single quote operator, denoted by , is used to compute the conjugate of a complex
number. We nally check that the equality (1 + i)(1 i) = 1 i2 = 2 is veried by Scilab.
We nally check that the equality (1 + i)(1 i) = 1 i2 = 2 is veried by Scilab.
-->x*y
ans = 2.
Strings Strings can be stored in variables, provided that they are delimited by double quotes
. The concatenation operation is available from the + operator. In the following Scilab
session, we dene two strings and then concatenate them with the + operator.
-->x = "foo"
x = foo
-->y = "bar"
y = bar
-->x+y
ans = foobar
Dynamic type of variables When we create and manage variables, Scilab allows to change
the type of a variable dynamically. This means that we can create a real value, and then put
a string variable in it, as presented in the following session.
-->x=1
x =
1.
-->x+1
ans =
2.
-->x="foo"
x =
foo
-->x+"bar"
ans =
foobar
We emphasize here that Scilab is not a typed language, that is, we do not have to declare
the type of a variable before setting its content. Moreover, the type of a variable can change
during the life of the variable.
9.5
9.5.1
Deal II
Introduction
Deal.II is a C++ program library targeted at the computational solution of partial dierential equations using adaptive nite elements. It uses state-of- the-art programming techniques
to oer you a modern interface to the complex data structures and algorithms required.
99
9.5.2
Description
The main aim of deal.II is to enable rapid development of modern nite element codes,
using among other aspects adaptive meshes and a wide array of tools classes often used in
nite element program. Writing such programs is a non-trivial task, and successful programs
tend to become very large and complex. We believe that this is best done using a program
library that takes care of the details of grid handling and renement, handling of degrees of
freedom, input of meshes and output of results in graphics formats, and the like. Likewise,
support for several space dimensions at once is included in a way such that programs can be
written independent of the space dimension without unreasonable penalties on run-time and
memory consumption.
9.5.3
Installation
The rst step is to get the library package from here

wget http://www.dealii.org/download/deal.II-7.1.0.tar.gz
9.5.4
Unpacking
The library comes in a tar.gz archive that we must unzip with the following commands:
gunzip deal.II-X.Y.Z.tar.gz
tar xf deal.II-X.Y.Z.tar
9.5.5
Conguration
The library has a conguration script that we must run before the installation.
./configure creates the file deal.II/common/Make.global options,
which remembers paths and conguration options. You can call:

make -j16 target-name
to let make call multiple instances of the compiler (in this case sixteen)
You can give several ags to ./congure:
enable-shared = saves disk space, link time and start-up time, so this is the default
enable-threads = The default is to use multiple threads
enable-mpi = If these compilers exist and indeed support MPI, then this also switches
on support for MPI in the library.
with-petsc=DIR and with-petsc-arch=ARCH switches to ./congure can be used to
override the values of PETSC DIR and PETSC ARCH or if these environment variables
are not set at all.
with-metis-include, with-metis-libs.
In order to congure the installation with the complete features of the deal II library, we have
rst got to install some other required libraries:
PETSC (Portable, Extensible Toolkit for Scientic Computation)
ATLAS (Automatically Tuned Linear Algebra Software) used for compiling PETSC
library
Metis (Graph Partioning, Mesh Partitioning, Matrix Reordering) that provides various
methods to partition graphs
100
Environment libraries We have several OpenMPI implementations. To use the one suitable for all the libraries we must load the apropriate module.
Loading GCC
module load compilers/gcc-4.6.0
Loading MPI
module load mpi/openmpi-1.5.3 gcc-4.6.0
Installing PETSC In order to compile PETSC we must rst have a version of ATLAS that
we can use. We have chosen PETSC version petsc-3.2-p5:
wget http://ftp.mcs.anl.gov/pub/petsc/release-snapshots/petsc-3.2-p5.tar.gz
gunzip petsc-3.2-p5.tar.gz
tar -xof petsc-3.2-p5.tar
cd petsc-3.2-p5
First we need to get the ATLAS library package and unpack it:
wget atlas3.9.51.tar.bz2
bunzip2 atlas3.9.51.tar.bz2
tar -xof atlas3.9.51.tar
mv ATLAS ATLAS3.9.51
cd ATLAS3.9.51
Now we need to congure the ATLAS library for the current machine, by creating the working
directories for the installation:
mkdir Linux C2D64SSE3
cd Linux C2D64SSE3
mkdir /export/home/ncit-cluster/stud/g/george.neagoe/hpc/lib/atlas
Running the conguration script of ATLAS library:

../configure -b 64 -D c -DPentiumCPS=2400
--prefix=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/lib/atlas
-Fa alg -fPIC
We must use the ags -Fa alg -fPIC because of an error that we had ecountering on
compiling PETSC library:
liblapack.a(dgeqrf.o):
relocation R X86 64 32 against a local symbol
can not be used when making a shared object;
recompile with -fPIC
The script must have an output like the following one:

cp /export/home/ncit-cluster/stud/g/george.neagoe/hpc/ATLAS3.9.51/Linux
..//makes/Make.l3thr src/threads/blas/level3/Makefile
..//makes/Make.l2thr src/threads/blas/level2/Makefile
..//makes/Make.l3ptblas src/pthreads/blas/level3/Makefile
..//makes/Make.dummy src/pthreads/blas/level2/Makefile
101
C2D64SSE3/
C2D64SSE3/
C2D64SSE3/
C2D64SSE3/
cp /export/home/ncit-cluster/stud/g/george.neagoe/hpc/ATLAS3.9.51/Linux C2D64SSE3/
..//makes/Make.dummy src/pthreads/blas/level1/Makefile
..//makes/Make.miptblas src/pthreads/misc/Makefile
..//makes/Make.pkl3 src/blas/pklevel3/Makefile
..//makes/Make.gpmm src/blas/pklevel3/gpmm/Makefile
..//makes/Make.sprk src/blas/pklevel3/sprk/Makefile
..//makes/Make.l3 src/blas/level3/Makefile
..//makes/Make.l3aux src/blas/level3/rblas/Makefile
..//makes/Make.l3kern src/blas/level3/kernel/Makefile
..//makes/atlas trsmNB.h include/.
..//CONFIG/ARCHS/Makefile ARCHS/.
make[2]: warning: Clock skew detected. Your build may be incomplete.
make[2]: Leaving directory
/export/home/ncit-cluster/stud/g/george.neagoe/hpc/ATLAS3.9.51/Linux C2D64SSE3
make[1]: warning: Clock skew detected. Your build may be incomplete.
make[1]: Leaving directory
/export/home/ncit-cluster/stud/g/george.neagoe/hpc/ATLAS3.9.51/Linux C2D64SSE3
make: warning: Clock skew detected. Your build may be incomplete.
DONE configure
Building ATLAS library after the conguration.

make
make
make
make
build
check
time
install
Conguring PETSC:
cd petsc-3.2-p5/
./configure
--with-blas-lapack-dir=
/export/home/ncit-cluster/stud/g/george.neagoe/hpc/lib/atlas/lib
--with-mpi-dir=/opt/libs/openmpi/openmpi-1.5.3 gcc-4.6.0
--with-debugging=1
--with-shared-libraries=1
The conguration script will have an output like the following one:
xxx=========================================================================xxx
Configure stage complete. Now build PETSc libraries with (legacy build):
make PETSC DIR=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5
PETSC ARCH=arch-linux2-c-debug all or (experimental with python):
PETSC DIR=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5
PETSC ARCH=arch-linux2-c-debug ./config/builder.py
xxx=========================================================================xxx
Then we can make the build of the PETSC library:

make all
102
Output:
Completed building libraries
=========================================
making shared libraries in
/export/home/ncit-cluster/stud/g/george.neagoe/
hpc/petsc-3.2-p5/arch-linux2-c-debug/lib
building libpetsc.so
In order to compile Deal II with the PETSC fresh compiled library we must set some environment variables, and also the LD LIBRARY PATH.
Petsc conguration parameters: PETSC DIR: this variable should point to the location of the PETSc installation that is used. Multiple PETSc versions can coexist on the same
le-system. By changing PETSC DIR value, one can switch between these installed versions
of PETSc.
PETSC ARCH: this variable gives a name to a conguration/build. Congure uses this
value to stores the generated cong makeles in
${PETSC DIR}/${PETSC ARCH}/conf. And make uses this value to determine this
location of these makeles [which intern help in locating the correct include and library les].
Thus one can install multiple variants of PETSc libraries - by providing dierent PETSC ARCH
values to each congure build. Then one can switch between using these variants of libraries
by switching the PETSC ARCH value used.
If congure doesnt nd a PETSC ARCH value [either in env variable or command line
option], it automatically generates a default value and uses it. Also - if make doesnt nd
a PETSC ARCH env variable - it defaults to the value used by last successful invocation of
previous congure.
We must dene the parameters in bash with the following commands before running the
conguration script:
PETSC ARCH=arch-linux2-c-debug;
export PETSC ARCH
PETSC DIR=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5;
export PETSC DIR
export LD LIBRARY PATH=
/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5/
arch-linux2-c-debug/lib:${LD LIBRARY PATH}:
echo $LD LIBRARY PATH
After the PETSC library is complete we have encountered another problem we we tried to
run the paralelized examples from the deal II library:
Exception on processing:
-------------------------------------------------------An error occurred in line <98> of file
</export/home/ncit-cluster/stud/g/george.neagoe/hpc/deal.II v2/
source/lac/sparsity tools.cc>
in function void dealii::SparsityTools::partition
(const dealii::SparsityPattern&, unsigned int, std::vector<unsigned int>&)
The violated condition was: false
The name and call sequence of the exception was:
ExcMETISNotInstalled()
Additional Information: (none)
So, how we can notice, the problem was that the METIS library was not installed.
103
Installing METIS In order to generate partitionings of triangulations, we have functions

that call METIS library. METIS is a library that provides various methods to partition graphs,
which we use to dene which cell belongs to which part of a triangulation. The main point in
using METIS is to generate partitions so that the interfaces between cell blocks are as small
as possible. This data can, in turn, be used to distribute degrees of freedom onto dierent
processors when using PETSc and/or SLEPc in parallel mode.
As with PETSc and SLEPc, the use of METIS is optional. If you wish to use it, you can do
so by having a METIS installation around at the time of calling ./congure by either setting
the METIS DIR environment variable denoting the path to the METIS library, or using the
with-metis ag. If METIS was installed as part of /usr or /opt, instead of local directories
in a home directory for example, you can use congure switches with-metis-include, withmetis-libs.
On some systems, when using shared libraries for deal.II, you may get warnings of the
kind libmetis.a(pmetis.o): relocation R X86 64 32 against a local symbol can not be used
when making a shared object; recompile with -fPIC when linking. This can be avoided by
recompiling METIS with -fPIC as a compiler ag.
METIS is not needed when using p4est to parallelize programs, see below.
9.5.6
Running Examples
The programs are in the examples/ directory of your local deal.II installation. After
compiling the library itself, if you go into one of the tutorial directories, you can compile the
program by typing make, and run it using make run. The latter command also compiles the
program if that has not already been done.
Example 1
cd /deal.II/examples/step-1
ls -l
total 1324
drwxr-xr-x 2 george.neagoe studcs
-rw-r--r-- 1 george.neagoe studcs
-rwxr-xr-x 1 george.neagoe studcs
./step1
ls -l
4096 Oct 9 22:42 doc

5615 Sep 21 18:12 Makefile
168998 Nov 8 03:44 Makefile.dep
469880 Nov 8 03:44 step-1
18200 May 17 07:34 step-1.cc
666928 Nov 8 03:44 step-1.g.o
29469 Nov 8 04:05 grid-1.eps

129457 Nov 8 04:05 grid-2.eps
Example 2
./step-2
ls -l
-rw-r--r-- 1 george.neagoe studcs 91942 Nov 8 04:16 sparsity pattern.1
-rw-r--r-- 1 george.neagoe studcs 92316 Nov 8 04:16 sparsity pattern.2
For viewing the 2D results we need to use gnuplot

gnuplot
Terminal type set to x11
gnuplot> set style data points
104
Figure 1: grid-1.eps
Figure 2: grid-2.eps
105
Figure 3: sparsity pattern.1, gnuplot
Figure 4: sparsity pattern.2, gnuplot
106
Example 3
./step-3
ls -l
96288 Nov
gnuplot gnuplot> set style data lines
gnuplot> splot solution.gpl
8 04:30 solution.gpl
Or with Hidden 3D
gnuplot> set hidden3d
gnuplot> splot solution.gpl
Example 17
cd step-17/
PETSC ARCH=arch-linux2-c-debug;
export PETSC ARCH
PETSC DIR=/export/home/ncit-cluster/stud/g/george.neagoe/hpc/petsc-3.2-p5;
export PETSC DIR
export LD LIBRARY PATH=/export/home/ncit-cluster/stud/g/george.neagoe/
hpc/petsc-3.2-p5/arch-linux2-c-debug/lib:${LD LIBRARY PATH}:
echo $LD LIBRARY PATH
mpirun {np 8./step-17
ls -l
9333 Jan 10 00:24 solution-0.gmv
-rw-r--r-- 1 george.neagoe studcs 153871 Jan 10 00:24 solution-4.gmv
The Results are meshes *.gmv (general mesh view) les. Some of the result les of other
programs output (*.vtk) can be viewed using paraview module only on fep.grid.pub.ro.
module load tools/ParaView-3.8.1
paraview
After this we can view the le using the paraview menu. For starting the application, we
need to make sure that we have connected to the cluster, with X11 port forwarding. Example:
ssh -X george.neagoe@fep.grid.pub.ro
107
Figure 5: solution.gpl, gnuplot, 3D normal
Figure 6: solution.gpl, gnuplot, hidden3d
108
References
[1] The RWTH HPC Primer,
http://www.rz.rwth-aachen.de/go/id/pil/lang/en
[2] Wikipedia page on SPARC processors,
http://www.rz.rwth-aachen.de/go/id/pil/lang/en
[3] Sunsolve page on the Sun Enterprise 220R,
http://sunsolve.sun.com/handbook_pub/validateUser.do?target=Systems/E220R/E220R
[4] Sunsolve page on the Sun Enterprise 420R,
http://sunsolve.sun.com/handbook_pub/svalidateUser.do?target=Systems/E420R/E420R
110

Clusterguide-V4 0

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Clusterguide-V4 0

Transféré par

Droits d'auteur :

Formats disponibles

Computer Science and Engineering Department

University Politehnica of Bucharest

The NCIT Cluster

6 The Software Stack

9 Application Software and Program Libraries

Acknowledgements and History

AMD Opteron Processors

IBM Cell Broadband Engine Processors

IBM Blade Center H

Fujitsu Celsius R620

Fujitsu Esprimo Machines

IBM eServer xSeries 336

Fujitsu-SIEMENS PRIMERGY TX200 S3

The Fujitsu-SIEMENS PRIMERGY TX200 S3 servers available at NCIT Cluster have

(*) This is the MDS for the Lustre system.

HPC Partner Clusters

The HPC Cluster of the Institute of Physical Chemistry (http://www.icf.ro) is nearly

The second cluster is a collaboration between UPB and CNMSI (http://www.cnmsi.ro).

Select ICE VM from Sessions. There is no GNOME or KDE Installed.

Running a GUI on your VirtualMachine

X11 Tunneling / VNC / FreeNX

Tips and Tricks

Here are some tips on how to manage your les:

SSH Fish protocol

Sharing Files Using Subversion / Trac

An available module can be loaded with

$ qstat -g c [-q queue]

Sun Grid Engine

-b y [executable] -> $ qsub -q queue_1 -b y /path/my_exec

$ qsub -pe [pe_name] [no_procs] -q [queue] -b n [script]

(you may omit -b and it will behave like -b n)

To display the sumited jobs of all users( -u *) or a specied user, use:

To print detailed information about one job, use:

Easy submit: MPRUN.sh

=> exec_script.sh <=

Easy development: APPRUN.sh

You connect to fep.grid.pub.ro using a GUI-capable connection (see ). Use apprun.sh

Running a custom VM on the NCIT-Cluster

The Software Stack

Environments: mpi/Sun-HPC8.2.1c-gnu, mpi/Sun-HPC8.2.1c-intel, mpi/Sun-HPC8.2.1cpgi, mpi/Sun-HPC8.2.1c-sun.

FC, CC, CXX -a variable containing the appropiate compiler name.

OpenMP is an Application Program Interface (API), jointly dened by a group of major

What does OpenMP stand for?

Short answer: Open Multi-Processing

OpenMP Programming Model

OMP WAIT POLICY

OMP MAX ACTIVE LEVELS

Fortran Directives Format

!$OMP PARALLEL [clause ...]

Obtain and print thread id

The OpenMP Directives

Examples using OpenMP with C/C++

Here are some examples using OpenMP with C/C++:

a[i] = b[i] = i * 1.0;

OpenMP Debugging - C/C++

/*** use critical for clean output ***/

#pragma omp sections nowait

OpenMP Debugging - FORTRAN

Initialize the locks

Initialize the locks

PRINT *, Thread,TID, adding B() to A()

TotalView is a sophisticated software debugger product of TotalView Technologies that has

qsub -q ibm-quad.q -cwd

/* use critical for clean output */