Achieving High Performance Computing

Achieving High Performance Computing
CHAPTER 1 INTRODUCTION
1.1. Parallel Programming Paradigm In the 1980s it was believed computer performance was best improved by creating faster and more efficient processors. This idea was challenged by parallel processing, which in essence means linking together two or more computers to jointly solve a computational problem. Since the early 1990s there has been an increasing trend to move away from expensive and specialized proprietary parallel supercomputers (vector-supercomputers and massively parallel processors) towards networks of computers (PCs Workstations SMPs). Among the driving forces that have enabled this transition has been the rapid improvement in the availability of commodity high performance components for PCs workstations and networks. These technologies are making a network cluster of computers an appealing vehicle for cost-effective parallel processing and this is consequently leading to low-cost commodity supercomputing. Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous) PCs or workstations, to SMPs, are rapidly becoming the standard platforms for highperformance and large-scale computing. The main attractiveness of such systems is that they are built using a fordable, low-cost, commodity hardware (such as Pentium PCs), fast LAN , and standard software components such as UNIX and MPI. These systems are scalable, i.e., they can be tuned to available budget and computational needs and allow efficient execution of both demanding sequential and parallel applications. 1.2. Overview We intend to present some of the main motivations for the widespread use of clusters in high-performance parallel computing. In the next section, we discuss a generic architecture of a cluster computer and grid computer and the rest of the chapter focuses on message passing interface, strategies for writing parallel programs, and the two main approaches to parallelism (implicit and explicit). We briefly summarize the whole spectrum of choices to exploit parallel processing: message-passing libraries, 1 distributed shared memory, object-oriented
Department of CSE
PACE, Mangalore
Achieving High Performance Computing programming. However, the main focusing chapter is about the identification and introducing parallel programming paradigms in existing applications such as OpenFoam. This approach presents some interesting advantages, for example, the reuse of code, higher flexibility, and the increased productivity of the parallel program developer. 1.3. Grid Network Grid networking services are best presented within the context of the Grid and its architectural principles. The Grid is a flexible, distributed, information technology environment that enables multiple services to be created with a significant degree of independence from the specific attributes of underlying support infrastructure. Advanced architectural infrastructure design increasingly revolves around the creation and delivery of multiple ubiquitous digital services. A major goal of information technology designers is to provide an environment within which it is possible to present any form of information on any device at any location. The Grid is an infrastructure that highly complements the era of ubiquitous digital information and services. These environments are designed to support services not as discrete infrastructure components, but as modular resources that can be integrated into specialized blends of capabilities to create multiple additional, highly customizable services. The Grid also allows such services to be designed and implemented by diverse, distributed communities, independently of centralized processes. Grid architecture represents an innovation that is advancing efforts to achieve these goals. Early Grid infrastructure was developed to support data and compute intensive science projects. For example, the high-energy physics community was an early adopter of Grid technology. This community must acquire extremely high volumes of data from specialized instruments at key locations in different countries. They must gather, distribute, and analyze those large volumes of data as a collaborative initiative with thousands of colleagues around the world.
Department of CSE
PACE, Mangalore
Achieving High Performance Computing 1.4. Message Passing Interface Cluster Message passing libraries allow efficient parallel programs to be written for distributed memory systems. These libraries provide routines to initiate and configure the messaging environment as well as sending and receiving packets of data. Currently, the two most popular high-level message-passing systems for scientific and engineering application is MPI (Message Passing Interface) defined by the MPI Forum. Currently, there are several implementations of MPI, including versions for networks of workstations, clusters of personal computers, distributed-memory multiprocessors, and sharedmemory machines. Almost every hardware vendor is supporting MPI. This gives the user a comfortable feeling since an MPI program can be executed on almost all of the existing computing platforms without the need to rewrite the program from scratch. The goal of portability, architecture, and network transparency has been achieved with these low-level communication libraries like MPI . Both communication libraries provide an interface for C and Fortran, and additional support of graphical tools. However, these message-passing systems are still stigmatized as low-level because most tasks of the parallelization are still left to the application programmer. When writing parallel applications using message passing, the programmer still has to develop a signi cant amount of software to manage some of the tasks of the parallelization, such as: the communication and synchronization between processes, data partitioning and distribution, mapping of processes onto processors, and input output of data structures. If the application programmer has no special support for these tasks, it then becomes difficult to widely exploit parallel computing. The easy-to-use goal is not accomplished with a bare message-passing system, and hence requires additional support. 1.5. OpenFoam The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox is a free, open source CFD software package produced by a commercial company, OpenCFD Ltd. It has a large user base across most areas of engineering and science, from both commercial and
Department of CSE
PACE, Mangalore
Achieving High Performance Computing academic organisations. OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer, to solid dynamics and electromagnetics. The core technology of OpenFOAM is a flexible set of efficient C++ modules. These are used to build a wealth of: solvers, to simulate specific problems in engineering mechanics; utilities, to perform pre- and post-processing tasks ranging from simple data manipulations to visualisation and mesh processing; libraries, to create toolboxes that are accessible to the solvers/utilities, such as libraries of physical models. OpenFOAM is supplied with numerous pre-configured solvers, utilities and libraries and so can be used like any typical simulation package. However, it is open, not only in terms of source code, but also in its structure and hierarchical design, so that its solvers, utilities and libraries are fully extensible. OpenFOAM uses finite volume numerics to solve systems of partial differential equations ascribed on any 3D unstructured mesh of polyhedral cells. The fluid flow solvers are developed within a robust, implicit, pressure-velocity, iterative solution framework, although alternative techniques are applied to other continuum mechanics solvers. One of the strengths of OpenFOAM is that new solvers and utilities can be created by its users with some pre-requisite knowledge of the underlying method, physics and programming techniques involved. OpenFOAM is supplied with pre- and post-processing environments. The interface to the pre- and post-processing are themselves OpenFOAM utilities, thereby ensuring consistent data handling across all environments. The overall structure of OpenFOAM is shown in Figure1.1.
Department of CSE
PACE, Mangalore
1.6. Case Studies 1.6.1. Dense Matrix Dense matrix multiplication is a core operation in scientific computing, and has been a topic of interest for computer scientists for over forty years. Theoretical computer scientists have redefined the time bounds of the problem, and focuses on implementations have been shifted from the serial to parallel computing model. The lower bound of the problem of multiplying two square dense matrices of size N by N (henceforth referred to as matrix multiplication) has been known for some time to be (n2) as every scalar element of the matrices must be examined. Until 1968, no improvements to the naive algorithm were known. 1.6.2. Computational Fluid Dynamics One of the CFD (Computational Fluid Dynamics) code developed for the Conjugate Heat Transfer problem is used for the case study. The term Conjugate Heat Transfer refers to a heat transfer process involving an interaction of heat conduction within a Solid body with either of the free, forced, and mixed convection from its surface to a fluid (or to its surface from a fluid) flowing over it. An accurate analysis of such heat transfer problems necessitates the coupling of the problem of conduction in the solid with that of convection in the fluid by satisfying the conditions of continuity in temperature and heat flux at the solidfluid interface. There are many engineering and practical applications in which conjugate heat transfer occurs. One such area of application is in the thermal design of a fuel element of a nuclear reactor. The energy released due to fission in the fuel element is first conducted to its lateral surface, which in turn is dissipated to the coolant flowing over it so as to maintain the temperature anywhere in the fuel element well within its allowable limit. If this energy generated is not removed fast enough, the fuel elements and other components may heat up so much that eventually a part of the core may melt. In fact, the limit to the power at which a reactor can be operated is set by the heat transfer capacity of the coolant. Therefore, the knowledge of the
Department of CSE
PACE, Mangalore
Achieving High Performance Computing temperature field in the fuel element and the flow and thermal fields in the coolant is needed in order to predict its thermal performance. 1.6.3. OpenFOAM Fluid dynamics is a field of science which studies the physical laws governing the flow of fluids under various conditions. Great effort has gone into understanding the governing laws and the nature of fluids themselves, resulting in a complex yet theoretically strong field of research.
Department of CSE
PACE, Mangalore
CHAPTER 2 TESTBED SETUP

2.1. Globus Toolkit Globus is community of users and developers who collaborate on the use and development of open source software, and associated documentation, for distributed computing and resource federation. The middleware software itselfthe Globus Toolkit is a set of libraries and programs that address common problems that occur when building distributed system services and applications. Its the infrastructure that supports this communitycode repositories, email lists, problem tracking system, and so forth, all accessible at globus.org. The software itself provides a variety of components and capabilities, including the following: A set of service implementations focused on infrastructure management. Tools for building new Web services, in Java, C, and Python. A powerful standards-based security infrastructure. Both client APIs (in different languages) and command line programs for accessing these various services and capabilities. Detailed documentation on these various components, their interfaces, and how they can be used to build applications. GT4 makes extensive use of Web services mechanisms to define its interfaces and structure its components. Web services provide flexible, extensible, and widely adopted XMLbased mechanisms for describing, discovering, and invoking network services; in addition, its document-oriented protocols are well suited to the loosely coupled interactions that many argue are preferable for robust distributed systems. These mechanisms facilitate the development of service-oriented architectures systems and applications structured as communicating services, in which service interfaces are described, operations invoked, access secured, etc., all in uniform ways.
Department of CSE
PACE, Mangalore
Achieving High Performance Computing Figure 2 illustrates various aspects of GT4 architecture.
2.1.1. Prerequisites We intend to present some of the main motivations for the widespread use of clusters in high-performance parallel computing. In the next section, we discuss a generic architecture of a cluster computer and grid computer and the rest of the chapter focuses on message passing interface, strategies for writing parallel programs, and the two main approaches to parallelism (implicit and explicit). We briefly summarize the whole spectrum of choices to exploit parallel processing: programming. The list of packages needed to be pre-installed in the system are - jdk-1_5_0_03-linux- i586.bin - apache-ant-1.6.4-bin.tar - gt4.2.0-all-source- installer.tar.gz message-passing libraries, distributed shared memory, object-oriented
Java Installation: [root@pace~]# cd /usr/local/
Department of CSE
PACE, Mangalore
Achieving High Performance Computing [root@pace local]#rpm -q zlib-devel [root@pace local]#./jdk-1_5_0_03-linux- i586.bin [root@pace local]# vi /etc/profile add the following lines... #GRID ENVIRONMENT VARIABLE SETTINGS.... JAVA_HOME=/usr/local/jdk1.5.0_03 PATH=$JAVA_HOME/bin:$PATH CLASSPATH=$CLASSPATH:$JAVA_HOME/lib/tools.jar export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC CLASSPATH JAVA_HOME Note: All Linux has Java inbuilt but the Globus need the Java by SUN, so check for the vendors installation.
Apache Ant Installation: [root@pace]# tar - xvf /home/vkuser/gt4/software/apache-ant-1.6.4-bin.tar [root@pace]# mv apache-ant-1.6.4 ant-1.6.4 [root@pace ant-1.6.4]# vi /etc/profile add the following lines... ANT_HOME=/usr/local/ant-1.6.4 PATH=$ANT_HOME/bin:$JAVA_HOME/bin:$PATH Note: Ant and Java is needed for compiling Globus source code .We used Fedora 10 it had all the requirements. We installed Ant and Java .The installation step may differ according to the version refer the installation guide in each package.
Department of CSE
PACE, Mangalore
Achieving High Performance Computing Globus installation create the Globus user account [root@pace]#adduser globus [root@pace]#passwd xxxxxx copy the file gt4.0.1-all-source- installer.tar in /usr/local and untar it. [root@pace]$ tar xzf gt4.0.1-all-source- installer.tar Configure, compile and change the ownership to globus user and change the permissions [root@pace]#chown globus:globus gt4.0.1-all-source- installer.tar Now a directory will be created (Ex: gt4.0.1-all-source- installer), go into directory and execute configure script. [root@pace]#./configure [root@pace]#make [root@pace]#make install [root@pace]# chown -R globus:globus /usr/local/globus-4.2.0/ Note: Before starting the installation process change the hostname of your system. The default hostname (localhost.localdomain) will create problems during the certificate genration. Now as a root user [root@pace local]# vi /etc/profile add the following lines... GLOBUS_LOCATION=/usr/local/globus-4.2.0 PATH=$ANT_HOME/bin:$JAVA_HOME/bin:$LAM_HOME/bin:$LAM_HOME/sbin: $PATH:$GLOBUS_LOCATION/bin:$GLOBUS_LOCATION/sbin export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC CLASSPATH GLOBUS_LOCATION
Department of CSE
10
PACE, Mangalore
Achieving High Performance Computing 2.1.2. Setting up the first machine 2.1.2.1 SimpleCA configuration: [globus@pace gt4.2.0-all-source-installer]$source $GLOBUS_LOCATION/etc/globus-user-env.sh [globus@pace~]$ $GLOBUS_LOCATION/setup/globus/setup-simple-ca The following results r displayed on the terminal The unique subject name for this CA is: cn=Globus Simple CA, ou=simpleCA-pace.grid, ou=GlobusTest, o=Grid Do you want to keep this as the CA subject (y/n) [y]:y Enter the email of the CA (this is the email where certificate requests will be sent to be signed by the CA):guser01@pace.grid The CA certificate has an expiration date. Keep in mind that once the CA certificate has expired, all the certificates signed by that CA become invalid. A CA should regenerate the CA certificate and start re- issuing ca-setup packages before the actual CA certificate expires. This can be done by re-running this setup script. Enter the number of DAYS the CA certificate should last before it expires. [default: 5 years (1825 days)]: <enter> Enter PEM pass phrase:xxxxxx Verifying - Enter PEM pass phrase:123456 setup-ssl-utils: Complete [root@pace~]#$GLOBUS_LOCATION/setup/globus_simple_ca_116a21a8_setup/setupgsi-default
Department of CSE
11
PACE, Mangalore
Achieving High Performance Computing Running the above command cause the following instructions to be processed setup-gsi: Configuring GSI security Making /etc/grid-security... mkdir /etc/grid-security Making trusted certs directory: /etc/grid-security/certificates/ mkdir /etc/grid-security/certificates/ Installing /etc/grid-security/certificates//grid-security.conf.116a21a8... Running grid-security-config... Installing Globus CA certificate into trusted CA certificate directory... Installing Globus CA signing policy into trusted CA certificate directory... setup-gsi: Complete
[root@pace~]# source $GLOBUS_LOCATION/etc/globus-user-env.sh [root@pace~]# grid-cert-request -host `hostname` [root@pace~]#exit [globus@pace ~]$ grid-ca-sign - in /etc/grid-security/hostcert_request.pem -out hostsigned.pem
To sign the request please enter the password for the CA key:xxxxxx The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/01.pem [root@pace ~]# cp /home/globus/hostsigned.pem /etc/grid-security/hostcert.pem cp: overwrite `/etc/grid-security/hostcert.pem'? y [root@pace ~]# cd /etc/grid-security/ [root@pace grid-security]# cp hostcert.pem containercert.pem [root@pace grid-security]# cp hostkey.pem containerkey.pem [root@pace grid-security]# chown globus:globus container*.pem Department of CSE 12 PACE, Mangalore
Achieving High Performance Computing [root@pace grid-security]# exit Now we'll get a usercert for guser01. [globus@pace ~]$ su - guser01 [guser01@pace~]$ source $GLOBUS_LOCATION/etc/globus-user-env.sh [guser01@pace ~]$ grid-cert-request Generating a 1024 bit RSA private key ..........++++++ ............++++++ writing new private key to '/home/guser01/.globus/userkey.pem' Enter PEM pass phrase:xxxxxx Verifying - Enter PEM pass phrase:xxxxxx [guser01@pace ~]$ cp /home/guser01/.globus/usercert_request.pem /tmp/request.pem [globus@pace ~]$ cp /tmp/request.pem /home/globus [globus@pace ~]$ grid-ca-sign - in request.pem -out signed.pem To sign the request please enter the password for the CA key:123456 The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/02.pem [globus@pace ~]$ cp signed.pem /tmp/ [globus@pace ~]$ su - guser01 [guser01@pace ~]$ cp /tmp/signed.pem ~/.globus/usercert.pem [guser01@pace~]$ grid-cert- info -subject /O=Grid/OU=GlobusTest/OU=simpleCA- pace.grid/OU=grid/CN=grid user #01 [root@pace ~]# vi /etc/grid-security/grid- mapfile add the following line..
Department of CSE
13
PACE, Mangalore
Achieving High Performance Computing "/O=Grid/OU=GlobusTest/OU=simpleCA- pace.grid/OU=grid/CN=grid" guser01
Environment variable setting for Credentials [root@pace~]#vi /etc/profile add the following lines... GRID_SECURITY_DIR=/etc/grid-security GRIDMAP=/etc/grid-security/grid- mapfile export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE INPUTRC CLASSPATH GLOBUS_LOCATION JAVA_HOME GRIDMAP GRID_SECURITY_DIR
Validate certificate setup:
Note: login as guser01 [root@pace~]# openssl verify -CApath /etc/grid-security/certificates -purpose sslserver /etc/grid-security/hostcert.pem /etc/grid-security/hostcert.pem: OK 2.1.2.2. Setting up GridFTP [root@pace ~]# vim /etc/xinetd.d/gridftp add the following lines... service gsiftp { instances socket_type wait user env = no = root += GLOBUS_LOCATION=/usr/local/globus-4.2.0 = 100 = stream
Department of CSE
14
PACE, Mangalore
Achieving High Performance Computing env server server_args log_on_success nice disable } [root@pace ~]# vim /etc/services add the following line into bottom of the file. # Local services gsiftp 2811/tcp = 10 = no += LD_LIBRARY_PATH=/usr/local/globus-4.2.0/lib = /usr/local/globus-4.2.0/sbin/globus- gridftp-server = -i += DURATION
[root@mitgrid ~]# /etc/init.d/xinetd reload Reloading configuration: [root@mitgrid ~]# netstat -an | grep 2811 tcp Note: Now the gridftp server is waiting for a request, so we'll run a client and transfer a file: Testing: [guser01@pace ~]$ grid-proxy- init -verify -debug User Cert File: /home/guser01/.globus/usercert.pem User Key File: /home/guser01/.globus/userkey.pem Trusted CA Cert Dir: /etc/grid-security/certificates Output File: /tmp/x509up_u502 Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-mitgrid.grid/OU=grid/CN=grid user #01 Enter GRID pass phrase for this identity:guser01 Department of CSE 15 PACE, Mangalore 0 0 0.0.0.0:2811 0.0.0.0:* LISTEN [ OK ]
Achieving High Performance Computing Creating proxy .............++++++++++++ ..++++++++++++ Done Proxy Verify OK Your proxy is valid until: Sun Jan 29 01:12:48 2006 [guser01@mitgrid]$globus-url-copy gsiftp://mitgrid.grid/etc/groupfile:///tmp/guser01.test.copy [guser01@mitgrid ~]$ diff /tmp/guser01.test.copy /etc/group Okay, so the GridFTP server works. Starting the web services container configuration: Now we'll setup an /etc/init.d entry for the web services container. Note: login as globus [globus@mitgrid ~]$ vim $GLOBUS_LOCATION/start-stop add the following lines.... #! /bin/sh set -e export GLOBUS_OPTIONS="-Xms256M -Xmx512M" . $GLOBUS_LOCATION/etc/globus-user-env.sh cd $GLOBUS_LOCATION case "$1" in start) $GLOBUS_LOCATION/sbin/globus-start-container-detached -p 8443 ;; stop) $GLOBUS_LOCATION/sbin/globus-stop-container-detached ;; *) echo "Usage: globus {start|stop}" >&2 exit 1 Department of CSE 16 PACE, Mangalore
Achieving High Performance Computing ;; esac exit 0 [globus@mitgrid ~]$ chmod +x $GLOBUS_LOCATION/start-stop Now, as root, we'll create an /etc/init.d script to call the globus user's start-stop script: Note: login as root [root@mitgrid ~]# vim /etc/init.d/globus-4.2.0 add the following lines... #!/bin/sh -e case "$1" in start) su - globus /usr/local/globus-4.0.1/start-stop start ;; stop) su - globus /usr/local/globus-4.0.1/start-stop stop ;; restart) $0 stop sleep 1 $0 start ;; *) printf "Usage: $0 {start|stop|restart}\n" >&2 exit 1 ;; esac exit 0
Department of CSE
17
PACE, Mangalore
Achieving High Performance Computing [root@pace ~]# chmod +x /etc/init.d/globus-4.2.0 [root@pace ~]# /etc/init.d/globus-4.2.0 start Starting Globus container. PID: 19051 2.1.2.3. Grid Resource Allocation and Management (GRAM) Now that we have GridFTP and RFT working, we can setup GRAM for resource management.First we have to setup sudo so the globus user can start jobs as a different user. [root@pace ~]# visudo add the following lines in the bottom of the file(It is link with /etc/sudoers)... #Grid variable settings by VK@MITGRID globus ALL=(guser01) NOPASSWD: /usr/local/globus-4.2.0bexec/globus-gridmapand-execute - g /etc/grid-security/grid- mapfile /usr/local/gobus-4.2.0/exec/globus-jobmanager-script.pl * globus ALL=(guser01) g NOPASSWD: /usr/local/globusmapfile 4.2.0bexec/globus-gridmap-and-execute /etc/grid-security/grid-
/usr/local/globus-4.2.0/exec/globus-gram- local-proxy-tool * Note: login as guser01 [guser01@pace ~]$ globusrun-ws -submit -c /bin/true Submitting job...Done. Job ID: uuid:a9378900-8fed-11da-a691-000ffe3b1003 Termination time: 01/29/2006 11:03 GMT Current job state: Active Current job state: CleanUp Current job state: Done Destroying job...Done. [guser01@mitgrid ~]$
Department of CSE
18
PACE, Mangalore
Achieving High Performance Computing [guser01@mitgrid ~]$ echo $? 0 MyProxy Server Setup and Configuration: In order to create a MyProxy server first we'll turn pace.grid machine into a MyProxy server by following instructions Note: Login as root [root@pace ~]# cp $GLOBUS_LOCATION/etc/myproxy-server.config /etc [root@pace~]# vim /etc/myproxy-server.config Just uncomment the following lines... Before modification # # Complete Sample Policy # # The following lines define a sample policy that enables all # myproxy-server features. See below for more examples. #accepted_credentials "*" #authorized_retrievers "*" #default_retrievers "*" #authorized_renewers "*" #default_renewers "none" #authorized_key_retrievers "*" #default_key_retrievers "none" after modification: # # Complete Sample Policy # # The following lines define a sample policy that enables all # myproxy-server features. See below for more examples.
Department of CSE
19
PACE, Mangalore
Achieving High Performance Computing accepted_credentials "*" authorized_retrievers "*" default_retrievers "*" authorized_renewers "*" default_renewers "none" authorized_key_retrievers "*" default_key_retrievers "none" [root@pace ~]# cat $GLOBUS_LOCATION/share/myproxy/etc.services.modifications >> /etc/services [root@mitgrid ~]# tail /etc/services asp tfido tfido fido fido gsiftp 27374/udp 60177/tcp 60177/udp 60179/tcp 60179/udp 2811/tcp # Myproxy server $GLOBUS_LOCATION/share/myproxy/etc.xinetd.myproxy # Address Search Protocol # Ifmail # Ifmail # Ifmail # Ifmail
# Local services
myproxy-server 7512/tcp [root@pace ~]# cp
/etc/xinetd.d/myproxy [root@pace ~]# vim /etc/xinetd.d/myproxy Modify the following lines.... service myproxy-server { socket_type = stream protocol = tcp wait = no
Department of CSE
20
PACE, Mangalore
Achieving High Performance Computing user server env disable } [root@pace ~]# /etc/init.d/xinetd reload Reloading configuration: [root@pace ~]# netstat -an | grep 7512 tcp 0 0 0.0.0.0:7512 0.0.0.0:* LISTEN [ OK ] = root = /usr/local/globus-4.0.1/sbin/myproxy-server = GLOBUS_LOCATION=/usr/local/globus-4.2.0 = no
LD_LIBRARY_PATH=/usr/local/globus-4.2.0/lib
Note: Login as guser01 @pace.grid [guser01@pace~]$ grid-proxy-destroy [guser01@pace ~]$ grid-proxy- info ERROR: Couldn't find a valid proxy. Use -debug for further information. Note: Instead of grid-proxy we use Myproxy server. [guser01@mitgrid ~]$ myproxy-init -s mitgrid Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-mitgrid.grid/OU=grid/CN=grid user #01 Enter GRID pass phrase for this identity: guser01 Creating proxy ........................................... Done Proxy Verify OK Your proxy is valid until: Fri Feb 10 15:44:40 2006 Enter MyProxy pass phrase: globus
Department of CSE
21
PACE, Mangalore
Achieving High Performance Computing Verifying - Enter MyProxy pass phrase: globus A proxy valid for 168 hours (7.0 days) for user guser01 now exists on mitgrid. [guser01@pace ~]$ myproxy-logon -s pace.grid Enter MyProxy pass phrase:guser01 A proxy has been received for user guser01 in /tmp/x509up_u503. 2.1.3. Setting up the Second machine Install GlobusToolKit [ follow the steps as specified in the prerequisites and first machine] Installation of CA packages To install CA packages log in to the CA host as a Globus user and invoke the setupsimple-ca script, and answer the prompts as appropriate [globus@ca]$ $GLOBUS_LOCATION/setup/globus/setup-simple-ca WARNING: GPT_LOCATION not set, assuming: GPT_LOCATION=/usr/local/globus-4.2.0 Certificate Authority Setup This script will setup a Certificate Authority for signing Globus users certificates. It will also generate a simple CA package that can be distributed to the users of the CA.The CA information about the certificates it distributes will be kept in: /home/globus/.globus/simpleCA/ /usr/local/globus-4.0.0/setup/globus/setup-simple-ca: line 250: test: res: integer expression expected The unique subject name for this CA is: cn=Globus Simple CA, ou=simpleCA-ca.redbook.ibm.com, ou=GlobusTest, o=Grid Do you want to keep this as the CA subject (y/n) [y]: y
Department of CSE
22
PACE, Mangalore
Achieving High Performance Computing Enter the email of the CA (this is the email where certificate requests will be sent to be signed by the CA): (type mail address)globus@ca.redbook.ibm.com. The CA certificate has an expiration date. Keep in mind that once the CA certificate has expired, all the certificates signed by that CA become invalid. A CA should regenerate the CA certificate and start re-issuing ca-setup packages before the actual CA certificate expires. This can be done by re-running this setup script. Enter the number of DAYS the CA certificate should last before it expires. [default: 5 years (1825 days)]: (type the number of days)1825 Enter PEM pass phrase: (type ca certificate pass phrase) Verifying - Enter PEM pass phrase: (type ca certificate pass phrase) ...(unrelated information omitted) Setup security in each grid node. After performing the steps above, a package file has been created that needs to be used on other nodes, as described in this section. In order to use certificates from this CA in other grid nodes, you need to copy and install the CA setup package to each grid node. 1. Log in to a grid node as a Globus user and obtain a CA setup package from the CA host. Then run the setup commands for configuration . [globus@hosta]$ scp globus@ca:/home/globus/.globus/simpleCA /globus_simple_ca_(ca_hash)_setup-0.18.tar.gz . [globus@hosta]$GLOBUS_LOCATION/sbin/gptbuild/globus_simple_ca_(ca_hash)_setup-0.18.tar.gz gcc32dbg [globus@hosta]$ $GLOBUS_LOCATION/sbin/gpt-postinstall Note: A CA setup package is generated when you run the setup-simple-ca command. Keep in mind that the name of the CA setup package includes a unique CA hash. As the root user, submit the commands to configure the CA settings in each grid node. This script creates the /etc/gridsecurity directory. This directory contains the configuration files for security. Configure CA in each grid node Department of CSE 23 PACE, Mangalore
Achieving High Performance Computing [root@hosta]# $GLOBUS_LOCATION/setup/globus_simple_ca_[ca_hash]_setup/setupgsi -default Note: For the setup of the CA host, you do not need to run the setup-gsi script. This script creates a directory that contains the configuration files for security. The CA host does not need this directory, because these configuration files are for the servers and users who use the CA. In order to use some of the services provided by Globus Toolkit 4, such as GridFTP, you need to have a CA signed host certificate and host key in the appropriate directory.As root user, request a host certificate with the command [root@pace]# grid-cert-request -host `hostname` Copy or send the /etc/grid-security/hostcert_request.pem file to the CA host. In the CA host as a Globus user, sign the host certificate by using the grid-ca-sign command. [globus@ca]$ grid-ca-sign -in hostcert_request.pem -out hostcert.pem To sign the request please enter the password for the CA key: (type CA passphrase) The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/01.pem Copy the hostcert.pem back to the /etc/grid-security/ directory in the grid node. In order to use the grid environment, a grid user needs to have a CA signed user certificate and user key in the users directory. As a user (auser1 in hosta), request a user certificate with the command [auser1@pace1]$ grid-cert-request Enter your name, e.g., John Smith: grid user 1 (type grid user name). A certificate request and private key is being created.You will be asked to enter a PEM pass phrase. This pass phrase is akin to your account password,and is used to protect your key file. If you forget your pass phrase, you will need to obtain a new certificate.
Department of CSE
24
PACE, Mangalore
Achieving High Performance Computing Generating a 1024 bit RSA private key .....................................++++++ ...++++++ writing new private key to '/home/auser1/.globus/userkey.pem' Enter PEM pass phrase: (type pass phrase for grid user) Verifying - Enter PEM pass phrase: (retype pass phrase for grid user) ...(unrelated information omitted) Copy or send the (userhome)/.globus/usercert_request.pem file to the CA host.In CA host as a Globus user, sign the user certificate by using the grid-ca-sign command . [globus@pace]$ grid-ca-sign -in usercert_request.pem -out usercert.pem To sign the request please enter the password for the CA key: The new signed certificate is at: /home/globus/.globus/simpleCA//newcerts/02.pem Copy the created usercert.pem to the (userhome)/.globus/ directory on the grid node. Test the user certificate by typing grid-proxy-init -debug -verify as the auser user. With this command, you can see the location of a user certificate and a key, CAs certificate directory, a distinguished name for the user, and the expiration time. After you successfully execute gridproxy-init, you have been authenticated and are ready to use the grid environment. [auser1@pace1]$ grid-proxy-init -debug -verify User Cert File: /home/auser1/.globus/usercert.pem User Key File: /home/auser1/.globus/userkey.pem Trusted CA Cert Dir: /etc/grid-security/certificates Output File: /tmp/x509up_u511 Your identity: /O=Grid/OU=GlobusTest/OU=simpleCAuser 1 ca.redbook.ibm.com/OU=rebook.ibm.com/CN=grid
Department of CSE
25
PACE, Mangalore
Achieving High Performance Computing Enter GRID pass phrase for this identity: Creating proxy .........++++++++++++ .................++++++++++++ Done Proxy Verify OK Your proxy is valid until: Thu Jun 9 22:16:28 200 Note: You may copy those user certificates to other grid nodes in order to access each grid node as a single grid user. But you may not copy a host certificate and a host key. A host certificate is needed to be created in each grid node. Set mapping information between a grid user and a local user Globus Toolkit 4 requires a mapping between an authenticated grid user and a local user. In order to map a user, you need to get the distinguished name of the grid user, and map it to a local user. Get the distinguished name by invoking the grid-cert-info command. [auser1@pace1]$ grid-cert-info -subject -f /home/auser1/.globus/usercert.pem / O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/CN =grid user 1
As a root user, map the local user name with the distinguished name by using the gridmapfile-add-entry command. [root@pace1]# grid-mapfile-add-entry -dn \ "/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/C N=grid user 1" Modifying /etc/grid-security/grid-mapfile ... /etc/grid-security/grid-mapfile does not exist... Attempting to create /etc/gridsecurity/grid-mapfile New entry: "/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/C N=grid user 1" auser1
Department of CSE
26
PACE, Mangalore
Note: The grid-mapfile-add-entry command creates and adds an entry to /etc/grid-security/gridmapfile. You can manually add an entry by adding a line into this file. In order to see the mapping information, look at /etc/grid-security/grid-mapfile Example of /etc/grid-security/grid-mapfile "/O=Grid/OU=GlobusTest/OU=simpleCAca.redbook.ibm.com/OU=redbook.ibm.com/CN=grid user 1" auser1 For setting up JavaWSCore, GRIDFTP, MyProxy server follow as specified for first machin. Submitting grid-proxy-init command Note: Login as auser1 @pace.grid [auser1@pace~]$ grid-proxy-destroy [auser1@pace ~]$ grid-proxy- info ERROR: Couldn't find a valid proxy. Use -debug for further information. [guser01@pace ~]$ Note: Instead of grid-proxy we use Myproxy server. [auser1@mitgrid ~]$ myproxy-init -s mitgrid Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA-mitgrid.grid/OU=grid/CN=grid auser #1 Enter GRID pass phrase for this identity: auser01 Creating proxy ........................................... Done Proxy Verify OK Your proxy is valid until: Fri Feb 10 15:44:40 2006 Enter MyProxy pass phrase: globus Verifying - Enter MyProxy pass phrase: globus Department of CSE 27 PACE, Mangalore
Achieving High Performance Computing A proxy valid for 168 hours (7.0 days) for user guser01 now exists on mitgrid. [auser1@pace ~]$ myproxy-logon -s pace.grid Enter MyProxy pass phrase:auser1 A proxy has been received for user guser01 in /tmp/x509up_u503.
2.2. Message Passing Interface

2.2.1. Setting up MPICH Here are the steps from obtaining MPICH2 through running your own parallel program on multiple machines. 1. Unpack the tar file for MPICH2 i.e. mpich2.tar.gz 2. Choose an installation directory (the default is /usr/local/bin): It will be most convenient if this directory is shared by all of the machines where you intend to run processes. If not, you will have to duplicate it on the other machines after intallation. 3. Choose a build directory. Building will proceed much faster if your build directory is on a file system local to the machine on which the configuration and compilation steps are executed. It is preferable that this also be separate from the source directory, so that the source directories remain clean and can be reused to build other copies on other machines. 4. Configure, build, and install MPICH2 using the following respective commands, specifying the installation directory, and running the configure script in the source directory: ./configure make make install 7. Add the bin subdirectory of the installation directory to your path: export PATH=/home/you/mpich2-install/bin:$PATH
Department of CSE
28
PACE, Mangalore
8. For security reasons, MPD looks in your home directory for a file named .mpd.conf containing the line secretword=<secretword> where <secretword> is a string known only to yourself. It should not be your normal Unix password. Set the file permissions as readable and writable only by you: cd $HOME touch .mpd.conf chmod 600 .mpd.conf Then use an editor to place a line like: secretword=mr45-j9z into the file (Of course use a different secret word than mr45-j9z.). If super user then as root create the mpd.conf file in /etc/mpd.conf 9. The first sanity check consists of bringing up a ring of one MPD on the local machine, testing one MPD command, and bringing the ring down. mpd & mpdtrace mpdallexit The output of mpdtrace should be the hostname of the machine you are running on. The mpdallexit causes the mpd daemon to exit. 10. The next sanity check is to run a non-MPI program using the daemon. mpd & mpiexec -n 1 /bin/hostname mpdallexit This should print the name of the machine you are running on.
Department of CSE
29
PACE, Mangalore
Achieving High Performance Computing 11. Now we will bring up a ring of mpds on a set of machines. Create a file consisting of a list of machine names, one per line. Name this file mpd.hosts. These hostnames will be used as targets for ssh or rsh, so include full domain names if necessary. Check to see if all the hosts you listed in mpd.hosts are in the output of mpdtrace and if so move on to the next step.
12. Test the ring you have just created: ssh login without password First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase: a@A:~> ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/a/.ssh/id_rsa): Created directory '/home/a/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/a/.ssh/id_rsa. Your public key has been saved in /home/a/.ssh/id_rsa.pub. The key fingerprint is: 3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A Now use ssh to create a directory ~/.ssh as user b on B. (The directory may already exist, which is fine): a@A:~> ssh b@B mkdir -p .ssh b@B's password: Finally append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last time: a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys' b@B's password: From now on you can log into B as b from A as a without password:
Department of CSE
30
PACE, Mangalore
Achieving High Performance Computing a@A:~> ssh b@B hostname
13. Test that the ring can run a multiprocess job: mpiexec -n <number> hostname The number of processes need not match the number of hosts in the ring; if there are more, they will wrap around. You can see the effect of this by getting rank labels on the stdout: mpiexec -l -n 30 hostname You probably didnt have to give the full pathname of the hostname command because it is in your path. If not, use the full pathname: mpiexec -l -n 30 /bin/hostname 14. Now we will run an MPI job, using the mpiexec command as specified mpiexec -n 5 examples/cpi The number of processes need not match the number of hosts. The cpi example will tell you which hosts it is running on. By default, the processes are launched one after the other on the hosts in the mpd ring, so it is not necessary to specify hosts when running a job with mpiexec. Trouble shooting: It can be rather tricky to configure one or more hosts in such a way that they adequately support client-server applications like mpd. In particular, each host must not only know its own name, but must identify itself correctly to other hosts when necessary. Further, certain information must be readily accessible to each host. For example, each host must be able to map another hosts name to its IP address. In this section, we will walk slowly through a series of steps that will help to ensure success in running mpds on a single host or on a large cluster. If you can ssh from each machine to itself, and from each machine to each other machine in your set (and back), then you probably have an adequate environment for mpd. However,
Department of CSE
31
PACE, Mangalore
Achieving High Performance Computing there may still be problems. For example, if you are blocking all ports except the ports used by ssh/sshd, then mpdwill still fail to operate correctly. To begin using mpd, the sequence of steps that we recommend is this: 1. get one mpd working alone on a first test node 2. get one mpd working alone on a second test node 3. get two new mpds to work together on the two test nodes Following the steps 1. Install mpich2, and thus mpd. 2. Make sure the mpich2 bin directory is in your path. Below, we will refer to it as MPDDIR. 3. Run a first mpd (alone on a first node). As mentioned above, mpd uses client-server communications to perform its work. So, before running an mpd, lets run a simpler program (mpdcheck) to verify that these communications are likely to be successful. Even on hosts where communications are well supported, sometimes there are probl ems associated with hostname resolution, etc. So, it is worth the effort to proceed a bit slowly. Below, we assume that you have installed mpd and have it in your path. Select a test node, lets call it n1. Login to n1.First, we will run 'mpdcheck' as a server and a client. To run it as a server, get into a window with a command-line and run this: n1 $ mpdcheck s server listening at INADDR_ANY on: n1 1234 Now, run the client side (in another window if convenient) and see if it can find the server and communicate. Be sure to use the same hostname and port number printed by the server (above: n1 1234): n1 $ mpdcheck -c n1 1234 server has conn on <socket._socketobject object at 0x40200f2c> Department of CSE 32 PACE, Mangalore
Achieving High Performance Computing from (192.168.1.1, 1234) server successfully recvd msg from client: hello_from_client_to_server client successfully recvd ack from server: ack_from_server_to_client If the experiment failed, you have some network or machine configuration problem which will also be a problem later when you try to use mpd. If the experiment succeeded, then you should be ready to try mpd on this one host. To start an mpd, you will use the mpd command. To run parallel programs, you will use the mpiexec program. All mpd commands accept the -h or help arguments, e.g.: n1 $ mpd --help n1 $ mpiexec help Try a few tests: n1 $ mpd & n1 $ mpiexec -n 1 /bin/hostname n1 $ mpiexec -l -n 4 /bin/hostname n1 $ mpiexec -n 2 PATH_TO_MPICH2_EXAMPLES/cpi where PATH TO MPICH2 EXAMPLES is the path to the mpich2-1.0.3/examples directory. To terminate the mpd: n1 $ mpdallexit Run a second mpd (alone on a second node). To verify that things are fine on a second host (say n2 ), login to n2 and perform the same set of tests that you did on n1. Make sure that you use mpdallexit to terminate the mpd so you will be ready for further tests. Run a ring of two mpds on two hosts. Before running a ring of mpds on n1 and n2, we will again use mpdcheck, but this time between the two machines. We do this because the two nodes may have trouble locating each other or communicating between them and it is easierto check this out with the smaller program.
Department of CSE
33
PACE, Mangalore
First, we will make sure that a server on n1 can service a client from n2. On n1: n1 $ mpdcheck s which will print a hostname (hopefully n1) and a portnumber (say 3333 here). On n2: n2 $ mpdcheck -c n1 3333 Second, we will make sure that a server on n2 can service a client from n1. On n2: n2 $ mpdcheck -s which will print a hostname (hopefully n2) and a portnumber (say 7777 here). On n2: n2 $ mpdcheck -c n2 7777 The 6789 is the port that the mpd is listeneing on for connections from other mpds wishing to enter the ring. We will use that port in a moment to get an mpd from n2 into the ring. The value in parentheses should be the IP address of n1. On n2: n2 $ mpd -h n1 -p 6789 & where 6789 is the listening port on n1 (from mpdtrace above). Now try: n2 $ mpdtrace -l You should see both mpds in the ring.To run some programs in parallel: n1 $ mpiexec -n 2 /bin/hostname n1 $ mpiexec -n 4 /bin/hostname n1 $ mpiexec -l -n 4 /bin/hostname n1 $ mpiexec -l -n 4 PATH_TO_MPICH2_EXAMPLES/cpi To bring down the ring of mpds: n1 $ mpdallexit If the output from any of mpdcheck, mpd, or mpdboot leads you to believe that one or more of your hosts are having trouble communicating due to firewall issues, we can offer a few simple suggestions. If the problems are due to an enterprise firewall computer, then we can only point you to your local network admin for assistance. In other cases, there are a few quick
Department of CSE
34
PACE, Mangalore
Achieving High Performance Computing things that you can try to see if there some common protections in place which may be causing your problems.Deactivate all firewalls in the running services window.
2.3. OpenFoam
System requirements: OpenFOAM is developed and tested on Linux, but should work with other POSIX systems. To check your system setup, execute the foamSystemCheck script in the bin/ directory of the OpenFOAM installation. Here is the output you should get [open@sham OpenFOAM-1.6]$ foamSystemCheck Checking basic system... ----------------------------------------------------------------------Shell: Host: OS: User: /bin/bash sham.globus Linux version 2.6.27.5-117.fc10.i686 open
System check: PASS ================== Continue OpenFOAM installation.
Installation: Download and unpack the files in the $HOME/OpenFOAM directory as described in: http://www.OpenFOAM.org/download.html
Department of CSE
35
PACE, Mangalore
Achieving High Performance Computing The environment variable settings are contained in files in an etc/ directory in the OpenFOAM release. e.g. in $HOME/OpenFOAM/OpenFOAM-1.6/etc/ source the etc/bashrc file by adding the following line to the end of your HOME/.bashrc file: . $HOME/OpenFOAM/OpenFOAM-1.6/etc/bashrc Then update the environment variables by sourcing the $HOME/.bashrc file by typing in the terminal: . $HOME/.bashrc Testing the installation: To check your installation setup, execute the 'foamInstallationTest' script (in the bin/ directory of the OpenFOAM installation). If no problems are reported, proceed to getting started with OpenFOAM; otherwise, go back and check you have installed the software correctly . Getting Started Create a project directory within the $HOME/OpenFOAM directory named <USER>-1.6 (e.g. 'chris-1.6' for user chris and OpenFOAM version 1.6) and create a directory named 'run' within it, e.g. by typing: mkdir -p $FOAM_RUN/run Copy the 'tutorial' examples directory in the OpenFOAM distribution to the 'run' directory. If the OpenFOAM environment variables are set correctly, then the following command will be correct: + cp -r $WM_PROJECT_DIR/tutorials $FOAM_RUN Run the first example case of incompressible laminar flow in a cavity: + cd $FOAM_RUN/tutorials/incompressible/icoFoam/cavity
Department of CSE
36
PACE, Mangalore
Achieving High Performance Computing + blockMesh + icoFoam + paraFoam
Chapter 3 Case Studies 3.1 Case Study Dense Matrix Multiplication:

One way to implement the matrix multiplication algorithm is to allocate one processor to compute each row of resultant matrix C, matrix B and one row of elements of A are needed for each processor. Using master slave approach, these elements could be sent from the master processor to the selected slave processors. Results are then collected back from each of the slaves and displayed by the master. Steps taken to parallelize: 1) A MPICH code for matrix multiplication is written. 2) MPD must be running on all the nodes. Start the daemons "by hand" as follows: mpd & mpdtrace -l # starts the local daemon # makes the local daemon print its host # and port in the form <host>_<port> Then log into each of the other machines, put the install/bin directory in your path, and do: mpd -h <hostname> -p <port> & Where the hostname and port belong to the original mpd that has been started. From each machine, after starting the mpd, mpdtrace is used to see which machines are in the ring.
Department of CSE
37
PACE, Mangalore
Achieving High Performance Computing 3) The execution command is given in the master node. MPI job is run using mpiexec command. mpiexec n <no of processors> <output filename> mpiexec n 5 ./cpi 4) System monitor is checked if all the nodes are utilized. 5) The result from each matrix is given back to the master node. Implementing: There exist many ways of solving matrix multiplication. Finding the efficient code for it is perhaps the greatest challenge faced by the programming community. A sequential code is written considering the ordinary matrix multiplication.
for (i=0; i<n;i++) { for (j=0;j<n;j++) { mult[i][j]=0; for (k=0; k<n; k++) { Mult[i][j] +=m1[i][k]*m2[k][j]; } } }
This algorithm requires n3 multiplication and n3 addition, leading to a sequential time complexity of 0(n3). Parallel matrix multiplication is usually based upon the direct sequential matrix multiplication algorithm. Even a superficial look at the sequential code reveals that the computation in each iteration of the outer two loops is not dependent upon any other iteration and each instance of the inner loop could be executed in parallel. Theoretically with p=n 2 processors, we can expect a parallel time complexity of O(n2) and this is easily obtainable.
Direct implementation:
Department of CSE
38
PACE, Mangalore
One way to implement the matrix multiplication algorithm is to allocate one processor to compute each column of resultant matrix C, matrix A and one column of elements of B are needed for each processor. Using master slave approach, these elements could be sent from the master processor to the selected slave processors. Results are then collected back from each of the slaves and displayed by the master. Observations:
Observation of the Performance of Matrix Multiplication Using MPICH2:

Matrix Dimension 1000x1000 No. of cores 1 2 3 4 5 Time (seconds) 13.90 8.33 5.86 5.07 6.89
Matrix Dimension 2000x2000
No. of cores 1 2 3 4 5
Time (seconds) 108.17 64.67 45.56 36.54 51.62
Matrix Dimension 3000x3000
No. of cores 1 2 3 4 5
Time (seconds) 392.9 220.19 156.81 123.36 180.29
Department of CSE
39
PACE, Mangalore
The below graph shows the experimental observations of different matrix dimensions versus the number of processes.
Remarks: 1. Time increases as the number of processes spawned reaches some critical number. 2. Time spent in the above communication process also somtime adds overhead. 3. If the program is carefully divided between the number of processes as of machines and cores available, then there is improvement in the performance . 4. For example if there a 2 machines with 2 cores on each then for the processes 1,2,3,4 the performance increases. 5. If the program is not carefully divided between number of processes as of machines and cores available then there is slight decrease in the performance.
Department of CSE
40
PACE, Mangalore
Achieving High Performance Computing 6. For example if there are 2 machines with the 2 cores on each , for the processes 1,2,3,4 performance increases. But if we divide it into 5 ,the performance slightly decreases as shown in graph. 7. The program should be carefully divided between the number of processes according to the machines and cores available in order to achieve high performance.
3.2 Case Study 2: Computatiuonal Fluid Dynamics

1. Involves interaction of heat conduction within a solid body from its surface to a fluid flowing over it. 2. Application are : thermal design of a fuel element of a nuclear reactor. 3. Software was developed that deals with study of conjugate heat transfer problem associated with a rectangular nuclear fuel element washed by upward moving coolant. employing stream function-vorticity formulation. equations governing the steady, two-dimensional flow and thermal fields in the coolant are solved simultaneously with the steady, two-dimensional heat conduction equation 4. Software was developed that deals with study of conjugate heat transfer problem associated with a rectangular nuclear fuel element washed by upward moving coolant. employing stream function-vorticity formulation. equations governing the steady, two-dimensional flow and thermal fields in the coolant are solved simultaneously with the steady, two-dimensional heat conduction equation
Pre-analysis -Profiling:
Department of CSE
41
PACE, Mangalore
Using gprof, Gnus opensource profiler, to profile the code. The output of gprof, which includes a flat profile and a call graph, was also used for code comprehension.
Flat profile:
Department of CSE
42
PACE, Mangalore
Call graph:
Department of CSE
43
PACE, Mangalore
Achieving High Performance Computing The following observations were made. 1. Values for the different input parameters were hardcoded into the code. Each change of parameter value necessitated a recompilation of the code.
2. For the given set of parameters present in the code the observed run time was about 23 minutes. The run time could get much larger (ranging from a few hours to days) with changed parameters. The large system run time discouraged experimenting with a range of values of the computational grid system that may have resulted in values with a finer resolution. Additionally, the range of values of parameters that may compute results giving more insight into the physics of the problem could not be studied for the same reasons.
Department of CSE
44
PACE, Mangalore
Achieving High Performance Computing 3. Multiple output files were used and data was being written to a large set of output files. File handling could have been more efficiently done to positively impact the total execution time.
4. As the program was serially executed, the execution of a few functions was delayed, even though parameters or values required to compute that function were available. A similar
Department of CSE
45
PACE, Mangalore
Achieving High Performance Computing case was of functions being called after the complete execution of loops although no data or control dependencies existed between the loop and/or the functions. 5. It was observed that there was an excessive and sometimes unnecessary use of global variables.
6. Some loops were identified that could have been combined to bring down the size of the code.
Department of CSE
46
PACE, Mangalore
7. None of the functions used any input parameters nor did they return any value. Global variables were used as a replacement for both.
Department of CSE
47
PACE, Mangalore
8. Some functions that have been defined exhibit identical functionality with few differing statements.
Analyzing Data Dependencies: The call graph was used to understand the initial working of the code after which the data dependencies at the coarse function-level granularity were analyzed. More specifically, the following data dependencies were identified. 1. Flow dependence - If the variables modified in one function are passed to another function then the execution must follow the same path. 2. Anti-dependence - If a changed variable in one function is being used by a previously called function. The order of these functions cannot be interchanged. 3. Output dependence - If two functions produce or write to the same output variable they are said to be output dependent; thus their order cannot be changed.
Department of CSE
48
PACE, Mangalore
Achieving High Performance Computing 4. I/O dependence - This dependence between two functions occurs when a file is being read and written by both these functions. Based on the identified dependencies, the code was statically restructured for a theoretical execution on a multiprocessor system. Initial analysis indicated a reduction of the execution time to about 13 seconds indicating a theoreticalspeedup of 1.7 ignoring the communication overhead.
PRE-PARALLELIZATION EXERCISES: Based on the analysis, it was noted that the following need to be completed before the start of the parallelization step: 1. Code needs to be changed to read in parameters from the command-prompt or from an externally available input file. This would allow the code to be executed unchanged for different values of the parameters without the need for a recompilation. 2. Reduction in the number of output files. If the files are genuinely required, the output sequence needs to be further analyzed; else the data that is written to these files can be combined into a reduced set. 3. Functions that have been identified with no data dependenciesbetween them are good candidates for parallel execution. Their execution time profiles and the computationcommunication ratio needs to be further studied to see if parallelization will indeed produce a speedup. 4. The code needs to be rewritten to reduce the usage of global variables. This may involve changing all or most of the function signatures to read in input parameters and return results. This exercise may also involve the creation of more efficient data structures for parameter passing between functions.
Department of CSE
49
PACE, Mangalore
Achieving High Performance Computing 5. Many functions can be eliminated by rewriting functions to combine the functionality of two or more functions. This would considerably reduce the code size and will result in more compact and well written code. However, code repeated in different programs offers the advantage that it can be customized to execute that part of the program where it lies in a unique manner; combining similar portions of code into a generalized single function, while offering other advantages, removes this advantage. The trade-off need to be deliberated before performing this exercise. 6. Loops that are temporally close need to be studied along with their indices to see if they can be successfully combined. In addition to reduce the code size, this would reduce the effort of parallelization as only a single loop needs to analyzed. Based on the identified dependencies, the code was statically restructured for a theoretical execution on a multiprocessor system. Initial analysis indicated a reduction of the execution time to about 13 seconds indicating a theoretical speedup of 1.7 ignoring the communication overhead.
Flowchart
Department of CSE
50
PACE, Mangalore
Flowchart for order of execution of program
The above flowchart shows the execution flow of functions in a given CFD case problem .There are total 31 functions some of which are dependent and independent. The single rectangular box with number inside indicates the number of functions which has to be executed sequentially .The double rectangular box with number inside indicates the number of functions which can be executed independently.
Department of CSE
51
PACE, Mangalore
Achieving High Performance Computing The problem took over 15 minutes to execute when executed sequentially. After analyzing that some part of the code can be parallelized the theoretical speed up of 1.71 was achived by taking the maximum time taken among functions which can be executed in parallel.
3.3 Case Study 3 :OpenFOAM

BubbleFoam is one of the case in OpenFOAM application that we will be studying from many of the cases in OpenFOAM. Before u start up with this case you need to do some modification. First to generate the profiler you have to include -pg option in C and C++ files which are there in following directory /home/open/OpenFOAM/OpenFOAM-1.6/wmake/rules/linuxGcc After modifications you need to compile Bubblefoam case by running wmake command located in following direcory /home/open/OpenFOAM/OpenFOAM-1.6/applications/solvers/multiphase/bubbleFoam The case is located in /home/open/OpenFOAM/OpenFOAM1.6/tutorials/multiphase/bubbleFoam/bubbleColumn +blockMesh +bubbleFoam The case is executed for 8 minutes. Now to reduce the execution time we needed to do some observation. So we used gprof profiler to profile the case. From the profile graph we observed that the functions :
Department of CSE
52
PACE, Mangalore
Achieving High Performance Computing 'H()' - which is located in fvMatrix.C file Foam::tmp<Foam::fvMatrix<Foam::Vector<double>>>Foam::fvm::div<Foam::Vector<d ouble> >(Foam::GeometricField<double, Foam::fvsPatchField, Foam::surfaceMesh> const&, Foam::GeometricField<Foam::Vector<double>, Foam::fvPatchField, Foam::volMes-h>&) were taking more time. The H() was called about 40000 times and the other function was taking more time.
Running in Parallel: This case was also run in parallel to create 4 diffrent mesh at a time. There is a dictionary associated with decomposePar named decomposeParDict which is located in the system
Department of CSE
53
PACE, Mangalore
Achieving High Performance Computing directory of the tutorial case; also, like with many utilities, a default dictionary can be found in the directory of the source code of the specific utility, i.e. in $FOAM UTILITIES/parallelProcessing/decomposePar The first entry is numberOfSubdomains which specifies the number of subdomains into which the case will be decomposed, usually corresponding to the number of processors available for the case. The method of decomposition should be simple and the corresponding simpleCoeffs should be edited according to the following criteria. The domain is split into pieces, or subdomains, in the x, y and z directions, the number of subdomains in each direction being given by the vector n. As this geometry is 2 dimensional, the 3 rd direction, z, cannot be split, hence nz must equal 1. The nx and ny components of n split the domain in the x and y directions and must be specified so that the number of subdomains specified by nx and ny equals the specified numberOfSubdomains, i.e. nx ny = numberOfSubdomains. It is beneficial to keep the number of cell faces adjoining the subdomains to a minimum so, for a square geometry, it is best to keep the split between the x and y directions should be fairly even. The delta keyword should be set to 0.001. For example, let us assume we wish to run on 4 processors. We would set numberOfSubdomains to 4 and n = (2, 2, 1). When running decomposePar, we can see from the screen messages that the decomposition is distributed fairly even between the processors. The user has a choice of four methods of decomposition, specified by the method keyword as described below. simple Simple geometric decomposition in which the domain is split into pieces by direction, e.g. 2 pieces in the x direction, 1 in y etc. hierarchical Hierarchical geometric decomposition which is the same as simple except the user specifies the order in which the directional split is done, e.g. first in the y-direction, then the x-direction etc. Manual decomposition, where the user directly specifies the allocation of each cell to a particular processor. For each method there are a set of coefficients specified in a sub-dictionary of decompositionDict, named <method>Coeffs as shown in the dictionary listing. decomposePar utility is executed in the normal manner by typing The
Department of CSE
54
PACE, Mangalore
Achieving High Performance Computing decomposePar On completion, a set of subdirectories will have been created, one for each processor, in the case directory. The directories are named processorN where N = 0, 1, . . . represents a processor number and contains a time directory, containing the decomposed field descriptions, and a constant/polyMesh directory containing the decomposed mesh description. Running a decomposed case: A decomposed OpenFOAM case is run in parallel using the openMPI implementation of MPI. openMPI can be run on a local multiprocessor machine very simply but when running on machines across a network, a file must be created that contains the host names of the machines. The file can be given any name and located at any path. In the following description we shall refer to such a file by the generic name, including full path, <machines>. The <machines> file contains the names of the machines listed one machine per line.The names must correspond to a fully resolved hostname in the /etc/hosts file of the machine on which the openMPI is run. The list must contain the name of the machine running the openMPI. Where a machine node contains more than one processor, the node name may be followed by the entry cpu=n where n is the number of processors openMPI should run on that node. For example, let us imagine a user wishes to run openMPI from machine machine1 on the following machines: machine1; machine2, which has 2 processors; and machine3. The <machines> file would contain: machine1 machine2 machine3 An application is run in parallel using mpirun. mpirun --hostfile <machines> -np <nProcs> <foamExec> <otherArgs> -parallel > log & cpu=2
Department of CSE
55
PACE, Mangalore
where: <nProcs> is the number of processors; <foamExec> is the executable, e.g.icoFoam; and, the output is redirected to a file named log. For example, if icoFoam is run on 4 nodes, specified in a file named machines, on the cavity tutorial in the $FOAM RUN/tutorials/incompressible/icoFoam directory, then the following command should be executed: mpirun --hostfile machines -np 4 interFoam -parallel > log &. The Foam ran for about 991 seconds for creating 4 different meshes.
Department of CSE
56
PACE, Mangalore
Chapter 4 Conclusion & Future work
Parallelization process is not easy ,as it needs the application
to be studied
thoroughly.Many applications are written in sequential execution point of view which were easy to write and test. The porting of matrix multiplication to mpi cluster helped us is studying & understanding how to parallize the sequential program as it used as a benchmark. The CFD code for conjugate heat transfer cannot be ported to mpi or grid cluster because the program is poorly strutured. The most effecient method to parallelize it is to rewrite the code again. But observation can be made that optimizing the code by eliminating dependencies and removing unnecessary references and a bit of smart programming can produce better performance. OpenFOAM is written in high object oriented language,which is very difficult to understand. It is one of the oldest written code which had undergone continuous enhancements. There is a seperate group involved in writing up this code. The code is standard and optimized and hence gives problems in prallelization process. Work can be carried to generelize the process of parallelisation.Port the case bubbleFoam to MPI-Cluster for generating single mesh.Study how to port applications to grid cluster and Port OpenFOAM application to the same. The study of OpenFOAM code thoroughly will be the next major advancement in the field of paralellising this application.We tried to parallellize some part of the case and we ran into errors.Even though the case was not running properly after modifying it to run in parallel, we achived parallellising some part of that case. A little more time spent can produce the disirable results.
Department of CSE
57
PACE, Mangalore
Chapter 5 References
[1] Joseph D. Sloan. High Performance LINUX Clusters with OSCAR, Rocks, openMosix & MPI. [2] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A.Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/ECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006. [3] Introduction to Grid Computing with Globus by Luis Ferreira,Viktors Berstis, Jonathan Armstrong,Mike Kendzierski, Andreas Neukoetter, MasanobuTakagi,Richa Bing-Wo, Adeeb Amir, Ryo Murakawa, Olegario Hernandez, James Magowan, Norbert Bieberstein ibm.com/redbooks www.redbooks.ibm.com/redbooks/pdfs/sg246895.pdf [4] Pre-Parallelization Exercises in Budget-Constrained HPC Projects: A Case Study in CFD byDr. Waseem Ahmed, Dr. Ramis M. K., Shamsheer Ahmed, Suma Bhat, Mohammed Isham, P. A. College of Engineering, Mangalore, India. [4] www.annauniv.edu/care/soft.htm [5] www.mcs.anl.gov/mpi/mpich1. [6] www.openfoam.com/docs. [7] www.nus.edu.sg/demo2a.html. [8] www.linuxproblem.org/art_9.html
Department of CSE
58
PACE, Mangalore

Achieving High Performance Computing

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Achieving High Performance Computing

Transféré par

Droits d'auteur :

Formats disponibles

Achieving High Performance Computing