Vous êtes sur la page 1sur 85

Parallel and Distributed Programming on Low Latency Clusters


B.Sc. (Politecnico di Torino) 2007


Submitted as a partial fulfillment of the requirements

for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Chicago, 2010

Chicago, Illinois
To my mother,

without whose continuous love

and support I would never have made it.


I want to thank all my family, my mother Silvana, my grandmother Nenna and my dear

Tanino who help me with love and support every day of my life.

Then I would like to thank all the faculty members that assisted me with this project, in

particular professor Bartolomeo Montrucchio and professor Carlo Ragusa for all the time spent

with me trying to make the software run, and researcher Fabio Freschi for giving me useful

suggestions during development.

Finally I would like to thank all my friends that were near me during these years, Al-

berto Grand, whose patience and kindness towards me are really extraordinary, and Salvatore

Campione, who is an encouraging model for my studies.

V. G.



1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Evolution of parallel and distributed systems . . . . . . . . . . 1
1.2 Computer architecture classification . . . . . . . . . . . . . . . 4
1.3 Thesis Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Parallel and distributed application developing . . . . . . . . . 8
2.2 Technological requirements . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 SMP processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 GPGPU computing . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 NUMA machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Scientific software advance . . . . . . . . . . . . . . . . . . . . . 14

3 TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Parallel applications with OpenMP . . . . . . . . . . . . . . . . 16
3.1.1 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Sequential program with OpenMP enhancements . . . . . . . . 22 OpenMP schedulers performance . . . . . . . . . . . . . . . . . 24 Static Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Dynamic Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Guided Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 OpenMP enhancement results . . . . . . . . . . . . . . . . . . . 27
3.2 Infiniband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Distributed execution with MPI . . . . . . . . . . . . . . . . . . 29
3.3.1 MPI over Infiniband . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Single message over Infiniband with MPI . . . . . . . . . . . . 31 Multiple messages over Infiniband with MPI . . . . . . . . . . 33 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Code Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Test Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


4.4 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Compiler optimizations . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.1 Native switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5.2 Loop unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.3 IEEE compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.4 Library Striping . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 General Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Hardware Support . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Applied Directives . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 MPI Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.2 DO directive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.3 REDUCTION directive . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.4 Avoiding data dependency . . . . . . . . . . . . . . . . . . . . . 52
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.1 Reduced test case . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4.2 Final test case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

CITED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75




RATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


III PARTIAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

IV FINAL RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

V FUNCTION RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . 57



1 Approach levels for parallelization . . . . . . . . . . . . . . . . . . . . . . . 3

2 Classification scheme of computer architecture classification . . . . . . . 5

3 Image showing the tree splitting procedure of a sequential task . . . . . 17

4 Graph plotting of the theoretical curve from Amdahl’s Law . . . . . . . 19

5 Graph plotting of Amdahl’s Law for multiprocessors . . . . . . . . . . . 20

6 Performance overview of an OpenMP threaded program . . . . . . . . . 23

7 OpenMP static scheduler performance chart . . . . . . . . . . . . . . . . 24

8 OpenMP dynamic scheduler performance chart . . . . . . . . . . . . . . . 25

9 OpenMP guided scheduler performance chart . . . . . . . . . . . . . . . . 26

10 OpenMP scheduler overview . . . . . . . . . . . . . . . . . . . . . . . . . . 27

11 Time v. size for a single message . . . . . . . . . . . . . . . . . . . . . . . 32

12 Time v. size for 1024 consecutive messages . . . . . . . . . . . . . . . . . 33

13 Flowchart of the main functions implementated in the code . . . . . . . 38

14 Standard problem #4 representation . . . . . . . . . . . . . . . . . . . . . 39

15 S state field representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

16 Call graph scheme of the target software . . . . . . . . . . . . . . . . . . . 42

17 Implementation scheme overview . . . . . . . . . . . . . . . . . . . . . . . 46


API Application Programming Interface

SMP Symmetric Multi-Processing

OpenMP Open Multi-Processing

MPI Message Passing Interface

IPC Inter Process Communication

PML Point-to-point Messaging Layer

BTL Byte Transfer Layer

SISD Single Instruction Single Data

SIMD Single Instruction Multiple Data

MISD Multiple Instructions Single Data

MIMD Multiple Instructions Multiple Data

SPMP Single Program Multiple Data

MPMD Multiple Program Multiple Data

SSE Streaming SIMD Extensions

SSSE3 Supplemental Streaming SIMD Extensions 3

UMA Uniform Memory Access

NUMA Non-Uniform Memory Access


GPU Graphics Processing Units

GPGPU General-Purpose computing on Graphics Process-

ing Units

ECC Error-Correcting Code

LLG Landau-Liftshitz-Gilbert equation


The goal of this thesis is to increase performance and data throughput of Sally3D, an electro-

magnetic field analyzer and micromagnetic modeler for nanomagnets, developed at “Politecnico

di Torino” by the Electrical Engineering department.

This target has been achieved by means of open standards, such as OpenMP and MPI, that

offer robust parallel programming paradigm and an efficient message passing API; in order to

reduce latency in message passing between the two machines, a point-to-point Infiniband link

has been implemented.

Results will be provided, showing that it is possible to achieve a 80% speed improvement

thanks to optimized code, OpenMP multithreading and MPI communication. The used hard-

ware consists of two computers with two quad-core Intel Xeon processors, running at 2.5 GHz,

supplied with 32 GB of RAM and a 20 Gb/s Infiniband network card.



1.1 Evolution of parallel and distributed systems

Until some decades ago computer applications were written in a sequential style in which

the instructions were executed in a fixed order; the programs relied on a single processing unit

and the throughput was dependent on the processor speed.

Nowadays however the technological trend is to control processor frequency and voltage in

order to consume less power and generate less heat and in this modern architecture sequential

programming is not effective. For this reason a new execution paradigm has been exploited:

parallel programming.

Parallel computing is a simultaneous execution of operations at different levels: the most

widely used form of parallelism are bit-level, augmenting the bit size of words, instruction-

level, exploiting instruction pipelines in processor architectures, loop-level, distributing data

independent instructions in a loop among different cores, and task-level, using complete threads

distribution among the cores.

In order to be able to use parallel applications, hardware support must be present. There are

many kinds of parallel-oriented computers like multi-core, single processor with many processing

units, symmetric multiprocessing, a machine with more than one (multicore) processor, cluster

and grid computing, closely coupled computers connected with high-end networks, and finally


graphics processing units which are used for general purpose computation and are suited for

linear and array operations.

On the other hand parallel applications bring some drawbacks at different levels: manually

programming threads and concurrent processes is a difficult task, as data dependency must be

carefully handled, and poor programming styles may lead to performance degradation. More-

over in a parallel environment several problems are introduced, such as deadlock or starvation,

in which execution cannot continue due to resource dependency conflicts.

Subsequently there has been an increasingly research effort to circumvent the difficulties

of parallel programming, trying to achieve the automatic parallelization from the compiler.

However complete automatic parallelization is a very complex operation requiring computa-

tional power that has not yet been reached; for this reasons several other approaches have been


A quite simple and somewhat effective technology is loop unrolling activated by proper

compiler optimizations; instead of translating a loop into a sequence of operations followed by

a jump, the cycle is transformed in a completely sequential program, preventing a lot of jumps

and processor flushes. This is quite beneficial for pipelined processors that present a high

overhead for jump operations, but there is an increased code size proportional to the dimension

of the loop and there is still exponential complexity in unrolling very large cycles.

A more effective way was introduced a few years ago in which the programmers could insert

hints as compiler directives: in this way it is possible to define sections of code that can be safely

parallelized, exploiting the full capabilities of multicore processors. The interaction level in this

methodology is more advanced with respect to loop unrolling as it requires deeper knowledge of

the program and of dependency between variables; however even limited insertion of compiler

directives has a major effect on parallelization and program throughput.

The next figure (Figure 1) shows different parallelization methodology and in-depth level

approach; as it may seem obvious, full parallelization is fully achieved when it is set up as a

goal during a program design, but it is possible to adapt the project during development at

different stages, each requiring an action of different difficulty.

Figure 1. Approach levels for parallelization


1.2 Computer architecture classification

As soon as parallel computation theory began to gain popularity, there was a shift in

computer architecture design and a precise classification was needed. From a single processor

model that operates on a single data stream, it was possible to consider multiple or single

instructions operating on multiple or single data; representation of each classification is gathered

in the Flynn’s taxonomy:

SISD computers are traditional machines with a single processor operating on a single instruc-

tion (or data) stream, often stored in a single memory. This is the oldest architecture

design and was the leading model in computer markets until a decade ago, when the first

MMX extension was added to Intel processors.

SIMD is the general modern architecture commonly found in current processors in the form

of SSE, Altivec and VIS1 instructions among others; most recently GPUs have started

to exceed this paradigm with emphasis on vectorial parallelization. Multimedia opera-

tions are the prime beneficiaries for this application as well as cryptography and data


MISD architecture is an uncommon one as there is no performance benefit from this design,

but it is often found in mission critical applications, in which a dependable system must

be developed. As a matter of fact operating on single data with multiple identical in-

Visual Instruction Set, technology present in SPARC processors.

structions may lead to error detection and error correction with means of hardware and

time redundancy.

MIMD systems are suited for computer clusters in which a shared or distributed memory is

used; processors may function asynchronously and independently. Parallelism is achieved

because at any time computers may be executing different instructions on different data.

Figure 2. Classification scheme of computer architecture classification


There might be some other classification for the MIMD class, in which the concept of

“instruction” is extended to the notion of “program”:

SPMD multi processors execute the same program at the same time, but at independent

points in the code while working on the different data;

MPMD implementation of a client/server model in which a master feeds other nodes with

data and coordinates the workload distribution; so each node executes a different set of

programs on different data and reports its result to the master.

1.3 Thesis Contents

In this thesis it is described how to make use of such levels of parallelization directives for

a completely serial numerical program, in order to increase computational performance over a

distributed and parallel environment. For this reason a MIMD system will be exploited.

The program consists in an equation solver written in FORTRAN language adapt for com-

putation of electromagnetic field analysis, with high level plotter resolution. Since the program

is already provided, it is not possible to abstract to a very high level methodology; for this

reason what has been selected for parallelization technology is OpenMP which offers a set of

compiler directives to extend sequential sections of code on every core of the machine.

As for the distributed part of the algorithm, two technologies have been adopted: MPI

and Infiniband. MPI is an high level API for performing Inter Process Communication on

the same machine or on different nodes available for many different programming languages

(even for those which do not have IPC mechanism capabilities). Infiniband on the other hand

was chosen for its outstanding performance in sending small quantities of data with very little


After introduction, this document will present a general background and previous work

regarding parallel application methodologies, followed by a thorough description of the tech-

nologies used in this research. Then the main algorithm of the program will be outlined,

showing the critical points in which a possible performance increase may be achieved through

parallelization or distribution; finally some results will be submitted, tracing the throughput

growth of the program with OpenMP and MPI directives.



2.1 Parallel and distributed application developing

Historically, parallel and distributed computing has been considered to be “the high end of

computing”, and has been used to model difficult scientific and engineering problems found in

the real world. Some examples (source: Livermore Computing Center):

• Atmosphere, Earth, Environment;

• Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics;

• Bioscience, Biotechnology, Genetics;

• Chemistry, Molecular Sciences;

• Geology, Seismology;

• Mechanical Engineering - from prosthetics to spacecraft;

• Electrical Engineering, Circuit Design, Microelectronics;

• Computer Science, Mathematics;

• processing of large amounts of data in sophisticated ways such as:

– Databases, data mining;

– Oil exploration;

– Web search engines, web based business services;


– Medical imaging and diagnosis;

– Pharmaceutical design;

– Management of national and multi-national corporations;

– Financial and economic modeling;

– Advanced graphics and virtual reality, particularly in the entertainment industry;

– Networked video and multi-media technologies;

– Collaborative work environments.

2.2 Technological requirements

2.2.1 SMP processors

As demands for performance increases and as the cost of microprocessors continues to drop,

the single processor model has been abandoned in favor of an SMP organization. An SMP

architecture refers a computer system composed of multiple processors connected to a single

shared memory and to a shared I/O controller.

Operating system support is necessary for enabling this feature. Moreover programs have to

be rewritten or at least reconsidered in order to access every resource available. For this reason

there has been a continuous improvement to compiler software, trying to simplify program

parallelization for developers.

Resorting to a SMP architecture can bring many advantages (1):


1. Performance – workload can be spread among more processors, running different tasks

in parallel; moreover interrupt management can affect only one processor at time, avoiding

processes suspension and pipeline stalls;

2. Incremental Growth – adding additional processors increases performance even more,

up to a certain extent;

3. Scaling – vendors can offer more systems with different SMP configuration;

4. Transparency – the operating system hides SMP management from the user, as it

handles thread scheduling and processes synchronization;

5. Availability – it is possible to set up the processor to execute the same instruction

on all the symmetric processors, being able to sustain hardware failures (sort of MISD


2.2.2 Multithreading

Multithreading is a technique to exploit thread-level parallelism; unit of execution becomes

a single thread of the program in memory. Once again, it is necessary to enable this feature in

software, through the operating system support (2).

It is possible to increase execution parallelism by using one of the following implementations:

interleaved multithreading (fine-grained ) at every clock cycle the processor switches exe-

cution from one thread to another, unless one is not ready (blocked for data dependency

or memory latency);

blocked multithreading (coarse-grained ) instructions of the threads are continuously exe-

cuted, until an event causes delay or cache miss; in that case execution is switched to

another thread;

simultaneous multithreading (SMT or Hyperthreading) instructions from multiple threads

are simultaneously executed, exploiting intrinsic parallelism of the execution units of the


chip multithreading one or more processors is simulated on the physical chip, each handling

separate thread sets; in this way pipeline execution is much simplified.

The Simultaneous Multithreading technique has been implemented in most modern proces-

sors as it has shown the best performance benefits in a variety of applications during testing.

2.2.3 GPGPU computing

General-purpose computing on graphics processing units refers to a technique that allows

general purpose execution through the processors present in modern video cards (namely,

GPUs). This methodology allows to exploit the GPU computing power, that is usually re-

served for computer graphics, for almost any kind of operations; since the graphics processing

unit is composed of a lot of array processors, using a GPGPU programming language enables

automatic streaming execution.

Applications that especially benefit from streaming execution are multimedia-related, such

as digital signal processing (for audio/video or image manipulation), but there are also many

implementations of computer clusters, physics simulators, mathematical solvers and raytracing


done with GPGPU. Moreover there is older array-based software that receives a positive impact

from this rather new technology, like cryptography, DNA folding, neural networks and medical


2.2.4 NUMA machines

While general purpose processors adopt a uniform memory access (UMA), it is not un-

common to find systems whose access time is not uniform and depends on the position of the

processor (NUMA, non-uniform).

NUMA machines are usually physically distributed but logically shared, meaning that one

node can directly access memory of another node and that not all processors have equal access

time to all memories; a software layer is often needed to guarantee program access and workload


Memory is mapped like a global address space, merging the linked SMP memory; this feature

provides a user-friendly programming perspective to memory as data sharing between tasks is

both fast and uniform due to the proximity of memory to CPUs.

However there is a lack of scalability between memory and CPUs because adding more CPUs

can geometrically increase traffic on the shared memory-CPU path. Moreover there is a whole

synchronization construct that needs to be implemented to insure “correct” access of global

memory. One final disadvantage is that it is becoming increasingly difficult and expensive to

design and produce shared memory machines with ever increasing numbers of processors.

2.2.5 Clusters

A cluster is an alternative or an addition to symmetric multiprocessing for achieving high

performance; it is possible to define “cluster” as a group of computers interconnected through

some network interface, working together as unified computing resource.

It is possible to create large clusters that can by far outperform any standalone machine,

with the advantage that is is relatively easy to add new components, even in small increments;

both clusters and SMP systems provide a configuration for high performance applications and

they can both introduce advantages and disadvantages.

For example an SMP system is easier to manage and has less problems in running single-

processor software, while clusters require an in-depth program revision, with load balancing and

work distribution; on the other hand, though, clusters dominate the final performance outcome

and offer more solutions for availability.

Clusters are historically divided in:

High-availability clusters for improving the availability offered by the cluster itself; they

usually exploit redundancy so that when one node fails, it can be immediately substituted

by a spare one (active or passive standby);

Load-balancing clusters with the primary purpose of distributing evenly the workload of a

given task or service among the rest of the cluster;

Compute clusters used for computational activity, rather than services; nodes are tightly

coupled and usually computation implies a consistent quantity of communication involved;


usually programs can be easily ported to this environment through simple instruction

routines (e.g. MPI);

Grid computing similar to compute clusters, they focus more on the final computational

throughput rather than workload distribution and tightly coupled jobs; computation con-

sists of many independent jobs which do not have to share data during the computation


2.3 Scientific software advance

Using parallelization technologies such as OpenMP and MPI, is not new in scientific soft-

ware. As a matter of fact it is normal to find quite a number of projects that exploit those


For example it is possible to cite the Folding@Home project, from the Stanford University’s

chemistry department, currently the most powerful distributed computing cluster, which is

being developed using an MPI layer between its nodes; or it is possible to find many entries

from the TOP500 list1 , like the Pleiades and the Ranger that use Infiniband as connection link

among the clusters.

As for electromagnetic field analyzers, there has been some previous work with OpenMP:

(3) and (4) describe a possible implementation for Hybrid solvers, but the addressed software

has different solving and modeling routines. The proposed work doesn’t rely on standard FEM

project ranking and detailing the 500 most powerful known computer systems in the world.

approach, but takes on a Finite Formulation of nonlinear Magneto-static algorithm which can

be safely parallelized and distributed; see (5) for more information.



3.1 Parallel applications with OpenMP

OpenMP is an application programming interface (API) that offers a set of compiler direc-

tives, library routines and environment variables to enable shared memory multiprocessing for

C, C++ and FORTRAN programs.

OpenMP stands for Open Multi-Processing and it is implemented in many open source and

commercial compilers, like Intel C++ and FORTRAN Compilers (ifort and icc) and GNU Com-

piler Collection (gcc). Among the key factors for its popularity there is the easiness of handling

threads and shared variables and the simplicity of porting programs to a multiprogramming

scheme with very little code change; moreover OpenMP enables parallel execution control for

languages that cannot usually handle multi threading and synchronization primitives, like, for

instance, FORTRAN.

With this technology the main program forks a set number of parallel threads which carry

out a task, dividing the work load on different cores; by default every thread executes its section

of code independently. After execution of the parallel job, threads are then joined back in the

main (or master) thread, resuming normal sequential programming; in this way it is possible

to divide the sequence of program execution in a tree-like structure (as shown in Figure 3).


Figure 3. Image showing the tree splitting procedure of a sequential task

OpenMP exploits preprocessor directives for thread creation and synchronization, workload

distribution and sharing, data and function management, while retaining compatibility with

unsupported compilers. In order to prevent data corruption due to overlapping threads, all

variables of the parallel section must have a declared visibility scope, either shared or private.

One directive is particularly suited for loop parallelization as it offers a fine-grained control

on the scheduling for the threads and on the distribution of the loop among the thread pool.

Other directives may directly manage thread interaction and synchronization objects (critical

regions and variable locking).


However, it is important to clarify that using OpenMP on an N processor machine does not

reduce the execution time by N. As a matter of fact there are a couple of reasons for this to


• Symmetric Multi Processor computer have increased computational power, but the mem-

ory bandwidth does not scale proportionally to the number of processors (or cores); per-

formance degradation occurs especially when the shared memory bandwidth is filled up

and data transfer is slowed down;

• synchronization overhead, critical region management, context switch costs and load bal-

ancing among the threads may reduce the final speedup;

• not every portion of the code can be actually parallelized:

• the theoretical limit imposed by Amdahl’s Law for parallel applications that regulates the

maximum theoretical speedup holds.

3.1.1 Amdahl’s Law

Amdahl’s Law is a method used for finding the maximum speed improvement in parallel

computing environments. The speedup highly depends on the size of the parallelizable code


The formula states that the potential speedup of the program directly depends on the

fraction of code P that can be parallelized

speedup = (3.1)

Basically if none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup),

if all of the code is parallelized, P = 1 and the speedup is infinite (in theory), if 50% of the

code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast, and

so on; the next figure (Figure 4) shows the theoretical speedup curve with infinite processors.

Figure 4. Graph plotting of the theoretical curve from Amdahl’s Law

When the code has parts that cannot be parallelized, the relationship can be updated to

speedup = P
N +S

where N is the number of processors, P the portion of parallelizable code and S the portion

of serial code (corresponding to (1 − P )).

The following figure (Figure 5) shows a set of examples with different parallelizable code over

a variable number of processors. It is possible to see not only that a 95% parallelizable program

has a maximum speed improvement in the order 20x notwithstanding the high number of

processors available, but also that a highly sequential program cannot achieve any acceleration


Figure 5. Graph plotting of Amdahl’s Law for multiprocessors


3.1.2 Benchmarking

In order to understand the possible benefit from using OpenMP, some tests have been run

targeting the best possible configuration about the number of threads and the thread size. A

simple test program was used with a complex and long loop containing some processor inten-

sive operations (mainly mathematical operations like power and square root). The particular

case of an “interesting” loop has been chosen because it showed with enough simplicity the

effort/benefit ratio of OpenMP.

The two main configuration variables that characterized the benchmarks were the scheduler

type and the chunk size, plus the total number of threads involved in the program. The chunk

size is an integer positive value representing the number of iterations each thread has to manage,

while the scheduler type may be:

STATIC loop iterations are divided in fixed chunk number of iterations;

DYNAMIC loop iterations are divided in chunk number of iterations, but then dynamically

assigned to thread when one task is completed;

GUIDED the chunk size is rearranged proportionally to its value allowing unassigned iteration

to gain priority over completed tasks.

Other type of schedulers may be auto and runtime in which one of the above scheduler is

selected accordingly to the CPU load and the set up environment. As it can be foretold, guided-

scheduled threads work best with very small chunk sizes (with respect to the total number of

iterations), as the scheduling algorithm is more efficient when it can control a pool of threads

on its whole, while the static and dynamic scheduling prefer having a medium chunk size value.

Beware that setting a static number of threads may reduce the total performance of the

application. As a matter of fact the thread number in the main program has been left to the

default value for this very reason.

The test program partially emulates some computationally intensive routines of the target

software; the main loop is composed of several mathematical functions that are known to stress

the processor and require a long cpu time to be carried out. The code is reported in appendix

B. Sequential program with OpenMP enhancements

In this first test the program is speeded up with increasingly higher number of threads avail-

able, also overcoming the eight physical cores actually present. All three scheduling algorithms

are evaluated. The value of the first column (one thread) may be safely considered as reference

for the program without OpenMP optimizations.

It is possible to see that there is a huge impact when inserting a second thread (50% time

reduction) and then it asinthotically tends to a given value, fully respecting Amdahl’s Law. It’s

interesting to notice that the three schedulers perform in same range of values and that the

best performance is achieved in the region of 8-9 threads (given the eight-core machines used).

After this value all the schedulers, static and the dynamic in particular, suffer from excessive

context switches and interference from the operating system preemption mechanism.

Figure 6. Performance overview of an OpenMP threaded program

24 OpenMP schedulers performance

Having evaluated the performance of the different threads, now the three types of available

schedulers are compared; moreover for each scheduler a different order of chunk size is tested. Static Scheduler

The static scheduler works as expected (Figure 7) showing a very good performance increase

in region of 7-8 threads with 10-100 as chunk value. It is interesting to notice that for very high

chunk size OpenMP can’t reduce the execution time, and this holds for every type of scheduler;

the reason of this behavior resides in how OpenMP manages iterations – all iterations of the

loop are assigned to a single thread and therefore there is not any benefit.

Figure 7. OpenMP static scheduler performance chart

25 Dynamic Scheduler

Because of its dynamic behavior, the dynamic scheduler shows very peculiar results with

different configurations. For example, as shown in Figure 8, there are high chunks and little

number of threads that present even an additional overhead, or small chunks that cannot leave

the average value regardless of the thread number.

Even with this disparity however, the best execution time reduction is achieved in region

7-9 by chunks of medium order.

Figure 8. OpenMP dynamic scheduler performance chart

26 Guided Scheduler

The final scheduler presented here is the most straightforward and the best performing,

thanks to the more advanced algorithm of the guided scheduling. As a matter of fact for a

chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations

divided by the number of threads, decreasing to 1. For a chunk size with value k (greater than

1), the size of each chunk is determined in the same way with the restriction that the chunks do

not contain fewer than k iterations (except for the last chunk to be assigned, which may have

fewer than k iterations) – source (7).

As anticipated, this algorithm works best with very small chunks, as it can apply its algo-

rithm without interferences, and always in the 8-9 threads region.

Figure 9. OpenMP guided scheduler performance chart

27 OpenMP enhancement results

This last section resumes the global results from the point of view of the scheduler. As

reference value, the maximum time execution reduction has been selected from each chunk of

each scheduling algorithm; all these results come from the 7-9 threads region.

The test run shows that the scheduler that performed best is the guided scheduler with

chunk size in the order of the units, and for this reason it has been chosen as default scheduler

in all OpenMP directives inserted.

Figure 10. OpenMP scheduler overview


3.2 Infiniband

Infiniband is the union of two competing transport designs, Next Generation I/O from Intel,

Microsoft and Sun, and Future I/O from Compaq, IBM and Hewlett-Packard. It has become

the de facto standard for high speed cluster interconnection, outperforming Ethernet in both

transfer rate and latency.

This technology implements a modern interconnection link using a point-to-point bidi-

rectional serial transfer, supporting several signaling rates. It is used for high-performance

computing either for high-speed connection between processors and peripherals as well as for

low-latency networking.

The standard transmission rate is of 2.5 Gbit/s, but double and quad data rates currently

achieve 5 Gbit/s and 10 Gbit/s respectively. Moreover it is possible to join links in units of 4 or

12 elements enabling even further transfer speed (up to 120 Gbit/s). However it is important

to state that a fault prevention for transmitted data is adopted using information redundancy:

every 10 bits sent carry only 8 bits of useful information, reducing the useful data transmission

rate. Table I summarizes the various configuration effective data rate.

Most notably, there is no standard programming interface for the device, only a set of

functions (referenced as verbs) must be present, leaving implementation to the vendors. The

most commonly accepted implementation is provided by the OpenFabric alliance. Being a

transport layer there are many protocol that can run on Infiniband, from TCP/IP to OpenIB

(described in section 3.3.1).




useful data Single Data Rate Double Data Rate Quad Data Rate
1X 2 Gbit/s 4 Gbit/s 8 Gbit/s
4X 8 Gbit/s 16 Gbit/s 32 Gbit/s
12X 24 Gbit/s 48 Gbit/s 96 Gbit/s

raw data Single Data Rate Double Data Rate Quad Data Rate
1X 2.5 Gbit/s 5 Gbit/s 10 Gbit/s
4X 10 Gbit/s 20 Gbit/s 40 Gbit/s
12X 30 Gbit/s 60 Gbit/s 120 Gbit/s

3.3 Distributed execution with MPI

MPI is a high level language-independent API used both for parallel computing and for one-

to-one, one-to-many and many-to-many inter process communication (IPC). It has become the

de facto standard for process communication despite of lack of sponsorship by any association.

Originally it was developed by William Gropp and Ewing Lusk among others.

This set of API is used for high-performance computing for its scalability, portability and

performance, as it implements a distributed shared memory system with very few directives. It

usually resides on level 5 of the OSI model, but, as there is no strict constraint on this point,

there are many implementation that offer different transport, network and data link layers.

MPI is available for many programming languages including C, C++, FORTRAN and

Java; sometimes implementations benefit from the bounded language, for example using object-

oriented programming in C++ and Java, and from the hardware they run on. Among the

most diffused library it is possible to find OpenMPI, MPICH2 and MVAPICH2 which differ

only for threading support, network availability (e.g. Ethernet or Infiniband) and hardware


3.3.1 MPI over Infiniband

One of the most widely used environments for MPI is Infiniband; as a matter of fact thanks

to Infiniband low latency a small packet sent through a connection link doesn’t present a major

overhead with respect to Ethernet for example. In order to set up a distributed system of this

kind there is need of additional software for managing the Infiniband sub net (OpenSM) and

for handling the transport layer (OpenIB).

MPI and Infiniband modularity allow different configurations, and it is common use to

transmit packet with either Infiniband or a TCP/IP stack. This is possible because the transport

layer of MPI is handled by two routines (among others): the Point-to-Point Messaging Layer

and the Byte Transfer Layer. The PML abstracts the communication mechanism with buffers,

synchronization points and acknowledge messages; the BTL on the other hand translates the

byte messages into the network layer byte sequence – OpenIB is a BTL protocol for sending

messages on Infiniband.

Subsequently the functions (or verbs) available in the Infiniband drivers are invoked and

control is moved from user space to kernel space, where the message is finally sent across the

network link.

This seemingly complex structure allows to reduce code complexity and increase inter-

compatibility and maintainability between different implementations.


3.3.2 Benchmarks

As it has been done with OpenMP, some tests were also performed on the MPI installation

and on Infiniband structure to check that machine configuration was correct and that devices

were running at full speed. The program makes heavy use of the MPI Send and MPI Recv

directives and utilizes timing function with resolution of milliseconds. It has been noticed

that a warm-up phase (exchanging some messages between the nodes) is necessary before any

measurement is done, because the whole structure of MPI plus Infiniband must be activated. Single message over Infiniband with MPI

In this test the transfer time of messages over Infiniband with MPI directives is evaluated;

message size increases quadratically and time is measured with millisecond precision. Data is

displayed in a semi-logarithmic scale so that the whole slope can be shown.

Two different MPI implementation are compared, and it possible to notice that OpenMPI

outperforms MVAPICH in small and large quantities of data, but it is slower in medium-sized

messages. With MVAPICH it is not possible to send data over 2 GB, due to implementation

limits; OpenMPI doesn’t suffer from this behavior, but on the other hand it has a sort of latency

of 3.5 seconds before programs start executing (and this is not recorded in this test).

Other types of MPI implementation exist, most notably MPICH and Lam-MPI, from which

both MVAPICH and OpenMPI derived, but they lack of support for Infiniband; any packet

transmitted would revert to plain TCP/IP.


Figure 11. Time v. size for a single message

33 Multiple messages over Infiniband with MPI

Using the same structure of above, here is tested the time v. size with multiple messages

(1024 messages exchanged for each tested size). The results are similar to the previous case.

Figure 12. Time v. size for 1024 consecutive messages

34 Latency

One final test has been run to determine the expected latency in message passing; this has

been achieved by sending a 0-length packet using some data types available in MPI. However,

due to the modularity of the MPI over Infiniband structure, the MPI initialization overhead

must be removed: for this reason the same test is to be repeated both on a single machine and

on the two machines.

The latency value measured with this method is 8 µs which is compatible with the Infiniband

board specifications. The complete table of results follows.



Test type µ-seconds

Single node 26
Two nodes 34
Latency 8


4.1 Overview

The target application is a suite of programs called Sally3d, and it has been ported from a

VMS system to standard FORTRAN, with a standard makefile instead of terminal scripts and

it can be compiled on any UNIX based operating systems.

The software is designed for electromagnetic field analysis and micromagnetic modeling of

nanomagnets; for this purpose, magnetization dynamics in nanomagnets is described by the

Landau-Lifshitz-Gilbert (LLG) equation which rules the gyromagnetic precession of magneti-

zation vector field around the so-called micromagnetic effective field.

The effective field takes phenomenologically into account the interactions occurring in mag-

netic materials such as short-range (exchange, anisotropy) and long-range interactions (mag-

netostatics, Zeeman). Magnetization dynamics in a ferromagnetic body is described by the

following Landau-Lifshitz-Gilbert (LLG) equation:

∂m ∂m
= −m × heff [m] − α , (4.1)
∂t ∂t

where m = m(r, t) is the magnetization vector field normalized to the saturation magneti-

zation, Ms , time is measured in unit of (γMs )−1 (γ is the absolute value of the gyromagnetic


ratio), α is the dimensionless damping parameter, heff [m(r, t)] is the effective field operator

which can be obtained by the variational derivative of the free energy functional:

δgL [m]
heff [m] = − , (4.2)


Z " 2 #
1 l ex 12
gL [m] = |∇m| − hm · m + ϕ(m) − ha · m dV , (4.3)
VΩ Ω 2 2

ϕ(m) is the anisotropy energy density and lex = (2A)/µ0 Ms2 is the exchange length (A

is the exchange constant and µ0 the vacuum permeability), hm and ha are the demagnetizing

and applied fields, respectively, and VΩ is the body volume.

In addition, the homogeneous Neumann boundary condition ∂m/∂n = 0 is imposed at the

body surface. In order to obtain a spatially discretized version of eq. (Equation 4.1) a partition

of the region Ω in N cells Ωk , with volume Vk is considered and is assumed that the cells are

small enough that the vector fields m(r, t) and heff [m(r, t)] can be considered spatially uniform

within each cell. Symbols mk (t) and heffk denote the vectors associated with the generic k-th

cell. Beside the cell vectors, the mesh vectors m = (m1 , . . . , mN )T ∈ R3N containing the whole

collection of cell vectors are defined.

Now it is possible to write down the discretized LLG equation in the following form that

consist of a system of ordinary differential equations:


dmk dmk
= −mk × heffk [m] + αmk × , (4.4)
dt dt

where mk is the average magnetization of the k-th cell. It is worth noting that the ef-

fective field in the k-th cell depends on the magnetization of the whole cell collection due to

the magnetostatic interaction, namely heffk = heffk [m]. The numerical solution of equation

(Equation 4.4) will provide the time evolution of magnetization.

4.2 Code Flowchart

The kernel of the micromagnetic solver integrates over time the LLG equation discretized

with respect to space. At every time step, the next value of the magnetic vector is computed

by collecting the different finite elements of the magnetic field; this operation is performed by

the GILBERT routine and it is reported in Figure 13. The equation is a non linear differential

equation, whose solution is obtained through Newton-Raphson method of approximation; this

is performed by the GINT function.

The section of code which has been parallelized and distributed (outlined with yellow in

the next figure) implements the magnetostatic and anisotropic field solvers; also the part that

combines together the different field elements has been updated with OpenMP and MPI direc-

tives. This development scheme has been chosen on the grounds that the real computational

bottleneck resulted particularly in the magnetostatic solver and partially in the anisotropic


Figure 13. Flowchart of the main functions implementated in the code


4.3 Test Case

In order to carefully analyze the performance of the program and to identify the possible

parallelization points, as well as to obtain useful data, a particular test was prepared. The test

case is the fourth standard problem of micromagnetics, proposed by Bob McMichael, Roger

Koch and Thomas Schrefl.

Quoting (8), the problem focuses on dynamic aspects of micromagnetic computations. The

initial state is an equilibrium s-state (Figure 15) which is obtained after applying and slowly

reducing a saturating fild along the [1,1,1] direction to zero. Fields of magnitude sufficient to

reverse the magnetization of the rectangle are applied to this initial state and the time evolution

of the magnetization are examined as the system moves towards equilibrium in the new fields.

The problem will be run for two different applied fields.

At t = 0 one field will be applied to the equilibrium s-state: the field is composed of

µHx = −24.6 mT, µHy = 4.3 mT, µHz = 0.0 mT (corresponding to approximately 25 mT,

directed 170 degrees counterclockwise from the positive x axis).

Figure 14. Standard problem #4 representation


The problem was chosen so that resolving the dynamics should be easier for the 170 degree

applied field than for the 190 degree applied field. Preliminary simulations reveal that, in the

case of the field applied at 170 degrees, the magnetization in the center of the rectangle rotates

in the same direction as at the ends during reversal. In the 190 degree case, however, the center

rotates the opposite direction as the ends resulting in a more complicated reversal. The field

amplitudes were chosen to be about 1.5 times the coercivity in each case.

Figure 15. S state field representation

4.4 Profiling

Thanks to the standardization of the program code, it was possible to exploit the gprof

utility, available in the gcc suite. This utility allows to obtain procedure level timing information

with reasonable resolution, as well as a complete call graph view for identifying the most

computational expensive functions.

According to the profiler, whose graph call has been reported in Figure 16, the following

functions were the most time consuming:


• calc intmudua

• curledge and the calling calc hdmg tet

• calc mudua

• campo effettivo

Most of the software is composed of very small routines that are called with very high

frequency, thus very difficult to optimize and to measure (in fact they are not even reported in

profiler reports); only the noted functions have an observable impact on the overall execution


4.5 Compiler optimizations

Once again, due to the porting operation that has been performed, several compiler opti-

mizations became available and were subsequently added in order to increase the throughput

of the program. Most of the additions have been chosen following official gcc documentation

and manual pages.

4.5.1 Native switch

The key for optimization relies on the native machine capabilities; in order to activate

at once all the features of a given architecture and of a given processor is required to set

-march=native. In this way all processor specific instructions can be accessed and all floating

point capabilities fully exploited, setting the right processor architecture and the available SSE

flags. Moreover the floating point instructions are specifically set to use any SSE extension

(-mfpmath=sse enabled by default).


Figure 16. Call graph scheme of the target software


A similar optimization is achieved also in the Intel FORTRAN Compiler with the -axS -xS


4.5.2 Loop unrolling

Among the loop transformation techniques, loop unrolling has achieved wide success in

compiler theory. Its goal is to increase the execution speed of the program at the expense of

size. Loop unrolling is performed by reducing (if not eliminating) the number of the “end of

loop”; in this way the number of jumps and of conditional branches decreases, and thanks to

the larger, size the number of cache hits increases (in big caches).

This optimization is pulled in by the -O3 flag.

4.5.3 IEEE compliance

Due to the highly mathematical nature of the software, the -ffast-math flag has been

added: this flag activates a set of optimization that allow some general speedups by discarding

some return codes and by skipping some redundant operations (like rejecting the sign of zero

or not considering Nan and +-Inf number types).

The main drawback to this optimization is that it is not possible to guarantee IEEE, ISO

and or ANSI compliance that specify arithmetic compatibility, exceptions and operand order

in floating point operations.

4.5.4 Library Striping

One final type of optimization has been inserted at linking time. The following options

try to decrease load time for library functions, modifying the executable header (ELF in this

context) and symbol handling (9). These options must be passed with the -Wl flag so that the

compiler can forward them to the linker.

More specifically the -O1 switch performs in this way: as symbols get inserted in the ELF

header, they are stored in hash tables; the default configuration is to keep the hash keys small,

performing string comparison with collisions. This optimization shifts the reduction towards

short hash chains, increasing hash keys length and header size, but actually reducing symbol



5.1 General Scheme

Analyzing the functions of 4.4 from several profiling sessions a common pattern has been


As a matter of fact, every function contained one or more loops, carrying quite a number

of instructions over arrays and matrices. For this reason a general plan has been decided and

summed up in Figure 17.

As first step, the standard sequential loop is parallelized to fully exploit all the eight cores

each single machine can offer. By setting up proper shared/private variables lists, the loop

is divided among a given number of OpenMP threads and each carries out a portion of that

iteration; as soon as a thread ends, a new one is created and assigned a element, until the whole

loop section is completed.

The second step in this strategy is to split in two distinct and equal parts before exploiting

OpenMP. Each part is submitted to a node of the cluster and separately executed; at the end of

the loop data is exchanged back with MPI and merged so that the two machines can continue

working on complete arrays. Thanks to Infiniband, latency for exchanged data sets is reduced

to a minimum.


Even though OpenMP requires little software modifications, in order to obtain the maximum

possible throughput from the software, some updates have been carried out, mainly reducing

portions of redundant code.

Figure 17. Implementation scheme overview

It should be noted, however, that the software is not embarrassingly parallel; as a matter of

fact there were a number of modification to the software in order to apply parallelization and

distributed computing. The synchronization object mostly used is the implicit blocking offered

by the send() and recv() mechanism; since data is exchanged between the two machines in

the same manner, until either of them is ready to process data, the other cannot continue.

In other sections of the code, synchronization was achieved by native OpenMP directives, as

shown in 5.3.4.

5.2 Hardware Support

The hardware selected for implementing the cluster consists of two computer, each supplied


• two quad core Intel Xeon E5420 running at 2.5 GHz frequency, with 6 MB of L2 cache;

• an Intel Server Board S5000PSLSATAR motherboard;

• 32 GB of ECC DDR2 RAM;

• one Infiniband card from Mellanox, model ConnectX IB MHGH28-XTC DDR HCA PCI-e

2.0 x8 Memory Free.

The two machines are connected together with an end-to-end Infiniband link, running at

full speed as the cards are mounted on the PCI Express x16 v1.1 slot. The focus for building

these computer has been to search for low-cost components that could enable high performance


5.3 Applied Directives

In this section some example code has been extracted from the source of the program and


5.3.1 MPI Layer

The following sections of code show some sample “header” and “epilogue” MPI functions

that enable slitting the array and merging it back. The header part analyzes the rank variable

which differs for every node of the MPI cluster: inside the if clause the array range is defined by

setting start INDEX and end INDEX variables (which intuitively represent the range beginning

and ending). So the first node works on the first half of the array and the second node on the

second half, allowing both machines to operate on separated data subsets.

Some preprocessor directives have been inserted in order to maintain compatibility on non

MPI system.


if (rank .eq. 0) then

start_INDEX = 1


else if (rank .eq. 1) then

start_INDEX = ( NEDGE/2 ) + 1




start_INDEX = 1



DO M=start_INDEX,end_INDEX


So after loop has terminated, the array on which the iteration worked must be synchronized

on both nodes; this is done with a couple of MPI SEND and MPI RECV instructions. The rank

variable is checked again to be able to tell which portions of the array must be updated.


tag = 1


if (rank .eq. 0) then

dest = 1

source = 1


& source, tag, MPI_COMM_WORLD, stat, err)

call MPI_SEND(BINTMU, NEDGE/2, MPI_REAL8, dest, tag,


else if (rank .eq. 1) then

dest = 0

source = 0


& dest, tag, MPI_COMM_WORLD, err)

call MPI_RECV(BINTMU, NEDGE/2, MPI_REAL8, source, tag,


& MPI_COMM_WORLD, stat,err)



5.3.2 DO directive

The DO directive is the most common in this configuration. It requires a list of shared and

private variables: for the latter case, a new memory position is allocated for each thread.

Workload is distributed accordingly to the selected scheduler as described in 3.1.




DO I=start_INDEX,end_INDEX









5.3.3 REDUCTION directive

One of the possible benefits in parallelization is to use a mathematical property for addition

and subtraction clauses: since variating the order doesn’t change the result, the reduction

directive allows to execute out-of-order loop instances and to compute the final value at the

end of the iteration.

Without this directive the target variable could have suffered from various synchronization

problems, as reading and writing to a shared position doesn’t guarantee a correct result.








DO K=1,3








Unfortunately this option is available for non-array operators only, so it has been applied

few times.

5.3.4 Avoiding data dependency

One of main problems of OpenMP and parallel programming in general is data dependency

and this is usually resolved by modifying the algorithm structure or by means of synchronization


In order to avoid inserting a critical region (corresponding to a CRITICAL or ATOMIC OpenMP

directive) for shared constructs which could have negatively affected performance, an array

with self data references has been converted into a matrix and indexed with the working thread

number; in this way every array element of the matrix was automatically dereferenced from

itself as there could only be one single thread working on a given line at the same time.


!$OMP& SHARED [...]

#ifdef _OPENMP

INUM_TH = omp_get_num_threads()



DO L=1,6





#ifdef _OPENMP

INUM = omp_get_thread_num()+1


INUM = 1








At the end of operation, the original array is rebuilt with a simple loop on the number of

generated threads (known in INUM TH).


#ifdef _OPENMP






#ifdef _OPENMP




5.4 Results

5.4.1 Reduced test case

During development the test case was run to understand if the current implementation was

providing good results. The simulation had duration of 8 ps only and was composed of just

1000 elements (see Figure 14), but it was already possible to notice some good improvements

to the software. Further work has been done after these results were produced.

The following table (Table III) resumes the total execution time in seconds; in the table the

label OMP stands for OpenMP, MPI for OpenMPI over Infiniband and OPT for optimiza-

tions, while for each field a * stands for enabled and a - for disabled.

It is possible to notice that the software has received a speed boost of 87.5% from the old

configuration to the newer optimized MPI over Infiniband plus OpenMP environment.

Not surprisingly the most effective contribution to the software is the optimizations section:

this is because the ability to access all the SSE extensions with the loop unrolling configuration

(see 4.5) adds some SIMD execution to the software already.




OMP MPI OPT seconds

* * * 133
* * - 400
* - * 186
* - - 487
- * * 200
- * - 792
- - * 246
- - - 1062

However it is important to take in consideration what targets had this project. It is true

that the most cumbersome code for the processor has been dutely parallelized, but the software

is composed of a high number of other functions that are either closely serialized or with very

small duration time. The sections that have been parallelized and distributed have received

a speed boost, but the final software performance suffers from the presence of serial code and

from the high number of small functions.

This explains also why the optimizations bring such an improvement, as they affect all the

software without distinction. So a more sensible comparison can only be done if the optimization

element is kept constant.

5.4.2 Final test case

With the analysis of the previous data, it was possible to understand what was really needed

to be measured and to be improved, so development continued focusing on the new ratio. In


the end, when all the most computational-expensive functions were addressed, it was possible

to launch the final test case with the same characteristics of before and to obtain the following




OMP MPI seconds

* * 59
* - 129
- * 174
- - 249

The total speed improvements of OpenMP and MPI elements only correspond to a raw 76%

increment. This is very good results, because not only it is comparable to the speedup intro-

duced by the optimizations, but also it out does the results obtained from the Intel FORTRAN

compiler v10 (obtained through other tests) by a rough 23%.

By looking at the single functions contribution more in detail, it is possible to see the effect

of OpenMP and MPI over Infiniband with no overhead from the other routines.

From the above table it is possible to understand the actual impact of the technologies used

to increase the throughput of the software.




Function Name Normal OpenMP MPI OpenMP+MPI

calc intmudua 24.5 s 4.7 s 14.4 s 2.8 s
calc hdmg tet 16.9 s 3.0 s 10.8 s 1.7 s
calc mudua 12.1 s 1.9 s 7.0 s 1.1 s
campo effettivo 17.7 s 4.5 s 9.9 s 2.3 s

Having a look at the OpenMP section, there is an aggressive reduction, by a factor of 6-8x:

this is a very good result as it means that the code was able to exploit every processor available

to the maximum extent, with very little overhead and no synchronization problems.

As for MPI on the other hand, there is a 2x factor of speed improvement; this is sensible as

the code was almost split in two, so it is normal that the overall reduction corresponds to half

execution time. It is interesting to notice that this effect applies perfectly when merging MPI

with OpenMP. As a matter of fact, thanks to the Infiniband channel used, communication time

is negligible, and so only the small MPI overhead can influence execution.


In this thesis, it has been demonstrated that to achieve best results, a complete review

of the software must be taken into account. Highly serialized software, written thoughout a

mathematical model, must be reorganized to allow better parallelization.

However there are technologies that can have a direct impact on performance, in particular

OpenMP and MPI. With very little software modification and simple code analisys, it has been

possible to introduce a significant improvement in the overall execution time. Furthermore

the standard, clean and stable environment of the GCC suite enabled accessing important

optimization controls that increased the quality of the software where it was not been done

with OpenMP or MPI.

For this reason this project shows significant room for improvement. First of all, algorithm

optimization are necessary to obtain high performance; secondly it could be possible to take

advantage of FORTRAN library functions for otherwise long routines – even more for the

high number of small operations repeated several times. In the third place, software analysis

must continue in order to extract precise timing information from profiling and to identify the

other computational-expensive functions that could receive a significant improvement from the

inclusion of OpenMP and MPI directives.

Finally thanks to the high scalability of cluster system, it should be fairly easy and much

convenient to add new elements that can contribute in the computation deployment; in fact it

would be possible to connect more components to the cluster using an Infiniband switch, at the

sole cost of some increased latency. In fact due to the applied middleware of open standards,

OpenMP and MPI, porting software to other architectures and expanding its routines to use

further nodes of the clusters should not be considered a complex task.



Appendix A


Introduced by Intel in its line of Pentium III processors, SIMD technology allows for

SIMD execution. While older processors could only process one data element per instruc-

tion, SIMD technology allows instructions to handle multiple data elements, making processing

much quicker.

SSE’s use of SIMD technology allows for data processing in applications such as 3D graphics

to benefit greatly from the availability of extended floating point registers. In contrast to the

preceding MMX technology, SSE registers have an increased width, allowing more bits to be

stored and more speed facilities for applications. Initially eight new 128-bit registers known

as XMM0 through XMM7 were added; SSE2 extends MMX instructions to operate on XMM

registers, allowing the programmer to completely avoid the eight 64-bit MMX registers “aliased”

on the original floating point register stack.

More precisely SSE2 adds new mathematical instructions for double-precision (64-bit) float-

ing point and also extends MMX instructions to operate on 128-bit XMM registers. SSE integer

instructions introduced with later extensions would still operate on 64-bit MMX registers be-

cause the new XMM registers require operating system support (this behavior changed only

with SSE4 onward). SSE2 enables the programmer to perform SIMD math of virtually any

type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the

need to touch the obsolete MMX/FPU registers.

Appendix A (Continued)

SSE3, SSSE3 and SSE4 are further revisions to the architecture and introduce new operating

conditions (column access to registers), new instructions (that can act on 64-bit MMX or 128-

bit XMM registers and simplify implementation of DSP and 3D code) and conversion utility

that avoid pipeline stalls.

In a multi-tasking environment, the Streaming SIMD Extensions require support from the

operating system: the SIMD registers must be handled properly by the operating system’s

context switching code. When the system switches control from one process to another, the old

process’s SIMD registers must be saved away, and the saved values of the new process’s SIMD

registers must be loaded into the processor. The Pentium III processor prohibits programs from

using the Streaming SIMD Extensions unless the operating system tells the processor at system

startup time that it is aware of the SIMD registers, and will manage them properly.

Appendix B


The test program has been designed to simulate some computationally intensive routines of

the target software; in the main loop a lot of mathematical functions are executed over a set

of arrays, without creating data dependencies between the iterations. Statistics are printed at

the beginning and at the end of the program; in order to obtain the total execution time the

function gettimeofday() is used; the loop is repeated ten times obtaining a more

for (u=1;u<=32; u++) {

printf("%d threads\t", u);


totaltime = 0;

for (t = 0; t< 10; t++){

gettimeofday (&timing_start, NULL);

#pragma omp parallel for \

shared (a, b, c, d, f, g, h, chunk) private (i, u, t) \

schedule (guided, chunk)

for (i=0; i < N; i++){

c[i] = pow (sqrt(a[i] * b[i] / (b[i] + a[i]) ), 3);

Appendix B (Continued)

d[i] = sqrt (c[i] * b[i] / (c[i] + pow (a[i], 4) ) );

e[i] = pow (pow (c[i], d[i]), pow (d[i], c[i]) );

f[i] = sqrt (pow (a[i], c[i] + b[i]) );

h[i] = a[i] * b[i] * c[i] * d[i] * e[i] * f[i] * g[i] * h[i];

gettimeofday (&timing_end, NULL);

totaltime+= (timing_end.tv_sec - timing_start.tv_sec) * 1000000 \

+ (timing_end.tv_usec - timing_start.tv_usec);

printf("%d\n", totaltime/10);


Appendix C


The general OpenMP directive begins with !$OMP indicating the starting of an OpenMP

configuration; any directive has to be declared with an entry and a closing section, such as:

!$OMP directive [clause ...]


!$OMP END directive

The first directive must be PARALLEL, which wraps the code section that must be executed in

parallel, and it is closed by the corresponding END PARALLEL. Syntax for this directive (clause)

may be any combination of the following:

IF (condition) parallel execution is activated only when condition is met;

PRIVATE (list) list of private variables;

SHARED (list) list of shared variables;

DEFAULT (type) type of visibility for variables not listed before;

FIRSTPRIVATE (list) list of private variables that are automatically initialized;

REDUCTION (operator: list) performs an out-of-order operation of kind operator on the

variable list;

COPYIN (list) for copying values of variable list among threads;

Appendix C (Continued)

NUM THREADS (num) statically set the number of threads to generate.

Another important OpenMP directive is DO that specifies the next loop can be executed in

parallel by the thread team. Syntax follows:

SCHEDULE (type [, chunk )] describes how iterations of the loop are divided (in chunk s)

among the threads;

ORDERED performs iteration in order, sequential style;

PRIVATE (list) list of private variables;

FIRSTPRIVATE (list) list of private variables that are automatically initialized;

LASTPRIVATE (list) list of private variables that are initialized when iteration ends;

SHARED (list) list of shared variables;

REDUCTION (operator | intrinsic : list) performs an out-of-order operation of kind op-

erator (or intrinsic function) on the variable list;

COLLAPSE (n) performs some loop collapsing (for n loops).

Other parallelizing directives that don’t require any particular clause configuration are:

• SECTIONS: statically splits the code into sections which are assigned each to a single thread

in the pool;

• WORKSHARE: divides the execution of the enclosed code block into separate units of work;

• TASK: defines an explicit task, which may be executed by the encountering thread, or

deferred for execution by any other thread in the team.

Appendix C (Continued)

As for synchronization management, it is possible to find the following directives, which

don’t need any other clause as well:

• MASTER: specifies a region of code that is executed only by one thread;

• CRITICAL: identifies a critical region in which only one thread at a time can access;

• BARRIER: implements a barrier region where execution is stopped until all threads are

ready to continue;

• ATOMIC: defines a single instruction critical region, in which memory is accessed atomically

from all the threads.

Finally it is possible to use some OpenMP related functions to further adapt the software to

a multiprogrammed system; this set of routines may be used for a variety of application such as

obtaining information from single threads, setting configuration about the number of threads,

getting environment data (like number of processors), locking variables, timing and so on. For


• OMP SET NUM THREADS(): sets the number of threads that must be started;

• OMP GET NUM THREADS(): returns the number of threads of the parallel region;

• OMP GET THREAD NUM(): returns the number identifying a single thread in the pool;

• OMP GET THREAD LIMIT(): returns the maximum number of OpenMP threads available

to a program;

• OMP GET NUM PROCS(): returns the number of processors that are available to the program;
Appendix C (Continued)

• OMP INIT LOCK(): initializes a lock on the variable, setting the lock to “unset”;

• OMP DESTROY LOCK(): eliminates the lock on the variable;

• OMP SET LOCK(): sets the lock on the given variable;

• OMP UNSET LOCK(): unsets the lock on the given variable;

• OMP TEST LOCK(): tests the lock on the given variable.


Appendix D


MPI routines are added to a standard FORTRAN program by including the mpif.h library.

After this, the MPI layer must be initialized with MPI INIT(), before using any MPI related

functions, and it must be closed with MPI FINALIZE(), before ending the program.

By using:

call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr)

the program becomes aware of running in a MPI environment, as the reported functions

save the number of the instance of the program in the rank variable, and the total number of

instances created in numtasks, respectively.

Then now it is possible to use point-to-point communication routines that are present in

many variants, like blocking, synchronous, non-blocking, buffered, and they are all described

by the following:

call MPI_SEND(buffer, quantity, type, destination, tag, MPI_COMM_WORLD, ierr)

call MPI_RECV(buffer, quantity, type, source, tag, MPI_COMM_WORLD, stat, ierr)

The syntax for the equivalent functions is similar:

buffer : represents either data that has be sent or the memory location in which it must be

Appendix D (Continued)

quantity : tells how much data of type is sent in the message;

type : sets one of the high MPI data types for the transfer;

destination/source : describes the number of the instance of the program that has to send

or receive the buffer;

tag : identifies the message number that must be sent or received;

MPI COMM WORLD : reads from the macro in which MPI configuration is saved;

stat : represents the status of the transfer;

ierr : is the error variable in case communication fails.

MPI allows also for collective communication (a sort of “multicasting”) by means of functions

such as:

• MPI BCAST() a message is sent to all the nodes;

• MPI SCATTER() a message is split and sent to all the nodes;

• MPI GATHER() a message is received from all the nodes;

• MPI ALLTOALL()Each task in a group performs a scatter operation, sending a distinct

message to all the tasks in the group

that require information about the data buffers of both the sender and the receiver.

In order to compile an MPI-enabled program, it is not possible to directly call the compiler,

but it is necessary to resort to the wrapper of the MPI distribution, which correctly set paths

and libraries; also for launching executables a special wrapper must be used with proper syntax.
Appendix D (Continued)

In case of OpenMPI, the MPI implementation selected for this project, the compiler is

called mpif90 while the launching wrapper is mpirun; this software must be called specifying

the number of instances of the program to run (-np) and the list of hosts that have to execute

it (-host). So for example in a two-machine cluster environment in which each node has to

execute an instance of the program, the correct syntax is

$ mpirun -np 2 -host host1,host2 program [args]

It is possible to share some environment variables among the nodes with the -x switch; this

is required for an OpenMP+MPI system as the number of threads depends on the value of

OMP NUM THREADS. The resulting command line instruction is:

$ mpirun -np 2 -host host1,host2 -x OMP NUM THREADS program [args]


1. Stallings, W.: Computer Organization & Architecture - Designing for Performance. Pear-
son - Prentice Hall, 2006.

2. Hennessy, J. L. and Patterson, D. A.: Computer Architecture: A Quantitative Approach.

Morgan Kaufmann, 1990.

3. Lu, J., Li, Y., Sun, C., and Yamada, S.: A parallel computation model for nonlinear
electromagnetic field analysis by harmonic balance finite element method. Tech-
nical Report 0-7803-2018-2, Faculty of Science and Technology, Griffith University
Australia and Faculty of Technology, Kanazawa University Japan, 1995.

4. Ito, F. and Amemiya, N.: Application of parallelized SOR method to electromagnetic field
analysis of superconductors. Technical Report 1051-8223/04, Faculty of Engineer-
ing, Yokohama National University, 2004.

5. Giuffrida, C., Gruosso, G., and Repetto, M.: Finite formulation of nonlinear magnesto-
statics with integral boundary conditions. Technical Report 0018-9464, Electrical
Engineering Department, Politecnico di Torino and Electronic and Information En-
gineering Department, Politecnico di MIlano, 2006.

6. Silberschatz, A., Galvin, P. B., and Gagne, G.: Operating System Concepts. Pearson
Education, 2006.

7. Barney, B.: OpenMP. Lawrence Livermore National Laboratory, https://computing.


8. McMichael, R. D.: µMAG – Micromagnetic Modeling Activity Group. Center for Theo-
retical and Computational Materials Science, http://www.ctcms.nist.gov/~rdm/

9. Moser, J. R.: Optimizing linker load times. LWN.net - Your Linux info source, http:
//lwn.net/Articles/192082/, 2006.

10. Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., and Menon, R.: Parallel
Programming in OpenMP. Morgan Kaufmann Publishers, 2001.

11. Dagum, L. and Menon, R.: OpenMP: An Industry Standard API fo Shared Memory
Programming. Computational Science and Engineering, 1998.

12. Gropp, W., Lusk, E., and Skjellum, A.: Using MPI - Portable Parallel Programming with
the Message-Passing Interface. Scientific and Engineering Computation Series. The
MIT Press, 1999.

13. Reinders, J.: VTuneTM Performanc Analyzer Essentials. Intel Press, 2007.

14. Stevens, W. R.: UNIX Network Programming: Networking APIs: Sockets and XTI. Pren-
tice Hall, 1998.

15. Butenhof, D. R.: Programming with POSIX

R Threads. Addison-Wesley Professional
Computing Series, 1997.

16. Shipman, G. M., Woodall, T. S., Graham, R. L., Maccabe, A. B., and Bridges, P. G.:
Infiniband scalability in Open MPI. Technical Report 1-4244-0054-6/06, Advanced
Computing Laboratory, Los Alamos National Laboratory and Dept. of Computer
Science, University of New Mexico, 2006.

17. Sur, S., Koop, M. J., and Panda, D. K.: High-performance and scalable MPI over In-
finiband with reduced memory usage: An in-depth performance analysis. Technical
Report 0-7695-2700-0/06, Department of Computer Science Engineering, Ohio State
University, 2006.

18. Quintero, D., Conrad, N., Desjarlais, R., Kahle, M.-E., Kim, J.-H., Nguyen, H.-N., Pir-
raglia, T., Pizzano, F., Simon, R., Yao, S. L., and Lascu, O.: Implementing
InfiniBand on IBM System p. IBM Redbooks, 2007.

19. Gray, A., Hein, J., and Booth, S.: Improved MPI with RDMA. Technical report, EPCC,
Univeristy of Edinburgh, June 2005.

20. T., U. and J., R. B. S.: Multithreaded processors. The Computer Journal, 3, 2002.

21. R., B.: High Performance Cluster Computing: Architectures and Systems. Prentice Hall,

22. R., B.: High Performance Cluster Computing: Programming and Applications. Prentice
Hall, 1999.

23. Barney, B.: Message Passing Interface (MPI). Lawrence Livermore National Laboratory,

24. Hablot, L., Gluck, O., Mignot, J.-C., Genaud, S., and Primet, P. V.-B.: Comparison and
tuning of MPI implementations in a grid context. Technical Report 1-4244-1388-5,
Laboratoir de l’Informatique du Parallelisme, Universite de Lyon, 2007.

NAME: Vittorio Giovara

EDUCATION: B.Sc. equiv., Computer Engineering,

Politecnico di Torino, Turin, Italy, 2007

M.Sc. equiv., Computer Engineering,

Politecnico di Torino, Turin, Italy, 2009, under the advising of
professors Bartolomeo Montrucchio and Carlo Ragusa

Master of Science in Electrical and Computer Engineering, University

of Illinois at Chicago, Chicago, Illinois, 2009, under the advising of
professors Bartolomeo Montrucchio and Zhichun Zhu

HONORS: PROFICIENCY Certificate in English, Cambridge University,

Turin, Italy, 2004

BTP certification, XX Winter Olympics, TOBO, Turin, Italy, 2006

TOP-UIC Fellowship, Politecnico di Torino, Turin, Italy, 2008

PROFESSIONAL: Project manager for GLE-MiPS, a VHDL description for

processor architecture, focusing on the educational implementation,

Developer of Hedgewars, a strategy game, managing the Mac OS X

and iPhone versions, http://www.hedgewars.org

Editor and author for ProjectSymphony, a collection of academic

essays and homework reports publically available,