Computer Structures - MPI

Department of Electrical and Computer Engineering
Faculty of Engineering
EC 4461 - Computer Structures

Laboratory 1 Report
Part 1:
a) Describe in a few sentences the purpose of the block of lines 31-34 and line 52 in the
code
Line 31 - MPI_Comm comm;

A basic function within the MPI used to determine which processes are involved in a
communication and to set a variable as a communicator type (i.e. comm is a communicator
variable such as setting a variable as an integer)
Line 32 - MPI_Init (NULL, NULL);

This initializes the MPI execution environment where the MPI processes begin.
Int MPI_Init (int *argc, char ***argv)
 argc - Pointer to the number of arguments
 argv - Pointer to the argument vector
Line 33 - comm = MPI_COMM_WORLD;

MPI_COMM_WORLD represents all processes available at the program startup as in
the default communicator. Then, process in formation from the comm variable, which
was initialized by the MPI_COMM_WORLD.
Line 34 - MPI_Comm_size (comm, &comm_sz);

Determines the size of the group associated with a communicator “comm” and allocate it in the
variable “comm_sz”
 Comm - Communicator
 Comm_sz - Number of processes in the given communicator
Line 35 - MPI_Comm_rank (comm, &my_rank);

Determines the rank of the calling process in the communicator “comm” and allocate it in the
variable “my_rank”
 Comm - Communicator
 my_rank - Rank of the calling process in the given communicator
Line 52 - MPI_Finalize ();

Finalizes the initialized MPI Execution Environment and terminates the variables.
b) Replace the commented pseudo code in line 68 of the code with a single MPI
call to broadcast the user supplied input number_of_tosses to all processes in the
communicator.
▪ number_of_tosses – Input variable

▪ 1 – No of items send of an array
▪ MPI_LONG_LONG – Data type
▪ 0 – Rank/id of the root process(sender)
▪ MPI_COMM_WORLD – Communicator
c) Replace the comments in lines 81-82 in the code respectively by

generating (uniformly distributed) random numbers x and y for the
coordinates of the random dart tosses between -1 and +1. Now replace line 84
of the code to compute the distance of each dart from the center origin and
store the result in the variable distance_squared. Finally replace line 88 of the
code to increment the count if the darts fall within the unit circle. Hint: use
random () and RAND_MAX, then scale and shift, and finally test and increment.
d) Replace the comment pseudo code in line 45 of the code with a single MPI call
to compute the global sum of all 'successful' dart tosses ensuring that result is
stored in the variable number_in_circle in process 0.
e) Compile and run your code with 109 tosses. Use four processes first. Then use
one. Verify the results are correct. Is it faster with four? If so, why, if not, why not?
(Consider how many physical cores, or hyper threaded ones, your system has) How
many tosses do you need to obtain a stable estimate accurate to five decimal
places?
Pi estimate when 4 cores = 3.141557

Pi estimate when 1 core = 3.141604
It took less time with four cores thus increasing the number of cores decreases the time.
However, the real Pi value for eight decimal places is 3.14159265. Therefore, the Pi estimate
using four processes was accurate up to four decimal places. It could be noticed that the Pi
estimate using a single process was accurate only up to three decimal places. The calculation
was faster with four processes rather than that of using a single process as well. It was found
that 2x1010 tosses are needed to obtain a stable Pi estimate accurate up to five decimal places.
Part 2:
a) Explain in detail how the function Global_sum coded at the bottom of the code file
works.
Specifically what are the variables partner, my_sum and bitmask used for?
Global sum function uses tree structured global sum to calculate the sum of the numbers
generated by random function, to the given number of processes. It is consisted with four
parameters.
- Int my_int => An arbitrary positive integer number
- Int my_rank => Current process id value
- Int comm_sz => Total number of processes
- MPI_Comm comm => Communicator
I.e. Calculation of partner Rank, Consider 4 process program execution. If we take my rank to be
equal to 3 the rank of the partner rank would be 2 this is calculated by XOR operation as below
 my_rank = 2 (010)
 bitmask = 1 (001)
 Partner = my_rank ^ bitmask
= 010 ^ 001
= 011
=3
Then it is checked whether my_rank is lower than that of the partner rank and whether it is
lower than the total number of processes, if so it executes an MPI_Recv command, which will
take the value from its partner and add to the my_sum variable. Then the bitmask of that
process was shifted using a left shift, for the next calculation of the tree. In addition, if the
my_rank variable is higher than that of the partner variable the value of the my_sum variable is
sent to its partner process to do summation.
b) Hence, replace the commented pseudo code in lines 76 and 81 each with one MPI call to receive
and send respectively so that received data is stored in recvtemp and sent data is from my_sum.
c) Describe in a few sentences the purpose and execution of the if statement in lines 38-49 in
the code.
If statement will only be executed in the root process.(my_rank = 0 ). Pointers are used
to store and manage the addresses of dynamically allocated blocks of memory. Such blocks are
used to store data objects or arrays of objects. If the array predefined array is used, it might be
insufficient or more than required to hold data. To solve this issue, you can allocate memory
dynamically. Int pointer called “all_ints” have been used to dynamically allocate the memory
and malloc will be allocate the requested size of bytes (size of int * comm_sz) and returns
pointer first byte of allocated space. Then gather function will be called and each process each
“my_int” to send back to process 0 to store all summands in array “all_ints”. Then all the
values which are stored in the “all_ints” array will be printed. Finally the Sum value will be
displayed using the return value observed from the “Global_sum” function and all the
dynamically allocated memory space will be cleared.
d) Hence, replace the commented pseudo code in lines 40 and 48 each with one MPI call so that we
can output our data. Do not use more than one MPI call for each comment line replaced.
e) Compile and run your code with 7, 33 and 101 processes. Verify by other means the results
are correct.
Below calculations were compared with an actual calculation and the results turned out to be the
same.
f) Explain how you would use MPI with a single function call to compute this global sum. Only state
the code you would execute. There is no need to add this to the code but you could use this to
verify your results.
Here, MPI reduce function can be used similar to part 1. It would take all the integer values in
the processes and sum will be calculated in the root process and send the final value into a
buffer.
Part 3:
a) Explain briefly the input and output of the function Floor_log coded at the bottom of
the code file.
Floor_log function takes comm_sz or total number of processes belong to the

communicator as the input to compute the largest power of two using a while loop. When that
is less than or equal to the number of processes, f1 value will be returned which will be stored
inside the int variable named ‘floor_log_p’, to perform butterfly structured global sum.
b) Hence,
i. Replace the commented pseudo code in lines 81 and 85 each with one MPI
call respectively in order to send the value in my_sum to its partner and to
receive the value from the matching partner in recvtemp.
ii. Replace the commented pseudo code in line 94 with some MPI code making
the current process sends to and receive from its partner such that sent data is
from my_sum and received data is stored in recvtemp while ensuring that
your code will not deadlock.
iii. Replace the commented pseudo code in lines 102 and 106 each with one MPI
call respectively in order to receive the sent value in my_sum and send the
value my_sum to its partner.
c) Describe in a few sentences the purpose and execution of the if statement in lines 39-
55 in the code.
If the (my_rank =! 0), two gather functions will be called to, gather from each process
each my_int to send back to process 0 and store all summands in array all_ints and gather from
each process each sum to send back to process 0 and store each processes' sum in array
sum_proc. This if statement only executed when it is the root function. Dynamically allocated
memory will be used by the int array pointers ‘all_ints’ and ‘sum_proc’. These two arrays will
allocated the space according to the ‘comm_sz’. Gather function call would gather from each
process each my_int value and send back to process 0 to store all summands in array all_ints.
Another Gather function call would gather each process each sum to send back to process 0
and store each processes' sum in array sum_proc. In front of the ‘Ints being summed’
statement all the integer values will be displayed and in front of the ‘Sums on the processes’ it
would displayed the summed values as well. Finally, all the dynamically allocated memory
spaces would be cleared.
d) Hence, replace the commented pseudo code in lines 41 and 53 and lines 46 and 54
each with one MPI call (but matching pairs) to output the data. Do not use more than
one MPI call for each comment line replaced.
e) Compile and run your code with 7, 33 and 101 processes. Verify by other means the
results are correct.
Above calculations were compared with an actual calculation and the results turned out to
be the same.
f. Explain how you would use MPI with a single function call to compute this global sum
as a butterfly. Only state the code you would execute. There is no need to add this to
the code.
This will reduce all the processes into a single process and the my_int variable in every
process will be summed up to sum variable in the resulting single process.
Part 4
a) Briefly explain the functionality of the C++ function clock () and macro
CLOCKS_PER_SEC, and how you would use these to take performance measurements.
Clock program returns the processor time consumed by the program. The value returned is
expressed in clock ticks, which are units of time of a constant but system-specific length (with a
relation of CLOCKS_PER_SEC clock ticks per second). The epoch used as reference by clock
varies between systems, but it is related to the program execution (generally its launch). To
calculate the actual processing time of a program, the value returned by clock shall be
compared to a value returned by a previous call to the same function.
Macro expands to an expression representing the number of clock ticks per second. Clock
ticks are units of time of a constant but system specific length. Since, those are returned by
function clock. Dividing a count of clock ticks by this expression yields the number of seconds.
b) Similarly briefly explain the functionality of the MPI call MPIWtime () and how you
would use it for measuring performance.
MPI_Wtime (), this function returns Time in seconds since an arbitrary time in the past. The
"time in the past" is guaranteed not to change during the life of the process. The user is
responsible for converting large numbers of seconds to other units if they are preferred. The
times returned are local to the node that called them. There is no requirement that different
nodes return "the same time."
Int main (int argc, char *argv [])

{ double t1, t2;
MPI_Init (0, 0);
t1 = MPI_Wtime ();
Sleep (1000);
t2 = MPI_Wtime ();
printf (“xxxxxxxxxxxx” t2-t1);
MPI_Finalize ( );
return 0; }
c) Discuss briefly the difference between the two ways above for measuring
performance.
Clock ()
The clock function is useless. It measures CPU time, not real time/wall time.
On most implementations, the resolution is extremely bad, for example, 1/100 of a second.
CLOCKS_PER_SECOND is not the resolution, just the scale.
With typical values of CLOCKS_PER_SECOND (UNIX standards require it to be 1 million, for
example), clock will overflow in a matter of minutes on 32-bit systems. After overflow, it
returns -1. Clock () always gives an idea of the expected time.
The output value needed to divide by CLOCKS_PER_SEC to gain the output in seconds.
MPIWtime ()
MPI_Wtime is gives the real time elapsed by each processor.
The function is intended to be a high-resolution.
Clock synchronization can be done using the MPI_WTIME_IS_GLOBAL.
This function is portable than the clock () function explained earlier because it returns seconds,
not the “ticks”. It carries not unnecessary baggage.
MPI_Wtime gives the "current time on this processor", which is quite different. If you do
something that sleeps for 1 minute, MPI_Wtime will move 60 seconds forward, where clock
(except for Windows) would be unchanged.
d) Examine the code for the function ping-pong. Why does the if statement reverse the
order of the MPI send and receive calls for processes 0 and 1?
The ping operation involves in sending a message from process 0 to process 1 and the pong
operation involves in sending back a message from the process 1 to process 0. Therefore, the
message must be first sent from process 0 to process 1, if the process 0 runs this section of
program. At the same time, the process 1 must receive the message sent by process 0.
Similarly, a message must be sent back from process 1 to process 0, if the process 1 runs this
section of program. At the same time, the process 0 must receive the message sent by process
1. Due to these reasons the if statement reverses the order of MPI send and receive functions
for processes 0 and 1.
e) State what message is being sent for the ping-pong and what range of message sizes
are being timed.
Message - Composed of a set of A’s (ex: AAA…A)

Participants - From process 0 to process 1
During - Ping operation
Message - Composed of a set of B’s (ex: BBB…B)
Participants - From process 1 to process 0
During - Pong operation
Range - 0 - 131072
f) Hence replace the commented pseudo code in lines 120 and 129 each with one line of
code to calculate and return the value for the total elapsed time for the series of ping-
pongs with clock().
Given in below code.
g) Similarly replace the commented pseudo code in lines 118 and 127 each with
one line of code to calculate and return the value total elapsed time for the series of
Ping-Pongs with MPIWtime().
h) Compile and run your code for both methods. Measurements are produced for various
message sizes. Now plot and briefly discuss your results. Are the results for clock ()
reliable? What about MPIWtime ()?
With CLOCK defined:
Then the same was calculated using MPIWtime (),

With MPIWtime () defined:
Plotted values of Clock and MPIWtime in a graph by tabulating them in a table,
Length of the Message Using Clock() – Series 11 Using wtime() – Series 2

0 2.00E-03 6.14E-03
1 3.00E-03 6.58E-03
2 3.00E-03 6.89E-03
4 2.50E-03 7.16E-03
8 2.50E-03 6.35E-03
16 3.00E-03 5.87E-03
32 2.50E-03 6.01E-03
64 3.00E-03 6.25E-03
128 2.50E-03 7.50E-03
256 3.00E-03 6.69E-03
512 2.50E-03 6.35E-03
1024 3.00E-03 5.59E-03
2048 2.50E-03 6.56E-03
4096 3.00E-03 6.15E-03
8192 3.00E-03 5.91E-03
16384 3.00E-03 6.53E-03
32768 2.00E-03 8.03E-03
65536 9.50E-03 1.92E-02
131072 8.50E-03 1.77E-02
Lab 1 - Part 4
2.50E-02
2.00E-02
1.50E-02
1.00E-02
5.00E-03
0.00E+00
1 4 16 64 256 1024 4096 16384 65536
Series1 Series2
The reason for the value difference of both methods is due to the number of cores assigned to the
Virtual machine, which only has one core out of the two cores the CPU has Therefore, its concluded that
MPIWtime() is more reliable than that of clock ().
Part 5
a. Consider the MPI code provided which is functioning and simply prints one
message per process. Unfortunately, the code produces a slightly different output
each time it is run. Modify the code such that the same message as in the original
code is printed for each process but ensure that they appear in exactly the same order
whenever the code is run. The only constraint is to do so with the minimum possible
changes. Explain your solution.
MPI_Recv (msg, MAX_STRING, MPI_CHAR, MPI_ANY_SOURCE, 0, comm,

MPI_STATUS_IGNORE);
Parameter of MPI_Recv () was change, from ‘MPI_ANY_SOURCE’ to ‘src’. Thus changed line,
MPI_Recv (msg, MAX_STRING, MPI_CHAR, src, 0, comm, MPI_STATUS_IGNORE);
Initially the code was written to get data

from any process without an order by
setting the 4th argument of the receiver
function to MPI_ANY_SOURCE. Then the
code was changed to receive msg from src
variable which is defined as the rank of
the process to receive data from, and it
was passed through a “For loop” to
increase the src number one by one from
processes 1 to comm_sz number, so that
it prints the messages in the ascending
order of process numbers.

Computer Structures - MPI

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Computer Structures - MPI

Transféré par

Droits d'auteur :

Formats disponibles

Department of Electrical and Computer Engineering

EC 4461 - Computer Structures

Line 31 - MPI_Comm comm;

Line 32 - MPI_Init (NULL, NULL);

Line 33 - comm = MPI_COMM_WORLD;

Line 34 - MPI_Comm_size (comm, &comm_sz);

Line 35 - MPI_Comm_rank (comm, &my_rank);

Line 52 - MPI_Finalize ();

▪ number_of_tosses – Input variable

c) Replace the comments in lines 81-82 in the code respectively by

Pi estimate when 4 cores = 3.141557

Floor_log function takes comm_sz or total number of processes belong to the

Int main (int argc, char *argv [])

Message - Composed of a set of A’s (ex: AAA…A)

Given in below code.

With CLOCK defined:

Then the same was calculated using MPIWtime (),

Length of the Message Using Clock() – Series 11 Using wtime() – Series 2

MPI_Recv (msg, MAX_STRING, MPI_CHAR, MPI_ANY_SOURCE, 0, comm,

Initially the code was written to get data

Vous aimerez peut-être aussi