A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)

IPASJ International Journal of Computer Science (IIJCS)
A Publisher for Research Motivation ........
Volume 2, Issue 3, March 2014
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm Email: editoriijcs@ipasj.org ISSN 2321-5992
A Review on use of MPI in parallel algorithms

Gurhans Singh Randhawa1, Anil Kumar2
1&2
Dept. of CSE, Guru Nanak Dev University Amritsar (Pb.) India
Abstract
This paper present and literature review on parallel algorithms. Parallel computing is becoming popular day by day. It is now successfully used in many worldwide complex and time consuming applications. It operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently to save time by taking advantage of nonlocal resources and overcoming memory constraints. The main aim is to form a cluster oriented parallel computing architecture for MPI based applications which demonstrates the performance gain and losses achieved through parallel processing using MPI. The main aim of this paper is to evaluate the existing research on the parallel algorithms.
Keywords: Parallel algorithms, MPI, Data Parallelism, Functional Parallelism.
1. INTRODUCTION
1.1 Overview: The history of a parallel computing can be traced back to a tablet dated around 100 BC [online available: www.cse.iitd.ernet.in]. Tablet had three calculating positions capable of operating simultaneously. From this we can infer that these multiple positions were aimed either at providing reliability or high speed computation through parallelism. Parallel computing operates on the principle that large problems divided into smaller ones, which are then solved concurrently to save time by taking advantage of non-local resources and overcoming memory constraints.
Figure 1 Execution of Instruction in Parallel Computing The figure shows that the computational problem can be solved by using multiple processors. A problem is broken into discrete parts that can be solved concurrently. Each part is further broken down to a series of instructions. The broken instructions from each part execute simultaneously on different processors. There are different forms of parallelism used in parallel computing-: Bit-level: Bit-level parallelism is a form of parallel computing based on increasing processor word size [7]. Increasing the word size reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word. Instruction level: Instruction level parallelism is a measure of how many of the operations in a computer program can be performed simultaneously [7]. The potential overlap among instructions is called instruction level parallelism. There are two approaches to instruction level parallelism: Hardware Software Hardware level works upon dynamic parallelism whereas the software level works on static parallelism. Task parallelism: Task parallelism (also known as function parallelism and control parallelism) is a form of parallelization of computer code across multiple processors in parallel computing environments [7]. Task parallelism focuses on distributing execution processes (threads) across different parallel computing nodes.
Volume 2 Issue 3 March 2014
Page 7

Table 1 Difference between Data and Functional Parallelism Data Parallelism Functional Parallelism The same operations are executed in Entirely different calculations can parallel for the elements of large data be performed concurrently on either structures, e.g. arrays. the same or different data. Tasks are the operations on each individual element or on subsets of the elements. Whether tasks are of same length or variable length depends on the application. Quite some applications have tasks of same length. The tasks are usually specified via different functions or different code regions. Tasks are of different length in most cases.
1.1.1 Libraries: Message passing libraries have made it possible to map parallel algorithm onto parallel computing platform in a portable way [8]. Parallel Virtual Machine and Message Passing Interface have been the most successful of such libraries. Parallel Virtual Machine and Message Passing Interface are the most used tools for parallel programming. There are freely available versions of each. Parallel Virtual Machine is a software package that allows a heterogeneous collection of workstations to function as a single high performance virtual machine. Parallel Virtual Machine has daemon running on all computers making up the virtual machine. Message Passing Interface has focused on message passing that can be used for writing portable parallel programs. 1.1.2 Parallel vs. Distributed Computing: Parallel and distributed computing can be confused. In parallel computing there is a shared memory system [online available: www.wavecoltech.com]. It is a multiprocessor architecture where all the processors share a single bus and memory unit whereas in distributed computing the memory system is distributed. All the processors have its own memory and can also access the memory of another processor. It is a multicomputer architecture where all the processors are connected via network. Distributed computing is a complex system but the parallel computing is much simpler than distributed.
Figure 2 Distributed vs. Parallel Computing 1.2 Memory Architecture: The different kind of memory architectures in parallel computing are as follows: Shared Memory: Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space [online available: www.compuitng.llnl.gov]. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines have been classified as UMA and NUMA, based upon memory.
Figure 3 Shared Memory Architecture
Page 8

Uniform memory access: It is represented by Symmetric Multiprocessor machines which have identical processors. All the processors have equal access and access times to memory. Sometimes it is called as CC-UMA: Cache CoherentUniform memory access which means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level. Non Uniform memory access: This system is made by physically linking two or more Symmetric Multiprocessors. One symmetric multiprocessor can directly access memory of another symmetric multiprocessor. All the processors do not have equal access time to all memories. The memory access across link is slower. If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent Non uniform memory access. 1.2.1 Distributed Memory: Like shared memory systems, distributed memory systems vary widely but share a common characteristic [online available: www.compuitng.llnl.gov]. Distributed memory systems require a communication network to connect inter-processor memory. The processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. Figure 4 Distributed Memory Architecture
1.2.2 Hybrid Memory: The largest and fastest computers in the world today employ both shared and distributed memory architectures such type of architecture is known as hybrid memory architecture [online available: www.compuitng.llnl.gov]. The shared memory component can be a shared memory machine and/or graphics processing units. The distributed memory component is the networking of multiple shared memory/graphics processing unit machines, which know only about their own memory - not the memory on another machine. Therefore, network communications are required to move data from one machine to another.
Figure 5 Hybrid Memory Architecture 1.3 Why parallel Computing: The development of parallel computing is influenced by many factors [online available: www.cse.iitd.ernet.in]. The prominent among them include the following: a) Computational requirements are ever increasing, both in the area of scientific and business computing. The technical computing problems which require high speed computational power are related to life sciences, aerospace, geographical information systems, mechanical design and analysis etc. b) Sequential architectures reaching physical limitation as they are constrained by the speed of light and thermodynamics law. Speed with which sequential CPUs can operate is reaching saturation point, and hence an alternative to get high computational speed is to connect multiple CPUs. c) Hardware improvements in pipelining, superscalar etc., are non-scalable and requires sophisticated compiler technology. Developing such compiler technology is difficult task.
Page 9

d) Vector processing works well for some kind of problems. It is suitable for only scientific problems. It is not useful for other areas such as database. e) Significant development in networking technology is paving a way for heterogeneous computing. 1.4 Existing Frameworks : 1.4.1 MPI: The specification of the Message Passing Interface define both the syntax and semantics of a core message-passing library that would be useful to a Wide range of users and implemented on a wide range of Massively Parallel Processor platforms [7]. In the MPI based parallel computing architecture, the Master-Slave computing paradigm is used. The master will monitor the progress and be able to report the time taken to solve the problem, taking into account the time spent in breaking the problem into subtasks and combining the results along with the communication delays. The slaves are capable of accepting sub problems from the master and finding the solution and sending back to the master. 1.4.2 PVM: The development of Parallel Virtual Machine started in summer 1989 at Oak Ridge National Laboratory (ORNL) [online available: www.netlib.org]. The main objective to design the parallel virtual machine was the notion of a "virtual machine" which is a set of heterogeneous hosts connected by a network that appears logically to user as a single large parallel computer or parallel virtual machine. Parallel virtual machine was aimed at providing a portable heterogeneous environment for using clusters of machines using socket communications over TCP/IP as a parallel computer. Because of PVM's focus on socket based communication between loosely coupled systems, PVM places a greater emphasis on providing a distributed computing environment and on handling communication failures. 1.4.3 MPI2: In MPI2 new functionality, dynamic process management, one-sided communication, cooperative I/O, C++ bindings, extended collective operations, and miscellaneous other functionality were added [7]. It is an improved version of MPI1. 1.4.4 OpenMP: It has emerged as the standard for shared-memory parallel programming [7]. The OpenMP application program interface provides programmers with a simple way to develop parallel application for shared memory parallel computing. It is based on compiler directive and can use the serial code. The OpenMP program begins with a single thread: the master thread. The master thread runs sequentially until the first parallel task is encountered. Then the master thread creates a team of parallel threads to execute the parallel task. When the team threads complete the instructions in the parallel task, they join together and synchronize, leaving the master thread.
Figure 6 OpenMP Architecture 1.4.5 MPICH2: It includes all-new implementation of both MPI-1 and MPI-2. In MPICH2 [7], the collective routines are significantly faster and has very low communication overhead than the classic MPI and MPICH versions.
2. LITERATURE REVIEW
2.1 Related Work: Zheng et al. (2009) has presented an approach based on Idle Time Windows and Particle Swarm Optimization algorithm to solve dynamic scheduling of multi-task for hybrid ow-shop [1]. The idea of Idle Time Window is introduced then the dynamic updating rules of the sets of Idle Time Window are explained in detail. With the sets of Idle Time Windows of machines as constraints, the mathematical model is presented for dynamic scheduling of multitask for hybrid ow-shop. It satisfies the dynamic scheduling of multi-task for hybrid flow-shop.
Page 10

Matsuoka et al. (2009) discussed that many applications will run both on CPUs and accelerators but with varying speed and power consumption characteristics they proposes a Task scheduling scheme that optimize overall energy consumption of the system [2]. The model of task arrangement in terms of the scheduling makes span and energy to be consumed for each scheduling decision. This denes acceleration factor for normalize the eect of acceleration per each task. The acceleration factor represents how much each task can be accelerated if it would be executed on an accelerator rather than on a CPU. Hoffmann et al. (2010) has discussed that shift to multicore processors demand efcient parallel programming on a diversity of architectures, including homogeneous and heterogeneous chip multiprocessors [3]. Task parallel programming is one approach that maps well to chip multiprocessors. The programmer focuses on identifying parallel tasks within an application, while a runtime system takes care of managing, scheduling, and balancing the tasks among a number of processors or cores. Heterogeneous chip multiprocessors, such as the Cell Broadband Engine, present new challenges to task parallel programming and corresponding runtime systems. A new library is presented which is based on task pools for dynamic task scheduling and load balancing on Cell processors. Arnold and Fettweis (2011) have introduced a failure aware dynamic task scheduling approach for unreliable heterogeneous MPSoCs [4]. Global and local errors are sporadically injected in the system. Two dynamic task scheduling modes are newly introduced to compensate these errors, one for each error injection method. Error free processing elements are favored, faulty ones are isolated. In case of an error the erroneous task is detected and dynamically compensated to guarantee an error free execution. Different applications are used to prove the feasibility. The failure aware dynamic task scheduling approach assures an error free execution of all applications. Girault et al. (2012) has presented a new tri-criteria scheduling heuristic for scheduling data-ow graphs of operations onto parallel heterogeneous architectures [5]. According to three criteria: rst the minimization of the schedule length crucial for real- time systems, second the maximization of the system reliability crucial for dependable systems and third minimizing energy consumption crucial for autonomous systems . The algorithm is a list scheduling heuristic and uses the active replication of operations to improve the reliability, minimize the schedule length, its global system failure rate, and its power consumption. Wang et al. (2012) has discussed that smartphones are facing a grand challenge in extending their battery life to sustain an increasing level of processing demand while subject to miniaturized form factor [6]. Dynamic Voltage Scaling has emerged as a critical technique to leverage power management by lowering the supply voltage and frequency of processors. The Energy-aware Dynamic Task Scheduling algorithm is used to minimize total energy consumption for smartphones while satisfying stringent time constraints for applications. The algorithm utilizes the results from a static scheduling algorithm and aggressively reduces energy intakes on the y. Energy-aware Dynamic Task Scheduling algorithm can signi cantly reduce energy consumption for smartphones. Kumar et al. (2013) has discussed that Parallel computing operates on the principle that large problems can often be divided into smaller ones, which are then solved concurrently to save time (wall clock time) by taking advantage of non-local resources and overcoming memory constraints [7]. The main aim is to form a cluster oriented parallel computing architecture for Message Passing Interface based applications which demonstrates the performance gain and losses achieved through parallel processing using Message Passing Interface. The parallel execution based on message passing interface demonstrates the time taken to solve the problem in serial execution and also the communication overhead involved in parallel computation. Sagar et al. (2013) has said that the architecture for Message passing interface and Parallel virtual machine is based on the Master-Slave computing paradigm [8]. The master will monitor the progress and be able to report the time taken to solve the problem, taking into account the time spent in breaking the problem into sub-tasks and combining the results along with the communication delays. The slaves are capable of accepting sub problems from the master and finding the solution and sending back to the master. The performance dependency of parallel and serial execution on RAM using message passing interface and parallel virtual machine shows the speed of the system. Wang et al. (2013) has said that the wide scale adoption of multi-cores in main stream computing, parallel programs rarely execute in isolation and have to share the platform with other applications that compete for resources [9]. If the external workload is not considered when mapping a program, it leads to a signicant drop in performance. The
Page 11

automatic approach that combines compile time knowledge of the program with dynamic runtime workload information is used to determine the best adaptive mapping of programs to available resources. This aims to maximize performance of the target program with minimum impact on the external workloads. 2.2 Comparison Ref. Author(s) No. [1] Kumar et al.
Technique Message Passing Interface
Feature Computational time reduction. Serial execution is faster for smaller input size and parallel execution is far greater compared to serial execution when the size of the input is very large Matrix multiplication problem is solved serially and also in parallel under MPI and PVM which shows that MPI is faster than the PVM Parallel processing elements may become bottleneck in those applications where granularity cannot be controlled easily so Task Pool for Dynamic Task Scheduling remove this problem by using a number of Synergistic Processing Elements (SPEs) in parallel Error Free Execution, Feasible. a system model was developed which allows an injection of global and local errors. The first one simulates a disconnection of a PE to the central scheduling unit. The second one changes the content of the local memories of a processing element The scheduling heuristics to minimize the schedule length, its global system failure rate and its power consumption is designed. It uses the active replication of the operations and the data-dependencies to increase the reliability and uses dynamic voltage scaling to lower the power consumption Lowering Energy consumption. Energy aware Dynamic Task Scheduling (EDTS) algorithm to minimize total energy consumption for smartphones while satisfying the stringent time constraints for applications Satises the .dynamic scheduling of multitask for hybrid ow-shop. With sets of idle time windows as constraints, a mathematical programming model is presented for dynamic scheduling of multitask for hybrid flow-shop. Then an approach is presented to solve the problem based on particle swarm optimization algorithm.
[2]
Sagar et al.
[3]
Hoffmann et al.
Comparison of MPI and PVM using a Cluster Based Parallel Computing Architecture Task Parallel Programming
[4]
Oliver Arnold and Gerhard Fettweis
Failure Aware Task Scheduling
Dynamic
[5]
Girault et al.
Scheduling of Real-Time Embedded Systems
[6]
Wang et al.
Dynamic Voltage Scaling
[7]
Zheng et al.
Idle Time Window and Particle Swarm Optimization Algorithm
Page 12


[8] Wang et al. Adaptive Mapping Parallelism of

A predictive modeling based approach to determine the best mapping of an application in the presence of dynamic external workloads. This approach brings together the static compiler knowledge of the program and dynamic runtime information to recon gure an d optimize an application in a dynamic environment and increase the performance in external load. The task scheduling which can be either static or dynamic, presents a parameter called acceleration factor that represents how much each task can be accelerated if it would be executed on an accelerator rather than on a CPU. It provides the energy consumption.
[9]
Matsuoka et al.
Power Aware Dynamic Task Scheduling
3. Conclusion and Future work

Optimizing the algorithms is a critical issue for researchers. Even now we have quite faster computers; but we demand more speed. As it is not possible to add more processing units on the given chip or circuit; researchers has decided to convert sequential algorithms to parallel to improve the speed of the algorithms. So optimizing the running time is main motivation of this research work. In near future we will focus on running the serial algorithms in parallel fashion to properly utilize the benefits of multicore systems. As parallel computing comes up with some potential overheads so we will also focus on the reduction of the overheads. As it is known in prior that the serial algorithms are quite faster than the parallel if the job size is small. So we will try to find this threshold point so that we can successfully achieve the benefits of the parallel algorithms and also prevent the parallel algorithms to become the bottleneck of the systems.
References
[1] Ling-Li, Zeng, Zou Feng-Xing, Gao Zheng, and Xu Xiao-Hong. "Dynamic scheduling of multi-task for hybrid flow-shop based on idle time windows." In Control and Decision Conference, 2009. CCDC'09. Chinese, pp. 56545658. IEEE, 2009. [2] Hamano, Tomoaki, Toshio Endo, and Satoshi Matsuoka. "Power-aware dynamic task scheduling for heterogeneous accelerated clusters." In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pp. 1-8. IEEE, 2009. [3] Hoffmann, Ralf, Andreas Prell, and Thomas Rauber. "Dynamic task scheduling and load balancing on cell processors." In Parallel, Distributed and Network-Based Processing (PDP), 2010 18th Euromicro International Conference on, pp. 205-212. IEEE, 2010. [4] Arnold, Oliver, and Gerhard Fettweis. "Resilient dynamic task scheduling for unreliable heterogeneous MPSoCs." In Semiconductor Conference Dresden (SCD), 2011, pp. 1-4. IEEE, 2011. [5] Assayad, Ismail, Alain Girault, and Hamoudi Kalla. "Scheduling of real-time embedded systems under reliability and power constraints." In Complex Systems (ICCS), 2012 International Conference on, pp. 1-6. IEEE, 2012. [6] Qiu, Meikang, Zhi Chen, Laurence T. Yang, Xiao Qin, and Bin Wang. "Towards Power-Efficient Smartphones by Energy-Aware Dynamic Task Scheduling." In High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on, pp. 1466-1472. IEEE, 2012. [7] Nanjesh, B. R., K. S. V. Kumar, C. K. Madhu, and G. H. Kumar. "MPI based cluster computing for performance evaluation of parallel applications." In Information & Communication Technologies (ICT), 2013 IEEE Conference on, pp. 1123-1128. IEEE, 2013. [8] Sagar, Bharat Bhushan, and R. NanjeshB. "Performance Evaluation and Comparison of MP] and PVM using a Cluster Based Parallel Computing Architecture." [9] Emani, Murali Krishna, Zheng Wang, and Michael FP O'Boyle. "Smart, adaptive mapping of parallelism in the presence of external workload." In Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on, pp. 1-10. IEEE, 2013.
Page 13

[10] Elts, Ekaterina. "Comparative analysis of PVM and MPI for the development of physical applications on parallel clusters." Saint-Petersburg State University (2004). [11] Introduction to Parallel Computing [Online Available: http://www.cse.iitd.ernet.in]. [12] Parallel computer memory architecture [Online Available: http://www.computing.llnl.gov]. [13] History of Parallel Virtual Machine [Online Available: http:// http://www.netlib.org]
AUTHOR
Gurhans Singh Randhawa received the B.Tech degree in Information Technology from Punjab Technical University and pursuing M.Tech degree in Computer Science & Engineering from Guru Nanak Dev University Amritsar, Punjab (India).
Anil Kumar received the B.Tech and M.Tech degrees in Computer Science & Engineering and Information Technology, respectively from Guru Nanak Dev University Amritsar, Punjab (India).Doing PhD in Paralell Computing. He has teaching experience of 14 years. He is now with GND University as Assistant Professor.
Page 14

A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)

Transféré par

Droits d'auteur :

Formats disponibles

IPASJ International Journal of Computer Science (IIJCS)

A Publisher for Research Motivation ........

Volume 2, Issue 3, March 2014

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm Email: editoriijcs@ipasj.org ISSN 2321-5992