Vous êtes sur la page 1sur 12

Cluster Computing 4, 145156, 2001 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

A Cluster Operating System Supporting Parallel Computing

School of Computing and Mathematics, Deakin University, Geelong, Victoria 3217, Australia

Abstract. The single factor limiting the harnessing of the enormous computing power of clusters for parallel computing is the lack of appropriate software. Present cluster operating systems are not built to support parallel computing they do not provide services to manage parallelism. The cluster operating environments that are used to assist the execution of parallel applications do not provide support for both Message Passing (MP) or Distributed Shared Memory (DSM) paradigms. They are only offered as separate components implemented at the user level as library and independent servers. Due to poor operating systems users must deal with computers of a cluster rather than to see this cluster as a single powerful computer. A Single System Image of the cluster is not offered to users. There is a need for an operating system for clusters. We claim and demonstrate that it is possible to develop a cluster operating system that is able to efciently manage parallelism, support Message Passing and DSM and offer the Single System Image. In order to substantiate the claim the rst version of a cluster operating system, called GENESIS, that manages parallelism and offers the Single System Image has been developed. Keywords: operating systems, clusters, parallel processing, parallelism management, Single System Image

1. Introduction The trend in parallel computing is to move away from highly specialized and expensive supercomputers to much cheaper, general purpose computer systems, called clusters, that consist of commodity components such as PCs or workstations connected by fast networks. Parallel processing requires the harnessing and synchronised access to these distributed resources. Traditionally, this problem has been left to application programmers to solve as little (to no) parallelism management support has been provided by the underlying operating systems. Managing the available parallelism means managing parallel processes and computational resources in order to achieve high performance, make programming and use of the parallel system easy, and use computational resources efciently. Lack of parallelism management is one of the major obstacles to make parallel processing, in particular that performed on clusters, part of the computing mainstream [10, 20,27]. Parallelism management in parallel programming has been neglected and left to the application programmer [10]. Currently, programmers must deal not only with programming of communication and coordination of parallel processes to achieve the correct execution of an application, but also with the problems of initiation and control of the execution on the cluster [10]. Moreover, the currently available operating systems run on clusters do not offer the Single System Image users do not see a cluster as a single powerful computer. The Message Passing (MP) and DSM communication paradigms, which are used to develop parallel applications, have advantages and disadvantages. The former is fast but
This work was partly supported by the ARC Grant 0504003157. Currently with HewlettPackard Laboratories, Palo Alto, California.

difcult to use, the latter is easy to use but demonstrates reduced performance. However, in the majority of execution environments programmers of parallel applications do not have the opportunity to make a choice between these two paradigms. Furthermore, communication paradigms and systems supporting them are treated independently of the operating system, rather than being incorporated into a comprehensive operating system as they manage system resources. The rst aim of this paper is to propose a new cluster operating system that: allows the achievement of high performance of parallel processing on clusters; supports execution on a cluster of both message passing and shared memory based parallel applications; relieves the programmer from error prone and time consuming work of allocation of processes to computers, management of interprocess communication and process synchronisation; provides a Single System Image to all programmers; makes the whole cluster based parallel system easy to use, and allows the efcient use of resources. The second aim is to introduce and discuss the architecture and services of a cluster operating system called GENESIS, that allow it to achieve the above specied goals. The third aim is to report on the performance of parallel processing of an application executed in the GENESIS environment, and to show how the Single System Image feature provides transparency and makes the whole operating system easy to use.

2. Towards an operating system for clusters This section introduces the logical level architecture and services of an operating system that provides parallelism management, supports programmers by creating Message Pass-



ing and DSM interface, and offers a Single System Image, in particular transparency. 2.1. Operating system requirements When designing a cluster operating system a number of important requirements must be met: Performance users must experience excellent performance from the system; comparable or close to the performance of supercomputers. Ease of programming programmers should be provided with the option of either Message Passing or DSM to be employed in their parallel application. The chosen method should also be free of syntax and semantic complexities. Ease of use programmers should be provided with an environment where the parallel applications can be executed with little or no input regarding which computers to use, how to create/duplicate parallel processes and how to support process coordination. Single System Image the operating system should provide a Single System Image for the cluster so that it looks like a single, powerful computer to each user. Transparency the programmers must be unaware of the location on which parallel applications are executed, and therefore unaware of the actual location of communicating parallel processes (developed using either Message Passing or DSM). High availability the operating system should be able to automatically handle additions, removal and reorganisation of system resources as easily as in a centralised system, by adaptively setting up a virtual computer, that changes dynamically. 2.2. Single System Image parallel processing transparency A cluster operating system must provide a Single System Image of the whole cluster to all users. This could be achieved if the concept of transparency [12] is employed. The following dimensions of parallel execution transparency should be offered: Location transparency: the whole cluster looks like a single powerful computer rather than as a set of connected computers. Process relation transparency: parent/child relationships are maintained over local and remote computers of the virtual machine. Execution transparency: workload is balanced dynamically and adaptively. When provided transparently, the role of the programmer is simplied and the performance benets of load balancing are obtained. Device transparency: access to devices (such as the screen, terminal, I/O ports and les) is global and also should be location independent.

2.3. Services for parallelism management and Single System Image A variety of services are required in order to support applications execution on clusters to: achieve high performance of parallel processing of applications and use resources efciently; make application programming easy by relieving the programmer from the burden of coding functions that deal with parallelism management and execution environment; execute the parallel application and the parallel computing system easily; and offer a Single System Image of the computing environment to all programmers. These services depend on features of clusters, parallel processing attributes and the computational model of parallelism of an application. For both the SPMD and MPMD models of parallelism the needed services are as follows [10]: the establishment of a virtual machine; parallel processes instantiation; allocation of processes to the computers of the virtual machine; data distribution; coordination of parallel processes; and dynamic load balancing of parallel processes. The following additional services are needed to support the MPMD model of parallelism [10]: instantiation of shared data on the computers of the virtual machine; initialization of synchronisation variables; and synchronisation of parallel processes of the application [13]. The provision of these services depends on services that are responsible for exploiting and hiding distribution (e.g., interprocess communication) and managing basic system resources: processes, processors and memory devices [12]. This leads to two major groups of services of the proposed operating system for clusters: Distributed services for transparent communication and management of basic system resources; and Parallelism management and Single System Image offer services, gure 1. Some individual services of the cluster operating system must be provided in a distributed manner (e.g., management of basic system resources, allocation of processes to computers of a cluster), whereas other services could be provided in either a centralised or distributed manner (e.g., dynamic load balancing). 2.4. Architecture of the cluster operating system Three basic logical levels have been proposed for an operating system for clusters: management level that is responsible for computers and communication resources and hiding distribution; parallelism management and Single System Image offer level that is responsible for managing parallel processing,



Figure 1. Logical architecture and services of a cluster operating system supporting parallel computing.

Although building an environment on top of an existing operating system increases the portability of such systems, we claim here that it is because the underlying operating systems were not specically designed and built to support parallel processing over clusters that these problems have occurred. Taking into consideration the problems of the middleware architecture and the aims of our research we propose to develop the needed services at the kernel level. An operating system that is capable of managing parallelism for applications executing clusters and offering a Single System Image of the whole cluster should [11]: possess the features of a distributed operating system in order to deal with distributed resources and their management, and hide distribution [12]; and provide efcient services which allow the transparent and dynamic establishment of a virtual parallel computer, instantiation of parallel child processes, distribution of data and coordination of execution. Thus, a cluster operating system that supports parallel processing is made up of an enhanced subset of the basic part of a distributed operating system, a parallelism management system and Single System Image, and a programming and execution environment, shown in gure 3. 3. The architecture of the GENESIS system The logical structure of GENESIS follows the structure shown in section 2.4. The GENESIS architecture that includes the microkernel, and relevant kernel servers is shown in gure 4. The GENESIS system comprises the following servers (called managers in the GENESIS system): Resource Discovery Manager is responsible for the establishment of the virtual machine upon a cluster based on information about idle and/or lightly loaded computers; and their resources (processor model, memory size, etc.);

Figure 2. Architecture commonly used in other system.

computational resources and providing complete transparency; and programmer interface level that is responsible for providing Message Passing and DSM interface to programmers. An operating system that supports parallel processing on and offers a Single System Image of a cluster can be developed using one of the two main approaches: middleware, at the application level; and underware, at the kernel level. The middleware approach, used in systems such as PVM [3], Beowulf [24], NOW [1] and TreadMarks [19], has followed a line where communication and execution services have been implemented outside the operating system and form a clearly separated environment, as shown in gure 2. The systems developed based on the middleware approach exploit a set of libraries, daemons and administrative commands and provide either message passing or DSM programming environments. A number of serious problems exist with the middleware approach, in particular most of the solutions used for DSM and PVM are performance driven and little work has been done on making them programmer friendly.



Space Manager coordinates and manages logical regions of memory resources (called Spaces) of a computer, which includes protection and the sharing of memory through DSM; Process Manager manages the processes that are created in GENESIS. In particular, it manipulates the process queues and deals with parent processes waiting for child processes to exit. It cooperates with other kernel servers, for instance with the Migration Manager to transfer a process state during migration; and the Execution Manager to set up a process state when a process is created or duplicated; Network Manager provides an implementation of the Reliable Datagram Protocol (RRDP) and IP protocols, used by the IPCM to reliably transmit messages to remote computers;
Figure 3. The logical structure of the cluster operating system.

File/Cache Manager manages les of the GENESIS system and applications; and Drivers provide a transparent unied interface of physical devices such as serial ports, keyboards, video screens and disks. The following section describes the individual services and servers of GENESIS, which provide parallelism management and offer the Single System Image. The kernel servers and services are presented in [7].

4. Parallelism management and Single System Image servers and services This section presents the GENESIS servers and services that support parallelism management and allow the system to offer the Single System Image of a cluster used to run parallel applications, and shows how parallelism management is effectively and transparently provided by the operating system.
Figure 4. The genesis architecture.

4.1. Establishment of a virtual machine Global Scheduler currently implemented as a centralised server, is responsible for the placement of process- The Resource Discovery Manager [22] plays a key role es on creation or migration on the computers that make in the establishment of the virtual machine upon a cluster. This server identies idle and/or lightly loaded comup the GENESIS virtual parallel machine; puters; and their resources (processor model, memory size, Execution Manager creates a process from a le and etc.) Furthermore, the Resource Discovery Manager of each duplicates a process either a heavy-weight or mediumcomputer, identied as a potential component of the virtual weight; machine, collects (in cooperation with the Data Collection Migration Manager coordinates the relocation of either Manager) both computational load and communication patan active process or a set of processes on one computer terns for each process (if any) executing on a given comto another computer or a set of computers, respectively, puter. This server provides the Global Scheduler with event in the GENESIS virtual parallel computer; based summaries on both computational load and commu Interprocess Communication (IPC) Manager when a nication patterns, and computer resources. This information message is addressed either to a process that does not can be provided on a per process, per computer, or averaged exist locally or to a group of processes, the IPC Manager over an entire cluster domain. The GENESIS virtual machine changes dynamically in is responsible for the delivery of that message to these time as some computers become overloaded and cannot be processes;



used as a part of the execution environment for a given parallel application; become idle or lightly loaded and can become a component of the virtual machine; and may be out of order for a unspecied period of time. 4.2. Global Scheduling The role of the Global Scheduler in the GENESIS system is that of a high level decision making server, which embodies the policy level decisions of which processes should be mapped to which computers. The Global Scheduler, therefore, relies on the mechanisms of the single, multiple and group process creation and duplication services supported by single or group process migration to enact any the mapping decisions made. The GENESIS Global Scheduler maps parallel processes to computers of a virtual machine. This server combines static allocation and dynamic load balancing components which allow the system: to provide initial mapping by nding the best locations, at a given time, for parallel processes of the application to be created remotely or for locally created processes to be moved to selected computers; to react to large uctuations in system load (using dynamic load balancing), and also to handle the case when system load remains steadily high (static allocation) when new processes have to be created. The GENESIS system ensures that the location of the child process(es) is hidden from the parent process and users and that all interactions (communication and coordination) between the parent and child process(es) is completely transparent. The Global Scheduler also ensures that the load remains balanced during the execution of the parallel application by providing dynamic load balancing services, which employ process migration to move executing processes from overloaded computers to idle or lightly loaded computers. 4.3. GENESIS PVM GENESIS provides transparent communication services of standard Message Passing and PVM as its integral components. PVM has been ported to GENESIS as it allows the exploitation of advanced message passing based parallel environment [25]. The PVM communication is transparently provided by a service that is only a mapping of the standard PVM services onto the GENESIS communication services and benets from additional services which are not provided by operating systems such as Unix or Windows. The functionality that the PVM server provides in Unix systems has been effectively substituted with services provided by GENESIS [25], as shown in gure 5. In this PVM server-free environment, PVM processes communicate directly with each other, signicantly improving the performance of IPC. Processes are managed across the entire cluster, with newly booted computers automatically exploited and heavy loaded computers being removed
Figure 5. Architecture of PVM on GENESIS.

from the virtual machine. This is much better than Unix PVM where the user is responsible selecting individual computers that make up the virtual machine. Removing the server from the PVM model also improves the reliability of PVM applications. Under Unix PVM, each task depends on the local server for all interactions with other tasks. If a server were to crash, the tasks under its control do not have the ability to continue execution independently. Under GENESIS PVM, this dependency does not exist. This vulnerability in Unix PVM is extended further in that if a slave server loses contact with the master server, it will shut down, terminating all of the tasks under its control. This event cannot occur in GENESIS PVM. 4.4. Distributed shared memory We decided to embody DSM within the operating system in order to create a transparent, easy to use and program environment and achieve high execution performance of parallel applications. Since DSM is essentially a memory management function, the Space Manager is the server into which the DSM system was integrated. This implies that the programmer is able to use the shared memory as though it were physically shared, hence, the transparency requirement is met. Furthermore, because the DSM system is in the operating system itself and is able to use the low level operating system functions the efciency requirement can be met. In order to support memory sharing in a cluster, which employs message passing to allow processes to communicate, the DSM system is supported by the IPC Manager. However, the support provided to the DSM system by this server is invisible to application programmers. Furthermore, because DSM parallel processes must be properly managed, including their creation, synchronisation when sharing a memory object, and co-ordination of their execution, the Process Manager supports DSM system activities. The



try to a barrier has the same effect on the shared memory as a signal(sem[0]) because the rst operation carried out by the barrier primitive is to make the shared memory consistent. Exiting from a barrier has the same effect as a wait(sem[0]) because the protection on the shared memory is changed to read-only. 4.5. GENESIS process creation To design a process creation service, which allows the achievement of the aims of this research (see section 1), a number of requirements need to be addressed, including: Multiple creation of processes it must be possible to concurrently create many instances of a process on a single computer and/or over many computers; Scalability the proposed service, in order to take full advantage of available parallelism, must be scalable to many computers; and Complete transparency the proposed service must hide from the user (programmer) the location of all resources and processes, and management elements. The GENESIS concurrent process creation service supports three forms of process creation: Single, Multiple and Group [16]. 4.5.1. Single and multiple process creation The single process creation service is similar to the services found within traditional systems supporting parallel processes and requires the executable image to be downloaded from disk for each parallel process to be created. To achieve this, the Execution Manager, after it received the process creation request, approaches the Global Scheduler and in response receives the location of the computer on which the process should be created. To create n processes, this operation must be repeated n times. The multiple process creation service supports the concurrent instantiation of a number of processes on a given computer through the one creation call. This service combines memory sharing services to improve memory utilisation and to decrease the overall creation time. Although this satises the rst requirement of being able to create multiple instances of a process, when many computers are involved in the multiple process creation, each computer must be addressed in a sequential manner. Therefore, the executable image of the parallel child process must be downloaded separately for each computer involved in the multiple process creation. This does not satisfy the scalability requirement. 4.5.2. Group process creation The group process creation combines multiple process creation and group communication to allow multiple parallel processes to be concurrently created on many computers, with a single executable image downloaded from the le server using group communication. In this case, the Global Scheduler provides n locations, where processes should be created concurrently. Therefore, the time to download the

Figure 6. DSM system integrated into the Space Manager in GENESIS.

placement of the DSM system in GENESIS and its interaction with the basic servers are shown in gure 6. When an application using DSM starts to execute, the parent process initialises the DSM system with a single primitive function. This function creates shared memory region and a set of processes. The latter operation is performed by the Execution Managers of remote computers, selected by the Global Scheduler based on the system load information from the Resource Discovery Manager. The granularity of the shared memory object is an important issue in the design of a DSM system. As the memory unit of the GENESIS space is a page, it follows that the most appropriate object of sharing for the DSM system is a page. 4.4.1. Memory consistency and synchronisation The GENESIS DSM system employs release consistency model (the memory is made consistent only when a critical region is exited), which is implemented using the writeupdate model [26]. Synchronisation of processes that share memory takes the form of semaphore type synchronisation for mutual exclusion. The semaphore is owned by the Space Manager on a particular computer which implies that gaining ownership of the semaphore is still mutually exclusive when more than one DSM process exists on the same computer. Barriers are used in GENESIS to co-ordinate executing processes. Processes block at a barrier until all processes have reached the same barrier; the processes then all continue execution. Barriers are also controlled by the Space Manager but the management of the barriers is centralised on one of the computers in the cluster. 4.4.2. Barriers in update-based DSM in GENESIS In GENESIS, the barriers take the form of an array of their names. In code written for DSM as in code written for execution on any shared memory system a barrier is required at the start and end of execution. Barriers can also be used throughout the execution whenever it is necessary that processes have completed a particular phase of the computation before the start of the next phase. The GENESIS barriers not only synchronise the processes but serve as points for memory update operations. En-



image for n computers is the same that would be required for one computer; and on each individual computer the multiple process creation method (employing memory sharing techniques) is used to improve memory utilisation.

4.7. Process migration Single and group process migration services are required to support process duplication and dynamic load balancing provided by the Execution Manager and Global Scheduler, respectively. 4.7.1. Basic process migration The Global Scheduler and Execution Manager access the Migration Manager through two library calls of migrate() and cancel(). The Migration Manager was designed to clearly separate policy from mechanism. Therefore, the Migration Manager acts as the coordinator for the migration of the various resources, which combine to form a process. The responsibility of migrating the physical resources, such as memory, process entries and communication resources are left to the Space Manager, Process Manager and IPC Manager, respectively. 4.7.2. Group process migration Group process migration is an enhancement of the basic process migration. Two possible methods exist for implementing the group migration service to support parallelism management and offer the Single System Image. The rst method involves the group migration of the parent process to multiple remote computers and its duplication to the appropriate number of child processes. This method has a major benet in that a single migration is required for any number of processes which are required to be duplicated remotely. The second method involves the migration of a multiple child processes (with identical memory) to multiple remote computers. In the second method, although the memory of each of the child processes are initially equal (parallel process of an SPMD program), each process has a unique name. The Migration Manager uses the process name as the fundamental control object for a given migration. This method has the major disadvantage that when multiple child processes must be duplicated remotely, then multiple single process migration operations must be performed. Thus, the total time to duplicate remotely many child processes is considerable taking into account the accumulative time to individually migrate each child process. Therefore, the rst method was chosen to be implemented since migrating a single process with a single process name to multiple computers does not require extensive modication to the single process migration code, as would be the case for group migration of multiple child parallel processes with multiple names. The development of the group migration service involved modifying the single communication between the peer Migration Managers, Process Managers, Space Managers and IPC Managers to that of group communication. This allowed each server to migrate their respective resource to multiple destination computers in a single message using group communication. The source Migration Manager maintains a table of participating destination computers which is updated on receipt of acknowledgements from each respective re-

4.6. GENESIS process duplication In GENESIS, parallel processes can also be instantiated on the selected computers of the virtual machine by employing process duplication supported by process migration. 4.6.1. Single local and remote process duplication Duplication of a process is invoked when the Execution Manager receives a twin request from a parent process. As with the process creation mechanism, the Execution Manager noties the Global Scheduler of the twin event and receives the location of the computer on which the twined child process should be placed. When a computer remote to the parent process is returned as the location on which the child process should be duplicated, further operations to instantiate the child process on the remote computer must be carried out. These operations employ process migration. 4.6.2. Multiple local and remote process duplication The GENESIS support for multiple duplication of processes, both locally and remotely, is provided as an enhancement to the single process duplication service. As with the local duplication of a single child process, the local duplication of multiple processes does not require the migration of a process. The Process Manager and Space Manager are requested to duplicate multiple copies of the process entries and memory spaces. If the Global Scheduler informs the Execution Manager to duplicate the child processes on many remote computers, then the remote multiple duplication must be performed for each selected remote computer. As with the single process duplication case, only the remote multiple duplication requires the migration of the parent process. This migration would need to be performed for each remote computer involved in the duplication. 4.6.3. Group remote process duplication When more than one remote computer is involved in the process duplication the overall performance of the instantiation decreases. This was due to the same reason as the initial creation method, where each remote computer is required to be contacted sequentially, thus forcing an individual process migration to each remote computer. The solution developed here, as with group process creation, also relies upon a group migration service. In the group process duplication service, the Execution Managers involved in process migration each joins a group and uses the collective communication to allow the single parent process to be migrated to all remote computer involved in the group duplication.



mote Migration Manager. When all remote Migration Managers have responded, the group migration is successful. After the successful group migration of the parent process to the remote computers, a copy of the parent process exists on all participating computers. The situation with multiple copies of the parent process on multiple computers is maintained until all duplications have been completed. The group migration is then cancelled, deleting all remote copies of the parent process and leaving the single, original parent process on the source computer (in contrast to the single migration service). Once cancelled, the original parent process on the source computer is allowed to continue execution. 5. The GENESIS parallel processing programming interface Parallel programming environment of GENESIS has been designed and developed to provide both message passing and DSM as is shown in gure 3. The message passing paradigm exploits basic GENESIS interprocess communication concepts and offers standard Message Passing and Remote Procedure Call primitives. The PVM programming environment is transparently provided by a service that is only a mapping of the standard PVM services onto the GENESIS communication services and benets from additional services which are not provided by operating systems such as Unix or Windows. The shared memory paradigm is provided in GENESIS by the DSM system and offers standard primitives used in physically shared memory systems. There are two groups of primitives available to programmers to develop parallel programs. The communication and coordination service primitives for the Message Passing, PVM and DSM based programs are listed in table 1. It can be seen that only two basic Message Passing primitives are provided, those which allow a message to be sent and received. Similar primitives are provided in the PVM oriented services with the addition of buffer management primitives which allow messages to be packed and unpacked as required. The RPC based application development is supported by the third primitive, call(). The DSM based primitives, due to the shared memory model of communication, only require read, write and synchronisation primitives, such as wait and signal to support locking of a memory region. All of the three paradigms take advantage of a barrier() to ensure a set of parallel processes reaching the same point within a program before continuing. The base send(), recv() and call() primitives are provided as system calls into the GENESIS microkernel, whereas every other primitive presented in this table is implemented as library call which basically is mapped down to the GENESIS send() and recv() calls. Table 2 lists the execution primitives used by the Message Passing, PVM and DSM based programs. In general, only execution services to support the instantiation and termination of parallel processes are required. For the Message

Table 1 GENESIS communication primitives. Message passing send() recv() call() barrier() PVM pvm_send() pvm_recv() pvm_pkbuf() pvm_unpkbuf() pvm_barrier() DSM read access write access wait() signalt() barrier()

Table 2 GENESIS execution primitives. Message passing proc_ncreate() proc_exit() PVM pvm_spawn() pvm_exit() DSM proc_ncreate() proc_exit()

Passing and DSM based programs, the same primitives are used and include proc_ncreate() which allows a set of n child processes to be created concurrently from an executable image located on disk; and proc_exit() which allows a process to terminate its execution (process duplication primitives are presented in another report). These primitives are provided as general GENESIS library routines which are mapped to send() and recv() calls to the Execution Manager. The PVM primitives, as with the equivalent PVM communication primitives, are also implemented as library calls which map down to the base proc_ncreate() and proc_exit() primitives [17].

6. Performance of standard parallel applications The GENESIS system was tested on a cluster that is composed of 19 Sun 3/50 workstations interconnected by a shared 10 Mb/s switched ethernet network. One of the workstations within the cluster is dedicated as a File Server and the remaining 18 workstations are used for normal user computation. The number of workstations used in the experiments was varied from 1 to 18. Each experiment was performed twenty times and the average result presented. 6.1. The inuence of the process instantiation on the execution performance The key focus of the experiments performed relate to the inuence the process creation based instantiation method has on the overall execution time of the application. The computation performed by each child has been simplied to a number of iterations of a loop. The decision to use equal partitioning of the data was made so that performance variations due to data imbalances were removed. In these experiments the dynamic load balancing component of the Global Scheduler was enabled. As the optimal performance of the SPMD parallel program execution is achieved with the number of child processes equalling the number of workstations [15]



Figure 7. Speed-up performance using process creation (5, 25 and 50 s work loads).

the speed-up values presented in gure 7 are for n processes on n workstations (where n is 1, 2, 4, 8 and 12); for each of the 5, 25 and 50 s of work for a sequential process. The results show that the group creation method out performs the other creations methods, with the multiple creation providing better results than the single process creation method. From the work-load perspective, the higher the work-load value the better the speedup value obtained. This can be attributed to the fact that the actual process creation time (ranging from 0.2 to 4.5 s) is a signicant component of the total execution time and therefore a considerable overhead. 6.2. Performance of a standard parallel application In this section we report on the execution performance of the Successive Over Relation (SOR) application used to solve partial differential equations. The MP, PVM and DSM programs were developed from algorithms developed for research reported in [21]. The performance of the SOR parallel program executed on the GENESIS cluster and supported by the GENESIS system, using Message Passing, PVM and DSM communication services is shown in gure 8.
Figure 8. SOR speed-up results.

The results presented were performed on an array of 128 128 elements with 10 iterations to calculate the resultant array. From these results the Message Passing based program provides considerable performance improvements over the PVM and DSM versions.



7. Easy-to-use and program environment The behaviour of the system described in the previous sections, shows that the location of the remote computer of the cluster is selected automatically by the Global Scheduler. The communicating processes and users do not know the location of the processes. Programming of parallel applications to be executed on the GENESIS cluster has been made very easy. Parallel applications require only one primitive to be used by the program to create and execute a set of parallel processes, where each creation request involves a sequence of operations. In other systems, the programming of this sequence and its operations have been the responsibility of the programmer. The operation and interaction between the kernel servers used to implement the group creation service in a transparent manner is presented in gure 9. This gure clearly shows that the underlying services required to complete the concurrent process creation are completely transparent to the parent process and thus to the programmer.

8. Related systems 8.1. Message passing based systems PVM [3] is an execution environment that runs on top of a variety of computers with the aim of providing parallel applications with a method of accessing the resources of the cluster. The PVM system was developed as a set of cooperating server processes and a suite of specialised libraries that provide the programmer with a set of consistent primitives for parallel process communication, execution and synchronisation. Two projects [23] and [5] to develop the next generation of PVM concentrated only on migration of processes amongst processors. HARNESS [8] will not provide transparency and the programmer will be forced to deal with the identication and declaration of available computers, and mapping processes to these computers. Load imbalance is neglected, which degrades the overall performance of executing applications. 8.2. Distributed shared memory Much of the effort in DSM research has concentrated on improving performance of the systems, rather than reaching a more integrated DSM design [18]. Munin and Treadmarks are the best known DSM systems. Munin is an object based DSM DSM system that uses multiple consistency protocols and runs on top of the V operating system. When using Munin the programmer must label different variables according to the consistency protocol they require [4]. TreadMarks is a DSM system that is implemented on top of Unix. It uses a variation of release consistency called lazy-release consistency in which the propagation of updates is delayed until the next request is made for the access to the critical region and then only the process entering the critical regions memory is updated. This leaves unapplied updates to be stored until they have been applied to all copies of the memory or garbage collection has been carried out [6]. Both Munin and TreadMarks have neglected parallelism management. The initialization stage of an application, in both, requires programmers to dene the number of computers to be used, create a process on each of these computers, initialize the shared data on each computer and create the synchronization barriers. This manual approach places a signicant load upon programmers, leads to load imbalance and performance degradation. 8.3. Execution environments An improvement to the common approach of the PVM and MPI systems executing on top of an existing operating system is through the enhancement of the underlying operating system, exploited by the Beowulf [24], NOW [1] and MOSIX [2] systems. The Beowulf system exploits distributed process space (BPROC) [14] in order to manage parallel processes. Process-

parent_code(){ Issue call to Execution to create the child processes process_create(GROUP_CREATE, n, child_prog) } Request to group create multiple pocesses received by the Execution Manager through the par_initialise() call: The parent Execution Manager which received the request contacts the Global Scheduler to select the set of computers W = {w1 , . . . , wm } to be used and the number of child processes N = {n1 , . . . , nm } to be created on each computer, where m n and n = nj , j = 1, . . . , m The parent Execution Manager forwards on a request to create the respective number of child processes nj to each of the Execution Managers on wj using group communication For each computer wj , j = 1, . . . , m, in the computer set W : the Execution Manager contacts the Process Manager to allocate nj new processes the Execution Manager contacts the Space Manager to create memory for one child process The parent Execution Manager contacts the File Server to request the image of the child process be sent to the group of Execution Managers On each wj the Execution Manager receives the child process executable image and populates the memory of the rst child the Execution Manager contacts the Space Manager to duplicate nj 1 copies of the memory for the remaining children. if wj is remote then the Execution Manager returs to the Execution Manager on the parent computer the names and details of the nj children created on computer wj The parent Execution Manager contacts the Process Manager to link the n newly created children to the list of children instantiated by the parent

Figure 9. Group process creation service.



es can be started on remote computers if the logon operation into that remote computer was completed successfully. Starting processes in Beowulf is only done via rsh, sequentially, although these two features are hidden from the user by PVM and MPI. Another weakness of Beowulf is that it does not address resource allocation or load balancing. The NOW system is based on the Solaris operating system from Sun Microsystems and combines specialised libraries and server processes with enhancements to the kernel itself in the form of the scheduling and communication kernel modules. The enhancements to the operating system have been in the form of a global operating system layer (GLUnix) to provide network wide process, le and virtual memory management [9]. The only parallelism management services of the NOW (Network of Workstations) system [1] are those that allow a sequential process to be initiated on any computer of a cluster; support semi-transparent start of parallel processes on multiple nodes (how those nodes are selected is not shown), barriers, co-scheduling, and the MPI standard. The current version of the MOSIX system is a result of the extension and modication of the Linux operating system to produce a fully distributed operating system. The MOSIX system provides enhanced and transparent communication and scheduling services within the kernel, and also employs PVM to provide the high level parallelism support. A number of problems exist with MOSIX from the parallel processing point of view. Firstly, MOSIX provides dynamic load balancing and load collection but PVM is relied upon to perform the initial placement of process. Secondly, although communication is transparent, all remote communication must be handled through the originating computer, causing a bottleneck once a process is migrated away to another computer. A common and serious problem exists with the PVM, Beowulf, NOW and MOSIX systems. Each of these systems only supports the creation of single processes. Although the primitives provided by each of these systems enable the user to request multiple processes to be created, internally they are only created one at a time. This problem is the result of the reliance each of these systems have on the underlying network operating systems, which were designed and implemented to only support the creation of single processes.

chine; mapping processes to a virtual computer; process instantiation using process creation and duplication supported by process migration; and load balancing at the instantiation of a process and also during their execution. GENESIS is a comprehensive system that provides programmers with an environment which allows them to use two forms of interprocess communication services: Message Passing or Shared Memory; within a single programming and execution environment. Message passing can be used either as raw Message Passing or through PVM and shared memory is used through the DSM system. These services are integrated into the GENESIS operating and resource management system which has been built from scratch. The DSM system has been incorporated into the memory management system making it an integral feature of the operating system. The GENESIS solution avoids many of the problems faced by current systems and provides the user/ programmer with an easy to use powerful system that manages the resources of a cluster and parallel processes of an application which communicate using either/both message passing and/or DSM. The implementation of GENESIS demonstrates that the approach and design of cluster operating systems supporting parallel execution and offering full transparency to programmer are feasible. Programmers do not have to be involved in parallelism management. Indeed, parallelism can be efciently and transparently managed if a proper operating system is employed. The performance study shows that system offers good performance.

[1] T. Anderson, D. Culler and D. Patterson, A case for networks of workstations: NOW, IEEE Micro. (February 1995) 5464. [2] A. Barak and O. Laadan, The MOSIX multicomputer operating system for high performance cluster computing, Journal of Future Generation Computer Systems 13(45) (1998) 361372. [3] D. Beguelin, J. Dongarra, A. Giest, R. Manchek, S. Otto and J. Walpole, PVM: Experiences, current status and future directions, Oregon Graduate Institute of Science and Technology, Technical Report, CSE-94-015 (April 1994). [4] J. Carter, Efcient distributed shared memory based on multi-protocol release consistency, Ph.D. Thesis, Rice University (September 1993). [5] J. Casas, D. Clark, P. Galbiati, R. Konuru, S. Otto, R. Prouty and J. Walpole, MIST: PVM with transparent migration and checkpointing, in: Proc. of the 3rd PVM Users Group Meeting, Pittsburgh (May 1995). [6] Concurrent Programming with TreadMarks, ParallelTools, L.L.C. (1994). [7] D. De Paoli, A. Goscinski, M. Hobbs and G. Wickham, The RHODOS microkernel, kernel servers and their cooperation, in: Proc. IEEE 1st International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP-95), Brisbane (April 1995) pp. 345354. [8] J. Dongarra, A. Geist, J.A. Kohl, P.M. Papadopoulos and V. Sunderam, HARNESS: Heterogeneous Adaptable Recongurable NEtworked SystemS, Oak Ridge National Laboratory, Oak Ridge, TN (March 1998). [9] P. Ghormley, D. Petrou, S. Rodrigues, A. Vahdat and T. Anderson, GLUnix: a global layer for a network of workstations, Software Practice and Experience 28(9) (1998) 929961.

9. Conclusions In this paper we have identied the requirements for a cluster operating system. The GENESIS operating system has been designed and developed to meet these needs: efciency and transparent execution of parallel applications on clusters; good utilization of computational resources of a cluster; and provision of the Single System Image in order to make a cluster easy to use and relieve programmers from activities that are operating system oriented. These operating system activities are: the selection of computers for the execution of their parallel programs; setting up a virtual ma-



[10] A. Goscinski, Parallel processing on clusters of workstations, in: Networks The Next Millennium (World Scientic, 1997). [11] A. Goscinski, Towards and operating system managing parallelism of computing on clusters of workstations, Future Generation Computer Systems 17 (2000) 293314. [12] A. Goscinski, Distributed Operating Systems: The Logical Design (Addison-Wesley, 1991). [13] A. Goscinski and J. Silcock, An easy to program and use DSM environment, in: Proc. 10th IASTED International Conference on Parallel and Distributed Computing and Systems, Las Vegas, Nevada (October 1998). [14] E. Hendriks, BPROC: Beowulf distributed process space, http:// www.beowulf.org/software/bproc.html (April 1999). [15] M. Hobbs. The management of SPMD based parallel processing on clusters of workstations, Ph.D. Thesis, Deakin University (August 1998). [16] M. Hobbs and A. Goscinski, A concurrent process creation service to support SPMD based parallel processing on COWs, Concurrency: Practice and Experience 11(13) (1999) 803821. [17] M. Hobbs and A. Goscinski. The RHODOS remote process creation facility supporting parallel execution on distributed systems, Journal of High Performance Computing 3(1) (1996) 2330. [18] L. Iftode and J.P. Singh, Shared virtual memory: progress and challenges, Technical Report TR-552-97, Department of Computer Science, Princeton University (October 1997). [19] P. Keleher, Lazy release consistency for distributed shared memory, Ph.D. Thesis, Rice University (1994). [20] T. Lewis, Supercomputers aint so Super, Computer 27(11) (1994). [21] H. Lu, Message passing versus distributed shared memory on networks of workstations, Ph.D. Thesis, Rice University (May 1995). [22] Y. Ni and A. Goscinski, Trader cooperation to enable object sharing among users of homogeneous distributed systems, Computer Communications 17(3) (1994). [23] S. Otto, Processor virtualization an migration for PVM, in: Proc. 2nd Workshop on Environments and Tools for Parallel Scientic Computing, Townsend, TN (May 1994). [24] D. Ridge, D. Becker, P. Merkey and T. Stirling, Beowulf: Harnessing the power of parallelism in a Pile-of-PCs, in: Proc. IEEE Aerospace (1997). [25] J. Rough, A. Goscinski and D. De Paoli, PVM on the RHODOS distributed operating system, in: Proc. 4th European PVM/MPI Users Group Meeting, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Cracow, Poland (November 1997) pp. 208215. [26] J. Silcock and A. Goscinski, Update-based distributed shared memory integrated into RHODOS memory management, in: Proc. 3rd

International Conference on Algorithms and Architecture for Parallel Processing ICA3PP97, Melbourne (December 1997) pp. 239252. [27] D. Talia, Parallel computation still not ready for the mainstream, Communications of the ACM 40(7) (1997).

Andrzej M. Goscinski is a chair professor of computing at Deakin University. He received his M.Sc. Ph.D. and D.Sc. from the Staszic University of Mining and Metallurgy, Krakow, Poland. Dr. Goscinski is recognized as one of the leading researchers in distributed systems, distributed operating systems and parallel processing on clusters. The results of his research have been published in international refereed journals and conference proceedings and presented at specialized conferences. In 1997, Dr. Goscinski and his research group has initiated a study into the design and development of a cluster operating system supporting parallelism management and offering a single system image. The rst version of this system is in use from the end of 1998. Currently, Dr. Goscinski is carrying out research into global computing, based on distributed, networked and parallel systems, to support the information economy, in particular, electronic commerce and knowledge acquisition and management. E-mail: ang@deakin.edu.au

Mick Hobbs. Photo and biography not available at time of publication. E-mail: mick@deakin.edu.au

Jackie Silcock received a B.Sc. degree from the University of Cape Town, South Africa. She completed her Ph.D. at Deakin University, Australia. Her Ph.D. was on the design and implementation of a user friendly and efcient Distributed Shared Memory system integrated into a distributed operating system. Her current research areas are parallel processing and scheduling on clusters of workstations, and Distributed Shared Memory. She teaches undergraduate courses in distributed systems, networks, systems analysis and information systems. E-mail: jackie@deakin.edu.au