Vous êtes sur la page 1sur 5

12th IEEE International Workshop on Future Trends of Distributed Computing Systems

APRIX: A Master-Slave Operating System Architecture for Multiprocessor Embedded Systems


Jimin Kim Computer Engineering Department College of Information and Communications Hanyang University, Korea jmkim@rtcc.hanyang.ac.kr Minsoo Ryu Computer Engineering Department College of Information and Communications Hanyang University, Korea msryu@hanyang.ac.kr

Abstract
The recent emergence of heterogeneous chip multiprocessors requires a different operating system organization from the usual SMP (symmetric multiprocessing) organization. Although the SMP organization has been widely adopted in modern multiprocessor operating systems, it is restricted to homogeneous processors with a global shared memory. On the other hand, the master-slave organization has little dependency upon the underlying hardware architecture, thus having great potential to cope with heterogeneous multiprocessors. This motivated us to reexamine the master-slave approach. In this paper, we attempt to address real-time and performance issues associated with the master-slave approach. Specically, we rst describe our previous design of a master-slave architecture, called APRIX. We then present an improved communication mechanism between the master and slave, which allows the master to provide priority-based system call services to slave kernels and also improves the overall multiprocessing performance.

shared memory. On the other hand, the master-slave organization [3] has little dependency upon the underlying hardware architecture since it merely assigns different roles to different processors. In a usual master-slave organization, one processor is designated as the master that can execute in supervisor mode while other processors are designated as slaves that execute only in user mode. This organization does not assume any specic hardware architectures, thus applicable to a wide range of hardware architectures from symmetric multiprocessors with UMA (uniform memory access) to asymmetric multiprocessors with NUMA (nonuniform memory access). Although the master-slave organization has great potential to cope with heterogeneous multiprocessors, it has rendered itself less well studied than the SMP approach due to some serious concerns about performance and difculty of software development. One concern is that a master processor would become a performance bottleneck as the number of slaves increases, thus offering lower scalability than the SMP organization. However, it should be noted that most modern heterogeneous multiprocessor systems of practical interest are developed with a specic purpose in mind such as multimedia processing and/or signal processing rather than massively parallel computation, and thus that they usually contain less than about eight CPUs. The Cell processor [11] is a good example, which contains one power processors as a master and eight synergistic processors as slaves. Furthermore, the target applications like multimedia and signal processing are inherently compute-intensive and involve little masters supervisor mode services. This implies that the master-slave organization can be used without serious performance degradation. Note that a common busbased SMP organization also has limited scalability due to bus contention, lock contention, and cache coherency. It is usually accepted that even the bus-based SMP organization limits the number of processors to about 8 or 16 CPUs. The other concern raised by the master-slave organi-

1. Introduction
The recent emergence of heterogeneous chip multiprocessors such as Philips Nexperia, TI OMAP, ST Nomadic, Qualcomm MSM, and STI Cell [11] requires a different operating system organization from the usual SMP (symmetric multiprocessing) organization [7, 2, 6]. Although the SMP organization has been widely adopted in modern multiprocessor operating systems, it is restricted to homogeneous processors with a global shared memory so that all processors run a single copy of SMP kernel located on the
This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) KRF-2005-041-D00625. Corresponding author: msryu@hanyang.ac.kr.

1071-0485/08 $25.00 2008 IEEE DOI 10.1109/FTDCS.2008.34

233

zation is the difculty of software development. In a master-slave setting, the target software needs to be divided into several parts and each part should be separately programmed on each processor. The communications between separate parts can be accomplished by explicit message passing mechanisms. On the other hand, the SMP organization provides the same programming model as the multithreaded uniprocessor programming model. Communications are also made easy since shared address space is available between threads. Here, we note that the difculty of programming in the master-slave organization is mainly caused by the underlying hardware architecture not by the organization style. Programs must be separately developed due to the processor heterogeneity. Communications must be done via message-passing if shared address space is not available. It is not difcult to imagine a master-slave organization on top of homogeneous multiprocessors with UMA as presented in the seminal work [3], which may be able to provide the same programming model as the uniprocessor programming model. Recently, we developed a prototype of master-slave operating system, called APRIX (Asymmetric Parallel RealtIme KernelS) for multiprocessor real-time systems [9]. APRIX has been constructed by converting a uniprocessor RTOS kernel into two cooperative kernels, i.e., master kernel and slave kernel. The master kernel is responsible for allocating and scheduling application tasks onto slaves and providing remote system call (RSC) services requested by slaves. The slave kernel executes allocated tasks in user mode and requests RSC service when needed. In the previous work, we designed and implemented APRIX with an emphasis on the structural design issues. In this paper, we attempt to address real-time and performance issues in more detail. Specically, we will focus on the RSC mechanism that plays a central role in master-slave communications and thus is a signicant factor contributing to the systems overall performance. First, we will discuss how to correctly design the RSC mechanism with common operating system architectures in mind. Second, we identify the problem of priority inversion and describe a prioritybased RSC mechanism. Finally, we present several options to increase the overall multiprocessing performance, which also helps mitigating the problem of master being a bottleneck.

code and data of the SMP kernel, synchronization is a key issue for the SMP architecture. Early implementations of SMP kernels relied on coarse-grained locking to simplify the synchronization problem, but almost all modern SMP kernels are now based on ne-grained locking to achieve the maximum parallelism [7, 2, 6]. Note that modern SMP kernels have evolved from monolithic uniprocessor kernels [10, 1, 2, 4, 8]. Since several processors can execute simultaneously in the kernel and may access the same kernel data structures, sophisticated synchronization mechanisms were required in adapting uniprocessor kernels to SMP versions. The other class of operating system architectures is the master-slave architecture [3]. In [3], Goble et al. implemented a master-slave system on a dual processor VAX 11/780, where the master is responsible for handling all system calls and interrupts. Slaves execute processes in user mode and send a request to the master when a process makes a system call. Recent work on the master-slave approach has been reported in [5]. Kagstrom et al. attempted to provide multiprocessor support without modifying the original uniprocessor kernel. Their idea was to create and run two threads for each application, a bootstrap thread and an application thread. The application thread runs the application on the slave and the bootstrap thread runs on the master awaiting a request for system call. On receiving a request, the bootstrap thread then calls the requested system call on behalf of the application thread. The remainder of this paper is organized as follows. Section 2 describes related work and overview of APRIX. Section 3 presents the improved design of RSC mechanism. Section 4 describes another prototype implementation of APRIX on TI DaVinci and experiment results. Section 5 concludes this paper with future work.

2. Overview of APRIX
APRIX is based on a simple model of master-slave that is similar to that of [3]. The master kernel has two basic functions: assigning application tasks to slaves and providing kernel services to slaves. Every slave, on the other hand, has a simple function of executing application tasks in user mode. Note that the master kernel itself is also able to run application tasks.

1.1. Related Work


From an operating systems perspective, there exist two major classes of operating system architectures for multiprocessor systems. Symmetric multiprocessing (SMP) kernels are the most widely used operating system architecture. All processors run a single copy of SMP kernel that exists on shared memory. Since all processors share the

2.1. APRIX Architecture


APRIX provides global priority-based scheduling. The master has a global scheduler that selects the high priority tasks from a single global run queue and assigns them to slave processors. Each slave has a local run queue stores runnable tasks that have been assigned by the global scheduler and performs priority-based scheduling.

234

The master kernel has been constructed by incorporating our uniprocessor kernel, called QURIX [9], with additional components. The additional components are a global task scheduler and a kernel service handle. Figure 1 (A) shows the master kernels structure, where shaded boxes represent added components and white boxes represent native components. Since the slave kernel requires a minimal set of functions, many of the original components have been removed from the uniprocessor kernel. The remaining components include the local scheduler and task manager. The slave kernel also contains two additional components, a local dispatcher and a kernel service proxy that are responsible for handling masters command for task assignment and managing remote invocation, respectively. Fig 2 (B) shows the resulting structure of slave kernel.

2.2. Interactions between Master and Slave Kernels


There are two types of interactions between master and slaves. The rst type of interaction is associated with task assignment and scheduling. When the global scheduler in the master decides to assign a task to a slave, it initiates communication by sending a message to the slave. On receiving the message, the slaves local dispatcher interprets the message and immediately requests the thread manager to create a new task. The local dispatcher has another function. It should inform the master when some scheduling event occurs. For instance, if a task completes, the local dispatcher informs the master of this event. The second type of interaction is carried out for remote invocation of kernel mode operations. When a task on slave invokes a kernel mode operation such as a system call, the kernel service proxy in the slave initiates communication by sending a message to the master. The kernel service handler in the master then invokes the requested operation and returns results to the kernel service proxy. In order to enable the above interactions between master and slaves, both kernels should incorporate a common component, called inter-processor communication component. Since the implementation of this component mostly depends upon the underlying hardware architecture and communication mechanisms, it is desirable to place the component at the lowest layer of kernel structure such as HAL (hardware abstraction layer). The inter-processor communication component should be implemented to provide a minimal set of message passing interfaces including send and receive. These interfaces may be implemented by using an interrupt mechanism when inter-processor interrupt is supported by the hardware or a polling mechanism when global shared memory is available.

(A)

3. Remote System Call Service Mechanism


When an application task running on some slave kernel requests a kernel mode service, the slave kernel redirects it to the master kernel that is responsible for system call services. This remote system call (RSC) service mechanism can be realized in a way similar to the traditional remote procedure call (RPC). Note that remote invocation has been well studied in the elds of middleware and distributed operating systems. However, we need to reexamine this issue in depth since it has a signicant impact on the performance in the master-slave architecture.

(B) Figure 1. The structures of APRIX : (A) Master kernel and (B) slave kernel.

3.1. Basic Remote System Call Service


The RSC service is supported by the kernel service proxy in slave and the kernel service handler in master. The kernel

235

service proxy can be viewed as a stub procedure, which converts the operation name and parameters into a message and sends the message to the master. When the kernel service handler receives the message, it invokes the requested operation, packs the result into a message, and sends it to the slave. The kernel service proxy in turn receives the message and returns the result to the caller task.

3.2. Priority-based Remote System Call


A straightforward approach to the RCS service mechanism is to implement the kernel service handler within an interrupt handler. However, this approach has two serious drawbacks. First, it can only be used for non-blocking system call operations that can be executed safely in an interrupt handler. Second, it may lead to indenite blocking of the current task that was running before the interrupt occurs since interrupt handlers always have higher priorities than application tasks. Note that this blocking can be more serious if the interrupt has been generated from a task that has a lower priority than the current task. We refer to this specic situation as priority inversion. A more desirable approach is to run the kernel service handler to execute in a task context. This permits the requested operation to block during its execution. Note that this approach may be slower since it requires a context switch out of the interrupt handler and may require synchronization with other tasks running in the kernel. The problem of priority inversion can then be solved by introducing a priority scheme. Every RSC request is associated with the same priority as that of the requesting task. When the kernel service handler is invoked, it services all the pending requests that have higher priorities than the current task. The lower priority requests can be serviced later when the handler gets control again. Note that the handler can get control in two cases, when a new RSC request arrives or when the processor switches from kernel mode to user mode. In this way, we can ensure that high priority tasks never be preempted by requests originated from low priority tasks.

extended version of APRIX provides a pool of RCS threads that are solely intended for providing RSC services. APRIX creates the same number of RSC threads as the number of slave processors. Note that there can be different choices for the number of RSC threads. One approach could choose the number of application tasks that exist in the entire system as the number of RSC threads. However, this would increase the overheads required for creating and scheduling RCS threads, thus signicantly degrading the overall performance. On the other hand, if we choose too small a number, the probability of the above phenomenon would increase. A reasonable choice seems to be the number of slave processors, which represents the actual degree of physical parallelism.

4. Implementation and Experimental Results


The rst version of APRIX was implemented on a fourprocessor board MPSoC-II that is a general-purpose prototyping board for developing MPSoC-based electronic designs [9]. The major components of the prototyping board include four ARM926EJ-S processors, 1MB shared memory, and 64MB local memory for each processor. Recently, we also ported APRIX to DaVinci TMS320DM644x Digital Media System-on-Chip (DMSoC) that consists of an ARM926EJ-S CPU core and a C64x+ DSP CPU core. With this implementation, we measured communication costs with the ARM core running at 296 MHz and DSP core running at 594 MHz. The overheads of requesting a system call and returning a result are given in Table 1. Operation Slaves writing to system call service queue Masters reading from system call service queue Masters writing to system call service return queue Slaves reading from system call service return queue Ave. (ns) 777 1628 444 1776 Max. (ns) 1184 1813 703 2072 Min. (ns) 740 1591 370 1665

3.3. Performance Optimization through Multiple RCS Threads


Although the above RSC approach can provide prioritybased services, there still remains a serious performance problem. When an RSC operation blocks, the master kernel cannot progress even though there exist pending RCS requests that have lower priorities than the current request. The reason for this phenomenon is that there exists only a single thread of control within the kernel. Therefore, this problem can be solved by creating more than one threads of control that are dedicated to providing RCS services. The

Table 1. Overheads of requesting and servicing a remote system call.

236

5. Conclusion
In this paper we presented a master-slave operating system architecture for heterogeneous multiprocessors. Specifically, we improved our initial design of APRIX with an emphasis on the real-time and performance issues. We have not performed thorough evaluation for the proposed techniques since the implementation is still going on at the moment of this writing. We will continue to implement the techniques described in Section 2 and 3, and report more detailed results in the near future.

References
[1] M. J. Bach. Design of the UNIX Operating System. Prentice Hall, 1986. [2] R. Clark, J. OQuin, and T. Weaver. Symmetric multiprocessing for the aix operating system. In Proceedings of the 40th IEEE Computer Society International Conference, pages 110115, 1996. [3] G. H. Goble and M. H. Marsh. A dual processor vax 11/780. In Proceeding of the 9th Annual Symposium on Computer Architecture, pages 291298, 1982. [4] M. D. Janssens, J. K. Annot, and A. J. Van De Goor. Adapting unix for a multiprocessor environment. Communications of the ACM, 29:895901, 1986. [5] S. Kagstrom, H. Grahn, and L. Lundberg. Experiences from implementing multiprocessor support for an industrial operating system kernel. In Proceeding of 11th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pages 365368, 2005. [6] S. Kleiman, J. Voll, J. Eykholt, A. Shivalingiah, and D. Williams. Symmetric multiprocessing in solaris 2.0. In Proceedings of the Thirty-Seventh International Conference on COMPCON, pages 181186, 1992. [7] G. Lehey. Improving the freebsd smp implementation. In Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference, pages 155164, 2001. [8] C. H. Russell and P. J. Waterman. Variations on unix for parallel-processing computers. Communications of the ACM, 30:10481055, 1987. [9] M. Seo, H. S. Kim, J. C. Maeng, J. Kim, and M. Ryu. An effective design of master-slave operating system architecture for multiprocessor embedded systems. In Proceedings of the 12th Asia-Pacic Computer Systems Architecture Conference, pages 114125, 2006. [10] U. Vahalia. UNIX Internals: The New Frontiers. Prentice Hall, 1996. [11] W. Wolf. The future of multiprocessor systems-on-chips. In Proceeding of the 41st Annual Conference on Design Automation, pages 681685, 2004.

237

Vous aimerez peut-être aussi