Vous êtes sur la page 1sur 7

Improving Robustness of Real-Time Operating Systems (RTOS) Services Related to Soft-Errors

M.H Neishaburi, Masoud Daneshtalab, Mohammad Reza Kakoee, Saeed Safari, University of Tehran, Iran. mhnisha@cad.ece.ut.ac.ir {kakoee, safari}@cad.ece.ut.ac.ir ; m.daneshtalab@ece.ut.ac.ir Abstract
Nowadays, more critical applications that have stringent real-time constraint are placed and run in an environment with Real-Time operating system (RTOS). The provided services of RTOSs are subject to faults that affect both functional and timing of Tasks which are running based on RTOS. In this paper, we try to evaluate and analyze robustness of services due to soft-errors in two proposed architecture of RTOS which are (SW-RTOS and HW/SWRTOS). According to experimental result we finally propose an architecture which provides more robust services in term of soft-error. Real-Time Operating System (RTOS) users desire predictable response time at an affordable cost, due to this demand Hardware/Software Real-Time Operating Systems (HW/SW-RTOS) appeared. This paper analyzes the impact of soft-errors in real-time systems running applications under purely Software RTOS versus HW/SW-RTOS. The proposed model is used to evaluate robustness of services like scheduling, synchronization time management and memory management and inter process communication in Software based RTOS and HW/SW-RTOS. Experimental results show HW/SW-RTOS provide more robust services in term of soft-error against purely software based RTOS years ago. The sad thing (but not unexpectedly) is that these improvements have not come free. By gradual shifting towards sub-micron era lots of new problems and challenges should be addressed; without that the prophecy of Moore's law can not be held and demise of CMOS technology in a couple of years will be seen. The sub-micron effects beyond 10 nm will be so fundamental that it needs answers in all the hierarchy of design; down to up; device to the system. The system designer must keep in mind that regardless of the attempts of device engineer, faults and single event upsets are inevitable. The system should be designed in a way that becomes robust against all kinds of soft-errors and faults. This new aspect of design will become critical in hardreal time systems in which the output of the design must be valid in the required deadline and the reliability of the system is as important as the functional accuracy. Failure to meet the deadline or the crash of the application may result in catastrophic events. There are lots of applications, e.g. life-supporting instruments, aerospace equipments, traffic control and etc. Several innovations have been introduced by system designers to deal with these problems previously. These solutions mostly were concerned about designing a robust application, for example the method presented in [8] includes an additional application which should check other applications in their workspace memory. In [9] the old idea of replication of systems is proposed. And finally in [10], [11] the researchers were concerned about designing a robust scheduling algorithm. As stated in [7] these techniques are not enough. By fault injection into the system, they found that a soft error may cause failure in multi-tasking process of a RTOS. This fault may be propagated to the application level and thus defeat the entire envisaged fault-tolerant mechanisms. As we mentioned before, this could lead to endangering valuable assets and life. hence, we can conclude the necessity of designing a robust fault tolerant RTOS. In [1], the authors suggested that by HW/SW partitioning of operating system and moving some of the RTOS functionalities (such as task synchronization and scheduling) to HW, much faster executions may be obtained. This improvement has come with the cost of only 13K gates.

General Terms
Reliability, Verification.

Keywords
Software Real-Time Operating System (SW-RTOS), Hardware/Software Real-Time Operating System (HW/SW-RTOS), Soft-Error.

1. Introduction
Up to now the industry of silicon devices has remained loyal to the Moore's Law. Feature size and other aspects of the fabrication process of Integrated Circuits have been improved constantly. The delay and size of the transistors have decreased tremendously comparing to a couple of

1-4244-1031-2/07/$25.002007 IEEE

528

Although, many researches have been done in this topic [2-6] there is still no commercial RTOS which take advantages of this feature. They hoped by advent of the fast inter-chip communication which was introduced by SoCs, the remaining hurdles will be overcome. The main contributions of this paper are as follows: We analyze and evaluate the effect of soft-errors in services which are provided by purely software base RTOS (SW-RTOS) and Hardware/Software RTOS (HW/SW-RTOS). We propose an effective RTOS architecture that provide more effective and robust services related to softerrors. The rest of the paper is organized as follows: section 2 introduces some basic preliminary and definitions, in Section 3 we explain our experimental framework which was used in this research. In Section 4, our experimental results can be found and finally Section 5 concludes the work.

2. PRELIMINARY
Real-Time Operating System ("RTOS") provides an "abstraction layer" that hides hardware details of processor (or set of processors) from software layer .In providing this "abstraction layer" the RTOS kernel supplies four main types of basic services to application software, figure 1 shows these services.
InterProcess communication(IPC) & Synchronization Dynamic Memory Allocation Time Management (Timer) Task Managment

Communication and Synchronization. In order to pass information from one process(task) to other process software developer should be familiar with these services of RTOS. These services make it possible for tasks to pass information from one to another, without danger of that information ever being damaged. They also make it possible for tasks to coordinate, so that they can productively cooperate with one another. Due to stringent timing requirements most Real-Time application which runs under RTOS, most RTOS kernels provide some basic Timer services such as task delays and time-outs. Dynamic Memory Allocation services are another service that are often provided RTOS kernels. This category of services allows tasks to "borrow" block of RAM memory for temporary use in application software. Often these blocks of memory are then passed from task to task, as a means of quickly communicating large amounts of data between tasks. Many non-real-time operating systems also provide similar kernel services. The key difference between general-computing operating systems and real-time operating systems is the need for deterministic timing behavior in the real-time operating systems. Formally, deterministic timing means that operating system services consume only known and expected amounts of time.

3. EXPERIMENTAL FRAMEWORK
In this section we present our approach for both designing a HW/SW-RTOS and injecting faults in the proposed model.

3.1. Software Real-Time Operating System (SW-RTOS)


In our Frame work we use eCos (embedded Configurable operating system) as Purely Software based RTOS (SW-RTOS). eCos is an open source, royalty-free and real-time operating system intended for embedded systems and applications[12]. The highly configurable nature of eCos allows the operating system to be customized to precise application requirements, delivering the best possible run-time performance and an optimized hardware resource footprint. A thriving net community has grown up around the operating system ensuring on-going technical innovation and wide platform support. eCos was designed for devices with memory footprints in the tens to hundreds of kilobytes, or with real-time requirements. It can be used on hardware that doesn't have enough RAM to support embedded Linux, which currently requires a minimum of about 2 MB of RAM, not including application and service requirements [12].

Figure 1: Basic Services Provided by a Real-Time Operating System Kernel

Task Management is the most basic category of kernel services. This module provide services like task creation, task scheduling and priority assignment to task with the help of these services software developers can be able to partition their design as a number of separate parts of software which each of them handles a distinct topic, a distinct goal, and perhaps its own real-time deadline. Each part of software is called a "task." Furthermore, Task Scheduler controls the execution of application software tasks, and can make them run in a very timely and responsive fashion. The second category of kernel services, which is shown in Figure 1, is InterProcess ( InterTask )

529

Application LEVEL

SW-Part
Ts1 Ts2 Ts3

HW-Part

Data Exchanger

Buffers

1 RTOS LEVEL RTOS Kernel Context Switch scheduler MMU n HW1 HW2 2 Task

SW-RTOS
SW-Part
Application LEVEL Ts1 Ts2 Ts3

2 1 nextTask

HW-Part
HW1 HW2

callRTOS

Hardware Scheduler

RTOS LEVEL

RTOS Kernel

Context Switch

Exchanger scheduler

Data

MMU

SocLock CACHE

HW/SWRTOS

MMU=Memory Management unit

Figure 2: Software Real-Time Operating System (SW-RTOS) versus HW/SW-RTOS

3.2. Hardware/Software Real-Time Operating System (HW/SW-RTOS)


RTOS users desire predictable response time at an affordable cost. To fulfill this, many researchers have investigated various approaches to ensure RTOS predictability. One active approach is to utilize a hardware mechanism; since: (i) hardware is far more predictable than software implementation of the same algorithm, and (ii) thanks to the decrease in cost of hardware, financial requirements of real-time System on Chip will be met [1], (iii) it can increase system performance due to the CPU load reduction as a consequence of moving RTOS functionality from software to dedicated hardware part. Idea of Hardware operating system that moves scheduling and interprocess communication from the software-kernel to the hardware has been proposed in some previous works [1]. But in the proposed HW/SWRTOS implementation, we replaced the POSIX support of eCos operating system with dedicated data exchanging mechanisms. The scheduler is also replaced. In the original tasks, only the communication part should be adapted in order to communicate with the hardware part of the proposed HW/SW-RTOS, other parts of the original task remain unchanged. The user can continue to use the same POSIX-based API without having to know about the presence of the HW/SW-RTOS in the final implementation. Inter-process communications have been regularly done by semaphores or mutex. In our proposed HW/SW-

RTOS we directly update memory locations used for implementation of semaphores and mutex. While interprocess communications have been done by generating standard bus transactions, consequently they are done in HW/SW-RTOS more efficiently in comparison with SWRTOS implementation. Figure 2 shows the comparison between standard software RTOS (top) and our proposed HW/SW-RTOS architecture (bottom). As shown in Figure 2, in the proposed HW/SW-RTOS, the operating system is composed of three parts (i) Scheduling unit (ii) DataExchanger unit (iii) Context switching unit

3.2.1. Scheduling unit


Scheduling unit must inform CPU about the identifier of next software task. In our proposed HW/SW-RTOS we have used hardware Weighted-Round-Robin(WRR) algorithm for implementation of this unit. At each scheduling cycle the task pointer incremented to other task then DataExchanger unit informs scheduling unit about the condition of task (executable, blocking). When it finds an executable task, it informs CPU by issuing hardware interrupt. Otherwise, task pointer is incremented until an executable task found. In some circumstances there are several executable tasks. In this condition Weighted-Round-Robin (WRR) algorithm which was implemented in hardware through priority encoder helps us to avoid starvation by selecting the task which has not executed recently.

530

3.2.2. DataExchanger
Data exchanger unit uses buffers to pass data between different tasks. When a task tries to send data to another task, it informs DataExchanger unit identifier of destination task and the value which must be sent. In this case DataExchanger manages internal buffers to guarantee that the value will be reached to the specified task. Conversely, when a task needs data provided by some other tasks, it informs DataExchanger unit identifier of source task. Then DataExchanger blocks waited task and calls scheduler unit to send the identifier of the schedulable task to the CPU.

Fault tracer collects information about the services that are currently executing in RTOS, (SW-RTOS) and (HW/SW-RTOS) inform this part of (FIE) about the kind of services that are active.
TS3

MEM

SW-RTOS

TS2

CPU

TS1

Fault Generator

Injection

Fault Tracer

Fault

a c t iv a t e

3.2.3. System on Chip Lock Cache unit (SoCLC)


Typical context switch consists of three steps [1]: 1. Pushing all CPU registers to the current task stack. 2. Selecting (scheduling) the next task to be run; 3. Popping all CPU registers from the stack of the next task. Steps 1 and 3 cannot be done by hardware in general CPUs, because all CPU registers must be stored into or restored from the memory by the CPU itself. Instead of it, scheduling unit has been called to specify next executable task to CPU in step 2.

TS3

CPU

HW/SW-

RTOS

TS2

MEM

Figure 3: Fault Injection Environment

3.4. Characteristics of our Bench Mark


In order to perform our experiments, we create multitask applications classified in 6 groups according to the mechanisms through which they communicate and synchronize (see Figure 2). These groups fully exploit most important services offered by eCos real-time kernel. The studied application consists of six groups. Before performing any fault injection experiment, we carefully studied and tested these applications in an environment without soft-errors to verify that all tasks meet their deadlines and produce correct output results. In order to consider real-time constraint, all tasks are considered critical (they must complete their executions before their deadlines and produce logically correct responses) and perform some useful computations [7]. Group1 tasks T1, T2, T3 and T4 share global variables using semaphore, this technique can be used to access critical section and synchronization. Group2 tasks T1 and T2 communicate by message queues. T1 (transmitter) sends the results of its computations into QM message queues. T2 (receiver) reads messages from T1, and uses them in their context; Group3 tasks T1 and T2 communicate by a mailbox that can store a single message. T1

TS1

3.3. Fault Injection Environment (FIE)


The fault injection technique was initially proposed in [13] but up to now we can not find any fault injection mechanism in the literature, which can support faultinjection in SW-RTOS as well as HW/SW-RTOS. Before describing main features of the adopted fault injection mechanisms, it is worth to know that faults consist of single bit-flips only in the CPU registers, therefore in order to identify sensitive components of the studied real-time operating systems, fault injection module should keep access to CPU registers to inject fault during the execution of applications. The adopted system architecture is simulated with an Instruction Set Simulator (ISS) [14]. The fault injection tool uses temporal breakpoint features available in the ISS to inject faults by software means. Once a temporal breakpoint is reached, global execution is suspended and the ISS activates a Fault Injection Environment (FIE) that comprises two modules [6] as shown in Figure 2: Fault generator calculates when and where the fault will be injected. FIE can inject faults in CPU registers, while the main services of the SW-RTOS (eCos kernel) and HW/SW-RTOS kernel are active.

531

sends a message periodically into a mailbox, while T2, the receiver, consumes the message and uses it in its future operations. Group4 tasks T1, T2 and T3 access a global variable which has been protected by mutual exclusion semaphore (mutex). Group5 tasks T1, T2, T3, and T4 access a global variable using semaphore1 (sem1), while tasks T5, T6, T7, and T1 access global variable using semaphore2 (sem2). Group6 tasks T1, and T2 access a global variable using mutex; then each of them, that gain access to global variable, sends the results of its computations into message queue (QM) and finally, task T3 receives its message from message queue (QM).

output results, (ii) Real-time problem (iii) Process Hanging (system continue its working but some processes stop their operations). Application Exception: one or more application tasks trigger some exception routine (e.g. illegal instruction, division by zero and etc.); System crash the system stops functioning.

4.1. Fault Injection Results


To evaluate eCos (SW-RTOS) and our proposed HW/SW-RTOS assessing the reliability and different vulnerability factor (VF) for each of OS services, we performed following fault injection rules: During execution of (SW-RTOS) and our proposed HW/SW-RTOS services, faults were randomly generated by Fault Injection module then injected into the CPU registers. Operating System Services like task creation and task termination are safe to fault injection. During these services Fault Injection module is idle. Fault Injection module will be activated using signal from HW/SW-RTOS by mechanism of data-exchanging while services are in progress. The impact of soft-errors according to the different services which are provided by eCos (SW-RTOS) and HW/SW-RTOS based on eCos are illustrated in Figure 5 and Figure 6 respectively. The X axes in these figures illustrate the classes of fault consequences that were specified before in subsection 3.1, while the value axis (Y) shows their frequency of occurrence. The different groups services related to eCos (SW-RTOS) and HW/SW-RTOS are depicted by a column bar. For instance, consequences of faults that affect services belonging to the synchronization group are illustrated by green violet bar. On average 42.4% of faults have no visible effects on the system behavior in SW-RTOS in comparison with 57.8% of fault have no effect in HW/SW-RTOS. Application failure rate SW-RTOS consist of 21.2% of total failure rate but in HW/SW-RTOS this fraction improves to 16.6%. Regarding to system crashes we can see a 15% improvement in robustness due to soft-error.

T4

T3

T2

T1 T1
QM

T2

Gloabal VAR Group1 Sem Message Box

Group2

T1

T2

T3

T2

T1

Gloabal VAR Group3 Group4 Mutex

T4 T3 T2

T1 T5

T6

T7

T1
Gloabal VAR

T2

Gloabal VAR1 Sem1

Gloabal VAR2 Sem2

Mutex QM

T3

Group5

Group6

Figure 4: Characteristic of our bench mark

4. Experimental Results
This section describes and analyses the obtained results to get evidence of soft-errors consequences in the case of a real-time application. Transient faults may cause several malfunctions when the real-time kernels services are corrupted. These malfunctions are classified as follows: Safe: no visible effect on system functionality. Application failure represents a class of faults with some effects on the application level. This class of faults can be subdivided to: (i) Incorrect

532

50 45 40 35 30 25 20 15 10 5 0 Safe Application Failure

Memory Management Synchronization Task Management Time Management Scheduler

46 44 42 40 38 36 34
Memor Managemen Synchronization Task Managemen Tim Managemen Scheduler

System Crash

Exception

Figure 7: Robustness of HW/SW-RTOS versus SW-RTOS

Figure 5: Effect of Soft-Error in SW-RTOS

A remarkable feature of our results that is apparent from Figure 5 and 6 is that all services provided by HW/SW-RTOS are more robust than the same services provide by eCos (SW-RTOS). Figure 7 shows the effectiveness of HW/SW-RTOS services in terms of reliability related to soft-error.
80 70 60 50 40 30 20 10 0
Safe Application Failure System Crash Exception

Figure 8 shows the hardware overhead related to different units of HW/SW-RTOS. As shown in this Figure, the HW/SW-RTOS implementation imposed us hardware overhead equals to 12830 gates.
14000 12000 10000
Number of Gates

Memory Management Task Management Time Management Scheduler

8000 6000 4000 2000 0 DataExchanger Scheduling Unit Unit TOTAL

Figure 8: Hardware overhead of different HW/SW-RTOS units

5. Conclusion
Real-time applications which have safety-critical constraints are often based on real-time operating systems. Real-time operating systems are subject to faults that affect both the correctness of logical results and the timing of tasks response. Hardware Real-Time Operating Systems (HW/RTOS) appeared to provide predictable response time at an affordable cost. In This paper, we analyzed the impact of soft-error in real-time applications running under a RTOS which is implemented in HW/SW. Our experimental

Figure 6: Effects of Soft-Error in HW/SW-RTOS

Services related to both synchronization and time managements are considerably improved as shown in Figure 7. We can justify these improvements by dedicated hardware synchronization part of our HW/SW-RTOS.

533

results show that soft-errors occurring in a real-time operating system (either in SW or HW kernel) have a major impact on the systems behavior. Moreover, it was found that all groups of eCos services have the same sensitivity profile. Experimental results also show the robustness of HW/SW-RTOS services in term of softerror versus SW-RTOS services. Experiments show considerable improvement in robustness of synchronization services which are provided by HW/SWRTOS against SW-RTOS, due to the dedicated synchronization hardware.

[5] K. Baskaran, W. Jigang, and T. .Srikanthan, A

Hardware Operating System based Approach for Run-time Recongurable Platform of Embedded Devices, in 6th Real Time Linux Workshop, (Singapore), Nov 3-5 2004.
[6] L. Lindh and F. Stanischewski, Fastchart-idea and

implementation, in ICCD, pp. 401404, 1991.


[7] N. Ignat, B. Nicolescu, Y.Savari, G. Nicolescu Soft-

Error Classification and Impact Analysis on RealTime Operating Systems, DATE 2006,
[8] Ph. Shirvani, R. Saxena, E.J. McCluskey, Software

6. Acknowledgment
The authors wish to acknowledge Iran telecommunication Research Center (ITRC) for the partial financial support during the course of this research.

implemented EDAC protection against SEUs, IEEE Transaction on Reliability, Vol. 49, No. 3, Sept. 2000
[9] V. Izosimov, P. Pop, P. Eles, Z. Peng, Design

7. References
[1] S.

optimization of time- and cost-constrained faulttolerant distributed embedded systems, Design, Automation and Test in Europe, Munich, Germany, 7-11 Mars 2005, pp. 864-869
[10] S. Ghosh, R. Melhem, D. Mosse, J. Sarma, Fault-

Chandra, F.Regazzoni, and M. Lajolo, Hardware/Software Partitioning of Operating Systems: a Behavioral Synthesis Approach, in proc. GLSVLSI06, pp. 324-329. V. J. Mooney and D. M. Blough, A hardwaresoftware real-time operating system framework for socs, IEEE Des. Test, vol. 19, no. 6, pp. 4451, 2002. M. Imai, Hardware implementation of a real-time operating system, in Proceedings of the 12th TRON Project International Symposium, pp. 3442, 1995.

tolerant Rate Monotonic Scheduling Journal of Real time systems, vol.15, No.2, September 1998
[11] P. Mejia-Alvarez, D. Moss, A responsiveness

[2]

approach for scheduling fault-recovery in real-time systems, 5th Real-Time Technology and Applications Symposium, 2-4 June 1999,pp.4
[12] http://en.wikipedia.org/wiki/eCos [13] B.Nicolescu, N.Ignat, Y. Savaria, G. Nicolescu,

[3] T. Nakano, A. Utama, M. Itabashi, A. Shiomi, and

Sensitivity of Real-Time Operating Systems to Transient Faults: A case study for MicroC kernel, IEEE Radiation and its Effects on Components and Systems, Cap de Agde, France, Sept. 19-23, 2005
[14] Motorola HC12 CPU awareness and true-time

[4] Morton and W. M. Loucks, A HW/SW kernel for

soc designs, in SAC 04: Proceedings of the 2004 ACM symposium on Applied computing, (New York, NY, USA), pp. 869875, ACM Press, 2004.

simulation, Metrowerks Corp., 2004.

534

Vous aimerez peut-être aussi