Académique Documents
Professionnel Documents
Culture Documents
SHUM--COS: A RTOS Using Multi-task Model to Reduce Migration Cost between SW/HW Tasks
Bo Zhoul, Weidong Qiu', Yan Chen', Chenglian Peng' Department of Computer and Information Technology, Fudan UniversiQ, Shanghai, China allenzhou@xasumail.corn
S Y S ~ M S ,called hardware-software partitioning, Currently, the dividing line is made by hand. An experienced system analyzer would attempt to let hardware engineers impIement the time-consuming components, thus maximizing execution speed. To determine which part is the performance bottleneck, we often need several product prototypes with different hardware-sohare dividing lines and realize the same hnctions in different methods. Then we will get the proper boundary between sohare. and hardware by comparisons. In this procedure, there exist many migrations between software and hardware. However, due to the lack of uniform programming model and system components for these different implementation methods, the migration costs of a function implementation from software to hardware are normally high. Even a small task migration needs an excessive modification, because it relates to both design teams. But the recent developments in configurable devices have increasingly blurred the traditional line between hardware and software. Using this excite characteristic, it seems that we can reduce the migration cost greatly. Operating system is a reasonable solution because it i s the traditional boundary between hardware and software. Although commercial RTOSs available for popular embedded processors provide significant reductions in design time, they typically do not take advantage of the intrinsic parallelism o f hardware tasks, probabiy because FPGAs and ASICs have historically been treated as hardware accelerators, for which there are only device drivers provided by the operating system. To cope with this problem, we have adopted a uniform multi-task (thread) model and implemented a RTOS with uCOSII [I] RTOS as its prototype, called Software Hardware Uniform Management uCOS (SHUM-uCOS). The basic concept of multi-thread model was first discussed in [2], which is proposed for hybrid chips containing both CPU and FPGA components in one chip. We extend this model into the embedded system design that is composed of a host processor and several reconfigurable devices. This programming model allows hardware tasks on reconfigurable devices to execute in a truly-parallel multitasking manner, which are organized like software
Abstract
The design of embedded systems has become more complex than ever, and the design qualities depend more on the cooperation of multidisciplinary design teams: hardware engineers and sofryonre engineers in general. However, due to the Iack of uniform programming model and system components for these different teams, the migrations costs o a function f model from software to hardware are high. But these actions are necessary in the hardwure-sojhvaye partitioning of embedded systems, especially in the prototype designs. To cope with this problem, we adopt a ungorm multi-task model and implement U RTOS (Red- Time Operating System). caIled SHUM-uCOS, which deals with hardware functions IZS same as software tasks. This RTOS uses uCUSII as is t protootype, traces and manages the sfates o f reconjigurable resources (FPGAs), which allows ihe f execution o hardware task in a true multitasking munner. Moreover. SHUM-uCOS also dejnes a standard hardware-task inter$ace, which supports share-bus protocol. I t has been proved by experiments that SHUM-uCOS can shorten the migration time from sofrware implementations to hardware implementations with /he performance improvement.
1. Introduction
Embedded systems experienced a considerable expansion in the last few years. With the silicon technology advancement, more powerful devices, (e.g., the higher frequency CPU, the larger memory) are provided. At the same time, the design complexity also increases dramatically, and the design qualities depend more on the effective cooperation of multidisciplinary design teams: hardware engineers and software engineers in general. gut how would the designer determine where to place the work-dividing line between software engineers and hardware engineers? This is a wellknown problem that hasn't been solved in embedded
984
Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.
The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings.
tasks, and substantially decreases the migration time for a task from SW implementation to HW implementation.
Sumanth Donthi [33 classifies FPGAs into two categories. If only a portion of the chip is modified and the remaining logic operates normally without any disruption, then it is partially reconfigurable. If the whole chip is modified at once, with a total loss of the previous configuration and the state of the flip-flops, then it is fully reconfigurable. The main functions of SHUM-uCOS are task and resource management. Several recent publications deal with task and resource management problems, e.g., [4] and [5], especially the problem of finding placements for hardware tasks on a reconfigurable surface, e.g., in [6j [7]. However, their discussions mainly focus on the partially configurable FPGAs. ft seems that there are few attentions paid to the fully configurable FPGAs in the operating system, which take a great share of the FPGA market currently. The SHUM-uCOS deals with these devices and uses preconfiguration table to increase the utilization of reconfigurable resources.
2. SHUM-uCOS Framework
The SHUM-uCOS is an extended version of uCOS11, expanding its management range by adding extra functions. It reserves most of data structures from uCOSII, and the priority-based scheduling policy.
While dealing with the s o b a r e tasks only, the SHUMuCOS is almost the same as uCOSII. While involving the hardware tasks, the SHUM-uCOS adopts uniform multi-task model to manage them, which can be seen in Figure 1. The whole model is divided into three parts: CPU, the hardware-task manager and reconfigurable devices. The software tasks execute on the CPU and the hardware tasks r n on the FPGAs. The software part of u SHUM-COS includes the soRware task interface, task scheduler and resource manager. The hardware part of SHUM-uCOS is called the hardware task manager, usually implemented in the FPGAs, including the communication controller, standard hardware-task interface, configuration interface and hardware-task configuration controlter. The SHUM-uCOS is composed of following parts in detail. Software task interface: a set of API functions. Designers can interact with the operating system through these functions by calling system services, e.g., creating semaphores and mutexes. The hardware task preconfiguration table: to reduce the configuration cost at runtime, we can get configuration sequences of confgurable devices by analyzing the task graph statically. The data is useful for the scheduler to configure devices before the hardware tasks run. Scheduler: the core of the RTOS. It is responsible
Figure 1- SHUM-uCOSframework
985
Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.
The !Xh International Conference on Computer Supported Cooperative Work in Design Proceedings
for managing the states of tasks (HW and SW), handling the synchronous and the asynchronous events, such as the scheduling of software tasks or the configuration of hardware tasks, and the synchronization between tasks. Resource manager: because of the dynamic creation and deletion of hardware tasks, the usage of the reconfiguration resource also changes steadily. The resource manager traces and records these changes, providing information for scheduler to configure hardware tasks. Communication controller: this moduIe handles the low-level communication detail, and translates the command to binary signals according to the application, e.g. the count of hardware tasks. Hardware task configuration database: this database contains all the hardware-task configuration data, which was synthesized ahead. Hardware-task configuration controller: the controller will retrieve corresponding configuration data from database, and configure the devices after receiving the configuration command from scheduler. A 4-bit or 8-bit microcontroller can be used as configuration controller because of the light workload, Hardware task interface: it supplies the communication controller with the standard signals and protocols. Hardware task implementation: it includes all the function modules in the FPGAs, which will be described in the section 3.3.
each task group is smaller than that of the configuration devices, in which the task group would be put in. 2. From temporal point of view, we must schedule the task groups to ensure that they just need minimum amount of reconfiguration devices. The grouping and scheduling of a DAG are all N-P complete problems.[8][9][10] Paper 191 has made a detailed discussion about the problem of task-group partition, and two algorithms are proposed: level based partitioning algorithm and clustering based partitioning algorithm. The former algorithm mainly exposes the parallelism hidden in the graph nodes, and the aim of the latter algorithm is to decrease the communication overhead, i.e., the number of terminal edges resulting from partitioning. In the multiprocessor field, there are already many discussions about how to get parallelism by analyzing the task graph statically. Correspondingly, numerous methods have been proposed, such as the MCP algorithm, the DCP algorithm [ 101. With above consideration, the basic idea of generating preconfiguration table is: at first, divide the hardware tasks into task groups that can be fit into the reconfigurable devices, and then view the configuration procedures as tasks with deadline. Finally, we can get the preconfiguration table by scheduling these tasks. Following steps will describe this procedure in detail (an example can be seen in the Fig.2). I . Remove the software-task nodes from the origin task graph G1, and then we get a task graph G2 only containing hardware tasks. The precedence relations between hardware tasks in G2 must be kept as same as them in G1. For example, in the Figure 2(a), there exist three tasks: T4, TS and T12, where the latter task depends on the former, and T8 is the only software task. Thus, we must remove it and keep T4 dependinng on T12 in the Figure 2(b).
2.
Replace *e hardware task node Ti as configuration task node Ci, and the deadiine of Ci equals the arriving time of Ti minus configuration
b z
TOFT&
Gmup
986
Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.
The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings
time.
3.
According to the level based partitioning algorithm [9], get the task groups under the area constraint. Merge each task group into one configuration node. Usethe DCP algorithm [lo] to schedule the configuration nodes, and the result is the preconfiguration table.
4.
5.
To reduce the cost of resource configuration, when a resource does not contain any active task, the scheduler sets its state as preconfiguration instead of putting it into blank state directly. When preconfiguration miss occurs subsequently, the resource is moved to blank state. This approach adds preconfiguration state between used state and blank state while recycling reconfigurable resources, and makes the resource recycle much like the cache manner for memory. As a result, it will improve the preconfiguration efficiency.
The procedure demonstrated in Figure 2 can generate the preconfiguration table, but there i s no guarantee for the optimization result. However, The focus of the SHUM-uCOS is not on the optimization. And in any case, the reconfigurations of the reconfigurable devices are always beneficial to the execution of hardware tasks.
Lh(
987
Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.
The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings
queue and message- box queue support for the hardware task at present. The hardware-task implementation uses share-bus protocol, the hardware-task can access the main memory if the share-bus is available. If the timing of main memory is standard memory timing, there is no need for timing-convert layer. Four parts compose the primitive operation layer in the standard hardware task implementation: The data pathway: connected with the main memory, allows the function entity to access the data stored in memory. The control pathway: connected with the DMA (direct memory access) signals of CPU or bus arbiter, handling the bus request or release. The initialization pathway: connected with the hardware-task controller. It is used to initialize the internal registers of primitive layers after the creation of hardware tasks. Hardware state controller: the core of the primitive operation layer. It interprets the CPU command, controls the hardware task state and reports the task status. There are no local registers or memory in the primitive layer, all the data is stored in the main memory. And each hardware task has a Task Interface Control Block (TICB) data structure to define its control registers, which are mapped into the main
memory.
typedef struct os-ticb { INT32U Receive-Cmd; // command received from CPU INT32U Send-Req; //request sent to CPU INT32U Return-Code //The result code INT32U Param-Reg //command parameter INT32U Pointer-Reg //the pointer to data fhme JNT32U Len-Reg //the length of data frame } OS-TICB; After the hardware task i s created, t h e task ID and the start address of TICB will be saved into registers of the hardware tasks. These parameters are apt to change at runtime. Only with the start address of TICB, can task state controller access the memory data. If there are some commands need to be sent to a hardware task, CPU will write the command into the memory location of Receive-Cmd parameter in TCIB first, then set the Cmd-Aquire in TCIB to teli the hardware task that there is a new command. At last the hardware requests the bus and obtains the data. If hardware tasks ask for the CPUs services, they will write the service type into the memory location of the Send-Req parameter in TCIB, then sets interrupt to notice CPU that something happens. Finally, according to the Send-Req parameter in TCIB, the CPU selects the proper service function.
I SHUM-uCOS I uCOS
I Remark
HWT:Hardware task)
Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.
The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings
Furthermore, it can also handle hardware tasks. Thirteen modifications in our VOIP case study have proved that the SHUM-uCOS can shorten the migration time greatly with the performance improvement.
References
[I] Jean J. Labrosse, Micro/OS-I1 The Real-Time Kernel, Second Edition, CMP Books, 2002 [2] David Andrews and Douglas Niehaus. Programming Models for Hybrid FPGA-CPU Computational Components: A Misssing Link , Micro, IEEE Transactions, Volume: 24 , Issue, 4 ,July-Aug. 2004, pp: 42 -53 [3] Donthi, S.; Haggard, R.L.A Survey of Dynamically
Software W k
Hanvare tark
Table 2 shows that the lost-frame ratio decreases dramatically after the compression (decompression) task migrates from software to hardware. It is true that any migrations form SW to HW is apt to increase system performances, and the more important the migrated function is, the more benefit we can get. However, w t the SHUM-uCOS, this kind of ih migrations will be mote natural, and affect the other parts less. In this case study, we changed only 13 locations to migrate the compressioddecompression functions from software to hardware successfully, which is even beyond our expectation.
Reconfigurable FPGA Devices. Proceedings of the 35th Southeastern Symposium on System Theory, 16-18 March 2003. Pages: 422 - 426 [4] 0. Diessel, H. EIGindy, M. Middendorc H.Schmeck, and B. Schmidt. Dynamic scheduling of tasks on partiaIly reconfigurable FPGAs. In IEE Proceedings on Computers and Digital Techniques, volume 147, pages 181-188, May 2000. [SI Katherine Compton, James Cooky, Stephen Knol, and Scott Hawk. Configuration Relocation and Defiagmentation for Reconfigurable Computing. In Proceedings of the IEEE Symposium ou FPGAs for Custom Computing Machines (FCCM). IEEE CS Press, April 2003. [6] Kiarash Bazargan, Ryan Kastner, and Majid Sarrafiadeh. Fast Template Placement for Reconfigurable Computing Systems. In IEEE Design and Test of Computers, volume 17, pages 6843,2000. 171 Herbert Walder, Christoph Steiger, and Marco Platzner. Fast Online Task Placement on FPGAs: Free Space Partitioning and 2D-Hashing. In Proceedings of the 10th Reconfigurable Architectures Workshop (RAW). IEEE CS Press, April 2003. [a] Thomas H.Cormen and Charles E. Leiserson. Introduction to Algorithms, The MIT Press. ,2001, Pages:
1043-1054 [9] Karthikeya M. G a j d a Puma and Dinesh Bhatia.
5. Conclusion
We implemented a RTOS based on the multi-task model. The aim of this approach is to provide a uniform platform for both software and hardware engineers, and reduce the migration cost for embedded system designs, which is a time-consuming step in the whole design flow, The SHUM-uCOS traces and manages the states of reconfigurable resources (FPGAs), allowing the execution of hardware tasks in a true multitasking manner. The Rhealstone Benchmarks have shown the SHUM-uCOS has almost the same performance as the UCOSII while dealing with software tasks only.
Temporal Partitioning and Scheduling Data Flow Graphs for Reconfigurable Computers, IEEE Transactions on Computer, 1999. pp.579-590 [lo] Kwork YK, Ahmad I. Dynamic critical-path scheduling: An effective technique for allocation task graphs to multiprocessors. IEEE Trans. on Parallel and Distributed System, 1996, 7(5): 506-521 [I 11 Rabindra P. Kar, Implementing the Rhealstone Real-time Benchmark, Dr. Dobbs Journal, Sep. 1990. [U]Altera Corporation, Cyclone Programmable Logic Device Family Datasheet, 2003, http://www.alteracom . [I31 Peng Cheng-lian and Zhou bo, SOPC design and practice using NOS, Beijing, Tsinghua Press, 2004
989
Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.