Vous êtes sur la page 1sur 8

2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation

Parallelizing SystemC Kernel for Fast Hardware Simulation on SMP Machines


Ezudheen P.
National Institute of Technology Calicut
ezudheen@gmail.com
Joy Chandra
Intel Corporation
joy.chandra@intel.com

Keywords:
affinity

Priya Chandran
National Institute of Technology Calicut
priya@nitc.ac.in

Biju Puthur Simon


Intel Corporation
biju.p.simon@intel.com

SystemC, TLM, SoC, SMP, OSCI, Core

tion time is a crucial factor affecting the time-to-market and


in turn, the commercial success of the designed processor.
The complexities of the new designs for System-on-Chip
(SoC) and embedded processors tend to inflate simulation
times, contributing a significant increase to their time-tomarket. One of the reasons for large simulation times is that
system-level modeling languages and simulation platforms
are typically single threaded, preventing them from utilizing
the parallel computing infrastructure inherent in symmetric
multiprocessing architectures like multi-core processors.

Abstract
SystemC is a system-level modeling language and simulation framework which facilitates design and verification
of processor designs at different levels. Recently, SystemC is becoming a popular choice for designers of both
System-On-Chip (SoC) and embedded processors, due to
its adaptability at cycle as well as transaction levels, and
ability to model concurrent processes. However, the single
threaded simulation kernel inherent to SystemC, prevents it
from utilizing the potential computing power of symmetric
multiprocessing (SMP) machines to speed up hardware
simulation. We present a parallel SystemC simulation
kernel, which is implemented using parallel programming
techniques and leverages the parallel execution capabilities
of multi-core machines to speed up hardware simulation.
We discuss the mechanism we use for mapping parallel
SystemC modules into different cores. Finally we report
the performance of the parallelized SystemC kernel using
a linear pipelined performance model and a pipelined
performance model tailored to exhibit the behavior of
real world simulation. Our results demonstrate that the
performance improvement obtained by using parallelized
SystemC for simulation of the above models is significant
and improves with increasing design complexity of the
simulated design and the number of cores in the machine
running the simulators.

Task level parallelism can be exploited by running several possible experiments in parallel on different machines,
but the presence of task level dependencies in processor
simulation limits the scope of such techniques. Parallelization of the hardware simulation kernel helps to increase
the speed of individual simulation runs. In this paper, we
present a scheme for parallelizing the simulation kernel, and
report the performance benefits of the parallelized version
over the conventional one, on multi-core machines.
The rest of this paper is organized in the following manner. Section 2 presents an overview of the existing attempts at parallelizing SystemC. A description of the OSCI
Open SystemC Initiative SystemC-2.2.0 scheduler is also
presented here. Section 3 describes our parallel kernel, including an overview of the different techniques we use for
parallelizing the OSCI SystemC-2.2.0 scheduler. We also
describe the strategy of setting core affinity to simulation
modules and manual grouping of SystemC modules, which
resists speed-up degradation with increase in the number
of cores. Section 4 demonstrates our experimental setup for performance analysis of parallelized SystemC for a
pipelined performance model in which the number of modules and amount of computation inside each module can
be varied. We also demonstrate the performance improvement for a parallelized benchmark SystemC simulation, tailored to exhibit the behavior of a real graphics hardware
simulation. We conclude with note on future work in this

Introduction

Simulation of processors at various stages in their design process is the technique popularly adopted for verification and validation of the design of the processors. Simula1087-4097/09 $25.00 2009 IEEE
DOI 10.1109/PADS.2009.25

Deepak Ravi
Intel Corporation
deepak.ravi@intel.com

80

Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.

area in Section 5.

2. Background
Chopardl et al. [4] present a functional parallel kernel
which runs multiple copies of the SystemC scheduler on
a large number of inexpensive machines. The partitioning of the hardware design is manual, and has to be fairly
balanced. It requires the definition of a module hierarchy,
and has the additional drawback of a high communication
overhead. In [5] a SystemC Distribution Library is introduced for geographical distribution of an arbitrary number
of SystemC simulations. However, it can also be adapted
for multi-core architectures. It supports distributed functional and approximated-timed TLM1 simulation only, and
cannot be used for cycle level simulations. Heterogeneity in
SystemC Language and simulation framework is introduced
by implementing an SDF (Synchronous Data Flow) kernel
extension into SystemC [9]. In [9], efficiency is gained
through concurrency of the SDF model. A fast SystemC
engine is proposed in [10] by introducing a new scheduling
strategy, which combines features from SystemC dynamic
scheduling technique and static scheduling, but requires fixing a unique execution order for SystemC processes. A
method to map VHDL models to PDES (Parallel Discrete
Event Simulation) is discussed in [8]. [7] introduces the
architecture of a parallel Verilog simulator. But parallel
HDL simulations are communication intensive [7]. Static
partitioning algorithm for parallel VHDL simulation [12]
partitions tightly connected parts into different processors,
which is not scalable to large number of cores due to interprocess communication delays, which get added to the simulation time. Low level schemes in general do not scale well
to complex designs and are not suitable for cycle level exploration. The module based parallel simulation objects can
provide comparatively better computation to communication ratio. In our work, we explore the benefits of applying
parallel programming techniques to OSCI2 SystemC-2.2.0
scheduler.

Figure 1. SystemC Scheduler

2.1. OSCI SystemC Kernel


A SystemC description of a design, and the generated
simulator have two parts - the structural part, and the behavioral part. The structural part describes the way components (hardware and software) are connected each other.
This translates into various modules connected via channels
[1, 4] in the simulator. The behavioral part influences the
simulation of the design by controlling the notion of processes [1, 4]. Execution of processes is driven by simulation
1 Transaction
2 Open

Level Modeling
SystemC Initiative

81

Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.

events such as the positive edge of a clock, channel update


etc.
Figure 1 demonstrates the flowchart of OSCI SystemC
2.2.0 scheduler. All processes3 are made runnable (into
ready-to-execute state) initially. All the runnable processes
are then executed in sequence. If any immediate event occurs, all processes sensitive to that event are made runnable.
Execution of runnable processes and triggering (processing)
of immediate events constitute the evaluation phase. The
evaluation phase continues till no more runnable process
exist.
After the evaluation phase, the scheduler executes all update requests. This stage is called the updation phase. Update requests are generated in the evaluation phase, when a
process writes a new value into a channel. Channel updation
can generate delta events and timed events. This evaluationupdation paradigm enables processes not to consider new
values immediately, but before the next delta cycle [5].
After completing the updation phase, the scheduler
moves to the delta notification phase, if any delta notifications exist. Otherwise it moves to the timed event notification phase. In the delta notification phase all active delta
events are triggered, and this makes all the processes sensitive to these events runnable. The whole process from
evaluation phase to delta notification phase is called a delta
cycle. Every delta cycle advances simulation by one delta
time. The scheduler moves to the evaluation phase after the
delta notification phase.
In the timed event notification phase, simulation time is
advanced to earliest timed event. All processes sensitive
to current timed events are made runnable. After timed
event notification phase, simulation moves to evaluation
phase. The simulation terminates if no timed events exist. In the next section, we present our design for a parallel
SystemC kernel, obtained by parallelizing the scheduler of
OSCI SystemC-2.2.0.

Push All Runnable


Processes to Array

Execute Runnable Processes &


Process Immediate Notifications
Parallel On Multiple CPUs

Runnable
Processes
Exist ?

Process Timed
Notification

NO

Process Update
Requests

Delta
Notifications
Exist ?

Process Delta
Notification

YES

Advance
Simulation Time

NO

Timed
Notifications
Exist ?

YES

NO

Figure 2. Parallel SystemC scheduler

3.1. Parallel SystemC scheduler


We modified the scheduler to execute all runnable processes in parallel such that time taken for execution of entire set of runnable processes can be reduced linearly to the
number of cores available on the machine. Sequential execution of runnable processes contributes significant part to
serial execution time, closer to 99% for a complex and large
hardware simulation.
All runnable processes are pushed into an array at the beginning of evaluation phase (refer Figure 2). The scheduler
create a multiple execution environment, which maintains
the state information related to currently executing runnable
process and multiple threads. The threads are executed in
parallel using multiple CPUs. A unique process execution
environment is assigned to every CPU. All the threads are
capable of executing runnable processes and triggering immediate notifications independent of other threads. A set
of runnable processes, called a chunk, are allocated to a

3. Parallel SystemC Kernel


Parallelization of SystemC kernel is done by using the
SystemC semantic - Within a delta cycle, execution order
of the processes is not predefined [1, 4]. The OSCI SystemC executes only one process at any time, although hardware supports execution of concurrent processes [5]. We
modify the existing SystemC scheduler to execute multiple
runnable processes at a time. To make sure that concurrent processes use coherent inputs/outputs, scheduler runs
all processes that are ready to execute first, and then it updates their changes on the channels. Figure 2 illustrates
our design of the parallel SystemC scheduler, which we describe next.
3 Executable

YES

code corresponding to components

82

Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.

thread. The mapping of threads to CPUs can be done in


different ways depending on parallelization technique. We
describe chunk divisions and parallelization techniques in
Section 3.2.
After evaluation of all the runnable processes, the scheduler moves to the updation phase if no runnable processes
exist. Otherwise the evaluation phase is restarted. The updation phase and subsequent phases are common to both
serial and parallel SystemC scheduler.
3.1.1

of runnable processes in the array and the number of


CPUs, respectively. Static allocation assigns n/t contiguous runnable processes to each thread.

3.2.2

The work stealing algorithm [3], in contrast, is based on


a pull-model multi-threading technique. In this method, the
under-utilized processors take the initiative to steal threads
from processors [2] which have more than one ready thread.
In work-stealing, the chunk size determines the number of
runnable processes assigned to every thread.

Synchronization issues

All runnable processes are pushed into an array before


starting the parallel execution. Hence, no synchronization
mechanisms are needed at the initialization stage of parallel SystemC scheduler. SystemC semantics allow multiple
processes to read from a channel, but only one process to
write into a channel. This process is termed as the owner
of the channel [1]. But when multiple runnable processes
execute in parallel and create update requests at same instance, multiple processes may try to push update requests
into the update queue at same instance. This poses potential
for creating data inconsistency. We have implemented adequate synchronization mechanisms to prevent this. Update
queues have been implemented using linked lists. We modified updation of header of the queue (linking new node
into updation queue) into a critical section using openmp
compiler directive (#pragma omp critical). Similar synchronization mechanisms have been implemented to avoid data
inconsistency when multiple SystemC processes trigger immediate notifications and create new runnable processes at
the same instance.

3.2.3

Manual Grouping

In the manual grouping strategy, the user manually groups


modules or methods to create groups of runnable processes.
Each runnable process of a group executes on the same core
in every clock cycle, i.e., the entire simulation time. This
improves the locality of reference, reducing L1 & L2 cache
misses and memory contention. The improved memory performance results in higher speed-up of computation.
We added a new API into SystemC kernel, void
set core affinity(int) . This can be used to set the affinity of a module/method to a unique core. All modules and
methods having same core affinity are treated as a group.
The grouping of modules or methods is achieved in the following manner.

3.2. Parallelization Techniques


3.2.1

Work Stealing

The sub modules and methods of a module are automatically placed the same group as the module itself. But if a
particular method inside a module is added to a particular
group, only that method gets added to the group. A user
can add another method of the same module into a different
group, if he/she wants.

Work Sharing

The master thread creates multiple threads equal to number


of chunks. In a work sharing algorithm [11], whenever such
threads are created in order to evaluate runnable processes
in the array, the created threads migrate to other processors.
The reasoning behind this heuristic is that work gets distributed to under-utilized processors. This is a push-model
multi-threading technique. The chunk sizes could be designed in one of the following ways.

To implement the manual grouping scheme, we create an


array of vectors for storing runnable processes. Each vector
corresponds to a group. Each vector maps to a different core
using parallel threads. The #pragma omp parallel compiler directive creates parallel threads and thread-number is
used for mapping vectors to core. Before evaluation, each
vector associated to a core stores the runnable processes assigned to that core. During evaluation, processes of each
group execute serially on their own core, independent of
other groups.

1. Fixed sized chunks: In this technique, interleaving


runnable processes in the array are grouped into multiple chunks of fixed size (e.g. four runnable processes
per chunk). The entire array of runnable processes is
treated as a set of chunks. Each chunk will be assigned
to a different thread.

We next describe our experimental set-up for evaluating


the various parallelizing schemes in the parallel kernel, and
present the performance results.

2. Static load balancing: In static load balancing, chunk


size is equal to n/t, where n and t are the number
83

Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.

4. Performance Analysis of Parallel SystemC


Kernel

The number of modules in the performance model can be


varied, and each module has a process sensitive to a common clock. The process is able to model a variable amount
of computation. The processes read values from input channels and write their sum to output ports.
This performance model provides the flexibility to abstract the behavior of any real hardware simulation, by tailoring connections among modules, module activations and
computation within modules. We analyze the speed-up of
different magnitudes of hardware complexity by varying
the number of modules and computation inside each module. The results help to identify lower bounds for achieving
speed-up for parallel simulation, parallelization bottlenecks
of existing simulations, predict speed-up for existing simulations and help us to design simulations to achieve better
speed-up.

4.1. Experimental Set-up


4.1.1

Compiler

We have used the gcc-4.3.1 compiler and included parallel/algorithms.h, parallel/setting.h and omp.h header files
with the following settings to global parameters.
Work Sharing Chunk Size = 8
Work Stealing Chunk Size = 8
Minimum number of elements for parallel operation = 16
Algorithm strategy = forced parallel
4.1.2

Hardware

An SMP machine having four quad core processors (Intel


Tigerton). Intel Quad core Xeon 7300 - series Tigerton is a
four-socket 4 consisting of two dual core Core2 architecture
silicon chips on a single ceramic module.
CPU speed : 2.93 GHz
L1 cache (Harvard architecture) : (4x) 32KB instruction
cache; 32KB data cache
L2 cache : (2x) 2MB
L3 cache : 3GB
FSB : 1066
4.1.3

Computations

Computation are generated by executing a bunch of instructions asm volatile ( nop ) which is invariant on all
compiler optimizations. We assume that the execution of
thousand instructions will take order of one micro second
CPU time. The actual computation time depends on underlying hardware.

Figure 4. Effect of Design Complexity (number of modules) on Speed-up

4.3

We first assess the effect of design complexity, represented by the number of modules and frequency of computations in the performance model, on the speed-up. The
speed-up reported is with reference to execution on a single
core machine. The parallelization strategy used is manual
partitioning, on a 16 core machine.
Figure 4 illustrates the effect of varying the number of
modules on the same computation frequency. We observe
that the speed-up is almost linear to log(number of modules)
until it approaches the number of cores. It also illustrates
the lower bound on computation frequency per module for
achieving a speed-up.
Figure 5 demonstrates the speed-up for a set of performance models by varying the computations per module,

Figure 3. Performance Model

4.2. Performance Models Used


We use a pipelined performance model as the system to
be simulated. Figure 3 illustrates our performance model.
4 Packaged

Results

in Socket 604

84

Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.

Figure 5. Effect of Design Complexity (computations) on Speed-up

Figure 6. Effect of Parallelization Techniques


on Speed-up

on a set of different modules, using parallel SystemC kernel with manual partitioning on a 16 core SMP machine.
The lower bound on the number of modules for achieving a
speed-up greater than one, can also be noted. Simulations
having average computations per module more than 10 micro seconds exhibit significant speedup.
Figure 6 demonstrates the speed-up for a performance
model having 128 modules and 40 microsecond of computations per module on parallel SystemC kernel using different parallelization techniques on varying number of cores.
We note that Parallel SystemC kernel implementation using
a manual grouping shows considerable speed-up enhancement over other parallelization techniques when the number
of cores is more than eight. Work stealing algorithm performance degrades with increasing number of cores, due to
the high multi-threading overhead. Except manual grouping, all the other schemes show a degradation with increasing number of cores after an initial improvement. We can
observe that the speed-up achieved by parallel simulation
depends on the number of cores, the underlying hardware
complexity, average computations per module, number of
modules per core and overhead per thread.
Figure 7 demonstrates CPU utilization for a performance
model with 128 modules and 40 micro second computations
per module using parallel SystemC kernel with different
parallelization techniques on varying the number of cores.
We note that CPU utilization % decreases with increase in
the number of cores for all parallelization techniques. The
difference in CPU utilization among various parallelization
techniques becomes prominent when the number of cores
increase.
Figure 8 demonstrates the difference between the speed-

up achieved and increase in CPU utilization for a performance model with 128 modules and 40 micro second
computation per module using the parallel SystemC kernel by manual partitioning on varying the number of cores.
Speedup achieved is higher than the increase in CPU utilization.
When a problem having data-set of size d, represented
in our experiment by d number of modules, is solved by n
CPUs then the data-set will be evenly distributed among n
CPUs.
Size of the data-set per CPU for serial execution = d
Size of data-set per CPU for parallel execution = d/n
Every CPU has a separate L1 cache, and L2 cache is
shared among 2 CPUs and core affinity is set for modules,
L1 cache misses(per cache) ideally reduce by a factor of
n and L2 cache misses reduce by a factor of n/2. Hence,
the time saved on servicing cache misses contributes to the
increase in speed-up for parallel simulation by manual partitioning. Efficiency5 of parallel SystemC simulation depends on the ratio between the speed-up achieved and increase in CPU utilization.

4.4. Performance Analysis on Graphics


Benchmark
We corroborate the above results with a performance
analysis on a multi-dimensional pipelined performance
model which exhibits the behavior of a real graphics hardware on a benchmark program. It is a simulation having a
simulation model with 70 modules. Around 25 modules are
active for every delta clock cycle and every module does
5 Performance

per watt

85

Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.

Figure 8. Difference between Speed-up and


CPU Utilization

Figure 7. CPU Utilization% with different Parallelization Techniques

References

varying computation between 1 to 100 micro seconds. The


connections among modules, and varying module activation
and computation within the modules, are made exactly as in
the graphics hardware simulation.
Figure 9 shows the speed-up for the above benchmark
simulation using parallel SystemC kernel with different parallelization techniques on 8 and 16 cores of an SMP machine. Since the computation per core decreases with increase in number of cores, only parallelization by manual
grouping and static load balancing show higher speed-up
with larger number of cores.

[1] Approved IEEE Draft Standard SystemC Language Reference Manual (superseded by 1666-2005). In IEEE Std
P1666/D2.1.1, 2005.
[2] K. Andreev and H. Racke. Balanced Graph Partitioning.
39(6):929939, November 2006.
[3] R. D. Blumofe and C. E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Proceedings of
the Thirty-fifth Annual Symposium on Foundations of Computer Science (FOCS), pages 356368, 1994.
[4] B. Chopard1, P. Combes, and J. Zory. A Conservative Approach to SystemC Parallelization. In Proceedings of the
Workshop on Scientific Computing in Electronics Engineering, pages 653660, May 2006.
[5] K. Huang, I. Bacivarov, F. Hugelshofer, and L. Thiele. Scalably Distributed SystemC Simulation for Embedded Applications. In Proceedings of the International Symposium on
Industrial Embedded Systems (SIES 2008), pages 271274,
June 2008.
[6] G. Karypis and V. Kumar. Parallel Multilevel k-way Partitioning Scheme for Irregular Graphs. In Proceedings of
the 1996 ACM/IEEE Conference on Supercomputing, pages
3535, 1996.
[7] T. Li, Y. Guo, and S.-K. Li. Design and Implementation
of a Parallel Verilog Simulator: PVSim. In Proceedings of
the Seventeenth International Conference on VLSI Design,
pages 329334, 2004.
[8] E. Naroska. Parallel VHDL Simulation. In Proceedings
of the Conference on Design, Automation and Test in Europe(DATE 98), pages 159165, Washington, DC, USA,
1998. IEEE Computer Society.
[9] H. D. Patel and S. K. Shukla. Towards a Heterogeneous
Simulation Kernel for System Level Models: A SystemC
Kernel for Synchronous Data Flow Models. In Proceedings
of the IEEE Computer Society Annual Symposium on VLSI,
pages 241242, February 2004.

5. Conclusion and Future works


We have developed a parallel SystemC simulation kernel using different parallelization techniques and evaluated
the performance of the simulations on the parallel kernel
running on multi-core machines. Our experimental results
demonstrate that hardware simulation based on parallel SystemC is faster than hardware simulation based on serial SystemC, provided one fourth of the total computation by entire modules (representing hardware complexity of the simulated design) is larger than overhead per thread. We infer
that parallelization benefits simulation of complex designs.
Among the parallelization techniques, manual grouping of
modules shows a considerable improvement in speed-up
over other parallelization techniques, especially when the
number of cores is more than eight.
As automatic grouping is preferable to manual grouping,
especially when the complexity is larger, we propose that
graph partitioning [2, 6] based techniques could be used to
develop algorithms for partitioning the modules for assignment to different cores.

86

Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.

Figure 9. Speedup for Graphics Hardware


Simulation

[10] D. G. Perez, G. Mouchard, and O. Temam. A New Optimized Implementation of the SystemC Engine using Acyclic
Scheduling. In Proceedings of the Conference on Design,
Automation and Test in Europe, pages 552557, February
2004.
[11] M. J. Quinn. Parallel Programming in C with MPI and
OpenMP. McGraw-Hill Science Engineering, First edition,
2003.
[12] W. Yue, J. Ling, Y. Hong-Bin, and L. Zong-Tian. A New
Partitioning Scheme of Parallel VHDL Simulation. In Proceedings of the Conference on High Density Microsystem
Design and Packaging and Component Failure Analysis,
pages 14, June 2005.

87

Authorized licensed use limited to: Technion Israel School of Technology. Downloaded on November 18, 2009 at 12:20 from IEEE Xplore. Restrictions apply.