Directcell: Hybrid Systems With Tightly Coupled Accelerators

directCell: H.
Penner
U. Bacher
Hybrid systems J. Kunigk
C. Rund
with tightly H. J. Schick
coupled
accelerators
The Cell Broadband Enginet (Cell/B.E.) processor is a hybrid
IBM PowerPCt processor. In blade servers and PCI Expresst
card systems, it has been used primarily in a server context, with
Linuxt as the operating system. Because neither Linux as an
operating system nor a PowerPC processor-based architecture is
the preferred choice for all applications, some installations use the
Cell/B.E. processor in a coupled hybrid environment, which has
implications for the complexity of systems management, the
programming model, and performance. In the directCell approach,
we use the Cell/B.E. processor as a processing device connected to
a host via a PCI Express link using direct memory access and
memory-mapped I/O (input/output). The Cell/B.E. processor
functions as a processor and is perceived by the host like a device
while maintaining the native Cell/B.E. processor programming
approach. We describe the problems with the current practice that
led us to the directCell approach. We explain the challenge in
programming, execution control, and operation on the accelerators
that were faced during the design and implementation of a
prototype and present solutions to overcome them. We also provide
an outlook on where the directCell approach promises to better
solve customer problems.
Motivation and overview alone. The main inhibitors are lack of applications for the
Multicore technology is a predominant trend in the IBM Power Architecture* and the performance
information technology (IT) industry: Many-core and characteristics of the IBM PowerPC* core
heterogeneous multicore systems are becoming common. implementation in the Cell/B.E. processor. For mixed
This is true not only for the Cell Broadband Engine*** workloads in environments for which the Cell/B.E.
(Cell/B.E.) processor, which initiated this trend, but also processor is not suitable, such as Microsoft Windows**,
for concepts such as NVIDIA CUDA** [1] and Intel various hybrid topologies consisting of Cell/B.E.
processors and non-Cell/B.E. processor-based
Larrabee [2].
components can be appropriate, depending on the
The Cell/B.E. processor and its successor, the IBM
workload. Adequate system structures such as the
PowerXCell* 8i processor, are known for their superior
integration of accelerators into other architectures have
computing performance and power efficiency. They to be employed to reduce management and software
provide more flexibility and easier programmability overhead. The most significant factors to be considered in
compared with graphics processing devices or field the design of Cell/B.E. processor-based system complexes
programmable arrays, alleviating the challenge of are the memory and communication architectures. The
adapting software to accelerator technology. Yet, it is following sections describe the properties of the
still difficult in some scenarios to fully leverage these Cell/B.E. Architecture (CBEA) and their consequences
performance capabilities with Cell/B.E. processor chips for system design.
Copyright 2009 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each
reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this
paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other
portion of this paper must be obtained from the Editor.
0018-8646/09/$5.00 ª 2009 IBM
IBM J. RES. & DEV. VOL. 53 NO. 5 PAPER 2 2009 H. PENNER ET AL. 2:1
Memory hierarchy of the CBEA memory-coherent complexes include Elastic Interface or
The CBEA [3] contains various types of processor cores Common Systems Interconnect. HyperTransport** can
on one chip: The IBM PowerPC processor element (PPE) be used in memory coherent and PCI Express** (PCIe**)
is compatible with the POWER* instruction set can be used in noncoherent memory setups. It is a
architecture (ISA) but employs a lightweight common feature of tightly coupled systems that they are
implementation with limited branch prediction and in- able to access the single main memory.
order execution. The eight synergistic processor elements Applications for tightly coupled systems usually run on
(SPEs) are independent cores with a different ISA one host part of the system (e.g., an x86 processor), while
optimized for computation-intensive workloads. The SPE the other parts of the system (accelerators such as
implementation on the chip targets efficiency, using in- PowerXCell 8i processor-based entities or GPUs) will
order execution and no dynamic branch prediction. Code perform computation-intensive tasks on data in main
running on the PPE can access main memory through the memory. The tightly coupled approach makes
on-chip memory controller. In contrast, the SPE deployment easier because only one host system with
instruction set allows access only to a very fast but small accelerator add-ons has to be maintained rather than a
memory area, the local store (LS), that is local to the core. cluster of nodes. Depending on the type of the tightly
The LSs on the SPEs are not coherent to any other entity coupled accelerator, the approach can be applied well to
on the chip or in the system but can request atomicity of non-HPC applications and allows for efficient and fine-
memory operations with respect to concurrent access to grained dataflow between components, as latency and
the same memory location. For communication and bandwidth are better than in loosely coupled systems.
dataflow, the SPEs are able to access main memory and Programming models include OpenMP** [9], ALF [6],
I/O (input/output) memory using direct memory access and DaCS [7] libraries, CUDA, the Apple OpenCL**
(DMA). stack [10], and FPGA programming kits. Compared with
FPGA and GPU approaches, Cell/B.E. processor
Loosely coupled systems accelerators, with their fast and large memories, offer the
Loosely coupled systems cannot access the memories of highest flexibility and generality while maintaining an
other nodes. Most clusters in the area of high- easy-to-use programming model.
performance computing (HPC) follow this system design,
including the first petaflops supercomputer [4]. If the High-level system description
application permits, these systems scale very well. This section describes the directCell architecture. It
Interconnectivity is usually attained by a high-speed reflects on properties of acceleration cores and gives
network infrastructure such as InfiniBand**. The data insight on the design principles of efficiently attaching the
distribution across the nodes is coarse grained to keep the accelerator node.
network overhead for synchronization and data exchange
low compared with the data processing tasks. The General-purpose processor compared with SPEs on
predominant programming model for loosely coupled the Cell/B.E. processor
systems is Message Passing Interface (MPI) [5], The remarkable performance of the Cell/B.E. processor is
potentially enriched by the hybrid IBM Accelerator achieved mainly through the SPEs and their design
Library Framework (ALF) [6] and hybrid Data principles. The lack of a supervisor mode, interrupts, and
Communication and Synchronization (DaCS) [7] libraries virtual memory addressing keeps the cores small,
to enable the Cell/B.E. processor in mixed-architecture resulting in a high number of cores per chip. High
nodes. Another model based on remote procedure calls computational performance is achieved through single-
is IBM Dynamic Application Virtualization [8]. instruction, multiple-data (SIMD) capabilities. In
directCell, the SPE cores are available for application
Tightly coupled systems code only, while the PPE is not used for user applications.
Tightly coupled systems are usually smaller-scale systems This effectively makes the directCell accelerator an SPE
consisting of few computing entities. In our context, a multicore entity. Unlike limited acceleration concepts
single host component, such as an x86 processor-based such as GPUs or FPGAs, the SPE cores are still able to
system, exploits one or several accelerator devices. execute general-purpose code, for example, that
Accelerators can range from field-programmable gate generated by an ordinary C compiler.
arrays (FPGAs) over graphics processing units (GPUs) to In standalone Cell/B.E. processor-based systems, OS
Cell/B.E. processor-based accelerator boards, each (operating system)-related tasks are handled on the PPE.
offering different performance and programmability The task of setting up SPEs and starting program codes
attributes. Accelerators can be connected in a memory- on them is controlled from outside, using a set of
coherent or noncoherent fashion. Interconnects for memory-mapped registers. All of these registers are
2:2 H. PENNER ET AL. IBM J. RES. & DEV. VOL. 53 NO. 5 PAPER 2 2009
accessible through memory accesses and no special In distributed hybrid systems, the host OS needs to
instructions have to be used on the PowerPC core of the coordinate data-transfer operations with the accelerator
Cell/B.E processor. Therefore, control tasks can be OS. MPI [5], which was designed for distributed systems,
performed by any entity having memory access to the features a sophisticated protocol to establish zero-copy
Cell/B.E. processor-side memory bus. remote DMA (RDMA) transfers that are vital to hybrid
system performance. Without using an OS on the
Coupling as device and not as system accelerator, the host can act independently, rendering
In most cases, porting Windows applications to the much more flexibility to the host system programmer and
Power Architecture is not a viable option. Instead, saving additional runtime overhead.
identifying computational hotspots and porting these to In order to support this execution model from the host
the Cell/B.E. processor-based platform offers a way to system software, the accelerator hardware should provide
speed up the application. As the SPE cores are used the following mechanisms:
for acceleration, the port has to target the SPE
instruction set. Endpoint functionality.
Cell/B.E. processor-based systems use PCIe as a high- MMIO control of processor cores, including next
speed I/O link. PCIe is designed for point-to-point short- instruction pointer, translation lookaside buffer
distance links. This communication model does not scale (TLB), machine state register (MSR), and interrupt
to many endpoints. However, it provides best-of-breed
acknowledgment.
bandwidth and latency characteristics and supports Interrupt facilities,
memory-mapped I/O (MMIO) and DMA operations.
Optionally PCIe coherency.
Because the Cell/B.E. processor resources are accessible
via MMIO, it is possible to directly integrate it into the
Providing these mechanisms in a processor design
address space of other systems through PCIe following
yields enormous rewards with regard to system
the tightly coupled accelerator concept.
configurations, especially in hybrid configurations.
Communication overhead in this PCIe-based setup can
The directCell architecture applies to Cell/B.E.
be kept minimal. The directCell approach takes
processor-based blades such as the IBM BladeCenter*
advantage of the low latencies of PCIe and makes the
SPEs available for existing applications as an acceleration QS22 server [11] with specific firmware code to operate
device instead of a separate system. To export a Cell/B.E. in PCIe endpoint mode and the IBM PowerXCell
processor-based system as an acceleration device on the accelerator board (PXCAB) offering [12].
PCIe level, it must be configured as a PCIe endpoint
device, an approach currently used by all hybrid setups Applications and programming
involving the Cell/B.E. processor, including the first Performance and flexibility are two essential and
petaflops computer [4]. The directCell solution evolves generally opposing criteria when evaluating computer
from the endpoint mode and uses the Cell/B.E. processor architectures. On the performance side, specially designed
as a device at the host system software level. We propose circuits, digital signal processors, and FPGAs provide
to relocate all control from the PPE to the host system extremely high computing performance but are lacking
processor except for the bare minimum needed for generality and programmability. Regarding flexibility,
accelerator operation. general-purpose processors allow for a wide application
For directCell, we further propose to shift all spectrum but lack the performance of highly optimized
application code executed on the PPE in a standalone systems. Cell/B.E. processors follow a tradeoff and
configuration over to the host processor. This includes provide high computing performance while still offering a
scheduling of SPE threads and SPE memory management good degree of flexibility and ease of programming.
and also applies to the PPE application code. Thus, the The tightly coupled hybrid approach proposed with
host controls all program flow on the accelerator device. directCell in this paper promises to capture the
None of the Cell/B.E. processor resources are hidden in advantages of both extremes by combining Cell/B.E.
our model. More importantly, nothing on the accelerator processor-based accelerators for performance and x86
device obstructs the host from controlling it. In contrast systems for wide applicability. Two aspects apply to
to a distributed hybrid system approach that relies on an programmability for these hybrid systems: first, the
accelerator-resident OS to control all of its memory, we complexity to program the accelerator (i.e., the SPEs),
consider an accelerator-resident OS superfluous. The host and second, the integration of the tightly coupled
OS runs the main application and is, therefore, in charge accelerator attachment into the main application running
of all of the hardware resources that are needed for its on the host system (i.e., the model to exploit the SPEs
execution. from the x86 system).
Standard programming framework of the Cell/B.E. Most of the application code runs on x86. Any code
processor that previously ran on the PPE now has to be built for
The SPE Runtime Management Library (libspe2) [13], and run on the x86 platform.
developed by IBM, Sony, and Toshiba, defines an The libspe2 application programming interface (API)
OS-independent interface to run code on the SPEs. It to control SPEs is moved to x86, i.e., libspe2 is an x86
provides basic functions to control dataflow and library.
communication from the PPE. Cell/B.E. processor SPEs can perform DMA transfers between the SPE
applications use this library to start the execution of code LS and the x86 main memory as used by the main
on the SPEs using a threading model while the PPE runs application. DMA to the northbound memory that is
control and I/O code. The code on the PPE is usually local to the Cell/B.E. processor accelerator systems is
responsible for shuffling data in and out of the system and still possible for SPEs. For transfers between the x86
for controlling execution on the SPEs. In most cases, code memory and SPEs, the different endianness
on the SPEs pulls data from main memory into the LSs so characteristics of the architectures need to be
that the SPEs can work on this data. Synchronization considered. (Endianness refers to whether bytes are
between the SPEs and between the PPE and SPEs can stored in the order of most significant to the least
take place through a mailbox mechanism. significant or the reverse.) SIMD shuffle byte
instructions can be used for efficient data conversion
Programming tightly coupled accelerators on the SPEs and may have to be added as a
The programming model and software framework for conversion layer, depending on the type of data
tightly coupled accelerators such as the Cell/B.E. exchanged between host and accelerator.
processor can be structured in several ways. One way, The PCIe interface between the Cell/B.E. processor
which resembles the loosely coupled systems approach and the x86 components does not permit atomic
[14], is to add a communication layer between the two DMA operations, a restriction that applies to all
standalone systems, the host and Cell/B.E. processor- accelerators attached through PCIe. Atomic DMA
based system. These additional layers increase the burden operations on the Cell/B.E. processor local memory
on software developers who have to deal with an explicit are still possible. This constraint requires code
communication model instead of a programming library. changes to existing applications that use atomic DMA
A different model is used for directCell that replaces operations for synchronization.
parts of the PPE functionality with host processor code. Memory local to the Cell/B.E. processors can be used
The PPE does not run application-specific code; instead, by SPE code for buffering; because of its proximity to
code on the host processor assumes the functions that the Cell/B.E. processor chips, it has better
were previously performed by the PPE. This allows performance characteristics than the remote x86
applications to fully leverage the advantages of a tightly memory. However, adding this mid-layer of memory
coupled accelerator model. In our implementation of between the x86 main memory and the SPE LSs adds
directCell, the libspe2 library is ported to the x86 to application complexity. A good way of using the
architecture, which allows control of SPE code execution memory local to the Cell/B.E. processors might be to
and synchronization between an x86 system and SPE use it to prefetch and cache large chunks of host
entities. This reduces the complexity of programming the system data. The complexity of prefetching and
directCell system to the level of complexity required to loading data on-demand from the Cell/B.E. processor
natively program the Cell/B.E. processor. Existing local memory can typically be encapsulated in a few
Cell/B.E. processor-aware code can continue to run
functions.
unchanged or with limited changes in this environment,
tapping a large number of kernels running on SPEs for Structure of directCell
x86 environments. Furthermore, existing Windows or The directCell model can be applied to Cell/B.E.
Linux** applications can be extended by libspe2 calls to processor-based systems only by means of software.
leverage SPE acceleration, opening up the x86 Several software components are needed on the host and
environment and other application environments to accelerator system side. In the following sections, we
Cell/B.E. processor acceleration. introduce the components and how they interface with
each other.
Implications of the tightly coupled accelerator
structure Libspe2.dll
The directCell approach has several implications on The basic interface for controlling SPE applications in
programming and applications: standalone Cell/B.E. processor-based systems is libspe2
[13]. To conform to this widely adopted programming system services. The granularity of this scheme centers
interface, and to be able to reuse a large amount of around the idea of a single device node as found in a
existing code, directCell provides a Windows port via a Linux /dev tree. The fine-grain control gained by
Windows DLL (dynamic linked library) that provides subdividing the device details into a file system hierarchy
such functions as spu_context_create() and such as spufs is opposed by a large variety of system
spu_context_run(). The libspe2 routines create a services, each tailored for a specific purpose on a distinct
directCell device instance and send I/O control device class. Device drivers that support I/O operations
commands processed by the device driver. implement a relatively small set of I/O functions via an
The libspe2 used in standalone system configurations I/O queue. The driver I/O queue registers callbacks, most
interfaces with the SPE hardware resources through the importantly read, write, I/O-control, stop, and resume.
synergistic processor unit (SPU) file system (spufs) [15] in The actual communication between a user-space
the Linux kernel. Spufs offers two system calls and a application and a device takes place through an
virtual file system for creating SPE threads that the Linux appropriate device service that invokes the Windows I/O
kernel schedules to run on the SPE. In the Windows manager, which forms an I/O request package, puts it
implementation, the spe_context_create() libspe2 call into the I/O queue of the correct driver, and notifies the
allocates a user-space representation of the SPE program driver via the previously registered callbacks.
and also a kernel-space device instance that is accessed The device driver can satisfy some of the function
by a device handle in user space. The function requirements of higher level libraries such as libspe2 by
spe_context_run() invokes a corresponding I/O control using existing system services. The command
(ioctl) that passes the device handle and a pointer to spe_context_create() generates an SPE context by
the user-space buffer as parameters. invoking the CreateFile() system service using the
SPE binaries are embedded in a given Windows Windows Cell/B.E. processor device handle. This will
application as executable and linking format (ELF) files. return a handle to a kernel representation of an SPE
In our Windows implementation, the ELF files are context that is similar to the representation returned by
integrated into accelerated applications as ELF resources. an spu_create() system call in spufs. The spu_run()
The Windows API provides standard resource-handling function is implemented as an ioctl that operates on the
functions that are used to locate and extract a previously previously created SPE handle. The ioctl is nonblocking,
embedded file at runtime. The libspe2 port also although upper-level software should wait for completion
implements ELF loading functionality for processing via designated wait routines for asynchronous I/O. The
the embedded SPU program to pass it to the ioctl initiates a direct DMA transfer of the SPE context
spe_context_run() call. from host user-space memory to the LS of the SPU.
All data transfers in directCell follow this zero-copy
Device driver paradigm.
Our implementation of directCell features a sophisticated Regarding the mechanisms to access the Cell/B.E.
device driver that provides a software representation of processor hardware facilities, the device driver must
all of the Cell/B.E. processor facilities on a host system. perform additional steps compared to spufs. These
Accelerated application code is loaded, executed, and additional steps are due mainly to the conversion of the
unloaded by operating the device driver interfaces. From Cell/B.E. processor address space and the host system
a conceptual level, this closely corresponds to the well- address space, as described in the following sections.
established spufs. The idea to span the spufs interface
across a high-speed interconnect has been pursued by Cell/B.E. processor firmware layer
Heinig et al. [16] as rspufs [16]. Regarding the user-space The standard product firmware of a Cell/B.E. processor-
representation, spufs is based on the UNIX** file concept based system is supplemented by a minimal software layer
by representing SPE hardware through a virtual file that takes care of additional initialization tasks after
system. In the case of Windows, the representation of the endpoint-aware standard firmware has finished the
hardware in user space is dominated by a different model. basic boot. Most importantly, this layer provides
In Windows, files, volumes, and devices are represented accelerator runtime management and control. The
by file objects, but the concept of imposing a directory initialization tasks comprise the setup of the address
structure for device control is not widely supported. translation mechanism of the SPEs, the reporting of
Windows allows the creation of device interface classes available SPEs to the device driver before accelerator
that are associated with a globally unique identifier usage, and the setup of the I/O memory management unit
(GUID) and a set of system services that it implements. (I/O MMU). The runtime management and control
A device interface class is instantiated by creating a file functionality centers around handling of SPE events such
handle with the GUID. This handle is used to invoke as page faults and completion signals. This handling is
4
Host system Cell/B.E. processor-based system
Host processor Host SPE PPE

memory
SPU
Core 4 Core
Execution
Local store Firmware
Device driver units
Control
block
1 3,6 MFC
MMIO L2
L2 cache DMA cache
registers
SPU 6
context
3 2,5
2,5
Element
Host interconnect
southbridge bus Memory
Issue Southbridge chipset
MMIO PCIe link
DMA
Figure 1
In directCell, an SPE program is started on the host processor by directly programming an SPE via the PCIe link between the host processor
southbridge and that of the Cell/B.E. processor. Shown are the most important directCell software components and their lifecycle in the
directCell hybrid system configuration.
vital for the execution of SPE binaries. Additional Consider a user-space application on x86 that needs to
runtime debug and monitoring facilities for the SPEs run an SPE program via a PCIe-attached directCell
complete the firmware layer functionality. The event device. Through our libspe2 implementation, it interfaces
handler and the debug monitor each run on separate PPE with the device driver by using the standard Windows
hardware threads. In our current implementation, the API functions to allocate an SPE context (representation
debug monitor provides a simple facility for probing of LS, general-purpose registers, and problem or
memory locations, memory-mapped resources, and privileged state). The SPE binary is retrieved from the
interfaces via the serial console. In future executable main application and loaded into the SPE
context. The device driver implements a standard
implementations, the debug console I/O could be
Windows function to support opening a device
redirected via PCIe to a console application that runs
representation of the directCell system and running
on the host.
the previously created SPE context (1). The
spe_context_run() method is implemented as an ioctl.
Accelerator operation
Upon invocation, the driver picks a free physical SPE on
While directCell maintains compatibility with the the remote system and orchestrates the transfer of the
common programming models for the CBEA [17], it SPE context to that system via MMIO operations (2). For
introduces a new runtime model. Figure 1 shows and the the actual transfer of data, it is crucial that only DMA is
following paragraphs describe in detail how directCell used to maximize throughput and minimize latency
delivers efficient workload acceleration and makes because the data has to traverse the PCIe bus. MMIO is
additional overhead imposed by a Cell/B.E. processor- used only as a control channel of the host system into the
resident OS superfluous. The numbers in parentheses Cell/B.E. processor-based system. This concept of
below refer to the numbers in red in the figure. loading comprises a small bootstrap loader as described
in Reference [18] to achieve this. The loader is transferred Figure 2 shows a simplified model for MMIO
to the SPE in a single 16-KB DMA operation that also communication. The MMIO transfer is triggered by
receives and contains as parameters the origin address of writing to virtual addresses (VAs) on the host that are
the target SPE binary and its size. The loader transfers the translated by the host MMU to RAs that, in turn, map to
contents of the SPE register array into an internal data the host PCI bus. The MMIO requests are delivered as
structure and initializes the registers before filling the LS PCI memory cycles to the accelerator PCIe core where
with the binary SPE code via DMA (3). The loader will they are further redirected to the interface between the
stop and signal completion to the PPE via an external southbridge and Cell/B.E. processor interface, as
interrupt. This interrupt is forwarded to the device driver described below in the section on inbound mappings. At
by the PPE (4), which runs a tiny firmware layer as part of that point, the I/O MMU maps the requests to the
the only software prevailing on the PPE (as introduced in corresponding units on the Cell/B.E. processor.
the previous section). The SPEs transfer data via DMA to or from the
While the SPE binary code executes, it generates accelerator system upon remote initiation by the host via
mapping faults (4) whenever a reference to an unmapped
MMIO. The target of the data-transfer operations may be
host-resident address is encountered. This applies, for
local Cell/B.E. processor addresses (LS or local main
instance, to parameter data that may reside in a control
memory) or remote host memory. The host addresses
block on the host. The corresponding memory mapping is
must be mapped to the I/O range of the Cell/B.E.
established by the device driver via MMIO (5), and the
processor; hence, accesses to these regions do not satisfy
host-resident data is then transferred via DMA (6). When
the coherency requirements that apply for local main
the SPE stops execution, the host is signaled once again
memory. The device driver is responsible for locking the
via interrupt and initiates a context switch and unload
host memory regions to prevent swapping. For future
procedure, as described in Reference [9]. Similar to the
previous steps, this phase involves unloader binary code enhancements of the directCell model, we encourage the
that facilitates a swift transfer of the context data to the introduction of a mechanism on the host system to
host memory via the SPE DMA engines. explicitly request local main memory buffering to
improve performance and coherency considerations.
Memory addressing model Further enhancements could provide a software-managed
One of the concepts of the directCell approach is to define mechanism to prefetch and cache host-resident
a global address mapping that allows the SPEs to access application buffers into local main memory [19].
host memory and the host to access the MMIO-mapped The memory mappings are established through direct
registers of the Cell/B.E. processor, LS, and local main updates of the SPE TLBs whenever the SPE encounters
memory. The mechanism that achieves this is based on an unmapped address. This handling is initiated by a
using the memory management unit (MMU) of the host, page fault, as described in the section on interrupt
the MMU and I/O MMU of the Cell/B.E. processor, and management. In our current implementation of
the PCIe implementation features, which we show in directCell, the host device driver performs this
the rest of this section. management via MMIO. The current implementation
The BAR setup section introduces how MMIO features a page size of 1 MB; hence, a single SPE can
requests from the host pass the PCIe bus and the section work on 256 MB of application data buffered in local
on inbound mappings shows how those are translated to memory or host-resident memory at once without the
Cell/B.E internal addresses. These internal addresses are need of a page table. This area can be increased if larger
further translated by the I/O MMU to PowerPC real common page sizes are available on the host and the
addresses (RAs). accelerator. To introduce hardware management of the
The IOMMU defines two translation ranges: TLBs to directCell, the firmware layer would need to be
augmented to manage a page table in the local main
1. Requests directed to the SPE MMIO space are memory.
furnished with a predefined displacement by the The flow of DMA operations is shown in Figure 2. The
southbridge (I/O chipset). The I/O MMU translates SPE views a DMA operation as a transfer from one VA
these displaced addresses to the base address of the to another. When reading data from the host system (blue
SPE MMIO spaces within the Cell/B.E. processor arrows), the target address translates to an LS address,
addressing domain and preserves offsets to navigate and the source address translates to an RA in the I/O
within this range. domain, which points to the outbound memory region of
2. Requests directed to Cell/B.E. processor memory are the accelerator PCIe core. The mapping of PCIe bus
translated 1:1 by the I/O MMU, also preserving the addresses to host RAs is likely to involve another
offset for internal navigation. inbound translation step; however, this is transparent to
Host MMIO operations
Host VA Host RA Cell/B.E. RA

Accelerator
PCIe bus memory
Accelerator
memory mapping
SPE MMIO mapping SPE MMIO
Cell/B.E.
I/O MMU
(a)
Mappings of individual pages Accelerator DMAs from host Accelerator DMAs to host
across address spaces
Cell/B.E. VA Cell/B.E. RA Host RA

Cell/B.E.
Accelerator memory MMU
PCIe bus User-space
and SPE LS buffer
Host memory
mapping in I/O space
(b)
Figure 2
The various levels of addressing translation and indirection between the host and the accelerator in directCell: (a) a control channel
implemented by means of MMIO; (b) a data-transfer channel implemented via DMA. (VA: virtual address; RA: real address.)
the accelerator system and entirely dependent on the host PCIe is used as a point-to-point interconnection
OS, which directCell does not interfere with. between devices and a system. This allows centralizing the
traffic routing and resource management and enables
PCIe quality of services. Thus, a PCIe switch can prioritize
PCIe was developed as the next-generation bus system packets. For a multimedia system, it could result in fewer
that replaces the older PCI and PCI-X** standards. In dropped frames and lower audio latency. PCIe itself is
contrast to PCI and PCI-X, PCIe uses a serial based on serial links. A link is a dual-simplex connection
communication protocol. The electrical and bit layer is using two pairs of wires. One pair is used for transmitting
data and one is used for receiving data. The two pairs are
similar to the InfiniBand protocol, which also uses 8b10b
known as a PCIe lane. A PCIe link consists of multiple
coding for the bit layer [20]. The entire PCIe protocol
lanes, for example, x1, x2, x4, x8, x16, or x32.
stack is handled in hardware. The programmer does not
PCIe is a good communication protocol for high-
need to build any queues or messages to transfer actual
bandwidth and low-latency devices. For example, a
data. Everything is done directly with the processor load
PCIe 2.0 lane can transmit 500 MB/s per direction, which
and store instructions, which move data from and to the
yields to 16 GB/s per direction for an x32 link
bus. It is possible to map a memory region from the configuration. A very compelling possibility of the PCIe
remote side to the local memory map. This mapping is external cabling specification is the option to interconnect
software compatible with the older PCI or PCI-X an external accelerator module via a cable to a
standards. Because of this, a mapping can be created so corresponding host system.
that the local processor can access remote memory
directly via load or store operations and effectively it does PCIe endpoint configuration
not need any software libraries to use this high-speed PCIe defines point-to-point links between an endpoint
communication channel. and a root-complex device. In a standalone server
whereas the Cell/B.E. processor-based system
Host southbridge is configured to run in endpoint mode. The
Accelerated application
configuration is done by the service processors of the two
systems prior to booting. From the host system, the
Windows libspe Cell/B.E. processor-based system is seen as a PCIe device,
with significant device identification, which allows the
1) Move accelerator memory window
2) DMA to accelerator memory Cell./B.E. BIOS (basic I/O system) and OS to configure the MMIO
3) Operate SPEs system driver ranges and load the appropriate device drivers. The
scalability of this model of endpoint coupling is limited
Host central
processing unit only by how much I/O memory BIOS can map.
(x86)
PCe inbound mapping
I/O MMU MMU
In order to remotely access the Cell/B.E. processor-based
Host system, the device driver and the firmware layer configure
southbridge the Cell/B.E. processor-based system southbridge to
correctly route inbound PCIe requests to the Cell/B.E.
processor and the local main memory. This inbound
Accelerator mapping involves PCI base address registers (BARs) on
the bus level and PCI inbound mappings (PIMs) on the
PCIe core
southbridge level. Both concepts are introduced in the
Outbound BAR BAR BAR Configuration
following sections.
mapping space
PIM PIM PIM
BAR setup
The PCI Standard [21] defines the BARs for device
128 MB 128 MB configuration by firmware or software as a means to
Southbridge assign PCI bus address ranges to RAs of the host system.
internal bus The BARs of the PCIe core in the southbridge are
Companion chip configured by PCIe config cycles of the host system
(southbridge) Cell/B.E. BIOS during boot. After the configuration, it is possible
system
to access internal resources of the accelerator system
memory
MMU through simple accesses to host RAs. This makes it
I/O MMU
possible to remotely control any Cell/B.E. processor-
related resources such as the MMIO register space, SPU
0x0 0x8000
resources, and all available local main memory. However,
to achieve full control of the accelerator system, an
Registers
Registers
DMA
DMA
MMIO additional step is required, which is introduced in the

region Cell/B.E. processor section on inbound mappings.
LS LS ... The directCell implementation uses three BAR
configurations to create a memory map of the device
resources into the address space of the host system:
Figure 3 1. BARs 0 and 1 forming one 64-bit BAR provide a

128-MB region of prefetchable memory used to
Overview of the inbound addressing mechanisms of the Cell/B.E.
southbridge. Inbound PCIe requests are mapped to distinct ranges access all of the accelerator main memory (Figure 3).
within the southbridge internal address space, which in turn are 2. A 32-bit BAR 2 is mapped to the internal resources
mapped to specific units of the Cell/B.E. processor. The address of the end-point configuration itself, making it
mappings can be changed at runtime. possible to change mappings from the host. This
BAR is 32 KB in size.
3. BARs 5 and 6 forming one 64-bit BAR are used to
map the LS, the privileged state, and problem state
configuration such as the QS22 BladeServer, the registers of all SPEs of the Cell/B.E. processor.
southbridge of the Cell/B.E. processor-based system is
configured as a root complex to support add-in daughter Figure 3 is an overview of this concept. We can see how
cards. For directCell, the host system southbridge the driver can conveniently target different BARs for
configuration is unchanged and runs as a root complex, directly reaching the corresponding hardware resources.
Inbound mappings Table 1 Throughput from SPE local store to x86-based host
The Cell/B.E. processor companion chip, southbridge, memory.
introduces PIM registers to provide internal address
translation capabilities. PIMs apply a displacement Configuration Throughput (MB/s)
within the southbridge internal bus address space, as QS22 at 3.2 GHz, SMP, PCIe x4, reads 719.22
shown in Figure 3.
QS22 at 3.2 GHz, SMP, PCIe x4, writes 859.44
For all BARs, corresponding PIM registers exist to
map addresses on the PCIe bus to the southbridge PXCAB at 2.8 GHz, PCIe x16, reads 2,333.95
internal bus [22] addresses connected to the southbridge- PXCAB at 2.8 GHz, PCIe x16, writes 2,658.60
to-Cell/B.E. processor interface.
The PIM register translation for BAR 2 is set up by the
accelerator firmware to refer to the MMIO-accessible
configuration area of the PCIe core. The PIMs for BARs Performance evaluation
0 and 1 and 5 and 6 are adjusted by the host device driver In this section, first results of the directCell
during runtime. In order to do so, the device driver uses implementation are presented. The results are sectioned
the inbound mapping for BAR 2. The resources of the into a discussion on general throughput and latency
Cell/B.E. processor are made accessible to a PPE space measurements for specific directCell hardware
address by the I/O MMU setup (described previously in configurations and a discussion of an exemplary
the section on PCIe and depicted in the lower part of application that we apply to our implementation of
Figure 3). directCell. The latter section elaborates on the effort of
The PIM register translation for BAR 0 and 1 is porting the application to directCell as well as application
changed during the runtime of host applications performance data.
according to the needs of the driver. The driver provides a
movable window across the complete main memory of General measurements
the accelerator, though at a given point in time, only These tests aim to determine the throughput and latency
128 MB of contiguous accelerator memory space is characteristics of the host and accelerator interconnect in
available (also shown in Figure 3). two distinct directCell hardware configurations:
Interrupt management 1. A BladeCenter server-based configuration IBM

The asynchronous execution of SPE programs requires QS22 blade with an IBM HS21 blade. The
an interrupt handling infrastructure for directCell. SPE interconnect is PCIe Gen1, x4.
programs use interrupts to signal such events as mapping 2. A PowerXCell processor-based accelerator board
faults, errors, or program completions. PCIe add-in card connected to a standard x86-based
Several layers are involved in handling SPE interrupts. advanced technology extended personal computer.
An external interrupt is routed to the interrupt controller The interconnect is PCIe Gen1, x16.
on the PPE, which is handled by an interrupt handler
in the PPE resident firmware layer. This handler All measurements are performed by a small firmware
determines which SPU triggered the interrupt and directly kernel running on the PPE of the Cell/B.E. processor-
triggers an interrupt to the PCIe core on the southbridge, based system. The kernel runs on the PowerXCell PPE in
which is in turn handled by the host device driver. real addressing mode and programs the SPE memory
Three types of SPE interrupts are handled by flow controller DMA facilities via MMIO and transfers
directCell: LS contents to the host memory and back. Because the
MMIO operations are significantly slower than
1. SPU external interrupt: page-fault exception programming the DMA directly from the SPE, the
handling, DMA error, program completion. methodology is not ideal; however, each test determines
2. Internal mailbox synchronization interrupt. the additional latency of all involved MMIO operations
3. Southbridge DMA engine interrupt. and subtracts it from the overall duration.
Table 1 lists the throughput from the SPE LS to x86-
The inclusion of the PPE in the interrupt-handling based host memory. The values shown were measured
scheme is necessary because the provided hardware does with two SPEs submitting requests; however, a single SPE
not allow the direct routing of the SPE interrupt to the will saturate the PCIe bus in both system configurations.
PCIe bus. We propose to include the ability to freely With regards to latency, we conducted several
reroute interrupts to different targets in future processors measurements to determine the overhead of the I/O path.
and I/O devices. We regard the hop from the PowerXCell processor to the
2 : 10 H. PENNER ET AL. IBM J. RES. & DEV. VOL. 53 NO. 5 PAPER 2 2009
6,000
1,800
1,600 Standalone QS22 1 SPE

SPE, southbridge 5,000 Standalone QS22 8 SPEs (avg.)
Simulations per second (millions)

1,400 loop back, local memory
directCell QS22 1 SPE
SPE, southbridge
1,200 directCell QS22 8 SPEs (avg.)
loop back, SPE 4,000
Latency (ns)
directCell PXCAB 1 SPE

1,000 SPE, southbridge
directCell PXCAB 8 SPEs (avg.)
PCIe, host memory
800 3,000
600
2,000
400
200
1,000
0
PXCAB QS22 PXCAB QS22
2.8-GHz 3.2-GHz SMP 2.8-GHz 3.2 GHz-SMP 0
read read write write
Figure 5
Figure 4 Overall application performance.
Latency of 16 KB SPE transfers to various targets (using processor
timebase).
completion, the control block is transferred back to main

memory.
southbridge separately from the overall access to the host The development of the application port involved
memory. This is possible by providing a PowerXCell porting the PPE code to the x86 architecture and
processor address that routes the request to the Windows as well as providing for endianness conversion
southbridge in a way that it gets immediately routed back of the application control block in the SPE program. Of
to the PowerXCell processor by the southbridge. One of the PPE code, 16% of all lines are reused verbatim. The
these loop-back tests refers back to LS, a second accesses remaining code underwent mostly syntax changes to
the Cell/B.E. processor-resident DDR2 (double-data-rate conform to the data types and library function signatures
2) memory. The results for both tests are shown in of Windows. The control flow of the application was not
Figure 4, along with results for end-to-end data transfers changed; all lines of SPE code are reused verbatim.
from LS to the x86 host memory. Some additional 85 lines of code were needed to perform
endianness conversion of the control block data. The
conversion code makes use of the spu_shuffle
Application porting
instruction with a set of predefined bit-swap patterns.
As an exemplary use case of directCell, we have ported a
With adequate programming language support, such a
Cell/B.E. processor-based application from the financial
code block can be easily generated by a compiler, even for
services domain for European option trading. The
complex data structures. The application port was
underlying software is a straightforward implementation
performed on a Microsoft Windows Server** 2008-based
of a Monte Carlo pricing simulation. A derivative of this
implementation of directCell.
application has been part of an IBM showcase at the
Securities Industry and Financial Markets Association Application results
Technology Management and Conference and Exhibit We compare the application performance on a standalone
[23]. The computing-intensive tasks of the application lie native Cell/B.E. processor running Linux on a
in the acquisition of random integer numbers and Moro’s BladeCenter QS22 server as well as a directCell
inversion of those [24]. The standalone Linux-based configuration of the same system with a BladeCenter
version of the application transfers and starts one or more HS21 server and a directCell PXCAB configuration
SPE threads that will in turn read in a control block from connected to an AMD64. The most important
main memory before starting the simulation. The control application parameters are set to 10 timesteps and
block contains such parameters as the number of 1 million simulations per run per SPE.
simulations that should be performed, the simulated span Figure 5 shows that the number of simulations per
of time, and the space to return result data. After second for the QS22-based directCell configuration is
IBM J. RES. & DEV. VOL. 53 NO. 5 PAPER 2 2009 H. PENNER ET AL. 2 : 11
Table 2 SPE memory fault measurements.
800
700 Interrupt type Duration (ls)

Standalone QS22 1 SPE
600 directCell QS22 1 SPE SPE segment fault 15.984
directCell PXCAB 1 SPE SPE page fault 20.56
500
Duration ( ␮s)
Standalone QS22 8 SPEs (avg.)

directCell QS22 8 SPEs (avg.)
400 directCell PXCAB 8 SPEs (avg.)
300 based x86 host system using PCIe connectivity. This

configuration taps today’s software environment while
200
the native Cell/B.E. processor programming model is
100 maintained, leveraging existing code for the CBEA.
The tight coupling of the accelerator and its low
0
spe_context_create spe_program_load spe_context_destroy communication latency allows for a broad range of
applications that go beyond HPC and streaming
workload while offering an integrated and manageable
Figure 6 system setup. The viability of this approach has been
Duration of host-resident libspe2 functions.
demonstrated in a prototype. The PPE portion of the
Cell/B.E. processor runs a slim firmware layer
implementing initialization, memory management, and
interrupt handling code, granting the x86 host system full
access to the SPEs.
only marginally lower than the number achieved in the
Further research should be undertaken to optimize
nonhybrid standalone system. The overall performance
memory management and the work split between host
for the PXCAB-based directCell configuration is linear to
control software and the Cell/B.E. processor PPE
the lower processor frequency of PXCAB compared with
firmware components. Exploitation of the memory local
QS22.
to the Cell/B.E. processor-based accelerator can improve
We also evaluated the duration of distinct libspe2
application performance transparently. Efforts to
function calls. Under Windows Server 2008, these
evaluate scalability across several accelerator engines as
measurements were conducted by using the
well as building an environment for efficient development
QueryPerformanceCounter function. Under PowerPC-
can lead to broader use of directCell.
Linux, the gettimeofday function was used to acquire
the durations. *Trademark, service mark, or registered trademark of
The results shown for host-resident function calls in International Business Machines Corporation in the United States,
Figure 6 reveal that the host processors used in the other countries, or both.
directCell model perform certain tasks faster than the
**Trademark, service mark, or registered trademark of NVIDIA
PPE of the PowerXCell processor. Also, the frequency of Corporation, Microsoft Corporation, InfiniBand Trade
the function calls has a significant impact on their average Association, HyperTransport Technology Consortium, PCI
duration. Special Interest Group, OpenMP Architecture Review Board
Corporation, Apple, Inc., Linus Torvalds, or The Open Group in
Finally, we measured the duration of handling SPE the United States, other countries, or both.
page and segment faults in directCell. For this, we used a
PCIe analyzer that triggers first upon interrupt assertion ***Cell Broadband Engine is a trademark of Sony Computer
on the bus and triggers again upon the bus cycle that Entertainment, Inc., in the United States, other countries, or both
and is used under license therefrom.
updates the TLB of the SPEs with a new memory
mapping. This method omits the time that the interrupt
References
and the resulting MMIO cycle need to propagate through
1. CUDA Zone, NVIDIA Corporation; see http://
the Cell/B.E. processor-based system, but the orders of www.nvidia.com/object/cuda_home.html#.
magnitude in terms of time for the handling of the page 2. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P.
fault on the bus and the host system handling allow us to Dubey, S. Junkins, et al., ‘‘Larrabee: A Many-Core x86
Architecture for Visual Computing,’’ ACM Trans. Graph. 27,
neglect this, as shown in Table 2. No. 3, 1–15 (2008).
3. IBM Corporation, IBM Cell Broadband Engine Technology;
Conclusion and outlook see http://www-03.ibm.com/technology/cell/index.html.
4. IBM Corporation (June 9, 2008).Roadrunner Smashes the
We have presented directCell, an approach to attach a Petaflop Barrier. Press release; see http://www-03.ibm.com/
Cell/B.E. processor-based accelerator to a Windows- press/us/en/pressrelease/24405.wss.
2 : 12 H. PENNER ET AL. IBM J. RES. & DEV. VOL. 53 NO. 5 PAPER 2 2009
5. ‘‘MPI: A Message-Passing Interface Standard Version 2.1,’’ Received September 22, 2008; accepted for publication
Message Passing Interface Forum, 2008; see http://www. January 14, 2009
mpi-forum.org/docs/mpi21-report.pdf.
6. IBM Corporation, Accelerated Library Framework
Programmer’s Guide and API Reference; see http:// Hartmut Penner IBM Systems and Technology Group, IBM
publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ Boeblingen Research and Development GmbH, Schoenaicher Strasse
eiccn/alf/ALF_Prog_Guide_API_v3.1.pdf. 220, 71032 Boeblingen, Germany (hpenner@de.ibm.com).
7. IBM Corporation, Data Communication and Synchronization Mr. Penner is a Senior Technical Staff Member responsible for the
Library Programmer’s Guide and API Reference; see http:// firmware architecture of Cell/B.E. processor-based systems. He
publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ holds an M.S. degree in computer science from the University of
eicck/dacs/DaCS_Prog_Guide_API_v3.1.pdf. Kaiserslautern, Germany. During his career at IBM, he has been
8. IBM Corporation, IBM Dynamic Application Virtualization; involved in many different fields, working on Linux, the GNU
see http://www.alphaworks.ibm.com/tech/dav. compiler collection, and various firmware stacks of IBM servers.
9. OpenMP Architecture Review Board, OpenMP Application
Programming Interface Version 3.0, May 2008; see http://
www.openmp.org/mp-documents/spec30.pdf. Utz Bacher IBM Systems and Technology Group, IBM
10. N. Trevett, ‘‘OpenCL Heterogeneous Parallel Programming,’’ Boeblingen Research and Development GmbH, Schoenaicher Strasse
Khronos Group; see http://www.khronos.org/developers/ 220, 71032 Boeblingen, Germany (utz.bacher@de.ibm.com).
library/2008_siggraph_bof_opengl/OpenCL%20and% Mr. Bacher is an architect for hybrid systems. He earned his B.S.
20OpenGL%20SIGGRAPH%20BOF%20Aug08.pdf. degree in information technology from the University of
11. IBM Corporation, IBM BladeCenter QS22; see ftp:// Cooperative Education in Stuttgart, Germany. He developed high-
ftp.software.ibm.com/common/ssi/pm/sp/n/bld03019usen/ speed networking code for Linux on System z*. From 2004 to
BLD03019USEN.PDF/. 2007, he led the Linux on Cell/B.E. processor kernel development
12. IBM Corporation, Cell/B.E. Technology-Based Systems; see groups in Germany, Australia, and Brazil. Since 2007, he has been
http://www-03.ibm.com/technology/cell/systems.html. responsible for designing the system software structure of future
13. IBM Corporation, SPE Runtime Management Library IBM machines.
Version 2.2; see http://www-01.ibm.com/chips/techlib/
techlib.nsf/techdocs/1DFEF31B3211112587257242007883F3/
$file/SPE_Runtime_Management_API_v2.2.pdf.
Jan Kunigk IBM Systems and Technology Group, IBM
Boeblingen Research and Development GmbH, Schoenaicher Strasse
14. K. Koch, ‘‘Roadrunner Platform Overview,’’ Los Alamos
220, 71032 Boeblingen, Germany (jkunigk@de.ibm.com).
National Laboratory; see http://www.lanl.gov/orgs/hpc/
Mr. Kunigk joined IBM in 2005 as a software engineer after
roadrunner/pdfs/Koch%20-%20Roadrunner%20Overview/
receiving his B.S. degree in applied computer science from the
RR%20Seminar%20-%20System%20Overview.pdf.
University of Cooperative Education in Mannheim, Germany.
15. A. Bergmann, ‘‘Spufs: The Cell Synergistic Processing Unit as
Prior to his involvement in hybrid systems and the directCell
a Virtual File System’’; see http://www-128.ibm.com/
prototype, he worked on firmware and memory initialization of
developerworks/power/library/pa-cell/.
various Cell/B.E. processor-based systems.
16. A. Heinig, R. Oertel, J. Strunk, W. Rehm, and H. J. Schick,
‘‘Generalizing the SPUFS Concept–A Case Study Towards
a Common Accelerator Interface’’; see http://private.ecit. Christian Rund IBM Systems and Technology Group, IBM
qub.ac.uk/MSRC/Wednesday_Abstracts/Heinig_Chemnitz.pdf. Boeblingen Research and Development GmbH, Schoenaicher Strasse
17. IBM Corporation, Software Development Kit for Multicore 220, 71032 Boeblingen, Germany (christian.rund@de.ibm.com).
Acceleration Version 3.1, SDK Quick Start Guide; see http:// Mr. Rund received his M.S. degree in computer science from the
publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/ University of Stuttgart, Germany. He joined IBM in 2001 as a
eicce/eiccesdkquickstart.pdf. Research and Development Engineer for the IBM zSeries* Fibre
18. IBM Corporation, Cell Broadband Engine Programming Channel Protocol channel. He has recently been involved in
Handbook; see http://www-01.ibm.com/chips/techlib/techlib.nsf/ firmware and software development for the Cell/B.E. processor
techdocs/9F820A5FFA3ECE8C8725716A0062585F. and the directCell prototype hybrid system.
19. IBM Corporation, Cell Broadband Engine SDK Example
Libraries: Example Library API Reference, Version 3.1; see
http://publib.boulder.ibm.com/infocenter/systems/scope/syssw/ Heiko Joerg Schick IBM Systems and Technology Group,
topic/eicce/SDK_Example_Library_API_v3.1.pdf. IBM Boeblingen Research and Development GmbH, Schoenaicher
20. PCI-SIG, PCI Express Base 2.0 Specification Revision 1.1, Strasse 220, 71032 Boeblingen, Germany (schickhj@de.ibm.com).
March 28, 2005; see http://www.pcisig.com/specifications/ Mr. Schick earned his M.S. degree in communications and
pciexpress/base2/. software engineering at the University of Applied Sciences in
21. PCI-SIG, PCI Conventional Specification 3.0 & 2.3: An Albstadt-Sigmaringen and graduated in 2004. Currently, he is the
Evolution of the Conventional PCI Local Bus Specification, firmware lead of the Quantum Chromodynamics Parallel
February 4, 2004; see http://www.pcisig.com/specifications/ Computing on the Cell Broadband Engine (QPACE) project,
conventional/. which is a supercomputer mission of IBM and European
22. IBM Corporation, CoreConnect Bus Architecture; see http:// universities and research centers to do quantum chromodynamics
www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/ parallel computing on the Cell Broadband Engine Architecture.
852569B20050FF7785256991004DB5D9/$file/crcon_pb.pdf.
23. J. Easton, I. Meents, O. Stephan, H. Zisgen, and S. Kato,
‘‘Porting Financial Markets Applications to the Cell
Broadband Engine Architecture,’’ IBM Corporation; see
http://www-03.ibm.com/industries/financialservices/doc/content/
bin/fss_applications_cell_broadband.pdf.
24. IBM Corporation, Moro’s Inversion Example; see http://
publib.boulder.ibm.com/infocenter/systems/scope/syssw/topic/
eiccr/mc/examples/moro_inversion.html.
IBM J. RES. & DEV. VOL. 53 NO. 5 PAPER 2 2009 H. PENNER ET AL. 2 : 13

Directcell: Hybrid Systems With Tightly Coupled Accelerators

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Directcell: Hybrid Systems With Tightly Coupled Accelerators

Transféré par

Droits d'auteur :

Formats disponibles

directCell: H.

Host system Cell/B.E. processor-based system

Host processor Host SPE PPE

Host VA Host RA Cell/B.E. RA

Cell/B.E. VA Cell/B.E. RA Host RA

MMIO additional step is required, which is introduced in the

Figure 3 1. BARs 0 and 1 forming one 64-bit BAR provide a

Interrupt management 1. A BladeCenter server-based conﬁguration IBM

1,600 Standalone QS22 1 SPE

Simulations per second (millions)

directCell PXCAB 1 SPE

completion, the control block is transferred back to main

700 Interrupt type Duration (ls)

Standalone QS22 8 SPEs (avg.)

300 based x86 host system using PCIe connectivity. This

Vous aimerez peut-être aussi