Acceleration of EM-Based 3D CT Reconstruction Using FPGA PDF

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 1
Acceleration of EM-Based 3D CT Reconstruction

Using FPGA
Young-kyu Choi, Student Member, IEEE, and Jason Cong, Fellow, IEEE
AbstractReducing radiation doses is one of the key concerns better noise reduction property and offers better spatial resolu-
in computed tomography (CT) based 3D reconstruction. Although tion by using a more natural geometric model than FBP [6].
iterative methods such as the expectation maximization (EM) algo- However, these advantages are achieved at the cost of much
rithm can be used to address this issue, applying this algorithm to
practice is difcult due to the long execution time. Our goal is to de- extended computation time [7]. This problem occurs because of
crease this long execution time to an order of a few minutes, so that the iterative nature of EM. For example, literature reports 14
low-dose 3D reconstruction can be performed even in time-crit- minutes and 1.52 hours of CPU computation while processing
ical events. In this paper we introduce a novel parallel scheme that 64 and 500 projections of 128 128 and 736 64 pixels, re-
takes advantage of numerous block RAMs on eld-programmable spectively [8], [9]. This becomes more problematic with actual
gate arrays (FPGAs). Also, an external memory bandwidth reduc-
tion strategy is presented to reuse both the sinogram and the voxel patient images, which require larger input data for accurate mea-
intensity. Moreover, a customized processing engine based on the surement. Our testset, which has 15,022 projections of 672 16
FPGA is presented to increase overall throughput while reducing pixels, requires 3.3 hours of computation with 16 threads on a
the logic consumption. Finally, a hardware and software ow is CPU Intel Xeon E5-2420. However, our requirement is to do re-
proposed to quickly construct a design for various CT machines. construction in under two minutes, as part of the low-dose adap-
The complete reconstruction system is implemented on an FPGA-
based server-class node. Experiments on actual patient data show tive reconstruction ow (Section II-B), so that doctors can give
that a 26.9 speedup can be achieved over a 16-thread multicore immediate on-site feedback to the patient. The performance gap
CPU implementation. between the multicore CPU result and our goal is approximately
Index TermsAccelerator architectures, computed tomog- 100 times.
raphy, eld-programmable gate arrays, parallel architectures, ray In order to reduce the execution time of CT reconstruction,
tracing. some systems perform computation on the cloud [10], [11].
However, this may raise concerns about private health infor-
I. INTRODUCTION mation being transmitted through the web. Also, the system
reliability may be compromised due to sporadic network
C OMPUTED tomography (CT) is a commonly used

methodology that produces 3D images of patients. It
allows doctors to non-invasively diagnose various medical
problems. Instead, we envision a supercomputer-in-a-box,
where the computing node is located next to a CT machine and
reliably produces a high-resolution image. To meet the high
problems such as tumors, internal bleeding, and complex computational demand, such nodes can be constructed using a
fractures. CT has an advantage over other imaging techniques hardware accelerator such as a graphics processing unit (GPU)
in that it generally requires a shorter acquisition time than or eld-programmable gate array (FPGA).
magnetic resonance imaging (e.g., [1]) and generally provides Most previous work on GPUs and FPGAs take advantage of
better image quality than ultrasound (e.g., [2]). However, high abundant parallel-working cores with efcient work distribution
radiation exposure from CT scans raises serious concerns methods [12][14] and thread divergence avoidance strategies
about safety. This has triggered the development of low-dose [15], [16]. However, they are mostly based on FBP. When im-
compressive sensing-based CT algorithms. Instead of tradi- plementing iterative CT with accelerators, several challenges
tional algorithms such as the ltered back projection (FBP) [3], still exist.
iterative algorithms such as expectation maximization (EM) One challenge is the high data transfer requirement. Previous
[4], [5] are used to obtain a quality image with considerably research has shown that a bottleneck will be reached where
less radiation exposure. Moreover, the iterative algorithm has a adding more computing elements no longer improves the per-
formance [9]. The reason is that CT reconstruction is a memory-
Manuscript received May 21, 2015; revised August 04, 2015; accepted Au- bounded application where the performance is dominated by
gust 06, 2015. This work was supported in part by the Center for Domain-Spe-
the system memory bandwidth [9], [17]. For example, an im-
cic Computing under the National Science Foundation (NSF) Grant CCF-
0926127 and in part by NSF Grant CCF-0903541. It was also supported in part plementation based on [9] transfers 974 GB of data per itera-
by C-FAR, one of six centers of STARnet, a Semiconductor Research Corpo- tion to DRAM for real patient image reconstruction. To solve
ration program sponsored by MARCO and DARPA. This paper was recom-
this bandwidth problem, previous GPU and FPGA acceleration
mended by Associate Editor E. Lam.
The authors are with the Computer Science Department, University of Cal- papers proposed bit-width optimization [8], [18], locality im-
ifornia, Los Angeles, Los Angeles, CA 90095 USA (e-mail: ykchoi@cs.ucla. provement [17], [19], prefetching and masking [9], and data
edu; cong@cs.ucla.edu).
movement reduction [20]. However, the improvement was lim-
Color versions of one or more of the gures in this paper are available online
at http://ieeexplore.ieee.org. ited to 1.3 3.1 . To achieve our two-minute goal, a more
Digital Object Identier 10.1109/TBCAS.2015.2471813 radical algorithmic and architectural improvement is needed.
1932-4545 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS
Fig. 1. CT scan trajectory. The parameters are from M.125000 dataset. Modied from [23].
One potential for a large increase in the system bandwidth Ofine memory analysis for irregular access patterns
can be found in the FPGA architecture. Although the FPGA in the ray-driven approach: To efciently tile the voxel
platform has a slower clock frequency (100 MHz versus for irregular memory access, an ofine memory analysis
5001000 MHz) and less DDR bandwidth than the GPU, it can technique is proposed. It exploits the input data indepen-
customize the computation and the data communication. For dence property of the CT machine. Also, a compact storage
example, it contains many block RAM (BRAM) that can be format is presented.
used to perform a large amount of local data transfer. However, Customized PE architecture: We present a design that
BRAM usage alone is not very effective unless DRAM can achieves high throughput and logic reuse for each PE.
supply the data to BRAM without stalling. Design ow for rapid hardware-software design: A ow
Another important challenge is to achieve a high degree of is presented to generate hardware and software for various
parallelization while efciently resolving access conict to local CT parameters.
memory. Problems such as race condition and bank conict This work is an extension of our previous conference publica-
occur when several rays try to update a voxel at the same time. tion in [23]. The previous work mainly focused on exploiting the
Unlike most other FBP-based acceleration work that achieves data reuse to reduce off-chip access. In this work we added the
high parallelism with point-like voxel, our work incorporates parallelization scheme, ofine memory analysis technique, vari-
volumetric voxel for a realistic geometric model. However, this able throughput optimization, and the automated design ow.
complicates the design with serialization and non-stencil access Due to the new optimization, this work is 5.2 times faster than
pattern. Chen et al. proposed parallelizing rays that are suf- [23] on the same dataset.
ciently apart [22], but the large distance limits the amount of The remainder of this paper is organized as follows: Section II
parallelism. A new approach is needed that can avoid the ac- will provide the background and analysis of the algorithm and
cess conict while still achieving large parallelism with min- the platform. It also provides a brief analysis of the problem and
imal overhead. highlights the challenges for constructing an efcient design.
In this paper we present a complete and working CT re- Section III explains the ray-driven, voxel-tile parallel approach
construction system implemented on a server-class node with and the ofine memory analysis technique. The design ow, de-
FPGA coprocessors. It incorporates several FPGA-friendly tails for the customized PE design, and throughput optimization
techniques for acceleration. The contributions of this paper strategies are provided in Section IV. The experimental results
include: can be found in Section V.
Ray-driven voxel-tile parallel approach: This approach
exploits the computational simplicity of the ray-driven ap-
II. BACKGROUND AND PROBLEM ANALYSIS
proach, while taking advantage of both ray and voxel data
reuse. Both the race condition and the bank conict prob-
A. CT Scanner Geometry and Dataset Description
lems are completely removed. Moreover, it can easily in-
crease the degree of parallelism with adjustment of tile In a CT scan, as shown in Fig. 1, radiation from the source
shape. A strategy for the tile shape decision is also pre- penetrates the patients body and gets measured by an array of
sented. detectors. The source and the detector array proceed in a helical
Variable throughput matching optimization: We pro- fashion, where the pair move in a circular direction in the -
vide strategies to increase the performance for designs plane and in a lateral direction along the -axis.
that have a variable and disproportionate throughput In our clinical setting, the CT scan was acquired using a
rate between processing elements (PE). In particular, helical cone beam protocol on a Siemens Somatom Sensa-
the logic consumption is reduced by exploiting the low tion 16 scanner. The focus-detector distance (FDD) and the
complexity of frequently computed parts of the ray-driven focus-isocenter distance (FID) (both shown in Fig. 1) are
approach. The logic usage is further reduced by using 1,040 mm and 570 mm, respectively. The circular rotation rate
small granularity PEs and module reuse. This leads to ( ) of source and detector array is 1,160 samples per rotation.
overall throughput improvement with module duplication. The lateral movement per sample ( ) of source and detector
CHOI AND CONG: ACCELERATION OF EM-BASED 3D CT RECONSTRUCTION USING FPGA 3
is 4.8 . The detector array consists of 672 channels and 16

rows.
For the dataset, we use CT images (labeled M.125000) ac-
quired from an actual patient at our institution using a protocol
approved by the institutions review board (IRB). To simulate a
reduced radiation dose, we used 15,022 samples, which is 25%
of the originally obtained data. The size of the reconstructed
volume is 512 (Nx) by 512 (Ny) by 372 (Nz) voxels. The size
of a voxel is 0.77 mm ( ) in each dimension.
B. Adaptive Flow for Lung Cancer Screening

Our aim is to construct an automatic lung cancer screening
system. First, the patient goes through a low-dose CT scan. Then
the result is converted into 3D images using a reconstruction al-
gorithm. Next, reconstructed images are processed by our com-
puter-aided cancer nodule detection algorithm [24], [25]. If a
suspicious nodule is found, a second full-dose CT scan is se-
lectively performed to obtain a high-quality image around the
target image slices.
Even with 25% low-dose scans, experiments show that the
nodule detection sensitivity is maintained at 100% using the
compressive sensing techniques in [26]. This leads to signicant
dose reduction for patients, with or without cancer. For example,
for a patient that has a suspicious cancer nodule that spans across Fig. 2. EM-based iterative reconstruction process, assuming that the ray tracing
37 slices (2.8 cm), the dose reduction is approximately 65% method has been used for the forward and the backward projection.
[ ].
In this paper we concentrate on acceleration of the 3D recon- TABLE I
struction, which is the most time-consuming part of our cancer PROFILING RESULT USING 16 THREADS ON INTEL XEON E5-2420
screening system.
C. 3D Reconstruction Algorithm and Problem Analysis

1) Expectation Maximization (EM) for 3D Reconstruction:
The objective of 3D CT reconstruction is to recover the object
image intensity from , where is the normalization constant. These opera-
which is the measurement obtained from the CT scan. The tions are repeatedly performed until the estimated object image
problem can be described as solving the linear equation , converges.
where is the matrix that contains how much each voxel of an After these EM steps, total variation (TV) regularization [21]
object will be projected onto the detector array. is performed on the image to remove noise and blurring. The
EM is an iterative method of solving the above problem by reader is referred to [26] for details of this process.
progressively rening the object by nding the most likely in- The proling result using 16 threads on Intel Xeon E5-2420 is
tensity, given CT scan observation. The overall algorithm is shown in Table I. It suggests that the forward projection and the
shown in Fig. 2. The rst step of EM is the forward projection, backward projection should be the focus of FPGA acceleration.
which obtains the projection of the estimated image. Using the We have decided to accelerate only EM and exclude TV, since
notation in [26], this can be described as: , where the portion of TV is very small.
is the image intensity in the th-iteration, and is the projec- 2) Forward and Backward Projection: The computation pat-
tion result which is also called the sinogram. After the sinogram tern for forward and backward projection is described in Fig. 3.
is compared with the input observation ( ), it is back-pro- Although the equation looks simple, the challenge lies in ef-
jected onto the image in the backward projection step: ciently nding the intersection length of a ray onto a voxel ( ).
, where is an element of , and and It is difcult to precompute and store to memory, because it
are the indices in the sinogram and image domains, respec- requires 243 GB even in a sparse matrix form. Such large data
tively. Finally, the previous estimate of the image intensity is needed because of the large number of rays (15, 022 672
is updated to using the equation 16) and the image (512 512 372).
Rather than storing a matrix, two types of efcient methods
are commonly used for forward and backward projections: the
ray-driven [27], [28] and voxel-driven [29], [30] approaches.
(1) The ray-driven approach computes the integral
along the line of all rays. is the intersection length, and
is the image intensity. The voxel-driven approach takes the

opposite direction, by updating all sinogram that are projected
from a particular voxel.
Since the radiation from the source spreads out to the de-
tectors in a cone-beam shape (Fig. 1), several rays may try to
access the same voxels near the source. This does not become
a problem when using the ray-driven approach in the forward
projection, because voxel intensity is read-only. In the back-
ward projection, however, race conditions may exist because
the voxel intensity will be read and written. Note that the voxel-
driven approach has opposite characteristics, i.e., access conict
may exist in the forward projection because several voxels may
Fig. 3. Memory access and computation pattern of forward and backward
update the same sinogram. But conict does not happen in the projection.
backward projection, because each voxel only reads from sev-
eral sinogram.
In terms of computational complexity, frequently used com-
putation in the ray-driven approach is much simpler. As can
be seen in Fig. 3 and Algorithm 1, the voxel position and the
intersection length can be implemented with a few comparators
and adders. For the voxel-driven approach, the weights ( )
of detectors projected from an image voxel are computed
by nding the ratio between the source-to-voxel and the
source-to-detector distance [29], [30]. This magnication factor
computation requires frequent division and multiplication. This
can be simplied by using a look-up table [30], but it requires
additional memory access. More complex computation leads to
Fig. 4. Computation for ray geometry data, intersection length, and voxel
larger FPGA logic resource consumption. This has a negative coordinate. The pseudocode is given in Algorithm 1 and Algorithm 2. Modied
impact on performance, since module duplication is heavily from [27].
used to increase the throughput in our design (Section IV-D).
Thus, we have decided to use the ray-driven approach.
3) Review of Ray-Driven Approach: The computational traversing through a 2D grid of voxels, as in Fig. 4. of the cur-
kernel of our algorithm is derived from the ray-driven approach rently traversed voxel (shaded grey) and of the next traversed
proposed by Zhao and Reader [27]. Their method, also referred voxel can be obtained using the update equation in Algorithm
to as the ray tracing algorithm, is an enhanced variant of the is the distance of a ray that has been traversed so
widely used Siddons algorithm [28]. far, and and are the distance of a ray that will be
Algorithm 1: Intersection length and coordinate com- traversed if the ray next passes through the x-grid or y-grid, re-
putation for ray tracing. spectively. and are the ray distance between two voxels
in each dimension. This equation can be extended to the 3D case
by adding direction to this equation.
Algorithm 2: Ray geometry data computation. This has
been shortened for brevity. The complete computation and its
meaning can be found in [27].
The main advantage of this algorithm is the on-the-y con-

struction of the intersection length and the position of a ray by
exploiting the repetitive nature of the traversal. Consider a ray
The ray geometry computation produces a set of parameters

needed for ray tracing: , the initial value in
each dimension , the initial value
, and initial voxel coordinate . Compared
to the ray tracing, this requires more resources because it im-
plements complex functions shown in Algorithm 2. The source
and detector positions and can be obtained by multiplying
Fig. 5. Proposed ray-driven voxel-tile parallel approach. Each ray tracing PE
their distance from the center (FID, FDD) by the rotation rate will have a tile of voxels in its BRAM, and process one ray at a time.
( ), added by the -direction movement ( ). This requires four
sine/cosine units and four multipliers for implementation. Then,
at the image boundary, and , is computed. After the external storage. Since each ray traverses through an average
comparison, is obtained. Next, and can be ob- of 1,004 voxels in our dataset, the data access ratio between the
tained as shown in Algorithm 2. After resource sharing, this re- voxel intensity and the ray data (ray geometry parameter and
quires one oating point division unit and about 15 additional sinogram) is 1,004 to 1. Since this means that voxels require
multipliers. Note that although it is expensive in terms of re- higher bandwidth, they are stored in the BRAM.
source consumption, the overall portion becomes small by ex- The second aspect that should be considered is the data reuse
ploiting the difference in the module throughput, as will be ex- rate. If BRAM accesses the DRAM frequently, the DRAM
plained in Section IV-D. bandwidth will become the bottleneck. As a result, BRAM will
The disadvantage of this algorithm is that it serializes the ray often stall waiting for data to arrive. This suggests that the data
tracing computation, which interferes with achieving massive reuse rate in BRAM should be high. This can be achieved by
parallelism for GPU and FPGA implementation. Also, the voxel grouping a contiguous set of voxels. Such voxel tiling benets
coordinate changes irregularly in all , , directions which from the ray traversal pattern; since a ray is likely to traverse
causes frequent memory bank conict. Moreover, this algorithm through nearby voxels, a tile of nearby voxels can share the
has non-stencil access patterns which makes it hard for com- data from a similar set of rays. Thus, the ray data reuse rate
pilers to analyze. is increased. Our previous work showed that the voxel reuse
rate can be up to 831 [23]. Since the sinogram reuse was
D. Convey HC-1ex Platform achieved with a ray-driven approach, our algorithm can exploit
This work has been implemented and veried using the both types of reuse. This leads to a large reduction in off-chip
Convey HC-1ex platform. The platform includes an Intel Xeon memory transfer.
5138 host CPU and four Xilinx Virtex-6 LX760 FPGAs [31]. Last, we should be able to extract enough parallelism to take
This platform was selected because of the high coprocessor advantage of available BRAMs. As mentioned in Section II-C2,
memory bandwidth and efcient scatter-gather memory access a race condition and bank conict may occur when several rays
capability [32] of the coprocessor memory controller. Also, try to update the same voxel. This becomes more problematic
cache-coherent shared memory space eases development. The with Zhao and Readers algorithm [27] since it introduces se-
coprocessor memory has 16 dual in-line modules. With the rialization into ray tracing. For the race condition problem, an
memory controller running at 300 MHz, it has a theoretical atomic operation can be used to update the voxels, but it will de-
coprocessor memory bandwidth of 80 GB/s. Each FPGA pro- grade the performance signicantly. Chen et al. proposes paral-
vides 16 FIFO-based DRAM channels, and these channels are lelizing sufciently spaced rays [22], but it will limit the amount
connected to the DRAM in a full crossbar. of parallelism and may cause frequent memory bank conict.
As a solution to the above problems, we propose a new paral-
III. RAY-DRIVEN VOXEL-TILE PARALLELISM lelization scheme which can be classied as a ray-driven voxel-
tile parallel approach. Each PE will have access to a BRAM
A. Tile-Based Parallelism for FPGA that stores a tile of voxels. One PE will be able to perform just
Our initial experiment demonstrated that an implementation one ray tracing at a time, but instead, multiple PEs work on
similar to [9] transfers 974 GB of data per iteration for our test different tiles of voxels to obtain high parallelism, as shown
dataset. This puts heavy pressure on the DRAM bandwidth. A in Fig. 5. After processing all rays in a tile, the PE will pro-
solution to this problem can be found in the abundant BRAMs ceed to the next voxel tile. This approach can also be viewed
present in the FPGA. We can take advantage of the large internal as exploiting the parallelism by dividing a ray with multiple
bandwidth provided by independently working 5,760 BRAM segments and performing ray-tracing starting in each voxel tile.
modules in four FPGAs. Parallelizing voxels while still using a ray-driven approach may
To use BRAM efciently, a number of challenges must be seem counter-intuitive at rst, but this is designed to exploit both
considered. The rst problem is to decide which data is stored the low computational complexity of a ray tracing method and
in the BRAM. In our design, voxel intensity is stored in the in- also the high BRAM system bandwidth of voxel tiling. Note that
ternal BRAM. This is because of the inherent data reuse rate the voxel tiling idea itself was proposed previously with FBP in
in the ray-driven approach. When each ray passes through mul- GPU [12], [13], but it was not in the context of EM, nor was it
tiple voxels, the intermediate accumulated sinogram (forward combined with ray tracing as our scheme.
projection) or sinogram for updating (backward projection) can The advantage of the proposed parallel approach is that we
be stored as a local variable and referenced without accessing can easily adjust the parallel factor by changing the size of the
voxel tile (decision process will be discussed in Sections III-C

and IV-D). Another advantage is that a race condition cannot
occur between rays since each PE computes one ray at a time.
Moreover, bank conict is removed since each PE has dedicated
BRAM access.
The proposed method, however, creates several new prob-
lems. First, the irregular access pattern of ray tracing makes it
difcult to analyze which rays access the voxel tile. Second, ray
data has to be fetched several times. Third, access conicts exist
in the ray sinogram. The solutions to these problems are dis-
cussed in the following subsections.
Fig. 6. Storage format that approximates the detector channel and row for each
Note that it is also possible to reuse rays in the voxel-driven projection.
approach, as was presented in [33]. In their design, parallelism
is achieved by using multiple transaxial projection modules.
TABLE II
AVERAGE RAY ACESSES PER VOXEL WITH VARYING GROUP OF RAYS
B. Offline Analysis for Irregular Memory Access
One of the most complex problems of the voxel-tiled-parallel

scheme is to nd the rays that pass through a certain tile. A
non-stencil, irregular access pattern of ray tracing makes it dif-
cult for compilers to analyze the memory access pattern. An
exhaustive search also cannot be used, since very few among
all rays access a tile; for example, only 0.03% of rays traverse
through a tile of size 64 64 1. It is possible to project a voxel
C. Tile Shape Decision
to the sinogram and nd the rays by computation [15], but this
consumes extra logic resources. As long as enough parallelism is guaranteed, the voxel tile
As a solution, we propose making an ofine, proling-based should be made as large as possible. The reason is that the sino-
approach. A software-based memory access analyzer will run gram has to be accessed again when it traverses to a different
ofine and record the list of rays that access certain tiles. This is voxel tile. However, the limitation on the size of the BRAM
possible because of a special property of CT reconstruction. In (12.7 MB) forces a decision on the shape and size of the data
a typical CT usage scenario, the CT patient will change, which stored in the internal memory.
will cause the input dataset to vary. The memory access pattern The shape of the tile should be decided by the trajectory of
of ray traversal, on the other hand, will typically not change, the CT scan. Table II shows the average number of accessed
since it is only dependent on the CT machine parameters. In rays that traverse through a voxel. Among a group of rays in
other words, the memory access pattern of the CT machine is the 672 channels 16 detector rows (single projection), each
independent of input data it is the same across multiple pa- voxel is accessed only 2.3 times on average. The reason for such
tients. This allows us to perform an ofine analysis and devise a small number of accesses is that radiation emission spreads
an optimized design. out in a cone shape (Fig. 1), and very few rays access the same
However, simply storing the list of individual rays per tile is voxel when far from the source. On the other hand, there is al-
not very efcient, since it requires 5.6 GB. Instead, we again most a linear increase in the average number of accessed rays
take advantage of the CT machine characteristics. Since each ( ) when the ray group is expanded to con-
radiation source has rays spreading out in a cone-beam shape secutive projections ( ). The reason is that detector
(Fig. 1), it is likely that nearby rays will access the same voxel and source movement along the -axis ( ) is slow compared
tile. This allows the list of rays to be compacted by storing only a to the circular rotation rate ( ). Thus, it is likely that rays of
continuous range of rays. Also, since a voxel tile is a rectangular nearby projections intersect almost identical - planes. The
shape, projecting the voxel tile back to the detector array will difference between the rays of nearby projections is in the pen-
produce a list of detectors that is close to a rectangular shape. etration angle, which only changes the order of voxel access
Thus, it becomes possible to approximate the ray list with a rect- but not the actual list of voxels accessed. This suggests that, for
angular bounding box. This is expressed as (highest/lowest de- voxels, it is advantageous to use a at x-y surface tile rather than
tector channel #, highest/lowest row #), as shown in Fig. 6. The a cube tile to maximize the reuse between adjacent rays.
ray bounding box expression is stored per each projection per In addition to the available BRAM size, the voxel tile size is
each voxel tile. The proposed method of generating the list of also affected by the available logic resources and the external
rays requires 58 MB of memory access, which is a large im- memory bandwidth. As will be discussed in Section IV-D, we
provement over the original list. can place 256 ray tracing PEs into the platform. Assuming about
Note that a method for representing the ray traversal path in 4.5 MB (35% of total BRAM resource) has been assigned for
the voxel domain has been proposed in [33]. A clever run-length ray tracing, we can use a voxel tile size of 64 64 1, which
encoding is used to reduce the size of the representation. is the most at shape conguration.
TABLE III TABLE IV

BRAM SIZE AND DRAM TRANSFER WITH VARIOUS TILE SIZES COMPARISON OF DRAM DATA TRANSACTION (PER ITERATION)
The relationship between the required external memory band-

width and the BRAM size is shown in Table III. Since voxels
are accessed once, only
of memory transfer is used for the voxel, and the rest is for the
sinogram. The results in the table conrm that bandwidth re-
quirement is low when a large amount of internal memory is
available. Also, it shows that when the BRAM size is xed, a
at-shape tile (e.g., 128 128 1) has more DRAM transfer
savings than a cube-like tile (e.g., 64 64 4).
D. Synchronization
Although voxel access conict was removed by adopting Fig. 7. Software and hardware design ow.
voxel tile parallelism, sinogram access conict has been newly
introduced in the forward projection. To solve this problem,
we decided to use the barrier synchronization method. Each IV. DESIGN FLOW AND ARCHITECTURE
PE will update its sinogram value to a memory location that is
different from other PEs. This does not mean memory location A. Overall Design Flow
will be increased by the number of available PEs, because a
Our design is based on an assumption that even though the pa-
ray traverses through a limited number of voxel tiles. After
tient changes, the CT scan parameter will not vary much. This
performing an ofine analysis of nding all voxel tiles that rays
allows engineers to optimize the design ofine (Section III-B).
pass into, adjacent voxel tiles are assigned to the same PE to
A possible usage scenario is that the CT machine will provide
reduce the number of PEs that might access the same sinogram.
some limited number of FPGA-acceleration settings that are
Experiments show that the DRAM memory allocation size for
most frequently used. Radiologists can choose one of these set-
sinogram increases by eight times for our dataset. This does not
tings and quickly obtain the reconstructed result.
have a severe negative impact on the performance because the
The overall design ow is shown in Fig. 7. The software ana-
amount of data transfer remains the same for the ray tracing PE.
lyzer will record the memory access pattern into a le and pro-
vide guidelines on the hardware architecture. The hardware de-
E. Comparison sign, written in C, is then fed into the Vivado HLS 2013.1 tool to
generate an RTL design. The RTL les are then combined with
The DRAM data transaction comparison between a design the host and the DRAM interface RTL. Next, the bitstream is
that only reuses ray data (e.g., [9]) and the proposed design created using the Convey development tool chain that is based
is presented in Table IV. Reusing only the sinogram requires on the Xilinx ISE tool. The host software performs necessary
large off-chip memory bandwidth to transfer voxel intensity. In pre- and post-processing such as reading the ray list informa-
the proposed scheme, the voxels in a tile are transferred only tion or reading and writing the image le.
once, because they are reused among several rays. At the same
time, ray data is still being reused, because ray tracing is per-
B. Software Analyzer
formed within a tile. Although the sinogram data access has in-
creased 8.6 times for a 64 64 1 tile size, this has little effect The software analyzer is responsible for generating two types
on overall data transfer, since its portion was small compared to of information: the list of rays per voxel tile and the hardware
the voxel transfer. Considering the combined value of sinogram design parameters. The range of rays is easily obtained by mea-
and intensity, there is a 55.3 reduction in external memory ac- suring the minimum and the maximum of radiation detectors
cess. per projection for each voxel tile (Section III-B). Hardware de-
Table IV also shows that there is little difference for a sign parameters include the tile size, shape, and the module
combined amount of BRAM and DRAM access. However, the duplication information. The reason this has to be extracted is
performance drastically improves, because FPGA has a much because the memory access pattern will change as the CT pa-
higher BRAM bandwidth than DRAM. rameter changes. These parameters are decided based on the
Fig. 8. Mapping between the algorithm and the architecture for the proposed system.
measurement of throughput of ray tracing PE, number of rays access logic were detached from ray tracing and ray geom-
passing into a voxel tile, and the number of DRAM accesses etry computation PEs. This allows us take advantage of the
(e.g., Table III), as will be explained in more detail in the next different throughput rate of PEs, because logic area can be
subsection. Note that the analyzer has been implemented in soft- saved by having multiple low-throughput PEs share access to
ware because it is easier to make modications, and also because high-throughput PEs. This leads to throughput optimization
the execution time is not very constrained. using module duplication, as will be explained in Section IV-D.
The second principle is module reuse. If multiple functions
C. Accelerator Architecture can be mapped to the same PE, it can save area and enable more
module duplication. For example, ray tracings from both for-
1) Overall Architecture: The overall architecture and the ward and backward projections are mapped to the same PE.
mapping from the algorithm are shown in Fig. 8. It shows three The third principle is to use FIFO logic for inter-module
types of customized PEs. Ray tracing of forward and backward communication. Since many modules are designed as
projection is mapped to the ray tracing PE. Ray geometry PE high-throughput streaming modules, FIFO simplies the data
generates the parameters needed for ray tracing. The sinogram transfer process.
comparison and the voxel update part are mapped to the com- Finally, we concentrated on pipelining different tasks. Overall
parison/update PE. execution time can be reduced, not only by enumerating one
General-purpose modules have been extracted from the cus- type of PE, but also by overlapping the computation of different
tomized PEs. In Fig. 8, C represents the loop iterator generator types of PEs. The advantage is that it allows local communica-
for both tile and the rays. D represents the DRAM access unit. tion between directly connected PEs and removes data storage
If the number of PEs do not match, a network logic, denoted time to memory. Note that such task parallelism is hard to be
N, is used to either gather or distribute the data. It employs used efciently with GPU and multicore CPU, and gives an in-
round-robin scheduling to allocate similar transfer rates to the herent advantage to FPGA-based approaches.
receiving PEs. 3) Ray Tracing PE: Ray tracing PE is responsible for ac-
2) Accelerator Design Principles: We employed several cumulating (forward projection) or updating (backward projec-
principles for high-performance accelerator design. The rst tion) the intensity of voxels, as shown in Fig. 3. Due to the sim-
principle is to divide PEs as nely as possible for aggressive ilarity in their computation patterns, they can be combined into
pipelining. For example, even small modules such as the one PE, as shown in Fig. 9. An operation control bit is used to
data distribution logic, loop iterator generator, and DRAM select either the forward or the backward projection mode.
TABLE V
WORD LENGTH OF VARIABLES ( )
5) Comparison & Update PE: The comparison computation

( ) and the update computation (
(1)) have a similar form, and can be combined into the same
PE to share the expensive division operator. Since both opera-
Fig. 9. Ray tracing architecture that supports both forward and backward tors are completely parallel, all data are consecutively read and
projection. The pseudocode for the intersection length and the voxel coordinate written to the DRAM in a streaming fashion. The comparison
generator can be found in Algorithm 1.
computation processes 4.8 GB [ 1 4
B] of sinogram data, and the update computation processes 0.36

The computation unit is tightly coupled with the internal GB [ ] of voxel data per iteration.
BRAM for high bandwidth. To achieve a throughput of one Since the amount of transferred data is small compared to the
voxel update per cycle, each updated voxel should be written ray tracing and ray geometry computation, we only placed 16
to BRAM while the next voxel is being read. This requires a comparison/update PEs.
two-port BRAM. Note that there is no race condition due to
reading and writing to the same address in BRAM, because D. Variable Throughput Optimization With Module
a ray is guaranteed to move one coordinate in each cycle. Duplication
Also, the external DRAM channel is shared among several ray Since the ray parameter is reused during the ray traversal,
tracing PEs, because the PEs BRAM does not need to access there is a large throughput difference between the ray tracing
the off-chip memory frequently (Section III-A). PE and the ray geometry parameter PE. For the ray tracing PE,
Even though PEs share the same DRAM channel, they were the throughput depends on the number of voxels traversed in
designed to work independently of other PEs. A different design a tile: for a 64 64 1 tile, II can be as little as 1 cycle, or as
choice could be to share the ray parameters between nearby PEs large as . The ray geometry PE, on the other
by interconnecting PEs in a grid structure. The advantage of this hand, can generate one ray data per cycle. Such disproportionate
scheme is that the sinogram and the ray parameter can be reused and variable PE throughput presents a challenge for optimized
and passed to the adjacent PEs. However, an implementation of design. Note that the presented problem is different from most
this design showed that the overall performance was degraded previous work on throughput and module duplication (e.g., [34],
by about 2 compared to the independent version. The reason [35]), which assume a xed data consumption ratio between
is that although DRAM bandwidth had been reduced, each PE producer and consumer PEs.
had to wait for other PEs to pass the rays. This increased the To compensate for the difference, ray tracing PEs are fur-
idling cycles of PEs. Thus, PEs were made to work indepen- ther duplicated. The duplication factor of the PEs can be deter-
dently to increase the utilization ratio and obtain better overall mined by a resource-constrained optimization problem where
performance. the execution time is minimized subject to a lookup table (LUT),
4) Ray Geometry Parameter PE: The ray geometry pa- ip-op (FF), BRAM, multiplier (DSP), and BRAM/DRAM
rameter PE reads the tile and ray range information from the bandwidth.
DRAM (Section III-B). Next, in parallel with the high-latency The resource constraints are shown in Table VI. We set the
parameter computation in Algorithm 2, the sinogram of the ray available resource as 40% of total LUT and FF, and 70% of
is prefetched from the DRAM. This information is fused with total DSPs and BRAMs of four Virtex-6 LX760 FPGAs. This
the ray geometry parameters ( , , , is based on our experience that the Xilinx placement and route
) and passed to the ray tracing PE. tool will have difculty if the resource consumption rate is too
With help from the Xilinx Vivado high-level synthesis (HLS) high. Moreover, the Convey DRAM controller occupies logic
tool, this module was redesigned into a pipelined version that resources. The bandwidth of DRAM is set to 40 GB/s, which is
has a throughput of one ray per cycle. Since the tool only sup- the practical upper limit of DRAM bandwidth obtained in the
ported an initiation interval (II) of 4 for certain oating-point test of the Convey memory controller system.
operations, we converted the whole design into a xed-point Next, the resource consumption of PEs needs to be measured.
design. The word length of the variables are shown in Table V. The logic resource consumption is easily obtained from Vivado
Note that the xed-point conversion also helps reduce the logic HLS, as shown in Table VI. The problem is in the throughput
consumption. measurement, which varies for each ray. To solve this problem,
Another challenge was the large logic consumption for the rst the analyzer software collected statistics on the average
sine and cosine operations. To solve this problem, we imple- throughput rate of a ray tracing PE, which was found to be
mented a xed-point table lookup based cosine/sine design. 58.8 cycles per ray. Next, we attached a large FIFO to each ray
Using the symmetry of cosine/sine value, angle folding with
was performed to reduce the table size. 1Memory allocation overhead for synchronization (Section III-D).
TABLE VI
LOGIC/MEMORY CONSTRAINT & CONSUMPTION PER PE
tracing PE. The effect is that of averaging the diverse throughput TABLE VII
to the expected value. BREAKDOWN OF EXECUTION TIME
The average ratio between the ray throughput of ray geometry
PE and that of ray tracing PE is . To
have some safe margins caused by the varying throughput, we
decreased the ratio to 1 : 16. Based on Table VI, it becomes
possible to place 17 ray geometry parameter PEs and 264 ray
tracing PEs. For regularity of the design, we decided to place TABLE VIII
16 ray geometry parameter PEs and 256 ray tracing PEs. EXECUTION TIME COMPARISON WITH VARIOUS ACCELERATOR PLATFORMS
As a result, even though a single ray geometry PE consumes THAT EXPLOITS THE RAY PARALLELISM (UNIT: MINUTES, 50 ITERATIONS)
6 times more LUT than a single ray tracing PE, it occupies 37%
less LUT after duplication. This is because the duplication factor
of the ray geometry PE is small. It can be observed that the logic
saved from sharing high-throughput PEs has increased perfor-
mance because it allowed more duplication of low-throughput
PEs.
Our proposed optimization shows how the performance
can be improved if more frequently occurring computation
has lower complexity. The ray tracing PE has frequent and
computation (Algorithm 1), which only requires addition. needed to send 372 MB of image back and forth between FPGA
Thus, it can be heavily duplicated. The ray geometry parameter and CPU through the front-side bus.
PE has more complex computation and thus consumes more 2) Effect of the Proposed Parallel Approach: To nd the ef-
logic, but less duplication is needed because only one compu- fect of the proposed parallel approach, we implemented the re-
tation is performed per ray in a voxel tile. Thus, it conrms construction algorithm using three different accelerators-mul-
that with the proposed optimization, it is possible to improve ticore CPU, GPU, and FPGA. These versions exploit the ray
the performance of the ray-driven approach by exploiting the parallelism but not the proposed voxel data reuse. All versions
simple computation pattern of the ray traversal. use the same CT parameter and the same input data as the pro-
posed FPGA version. For multicore CPU, we used the Xeon
V. EXPERIMENTAL RESULTS E5-2420 which is based on a newer process technology than
the FPGA platform. We used 16 threads. For GPU, three Tesla
A. Performance M2070 GPUs were used to approximately match the number of
1) Performance Breakdown: The performance breakdown is Virtex-6 FPGAs used. A total of 9216 threads were used. For
shown in Table VII. Each iteration of reconstruction takes 8.9 FPGA, we used the same Convey platform but incorporated op-
seconds; thus it takes 7.4 minutes to complete the reconstruction timization in [9] for comparison. Note that the CPU and FPGA
with 50 iterations. We also present the rate of voxel updates as versions keep enough distance between the threads to resolve
Giga updates per seconds (GUPS) for easier performance com- access conict, as suggested by [22], but the GPU version uses
parison. The performance was measured on an actual execution atomic operations to avoid a race condition among thousands of
using the Convey HC-1ex platform. threads.
Forward projection takes an amount of time similar to back- Table VIII shows that the proposed algorithm on FPGA has
ward projection although it requires about 2 of DRAM band- 26.9 , 16.8 , and 4.6 speedup over the ray-parallel imple-
width (Table IV). This suggests that the DRAM bandwidth is mentation on multicore CPU, FPGA, and GPU, respectively.
no longer the bottleneck. The performance gain is considerable. Even though the Convey
The reason that the compare and update part has a small platform has smaller DRAM bandwidth and a 28 and 5.8
GUPS is because only a quarter of Conveys DRAM channel slower clock frequency than the CPU and GPU platforms, it has
was utilized to reduce the number of PEs to 16. It has little neg- higher performance because of the reduced external memory
ative impact on the overall performance because the overall por- transfer with data reuse, task pipelining, and the customized cir-
tion is still relatively small. cuitry.
Note that if TV is added to the system, it will take 1.0 seconds One possible question could be whether other accelerators
for computation, per iteration, assuming that 16 threads on Xeon can also benet from the proposed techniques. Somewhat
E5-2420 are used. Also, an additional 0.8 seconds per iteration is counter-intuitively, experiments showed that, compared to
TABLE IX
SCALABILITY TEST AND PROJECTION OF EXECUTION TIME ON CONVEY HC-1EX PLATFORM
TABLE X TABLE XI
TEST RESULT USING VARIOUS CT PARAMETERS RESOURCE CONSUMPTION (4 FPGAS)
Table VIII, the proposed scheme degrades the performance

in multicore CPU and GPU. The difference can be found in
FPGAs ability to use task parallelism and make local data trans-
fers between PEs. As mentioned in Section II-C3, ray geometry
parameter, which involves complex operations like cosine and
division, has to be recomputed each time a ray traverses through This complicated placement and routing, and as a result, the
different voxel tiles. This increases the computational burden clock frequency had to be reduced to 100 MHz.
in CPU and GPU. On the other hand, the FPGA can efciently Since LUT utilization is very high, it would be difcult to
compute these parameters in parallel with the ray tracing. This place more PEs into the design. Thus, it can be inferred that
suggests that, combined with an efcient data transfer scheme the next step of achieving further improvement would be to nd
and a high degree of task parallelism, FPGA may outperform a suitable way of simplifying the design of the ray tracing PE.
the GPU implementation despite having a disadvantage in the Since BRAM utilization is also high, such a modication would
clock frequency and the DRAM bandwidth. lead to a smaller tile size per PE, but it would allow us to increase
3) Scalability Test: Table IX shows the measured execution the BRAM bandwidth and use more BRAMs independently.
time with 1, 2, and 4 of Virtex-6 LX760 FPGAs. It shows that the The power consumption of the Convey system was measured
performance almost linearly scales with the resource used. This with a power meter. The system static power consumption had
is because the throughput almost scales with the number of PEs a somewhat high value of 577 W, since the Convey server is
used, provided that the system can satisfy the required DRAM not very power efcient. In the FPGA execution state, however,
bandwidth. Since Convey HC-1ex has an effective bandwidth of the system consumed just 16 W of extra dynamic power. There
about 40 GB/s, we estimate that it can support up to about 3,500 was little power variation in different modes of execution (i.e.,
ray tracing PEs. The table also contains the estimated result as- in forward projection or backward projection).
suming the Convey HC-1ex platform has been updated with the
latest Xilinx Virtex-7 2000T or Ultrascale VU13P FPGAs [31], C. Accuracy
which contain 2.6 and 3.4 more LUTs than LX760 FPGAs,
Some slices (512 512 1) of the nal reconstructed image
respectively. It shows that the two-minute goal can be achieved
are shown in Fig. 10.
if FPGAs are replaced with these latest products.
The original CPU software version has single-precision
4) Performance With Various CT Parameters: The design
oating-point variables, whereas the FPGA version has various
ow was validated using various CT scan parameters, as shown
word length of the internal variables. The conversion from
in Table X. In addition to the M.125000 image that was mainly
oating-point to xed-point representation results in some
used for experiment, we tested another patient image labeled
quantization errors. Following a similar approach used in [36]
0960B with 25% (5,865) projections to 100% (23,464) projec-
to obtain quantitative difference, we measured signal-to-noise
tions. Note that the ofine analyzer has to be executed once for
ratio (SNR), root-mean-squared error (RMSE), universal
each setting since the memory access pattern has changed. The
quality index (UQI), and correlation coefcient (CC). Since
GUPS gure shows that the performance is maintained in other
the ground truth is not known, the 50th iteration output of
settings as well.
the oating-point software version is assumed to be the
ground truth. After 50 iterations, the FPGA output image for
B. Resource and Power Consumption
the M.125000 dataset has , ,
The resource consumption is shown in Table XI. Placing mul- , and , which suggests that the dif-
tiple copies of PEs led to about a 60% utilization ratio of LUTs. ference is small enough for most purposes. Fig. 11 shows how
TABLE XII
COMPARISON WITH OTHER ACCELERATION WORK ON 3D CT RECONSTRUCTION
D. Discussion of Other Related Work and Comparison

Previous work on accelerating CT reconstruction has pro-
posed efcient parallelization techniques. Scherl et al. use a
master-slave approach for a deeper pipeline and present com-
parisons among multi-core CPU, Cell, GPU, and FPGA [12].
Okitsu et al. explain how to efciently divide the workload into
multiple GPUs [13]. Xu and Mueller show how to exploit var-
ious specialized graphic hardware for CT reconstruction [14].
Chou et al. propose a divergence and bank conict-free algo-
rithm for GPU with data sampling and reordering [15]. Xiao
et al. present techniques to avoid thread divergence in a GPU
by converting conditional operations into arithmetic operations
[16].
Also, some work has proposed data transfer reduction tech-
niques. Chidlow and Mller proposed a bit-splitting technique
over different color channels to maintain the quality of the result
with reduced data bits [8]. Baer and Kachelrie discovered that
Fig. 10. Reconstructed lung image from M.125000 testset. (a) Slice 100.
(b) Slice 200. (c) Slice 300; R0960B testset (25% projection). (d) Slice 50. using a half-precision oating point can reduce the data transfer
(e) Slice 100. (f) Slice 150. on the Xeon Phi architecture [18]. Gac et al. described a loop
reordering method to improve the temporal and spatial access
locality [19]. Zhou et al. report that batching rays similar in di-
rection helps improve the memory efciency [17]. Yan et al.
proposed a cyclic data copy method to reduce the data transfer
from renderer to texture memory [20].
Most of the acceleration work described above is based on
FBP. The acceleration work on iterative algorithms can be found
in [8], [9], [22], [33], [37][39]. The latest work is [9], [37],
which uses the same ray tracing algorithm and the same plat-
form as this paper. In their work, they propose a hybrid solu-
tion of GPU and FPGA for energy efciency. They also present
a data prefetching scheme to decrease the read latency and a
sparse access technique to reduce the data transfer. Another
Fig. 11. Reconstructed image difference in SNR between oating-point CPU, recent work is [33], which is based on a separable footprint
GPU and xed-point FPGA, compared to the image output from the 50th
iteration of the oating-point CPU version. (SF) algorithm. They present a method that reuses rays in the
voxel-driven approach and a method that reduces the represen-
tation size of the ray traversal path. Also, a high throughput is
the CPU, GPU, and FPGA versions converge to the assumed achieved using a clever water-lling buffer scheme.
ground truth in SNR after each iteration. Table XII shows the performance comparison with other
We have also processed the output image by our computer- FPGA and GPU implementations. Direct comparison is dif-
aided cancer nodule detection algorithm [24], [25]. Experiment cult since other work uses different platforms, images, and
shows that the same cancer candidate nodule is detected in both algorithms. Thus, the voxel update rate (GUPS) is used for
FPGA and CPU output image at slice 138 151. This suggests approximate comparison.
that the quantization error has little effect on the accuracy of the The table shows that our work has a superior performance
subsequent detection algorithm. over previous work [22], [23], [33], [37] based on iterative al-
gorithms. Compared to [37] which uses the same EM algorithm, [8] K. Chidlow and T. Mller, Rapid emission tomography reconstruc-
our proposed implementation has higher parallelism (256 versus tion, in Proc. Eurographics/IEEE TVCG Workshop Volume Graph.,
2003, pp. 1526.
64), and slower clock frequency (100 MHz versus 150 MHz). [9] J. Chen et al., A hybrid architecture for compressive sensing 3-D CT
The main difference is the new parallel scheme, data reuse ap- reconstruction, IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 2, no.
proach, and the throughput-optimized architecture which ac- 3, pp. 616625, 2012.
[10] B. Meng, G. Pratx, and L. Xing, Ultrafast and scalable cone-beam CT
counts for about a 16 difference in performance. reconstruction using MapReduce in a cloud computing environment,
Note that our work even has better GUPS performance than Med. Phys., vol. 38, no. 12, pp. 66036609, 2011.
most previous FBP-based work. FBP uses point voxels, which [11] J. Rosen, J. Wu, J. Fessler, and T. Wenisch, Iterative helical CT re-
construction in the cloud for ten dollars in ve minutes, in Proc. Int.
allows the weight computation to be simplied with multiplica- Meeting Fully 3D Image Recon. in Rad. and Nucl. Med., 2013, pp.
tion of cosine/sine value and interpolation. The ray tracing algo- 241244.
rithm used in this paper is more compute-intensive and harder [12] H. Scherl et al., Evaluation of state-of-the-art hardware architectures
for fast cone-beam CT reconstruction, Parallel Comput., vol. 38, pp.
to parallelize since it serially computes the intersection of rays 111124, 2011.
onto a volumetric voxel. [13] Y. Okitsu, F. Ino, and K. Hagihara, High-performance cone beam re-
construction using CUDA compatible GPUs, Parallel Comput., vol.
36, no. 2, pp. 129141, 2010.
VI. CONCLUSIONS [14] F. Xu and K. Mueller, Real-time 3D computed tomographic recon-
struction using commodity graphics hardware, Phys. Med. Biol., vol.
This paper describes the acceleration of the EM algorithm on 52, pp. 34053417, 2007.
the Convey FPGA platform. We have shown that parallelizing [15] C. Chou, Y. Chuo, Y. Hung, and W. Wang, A fast forward projection
ray-driven voxel tiles can solve the access conict problem and using multithreads for multirays on GPUs in medical image reconstruc-
tion, Med. Phys., vol. 38, no. 7, pp. 40524065, 2011.
increase memory bandwidth. Also, reusing sinogram and voxel [16] K. Xiao, D. Chen, X. Hu, and B. Zhou, Efcient implementation of
can largely reduce external memory access. Optimization based the 3D-DDA ray traversal algorithm on GPU and its application in ra-
on ofine analysis can be performed to exploit the fact that diation dose, calculation, Med. Phys., vol. 39, no. 12, pp. 76197625,
2012.
memory access is independent of the input patient data. More- [17] B. Zhou, X. Hu, and D. Zhen, Memory-efcient volume ray tracing
over, a customized PE architecture and optimization strategies on GPU for radiotherapy, in Proc. IEEE Symp. Applicat. Specific Pro-
were presented to increase the system throughput. The experi- cessors, 2011, pp. 4651.
[18] M. Baer and M. Kachelrie, High performance parallel beam and per-
mental results suggest that the combination of an efcient local spective cone-beam backprojection for CT image reconstruction on
data transfer scheme and task pipelining can lead to a large pre-production Intel Xeon Phi, in Proc. Int. Meeting Fully 3D Image
speedup in FPGA-based design. The proposed system can out- Recon. in Rad. and Nucl. Med., 2013, pp. 233236.
[19] N. Gac, S. Mancini, M. Desvignes, and D. Houze, High speed 3-D
perform other parallel platforms and achieve a 26.9 speedup tomography on CPU, GPU, and FPGA, EURASIP J. Embed. Syst.,
over the 16-thread multicore CPU implementation. vol. 2008, pp. 112, 2008.
[20] G. Yan, J. Tian, S. Zhu, Y. Dai, and C. Qin, Fast cone-beam CT image
reconstruction using GPU hardware, J. X-Ray Sci. Technol., vol. 16,
ACKNOWLEDGMENT pp. 225234, 2008.
[21] L. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based
The authors would like to thank Dr. W. Hsu (UCLA
noise removal algorithms, Physica D, vol. 60, pp. 259268, 1992.
Radiology) and Dr. M. Yan (UCLA Mathematics) for the [22] J. Chen et al., EM+TV for reconstruction of cone-beam CT with
C code, Dr. Y. Zou (Arista Networks) for the FPGA code curved detectors using GPU, in Proc. Int. Meeting Fully 3D Image
Recon. in Rad. and Nucl. Med., 2011, pp. 363366.
[9], and D. Wu (UCLA Computer Science) for the GPU
[23] Y. Choi, J. Cong, and D. Wu, FPGA implementation of EM algorithm
code of the CT reconstruction. They would also like to thank for 3-D CT reconstruction, in Proc. 22nd IEEE Int. Symp. Field-Pro-
J. Wheeler (UCLA Computer Science) for proofreading this grammable Custom Comput. Mach., 2014, pp. 157160.
[24] N. Duggan et al., A technique for lung nodule candidate detection
manuscript. They are grateful to Dr. A. Bui (UCLA Radiology)
in CT using global minimization methods, Energy Minim. Methods
and Dr. W. Hsu for sharing the chest CT dataset. Comput. Vis. Pattern Recognit., pp. 478491, 2015.
[25] S. Shen, A. Bui, J. Cong, and W. Hsu, An automated lung segmen-
tation approach using bidirectional chain codes to improve nodule de-
REFERENCES tection accuracy, Comput. Biol. Med., vol. 57, pp. 139149, 2015.
[1] B. Li, E. Weber, and S. Crozier, A MRI rotary phased array head coil, [26] M. Yan and L. Vese, Expectation maximization and total variation-
IEEE Trans. Biomed. Circuits Syst., vol. 7, no. 4, pp. 548556, 2013. based model for computed tomography reconstruction from undersam-
[2] J. Um et al., A single-chip 32-channel analog beamformer with 4-ns pled data, Proc. SPIE Med. Imag., pp. 79612X-18, 2011.
delay resolution and 768-ns maximum delay range for ultrasound med- [27] H. Zhao and A. J. Reader, Fast ray-tracing technique to calculate line
ical imaging with a linear array transducer, IEEE Trans. Biomed. Cir- integral paths in voxel arrays, in Proc. IEEE Nucl. Sci. Symp. Conf.
cuits Syst., vol. 9, no. 1, pp. 138151, 2015. Rec., 2003, vol. 4, pp. 28082812.
[3] L. Feldkamp, L. Davis, and J. Kress, Practical cone beam algorithm, [28] R. Siddon, Fast calculation of the exact radiological path for a three-
J. Opt. Soc. Amer. A, vol. 1, no. 6, pp. 612619, 1984. dimensional CT array, Med. Phys., vol. 12, pp. 252255, 1985.
[4] A. Dempster, N. Laird, and D. Rubin, Maximum likelihood from in- [29] B. D. Man and S. Basu, Distance-driven projection and backprojection
complete data via the EM algorithm, J. Roy. Stat. Soc., Series B, vol. in three dimensions, Phys. Med. Biol., vol. 49, no. 11, pp. 24632475,
39, no. 1, pp. 138, 1977. 2004.
[5] L. Shepp and Y. Vardi, Maximum likelihood reconstruction for emis- [30] A. Ziegler, T. Khler, T. Nielsen, and R. Proksa, Efcient projection
sion tomography, IEEE Trans. Med. Imag., vol. 1, no. 2, pp. 113122, and backprojection scheme for spherically symmetric basis functions in
1982. divergent beam geometry, Med. Phys., vol. 32, no. 12, pp. 46534663,
[6] R. Nelson, S. Feuerlein, and D. Boll, New iterative reconstruction 2006.
techniques for cardiovascular computed tomography: How do they [31] All Programmable FPGAs and 3D ICs Xilinx, 2015 [Online]. Avail-
work, and what are the advantages and disadvantages?, J. Cardiovas- able: http://www.xilinx.com/products/silicon-devices/fpga.html
cular Computed Tomogr., vol. 5, no. 5, pp. 286292, 2011. [32] J. Bakos, High-performance heterogeneous computing with the
[7] M. Beister, D. Kolditz, and W. Kalender, Iterative reconstruction Convey HC-1, IEEE Comput. Sci. Eng., vol. 12, no. 6, pp. 8087,
methods in X-ray CT, Phys. Med., vol. 28, no. 2, pp. 94108, 2012. 2010.
[33] J. Kim, J. Fessler, and Z. Zhang, Forward-projection architecture for Jason Cong (F00) received the B.S. degree from
fast iterative image reconstruction in X-ray CT, IEEE Trans. Signal Peking University, Beijing, China, in 1985, and the
Process., vol. 60, no. 10, pp. 55085518, 2012. M.S. and Ph.D. degrees from the University of Illi-
[34] J. Cong et al., Combining module selection and replication for nois at Urbana-Champaign, Champaign, IL, USA, in
throughput-driven streaming programs, in Proc. Conf. Design, 1987 and 1990, respectively, all in computer science.
Autom. Test in Europe, 2012, pp. 10181023. At the University of California, Los Angeles
[35] J. Cong, M. Huang, and P. Zhang, Combining computation and com- (UCLA), Los Angeles, CA, USA, he is a Chan-
munication optimizations in system synthesis for streaming applica- cellors Professor in the Computer Science Depart-
tions, in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Ar- ment and holds a joint appointment in the Electrical
rays, 2014, pp. 213222. Engineering Department. He is the Director of the
[36] X. Li and S. Luo, A compressed sensing-based iterative algorithm Center for Domain-Specic Computing (CDSC),
for CT reconstruction and its possible application to phase contrast Codirector of the UCLA/Peking University Joint Research Institute in Science
imaging, Biomed Eng. Online, vol. 10, 2011. and Engineering, and the Director of the VLSI Architecture, Synthesis, and
[37] J. Chen, J. Cong, M. Yan, and Y. Zou, FPGA-accelerated 3D recon- Technology (VAST) Laboratory. Additionally, he served as Chair of the UCLA
struction using compressive sensing, in Proc. ACM/SIGDA Int. Symp. Computer Science Department (20052008). His research interests include
Field-Programmable Gate Arrays, 2012, pp. 163165. energy-efcient computing, customized computing for big-data applications,
[38] B. Keck et al., GPU-accelerated SART reconstruction using the synthesis of VLSI circuits and systems, and highly scalable algorithms. He has
CUDA programming environment, Proc. SPIE Med. Imag., pp. authored more than 400 publications in these areas.
72582B-19, 2009. Dr. Cong has received 10 best paper awards, two 10-Year Most Inuential
[39] F. Xu and K. Mueller, Accelerating popular tomographic reconstruc- Paper Awards (from ICCAD14 and ASPDAC15), and the 2011 ACM/IEEE
tion algorithms on commodity PC graphics hardware, IEEE Trans. A. Richard Newton Technical Impact Award in Electric Design Automation.
Nucl. Sci., vol. 52, no. 3, pp. 654663, Jun. 2005. He was elected an ACM Fellow in 2008. He was the recipient of the 2010 IEEE
Circuits and System (CAS) Society Technical Achievement Award For seminal
Young-kyu Choi (S14) received the B.S. and M.S. contributions to electronic design automation, especially in FPGA synthesis,
degrees in electrical engineering from Seoul National VLSI interconnect optimization, and physical design automation.
University, Seoul, Korea.
Currently, he is working toward the Ph.D. de-
gree in computer science under the guidance of
Prof. Jason Cong at the University of California, Los
Angeles, Los Angeles, CA, USA. He has worked
at LG Electronics (20082011) and Inha University,
Incheon, Korea (20112012). His research interests
include characterization and implementation of
FPGA-friendly applications.

Acceleration of EM-Based 3D CT Reconstruction Using FPGA PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Acceleration of EM-Based 3D CT Reconstruction Using FPGA PDF

Transféré par

Droits d'auteur :

Formats disponibles

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 1

Acceleration of EM-Based 3D CT Reconstruction

C OMPUTED tomography (CT) is a commonly used

2 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS

CHOI AND CONG: ACCELERATION OF EM-BASED 3D CT RECONSTRUCTION USING FPGA 3

is 4.8 . The detector array consists of 672 channels and 16

B. Adaptive Flow for Lung Cancer Screening

C. 3D Reconstruction Algorithm and Problem Analysis

4 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS

is the image intensity. The voxel-driven approach takes the

The main advantage of this algorithm is the on-the-y con-

CHOI AND CONG: ACCELERATION OF EM-BASED 3D CT RECONSTRUCTION USING FPGA 5

The ray geometry computation produces a set of parameters

6 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS

voxel tile (decision process will be discussed in Sections III-C

B. Offline Analysis for Irregular Memory Access

One of the most complex problems of the voxel-tiled-parallel

CHOI AND CONG: ACCELERATION OF EM-BASED 3D CT RECONSTRUCTION USING FPGA 7

TABLE III TABLE IV

The relationship between the required external memory band-

8 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS

CHOI AND CONG: ACCELERATION OF EM-BASED 3D CT RECONSTRUCTION USING FPGA 9

5) Comparison & Update PE: The comparison computation

B] of sinogram data, and the update computation processes 0.36

10 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS

CHOI AND CONG: ACCELERATION OF EM-BASED 3D CT RECONSTRUCTION USING FPGA 11

Table VIII, the proposed scheme degrades the performance

12 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS

D. Discussion of Other Related Work and Comparison

CHOI AND CONG: ACCELERATION OF EM-BASED 3D CT RECONSTRUCTION USING FPGA 13

14 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS

Vous aimerez peut-être aussi