Académique Documents
Professionnel Documents
Culture Documents
AbstractReducing radiation doses is one of the key concerns better noise reduction property and offers better spatial resolu-
in computed tomography (CT) based 3D reconstruction. Although tion by using a more natural geometric model than FBP [6].
iterative methods such as the expectation maximization (EM) algo- However, these advantages are achieved at the cost of much
rithm can be used to address this issue, applying this algorithm to
practice is difcult due to the long execution time. Our goal is to de- extended computation time [7]. This problem occurs because of
crease this long execution time to an order of a few minutes, so that the iterative nature of EM. For example, literature reports 14
low-dose 3D reconstruction can be performed even in time-crit- minutes and 1.52 hours of CPU computation while processing
ical events. In this paper we introduce a novel parallel scheme that 64 and 500 projections of 128 128 and 736 64 pixels, re-
takes advantage of numerous block RAMs on eld-programmable spectively [8], [9]. This becomes more problematic with actual
gate arrays (FPGAs). Also, an external memory bandwidth reduc-
tion strategy is presented to reuse both the sinogram and the voxel patient images, which require larger input data for accurate mea-
intensity. Moreover, a customized processing engine based on the surement. Our testset, which has 15,022 projections of 672 16
FPGA is presented to increase overall throughput while reducing pixels, requires 3.3 hours of computation with 16 threads on a
the logic consumption. Finally, a hardware and software ow is CPU Intel Xeon E5-2420. However, our requirement is to do re-
proposed to quickly construct a design for various CT machines. construction in under two minutes, as part of the low-dose adap-
The complete reconstruction system is implemented on an FPGA-
based server-class node. Experiments on actual patient data show tive reconstruction ow (Section II-B), so that doctors can give
that a 26.9 speedup can be achieved over a 16-thread multicore immediate on-site feedback to the patient. The performance gap
CPU implementation. between the multicore CPU result and our goal is approximately
Index TermsAccelerator architectures, computed tomog- 100 times.
raphy, eld-programmable gate arrays, parallel architectures, ray In order to reduce the execution time of CT reconstruction,
tracing. some systems perform computation on the cloud [10], [11].
However, this may raise concerns about private health infor-
I. INTRODUCTION mation being transmitted through the web. Also, the system
reliability may be compromised due to sporadic network
1932-4545 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. CT scan trajectory. The parameters are from M.125000 dataset. Modied from [23].
One potential for a large increase in the system bandwidth Ofine memory analysis for irregular access patterns
can be found in the FPGA architecture. Although the FPGA in the ray-driven approach: To efciently tile the voxel
platform has a slower clock frequency (100 MHz versus for irregular memory access, an ofine memory analysis
5001000 MHz) and less DDR bandwidth than the GPU, it can technique is proposed. It exploits the input data indepen-
customize the computation and the data communication. For dence property of the CT machine. Also, a compact storage
example, it contains many block RAM (BRAM) that can be format is presented.
used to perform a large amount of local data transfer. However, Customized PE architecture: We present a design that
BRAM usage alone is not very effective unless DRAM can achieves high throughput and logic reuse for each PE.
supply the data to BRAM without stalling. Design ow for rapid hardware-software design: A ow
Another important challenge is to achieve a high degree of is presented to generate hardware and software for various
parallelization while efciently resolving access conict to local CT parameters.
memory. Problems such as race condition and bank conict This work is an extension of our previous conference publica-
occur when several rays try to update a voxel at the same time. tion in [23]. The previous work mainly focused on exploiting the
Unlike most other FBP-based acceleration work that achieves data reuse to reduce off-chip access. In this work we added the
high parallelism with point-like voxel, our work incorporates parallelization scheme, ofine memory analysis technique, vari-
volumetric voxel for a realistic geometric model. However, this able throughput optimization, and the automated design ow.
complicates the design with serialization and non-stencil access Due to the new optimization, this work is 5.2 times faster than
pattern. Chen et al. proposed parallelizing rays that are suf- [23] on the same dataset.
ciently apart [22], but the large distance limits the amount of The remainder of this paper is organized as follows: Section II
parallelism. A new approach is needed that can avoid the ac- will provide the background and analysis of the algorithm and
cess conict while still achieving large parallelism with min- the platform. It also provides a brief analysis of the problem and
imal overhead. highlights the challenges for constructing an efcient design.
In this paper we present a complete and working CT re- Section III explains the ray-driven, voxel-tile parallel approach
construction system implemented on a server-class node with and the ofine memory analysis technique. The design ow, de-
FPGA coprocessors. It incorporates several FPGA-friendly tails for the customized PE design, and throughput optimization
techniques for acceleration. The contributions of this paper strategies are provided in Section IV. The experimental results
include: can be found in Section V.
Ray-driven voxel-tile parallel approach: This approach
exploits the computational simplicity of the ray-driven ap-
II. BACKGROUND AND PROBLEM ANALYSIS
proach, while taking advantage of both ray and voxel data
reuse. Both the race condition and the bank conict prob-
A. CT Scanner Geometry and Dataset Description
lems are completely removed. Moreover, it can easily in-
crease the degree of parallelism with adjustment of tile In a CT scan, as shown in Fig. 1, radiation from the source
shape. A strategy for the tile shape decision is also pre- penetrates the patients body and gets measured by an array of
sented. detectors. The source and the detector array proceed in a helical
Variable throughput matching optimization: We pro- fashion, where the pair move in a circular direction in the -
vide strategies to increase the performance for designs plane and in a lateral direction along the -axis.
that have a variable and disproportionate throughput In our clinical setting, the CT scan was acquired using a
rate between processing elements (PE). In particular, helical cone beam protocol on a Siemens Somatom Sensa-
the logic consumption is reduced by exploiting the low tion 16 scanner. The focus-detector distance (FDD) and the
complexity of frequently computed parts of the ray-driven focus-isocenter distance (FID) (both shown in Fig. 1) are
approach. The logic usage is further reduced by using 1,040 mm and 570 mm, respectively. The circular rotation rate
small granularity PEs and module reuse. This leads to ( ) of source and detector array is 1,160 samples per rotation.
overall throughput improvement with module duplication. The lateral movement per sample ( ) of source and detector
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
D. Synchronization
Although voxel access conict was removed by adopting Fig. 7. Software and hardware design ow.
voxel tile parallelism, sinogram access conict has been newly
introduced in the forward projection. To solve this problem,
we decided to use the barrier synchronization method. Each IV. DESIGN FLOW AND ARCHITECTURE
PE will update its sinogram value to a memory location that is
different from other PEs. This does not mean memory location A. Overall Design Flow
will be increased by the number of available PEs, because a
Our design is based on an assumption that even though the pa-
ray traverses through a limited number of voxel tiles. After
tient changes, the CT scan parameter will not vary much. This
performing an ofine analysis of nding all voxel tiles that rays
allows engineers to optimize the design ofine (Section III-B).
pass into, adjacent voxel tiles are assigned to the same PE to
A possible usage scenario is that the CT machine will provide
reduce the number of PEs that might access the same sinogram.
some limited number of FPGA-acceleration settings that are
Experiments show that the DRAM memory allocation size for
most frequently used. Radiologists can choose one of these set-
sinogram increases by eight times for our dataset. This does not
tings and quickly obtain the reconstructed result.
have a severe negative impact on the performance because the
The overall design ow is shown in Fig. 7. The software ana-
amount of data transfer remains the same for the ray tracing PE.
lyzer will record the memory access pattern into a le and pro-
vide guidelines on the hardware architecture. The hardware de-
E. Comparison sign, written in C, is then fed into the Vivado HLS 2013.1 tool to
generate an RTL design. The RTL les are then combined with
The DRAM data transaction comparison between a design the host and the DRAM interface RTL. Next, the bitstream is
that only reuses ray data (e.g., [9]) and the proposed design created using the Convey development tool chain that is based
is presented in Table IV. Reusing only the sinogram requires on the Xilinx ISE tool. The host software performs necessary
large off-chip memory bandwidth to transfer voxel intensity. In pre- and post-processing such as reading the ray list informa-
the proposed scheme, the voxels in a tile are transferred only tion or reading and writing the image le.
once, because they are reused among several rays. At the same
time, ray data is still being reused, because ray tracing is per-
B. Software Analyzer
formed within a tile. Although the sinogram data access has in-
creased 8.6 times for a 64 64 1 tile size, this has little effect The software analyzer is responsible for generating two types
on overall data transfer, since its portion was small compared to of information: the list of rays per voxel tile and the hardware
the voxel transfer. Considering the combined value of sinogram design parameters. The range of rays is easily obtained by mea-
and intensity, there is a 55.3 reduction in external memory ac- suring the minimum and the maximum of radiation detectors
cess. per projection for each voxel tile (Section III-B). Hardware de-
Table IV also shows that there is little difference for a sign parameters include the tile size, shape, and the module
combined amount of BRAM and DRAM access. However, the duplication information. The reason this has to be extracted is
performance drastically improves, because FPGA has a much because the memory access pattern will change as the CT pa-
higher BRAM bandwidth than DRAM. rameter changes. These parameters are decided based on the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 8. Mapping between the algorithm and the architecture for the proposed system.
measurement of throughput of ray tracing PE, number of rays access logic were detached from ray tracing and ray geom-
passing into a voxel tile, and the number of DRAM accesses etry computation PEs. This allows us take advantage of the
(e.g., Table III), as will be explained in more detail in the next different throughput rate of PEs, because logic area can be
subsection. Note that the analyzer has been implemented in soft- saved by having multiple low-throughput PEs share access to
ware because it is easier to make modications, and also because high-throughput PEs. This leads to throughput optimization
the execution time is not very constrained. using module duplication, as will be explained in Section IV-D.
The second principle is module reuse. If multiple functions
C. Accelerator Architecture can be mapped to the same PE, it can save area and enable more
module duplication. For example, ray tracings from both for-
1) Overall Architecture: The overall architecture and the ward and backward projections are mapped to the same PE.
mapping from the algorithm are shown in Fig. 8. It shows three The third principle is to use FIFO logic for inter-module
types of customized PEs. Ray tracing of forward and backward communication. Since many modules are designed as
projection is mapped to the ray tracing PE. Ray geometry PE high-throughput streaming modules, FIFO simplies the data
generates the parameters needed for ray tracing. The sinogram transfer process.
comparison and the voxel update part are mapped to the com- Finally, we concentrated on pipelining different tasks. Overall
parison/update PE. execution time can be reduced, not only by enumerating one
General-purpose modules have been extracted from the cus- type of PE, but also by overlapping the computation of different
tomized PEs. In Fig. 8, C represents the loop iterator generator types of PEs. The advantage is that it allows local communica-
for both tile and the rays. D represents the DRAM access unit. tion between directly connected PEs and removes data storage
If the number of PEs do not match, a network logic, denoted time to memory. Note that such task parallelism is hard to be
N, is used to either gather or distribute the data. It employs used efciently with GPU and multicore CPU, and gives an in-
round-robin scheduling to allocate similar transfer rates to the herent advantage to FPGA-based approaches.
receiving PEs. 3) Ray Tracing PE: Ray tracing PE is responsible for ac-
2) Accelerator Design Principles: We employed several cumulating (forward projection) or updating (backward projec-
principles for high-performance accelerator design. The rst tion) the intensity of voxels, as shown in Fig. 3. Due to the sim-
principle is to divide PEs as nely as possible for aggressive ilarity in their computation patterns, they can be combined into
pipelining. For example, even small modules such as the one PE, as shown in Fig. 9. An operation control bit is used to
data distribution logic, loop iterator generator, and DRAM select either the forward or the backward projection mode.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
WORD LENGTH OF VARIABLES ( )
TABLE VI
LOGIC/MEMORY CONSTRAINT & CONSUMPTION PER PE
tracing PE. The effect is that of averaging the diverse throughput TABLE VII
to the expected value. BREAKDOWN OF EXECUTION TIME
The average ratio between the ray throughput of ray geometry
PE and that of ray tracing PE is . To
have some safe margins caused by the varying throughput, we
decreased the ratio to 1 : 16. Based on Table VI, it becomes
possible to place 17 ray geometry parameter PEs and 264 ray
tracing PEs. For regularity of the design, we decided to place TABLE VIII
16 ray geometry parameter PEs and 256 ray tracing PEs. EXECUTION TIME COMPARISON WITH VARIOUS ACCELERATOR PLATFORMS
As a result, even though a single ray geometry PE consumes THAT EXPLOITS THE RAY PARALLELISM (UNIT: MINUTES, 50 ITERATIONS)
6 times more LUT than a single ray tracing PE, it occupies 37%
less LUT after duplication. This is because the duplication factor
of the ray geometry PE is small. It can be observed that the logic
saved from sharing high-throughput PEs has increased perfor-
mance because it allowed more duplication of low-throughput
PEs.
Our proposed optimization shows how the performance
can be improved if more frequently occurring computation
has lower complexity. The ray tracing PE has frequent and
computation (Algorithm 1), which only requires addition. needed to send 372 MB of image back and forth between FPGA
Thus, it can be heavily duplicated. The ray geometry parameter and CPU through the front-side bus.
PE has more complex computation and thus consumes more 2) Effect of the Proposed Parallel Approach: To nd the ef-
logic, but less duplication is needed because only one compu- fect of the proposed parallel approach, we implemented the re-
tation is performed per ray in a voxel tile. Thus, it conrms construction algorithm using three different accelerators-mul-
that with the proposed optimization, it is possible to improve ticore CPU, GPU, and FPGA. These versions exploit the ray
the performance of the ray-driven approach by exploiting the parallelism but not the proposed voxel data reuse. All versions
simple computation pattern of the ray traversal. use the same CT parameter and the same input data as the pro-
posed FPGA version. For multicore CPU, we used the Xeon
V. EXPERIMENTAL RESULTS E5-2420 which is based on a newer process technology than
the FPGA platform. We used 16 threads. For GPU, three Tesla
A. Performance M2070 GPUs were used to approximately match the number of
1) Performance Breakdown: The performance breakdown is Virtex-6 FPGAs used. A total of 9216 threads were used. For
shown in Table VII. Each iteration of reconstruction takes 8.9 FPGA, we used the same Convey platform but incorporated op-
seconds; thus it takes 7.4 minutes to complete the reconstruction timization in [9] for comparison. Note that the CPU and FPGA
with 50 iterations. We also present the rate of voxel updates as versions keep enough distance between the threads to resolve
Giga updates per seconds (GUPS) for easier performance com- access conict, as suggested by [22], but the GPU version uses
parison. The performance was measured on an actual execution atomic operations to avoid a race condition among thousands of
using the Convey HC-1ex platform. threads.
Forward projection takes an amount of time similar to back- Table VIII shows that the proposed algorithm on FPGA has
ward projection although it requires about 2 of DRAM band- 26.9 , 16.8 , and 4.6 speedup over the ray-parallel imple-
width (Table IV). This suggests that the DRAM bandwidth is mentation on multicore CPU, FPGA, and GPU, respectively.
no longer the bottleneck. The performance gain is considerable. Even though the Convey
The reason that the compare and update part has a small platform has smaller DRAM bandwidth and a 28 and 5.8
GUPS is because only a quarter of Conveys DRAM channel slower clock frequency than the CPU and GPU platforms, it has
was utilized to reduce the number of PEs to 16. It has little neg- higher performance because of the reduced external memory
ative impact on the overall performance because the overall por- transfer with data reuse, task pipelining, and the customized cir-
tion is still relatively small. cuitry.
Note that if TV is added to the system, it will take 1.0 seconds One possible question could be whether other accelerators
for computation, per iteration, assuming that 16 threads on Xeon can also benet from the proposed techniques. Somewhat
E5-2420 are used. Also, an additional 0.8 seconds per iteration is counter-intuitively, experiments showed that, compared to
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IX
SCALABILITY TEST AND PROJECTION OF EXECUTION TIME ON CONVEY HC-1EX PLATFORM
TABLE X TABLE XI
TEST RESULT USING VARIOUS CT PARAMETERS RESOURCE CONSUMPTION (4 FPGAS)
TABLE XII
COMPARISON WITH OTHER ACCELERATION WORK ON 3D CT RECONSTRUCTION
gorithms. Compared to [37] which uses the same EM algorithm, [8] K. Chidlow and T. Mller, Rapid emission tomography reconstruc-
our proposed implementation has higher parallelism (256 versus tion, in Proc. Eurographics/IEEE TVCG Workshop Volume Graph.,
2003, pp. 1526.
64), and slower clock frequency (100 MHz versus 150 MHz). [9] J. Chen et al., A hybrid architecture for compressive sensing 3-D CT
The main difference is the new parallel scheme, data reuse ap- reconstruction, IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 2, no.
proach, and the throughput-optimized architecture which ac- 3, pp. 616625, 2012.
[10] B. Meng, G. Pratx, and L. Xing, Ultrafast and scalable cone-beam CT
counts for about a 16 difference in performance. reconstruction using MapReduce in a cloud computing environment,
Note that our work even has better GUPS performance than Med. Phys., vol. 38, no. 12, pp. 66036609, 2011.
most previous FBP-based work. FBP uses point voxels, which [11] J. Rosen, J. Wu, J. Fessler, and T. Wenisch, Iterative helical CT re-
construction in the cloud for ten dollars in ve minutes, in Proc. Int.
allows the weight computation to be simplied with multiplica- Meeting Fully 3D Image Recon. in Rad. and Nucl. Med., 2013, pp.
tion of cosine/sine value and interpolation. The ray tracing algo- 241244.
rithm used in this paper is more compute-intensive and harder [12] H. Scherl et al., Evaluation of state-of-the-art hardware architectures
for fast cone-beam CT reconstruction, Parallel Comput., vol. 38, pp.
to parallelize since it serially computes the intersection of rays 111124, 2011.
onto a volumetric voxel. [13] Y. Okitsu, F. Ino, and K. Hagihara, High-performance cone beam re-
construction using CUDA compatible GPUs, Parallel Comput., vol.
36, no. 2, pp. 129141, 2010.
VI. CONCLUSIONS [14] F. Xu and K. Mueller, Real-time 3D computed tomographic recon-
struction using commodity graphics hardware, Phys. Med. Biol., vol.
This paper describes the acceleration of the EM algorithm on 52, pp. 34053417, 2007.
the Convey FPGA platform. We have shown that parallelizing [15] C. Chou, Y. Chuo, Y. Hung, and W. Wang, A fast forward projection
ray-driven voxel tiles can solve the access conict problem and using multithreads for multirays on GPUs in medical image reconstruc-
tion, Med. Phys., vol. 38, no. 7, pp. 40524065, 2011.
increase memory bandwidth. Also, reusing sinogram and voxel [16] K. Xiao, D. Chen, X. Hu, and B. Zhou, Efcient implementation of
can largely reduce external memory access. Optimization based the 3D-DDA ray traversal algorithm on GPU and its application in ra-
on ofine analysis can be performed to exploit the fact that diation dose, calculation, Med. Phys., vol. 39, no. 12, pp. 76197625,
2012.
memory access is independent of the input patient data. More- [17] B. Zhou, X. Hu, and D. Zhen, Memory-efcient volume ray tracing
over, a customized PE architecture and optimization strategies on GPU for radiotherapy, in Proc. IEEE Symp. Applicat. Specific Pro-
were presented to increase the system throughput. The experi- cessors, 2011, pp. 4651.
[18] M. Baer and M. Kachelrie, High performance parallel beam and per-
mental results suggest that the combination of an efcient local spective cone-beam backprojection for CT image reconstruction on
data transfer scheme and task pipelining can lead to a large pre-production Intel Xeon Phi, in Proc. Int. Meeting Fully 3D Image
speedup in FPGA-based design. The proposed system can out- Recon. in Rad. and Nucl. Med., 2013, pp. 233236.
[19] N. Gac, S. Mancini, M. Desvignes, and D. Houze, High speed 3-D
perform other parallel platforms and achieve a 26.9 speedup tomography on CPU, GPU, and FPGA, EURASIP J. Embed. Syst.,
over the 16-thread multicore CPU implementation. vol. 2008, pp. 112, 2008.
[20] G. Yan, J. Tian, S. Zhu, Y. Dai, and C. Qin, Fast cone-beam CT image
reconstruction using GPU hardware, J. X-Ray Sci. Technol., vol. 16,
ACKNOWLEDGMENT pp. 225234, 2008.
[21] L. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based
The authors would like to thank Dr. W. Hsu (UCLA
noise removal algorithms, Physica D, vol. 60, pp. 259268, 1992.
Radiology) and Dr. M. Yan (UCLA Mathematics) for the [22] J. Chen et al., EM+TV for reconstruction of cone-beam CT with
C code, Dr. Y. Zou (Arista Networks) for the FPGA code curved detectors using GPU, in Proc. Int. Meeting Fully 3D Image
Recon. in Rad. and Nucl. Med., 2011, pp. 363366.
[9], and D. Wu (UCLA Computer Science) for the GPU
[23] Y. Choi, J. Cong, and D. Wu, FPGA implementation of EM algorithm
code of the CT reconstruction. They would also like to thank for 3-D CT reconstruction, in Proc. 22nd IEEE Int. Symp. Field-Pro-
J. Wheeler (UCLA Computer Science) for proofreading this grammable Custom Comput. Mach., 2014, pp. 157160.
[24] N. Duggan et al., A technique for lung nodule candidate detection
manuscript. They are grateful to Dr. A. Bui (UCLA Radiology)
in CT using global minimization methods, Energy Minim. Methods
and Dr. W. Hsu for sharing the chest CT dataset. Comput. Vis. Pattern Recognit., pp. 478491, 2015.
[25] S. Shen, A. Bui, J. Cong, and W. Hsu, An automated lung segmen-
tation approach using bidirectional chain codes to improve nodule de-
REFERENCES tection accuracy, Comput. Biol. Med., vol. 57, pp. 139149, 2015.
[1] B. Li, E. Weber, and S. Crozier, A MRI rotary phased array head coil, [26] M. Yan and L. Vese, Expectation maximization and total variation-
IEEE Trans. Biomed. Circuits Syst., vol. 7, no. 4, pp. 548556, 2013. based model for computed tomography reconstruction from undersam-
[2] J. Um et al., A single-chip 32-channel analog beamformer with 4-ns pled data, Proc. SPIE Med. Imag., pp. 79612X-18, 2011.
delay resolution and 768-ns maximum delay range for ultrasound med- [27] H. Zhao and A. J. Reader, Fast ray-tracing technique to calculate line
ical imaging with a linear array transducer, IEEE Trans. Biomed. Cir- integral paths in voxel arrays, in Proc. IEEE Nucl. Sci. Symp. Conf.
cuits Syst., vol. 9, no. 1, pp. 138151, 2015. Rec., 2003, vol. 4, pp. 28082812.
[3] L. Feldkamp, L. Davis, and J. Kress, Practical cone beam algorithm, [28] R. Siddon, Fast calculation of the exact radiological path for a three-
J. Opt. Soc. Amer. A, vol. 1, no. 6, pp. 612619, 1984. dimensional CT array, Med. Phys., vol. 12, pp. 252255, 1985.
[4] A. Dempster, N. Laird, and D. Rubin, Maximum likelihood from in- [29] B. D. Man and S. Basu, Distance-driven projection and backprojection
complete data via the EM algorithm, J. Roy. Stat. Soc., Series B, vol. in three dimensions, Phys. Med. Biol., vol. 49, no. 11, pp. 24632475,
39, no. 1, pp. 138, 1977. 2004.
[5] L. Shepp and Y. Vardi, Maximum likelihood reconstruction for emis- [30] A. Ziegler, T. Khler, T. Nielsen, and R. Proksa, Efcient projection
sion tomography, IEEE Trans. Med. Imag., vol. 1, no. 2, pp. 113122, and backprojection scheme for spherically symmetric basis functions in
1982. divergent beam geometry, Med. Phys., vol. 32, no. 12, pp. 46534663,
[6] R. Nelson, S. Feuerlein, and D. Boll, New iterative reconstruction 2006.
techniques for cardiovascular computed tomography: How do they [31] All Programmable FPGAs and 3D ICs Xilinx, 2015 [Online]. Avail-
work, and what are the advantages and disadvantages?, J. Cardiovas- able: http://www.xilinx.com/products/silicon-devices/fpga.html
cular Computed Tomogr., vol. 5, no. 5, pp. 286292, 2011. [32] J. Bakos, High-performance heterogeneous computing with the
[7] M. Beister, D. Kolditz, and W. Kalender, Iterative reconstruction Convey HC-1, IEEE Comput. Sci. Eng., vol. 12, no. 6, pp. 8087,
methods in X-ray CT, Phys. Med., vol. 28, no. 2, pp. 94108, 2012. 2010.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[33] J. Kim, J. Fessler, and Z. Zhang, Forward-projection architecture for Jason Cong (F00) received the B.S. degree from
fast iterative image reconstruction in X-ray CT, IEEE Trans. Signal Peking University, Beijing, China, in 1985, and the
Process., vol. 60, no. 10, pp. 55085518, 2012. M.S. and Ph.D. degrees from the University of Illi-
[34] J. Cong et al., Combining module selection and replication for nois at Urbana-Champaign, Champaign, IL, USA, in
throughput-driven streaming programs, in Proc. Conf. Design, 1987 and 1990, respectively, all in computer science.
Autom. Test in Europe, 2012, pp. 10181023. At the University of California, Los Angeles
[35] J. Cong, M. Huang, and P. Zhang, Combining computation and com- (UCLA), Los Angeles, CA, USA, he is a Chan-
munication optimizations in system synthesis for streaming applica- cellors Professor in the Computer Science Depart-
tions, in Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Ar- ment and holds a joint appointment in the Electrical
rays, 2014, pp. 213222. Engineering Department. He is the Director of the
[36] X. Li and S. Luo, A compressed sensing-based iterative algorithm Center for Domain-Specic Computing (CDSC),
for CT reconstruction and its possible application to phase contrast Codirector of the UCLA/Peking University Joint Research Institute in Science
imaging, Biomed Eng. Online, vol. 10, 2011. and Engineering, and the Director of the VLSI Architecture, Synthesis, and
[37] J. Chen, J. Cong, M. Yan, and Y. Zou, FPGA-accelerated 3D recon- Technology (VAST) Laboratory. Additionally, he served as Chair of the UCLA
struction using compressive sensing, in Proc. ACM/SIGDA Int. Symp. Computer Science Department (20052008). His research interests include
Field-Programmable Gate Arrays, 2012, pp. 163165. energy-efcient computing, customized computing for big-data applications,
[38] B. Keck et al., GPU-accelerated SART reconstruction using the synthesis of VLSI circuits and systems, and highly scalable algorithms. He has
CUDA programming environment, Proc. SPIE Med. Imag., pp. authored more than 400 publications in these areas.
72582B-19, 2009. Dr. Cong has received 10 best paper awards, two 10-Year Most Inuential
[39] F. Xu and K. Mueller, Accelerating popular tomographic reconstruc- Paper Awards (from ICCAD14 and ASPDAC15), and the 2011 ACM/IEEE
tion algorithms on commodity PC graphics hardware, IEEE Trans. A. Richard Newton Technical Impact Award in Electric Design Automation.
Nucl. Sci., vol. 52, no. 3, pp. 654663, Jun. 2005. He was elected an ACM Fellow in 2008. He was the recipient of the 2010 IEEE
Circuits and System (CAS) Society Technical Achievement Award For seminal
Young-kyu Choi (S14) received the B.S. and M.S. contributions to electronic design automation, especially in FPGA synthesis,
degrees in electrical engineering from Seoul National VLSI interconnect optimization, and physical design automation.
University, Seoul, Korea.
Currently, he is working toward the Ph.D. de-
gree in computer science under the guidance of
Prof. Jason Cong at the University of California, Los
Angeles, Los Angeles, CA, USA. He has worked
at LG Electronics (20082011) and Inha University,
Incheon, Korea (20112012). His research interests
include characterization and implementation of
FPGA-friendly applications.