Académique Documents
Professionnel Documents
Culture Documents
1, JANUARY/FEBRUARY 2015
AbstractThe next generation genome sequencing problem with short (long) reads is an emerging field in numerous scientific and big
data research domains. However, data sizes and ease of access for scientific researchers are growing and most current methodologies
rely on one acceleration approach and so cannot meet the requirements imposed by explosive data scales and complexities. In this
paper, we propose a novel FPGA-based acceleration solution with MapReduce framework on multiple hardware accelerators. The
combination of hardware acceleration and MapReduce execution flow could greatly accelerate the task of aligning short length reads
to a known reference genome. To evaluate the performance and other metrics, we conducted a theoretical speedup analysis on a
MapReduce programming platform, which demonstrates that our proposed architecture have efficient potential to improve the speedup
for large scale genome sequencing applications. Also, as a practical study, we have built a hardware prototype on the real Xilinx
FPGA chip. Significant metrics on speedup, sensitivity, mapping quality, error rate, and hardware cost are evaluated, respectively.
Experimental results demonstrate that the proposed platform could efficiently accelerate the next generation sequencing problem with
satisfactory accuracy and acceptable hardware cost.
1 INTRODUCTION
running on a cluster of processors. However, NGS improve- construct a hardware prototype subsystem. The
ments are moving the bottleneck of the genome sequencing experimental results with hardware speedup, sen-
workflow from the sequencing phase to the short read map- sitivity, quality, error rate and hardware utiliza-
ping software. In particular, using MapReduce framework tion is presented and analyzed.
to accelerate the short read mapping process by exploiting The structure of this paper is organized as below. We
the parallelism of the task has been proven to be an efficient present the motivation and the related state-of-the-art
way as CloudBurst [5] has shown. However, CloudBurst studies and in Section 2, then Section 3 illustrates the
will not start until the millions of short reads have been architecture and FPGA based MapReduce programming
imported into the system. This would be time-consuming, framework. We also detail the execution flow of short read
and any updates to one of the short reads require the Map- mapping in Section 3. Thereafter, we explore the hot spots
Reduce tasks to be re-run which will dramatically degrade of the genome sequencing application in Section 4. A theo-
the response time. retical study for both timing complexity and speedup anal-
Furthermore, along with the rapid development of genome ysis is also described in Section 4. In Section 5, we present
technology, in the foreseeable future, the commercial genome the hardware prototype and detail the experimental results
sequencing service could be provided to the research commu- on the real FPGA hardware platform. Finally, Section 6
nity. The requirements for scientific researchers to access the concludes the paper and introduces some future works.
genome sequencing services provided by supercomputers
would be a driven force to the genome technology in the 2 RELATED WORK AND MOTIVATIONS
future. To tackle this problem, a high performance genome
sequencing engine that can response in a reasonable response There are quite a lot of successful short read mapping soft-
time is becoming highly recommended and favorable. ware tools that address the problem of processing the enor-
Up to now, it has been popular and necessary to address mous amount of data produced by the next-generation
the computational requirements for NGS using efficient sequencing machines. In this section, we first present the
hardware field programmable gate array (FPGA)-based motivation, and then analyze the state-of-the-art related
acceleration. For example, the recent work in [6], [7] stream work of short read mapping algorithm, MapReduce accel-
a good perspective of the reference genome through a sys- eration frameworks and heterogeneous accelerating
tem doing exact matching of the short reads. However, the approaches respectively.
It didnt fit problem from TC-FPGA community predic-
tions [8] illustrates that the helpless fact the FPGA based 2.1 Motivation
heterogeneous computing platform is preferable to deal Although it has been only less than five years since the
with computing-intensive applications, while tremendous emergence of the short read mapping problem of genome
amount of data may cause inevitable failure if only the sequencing, it is clearly an area of intense research activity.
hardware based acceleration is employed in the FPGA devi- It is widely acknowledged this field to be one of the pioneer-
ces. Instead, they should be processed in a more parallel ing application domains that benefit from innovative ideas.
way, such as MapReduce [9] framework. During the past Due to the I/O and computing intensive requirements of
few years, there has been some creditable literature like state-of-the-art DNA sequencing problem, we anticipate
FPMR [10] that integrates high level MapReduce parallel that DNA sequencing programs will be located in super-
framework into FPGA based hardware prototype. How- computers but in a manner that provides pervasive and
ever, in spite of the state-of-the-art hardware implementa- ubiquitous services to scientific researchers or even public
tion achieving satisfying speedups, to our best knowledge, users. In this scenario, system performance assessment
few of them have considered FPGA based MapReduce pro- needs to consider more metrics such as response time, scal-
gramming framework really applicable to this typical large ability and mismatching rate, rather than the raw parallel
amount data-intensive computing domains. speedup oriented, as is illustrated in most of the published
To address the above problems, in this paper we present work. In a cloud service model, it is likely that the service
a heterogeneous cloud framework with MapReduce and need to begin operation before all the required data is
multiple hardware execution engines on FPGA to accelerate completely transferred via internet.
the genome sequencing problems. The contribution of this To the best of our knowledge, this approach targeting
paper is claimed as below: NGS has not yet been fully studied. In this paper, we pres-
ent a novel approach for solving NGS problems using a
1) We propose a novel architecture to accelerate read cloud based paradigm. We apply cluster computing
mapping processing thread when facing large approach to FPGA based multiple acceleration engines into
amount of requests with FPGA based MapReduce. a sound framework, and discuss architecture in a service-
2) A distributed MapReduce framework is presented to oriented manner. We also apply MapReduce programming
dispatch the task into multiple FPGA accelerators. model into the hybrid system to manipulate FPGA based
The Map and Reduce process is based on RMAP accelerators as function units, which could largely facilitate
sequencing algorithm. And inside each FPGA based the programmers to improve the performance.
accelerator, we implement BWA-SW algorithm ker-
nel in hardware that is connected to a local micropro- 2.2 Current Short Read Mapping Algorithms
cessor to speedup the local alignment process. In this section, we explore the state-of-the-art short read
3) We conduct a theoretical analysis on the MapRe- mapping solutions, in particular into two main algorithmic
duce framework, and for each FPGA chip, we categories, as is inherited from [1]. The first category of
168 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 1, JANUARY/FEBRUARY 2015
TABLE 1
Summary for Accelerations for Genome Sequencing
solution is based upon a block sorting data compression match the input stream against a single string, one match-
algorithm called the Burrows-Wheeler transform (BWT) ing unit is required per string, and the hardware system is
[11]. This solution uses the FMindex [12] to efficiently store composed of a linear array of matching units. The methods
information required to traverse a suffix tree for a reference presented in [32] and [33] are based on pre-decoded charac-
sequence. These solutions can quickly find a set of matching ters with hardwired logic circuits. The system is optimized
locations in a reference genome for short reads that match with respect to the given set of signatures and the character-
the reference genome with a very limited number of differ- istics of the FPGA devices. The FPGA is reconfigured when
ences. However, the running time of this class of algorithm there are changes to the signature set. However, the long
is exponential with respect to the allowed number of differ- latency required in offline generation of the optimized
ences; therefore BWT-based algorithms tend to be less sensi- hardwired circuits is considered as a major disadvantage in
tive than other creditable solutions. Bowtie [2] and BWA [3] a network intrusion detection system that demands fast
are examples programs based on this algorithmic approach. responses to hostile conditions. For the sake of FPGA accel-
The second category of solution leverages the fact that erators, many of the proposed hardware solutions are
individual genomes differ only slightly, meaning it is likely based on the well-known Aho-Corasick (AC) algorithm
that some shorter sub-sequences of a short read will exactly [34], where the system is modeled as a deterministic finite
match the reference genome. This technique is called seed- automaton. The AC algorithm solves the string matching
and-extend and these shorter sub-sequences are called problem in time linearly proportional to the length of the
seeds. For example, for a 30 bp read to map to a reference input stream. However, the memory requirement is not fea-
with only one difference, there must regardless of where sible in a straightforward hardware implementation. In
the difference occurs. In fact, if a we align an m bp read particular, [35] presents a pipelined processing approach to
with at most k differences, there must exist one exact align- the implementation of AC algorithm. However, these
ment of m/(k 1) consecutive bases [13]. An index of the approaches are not appropriate to be implemented in hard-
reference genome is compiled first, which maps every seed ware due to the complexity and timing overheads.
that occurs in the reference genome to the locations where
they occur. To align a short read, all the seeds in the read 2.3 Accelerating Approaches
are looked up in the index, which yields a set of candidate Along with the novel short read mapping algorithms, multi-
alignment locations (CALs). The short read is then scored ple attempts have also been conducted to accelerate short
against the reference at each of these CALs using the Smith- read mapping in diverse high performance computing tech-
Waterman [14] string-matching algorithm. The location niques. Graphics processing units (GPUs), MapReduce and
with the highest score is chosen as the alignment location FPGA based hardware are the most widely used accelera-
for a short read. For example, BLAST [4] uses a hash table of tion engines. Table 1 listed most of the state-of-the-art litera-
all fixed length k-mers in the reference to find seeds, and a tures for type, reference, performance metric, utilized
banded version of the Smith-Waterman algorithm to com- algorithm, and special features. In particular, the devoted
pute high scoring gapped alignments. RMAP [15] uses a researches can be divided into following categories:
hash table of non-overlapping k-mers of length m/(k 1)
in the reads to find seeds. 2.3.1 GPU Based Accelerators
Based on the string match problem abstractions, plenty GPUs have recently been used as a mature approach for
of traditional approaches those tackle high-speed similarity several bioinformatics applications, especially for sequence
analysis and indexing exploration can be adopted, such as alignment, one of the most significant research areas. For
[26], [27], [28], [29], [30]. Baker and Prasanna [31] presented example, [16] presents a similarity using Smith-Waterman
a hardware implementation of the Knuth-Morris-Pratt algorithm in GPU accelerators, using compute unified
(KMP) algorithm. Since the KMP algorithm is designed to device architecture (CUDA) programming engines.
WANG ET AL.: HETEROGENEOUS CLOUD FRAMEWORK FOR BIG DATA GENOME SEQUENCING 169
MUMmerGPU [17] is an open-source parallel pairwise local but the hardware engines is implemented on a sole FPGA
sequence alignment program that runs on commodity board. Chen et al. [39] presents a novel FPGA-based archi-
GPUs in common workstations. Based on this research, [18] tecture which could address the problem with a bounded
uses features a stackless depth-first-search print kernel with number of PEs to realize any lengths of systolic array. It is
massive GPU data layout configurations to improve register mainly based on the idea of the banded Smith-Waterman
footprint and conclude higher occupancy. Liu et al. [19] but with a key distinguish that it reuses the PEs which are
makes new contributions to Smith-Waterman protein data- beyond the boundary.
base searches using compute unified device architecture. A Previous efforts doing short read mapping using FPGAs
parallel Smith-Waterman algorithm has been proposed to have achieved at most an order of magnitude improvement
further optimize the performance based on the single- compared to software tools. Also, previous solutions are not
instruction-multiple-thread (SIMT) abstraction. convincible to produce a system that is well-suited to large
scale long read mapping and full genome sequencing.
2.3.2 FPGA Based Accelerations Finally, current literatures only uses one unique method on
Nevertheless, numerous attempts to accelerate short read FPGA to accelerate the short read mapping problem, there-
mapping on FPGAs tried to use a brute-force approach to fore it could make the most advantage of the parallel com-
compare short sequences in parallel to a reference puting and hardware acceleration based techniques.
genome. For example, the work in [6], [7] streams the ref-
erence genome through a system doing exact matching of 2.3.3 MapReduce Based Acceleration Frameworks
the short reads. Knodel et al. [6] demonstrate a greater Besides the above heterogeneous accelerating engines, Map-
sensitivity to genetic variations in the short reads than Reduce [9] is an alternative software framework developed
Bowtie and MAQ, but the mapping speed was approxi- and used by Google to support parallel-distributed execu-
mately the same as that of Bowtie. Also, this system dem- tion of data intensive applications. Utilizing MapReduce
onstrated mapping short reads to only chromosome 1 of programming framework, CloudBurst [5] is a new credit-
the human genome. Fernandez et al. [7] demonstrate able parallel read-mapping algorithm optimized for map-
between 1.6x and 4x speedup versus RMAP [15] for reads ping next-generation sequence data to the human genome
with between 0 and 3 differences. This implementation and other reference genomes, for use in a variety of biologi-
was for the full human genome. cal analyses including SNP discovery, genotyping and per-
Some recently papers conducing DNA short read sonal genomics. It is modelled after the short read-mapping
sequencing problem using FPGA based reconfigurable com- program RMAP [15], and reports either all alignments or
puting acceleration engines are proposed in [1], [20], [21], the unambiguous best alignment for each read with any
[22], [23], [24], [25]. Of these approaches, [20] proposes a sys- number of mismatches or differences. This level of sensitiv-
tolic custom computation on FPGA to implement the read ity could be prohibitively time consuming, but CloudBurst
mapping on a massively parallel architecture. This literature uses the open-source Hadoop implementation of MapRe-
enables the implementation of thousands of parallel search duce to parallelize execution using multiple compute nodes,
engines on a single FPGA device. Wang et al. [36] shares the which means it can deal with larger amount of short reads
idea of genome sequencing using MapReduce on FPGA simultaneously. Similarly, [40] is based on the pre-process
with multiple hardware accelerators, and further presents of the reference genomes and iterative MapReduce jobs for
the Big data genome sequencing on Zynq based clusters aligning the continuous incoming reads.
[37]. Moreover, the authors in [21] proposed a CPU-FPGA To sum up, even the MapReduce framework could dis-
heterogeneous architecture for accelerating a short reads seminate the read mapping to multiple computing machines,
mapping algorithm, which was built upon the concept of but the execution kernel itself is quite inefficient, so the
hash-index with several optimizations that reorder hash throughput of the entire system is still worth pursuing.
table accesses and compress empty hash buckets. The
authors in [22] applies pre-filtering of the kind commonly 3. SHORT READ MAPPING ON HETEROGENEOUS
used in BLAST to perform the initial all-pairs alignments. In CLOUD FRAMEWORK
[1] the authors proposed a scalable FPGA-based solution to
the short read mapping problem in DNA sequencing, which As the MapReduce framework has been successfully inte-
greatly accelerates the task of aligning short length reads to grated into FPGA based researches [10], in this section, we
a known reference genome. Not only the first stage of pro- demonstrate the system architecture of a hybrid heteroge-
gressive alignment in multiple sequence alignment problem, neous system utilizing both software and field programma-
the third stage of progressive alignment on reconfigurable ble gate array-based hardware for real time short reads
hardware in [23]. Chen et al. [24] introduces a hybrid system mapping service. We will model it as a specific-domain
for short read mapping utilizing both software and field pro- search problem that has been studied well for the past
grammable gate array-based hardware. Zhang et al. [25] decades.
presents a implementation of the Smith-Waterman algo-
rithm for both DNA and protein sequences on the platform. 3.1 High Level Architecture Framework
The paper introduces a multistage processing element After sequencing DNA, researchers often map the reads to a
design and a pipelined control mechanism with uneven reference genome to find the locations where each read
stage latencies to improve the performance and decrease of occurs. The read-mapping algorithm reports one or more
the on-chip SRAM usage. Recently, [38] uses FPGA-based alignments for each read within a scoring threshold, com-
system to accelerate the next generation long read mapping, monly expressed as the minimal acceptable significance of
170 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 1, JANUARY/FEBRUARY 2015
Fig. 3. Reference genome lookup table. Index means this occurrences Fig. 4. Hbase table after users short read requests written.
position in the reference genome 1, L-C means the left elements of this
occur; L-R means the right elements of this occur; The size of L-C and to organize this way is we can make sure each reference
L-R should be m-sk (k is the maximum number of differences or mis-
takes; s is seed length, m is minimum length of the reads.
genome information was stored in one physical server,
which would accelerate computing a lot.
Considering the data structure of reference genome
immediately after the map function. Combiners are not possi- lookup table structure, we utilize the LUT based slice regis-
ble in every application because they evaluate on a subset of ters within FPGA development board to allocate block
the values for a given key, but when possible, reduce the parameters, as it is the natural way to access the L-C, Index
amount of data processed in the shuffle and reduce phases. and L-R values in hardware acceleration. However, due to
In the k-mer counting example, the combiner emits a partial the hardware resources constraints, we can not integrate
sum from the subset of 1 s it evaluates, and the reduce func- all the Lookup Table together, therefore they are imple-
tion sums over the list of partial sums. mented as private memory access for different hardware
The MapReduce programming framework has been suc- accelerators.
cessfully introduced with respect to the FPGA based acceler- When a short read reaches the buffer, it would be written
ations. For the sake of massive data processing requirement, into the Hbase table immediately to form a table like Fig. 4
we use RMAP algorithm to align the short reads. RMAP shows. Its the same table with Fig. 3 except the extra read
algorithm is based on seed-and-extend approach, which is columns that were added into the corresponding reference
composed of two phases: Compare Phase and Seed-Extend genome column-family. After writing into this table, each
Phase [15]. Of the two phases, Compare Phase is basically read should begin to align itself with the corresponding ref-
considered as a mathematical string matching problem, erence genomes. This processing is not complex but still
which in this scenario is in charge of locating the seed string time costing. To accelerate this procedure, we revise the
in the reference genome sequence, therefore it is regarded as Hbase update operations: whenever there is an update in
the time-consuming part. On the other hand, Seed-Extend specific column-family (which in our case is read column-
Phase will be processed after the seed are indexed, and the family), there would be a new task allocated to do the align-
extensions in both sides of the seed string are to be com- ing job. Due to localization of the same row and the same
pared to explore whether a mismatch (e.g. genome muta- column-family, this task would not need communication,
tion) appears. Due to that the length of extended string is which will be faster with the plain MapReduce algorithm.
always much shorter than the reference itself, hence the tim- Beside the tasks executed when request arrives, we have
ing complexity of Seed-Extend Phase can be ignored with a threshold on the short read buffer to judge whether the sys-
respect to the large amount of the reference sequence. More- tem load (means the short reads requests per minutes) was
over, for the sake of the area and hardware constraints of too high. In the situation that system load is high, each read
FPGA chip areas, it is not feasible to implement all the hard- will not be processed by individual task, instead, after writ-
ware logic within limited hardware look-up-tables (LUTs) ing the request reads into the Hbase table, we will schedule
and slices registers, consequently we put our concentrations an global MapReduce task and generate the align results at
on the hardware accelerations engines of the Comparing the same time.
Phase, implementing hardware IP cores on FPGA for RMAP The scheduling and generation process is divided into
string matching applications. three stages:
To design the RMAP algorithm into MapReduce pro- 1) Generate lookup table. At the very beginning of our sys-
gramming framework, both the map and shuffle stages are tem, we got numerous reference genomes and we will
employed in string compare operations, while the reduce schedule MapReduce tasks to generate the seed lookup table,
stage is in charge of seed-extend operations. which will be used to handle user requests. Fig. 5 shows an
At the very beginning of our system, we use MapReduce example of how the map, shuffle, and reduce function
tasks to build a lookup table for all the seed from the refer- works in generating lookup table.
ence genomes like Fig. 3. Each seed may appear multiple The map function emits k-mers of length s as seeds from
times in one reference genome, so we make every reference all the reference sequences. The shuffle phase collects k-
genome as a column-family in Hbase table and each occur- mers shared between the reference sequences. Finally, the
rence of a given key is a column in the family. The benefits reduce function write the same seed into the Hbase table.
172 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 1, JANUARY/FEBRUARY 2015
TABLE 2
Parameters Used to Analyze Speedup
Fig. 10. Speedup of different data sets with different SA interval threshold s and z-best strategy threshold z. approximately 10,000,000 bp reads with
lengths of 100, 200, 500, 1,000, and 2,000 bp are simulated. Genomes mismatches and gaps at the error rate of 2 percent are inserted in all these
data sets. Reads sets in subfigure (a), (b), and (c) are the same.
176 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 1, JANUARY/FEBRUARY 2015
TABLE 3
Hardware Utilization on FPGA (500 bp)
TABLE 4
Power and Thermal Information (2,048 bp)
Furthermore, we also measured the power and thermal [7] E. Fernandez, W. Najjar, E. Harris, and S. Lonardi, Exploration of
short reads genome mapping in hardware, in Proc. Int. Conf. Field
information in Table 4. The total power of a single accelera- Programmable Logic Appl., 2010, pp. 360363.
tor is only 1.339 W, which includes a quiescent power at [8] FCCM 2011 Top 10 Predictions for FCCMs in 2016. (2011).
1.319 W (98.6 percent), and a dynamic power at 0.019 W (1.4 [Online] Available: http://www.fccm.org/2012/Previous.html
percent). Meanwhile the max ambient and the junction tem- [9] D. Jeffrey and G. Sanjay, MapReduce: Simplified data processing
on large clusters, Commun. ACM, vol. 51, no. 1, pp. 107113, 2008.
perature are 77.1 and 57.9 , respectively. [10] Y. Shan, B. Wang, J. Yan, Y. Wang, N. Xu, and H. Yang, FPMR:
MapReduce framework on FPGA, in Proc. 18th Annu. ACM/
SIGDA Int. Symp. Field Programmable Gate Arrays, Monterey, CA,
6 CONCLUSIONS AND FUTURE WORK USA, , 2010, pp. 93102.
[11] M. Burrows and D. J. Wheeler, A Block-Sorting Lossless Data Com-
This paper has proposed a general heterogeneous cloud pression Algorithm. Palo Alto, CA, USA: Digital Equipment Corpo-
framework for next genome sequencing with multiple hard- ration, 1994, p. 124.
ware accelerators on FPGAs. We utilize an extended Map- [12] P. Ferragina and G. Manzini, Opportunistic data structures with
Reduce distribution framework with multiple hardware applications, in Proc. 41st Annu. Symp. Found. Comput. Sci., 2000,
pp. 390398.
accelerators on FPGA. By extending a distributed process- [13] R. A. Baeza-Yates and C. H. Perleberg, Fast and practical approx-
ing technique with hardware accelerators, this approach imate string matching, in Proc. 3rd Annu. Symp. Combinatorial Pat-
could bring significant speedup for genome sequencing tern Matching. 1992.
[14] T. F. Smith and M. S. Waterman, Identification of common molec-
alignment process. We have presented results both from a ular subsequences, J. Mol. Biol., vol. 147, no. 1, pp. 195197, 1981.
theoretical analysis and real hardware platform on Xilinx [15] A. D. Smith, Z. Xuan, and M. Q. Zhang, Using quality scores and
FPGA development board. The experimental results reveal longer reads improves accuracy of Solexa read mapping, BMC
that each hardware accelerator could achieve up to 2.73x Bioinformat., vol. 9, no. 128, pp. 14712105, 2008.
[16] S. A Manavski and G. Valle, A compatible GPU cards as efficient
speedup as well as it only occupies less than 1 percent of hardware accelerators for Smith-Waterman sequence alignment,
the FPGA chip resources. BMC Bioinformat., vol. 9, no. Suppl. 2, p. S10, 2008.
Given the promising preliminary results illustrated [17] M. C Schatz, C. Trapnell, A. L Delcher, and A. Varshney, High-
in this paper, there exist various directions for future throughput sequence alignment using graphics processing units,
BMC Bioinformat., vol. 8, p. 474, 2007.
developments. Additional engineering effort needs to be [18] C. Trapnell and M. C. Schatz, Optimizing data intensive GPGPU
applied to assess the scalability comparison to aforemen- computations for DNA sequence alignment, Parallel Comput.,
tioned cutting-edge approaches. Also a comparative study vol. 35, nos. 8/9, pp. 429440, 2009.
[19] Y. Liu, B. Schmidt, and D. L. Maskell, CUDASW2.0:
between the short read mapping applications in high per- Enhanced Smith-Waterman protein database search on CUDA-
formance GPU based supercomputers based on Hadoop enabled GPUs based on SIMT and virtualized SIMD abstractions,
clusters and detailed analysis of the feasibility, cost and BMC Res. Notes, vol. 3, p. 93, 2010.
overheads is working in progress. Furthermore, from the [20] T. B. Preusser, O. Knodel, and R. G. Spallek, Short-read mapping
by a systolic custom FPGA computation, in Proc. IEEE 20th
technical perspective, if different reads contain the same Annu. Int. Symp. Field-Programmable Custom Comput. Mach., 2012,
seeds, they could be cached in a fast table to response pp. 169176.
with high efficiency. [21] W. Tang, W. Wang, B. Duan, C. Zhang, G. Tan, P. Zhang, and N.
Sun, Accelerating millions of short reads mapping on a heteroge-
neous architecture with FPGA accelerator, in Proc. IEEE 20th
ACKNOWLEDGMENTS Annu. Int. Symp. Field-Programmable Custom Comput. Mach., 2012,
pp. 184187.
This work was supported by the National Science Founda- [22] A. Mahram and M. C. Herbordt, FMSA: FPGA-Accelerated clus-
tion of China under grants (No. 61379040, No. 61272131, taIW-based multiple sequence alignment through pipelined pre-
No. 61202053, No. 61222204, No. 61221062), Jiangsu Provin- filtering, in Proc. IEEE 20th Annu. Int. Symp. Field-Programmable
Custom Comput., Mach., 2012, pp. 177183.
cial Natural Science Foundation (No. SBK201240198), and [23] S. Lloyd and Q. O Snell, Accelerated large-scale multiple
the Strategic Priority Research Program of CAS sequence alignment, BMC Bioinformat., vol. 12, p. 466, 2011.
(No. XDA06010403). The authors deeply appreciate many [24] Y. Chen, B. Schmidt, and D. L. Maskell, Accelerating short read
reviewers for their insightful comments and suggestions. mapping on an FPGA (abstract only), in Proc. ACM/SIGDA Int.
Symp. Field Programmable Gate Arrays, Monterey, California CA,
Professor Xi Li is the corresponding author. USA, 2012, p. 265.
[25] P. Zhang, G. Tan, and G. R. Gao, Implementation of the Smith-
REFERENCES Waterman algorithm on a reconfigurable supercomputing
platform, in Proc. 1st Int. Workshop High-Performance Reconfigura-
[1] C. B. Olson, M. Kim, C. Clauson, B. Kogon, C. Ebeling, S. Hauck, ble Comput. Technol. Appl., Reno, NV, USA, , 2007, pp. 3948.
and W. L. Ruzzo, Hardware acceleration of short read mapping, [26] Q. Wang and V. K. Prasanna, Multi-core architecture on FPGA
in Proc. IEEE 20th Annu. Int. Symp. Field-Programmable Custom for large dictionary string matching, in Proc. 17th IEEE Symp.
Comput. Mach., 2012, pp. 161168. Field Programmable Custom Comput. Mach., 2009, pp. 96103.
[2] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, Ultrafast [27] I. Sourdis and D. Pnevmatikatos, Fast, large-scale string match
and memory-efficient alignment of short DNA sequences to the for a 10 Gbps FPGA-based network intrusion detection system,
human genome, Genome Biol., vol. 10, no. 3, p. R25, 2009. in Proc. 13th Int. Conf. Field Programmable Logic Appl., 2003,
[3] H. Li and R. Durbin, Fast and accurate short read alignment with pp. 880889.
Burrows-Wheeler transform, Bioinformatics, vol. 25, no. 14, [28] T. V. Court and M. C. Herbordt, Families of FPGA-based algo-
pp. 17541760, 2009. rithms for approximate string matching, in Proc. 15th IEEE Int.
[4] N. Homer, B. Merriman, and S. F. Nelson, BFAST: An alignment Conf. Appl.-Specific Syst., Archit. Process., 2004, pp. 354364.
tool for large scale genome resequencing, PLoS ONE, vol. 4, [29] M. Aldwairi, T. Conte, and P. Franzon, Configurable string
no. 11, p. e7767, 2009. matching hardware for speeding up intrusion detection, ACM
[5] M. C. Schatz, CloudBurst: Highly sensitive read mapping with SIGARCH Comput. Archit. News, vol. 33, no. 1, pp. 99107, 2005.
MapReduce, Bioinformatics, vol. 25, no. 11, pp. 13631369, 2009. [30] D. Pao, W. Lin, and B. Liu, Pipelined architecture for multi-string
[6] O. Knodel, T. B. Preusser, and R. G. Spallek, Next-generation mas- matching, IEEE Comput. Archit. Lett., vol. 7, no. 2, pp. 3336, Jul.
sively parallel short-read mapping on FPGAs, in Proc. IEEE Int. Dec. 2008.
Conf. Appl.-Specific Syst., Archit. Processors, 2011, pp. 195201.
178 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 1, JANUARY/FEBRUARY 2015
[31] Z. K. Baker and V. K. Prasanna, A computationally efficient Xi Li is a professor and the vice dean in the
engine for flexible intrusion detection, IEEE Trans. Very Large School of Software Engineering, University of
Scale Integr. Syst., vol. 13, no. 10, pp. 11791189, Oct. 2005. Science and Technology of China, where he
[32] Z. K. Baker and V. K. Prasanna, Automatic synthesis of efficient directs the research programs in Embedded Sys-
intrusion detection systems on FPGAs, IEEE Trans. Dependable tem Lab, examining various aspects of embed-
Secure Comput., vol. 3, no. 4, pp. 289300, Oct.Nov. 2006. ded system with the focus on performance,
[33] V. Dimopoulos, I. Papaefstathiou, and D. Pnevmatikatos, A availability, flexibility, and energy efficiency. He
memory-efficient reconfigurable aho-corasick FSM implementa- has lead several national key projects of China,
tion for intrusion detection systems, in Proc. Int. Conf. Embedded several national 863 projects and NSFC projects.
Comput. Syst.: Archit., Model. Simul., 2007, pp. 186193. He is a member of the IEEE, ACM, and also a
[34] A. V. Aho and M. J. Corasick, Efficient string matching: An aid to senior member of the CCF.
bibliographic search, Commun. ACM, vol. 18, no. 6, pp. 333340,
1975.
[35] D. Pao, W. Lin, and B. Liu, A memory-efficient pipelined imple- Peng Chen received the BS degree in computer
mentation of the aho-corasick string-matching algorithm, ACM science from the University of Science and Tech-
Trans. Archit. Code Optim., vol. 7, no. 2, pp. 127, 2010. nology of China in 2010. He is currently working
[36] C. Wang, X. Li, X. Zhou, J. Martin, and R. C. C. Cheung, Genome toward the PhD degree in the Embedded System
sequencing using mapreduce on FPGA with multiple hardware Lab in Suzhou Institute of University of Science
accelerators, in Proc. ACM/SIGDA Int. Symp. Field Programmable and Technology of China, Suzhou, China. His
Gate Arrays, 2013, p. 266. research interests include multiprocessor system
[37] C. Wang, X. Li, X. Zhou, Y. Chen, and R. C. C. Cheung, Big data on chip, reconfigurable computing, and big data
genome sequencing on Zynq based clusters in 2014, in Proc. oriented heterogeneous platforms. He is a stu-
ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays, 2014, dent member of the IEEE and CCF.
p. 247.
[38] P. Chen, C. Wang, X. Li, and X. Zhou, Accelerating the next gen-
eration long read mapping with the FPGA-based system, IEEE/ Aili Wang is a lecture in the School of Software
ACM Trans. Comput. Biol. Bioinformat., vol. PP, no. 99, p. 1, 2014. Engineering, University of Science and Technol-
[39] P. Chen, C. Wang, X. Li, and X. Zhou, Hardware acceleration for ogy of China. She serves as the guest editor of
the banded Smith-Waterman algorithm with the cycled systolic Applied Soft Computing, and International Jour-
array, in Proc. Int. Conf. Field-Programmable Technol., 2013, pp. 480 nal of Parallel Programming. She is also a
481. reviewer for the International Journal of Electron-
[40] D. Dai, X. Li, C. Wang, and X. Zhou, Cloud based short read ics. She has published more than 10 International
mapping service, in Proc. IEEE Int. Conf. Cluster Comput., 2012, journal and conference articles in the areas of
pp. 601604. software engineering, operating systems, and
[41] Apache. HBase - Apache HBase Home. (2014). [Online]. Avail- distributed computing systems.
able: http://hbase.apache.org/
[42] Apache. Hadoop. (2014). [Online]. Available: http://hadoop. Xuehai Zhou is the executive dean in School of
apache.org/ Software Engineering, University of Science and
[43] HDFS Architecture Guide. (2014). [Online]. Available: http:// Technology of China, and a professor in the
hadoop.apache.org/docs/r1.2.1/hdfs_design.html School of Computer Science. He serves as a
[44] H. Li and R. Durbin, Fast and accurate long-read alignment with general secretary of steering committee of Com-
Burrows-Wheeler transform, Bioinformatics, vol. 26, no. 5, puter College Fundamental Lessons, and techni-
pp. 589595, 2010. cal committee of Open Systems, China
[45] H. Li, J. Ruan, and R. Durbin, Mapping short DNA sequencing Computer Federation. He has lead many national
reads and calling variants using mapping quality scores, Genome 863 projects and NSFC projects. He has
Res., vol. 18, no. 11, pp. 18511858, 2008. published more 100 International journal and
[46] B. Harris, A. C. Jacob, J. M. Lancaster, J. Buhler, and R. D. Cham- conference articles in the areas of software
berlain, A banded smith-waterman FPGA accelerator for mer- engineering, operating systems, and distributed computing systems.
cury BLASTP, in Proc. Int. Conf. Field Programmable Logic Appl., He is a member of the IEEE and ACM, and also a senior member of
2007, pp. 765769. the CCF.
[47] X. Guo, H. Wang, and V. Devabhaktuni, A systolic array-based
FPGA parallel architecture for the BLAST algorithm, ISRN Bioin-
format., vol. 2012, p. 11, Article ID 195658, 2012.
Hong Yu received the BS degree from Tsinghua
Chao Wang (M11) received the BS and PhD University in 2006 and the PhD degree in bioin-
degrees from the University of Science and Tech- formatics from the Institute of Genetics and
nology of China, in 2006 and 2011, respectively, Developmental Biology, Chinese Academy of
both in computer science. He is an associate Sciences, in 2011. He is an assistant professor
professor with Embedded System Lab in the Suz- with the National Center for Plant Gene
hou Institute of University of Science and Tech- Research, Institute of Genetics and Developmen-
nology of China, Suzhou, China. His research tal Biology, Chinese Academy of Sciences. He
interests include big data and reconfigurable has published more than 10 International journal
computing. He is now an editor board member of and conference articles in the research areas of
IET CDT MICPRO, IJHPSA. He also serves as bioinformatics on scientific journals, including
the publicity cochair of the ISPA 2014 and nature, genome research, bioinformatics, and BMC systems biology.
HiPEAC 2015. He is a member of the IEEE, ACM, and CCF.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.