Automatic Generation of Efficient Accelerators For Reconfigurable HW PDF

Automatic Generation of Efficient Accelerators for Reconfigurable Hardware
David Koeplinger Raghu Prabhakar Yaqi Zhang

Stanford University Stanford University Stanford University
dkoeplin@stanford.edu raghup17@stanford.edu yaqiz@stanford.edu
Christina Delimitrou Christos Kozyrakis Kunle Olukotun

Stanford University Stanford University Stanford University
Cornell University EPFL kunle@stanford.edu
cdel@stanford.edu kozyraki@stanford.edu
Abstract—Acceleration in the form of customized datapaths Reconfigurable fabrics such as field-programmable gate
offer large performance and energy improvements over general arrays (FPGAs) offer a promising alternative to ASIC-based
purpose processors. Reconfigurable fabrics such as FPGAs are accelerators due to their reconfigurability and customizabil-
gaining popularity for use in implementing application-specific
accelerators, thereby increasing the importance of having ity, even if these benefits come at a price [6]. FPGAs are
good high-level FPGA design tools. However, current tools increasingly gaining traction in the industry as mainstream
for targeting FPGAs offer inadequate support for high-level accelerators. Microsoft [7] and Baidu [8] have successfully
programming, resource estimation, and rapid and automatic deployed FPGA-based accelerators in a commercial setting
design space exploration. to accelerate web search and deep neural networks. Intel [9]
We describe a design framework that addresses these chal-
lenges. We introduce a new representation of hardware using is actively working on integrating an FPGA with a processor
parameterized templates that captures locality and parallelism on a heterogeneous motherboard. The acquisition of Altera
information at multiple levels of nesting. This representation by Intel, and new startups [10] working on using FPGAs
is designed to be automatically generated from high-level as accelerators for datacenters suggest that future systems
languages based on parallel patterns. We describe a hybrid will incorporate reconfigurable logic into their design. As
area estimation technique which uses template-level models
and design-level artificial neural networks to account for a result, FPGAs will play a crucial role in the space of
effects from hardware place-and-route tools, including routing customizable accelerators over the next few years. This
overheads, register and block RAM duplication, and LUT places a greater importance on FPGA programmability and
packing. Our runtime estimation accounts for off-chip memory automated tools to generate efficient designs that maximize
accesses. We use our estimation capabilities to rapidly explore a application performance on a given reconfigurable substrate.
large space of designs across tile sizes, parallelization factors,
and optional coarse-grained pipelining, all at multiple loop Designing an efficient accelerator architecture involves
levels. We show that estimates average 4.8% error for logic balancing compute with on-chip and off-chip memory band-
resources, 6.1% error for runtimes, and are 279 to 6533 times width requirements to avoid resource bottlenecks. This is
faster than a commercial high-level synthesis tool. We compare irrespective of whether the accelerator is implemented as an
the best-performing designs to optimized CPU code running ASIC or on an FPGA. This process involves navigating a
on a server-grade 6 core processor and show speedups of up
to 16.7×. large multi-dimensional design space with application-level,
architectural and microarchitectural parameters. The optimal
set of parameters depends on the inherent parallelism and
I. I NTRODUCTION data locality in the application, as well as the available
Over the past few years, the computing landscape has hardware resources. The design space for FPGAs is fur-
seen a paradigm shift towards specialized architectures [1, 2, ther complicated by the heterogeneous nature of available
3, 4]. Customized accelerators, implemented as application FPGA resources, which include units such as look-up tables
specific integrated circuits (ASICs) efficiently perform key (LUTs), block RAMs (BRAMs), flip flops (FFs) and dig-
kernel computations within larger applications to achieve ital signal processing units (DSPs). As a result, hardware
orders of magnitude improvements in performance and en- accelerator design is an inherently iterative process which
ergy efficiency compared to programmable processors [5]. involves exploring a large design space for even moderately
However, such improvements typically require sacrificing complex accelerators. Exhaustive or manual exploration of
flexibility. Once fabricated, an ASIC’s custom datapath can this space would be impractical for all but the simplest of
no longer be modified to meet new requirements. ASICs also designs, suggesting that efficient FPGA accelerator design
have high non-recurring engineering (NRE) costs associated requires support from high-level tools for rapid modeling
with manufacturing, and design space exploration.
Unfortunately, current FPGA design tools support rela- at multiple levels in the design. The templates are
tively primitive programming interfaces that require exten- designed such that applications expressed using high-
sive manual effort [11]. Designing an FPGA accelerator level parallel patterns can be mapped to these templates
typically involves an architecture description in RTL or in a straightforward way.
a C-based high-level language coupled with ad-hoc tools • We provide quick estimates of cycle count and FPGA
to explore the design space, potentially involving multiple area usage for designs expressed in DHDL. Estimates
logic synthesis runs. The long turn-around times from logic take into account available off-chip memory bandwidth
synthesis tools, which are on the order of several hours per and on-chip resources for datapath and routing, as
design, makes it infeasible for them to be included in the iter- well as effects from low-level optimizations like LUT
ative design process. High-level synthesis tools [12, 13, 14] packing and logic duplication.
raise the level of abstraction and provide some support for • We study the space of designs described by tiling sizes,
pre-place-and-route analysis. However, high-level synthesis parallelization factors, and coarse-grained pipelining.
tools do not capture many important design points, such This space is larger than previous work because we
as coarse-grained pipelining and nested parallelism, and study more design dimensions than what is possible
generally use simple memory models that do not involve using state-of-the-art HLS tools. This in turn allows
modeling off-chip memory accesses [15]. our system to find better design points in terms of
This paper presents a practical framework for automatic performance and performance-per-area than previously
generation of efficient FPGA accelerators. Figure 1 describes possible.
our overall system architecture. The input to our system • We evaluate the quality of our estimators and generated
is an application described in a high-level language using designs on a variety of compute-bound and memory-
high-level parallel patterns like map, reduce, filter, and bound benchmarks from the machine learning and data
groupBy [16]. Parallel patterns serve the dual purpose of analytics domain. We show that our run time estimates
raising the level of abstraction for the programmer [17, 19], are within 6.1% and area estimates are within 4.8%
and providing richer semantic information to the com- of post place-and-route reports provided by FPGA
piler [20]. These constructs are then automatically lowered vendor tools, making this a useful tool for design space
into our hardware definition language (HDL) that explicitly exploration. We evaluate the performance of the gen-
captures information on parallelism, locality, and memory erated designs compared to optimized multi-core CPU
access pattern at all levels of nesting using parameterizable implementations running on a server-grade processor
architectural templates (Step 1 in Figure 1). Step 1 performs and achieve speedups of up to 16.7×.
high-level optimizations like loop fusion and tiling transfor- The remainder of this paper is structured as follows.
mations. The output of step 1 is a tiled representation of the Section II outlines the requirements of good automated
input design expressed in our HDL. Note that tiling here FPGA design tools and reviews related work in this domain.
includes both loop and data tiling. Section III describes the DHDL language and provides
We characterize each template and construct resource insights into how this representation enables larger design
models to provide quick and accurate estimates of cycle space exploration and accurate estimation. Section IV de-
count and FPGA resource utilization for a given set of design scribes our modeling methodology for DHDL templates.
parameters, including modeling off-chip memory transfers. Section V discusses the evaluation of our approach for abso-
The estimators guide a design space exploration phase (Steps lute accuracy, design exploration efficiency, and performance
2–4 in Figure 1) which navigates a large space to produce a of generated designs compared to a multi-core processor.
set of optimized parameters. Finally, we integrate hardware
generation into the design flow so that optimized designs can II. BACKGROUND AND R ELATED W ORK
be automatically generated, synthesized and run on a real A primary requirement for good accelerator design tools
FPGA (Steps 5–7 in Figure 1). We synthesize hardware by is the ability to capture and represent design points along
automatically generating MaxJ, a low-level Java-based hard- all important dimensions. Specifically, design tools must be
ware generation language from Maxeler Technologies [21]. able to capture application-level parameters (e.g., input sizes,
Step 1 in Figure 1 has been described in previous work [22]. bitwidth, data layout), architectural parameters (parallelism
Steps 2–7 are the focus of this paper. factors, buffer sizes, banking factors pipelining levels, off-
We make the following contributions in this paper: chip memory streams) and microarchitectural parameters
• We define an intermediate representation called Delite (e.g., on-chip memory word width). Having a representation
Hardware Definition Language, or DHDL. DHDL de- rich in parallelism information allows for more accurate
fines a set of parameterizable architectural templates estimations, thorough design space exploration, and efficient
to describe hardware. Templates capture specific types code generation.
of memory accesses and parallel compute patterns, In addition to application characteristics, both hetero-
and can be composed to naturally express parallelism geneity within FPGAs and low-level optimizations done by
2
1 2 5 6 7
Parallel High-Level DHDL Transformed Code Bitstream FPGA
DHDL MaxJ FPGA
Patterns Optimizations Compiler DHDL Generation Generation Configuration
Parameters
4 3
DSE Estimator
Parameter List Estimates
Figure 1. System Overview
logic synthesis tools have significant impact on required generation. However, imperative design descriptions place
design resources. FPGA resource utilization does not just greater burden on the compiler to discover parallelism,
depend on the compute and memory operations in a given pipeline structure and memory access patterns. The absence
design; a non-trivial amount of resources are typically used of explicit parallelism often leads to conservative compiler
to establish static routing connections to move data between analyses producing sub-optimal designs. While some tools
two points, often rendering them unavailable for “real” allow users to provide compiler hints in the form of di-
operations. In addition, low-level logic synthesis tools often rectives or pragmas in the source code, this approach fails
perform optimizations like LUT-packing or logic duplication to capture key points in the design space. For example,
for signal fanout reduction that alter resource usage. Off-
chip memory communication requires FPGA resources to L1: for (int i=0; i<R; i++) {
implement various queues and control logic. Such effects #pragma HLS PIPELINE II=1
L11: for (int j=0; j<C; j++) {
from low-level tools must be factored into the design tools to sub[j] = y[i] ? x[i][j]-mu0[j] : x[i][j]-mu1[j];
provide accurate estimates of design resource requirements. }
L121: for (int j1=0; j1<C; j1++) {
A good FPGA design tool should have the following L122: for (int j2=0; j2<C; j2++) {
sigma[j1][j2] += sub[j1]*sub[j2];
features: }
}
• Representation: The tool must internally represent hard-
ware using a general and parameterizable represen-
Figure 2. GDA for high-level synthesis.
tation. This representation must preserve information
regarding data locality, memory access pattern and
consider Figure 2 which represents the gaussian discriminant
parallelism in the input at all levels of nesting. Such
analysis (GDA) kernel. All loops in this kernel are parallel
a representation must be target-agnostic and should be
loops. One set of valid design points would be to implement
targetable from high-level language constructs.
L1 as a coarse-grained pipeline with L11 and L121 as
• Estimation: The tool must quickly analyze a design in
its stages. Commercial HLS tools support limited coarse-
the above representation and estimate metrics such as
grained pipelining, but with serveral restrictions. For exam-
cycle counts and FPGA resource requirements for a
ple, the DATAFLOW directive in Vivado HLS enables users
target FPGA.
to describe coarse-grained pipelines. However, the directive
• DSE: The tool must be able to leverage the estimators
does not support arbitrarily nested coarse-grained pipelines,
to prune the large design search space, walk the space
multiple producers and consumers between stages, or coarse-
of designs, and find the Pareto-optimal surface.
grain pipelining within a finite loop scope [24], as required
• Generation: The tool must be able to automatically
in the outer loop in Figure 2. In addition, compile times for
generate hardware which can then be synthesized and
HLS can be long for large designs due to the complications
run on the target FPGA. Without this feature, hardware
that arise during scheduling. Previous studies [15] point out
would typically be generated using separate toolchains
other similar issues. Such limitations restrict the capability
for estimation and generation, which makes accurate
of HLS tools to explore more complex design spaces.
estimation much harder.
Pouchet et al. [25] explore combining HLS with polyhe-
Previous work on generating FPGA accelerators has fo- dral analysis to optimize input designs for locality and use
cused on various aspects of the points mentioned above. estimates from HLS tools to drive design space exploration.
Here we provide an overview of this work. While this captures a larger design space than previous
High-level synthesis (HLS) tools such as LegUp [13] work by including tile sizes, this approach is limited to
and Vivado HLS [23] (previously AutoPilot) [12] synthesize the capabilities of the HLS tools and to benchmarks that
hardware from C. These tools provide estimates of the cycle have strictly affine data accesses. This paper improves upon
count, area and power consumption along with hardware previous work by modeling tiling parameters in addition
3
to other design points like coarse-grained pipelining of space of designs which other tools cannot capture, as
imperfectly nested loops which are not supported by HLS shown in Figure 2.
tools, as well as data-dependent accesses which are not • Every template is parameterized. A specific hardware
supported by polyhedral analysis. Chen et al. [26] describe a design point is instantiated from a DHDL description
simultaneous resource allocation and binding algorithm and by instantiating all the templates in the design with con-
perform design space exploration using a high-level power crete parameter values passed to the program. DHDL
estimator. They characterize area usage of primitives and heavily uses metaprogramming, so these values are
fit linear models to derive estimation functions. However, passed in as arguments to the DHDL program. The
this study does not consider higher level design parameters generated design instance is represented internally as
or nested parallelism as part of the design space. We a graph that can be analyzed to provide estimates of
perform characterization of primitive operations as well as metrics such as area and cycle count. The parameters
other coarse-grained templates, which enables us to esti- used to create the design instance can be automatically
mate resource usage for much more complex accelerators. generated by a design space exploration tool.
CMOST [18] is a C-to-FPGA framework that uses task-level DHDL is implemented as an embedded domain-specific lan-
modeling to exploit multi-level parallelism. While CMOST guage in Scala, thereby leveraging Scala’s language features
uses simple analytical models, this paper uses a mixture of and type system.
analytical and machine learning models that enables much
more fine-grained and accurate estimation of FPGA resource A. Generating DHDL from parallel patterns
utilization. A fundamental design goal of DHDL is that it should
Aladdin[15] is a pre-RTL estimation tool for ASIC ac- be automatically generated from high-level languages which
celerator design. Aladdin uses a dynamic data dependence express the computation using parallel patterns such as
graph (DDDG) as input and estimates the latency, area, and map, reduce, zip, groupBy and filter. Previous work has
power for a variety of designs. However, using a DDDG shown that parallel patterns can be used to improve both
limits the tool’s ability to discover nested parallelism and productivity and performance by using patterns as a basis
infer coarse-grained pipeline structures that require double for high-level languages [16, 19] and sophisticated compiler
buffering, especially with complex memory accesses in frameworks [20]. The templates in DHDL are inspired from
patterns like filters or groupBys. Also, Aladdin is focused on these well-known parallel patterns. This makes it possible
ASIC designs while our work focuses on FPGA accelerators to define explicit rules to generate DHDL for each parallel
which have a different set of challenges, as outlined above. pattern mentioned above. Previous work [22] has proposed
Other related work [27, 28, 29, 30, 31] explore various compilation techniques to automatically generate hardware
ideas from analytical to empirical models for estimating designs from parallel patterns using a similar template-based
latency and area of designs in high-level languages. How- approach. We use these techniques to automatically generate
ever, these approaches do not consider complex applications DHDL from parallel patterns.
with nested parallelism. Also, previous work either ignores
memory or has a relatively simple model for memory. This B. Language constructs
paper handles both on-chip and off-chip memory accesses A hardware datapath is described in DHDL using various
with varying, data-dependent memory access patterns. nodes connected to each other by their data dependencies.
DHDL also supports variable bit-width fixed-point types,
III. DHDL variable precision floating point types, and associated type
checking. Every node that either produces or stores data has
In this section, we describe the Delite Hardware Definition an associated type. Table I describes the hardware templates
Language, or DHDL. DHDL is an intermediate language for and associated parameters supported in DHDL. There are
describing hardware datapaths. A DHDL program describes four types of nodes:
a dataflow graph consisting of various kinds of nodes 1) Primitive Nodes: Primitive nodes correspond to basic
connected to each other by data dependencies. Each node in operations, such as arithmetic and logic tasks, and multi-
a DHDL program corresponds to one of the supported ar- plexers. Some complex multi-cycle operations such as abs,
chitectural templates listed in Table I. DHDL is represented sqrt and log are also supported as primitive nodes. Every
in-memory as a parameterized, hierarchical dataflow graph. primitive node represents a vector computation; a “vector
DHDL is a good hardware representation to aid in design width" parameter defines the number of parallel instances
space exploration for the following reasons: of each node. Scalar operations are thus special cases where
• Templates in DHDL capture parallelism, locality, and the associated vector width is 1.
access pattern information at multiple levels. This 2) Memories: DHDL distinguishes between on-chip
dramatically simplifies coarse-grained pipelining and buffers and off-chip memory regions by representing them
enables us to explicitly capture and represent a large explicitly using separate nodes. This is used to capture
4
Template Description Design Parameters
Primitive +, -, *, /, <. >, mux Basic arithmetic, logic, and control operations Vector width, Type
Nodes Ld, St Load and store from on-chip memory Vector width, Bank stride
OffChipMem N-dimensional off-chip memory array Dimensions, Type
Dimensions, Word width, Double
Memories BRAM On-chip scratchpad memory buffering, Vector width, Banks,
Interleaving scheme, Type
Priority Queue Hardware sorting queue Double buffering, Depth, Type
Reg Non-pipeline register Double buffering, Vector width
Counter Counter chain used to produce loop iterators Vector width
Hardware pipeline of primitive operations. Typically used to
Pipe Parallelization factor, Pattern
represent bodies of innermost loops.
Controllers Sequential Non-pipelined, sequential execution of multiple stages. Parallelization factor, Pattern
Parallel Fork-join style parallel container with synchronizing barrier.
Coarse-grained pipeline with asynchronous handshaking sig-
MetaPipe Parallelization factor, Pattern
nals across stages.
Tile dimensions, Word width, Paral-
Memory TileLd Load a tile of data from an off-chip array
lelization factor
Command
Generators Tile dimensions, Word width, Paral-
TileSt Store a tile of data to an off-chip array
lelization factor
Table I
D ESCRIPTION OF TEMPLATES IN DHDL AND SUPPORTED PARAMETERS FOR EACH TEMPLATE
on-chip and off-chip accesses which have different access execution times of each stage. Communication buffers used
times resource requirements. OffChipMem represents an N- in between stages are converted to double buffers. Sequential
dimensional region of memory stored on off-chip DRAM. represents unpipelined execution of a chain of controller
BRAM, Priority Queue and Reg correspond to different nodes. Parallel is a container to execute multiple controller
types of on-chip buffers specialized for different kinds of nodes in parallel with an implicit barrier at the end of
computation. OffChipMems are accessed using nodes called execution. Counter is a simple chain of counters required
memory command generators, while on-chip buffers are to generate loop iterators. Counter has an associated vector
accessed using primitive Ld (load) and St (store) nodes. The width so that multiple successive iterators can be produced
banking factor for a BRAM node is automatically calculated in parallel. This vector width is typically equal to the
using the vector widths and access patterns of all the Ld parallelization factor of the Pipe, MetaPipe, or Sequential
and St nodes accessing it such that the required memory it is associated with.
bandwidth can be met. 4) Memory Command Generators: OffChipMems in
DHDL are accessed at the granularity of tiles, where tile
3) Controllers: Several controller templates are supported is an regular N-dimensional region of memory. Previous
in DHDL to capture imperfectly nested loops and parallelism work [25, 22] has shown the importance of tiling transforma-
at multiple nesting levels. Parallel patterns in input designs tions to maximize locality and generate efficient hardware.
are represented using one of the Pipe, MetaPipe, or Sequen- Accesses to OffChipMems are explicitly captured in DHDL
tial controllers with an associated Counter node. Each of using special TileLd (tile load) and TileSt (tile store) con-
these controllers is associated with a parallelization factor trollers. Each TileLd and TileSt node instantiates data and
and the parallel pattern from which it was generated, which command queues to interface with the memory controller,
is used in replicating the nodes for parallelization. For ex- and contains control logic to generate memory commands.
ample, nodes associated with the map pattern are replicated
and connected in parallel, whereas nodes associated with the C. Code Example: GDA in DHDL
reduce pattern are replicated and connected as a balanced Figure 4 shows GDA written in DHDL, complete with off-
tree. Pipe is a dataflow pipeline which consists of purely chip memory transfers. The hardware described is depicted
primitive nodes. This typically represents innermost bodies pictorially in Figure 3. Note that the design captures nested
of parallel loops that are traditionally converted to pipelines parallelism with two levels of MetaPipes with stages sepa-
using software pipelining techniques. MetaPipe represents a rated by double buffers. Each bubble denotes parameters that
coarse-grained pipeline where each of its stages are other apply to the template it points to. Some of the parameters,
controller nodes. MetaPipe orchestrates the execution of its like number of banks for BRAM, is omitted as they are
stages in a pipelined fashion using asynchronous handshak- automatically inferred based on parallelization factors. The
ing signals, thereby being able to tolerate variations in the design is parameterized using three kinds of parameters:
5
muSize Parallelism factors : M1Par, M2Par, P1Par, P2Par
mu0 TileLd mu0T M2Par Tile sizes : muSize, inTileSize
M1Par M2toggle MetaPipe toggle : M1toggle, M2toggle
M1toggle
mu1 TileLd mu1T P1Par P2Par
inTileSize
Pipe P1 Pipe P2
y TileLd yT
- subT + sigT TileSt sigma
xT
x
x TileLd
muSize*muSize
MetaPipe C2 muSize
MetaPipe M1 inTileSize
Figure 3. Parameterized GDA design described in Figure 4. Bubbles denote parameters that apply to the template it points to. Note that some parameters
(e.g. M1toggle) apply to more than one template (e.g. M1, xT, and yT) but have been not been shown for clarity.
1 val x = OffChipMem[Float](R, C) supplying different parameters, different design points im-

2 val y = OffChipMem[Bit](R) plementing GDA can be automatically generated from the
3 val mu0 = OffChipMem[Float](C)
4 val mu1 = OffChipMem[Float](C) same DHDL source code. In comparison to the design in
5 val sigma = OffChipMem[Float](C, C) Figure 3, the high-level synthesis specification in Figure 2
6
7 Sequential { cannot capture the design points where either M1toggle or
8 val mu0T = BRAM[Float](muSize) M2toggle are set to true. Also, it is challenging to generate
9 val mu1T = BRAM[Float](muSize)
multiple design points using the input in Figure 2 without
10 Parallel {
11 mu0T := mu0(0::muSize) // Load mu0 extensively modifying the source code. We explore these
12 mu1T := mu1(0::muSize) // Load mu1 design spaces with various parallelism factors, tile sizes, and
13 }
14 toggles in detail in Section V. We generate parameters to
15 val sigT = BRAM[Float](muSize, muSize) the design in Figure 3 automatically using a design-space
16 MetaPipe(rows by inTileSize, sigT) { r =>
17 val yT = BRAM[Bit](inTileSize) exploration tool, and each proposed design is analyzed to
18 val xT = BRAM[Float](inTileSize, muSize) estimate FPGA resource utiliztion and cycle counts.
19 Parallel {
20 // Load one tile of x and y
21 xT := x(r::r+inTileSize, 0::muSize) IV. M ODELING AND E STIMATION
22 yT := y(r::r+inTileSize)
23 } In this section, we describe our modeling methodology.
24 Our models account for the various design parameters for
25 val sigmaBlk = BRAM[Float](muSize, muSize)
26 MetaPipe(inTileSize by 1, sigmaBlk) { rr => each DHDL template, as listed in Table I, as well as
27 val subT = BRAM[Float](muSize) optimizations done by low-level logic synthesis tools in
28 val sigmaTile = BRAM[Float](muSize, muSize)
29 Pipe(muSize by 1){ cc => order to accurately estimate resource usage.
30 val sub = yT(rr) ? mu1T(cc) :: mu0T(cc)
31 sigmaTile(cc) = xT(rr,cc) - sub A. Modeling Considerations
32 }
33 Pipe(muSize by 1, muSize by 1) { (ii,jj) => The resource requirements of a given application im-
34 sigmaTile(ii,jj) = subT(ii) * subT(jj)
35 } plemented on an FPGA depend both on the target device
36 sigmaTile and on the toolchain. Heterogeneity in the FPGA fabric,
37 }{_+_}
38 sigmaBlk use of FPGA resources for routing, and other low-level
39 }{_+_} optimizations performed by logic synthesis tools often have
40 a significant impact on the total resource consumption of
41 sigma(0::muSize, 0::muSize) := sigT
42 } a design. Since these factors reflect the physical layout of
computation on the device after placement and routing, they
Figure 4. GDA in DHDL are not captured directly in the application’s dataflow graph.
We identify and account for the following factors:
LUT and register packing: Basic compute units in
parallelism factors controlling number of parallel iterations, FPGAs are typically composed of a lookup table (LUT),
tile sizes corresponding to on-chip buffer sizes and MetaPipe and a small number of single bit multiplexers, registers,
toggles which controls whether an outer loop should be and full adders. Modern FPGA LUTs support up to 8-
implemented as a Sequential or a MetaPipe. The MetaPipe input binary functions but are often implemented using a
toggle parameters also control whether the buffers internal pair of smaller LUTs [32, 33]. When these LUTs can be
to the MetaPipe should be double-buffered. Note that by configured and used independently, vendor placement tools
6
attempt to “pack” multiple small functions into a single 8- their parameters. Note that these models include estimates of
input unit. LUT packing can have a significant impact on off-chip memory access latency as a function of the number
design resource requirements. In our experiments, we are and length of memory commands, as well as contention
able to pack about 80% of the functions in each design in due to competing accessors. Since template models are
pairs, decreasing the number of used LUTs by about 40%. application-independent, each needs only be characterized
Routing Resources: Logic synthesis tools require a once for a given target device and logic synthesis toolchain.
significant amount of resources to establish static routing The synthesis times required to model templates can there-
connections between two design points (e.g., a multiplier and fore be amortized over many applications.
a block RAM) which fit in the path’s clock period. While Using these models, we run a pair of analysis passes
FPGAs have dedicated routing resources, logic synthesis over the application’s DHDL intermediate representation to
tools may have the option to use LUTs for routing. These estimate design cycle counts and area requirements.
LUTs may then be unavailable to be used for “real” compute. 1) Cycle Count Estimation: In the first analysis pass,
In our designs, “route-through” LUTs typically account for we estimate the total runtime of the design on the FPGA.
about 10% of the total number of used LUTs. Since the DHDL intermediate representation is hierarchical
Logic duplication: Logic synthesis tools often duplicate in nature, this pass is done recursively. The total runtime
resources such as block RAMs and registers to avoid routing of MetaPipe and Sequential nodes is calculated first by
congestion and to decrease fan out. While duplicated regis- determining the runtime of all controller nodes contained
ters typically encompass around 5% of the total number of within them. The total propagation delay of a single iteration
registers required in our designs, we found that block RAM of a Pipe is the length of the body’s critical path, calculated
duplication can increase RAM utilization by 10 to 100%, using a depth first search of the body’s subgraph and the
depending on the complexity of the design. propagation delay of all primitive nodes within the graph.
Unavailable resources: FPGA resources are typically Input dataset sizes, given as user annotations in the high-
organized in a hierarchy, such as Altera’s Logic Array Block level program, are used by the analysis pass along with tiling
structure (10 LUTs) and Xilinx’s Slice structure (4 LUTs). factors to determine the iteration counts for each controller
Such organizations impose mapping constraints which can template. Iteration counts are then used to calculate the total
lead to resources that are rendered unusable. In our exper- runtime of the respective controller nodes.
iments, the number of unusable LUTs made up only about 2) Area Estimation: Since the FPGA resource utilization
4% of the design’s total LUT usage. of a design is sensitive to factors that are not directly
captured in the design’s dataflow graph, we adopt a hybrid
B. Methodology approach in our area analysis.
In order to model runtime and resource requirements of We first estimate the area of the DHDL design by counting
DHDL designs, we first need an estimate of the area require- the resource requirements of each node using their pre-
ments and propagation delay of every DHDL template. Area characterized area models. In Pipe bodies, we also estimate
requirements include the number of digital signal processing the resources required for delaying signals. This is done
units (DSPs), device block RAMs, LUTs, and registers that by recursively calculating the propagation delay of every
each template requires. To facilitate LUT packing estima- path to each node using depth first search. Paths with slack
tion, we split template LUT resource requirements into the relative to the critical path to that node require their width
number of “packable” and “unpackable” LUTs required. (in bits) multiplied by the slack delay resources. Delays
We obtain characterization data by synthesizing multiple over a synthesis tool-specific threshold are modeled as block
instances of each template instantiated for combinations of RAMs. Otherwise, they are modeled as registers. Note that
its parameters as given in Table I. Using this data, we this estimation assumes ASAP scheduling.
create analytical models of each DHDL template’s resource We model LUT routing usage, register duplication, and
requirements and cycle counts for a predefined fabric clock. unavailable LUTs using a set of small artificial neural
The area and cycle count of controller templates are modeled networks implemented using the Encog machine learning
as functions of the latencies of the nodes contained within library [34]. Each network has three fully connected layers
them. The total cycle count for a MetaPipe, for example, is with eleven input nodes, six hidden layer nodes, and a
modeled using the recursive function single output node. We chose to use three layer neural
networks as they have been proven to be capable of fitting a
(N − 1) max(cycles(n)|n ∈ nodes) + ∑ cycles(n) wide number of function classes with arbitrary precision,
n∈nodes
including polynomial functions of any order [35]. One
where N is the number of iterations of the MetaPipe and network is trained for each factor on a common set of 200
nodes is the set of nodes contained in the MetaPipe. design samples with varying levels of resource usage to
Most templates require about six synthesized designs to give a representative sampling of the space. Choosing the
characterize their resource and area usage as a function of correct network parameters to obtain the lowest model error
7
is typically challenging, but in our experiments we found Benchmark Description Dataset Size
that above four nodes in the hidden layer, the exact number dotproduct Vector dot product 187,200,000
of hidden layer nodes made little difference. Duplicated outerprod Vector outer product 38, 400 38, 400
block RAMs are estimated as a linear function of the
gemm Tiled matrix multiplication 1536 × 1536
number of routing LUTs, as we found that this gave the
tpchq6 TPC-H Query 6 N=18,720,000
best estimate of design routing complexity in practice. This
linear function was fit using the same data used to train blackscholes Black-Scholes-Merton model N=9,995,328
the neural networks. Like the template models, these neural gda Gaussian discriminant analysis R=360,000 D=96
networks are application independent and only need to be #points=960,000,
kmeans k-Means clustering
trained once for a given target device and toolchain. k=8, dim=384
We use the raw resource counts as an input to each of our
Table II
neural networks to obtain global estimates for routing LUTs, E VALUATION BENCHMARKS .
duplicated registers, and unavailable LUTs. We estimate the
number of duplicated block RAMs using the routing LUTs.
These estimates are then added to the raw resource counts to
obtain a pre-packing resource estimate. For the purposes of These heuristics defines a “legal” subspace of the total
LUT packing, we assume routing LUTs are always packable. design space. In our experiments, we randomly generate es-
Lastly, we model LUT packing using the simple assump- timates for up to 75, 000 legal points to give a representative
tion that all packable LUTs will be packed. The target view of the entire design space. We immediately discard
device in our experiments supports pairwise LUT packing, illegal points.
so we estimate the number of compute units used for
logic as the number of unpackable LUTs plus the number V. E VALUATION
of packable LUTs divided by two. We assume that each We evaluate the accuracy of the estimations described in
compute unit will use two registers on average. We model Section IV. We use our models to study the space of designs
any registers unaccounted for by logic compute units as on benchmarks from the data analytics, machine learning,
requiring compute units with two registers each. This gives and financial analytics domains. We then evaluate the speed
us the final estimation for LUTs, DSPs, and BRAM. of our design space exploration against a commercial high-
level synthesis tool. Finally, we evaluate the performance
C. Design space exploration
of our Pareto-optimal points by comparing the execution
Our design space exploration tool uses the resource and times with an optimized multi-core CPU implementation on
cycle count estimates to explore the space of designs de- a server-grade processor.
scribed by the parameters in Table I. As we are dealing
with large design spaces on the order of millions of points A. Experimental Setup
even for small benchmarks, we prune invalid and suboptimal Table II lists the benchmarks we use in our evaluation
points in the search space using a few simple heuristics: along with the corresponding input dataset sizes used. Dot-
• Parallelization factors considered are integer divisors product, outerprod, and gemm are common linear algebra
of the respective iteration counts. We use this pruning kernels. Tpchq6 is a data analytics application that streams
strategy because non-divisor factors create edge cases through a collection of records and performs a reduction on
which require additional modulus operations. These records filtered by a condition. BlackScholes is a financial
operations can significantly increase the latency and analytics application that implements Black-Scholes option
area of address calculation, typically making them poor pricing. Gda and kmeans are commonly used machine
design parameter choices [36]. learning kernels used for data classification and clustering,
• Tile sizes considered are divisors of the dimensions respectively. All benchmarks operate on single-precision
of the annotated data size. Similar to parallelization floating point numbers, except in certain cases where the
factors, tile sizes with edge cases are usually suboptimal benchmark requires integer or boolean values as inputs. For
as they increase load and store area and latency with the purposes of this paper, all benchmarks were written in
additional indexing logic. DHDL by hand but are equivalent to what could be generated
• Automatic banking of on-chip memories eliminates the automatically from higher level DSLs.
memory banks as an independent variable. This prunes We implement the DHDL compiler framework in Scala.
a large set of suboptimal design points where on- The DHDL compiler generates hardware by emitting MaxJ,
chip memory bandwidth requirements do not match the which is a low-level Java-based hardware generation lan-
amount of parallelization. guage. Each generated design is synthesized and run on
• The total size of each local memory is limited to a fixed an Altera 28nm Stratix V FPGA on a Max4 MAIA board
maximum value. at a fabric clock frequency of 150MHz. The MAIA board
8
Benchmark ALMs DSPs BRAM Runtime of resource utilization. Hence, this error does not affect the
dotproduct 1.7% 0.0% 13.1% 2.8% quality of the designs found during design space exploration,
outerprod 4.4% 29.7% 12.8% 1.3% and improves with increased resource utilization.
gemm 12.7% 11.4% 17.4% 18.4%
Of our estimated metrics, BRAM estimates have the
highest average error over all benchmarks. These errors are
tpchq6 2.3% 0.0% 5.4% 3.1%
primarily from block RAM duplication done by the place-
blackscholes 5.3% 5.3% 7.0% 3.4% ment and routing tool. In designing our models, we found
gda 5.2% 6.2% 8.4% 6.7% that BRAM duplication is inherently noisy, as more complex
kmeans 2.0% 0.0% 21.9% 7.0% machine learning models failed to achieve better estimates
than a simple linear fit. Our linear model provides a rough
Average 4.8% 7.5% 12.3% 6.1%
estimate of design complexity and routing requirements, but
Table III it does not provide a complete picture for when and how
AVERAGE ABSOLUTE ERROR FOR RESOURCE USAGE AND RUNTIME .
often the synthesis tool will decide to duplicate BRAMs.
However, like DSPs, we find that our BRAM estimates track
actual usage and preserve ordering across designs, making
interfaces with an Intel CPU via PCIe. The board has it usable for design space exploration and relative design
48GB of dedicated off-chip DDR3 DRAM with a peak comparisons.
bandwidth of 76.8GB/s. In practice, our maximum memory Gemm has the highest overall error of any benchmark. We
bandwidth is 37.5 GB/s, as our on-chip memory clock found that this is due to low-level hardware optimizations
is limited to 400MHz. We leverage Maxeler’s runtime to like floating point multiply-add fusion, fusion of floating
manage communication and data movement between the point reduction trees, and BRAM coalescing that Maxeler’s
host CPU and the MAIA board. Execution time is measured compiler performs automatically and that we use heuristics
starting from when the FPGA design is started (after input to predict. Since we do not have explicit control over these
has been copied to FPGA DRAM) and stopped after the optimizations, it is possible to mispredict when they will
design finishes execution (before output is copied to CPU occur. The gemm benchmark is exceptionally sensitive to
DRAM). We report execution time as the average of 20 these errors. However, as with the other errors, we found
runs to eliminate noise from design initialization time and that this error does not detract from the model’s ability to
off-chip memory latencies. The FPGA resource utilization guide design space exploration as long as the possibility of
numbers reported are from the post place-and-route report this error is accounted for.
generated by Altera’s logic synthesis tools. C. Design space exploration
B. Evaluation of estimator 1) Pareto-optimality analysis: In this section we show
We first evaluate the absolute accuracy of our modeling the Pareto-optimal curves of each benchmark derived from
approach. We select five Pareto points generated from our estimators. Figure 5 shows the design space scatter plots for
design space exploration for each of our benchmarks. We all benchmarks in Table II. A design point is considered
then generate and synthesize hardware for each design and invalid if its resource requirement for at least one type of
run it on the FPGA. We compare our area estimates to post resource exceeds the maximum available amount on the
place-and-route reports generated by Altera’s toolchain. We target device. Pareto-optimal designs along the dimensions
then run the design on the FPGA and compare the estimated of execution time and ALM utilization are highlighted for
runtime to observed runtime. Note that runtime includes each benchmark through all three resource plots. We now
off-chip memory accesses from the FPGA to its DRAM. analyze each benchmark in detail.
Table III summarizes the errors averaged across all selected Dotproduct (Figure 5 A,B,C) is a memory-bound bench-
Pareto points for each benchmark. mark. Peak execution time is reached by balancing tile
Our area estimates have an average error of 4.8% for loads and computation. Inner and outer loop parallelization
ALMs, 7.5% for DSPs, and 12.3% for BRAMs, while our allows us to quickly reach close to the input bandwidth.
runtime estimation error averages 6.1%. Our highest error Runtimes of designs with MetaPipes then slowly decrease
occurs in predicting DSPs for outerprod, where we over- as parallelization increases once the dominant stage becomes
predict by 29.7% DSP usage on average. However, we found the dot product reduction tree. In dotproduct, designs with
that errors above 10% for DSP usage only occur for designs MetaPipe consume less resources than those with Sequential
which use less than 2% of the total DSPs available on the for the same performance. Sequentials require larger tile
device. As our benchmarks are limited by other resources sizes and more parallelism to match MetaPipe performance.
(typically ALMs or BRAM), the relative error for DSPs Outprod (Figure 5 D,E,F) represents both a BRAM and
is more sensitive to low-level fluctuations and noise. We memory bound benchmark. For 2N inputs, the total BRAM
observe that our DSP estimates preserve absolute ordering requirement is 2N + N 2 to store the input and output tiles,
9
Invalid design Pareto point
Valid design Synthesized design
10
9
A B C
DotProduct
8
7
6
10
D E F
OuterProduct
9
8
7
11
10 G H I
9
GEMM
8
7
Cycles (Log Scale)
6
8
J K L
7
TPCHQ6
6
5
9
M N O
BlackScholes
8
7
6
11
10 P Q R
9
GDA
8
7
6
10
S T U
9
Kmeans
8
7
0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120
ALM DSP BRAM
Usage (% of maximum)
Figure 5. Results of design space exploration. Horizontal axis shows estimated ALM, DSP, and BRAM usages. Vertical axis shows runtime in cycles,
given in log scale (base 10).
10
meaning the BRAM requirement increases quadratically Our approach Vivado HLS restricted† Vivado HLS full
with increases in input tile size. The highest performing 0.017s / design 4.75s / design 111.06s / design
designs for outer product do not use MetaPipes to overlap † Vivado HLS restricted design space ignores outer loop pipelining
loading and storing of tiles. This is because the overhead
Table IV
due to main memory contention from overlapping tile loads AVERAGE ESTIMATION TIME PER DESIGN POINT.
and stores turns out to be higher than the cost of executing
each stage sequentially.
Gemm (Figure 5 G,H,I) contains a lot of temporal and
spatial locality. From Figure 5(I), Pareto-optimal designs for 2) Speed of exploration: We compare the speed of
gemm occupy almost all BRAM resources on the board. our estimation and design space exploration with Vivado
Intuitively, this is because good designs for gemm maximize HLS [23], a commercial high-level synthesis tool from
locality by retaining large, two dimensional chunks of data Xilinx. Our evaluation uses the GDA example in Figure 2 as
in on-chip memory. input to the high-level synthesis tool, and the GDA design
Tpchq6 (Figure 5 J,K,L) exhibits behavior typical of in Figure 3 as input to our design space exploration tool.
memory intensive applications. Performance reaches a max- Design parameters for the high-level synthesis tool are the
imum threshold with increased tile size because of overlap- unrolling factors. We also include a pipeline directive toggle
ping memory access and compute. for each loop in the design. For DHDL, we vary all design
Blackscholes (Figure 5 M,N,O) streams through multiple parameters specified in Figure 4. Speed is measured by
large arrays and performs complex floating point computa- comparing the average estimation speed per point for 250
tions on the input data. Points along the same vertical bar design points for each tool. In our experiments, our analysis
in Figure 5(M) share the same inner loop parallelization takes 5 to 29 milliseconds per design depending on the size
factor. Increasing parallelization improves performance by of the application’s intermediate representation.Analysis of
increasing utilization of the available off-chip memory band- GDA also takes 17 milliseconds per design.
width. Our model suggests that increasing the inner loop Table IV shows a comparison between estimation speeds
parallelization would continue to scale performance until a from our toolchain and Vivado HLS. The “restricted” col-
parallelization of 16, around which point blackscholes would umn refers to the average time spent per design over points
be memory bound. Because there are not enough compute whose outer loop (L1, in Figure 2) is not pipelined with
resources are available to implement a parallelization factor a pipeline directive. The “full” version refers to all design
of 16, blackscholes is ALM bound. points where 30 of the 250 points have a pipeline directive
Gda (Figure 5 P,Q,R) posseses higher degrees of spatial to enable outer loop pipelining. We observe the following:
locality. Because of this, gda exhibits compute-bound behav- • Our estimation tool is 279× faster than the “restricted”
ior, where execution time decreases steadily with increased space exploration, and 6533× faster than the “full”
resource utilization, as seen in Figure 5(P). The critical space exploration.
resource is again BRAM. This is because BRAM usage • Compared to Vivado HLS, our estimation time is not
increases with parallelization due to the creation of more sensitive to design parameter inputs. Estimation time
banks with fewer words per bank, which can cause under- for Vivado HLS increases dramatically when the outer
utilization of the capacity of individual BRAMs. loop is pipelined in GDA because the tool completely
Kmeans (Figure 5 S,T,U) is bound by the number of unrolls all inner loops before pipelining the outer loop.
ALMs. The critical path in this application is the distance This creates a large graph that complicates scheduling.
computation done comparing an input point to each centroid. Our approach does not suffer from this limitation be-
The number of floating point operations to be done to keep cause we explicitly capture pipelines in parameterized
up with main memory bandwidth is therefore proportional templates such as Pipe and MetaPipe, thereby capturing
to K × D, where D is the number of dimensions in one outer loop pipelining more naturally.
point. The performance of kmeans is therefore limited by the
number of ALMs on the FPGA, as not enough are available D. Comparison with CPU
to perform all K × D operations in parallel. Like GDA, To evaluate the quality of the generated Pareto-optimal
kmeans is also limited by BRAMs due to under-utilization designs, we compare the best FPGA execution times with
of BRAM capacity with increased banking factors. optimized CPU implementations of all benchmarks. CPU
From our experiments, we observe that capturing par- comparison numbers were obtained by running C++ versions
allelism at multiple levels using MetaPipes enables us to of benchmarks in Table II on a 6 core Intel(R) 32nm
generate efficient designs. In addition, effective management Xeon(R) CPU E5-2630 processor clocked at 2.30GHz, with
of on-chip BRAM resources is critical to good designs as a 15MB LLC and a maximum main memory bandwidth
BRAM resources are the limiting factor for performance of 42.6 GB/s. Each CPU benchmark is run with 6 threads.
scaling in most of our benchmarks. For gemm, we compare to multi-threaded OpenBLAS [37].
11
The rest of the CPU implementations were generated from 20
OptiML [16], a machine learning DSL which generates high 16.73
performance, multi-threaded C++ comparable to, or better 15

than, manually optimized code. CPU execution times are
obtained by measuring the core computation kernel averaged
Speedup
10
over 10 runs. Figure 6 shows the the speedups of all our
benchmarks normalized to the execution time on the CPU. 4.55
5
Both dotproduct and outerprod are streaming, memory- 2.42
1.07 1.11 1.15
intensive benchmarks. For dotproduct, we see a speedup of 0.1
1.07×, roughly the same performance as the CPU. In outer- 0
prod, we see a speedup of 2.4×. We associate this speedup
with overhead of multithreaded setup and synchronization.
However, the CPU outerprod implementation can likely Figure 6. Normalized speedups of most performant FPGA design points
be improved further to match the FPGA’s performance. over multi-core CPU implementations.
Ultimately, we would not expect a significant difference in
performance on either of the benchmarks as the memory
bandwidth of the two architectures is roughly the same. In VI. C ONCLUSION
the case of outerprod, both architectures should be equally
In this paper, we describe a practical framework that can
capable of exploiting spatial locality as the vector sizes are
generate efficient FPGA designs automatically from a high-
far smaller than local memory sizes.
level description based on parallel patterns. We introduce
We observe a significant slowdown by about 10× for DHDL, a new parameterizable hardware definition language
gemm. By taking advantage of architecture-specific tiling that describes designs using templates such as MetaPipe
techniques at multiple memories of the memory hierarchy with which we capture a larger design space than previous
and by vectorizing floating point operations, the OpenBLAS work. We describe our hybrid area estimation technique and
implementation can achieve a total of about 89 GFLOPs. evaluate our approach extensively on various benchmarks
Our FPGA does not have enough resources to achieve from the data analytics, financial analytics and machine
that performance on single precision floating point values. learning domains. We show an average area estimation
However, larger FPGAs with more compute capacity or, error of 4.8% and average runtime estimation error of 6.1%
more recently, direct hardware support for floating point over all the benchmarks. We perform a detailed study for
operations have been shown to be capable of much higher each benchmark on the space of designs described by tile
floating point performance than this. sizes, parallelism factors, and coarse-grained pipelining and
The tpchq6 benchmark achieves a speedup of 1.11× measure their effects on the utilization of different types
in spite of having an access pattern that streams through of FPGA resources. We show that our exploration tool
multiple large arrays. This is because tpchq6 consists of is 279 to 6533 times faster than a commercial high-level
data-dependent branches which cause frequent stalls in the synthesis tool. Finally, we show that the Pareto-optimal
frontend of the processor’s pipeline. On the FPGA, such designs we discover can achieve a speedup of up to 16.7×
branches are implemented using simple multiplexers which over optimized multi-core CPU implementations running on
do not create stalls or bubbles in the dataflow pipeline. a commodity server processor.
Given the appropriate tile sizes, this shows that memory-
intensive benchmarks like tpchq6 that have branches can be ACKNOWLEDGMENTS
accelerated on an FPGA. The authors thank Maxeler Technologies for their assis-
Blackscholes achieves a speedup of 16.7×. The core com- tance with this paper, and the reviewers for their suggestions.
pute kernel of blackscholes is amenable to deep pipelining. This work is supported by DARPA Contract-Air Force
While the blackscholes benchmark is compute bound on the FA8750-12-2-0335; Army Contract AHPCRC W911NF-07-
CPU [38], FPGAs can exploit higher levels of instruction- 2-0027-1; NSF Grants IIS-1247701, CCF-1111943, CCF-
level parallelism than CPUs via deep pipelines. Our blacksc- 1337375, and SHF-1408911; Stanford PPL affiliates pro-
holes design benefits from this pipeline parallelism. gram, Pervasive Parallelism Lab: Oracle, AMD, Huawei,
The gda and kmeans achieve speedups of 4.5× and 1.15×, Intel, NVIDIA, SAP Labs. Authors acknowledge additional
respectively. Both benchmarks have nested levels of paral- support from Oracle. The views and conclusions contained
lelism which is captured using MetaPipes. By exploiting herein are those of the authors and should not be interpreted
pipeline parallelism and taking advantage of locality within as necessarily representing the official policies or endorse-
these two applications, our generated designs are able to ments, either expressed or implied, of DARPA or the U.S.
achieve a modest speedup over the multi-core CPU. Government.
12
R EFERENCES [18] P. Zhang, M. Huang, B. Xiao, H. Huang, and J. Cong. “CMOST: A
[1] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, System-level FPGA Compilation Framework,” DAC, 2015.
and M. A. Horowitz, “Convolution engine: Balancing efficiency & [19] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and
flexibility in specialized computing,” in Proceedings of the 40th An- S. Amarasinghe, “Halide: A language and compiler for optimiz-
nual International Symposium on Computer Architecture, ser. ISCA, ing parallelism, locality, and recomputation in image processing
2013, pp. 24–35. pipelines,” in Proceedings of the 34th ACM SIGPLAN Conference
[2] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, on Programming Language Design and Implementation, ser. PLDI,
X. Zhou, and Y. Chen, “Pudiannao: A polyvalent machine learning 2013, pp. 519–530.
accelerator,” in Proceedings of the Twentieth International Conference [20] A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky,
on Architectural Support for Programming Languages and Operating and K. Olukotun, “Delite: A compiler architecture for performance-
Systems, ser. ASPLOS, 2015, pp. 369–381. oriented embedded domain-specific languages,” in TECS’14: ACM
[3] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, Transactions on Embedded Computing Systems, July 2014.
“Q100: The architecture and design of a database processing unit,” in [21] Maxeler Technologies, “MaxCompiler white paper,” 2011.
Proceedings of the 19th International Conference on Architectural [22] R. Prabhakar, D. Koeplinger, K. J. Brown, H. Lee, C. De Sa,
Support for Programming Languages and Operating Systems, ser. C. Kozyrakis, and K. Olukotun, “Generating configurable hardware
ASPLOS, 2014, pp. 255–268. from parallel patterns,” in Proceedings of the Twenty-First Interna-
[4] J. Casper and K. Olukotun, “Hardware acceleration of database tional Conference on Architectural Support for Programming Lan-
operations,” in ACM/SIGDA International Symposium on Field- guages and Operating Systems, ser. ASPLOS, 2016, pp. 651–665.
programmable Gate Arrays, ser. FPGA, 2014, pp. 151–160. [23] “Vivado high-level synthesis,” http://www.xilinx.com/products/
[5] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. design-tools/vivado/integration/esl-design.html.
Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding [24] “Vivado design suite 2015.1 user guide: High-level synthesis.”
sources of inefficiency in general-purpose chips,” in Proceedings of [25] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, “Polyhedral-
the 37th Annual International Symposium on Computer Architecture, based data reuse optimization for configurable computing,” in Pro-
ser. ISCA, 2010, pp. 37–47. ceedings of the ACM/SIGDA International Symposium on Field Pro-
[6] I. Kuon and J. Rose, “Measuring the gap between fpgas and asics,” in grammable Gate Arrays, ser. FPGA, 2013, pp. 29–38.
Proceedings of the 2006 ACM/SIGDA 14th International Symposium [26] D. Chen, J. Cong, Y. Fan, and Z. Zhang, “High-level power estima-
on Field Programmable Gate Arrays, ser. FPGA, 2006, pp. 21–30. tion and low-power design space exploration for fpgas,” in Design
[7] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constan- Automation Conference, 2007. ASP-DAC ’07. Asia and South Pacific,
tinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, Jan 2007, pp. 529–534.
M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, [27] L. Deng, K. Sobti, Y. Zhang, and C. Chakrabarti, “Accurate area,
J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, time and power models for fpga-based implementations,” J. Signal
and D. Burger, “A reconfigurable fabric for accelerating large-scale Process. Syst., vol. 63, no. 1, pp. 39–50, Apr. 2011.
datacenter services,” in Proceeding of the 41st Annual International [28] S. Bilavarn, G. Gogniat, J.-L. Philippe, and L. Bossuet, “Design space
Symposium on Computer Architecuture, ser. ISCA, 2014, pp. 13–24. pruning through early estimations of area/delay tradeoffs for fpga
[8] J. Ouyang, S. Lin, W. Qi, Y. Wang, B. Yu, and S. Jiang, “Sda: implementations,” Computer-Aided Design of Integrated Circuits and
Software-defined accelerator for largescale dnn systems,” ser. Hot Systems, IEEE Transactions on, vol. 25, no. 10, pp. 1950–1968, 2006.
Chips 26, 2014. [29] A. Nayak, M. Haldar, A. Choudhary, and P. Banerjee, “Accurate area
[9] P. K. Gupta, “Xeon+fpga platform for the data center,” and delay estimators for fpgas,” in Design, Automation and Test in
http://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15- Europe Conference and Exhibition, Proceedings, 2002, pp. 862–869.
gupta.pdf, 2015. [30] R. Enzler, T. Jeger, D. Cottet, and G. Tröster, “High-level area
[10] “Falcon computing,” http://falcon-computing.com/, 2015. and performance estimation of hardware building blocks on fpgas,”
[11] D. Bacon, R. Rabbah, and S. Shukla, “Fpga programming for the in Field-Programmable Logic and Applications: The Roadmap to
masses,” Queue, vol. 11, no. 2, Feb. 2013. Reconfigurable Computing. Springer, 2000, pp. 525–534.
[12] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and [31] P. Bjuréus, M. Millberg, and A. Jantsch, “Fpga resource and timing
Z. Zhang, “High-level synthesis for fpgas: From prototyping to estimation from matlab execution traces,” in Proceedings of the tenth
deployment,” Computer-Aided Design of Integrated Circuits and international symposium on Hardware/software codesign, 2002.
Systems, IEEE Transactions on, vol. 30, no. 4, pp. 473–491, 2011. [32] “Stratix device handbook,” https://www.altera.com/content/dam/
[13] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Cza- altera-www/global/en_US/pdfs/literature/hb/stratix-v/stx5_core.pdf.
jkowski, S. D. Brown, and J. H. Anderson, “Legup: An open- [33] “Xilinx 7 series fpgas configurable logic block user guide,”
source high-level synthesis tool for fpga-based processor/accelerator http://www.xilinx.com/support/documentation/user_guides/ug474_
systems,” TECS, vol. 13, no. 2, p. 24, 2013. 7Series_CLB.pdf, 2014.
[14] J. Auerbach, D. F. Bacon, P. Cheng, and R. Rabbah, “Lime: A [34] J. Heaton, “Encog: Library of interchangeable machine learning
java-compatible and synthesizable language for heterogeneous archi- models for java and c#,” Journal of Machine Learning Research,
tectures,” in Proceedings of the ACM International Conference on vol. 16, pp. 1243–1247, 2015. [Online]. Available: http://jmlr.org/
Object Oriented Programming Systems Languages and Applications, papers/v16/heaton15a.html
ser. OOPSLA, 2010, pp. 89–108. [35] F. Scarselli and A. C. Tsoi, “Universal approximation using
[15] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, “Aladdin: A pre-rtl, feedforward neural networks: A survey of some existing methods,
power-performance accelerator simulator enabling large design space and some new results,” Neural Networks, vol. 11, no. 1, pp. 15 –
exploration of customized architectures,” in Computer Architecture 37, 1998. [Online]. Available: http://www.sciencedirect.com/science/
(ISCA), 2014 ACM/IEEE 41st International Symposium on. IEEE, article/pii/S089360809700097X
2014, pp. 97–108. [36] J. Cong, B. Liu, R. Prabhakar, and P. Zhang, “A study on
[16] A. K. Sujeeth, H. Lee, K. J. Brown, H. Chafi, M. Wu, A. R. Atreya, the impact of compiler optimizations on high-level synthesis,” in
K. Olukotun, T. Rompf, and M. Odersky, “Optiml: an implicitly Languages and Compilers for Parallel Computing, ser. Lecture Notes
parallel domain specific language for machine learning,” in ICML, in Computer Science, H. Kasahara and K. Kimura, Eds. Springer
2011. Berlin Heidelberg, 2013, vol. 7760, pp. 143–157. [Online]. Available:
[17] A. K. Sujeeth, T. Rompf, K. J. Brown, H. Lee, H. Chafi, V. Popic, http://dx.doi.org/10.1007/978-3-642-37658-0_10
M. Wu, A. Prokopec, V. Jovanovic, M. Odersky, and K. Olukotun, [37] “Openblas,” http://www.openblas.net/, 2016.
“Composition and reuse with compiled domain-specific languages,” [38] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. disserta-
in European Conference on Object Oriented Programming, 2013. tion, Princeton University, January 2011.
13

Automatic Generation of Efficient Accelerators For Reconfigurable HW PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Automatic Generation of Efficient Accelerators For Reconfigurable HW PDF

Transféré par

Droits d'auteur :

Formats disponibles

Automatic Generation of Efficient Accelerators for Reconfigurable Hardware

David Koeplinger Raghu Prabhakar Yaqi Zhang

Christina Delimitrou Christos Kozyrakis Kunle Olukotun

Figure 1. System Overview

1 val x = OffChipMem[Float](R, C) supplying different parameters, different design points im-

performance, multi-threaded C++ comparable to, or better 15

Vous aimerez peut-être aussi