Vous êtes sur la page 1sur 4

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2410791, IEEE Computer Architecture Letters
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1

Mitigating Memory-induced Dark Silicon in


Many-Accelerator Architectures
Dionysios Diamantopoulos, Sotirios Xydis, Kostas Siozios, and Dimitrios Soudris

AbstractMany-Accelerator (MA) systems have been introduced as a promising architectural paradigm that can boost performance and
improve power of general-purpose computing platforms. In this paper, we focus on the problem of resource under-utilization, i.e. Dark
Silicon, in FPGA-based MA platforms. We show that except the typically expected peak power budget, on-chip memory resources form a
severe under-utilization factor in MA platforms, leading up to 75% of dark silicon. Recognizing that static memory allocation the de-facto
mechanism supported by modern design techniques and synthesis tools forms the main source of memory-induced Dark Silicon, we
introduce a novel framework that extends conventional High Level Synthesis (HLS) with dynamic memory management (DMM) features,
enabling accelerators to dynamically adapt their allocated memory to the runtime memory requirements, thus maximizing the overall
accelerator count through effective sharing of FPGAs memories resources. We show that our technique delivers significant gains in
FPGAs accelerators density, i.e. 3.8, and application throughput up to 3.1 and 21.4 for shared and private memory accelerators.

Index Termsmany-accelerator architectures, dynamic memory management, high-level synthesis.

F
1 I NTRODUCTION
Breaking the exascale barrier has been recently identified
as the next big challenge in computing systems. Several
studies [1], [2] showed that reaching this goal requires
a design paradigm shift towards more aggressive hard-
ware/software co-design solutions at the architecture and
technology level. Recently, many-accelerator heterogeneous
architectures have been proposed to overcome the utiliza-
tion/power wall [3], [4], [5]. For instance, Microsoft Corp.
showed that such many-accelerator systems on reconfig-
urable fabrics can accelerate portions of large-scale software
[6], delivering 95% improvements in serverss throughput.
In this work we investigate the impact of the memory-
intensive nature of many-accelerator systems onto the scal-
ability potential of MA architectures. A recent survey of
eleven publicly available accelerators reveals that an av-
erage of 69% of accelerator area is consumed by memory
[7]. Rapid starvation of the available on-chip memory leads Fig. 1. Accelerators scalability analysis of Kmeans clustering algorithm:
in severe resource under-utilization of the FPGA, similar to Ai -Accelerators= [1 : 128], Np -Points=2 104 , Pk -Clusters=3
the Dark Silicon concept of future many-core chips. In
fact, modern FPGA CAD tools (both at the RTL or HLS- that memory resources starve faster than power. The BRAM
level) allow only static memory allocation, which dictates memory is the resource that saturates faster than the rest
the reservation of the maximum memory that an accelerator FPGA resources types (FFs, LUTs, DSPs), thus forming the
needs, for the entire execution window. While static alloca- main limiting factor of higher accelerator densities as well
tion works fine for a limited number of accelerators, it does as generating large fractions of underutilized resources.
not scale to a many-accelerator design paradigm. Figure 1 In this paper, we target to alleviate the aforementioned
shows an exemplary study of the memory-induced Dark memory-induced Dark Silicon problem by proposing the
Silicon, considering the resource demands for the Kmeans elimination of the pessimistic memory allocation forced by
clustering algorithm1 on the Virtex Ultrascale XVCU190 static approaches. The main contribution of our proposal is
FPGA device2 when scaling the number of parallel accel- the introduction of a novel HLS-based design framework
erators. Figure 1 is annotated with two threshold values re- for many-accelerator platforms that adopts a dynamically
garding to maximum resource count and maximum power allocated run-time memory model. The focus of the paper is
budget (device TDP=125o C), respectively. Considering an to show how DMM can be used for mitigating the memory-
ambient temperature of 50o C, the power-induced Dark induced resource under-utilization problem in MA systems,
Silicon manifests itself with an allocation scenario of 105 rather than a detailed hardware description of the imple-
accelerators consuming around 20 Watts. As shown, mem- mented DMM mechanisms. In typical HLS with static mem-
ory induced Dark Silicon poses a stricter constraint in ory allocation, if the accelerators memory requirements
accelerators count, i.e. up to 2.5 less accelerators, meaning exceeds the available on-chip memory resources, then the
design becomes un-synthesizable, i.e. the designer should
1
The same behavior is observed for the overall set of the evaluated degrade system characteristics to match available resources.
applications exhibiting diverse resource utilization features. The proposed solution allows high accelerator densities
2
The FPGA with the highest on-chip block RAM, total size: 132.9Mb by alleviating the resource under-utilization inefficiencies

1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2410791, IEEE Computer Architecture Letters
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2
FPGA FPGA BRAMs (On-chip Memory)
Processor Subsystem Accelerators Subsystem Free-list organization as a Bit-Map
Statically Dynamically Allocated BRAM
Alloc. 1. int *A = HlsMalloc ( 1 * sizeof ( int ) , i ) ;
BRAM Heap 1 ` Heap n
2. A [ 0 ] = DATA ;
Acc5
Acc9 DMM Heap #i

= 25+24+23+22+21+20
Heap Address mapping to index
` <Alloc. Size> 0x00
Host CPU
PCI

Acc11 Sequencer Sequencer


index 0x01
Acc2
DATA[31:23] 0x02
DATA[22:16] 0x03
FreeBitMap 1 FreeBitMap n DATA[15:19] 0x04

Acc8
Acc13 DATA[18:10] 0x05

DHi
Acc3 Freelist Manager Freelist Manager
FreeBitMap index: 5 4 3 2 1 0

Acc10
First-Fit Controller First-Fit Controller
Acc6
Acc7
Acc1 FreeBitMap #i 0 0 1 1 1 1 1 1 Reg0=63
Acc0 Acc12 Address/Data/Bus(es) 1,2,,n 0 0 0 0 0 0 0 0 Reg1=0

DFi
Address-to-Heap Mapper
Off-chip DRAM 0 0 0 0 0 0 0 0 RegN=0
Slice BRAM DMM-HLS logic Many-accelerators System Interconnection LHi LFi

Fig. 2. Proposed architectural template for memory efficient many-accelerator FPGA-based systems.

mainly induced by the static memory allocation strategies 90 BRAMS

Memory Footprint
used in modern HLS tools. We show that applications with 75 BRAMS

both dynamically and statically allocated data can be bene- 50 BRAMS

25 BRAMS
fited from the proposed techniques, with the prerequisite of
utilizing the proposed malloc/free interface for performing 0

malloc execute free


data allocation. For static applications this can be performed MMUL 70x70

[25 BRAM]
[25 BRAM, 40 ms]

HEAP 2
through a straightforward source code modification. MMUL 70x70
[25 BRAM, 40 ms]

Following the above discussion, we introduce the DMM- MMUL 50x50

[65 BRAM]
[13 BRAM, 20 ms]

HEAP 1
HLS framework that (i) extends typical HLS with DMM MMUL 100x100
[50 BRAM, 75 ms]
mechanisms and (ii) provides an HLS malloc/free API that
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115120
enables statically allocated memory to be transformed to Fig. 3. Schedule and memory footprint for 4 MMUL accelerators with
a dynamic one. We extensively evaluated the effectiveness differing workload utilizing the DMM-HLS technique. FPGA platform:
of the proposed DMM-HLS framework over several many- 90 BRAMs. Total MMULs memory request: 113 BRAMs. The multi-
accelerator architectures for representative applications of heap DMM-HLS offers parallel and overlapped execution of accelerators
improving performance over static memory allocation.
emerging computing domains. We show that DMM-HLS
delivers more scalable MA platform configurations with an
The DMM-HLS framework implements techniques from
average 3.8 increment on the accelerator count in compar-
multi-threaded dynamic memory management [8], [9]. It
ison to MA systems designed using state-of-art HLS. Better
supports parallel memory access paths, by grouping BRAM
scalability leads also to significant throughput gains under
modules into memory banks, named heaps (Fig. 2). Each
both private and shared memory model configurations,
heap implements its own allocator consisting of two major
24.1 and 3.8 in average, respectively.
hardware components, i) the free-list memory structure
2 DMM-HLS FOR M ANY -ACCELERATOR FPGA S holding the freed and allocated memory blocks and ii) the fit
Design-time: Figure 2 shows a typical FPGA-based MA allocation algorithm that searches over the free-list and al-
system. It includes i) the processor subsystem executing locates memory in a first fit manner. The maximum number
the application control flow and ii) the accelerators sub- of heaps controls the supported memory level parallelism of
system holding the computationally intensive kernels of the dynamically allocated data. More than one accelerators
the application. The accelerators are designed and synthe- can be bound to a specific memory heap for allocating data.
sized through Vivado-HLS. The on-chip memory resources The increased accelerator parallelism in combination with
(BRAMs) are managed through the DMM-HLS framework. the overlapped execution offered by the multiple heaps
DMM-HLS supports the description of accelerators with DMM configurations delivers significant throughput gains
both static and dynamic allocated data stored in BRAMs. It when tasks of variable workload are co-scheduled on the
exposes a DMM API composed of two main function calls, FPGA. Figure 3 shows such a scenario, i.e. accelerator
similar to glibc malloc/free API, for memory allocation and parallelism and overlapping, for the execution of 4 MMUL
deallocation. applications with differing workload characteristics onto a
FPGA with maximum 90 BRAMs assuming a 2-heap DMM.
void* HlsMalloc(size_t size, uint heap_id) It can be easily verified that the proposed solution delivers
void HlsFree(void *ptr, uint heap_id)
throughput gains of 42.8% over the static HLS solution that
, where size is the requested allocation size in bytes, heap id serializes tasks execution.
is the identification number of the heap on which allocation Run-time: There are three major runtime issues related
shall occur and *ptr is the pointer which shall be freed up. A to the DMM in many accelerator systems, i.e. memory frag-
partitioning of accelerators data as dynamic or static mem- mentation, memory coherency and memory access conflicts. In
ory objects can performed by the designer after analysing many accelerator systems, we recognize two fragmentation
the applications memory access traces. Without loss of types, alignment and request fragmentation. Alignment frag-
generality, in this paper we adopt a data partitioning scheme mentation accounts for the extra bytes reserved for keeping
in which global scope data structures, i.e. data structures every allocation padded to the heap word length LH i , in-
that accelerators operate on upon invocation, are allocated cluding allocation size and header meta-data information.
as dynamic data, while accelerators internal data structures, As long as the size of DMM requests (malloc/free) is mul-
e.g. local register files etc. are allocated as static objects. tiple of LHi , the alignment fragmentation is zero. Request

1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2410791, IEEE Computer Architecture Letters
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3

fragmentation refers to the situation that a memory request manner. First, we evaluate accelerators density, i.e. the
skips freed memory blocks to find a continuous memory number of accelerators that can be programmed onto the
space equivalent to the size of the request. In the worst FPGA simultaneously. Figure 4(a) depicts the accelerators
case scenario the request cannot be served even if there are gain/loss of DMM-HLS compared to static allocation. The
available memory blocks in the heap if they are merged. bars express the min-max number of accelerators that can
Request fragmentation is strongly dependent on the mem- be loaded using 1:16 heaps. As shown, the proposed DMM-
ory allocation patterns of each accelerator. In case of an HLS framework is able to deliver many-accelerator architec-
homogeneous many-accelerator system, i.e. each accelerator tures with 3.8 more accelerators in average (up to 9.7 for
allocates same memory size, request fragmentation is zero. Histogram), compared to static allocation. The highest gains
Regarding to memory coherency problems, they are in- on accelerators density come from the usage of a single
herently eliminated in DMM-HLS, since every accelerator heap configuration that delivers the least possible overhead
has its own memory space, thus no other accelerator may regarding the resources consumed by the DM manager.
access it. However, memory access conflicts may become a As long as configurations with more heaps are adopted,
performance bottleneck in case that a large set of accelera- the extra resources needed to implement the corresponding
tors share the same heap. As previously mentioned, DMM- allocators decrease the maximum number of accelerators,
HLS supports multiple-heap configurations that relax the e.g. the instantiation of 16 heaps delivers an average gain of
pressure on the dynamic allocated memory space. 1.7 in accelerators density.
DMM-HLS framework setup: The employed flow for We note that simply increasing the number of acceler-
the evaluation of DMM-HLS is based on Xilinx Vivado-HLS, ators does not imply performance gains in a straightfor-
a state-of-art and industrial strength HLS tool, targetting ward manner. Figure 4(b) shows the number of accelerators
the Virtex Ultrascale XVCU190 device. We evaluated the where the system exhibits maximal performance, in terms
efficiency of the proposed DMM-HLS framework consid- of throughput3 Figure 4(b) shows normalized throughput
ering many-accelerator architectures targeting to emerging over static for all employed applications. The observed vari-
application domains, e.g. artificial intelligence, scientific ations in throughput and accelerator density originates from
computing, enterprise computing etc. Specifically, we used the differing resource requirements and workload charac-
six applications (Table 1) found in Phoenix MapReduce teristics. We consider two use-case scenarios: the loaded
framework for shared-memory systems [10]. We considered accelerators are initialized with data from i) individual
a set of 1000 tasks to be mapped onto each examined MA memory space i.e. private memory and ii) shared memory,
system. Tasks memory size requirements derived by a nor- i.e. the case that accelerators are working on the same data,
mal distribution N( M axSize
2 , M axSize), where M axSize is e.g. finding different strings on the same document with
defined in the last column of Table 1. The tasks are inserted String Match algorithm. The results report the configuration
to the MA system in the same time and scheduled whenever (denoted as accelerators number : heaps number on top of
enough memory is available. In case of a tie, tasks with every column) that delivers maximal throughput for each
larger memory size requests are prioritized. application. We measured an average throughput increase
For static applications, the original code is source-to- of 21.4 with private memory initialization and 3.1 with
source transformed to a dynamically allocated one using shared memory initialization over the static allocation of the
specific function calls from the proposed DMM-HLS API. conventional Vivado-HLS.
The transformed code is augmented by the DMM-HLS Figure 4(c) shows aggregated averaged results in terms
function calls and it is synthesized into RTL implementation of number of accelerator, resource usage and throughput.
through the back-end of Vivado HLS tool. An exemplary We examine both static allocation and DM allocation with
scenario of the code patterns triggering code transforma- several heaps ranging from 1 up to 32, respectively. The
tions is given in the following listing: left y-axis refers to the normalized resources4 . The average
Original Code Transformed Code for DMM-HLS accelerators density (i.e. maximum number of deployed
c o n s t unsigned i n t T= 5 0 0 ; / check period / accelerators) for each examined configuration is reported on
i n t IN , OUT;
i n t IN [ 1 0 ] ; while ( ( IN =HlsMalloc (10 ,0))== 1) { HlsSleep ( T) } top of the stacked bar. The right y-axis refers to the normal-
i n t OUT[ 1 0 ] ; while ( (OUT=HlsMalloc (10 ,0))== 1) { HlsSleep ( T) } ized throughput over static allocation, which is highlighted
foo ( IN,&OUT ) ; foo ( IN,&OUT ) ; as a dashed horizontal line. Following the same trend as
HlsFree ( IN , 0 ) ;
HlsFree (OUT, 0 ) ; throughput, energy efficiency is also reported in terms of
Regarding to the execution stalling of an accelerator, Giga-Operations Per Second/Watt (GOPS/W). While the
when there is no available free memory in the heap, we single heap configurations delivers the highest number of
utilize a while-loop wrapper around HlsMalloc call. In case accelerators (3.8 times), it exhibits low throughput since
of available free memory on a specific memory chunk of the memory accesses are executed sequentially. By adding more
heap, the allocator returns the first address of this memory heaps, the overall system exhibits higher throughput up to
chunk. On the contrary, the allocator returns 1. Using the point that the extra resources of multi-heap allocators
the while-loop wrapper, we force the stalling of the acceler- cause the decrease in number of accelerators. Area over-
ator at the HlsMalloc call. At the same time other accelera- heads is due to the heap allocator modules and the extended
tors may be executed in parallel and eventually free some interface of the accelerators. For single heap implementa-
memory. Until then, the stalled accelerator periodically (e.g. tions an average area overhead of 0.3% FFs and 1.2% LUTs
T=500 cycles) checks for available free space through the
3
while-loop wrapper. Throughput is calculated as the workload size (in Mbytes) over
the latency of a kernel to process it (in us).
3 E VALUATION 4
Every stacked vertical bar contains the cumulative percentage of
the four studied resources (BRAMs, DSPs, FFs and LUTs), thus the
We evaluate the practical performance improvements deliv- theoretical maximum value of left y-axis, on the diagrams of Fig. 4(c),
ered by the proposed DMM-HLS framework in a twofold is 400%.

1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2410791, IEEE Computer Architecture Letters
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4
TABLE 1
Applications Characterization

Application Domain Kernel Description Parameters


Image Processing Histogram Determine frequency of RGB channels in image. Msize = 640480 pixels
Scientific Computing Matrix Multiplication Dense integer matrix multiplication. Msize = 100100
Enterprise Computing String Match Search file with keys for an encrypted word. Nf ilekeys = 307,200, Mwords = 4
Artificial Intelligence Linear Regression Compute the best fit line for a set of points. Npoints = 100,000
Artificial Intelligence PCA Principal components analysis on a matrix. Msize = 250250
Artificial Intelligence Kmeans Clustering 3-D data points into groups. Npoints = 20,000, Pclusters = 10

Fig. 4. Comparison on a) accelerators density, b) systems throughput and c) resources breakdown versus throughput and energy trade-offs.

is reported in respect to the static implementations, which R EFERENCES


scales up to +10.7% FFs and +55.2% LUTs for the case of 32 [1] J. Shalf, D. Quinlan, and C. Janssen, Rethinking hardware-
heaps. However, considering both the DMM infrastructure software codesign for exascale systems, Computer, Nov 2011.
and accelerators, the on-chip memory utilization is around [2] Q. Zhu, B. Akin, H. Sumbul, F. Sadi, J. Hoe, L. Pileggi, and
60%, i.e. other resources and not memory exceed the maxi- F. Franchetti, A 3d-stacked logic-in-memory accelerator for
application-specific data intensive computing, in 3D Systems
mum limit, while the rest FPGA resources are increased in Integration Conference (3DIC), 2013 IEEE International, Oct 2013.
average by a factor of 6.3, 17, 6 and 29.7 for DSPs, FFs [3] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin,
and LUTs respectively. J. Lugo-Martinez, S. Swanson, and M. B. Taylor, Conservation
cores: Reducing the energy of mature computations, SIGARCH
4 R ELATED W ORK Comput. Archit. News, vol. 38, no. 1, pp. 205218, Mar. 2010.
[4] Y.-T. Chen, J. Cong, M. Ghodrat, M. Huang, C. Liu, B. Xiao, and
Prior art investigated specialized hardware, i.e. architec- Y. Zou, Accelerator-rich cmps: From concept to real hardware,
ture templates for many-accelerator systems, as a response in Computer Design (ICCD), 2013 IEEE 31st International Conference.
to dark silicon era. In [7], the authors propose a many- [5] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman,
Architecture support for domain-specific accelerator-rich cmps,
accelerator memory organization that only statically shares ACM Trans. Embed. Comput. Syst., vol. 13, no. 4s, Apr. 2014.
the address space between active accelerators. Similarly in [6] A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides,
[11], a many-accelerator architectural template is proposed J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray,
that enables the reuse of accelerator memory resources M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka,
J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and
adopting a non-uniform cache architecture (NUCA) scheme. D. Burger, A reconfigurable fabric for accelerating large-scale
Both [7] and [11] are targeting ASIC-like many-accelerator datacenter services, in 41st ISCA, June 2014.
systems and they mainly focusing on the performance im- [7] M. J. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks, The
accelerator store: A shared memory framework for accelerator-
plications of the memory subsystem. Related to HLS prior based systems, ACM Trans. Archit. Code Optim., vol. 8, Jan. 2012.
art, to the best of our knowledge, only [12], [13] and [14] [8] S. Xydis, A. Bartzas, I. Anagnostopoulos, D. Soudris, and K. Z.
studied the dynamic memory allocation for high-level syn- Pekmestzi, Custom multi-threaded dynamic memory manage-
thesis. However, they do not target many-accelerator sys- ment for multiprocessor system-on-chip platforms, in ICSAMOS,
2010, pp. 102109.
tems, thus providing no support for mitigating the memory- [9] Y. Sade, M. Sagiv, and R. Shaham, Optimizing c multithreaded
induced underutilization problem. memory management using thread-local storage, in 14th Interna-
tional Conference on Compiler Construction, ser. CC05, 2005.
5 C ONCLUSIONS [10] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and
C. Kozyrakis, Evaluating mapreduce for multi-core and multi-
This paper targeted the scalability issues of modern many- processor systems, in HPCA. IEEE 13th Int. Symp. on, Feb 2007.
accelerator FPGA systems. We showed that the on-chip [11] E. Cota, P. Mantovani, M. Petracca, M. Casu, and L. Carloni,
memory resource impose severe bottlenecks on the maxi- Accelerator memory reuse in the dark silicon era, IEEE Computer
mum number of deployed accelerators, leading to large re- Architecture Letters, vol. 99, no. RapidPosts, p. 1, 2012.
[12] L. Semeria and G. De Micheli, Spc: synthesis of pointers in c
source under-utilization in modern FPGA devices. We pro- application of pointer analysis to the behavioral synthesis from
posed the incorporation of dynamic memory management c, in Computer-Aided Design, 1998. ICCAD 98. Digest of Technical
during HLS to alleviate the resource under-utilization inef- Papers. 1998 IEEE/ACM International Conference on, Nov 1998.
ficiencies mainly induced by the static memory allocation [13] L. Semeria, K. Sato, and G. De Micheli, Resolution of dynamic
memory allocation and pointers for the behavioral synthesis form
strategies used in state-of-art HLS tools. The proposed ap- c, in Proc. of the Conf. on DATE. New York, NY, USA: ACM, 2000.
proach has been extensively evaluated over real-life many- [14] M. Shalan and V. J. Mooney, A dynamic memory management
accelerator architectures targeting to emerging applications, unit for embedded real-time system-on-a-chip, in Proceedings
of the 2000 International Conference on Compilers, Architecture, and
showing that its adoption delivers significant gains regard- Synthesis for Embedded Systems, ser. CASES 00. ACM, 2000.
ing to accelerators density and throughput improvements.

1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Vous aimerez peut-être aussi