Académique Documents
Professionnel Documents
Culture Documents
Customizable Computing:
from Single-Chips to Datacenters
Jason Cong
Chancellors Professor, UCLA
Director, Center for Domain-Specific Computing
cong@cs.ucla.edu
http://cadlab.cs.ucla.edu/~cong
1000
rocket
nozzle
nuclear reactor
100
Parallelization
Customization
Watts/cm2
Pentium 4
Pentium III
hot plate
10
Pentium II
Pentium Pro
Pentium
i386
1
1.5
i486
1.0
0.7
0.5
0.35
0.25
0.18
0.13
0.10
0.07
10/13/16
Why now?
Previous architectures are device/transistor limited
Von Neumann architecture allows maximum device reuse
One pipeline serves all functions, fully utilized
Future architectures
Plenty of transistors, but power/energy limited (dark silicon)
Customization and specialization for maximum energy efficiency
A story of specialization
4
10/13/16
Intel CEO Brian Krzanich noted, The acquisition will couple Intels leadingedge products and manufacturing process with Alteras leading fieldprogrammable gate array (or FPGA) technology. He further stated, The
combination is expected to enable new classes of products that meet
customer needs in the data center and Internet of Things market segments.
FALCON CONFIDENTIAL
10/13/16
Levels of Customization
Single-chip
level
node level
center level
Chip-Level Customization:
Accelerator-Rich Architectures (ARA)
Hybrid L2 Cache with
STTRAM + SRAM
Buffer in NUCA
[ISLPED 12]
[DATE 12]
Adaptive L1 Cache
+ SPM/Buffer
[ISLPED 11]
RF-interconnects
DIMM
AIM DIMM
DIMM
AIM DIMM
DIMM
AIM DIMM
Accelerator in
Memory
DIMM
AIM DIMM
[JESTCS 12]
10/13/16
Core
TLB
TLB
MMU
MMU
Accel
Accel
IOMMU
Interconnect
Main Memory
Accel Datapath
Scratchpad Memory
Todays ARA
address translation
using IOMMU with
IOTLB (e.g. 32-entries)
Memory Interface
IOTLB
IOMMU IOTL
dress translation.
#2 High Page Walk Latency.
The
rest
ofmiss,
this4 main
paper
is organized as follows. SecOn an
IOTLB
memory
tion accesses
2 characterizes
address
translation behaviors of
are required to walk page table.
customized accelerators to motivate our design. Section 3 explains our simulation methodology and workloads. Section 4 details the design and evaluation of the
proposed architectural support for address translation.
Section 5 discusses more use cases. Section 6 summarizes related work and Section 7 concludes the paper.
2. CHARACTERIZATION OF
10
10/13/16
TLB miss
behavior of
BlackScholes
10/13/16
Multiple data tiles (from the same set of pages) are often mapped to
different accelerator instances for parallelism.
Thus, a shared TLB can be very helpful
13
Accel
TLB
Accel
Accel
TLB
TLB
Shared TLB
To IOMMU
4.3
Figure the
11:ideal
Page
walk reduction compared to the
performance
IOMMU baseline3
latency.walker
However,
we find the benefit completely outdesign
weighs the added access latency for the current configuration. Much larger TLB sizes or more sharers (we will
discuss this in Section 5.1) could benefit from a banked
design, but such use cases are not well established.
Correctness issues. In addition to the TLB shootdown support in private TLBs, the shared TLB also
needs to be checked for invalidation. The reason is
in a mostly-inclusive policy, entries 14
that are previously
brought in can be present in both levels.
4.3.2
Evaluation
10/13/16
Common practice:
L4
L3
L2
Upper-level entries:
Common entries are
cached in MMU caches
L1
Page offset
L1 entries:
Data cache line has
prefetching effects
15
16
10/13/16
Levels of Customization
Single-chip
level
node level
center level
17
Advance
Intel I/O
Subsystem
Intel
Application Engines
(AEs)
Memory
Controller
Hub (MCH)
Memory
Equally important is the translation function that the CAPP and PSL provide for the accelerator. The
accelerator uses the same virtual memory addressing space as the core application that enables it. The
CAPP and PSL handle all virtual-to-physical memory translations, simplifying the programming model and
freeing the accelerator to do the number crunching directly on the data it receives.
Direct
Data
Port
In addition, the PSL contains a 256 KB resident cache on behalf of the accelerator. Based on the needs
of the algorithm, the accelerator can direct the use of the cache via the type of memory accesses
(reads/write) as cacheable or noncacheable. Fundamentally, the rich set of commands that the
accelerator can send to the PSL reflects all of the types of memory accesses available to the POWER8
cores themselves (such as reads, reads with intent-to-modify, reservations, locks, writes, and writes to
highest point of coherency).
The accelerator runs hand-in-hand with an application running on the core as shown in Figure 2:
Dedicated Process Model.
Memory
Convey coprocessor
FPGA-based
Shared cache-coherent memory
29 September 2014
Page 6
10/13/16
xPilot (UCLA 2006) -> AutoPilot (AutoESL) -> Vivado HLS (Xilinx 2011-)
Design Specification
C/C++/SystemC
Compilation &
Elaboration
AutoPilotTM
ESL Synthesis
Common Testbench
User Constraints
Platform-based C to RTL
synthesis
Platform
Characterization
Library
Timing/Power/Layout
Constraints
FPGA
or ASIC blocks
20
10
10/13/16
Loop
Restructuring
Code Generation
Data Layout
Optimization
Inter-Module
Optimization
Array Partitioning
Module Selection/
replication
Communication
Optimization
Module-level
Scheduling
Data Reuse
21
User Direc.ves
PlaIorm Spec.
System Op3miza3on
Program Analysis
Task graph
hardware
model
design parameters
OpenCL
System Genera3on
Module templates,
System IP templates
(C/RTL)
On-board
executable HW/SW
11
10/13/16
23
feature maps
feature maps
feature maps
input image
output
category
Max-pooling is optional
Input feature map
Output
feature map
24
12
10/13/16
Programmability
Network definitions (& layer parameters) change according to application
CNNs are getting deeper, more precision achieved with deeper network
Network parameters vary across layers, many user definable parameters
Hardware has limited programmability
Re-synthesize bitstream for each network? Not preferred
Re-program FPGA for each layer? NO!
30
28.2
152 layers
160
25.8
140
25
120
20
100
16.4
15
80
11.7
19 layers
10
shallow
8 layers
8 layers
22 layers
7.3
6.7
60
40
3.57
20
0
2010
2011
2012
2013
2014
2015
25
2016
model
D4J etc.
TensorFlow
Caffe
Industry:
Stability
scale & speed
Data integration
Relatively fixed
Torch
Theano
Neon
Caffe framework
Network definitions
Layer definitions
CONV
FCN
POOL
ReLU
MKL, BLAS
cuDNN
CPU_forward
GPU_forward
Research:
Flexible
Fast iteration
Debuggable
Relatively bare bone
Quite different for
various layers, &
various networks
26
13
10/13/16
(Tile loop)
(Tile loop)
(Tile loop)
}}}}
Automation Flow to
Unified Representation
PCIe
FPGA
Customized Accelerator
DRAM
Instruction
Region
Feature map
Region
28
14
10/13/16
Performance
Convolution layer & Fully connected layer
29
+
Weights
Input Vectors
Output Vectors
Convolution
Input feature map
Convolution
Weights
Convolution
Output feature map
30
15
10/13/16
+
Weights
Input Vectors
Output Vectors
Convolution
Input feature map
Convolution
Weights
Convolution
Output feature map
31
Challenge 1: Programmability
1. Network changes according to applications
2. Parameter varies across layers
Automation flow
SW
CONV
FCN
POOL
ReLU
Challenge 2: Performance
1. Re-use same HW for both CONV&FCN
2. Deal with both computation & bandwidth
bound kernels
Uniformed Representation
HW
Computation
Optimization
Bandwidth
Optimization
Challenge 3: Scalability
1. Demand to scale accelerator to larger device
16
10/13/16
33
Levels of Customization
n Single-chip
level
node level
center level
34
17
10/13/16
35
ISCA10, ASPLOS12
Mismatch between workloads and processor designs;
Modern processors are over-provisioning
Trade-off
36
18
10/13/16
FPGA
37
n Efficient
38
19
10/13/16
Results
Normalized to reference Xeon
performance
5.26
3.13
7.8
Average
power consumption
Xeon: 175W/node
3.1
Benchmarks (MLLib)
KM
5.21
4.88
6.86
LR
Baselines
NORMALIZED
EXECUTION TIME
Data set
NORMALIZED
ENERGY
10.97
Atom: 30W/node
8X ARM
8X ATOM
ARM:
10W/node (embedded
CPU)
39
40
20
10/13/16
41
1.06
7.8
5.26
3.13
0.69
8X ZYNQ
8X ARM
8X ATOM
0.66
0.43
8X ATOM
3.1
Results
8X ARM
KM
5.21
LR
4.88
6.86
Data set
MNIST 700K Samples
784 Features, 10 Labels
NORMALIZED
EXECUTION TIME
Setup
NORMALIZED
ENERGY
10.97
8X ZYNQ
42
21
10/13/16
43
Inefficiencies in Small-core
Slower
44
22
10/13/16
1 master /
driver
1 10GbE switch
Each node:
1. Two Xeon processors
2. One FPGA PCIe card
(Alpha Data)
3. 64 GB RAM
4. 10GBE NIC
22 workers
1 file server
45
Experimental Results
NORMALIZED
ENERGY
1.06
0.69
0.5
1X XEON+AD
8X ZYNQ
0.66
Normalized to reference
Xeon performance
1X XEON+AD
0.43
Results
0.56
0.38
Data set
0.33
setup
NORMALIZED
EXECUTION TIME
Experimental
8X ZYNQ
46
23
10/13/16
Performance
EnergyEfficiency
Big-Core+FPGA
Best | 2.5
Best | 2.6
Small-Core+FPGA
Better | 1.2
Best | 1.9
Big-Core
Good | 1.0
Good | 1.0
Small-Core
Bad | 0.25
Bad | 0.24
47
48
24
10/13/16
Accelerator designer
How to install my
accelerators?
How to acquire
accelerator resource ?
Challenges:
1. Query available
platforms
1. Heterogeneous platforms
Cloud service provider
2. Express HW-specific
2. Accelerator locality
requirements
Challenges:
49
FPGA
Client
RM
GAM
Accelerator status
GAM
NAM
NM
NAM
AM
NM
NAM
Container
Container
RM: Resource
Manager
Node Accelerator Manager
NM: Node Manager
Local accelerator service management, AM: Application
JVM-to-ACC communication optimization Master
50
25
10/13/16
Request Accelerators
Use acc_id as label
User Application
ACC Labels
ACC Invoke
Input data
Output data
GAM allocates
corresponding nodes
to applications
Containers
ACC Info
Node ACC
Manager
FPGA
GPU
ACC
51
Applica4ons
325 0
Kmeans (KM)
364 0
Compression (COMP)
360 0
896 0
52
26
10/13/16
200
Task time
Time (sec)
Time (sec)
150
150
App time
100
50
8Nx12T
CPU
150
Task time
60
0
Kmeans
App time
90
60
30
scheduler
30
4Nx12T
FPGA
Time (sec)
120
Time (sec)
12Nx12T
CPU
compute
90
0
4Nx12T
CPU
shue
120
4Nx12T 4Nx12T
CPU
FPGA
200
shue
150
compute
100
scheduler
50
0
0
4Nx12T
CPU
8Nx12T
CPU
12Nx12T
CPU
4Nx12T 4Nx12T
CPU
FPGA
4Nx12T
FPGA
53
User Applications in
MapReduce/Spark/Hadoop
+
Java/C/C++/OpenMP
Merlin
Compiler
FCS
ACC Engines
Kestrel
Runtime
Libraries
The only solution of FPGA customization and
virtualization for Datacenter acceleration!
Customize &
Virtualize
54
27
10/13/16
Merlin Compiler
C/C++ with pragmas
OpenMP-like high-level
programming model
Compiler
Source-to-source
Optimizations
OpenCL Generation
Optimized OpenCL
OpenCL backend
(Altera/Xilinx)
System executables
55
Merlin
Compiler
Initial
OpenCL
Manual Optimized
OpenCL
Blackschole
Denoise
0.34ms
0.08s
11ms
3.8s
NA
NA
LogisticRegr
MatMult
NAMD
Normal
TwoNN
94ms
0.8ms
26ms
4ms
1.23s
3.7s
1.9ms
51ms
52ms
1.70s
94ms
0.8ms
26ms
10ms
NA
1x
21x
1.3x
Average
56
28
10/13/16
Concluding Remarks
Compilation support
Runtime management
57
NSF Expeditions in Computing and C-FAR Center under the STARnet Program
Industry support: Baidu, Falcon Computing, Fujitsu Labs, Huawei, Intel, Mentor Graphics
CDSC faculty:
Aberle
(UCLA)
Palsberg
(UCLA)
Baraniuk
(Rice)
Potkonjak
(UCLA)
Bui
(UCLA)
Reinman
(UCLA)
Chang
(UCLA)
Sadayappan
(Ohio-State)
Cheng
(UCSB)
Sarkar
(Associate Dir)
(Rice)
Cong (Director)
(UCLA)
Vese
(UCLA)
58
29
10/13/16
Dr. Peng Li
(UCLA)
Hao Yu
(UCLA)
Yuting Chen
(UCLA/Google)
Zhenman Fang
(UCLA)
Yuxin Wang
(PKU/Falcon)
Hui Huang
(UCLA/Google)
Di Wu
(UCLA/Falcon)
Yi Zou (UCLA/
Arista)
Muhuan Huang
(UCLA/Falcon)
Bingjun Xiao
(UCLA/Google)
Wei Zuo
(UIUC)
59
60
30