Cong

10/13/16
Customizable Computing:
from Single-Chips to Datacenters
Jason Cong
Chancellors Professor, UCLA
Director, Center for Domain-Specific Computing
cong@cs.ucla.edu
http://cadlab.cs.ucla.edu/~cong
Our Research Focus: Customization and Specialization

for Energy Efficient Computing
suns surface
1000
Power doubles every 4 years
rocket
nozzle
nuclear reactor
100
Parallelization
Customization
Watts/cm2
Pentium 4
Pentium III
hot plate
10
Pentium II
Adapt the architecture to

application domain
Pentium Pro
Pentium
i386
1
1.5
i486
1.0
0.7
0.5
0.35
0.25
0.18
0.13
0.10
0.07
Based on Fred Pollack (Intel) and Michael Taylor (UCSD)

2
10/13/16
2009 NSF Expeditions in Computing Project
Overview of Our Approach -- Customized Computing with

Accelerator-Rich Architectures
Extensive
use of dedicated and composable accelerators
Most computations are carried on accelerators not on processors!
A fundamental departure from von Neumann architecture
Why now?
Previous architectures are device/transistor limited
Von Neumann architecture allows maximum device reuse
One pipeline serves all functions, fully utilized
Future architectures
Plenty of transistors, but power/energy limited (dark silicon)
Customization and specialization for maximum energy efficiency
A story of specialization
4
10/13/16
Lessons from Nature:

Human Brain and Advance of Civilization
High power efficiency (20W) of human brain comes from specialization

Different region responsible for different functions
Remarkable advancement of civilization also from specialization

More advanced societies have higher degree of specialization
Intels $16.7B Acquisition of Altera
Intel CEO Brian Krzanich noted, The acquisition will couple Intels leadingedge products and manufacturing process with Alteras leading fieldprogrammable gate array (or FPGA) technology. He further stated, The
combination is expected to enable new classes of products that meet
customer needs in the data center and Internet of Things market segments.
FALCON CONFIDENTIAL
10/13/16
Levels of Customization
Single-chip
level
Require new processor designs, e.g. using fixed-function or

composable accelerators
Server
node level
Host CPU + FPGA via PCI-e or QPI connections

Data
center level
Clusters of heterogeneous computing nodes
Chip-Level Customization:
Accelerator-Rich Architectures (ARA)
Hybrid L2 Cache with
STTRAM + SRAM
Buffer in NUCA
[ISLPED 12]
[DATE 12]
Adaptive L1 Cache
+ SPM/Buffer
[ISLPED 11]
RF-interconnects
Now the full-system ARA simulator

PARADE [ICCAD 15] is open source
DIMM
AIM DIMM
[DAC 12, ISLPED 12, DAC 14]
DIMM
AIM DIMM
ARC, CHARM, CAMEL
DIMM
AIM DIMM
Accelerator in
Memory
DIMM
AIM DIMM
[HPCA 08 best paper]

Hybrid NoC [DAC 15]
[JESTCS 12]
RF-interconnects improves DRAM BW
10/13/16
Latest ARA Work: Unified Address

Translation Support [HPCA 17]
Inefficiency in Todays ARA Address Translation

Core
Core
TLB
TLB
MMU
MMU
Accel
Accel
IOMMU
Interconnect
Main Memory
#1 Inefficient TLB Support.

Figure
3: A detailed look at an
TLBs are not specialized to provide
withlow-latency
system and
IOMMU
(baseline)
capture page
locality
Accel Datapath
Scratchpad Memory
Todays ARA
address translation
using IOMMU with
IOTLB (e.g. 32-entries)
Memory Interface
IOTLB
IOMMU IOTL
To Memory SystemIOMMU only achieves

12% performance of ideal
address translation
accelerator connecting
dress translation.
#2 High Page Walk Latency.
The
rest
ofmiss,
this4 main
paper
is organized as follows. SecOn an
IOTLB
memory
tion accesses
2 characterizes
address
translation behaviors of
are required to walk page table.
customized accelerators to motivate our design. Section 3 explains our simulation methodology and workloads. Section 4 details the design and evaluation of the
proposed architectural support for address translation.
Section 5 discusses more use cases. Section 6 summarizes related work and Section 7 concludes the paper.
2. CHARACTERIZATION OF
Figure 4: TLB miss behav
10
Figure 5: TLB miss trace of a

BlackScholes
10/13/16
Characteristic #1: Performance can be Highly

Sensitive to Address Translation Latency
On average, a translation latency
of 64 cycles can only achieves
70% of the ideal performance.
For the most sensitive app LPCIP,
it achieves only 40% performance.
Must provide efficient TLB design and page walk design

11
Characteristic #2: Regular Bulk Transfer of

Consecutive Data
TLB miss
behavior of
BlackScholes
Such regular access behavior presents opportunities for relatively

simple designs in supporting address translation
12
10/13/16
Characteristic #3: Impact of Data Tiling

Breaking a Page into Multiple Accelerators
Each plane is
a single page
Example: rectangular tiling on a 32 *

32 * 32 data array into 16 * 16 *16 tiles.
31
Each tile accesses 16 -pages
and can be mapped to a different
accelerator for parallel processing
Multiple data tiles (from the same set of pages) are often mapped to
different accelerator instances for parallelism.
Thus, a shared TLB can be very helpful
13
Our Two-Level TLB Design

Accel
TLB
Accel
TLB
Accel
Accel
TLB
TLB
32-entry private TLB

512-entry shared TLB
Utilization wall limits the number of
simultaneously powered accelerators
Shared TLB
To IOMMU
Figure 10: The structure of a shared level-two TLB

miss-under-miss targeting individual applications. We
leave these for future work.
4.3
A Shared Level-Two TLB
Figure 10 depicts the basic structure of our two-level

TLB design, and illustrates two orthogonal benefits provided from adding a shared TLB. First, the requested
entry is previously inserted to the shared TLB by a
request from the same accelerator. This is the case
when the private TLB size is not sufficient to capture the reuse distance so that the requested entry is
evicted from the private TLB earlier. Specifically, medical imaging benchmarks would benefit from having a
large shared TLB. Second, the requested entry is previously inserted to the shared TLB by a request from
another accelerator. This case is common when data
tile size is smaller than a memory page and neighboring tiles within a memory page are mapped to dierent
accelerators (illustrated in Section 2.3). Once an entry
is brought into the shared TLB by one accelerator, it is
immediately available to other accelerators, leading to
shared TLB hits. Requests that also miss in the shared
TLB need to access IOMMU for page table walking.
Still only achieves half
Figure the
11:ideal
Page
walk reduction compared to the
performance
IOMMU baseline3
=> need to improve page
latency.walker
However,
we find the benefit completely outdesign
weighs the added access latency for the current configuration. Much larger TLB sizes or more sharers (we will
discuss this in Section 5.1) could benefit from a banked
design, but such use cases are not well established.
Correctness issues. In addition to the TLB shootdown support in private TLBs, the shared TLB also
needs to be checked for invalidation. The reason is
in a mostly-inclusive policy, entries 14
that are previously
brought in can be present in both levels.
4.3.2
Evaluation
In contrast to private TLBs where low access latency

is the key, the shared TLB mainly aims to reduce the
number of page walks in two ways: (1) providing a
larger capacity to capture the page locality for applications which is difficult to achieve in private TLBs without sacrificing access latency, (2) reducing TLB misses
on common virtual pages by enabling translation sharing between concurrent accelerators.
10/13/16
Page Walker Design Alternatives

#1 Improve the IOMMU design to reduce page walk latency
Need to design a more complex IOMMU, e.g., GPU MMU with parallel
page walker [HPCA 14]
#2 Leverage host core MMU that launches accelerators

Very simple but very efficient because:
64-bit virtual address

in 4-level page walk
Host core has MMU

cache and data cache
Common practice:
host core is idle after

accelerator offloading,
leaving MMU cache and
data cache warmed up
L4
L3
L2
Upper-level entries:
Common entries are
cached in MMU caches
L1
Page offset
L1 entries:
Data cache line has
prefetching effects
One cache line potentially holds

entries to 8 consecutive pages
15
Final Proposal: Two-Level TLB + Host Page Walk [HPCA 17]
On average: 7.6X speedup over nave IOMMU

design, only 6.4% gap between ideal translation
16
10/13/16
Single-chip
level

Server
node level

Data
center level
Clusters of heterogeneous computing nodes
17
Modern CPU-FPGA Platforms

White Paper
Advance
Coherent Accelerator Processor Interface (CAPI) for POWER8 Systems
Figure 1: CAPI Hardware Ecosystem
Commodity Intel Server

Intel
Xeon
Processor
Intel I/O
Subsystem
Intel
Convey FPGA-based coprocessor

Application
Engine Hub
(AEH)
Application Engines
(AEs)
Memory
Controller
Hub (MCH)
Memory
Standard Intel x86-64

Server
x86-64 Linux
Equally important is the translation function that the CAPP and PSL provide for the accelerator. The
accelerator uses the same virtual memory addressing space as the core application that enables it. The
CAPP and PSL handle all virtual-to-physical memory translations, simplifying the programming model and
freeing the accelerator to do the number crunching directly on the data it receives.
Direct
Data
Port
In addition, the PSL contains a 256 KB resident cache on behalf of the accelerator. Based on the needs
of the algorithm, the accelerator can direct the use of the cache via the type of memory accesses
(reads/write) as cacheable or noncacheable. Fundamentally, the rich set of commands that the
accelerator can send to the PSL reflects all of the types of memory accesses available to the POWER8
cores themselves (such as reads, reads with intent-to-modify, reservations, locks, writes, and writes to
highest point of coherency).
The accelerator runs hand-in-hand with an application running on the core as shown in Figure 2:
Dedicated Process Model.
Memory
Convey coprocessor
FPGA-based
Shared cache-coherent memory
29 September 2014
Page 6
10/13/16
C/C++ Based Synthesis for Accelerator Design
xPilot (UCLA 2006) -> AutoPilot (AutoESL) -> Vivado HLS (Xilinx 2011-)
Design Specification
C/C++/SystemC
Compilation &
Elaboration
AutoPilotTM
Code transformation & opt
Behavioral & Communication

Synthesis and Optimizations
RTL HDLs &

RTL SystemC
ESL Synthesis
Common Testbench
Simulation, Verification, and Prototyping
User Constraints
Platform-based C to RTL
synthesis
Synthesize pure ANSI-C and C+

+, GCC-compatible compilation
flow
Full support of IEEE-754 floating

point data types & operations
Platform
Characterization
Library
Timing/Power/Layout
Constraints
FPGA
or ASIC blocks
Efficiently handle bit-accurate

fixed-point arithmetic
SDC-based scheduling
Automatic memory partitioning
QoR matches or exceeds manual

RTL for many designs
Developed by AutoESL, acquired by Xilinx in Jan. 2011

19
Vivado HLS Become the Most Widely Used HLS System
20
10
10/13/16
Recent Research - Optimizations Beyond HLS

Input Code(C/C++)
Program Analysis
Loop Structure
Optimization
Improving Polyhedral Code Generation

for High-Level Synthesis CODES-ISSS14
Best Paper Award
Loop
Restructuring
Code Generation
Data Layout
Optimization
Inter-Module
Optimization
Polyhedral-Based Data Reuse

Optimization for Configurable Computing
FPGA13 Best Paper Award
Array Partitioning
Theory and Algorithm for Generalized

Memory Partitioning in High-Level
Synthesis, FPGA14
Module Selection/
replication
An optimal microarchitecture for stencil

computation acceleration based on nonuniform partitioning of data reuse buffers
(DAC14)
Communication
Optimization
Module-level
Scheduling
Combining Computation with

Communication Optimization in System
Synthesis for Streaming Applications,
FPGA14
Data Reuse
21
CMOST: Fully Automated Compilation and Mapping Flow

[DAC 2015]
Applica.on: C/C++/OpenMP4.0
User Direc.ves
PlaIorm Spec.
System Op3miza3on
Program Analysis
Task graph
hardware
model
design parameters
OpenCL
design analysis/impl. report
System Genera3on
Module templates,
System IP templates
(C/RTL)
On-board
executable HW/SW
Retargetable and op.mized

OpenCL source code
11
10/13/16
Further Raise the Level of Abstraction:

Use of Domain-Specific Languages
Example: Caffeine -- HW/SW Co-Optimization
for Deep Learning [ICCAD 16]
23
Convolutional Neural Network (CNN)

feature maps
feature maps
feature maps
feature maps
input image
output
category
Inference: A feedforward computation

Convolutional layer
Max-pooling is optional
Input feature map
Output
feature map
24
12
10/13/16
Challenges of Using FPGA for Deep Learning
Programmability
Network definitions (& layer parameters) change according to application
CNNs are getting deeper, more precision achieved with deeper network
Network parameters vary across layers, many user definable parameters
Hardware has limited programmability
Re-synthesize bitstream for each network? Not preferred
Re-program FPGA for each layer? NO!
30
28.2
152 layers
160
25.8
140
25
120
20
100
16.4
15
80
11.7
19 layers
10
shallow
8 layers
8 layers
22 layers
7.3
6.7
60
40
3.57
20
0
2010
2011
2012
2013
2014
2015
25
2016
Challenges of Using FPGA for Deep Learning (1)

Programming
model
D4J etc.
TensorFlow
Caffe
Industry:
Stability
scale & speed
Data integration
Relatively fixed
Torch
Theano
Neon
Caffe framework
Network definitions
Layer definitions
CONV
FCN
POOL
ReLU
MKL, BLAS
cuDNN
CPU_forward
GPU_forward
Research:
Flexible
Fast iteration
Debuggable
Relatively bare bone
Quite different for
various layers, &
various networks
26
13
10/13/16
Convolution Optimization on FPGA

(Tile loop)
1 for(row=0; row<R; row+=Tr) {

2 for(col=0; col<C; col+=Tc) {
3
for(to=0; to<M; to+=Tm) {
4
for(ti=0; ti<N; ti+=Tn) {
(Tile loop)
(Tile loop)
(Tile loop)
Off-chip Data Transfer: Memory Access Optimization

On-chip Data: Computation Optimization
5
6
7
8
9
10
for(trr=row; trr<min(row+Tr, R); trr++) {

(Point loop)
for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
(Point loop)
for(too=to; too<min(to+Tm, M); too++) {
(Point loop)
for(tii=ti; tii<(ti+Tn, N); tii++) {
(Point loop)
for(i=0; i<K; i++) {
(Point loop)
for(j=0; j<K; j++) {
(Point loop)
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
27
}}}}
Feedforward Processing on CPU+FPGA Platform
Layer-by-layer feedforward computation
No CPU interruptions are required until final prediction

CPU
Applications (with CNNs)

High-level Network Description
Automation Flow to
Unified Representation
PCIe
FPGA
Customized Accelerator

DRAM
Instruction
Region
Weight & bias

Region
Feature map
Region
28
14
10/13/16
Challenges of Using FPGA for Deep Learning (2)
Performance
Convolution layer & Fully connected layer
Convolution layer -- Compute Intensive

Fully Connected layer -- Communication Intensive
Hardware has limited bandwidth resource
Convolution layer -- extensive previous study
Fully connected layer -- How to maximize bandwidth utilization?
CONV + FCN -- How to re-use hardware for both kernels?
1
29
Mapping Fully-Connected Layer to Convolution

Input-Major Mapping
+
Weights
Input Vectors
Output Vectors
Convolution
Input feature map
Convolution
Weights
Convolution
Output feature map
30
15
10/13/16
Mapping Fully-Connected Layer to Convolution

Weight-Major Mapping
+
Weights
Input Vectors
Output Vectors
Convolution
Input feature map
Convolution
Weights
Convolution
Output feature map
31
Our Solution: Caffeine [ICCAD 16]

Software/Hardware Co-designed FPGA-based deep learning library
CNN Network definitions

Challenge 1: Programmability
1. Network changes according to applications
2. Parameter varies across layers
Automation flow
SW
CONV

FCN

POOL
ReLU

Challenge 2: Performance
1. Re-use same HW for both CONV&FCN
2. Deal with both computation & bandwidth
bound kernels
Uniformed Representation
FPGA-based Accelerator Design
HW
Computation
Optimization
Bandwidth
Optimization
Challenge 3: Scalability
1. Demand to scale accelerator to larger device
1. Portable HLS implementation

2. Scalable systolic array PE design
32
16
10/13/16
33
n Single-chip
level

n Server
node level

n Data
center level
Clusters of heterogeneous computing node [DAC 16]

How about programming at data center level? [SoCC 16]
34
17
10/13/16
Data Center Energy Consumption is a Big Deal

In 2013, U.S. data centers consumed an estimated 91
billion kilowatt-hours of electricity, projected to increase to
roughly 140 billion kilowatt-hours annually by 2020
50 large power plants (500-megawatt coal-fired)
$13 billion annually
100 million metric tons of carbon pollution per year.
https://www.nrdc.org/resources/americas-data-centers-consuming-and-wasting-growingamounts-energy)
35
Extensive Efforts on Improving Datacenter

Energy Efficiency
Understand
the scale-out workloads
ISCA10, ASPLOS12
Mismatch between workloads and processor designs;
Modern processors are over-provisioning
Trade-off
of big-core vs. small-core
ISCA10: Web-search on small-core with better energy-efficiency

Baidu taps Mavell for ARM storage server SoC
36
18
10/13/16
Datacenter Level Integration at Microsoft
FPGA
A. Putnam, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA2014
37
Focus of Our Study

n Evaluation
of different integration options of

heterogeneous technologies in datacenters
n Efficient
programming and runtime support for

heterogeneous datacenters
38
19
10/13/16
LR: logistic regression

KM: k-mean clustering
Results
Normalized to reference Xeon
performance
5.26
3.13
7.8
Atom: Intel D2500 1.8GHz

8XA9
ARM
8X ATOM
ARM:
in Zynq 800MHz
Average
power consumption
(Entire system including DRAM and HDD)
Xeon: 175W/node
3.1
Benchmarks (MLLib)
KM
5.21
Xeon: Intel E5 2620

12 Core CPU 2.40GHz
4.88
784 Features, 10 Labels
6.86
MNIST 700K Samples
LR
Baselines
NORMALIZED
EXECUTION TIME
Data set
NORMALIZED
ENERGY
10.97
Small-core on Compute-intensive Workloads
Atom: 30W/node
8X ARM
8X ATOM
ARM:
10W/node (embedded
CPU)
39
Small Cores Alone Are Not Efficient!
40
20
10/13/16
Small Core + ACC: FARM

Boost
Small-core Performance with FPGA
- 8 Xilinx ZC706 boards

- 24-port Ethernet switch
- ~100W power
41
1.06
7.8
5.26
3.13
0.69
8X ZYNQ
8X ARM
8X ATOM
0.66
0.43
Normalized to reference singlenode Xeon performance
8X ATOM
3.1
Results
8X ARM
KM
5.21
LR
4.88
Power consumption (averaged)

Atom: 30W/node
ARM: 10W/node
6.86
Data set
MNIST 700K Samples
NORMALIZED
EXECUTION TIME
Setup
NORMALIZED
ENERGY
10.97
Small-core with FPGA Performance
8X ZYNQ
42
21
10/13/16
Small Cores + FPGAs Are More Interesting!
43
Inefficiencies in Small-core
Slower
core and memory clock
Task scheduling is slow

JVM-to-FPGA data transfer is slow
Limited
DRAM size and Ethernet bandwidth
Slow data shuffling between nodes

Another
option: Big-core + FPGA
44
22
10/13/16
Big-Core + ACC: CDSC FPGA-Enabled Cluster

A
24-node cluster with FPGA-based accelerators
Run on top of Spark and Hadoop (HDFS)

Alpha Data board:
1. Virtex-7 FPGA
2. 16GB on-board
RAM
1 master /
driver
1 10GbE switch
Each node:
1. Two Xeon processors
2. One FPGA PCIe card
(Alpha Data)
3. 64 GB RAM
4. 10GBE NIC
22 workers
1 file server
45
Experimental Results
NORMALIZED
ENERGY
1.06
0.69
0.5
1X XEON+AD
8X ZYNQ
0.66
Normalized to reference
Xeon performance
1X XEON+AD
0.43
Results
0.56
MNIST 700K Samples

0.38
Data set
0.33
setup
NORMALIZED
EXECUTION TIME
Experimental
8X ZYNQ
46
23
10/13/16
Overall Evaluation Results

Based
on two machine learning workloads
Normalized performance (speedup), and energy efficiency

(performance/W) relative to big-core solutions
Performance
EnergyEfficiency
Big-Core+FPGA
Best | 2.5
Best | 2.6
Small-Core+FPGA
Better | 1.2
Best | 1.9
Big-Core
Good | 1.0
Good | 1.0
Small-Core
Bad | 0.25
Bad | 0.24
47
More is Needed for

Data Center Level Deployment
48
24
10/13/16
Deploying Accelerators in Datacenters

How to program with
your accelerators?
Big data application
developer
Accelerator designer
How to install my
accelerators?
How to acquire
accelerator resource ?
Challenges:
1. Query available
platforms
1. Heterogeneous platforms
Cloud service provider
2. Express HW-specific
2. Accelerator locality
requirements
Challenges:
49
Blaze: Accelerator-as-a-Service [SoCC 16]

GPU
FPGA
Client
RM
GAM
Accelerator status
GAM
NAM
NM
NAM
AM
NM
NAM
Container
Container
Global Accelerator Manager

Accelerator locality aware scheduling
RM: Resource
Manager
Node Accelerator Manager
NM: Node Manager
Local accelerator service management, AM: Application
JVM-to-ACC communication optimization Master
50
25
10/13/16
Blaze Deployment Flow

Register Accelerators
Interface to add
accelerator service to
corresponding nodes
Request Accelerators
Use acc_id as label
User Application
ACC Labels
ACC Invoke
Input data
Output data
GAM allocates
corresponding nodes
to applications
Containers
Global ACC Manager

Container Info
ACC Info
Node ACC
Manager
FPGA
GPU
ACC
51
Programming Efforts Reduction Using Blaze
Applica4ons
Lines of code reduction for

Accelerator Management
Logistic Regression (LR)
325 0
Kmeans (KM)
364 0
Compression (COMP)
360 0
Genome Sequence Alignment

(GSA) [HotCloud16]
896 0
52
26
10/13/16
Overall Performance with Accelerators

(Integrated with Blaze)
Logis4c Regression
200
Task time
Time (sec)
Time (sec)
150
150
App time
100
50
8Nx12T
CPU
150
Task time
60
0
Kmeans
App time
90
60
30
scheduler
30
4Nx12T
FPGA
Time (sec)
120
Time (sec)
12Nx12T
CPU
compute
90
0
4Nx12T
CPU
shue
120
4Nx12T 4Nx12T
CPU
FPGA
200
shue
150
compute
100
scheduler
50
0
0
4Nx12T
CPU
8Nx12T
CPU
12Nx12T
CPU
data load and

preprocessing
4Nx12T 4Nx12T
CPU
FPGA
4Nx12T
FPGA
data load and

preprocessing
1 server with FPGA oers the same throughput of 3 servers
53
Falcon Computing Solutions, Inc.

http://www.falcon-computing.com
ACC: accelerator
User Applications in
MapReduce/Spark/Hadoop
+
Java/C/C++/OpenMP
Merlin
Compiler
FCS
ACC Engines
Overall Computing Solutions

ACC
Models
ACC
Kestrel
Runtime
Libraries
The only solution of FPGA customization and
virtualization for Datacenter acceleration!
Customize &
Virtualize
Copyright 2016 Falcon Computing Solutions
54
27
10/13/16
Merlin Compiler
C/C++ with pragmas
C-based design flow
OpenMP-like high-level
programming model
Automatic optimizations for
Compiler
Source-to-source
Optimizations
productivity and QoR
OpenCL Generation
Same input for multi-vendors and

multi-platforms
Optimized OpenCL
OpenCL backend
(Altera/Xilinx)
System executables
55
Sample Compilation Results

Design
Merlin
Compiler
Initial
OpenCL
Manual Optimized
OpenCL
Blackschole
Denoise
0.34ms
0.08s
11ms
3.8s
NA
NA
LogisticRegr
MatMult
NAMD
Normal
TwoNN
94ms
0.8ms
26ms
4ms
1.23s
3.7s
1.9ms
51ms
52ms
1.70s
94ms
0.8ms
26ms
10ms
NA
1x
21x
1.3x
Average
56
28
10/13/16
Concluding Remarks
New era of computing

Accelerator-centric computing
Need efficient support for customization and specialization
Customization at all levels

Chip-level
Server node level
Data center level
Data center level customization holds great promise

Thats where workload aggregates
Software is the key

Programming models
Hadoop/MapReduce or SPARK (+ C/C++), OpenMP, OpenCL,,
Compilation support
Runtime management
57
Acknowledgements NSF and C-FAR
NSF Expeditions in Computing and C-FAR Center under the STARnet Program
Industry support: Baidu, Falcon Computing, Fujitsu Labs, Huawei, Intel, Mentor Graphics
CDSC faculty:
Aberle
(UCLA)
Palsberg
(UCLA)
Baraniuk
(Rice)
Potkonjak
(UCLA)
Bui
(UCLA)
Reinman
(UCLA)
Chang
(UCLA)
Sadayappan
(Ohio-State)
Cheng
(UCSB)
Sarkar
(Associate Dir)
(Rice)
Cong (Director)
(UCLA)
Vese
(UCLA)
58
29
10/13/16
Postdocs, Graduate Students, and Collaborators
Prof. Deming Chen

(UIUC/ADSC)
Dr. Peng Li
(UCLA)
Hao Yu
(UCLA)
Yuting Chen
(UCLA/Google)
Prof. Louis-Nol Pouchet

(UCLA/colostate)
Zhenman Fang
(UCLA)
Yuxin Wang
(PKU/Falcon)
Dr. Peng Zhang

(UCLA/Falcon)
Hui Huang
(UCLA/Google)
Di Wu
(UCLA/Falcon)
Yi Zou (UCLA/
Arista)
Muhuan Huang
(UCLA/Falcon)
Bingjun Xiao
(UCLA/Google)
Wei Zuo
(UIUC)
59
60
30

Cong

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cong

Transféré par

Droits d'auteur :

Formats disponibles

10/13/16

Our Research Focus: Customization and Specialization

Power doubles every 4 years

Adapt the architecture to

Based on Fred Pollack (Intel) and Michael Taylor (UCSD)

2009 NSF Expeditions in Computing Project

Overview of Our Approach -- Customized Computing with

use of dedicated and composable accelerators

Most computations are carried on accelerators not on processors!

A fundamental departure from von Neumann architecture

Lessons from Nature:

High power efficiency (20W) of human brain comes from specialization

Remarkable advancement of civilization also from specialization

Intels $16.7B Acquisition of Altera

Require new processor designs, e.g. using fixed-function or

Host CPU + FPGA via PCI-e or QPI connections

Clusters of heterogeneous computing nodes

Now the full-system ARA simulator

[DAC 12, ISLPED 12, DAC 14]

ARC, CHARM, CAMEL

[HPCA 08 best paper]

RF-interconnects improves DRAM BW

Latest ARA Work: Unified Address

Inefficiency in Todays ARA Address Translation

#1 Inefficient TLB Support.

To Memory SystemIOMMU only achieves

Figure 4: TLB miss behav

Figure 5: TLB miss trace of a

Characteristic #1: Performance can be Highly

Must provide efficient TLB design and page walk design

Characteristic #2: Regular Bulk Transfer of

Such regular access behavior presents opportunities for relatively

Characteristic #3: Impact of Data Tiling

Example: rectangular tiling on a 32 *

Our Two-Level TLB Design

32-entry private TLB

Figure 10: The structure of a shared level-two TLB

A Shared Level-Two TLB

Figure 10 depicts the basic structure of our two-level

Still only achieves half

=> need to improve page

In contrast to private TLBs where low access latency

Page Walker Design Alternatives

#2 Leverage host core MMU that launches accelerators

64-bit virtual address

Host core has MMU

host core is idle after

One cache line potentially holds

Final Proposal: Two-Level TLB + Host Page Walk [HPCA 17]

On average: 7.6X speedup over nave IOMMU

Require new processor designs, e.g. using fixed-function or

Host CPU + FPGA via PCI-e or QPI connections

Clusters of heterogeneous computing nodes

Modern CPU-FPGA Platforms

Coherent Accelerator Processor Interface (CAPI) for POWER8 Systems

Figure 1: CAPI Hardware Ecosystem

Commodity Intel Server

Convey FPGA-based coprocessor

Standard Intel x86-64

C/C++ Based Synthesis for Accelerator Design

Code transformation & opt

Behavioral & Communication

RTL HDLs &