Vous êtes sur la page 1sur 30

10/13/16

Customizable Computing:
from Single-Chips to Datacenters
Jason Cong
Chancellors Professor, UCLA
Director, Center for Domain-Specific Computing

cong@cs.ucla.edu
http://cadlab.cs.ucla.edu/~cong

Our Research Focus: Customization and Specialization


for Energy Efficient Computing
suns surface

1000

Power doubles every 4 years

rocket
nozzle

nuclear reactor
100

Parallelization
Customization

Watts/cm2

Pentium 4
Pentium III
hot plate

10

Pentium II

Adapt the architecture to


application domain

Pentium Pro
Pentium
i386
1

1.5

i486
1.0

0.7

0.5

0.35

0.25

0.18

0.13

0.10

0.07

Based on Fred Pollack (Intel) and Michael Taylor (UCSD)


2

10/13/16

2009 NSF Expeditions in Computing Project

Overview of Our Approach -- Customized Computing with


Accelerator-Rich Architectures
Extensive

use of dedicated and composable accelerators

Most computations are carried on accelerators not on processors!

A fundamental departure from von Neumann architecture

Why now?
Previous architectures are device/transistor limited
Von Neumann architecture allows maximum device reuse
One pipeline serves all functions, fully utilized

Future architectures
Plenty of transistors, but power/energy limited (dark silicon)
Customization and specialization for maximum energy efficiency

A story of specialization
4

10/13/16

Lessons from Nature:


Human Brain and Advance of Civilization

High power efficiency (20W) of human brain comes from specialization


Different region responsible for different functions

Remarkable advancement of civilization also from specialization


More advanced societies have higher degree of specialization

Intels $16.7B Acquisition of Altera

Intel CEO Brian Krzanich noted, The acquisition will couple Intels leadingedge products and manufacturing process with Alteras leading fieldprogrammable gate array (or FPGA) technology. He further stated, The
combination is expected to enable new classes of products that meet
customer needs in the data center and Internet of Things market segments.
FALCON CONFIDENTIAL

10/13/16

Levels of Customization
Single-chip

level

Require new processor designs, e.g. using fixed-function or


composable accelerators
Server

node level

Host CPU + FPGA via PCI-e or QPI connections


Data

center level

Clusters of heterogeneous computing nodes

Chip-Level Customization:
Accelerator-Rich Architectures (ARA)
Hybrid L2 Cache with
STTRAM + SRAM

Buffer in NUCA
[ISLPED 12]

[DATE 12]

Adaptive L1 Cache
+ SPM/Buffer
[ISLPED 11]

RF-interconnects

Now the full-system ARA simulator


PARADE [ICCAD 15] is open source

DIMM
AIM DIMM

[DAC 12, ISLPED 12, DAC 14]

DIMM
AIM DIMM

ARC, CHARM, CAMEL

DIMM
AIM DIMM

Accelerator in
Memory

DIMM
AIM DIMM

[HPCA 08 best paper]


Hybrid NoC [DAC 15]

[JESTCS 12]

RF-interconnects improves DRAM BW

10/13/16

Latest ARA Work: Unified Address


Translation Support [HPCA 17]

Inefficiency in Todays ARA Address Translation


Core

Core

TLB

TLB

MMU

MMU

Accel

Accel

IOMMU

Interconnect
Main Memory

#1 Inefficient TLB Support.


Figure
3: A detailed look at an
TLBs are not specialized to provide
withlow-latency
system and
IOMMU
(baseline)
capture page
locality

Accel Datapath
Scratchpad Memory

Todays ARA
address translation
using IOMMU with
IOTLB (e.g. 32-entries)

Memory Interface
IOTLB
IOMMU IOTL

To Memory SystemIOMMU only achieves


12% performance of ideal
address translation
accelerator connecting

dress translation.
#2 High Page Walk Latency.
The
rest
ofmiss,
this4 main
paper
is organized as follows. SecOn an
IOTLB
memory
tion accesses
2 characterizes
address
translation behaviors of
are required to walk page table.
customized accelerators to motivate our design. Section 3 explains our simulation methodology and workloads. Section 4 details the design and evaluation of the
proposed architectural support for address translation.
Section 5 discusses more use cases. Section 6 summarizes related work and Section 7 concludes the paper.

2. CHARACTERIZATION OF

Figure 4: TLB miss behav

10

Figure 5: TLB miss trace of a


BlackScholes

10/13/16

Characteristic #1: Performance can be Highly


Sensitive to Address Translation Latency
On average, a translation latency
of 64 cycles can only achieves
70% of the ideal performance.
For the most sensitive app LPCIP,
it achieves only 40% performance.

Must provide efficient TLB design and page walk design


11

Characteristic #2: Regular Bulk Transfer of


Consecutive Data

TLB miss
behavior of
BlackScholes

Such regular access behavior presents opportunities for relatively


simple designs in supporting address translation
12

10/13/16

Characteristic #3: Impact of Data Tiling


Breaking a Page into Multiple Accelerators
Each plane is
a single page

Example: rectangular tiling on a 32 *


32 * 32 data array into 16 * 16 *16 tiles.
31
Each tile accesses 16 -pages
and can be mapped to a different
accelerator for parallel processing

Multiple data tiles (from the same set of pages) are often mapped to
different accelerator instances for parallelism.
Thus, a shared TLB can be very helpful

13

Our Two-Level TLB Design


Accel
TLB

Accel
TLB

Accel

Accel

TLB

TLB

32-entry private TLB


512-entry shared TLB
Utilization wall limits the number of
simultaneously powered accelerators

Shared TLB
To IOMMU

Figure 10: The structure of a shared level-two TLB


miss-under-miss targeting individual applications. We
leave these for future work.

4.3

A Shared Level-Two TLB

Figure 10 depicts the basic structure of our two-level


TLB design, and illustrates two orthogonal benefits provided from adding a shared TLB. First, the requested
entry is previously inserted to the shared TLB by a
request from the same accelerator. This is the case
when the private TLB size is not sufficient to capture the reuse distance so that the requested entry is
evicted from the private TLB earlier. Specifically, medical imaging benchmarks would benefit from having a
large shared TLB. Second, the requested entry is previously inserted to the shared TLB by a request from
another accelerator. This case is common when data
tile size is smaller than a memory page and neighboring tiles within a memory page are mapped to dierent
accelerators (illustrated in Section 2.3). Once an entry
is brought into the shared TLB by one accelerator, it is
immediately available to other accelerators, leading to
shared TLB hits. Requests that also miss in the shared
TLB need to access IOMMU for page table walking.

Still only achieves half

Figure the
11:ideal
Page
walk reduction compared to the
performance
IOMMU baseline3

=> need to improve page

latency.walker
However,
we find the benefit completely outdesign
weighs the added access latency for the current configuration. Much larger TLB sizes or more sharers (we will
discuss this in Section 5.1) could benefit from a banked
design, but such use cases are not well established.
Correctness issues. In addition to the TLB shootdown support in private TLBs, the shared TLB also
needs to be checked for invalidation. The reason is
in a mostly-inclusive policy, entries 14
that are previously
brought in can be present in both levels.

4.3.2

Evaluation

In contrast to private TLBs where low access latency


is the key, the shared TLB mainly aims to reduce the
number of page walks in two ways: (1) providing a
larger capacity to capture the page locality for applications which is difficult to achieve in private TLBs without sacrificing access latency, (2) reducing TLB misses
on common virtual pages by enabling translation sharing between concurrent accelerators.

10/13/16

Page Walker Design Alternatives


#1 Improve the IOMMU design to reduce page walk latency
Need to design a more complex IOMMU, e.g., GPU MMU with parallel
page walker [HPCA 14]

#2 Leverage host core MMU that launches accelerators


Very simple but very efficient because:

64-bit virtual address


in 4-level page walk

Host core has MMU


cache and data cache

Common practice:

host core is idle after


accelerator offloading,
leaving MMU cache and
data cache warmed up

L4

L3

L2

Upper-level entries:
Common entries are
cached in MMU caches

L1

Page offset

L1 entries:
Data cache line has
prefetching effects

One cache line potentially holds


entries to 8 consecutive pages

15

Final Proposal: Two-Level TLB + Host Page Walk [HPCA 17]

On average: 7.6X speedup over nave IOMMU


design, only 6.4% gap between ideal translation

16

10/13/16

Levels of Customization
Single-chip

level

Require new processor designs, e.g. using fixed-function or


composable accelerators
Server

node level

Host CPU + FPGA via PCI-e or QPI connections


Data

center level

Clusters of heterogeneous computing nodes

17

Modern CPU-FPGA Platforms


White Paper

Advance

Coherent Accelerator Processor Interface (CAPI) for POWER8 Systems

Figure 1: CAPI Hardware Ecosystem

Commodity Intel Server


Intel
Xeon
Processor

Intel I/O
Subsystem

Intel

Convey FPGA-based coprocessor


Application
Engine Hub
(AEH)

Application Engines
(AEs)

Memory
Controller
Hub (MCH)

Memory

Standard Intel x86-64


Server
x86-64 Linux

Equally important is the translation function that the CAPP and PSL provide for the accelerator. The
accelerator uses the same virtual memory addressing space as the core application that enables it. The
CAPP and PSL handle all virtual-to-physical memory translations, simplifying the programming model and
freeing the accelerator to do the number crunching directly on the data it receives.

Direct
Data
Port

In addition, the PSL contains a 256 KB resident cache on behalf of the accelerator. Based on the needs
of the algorithm, the accelerator can direct the use of the cache via the type of memory accesses
(reads/write) as cacheable or noncacheable. Fundamentally, the rich set of commands that the
accelerator can send to the PSL reflects all of the types of memory accesses available to the POWER8
cores themselves (such as reads, reads with intent-to-modify, reservations, locks, writes, and writes to
highest point of coherency).
The accelerator runs hand-in-hand with an application running on the core as shown in Figure 2:
Dedicated Process Model.

Memory
Convey coprocessor
FPGA-based
Shared cache-coherent memory

29 September 2014

Page 6

10/13/16

C/C++ Based Synthesis for Accelerator Design

xPilot (UCLA 2006) -> AutoPilot (AutoESL) -> Vivado HLS (Xilinx 2011-)
Design Specification
C/C++/SystemC

Compilation &
Elaboration

AutoPilotTM

Code transformation & opt

Behavioral & Communication


Synthesis and Optimizations

RTL HDLs &


RTL SystemC

ESL Synthesis

Common Testbench

Simulation, Verification, and Prototyping

User Constraints

Platform-based C to RTL
synthesis

Synthesize pure ANSI-C and C+


+, GCC-compatible compilation
flow

Full support of IEEE-754 floating


point data types & operations

Platform
Characterization
Library

Timing/Power/Layout
Constraints

FPGA
or ASIC blocks

Efficiently handle bit-accurate


fixed-point arithmetic
SDC-based scheduling

Automatic memory partitioning

QoR matches or exceeds manual


RTL for many designs

Developed by AutoESL, acquired by Xilinx in Jan. 2011


19

Vivado HLS Become the Most Widely Used HLS System

20

10

10/13/16

Recent Research - Optimizations Beyond HLS


Input Code(C/C++)
Program Analysis
Loop Structure
Optimization

Improving Polyhedral Code Generation


for High-Level Synthesis CODES-ISSS14
Best Paper Award

Loop
Restructuring
Code Generation

Data Layout
Optimization

Inter-Module
Optimization

Polyhedral-Based Data Reuse


Optimization for Configurable Computing
FPGA13 Best Paper Award

Array Partitioning

Theory and Algorithm for Generalized


Memory Partitioning in High-Level
Synthesis, FPGA14

Module Selection/
replication

An optimal microarchitecture for stencil


computation acceleration based on nonuniform partitioning of data reuse buffers
(DAC14)

Communication
Optimization
Module-level
Scheduling

Combining Computation with


Communication Optimization in System
Synthesis for Streaming Applications,
FPGA14

Data Reuse

21

CMOST: Fully Automated Compilation and Mapping Flow


[DAC 2015]
Applica.on: C/C++/OpenMP4.0

User Direc.ves

PlaIorm Spec.

System Op3miza3on
Program Analysis

Task graph
hardware
model
design parameters

OpenCL

design analysis/impl. report

System Genera3on

Module templates,
System IP templates
(C/RTL)

On-board
executable HW/SW

Retargetable and op.mized


OpenCL source code

11

10/13/16

Further Raise the Level of Abstraction:


Use of Domain-Specific Languages
Example: Caffeine -- HW/SW Co-Optimization
for Deep Learning [ICCAD 16]

23

Convolutional Neural Network (CNN)


feature maps

feature maps

feature maps

feature maps

input image

output
category

Inference: A feedforward computation


Convolutional layer

Max-pooling is optional
Input feature map
Output
feature map

24

12

10/13/16

Challenges of Using FPGA for Deep Learning

Programmability
Network definitions (& layer parameters) change according to application

CNNs are getting deeper, more precision achieved with deeper network
Network parameters vary across layers, many user definable parameters
Hardware has limited programmability
Re-synthesize bitstream for each network? Not preferred
Re-program FPGA for each layer? NO!
30

28.2

152 layers
160

25.8

140

25

120
20

100

16.4

15

80

11.7

19 layers

10

shallow

8 layers

8 layers

22 layers

7.3

6.7

60
40

3.57

20
0

2010

2011

2012

2013

2014

2015

25

2016

Challenges of Using FPGA for Deep Learning (1)


Programming

model

D4J etc.

TensorFlow
Caffe

Industry:
Stability
scale & speed
Data integration
Relatively fixed

Torch
Theano

Neon

Caffe framework
Network definitions
Layer definitions
CONV

FCN

POOL

ReLU

MKL, BLAS

cuDNN

CPU_forward

GPU_forward

Research:
Flexible
Fast iteration
Debuggable
Relatively bare bone
Quite different for
various layers, &
various networks
26

13

10/13/16

Convolution Optimization on FPGA


(Tile loop)

1 for(row=0; row<R; row+=Tr) {


2 for(col=0; col<C; col+=Tc) {
3
for(to=0; to<M; to+=Tm) {
4
for(ti=0; ti<N; ti+=Tn) {

(Tile loop)
(Tile loop)
(Tile loop)

Off-chip Data Transfer: Memory Access Optimization


On-chip Data: Computation Optimization
5
6
7
8
9
10

for(trr=row; trr<min(row+Tr, R); trr++) {


(Point loop)
for(tcc=col; tcc<min(tcc+Tc, C); tcc++) {
(Point loop)
for(too=to; too<min(to+Tm, M); too++) {
(Point loop)
for(tii=ti; tii<(ti+Tn, N); tii++) {
(Point loop)
for(i=0; i<K; i++) {
(Point loop)
for(j=0; j<K; j++) {
(Point loop)
output_fm[to][row][col] +=
weights[to][ti][i][j]*input_fm[ti][S*row+i][S*col+j];
}}}}}}
27

}}}}

Feedforward Processing on CPU+FPGA Platform

Layer-by-layer feedforward computation

No CPU interruptions are required until final prediction


CPU

Applications (with CNNs)



High-level Network Description

Automation Flow to
Unified Representation
PCIe
FPGA

Customized Accelerator

DRAM

Instruction
Region

Weight & bias


Region

Feature map
Region
28

14

10/13/16

Challenges of Using FPGA for Deep Learning (2)

Performance
Convolution layer & Fully connected layer

Convolution layer -- Compute Intensive


Fully Connected layer -- Communication Intensive
Hardware has limited bandwidth resource
Convolution layer -- extensive previous study
Fully connected layer -- How to maximize bandwidth utilization?
CONV + FCN -- How to re-use hardware for both kernels?
1

29

Mapping Fully-Connected Layer to Convolution


Input-Major Mapping

+
Weights
Input Vectors

Output Vectors

Convolution
Input feature map

Convolution
Weights

Convolution
Output feature map
30

15

10/13/16

Mapping Fully-Connected Layer to Convolution


Weight-Major Mapping

+
Weights
Input Vectors

Output Vectors

Convolution
Input feature map

Convolution
Weights

Convolution
Output feature map
31

Our Solution: Caffeine [ICCAD 16]


Software/Hardware Co-designed FPGA-based deep learning library
CNN Network definitions

Challenge 1: Programmability
1. Network changes according to applications
2. Parameter varies across layers

Automation flow

SW

CONV

FCN

POOL

ReLU

Challenge 2: Performance
1. Re-use same HW for both CONV&FCN
2. Deal with both computation & bandwidth
bound kernels

Uniformed Representation

FPGA-based Accelerator Design

HW

Computation
Optimization

Bandwidth
Optimization

Challenge 3: Scalability
1. Demand to scale accelerator to larger device

1. Portable HLS implementation


2. Scalable systolic array PE design
32

16

10/13/16

33

Levels of Customization
n Single-chip

level

Require new processor designs, e.g. using fixed-function or


composable accelerators
n Server

node level

Host CPU + FPGA via PCI-e or QPI connections


n Data

center level

Clusters of heterogeneous computing node [DAC 16]


How about programming at data center level? [SoCC 16]

34

17

10/13/16

Data Center Energy Consumption is a Big Deal


In 2013, U.S. data centers consumed an estimated 91
billion kilowatt-hours of electricity, projected to increase to
roughly 140 billion kilowatt-hours annually by 2020
50 large power plants (500-megawatt coal-fired)
$13 billion annually
100 million metric tons of carbon pollution per year.
https://www.nrdc.org/resources/americas-data-centers-consuming-and-wasting-growingamounts-energy)

35

Extensive Efforts on Improving Datacenter


Energy Efficiency
Understand

the scale-out workloads

ISCA10, ASPLOS12
Mismatch between workloads and processor designs;
Modern processors are over-provisioning
Trade-off

of big-core vs. small-core

ISCA10: Web-search on small-core with better energy-efficiency


Baidu taps Mavell for ARM storage server SoC

36

18

10/13/16

Datacenter Level Integration at Microsoft

FPGA

A. Putnam, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, ISCA2014

37

Focus of Our Study


n Evaluation

of different integration options of


heterogeneous technologies in datacenters

n Efficient

programming and runtime support for


heterogeneous datacenters

38

19

10/13/16

LR: logistic regression


KM: k-mean clustering

Results
Normalized to reference Xeon
performance

5.26

3.13

7.8

Atom: Intel D2500 1.8GHz


8XA9
ARM
8X ATOM
ARM:
in Zynq 800MHz

Average

power consumption

(Entire system including DRAM and HDD)

Xeon: 175W/node

3.1

Benchmarks (MLLib)

KM

5.21

Xeon: Intel E5 2620


12 Core CPU 2.40GHz

4.88

784 Features, 10 Labels

6.86

MNIST 700K Samples

LR

Baselines

NORMALIZED
EXECUTION TIME

Data set

NORMALIZED
ENERGY

10.97

Small-core on Compute-intensive Workloads

Atom: 30W/node
8X ARM
8X ATOM
ARM:
10W/node (embedded
CPU)
39

Small Cores Alone Are Not Efficient!

40

20

10/13/16

Small Core + ACC: FARM


Boost

Small-core Performance with FPGA

- 8 Xilinx ZC706 boards


- 24-port Ethernet switch
- ~100W power

41

1.06

7.8

5.26

3.13

0.69

8X ZYNQ

8X ARM

8X ATOM

0.66

0.43

Normalized to reference singlenode Xeon performance

8X ATOM

3.1

Results

8X ARM

KM

5.21

LR

4.88

Power consumption (averaged)


Atom: 30W/node
ARM: 10W/node

6.86

Data set
MNIST 700K Samples
784 Features, 10 Labels

NORMALIZED
EXECUTION TIME

Setup

NORMALIZED
ENERGY

10.97

Small-core with FPGA Performance

8X ZYNQ

42

21

10/13/16

Small Cores + FPGAs Are More Interesting!

43

Inefficiencies in Small-core
Slower

core and memory clock

Task scheduling is slow


JVM-to-FPGA data transfer is slow
Limited

DRAM size and Ethernet bandwidth

Slow data shuffling between nodes


Another

option: Big-core + FPGA

44

22

10/13/16

Big-Core + ACC: CDSC FPGA-Enabled Cluster


A

24-node cluster with FPGA-based accelerators

Run on top of Spark and Hadoop (HDFS)


Alpha Data board:
1. Virtex-7 FPGA
2. 16GB on-board
RAM

1 master /
driver
1 10GbE switch

Each node:
1. Two Xeon processors
2. One FPGA PCIe card
(Alpha Data)
3. 64 GB RAM
4. 10GBE NIC

22 workers

1 file server

45

Experimental Results

NORMALIZED
ENERGY

1.06

0.69

0.5

1X XEON+AD

8X ZYNQ
0.66

Normalized to reference
Xeon performance

1X XEON+AD

0.43

Results

0.56

MNIST 700K Samples


784 Features, 10 Labels

0.38

Data set

0.33

setup
NORMALIZED
EXECUTION TIME

Experimental

8X ZYNQ

46

23

10/13/16

Overall Evaluation Results


Based

on two machine learning workloads

Normalized performance (speedup), and energy efficiency


(performance/W) relative to big-core solutions

Performance

EnergyEfficiency

Big-Core+FPGA

Best | 2.5

Best | 2.6

Small-Core+FPGA

Better | 1.2

Best | 1.9

Big-Core

Good | 1.0

Good | 1.0

Small-Core

Bad | 0.25

Bad | 0.24

47

More is Needed for


Data Center Level Deployment

48

24

10/13/16

Deploying Accelerators in Datacenters


How to program with
your accelerators?
Big data application
developer

Accelerator designer

How to install my
accelerators?

How to acquire
accelerator resource ?

Challenges:
1. Query available
platforms
1. Heterogeneous platforms
Cloud service provider
2. Express HW-specific
2. Accelerator locality
requirements

Challenges:

49

Blaze: Accelerator-as-a-Service [SoCC 16]


GPU

FPGA

Client

RM

GAM

Accelerator status

GAM
NAM

NM

NAM

AM

NM

NAM

Container
Container

Global Accelerator Manager


Accelerator locality aware scheduling

RM: Resource
Manager
Node Accelerator Manager
NM: Node Manager
Local accelerator service management, AM: Application
JVM-to-ACC communication optimization Master
50

25

10/13/16

Blaze Deployment Flow


Register Accelerators
Interface to add
accelerator service to
corresponding nodes

Request Accelerators
Use acc_id as label

User Application
ACC Labels
ACC Invoke
Input data
Output data

GAM allocates
corresponding nodes
to applications

Containers

Global ACC Manager


Container Info

ACC Info

Node ACC
Manager

FPGA
GPU
ACC
51

Programming Efforts Reduction Using Blaze

Applica4ons

Lines of code reduction for


Accelerator Management

Logistic Regression (LR)

325 0

Kmeans (KM)

364 0

Compression (COMP)

360 0

Genome Sequence Alignment


(GSA) [HotCloud16]

896 0

52

26

10/13/16

Overall Performance with Accelerators


(Integrated with Blaze)
Logis4c Regression

200

Task time

Time (sec)

Time (sec)

150

150

App time

100
50
8Nx12T
CPU

150

Task time

60
0

Kmeans

App time

90
60
30

scheduler

30

4Nx12T
FPGA

Time (sec)

120

Time (sec)

12Nx12T
CPU

compute

90

0
4Nx12T
CPU

shue

120

4Nx12T 4Nx12T
CPU
FPGA
200

shue

150
compute

100

scheduler

50
0

0
4Nx12T
CPU

8Nx12T
CPU

12Nx12T
CPU

data load and


preprocessing

4Nx12T 4Nx12T
CPU
FPGA

4Nx12T
FPGA

data load and


preprocessing

1 server with FPGA oers the same throughput of 3 servers

53

Falcon Computing Solutions, Inc.


http://www.falcon-computing.com
ACC: accelerator

User Applications in
MapReduce/Spark/Hadoop
+
Java/C/C++/OpenMP

Merlin
Compiler
FCS

ACC Engines

Overall Computing Solutions


ACC
Models
ACC

Kestrel
Runtime

Libraries
The only solution of FPGA customization and
virtualization for Datacenter acceleration!

Customize &
Virtualize

Copyright 2016 Falcon Computing Solutions

54

27

10/13/16

Merlin Compiler
C/C++ with pragmas

C-based design flow

OpenMP-like high-level
programming model

Automatic optimizations for

Compiler

Source-to-source
Optimizations

productivity and QoR

OpenCL Generation

Same input for multi-vendors and


multi-platforms

Optimized OpenCL

OpenCL backend
(Altera/Xilinx)
System executables

Copyright 2016 Falcon Computing Solutions

55

Sample Compilation Results


Design

Merlin
Compiler

Initial
OpenCL

Manual Optimized
OpenCL

Blackschole
Denoise

0.34ms
0.08s

11ms
3.8s

NA
NA

LogisticRegr
MatMult
NAMD
Normal
TwoNN

94ms
0.8ms
26ms
4ms
1.23s

3.7s
1.9ms
51ms
52ms
1.70s

94ms
0.8ms
26ms
10ms
NA

1x

21x

1.3x

Average

Copyright 2016 Falcon Computing Solutions

56

28

10/13/16

Concluding Remarks

New era of computing


Accelerator-centric computing
Need efficient support for customization and specialization

Customization at all levels


Chip-level
Server node level
Data center level

Data center level customization holds great promise


Thats where workload aggregates

Software is the key


Programming models

Hadoop/MapReduce or SPARK (+ C/C++), OpenMP, OpenCL,,

Compilation support
Runtime management
57

Acknowledgements NSF and C-FAR

NSF Expeditions in Computing and C-FAR Center under the STARnet Program
Industry support: Baidu, Falcon Computing, Fujitsu Labs, Huawei, Intel, Mentor Graphics
CDSC faculty:

Aberle
(UCLA)

Palsberg
(UCLA)

Baraniuk
(Rice)

Potkonjak
(UCLA)

Bui
(UCLA)

Reinman
(UCLA)

Chang
(UCLA)

Sadayappan
(Ohio-State)

Cheng
(UCSB)

Sarkar
(Associate Dir)
(Rice)

Cong (Director)
(UCLA)

Vese
(UCLA)

58

29

10/13/16

Postdocs, Graduate Students, and Collaborators

Prof. Deming Chen


(UIUC/ADSC)

Dr. Peng Li
(UCLA)

Hao Yu
(UCLA)

Yuting Chen
(UCLA/Google)

Prof. Louis-Nol Pouchet


(UCLA/colostate)

Zhenman Fang
(UCLA)

Yuxin Wang
(PKU/Falcon)

Dr. Peng Zhang


(UCLA/Falcon)

Hui Huang
(UCLA/Google)

Di Wu
(UCLA/Falcon)

Yi Zou (UCLA/
Arista)

Muhuan Huang
(UCLA/Falcon)

Bingjun Xiao
(UCLA/Google)

Wei Zuo
(UIUC)

59

60

30

Vous aimerez peut-être aussi