Vous êtes sur la page 1sur 90

System Acceleration with FPGA

using the Altera OpenCL SDK

Accelerating your IDEA


with FPGA

A Complete Solutions Portfolio


P

OW E R I N G

CPLDs
Lowest Cost,
Lowest Power

O U R

FPGAs
FPGAs
Cost/Power Balance Mid-range FPGAs
SoC & Transceivers SoC & Transceivers

N N OVAT I O N

FPGAs
Optimized for
High Bandwidth

PowerSoCs
High-efficiency
Power Management

RESOURCES
Embedded Soft and
Hard Processors

Design
Software

Development
Kits

Intellectual
Property (IP)

Industrial
Computing
Enterprise
3

Industry Challenges
Variety of applications are becoming bottlenecked by scalable
performance requirements

E.g. Object detection and recognition, image tracking and processing,

cryptography, cloud, search engines, deep packet inspection, etc

Overloading CPUs capabilities

Frequencies are capped

Processors keep adding more cores

Need to coordinate all the cores and manage data

Product life cycles are long

GPUs lifespan is short

Require re-optimization and regression testing between generations

Support agreement for GPUs costly


Power dissipation of CPUs and GPUs limits system size
Maintaining coherency throughout scalable system
4

OpenCL and FPGAs Address These Challenges


Power efficient acceleration

Typically 1/5 power of GPU and orders of magnitude more performance per watt of CPU

FPGA lifecycle over 15 years

GPUs lifespan is short


Require re-optimization testing between generations

FPGA OpenCL code retargeted to future devices without modification

Our OpenCL flow abstracts away FPGA hardware flow

Puts FPGA into software engineers hands

Our OpenCL SDK allows for streaming IO channels and kernel


channels

Data movement without host involvement

Low latency data transmissions to accelerator

Shared virtual memory

IBM CAPI and Intel QPI

Efficiency via Specialization

FPGAs

GPUs

Source: Bob Broderson, Berkeley Wireless group

ASICs

Application Development Paradigm


ASIC

FPGA
Programmers

Parallel
Programmers

Standard CPU Programmers

OpenCL expands
The number of
application developers

More SW Engineering Resources than HW?

1000:1 software engineers to FPGA designers


Software engineers are not used to long compile
times

OpenCL Solves This!

Our OpenCL flow abstracts away FPGA hardware flow


bringing the FPGA to low level software programmers
Software developers write, optimize and debug in their software familiar

environment
Quartus is run behind the scenes
Emulator and profiler are software development tools
Pushing long compile times to end

OpenCL optimization doesnt require a board


Allowing SW to drive board requirements (.xml file)

OpenCL On FPGAs Fit Into All Markets


Automotive/Industrial
(Pedestrian Detection,
Motion Estimation)

Military/Government
(Crypto, Image Detection )

Data
Processing
Algorithms
Networking
(DPI, SDN, NFV)

Computer & Storage


(HPC, Financial,
Data Compression)

Medical
(Diagnostic Image Processing,
BioInformatics)
Broadcast, Consumer
(Video image processing)

OpenCL and FPGA Acceleration in the News


IBM and Altera Collaborate on OpenCL
IBMs collaboration with Altera on OpenCL and support of the IBM Power
architecture with the Altera SDK for OpenCL can bring more innovation to
address Big Data and cloud computing challenges, said Tom Rosamilia, senior
vice president, IBM Systems

Intel Reveals FPGA and Xeon in One Socket


"That allows end users that have applications that can benefit from acceleration
to load their IP and accelerate that algorithm on that FPGA as an offload,"
explained the vice president of Intel's data center group, Diane Bryant

Search Engine Gets Help From FPGA


"Altera was really interesting in helping with the developmentthe
resources they were willing to throw our way were more significant than
those from Xilinx Microsoft Engr Manager

Baidu and Altera Demonstrate Faster Image Classification


Altera Corp. and Baidu, Chinas largest online search engine, are collaborating on
using FPGAs and convolutional neural network (CNN) algorithms for deep learning
applications.

Xilinx Announces SDAccel Development Environment for OpenCL


Delivering Up to 25X Better Performance/Watt to the Data Center

10

Demo

11

Exploring FPGA world

13

Circuit

14

In the BeginningTTL Logic Design

Basic logic functions available on separate chips

15

NAND, OR, multiplexers, flip-flops, etc.

Made famous by the Texas Instruments 7400 device


family
Design choices often determined by cost and
available device inventory

Epic of Digital computer

16

Computer on a chip

17

Why Programmable Logic?

I/O
Generic

Flash
CPU

RAM

Video I/O
Ethernet

DDR,QDR
I/O

XVCR
PCIe, SDI,..

I/O

I/O

FPGA
CPU

DSP

DSP

Solution: Replace External Devices


with Programmable Logic
18

Programmable
Logic
is
Found
Everywhere!
Test,

Consumer
Automotive

Measurement,
& Medical

Communications
Broadcast

Computer &
Storage

Entertainment

Instrumentation

Wireless

Military

Computers

Broadband
Audio/video
Video display

Medical
Test equipment
Manufacturing

Cellular
Basestations
Wireless LAN

Secure comm.
Radar
Guidance and control

Servers
Mainframe

Automotive

Networking

Navigation
Entertainment

Switches
Routers

Security &
Energy Management

Wireline
Optical
Metro
Access

Broadcast
Studio
Satellite
Broadcasting

19

Military &
Industrial

Card readers
Control systems
ATM

Storage
RAID
SAN

Office
Automation
Copiers
Printers
MFP

FPGA Architecture

Massive Parallelism

Millions of logic elements

Thousands of 20Kb memory

I/O

blocks
Thousands of Variable Precision

I/O

DSP blocks

I/O

Dozens of High-speed
transceivers

Hardware-centric

Programmable
Routing Switch

VHDL/Verilog

Synthesis

Logic

Element

20

Place&Route

I/O

FPGA Logic Elements


FPGAs use multiple input LUTs
0

LUT
A
B

Out

SRAM
Cell
Out

0
1

SRAM
cell

Shift IN

Carry IN

A B
Addr0

Multiple Families

Reg0

Multiple
Input LUT

Cyclone: cost and power balanced


Arria: midrange
Stratix: high bandwidth

Multiple
Input LUT

Reg1

Addr1

Reg2

Reg3

21

Shift OUT

Carry OUT

Typical FPGA Design Entry


always @(a or b or c or d or sel)
begin
case (sel)
2b00: mux_out = a;
2b01: mux_out = b;
2b10: mux_out = c;
2b11: mux_out = d;
endcase

22

a
b
c
d
sel

mux_out

Hardware circuits are described using


Hardware Description Languages (HDL)
such as VHDL or Verilog
A designer must describe the behavior
of the algorithm to create a low level
digital circuit of it

Digital Design with TTL Logic


Karnaugh
map

Truth table
A B C D X

23

CD

00

01

11

10

00

01

11

10

AB

Digital Design with TTL Logic (cont.)


Logic expression

Final logic implementation

VCC
7400

X = AB + CD + BD + BC + AD + AC

VCC
7430

7474

A
B

PRE

C
D

Q
Q
CLR

X = (AB CD BD BC AD AC)

7400
PRE

Sum of
Products

24

CLR

From TTL to Programmable Logic


General features of logic
implementations
Sum of products (AND-OR gates; combinatorial logic)
Stored results (registered outputs)
Wired together

What if
Logic functions were fixed (like TTL), but combined into

a single device?
Wiring (routing) connections could be controlled
(programmed) somehow?

25

Field Programmable Gate Array (FPGA)


LABs arranged in an array
Row and column programmable
interconnect
Interconnect may span all or part of the
array
LABs
Row
interconnect
Column
interconnect

26

Segmented
interconnects

CPLD LABs vs. FPGA LABs

27

FPGA LABs made up of logic elements (LEs) instead


of product terms and macrocells
Easier to create complex functions through LE
cascading

Lookup Tables (LUTs)


Replaces product term array
Combinational functions created with
programmed tables (cascaded
multiplexers)
LUT inputs are mux select lines
D
C
X = AB + ABCD + ABCD
B
A
Programmed levels
(EEPROM or SRAM)

1 0 01 10 00
28

10 0 0 1 0

01

= x9889

Programmable register

29

Configure for D, T,
JK, or SR flip-flop
operation
Clock typically
driven by global
clock
Asynchronous
control through
other logic or I/O
Feedback into LUT
Bypass register or
LUT

Carry and Register Chains

30

Chain carry bits


between LEs
Register outputs
can chain to
other LE
registers in LAB
to form LUTindependent shift
registers

Register Packing

31

Separate outputs
from LUT and
register create
two outputs from
one LE
Saves device
resources

LABs and LEs: A Closer Look

LUT &
carry logic

32

Register

Adaptive
Logic
Modules
(ALM)

Based on LE, but includes dedicated resources & adaptive LUT (ALUT)
Improves performance and resource utilization

ALM
1

ALM Inputs

2
3

Adder

Reg

4
5
6
7
8

33

Adaptive
LUT

Adder
Reg

FPGA Routing
All device resources can feed into or be
fed by any routing in device
Differing fixed lengths to adjust for
timing
Scales linearly as density increases
Local interconnect
Connects between LEs or ALMs within a LAB
Can include direct connections between adjacent LABs

Row and column interconnect


Fixed length routing segments
Span a number of LABs or entire device

34

FPGA I/O Elements

35

Advanced programmable logic blocks connect directly to


row or column interconnect
Control available I/O features

Input/output/bidirectional

Multiple I/O standards

Differential signaling

Current drive strength

Slew rate

On-chip termination/pull-up resistors

Clamping diodes for PCI bus use

Open drain/tri-state

etc.

Typical I/O Element Logic

output enable
control

device pin

output
path

input path

36

Other Typical FPGA Features

Replace some LABs with dedicated functional


hardware blocks
Memory blocks

Create on-board memory structures to support design


Single/dual-port RAM
ROM
Shift registers or FIFO buffers

37

Initialize RAM or ROM contents on power-on

Memory LABs (MLABs)

Embedded multipliers

Useful for DSP

High-performance multiply/add/accumulate operations

High-speed transceivers

Typical PLD Design Flow


Design Specification

Schematic entry/RTL
coding/Qsys/DSP Builder/OpenCL
- Behavioral or structural description of design

RTL simulation
- Functional simulation
- Verify logic model & data flow

LE
M4K/M9K

M512

I/O

Synthesis (Mapping)
- Translate design into device specific primitives
- Optimization to meet required area & performance constraints
- Quartus II synthesis or 3rd party synthesis tools
- Result: Post-synthesis netlist

Place & route (Fitting)


- Map primitives to specific locations inside
target technology with reference to area &
performance constraints
- Specify routing resources to be used
- Quartus II Fitter
- Result: Post-fit netlist
38

Typical PLD Design Flow


tclk

Timing analysis (TimeQuest Timing Analyzer)


- Verify performance specifications were met
- Static timing analysis

Gate level simulation (optional)


- Simulation with timing delays taken into account
- Verify design will work in target technology

PC board simulation & test


- Simulate board design
- Program & test device on board
- Use SignalTap II Logic Analyzer or
other on-chip tools for debugging

39

The Key to Performance


Maximize Throughput
Minimize Latency
More Operations
Per Second

Quick Data
Access

Parallelism

Memory Access

Pipelining
Instructions
Processes
Loop unrolling
Duplication (SPMD)
Multi-threading (SMT)

40

Avoid transfer/copy
Work in local memory instead of
shared memory
Coalesce accesses

FPGA Acceleration
Efficient Power

Custom processors

Optimized for task

Small soft scalar processor


Larger vector processor
Hardware pipeline

Dedicated local memory

Acceleration

Memory

Processor

Processor

Memory

Processor

Memory

Processor

Memory

Processor

Memory

Multiple engines for SMT

Replication for SPMD

Pipelining for more


throughput
I/O

41

I/O

I/O

I/O

Interface Protocols IP

Wireline

Ethernet
Interlaken
XAUI
OTN/OTL

General
Wireless

Video

PCI Express
CPRI

SDI
SerialLite

RapidIO

DisplayPort
QPI*

JESD204B

HDMI
SATA/SAS*

SFI-S
RXAUI/DXAUI*
SONET/SDH*
HiGig*
42

Extensive support of
protocol interfaces
across market segments
* Partner IP. The rest of the IP is developed and supported by Altera

Fibre Channel*
InfiniBand*

Memory Interfaces IP

DDR SDRAM

RLDRAM

QDR SRAM

Serial Memory

DDR4

RLDRAM 3

QDR II+ Xtreme

HMC

DDR3

RLDRAM 2

QDR II / II+

GCI (MoSys)*

LPDDR3

Broad support of memory interfaces

43

* Partner IP. The rest of the IP is developed and supported by Altera

SoC Solution

ARM Cortex-A9
NEON / FPU
L1 Cache

ARM

L2 Cache

64-KB
RAM

JTAG
Debug /
Trace (1)

SD /
SDIO/
MMC (1)

Timers
(x6)

Shared Multiport DDR


SDRAM Controller (2)

HPS to
FPGA

QSPI
Flash
Control
NAND
Flash
(1) (2)

FPGA

acceleration
Hard Multiport
DDR
Multiport
DDR
Multiport
DDRSDRAM
SDRAM
(2)
SDRAM
Controller
Controller
Controller

USB
OTG
(x2) (1)

(x2) (1)

GPIO

I2 C
(x4)

SPI
(x2)

CAN
(x2)

DMA
(8
Channels)

FPGA
to HPS

UART
(x2)

FPGA
Config

28LP process
8-input ALMs
Variable-precision DSP
M10K memory and 640-bit
MLABs
fPLLs

Hard

PCIe
PCIe

Notes:
(1) Integrated direct memory access (DMA)
(2) Integrated ECC

44

Ethernet

3-, 5-, 6-,


and 10-Gbps
Transceivers

FPGA General Purpose I/Os

Lower cost
Power efficient
Real-time system

ARM Cortex-A9
NEON / FPU
L1 Cache

HPS I/Os

Cortex-A9
Host processor and
FPGA accelerator in
one package

Hard Processor System (HPS)

FPGA Sweet Spot


Low
Latency
CPU

++
(branch
prediction)

SMT
+

SIMD
-

Integer/
Bit

SPMD
-

Floating
Point

Power

++

++

++

++

Control Compute Intensive


DSP

Datapath Compute Intensive


GPU

(deep
cache)

++

Array Processing Intensive


FPGA

+
(flexible
local
memory)

++

++

Bit Manipulation Intensive

45

Programming Language Offerings

Target

GPU
Multi-Core CPU

DSP/
Embedded

FPGA

System
(Heterogeneous Platform)

Device

IP Block

Programmer

Embedded Programmer

Hardware Designer

CUDA/OpenCL

Code Composer Studio (TI C)

Quartus II (Verilog/VHDL)

Task Parallelism
Data Parallelism

Real Time Function Acceleration

IP Design and Integration

Throughput/Latency
Power Efficiency

Real Time Execution


Cost

None (Coding Style Guidelines)

Limited (macro architecture


bandwidth level)

Yes (protocol-level, timing closure,


micro architecture)

Scope

Designer
Design Flow
Design Activity

Design Constraints

Hardware Knowledge

46

Today

Today

HLS

Clock Frequency
Resource Utilization
Interface Requirements
Power

PoC

HLS vs OpenCL Positioning

Targets CPU, GPU and


FPGAs
Target user is Software
developer
Implements FPGA in software
development flow
Performance is determined by
resources allocated

Targets FPGA

Target user is FPGA


designer
Implements FGPA in
traditional FPGA
development flow
Performance is defined and
amount of resource to
achive is reported
Host not required

Host Required

47

Altera OpenCL SDK

48

Altera SDK for OpenCL Competitive Differentiator


Alteras SDK for OpenCL has proven to be a powerful solution for
many vendors

Won design tool and development software Elektra award in


Europe

Won Ultimate Product of the Year for 2014

Actively being used today:


I was extremely happy to get a great
performance with such low effort. I was
so impressed with how powerful the
Altera tool was!
--- Senior Engineer,
Altera OpenCL Customer

49

First Conformant OpenCL Solution for FPGAs!!!


OpenCL v1.0 specification

>8500 Programs tested

Supports Arm Host

50

CV and AV SoC

http://www.khronos.org/conformance/adopters/conformant-companies
http://www.khronos.org/conformance/adopters/conformant-products

Heterogeneous Platform Model


OpenCL
Platform
Model

Host
Memory

(Compute) Device
Host

Global
Memory

Example
Platform

PCIe

51

x86

Compute Unit

Processing
Element

Heterogeneous Platform Model


OpenCL
Platform
Model

Host
Memory

Host

Global
Memory

Example
Platform

PCIe

52

x86

Device

Device

OpenCL Use Model: Abstracting the FPGA away


OpenCL Accelerator Code

Host Code

__kernel void sum


(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}

main() {
read_data( );
manipulate( );
clEnqueueWriteBuffer( );
clEnqueueNDRange(,sum,);
clEnqueueReadBuffer( );
display_result( );
}
Standard
gcc
Compiler

Altera
Offline
Compiler

EXE

AOCX

Verilog

Quartus II

Accelerator
Host
53

OpenCL Programming Model


host.c

Platform
Context
Device
Queue

gcc

opencl.h

Driver

Acquire

Compute

Visualize

Program
Kernel
Buffer
Launch

device.cl
54

aoc

Use Model: clCreateProgramWithBinary


fp = fopen(file.aocx","rb");
fseek(fp,0,SEEK_END);
lengths[0] = ftell(fp);
binaries[0] = (unsigned char*)malloc(sizeof(unsigned char)*lengths[0]);
rewind(fp);
fread(binaries[0],lengths[0],1,fp);
fclose(fp);

OpenCL.h
API

.cl

clGetPlatforms
cl_platform
clGetDevices

Program (exe)

const char**
const char**
const char**

clCreateProgramWithBinary

cl_device
Program (exe)

cl_program

kernel

Offline
Compiler

clCreateContext
clBuildProgram

cl_context

clCreateCommandQueue

exe

Kernel (src)

exe

Kernel (src)

cl_command
_queue

55

clEnqueueNDRangeKernel

.aocx

clCreateKernel
exe

host.c

cl_program

cl_kernel

CL File
OpenCL Program
Bitstream

The Only Custom Accelerator Solution: Platforms


DDR3 Memory Interface

DDR

DDR3 Memory Interface

QDR

QDRII Memory Interface

QDR

QDRII Memory Interface

QDR

QDRII Memory Interface

QDR

QDRII Memory Interface

10G
Network

Host

Kernel
IP

10Gb MAC/UOE Data


Interface

10Gb MAC/UOE Data


Interface

PCIe gen2x8 Host


Interface

IO Infrastructure
56

Built with
Altera
OpenCL
Compiler

OpenCL Domain

Interconnect

DDR

Prebuilt
BSP with
standard HDL
Tools by FPGA
Developer

Kernel
IP

Altera Reference Platforms


High Performance Computing (HPC)

Low Latency

Compute Power/
Memory Bandwidth

UMD
KMD

Architecture

CPLD
Bridge

FLASH

DDR3 DDR3

OpenCL
API
HAL

10G
UDP

OpenCL
API
HAL

CPLD

Stratix V FPGA

UMD

DMA
(OpenCL Kernels)

PCIe

KMD

10G
UDP

Requirement

Network Enabled

Stratix V FPGA

DDR3 DDR3
DMA
(OpenCL Kernels)

PCIe

Global
Memory

DDR and QDRII+

Large amount of DDR

IO Channels

2x10GbE (MAC/UOE)

None (Minimize IP overhead)

Reference
Design
57

OPRA (Streaming)
Trading (with global memory access)

Option Pricing

SoC Reference Platforms


HPS block removes the complexities of the BSP creation
Coherency between Host and Accelerator
Stratix V FPGA

32bit,
50Mz

H2F/F2H

HPS
DDR3

HPS

LWH2F
F2S

CSR

Scratch
DDR3

FPGA
Memory
OpenCL
Kernels

DVI

Camera

DVO

Monitor

OpenCL Platforms Page contains CV SoC devkit


platform users guide
58

Altera Network Enabled Reference Platform for OpenCL


C/C++ API

host.c

OpenCL C

device.cl

Reference
Design
Compiler

Software Layer

Hardware Layer

Reference
Platform

Host
Device

64-bit
RHEL 6.4
Windows 7

59

s5_hft (S5PH-Q)

Reference
Board

Guaranteed Timing Flow


kernel.cl

Boardspec.xml

Post-fit QXP partition


(PCIe, UniPHY, DMA, )

AOC
Synthesis / P&R / STA on the
OpencL Kernels ONLY
Meet
Timing

Yes

No
Reconfig kernel PLL
Re-run STA with the
new PLL value
DONE!
60

Heterogeneous Memory Support


Device
Interface

Host
Memory
Host
IO

CU

Memories with different


characteristics

Sequential Access

Global
Memory1

DDR
QDR
Random Access

Global
Memory2

On-Chip

IO

Low

Latency
__kernel void foo(
global uint *data
__attribute((buffer_location(QDR) ))
) {

foo(data[i]);

61

MoSys Efficient

HMC High Capacity

Combine different memories

Attribute-based

Automatic

Channels Advantage
Standard OpenCL

Altera Vendor Extension


IO and Kernel Channels

QDR
QDR
QDR
QDR

DDR
DDR

CvP Update

QDR
QDR
QDR
OpenCL
Kernels

OpenCL
Kernels

QDR

DDR3
Interface
DDR3
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface

10G
Network

10Gb
Interface
10Gb
Interface

10G
Network

10Gb
Interface
10Gb
Interface

Host

Host
Interface

Host

Host
Interface

62

CvP Update

Interconnect

DDR

DDR3
Interface
DDR3
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface

Interconnect

DDR

OpenCL
Kernels

OpenCL
Kernels

External Function Interface


Ability to quickly integrate IPs in OpenCL
Support Avalon Streaming based IPs
myMod.v

RTL
Module

module myMod (input clock,


input din,
)

endmodule

proc_element.v

proc_element.cl

OpenCL
Program

extern myfunc();
void
void kernel
kernel proc_element(){
foo(){

x = myfunc(a, b, );

}}

myMod.xml

EFI Spec

63

<EFI_SPEC>
<FUNCTION
name=myfunc
module=myMod >

<FILE name=myMod.v >


</EFI_SPEC>

Altera
OpenCL
Compiler

module
module foo(
proc_element(
input clock,
input clock,
input
input ,
, )
)

myMod mod_inst(

.clock(clock),

.din(), );

endmodule
endmodule

Kernel Development Flow


Modify kernel.cl

x86 Emulator (sec)

Hardware
performance
met?

Optimization Report (min)

Prototype (min)

Profiler (hours)

DONE!
64

Functional Bugs?

Stall-free pipeline?
Memory coalesced?

x86 emulator

Beta v14.1

Enable functional debug on x86 system of kernel code


Prototype support to allow users run kernels on x86 platform
Debug support for Altera vendor specific debug support such as
channels
kernel void accel() {

gid = get_global_id(0);
out[gid] =
proc(data[gid]);

Supports
OpenCL syntax
Channels
Printf
65

x86
Kernel
Compiler

./kernel_tb

Running

Example: Load to Store dependency


1
2
3
4
5
6

kernel void prefixsum( global int* restrict A, unsigned N ) {


for ( unsigned i = 1 ; i < N ; i++ ) {
int a = A[i-1];
A[i] += a;
}
}

==============================================================================
|
*** Optimization Report ***
|
==============================================================================
Relative cost of global
| Kernel: prefixsum
| Ln.Col |
memory
to
local
==============================================================================
| Loop for.body
| 2.25
|
computation
|
Pipelined execution inferred.
|
|
|
Successive iterations launched every 321 cycles due to:
|
|
|
|
|
|
Memory dependency on Load Operation from:
| 3.21
|
|
Store Operation
| 4.7
|
True
fix
requires
|
Largest Critical Path Contributors:
|
|
restructuring the code
|
49%: Load Operation
| 3.21
|
|
49%: Store Operation
| 4.7
|
=============================================================================
66

Example: Accumulating a value


1
2
3
4
5
6
7
8
9

kernel void test( global float* restrict input,


global float* restrict output, unsigned N )
{
float mul = 1.0f;
for ( unsigned i = 0; i < N; i++ ) {
mul *= input[ i ];
}
*output = mul;
}

==================================================================================
|
*** Optimization Report ***
|
==================================================================================
| Kernel: test
| Ln.Col |
==================================================================================
| Loop for.body
| 5.24
|
|
Pipelined execution inferred.
|
|
|
Successive iterations launched every 3 cycles due to:
|
|
|
|
|
|
Data dependency on variable mul
| 4.10
|
|
Largest Critical Path Contributor:
|
|
|
100%: Fmul Operation
| 6.7
|
==================================================================================
67

Rapid Prototyping

Beta v14.1

Increases productivity during application development


Uses a library of pre-compiled templates to skip Quartus II
compilation
Can test small versions of the final design on hardware very
quickly
OpenCL
Compiler

User Program

Quartus II

aoc

~ hours
OpenCL
Compiler
aoc march=prototype

.
....
....

HW
Implementation

Configuration Template Library

~minutes

Ability to generate custom templates based on user kernels


Tailors the Rapid Prototyping Template Library to the user
68

Profiler

BETA v14.1

Instrument the pipeline with performance counters and


profiling logic
Transfer the profiling information to the host via PCIe link
Kernel Pipeline

kernel void accel() {

gid = get_global_id(0);
out[gid] = a[gid]+b[gid];

Load

Load

Store
69

Memory Mapped
Registers

Profiler

BETA v14.1

Bottlenecks, bandwidth, saturation, pipeline occupancy

70

OpenCL Host Library & Run Time Environment


(RTE)
Host library improvements:

Lower CPU usage


Improved scalability

Lower memory footprint


Faster run time

SDK & Run Time Environment:


OS

SDK (needs ACDS)

RTE

Installer

Installer

Linux (RHEL) x86-64

Installer, RPM

Installer, RPM

Linux (RHEL) Power

RPM

Linux (custom) CV
SoC

Tarball

Windows x86-64

71

Installable Client Driver

BETA v14.1

host.c
clGetPlatformID

nVidiaOpenCL

ICD

opencl.h

Acquire

AlteraOpenCL

Compute

Visualize

device.cl
72

HKEY_LOCAL_MACHINE\SOFTWARE\
Khronos\OpenCL\Vendors
<library>.dll
DWORD
/etc/OpenCL/Vendors
/<vendor>.icd
<library>.so

Altera Client Driver

BETA v14.1

host.c
clGetPlatformID

nVidiaOpenCL

ICD

opencl.h

clGetDeviceID

Acquire

AlteraOpenCL

Compute

Visualize

device.cl
73

ACD

OpenCL + FPGA Key Benefits


Faster development vs. traditional FPGA design flow

Puts the FPGA in the software developers hands

Familiar C-based development flow

Higher performance/watt vs. CPU/GPGPU

Implement exactly what you need

Pipeline parallel structures

Custom interconnect converging with data processing cores

Lower power vs. CPU/GPGPU

Core frequency lower: 200-250MHz vs 1GHz

Turn off unused logic

Up to 1/5 the power

Portability & Obsolescence free

74

Code can transfer between different HW accelerators (CPU, GPGPU, FPGA, etc)

Code ports seamlessly to new generations of the FPGA

FPGA life cycle considerably longer than CPUs or GPGPUs

Additional Resources

Altera SDK for OpenCL Design Flow


Set Up

Getting Started Guide (document)


Install Quartus II
v13.1 with Altera
SDK for OpenCL

Install C
Compiler or
Development
Environment

Obtain and setup


license from the
Self Service
Licensing Center

Install the FPGA


(OpenCL) board
aocl install

Design

Programming Guide (document)


Develop kernel code
and compile on
CPU/GPU for
functional correctness

Build, compile &


link the host
application (Visual
Studio/GCC)

Compile the
OpenCL kernel
with Altera offline
Compiler (aoc)

Run the
application

Optimize

Best Practices (document)

76

Optimize kernel for


FPGA hardware

Additional Altera OpenCL Collateral


White papers on OpenCL
OpenCL online demos
OpenCL design examples
Instructor-Led training

Parallel Computing with OpenCL Workshop by Altera (1 Day)

Optimization of OpenCL for Altera FPGAs Training by Altera (1


Day)

Online training

Introduction to Parallel Computing with OpenCL

Writing OpenCL Programs for Altera FPGAs

Running OpenCL on Altera FPGAs

Single-Threaded vs. Multi-Threaded Kernels

Building Custom Platforms for Altera SDK for OpenCL

OpenCL board partners page


77

Application Benchmarking

Case Study: GZIP Compression


OpenCL Was

10% Slower
12% more
resources
3x faster
development
time
Altera summer intern ported and optimized GZIP algorithm in a little
more than a month
Industry leading companies FPGA engineer coded Verilog in 3 months
Much lower design effort and design time
79

Results

CHREC/Univ of Florida

OpenCL vs. VHDL performance table


OpenCL vs. VHDL
productivity table
VHDL
development
time

Conclusions

Sobel,
Canny,
&
SURF

6 months

1 month

Apps.

Stratix 4

Predicted Stratix 5

Stratix 5

Frames/sec

Max
freq.

Frames/sec

Max
freq.

Frames/sec

Max
freq.

Sobel

475

170

909

300

870

300

Canny

470

170

890

300

823

309

SURF

392

170

870

300

804

283

Avoid productivity challenges of HDL

6 increase in productivity

OpenCL offers familiar C environment

Develop fully pipeline kernels

Minimum performance cost

80

OpenCL
development
time

OpenCL
performance

VHDL performance

< 10 % overhead

Performance
Productivity

Case Study: Image Classification


Deep Learning Algorithm

Convolutional Neural Networking

Based on Hintons CNN

Early Results on Stratix V

2X Perf./Power vs. gpgpu


despite soft floating point

8+ simultaneous kernels
vs. 2 on gpgpu

Exploiting OpenCL channels


between kernels

A10 Expectations

81

Hard floating point

Better density and frequency

~ 4X performance/watt v SV

Hintons CNN Algorithm


The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000
images per class. There are 50000 training images and 10000 test images.
Here are the classes in the dataset, as well as 10 random images from each:
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck

AES Encryption
Encryption/decryption

256bit key

Counter (CTR) method

Advantage FPGA

Integer arithmetic

Coarse grain bit operations

Complex decision making

Results
Platform

Power
(W)

Performance
(GB/s)

Efficiency
(MB/s/W)

E5503 Xeon Processor


(single core)

est 80

0.01

0.125

AMD Radeon HD 7970

est 100

0.33

3.3

25

5.20

208

PCIe385 A7 Accelerator

82

Multi-Asset Barrier Option Pricing


Monte-Carlo simulation

No closed form solution possible

High quality random number generator


required

Billions of simulations required

Used GPU vendors example code


Advantage FPGA

Complex Control Flow

Optimizations

Channels, loop pipelining

Results
Platform

Power
(W)

Performance
(Bsims/s)

Efficiency
(Msims/s/W)

W3690 Xeon Processor

130

.032

0.0025

nVidia Kepler20

212

10.1

48

45

12.0

266

Bittware S5-PCIe-HQ

83

Document Filtering
Unstructured data analytics

Bloom Filter

Advantage FPGA

Integer Arithmetic

Flexible Memory Configuration

Results
Platform

Power
(W)

Performance
(MTs)

Efficiency
(MTs/W)

W3690 Xeon Processor

130

2070

15.92

nVidia Tesla C2075

215

3240

15.07

25

3602

144.08

PCIe385 A7 Accelerator

84

Consumer (Japan)
Image Processing

pxy

Adaptive weighted images

c1 ij d1 xy c1 ( i1) j d 2 xy c2 ij d xy c2 ( i1) j d 2 xy
W

Advantage FPGA

Integer Arithmetic

Results
Platform

Power
(W)

Performance
(FPS)

Efficiency
(FPS/W)

W3565 Xeon Processor

est 130

0.05

.0004

nVidia Quadro 4000

est 150

2.94

.0200

21

4.29

.2040

PCIe385 A7 Accelerator

85

Smith-Waterman
Sequence Alignment

Scoring Matrix

Advantage FPGA

Integer Arithmetic

SMT Streaming

Results
Platform

Power
(W)

Performance
(MCUPS)

Efficiency
(MCUPS/W)

W3565 Xeon Processor

140

40

.29

nVidia K20

225

704

3.13

25

32596

1303.00

PCIe385 A7 Accelerator

86

Multi Function Printer

Image Processing

RGB output of raster scanner converted to


CMYK colorants for printing

Advantage FPGA

SoC Solution
IO and Kernel Channels
Heterogeneous memory accesses

Goal 50PPM at A4/letter size


Results

>40X improvement over C based algorithm on


ARM only
No NEON coprocessor used

87

C6 speed grade part improved 20% to 128PPM

Suricata: IDS/IPS Implementation (Cybersecurity)

2x 10 Gbps

ETH
ETH
IO
IO

Ingress
Network Path

STD
PKT
IDS PKT
Processing
Analysis

DPIPKT
PKT
DPI
Processing
Analysis
DPI Rules
Memory

STD Rules
Memory

Packet Analysis Kernel IDS (task)

Stream in decoded packets and store in


local memory (aoclReadChannel)

Parallel regex with STD rules in global


memory (heterogeneous memory support)

Write results to global memory

Stream out decoded packets


(aoclWriteChannel)

Decoder Kernel (autorun)

Stream in encoded packets


(aoclReadChannel)

Unpack single streams


Stream out decoded packets
(aoclWriteChannel)

IDS/IPS
MGMT

(QDR or DDR)

(QDR or DDR)

Traffic
IPS
PKT
Control
Manipulation

ETH
ETH
IO
IO

2x 10 Gbps

Mirror for
Egress
Network Path

Host IDS/IPS Management


Read results from global memory and log
Decide to modify or delete packets
Packet Manipulation Kernel - IPS (task)

Stream in decoded packets (aoclReadChannel)

Read and process decision from the host

Stream out decoded packets


(aoclWriteChannel)

Packet Analysis Kernel - DPI (task)

Stream in decoded packets and store in


local memory (aoclReadChannel)

Parallel regex with DPI rules in global


memory (heterogeneous memory support)

Write results to global memory

Stream out decoded packets


(aoclWriteChannel)

Encoder Kernel (autorun)

Stream in decoded packets


(aoclReadChannel)

Repack multiple streams


Stream out encoded packets
(aoclWriteChannel)

Haplotype Caller (Pair-HMM)


Smith Waterman like algorithm

Uses hidden markov models to compare gene


sequences

3 stages: Assembler, Pair-HMM (70%), Traversal


+Genotyping

Floating point (SP + DP)

C++ code starting point (from JAVA)

Whole genome takes 7.6 days!


Results
Platform
Java (gatk 2.8)
Intel Xeon E5-1650

89

Runtime (ms)
10,800
138

nVidia Tesla K40

70

Nallatech SV-D8

15.5

Sobel Filter
Fundamental image filter algorithm
Used commonly in industrial and automotive
applications

Sliding window based design pattern


Same shift register structure, except in two dimensions
WIDTH*4-1

WIDTH*3

WIDTH*4-9
A

WIDTH*3-9
E

WIDTH*2-9

90

Pixels enter here

WIDTH-9

WIDTH-1

Task Implementation and Results


Altera OpenCL
__kernel void sobel(int iters)
{
// Coefficients
int Gx[3][3] = {{-1,-2,-1},{0,0,0},{1,2,1}};
int Gy[3][3] = {{-1,0,1},{-2,0,2},{-1,0,1}};

On our design example


website

int rows[2 * COLS + 3]; // line buffer


int count = 0;
while (count != iters) {
// Shift the line buffer
#pragma unroll
for (int i = COLS * 2 + 2; i > 0; --i) {
rows[i] = rows[i - 1];
}
rows[0] = read_channel_altera(in_channel);
int x_dir = 0;
int y_dir = 0;
#pragma unroll
for (int i = 0; i < 3; ++i) {
#pragma unroll
for (int j = 0; j < 3; ++j) {
x_dir += rows[i * COLS + j] * Gx[i][j];
y_dir += rows[i * COLS + j] * Gy[i][j];
}
}
int edge_weight = abs(x_dir) + abs(y_dir);
write_channel_altera(out_channel, edge_weight);
++count;
}

91

http://www.altera.com/su
pport/examples/opencl/s
obel-filter.html

Device

Resolution

Cyclone V

1080p

Stratix V

1080p

FPS
60
135

Vous aimerez peut-être aussi