Winter School at Snu Altera

System Acceleration with FPGA
using the Altera OpenCL SDK
Accelerating your IDEA

with FPGA
A Complete Solutions Portfolio

P
OW E R I N G
CPLDs
Lowest Cost,
Lowest Power
O U R
FPGAs
FPGAs
Cost/Power Balance Mid-range FPGAs
SoC & Transceivers SoC & Transceivers
N N OVAT I O N
FPGAs
Optimized for
High Bandwidth
PowerSoCs
High-efficiency
Power Management
RESOURCES
Embedded Soft and
Hard Processors
Design
Software
Development
Kits
Intellectual
Property (IP)
Industrial
Computing
Enterprise
3
Industry Challenges
Variety of applications are becoming bottlenecked by scalable
performance requirements
E.g. Object detection and recognition, image tracking and processing,
cryptography, cloud, search engines, deep packet inspection, etc
Overloading CPUs capabilities
Frequencies are capped
Processors keep adding more cores
Need to coordinate all the cores and manage data
Product life cycles are long
GPUs lifespan is short
Require re-optimization and regression testing between generations
Support agreement for GPUs costly

Power dissipation of CPUs and GPUs limits system size
Maintaining coherency throughout scalable system
4
OpenCL and FPGAs Address These Challenges

Power efficient acceleration
Typically 1/5 power of GPU and orders of magnitude more performance per watt of CPU
FPGA lifecycle over 15 years
GPUs lifespan is short

Require re-optimization testing between generations
FPGA OpenCL code retargeted to future devices without modification
Our OpenCL flow abstracts away FPGA hardware flow
Puts FPGA into software engineers hands
Our OpenCL SDK allows for streaming IO channels and kernel

channels
Data movement without host involvement
Low latency data transmissions to accelerator
Shared virtual memory
IBM CAPI and Intel QPI
Efficiency via Specialization
FPGAs
GPUs
Source: Bob Broderson, Berkeley Wireless group
ASICs
Application Development Paradigm

ASIC
FPGA
Programmers
Parallel
Programmers
Standard CPU Programmers
OpenCL expands
The number of
application developers
More SW Engineering Resources than HW?
1000:1 software engineers to FPGA designers

Software engineers are not used to long compile
times
OpenCL Solves This!
Our OpenCL flow abstracts away FPGA hardware flow

bringing the FPGA to low level software programmers
Software developers write, optimize and debug in their software familiar
environment
Quartus is run behind the scenes
Emulator and profiler are software development tools
Pushing long compile times to end
OpenCL optimization doesnt require a board

Allowing SW to drive board requirements (.xml file)
OpenCL On FPGAs Fit Into All Markets

Automotive/Industrial
(Pedestrian Detection,
Motion Estimation)
Military/Government
(Crypto, Image Detection )
Data
Processing
Algorithms
Networking
(DPI, SDN, NFV)
Computer & Storage

(HPC, Financial,
Data Compression)
Medical
(Diagnostic Image Processing,
BioInformatics)
Broadcast, Consumer
(Video image processing)
OpenCL and FPGA Acceleration in the News

IBM and Altera Collaborate on OpenCL
IBMs collaboration with Altera on OpenCL and support of the IBM Power
architecture with the Altera SDK for OpenCL can bring more innovation to
address Big Data and cloud computing challenges, said Tom Rosamilia, senior
vice president, IBM Systems
Intel Reveals FPGA and Xeon in One Socket

"That allows end users that have applications that can benefit from acceleration
to load their IP and accelerate that algorithm on that FPGA as an offload,"
explained the vice president of Intel's data center group, Diane Bryant
Search Engine Gets Help From FPGA

"Altera was really interesting in helping with the developmentthe
resources they were willing to throw our way were more significant than
those from Xilinx Microsoft Engr Manager
Baidu and Altera Demonstrate Faster Image Classification

Altera Corp. and Baidu, Chinas largest online search engine, are collaborating on
using FPGAs and convolutional neural network (CNN) algorithms for deep learning
applications.
Xilinx Announces SDAccel Development Environment for OpenCL

Delivering Up to 25X Better Performance/Watt to the Data Center
10
Demo
11
Exploring FPGA world
13
Circuit
14
In the BeginningTTL Logic Design
Basic logic functions available on separate chips
15
NAND, OR, multiplexers, flip-flops, etc.
Made famous by the Texas Instruments 7400 device

family
Design choices often determined by cost and
available device inventory
Epic of Digital computer
16
Computer on a chip
17
Why Programmable Logic?
I/O
Generic
Flash
CPU
RAM
Video I/O
Ethernet
DDR,QDR
I/O
XVCR
PCIe, SDI,..
I/O
I/O
FPGA
CPU
DSP
DSP
Solution: Replace External Devices

with Programmable Logic
18
Programmable
Logic
is
Found
Everywhere!
Test,
Consumer
Automotive
Measurement,
& Medical
Communications
Broadcast
Computer &
Storage
Entertainment
Instrumentation
Wireless
Military
Computers
Broadband
Audio/video
Video display
Medical
Test equipment
Manufacturing
Cellular
Basestations
Wireless LAN
Secure comm.
Radar
Guidance and control
Servers
Mainframe
Automotive
Networking
Navigation
Entertainment
Switches
Routers
Security &
Energy Management
Wireline
Optical
Metro
Access
Broadcast
Studio
Satellite
Broadcasting
19
Military &
Industrial
Card readers
Control systems
ATM
Storage
RAID
SAN
Office
Automation
Copiers
Printers
MFP
FPGA Architecture
Massive Parallelism
Millions of logic elements
Thousands of 20Kb memory
I/O
blocks
Thousands of Variable Precision
I/O
DSP blocks
I/O
Dozens of High-speed
transceivers
Hardware-centric
Programmable
Routing Switch
VHDL/Verilog
Synthesis
Logic
Element
20
Place&Route
I/O
FPGA Logic Elements

FPGAs use multiple input LUTs
0
LUT
A
B
Out
SRAM
Cell
Out
0
1
SRAM
cell
Shift IN
Carry IN
A B
Addr0
Multiple Families
Reg0
Multiple
Input LUT
Cyclone: cost and power balanced

Arria: midrange
Stratix: high bandwidth
Multiple
Input LUT
Reg1
Addr1
Reg2
Reg3
21
Shift OUT
Carry OUT
Typical FPGA Design Entry

always @(a or b or c or d or sel)
begin
case (sel)
2b00: mux_out = a;
2b01: mux_out = b;
2b10: mux_out = c;
2b11: mux_out = d;
endcase
22
a
b
c
d
sel
mux_out
Hardware circuits are described using

Hardware Description Languages (HDL)
such as VHDL or Verilog
A designer must describe the behavior
of the algorithm to create a low level
digital circuit of it
Digital Design with TTL Logic

Karnaugh
map
Truth table
A B C D X
23
CD
00
01
11
10
00
01
11
10
AB
Digital Design with TTL Logic (cont.)

Logic expression
Final logic implementation
VCC
7400
X = AB + CD + BD + BC + AD + AC
VCC
7430
7474
A
B
PRE
C
D
Q
Q
CLR
X = (AB CD BD BC AD AC)
7400
PRE
Sum of
Products
24
CLR
From TTL to Programmable Logic

General features of logic
implementations
Sum of products (AND-OR gates; combinatorial logic)
Stored results (registered outputs)
Wired together
What if
Logic functions were fixed (like TTL), but combined into
a single device?
Wiring (routing) connections could be controlled
(programmed) somehow?
25
Field Programmable Gate Array (FPGA)

LABs arranged in an array
Row and column programmable
interconnect
Interconnect may span all or part of the
array
LABs
Row
interconnect
Column
interconnect
26
Segmented
interconnects
CPLD LABs vs. FPGA LABs
27
FPGA LABs made up of logic elements (LEs) instead

of product terms and macrocells
Easier to create complex functions through LE
cascading
Lookup Tables (LUTs)

Replaces product term array
Combinational functions created with
programmed tables (cascaded
multiplexers)
LUT inputs are mux select lines
D
C
X = AB + ABCD + ABCD
B
A
Programmed levels
(EEPROM or SRAM)
1 0 01 10 00
28
10 0 0 1 0
01
= x9889
Programmable register
29
Configure for D, T,
JK, or SR flip-flop
operation
Clock typically
driven by global
clock
Asynchronous
control through
other logic or I/O
Feedback into LUT
Bypass register or
LUT
Carry and Register Chains
30
Chain carry bits

between LEs
Register outputs
can chain to
other LE
registers in LAB
to form LUTindependent shift
registers
Register Packing
31
Separate outputs
from LUT and
register create
two outputs from
one LE
Saves device
resources
LABs and LEs: A Closer Look
LUT &
carry logic
32
Register
Adaptive
Logic
Modules
(ALM)
Based on LE, but includes dedicated resources & adaptive LUT (ALUT)
Improves performance and resource utilization
ALM
1
ALM Inputs
2
3
Adder
Reg
4
5
6
7
8
33
Adaptive
LUT
Adder
Reg
FPGA Routing
All device resources can feed into or be
fed by any routing in device
Differing fixed lengths to adjust for
timing
Scales linearly as density increases
Local interconnect
Connects between LEs or ALMs within a LAB
Can include direct connections between adjacent LABs
Row and column interconnect

Fixed length routing segments
Span a number of LABs or entire device
34
FPGA I/O Elements
35
Advanced programmable logic blocks connect directly to

row or column interconnect
Control available I/O features
Input/output/bidirectional
Multiple I/O standards
Differential signaling
Current drive strength
Slew rate
On-chip termination/pull-up resistors
Clamping diodes for PCI bus use
Open drain/tri-state
etc.
Typical I/O Element Logic
output enable
control
device pin
output
path
input path
36
Other Typical FPGA Features
Replace some LABs with dedicated functional

hardware blocks
Memory blocks
Create on-board memory structures to support design

Single/dual-port RAM
ROM
Shift registers or FIFO buffers
37
Initialize RAM or ROM contents on power-on
Memory LABs (MLABs)
Embedded multipliers
Useful for DSP
High-performance multiply/add/accumulate operations
High-speed transceivers
Typical PLD Design Flow

Design Specification
Schematic entry/RTL
coding/Qsys/DSP Builder/OpenCL
- Behavioral or structural description of design
RTL simulation
- Functional simulation
- Verify logic model & data flow
LE
M4K/M9K
M512
I/O
Synthesis (Mapping)
- Translate design into device specific primitives
- Optimization to meet required area & performance constraints
- Quartus II synthesis or 3rd party synthesis tools
- Result: Post-synthesis netlist
Place & route (Fitting)

- Map primitives to specific locations inside
target technology with reference to area &
performance constraints
- Specify routing resources to be used
- Quartus II Fitter
- Result: Post-fit netlist
38
Typical PLD Design Flow

tclk
Timing analysis (TimeQuest Timing Analyzer)

- Verify performance specifications were met
- Static timing analysis
Gate level simulation (optional)

- Simulation with timing delays taken into account
- Verify design will work in target technology
PC board simulation & test

- Simulate board design
- Program & test device on board
- Use SignalTap II Logic Analyzer or
other on-chip tools for debugging
39
The Key to Performance

Maximize Throughput
Minimize Latency
More Operations
Per Second
Quick Data
Access
Parallelism
Memory Access
Pipelining
Instructions
Processes
Loop unrolling
Duplication (SPMD)
Multi-threading (SMT)
40
Avoid transfer/copy
Work in local memory instead of
shared memory
Coalesce accesses
FPGA Acceleration
Efficient Power
Custom processors
Optimized for task
Small soft scalar processor

Larger vector processor
Hardware pipeline
Dedicated local memory
Acceleration
Memory
Processor
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
Multiple engines for SMT
Replication for SPMD
Pipelining for more

throughput
I/O
41
I/O
I/O
I/O
Interface Protocols IP
Wireline
Ethernet
Interlaken
XAUI
OTN/OTL
General
Wireless
Video
PCI Express
CPRI
SDI
SerialLite
RapidIO
DisplayPort
QPI*
JESD204B
HDMI
SATA/SAS*
SFI-S
RXAUI/DXAUI*
SONET/SDH*
HiGig*
42
Extensive support of
protocol interfaces
across market segments
* Partner IP. The rest of the IP is developed and supported by Altera
Fibre Channel*
InfiniBand*
Memory Interfaces IP
DDR SDRAM
RLDRAM
QDR SRAM
Serial Memory
DDR4
RLDRAM 3
QDR II+ Xtreme
HMC
DDR3
RLDRAM 2
QDR II / II+
GCI (MoSys)*
LPDDR3
Broad support of memory interfaces
43
* Partner IP. The rest of the IP is developed and supported by Altera
SoC Solution
ARM Cortex-A9
NEON / FPU
L1 Cache
ARM
L2 Cache
64-KB
RAM
JTAG
Debug /
Trace (1)
SD /
SDIO/
MMC (1)
Timers
(x6)
Shared Multiport DDR

SDRAM Controller (2)
HPS to
FPGA
QSPI
Flash
Control
NAND
Flash
(1) (2)
FPGA
acceleration
Hard Multiport
DDR
Multiport
DDR
Multiport
DDRSDRAM
SDRAM
(2)
SDRAM
Controller
Controller
Controller
USB
OTG
(x2) (1)
(x2) (1)
GPIO
I2 C
(x4)
SPI
(x2)
CAN
(x2)
DMA
(8
Channels)
FPGA
to HPS
UART
(x2)
FPGA
Config
28LP process
8-input ALMs
Variable-precision DSP
M10K memory and 640-bit
MLABs
fPLLs
Hard
PCIe
PCIe
Notes:
(1) Integrated direct memory access (DMA)
(2) Integrated ECC
44
Ethernet
3-, 5-, 6-,

and 10-Gbps
Transceivers
FPGA General Purpose I/Os
Lower cost
Power efficient
Real-time system
ARM Cortex-A9
NEON / FPU
L1 Cache
HPS I/Os
Cortex-A9
Host processor and
FPGA accelerator in
one package
Hard Processor System (HPS)
FPGA Sweet Spot

Low
Latency
CPU
++
(branch
prediction)
SMT
+
SIMD
-
Integer/
Bit
SPMD
-
Floating
Point
Power
++
++
++
++
Control Compute Intensive

DSP
Datapath Compute Intensive

GPU
(deep
cache)
++
Array Processing Intensive

FPGA
+
(flexible
local
memory)
++
++
Bit Manipulation Intensive
45
Programming Language Offerings
Target
GPU
Multi-Core CPU
DSP/
Embedded
FPGA
System
(Heterogeneous Platform)
Device
IP Block
Programmer
Embedded Programmer
Hardware Designer
CUDA/OpenCL
Code Composer Studio (TI C)
Quartus II (Verilog/VHDL)
Task Parallelism
Data Parallelism
Real Time Function Acceleration
IP Design and Integration
Throughput/Latency
Power Efficiency
Real Time Execution

Cost
None (Coding Style Guidelines)
Limited (macro architecture

bandwidth level)
Yes (protocol-level, timing closure,

micro architecture)
Scope
Designer
Design Flow
Design Activity
Design Constraints
Hardware Knowledge
46
Today
Today
HLS
Clock Frequency
Resource Utilization
Interface Requirements
Power
PoC
HLS vs OpenCL Positioning
Targets CPU, GPU and

FPGAs
Target user is Software
developer
Implements FPGA in software
development flow
Performance is determined by
resources allocated
Targets FPGA
Target user is FPGA

designer
Implements FGPA in
traditional FPGA
development flow
Performance is defined and
amount of resource to
achive is reported
Host not required
Host Required
47
Altera OpenCL SDK
48
Altera SDK for OpenCL Competitive Differentiator

Alteras SDK for OpenCL has proven to be a powerful solution for
many vendors
Won design tool and development software Elektra award in

Europe
Won Ultimate Product of the Year for 2014
Actively being used today:

I was extremely happy to get a great
performance with such low effort. I was
so impressed with how powerful the
Altera tool was!
--- Senior Engineer,
Altera OpenCL Customer
49
First Conformant OpenCL Solution for FPGAs!!!

OpenCL v1.0 specification
>8500 Programs tested
Supports Arm Host
50
CV and AV SoC
http://www.khronos.org/conformance/adopters/conformant-companies
http://www.khronos.org/conformance/adopters/conformant-products
Heterogeneous Platform Model

OpenCL
Platform
Model
Host
Memory
(Compute) Device
Host
Global
Memory
Example
Platform
PCIe
51
x86
Compute Unit
Processing
Element
Heterogeneous Platform Model

OpenCL
Platform
Model
Host
Memory
Host
Global
Memory
Example
Platform
PCIe
52
x86
Device
Device
OpenCL Use Model: Abstracting the FPGA away

OpenCL Accelerator Code
Host Code
__kernel void sum

(__global float *a,
__global float *b,
__global float *y)
{
int gid = get_global_id(0);
y[gid] = a[gid] + b[gid];
}
main() {
read_data( );
manipulate( );
clEnqueueWriteBuffer( );
clEnqueueNDRange(,sum,);
clEnqueueReadBuffer( );
display_result( );
}
Standard
gcc
Compiler
Altera
Offline
Compiler
EXE
AOCX
Verilog
Quartus II
Accelerator
Host
53
OpenCL Programming Model

host.c
Platform
Context
Device
Queue
gcc
opencl.h
Driver
Acquire
Compute
Visualize
Program
Kernel
Buffer
Launch
device.cl
54
aoc
Use Model: clCreateProgramWithBinary

fp = fopen(file.aocx","rb");
fseek(fp,0,SEEK_END);
lengths[0] = ftell(fp);
binaries[0] = (unsigned char*)malloc(sizeof(unsigned char)*lengths[0]);
rewind(fp);
fread(binaries[0],lengths[0],1,fp);
fclose(fp);
OpenCL.h
API
.cl
clGetPlatforms
cl_platform
clGetDevices
Program (exe)
const char**
const char**
const char**
clCreateProgramWithBinary
cl_device
Program (exe)
cl_program
kernel
Offline
Compiler
clCreateContext
clBuildProgram
cl_context
clCreateCommandQueue
exe
Kernel (src)
exe
Kernel (src)
cl_command
_queue
55
clEnqueueNDRangeKernel
.aocx
clCreateKernel
exe
host.c
cl_program
cl_kernel
CL File
OpenCL Program
Bitstream
The Only Custom Accelerator Solution: Platforms

DDR3 Memory Interface
DDR
DDR3 Memory Interface
QDR
QDRII Memory Interface
QDR
QDR
QDR
10G
Network
Host
Kernel
IP
10Gb MAC/UOE Data

Interface
10Gb MAC/UOE Data

Interface
PCIe gen2x8 Host

Interface
IO Infrastructure
56
Built with
Altera
OpenCL
Compiler
OpenCL Domain
Interconnect
DDR
Prebuilt
BSP with
standard HDL
Tools by FPGA
Developer
Kernel
IP
Altera Reference Platforms

High Performance Computing (HPC)
Low Latency
Compute Power/
Memory Bandwidth
UMD
KMD
Architecture
CPLD
Bridge
FLASH
DDR3 DDR3
OpenCL
API
HAL
10G
UDP
OpenCL
API
HAL
CPLD
Stratix V FPGA
UMD
DMA
(OpenCL Kernels)
PCIe
KMD
10G
UDP
Requirement
Network Enabled
Stratix V FPGA
DDR3 DDR3
DMA
(OpenCL Kernels)
PCIe
Global
Memory
DDR and QDRII+
Large amount of DDR
IO Channels
2x10GbE (MAC/UOE)
None (Minimize IP overhead)
Reference
Design
57
OPRA (Streaming)
Trading (with global memory access)
Option Pricing
SoC Reference Platforms

HPS block removes the complexities of the BSP creation
Coherency between Host and Accelerator
Stratix V FPGA
32bit,
50Mz
H2F/F2H
HPS
DDR3
HPS
LWH2F
F2S
CSR
Scratch
DDR3
FPGA
Memory
OpenCL
Kernels
DVI
Camera
DVO
Monitor
OpenCL Platforms Page contains CV SoC devkit

platform users guide
58
Altera Network Enabled Reference Platform for OpenCL

C/C++ API
host.c
OpenCL C
device.cl
Reference
Design
Compiler
Software Layer
Hardware Layer
Reference
Platform
Host
Device
64-bit
RHEL 6.4
Windows 7
59
s5_hft (S5PH-Q)
Reference
Board
Guaranteed Timing Flow

kernel.cl
Boardspec.xml
Post-fit QXP partition

(PCIe, UniPHY, DMA, )
AOC
Synthesis / P&R / STA on the
OpencL Kernels ONLY
Meet
Timing
Yes
No
Reconfig kernel PLL
Re-run STA with the
new PLL value
DONE!
60
Heterogeneous Memory Support

Device
Interface
Host
Memory
Host
IO
CU
Memories with different

characteristics
Sequential Access
Global
Memory1
DDR
QDR
Random Access
Global
Memory2
On-Chip
IO
Low
Latency
__kernel void foo(
global uint *data
__attribute((buffer_location(QDR) ))
) {
foo(data[i]);
61
MoSys Efficient
HMC High Capacity
Combine different memories
Attribute-based
Automatic
Channels Advantage
Standard OpenCL
Altera Vendor Extension

IO and Kernel Channels
QDR
QDR
QDR
QDR
DDR
DDR
CvP Update
QDR
QDR
QDR
OpenCL
Kernels
OpenCL
Kernels
QDR
DDR3
Interface
DDR3
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
10G
Network
10Gb
Interface
10Gb
Interface
10G
Network
10Gb
Interface
10Gb
Interface
Host
Host
Interface
Host
Host
Interface
62
CvP Update
Interconnect
DDR
DDR3
Interface
DDR3
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
Interconnect
DDR
OpenCL
Kernels
OpenCL
Kernels
External Function Interface

Ability to quickly integrate IPs in OpenCL
Support Avalon Streaming based IPs
myMod.v
RTL
Module
module myMod (input clock,

input din,
)
endmodule
proc_element.v
proc_element.cl
OpenCL
Program
extern myfunc();
void
void kernel
kernel proc_element(){
foo(){
x = myfunc(a, b, );
}}
myMod.xml
EFI Spec
63
<EFI_SPEC>
<FUNCTION
name=myfunc
module=myMod >
<FILE name=myMod.v >

</EFI_SPEC>
Altera
OpenCL
Compiler
module
module foo(
proc_element(
input clock,
input clock,
input
input ,
, )
)
myMod mod_inst(
.clock(clock),
.din(), );
endmodule
endmodule
Kernel Development Flow

Modify kernel.cl
x86 Emulator (sec)
Hardware
performance
met?
Optimization Report (min)
Prototype (min)
Profiler (hours)
DONE!
64
Functional Bugs?
Stall-free pipeline?
Memory coalesced?
x86 emulator
Beta v14.1
Enable functional debug on x86 system of kernel code

Prototype support to allow users run kernels on x86 platform
Debug support for Altera vendor specific debug support such as
channels
kernel void accel() {
gid = get_global_id(0);
out[gid] =
proc(data[gid]);
Supports
OpenCL syntax
Channels
Printf
65
x86
Kernel
Compiler
./kernel_tb
Running
Example: Load to Store dependency

1
2
3
4
5
6
kernel void prefixsum( global int* restrict A, unsigned N ) {

for ( unsigned i = 1 ; i < N ; i++ ) {
int a = A[i-1];
A[i] += a;
}
}
==============================================================================
|
*** Optimization Report ***
|
==============================================================================
Relative cost of global
| Kernel: prefixsum
| Ln.Col |
memory
to
local
==============================================================================
| Loop for.body
| 2.25
|
computation
|
Pipelined execution inferred.
|
|
|
Successive iterations launched every 321 cycles due to:
|
|
|
|
|
|
Memory dependency on Load Operation from:
| 3.21
|
|
Store Operation
| 4.7
|
True
fix
requires
|
Largest Critical Path Contributors:
|
|
restructuring the code
|
49%: Load Operation
| 3.21
|
|
49%: Store Operation
| 4.7
|
=============================================================================
66
Example: Accumulating a value

1
2
3
4
5
6
7
8
9
kernel void test( global float* restrict input,

global float* restrict output, unsigned N )
{
float mul = 1.0f;
for ( unsigned i = 0; i < N; i++ ) {
mul *= input[ i ];
}
*output = mul;
}
==================================================================================
|
*** Optimization Report ***
|
==================================================================================
| Kernel: test
| Ln.Col |
==================================================================================
| Loop for.body
| 5.24
|
|
Pipelined execution inferred.
|
|
|
Successive iterations launched every 3 cycles due to:
|
|
|
|
|
|
Data dependency on variable mul
| 4.10
|
|
Largest Critical Path Contributor:
|
|
|
100%: Fmul Operation
| 6.7
|
==================================================================================
67
Rapid Prototyping
Beta v14.1
Increases productivity during application development

Uses a library of pre-compiled templates to skip Quartus II
compilation
Can test small versions of the final design on hardware very
quickly
OpenCL
Compiler
User Program
Quartus II
aoc
~ hours
OpenCL
Compiler
aoc march=prototype
.
....
....
HW
Implementation
Configuration Template Library
~minutes
Ability to generate custom templates based on user kernels

Tailors the Rapid Prototyping Template Library to the user
68
Profiler
BETA v14.1
Instrument the pipeline with performance counters and

profiling logic
Transfer the profiling information to the host via PCIe link
Kernel Pipeline
kernel void accel() {
gid = get_global_id(0);
out[gid] = a[gid]+b[gid];
Load
Load
Store
69
Memory Mapped
Registers
Profiler
BETA v14.1
Bottlenecks, bandwidth, saturation, pipeline occupancy
70
OpenCL Host Library & Run Time Environment

(RTE)
Host library improvements:
Lower CPU usage

Improved scalability
Lower memory footprint

Faster run time
SDK & Run Time Environment:

OS
SDK (needs ACDS)
RTE
Installer
Installer
Linux (RHEL) x86-64
Installer, RPM
Installer, RPM
Linux (RHEL) Power
RPM
Linux (custom) CV
SoC
Tarball
Windows x86-64
71
Installable Client Driver
BETA v14.1
host.c
clGetPlatformID
nVidiaOpenCL
ICD
opencl.h
Acquire
AlteraOpenCL
Compute
Visualize
device.cl
72
HKEY_LOCAL_MACHINE\SOFTWARE\
Khronos\OpenCL\Vendors
<library>.dll
DWORD
/etc/OpenCL/Vendors
/<vendor>.icd
<library>.so
Altera Client Driver
BETA v14.1
host.c
clGetPlatformID
nVidiaOpenCL
ICD
opencl.h
clGetDeviceID
Acquire
AlteraOpenCL
Compute
Visualize
device.cl
73
ACD
OpenCL + FPGA Key Benefits

Faster development vs. traditional FPGA design flow
Puts the FPGA in the software developers hands
Familiar C-based development flow
Higher performance/watt vs. CPU/GPGPU
Implement exactly what you need
Pipeline parallel structures
Custom interconnect converging with data processing cores
Lower power vs. CPU/GPGPU
Core frequency lower: 200-250MHz vs 1GHz
Turn off unused logic
Up to 1/5 the power
Portability & Obsolescence free
74
Code can transfer between different HW accelerators (CPU, GPGPU, FPGA, etc)
Code ports seamlessly to new generations of the FPGA
FPGA life cycle considerably longer than CPUs or GPGPUs
Additional Resources
Altera SDK for OpenCL Design Flow

Set Up
Getting Started Guide (document)

Install Quartus II
v13.1 with Altera
SDK for OpenCL
Install C
Compiler or
Development
Environment
Obtain and setup

license from the
Self Service
Licensing Center
Install the FPGA

(OpenCL) board
aocl install
Design
Programming Guide (document)

Develop kernel code
and compile on
CPU/GPU for
functional correctness
Build, compile &

link the host
application (Visual
Studio/GCC)
Compile the
OpenCL kernel
with Altera offline
Compiler (aoc)
Run the
application
Optimize
Best Practices (document)
76
Optimize kernel for

FPGA hardware
Additional Altera OpenCL Collateral

White papers on OpenCL
OpenCL online demos
OpenCL design examples
Instructor-Led training
Parallel Computing with OpenCL Workshop by Altera (1 Day)
Optimization of OpenCL for Altera FPGAs Training by Altera (1

Day)
Online training
Introduction to Parallel Computing with OpenCL
Writing OpenCL Programs for Altera FPGAs
Running OpenCL on Altera FPGAs
Single-Threaded vs. Multi-Threaded Kernels
Building Custom Platforms for Altera SDK for OpenCL
OpenCL board partners page

77
Application Benchmarking
Case Study: GZIP Compression

OpenCL Was
10% Slower
12% more
resources
3x faster
development
time
Altera summer intern ported and optimized GZIP algorithm in a little
more than a month
Industry leading companies FPGA engineer coded Verilog in 3 months
Much lower design effort and design time
79
Results
CHREC/Univ of Florida
OpenCL vs. VHDL performance table

OpenCL vs. VHDL
productivity table
VHDL
development
time
Conclusions
Sobel,
Canny,
&
SURF
6 months
1 month
Apps.
Stratix 4
Predicted Stratix 5
Stratix 5
Frames/sec
Max
freq.
Frames/sec
Max
freq.
Frames/sec
Max
freq.
Sobel
475
170
909
300
870
300
Canny
470
170
890
300
823
309
SURF
392
170
870
300
804
283
Avoid productivity challenges of HDL
6 increase in productivity
OpenCL offers familiar C environment
Develop fully pipeline kernels
Minimum performance cost
80
OpenCL
development
time
OpenCL
performance
VHDL performance
< 10 % overhead
Performance
Productivity
Case Study: Image Classification

Deep Learning Algorithm
Convolutional Neural Networking
Based on Hintons CNN
Early Results on Stratix V
2X Perf./Power vs. gpgpu

despite soft floating point
8+ simultaneous kernels
vs. 2 on gpgpu
Exploiting OpenCL channels

between kernels
A10 Expectations
81
Hard floating point
Better density and frequency
~ 4X performance/watt v SV
Hintons CNN Algorithm

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000
images per class. There are 50000 training images and 10000 test images.
Here are the classes in the dataset, as well as 10 random images from each:
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
AES Encryption
Encryption/decryption
256bit key
Counter (CTR) method
Advantage FPGA
Integer arithmetic
Coarse grain bit operations
Complex decision making
Results
Platform
Power
(W)
Performance
(GB/s)
Efficiency
(MB/s/W)
E5503 Xeon Processor

(single core)
est 80
0.01
0.125
AMD Radeon HD 7970
est 100
0.33
3.3
25
5.20
208
PCIe385 A7 Accelerator
82
Multi-Asset Barrier Option Pricing

Monte-Carlo simulation
No closed form solution possible
High quality random number generator

required
Billions of simulations required
Used GPU vendors example code

Advantage FPGA
Complex Control Flow
Optimizations
Channels, loop pipelining
Results
Platform
Power
(W)
Performance
(Bsims/s)
Efficiency
(Msims/s/W)
W3690 Xeon Processor
130
.032
0.0025
nVidia Kepler20
212
10.1
48
45
12.0
266
Bittware S5-PCIe-HQ
83
Document Filtering
Unstructured data analytics
Bloom Filter
Advantage FPGA
Integer Arithmetic
Flexible Memory Configuration
Results
Platform
Power
(W)
Performance
(MTs)
Efficiency
(MTs/W)
130
2070
15.92
nVidia Tesla C2075
215
3240
15.07
25
3602
144.08
84
Consumer (Japan)
Image Processing
pxy
Adaptive weighted images
c1 ij d1 xy c1 ( i1) j d 2 xy c2 ij d xy c2 ( i1) j d 2 xy
W
Advantage FPGA
Integer Arithmetic
Results
Platform
Power
(W)
Performance
(FPS)
Efficiency
(FPS/W)
est 130
0.05
.0004
nVidia Quadro 4000
est 150
2.94
.0200
21
4.29
.2040
85
Smith-Waterman
Sequence Alignment
Scoring Matrix
Advantage FPGA
Integer Arithmetic
SMT Streaming
Results
Platform
Power
(W)
Performance
(MCUPS)
Efficiency
(MCUPS/W)
140
40
.29
nVidia K20
225
704
3.13
25
32596
1303.00
86
Multi Function Printer
Image Processing
RGB output of raster scanner converted to

CMYK colorants for printing
Advantage FPGA
SoC Solution
IO and Kernel Channels
Heterogeneous memory accesses
Goal 50PPM at A4/letter size

Results
>40X improvement over C based algorithm on

ARM only
No NEON coprocessor used
87
C6 speed grade part improved 20% to 128PPM
Suricata: IDS/IPS Implementation (Cybersecurity)
2x 10 Gbps
ETH
ETH
IO
IO
Ingress
Network Path
STD
PKT
IDS PKT
Processing
Analysis
DPIPKT
PKT
DPI
Processing
Analysis
DPI Rules
Memory
STD Rules
Memory
Packet Analysis Kernel IDS (task)
Stream in decoded packets and store in

local memory (aoclReadChannel)
Parallel regex with STD rules in global

memory (heterogeneous memory support)
Write results to global memory
Stream out decoded packets

(aoclWriteChannel)
Decoder Kernel (autorun)
Stream in encoded packets

(aoclReadChannel)
Unpack single streams

(aoclWriteChannel)
IDS/IPS
MGMT
(QDR or DDR)
(QDR or DDR)
Traffic
IPS
PKT
Control
Manipulation
ETH
ETH
IO
IO
2x 10 Gbps
Mirror for
Egress
Network Path
Host IDS/IPS Management

Read results from global memory and log
Decide to modify or delete packets
Packet Manipulation Kernel - IPS (task)
Stream in decoded packets (aoclReadChannel)
Read and process decision from the host

(aoclWriteChannel)
Packet Analysis Kernel - DPI (task)
Stream in decoded packets and store in

local memory (aoclReadChannel)
Parallel regex with DPI rules in global

memory (heterogeneous memory support)
Write results to global memory

(aoclWriteChannel)
Encoder Kernel (autorun)
Stream in decoded packets

(aoclReadChannel)
Repack multiple streams

Stream out encoded packets
(aoclWriteChannel)
Haplotype Caller (Pair-HMM)

Smith Waterman like algorithm
Uses hidden markov models to compare gene

sequences
3 stages: Assembler, Pair-HMM (70%), Traversal

+Genotyping
Floating point (SP + DP)
C++ code starting point (from JAVA)
Whole genome takes 7.6 days!

Results
Platform
Java (gatk 2.8)
Intel Xeon E5-1650
89
Runtime (ms)
10,800
138
nVidia Tesla K40
70
Nallatech SV-D8
15.5
Sobel Filter
Fundamental image filter algorithm
Used commonly in industrial and automotive
applications
Sliding window based design pattern

Same shift register structure, except in two dimensions
WIDTH*4-1
WIDTH*3
WIDTH*4-9
A
WIDTH*3-9
E
WIDTH*2-9
90
Pixels enter here
WIDTH-9
WIDTH-1
Task Implementation and Results

Altera OpenCL
__kernel void sobel(int iters)
{
// Coefficients
int Gx[3][3] = {{-1,-2,-1},{0,0,0},{1,2,1}};
int Gy[3][3] = {{-1,0,1},{-2,0,2},{-1,0,1}};
On our design example

website
int rows[2 * COLS + 3]; // line buffer

int count = 0;
while (count != iters) {
// Shift the line buffer
#pragma unroll
for (int i = COLS * 2 + 2; i > 0; --i) {
rows[i] = rows[i - 1];
}
rows[0] = read_channel_altera(in_channel);
int x_dir = 0;
int y_dir = 0;
#pragma unroll
for (int i = 0; i < 3; ++i) {
#pragma unroll
for (int j = 0; j < 3; ++j) {
x_dir += rows[i * COLS + j] * Gx[i][j];
y_dir += rows[i * COLS + j] * Gy[i][j];
}
}
int edge_weight = abs(x_dir) + abs(y_dir);
write_channel_altera(out_channel, edge_weight);
++count;
}
91
http://www.altera.com/su
pport/examples/opencl/s
obel-filter.html
Device
Resolution
Cyclone V
1080p
Stratix V
1080p
FPS
60
135

Winter School at Snu Altera

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Winter School at Snu Altera

Transféré par

Droits d'auteur :

Formats disponibles

System Acceleration with FPGA

using the Altera OpenCL SDK

Accelerating your IDEA

A Complete Solutions Portfolio

E.g. Object detection and recognition, image tracking and processing,

cryptography, cloud, search engines, deep packet inspection, etc

Overloading CPUs capabilities

Frequencies are capped

Processors keep adding more cores

Need to coordinate all the cores and manage data

Product life cycles are long

GPUs lifespan is short

Require re-optimization and regression testing between generations

Support agreement for GPUs costly

OpenCL and FPGAs Address These Challenges

FPGA lifecycle over 15 years

GPUs lifespan is short

FPGA OpenCL code retargeted to future devices without modification

Our OpenCL flow abstracts away FPGA hardware flow

Puts FPGA into software engineers hands

Our OpenCL SDK allows for streaming IO channels and kernel

Data movement without host involvement

Low latency data transmissions to accelerator

Shared virtual memory

IBM CAPI and Intel QPI

Efficiency via Specialization

Source: Bob Broderson, Berkeley Wireless group

Application Development Paradigm

Standard CPU Programmers

More SW Engineering Resources than HW?

1000:1 software engineers to FPGA designers

OpenCL Solves This!

Our OpenCL flow abstracts away FPGA hardware flow

OpenCL optimization doesnt require a board

OpenCL On FPGAs Fit Into All Markets

Computer & Storage

OpenCL and FPGA Acceleration in the News

Intel Reveals FPGA and Xeon in One Socket

Search Engine Gets Help From FPGA

Baidu and Altera Demonstrate Faster Image Classification

Xilinx Announces SDAccel Development Environment for OpenCL

Exploring FPGA world

In the BeginningTTL Logic Design

Basic logic functions available on separate chips

NAND, OR, multiplexers, flip-flops, etc.

Made famous by the Texas Instruments 7400 device

Epic of Digital computer

Why Programmable Logic?

Solution: Replace External Devices

Millions of logic elements

Thousands of 20Kb memory

FPGA Logic Elements

Cyclone: cost and power balanced

Typical FPGA Design Entry

Hardware circuits are described using

Digital Design with TTL Logic

Digital Design with TTL Logic (cont.)

Final logic implementation

From TTL to Programmable Logic

Field Programmable Gate Array (FPGA)

CPLD LABs vs. FPGA LABs

FPGA LABs made up of logic elements (LEs) instead

Lookup Tables (LUTs)

Carry and Register Chains