Académique Documents
Professionnel Documents
Culture Documents
OW E R I N G
CPLDs
Lowest Cost,
Lowest Power
O U R
FPGAs
FPGAs
Cost/Power Balance Mid-range FPGAs
SoC & Transceivers SoC & Transceivers
N N OVAT I O N
FPGAs
Optimized for
High Bandwidth
PowerSoCs
High-efficiency
Power Management
RESOURCES
Embedded Soft and
Hard Processors
Design
Software
Development
Kits
Intellectual
Property (IP)
Industrial
Computing
Enterprise
3
Industry Challenges
Variety of applications are becoming bottlenecked by scalable
performance requirements
Typically 1/5 power of GPU and orders of magnitude more performance per watt of CPU
FPGAs
GPUs
ASICs
FPGA
Programmers
Parallel
Programmers
OpenCL expands
The number of
application developers
environment
Quartus is run behind the scenes
Emulator and profiler are software development tools
Pushing long compile times to end
Military/Government
(Crypto, Image Detection )
Data
Processing
Algorithms
Networking
(DPI, SDN, NFV)
Medical
(Diagnostic Image Processing,
BioInformatics)
Broadcast, Consumer
(Video image processing)
10
Demo
11
13
Circuit
14
15
16
Computer on a chip
17
I/O
Generic
Flash
CPU
RAM
Video I/O
Ethernet
DDR,QDR
I/O
XVCR
PCIe, SDI,..
I/O
I/O
FPGA
CPU
DSP
DSP
Programmable
Logic
is
Found
Everywhere!
Test,
Consumer
Automotive
Measurement,
& Medical
Communications
Broadcast
Computer &
Storage
Entertainment
Instrumentation
Wireless
Military
Computers
Broadband
Audio/video
Video display
Medical
Test equipment
Manufacturing
Cellular
Basestations
Wireless LAN
Secure comm.
Radar
Guidance and control
Servers
Mainframe
Automotive
Networking
Navigation
Entertainment
Switches
Routers
Security &
Energy Management
Wireline
Optical
Metro
Access
Broadcast
Studio
Satellite
Broadcasting
19
Military &
Industrial
Card readers
Control systems
ATM
Storage
RAID
SAN
Office
Automation
Copiers
Printers
MFP
FPGA Architecture
Massive Parallelism
I/O
blocks
Thousands of Variable Precision
I/O
DSP blocks
I/O
Dozens of High-speed
transceivers
Hardware-centric
Programmable
Routing Switch
VHDL/Verilog
Synthesis
Logic
Element
20
Place&Route
I/O
LUT
A
B
Out
SRAM
Cell
Out
0
1
SRAM
cell
Shift IN
Carry IN
A B
Addr0
Multiple Families
Reg0
Multiple
Input LUT
Multiple
Input LUT
Reg1
Addr1
Reg2
Reg3
21
Shift OUT
Carry OUT
22
a
b
c
d
sel
mux_out
Truth table
A B C D X
23
CD
00
01
11
10
00
01
11
10
AB
VCC
7400
X = AB + CD + BD + BC + AD + AC
VCC
7430
7474
A
B
PRE
C
D
Q
Q
CLR
X = (AB CD BD BC AD AC)
7400
PRE
Sum of
Products
24
CLR
What if
Logic functions were fixed (like TTL), but combined into
a single device?
Wiring (routing) connections could be controlled
(programmed) somehow?
25
26
Segmented
interconnects
27
1 0 01 10 00
28
10 0 0 1 0
01
= x9889
Programmable register
29
Configure for D, T,
JK, or SR flip-flop
operation
Clock typically
driven by global
clock
Asynchronous
control through
other logic or I/O
Feedback into LUT
Bypass register or
LUT
30
Register Packing
31
Separate outputs
from LUT and
register create
two outputs from
one LE
Saves device
resources
LUT &
carry logic
32
Register
Adaptive
Logic
Modules
(ALM)
Based on LE, but includes dedicated resources & adaptive LUT (ALUT)
Improves performance and resource utilization
ALM
1
ALM Inputs
2
3
Adder
Reg
4
5
6
7
8
33
Adaptive
LUT
Adder
Reg
FPGA Routing
All device resources can feed into or be
fed by any routing in device
Differing fixed lengths to adjust for
timing
Scales linearly as density increases
Local interconnect
Connects between LEs or ALMs within a LAB
Can include direct connections between adjacent LABs
34
35
Input/output/bidirectional
Differential signaling
Slew rate
Open drain/tri-state
etc.
output enable
control
device pin
output
path
input path
36
37
Embedded multipliers
High-speed transceivers
Schematic entry/RTL
coding/Qsys/DSP Builder/OpenCL
- Behavioral or structural description of design
RTL simulation
- Functional simulation
- Verify logic model & data flow
LE
M4K/M9K
M512
I/O
Synthesis (Mapping)
- Translate design into device specific primitives
- Optimization to meet required area & performance constraints
- Quartus II synthesis or 3rd party synthesis tools
- Result: Post-synthesis netlist
39
Quick Data
Access
Parallelism
Memory Access
Pipelining
Instructions
Processes
Loop unrolling
Duplication (SPMD)
Multi-threading (SMT)
40
Avoid transfer/copy
Work in local memory instead of
shared memory
Coalesce accesses
FPGA Acceleration
Efficient Power
Custom processors
Acceleration
Memory
Processor
Processor
Memory
Processor
Memory
Processor
Memory
Processor
Memory
41
I/O
I/O
I/O
Interface Protocols IP
Wireline
Ethernet
Interlaken
XAUI
OTN/OTL
General
Wireless
Video
PCI Express
CPRI
SDI
SerialLite
RapidIO
DisplayPort
QPI*
JESD204B
HDMI
SATA/SAS*
SFI-S
RXAUI/DXAUI*
SONET/SDH*
HiGig*
42
Extensive support of
protocol interfaces
across market segments
* Partner IP. The rest of the IP is developed and supported by Altera
Fibre Channel*
InfiniBand*
Memory Interfaces IP
DDR SDRAM
RLDRAM
QDR SRAM
Serial Memory
DDR4
RLDRAM 3
HMC
DDR3
RLDRAM 2
QDR II / II+
GCI (MoSys)*
LPDDR3
43
SoC Solution
ARM Cortex-A9
NEON / FPU
L1 Cache
ARM
L2 Cache
64-KB
RAM
JTAG
Debug /
Trace (1)
SD /
SDIO/
MMC (1)
Timers
(x6)
HPS to
FPGA
QSPI
Flash
Control
NAND
Flash
(1) (2)
FPGA
acceleration
Hard Multiport
DDR
Multiport
DDR
Multiport
DDRSDRAM
SDRAM
(2)
SDRAM
Controller
Controller
Controller
USB
OTG
(x2) (1)
(x2) (1)
GPIO
I2 C
(x4)
SPI
(x2)
CAN
(x2)
DMA
(8
Channels)
FPGA
to HPS
UART
(x2)
FPGA
Config
28LP process
8-input ALMs
Variable-precision DSP
M10K memory and 640-bit
MLABs
fPLLs
Hard
PCIe
PCIe
Notes:
(1) Integrated direct memory access (DMA)
(2) Integrated ECC
44
Ethernet
Lower cost
Power efficient
Real-time system
ARM Cortex-A9
NEON / FPU
L1 Cache
HPS I/Os
Cortex-A9
Host processor and
FPGA accelerator in
one package
++
(branch
prediction)
SMT
+
SIMD
-
Integer/
Bit
SPMD
-
Floating
Point
Power
++
++
++
++
(deep
cache)
++
+
(flexible
local
memory)
++
++
45
Target
GPU
Multi-Core CPU
DSP/
Embedded
FPGA
System
(Heterogeneous Platform)
Device
IP Block
Programmer
Embedded Programmer
Hardware Designer
CUDA/OpenCL
Quartus II (Verilog/VHDL)
Task Parallelism
Data Parallelism
Throughput/Latency
Power Efficiency
Scope
Designer
Design Flow
Design Activity
Design Constraints
Hardware Knowledge
46
Today
Today
HLS
Clock Frequency
Resource Utilization
Interface Requirements
Power
PoC
Targets FPGA
Host Required
47
48
49
50
CV and AV SoC
http://www.khronos.org/conformance/adopters/conformant-companies
http://www.khronos.org/conformance/adopters/conformant-products
Host
Memory
(Compute) Device
Host
Global
Memory
Example
Platform
PCIe
51
x86
Compute Unit
Processing
Element
Host
Memory
Host
Global
Memory
Example
Platform
PCIe
52
x86
Device
Device
Host Code
main() {
read_data( );
manipulate( );
clEnqueueWriteBuffer( );
clEnqueueNDRange(,sum,);
clEnqueueReadBuffer( );
display_result( );
}
Standard
gcc
Compiler
Altera
Offline
Compiler
EXE
AOCX
Verilog
Quartus II
Accelerator
Host
53
Platform
Context
Device
Queue
gcc
opencl.h
Driver
Acquire
Compute
Visualize
Program
Kernel
Buffer
Launch
device.cl
54
aoc
OpenCL.h
API
.cl
clGetPlatforms
cl_platform
clGetDevices
Program (exe)
const char**
const char**
const char**
clCreateProgramWithBinary
cl_device
Program (exe)
cl_program
kernel
Offline
Compiler
clCreateContext
clBuildProgram
cl_context
clCreateCommandQueue
exe
Kernel (src)
exe
Kernel (src)
cl_command
_queue
55
clEnqueueNDRangeKernel
.aocx
clCreateKernel
exe
host.c
cl_program
cl_kernel
CL File
OpenCL Program
Bitstream
DDR
QDR
QDR
QDR
QDR
10G
Network
Host
Kernel
IP
IO Infrastructure
56
Built with
Altera
OpenCL
Compiler
OpenCL Domain
Interconnect
DDR
Prebuilt
BSP with
standard HDL
Tools by FPGA
Developer
Kernel
IP
Low Latency
Compute Power/
Memory Bandwidth
UMD
KMD
Architecture
CPLD
Bridge
FLASH
DDR3 DDR3
OpenCL
API
HAL
10G
UDP
OpenCL
API
HAL
CPLD
Stratix V FPGA
UMD
DMA
(OpenCL Kernels)
PCIe
KMD
10G
UDP
Requirement
Network Enabled
Stratix V FPGA
DDR3 DDR3
DMA
(OpenCL Kernels)
PCIe
Global
Memory
IO Channels
2x10GbE (MAC/UOE)
Reference
Design
57
OPRA (Streaming)
Trading (with global memory access)
Option Pricing
32bit,
50Mz
H2F/F2H
HPS
DDR3
HPS
LWH2F
F2S
CSR
Scratch
DDR3
FPGA
Memory
OpenCL
Kernels
DVI
Camera
DVO
Monitor
host.c
OpenCL C
device.cl
Reference
Design
Compiler
Software Layer
Hardware Layer
Reference
Platform
Host
Device
64-bit
RHEL 6.4
Windows 7
59
s5_hft (S5PH-Q)
Reference
Board
Boardspec.xml
AOC
Synthesis / P&R / STA on the
OpencL Kernels ONLY
Meet
Timing
Yes
No
Reconfig kernel PLL
Re-run STA with the
new PLL value
DONE!
60
Host
Memory
Host
IO
CU
Sequential Access
Global
Memory1
DDR
QDR
Random Access
Global
Memory2
On-Chip
IO
Low
Latency
__kernel void foo(
global uint *data
__attribute((buffer_location(QDR) ))
) {
foo(data[i]);
61
MoSys Efficient
Attribute-based
Automatic
Channels Advantage
Standard OpenCL
QDR
QDR
QDR
QDR
DDR
DDR
CvP Update
QDR
QDR
QDR
OpenCL
Kernels
OpenCL
Kernels
QDR
DDR3
Interface
DDR3
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
10G
Network
10Gb
Interface
10Gb
Interface
10G
Network
10Gb
Interface
10Gb
Interface
Host
Host
Interface
Host
Host
Interface
62
CvP Update
Interconnect
DDR
DDR3
Interface
DDR3
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
QDRII
Interface
Interconnect
DDR
OpenCL
Kernels
OpenCL
Kernels
RTL
Module
endmodule
proc_element.v
proc_element.cl
OpenCL
Program
extern myfunc();
void
void kernel
kernel proc_element(){
foo(){
x = myfunc(a, b, );
}}
myMod.xml
EFI Spec
63
<EFI_SPEC>
<FUNCTION
name=myfunc
module=myMod >
Altera
OpenCL
Compiler
module
module foo(
proc_element(
input clock,
input clock,
input
input ,
, )
)
myMod mod_inst(
.clock(clock),
.din(), );
endmodule
endmodule
Hardware
performance
met?
Prototype (min)
Profiler (hours)
DONE!
64
Functional Bugs?
Stall-free pipeline?
Memory coalesced?
x86 emulator
Beta v14.1
gid = get_global_id(0);
out[gid] =
proc(data[gid]);
Supports
OpenCL syntax
Channels
Printf
65
x86
Kernel
Compiler
./kernel_tb
Running
==============================================================================
|
*** Optimization Report ***
|
==============================================================================
Relative cost of global
| Kernel: prefixsum
| Ln.Col |
memory
to
local
==============================================================================
| Loop for.body
| 2.25
|
computation
|
Pipelined execution inferred.
|
|
|
Successive iterations launched every 321 cycles due to:
|
|
|
|
|
|
Memory dependency on Load Operation from:
| 3.21
|
|
Store Operation
| 4.7
|
True
fix
requires
|
Largest Critical Path Contributors:
|
|
restructuring the code
|
49%: Load Operation
| 3.21
|
|
49%: Store Operation
| 4.7
|
=============================================================================
66
==================================================================================
|
*** Optimization Report ***
|
==================================================================================
| Kernel: test
| Ln.Col |
==================================================================================
| Loop for.body
| 5.24
|
|
Pipelined execution inferred.
|
|
|
Successive iterations launched every 3 cycles due to:
|
|
|
|
|
|
Data dependency on variable mul
| 4.10
|
|
Largest Critical Path Contributor:
|
|
|
100%: Fmul Operation
| 6.7
|
==================================================================================
67
Rapid Prototyping
Beta v14.1
User Program
Quartus II
aoc
~ hours
OpenCL
Compiler
aoc march=prototype
.
....
....
HW
Implementation
~minutes
Profiler
BETA v14.1
gid = get_global_id(0);
out[gid] = a[gid]+b[gid];
Load
Load
Store
69
Memory Mapped
Registers
Profiler
BETA v14.1
70
RTE
Installer
Installer
Installer, RPM
Installer, RPM
RPM
Linux (custom) CV
SoC
Tarball
Windows x86-64
71
BETA v14.1
host.c
clGetPlatformID
nVidiaOpenCL
ICD
opencl.h
Acquire
AlteraOpenCL
Compute
Visualize
device.cl
72
HKEY_LOCAL_MACHINE\SOFTWARE\
Khronos\OpenCL\Vendors
<library>.dll
DWORD
/etc/OpenCL/Vendors
/<vendor>.icd
<library>.so
BETA v14.1
host.c
clGetPlatformID
nVidiaOpenCL
ICD
opencl.h
clGetDeviceID
Acquire
AlteraOpenCL
Compute
Visualize
device.cl
73
ACD
74
Code can transfer between different HW accelerators (CPU, GPGPU, FPGA, etc)
Additional Resources
Install C
Compiler or
Development
Environment
Design
Compile the
OpenCL kernel
with Altera offline
Compiler (aoc)
Run the
application
Optimize
76
Online training
Application Benchmarking
10% Slower
12% more
resources
3x faster
development
time
Altera summer intern ported and optimized GZIP algorithm in a little
more than a month
Industry leading companies FPGA engineer coded Verilog in 3 months
Much lower design effort and design time
79
Results
CHREC/Univ of Florida
Conclusions
Sobel,
Canny,
&
SURF
6 months
1 month
Apps.
Stratix 4
Predicted Stratix 5
Stratix 5
Frames/sec
Max
freq.
Frames/sec
Max
freq.
Frames/sec
Max
freq.
Sobel
475
170
909
300
870
300
Canny
470
170
890
300
823
309
SURF
392
170
870
300
804
283
6 increase in productivity
80
OpenCL
development
time
OpenCL
performance
VHDL performance
< 10 % overhead
Performance
Productivity
8+ simultaneous kernels
vs. 2 on gpgpu
A10 Expectations
81
~ 4X performance/watt v SV
AES Encryption
Encryption/decryption
256bit key
Advantage FPGA
Integer arithmetic
Results
Platform
Power
(W)
Performance
(GB/s)
Efficiency
(MB/s/W)
est 80
0.01
0.125
est 100
0.33
3.3
25
5.20
208
PCIe385 A7 Accelerator
82
Optimizations
Results
Platform
Power
(W)
Performance
(Bsims/s)
Efficiency
(Msims/s/W)
130
.032
0.0025
nVidia Kepler20
212
10.1
48
45
12.0
266
Bittware S5-PCIe-HQ
83
Document Filtering
Unstructured data analytics
Bloom Filter
Advantage FPGA
Integer Arithmetic
Results
Platform
Power
(W)
Performance
(MTs)
Efficiency
(MTs/W)
130
2070
15.92
215
3240
15.07
25
3602
144.08
PCIe385 A7 Accelerator
84
Consumer (Japan)
Image Processing
pxy
c1 ij d1 xy c1 ( i1) j d 2 xy c2 ij d xy c2 ( i1) j d 2 xy
W
Advantage FPGA
Integer Arithmetic
Results
Platform
Power
(W)
Performance
(FPS)
Efficiency
(FPS/W)
est 130
0.05
.0004
est 150
2.94
.0200
21
4.29
.2040
PCIe385 A7 Accelerator
85
Smith-Waterman
Sequence Alignment
Scoring Matrix
Advantage FPGA
Integer Arithmetic
SMT Streaming
Results
Platform
Power
(W)
Performance
(MCUPS)
Efficiency
(MCUPS/W)
140
40
.29
nVidia K20
225
704
3.13
25
32596
1303.00
PCIe385 A7 Accelerator
86
Image Processing
Advantage FPGA
SoC Solution
IO and Kernel Channels
Heterogeneous memory accesses
87
2x 10 Gbps
ETH
ETH
IO
IO
Ingress
Network Path
STD
PKT
IDS PKT
Processing
Analysis
DPIPKT
PKT
DPI
Processing
Analysis
DPI Rules
Memory
STD Rules
Memory
IDS/IPS
MGMT
(QDR or DDR)
(QDR or DDR)
Traffic
IPS
PKT
Control
Manipulation
ETH
ETH
IO
IO
2x 10 Gbps
Mirror for
Egress
Network Path
89
Runtime (ms)
10,800
138
70
Nallatech SV-D8
15.5
Sobel Filter
Fundamental image filter algorithm
Used commonly in industrial and automotive
applications
WIDTH*3
WIDTH*4-9
A
WIDTH*3-9
E
WIDTH*2-9
90
WIDTH-9
WIDTH-1
91
http://www.altera.com/su
pport/examples/opencl/s
obel-filter.html
Device
Resolution
Cyclone V
1080p
Stratix V
1080p
FPS
60
135