Chapter9 Intro FPGA

Introduction to FPGA
Technology, Devices and Tools
FPGA Devices & Technology
World of Integrated Circuits
Full-Custom ASICs
Semi-Custom ASICs
User Programmable
PLD
FPGA
ASIC Application Specific Integrated Circuit

designs must be sent for expensive and time consuming fabrication in semiconductor foundry designed all the way from behavioral description to physical layout
FPGA Field Programmable Gate Array

Small development overhead No NRE (non-recurring engineering) costs Quick time to market No minimum quantity order Reprogrammable
How can we make a programmable logic?
One time programmable

Fuses (destroy internal links with current) Anti-fuses (grow internal links) PROM EPROM EEPROM Flash SRAM - volatile
Reprogrammable
What is an FPGA?
Configurable Logic Blocks
Block RAMs Block RAMs
I/O Blocks Block RAMs
Which Way to Go?

ASICs FPGAs
Off-the-shelf
High performance Low development cost Low power Short time to market Low cost in high volumes
Reconfigurability
Other FPGA Advantages
Manufacturing cycle for ASIC is very costly, lengthy and engages lots of manpower
Mistakes not detected at design time have large impact on development time and cost FPGAs are perfect for rapid prototyping of digital circuits
Easy upgrades like in case of software Unique applications
reconfigurable computing
Major FPGA Vendors

SRAM-based FPGAs Xilinx, Inc. Share over 60% of the market Altera Corp. Atmel Lattice Semiconductor Flash & antifuse FPGAs Actel Corp. Quick Logic Corp.
XILINX
Xilinx
Primary products: FPGAs and the associated CAD software
Programmable Logic Devices
ISE Alliance and Foundation Series Design Software
Main headquarters in San Jose, CA Fabless* Semiconductor and Software Company

UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996} Seiko Epson (Japan) TSMC (Taiwan)
Xilinx FPGA Families
Old families

XC3000, XC4000, XC5200 Old 0.5m, 0.35m and 0.25m technology. Not recommended for modern designs. Virtex (0.22m) Virtex-E, Virtex-EM (0.18m) Virtex-II, Virtex-II PRO (0.13m) Spartan/XL derived from XC4000 Spartan-II derived from Virtex Spartan-IIE derived from Virtex-E Spartan-3
High-performance families

Low Cost Family

Basic Spartan-II FPGA Block Diagram
CLB Structure
COUT YB Y D CK EC R F5IN BY SR XB X D S F4 F3 F2 F1 XB X D S COUT YB Y D CK EC R G4 G3 G2 G1 Look-Up Table O
Carry & Control Logic
G4 G3 G2 G1
Look-Up Table O
F5IN BY SR F4 F3 F2 F1
Look-Up Table O
CK
EC R
Look-Up Table O
CK
EC R
CIN CLK CE
SLICE
CIN CLK CE
SLICE
Each slice has 2 LUT-FF pairs with associated carry logic Two 3-state buffers (BUFT) associated with each CLB, accessible by all CLB outputs
CLB Slice Structure
Each slice contains two sets of the following:
Four-input LUT Any 4-input logic function, or 16-bit x 1 sync RAM or 16-bit shift register Carry & Control Fast arithmetic logic Multiplier logic Multiplexer logic Storage element Latch or flip-flop Set and reset True or inverted inputs Sync. or async. control
LUT (Look-Up Table) Functionality

x1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 x2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 x3 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 x4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 y 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 x1 x2 x3 x4
LUT
x1 x2 x3 x4
x1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
x2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
x3 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
x4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
y 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0
Look-Up tables are primary elements for logic implementation Each LUT can implement any function of 4 inputs
x1 x2 y y
5-Input Functions implemented using two LUTs

One CLB Slice can implement any function of 5 inputs Logic function is partitioned between two LUTs F5 multiplexer selects LUT
A4 A3 A2 A1 WS DI
0
LUT ROM RAM
F5
F5 GXOR G
F4 F3 F2 F1 BX
A4 A3 A2 A1
WS
DI D
LUT ROM RAM
nBX BX 1 0
5-Input Functions implemented using two LUTs

X X X X X 5 4 3 2 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 Y 0 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0
LUT
OUT
LUT
Dedicated Expansion Multiplexers
MUXF5 combines 2 LUTs to create Any 5-input function (LUT5) Or selected functions up to 9 inputs Or 4x1 multiplexer MUXF6 combines 2 slices to form Any 6-input function (LUT6) Or selected functions up to 19 inputs 8x1 multiplexer Dedicated muxes are faster and more space efficient
CLB Slice LUT LUT MUXF5 MUXF6
Slice
LUT LUT MUXF5
Distributed RAM
RAM16X1S
CLB LUT configurable as Distributed RAM
LUT
=
RAM32X1S
D WE WCLK A0 A1 A2 A3 A4 O
D WE WCLK A0 A1 A2 A3
A LUT equals 16x1 RAM Implements Single and Dual-Ports Cascade LUTs to increase RAM size
LUT
Synchronous write Synchronous/Asynchronous read
=
LUT
or
RAM16X2S
D0 D1 WE WCLK A0 A1 A2 A3 O0 O1
RAM16X1D
D WE WCLK A0 A1 A2 A3 DPRA0 DPO DPRA1 DPRA2 DPRA3 SPO
Accompanying flip-flops used for synchronous read
or
Shift Register
LUT
Each LUT can be configured as shift register
IN CE CLK
D CE
Serial in, serial out
D CE
Dynamically addressable delay up to 16 cycles For programmable pipeline Cascade for greater cycle delays Use CLB flip-flops to add depth
LUT
D CE
OUT
D CE
DEPTH[3:0]
Shift Register
12 Cycles Operation A 64 4 Cycles Operation B 8 Cycles 64
Operation C
3 Cycles
3 Cycles
Register-rich FPGA
9-Cycle imbalance
Allows for addition of pipeline stages to increase throughput
Data paths must be balanced to keep desired functionality

COUT YB G4 G3 G2 G1 Y Look-Up O Table S
D CK EC
F5IN BY SR XB
F4 F3 F2 F1
X Look-Up Table O
S D CK EC R Q
CIN CLK CE
SLICE
Fast Carry Logic
Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters
Carry logic is independent of normal logic and routing resources
LSB
Carry Logic Routing
Each CLB contains separate logic and routing for the fast generation of sum & carry signals
MSB
Accessing Carry Logic
All major synthesis tools can infer carry logic for arithmetic functions

Addition (SUM <= A + B) Subtraction (DIFF <= A - B) Comparators (if A < B then) Counters (count <= count +1)
Block RAM
Port B Port A
Spartan-II True Dual-Port Block RAM
Block RAM
Most efficient memory implementation
Dedicated blocks of memory 4 to 14 memory blocks
Ideal for most memory requirements
4096 bits per blocks
Use multiple blocks for larger memories
Builds both single and true dual-port RAMs
Dual Port Block RAM
Dual-Port Bus Flexibility

RAMB4_S4_S16
WEA ENA RSTA CLKA ADDRA[9:0] DIA[3:0] DOA[3:0]
Port A In 1K-Bit Depth
Port A Out 4-Bit Width
WEB
ENB
Port B In 256-Bit Depth
RSTB CLKB ADDRB[7:0] DIB[15:0]
DOB[15:0]
Port B Out 16-Bit Width
Each port can be configured with a different data bus width Provides easy data width conversion without any additional logic
Two Independent Single-Port RAMs

RAMB4_S1_S1 Port A In 2K-Bit Depth VCC, ADDR[10:0]
WEA ENA RSTA CLKA ADDRA[10:0] DIA[0] DOA[0]
Port A Out 1-Bit Width
Port B In 2K-Bit Depth GND, ADDR[10:0]
WEB ENB RSTB CLKB ADDRB[10:0] DIB[0] DOB[0]
Port B Out 1-Bit Width
Added advantage of True Dual-Port
To access the lower RAM
No wasted RAM Bits
Can split a Dual-Port 4K RAM into two Single-Port 2K RAM
Tie the MSB address bit to Logic Low Tie the MSB address bit to Logic High
Simultaneous independent access to each RAM
To access the upper RAM
I/O Banking
Basic I/O Block Structure

Three-State FF Enable Clock Set/Reset D Q EC SR
Three-State Control
Output FF Enable
D Q EC SR
Output Path
Direct Input FF Enable Registered Input Q D EC Input Path
SR
IOB Functionality
IOB provides interface between the package pins and CLBs Each IOB can work as uni- or bi-directional I/O Outputs can be forced into High Impedance Inputs and outputs can be registered
advised for high-performance I/O
Inputs can be delayed
Routing Resources
CLB CLB CLB
PSM
CLB CLB
PSM
CLB Programmable Switch Matrix
PSM
CLB CLB
PSM
CLB
Clock Distribution
FPGA Nomenclature
ALTERA
Device Families & Tools
Logic Element: FLEX10K
Logic Array Block: FLEX10K
FLEX10K Architecture
Stratix Architecture
Stratix Device Family
Feature Logic Elements (LEs) M512 RAM Blocks ( 512 Bits + Parity) M4K RAM Blocks (4 Kbits + Parity) M512 RAM Blocks (512 Kbits + Parity) Total RAM bits DSP Blocks Embedded Multipliers PLLS Maximum User I/O Pins Engineering Sample Availability Production Device Availability
EP1S10 10,570 94 60 1 920,448 6 48 6 426 Now March 2003
EP1S20 18,460 194 82 2 1,669,248 10 80 6 586 Use Production Now
EP1S25 25,660 224 138 2 1,944,576 10 80 6 706 Use Production Now
EP1S30 32,470 295 171 4

3,317,184
EP1S40 41,250 384 183 4 3,423,744 14 112 12 822 Now March 2003
EP1S60 57,120 574 292 6

5,215,104
EP1S80 79,040 767 364 9 7,427,520 22 176 12 1,238 Now January 2003
EP1S120 114,140 1,118 520 12

10,118,016
12 96 10 726 N/A
18 144 12 1,022 N/A April 2003
28 224 12 1,314 2003
Now
2003
FPGA Technology Roadmap
year
1995
1996
1997
2000
2003
2004 ?
Technology
0.6
0.35 0.25 0.18
0.13
100K LC* 8Mb RAM 400 18X18 multipliers
0.07
Gate count
25K
100K
250K
1M
Transistor count
3.5M
12M
23M
75M
430M
1B
*note: Xilinx Virtex-II Pro XC2VP100 (9/16/2003)
Advance architecture on modern FPGAs
More guts
Additional components

RAM blocks Dedicated multipliers Tri-state buffers Transceivers Processor cores DSP blocks
Dedicate Arithmetic Blocks
QuickLogic
Altera Xilinx
Processor Cores
PowerPC on Vertex II Pro

Embedded 300+ MHz Harvard Architecture Core Low Power Consumption: 0.9 mW/MHz Five-Stage Data Path Pipeline Hardware Multiply/Divide Unit Thirty-Two 32-bit General Purpose Registers 16 KB Two-Way Set-Associative Instruction Cache 16 KB Two-Way Set-Associative Data Cache Memory Management Unit (MMU) - 64-entry unified Translation Look-aside Buffers (TLB) - Variable page sizes (1 KB to 16 MB) Dedicated On-Chip Memory (OCM) Interface Supports IBM CoreConnect Bus Architecture Debug and Trace Support Timer Facilities
ARM in Excalibur
Industry-standard ARM922T 32-bit RISC processor core operating up to 200MHz
ARMv4T instruction set with Thumb extensions
Memory management unit (MMU) included for real-time operating systems (RTOS) support Harvard cache architecture with 64-way set associative separate 8Kbyte instruction and 8-Kbyte data caches
Embedded programmable on-chip peripherals

ETM9 embedded trace module to assistant software debugging
Flexible interrupt controller

Universal asynchronous receiver/transmitter (UART) General-purpose timer
Watchdog timer
FPGA Tools
Design process (1)

Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds..
Specification (Lab Experiments)

VHDL description (Your Source Files)
Library IEEE; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; entity RC5_core is port( clock, reset, encr_decr: in std_logic; data_input: in std_logic_vector(31 downto 0); data_output: out std_logic_vector(31 downto 0); out_full: in std_logic; key_input: in std_logic_vector(31 downto 0); key_read: out std_logic; ); end AES_core;
Functional simulation
Synthesis
Post-synthesis simulation
Design process (2)

Implementation
Timing simulation
Configuration On chip testing
Active-HDL
Simulation Tools
Synthesis Tools
Logic Synthesis
VHDL description
architecture MLU_DATAFLOW of MLU is signal signal signal signal begin A1:STD_LOGIC; B1:STD_LOGIC; Y1:STD_LOGIC; MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC; A1<=A when (NEG_A='0') else not A; B1<=B when (NEG_B='0') else not B; Y<=Y1 when (NEG_Y='0') else not Y1; MUX_0<=A1 MUX_1<=A1 MUX_2<=A1 MUX_3<=A1 and B1; or B1; xor B1; xnor B1;
Circuit netlist
with (L1 & L0) select Y1<=MUX_0 when "00", MUX_1 when "01", MUX_2 when "10", MUX_3 when others; end MLU_DATAFLOW;
Features of synthesis tools

Interpret RTL code Produce synthesized circuit netlist in a standard EDIF format Give preliminary performance estimates Some can display circuit schematics corresponding to EDIF netlist
Implementation
After synthesis the entire implementation process is performed by FPGA vendor tools Xilinx ISE foundation 6.2i Altera Quartus II 4.0 3rd party tools for alliance version
Circuit Compilation
1. Technology Mapping
LUT
2. Placement
LUT
?
3. Routing
Assign a logical LUT to a physical location.
Select wire segments And switches for Interconnection.
Routing Example
FPGA
Programmable Connections
Static Timing Analyzer
Performs static analysis of the circuit performance Reports critical paths with all sources of delays Determines maximum clock frequency
Static Timing Analysis
Critical Path The Longest Path From Outputs of Registers to Inputs of Registers
tP logic
in clk
out
tCritical = tP FF + tP logic + tS FF
Min. Clock Period = Length of The Critical Path Max. Clock Frequency = 1 / Min. Clock Period
Configuration
Once a design is implemented, you must create a file that the FPGA can understand
This file is called a bit stream: a BIT file (.bit extension)
The BIT file can be downloaded directly to the FPGA, or can be converted into a PROM file which stores the programming information

Chapter9 Intro FPGA

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Chapter9 Intro FPGA

Transféré par

Droits d'auteur :

Formats disponibles

Introduction to FPGA

Technology, Devices and Tools

FPGA Devices & Technology

World of Integrated Circuits

ASIC Application Specific Integrated Circuit

FPGA Field Programmable Gate Array

How can we make a programmable logic?

One time programmable

I/O Blocks Block RAMs

Which Way to Go?

Other FPGA Advantages

Easy upgrades like in case of software Unique applications

Major FPGA Vendors

Primary products: FPGAs and the associated CAD software

Programmable Logic Devices

ISE Alliance and Foundation Series Design Software

Main headquarters in San Jose, CA Fabless* Semiconductor and Software Company

Xilinx FPGA Families

Low Cost Family

Basic Spartan-II FPGA Block Diagram

Carry & Control Logic

Carry & Control Logic

Carry & Control Logic

Carry & Control Logic

CLB Slice Structure

Each slice contains two sets of the following:

LUT (Look-Up Table) Functionality

5-Input Functions implemented using two LUTs

LUT ROM RAM

LUT ROM RAM

5-Input Functions implemented using two LUTs

Dedicated Expansion Multiplexers

CLB Slice LUT LUT MUXF5 MUXF6

CLB LUT configurable as Distributed RAM

Synchronous write Synchronous/Asynchronous read

Accompanying flip-flops used for synchronous read

Each LUT can be configured as shift register

Serial in, serial out

Allows for addition of pipeline stages to increase throughput

Data paths must be balanced to keep desired functionality

Carry & Control Logic

Carry & Control Logic

Carry & Control Logic

Fast Carry Logic

Carry logic is independent of normal logic and routing resources

Carry Logic Routing

Accessing Carry Logic

Most efficient memory implementation

Dedicated blocks of memory 4 to 14 memory blocks

Ideal for most memory requirements

4096 bits per blocks

Use multiple blocks for larger memories

Builds both single and true dual-port RAMs

Dual Port Block RAM

Dual-Port Bus Flexibility

Port A In 1K-Bit Depth

Port A Out 4-Bit Width

Port B In 256-Bit Depth

RSTB CLKB ADDRB[7:0] DIB[15:0]

Port B Out 16-Bit Width

Two Independent Single-Port RAMs

Port A Out 1-Bit Width

Port B In 2K-Bit Depth GND, ADDR[10:0]