Académique Documents
Professionnel Documents
Culture Documents
• system-on-chip (SOC)
– processors: become components in a system
• SOC covers many topics
– processor: pipelined, superscalar, VLIW, array, vector
– storage: cache, embedded and external memory
– interconnect: buses, network-on-chip
– impact: time, area, power, reliability, configurability
– customisability: specialized processors, reconfiguration
– productivity/tools: model, explore, re-use, synthesise, verify
– examples: crypto, graphics, media, network, comm, security
– future: autonomous SOC, self-optimising/verifying design
• our focus
– overview, processor, memory
wl 2019 11.1
iPhone SOC
I/O
Processor
wl 2019 11.3
AMD’s Barcelona Multicore
Processor 4 out-of-order cores
512KB L2
512KB L2
65nm technology
2MB shared L3 Cache
Core 1 Core 2
3 levels of caches
integrated Northbridge
Northbridge
512KB L2
512KB L2
Core 3 Core 4
wl 2019 11.6
Sequential and parallel machines
• basic single stream processors
– pipelined: overlap operations in basic sequential
– superscalar: transparent concurrency
– VLIW: compiler-generated concurrency
• multiple streams, multiple functional units
– array processors
– vector processors
• multiprocessors
wl 2019 11.7
Pipelined processor
Instruction #1
IF ID AG DF EX WB
Instruction #2
IF ID AG DF EX WB
Instruction #3
IF ID AG DF EX WB
Instruction #4
IF ID AG DF EX WB
Time
wl 2019 11.8
Superscalar and VLIW processors
Instruction #1
IF ID AG DF EX WB
Instruction #2
IF ID AG DF EX WB
Instruction #3
IF ID AG DF EX WB
Instruction #4
IF ID AG DF EX WB
Instruction #5
IF ID AG DF EX WB
Instruction #6
IF ID AG DF EX WB
Time
wl 2019 11.9
Superscalar
wl 2019 11.10
Array processors
• perform op if condition = mask
• operand can come from neighbour
one instruction
issued to all PEs
wl 2019 11.11
Vector processors
• vector registers, eg 8 sets x 64 elements x 64 bits
• vector instructions: VR3 = VR2 VOP VR1
wl 2019 11.12
Memory addressing:
three levels
wl 2019 11.13
User view of memory: addressing
• a program: process address (offset + base + index)
– virtual address: from page address and process/user id
• segment table: process base and bound (for each process)
– system address: process base + page address
• pages: active localities in main/real memory
– virtual address: page table lookup to physical address
– page miss: virtual pages not in page table
• TLB (translation look-aside buffer): recent translations
– TLB entry: corresponding real and (virtual, id) address
• a few hashed virtual address bits address TLB entries
– if virtual, id = TLB (virtual, id) then use translation
wl 2019 11.14
Virtual Address
TLB and Paging:
Address
translation
(recent translations)
(find process)
process base
System Address
(find page)
Physical Address
wl 2019 11.15
SOC interconnect
• interconnecting multiple active agents requires
– bandwidth: capacity to transmit information (bps)
– protocol: logic for non-interfering message transmission
• bus
– AMBA (Adv. Microcontroller Bus Architecture) from ARM,
widely used for SOC
– bus performance: can determine system performance
• network on chip
– array of switches
– statically switched: eg mesh
– dynamically switched: eg crossbar
– adopted in the latest FPGAs to support AI and 5G
wl 2019 11.16
Adaptive Compute Acceleration
COMPUTE ACCELERATION
Scalar Adaptable AI
Engines Engines Engines
PLATFORM
ADAPTIVE Development Tools
Hardware/Software Libraries
Diverse Workloads in Run-time Stack
Milliseconds
Source: Xilinx
wl 2019 11.17
Versal Architecture (7nm technology)
Adaptable Engines
2X compute density
AI Engines
Scalar Engines •AI Compute
•Platform Control •Diverse DSP workloads
•Edge Compute
Network-on-Chip
•Guaranteed Bandwidth
•Enables Software Programmability
Source: Xilinx
wl 2019 11.18
Platform Management Controller
Bringing the Platform to Life & Keeping it Safe & Secure
10s of
Milliseconds
Boot & Configuration DONE
MEMORY
MEMORY
• High bandwidth extensible memory
AI AI
CORE CORE
Deterministic
• Up to 400 AI Engines per device
Efficient • 8 times Compute Density
MEMORY
MEMORY
AI AI
CORE CORE
Artificial
Signal Processing
Intelligence
Computer Vision
CNN
LSTM / MLP
Adaptable. Intelligent.
Source: Xilinx
wl 2019 11.20
AI Engine: tile-based architecture
Non-Blocking Interconnect
Up to 200+ GB/s bandwidth per tile
PS I/O
PL
Interconnect
Local ISA-based
ISA-based
Local Memory Memory Vector Processor Vector Processor
Multi-bank implementation Software Programmable
AI Vector
Shared across neighbor cores (e.g., C/C++)
Extensions
Data
5G Vector
Mover Extensions
Source: Xilinx
wl 2019 11.21
AI Engine: processor core
Scalar Fixed-Point
Scalar ALU Vector Vector Unit
Register Non- Regist Floating-
File linear er File Point Vector
Functions Unit
32-bit Scalar RISC Processor Scalar Unit Vector Unit Vector Processor
512-bit SIMD Datapath
AGU AGU AGU Instruction Fetch
& Decode Unit
Load Unit A Load Unit B Store Unit
Source: Xilinx
wl 2019 11.22
AI Engine: processor core
• Program Memory per Tile
– 16KB, 128-bit wide, 1K word deep, single port
– Instruction Compression, ECC protection + Reporting
• 32KB data memory per Tile
– 8 single port banks, 256-bit wide, 128b deep.
– 5 cycle access latency
– Error detection (parity) + reporting
• Independent DMA per tile, 2D strided access to north, south, east, west
• 3 AGUs, 2 load, 1 store
• 32-bit scalar RISC
– w/ 32x32 scalar multiplier
– sin/cos, square root, inv square-root
• 512-bit vector fixed point unit
• Single Precision floating point vector unit
Source: Xilinx
wl 2019 11.23
Multi-precision support
AI Data Types Signal Processing Data Types
* *
Source: Xilinx
wl 2019 11.24
Design cost: product economics
• increasingly product cost determined by
– design costs, including verification
– not marginal cost to produce
• manage complexity in die technology by
– engineering effort
– engineering cleverness
• design effort Design time
and effort
Basic
physical
tradeoffs
– often dictated by
product volume
Balance point depends on
n, number of units wl 2019 11.25
Design complexity
processors
wl 2019 11.26
Cost: product program vs engineering
Chip design
Labor costs
Software
Marketing,
sales,
administration
Manufacturing CAD
support
costs
Engineering
costs
Engineering
Mask costs
CAD Fixed
programs project costs
Product cost
Capital
equipment
wl 2019 11.27
Example: two scenarios
• fixed costs Kf, support costs 0.1 x function(n), and
variable costs Kv x n, so
wl 2019 11.28
More recent: higher NRE
2015
1995
wl 2019 11.29
IP: Intellectual Property
wl 2019 11.30
Summary
• physical customisation: FPGA vs ASIC
• customisation techniques
– parametric description: pointwise, pointfree descriptions
– patterns of composition: series, parallel, chain, row, grid
– transformations: retiming, slowdown, state machines
• system-on-chip (SOC)
– processors, memory, interconnect, design costs, IP
• why exciting?
– foundation of everything else in computing: theory + practice
– Microsoft adopt FPGA in data centres; Intel bought Altera
– you can be part of it: projects, internship, research, start-up…
wl 2019 11.31
Answers to Unassessed Coursework 6
1. rdl1 R = snd [-]-1 ; R
rdln+1 R = snd aprn-1 ; rsh ; fst (rdln R) ; R
2. P0 = rdln Pcell; 1
<<s,x>, a> Pcell <sx+a, x>
3. rdln R = rown (Ri ; 2-1) ; 2
P1 = loop (rown Pcell1 ; fst mapn D) ; 1
<<s,x>, a> Pcell1 <a,<sx+a, x>>
4. loop (rown R) = (loop R)n
Proof: induction on n
(see www.doc.ic.ac.uk/~wl/papers/scp90.pdf)
P1 = P2 ; [D,D]-n
P2 = (loop (Pcell1 ; [D,[D,D]]))n
wl 2019 11.32