SOC Architecture and Design: - System-On-Chip (SOC) - SOC Covers Many Topics

SOC architecture and design
• system-on-chip (SOC)
– processors: become components in a system
• SOC covers many topics
– processor: pipelined, superscalar, VLIW, array, vector
– storage: cache, embedded and external memory
– interconnect: buses, network-on-chip
– impact: time, area, power, reliability, configurability
– customisability: specialized processors, reconfiguration
– productivity/tools: model, explore, re-use, synthesise, verify
– examples: crypto, graphics, media, network, comm, security
– future: autonomous SOC, self-optimising/verifying design
• our focus
– overview, processor, memory
wl 2019 11.1
iPhone SOC
I/O
Processor
1 GHz ARM Cortex

A8
I/O
I/O Memory
Source: UC Berkeley wl 2019 11.2
Basic system-on-chip model
wl 2019 11.3
AMD’s Barcelona Multicore
Processor 4 out-of-order cores 
 1.9 GHz clock rate
512KB L2
512KB L2
 65nm technology
2MB shared L3 Cache
Core 1 Core 2
 3 levels of caches
 integrated Northbridge
Northbridge
512KB L2
512KB L2
Core 3 Core 4
http://www.techwarelabs.com/reviews/processors/barcelona/ wl 2019 11.4

SOC vs processors on chip
• with lots of transistors, designs move in 2 ways:
– complete system on a chip
– multi-core processors with lots of cache
System on chip Processors on chip

processor multiple, simple, few, complex,
heterogeneous homogeneous
cache one level, small 2-3 levels, extensive
memory embedded, on chip very large, off chip
functionality special purpose general purpose
interconnect wide, high bandwidth often through cache
power, cost both low both high
operation largely stand-alone need other chips

wl 2019 11.5
Processor types: overview
Processor type Architecture / Implementation approach
SIMD Single instruction applied to multiple
functional units
Vector Single instruction applied to multiple
pipelined registers
VLIW Multiple instructions issued each cycle
under compiler control
Superscalar Multiple instructions issued each cycle
under hardware control
wl 2019 11.6
Sequential and parallel machines
• basic single stream processors
– pipelined: overlap operations in basic sequential
– superscalar: transparent concurrency
– VLIW: compiler-generated concurrency
• multiple streams, multiple functional units
– array processors
– vector processors
• multiprocessors
wl 2019 11.7
Pipelined processor
Instruction #1
IF ID AG DF EX WB
Instruction #2
IF ID AG DF EX WB
Instruction #3
IF ID AG DF EX WB
Instruction #4
IF ID AG DF EX WB
Time
wl 2019 11.8
Superscalar and VLIW processors
Instruction #1
IF ID AG DF EX WB
Instruction #2
IF ID AG DF EX WB
Instruction #3
IF ID AG DF EX WB
Instruction #4
IF ID AG DF EX WB
Instruction #5
IF ID AG DF EX WB
Instruction #6
IF ID AG DF EX WB
Time
wl 2019 11.9
Superscalar
hardware for parallelism control

VLIW
wl 2019 11.10
Array processors
• perform op if condition = mask
• operand can come from neighbour
mask op dest sr1 sr2
n PEs, each with

memory; neighbour
communications
one instruction
issued to all PEs
wl 2019 11.11
Vector processors
• vector registers, eg 8 sets x 64 elements x 64 bits
• vector instructions: VR3 = VR2 VOP VR1
wl 2019 11.12
Memory addressing:
three levels
(each segment contains pages

for a program/process)
wl 2019 11.13
User view of memory: addressing
• a program: process address (offset + base + index)
– virtual address: from page address and process/user id
• segment table: process base and bound (for each process)
– system address: process base + page address
• pages: active localities in main/real memory
– virtual address: page table lookup to physical address
– page miss: virtual pages not in page table
• TLB (translation look-aside buffer): recent translations
– TLB entry: corresponding real and (virtual, id) address
• a few hashed virtual address bits address TLB entries
– if virtual, id = TLB (virtual, id) then use translation
wl 2019 11.14
Virtual Address
TLB and Paging:
Address
translation
(recent translations)
(find process)
process base
System Address
(find page)
Physical Address
wl 2019 11.15
SOC interconnect
• interconnecting multiple active agents requires
– bandwidth: capacity to transmit information (bps)
– protocol: logic for non-interfering message transmission
• bus
– AMBA (Adv. Microcontroller Bus Architecture) from ARM,
widely used for SOC
– bus performance: can determine system performance
• network on chip
– array of switches
– statically switched: eg mesh
– dynamically switched: eg crossbar
– adopted in the latest FPGAs to support AI and 5G
wl 2019 11.16
Adaptive Compute Acceleration
COMPUTE ACCELERATION
Scalar Adaptable AI
Engines Engines Engines
PLATFORM
ADAPTIVE Development Tools
Hardware/Software Libraries
Diverse Workloads in Run-time Stack
Milliseconds
Future-Proof for Software Programmable

New Algorithms Silicon Infrastructure
Enabling Data Scientists, Software Developers, Hardware Developers
Source: Xilinx
wl 2019 11.17
Versal Architecture (7nm technology)
Adaptable Engines
2X compute density
AI Engines
Scalar Engines •AI Compute
•Platform Control •Diverse DSP workloads
•Edge Compute
Network-on-Chip
•Guaranteed Bandwidth
•Enables Software Programmability
Programmable I/O DDR Memory

•Any interface or sensor •3200-DDR4, 4200-LPDDR4
•Includes 4.2Gb/s MIPI •2X bandwidth/pin
PCIe & CCIX

Transceivers
•2X PCIe & DMA bandwidth
•Broad range, 25G →112G
•Cache-coherent interface
•58G in mainstream devices
to accelerators
Source: Xilinx
wl 2019 11.18
Platform Management Controller
Bringing the Platform to Life & Keeping it Safe & Secure
10s of
Milliseconds
Boot & Configuration DONE
˃ Boots the platform in milliseconds (any engine

first)
˃ 8 times faster dynamic reconfiguration
˃ Advanced power & thermal management
Security, Safety & Reliability Enclave

˃ Hardware Root of Trust
Boot
˃ Cryptographic acceleration, confidentiality
˃ Enhanced diagnostics, system monitoring, anti-tamper BOOT & CONFIG  SAFETY   DEBUG
˃ Error mitigation, detection, management for safety SECURITY
Integrated Platform Interfaces & High Speed Debug

˃ Integrated flash, system & debug interfaces
˃ High-speed non-invasive, chip-wide debug Source: Xilinx
wl 2019 11.19
Introducing the AI Engine
• 1GHz+ Multi-precision Vector

Software Programmable Processor
MEMORY
MEMORY
• High bandwidth extensible memory
AI AI
CORE CORE
Deterministic
• Up to 400 AI Engines per device
Efficient • 8 times Compute Density
MEMORY
MEMORY
AI AI
CORE CORE
• 40% Lower Power Consumption
Artificial
Signal Processing
Intelligence
Computer Vision
CNN
LSTM / MLP
Adaptable. Intelligent.
Source: Xilinx
wl 2019 11.20
AI Engine: tile-based architecture
Non-Blocking Interconnect
Up to 200+ GB/s bandwidth per tile
PS I/O
PL
Interconnect
Local ISA-based
ISA-based
Local Memory Memory Vector Processor Vector Processor
Multi-bank implementation Software Programmable
AI Vector
Shared across neighbor cores (e.g., C/C++)
Extensions
Data
5G Vector
Mover Extensions
Cascade Interface Data Mover

Non-neighbor data communication
Partial results to next core
Integrated synchronization primitives
Source: Xilinx
wl 2019 11.21
AI Engine: processor core
Scalar Fixed-Point
Scalar ALU Vector Vector Unit
Register Non- Regist Floating-
File linear er File Point Vector
Functions Unit
32-bit Scalar RISC Processor Scalar Unit Vector Unit Vector Processor
512-bit SIMD Datapath
AGU AGU AGU Instruction Fetch
& Decode Unit
Load Unit A Load Unit B Store Unit
Local, Shareable Memory Stream

• 32KB Local, 128KB Addressable Memory Interface Interface
Instruction Parallelism: VLIW Data Parallelism: SIMD

Highly
7+ operations / clock cycle Multiple vector lanes
• 2 Vector Loads / 1 Mult / 1 Store
Parallel • Vector Datapath
• 2 Scalar Ops / Stream Access • 8 / 16 / 32-bit & SPFP operands
Up to 128 MACs / Clock Cycle per Core (INT 8)
Source: Xilinx
wl 2019 11.22
AI Engine: processor core
• Program Memory per Tile
– 16KB, 128-bit wide, 1K word deep, single port
– Instruction Compression, ECC protection + Reporting
• 32KB data memory per Tile
– 8 single port banks, 256-bit wide, 128b deep.
– 5 cycle access latency
– Error detection (parity) + reporting
• Independent DMA per tile, 2D strided access to north, south, east, west
• 3 AGUs, 2 load, 1 store
• 32-bit scalar RISC
– w/ 32x32 scalar multiplier
– sin/cos, square root, inv square-root
• 512-bit vector fixed point unit
• Single Precision floating point vector unit
Source: Xilinx
wl 2019 11.23
Multi-precision support
AI Data Types Signal Processing Data Types
* *
Source: Xilinx
wl 2019 11.24
Design cost: product economics
• increasingly product cost determined by
– design costs, including verification
– not marginal cost to produce
• manage complexity in die technology by
– engineering effort
– engineering cleverness
• design effort Design time
and effort
Basic
physical
tradeoffs
– often dictated by
product volume
Balance point depends on
n, number of units wl 2019 11.25
Design complexity
processors
wl 2019 11.26
Cost: product program vs engineering
Chip design
Fixed Variable costs

costs Verify & test
Labor costs
Software
Marketing,
sales,
administration
Manufacturing CAD
support
costs
Engineering
costs
Engineering
Mask costs
CAD Fixed
programs project costs
Product cost
Capital
equipment
wl 2019 11.27
Example: two scenarios
• fixed costs Kf, support costs 0.1 x function(n), and
variable costs Kv x n, so
• design gets more complex, while production costs

decrease
– Kf increases while Kv decreases
– if same price, requires higher volumes to break even
• when compared with 1995, in 2015

– Kf increased by 10 times
– Kv decreased by the same amount
wl 2019 11.28
More recent: higher NRE
2015
1995
wl 2019 11.29
IP: Intellectual Property
wl 2019 11.30
Summary
• physical customisation: FPGA vs ASIC
• customisation techniques
– parametric description: pointwise, pointfree descriptions
– patterns of composition: series, parallel, chain, row, grid
– transformations: retiming, slowdown, state machines
• system-on-chip (SOC)
– processors, memory, interconnect, design costs, IP
• why exciting?
– foundation of everything else in computing: theory + practice
– Microsoft adopt FPGA in data centres; Intel bought Altera
– you can be part of it: projects, internship, research, start-up…
wl 2019 11.31
Answers to Unassessed Coursework 6
1. rdl1 R = snd [-]-1 ; R
rdln+1 R = snd aprn-1 ; rsh ; fst (rdln R) ; R
2. P0 = rdln Pcell; 1
<<s,x>, a> Pcell <sx+a, x>
3. rdln R = rown (Ri ; 2-1) ; 2
P1 = loop (rown Pcell1 ; fst mapn D) ; 1
<<s,x>, a> Pcell1 <a,<sx+a, x>>
4. loop (rown R) = (loop R)n
Proof: induction on n
(see www.doc.ic.ac.uk/~wl/papers/scp90.pdf)
P1 = P2 ; [D,D]-n
P2 = (loop (Pcell1 ; [D,[D,D]]))n
wl 2019 11.32

SOC Architecture and Design: - System-On-Chip (SOC) - SOC Covers Many Topics

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

SOC Architecture and Design: - System-On-Chip (SOC) - SOC Covers Many Topics

Transféré par

Droits d'auteur :

Formats disponibles

SOC architecture and design

1 GHz ARM Cortex

 1.9 GHz clock rate

http://www.techwarelabs.com/reviews/processors/barcelona/ wl 2019 11.4

System on chip Processors on chip

interconnect wide, high bandwidth often through cache

power, cost both low both high

operation largely stand-alone need other chips

hardware for parallelism control

mask op dest sr1 sr2

n PEs, each with

(each segment contains pages

Future-Proof for Software Programmable

Enabling Data Scientists, Software Developers, Hardware Developers

Programmable I/O DDR Memory

PCIe & CCIX

˃ Boots the platform in milliseconds (any engine

Security, Safety & Reliability Enclave

˃ Error mitigation, detection, management for safety SECURITY

Integrated Platform Interfaces & High Speed Debug

• 1GHz+ Multi-precision Vector

• 40% Lower Power Consumption

Cascade Interface Data Mover

Local, Shareable Memory Stream

Instruction Parallelism: VLIW Data Parallelism: SIMD

Up to 128 MACs / Clock Cycle per Core (INT 8)

Fixed Variable costs

• design gets more complex, while production costs

• when compared with 1995, in 2015

Vous aimerez peut-être aussi