Vous êtes sur la page 1sur 32

SOC architecture and design

• system-on-chip (SOC)
– processors: become components in a system
• SOC covers many topics
– processor: pipelined, superscalar, VLIW, array, vector
– storage: cache, embedded and external memory
– interconnect: buses, network-on-chip
– impact: time, area, power, reliability, configurability
– customisability: specialized processors, reconfiguration
– productivity/tools: model, explore, re-use, synthesise, verify
– examples: crypto, graphics, media, network, comm, security
– future: autonomous SOC, self-optimising/verifying design
• our focus
– overview, processor, memory
wl 2019 11.1
iPhone SOC
I/O
Processor

1 GHz ARM Cortex


A8
I/O
I/O Memory
Source: UC Berkeley wl 2019 11.2
Basic system-on-chip model

wl 2019 11.3
AMD’s Barcelona Multicore
Processor 4 out-of-order cores 

 1.9 GHz clock rate

512KB L2

512KB L2
 65nm technology
2MB shared L3 Cache

Core 1 Core 2
 3 levels of caches
 integrated Northbridge

Northbridge
512KB L2

512KB L2

Core 3 Core 4

http://www.techwarelabs.com/reviews/processors/barcelona/ wl 2019 11.4


SOC vs processors on chip
• with lots of transistors, designs move in 2 ways:
– complete system on a chip
– multi-core processors with lots of cache

System on chip Processors on chip


processor multiple, simple, few, complex,
heterogeneous homogeneous
cache one level, small 2-3 levels, extensive
memory embedded, on chip very large, off chip
functionality special purpose general purpose

interconnect wide, high bandwidth often through cache

power, cost both low both high

operation largely stand-alone need other chips


wl 2019 11.5
Processor types: overview
Processor type Architecture / Implementation approach
SIMD Single instruction applied to multiple
functional units
Vector Single instruction applied to multiple
pipelined registers
VLIW Multiple instructions issued each cycle
under compiler control
Superscalar Multiple instructions issued each cycle
under hardware control

wl 2019 11.6
Sequential and parallel machines
• basic single stream processors
– pipelined: overlap operations in basic sequential
– superscalar: transparent concurrency
– VLIW: compiler-generated concurrency
• multiple streams, multiple functional units
– array processors
– vector processors
• multiprocessors

wl 2019 11.7
Pipelined processor

Instruction #1

IF ID AG DF EX WB

Instruction #2

IF ID AG DF EX WB

Instruction #3

IF ID AG DF EX WB

Instruction #4

IF ID AG DF EX WB
Time

wl 2019 11.8
Superscalar and VLIW processors
Instruction #1

IF ID AG DF EX WB

Instruction #2

IF ID AG DF EX WB

Instruction #3

IF ID AG DF EX WB

Instruction #4

IF ID AG DF EX WB

Instruction #5

IF ID AG DF EX WB

Instruction #6

IF ID AG DF EX WB
Time

wl 2019 11.9
Superscalar

hardware for parallelism control


VLIW

wl 2019 11.10
Array processors
• perform op if condition = mask
• operand can come from neighbour

mask op dest sr1 sr2

n PEs, each with


memory; neighbour
communications

one instruction
issued to all PEs

wl 2019 11.11
Vector processors
• vector registers, eg 8 sets x 64 elements x 64 bits
• vector instructions: VR3 = VR2 VOP VR1

wl 2019 11.12
Memory addressing:
three levels

(each segment contains pages


for a program/process)

wl 2019 11.13
User view of memory: addressing
• a program: process address (offset + base + index)
– virtual address: from page address and process/user id
• segment table: process base and bound (for each process)
– system address: process base + page address
• pages: active localities in main/real memory
– virtual address: page table lookup to physical address
– page miss: virtual pages not in page table
• TLB (translation look-aside buffer): recent translations
– TLB entry: corresponding real and (virtual, id) address
• a few hashed virtual address bits address TLB entries
– if virtual, id = TLB (virtual, id) then use translation
wl 2019 11.14
Virtual Address
TLB and Paging:
Address
translation

(recent translations)
(find process)

process base

System Address

(find page)

Physical Address
wl 2019 11.15
SOC interconnect
• interconnecting multiple active agents requires
– bandwidth: capacity to transmit information (bps)
– protocol: logic for non-interfering message transmission
• bus
– AMBA (Adv. Microcontroller Bus Architecture) from ARM,
widely used for SOC
– bus performance: can determine system performance
• network on chip
– array of switches
– statically switched: eg mesh
– dynamically switched: eg crossbar
– adopted in the latest FPGAs to support AI and 5G
wl 2019 11.16
Adaptive Compute Acceleration
COMPUTE ACCELERATION

Scalar Adaptable AI
Engines Engines Engines

PLATFORM
ADAPTIVE Development Tools
Hardware/Software Libraries
Diverse Workloads in Run-time Stack
Milliseconds

Future-Proof for Software Programmable


New Algorithms Silicon Infrastructure

Enabling Data Scientists, Software Developers, Hardware Developers

Source: Xilinx
wl 2019 11.17
Versal Architecture (7nm technology)
Adaptable Engines
2X compute density

AI Engines
Scalar Engines •AI Compute
•Platform Control •Diverse DSP workloads
•Edge Compute

Network-on-Chip
•Guaranteed Bandwidth
•Enables Software Programmability

Programmable I/O DDR Memory


•Any interface or sensor •3200-DDR4, 4200-LPDDR4
•Includes 4.2Gb/s MIPI •2X bandwidth/pin

PCIe & CCIX


Transceivers
•2X PCIe & DMA bandwidth
•Broad range, 25G →112G
•Cache-coherent interface
•58G in mainstream devices
to accelerators

Source: Xilinx
wl 2019 11.18
Platform Management Controller
Bringing the Platform to Life & Keeping it Safe & Secure

10s of
Milliseconds
Boot & Configuration DONE

˃ Boots the platform in milliseconds (any engine


first)
˃ 8 times faster dynamic reconfiguration
˃ Advanced power & thermal management

Security, Safety & Reliability Enclave


˃ Hardware Root of Trust
Boot
˃ Cryptographic acceleration, confidentiality
˃ Enhanced diagnostics, system monitoring, anti-tamper BOOT & CONFIG  SAFETY   DEBUG

˃ Error mitigation, detection, management for safety SECURITY

Integrated Platform Interfaces & High Speed Debug


˃ Integrated flash, system & debug interfaces
˃ High-speed non-invasive, chip-wide debug Source: Xilinx
wl 2019 11.19
Introducing the AI Engine

• 1GHz+ Multi-precision Vector


Software Programmable Processor

MEMORY

MEMORY
• High bandwidth extensible memory
AI AI
CORE CORE

Deterministic
• Up to 400 AI Engines per device
Efficient • 8 times Compute Density

MEMORY

MEMORY
AI AI
CORE CORE

• 40% Lower Power Consumption

Artificial
Signal Processing
Intelligence
Computer Vision

CNN
LSTM / MLP

Adaptable. Intelligent.
Source: Xilinx
wl 2019 11.20
AI Engine: tile-based architecture
Non-Blocking Interconnect
Up to 200+ GB/s bandwidth per tile
PS I/O

PL
Interconnect

Local ISA-based
ISA-based
Local Memory Memory Vector Processor Vector Processor
Multi-bank implementation Software Programmable
AI Vector
Shared across neighbor cores (e.g., C/C++)
Extensions

Data
5G Vector
Mover Extensions

Cascade Interface Data Mover


Non-neighbor data communication
Partial results to next core
Integrated synchronization primitives

Source: Xilinx
wl 2019 11.21
AI Engine: processor core

Scalar Fixed-Point
Scalar ALU Vector Vector Unit
Register Non- Regist Floating-
File linear er File Point Vector
Functions Unit
32-bit Scalar RISC Processor Scalar Unit Vector Unit Vector Processor
512-bit SIMD Datapath
AGU AGU AGU Instruction Fetch
& Decode Unit
Load Unit A Load Unit B Store Unit

Local, Shareable Memory Stream


• 32KB Local, 128KB Addressable Memory Interface Interface

Instruction Parallelism: VLIW Data Parallelism: SIMD


Highly
7+ operations / clock cycle Multiple vector lanes
• 2 Vector Loads / 1 Mult / 1 Store
Parallel • Vector Datapath
• 2 Scalar Ops / Stream Access • 8 / 16 / 32-bit & SPFP operands

Up to 128 MACs / Clock Cycle per Core (INT 8)

Source: Xilinx
wl 2019 11.22
AI Engine: processor core
• Program Memory per Tile
– 16KB, 128-bit wide, 1K word deep, single port
– Instruction Compression, ECC protection + Reporting
• 32KB data memory per Tile
– 8 single port banks, 256-bit wide, 128b deep.
– 5 cycle access latency
– Error detection (parity) + reporting
• Independent DMA per tile, 2D strided access to north, south, east, west
• 3 AGUs, 2 load, 1 store
• 32-bit scalar RISC
– w/ 32x32 scalar multiplier
– sin/cos, square root, inv square-root
• 512-bit vector fixed point unit
• Single Precision floating point vector unit
Source: Xilinx
wl 2019 11.23
Multi-precision support
AI Data Types Signal Processing Data Types

* *

Source: Xilinx
wl 2019 11.24
Design cost: product economics
• increasingly product cost determined by
– design costs, including verification
– not marginal cost to produce
• manage complexity in die technology by
– engineering effort
– engineering cleverness
• design effort Design time
and effort
Basic
physical
tradeoffs

– often dictated by
product volume
Balance point depends on
n, number of units wl 2019 11.25
Design complexity

processors

wl 2019 11.26
Cost: product program vs engineering
Chip design

Fixed Variable costs


costs Verify & test

Labor costs

Software
Marketing,
sales,
administration
Manufacturing CAD
support
costs
Engineering
costs

Engineering

Mask costs

CAD Fixed
programs project costs

Product cost
Capital
equipment
wl 2019 11.27
Example: two scenarios
• fixed costs Kf, support costs 0.1 x function(n), and
variable costs Kv x n, so

• design gets more complex, while production costs


decrease
– Kf increases while Kv decreases
– if same price, requires higher volumes to break even

• when compared with 1995, in 2015


– Kf increased by 10 times
– Kv decreased by the same amount

wl 2019 11.28
More recent: higher NRE
2015
1995

wl 2019 11.29
IP: Intellectual Property

wl 2019 11.30
Summary
• physical customisation: FPGA vs ASIC
• customisation techniques
– parametric description: pointwise, pointfree descriptions
– patterns of composition: series, parallel, chain, row, grid
– transformations: retiming, slowdown, state machines
• system-on-chip (SOC)
– processors, memory, interconnect, design costs, IP
• why exciting?
– foundation of everything else in computing: theory + practice
– Microsoft adopt FPGA in data centres; Intel bought Altera
– you can be part of it: projects, internship, research, start-up…
wl 2019 11.31
Answers to Unassessed Coursework 6
1. rdl1 R = snd [-]-1 ; R
rdln+1 R = snd aprn-1 ; rsh ; fst (rdln R) ; R
2. P0 = rdln Pcell; 1
<<s,x>, a> Pcell <sx+a, x>
3. rdln R = rown (Ri ; 2-1) ; 2
P1 = loop (rown Pcell1 ; fst mapn D) ; 1
<<s,x>, a> Pcell1 <a,<sx+a, x>>
4. loop (rown R) = (loop R)n
Proof: induction on n
(see www.doc.ic.ac.uk/~wl/papers/scp90.pdf)
P1 = P2 ; [D,D]-n
P2 = (loop (Pcell1 ; [D,[D,D]]))n
wl 2019 11.32

Vous aimerez peut-être aussi