Vous êtes sur la page 1sur 15

Vector IRAM: A Microprocessor Architecture for Media Processing

Christoforos E. Kozyrakis
kozyraki@cs.berkeley.edu
CS252 Graduate Computer Architecture February 10, 2000

Outline
Motivation for IRAM
technology trends design trends application trends

Vector IRAM
instruction set prototype architecture performance

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 2

Processor-DRAM Gap (latency)

1000 Performance
Moores Law

CPU

Proc 60%/yr.

100 10 1

Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr.


1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Time
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 3

Processor-DRAM Tax
logic Intel PIII Xeon MIPS R12000 HP PA-8500 Sun Ultra-2 PowerPC G4 IBM Power3 AMD Athlon Alpha 21264 0 6 5 3 4 2 1.8 4.5 7 11 9.2 10 15 20 25 30 6 8 11 8 4.2 126 memory 15

Million Transistors

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 4

Power Consumption
60 50 Performance (Spec95FP) 40 30 20 10 0 0 20 40 Power (W) 60 80

Alpha 21264 AMD Athlon IBM Power3 PowerPC G4 Sun Ultra-2 HP PA-8500 MIPS R12000 Intel PIII Xeon

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 5

Other Design Challenges


Interconnect scaling problems
multiple cycles to go across the chip difficult to achieve single cycle result forwarding need to add extra pipeline stages at the cost of power, complexity, branch and load-use latency

Design complexity of high-end CPUs


4 to 5 years from scratch to chips for new superscalar architectures >100 engineers >50% of resources to design verification

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 6

Complexity Vs. Performance Gains


R5000 Clock Rate 200 MHz On-Chip Caches 32K/32K Instructions/Cycle 1(+ FP) Pipe stages 5 Model In-order Die Size (mm2) 84 wo cache, TLB 32 Development 60 (man years) SPECint_base95 5.7 R10000 195 MHz 32K/32K 4 5-7 Out-of-order 298 205 300 8.8 R10K/R5K 1.0x 1.0x 4.0x 1.2x --3.5x 6.3x 5.0x 1.6x

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 7

Future microprocessor applications


Multimedia applications
image/video processing, voice/pattern recognition, 3D graphics, animation, digital music, encryption etc. narrow data types, streaming data, real-time requirements

Mobile and embedded environments


notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, cars etc.. small devices, limited chip-count, limited power/energy budget

Significantly different environment from the desktop/workstation model

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 8

Requirements on microprocessors (1)


High performance for multimedia:
real-time performance guarantees support for continuous media data-types exploit fine-grain parallelism exploit coarse-grain parallelism exploit high instruction reference locality code density high memory bandwidth

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 9

Average vs. real time performance ...


45% 40% 35% 30%

Average Which one is the best? Statistical Average C Real time Worst A

Inputs

25% 20% 15% 10% 5% 0%

C
Best Case
Page 10

Worst Case
2/10/2000

Performance

C.E. Kozyrakis, U.C. Berkeley

Requirements on microprocessors (2)


Low power and energy consumption
energy efficiency for long battery life power efficiency for system cost reduction (cooling system, packaging etc...)

Design scalability
performance scalability physical design scalability design complexity, verification complexity immunity to interconnect scaling problems locality of interconnect, tolerance to latency

System-on-a-chip (SoC)
highly integrated system low system chip-count
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 11

The IRAM vision statement


Microprocessor & DRAM on a single chip:
on-chip memory latency 5-10X, bandwidth 50-100X improve energy efficiency 2X-4X (no off-chip bus) serial I/O 5-10X v. buses smaller board area/volume adjustable memory size/width Proc $ $ L2$ Bus D R A M L o f g a i b c

I/O I/O Bus

I/O I/O
Proc Bus D f R a A b M

D R
2/10/2000 C.E. Kozyrakis, U.C. Berkeley

A M
Page 12

Vector IRAM
Vector processing
high-performance for media processing low power/energy for processor control modularity, low complexity scalability well understood software development high bandwidth for vector processing low power/energy for memory accesses modularity, scalability small system size

Embedded DRAM

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 13

IRAM ISA summary


Full vector instruction set with
32 vector registers, 32 vector flag registers support for multiple data types (64b, 32b, 16b, 8b) support for strided and indexed memory accesses support for auto-increment addressing support for DSP operations (multiply-add, saturation etc) support for conditional execution support for software speculation support for fast reductions and butterfly permutations support for virtual memory restartable arithmetic (FP & integer) exceptions

Implemented as a coprocessor extension to MIPS64 ISA (coprocessor 2)


2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 14

Vector architectural state


Virtual Processors ($vlr)
VP0 VP1 VP$vlr-1

Control Regs
vcr0 vcr1 vcr31
64b

General vr0 vr1 Purpose Registers vr31 (32)


$vpw

Flag Registers (32)

vf0 vf1 vf31 1b

Scalar Regs
vs0 vs1 vs31
64b

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 15

Fixed-point Multiply-add
Mul & Shift Right & Round Add & Sat

x n/2 y n/2 *
n Shift

zn +
Round
n

sat

Multiply halves & shift instruction provides support for any fixed-point format Precision is equal to the datatype width; multipliers inputs have half the width Uniform, simple support for all datatypes
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 16

VIRAM-1 prototype

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 17

Design Overview
64b MIPS scalar core
coprocessor interface 16KB I/D caches

Memory system
8 2MByte eDRAM banks single sub-bank per bank 256-bit synchronous interface, separate I/O signals 20ns cycle time, 6.6ns column access crossbar interconnect for 12.8 GB/sec per direction no caches

Vector unit
8KByte vector register file support for 64b, 32b, and 16b data-types 2 arithmetic (1 FP), 2 flag processing, 1 load-store units 4 64-bit datapaths per unit DRAM latency included in vector pipeline 4 addresses/cycle for strided/indexed accesses 2-level TLB
2/10/2000

Network interface
user-level message passing dedicated DMA engines 4 100MByte/s links
Page 18

C.E. Kozyrakis, U.C. Berkeley

Vector Unit Pipeline Structure


Single-issue, in-order pipeline
each instruction can specify up to 128 operations and occupy a functional unit for 8 cycles

DRAM latency is included in the execution pipeline (delayed pipeline)


deep pipeline design, but not caches needed to avoid stalls worst case DRAM latency does not cause pipeline stalls

Address decoupling buffer


buffers memory addresses in the presence of conflicts (indexed/strided accesses) memory conflicts do not stall pipeline

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 19

Non-Delayed Pipeline
F D X M W

. . .

DRAM latency: >=20ns

vld VW mem vadd vst vld mem vadd vst . . .

VLOAD

Long Load-> ALU RAW hazard

VALU VSTORE

VR X1 X2 ... XN VW

VR

Load->ALU exposes full DRAM latency (long)


2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 20

10

Tolerating Memory Latency Delayed Pipeline


F D X M W

DRAM latency: >20ns

. . .
VW vld vadd vst vld vadd vst . . .

VLOAD

Load-> ALU RAW hazard

VALU VSTORE

DELAY

VR X1 ... XN VW

VR

Load ALU sees functional unit latency (short)


2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 21

Clustered VLSI Design


64b Xbar I/F Integer Datapath 0 Vector Registers Control Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F 256b
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 22

Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F

Xbar I/F Integer Datapath 0 Vector Registers Flag Regs. & Datapath FP Datapaths Integer Datapath 1 Xbar I/F

11

VIRAM-1 Floorplan
DRAM Bank 0 DRAM Bank 2 DRAM Bank 4 DRAM Bank 6

N I M I P S

Vector Lane 0

Vector Lane 1

C T L

Vector Lane 2

Vector Lane 3

DRAM Bank 1

DRAM Bank 3

DRAM Bank 5

DRAM Bank 7

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 23

Prototype Summary
Technology:
0.18um eDRAM CMOS process (IBM) 6 layers of copper interconnect 1.2V and 1.8V power supply

Memory: 16 MBytes Clock frequency: 200MHz Power: 2 W for vector unit and memory Transistor count: ~140 millions Peak performance:
GOPS w. multiply-add: 3.2 (64b), 6.4 (32b), 12.8 (16b) GOPS wo. multiply-add: 1.6 (64b), 3.2 (32b), 6.4 (16b) GFLOPS: 1.6 (32b)

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 24

12

Kernels Performance
Peak Perf. Image Composition iDCT Color Conversion Image Convolution Integer MV Multiply Integer VM Multiply FP MV Multiply FP VM Multiply AVERAGE 6.4 GOPS 6.4 GOPS 3.2 GOPS 3.2 GOPS 3.2 GOPS 3.2 GOPS 1.6 GFLOPS 1.6 GFLOPS Sustained Perf. 6.40 GOPS 1.97 GOPS 3.07 GOPS 3.16 GOPS 2.77 GOPS 3.00 GOPS 1.40 GFLOPS 1.59 GFLOPS % of Peak 100.0% 30.7% 96.0% 98.7% 86.5% 93.7% 87.5% 99.6% 86.6%

Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 25

Comparisons
VIRAM Image Composition iDCT Color Conversion Image Convolution
All

MMX 3.75 (3.2x) 8.00 (10.2x) 5.49 (4.5x)

VIS 2.22 (17.0x) 6.19 (5.1x)

TMS320C82 5.70 (7.6x) 6.50 (5.3x)

0.13 1.18 0.78 5.49

numbers in cycles/pixel

MMX, VIS, and TMS results assume all data in L1 cache

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 26

13

FFT Performance
200

Time (microseconds)

150

Fixed Point (16 bit) Floating Point (32 bit)


Pentium/200: 151 us

TMS320C67x: 124 us

100
PPC604e: 87 us

50
TigerSHARC: 41 us VIRAM: 37 us CRI Pathfinder-1: 22.3 us CRI Pulsar: 27.9 us Wildstar: 25 us

0 128 256 512 1024

Size (#points in FFT)


Note : Simulations performed with unscheduled fixed-point code
2/10/2000 C.E. Kozyrakis, U.C. Berkeley Page 27

Motion Estimation Performance

Size

VIRAM-1 (cycles) 7.1x106 (4.6x)

MMX (cycles) 3.3x107

QCIF (176x144) CIF (352x288)

2.8x107 (5.0x)

1.4x108

Note : MMX results assume all data in L1 cache

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 28

14

Overall Performance of H.263

Akiyo (12.95 kbit/s) 23.5 fps

Mom (16.25 kbit/s) 22.7fps

Hall (20.47 kbit/s) 22.7fps

Foreman (65.52 kbit/s) 20.9fps

Average encoding speed for H.263 on VIRAM standard mpeg test sequences, using exhaustive search for motion estimation and LLM for DCT. Note : simulations did not include memory optimizations (address decoupling, small strides optimizations, address hashing), or fixed-point multiply-add integer datapaths

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 29

Summary Class Project Suggestions


Architecture comparisons & applications
information retrieval signal processing apps neural nets training

Multimedia application analysis


operand reuse patterns branch behavior data/value locality and memory access patterns

Low power/energy architectures


energy-exposed ISA design compilation for low energy speculation use for power reduction

2/10/2000

C.E. Kozyrakis, U.C. Berkeley

Page 30

15

Vous aimerez peut-être aussi