Vous êtes sur la page 1sur 48

Processor Requirements needed to optimize DSP performance

M. R. Smith, Electrical and Computer Engineering, University of Calgary, Alberta, Canada smithmr @ ucalgary.ca
2000/03/05 1

To be tackled today

Characteristics of DSP algorithms Specialized handling of


Multiplication Division (21K has no division instruction) ENCM515 Reference Material

How RISCy Is DSP, IEEE Micro (Jan-10) Simply Signal Processing (Jan-40) Fast Scaling, CCI (Apr-10) Saturation Arithmetic (Apr-20)

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2 / 48

DSP Algorithms

DSP algorithms require specialized features on processors Processors are a compromise

speed, cost, silicon

When have you as a designer found a compromise that meets your requirements? As a consultant may have to add DSP characteristics to an existing system or add DSP coprocessor to an existing system
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

3 / 48

FIR

Multiply/Addition intensive Sum operation with high precision -- overflow considerations Long simple loop Online operation -- infinite amount of data Store coefficients on-chip for fast access Complex domain arithmetic
2000/03/05
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

4 / 48

IIR-1

Interrelated and order dependent multiplications and additions Small number of delays via register moves? short loop -- low number of instructions in loop which makes it difficult to optimize Precision -- very important because of feedback Multiple stages -- I.e. IIR follows IIR etc
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

5 / 48

IIR-2 LDI

Short complicated loop Many intermediate values Pipeline issues because of interdependence

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

6 / 48

FFT

Complex variables (A and B) and fixed coefficients (W) Address calculations complex Memory accesses numerable Multiplication and additions Need for fast access to many registers, address pointers, constants, variables
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

7 / 48

Fast instruction cycle -- needed

DSP chips -- two cycle instructions (on top of FETCH/DECODE) during which the processor performs many parallel operations

More recent technology -- 1 clock cycle

Many processors takes 6 to 32 cycles to handle MULT, FMULT, FDIV or even FADD Make processor highly pipelined -- pipeline must be started and then kept full

FIR (easy to pipeline) IIR (hard to pipeline) FFT (challenging to pipeline)


ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

8 / 48

Loop Overhead -- must be minimized

Use specialized hardware

specialized decrement and branch instructions occurring in a single cycle instruction cached with counter superscalar operations delayed branches hardware loop control loop unrolling down counting loops
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

Use specialized software techniques


2000/03/05

9 / 48

Memory operations -- Many of them


Data/instruction and data/data conflicts Data caches

Will also have external data memory banks

Harvard architecture branch target caches multi-ported memory register pre-forwarding -- avoid stalls while trying to write back result of ALU operation only to re-access the same register large register banks -- avoid memory ops associated with just calculated values
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

10 / 48

Precision -- high but without speed loss


FIR -- accumulated value can grow big IIR -- recursive use of a value External Memory bus width Internal Memory bus width Data width of registers and ALU Saturation arithmetic

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

11 / 48

Saturation Arithmetic

For full discussion see 21K SHARC user manual and also Being Assertive with your processor (APR-20) Internal register 80 bits but external busses only 32 wide 0xFFFF F0000001 00000000

stored as F0000001 stored as 00000001 (normal math) stored as 80000000 (saturation) Can be good solution (FIR) or bad solution (IIR) to the problem of overflow
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

0xFFFF 00000001 00000000


2000/03/05

12 / 48

Complex arithmetic -- frequency domain operations

Need to fetch real and imaginary parts in at different times during the algorithm Need fast access to adjacent memory locations -- burst memory Need for many internal registers to temporarily store real/imaginary components (FFT butterfly and last years exams) Duplication of resources -- was custom, but consider now 21160
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

13 / 48

TigerSHARC ADSP-21160 Core Architecture


CACHE MEMORY 32 x 48

JTAG TEST & EMULATION FLAGS PROGRAM SEQUENCER TIMER

DAG 1 8 x 4 x 32

DAG 2 8 x 4 x 32 PMA BUS 32

PMA DMA BUS 32 DMA PMD BUS 64 PMD

BUS CONNECT

DMD BUS 64

DMD

FLOATING & FIXEDPOINT MULTIPLIER, FIXED-POINT ACCUMULATOR

REGISTER FILE 16 x 40

32-BIT BARREL SHIFTER

FLOATING-POINT &FIXED-POINT ALU

FLOATING-POINT &FIXED-POINT ALU

32-BIT BARREL SHIFTER

REGISTER FILE 16 x 40

FLOATING & FIXEDPOINT MULTIPLIER, FIXED-POINT ACCUMULATOR

2000/03/05

14

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

15 / 48

Address calculations -- frequent

Complex addressing modes -- take many clock cycles Use pointers and autoincrement rather than calculating pointer + offset

need many address-related registers address calculations compete with ALU calculations group instructions within program

e.g. read and store often use same or similar addresses so dont recalculate the addresses.
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

16 / 48

Specialized addressing modes


standard memory access premodify postmodify circular buffers (modulo arithmetic on the address registers) bit-reverse addressing structure handling auto-increment with size accounted for
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

17 / 48

Key issue -- ease of development


Microcontrollers -- onboard peripherals Host communication Multiprocessor communications Simulators

Multi-processor operations

Application notes Good working environment Compatibility to previous processor versions -legacy code (advantage and a disadvantage)

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

18 / 48

Multiplication Extensive algorithms

Off-chip multipliers have big bottlenecks


Get and then give instruction to multiplier Get and then give first, second data to multiplier Wait till cooked, and then get value

Newer chips have on-board multiplication or intelligent co-processors (F-LINE exceptions) Many chips do multiplication using specialized techniques introduced by optimizing compiler

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

19 / 48

Smart Multiplication through optimizing compiler techniques

29K RISC FMULT execution takes 6 cycles + fetch 16bit x 16bit INTEGER multiplication on 68K CISC takes 70 cycles regardless of operations Use adds and shift instead since these take less time -- easy with integer, but floats?

What are equivalent operations on 21K. Discussed in early lecture on Quirks and SHARCs
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

20 / 48

Smart Integer 68k Multiplication

Multiplication by 2, 4, 8, 16

Achieved by shifting 1, 2, 3 or 4 times (done in 6 + 2n operations on 68K)

D2 = D0 * 19
MOVE.W D0, D2 ASL.W #4, D2 D2 = D0 * 16 ADD.W D0, D2 D2 = D0 * 17 ASL.W #1, D0 D0 = D0 *2 ADD.W D0, D2 D2 = D0 * 19 (29 cycles compared to 70) Watch out for overflow, may need conversion to 32 bits (SSI, SSF on some processssors -- not only 21k) Waste of time if have single cycle multipliers (21k?). Careful because multiplication results may end in special register.
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

21 / 48

Multiplication Extensive algorithms

Highly pipelined, therefore complex instruction interdependence


R0 = R1 * R2 R3 = R4 * R5 BUT R0 = R1 * R2 R3 = R0 * R5 <- delay dependency

Need automated tools to schedule instructions Need multiple destinations (registers) for multiplier result Multiple and Accumulate (MAC) instruction Super-scalar operations even on a simpler processor Cause problems in short loops Many types of MACs needed

Not all processors have the 21061 single cycle multiplication operation

See In the AM29050 a FIR-bearing animal (FEB-80 in ENCM515 -- Characteristics needed in DSP processors 2000/03/05class notes)) 22 / 48 Copyright smithmr@ucalgary.ca

Typically need Normalization of result

N point DFT

Result = DFT (Input)

; 0 <= n < N

N point inverse DFT


Result = IDFT (Input) / N ; 0 <= n < N Division is typically done by the equivalent of repeated subtraction -- 150 cycles on 68K

result = 0; do { Numerator = Numerator - Denom; result++; } while (Numerator > 0); result--;

2000/03/05

Special shift-subtract tricks speed operations


ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

23 / 48

Smart Integer Division

Division by 2, 4, 8, 16

unsigned LSL #1, D0

signed ASL #1, D0

Need to propagate (or not propagate) the sign bit


Unsigned original = 0x80 (128)
Signed
2000/03/05

final = 0x40 (64)


final = 0xC0 (-64)
24 / 48

original = 0x80 (-128)

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

Floating Point Division

The FDIV on 29K takes 15 cycles There is not a FDIV on the 21K -- use recursion!!

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

25 / 48

Why is floating point so difficult?


Number Internal representation 1.0 0x3F 80 00 00 32.0 0x42 00 00 00

31.98125 1023.4

0x41 FF D9 9A 0x44 7F D9 9A

31.98125 = 1023.4 / 32 = 1023.4 / 2^5


2000/03/05
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

26 / 48

Why is floating point so difficult?

Fast scaling Routine for Floating-point RISC and DSP processors (APR-10) Floating Point Format

31 S
2000/03/05

23 22 bexp frac
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

0
27 / 48

Floating point number K


s (-1) x 1.frac x2 (bexp -127)

0 1.0 = 0x1.0 x 2 0 (-1) x 0x1.0000 x 2 (127 - 127)

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

28 / 48

Floating point number K


s (-1) x 1.frac x2 (bexp -127)

10.0 = 0x10.0 = %1010.0 = %1.0100 x 2


0 (-1) x 0x1.4000 x 2 (130 - 127)

3 (0x1.4 x 2 )

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

29 / 48

IEEE Std. 754, 1985


Number 1.0 32.0 Internal representation 0x3F 80 00 00 0x42 00 00 00 s 0 0 bexp 0x7F 0x84 frac 0x00 00 00 0x00 00 00

31.98125 1023.4
1.frac

0x41 FF D9 9A 0x44 7F D9 9A

0 0

0x83 0x88

0x7F D9 9A 0x7F D9 9A

-- only fractional part is stored

Remember JAMES BOND helped by M (Smith) The ONE is remembered and not stored
2000/03/05
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

30 / 48

Fast floating pt division possible


Number Internal representation 0x3F 80 00 00 0x42 00 00 00 0x41 FF D9 9A 0x44 7F D9 9A s bexp frac

1.0 32.0
31.98125 1023.4

0 0
0 0

0x7F 0x00 00 00 0x84 0x00 00 00 BEXP DIFF = 5 0x83 0x7F D9 9A 0x88 0x7F D9 9A BEXP DIFF = 5

K = K / -1

-- flip the sign bit with XOR instruction p K = K / N where N = 2 -- decrease bexp = bexp -5

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

31 / 48

Fast Floating Point Division by 32 Doing it

29K -- FP# K is in gr96


Setting up the power CONST BEXPchange, 5 Setting up the bexp-diff SLL BEXPchange, BEXPchange, 23 result = K / 32 SUB result, K, BEXPchange <- REPEATED Note -- when processing a large array -- only the last step needed for every number (inside the loop)

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

32 / 48

F0 = 1.0 R0 = R8 - R0

Fast Floating Point Division by FP M when M is known to be 2^p

// NOTE integer operation Setting up the bexp-diff R0 = ASHIFT R0 BY 23 result = K / 32 R4 = R4 - R0


Works because
F8 = 32.0 (0x42000000) F0 = 1.0 (0x3F800000)
2000/03/05
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

33 / 48

PROBLEMS?

Try to do 0 / 32 Get a large negative number


Number 0.0 subtract -2.126 * 10^37 s 0 0 1 bexp 0x00 0x05 0xFB frac 0x00 00 00 0x00 00 00 0x00 00 00

If dividing by 2^p -- problems if number is smaller than 2^(p-127)


Must be overcome on many processors Non-issue on 21k which has single cycle multiplication and division. Calculate reciprocal and then multiply
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

34 / 48

Must guarantee result

68K, 29K, MIPS and 21k problems

ADD.W R0, R1

ADD gr96, gr97, gr98

Every addition (subtraction) result has the possibility of being out of range -- overflow. Must be tested. 68K solution
ADD.W R0, R1 BVS Somewhere <- Test takes cycles

29K and MIPS solution

Special instructions -- ADDU and ADDS

21k solution is what?


ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

35 / 48

Specialized coding techniques e.g. 29k has the ability of throwing SWI as part of compare (ASSERT) Test for FP number too small from previous special Division operation
CMP.L #toosmall, D0 BGE okay MOVE.L #0, D0 BRA continue okay: SUB.L #b_exp, D0 continue: 68K code <- EXTRA cycles always executed

ASGE TRAP#, temp, BEXPchange <- Only compare for 29k SUB gr96, gr96, BEXPchange <- Not in a delay slot? where TOOSMALL: CONST gr96, 0 RTI

Extra code only executed in the special case that it is needed


2000/03/05
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

36 / 48

Specialized conditional instructions on 21k

21K -- F4 contains the FP value -- need F4/32

R0 = 5
R0 = ASHIFT R0 BY 23 F1 = minimum value ( 2^(5-127) ) F2 = ABS F4 COMP (F2, F1) IF GE R4 = R4 - R0 ELSE R4 = R4 - R4 <- NO DELAY
Cant use

ELSE R4 = 0

As this not a compute operation but uses 32-bit constant.


2000/03/05
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

37 / 48

LIES -- ALL LIES


IF GE R4 = R4 - R0 ELSE R4 = R4 - R4

This is not a legal instruction either!! COMPUTE instructions take 22 bits to describe IF JUMP/CALL ELSE R4 = R4 - R4 is allowed
Useless approach anyway since there are better ways on 21k to do repeated division by a constant.
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

38 / 48

Processors compared

IEEE Micro Magazine Special Feature 1992 DSP

TMS320C25, 030 DSP56000/1, DSP96002 (Motorola)


i860 (Intel) MC88100 (Motorola) SPARC (Sparc Consortium NOT Sun) Am29050
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

RISC

Ideal -- SMITH CRISP


39 / 48

2000/03/05

CRISP -- triple pun as well

Comprehensive RISC -- Predicted 1992


Harvard architecture MAC (rather than Super -- Scalar instructions) Ability to do X = R+S, Y = R-S operations many registers for address/values FP as well as integer capability Bit-reverse addressing Peripherals with DMA Low power standby High precision -- double precision Efficient pipeline with parallel completion of many operations (dual-ported memory and register banks)
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

40 / 48

Comparisons -- 1

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

41 / 48

FIR/IIR

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

42 / 48

FFT -- Radix 2 and Radix 4

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

43 / 48

Requirements for perfect DSP

Fast instruction cycle -- different from high clock speed Cycle time adjustable according to instruction type Fast hardware multiplier Floating point for easier algorithm design High precision, implying wide data buses for memory, internal processor transfers, registers and on-board processing units

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

44 / 48

Requirements for perfect DSP

Several data buses available to reduce bus conflict transfer overhead Harvard architecture and/or instruction cache to avoid instruction and data-fetch clashes Duplicate resources for parallel computation of real and imaginary components of complex numbers Dedicated hardware required for address calculations to avoid APU clash with main algorithm

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

45 / 48

Requirements for perfect DSP

Extensive temporary registers to reduce unwanted fetches of continually used data

Or single cycle, highly parallel, memory operations

Fast and reliable, easily programmed, developed and upgraded Inexpensive and easy to develop peripherals High level of customer support Inexpensive to purchase Lower power consumption with a standby mode
ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

2000/03/05

46 / 48

Requirements for perfect DSP

Several data buses available to reduce bus conflict transfer overhead Harvard architecture and/or instruction cache to avoid instruction and data-fetch clashes Duplicate resources for parallel computation of real and imaginary components of complex numbers Dedicated hardware required for address calculations to avoid APU

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

47 / 48

Tackled today

Characteristics of DSP algorithms Specialized handling of


Multiplication Division (21K has no division instruction) ENCM515 Reference Material

How RISCy Is DSP, IEEE Micro (Jan-10) Simply Signal Processing (Jan-40) Fast Scaling, CCI (Apr-10) Saturation Arithmetic (Apr-20)

2000/03/05

ENCM515 -- Characteristics needed in DSP processors Copyright smithmr@ucalgary.ca

48 / 48

Vous aimerez peut-être aussi