Vous êtes sur la page 1sur 8

CODE OPTIMIZATION

I. General methodology

Start with C code.

Compile with –g option with no optimization. (If using 'C67x floating point
operations, need to select –mv6710 option in the compiler so that it will use the
floating-point hardware of the DSP). If performance is satisfactory, do not continue.

Compile with –g and –o0 or –o1 optimization option. Check performance

Compile with –g and –o3 and –pm optimization option. USE –pm when have no
ASM code (there are ways to use –pm we are not going into that level of detail.)

(-g: enables src-level symbolic debugging, -pm: combine all C source files before
compiling, program level optimization, -gs: interlist C statements into assembly
listing)

Modify C code.

Use intrinsics, pragmas, word-wide optimization, loop unrolling (to remove branches),
and compiler feedback.
Use optimization levels.

Linear ASM

Identify functions or sections that need to be further optimized using profiling.


Write these sections in linear asm.

Use word-wide optimization.

ASM

Write time-critical sections in asm.

Use word-wide optimization, parallel instructions, removal of NOPs.

Hand-optimize through dependency graph and scheduling.

II. C-code modifications

A. Intrinsics: special functions that map directly to inline 'C67x instructions. Compiler produces
assembly language statements directly into the programs.

Intrinsics are specified with a leading underscore ( _ ) and are treated like a function.

Ex: int c
short a,b
c = _mpy(a,b)

This is equivalent to c=a*b BUT is done at the assembly language level.

in CCS: Help > Contents > TMS320C6000 Programmer's Guide > Optimizing C/C++ Code >
Refining C/C++ Code > Using Intrinsics

B. Pragma directives provide information to the compiler or tell it to do something.

#pragma MUST_ITERATE (min, max, factor)


Write the pragma in the C code before the loop. Helps in loop optimization.

C. Word-wide optimization: If you have short 16-bit data, access two values at once and then
work with 32-bit data. Also, called packed data processing.

1. Use casts and intrinsics

int dot_product(short *x, short *y, short z) //x,y point to short variables
{
int *w_x = (int *)x; //cast to point to a word
int *w_y = (int *)y;
int sum1 = 0, sum2 = 0, i;
for (i = 0; i < z/2; i++)
{
sum1 += _mpy(w_x[i], w_y[i]); // mpy 16LSB
sum2 += _mpyh(w_x[i], w_y[i]); // mpy 16MSB
}
return (sum1 + sum2);
}

Equivalent to summing the even indexed numbers in the array (i=0,2,4) and summing
the odd indexed numbers in the array and then summing the sums.

2. Use _nassert intrinsic to specify that 16-bit short arrays are aligned on a 32-bit (word)
boundary.

int dot_product(short *x, short *y, short z)


{
int sum = 0, i;

_nassert (((int)(x) & 0x3) == 0);


_nassert (((int)(y) & 0x3) == 0);
#pragma MUST_ITERATE(20, , 4); //must be even number of times in order
to split in half for optimization

for (i = 0; i < z; i++) sum += x[i] * y[i];

return sum;
}

_nassert tells the computer that the pointer x and y are on a word boundary (2 LSBs
are 0). The compiler then knows that it can optimize the following loop with mpy and
mpyh.

_nassert is useful in that you do not need to rewrite code – simply add line

D. Unrolling the Loop – a technique where you expand small loops so that more iterations of the
loop are seen in your code. The purpose is so that more instructions can operate in parallel.

Compiler will do this automatically if it knows:


1) The trip count
2) The trip count is a multiple of 2

Can do this through MUST_ITERATE:


#pragma MUST_ITERATE (10, ,2);

E. Compiler Feedback – when you compile, you can get information from the compiler which
helps to know what to optimize and how to optimize.

III. Linear ASM Optimization – Word-wide optimization


A. Example: Load word and operate on 16-bit components/parts, dot product

.global _dotp

_dotp .cproc x, y
.reg sum, sum0, sum1, cntr
.reg a, b, p, p1

MVK 50,cntr ; cntr = 100/2


ZERO sum0 ; multiply result = 0
ZERO sum1 ; multiply result = 0

LOOP: .trip 50

LDW *x++,a ; load a & a+1 from memory


LDW *y++,b ; load b & b+1 from memory
MPY a,b,p ;a*b
MPYH a,b,p1 ; a+1 * b+1
ADD p,sum0,sum0 ; sum0 += (a * b)
ADD p1,sum1,sum1 ; sum1 += (a+1 * b+1)
[cntr] SUB cntr,1,cntr ; decrement loop counter
[cntr] B LOOP ; branch to loop

ADD sum0,sum1,sum ; compute final result

.return sum

.endproc

B. Example: Load double word and work on 32-bit words for floating point dot product

.global _dotp

_dotp .cproc x, y

.reg sum, sum0, sum1, a, b


.reg a0:a1, b0:b1, p, p1 ;note format for double word

MVK 50,cntr ; cntr = 100/2


ZERO sum0 ; multiply result = 0
ZERO sum1 ; multiply result = 0

LOOP .trip 50
LDDW *x++,a0:a1 ; load a & a+1 from memory
LDDW *y++,b0:b1 ; load b & b+1 from memory
MPYSP a0,b0,p ;a*b
MPYSP a1,b1,p1 ; a+1 * b+1
ADDSP p,sum0,sum0 ; sum0 += (a * b)
ADDSP p1,sum1,sum1 ; sum1 += (a+1 * b+1)
[cntr] SUB cntr,1,cntr ; decrement loop counter
[cntr] B LOOP ; branch to loop

ADDSP sum,sum1,sum0 ; compute final result

.return sum

.endproc

IV. Assembly language code


A. Word-wide optimization, parallel instructions, and reduction of NOPs

Consider the fixed-point dot product example with no optimization

MVK .S1 100, A1 ; set up loop counter


ZERO .L1 A7 ; zero out accumulator
LOOP
LDH .D1 *A4++,A2 ; load ai from memory
LDH .D1 *A3++,A5 ; load bi from memory
NOP 4 ; delay slots for LDH
MPY .M1 A2,A5,A6 ; ai * bi
NOP ; delay slot for MPY
ADD .L1 A6,A7,A7 ; sum += (ai * bi)
SUB .S1 A1,1,A1 ; decrement loop counter
[A1] B .S2 LOOP ; branch to loop
NOP 5 ; delay slots for branch
; Branch occurs here

This takes 16 cycles for each iteration. Since there are 100 numbers in each array, this
results in 1600 cycles + 2 cycles for the MVK and ZERO.

Optimize with parallel instructions and reducing NOPs:


If use both .D units for loading, can put the loads in parallel.
Also the SUB and B do not depend on the other instructions. By moving them, we can
remove the NOP 5 line and reduce the number of cycles. The B instruction should occur no
earlier than 5 steps before the ADD.

MVK .S1 100, A1 ; set up loop counter


|| ZERO .L1 A7 ; zero out accumulator
LOOP
LDH .D1 *A4++,A2 ; load ai from memory
|| LDH .D2 *B4++,B2 ; load bi from memory
SUB .S1 A1,1,A1 ; decrement loop counter
[A1] B .S2 LOOP ; branch to loop
NOP 2 ; delay slots for LDH
MPY .M1X A2,B2,A6 ; ai * bi X denotes clearly that cross-path used
NOP ; delay slots for MPY
ADD .L1 A6,A7,A7 ; sum += (ai * bi)
; Branch occurs here

This loop requires 8 cycles per iteration. For 100 numbers in each array, this function
requires 8*100 + 1 (for initialization) cycles.
We've gone from 1602 to 801 cycles for the function.

Let's do word-wide optimization now:


The loop in linear ASM is:

LDW *x++,a ; load a & a + 1 from memory


LDW *y++,b ; load b & b + 1 from memory
MPY a,b,p ;a*b
MPYH a,b,p1 ; a+1 * b+1
ADD p,sum0,sum0 ; sum0 += (a * b)
ADD p1,sum1,sum1 ; sum1 += (a+1 * b+1)
[cntr] SUB cntr,1,cntr ; decrement loop counter
[cntr] B LOOP ; branch to loop

So the optimized code is now:

MVK .S1 50,A1 ; set up loop counter


|| ZERO .L1 A7 ; zero out sum0 accumulator
|| ZERO .L2 B7 ; zero out sum1 accumulator
LOOP
LDW .D1 *A4++,A2 ; load a & a+1 from memory
|| LDW .D2 *B4++,B2 ; load b & b+1 from memory
SUB .S1 A1,1,A1 ; decrement loop counter
[A1] B .S1 LOOP ; branch to loop
NOP 2
MPY .M1X A2,B2,A6 ; a * b note same registers can be used
|| MPYH .M2X A2,B2,B6 ; a+1 * b+1
NOP
ADD .L1 A6,A7,A7 ; sum0+= (a * b)
|| ADD .L2 B6,B7,B7 ; sum1+= (a+1 * b+1)
; Branch occurs here

This loop takes 8 cycles per iteration, but now we only have 50 iterations. The total cycles
are then 8*50 + 2 (for initialization).
We've gone from 3202 > 1601 > 402 cycles.

B. Software Pipelining - technique used to schedule instructions from a loop so that multiple
iterations execute in parallel. The parallel resources on the C6x make it possible to initiate a
new loop iteration before previous iterations finish. The goal of software pipelining is to start a
new loop iteration as soon as possible. (this works for linear assembly or assembly)

1. Development of the idea (ignoring NOPs for simplicity)


Consider: loop: ldh
|| ldh
mpy
add

Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2


1 ldh ldh
2 mpy
3 add
4 ldh ldh
5 mpy
6 add
7 ldh ldh
8 mpy
9 add

We can see that many of the resources are not used. With pipelining can have the following:

Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2


1 ldh ldh
2 ldh ldh mpy
3 ldh ldh mpy add
4 ldh ldh mpy add
5 ldh ldh mpy add
6 mpy add
7 add

There are three stages to a pipelined code: prolog, loop kernel (gray area), and epilog.
Prolog are the instructions needed to build up a loop kernel or cycle. Epilogue are the
instructions needed to complete all loop iterations. When a single-cycle loop kernel is set
up, the entire loop is executed in one cycle via one parallel instruction using the maximum
number of functional units.

2. Method - three steps are needed to produce a hand-coded software pipelined code from
an assembly loop code:
1) drawing a dependency graph
2) setting up a scheduling table
3) deriving the pipelined code from the table

3. Dependency Graph

Let’s consider the linear assembly program:


loop .trip 10 ;exactly 10 iterations through loop
LDH *x++,a ;pointer to x array > a
LDH *y++,b ;put an element of y into b
MPY a,b,prod ;prod=x*y
ADD prod,sum,sum ;sum of products > sum
SUB count,1,count ;decrement counter
[count] B loop

First, you draw a node and path for each instruction and write the number of cycles to
complete the instruction. Each circle corresponds to an instruction. Inside the circle is
the variable written to by the instruction, dst.

LDH LDH
a b
5 5
SUB
MPY prod 1 count
2 1
B
ADD
1 sum loop
6

Second, you assign functional units to each instruction and separate the A and B data paths
so that the maximum number of units are used and the workload between both paths are
as equal as possible.
LDH LDH
.D1
a b .D2
5 5
SUB
MPY prod .M1 1 count .L2
2 1
B
ADD
1 sum .L1 loop .S2
6

4. Scheduling Table

Identify the longest path. This effects how long the table will be.

For our example, the longest path is 8. This means that 7 prolog columns are needed
before entering the loop kernel.

The longest path starts with LDH, so we start with that in cycle 1. This is repeated in
every cycle through loop kernel column (8). The MPY must come 5 cycles after the loads
(1 + 4). We should have 8 of these for the 8 loads. The ADD must come 2 cycles after
the MPY (1 + 1). There should be 8 of these as well.

The branch is scheduled by reverse counting 5 cycles from the loop kernel. The SUB
must come 1 cycle before the branch.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
.D1 LD LD LD LD LD LD LD LD
.D2 LD LD LD LD LD LD LD LD
.L1 AD AD AD AD AD AD AD AD
.L2 SU SU SU SU SU SU SU
.S1
.S2 B B B B B B
.M1 MP MP MP MP MP MP MP MP
.M2

5. Pipelined Code

Write code from the table. Two ways:

a. The number of times you go through the loop kernel is equal to the number of dot
products which you want to do – the number of prolog cycles.

Ex. If we have 10 numbers to multiply together and add, then the number of loop
kernel iterations is 10-7 = 3, i.e. the loop count must be set to 3.

Assembly code (notice no NOPs are needed):

Cycle 1:

LDH .D1 *A4++,A2


|| LDH .D2 *B4++,B2
Cycle 2:

LDH .D1 *A4++,A2


|| LDH .D2 *B4++,B2
|| [B1] SUB .L2 B1,1,B1 ;need [B1] for loop kernel

Cycle 3-5:

LDH .D1 *A4++,A2


|| LDH .D2 *B4++,B2
|| [B1] SUB .L2 B1,1,B1
|| [B1] B .S2 loop

Cycle 6-7:

LDH .D1 *A4++,A2


|| LDH .D2 *B4++,B2
|| [B1] SUB .L2 B1,1,B1
|| [B1] B .S2 loop
|| MPY .M1 A2,B2,A3

Cycle 8, 9, 10

LDH .D1 *A4++,A2


|| LDH .D2 *B4++,B2
|| [B1] SUB .L2 B1,1,B1
|| [B1] B .S2 loop
|| MPY .M1x A2,B2,A3
|| ADD .L1 A3,A7,A7

Cycle 11-15:

MPY .M1x A2,B2,A3


|| ADD .L1 A3,A7,A7

Cycle 16-17:

ADD .L1 A3,A7,A7

Now it takes 17 cycles to complete the loop.

Could optimize further by doing word-wide optimization and software pipelining.

b. An alternative is to get rid of the epilog cycles and let the loop count equal the number
of dot products you want to do.

The result is a program of smaller code size and unnecessary loads but with the
same number of cycles.

C. Summary

For fixed-point dot-product:

No optimization: 16 cycles/loop * 10 numbers = 160 cycles

Reduce NOPS & parallel instruction: 8 cycles/loop * 10 numbers = 80 cycles

Word-wide optimization: 8 cycles/loop * 5 times through loop = 40 cycles

SW pipelining: 17 cycles

Vous aimerez peut-être aussi