Optimization

CODE OPTIMIZATION
I. General methodology
Start with C code.
Compile with –g option with no optimization. (If using 'C67x floating point
operations, need to select –mv6710 option in the compiler so that it will use the
floating-point hardware of the DSP). If performance is satisfactory, do not continue.
Compile with –g and –o0 or –o1 optimization option. Check performance
Compile with –g and –o3 and –pm optimization option. USE –pm when have no
ASM code (there are ways to use –pm we are not going into that level of detail.)
(-g: enables src-level symbolic debugging, -pm: combine all C source files before
compiling, program level optimization, -gs: interlist C statements into assembly
listing)
Modify C code.
Use intrinsics, pragmas, word-wide optimization, loop unrolling (to remove branches),
and compiler feedback.
Use optimization levels.
Linear ASM
Identify functions or sections that need to be further optimized using profiling.

Write these sections in linear asm.
Use word-wide optimization.
ASM
Write time-critical sections in asm.
Use word-wide optimization, parallel instructions, removal of NOPs.
Hand-optimize through dependency graph and scheduling.
II. C-code modifications
A. Intrinsics: special functions that map directly to inline 'C67x instructions. Compiler produces
assembly language statements directly into the programs.
Intrinsics are specified with a leading underscore ( _ ) and are treated like a function.
Ex: int c
short a,b
c = _mpy(a,b)
This is equivalent to c=a*b BUT is done at the assembly language level.
in CCS: Help > Contents > TMS320C6000 Programmer's Guide > Optimizing C/C++ Code >
Refining C/C++ Code > Using Intrinsics
B. Pragma directives provide information to the compiler or tell it to do something.
#pragma MUST_ITERATE (min, max, factor)

Write the pragma in the C code before the loop. Helps in loop optimization.
C. Word-wide optimization: If you have short 16-bit data, access two values at once and then
work with 32-bit data. Also, called packed data processing.
1. Use casts and intrinsics
int dot_product(short *x, short *y, short z) //x,y point to short variables
{
int *w_x = (int *)x; //cast to point to a word
int *w_y = (int *)y;
int sum1 = 0, sum2 = 0, i;
for (i = 0; i < z/2; i++)
{
sum1 += _mpy(w_x[i], w_y[i]); // mpy 16LSB
sum2 += _mpyh(w_x[i], w_y[i]); // mpy 16MSB
}
return (sum1 + sum2);
}
Equivalent to summing the even indexed numbers in the array (i=0,2,4) and summing
the odd indexed numbers in the array and then summing the sums.
2. Use _nassert intrinsic to specify that 16-bit short arrays are aligned on a 32-bit (word)
boundary.
int dot_product(short *x, short *y, short z)

{
int sum = 0, i;
_nassert (((int)(x) & 0x3) == 0);

_nassert (((int)(y) & 0x3) == 0);
#pragma MUST_ITERATE(20, , 4); //must be even number of times in order
to split in half for optimization
for (i = 0; i < z; i++) sum += x[i] * y[i];
return sum;
}
_nassert tells the computer that the pointer x and y are on a word boundary (2 LSBs
are 0). The compiler then knows that it can optimize the following loop with mpy and
mpyh.
_nassert is useful in that you do not need to rewrite code – simply add line
D. Unrolling the Loop – a technique where you expand small loops so that more iterations of the
loop are seen in your code. The purpose is so that more instructions can operate in parallel.
Compiler will do this automatically if it knows:

1) The trip count
2) The trip count is a multiple of 2
Can do this through MUST_ITERATE:

#pragma MUST_ITERATE (10, ,2);
E. Compiler Feedback – when you compile, you can get information from the compiler which
helps to know what to optimize and how to optimize.
III. Linear ASM Optimization – Word-wide optimization

A. Example: Load word and operate on 16-bit components/parts, dot product
.global _dotp
_dotp .cproc x, y
.reg sum, sum0, sum1, cntr
.reg a, b, p, p1
MVK 50,cntr ; cntr = 100/2

ZERO sum0 ; multiply result = 0
LOOP: .trip 50
LDW *x++,a ; load a & a+1 from memory

LDW *y++,b ; load b & b+1 from memory
MPY a,b,p ;a*b
MPYH a,b,p1 ; a+1 * b+1
ADD p,sum0,sum0 ; sum0 += (a * b)
ADD p1,sum1,sum1 ; sum1 += (a+1 * b+1)
[cntr] SUB cntr,1,cntr ; decrement loop counter
[cntr] B LOOP ; branch to loop
ADD sum0,sum1,sum ; compute final result
.return sum
.endproc
B. Example: Load double word and work on 32-bit words for floating point dot product
.global _dotp
_dotp .cproc x, y
.reg sum, sum0, sum1, a, b

.reg a0:a1, b0:b1, p, p1 ;note format for double word
MVK 50,cntr ; cntr = 100/2

LOOP .trip 50
LDDW *x++,a0:a1 ; load a & a+1 from memory
LDDW *y++,b0:b1 ; load b & b+1 from memory
MPYSP a0,b0,p ;a*b
MPYSP a1,b1,p1 ; a+1 * b+1
ADDSP p,sum0,sum0 ; sum0 += (a * b)
ADDSP p1,sum1,sum1 ; sum1 += (a+1 * b+1)
ADDSP sum,sum1,sum0 ; compute final result
.return sum
.endproc
IV. Assembly language code

A. Word-wide optimization, parallel instructions, and reduction of NOPs
Consider the fixed-point dot product example with no optimization
MVK .S1 100, A1 ; set up loop counter

ZERO .L1 A7 ; zero out accumulator
LOOP
LDH .D1 *A4++,A2 ; load ai from memory
LDH .D1 *A3++,A5 ; load bi from memory
NOP 4 ; delay slots for LDH
MPY .M1 A2,A5,A6 ; ai * bi
NOP ; delay slot for MPY
ADD .L1 A6,A7,A7 ; sum += (ai * bi)
SUB .S1 A1,1,A1 ; decrement loop counter
[A1] B .S2 LOOP ; branch to loop
NOP 5 ; delay slots for branch
; Branch occurs here
This takes 16 cycles for each iteration. Since there are 100 numbers in each array, this
results in 1600 cycles + 2 cycles for the MVK and ZERO.
Optimize with parallel instructions and reducing NOPs:

If use both .D units for loading, can put the loads in parallel.
Also the SUB and B do not depend on the other instructions. By moving them, we can
remove the NOP 5 line and reduce the number of cycles. The B instruction should occur no
earlier than 5 steps before the ADD.
MVK .S1 100, A1 ; set up loop counter

|| ZERO .L1 A7 ; zero out accumulator
LOOP
LDH .D1 *A4++,A2 ; load ai from memory
|| LDH .D2 *B4++,B2 ; load bi from memory
NOP 2 ; delay slots for LDH
MPY .M1X A2,B2,A6 ; ai * bi X denotes clearly that cross-path used
NOP ; delay slots for MPY
ADD .L1 A6,A7,A7 ; sum += (ai * bi)
This loop requires 8 cycles per iteration. For 100 numbers in each array, this function
requires 8*100 + 1 (for initialization) cycles.
We've gone from 1602 to 801 cycles for the function.
Let's do word-wide optimization now:

The loop in linear ASM is:
LDW *x++,a ; load a & a + 1 from memory

LDW *y++,b ; load b & b + 1 from memory
MPY a,b,p ;a*b
MPYH a,b,p1 ; a+1 * b+1
ADD p,sum0,sum0 ; sum0 += (a * b)
ADD p1,sum1,sum1 ; sum1 += (a+1 * b+1)
So the optimized code is now:
MVK .S1 50,A1 ; set up loop counter

|| ZERO .L1 A7 ; zero out sum0 accumulator
|| ZERO .L2 B7 ; zero out sum1 accumulator
LOOP
LDW .D1 *A4++,A2 ; load a & a+1 from memory
|| LDW .D2 *B4++,B2 ; load b & b+1 from memory
NOP 2
MPY .M1X A2,B2,A6 ; a * b note same registers can be used
|| MPYH .M2X A2,B2,B6 ; a+1 * b+1
NOP
ADD .L1 A6,A7,A7 ; sum0+= (a * b)
|| ADD .L2 B6,B7,B7 ; sum1+= (a+1 * b+1)
This loop takes 8 cycles per iteration, but now we only have 50 iterations. The total cycles
are then 8*50 + 2 (for initialization).
We've gone from 3202 > 1601 > 402 cycles.
B. Software Pipelining - technique used to schedule instructions from a loop so that multiple
iterations execute in parallel. The parallel resources on the C6x make it possible to initiate a
new loop iteration before previous iterations finish. The goal of software pipelining is to start a
new loop iteration as soon as possible. (this works for linear assembly or assembly)
1. Development of the idea (ignoring NOPs for simplicity)

Consider: loop: ldh
|| ldh
mpy
add
Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2

1 ldh ldh
2 mpy
3 add
4 ldh ldh
5 mpy
6 add
7 ldh ldh
8 mpy
9 add
We can see that many of the resources are not used. With pipelining can have the following:
Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2

1 ldh ldh
2 ldh ldh mpy
3 ldh ldh mpy add
4 ldh ldh mpy add
5 ldh ldh mpy add
6 mpy add
7 add
There are three stages to a pipelined code: prolog, loop kernel (gray area), and epilog.
Prolog are the instructions needed to build up a loop kernel or cycle. Epilogue are the
instructions needed to complete all loop iterations. When a single-cycle loop kernel is set
up, the entire loop is executed in one cycle via one parallel instruction using the maximum
number of functional units.
2. Method - three steps are needed to produce a hand-coded software pipelined code from
an assembly loop code:
1) drawing a dependency graph
2) setting up a scheduling table
3) deriving the pipelined code from the table
3. Dependency Graph
Let’s consider the linear assembly program:

loop .trip 10 ;exactly 10 iterations through loop
LDH *x++,a ;pointer to x array > a
LDH *y++,b ;put an element of y into b
MPY a,b,prod ;prod=x*y
ADD prod,sum,sum ;sum of products > sum
SUB count,1,count ;decrement counter
[count] B loop
First, you draw a node and path for each instruction and write the number of cycles to
complete the instruction. Each circle corresponds to an instruction. Inside the circle is
the variable written to by the instruction, dst.
LDH LDH
a b
5 5
SUB
MPY prod 1 count
2 1
B
ADD
1 sum loop
6
Second, you assign functional units to each instruction and separate the A and B data paths
so that the maximum number of units are used and the workload between both paths are
as equal as possible.
LDH LDH
.D1
a b .D2
5 5
SUB
MPY prod .M1 1 count .L2
2 1
B
ADD
1 sum .L1 loop .S2
6
4. Scheduling Table
Identify the longest path. This effects how long the table will be.
For our example, the longest path is 8. This means that 7 prolog columns are needed
before entering the loop kernel.
The longest path starts with LDH, so we start with that in cycle 1. This is repeated in
every cycle through loop kernel column (8). The MPY must come 5 cycles after the loads
(1 + 4). We should have 8 of these for the 8 loads. The ADD must come 2 cycles after
the MPY (1 + 1). There should be 8 of these as well.
The branch is scheduled by reverse counting 5 cycles from the loop kernel. The SUB
must come 1 cycle before the branch.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
.D1 LD LD LD LD LD LD LD LD
.D2 LD LD LD LD LD LD LD LD
.L1 AD AD AD AD AD AD AD AD
.L2 SU SU SU SU SU SU SU
.S1
.S2 B B B B B B
.M1 MP MP MP MP MP MP MP MP
.M2
5. Pipelined Code
Write code from the table. Two ways:
a. The number of times you go through the loop kernel is equal to the number of dot
products which you want to do – the number of prolog cycles.
Ex. If we have 10 numbers to multiply together and add, then the number of loop
kernel iterations is 10-7 = 3, i.e. the loop count must be set to 3.
Assembly code (notice no NOPs are needed):
Cycle 1:
LDH .D1 *A4++,A2

|| LDH .D2 *B4++,B2
Cycle 2:
LDH .D1 *A4++,A2

|| LDH .D2 *B4++,B2
|| [B1] SUB .L2 B1,1,B1 ;need [B1] for loop kernel
Cycle 3-5:
LDH .D1 *A4++,A2

|| LDH .D2 *B4++,B2
|| [B1] SUB .L2 B1,1,B1
|| [B1] B .S2 loop
Cycle 6-7:
LDH .D1 *A4++,A2

|| LDH .D2 *B4++,B2
|| [B1] SUB .L2 B1,1,B1
|| [B1] B .S2 loop
|| MPY .M1 A2,B2,A3
Cycle 8, 9, 10
LDH .D1 *A4++,A2

|| LDH .D2 *B4++,B2
|| [B1] SUB .L2 B1,1,B1
|| [B1] B .S2 loop
|| MPY .M1x A2,B2,A3
|| ADD .L1 A3,A7,A7
Cycle 11-15:
MPY .M1x A2,B2,A3

|| ADD .L1 A3,A7,A7
Cycle 16-17:
ADD .L1 A3,A7,A7
Now it takes 17 cycles to complete the loop.
Could optimize further by doing word-wide optimization and software pipelining.
b. An alternative is to get rid of the epilog cycles and let the loop count equal the number
of dot products you want to do.
The result is a program of smaller code size and unnecessary loads but with the
same number of cycles.
C. Summary
For fixed-point dot-product:
No optimization: 16 cycles/loop * 10 numbers = 160 cycles
Reduce NOPS & parallel instruction: 8 cycles/loop * 10 numbers = 80 cycles
Word-wide optimization: 8 cycles/loop * 5 times through loop = 40 cycles
SW pipelining: 17 cycles

Optimization

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Optimization

Transféré par

Droits d'auteur :

Formats disponibles

CODE OPTIMIZATION

Start with C code.

Compile with –g and –o0 or –o1 optimization option. Check performance

Identify functions or sections that need to be further optimized using profiling.

Use word-wide optimization.

Write time-critical sections in asm.

Use word-wide optimization, parallel instructions, removal of NOPs.

Hand-optimize through dependency graph and scheduling.

II. C-code modifications

This is equivalent to c=a*b BUT is done at the assembly language level.

B. Pragma directives provide information to the compiler or tell it to do something.

#pragma MUST_ITERATE (min, max, factor)

1. Use casts and intrinsics

int dot_product(short *x, short *y, short z)

_nassert (((int)(x) & 0x3) == 0);

for (i = 0; i < z; i++) sum += x[i] * y[i];

Compiler will do this automatically if it knows:

Can do this through MUST_ITERATE:

III. Linear ASM Optimization – Word-wide optimization

MVK 50,cntr ; cntr = 100/2

LDW *x++,a ; load a & a+1 from memory

ADD sum0,sum1,sum ; compute final result

.reg sum, sum0, sum1, a, b

MVK 50,cntr ; cntr = 100/2

ADDSP sum,sum1,sum0 ; compute final result

IV. Assembly language code

Consider the fixed-point dot product example with no optimization

MVK .S1 100, A1 ; set up loop counter

Optimize with parallel instructions and reducing NOPs:

MVK .S1 100, A1 ; set up loop counter

Let's do word-wide optimization now:

LDW *x++,a ; load a & a + 1 from memory

So the optimized code is now:

MVK .S1 50,A1 ; set up loop counter

1. Development of the idea (ignoring NOPs for simplicity)

Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2

Cycle .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2

Let’s consider the linear assembly program:

Write code from the table. Two ways:

Assembly code (notice no NOPs are needed):

LDH .D1 *A4++,A2

LDH .D1 *A4++,A2

LDH .D1 *A4++,A2

LDH .D1 *A4++,A2

LDH .D1 *A4++,A2

MPY .M1x A2,B2,A3

ADD .L1 A3,A7,A7

Now it takes 17 cycles to complete the loop.

Could optimize further by doing word-wide optimization and software pipelining.

For fixed-point dot-product:

No optimization: 16 cycles/loop * 10 numbers = 160 cycles

Reduce NOPS & parallel instruction: 8 cycles/loop * 10 numbers = 80 cycles

Word-wide optimization: 8 cycles/loop * 5 times through loop = 40 cycles

Vous aimerez peut-être aussi

int dot_product(short x, short y, short z)