Académique Documents
Professionnel Documents
Culture Documents
I. General methodology
Compile with –g option with no optimization. (If using 'C67x floating point
operations, need to select –mv6710 option in the compiler so that it will use the
floating-point hardware of the DSP). If performance is satisfactory, do not continue.
Compile with –g and –o3 and –pm optimization option. USE –pm when have no
ASM code (there are ways to use –pm we are not going into that level of detail.)
(-g: enables src-level symbolic debugging, -pm: combine all C source files before
compiling, program level optimization, -gs: interlist C statements into assembly
listing)
Modify C code.
Use intrinsics, pragmas, word-wide optimization, loop unrolling (to remove branches),
and compiler feedback.
Use optimization levels.
Linear ASM
ASM
A. Intrinsics: special functions that map directly to inline 'C67x instructions. Compiler produces
assembly language statements directly into the programs.
Intrinsics are specified with a leading underscore ( _ ) and are treated like a function.
Ex: int c
short a,b
c = _mpy(a,b)
in CCS: Help > Contents > TMS320C6000 Programmer's Guide > Optimizing C/C++ Code >
Refining C/C++ Code > Using Intrinsics
C. Word-wide optimization: If you have short 16-bit data, access two values at once and then
work with 32-bit data. Also, called packed data processing.
int dot_product(short *x, short *y, short z) //x,y point to short variables
{
int *w_x = (int *)x; //cast to point to a word
int *w_y = (int *)y;
int sum1 = 0, sum2 = 0, i;
for (i = 0; i < z/2; i++)
{
sum1 += _mpy(w_x[i], w_y[i]); // mpy 16LSB
sum2 += _mpyh(w_x[i], w_y[i]); // mpy 16MSB
}
return (sum1 + sum2);
}
Equivalent to summing the even indexed numbers in the array (i=0,2,4) and summing
the odd indexed numbers in the array and then summing the sums.
2. Use _nassert intrinsic to specify that 16-bit short arrays are aligned on a 32-bit (word)
boundary.
return sum;
}
_nassert tells the computer that the pointer x and y are on a word boundary (2 LSBs
are 0). The compiler then knows that it can optimize the following loop with mpy and
mpyh.
_nassert is useful in that you do not need to rewrite code – simply add line
D. Unrolling the Loop – a technique where you expand small loops so that more iterations of the
loop are seen in your code. The purpose is so that more instructions can operate in parallel.
E. Compiler Feedback – when you compile, you can get information from the compiler which
helps to know what to optimize and how to optimize.
.global _dotp
_dotp .cproc x, y
.reg sum, sum0, sum1, cntr
.reg a, b, p, p1
LOOP: .trip 50
.return sum
.endproc
B. Example: Load double word and work on 32-bit words for floating point dot product
.global _dotp
_dotp .cproc x, y
LOOP .trip 50
LDDW *x++,a0:a1 ; load a & a+1 from memory
LDDW *y++,b0:b1 ; load b & b+1 from memory
MPYSP a0,b0,p ;a*b
MPYSP a1,b1,p1 ; a+1 * b+1
ADDSP p,sum0,sum0 ; sum0 += (a * b)
ADDSP p1,sum1,sum1 ; sum1 += (a+1 * b+1)
[cntr] SUB cntr,1,cntr ; decrement loop counter
[cntr] B LOOP ; branch to loop
.return sum
.endproc
This takes 16 cycles for each iteration. Since there are 100 numbers in each array, this
results in 1600 cycles + 2 cycles for the MVK and ZERO.
This loop requires 8 cycles per iteration. For 100 numbers in each array, this function
requires 8*100 + 1 (for initialization) cycles.
We've gone from 1602 to 801 cycles for the function.
This loop takes 8 cycles per iteration, but now we only have 50 iterations. The total cycles
are then 8*50 + 2 (for initialization).
We've gone from 3202 > 1601 > 402 cycles.
B. Software Pipelining - technique used to schedule instructions from a loop so that multiple
iterations execute in parallel. The parallel resources on the C6x make it possible to initiate a
new loop iteration before previous iterations finish. The goal of software pipelining is to start a
new loop iteration as soon as possible. (this works for linear assembly or assembly)
We can see that many of the resources are not used. With pipelining can have the following:
There are three stages to a pipelined code: prolog, loop kernel (gray area), and epilog.
Prolog are the instructions needed to build up a loop kernel or cycle. Epilogue are the
instructions needed to complete all loop iterations. When a single-cycle loop kernel is set
up, the entire loop is executed in one cycle via one parallel instruction using the maximum
number of functional units.
2. Method - three steps are needed to produce a hand-coded software pipelined code from
an assembly loop code:
1) drawing a dependency graph
2) setting up a scheduling table
3) deriving the pipelined code from the table
3. Dependency Graph
First, you draw a node and path for each instruction and write the number of cycles to
complete the instruction. Each circle corresponds to an instruction. Inside the circle is
the variable written to by the instruction, dst.
LDH LDH
a b
5 5
SUB
MPY prod 1 count
2 1
B
ADD
1 sum loop
6
Second, you assign functional units to each instruction and separate the A and B data paths
so that the maximum number of units are used and the workload between both paths are
as equal as possible.
LDH LDH
.D1
a b .D2
5 5
SUB
MPY prod .M1 1 count .L2
2 1
B
ADD
1 sum .L1 loop .S2
6
4. Scheduling Table
Identify the longest path. This effects how long the table will be.
For our example, the longest path is 8. This means that 7 prolog columns are needed
before entering the loop kernel.
The longest path starts with LDH, so we start with that in cycle 1. This is repeated in
every cycle through loop kernel column (8). The MPY must come 5 cycles after the loads
(1 + 4). We should have 8 of these for the 8 loads. The ADD must come 2 cycles after
the MPY (1 + 1). There should be 8 of these as well.
The branch is scheduled by reverse counting 5 cycles from the loop kernel. The SUB
must come 1 cycle before the branch.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
.D1 LD LD LD LD LD LD LD LD
.D2 LD LD LD LD LD LD LD LD
.L1 AD AD AD AD AD AD AD AD
.L2 SU SU SU SU SU SU SU
.S1
.S2 B B B B B B
.M1 MP MP MP MP MP MP MP MP
.M2
5. Pipelined Code
a. The number of times you go through the loop kernel is equal to the number of dot
products which you want to do – the number of prolog cycles.
Ex. If we have 10 numbers to multiply together and add, then the number of loop
kernel iterations is 10-7 = 3, i.e. the loop count must be set to 3.
Cycle 1:
Cycle 3-5:
Cycle 6-7:
Cycle 8, 9, 10
Cycle 11-15:
Cycle 16-17:
b. An alternative is to get rid of the epilog cycles and let the loop count equal the number
of dot products you want to do.
The result is a program of smaller code size and unnecessary loads but with the
same number of cycles.
C. Summary
SW pipelining: 17 cycles