Académique Documents
Professionnel Documents
Culture Documents
Pål-Kristian Engstad
pal engstad@naughtydog.com
March 5, 2010
These slides are used internally at Naughty Dog to introduce new programmers
to our SPU programming methods. Due to popular interest, we are now
making these public. Note that some of the tools that we are using are not
released to the public, but there exists many other alternatives out there that
do similar things.
The first set of slides introduce most of the SPU assembly instructions. Please
read these carefully before reading the second set. Those slides go through a
made-up example showing how one can improve performance drastically, by
knowing the hardware as well as employing a technique called software
pipe-lining.
In these slides, we will go through all of the assembly instructions that exist on
the SPU, giving you a quick introduction to the power of the SPUs.
The instruction set can be put in classes, where the instructions in the same
class have the same arity (i.e. whether they are even or odd) and latency (how
long it takes for the result to be ready):
The syntax here indicates that for each of the 4 32-bit floating point values in
the register, the operation in the comment is executed.
I No broadcast versions.
I No dot-products or cross-products.
I No “fnma” instruction.
Example:
The FX class of instructions all have latency of just 2 cycles and all have a
throughput of 1 cycle. These are even instructions.
There’s quite a few of them, and we can further divide them down into:
The integer arithmetic operations “add” and “subtract from” work on either 4
words at a time or 8 half-words at a time.
Notice the subtract from semantics. This is different from the floating point
subtract (fs) semantic. We think this was mainly due to the additional power
of the immediate forms.
The SPU has some instructions that enable us to quickly set up registers
values. These immediate loads are also 2-cycle FX instructions:
Example:
and i, j, k ; i = j & k
nand i, j, k ; i = ~(j & k)
andc i, j, k ; i = j & ~k
or i, j, k ; i = j | k
nor i, j, k ; i = ~(j | k)
orc i, j, k ; i = j | ~k
xor i, j, k ; i = j ^ k
eqv i, j, k ; i = j == k
TRUE = 0xFF
FALSE = 0x00
TRUE = 0xFFFF
FALSE = 0x0000
TRUE = 0xFFFF_FFFF
FALSE = 0x0000_0000
TRUE = 0xFFFF_FFFF
FALSE = 0x0000_0000
This very important operation selects bits from j and k depending on the bits
in the l registers. These fit well with the comparison functions given previously.
selb i, j, k, l ; i = (l==0) ? j : k
Notice that if the bit is 0, then it selects j and if not then it selects the bit in k.
Notice that there is an independent shift amount for each of the shlh and shl
versions, i.e., this is truly SIMD!
; Assume r0 = ( 1, 2, 4, 8 )
; r1 = ( 1, 2, 3, 4 )
shl r2, r0, r1
; Now r2 = ( 1<<1, 2<<2, 4<<3, 8<<4 )
= ( 2, 4, 32, 128 )
Notice here that the shift amounts need to be negative in order to produce a
proper shift. This is because this is actually a rotate left and then mask
operation.
The load/store operations are odd instructions that work on the 256 kB local
memory. They have a latency of 6 cycles, but the hardware has short-cuts in
place so that you can read a written value immediately after the store. Do note:
I Memory wraps around, so you can never access memory outside the local
store (LS).
I You can only load and store a whole quadword, so if you need to modify a
part, you need to load the quadword value, merge in the modified part
into the value and store the whole quadword back.
I Addresses are in units of bytes, unlike the VU’s on the PS2.
I The load/store operations will use the value in the preferred word of the
address register, i.e.: the first word.
The shuffle operations all have 4 cycle latency and they are odd instructions.
Most of the instructions in this class deal with the whole quadword:
The ordering of bytes, half-words and words within the quadword is shown
below. Notice that this is big-endian, not little-endian:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | a | b | c | d | e | f |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| 0 | 1 | 2 | 3 |
+---------------+---------------+---------------+---------------+
The shuffle byte instruction shufb take three inputs, two source registers r0,
r1, and a shuffle mask msk. The output register d is found by running the
following logic on each byte within the input registers:
I if x in 0 .. 0x7f:
I If (x & 0x10) == 0x00, then d.b[n] = r0.b[x & 0x0f].
I If (x & 0x10) == 0x10, then d.b[n] = r1.b[x & 0x0f].
I if x in 0x80 .. 0xbf: d.b[n] = 0x00
I if x in 0xc0 .. 0xdf: d.b[n] = 0xff
I if x in 0xe0 .. 0xff: d.b[n] = 0x80
Previously, we mentioned that the SPU has no broadcast ability, but with a
single shufb instruction we can broadcast one word into all words. We can
create the shuffle masks using instructions directly, or else we could simply load
it using a LS class instruction.
Using these masks, we can quickly create a registers with all x’s, y’s, z’s or w’s:
Because the shuffle instruction is so useful, our “frontend” tool supports quick
creations of shuffle masks. Using the .dshuf directive, we create shuffle masks
that follow the following rules.
We can create finite state-machines, piping input into one end of the
quad-word, while spitting out the result into another (like e.g. the preferred
word). Here’s an example of such a “delay machine”:
Ditto for shift rights, though as for the WS class, we call it rotates with mask
and use the negative shift amounts:
These instructions are designed to expand a small number of bits into many bits of ones,
and they are good for use with the sel operation.
Example:
fsmbi selABCd, 0x000f; make select mask to get XYZ from first arg
These are the opposite to the form select instructions, and can be used to
quickly pack results from comparison operators into compact bytes or
half-words. They all gather the rightmost bit from the the source register and
packs it into a single bit in the target.
The hardware supports two fast (4 cycles) that calculate the √ reciprocal
recip(x) = 1/x, or the reciprocal square root rsqrt(x) = 1/ x. These
instructions work in conjunction with the “fi” instruction that we’ll later explain
in detail. After the interpolation instruction, result are accurate to a precision
of 12 bits, which is about half the floating-point precision of 23. In order to
improve the accuracy, one must perform another Taylor- or Euler-step.
Do note that:
√ √
√ x x 1
sqrt(x) = x= √ = |x| √ = x · rsqrt(x),
x x
since x ≥ 0, so there is no need for a seperate square-root function.
frest a, x
fi b, x, a ; b is good to 12 bits precision
fnms c, b, x, one ;
fma b, c, b, b ; b is good to 24 bits precision
;
frsqest a, x
fi b, x, a ; b is good to 12 bits precision
fm c, b, x ; (b and a can share register)
fm d, b, onehalf ; (c and x can share register)
fnms c, c, b, one
fma b, d, c, b ; b is good to 24 bits precision
Or Across
orx i, j ; i.w[0] = ( j.w[0] | j.w[1] | j.w[2] | j.w[3] );
i.w[1] = i.w[2] = i.w[3] = 0
Also, please note that these instructions saturate to the min and max values of
their precision.
Branches on the SPU are costly. If a branch is taken, and it has not been
predicted, there is a 18 cycle penalty so that the chip can restart the pipe.
There is no penalty for falling through a non-predicted branch. However, if you
have predicted a branch, and this does not occur - then there is also a 18 cycle
penalty. Branches and branch hints are all odd instructions.
Note: This is one of the reasons why diverging control-paths are so difficult to
optimize for.
Branch Relative
br brTo ; goto label address
Branch Relative and Set Link
brsl i, brTo ; gosub label address, i.w[0] = return address, (*)
Branch Indirect
bi i ; goto i.w[0]
Branch Indirect and Set Link
bisl i, j ; gosub j.w[0], i.w[0] = return address, (*)
BRanch Absolute
bra brTo ; goto brTo
BRanch Absolute and Set Link
brasl i, brTo ; gosub label address, i.w[0] = return address (*)
(*): These instructions have a 4 cycle latency for the return register. Note:
The bi instructions have enable/disable interrupt versions, e.g.: bie, bid,
bisle, bisld.
Branch on Zero
brz i, brTo ; branch if i.w[0] == 0
Branch on Not Zero
brnz i, brTo ; branch if i.w[0] != 0
Branch on Zero
brhz i, brTo ; branch if i.h[1] == 0
Branch on Not Zero
brhnz i, brTo ; branch if i.h[1] != 0
Interrupt RETurn
iret i ; Return from interrupt
Interrupt RETurn
iretd i ; Return from interrupt, disable interrupts
Interrupt RETurn
irete i ; Return from interrupt, enable interrupts
Branch Indirect and Set Link if External Data
bisled i, j ; gosub j if channel 0 is non-zero
If you know the most likely (or only) outcome for a branch, you can make sure
the branch is penalty free as long as the hint occurs at least 15 cycles before
the branch is taken. If the hint occurs later, there still may be a benefit, since
the penalty is lowered. However, if the hint arrives less than 4 cycles before the
branch, there is no benefit.
Please note that it also turns out that there is a hardware bug w.r.t. the hbr
instructions. One cannot hint a branch where the branch targets forwards and
is also within the same 64-byte block as the branch.
We will explain these in further talks, but for completeness we’ve included
these here. They are all odd instructions with a latency of 6. Note, that the
latency may actually be much higher if channels are not ready.
DP instructions have a latency of 13 and are even. However, they will stall
pipelining for 6 cycles (that is all currently executing instructions are halted)
while this instruction is executed. Therefore, we do not recommend using
double precision at all!