The FFT Via Matrix Factorizations: A Key To Designing High Performance Implementations

The FFT
Via Matrix Factorizations

A Key to Designing High Performance Implementations
Charles Van Loan

Department of Computer Science
Cornell University
A High Level Perspective...
Blocking For Performance
 
A11 A12 · · · A1q } n1
A A · · · A 
 21 22 2q  } n2
A =  . . . . . 
 . . . . 
Ap1 Ap2 · · · Apq } nq
|{z} |{z} |{z}
n1 n2 nq
A well known strategy for high-performance Ax = b and Ax = λx

solvers.
Factoring for Performance
One way to execute a matrix-vector product

y = Fnx
when Fn = At · · · A2A1 is as follows:
y=x
for k = 1:t
y = Ak x
end
A different factorization Fn = Ãt̃ · · · Ã1 would yield a different

algorithm.
The Discrete Fourier Transform (n = 8)
 
ω80 ω80 ω80 ω80 ω80 ω80 ω80 ω80
 
 ω0 ω81 ω82 ω83 ω84 ω85 ω86 7
ω8 
 8 
 
 ω0 ω82 ω84 ω86 ω88 ω810 ω812 14
ω8 
 8 
 
 ω0 ω83 ω86 ω89 ω812 ω815 ω818 21
ω8 
 8 
y = F8x =  x
 ω0 ω84 ω88 ω812 ω816 ω820 ω824 28
ω8 
 8 
 0 
ω
 8 ω85 ω810 ω815 ω820 ω825 ω830 35
ω8  
 0 
ω
 8 ω86 ω812 ω818 ω824 ω830 ω836 42
ω8  
ω80 ω87 ω814 ω821 ω828 ω835 ω842 ω849
ω8 = cos(2π/8) − i · sin(2π/8)
The DFT Matrix In General...
If ωn = cos(2π/n) − i · sin(2π/n) then
pq
[Fn]pq = ωn
= (cos(2π/n) − i · sin(2π/n))pq
= cos(2pqπ/n) − i · sin(2pqπ/n)
Fact:
FnH Fn = nIn
√
Thus, Fn/ n is unitary.
Data Sparse Matrices
An n-by-n matrix A is data sparse if it can be represented with

many fewer than n2 numbers.
Example 1.
A has lots of zeros. (“Traditional Sparse”)
Example 2.
A is Toeplitz...
 
a b c d
e a b c
A = 
f

e a b
g f e a
More Examples of Data Sparse Matrices
A is a Kronecker Product B ⊗ C, e.g.,
" #
b11C b12C
A =
b21C b22C
If B ∈ IRm1×m1 and C ∈ IRm2×m2 then A = B ⊗ C has m21m22

entries but is parameterized by just m21 + m22 numbers.
Extreme Data Sparsity
n X
X n X
n X
n
A = S(i, j, k, `) · (2-by-2) ⊗ · · · ⊗ (2-by-2)
i=1 j=1 k=1 `=1 | {z }
d times
A is 2d -by-2d but is parameterized by O(dn4) numbers.

Factorization of Fn
The DFT matrix can be factored into a short product of sparse

matrices, e.g.,
F1024 = A10 · · · A2A1P1024
where each A-matrix has 2 nonzeros per row and P1024 is a per-
mutation.
From Factorization to Algorithm
If n = 210 and
Fn = A10 · · · A2A1Pn
then
y = Pnx
for k = 1:10
y = Ak x ← 2n flops.
end
computes y = Fnx and requires O(n log n) flops.

Recursive Block Structure
F8(:, [ 0 2 4 6 1 3 5 7 ]) =
 
1 0 0 0 1 0 0 0
 0 1 0 0 0 ω 0 0 
 8 
 0 0 1 0 0 0 ω 2 0 
 8 
 F 0

0 ω8  3
 0 0 0 1 0 0 4
 
 1 0 0 0 −1 0 0 0  0 F4
 
 0 1 0 0 0 −ω8 0 0 

 0 0 1 0 0 2
0 −ω8 0 

0 0 0 1 0 0 0 −ω83
Fn/2 “shows up” when you permute the columns of Fn so that

the odd-indexed columns come first.
Recursion...
We build an 8-point DFT from two 4-point DFTs...

 
1 0 0 0 1 0 0 0

 0 1 0 0 0 ω8 0 0 

 0 0 1 0 0 0 ω82 0  " #
0 ω83  F4x(0:2:7)
 
 0 0 0 1 0 0
F8 x =  
 1 0 0 0 −1 0 0 0  F4x(1:2:7)
 
 0 1 0 0 0 −ω8 0 0 

 0 0 1 0 0 2
0 −ω8 0 

0 0 0 1 0 0 0 −ω83
Radix-2 FFT: Recursive Implementation
function y =fft(x, n)
if n = 1
y = x
else
m = n/2; ω = exp(−2πi/n)
Ω = diag(1, ω, . . . , ω m−1)
zT = fft(x(0:2:n − 1), m)
zB = Ω· fft(x(1:2:n − 1), m)

Im Im zT
y = Overall: 5n log n flops.
Im −Im zB
end
The Divide-and-Conquer Picture
(0:1:15)
HH
H
H
HH
H
H
(0:2:15) (1:2:15)
Q Q
Q Q
Q Q
Q Q
(0:4:15) (2:4:15) (1:4:15) (3:4:15)
@ @ @ @
@ @ @ @
(0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15)
A A A A A A A A
A A A A A A A A
[0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Towards a Nonrecursive Implementation
The Radix-2 Factorization...
If n = 2m and
Ωm = diag(1, ωn, . . . , ωnm−1),
then " # " #
Fm ΩmFm Im Ωm
FnΠn = = (I2 ⊗ Fm).
Fm −ΩmFm Im −Ωm
where Πn = In(:, [0:2:n 1:2:n]).

Fm 0
Note: I2 ⊗ Fm = .
0 Fm
The Cooley-Tukey Factorization
n = 2t
Fn = At · · · A1Pn
Pn = the n-by-n “bit reversal ” permutation matrix

" #
IL/2 ΩL/2
Aq = I r ⊗ L = 2q , r = n/L
IL/2 −ΩL/2
L/2−1
ΩL/2 = diag(1, ωL, . . . , ωL ) ωL = exp(−2πi/L)
The Bit Reversal Permutation
(0:1:15)
HH
H
H
HH
H
H
(0:2:15) (1:2:15)
Q Q
Q Q
Q Q
Q Q
(0:4:15) (2:4:15) (1:4:15) (3:4:15)
@ @ @ @
@ @ @ @
(0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15)
A A A A A A A A
A A A A A A A A
[0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Bit Reversal
       
x(0) x(0000) x(0000) x(0)
 x(1)   x(0001)   x(1000)   x(8) 
       
 x(2)   x(0010)   x(0100)   x(4) 
       
 x(3)   x(0011)   x(1100)   x(12) 
       
 x(4)   x(0100)   x(0010)   x(2) 
       
 x(5)   x(0101)   x(1010)   x(10) 
       
 x(6)   x(0110)   x(0110)   x(6) 
       
 x(7)   x(0111)   x(1110)   x(14) 
 x(8)  =  x(1000) 
    →  x(0001)  =  x(1) 
   
       
 x(9)   x(1001)   x(1001)   x(9) 
       
 x(10)   x(1010)   x(0101)   x(5) 
       
 x(11)   x(1011)   x(1101)   x(13) 
       
 x(12)   x(1100)   x(0011)   x(3) 
       
 x(13)   x(1101)   x(1011)   x(11) 
       
 x(14)   x(1110)   x(0111)   x(7) 
x(15) x(1111) x(1111) x(15)
Butterfly Operations
This matrix is block diagonal...
" #
IL/2 ΩL/2
Aq = I r ⊗ L = 2q , r = n/L
IL/2 −ΩL/2
r copies of things like this
 
1 ×

 1 × 


 1 × 


 1 × 

1
 × 


 1 × 

 1 × 
1 ×
At the Scalar Level...
a sH a + ωb
s
H
H
ω
HH
b s Hs a − ωb
Signal Flow Graph (n = 8)
x0
H
s s s s y0
HH @ A
ω80 @ A
H @ A
@
x4 s
HHs
ω80
A s y1
A A
s
@
@ @ A A
@
@ @ A A
@ A
@s A s y2
x2
HH ω82
A
ω80

s s
A
H @ A A
ω80 @ A A A
A
HH @ A A
x6 Hs @s A ω81 A s y3
A A
s
A
A A A A
A A

A A A A
s A A A s y4
x1
H ω82
A A A
s s
H @
H
ω80 @ A A
A
H @
@ AA A A
H
x5
s Hs ω80 s ω83 A A s y5
@ A A
@ @
@ A A
@ @ A A
@
@ A A s y6
H
x3 2
s s ω8 s
HH A
@
ω80 @ A
H @ A
H
x7
s Hs @
s A s y7
The Transposed Stockham Factorization
If n = 2t, then
Fn = St · · · S2S1,
where for q = 1:t the factor Sq = Aq Γq−1 is defined by
Aq = I r ⊗ BL , L = 2q , r = n/L,
Γq−1 = Πr∗ ⊗ IL∗ , L∗ = L/2, r∗ = 2r,

IL∗ ΩL∗
BL = ,
IL∗ −ΩL∗
ΩL∗ = diag(1, ωL, . . . , ωLL∗−1).

Perfect Shuffle
   
x0 x0
 x1   x1 
   
 x2   x4 
   
 x3   x5 
(Π4 ⊗ I2) 
 x4  =  x2 
  
   
 x5   x3 
   
 x6   x6 
x7 x7
Cooley-Tukey Array Interpretation
Step q:
k

2k 2k+1 
8 

>
<


L∗ =2q−1
>
−→ L=2q
: 



| {z } 
r∗ =n/L∗
| {z }
r=n/L
Reshaping
 
×
×
 
×
 
×
  × × × ×
x =  ×  → x2×4 =
 
× × × × ×
 
×
 
×
×
Transposed Stockham Array Interp
k k+r
9
>
=
(q−1)
xL∗ ×r∗ = FL∗ xT
r∗ ×L∗ = L∗ =2q−1 .
>
;
| {z }
r∗ =n/L∗
x(q) = Sq x(q−1)
k
9
>
>
>
>
>
>
>
>
=
(q)
xL×r = FL xT
r×L = L=2q
>
>
>
>
>
>
>
>
;
| {z }
r=n/L
2 × 2 × 2 Basic Radix-2 Versions
Store intermediate DFTs by row or column
Intermediate DFTs adjacent or not.
How the two butterfly loops are ordered.

" #!
IL/2 ΩL/2
x = Ir ⊗ x L = 2q , r = n/L
IL/2 −ΩL/2
The Gentleman-Sande Idea
It can be shown that FnT = Fn and so if
Fn = At · · · A1PnT
then
Fn = FnT = PnAT1 · · · ATt
and we can compute y = Fnx as follows...
y = x
for k = t: − 1:1
y = ATk x
end
y = Pny
Convolution and Other Aps
From “problem space” to “DFT space” via

for k = t: − 1:1
x = ATk x
end
x = Pnx
Do your thing in DFT space. Then inverse transform back to

Problem space via
x = PnT x
for k = 1:t
x = Ak x
end
x = x/n
Can avoid the Pn ops by working in “scrambled” DFT space.

Radix-4
Can combine four quarter-length DFTs to produce a single full-

length DFT:
    
I I I I a (a + c) + (b + d)
 I −iI −I iI  b   (a − c)−i(b − d) 
v=   = 
 I −I I −I  c   (a + c) − (b + d) 
,
I iI −I −iI d (a − c)+i(b − d)
The radix-4 butterfly.

Better re-use of data.
Fewer flops. Radix-4 FFT is 4.25n log n (instead of 5n log n).
Mixed Radix
96

#P
cPP

# PP
c PP
# c
# c PP
24 24 24 24
@ @ @ @
@ @ @ @
8 8 8 8 8 8 8 8 8 8 8 8
Multiple DFTs
Given: n1-by-n2 matrix X.
Multicolumn DFT Problem...
X ← Fn1 X
Multirow DFT Problem...
X ← XFn2
Blocked Multiple DFTs
X ← Fn1 X becomes

X1 | X2 | · · · | Xp ← Fn1 X1 | Fn1 X2 | · · · | Fn1 Xp
The 4-Step Framework
A matrix reshaping of the x ← Fnx operation when n = n1n2:
xn1×n2 ← xn1×n2 Fn2 Multiple row DFT
xn1×n2 ← Fn(0:n1 − 1, 0:n2 − 1).∗ xn1×n2 Pointwise multiply
xn2×n1 ← xTn1×n2 Transpose
xn2×n1 ← xn2×n1 Fn1 Multiple row DFT .
Can be arranged so communication is concentrated in the trans-

pose step.
Distributed Transpose: Example
Initial:  
X00 X01 X02 X03
 X10 X11 X12 X13 
X = 
 X20
.
X21 X22 X23 
X30 X31 X32 X33
Transpose each block:
 
T
X00 T
X01 T
X02 T
X03
 
 XT T
X11 T
X12 T
X13 
 10 
X ←  .
 XT T
X21 T
X22 T 
X23
 20 
T
X30 T
X31 T
X32 T
X33
Now regard as 2-by-2 and block transpose each block:
 
X T XT XT XT
 00 10 02 12 
 T T T T

X X X X 
X ←  01 11 03 13  .
 T T T T

X X X X 
 20 30 22 32 
T XT XT XT
X21 31 23 33
Now do a 2-by-2 block transpose:
 
X T XT XT XT
 00 10 20 30 
 T T T T

X X X X 
X ←  01 11 21 31  .
 T 
 X XT XT XT 
 02 12 22 32 
T XT XT XT
X03 13 23 33
Factorization and Transpose
xn×m ← xTm×n
corresponds to
x ← P (m, n)x
where P (m, n) is a perfect shuffle permutation, e.g.,
P (3, 4) = I12(:, [0 3 6 9 1 4 7 10 2 5 8 11])
Different multi-pass transposition algorithms correspond to differ-

ent factorizations of P (m, n).
Two-Dimensional FFTs
If X is an n1-by-n2 matrix then is 2D DFT is

X ← Fn1 XFn2
Option 1.
X ← Fn1 X
X ← XFn2
Option 2. Assume n1 = n2 and Fn1 = At · · · A1.

for q = 1:t
X ← Aq XATq
end
Interminlgling the column and row butterfly computations can
result in better locality.
3-Dimensional DFTs
Given X(1:n1, 1:n2, 1:n3 ), apply DFT in each of the three dimen-
sions.
If
x = reshape(X(1:n1, 1:n2, 1:n3), n1n2n3, 1)
then the problem is to compute
x ← (Fn3 ⊗ Fn2 ⊗ Fn1 )x

i.e.,
x ← (In3 ⊗ In2 ⊗ Fn1 )x
x ← (In3 ⊗ Fn2 ⊗ In1)x
x ← (Fn3 ⊗ In2 ⊗ In1)x
d-Dimensional DFTs
Sample for d = 5:
X(α1, α2 , α3, α4, α5) Fn1
µ=1
X(α2, α3 , α4, α5, α1) ΠTn1,n
X(α2, α3 , α4, α5, α1) Fn2
µ=2
X(α3, α4 , α5, α1, α2) ΠTn2,n
X(α3, α4 , α5, α1, α2) Fn3
µ=3
X(α4, α5 , α1, α2, α3) ΠTn3,n
X(α4, α5 , α1, α2, α3) Fn4
µ=4
X(α5, α1 , α2, α3, α4) ΠTn4,n
X(α5, α1 , α2, α3, α4) Fn5
µ=5
X(α1, α2 , α3, α4, α5) ΠTn5,n
Intemingling of component DFTs and tensor transpositions.

References
FFTW: http:www.fftw.org
C. Van Loan (1992). Computational Frameworks for the Fast

Fourier Transform, SIAM Publications, Philadelphia, PA.

The FFT Via Matrix Factorizations: A Key To Designing High Performance Implementations

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

The FFT Via Matrix Factorizations: A Key To Designing High Performance Implementations

Transféré par

Droits d'auteur :

Formats disponibles

The FFT

Via Matrix Factorizations

Charles Van Loan

A well known strategy for high-performance Ax = b and Ax = λx

One way to execute a matrix-vector product

A different factorization Fn = Ãt̃ · · · Ã1 would yield a different

If ωn = cos(2π/n) − i · sin(2π/n) then

An n-by-n matrix A is data sparse if it can be represented with

A is a Kronecker Product B ⊗ C, e.g.,

If B ∈ IRm1×m1 and C ∈ IRm2×m2 then A = B ⊗ C has m21m22

A is 2d -by-2d but is parameterized by O(dn4) numbers.

The DFT matrix can be factored into a short product of sparse

F1024 = A10 · · · A2A1P1024

computes y = Fnx and requires O(n log n) flops.

Fn/2 “shows up” when you permute the columns of Fn so that

We build an 8-point DFT from two 4-point DFTs...

The Radix-2 Factorization...

where Πn = In(:, [0:2:n 1:2:n]).

Pn = the n-by-n “bit reversal ” permutation matrix

Γq−1 = Πr∗ ⊗ IL∗ , L∗ = L/2, r∗ = 2r,

ΩL∗ = diag(1, ωL, . . . , ωLL∗−1).

Store intermediate DFTs by row or column

Intermediate DFTs adjacent or not.

How the two butterfly loops are ordered.

It can be shown that FnT = Fn and so if

From “problem space” to “DFT space” via

Do your thing in DFT space. Then inverse transform back to

Can avoid the Pn ops by working in “scrambled” DFT space.

Can combine four quarter-length DFTs to produce a single full-

The radix-4 butterfly.

Given: n1-by-n2 matrix X.

Multicolumn DFT Problem...

Multirow DFT Problem...

A matrix reshaping of the x ← Fnx operation when n = n1n2:

xn1×n2 ← xn1×n2 Fn2 Multiple row DFT

xn1×n2 ← Fn(0:n1 − 1, 0:n2 − 1).∗ xn1×n2 Pointwise multiply

xn2×n1 ← xTn1×n2 Transpose

xn2×n1 ← xn2×n1 Fn1 Multiple row DFT .

Can be arranged so communication is concentrated in the trans-

where P (m, n) is a perfect shuffle permutation, e.g.,

P (3, 4) = I12(:, [0 3 6 9 1 4 7 10 2 5 8 11])

Different multi-pass transposition algorithms correspond to differ-

If X is an n1-by-n2 matrix then is 2D DFT is

Option 2. Assume n1 = n2 and Fn1 = At · · · A1.

then the problem is to compute

x ← (Fn3 ⊗ Fn2 ⊗ Fn1 )x

Intemingling of component DFTs and tensor transpositions.

C. Van Loan (1992). Computational Frameworks for the Fast

Vous aimerez peut-être aussi