Vous êtes sur la page 1sur 42

The FFT

Via Matrix Factorizations


A Key to Designing High Performance Implementations

Charles Van Loan


Department of Computer Science
Cornell University
A High Level Perspective...
Blocking For Performance

 
A11 A12 · · · A1q } n1
A A · · · A 
 21 22 2q  } n2
A =  . . . . . 
 . . . . 
Ap1 Ap2 · · · Apq } nq
|{z} |{z} |{z}
n1 n2 nq

A well known strategy for high-performance Ax = b and Ax = λx


solvers.
Factoring for Performance

One way to execute a matrix-vector product


y = Fnx
when Fn = At · · · A2A1 is as follows:

y=x
for k = 1:t
y = Ak x
end

A different factorization Fn = Ãt̃ · · · Ã1 would yield a different


algorithm.
The Discrete Fourier Transform (n = 8)

 
ω80 ω80 ω80 ω80 ω80 ω80 ω80 ω80
 
 ω0 ω81 ω82 ω83 ω84 ω85 ω86 7
ω8 
 8 
 
 ω0 ω82 ω84 ω86 ω88 ω810 ω812 14
ω8 
 8 
 
 ω0 ω83 ω86 ω89 ω812 ω815 ω818 21
ω8 
 8 
y = F8x =  x
 ω0 ω84 ω88 ω812 ω816 ω820 ω824 28
ω8 
 8 
 0 
ω
 8 ω85 ω810 ω815 ω820 ω825 ω830 35
ω8  
 0 
ω
 8 ω86 ω812 ω818 ω824 ω830 ω836 42
ω8  
ω80 ω87 ω814 ω821 ω828 ω835 ω842 ω849

ω8 = cos(2π/8) − i · sin(2π/8)
The DFT Matrix In General...

If ωn = cos(2π/n) − i · sin(2π/n) then

pq
[Fn]pq = ωn

= (cos(2π/n) − i · sin(2π/n))pq

= cos(2pqπ/n) − i · sin(2pqπ/n)

Fact:
FnH Fn = nIn


Thus, Fn/ n is unitary.
Data Sparse Matrices

An n-by-n matrix A is data sparse if it can be represented with


many fewer than n2 numbers.

Example 1.
A has lots of zeros. (“Traditional Sparse”)

Example 2.
A is Toeplitz...
 
a b c d
e a b c
A = 
f

e a b
g f e a
More Examples of Data Sparse Matrices

A is a Kronecker Product B ⊗ C, e.g.,

" #
b11C b12C
A =
b21C b22C

If B ∈ IRm1×m1 and C ∈ IRm2×m2 then A = B ⊗ C has m21m22


entries but is parameterized by just m21 + m22 numbers.
Extreme Data Sparsity

n X
X n X
n X
n
A = S(i, j, k, `) · (2-by-2) ⊗ · · · ⊗ (2-by-2)
i=1 j=1 k=1 `=1 | {z }
d times

A is 2d -by-2d but is parameterized by O(dn4) numbers.


Factorization of Fn

The DFT matrix can be factored into a short product of sparse


matrices, e.g.,

F1024 = A10 · · · A2A1P1024

where each A-matrix has 2 nonzeros per row and P1024 is a per-
mutation.
From Factorization to Algorithm

If n = 210 and
Fn = A10 · · · A2A1Pn
then

y = Pnx
for k = 1:10
y = Ak x ← 2n flops.
end

computes y = Fnx and requires O(n log n) flops.


Recursive Block Structure

F8(:, [ 0 2 4 6 1 3 5 7 ]) =
 
1 0 0 0 1 0 0 0
 0 1 0 0 0 ω 0 0 
 8 
 0 0 1 0 0 0 ω 2 0 
 8 
 F 0 

0 ω8  3
 0 0 0 1 0 0 4
 
 1 0 0 0 −1 0 0 0  0 F4
 
 0 1 0 0 0 −ω8 0 0 

 0 0 1 0 0 2
0 −ω8 0 

0 0 0 1 0 0 0 −ω83

Fn/2 “shows up” when you permute the columns of Fn so that


the odd-indexed columns come first.
Recursion...

We build an 8-point DFT from two 4-point DFTs...


 
1 0 0 0 1 0 0 0

 0 1 0 0 0 ω8 0 0 

 0 0 1 0 0 0 ω82 0  " #
0 ω83  F4x(0:2:7)
 
 0 0 0 1 0 0
F8 x =  
 1 0 0 0 −1 0 0 0  F4x(1:2:7)
 
 0 1 0 0 0 −ω8 0 0 

 0 0 1 0 0 2
0 −ω8 0 

0 0 0 1 0 0 0 −ω83
Radix-2 FFT: Recursive Implementation

function y =fft(x, n)
if n = 1
y = x
else
m = n/2; ω = exp(−2πi/n)
Ω = diag(1, ω, . . . , ω m−1)
zT = fft(x(0:2:n − 1), m)
zB = Ω· fft(x(1:2:n − 1), m)
  
Im Im zT
y = Overall: 5n log n flops.
Im −Im zB
end
The Divide-and-Conquer Picture

(0:1:15)
HH
 H
 H
 HH
 H
 H
(0:2:15) (1:2:15)
Q Q
 Q  Q
 Q  Q
 Q  Q
(0:4:15) (2:4:15) (1:4:15) (3:4:15)
@ @ @ @
@ @ @ @
(0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15)
A A A A A A A A
 A  A  A  A  A  A  A  A
[0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Towards a Nonrecursive Implementation

The Radix-2 Factorization...

If n = 2m and
Ωm = diag(1, ωn, . . . , ωnm−1),
then " # " #
Fm ΩmFm Im Ωm
FnΠn = = (I2 ⊗ Fm).
Fm −ΩmFm Im −Ωm

where Πn = In(:, [0:2:n 1:2:n]).


 
Fm 0
Note: I2 ⊗ Fm = .
0 Fm
The Cooley-Tukey Factorization

n = 2t

Fn = At · · · A1Pn

Pn = the n-by-n “bit reversal ” permutation matrix


" #
IL/2 ΩL/2
Aq = I r ⊗ L = 2q , r = n/L
IL/2 −ΩL/2

L/2−1
ΩL/2 = diag(1, ωL, . . . , ωL ) ωL = exp(−2πi/L)
The Bit Reversal Permutation

(0:1:15)
HH
 H
 H
 HH
 H
 H
(0:2:15) (1:2:15)
Q Q
 Q  Q
 Q  Q
 Q  Q
(0:4:15) (2:4:15) (1:4:15) (3:4:15)
@ @ @ @
@ @ @ @
(0:8:15) (4:8:15) (2:8:15) (6:8:15) (1:8:15) (5:8:15) (3:8:15) (7:8:15)
A A A A A A A A
 A  A  A  A  A  A  A  A
[0] [8] [4] [12] [2] [10] [6] [14] [1] [9] [5] [13] [3] [11] [7] [15]
Bit Reversal
       
x(0) x(0000) x(0000) x(0)
 x(1)   x(0001)   x(1000)   x(8) 
       
 x(2)   x(0010)   x(0100)   x(4) 
       
 x(3)   x(0011)   x(1100)   x(12) 
       
 x(4)   x(0100)   x(0010)   x(2) 
       
 x(5)   x(0101)   x(1010)   x(10) 
       
 x(6)   x(0110)   x(0110)   x(6) 
       
 x(7)   x(0111)   x(1110)   x(14) 
 x(8)  =  x(1000) 
    →  x(0001)  =  x(1) 
   
       
 x(9)   x(1001)   x(1001)   x(9) 
       
 x(10)   x(1010)   x(0101)   x(5) 
       
 x(11)   x(1011)   x(1101)   x(13) 
       
 x(12)   x(1100)   x(0011)   x(3) 
       
 x(13)   x(1101)   x(1011)   x(11) 
       
 x(14)   x(1110)   x(0111)   x(7) 
x(15) x(1111) x(1111) x(15)
Butterfly Operations
This matrix is block diagonal...
" #
IL/2 ΩL/2
Aq = I r ⊗ L = 2q , r = n/L
IL/2 −ΩL/2
r copies of things like this
 
1 ×

 1 × 


 1 × 


 1 × 

1
 × 


 1 × 

 1 × 
1 ×
At the Scalar Level...

a sH  a + ωb
s
H 
H 
ω
 HH
b s Hs a − ωb
Signal Flow Graph (n = 8)

x0
H 
s s s s y0
HH  @ A 
ω80 @ A 
 H @ A 
@
x4 s 
 HHs
ω80
A  s y1
A A  
s
@
@ @ A A  
@
@ @ A A  
@ A 
@s A  s y2
x2
HH  ω82
A
ω80

s s
 A 
H  @ A  A 
ω80 @ A A A 
A  
 HH @ A A
x6  Hs @s A ω81 A  s y3
A A 
s
A 
A A  A  A
A A

A  A  A A
s A A  A s y4
x1
H  ω82
A  A A
s s
H  @
H 
ω80 @  A A
 A
 H @
@  AA A A
 H
x5 
s Hs ω80 s  ω83 A A s y5
@   A A
@ @
@   A A
@ @   A A
@
@  A A s y6
H 
x3 2
s s ω8 s
HH   A
@
ω80 @  A
 H @  A
 H
x7 
s Hs @
s A s y7
The Transposed Stockham Factorization

If n = 2t, then
Fn = St · · · S2S1,
where for q = 1:t the factor Sq = Aq Γq−1 is defined by

Aq = I r ⊗ BL , L = 2q , r = n/L,

Γq−1 = Πr∗ ⊗ IL∗ , L∗ = L/2, r∗ = 2r,


 
IL∗ ΩL∗
BL = ,
IL∗ −ΩL∗

ΩL∗ = diag(1, ωL, . . . , ωLL∗−1).


Perfect Shuffle

   
x0 x0
 x1   x1 
   
 x2   x4 
   
 x3   x5 
(Π4 ⊗ I2) 
 x4  =  x2 
  
   
 x5   x3 
   
 x6   x6 
x7 x7
Cooley-Tukey Array Interpretation

Step q:

k

2k 2k+1 
8 

>
<


L∗ =2q−1
>
−→ L=2q
: 



| {z } 
r∗ =n/L∗
| {z }
r=n/L
Reshaping

 
×
×
 
×
 
×  
  × × × ×
x =  ×  → x2×4 =
 
× × × × ×
 
×
 
×
×
Transposed Stockham Array Interp

k k+r
9
>
=
(q−1)
xL∗ ×r∗ = FL∗ xT
r∗ ×L∗ = L∗ =2q−1 .
>
;

| {z }
r∗ =n/L∗
x(q) = Sq x(q−1)
k
9
>
>
>
>
>
>
>
>
=
(q)
xL×r = FL xT
r×L = L=2q
>
>
>
>
>
>
>
>
;

| {z }
r=n/L
2 × 2 × 2 Basic Radix-2 Versions

Store intermediate DFTs by row or column

Intermediate DFTs adjacent or not.

How the two butterfly loops are ordered.


" #!
IL/2 ΩL/2
x = Ir ⊗ x L = 2q , r = n/L
IL/2 −ΩL/2
The Gentleman-Sande Idea

It can be shown that FnT = Fn and so if

Fn = At · · · A1PnT
then
Fn = FnT = PnAT1 · · · ATt
and we can compute y = Fnx as follows...
y = x
for k = t: − 1:1
y = ATk x
end
y = Pny
Convolution and Other Aps

From “problem space” to “DFT space” via


for k = t: − 1:1
x = ATk x
end
x = Pnx

Do your thing in DFT space. Then inverse transform back to


Problem space via
x = PnT x
for k = 1:t
x = Ak x
end
x = x/n

Can avoid the Pn ops by working in “scrambled” DFT space.


Radix-4

Can combine four quarter-length DFTs to produce a single full-


length DFT:
    
I I I I a (a + c) + (b + d)
 I −iI −I iI  b   (a − c)−i(b − d) 
v=   = 
 I −I I −I  c   (a + c) − (b + d) 
,

I iI −I −iI d (a − c)+i(b − d)

The radix-4 butterfly.


Better re-use of data.
Fewer flops. Radix-4 FFT is 4.25n log n (instead of 5n log n).
Mixed Radix

96
 
#P
cPP

 # PP
  c PP
 # c
 # c PP
24 24 24 24
@ @ @ @
@ @ @ @
8 8 8 8 8 8 8 8 8 8 8 8
Multiple DFTs

Given: n1-by-n2 matrix X.

Multicolumn DFT Problem...

X ← Fn1 X

Multirow DFT Problem...

X ← XFn2
Blocked Multiple DFTs

X ← Fn1 X becomes

   
X1 | X2 | · · · | Xp ← Fn1 X1 | Fn1 X2 | · · · | Fn1 Xp
The 4-Step Framework

A matrix reshaping of the x ← Fnx operation when n = n1n2:

xn1×n2 ← xn1×n2 Fn2 Multiple row DFT

xn1×n2 ← Fn(0:n1 − 1, 0:n2 − 1).∗ xn1×n2 Pointwise multiply

xn2×n1 ← xTn1×n2 Transpose

xn2×n1 ← xn2×n1 Fn1 Multiple row DFT .

Can be arranged so communication is concentrated in the trans-


pose step.
Distributed Transpose: Example

Initial:  
X00 X01 X02 X03
 X10 X11 X12 X13 
X = 
 X20
.
X21 X22 X23 
X30 X31 X32 X33
Transpose each block:
 
T
X00 T
X01 T
X02 T
X03
 
 XT T
X11 T
X12 T
X13 
 10 
X ←  .
 XT T
X21 T
X22 T 
X23
 20 
T
X30 T
X31 T
X32 T
X33
Now regard as 2-by-2 and block transpose each block:
 
X T XT XT XT
 00 10 02 12 
 T T T T

X X X X 
X ←  01 11 03 13  .
 T T T T

X X X X 
 20 30 22 32 
T XT XT XT
X21 31 23 33
Now do a 2-by-2 block transpose:
 
X T XT XT XT
 00 10 20 30 
 T T T T

X X X X 
X ←  01 11 21 31  .
 T 
 X XT XT XT 
 02 12 22 32 
T XT XT XT
X03 13 23 33
Factorization and Transpose

xn×m ← xTm×n

corresponds to
x ← P (m, n)x

where P (m, n) is a perfect shuffle permutation, e.g.,

P (3, 4) = I12(:, [0 3 6 9 1 4 7 10 2 5 8 11])

Different multi-pass transposition algorithms correspond to differ-


ent factorizations of P (m, n).
Two-Dimensional FFTs

If X is an n1-by-n2 matrix then is 2D DFT is


X ← Fn1 XFn2

Option 1.
X ← Fn1 X
X ← XFn2

Option 2. Assume n1 = n2 and Fn1 = At · · · A1.


for q = 1:t
X ← Aq XATq
end
Interminlgling the column and row butterfly computations can
result in better locality.
3-Dimensional DFTs

Given X(1:n1, 1:n2, 1:n3 ), apply DFT in each of the three dimen-
sions.
If
x = reshape(X(1:n1, 1:n2, 1:n3), n1n2n3, 1)

then the problem is to compute

x ← (Fn3 ⊗ Fn2 ⊗ Fn1 )x


i.e.,
x ← (In3 ⊗ In2 ⊗ Fn1 )x
x ← (In3 ⊗ Fn2 ⊗ In1)x
x ← (Fn3 ⊗ In2 ⊗ In1)x
d-Dimensional DFTs

Sample for d = 5:
X(α1, α2 , α3, α4, α5) Fn1
µ=1
X(α2, α3 , α4, α5, α1) ΠTn1,n
X(α2, α3 , α4, α5, α1) Fn2
µ=2
X(α3, α4 , α5, α1, α2) ΠTn2,n
X(α3, α4 , α5, α1, α2) Fn3
µ=3
X(α4, α5 , α1, α2, α3) ΠTn3,n
X(α4, α5 , α1, α2, α3) Fn4
µ=4
X(α5, α1 , α2, α3, α4) ΠTn4,n
X(α5, α1 , α2, α3, α4) Fn5
µ=5
X(α1, α2 , α3, α4, α5) ΠTn5,n

Intemingling of component DFTs and tensor transpositions.


References

FFTW: http:www.fftw.org

C. Van Loan (1992). Computational Frameworks for the Fast


Fourier Transform, SIAM Publications, Philadelphia, PA.

Vous aimerez peut-être aussi