Vous êtes sur la page 1sur 7

Part III

Parallel FFT Algorithms

2000 by CRC Press LLC

Chapter 17

Parallelizing the FFTs:


Preliminaries on Data
Mapping
The discussion in Chapters 4 to 9 has focused on providing a unied algorithmic treatment of the NR, RN and NN variants for implementing the sequential radix-2 FFTs.
These variations give the options of providing the input time series, and receiving the
output frequencies, in either the natural ordering or the bit-reversed ordering. Regardless of which of the three choices is made, one can choose to use either the DIF
FFT algorithm or the DIT FFT algorithm.
The block diagram in Figure 17.1 depicts these various options.
Figure 17.1 Top level design chart for implementing the sequential FFT.

2000 by CRC Press LLC

A key step in parallelizing the FFT on multiprocessor computers concerns the mapping of array addresses to processors. Figure 17.2 depicts such a process. Recall from
the previous chapters that each of the NR, RN, and NN algorithms can be completely
specied using the n-bit binary address of a representative element. In this chapter,
this binary address notation will be used to facilitate the mapping of array locations
to multiple processors, to aid in the description and classication of the many known
parallel FFT algorithms, and to help in the development of new ones.
Figure 17.2 Top level design chart for implementing the parallel FFT.

17.1

Mapping Data to Processors

Multiprocessors fall into two general categories: shared-memory multiprocessors and


local-memory (or distributed-memory) multiprocessors. As their names imply, they
are distinguished by whether each processor can directly access the entire memory
available, or whether the memory is partitioned into portions which are private to each
processor.
For shared-memory architectures, the main challenge in parallelizing a sequential
algorithm is to subdivide the computation among the processor in such a way that the
load is balanced, and memory conicts are kept low. For FFT algorithms, this is a
relatively simple task.
In terms of the design of algorithms, local-memory machines impose the additional
burden of requiring that the data, as well as the computation, be partitioned. In addition to identifying parallelism in the computation, and assigning computational tasks
to individual processors, the data associated with the computation must be distributed
among the processors, and communicated among them as necessary. The challenge is

2000 by CRC Press LLC

to do this in such a way that each processor has the data it needs in its local memory
at the time that it needs it, and the amount of communication required among the
processors during the computation is kept acceptably low.
A useful way to dene dierent partitionings is to associate each processor with a
data item as follows. Since each location in a N = 2n element array has a n-bit binary
address, and P = 2d processors can each be identied by a unique d-bit binary ID
number, a class of partitionings can be specied by designating d consecutive bits from
the n-bit address as the processor ID number as shown in Figure 17.3 for an example
with N = 32 and P = 4.
This class of mappings is referred to as the generalized Cyclic Block Mapping
with blocksize = 2i for i = 0, 1, , n d. The n d + 1 cyclic block mappings for
n = 5 and d = 2 are illustrated in Figure 17.4, where the array locations mapped to
processor P0 are shaded to highlight the cyclic nature with various block sizes.
Figure 17.3 Mapping array locations to processors.

17.2

Properties of Cyclic Block Mappings

The important properties of the class of cyclic block mappings are listed below. It is
assumed that N = 2n , P = 2d , and element xm is stored in a[m], where 0 m N 1

2000 by CRC Press LLC

Figure 17.4 The n d + 1 cyclic block mappings for N = 2n = 32 and P = 2d = 4.

From E. Chu and A. George [28], Linear Algebra and its Applications, 284:95124, 1998. With
permission.

2000 by CRC Press LLC

and the binary representation of m = in1 in2 i0 . The initial n-bit global array
address used in the mapping is thus in1 in2 i0 . Although a natural ordering is
assumed in this chapter so that the concept of data mapping can be introduced in a
straightforward manner, the notations are readily adapted to other initial orderings,
and those cases will be dealt with when they arise in the following chapters.
 Property 1. Each cyclic block mapping is dened by designating ik ik1 ikd+1
as processor ID number, where k = n 1, n 2, , d 1. There are thus
n d + 1 dierent mappings.
 Property 2. The block size is 2kd+1 for each k dened in property 1. Each
mapping in this class can thus be uniquely identied by its block size.
 Property 3. When the left-most d bits are taken as the processor ID number, the
block size is equal to N
P , and one has the standard block mapping, which is also
known as consecutive data mapping.
 Property 4. When the right-most d bits are taken as the processor ID number,
the block size is equal to one, and one has the standard cyclic mapping.
 Property 5. Each processor is always assigned
of mappings ensures even data distribution.

N
P

locations in total, i.e., this class

 Property 6. In parallelizing any one of the four unordered in-place FFTs, each
processor can always compute the butteries involving the N
P local data points
independently, because these data correspond to array locations corresponding
to the n d address bits specied by the braces below:
$
$ %& '
%&
'
in1 ik+1| designated d-bit for Processor ID |ikd i0 .
 Property 7. To compute the butteries involving the address bits used to dene
the processor ID number, data can always be exchanged between two processors
with ID numbers dierent in exactly one bit, although these exchanges can
 
1 N
involve either N
data points and they may or may not be pipelined.
P or 2 P
These design issues will be addressed in the subsequent chapters.
In view of properties 5, 6, and 7, it is not surprising that many mappings used in
the literature for parallelizing the in-place FFTs belong to the class of Cyclic Block
Mappings (CBMs). This class of mappings was also used in parallelizing the ordered
FFTs, although in a less straightforward manner.

17.3

Examples of CBM Mappings and Parallel FFTs

CBM mappings have received considerable study in the literature dealing with parallelizing FFTs [23, 36, 46, 56, 59, 90, 95, 104, 107]. These works vary in the choice of the
blocksize, whether DIF or DIT transforms are used, and whether the input and/or output is in unordered (reverse-binary) or in natural order, and so on. All these treatments
will be brought into a common framework in subsequent chapters.
To give an overview, some examples are cited in Table 17.1. Observe that each CBM

2000 by CRC Press LLC

mapping is identied by its unique blocksize. The perfect shue scheme [90] was already discussed in Section 10.2.2, the other parallel FFT algorithms cited in Table 17.1
are reviewed in the specied sections in Chapter 21, and the underlying techniques can
be found in the specied sections in Chapters 19 and 20.

Table 17.1 Examples of cyclic block mappings (CBMs) and parallel FFTs.
Examples of Parallel FFTs using

Stone [90], 1971


(Sec. 10.2.2)

9'1

Jamiulon, Mueller
2' for k = 0
k Siegel [56], 1986
(cyclic)
(Sec. 20.1.2 k 21.2.2)
Walton [107],1986
g = 2"-d
(Sec. 20.2 k 21.2.1) (consecutive)
N -2"-d
Swarztrauber
P
[95], 1987
(consecutive)
(Sec. 20.1.2 k 21.2.4)
E
P = 2"-d
Chamberlain
[23], 1988
(consecutive)
(Sec. 19.2.1 k 21.1.1)
Tong k Swarztrauber 2' for k = n - d
[104], 1991
(consecutive) and
(Sec.20.1.2 k 21.2.4) k = 0 (cyclic)
Johnsson kKrawitz
$ = 2"-d
[59], 1992
(consecutive)
[Sec. 19.2.3 k 21.1.4)

P = 2' Proceesore

radix-2 DITNR
radix-2 DIFNR

Perfect Shuffle
(with P = N)
SIMD System with
P = $ for
k slog,

1s

radix-2 DITRN

32-node Ametek
Hypercube

radix-2 DIFNR

Hypercube

plus intermediate

(not implemented)

reordering dcycla
radix-2 DIFNR
64-node Intel iPSC
k Its Inverse
Hypercube k Linear Array ( v i
Reflected-Binary Gray-Codes)
radix-2 DIFNR

CM-2 Hypercube

plus intermediate
reordering Ccycles

(16K 1-bit proceseors)

radix-2 D I F w

2048-processor CM-200

and DITNR

(with Boolean Cube Network)

2' for k = 0, 1,
radix-2 DIFNR
Dubey, Zubair k
Grosch [36], 1994
- - -n,- d.
plus an ad-hoc
[Sec. 20.1.2 k 21.2.3)
rearrangement phase

64-node Intel iPSC/860


Hypercube

g = 2"-d - 4'

128-node nCUBE2
Hypercube

Fabbretti et. al.


[461,1996
'Sec. 19.2.3 k 21.1.31

2000 by CRC Press LLC

radix-4 DITNR
(consecutive) plus local split-radix

Vous aimerez peut-être aussi