Parallel FFT Algorithms: © 2000 by CRC Press LLC

Part III
Parallel FFT Algorithms
2000 by CRC Press LLC
Chapter 17
Parallelizing the FFTs:

Preliminaries on Data
Mapping
The discussion in Chapters 4 to 9 has focused on providing a unied algorithmic treatment of the NR, RN and NN variants for implementing the sequential radix-2 FFTs.
These variations give the options of providing the input time series, and receiving the
output frequencies, in either the natural ordering or the bit-reversed ordering. Regardless of which of the three choices is made, one can choose to use either the DIF
FFT algorithm or the DIT FFT algorithm.
The block diagram in Figure 17.1 depicts these various options.
Figure 17.1 Top level design chart for implementing the sequential FFT.
A key step in parallelizing the FFT on multiprocessor computers concerns the mapping of array addresses to processors. Figure 17.2 depicts such a process. Recall from
the previous chapters that each of the NR, RN, and NN algorithms can be completely
specied using the n-bit binary address of a representative element. In this chapter,
this binary address notation will be used to facilitate the mapping of array locations
to multiple processors, to aid in the description and classication of the many known
parallel FFT algorithms, and to help in the development of new ones.
Figure 17.2 Top level design chart for implementing the parallel FFT.
17.1
Mapping Data to Processors
Multiprocessors fall into two general categories: shared-memory multiprocessors and

local-memory (or distributed-memory) multiprocessors. As their names imply, they
are distinguished by whether each processor can directly access the entire memory
available, or whether the memory is partitioned into portions which are private to each
processor.
For shared-memory architectures, the main challenge in parallelizing a sequential
algorithm is to subdivide the computation among the processor in such a way that the
load is balanced, and memory conicts are kept low. For FFT algorithms, this is a
relatively simple task.
In terms of the design of algorithms, local-memory machines impose the additional
burden of requiring that the data, as well as the computation, be partitioned. In addition to identifying parallelism in the computation, and assigning computational tasks
to individual processors, the data associated with the computation must be distributed
among the processors, and communicated among them as necessary. The challenge is
to do this in such a way that each processor has the data it needs in its local memory
at the time that it needs it, and the amount of communication required among the
processors during the computation is kept acceptably low.
A useful way to dene dierent partitionings is to associate each processor with a
data item as follows. Since each location in a N = 2n element array has a n-bit binary
address, and P = 2d processors can each be identied by a unique d-bit binary ID
number, a class of partitionings can be specied by designating d consecutive bits from
the n-bit address as the processor ID number as shown in Figure 17.3 for an example
with N = 32 and P = 4.
This class of mappings is referred to as the generalized Cyclic Block Mapping
with blocksize = 2i for i = 0, 1, , n d. The n d + 1 cyclic block mappings for
n = 5 and d = 2 are illustrated in Figure 17.4, where the array locations mapped to
processor P0 are shaded to highlight the cyclic nature with various block sizes.
Figure 17.3 Mapping array locations to processors.
17.2
Properties of Cyclic Block Mappings
The important properties of the class of cyclic block mappings are listed below. It is
assumed that N = 2n , P = 2d , and element xm is stored in a[m], where 0 m N 1
Figure 17.4 The n d + 1 cyclic block mappings for N = 2n = 32 and P = 2d = 4.
From E. Chu and A. George [28], Linear Algebra and its Applications, 284:95124, 1998. With
permission.
and the binary representation of m = in1 in2 i0 . The initial n-bit global array
address used in the mapping is thus in1 in2 i0 . Although a natural ordering is
assumed in this chapter so that the concept of data mapping can be introduced in a
straightforward manner, the notations are readily adapted to other initial orderings,
and those cases will be dealt with when they arise in the following chapters.
Property 1. Each cyclic block mapping is dened by designating ik ik1 ikd+1
as processor ID number, where k = n 1, n 2, , d 1. There are thus
n d + 1 dierent mappings.
Property 2. The block size is 2kd+1 for each k dened in property 1. Each
mapping in this class can thus be uniquely identied by its block size.
Property 3. When the left-most d bits are taken as the processor ID number, the
block size is equal to N
P , and one has the standard block mapping, which is also
known as consecutive data mapping.
Property 4. When the right-most d bits are taken as the processor ID number,
the block size is equal to one, and one has the standard cyclic mapping.
Property 5. Each processor is always assigned
of mappings ensures even data distribution.
N
P
locations in total, i.e., this class
Property 6. In parallelizing any one of the four unordered in-place FFTs, each
processor can always compute the butteries involving the N
P local data points
independently, because these data correspond to array locations corresponding
to the n d address bits specied by the braces below:
$
$ %& '
%&
'
in1 ik+1| designated d-bit for Processor ID |ikd i0 .
Property 7. To compute the butteries involving the address bits used to dene
the processor ID number, data can always be exchanged between two processors
with ID numbers dierent in exactly one bit, although these exchanges can

1 N
involve either N
data points and they may or may not be pipelined.
P or 2 P
These design issues will be addressed in the subsequent chapters.
In view of properties 5, 6, and 7, it is not surprising that many mappings used in
the literature for parallelizing the in-place FFTs belong to the class of Cyclic Block
Mappings (CBMs). This class of mappings was also used in parallelizing the ordered
FFTs, although in a less straightforward manner.
17.3
Examples of CBM Mappings and Parallel FFTs
CBM mappings have received considerable study in the literature dealing with parallelizing FFTs [23, 36, 46, 56, 59, 90, 95, 104, 107]. These works vary in the choice of the
blocksize, whether DIF or DIT transforms are used, and whether the input and/or output is in unordered (reverse-binary) or in natural order, and so on. All these treatments
will be brought into a common framework in subsequent chapters.
To give an overview, some examples are cited in Table 17.1. Observe that each CBM
mapping is identied by its unique blocksize. The perfect shue scheme [90] was already discussed in Section 10.2.2, the other parallel FFT algorithms cited in Table 17.1
are reviewed in the specied sections in Chapter 21, and the underlying techniques can
be found in the specied sections in Chapters 19 and 20.
Table 17.1 Examples of cyclic block mappings (CBMs) and parallel FFTs.
Examples of Parallel FFTs using
Stone [90], 1971

(Sec. 10.2.2)
9'1
Jamiulon, Mueller
2' for k = 0
k Siegel [56], 1986
(cyclic)
(Sec. 20.1.2 k 21.2.2)
Walton [107],1986
g = 2"-d
(Sec. 20.2 k 21.2.1) (consecutive)
N -2"-d
Swarztrauber
P
[95], 1987
(consecutive)
(Sec. 20.1.2 k 21.2.4)
E
P = 2"-d
Chamberlain
[23], 1988
(consecutive)
(Sec. 19.2.1 k 21.1.1)
Tong k Swarztrauber 2' for k = n - d
[104], 1991
(consecutive) and
(Sec.20.1.2 k 21.2.4) k = 0 (cyclic)
Johnsson kKrawitz
$ = 2"-d
[59], 1992
(consecutive)
[Sec. 19.2.3 k 21.1.4)
P = 2' Proceesore
radix-2 DITNR
radix-2 DIFNR
Perfect Shuffle
(with P = N)
SIMD System with
P = $ for
k slog,
1s
radix-2 DITRN
32-node Ametek
Hypercube
radix-2 DIFNR
Hypercube
plus intermediate
(not implemented)
reordering dcycla
radix-2 DIFNR
64-node Intel iPSC
k Its Inverse
Hypercube k Linear Array ( v i
Reflected-Binary Gray-Codes)
radix-2 DIFNR
CM-2 Hypercube
plus intermediate
reordering Ccycles
(16K 1-bit proceseors)
radix-2 D I F w
2048-processor CM-200
and DITNR
(with Boolean Cube Network)
2' for k = 0, 1,
radix-2 DIFNR
Dubey, Zubair k
Grosch [36], 1994
- - -n,- d.
plus an ad-hoc
[Sec. 20.1.2 k 21.2.3)
rearrangement phase
64-node Intel iPSC/860

Hypercube
g = 2"-d - 4'
128-node nCUBE2
Hypercube
Fabbretti et. al.

[461,1996
'Sec. 19.2.3 k 21.1.31
radix-4 DITNR
(consecutive) plus local split-radix

Parallel FFT Algorithms: © 2000 by CRC Press LLC

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Parallel FFT Algorithms: © 2000 by CRC Press LLC

Transféré par

Droits d'auteur :

Formats disponibles

Part III

Parallel FFT Algorithms

2000 by CRC Press LLC

Parallelizing the FFTs:

2000 by CRC Press LLC

Mapping Data to Processors

Multiprocessors fall into two general categories: shared-memory multiprocessors and

2000 by CRC Press LLC

Properties of Cyclic Block Mappings

2000 by CRC Press LLC

Figure 17.4 The n d + 1 cyclic block mappings for N = 2n = 32 and P = 2d = 4.

2000 by CRC Press LLC

locations in total, i.e., this class

Examples of CBM Mappings and Parallel FFTs

2000 by CRC Press LLC

Stone [90], 1971

(16K 1-bit proceseors)

(with Boolean Cube Network)

64-node Intel iPSC/860

Fabbretti et. al.

2000 by CRC Press LLC

Vous aimerez peut-être aussi