Académique Documents
Professionnel Documents
Culture Documents
Chapter 17
A key step in parallelizing the FFT on multiprocessor computers concerns the mapping of array addresses to processors. Figure 17.2 depicts such a process. Recall from
the previous chapters that each of the NR, RN, and NN algorithms can be completely
specied using the n-bit binary address of a representative element. In this chapter,
this binary address notation will be used to facilitate the mapping of array locations
to multiple processors, to aid in the description and classication of the many known
parallel FFT algorithms, and to help in the development of new ones.
Figure 17.2 Top level design chart for implementing the parallel FFT.
17.1
to do this in such a way that each processor has the data it needs in its local memory
at the time that it needs it, and the amount of communication required among the
processors during the computation is kept acceptably low.
A useful way to dene dierent partitionings is to associate each processor with a
data item as follows. Since each location in a N = 2n element array has a n-bit binary
address, and P = 2d processors can each be identied by a unique d-bit binary ID
number, a class of partitionings can be specied by designating d consecutive bits from
the n-bit address as the processor ID number as shown in Figure 17.3 for an example
with N = 32 and P = 4.
This class of mappings is referred to as the generalized Cyclic Block Mapping
with blocksize = 2i for i = 0, 1, , n d. The n d + 1 cyclic block mappings for
n = 5 and d = 2 are illustrated in Figure 17.4, where the array locations mapped to
processor P0 are shaded to highlight the cyclic nature with various block sizes.
Figure 17.3 Mapping array locations to processors.
17.2
The important properties of the class of cyclic block mappings are listed below. It is
assumed that N = 2n , P = 2d , and element xm is stored in a[m], where 0 m N 1
From E. Chu and A. George [28], Linear Algebra and its Applications, 284:95124, 1998. With
permission.
and the binary representation of m = in1 in2 i0 . The initial n-bit global array
address used in the mapping is thus in1 in2 i0 . Although a natural ordering is
assumed in this chapter so that the concept of data mapping can be introduced in a
straightforward manner, the notations are readily adapted to other initial orderings,
and those cases will be dealt with when they arise in the following chapters.
Property 1. Each cyclic block mapping is dened by designating ik ik1 ikd+1
as processor ID number, where k = n 1, n 2, , d 1. There are thus
n d + 1 dierent mappings.
Property 2. The block size is 2kd+1 for each k dened in property 1. Each
mapping in this class can thus be uniquely identied by its block size.
Property 3. When the left-most d bits are taken as the processor ID number, the
block size is equal to N
P , and one has the standard block mapping, which is also
known as consecutive data mapping.
Property 4. When the right-most d bits are taken as the processor ID number,
the block size is equal to one, and one has the standard cyclic mapping.
Property 5. Each processor is always assigned
of mappings ensures even data distribution.
N
P
Property 6. In parallelizing any one of the four unordered in-place FFTs, each
processor can always compute the butteries involving the N
P local data points
independently, because these data correspond to array locations corresponding
to the n d address bits specied by the braces below:
$
$ %& '
%&
'
in1 ik+1| designated d-bit for Processor ID |ikd i0 .
Property 7. To compute the butteries involving the address bits used to dene
the processor ID number, data can always be exchanged between two processors
with ID numbers dierent in exactly one bit, although these exchanges can
1 N
involve either N
data points and they may or may not be pipelined.
P or 2 P
These design issues will be addressed in the subsequent chapters.
In view of properties 5, 6, and 7, it is not surprising that many mappings used in
the literature for parallelizing the in-place FFTs belong to the class of Cyclic Block
Mappings (CBMs). This class of mappings was also used in parallelizing the ordered
FFTs, although in a less straightforward manner.
17.3
CBM mappings have received considerable study in the literature dealing with parallelizing FFTs [23, 36, 46, 56, 59, 90, 95, 104, 107]. These works vary in the choice of the
blocksize, whether DIF or DIT transforms are used, and whether the input and/or output is in unordered (reverse-binary) or in natural order, and so on. All these treatments
will be brought into a common framework in subsequent chapters.
To give an overview, some examples are cited in Table 17.1. Observe that each CBM
mapping is identied by its unique blocksize. The perfect shue scheme [90] was already discussed in Section 10.2.2, the other parallel FFT algorithms cited in Table 17.1
are reviewed in the specied sections in Chapter 21, and the underlying techniques can
be found in the specied sections in Chapters 19 and 20.
Table 17.1 Examples of cyclic block mappings (CBMs) and parallel FFTs.
Examples of Parallel FFTs using
9'1
Jamiulon, Mueller
2' for k = 0
k Siegel [56], 1986
(cyclic)
(Sec. 20.1.2 k 21.2.2)
Walton [107],1986
g = 2"-d
(Sec. 20.2 k 21.2.1) (consecutive)
N -2"-d
Swarztrauber
P
[95], 1987
(consecutive)
(Sec. 20.1.2 k 21.2.4)
E
P = 2"-d
Chamberlain
[23], 1988
(consecutive)
(Sec. 19.2.1 k 21.1.1)
Tong k Swarztrauber 2' for k = n - d
[104], 1991
(consecutive) and
(Sec.20.1.2 k 21.2.4) k = 0 (cyclic)
Johnsson kKrawitz
$ = 2"-d
[59], 1992
(consecutive)
[Sec. 19.2.3 k 21.1.4)
P = 2' Proceesore
radix-2 DITNR
radix-2 DIFNR
Perfect Shuffle
(with P = N)
SIMD System with
P = $ for
k slog,
1s
radix-2 DITRN
32-node Ametek
Hypercube
radix-2 DIFNR
Hypercube
plus intermediate
(not implemented)
reordering dcycla
radix-2 DIFNR
64-node Intel iPSC
k Its Inverse
Hypercube k Linear Array ( v i
Reflected-Binary Gray-Codes)
radix-2 DIFNR
CM-2 Hypercube
plus intermediate
reordering Ccycles
radix-2 D I F w
2048-processor CM-200
and DITNR
2' for k = 0, 1,
radix-2 DIFNR
Dubey, Zubair k
Grosch [36], 1994
- - -n,- d.
plus an ad-hoc
[Sec. 20.1.2 k 21.2.3)
rearrangement phase
g = 2"-d - 4'
128-node nCUBE2
Hypercube
radix-4 DITNR
(consecutive) plus local split-radix