Académique Documents
Professionnel Documents
Culture Documents
Abstract-This paper addresses rate-optimal compile-time multiprocessor scheduling of iterative data-flow programs suitable for real-time signal
processing applications. Recursions or loops in these programs lead to an
inherent lower bound on the achievable iteration period, referred to as the
iteralion bound. A multiprocessor schedule is de-optimal if the iteration
period equals the iteration bound. In general, it may not be possible to
schedule a specified iterative data-flow program rate-optimally. Retiming
transformation may improve the iteration period of a schedule, but cannot
guarantee the schedule to be rate-optimal.
Systematic unfolding of iterative data-flow programs is proposed,
and properties of unfolded data-flow programs are studied. Unfolding
increases the number of tasks in a program, unravels the hidden concurrency in iterative data-flow programs, and can reduce the iteration
period. We introduce a special class of iterative data-flow programs,
referred to as perfect-mute programs. Each loop in these programs has
a single register (a register is also referred to as a delay in signal
processing literature). Perfect-rate programs have the important property
that they can always be scheduled rate-optimally (requiring no retiming
or unfolding transformation). We show that unfolding any program
by an optimum unfolding factor transforms any arbitrary program to
an equivalent perfect-rate program, which can then be scheduled rateoptimally. This optimum unfolding factor for any arbitrary program is
the least common multiple of the number of registers (or delays) in all
loops, and is independent of the node execution times. An upper bound
on the number of processors for rate-optimal scheduling is also given.
I. INTRODUCTION
ATA-FLOW model of program description clearly exhibits
concurrency in multiprocessor implementations, and has
been widely used for many years [1]-[lo]. This paper is concerned with periodic scheduling of a class of iterative, static,
coarse-grain data-flow programs, suitable for real-time signal
processing applications. In particular, we are concerned with
compile-time construction of rate-optimal schedules of such
programs.
Much research has been done in preemptive and nonpreemptive scheduling of general-purpose real-time systems for single
and multiple processor systems [11]-[ 161. Scheduling of realtime systems using static and dynamic scheduling approaches
have been considered. A good survey of current scheduling
approaches is described in [ l l ] .
Manuscript received June 20, 1988; revised June 23, 1990. This work was
supported in part by an IBM Graduate Fellowship and by the National Science
Foundation under Contract MIP-8908586.
K. K. Parhi is with the Department of Electrical Engineering, University of
Minnesota, Minneapolis, MN 55455.
D. G. Messerschmitt is with the Department of Electrical Engineering and
Computer Sciences, University of California, Berkeley, CA 94720.
IEEE Log Number 9040681.
the notion of iteration bound has existed for quite some time,
it was so far not known if the iteration bound could always be
achieved for any arbitrary data-flow program (even when infinite
processors are available and a large amount of time is allowed to
construct multiprocessor schedules). This paper proves that it is
always possible to achieve the iteration bound (that is construct
a rate-optimal schedule) for arbitrary static data-flow programs
(using the program unfolding transformation). Using unfolding,
we prove that it is always possible to construct fully-static rateoptimal schedules for any arbitrary signal processing program
(past research had concluded that rate-optimal schedules cannot
always be constructed in a fully-static manner, but can always
be constructed in a cyclo-static manner [17]-[22]).
Traditional static multiprocessor scheduling techniques for
static data-flow programs are nonoverlapped, and use critical path
methods (CPM) [26], [27]. Nonoverlapped schedules minimize
execution time of the program over a single iteration of the
algorithm, and rarely lead to rate-optimal schedules (because
these schedules do not consider overlap of successive iterations).
It is possible to improve the iteration period of a nonoverlapped
schedule using the retiming technique [28], [29] (which was first
used in the context of minimizing clock period of synchronous
digital systems). The retiming technique redistributes the loop
registers, and creates new precedence relations and new schedules. Retiming can improve the iteration period of a program,
but cannot guarantee a schedule to be rate-optimal. Improving
schedules using the retiming transformation is addressed in
Section V.
Rate-optimal, periodic multiprocessor schedules of iterative
data-flow programs can be constructed by exploiting successive
overlap of different iterations. Overlap of successive iterations
exploits precedence constraints among different iterations. We
introduce a formal and systematic approach to exploiting these
constraints using program unfolding transformation [9], [lo].
Section VI of this paper presents the unfolding transformation,
and studies properties of unfolded data-flow programs. If a dataflow program is unfolded by a factor J, then the unfolded
data-flow program (referred to as J-unfolded data-flow program)
describes J successive iterations of the original data-flow program.
In Section VII, we introduce the notion of a class of programs
referred to as perfect-rate data-flow programs [9], [lo]. A dataflow program is said to be perfect-rate, if all the loops in the
program have one and only one (storage element or) register.
We prove that perfect-rate programs can always be scheduled
rate-optimally (thus the name perfect-rate). Next we show that
it is possible to transform any arbitrary data-flow program to an
equivalent perfect-rate program using optimum unfolding (which
can then be scheduled rate-optimally). The optimum unfolding
factor is shown to be equal to the least common multiple of
the number of loop registers in the data-flow program. Rateoptimal scheduling of arbitrary data-flow programs via optimum
unfolding is addressed in Section VIII. An upper bound on the
number of processors is also given in Section VIII.
11.
ITERATIVE DATA-FLOW
PROGRAM
MODEL
This paper is concerned with nonterminating, iterative, dataflow programs, which are useful in signal processing applications. These programs are described by data-flow program
graphs (DFGs), where nodes represent tasks to be executed, and
directed arcs represent communication among nodes. Nonterminating programs process infinite time series and produce infinite
179
Fig. 1. A simple nonterminating data-flow program. The task A represents an addition operation.
for { n = 1 to
CO}
4 n ) = 4.1 + Y(n) 1.
Program 1 operates on infinite time series { ~ ( n )and
} {y(n)},
and computes the infinite time series { ~ ( n ) Each
} . execution of
the loop in the nonterminating program is referred to an iteration.
The time required to perform each iteration is referred to as
the iteration period of the program. For example, the iteration
period in Program 1 corresponds to the time required for a single
addition operation. The program graph corresponding to Program
1 is described by a single node with two input arcs (which
represent inputs) and one output arc (which represents the output)
and is shown in Fig. 1. In data-flow terminology, each node is
assumed to consume a single token from each of its incoming
arcs, and produce a single token on each of its output arcs. (Note
that in signal processing literature, token means a sample,
and iteration period is often referred to as sample period, to
indicate that all the time series are sampled periodically with a
period equal to the iteration or sample period). Program 1 has
the characteristic that no token needs to be stored anywhere.
Many signal processing programs require samples or tokens to
be stored in a register or a latch (to be used in future iterations).
As an example of such a program, consider Program 2.
Program 2:
180
(b)
Fig. 3. (a) A simple program with two loops. The loop Li contains one
register and the loop L2 contains two registers. The execution time for each
node is assumed to be one unit. (b) Acyclic precedence graph obtained by
deleting the arcs with registers.
Multiprocessor periodic schedules of static data-flow programs can be nonoverlapped or overlapped. The nonoverlapped
schedules primarily use critical path methods. The overlapped
schedules can be fully-static or cyclo-static. This section summarizes characteristics of these schedules.
A. Precedence Constraints
referred to as delays in signal processing literature.) In this
paper, the notation iD associated with an arc represents i registers.
The token produced by node C is consumed by node D in the
same iteration and does not need to be stored. The registers are
initialized with initial conditions. In Program 2, bc(-1) and bc(0)
are initially stored in the two registers. In the remainder of the
paper, we will omit the system input and output arcs (since these
do not affect scheduling of the tasks).
Programs 1 and 2 have the common feature that they do not
have any feedback or loop. Many signal processing programs
are recursive in nature, and contain feedback. It is this class of
programs that we are mostly concerned with in this paper, since
the loops in program graphs impose an inherent lower bound
on the achievable iteration period. We assume the recursive
programs to be computable, i.e., all loops in the DFG contain one
or more registers. Consider Program 3, which contains recursion
or feedback (see Fig. 3).
--.)
181
P1:
P2:
PI:
D,
A,
D,
y
,
C,
B1 C,
A2
D3 A3 D,
P2:
g
q
($
&
J(
B,
C,
B3
(c)
(b)
B. Transitivity
Transitivity is associated with arcs. An arc U -+ V in a DFG
is said to be transitive if there exists a path U -+ V, and the
number of registers of the path U -+ V and arc U -+ V are the
same. (A path A -+ B -+ C -+ D is also referred to as the path
A -+ D, and the sum of the number of registers in arcs A + B,
B -+ C and C -+ D is referred to as the number of registers in
the path A -+ D.) For example, in the DFG in Fig. 4(a), there
are three transitive arcs. The arc C -+ E and the path C + E
(which is the same as C -+ D -+ E) both contain no registers,
and the arc C -+ E is a transitive arc. The arc C -+ D implies that
node D can be invoked after node C is executed. The arc D -+ E
implies that the node E can be invoked after the execution of node
D is complete. These two precedence constraints automatically
satisfy the precedence constraint imposed by the arc C + E.
Thus, deletion of the transitive arc C -+ E will not alter the
required precedence constraints.
The arc A -+ E with two registers is transitive, because the
path A + E has also two registers. The path A -+ E with two
registers implies that the execution of the ( n - 2)nd iteration ofA
is complete before the nth iteration of E is invoked, which is also
satisfied by the arc A -+ E. Thus, deletion of the transitive arc
A + E does not change the inter-iteration precedence constraints.
Similarly, the arc E + G is a transitive arc (because this arc and
the path E -+ G contain one register each), and can be deleted. In
this paper, we will assume that the first step in the construction
of a multiprocessor schedule is to delete all the transitive arcs
from the DFG.
We can extend the notion of transitivity further. If an arc
U -+ V contains i registers, and a path U -+ V contains less than
i registers, then the arc U -+ V is an extended transitive arc.
This arc can also be deleted without affecting any precedence
D. Overlapped Schedules
A schedule is said to be overlapped, if any task of iteration
(n + 1) is scheduled before all tasks of iteration n have been
executed. The schedules in Fig. 5(b) and (c) correspond to
overlapped schedules. In these schedules, D2 is scheduled before
execution of B1 is complete, and any two consecutive iterations always overlap. Overlapped schedules exploit inter-iteration
precedence constraints (in addition to intra-iteration precedence),
182
P I:
P2:
E. Fully-Static Schedules
A periodic multiprocessor schedule is said to be fully-static, if
all the iterations of some task are scheduled in the same processor
[9], [ 101. The nonoverlapped multiprocessor schedule in Fig. 5(a)
D2,D3,
is fully-static. Note that all the iterations of D (i.e., D1,
etc.) are scheduled in processor P I with time displacement 3 units
(the time displacement is the same as the iteration period). The
overlapped periodic schedule in Fig. 5(b) is also fully-static (with
time displacement of 2 units). The schedule of a single iteration is
replicated with a time displacement equal to the iteration period.
P1:
P2:
P3:
P4:
P I:
P2:
F. Cyclo-Static Schedules
P3:
The cyclo-static periodic multiprocessor schedules were introduced by Schwartz and Bamwell [17]-[22]. These schedules
P4:
are characterized by a time displacement as well as a processor
P5:
displacement. In cyclo-static schedules, if the iteration n of some
tasks A is scheduled in processor Pk at time t, then the iteration
P6:
(n + 1) of taskA is scheduled in processor P ( L + I c at) time
~ ~ ~ ~ ~ ~ ~
P?:
(t + T), where T is the time displacement (or iteration period),
N is the total number of processors, and K is the processor
P8:
displacement. The overlapped periodic schedule in Fig. 5(c) is
a cyclo-static schedule with iteration period 2 and a processor
(c)
displacement of 1. The task D1 is scheduled in P1 at time 0, D2
is scheduled in P2 (note that (1 + 1) modulo 2 is considered as Fig. 6. Multiprocessor schedule of the nonrecursive data-flow program in
2) at time 2, D3 is scheduled in P I (since (2 + 1) modulo 2 is Fig. 2. (a) A two-processor schedule with iteration period of 2 units. (b) A
1) at time 4, etc. A cyclo-static schedule reduces to a fully-static four-processor schedule with iteration period of 1 unit. (c) An eight-processor
schedule with iteration period of 1/2 unit.
schedule if the processor displacement is 0. This paper is not
concerned with construction of general cyclo-static schedules;
we only consider construction of fully-static schedules.
loop (these can be scheduled using a postprocessor). We assume
the nodes belonging to no loop can be deleted for now (since
these can be scheduled with arbitrarily less iteration period, some
Iv. ITERATION BOUND
AND RATEOPTIMALSCHEDULES
more discussion on this is in Section VIII).
This section reviews the notion of iteration bound in data-flow
The iteration period bound in any data-flow program with
programs with feedback loops. Any program with no feedback loops is given by
loop can be scheduled with arbitrary concurrency (i.e., with
arbitrarily shorter iteration period) by using a larger number of
(4.1)
T, = M ~ [ T I / D I ]
processors. Consider the program described by the DFG in Fig. 2.
where
the
maximum
is
taken
over
all
loops
1
in
the
DFG,
and
TI
Fig. 6(a), (b), and (c), respectively, show periodic overlapped
is
the
sum
of
the
execution
times
associated
with
all
the
nodes
schedules for this DFG when 2,4, and 8 processors are available.
Note that the iteration period of these schedules are, respectively, in loop 1, and D, is the number of registers in loop 1. The bound
2, 1, and 1/2 units of time (recall all task execution times were imposed on the iteration period due to the lth loop is described
chosen to be 1 unit in this example). This suggests by increasing by the inequality
the number of processors, we can reduce the iteration period
TI I DIT,.
(44
arbitrarily for any program with no loop. However, this is not true
for programs with loop or feedback; loops in programs impose This inequality is referred to as the loop bound inequality, and
an inherent lower bound on the iteration period, referred to as Tl/DI is the loop bound of the lth loop. The loop lo for which
the iteration bound [2], [8]-[lo], [23]-[25]. Periodic schedules Tlo/Drois maximum is referred to as the critical loop, and the
are said to be rate-optimal if the iteration period is the same as loop bound inequality reduces to a strict equality for this loop.
the iteration bound. One can never achieve an iteration period
The difference between the iteration bound and the loop bound
less than this bound even when infinite processors are available. is referred to as the slack time of a loop. The slack time of the
Although the iteration bound has been established for some time, critical loop is zero. The more the slack time, the less critical
it has so far not been shown that rate-optimal schedules can is the loop.
always be constructed. The objective of this paper is to show
All existing scheduling papers take the maximum of the
that one can always construct rate-optimal schedules for static quantity in (4.1) and all the node execution times as the minimum
data-flow programs. In the remainder of this paper, we do not achievable iteration period. In our definition, we do not consider
consider the nodes of the program which do not belong to any the ceiling or the maximum of node execution times. We can
183
2 *lo+
*lo+
Program 4:
02
+ tb 57.T3
tb 5 TCC
T, is given by
T, = M a xt [+y t, t b ] .
A6
(b)
Fig. 7. (a) A DFG with two loops. The node execution times of nodes A
and B are 10 and 2 units. (b) An overlapped rate-optimal periodic schedule
with iteration period of 4 units.
A3
2 I 2 p-
2 *lo+
(4.3a)
(4.4b)
(4.3b)
For loop L1, the total computation time is 12 units, and the
number of loop registers is 3. The bound for loop L1 is 4. For
loop L2, the computation time is 2, and there is a single loop
register. The bound for loop L2 is 2. The iteration bound for
the program is 4 (which is maximum of the two loop bounds).
The slack times are, respectively, 0 and 2 units for loops L1 and
Lz, respectively. For this program, any multiprocessor schedule
which achieves an iteration period of 4 units is rate-optimal.
Fig. 7( b) shows a rate-optimal fully-static overlapped schedule.
Note that this periodic schedule is fully-static with respect to
three iterations. This should be clear from the fact Al and A4
are scheduled in the first processor with a time displacement of
12 units. The iteration period is 4 units (which is less than the
execution time of A), since three iterations can be scheduled in
12 units. Also note that a CPM nonoverlapped schedule would
require an iteration period of 10 units (which is the time to
execute task A). How we constructed the schedule in Fig. 7(b)
0
is postponed until Section VI (see Example 6.1).
Example 4.2: Consider the DFG in Fig. 8(a) with 2 loops. The
program described by this DFG is described by Program 5.
t,
+ t b + t, + + t, + t f + t, 5 .T3
td
The loop bound inequality for the loop containing transitive arcs
A + E, and E + G is given by
t, + t , + t , 53T,.
Program 5:
(4.5a)
(4.5b)
+ + +
+
+
+
t, t, tf t, 5 3T,.
t, tb 4-t, t, + t f t, 5 3
T
.
t , + t b + t c + t e + t g<3Tw.
Program 5 has two loops, the loop A -+ B --iC -+ A (denoted
L1) and the loop A + B -+ A (denoted L2), the number of loop
registers is, respectively, 2 and 1. The bounds imposed on the
iteration period by the two loops are, respectively, given by
(4.5c)
(4.5d)
(4.5e)
184
(b)
-I
B
w
(C)
Fie. 8. (a) A DFG with two looos. The node comoutation times of A, B, ant
C are 10, 20, and 40 units, respectively. The iteration bound is 35 nits, an1
loop L1 is the critical loop. (b) Acyclic precedence graph. (c) A nonoverlapped
schedule for one iteration with iteration period of 60 units. Periodic schedule
is constructed by replicating this schedule with time displacement of 60 units.
I
ta+tb+tc+td ST-.
Program 6:
(4.6a)
+ +
ta tc td 5 2T,.
(4.6b)
Program 7:
tb
+ t,
+ t , 5 T-.
(4.7a)
tb+tc+te<2Tm.
(4.7b)
td
2, WBRUARY 1991
V. RETIMING
IN DATA-FLOW
PROGRAMS
Pi:
NO.
Program 8:
Initial condition: bc(1) = f b c [ f a b [ca(O)]1.
for {n = 1 to C O } {
185
D
A
B
-
(g&@
D
(b)
(C)
p
o
+
-
20
+
o
l+
P2:
!-
-I
40
(C)
Program 9:
Fig. 10. (a) A retimed equivalent program of the DFG of Fig. 8(a).
Precedence graph of the ietimed DFG;(c) A nonoverlapped schedule for
one iteration with iteration period of 40 units.
(6
ba(1) = fba[ab(O)],
bc(l) = fbc[ab(O)],
to preserve the input-output behavior of the system. The precedence graph and the nonoverlapped periodic schedule corresponding to the retimed DFG in Fig. lO(a) are respectively shown
in Fig. 10(b) and (c). The iteration period of the retimed DFG is
40 units, which is 5 times greater than the iteration bound, but
0
20 units less than the schedule of Fig. 8(c).
Retiming transformation can reduce the iteration period in a
programmable multiprocessor implementation, but cannot guarantee a rate-optimal schedule.
VI. DATA-FLOW
PROGRAM
UNFOLDING
Nonoverlapped multiprocessor schedules constructed from
precedence graphs of DFGs exploit intra-iteration precedence
relations, and fail to exploit inter-iteration precedence. One
can reduce the iteration period of multiprocessor schedules
by exploiting the inter-iteration precedence constraints. The
program unfolding transformation exploits the inter-iteration
precedence constraints, and can lead to rate-optimal schedules
(assuming availability of large number of processors and
complete interconnection among the multiple processors) [9],
[ 101. This section studies systematic unfolding, and properties
of unfolded data-flow programs.
A. Construction of Unfolded Programs
An unfolded DFG with an unfolding factor J describes J
consecutive iterations of the original program. The unfolded
186
(b)
Program 10:
T,~-~+~
-+
%-
187
T,'
[WI,
(b)
20 _ I _
1-
P2:
+
o
l+
40
=2
20
-I-
40
-1
5 JT,D:
(6.1)
must hold.
Proof: Let TL be the iteration bound of the unfolded DFG.
Then, Ti 5 TLD: must hold. But, TL = JT, (due to
U
Property 6.2), and hence (6.1) must hold.
Property 6.4: Any loop bound relation of the type (6.1) in the
unfolded DFG can be obtained either by multiplying a loop bound
relation in the original DFG by a constant, or by taking linear
additive combinations of the loop bound relations in the original
DFG so that the right side of the inequality is a multiple of J.
Proof: The right side of the loop bound for any loop in the
unfolded DFG must 6e a multiple of J (when expressed in terms
of T,, the iteration bound of the original program). Assume
that the ith loop of the original DFG has a bound T, 5 D,T,.
Any linear additive combination of one or more loop bounds
in the original DFG, which corresponds to a loop bound in the
unfolded DFG, must be of the form
(C)
Fig. 12. (a) An equivalent 2-unfolded program of the program in Fig. S(a).
(b) Precedence graph. (c) Rate-optimal periodic schedule with iteration period
of 35 units.
t,,
+ t b I T,,
t,
+ t b + t,
5 2T,.
188
2ta
+ 2tb I TL,
ta
+ tb +
tc
L. TL
E;=,=,
VII. PERFECT-RATE
DATA-FLOW
PROGRAMS
In this section, we introduce the notion of perfect-rate dataflow programs, and show that these can always be scheduled in
a fully-static and real-optimal manner.
Definition 7.1: Any data-flow program with one register in
each loop is referred to as a perfect-rate data-flow program;
a DFG describing a perfect-rate program is referred to as a
perfect-rate DFG.
A. Scheduling of Perfect-Rate Data-Flow Programs
The DFG shown in Fig. 14(a) is an example of a perfect-rate
graph. This DFG has one initial node (node D ) , one terminal node
(node E), and three loops, and all the loops are critical (assuming
unit execution time for each node or task). The iteration bound
for this DFG is 3 units of time (u.t.). The precedence graph for
the DFG is shown in Fig. 14(b), and the length of the critical
path is 5 u.t. (any nonoverlapped CPM schedule would require an
iteration period of 5 u.t.). A rate-optimal overlapped schedule is
shown in Fig. 14(c). Note that the DFG did not need to be retimed
(b)
p:
P2:
(C)
189
A-B-C-1
D+
J
&
(b)
(C)
Fig. 15. Several retimed versions of the perfect-rate program of Fig. 14(a),
and corresponding rate-optimal overlapped schedules.
190
(C)
e - .
fi
191
(a)
U
Theorem 7.1: For any perfect-rate graph, we can construct
fully-static rate-optimal schedules requiring no retiming or unfolding transformation.
Proof: The nodes of the critical loop can be scheduled contiguously requiring a period equal to the critical loop computation
time or the iteration bound. This schedule can be replicated over
successive iterations with no gap at all in the same processor
with a time displacement equal to the iteration bound. For each
gap in the scheduling of nodes (of noncritical loops), there exists
a path with longer computation time. This implies that the sum
of the computation time-and the gap time of any loop cannot
exceed the critical loop computation time or the iteration bound,
and therefore Algorithm 7.1 results in a rate-optimal schedule.
The schedule of the single iteration can be replicated with zero
processor displacement and with a time displacement equal to
the iteration bound, and hence the schedule is fully-static.
0
Theorem 7.2: The number of loops in a perfect-rate graph
represents an upper bound on the number of processors to
schedule the recursive nodes in a fully-static and rate-optimal
manner.
Proof: The upper bound on number of processors required
by Algorithm 7.1 equals the number of loops in the program
graph. This equals the upper bound on the number of processors
for a fully-static, rate-optimal schedule.
0
Remark 7.5: Note that this upper bound on the number of
processors is independent of the execution times of the nodes in
the program graph, and can be determined by only considering
the topology of the program graph.
-~
--
B @ c
D
(b)
Fig. 19. (a) A program graph with three loops. The execution times of nodes
A, B , C, D,and E are, respectively, 20, 5, 10, 10, and 2 units. (b) Acyclic
precedence graph.
DFGs would contain only one register. The unfolded DFG then
corresponds to a perfect-rate DFG (by definition). Note that this
unfolding factor is independent of the execution times of the
U
nodes in a DFG.
Example 7.1: Consider the unfolding of the DFG in Fig. 7(a).
The two loops in Fig. 7(a) contain one and three registers, and the
least common multiple is three. Fig. ll(a) shows the 3-unfolded
DFG, which is perfect-rate.
U
Example 7.2: The number of loop registers in the DFG in
Fig. 8(a) are one, and two, and the least common multiple is two.
Fig. 12(a) shows a 2-unfolded DFG, which is perfect-rate. 0
VIII. FULLY-STATIC
RATE-OPTIMAL
SCHEDULING
This section uses the results of Section VI and VI1 and proves
that the tasks of any DFG can be scheduled rate-optimally in a
fully-static manner.
One might be led to believe that we can always achieve a rateoptimal schedule using an unfolding factor equal to the number
of registers in the critical loop and then retiming the unfolded
DFG. This is because the critical loop in the equivalent unfolded
DFG would contain a single register, and the critical loop can
be scheduled contiguously. However, this conjecture is not true!
This conjecture is disproved using a counterexample.
Consider the DFG example in Fig. 19, where the number of
loop register counts in the DFG are 2 and 3, respectively. The
execution times of nodes A , B, C, D , and E in Fig. 19(a) are,
respectively, 20, 5, 10, 10, and 2, and the iteration bound is 16,
and corresponds to the critical loop L1. The precedence relation
of the DFG is shown in Fig. 19(b), and the length of the critical
path (or equivalently the iteration period for this DFG) is 20
units. Since the number of registers in the critical loop is 2, we
construct an equivalent unfolded DFG with J = 2 as shown in
Fig. 20(a). The precedence graph for the unfolded DFG is shown
in Fig. 20(b), and leads to an iteration period of 20 units. We
__
-~
__
192
(b)
(b)
Fig. 20. (a) A 2-unfolded equivalent program of the program in Fig. 19(a).
(b) The acyclic precedence graph.
Remark 8.1:It may be possible to obtain rate-optimal schedules with an unfolding factor less than the least common multiple.
As an example, the DFG in Fig. 3(a) is scheduled rate-optimally
in a fully-static manner [see schedule in Fig. 5(b)] with no
unfolding (the optimum unfolding factor for this example is 2).
However, it is possible to schedule the DFG in Fig. 3(a) in a rateoptimal, fully-static manner with unfolding factor 2 for all possible node execution times. If we change the execution time of the
node D to 3 units (that would make the loop D + A -+ B + D
critical and the iteration bound would be 2.5), then one cannot
obtain a rate-optimal fully-static schedule without unfolding.
The optimum unfolding factor can be used to construct rateoptimal schedules only when no node not belonging to any
loop has an execution time greater than JOT, (where J , is the
optimum unfolding factor). The iteration bound (for scheduling
of J , iterations) of the J,-unfolded DFG is JOT,. If any node
(not belonging to any loop) has an execution time greater than
JOT,, then the DFG needs to be unfolded by a multiple of the
least common multiple of the number of registers in all loops of
the DFG (say KJo), such that the execution time of all nodes in
the DFG is less than KJ,T,. Observe that any unfolding with
unfolding factor KJ, also reduces the DFG to an unfolded DFG,
which is perfect-rate (since unfolding a perfect-rate DFG always
results in a perfect-rate DFG).
Now we obtain an upper bound on the number of processors
for rate-optimal scheduling of arbitrary data-flow programs.
Theorem 8.2: Recursive nodes (i.e., nodes belonging to one
or more loops) of any DFG can be scheduled in a rate-optimal
fully-static manner by using at most P processors, where P is the
sum of the register counts in all the loops in the original DFG.
Proof: Since the unfolding factor is the least common
multiple of the register counts in all the loops, each loop with
K registers transforms to K distinct loops in the unfolded DFG.
Thus, the upper bound on the number of distinct loops in the
unfolded DFG (which is perfect-rate) is equal to the sum of
the register counts in all the loops of the original DFG. This
is the upper bound on the number of processors to schedule all
193
Fig. 22. (a) A 6-unfolded equivalent program of the program in Fig. 19(a). The unfolded program is perfect-rate. (b) The acyclic precedence graph.
This unfolded program can be scheduled rate-optimally.
IX. CONCLUSION
We have shown that the inter-iteration precedence constraints
can be exploited by unfolding the data-flow signal processing
programs. Furthermore, unfolding by the optimum unfolding factor transforms a data-flow program to an equivalent perfect-rate
data-flow program, which maximally exploits the inter-iteration
precedence constraints. Unfolding any data-flow program beyond
the optimum unfolding factor does not lead to any further
improvement. The major result obtained in this paper is that the
optimum unfolding factor can always be used to construct rateoptimal fully-static periodic schedules of data-flow programs for
multiprocessor implementations (assuming availability of a large
number of processors and complete interconnection).
While this paper has contributed to our theoretical understanding of multiprocessor implementation of iterative data-flow signal
processing programs, construction of efficient multiprocessor
194
ACKNOWLEDGMENT
The authors are indebted to all the three reviewers for their
careful review of the paper. Their numerous constructive suggestions and criticisms improved the clarity of presentation of
the contents of this paper. Thanks are also due to E.A. Lee,
G. C. Sih, and D.A. Schwartz for many useful discussions.
REFERENCES
R.E. Karp and R.E. Miller, Properties of a model for parallel
computations: Determinacy, termination, and queueing, SIAM J.
Appl. Math., vol. 14, no. 6, pp. 1390-1411, Nov. 1966.
R. Reiter, Scheduling parallel computations, J. ACM, vol. 15,
no. 4, pp. 590-599, Oct. 1968.
R. E. Crochiere and A. V. Oppenheim, Analysis of linear digital
networks, Proc. ZEEE, vol. 63, no. 4, pp. 581-595, Apr. 1975.
J. B. Dennis, Data flow supercomputers, IEEE Comput. Mag.,
pp. 48-56, NOV.1980.
A. L. Davis and R. M. Keller, Data flow program graphs, IEEE
Comput. Mag., vol. 15, no. 2, pp. 26-41, Feb. 1982.
W. B. Ackerman, Data flow languages, IEEE Comput. Mag.,
vol. 15, no. 2, pp. 15-25, Feb. 1982.
E.A. Lee and D.G. Messerschmitt, Static scheduling of synchronous data flow programs for digital signal processing, ZEEE
Trans. Comput., vol. C-36, no. 1, pp. 24-35, Jan. 1987.
S. Y. Kung, P. S. Lewis, and S. C. Lo, Performance analysis and
optimization of VLSI data flow arrays, J. Parallel Distributed
Comput., vol. 4, pp. 592-618, 1987.
K. K. Parhi and D. G. Messerschmitt, Rate-optimal fullystatic multiprocessor scheduling of data-flow signal processing
programs, in Proc. 1989 IEEE Int. Symp. Circuits Syst., Portland,
OR, May 1989
-, Fully-static rate-optimal scheduling of iterative data-flow
programs via optimum unfolding, in Proc. 1989 Int. Con! Parallel
Processing, St. Charles, IL, Aug. 1989
S. C. Cheng, J. A. Stankovic, and K. Ramamritham, Scheduling
algorithms for hard real-time systems-A brief survey, in Hard
Real-Time Systems Tutorial, J. A. Stankovic, Ed. New York IEEE
Computer Society Press, 1988, pp. 150-173.
C. L. Liu and J. W. Layland, Scheduling algorithms for multiprogramming in a hard real-time environment, J. ACM, vol. 20,
pp. 46-61, 1973.
195