Vous êtes sur la page 1sur 18

178

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 2, FEBRUARY 1991

Static Rate-Optimal Scheduling of Iterative


Data-Flow Programs via Optimum Unfolding
Keshab K. Parhi, Member, IEEE, and David G. Messerschmitt, Fellow, IEEE

Abstract-This paper addresses rate-optimal compile-time multiprocessor scheduling of iterative data-flow programs suitable for real-time signal
processing applications. Recursions or loops in these programs lead to an
inherent lower bound on the achievable iteration period, referred to as the
iteralion bound. A multiprocessor schedule is de-optimal if the iteration
period equals the iteration bound. In general, it may not be possible to
schedule a specified iterative data-flow program rate-optimally. Retiming
transformation may improve the iteration period of a schedule, but cannot
guarantee the schedule to be rate-optimal.
Systematic unfolding of iterative data-flow programs is proposed,
and properties of unfolded data-flow programs are studied. Unfolding
increases the number of tasks in a program, unravels the hidden concurrency in iterative data-flow programs, and can reduce the iteration
period. We introduce a special class of iterative data-flow programs,
referred to as perfect-mute programs. Each loop in these programs has
a single register (a register is also referred to as a delay in signal
processing literature). Perfect-rate programs have the important property
that they can always be scheduled rate-optimally (requiring no retiming
or unfolding transformation). We show that unfolding any program
by an optimum unfolding factor transforms any arbitrary program to
an equivalent perfect-rate program, which can then be scheduled rateoptimally. This optimum unfolding factor for any arbitrary program is
the least common multiple of the number of registers (or delays) in all
loops, and is independent of the node execution times. An upper bound
on the number of processors for rate-optimal scheduling is also given.

Index Terms- Fully-static rate-optimal schedules; iteration bound;


intra-interation and inter-iteration precedence constraints, processor
bounds; optimum unfolding, perfect-rate data-flow programs; periodic,
deterministic, nonpreemptive multiprocessor scheduling; program unfolding; retiming; real-time signal and image processing; static data-flow
programming.

I. INTRODUCTION
ATA-FLOW model of program description clearly exhibits
concurrency in multiprocessor implementations, and has
been widely used for many years [1]-[lo]. This paper is concerned with periodic scheduling of a class of iterative, static,
coarse-grain data-flow programs, suitable for real-time signal
processing applications. In particular, we are concerned with
compile-time construction of rate-optimal schedules of such
programs.
Much research has been done in preemptive and nonpreemptive scheduling of general-purpose real-time systems for single
and multiple processor systems [11]-[ 161. Scheduling of realtime systems using static and dynamic scheduling approaches
have been considered. A good survey of current scheduling
approaches is described in [ l l ] .

Manuscript received June 20, 1988; revised June 23, 1990. This work was
supported in part by an IBM Graduate Fellowship and by the National Science
Foundation under Contract MIP-8908586.
K. K. Parhi is with the Department of Electrical Engineering, University of
Minnesota, Minneapolis, MN 55455.
D. G. Messerschmitt is with the Department of Electrical Engineering and
Computer Sciences, University of California, Berkeley, CA 94720.
IEEE Log Number 9040681.

This paper is concerned with multiprocessor scheduling of


signal processing data-flow programs, described by data-flow
graphs (DFGs). The nodes in a DFG represent tasks or computations, and arcs represent communications. The arcs in signal
processing DFGs have registers (or delays) associated with them
(the number of registers can be any nonnegative integer). An
arc with no register describes precedence between tasks within
an iteration, and an arc with registers describes precedence
between tasks of different iterations. The registers associated with
arcs differentiate signal processing DFGs from other classes of
DFGs. Not much research has been devoted to task scheduling
taking into account precedence between tasks of consecutive
iterations.
First we describe some general assumptions. We assume
that no preemption is allowed. In general-purpose computing
systems, task execution times are often assumed to be stochastic.
We assume the node execution times to be deterministic (the
execution time of different nodes can be any arbitrary number,
and need not be the same in general). This paper addresses
static or compile-time scheduling, as opposed to dynamic or
run-time scheduling [ 141. Often some hard real-time systems
tolerate partial results, i.e., these real-time systems can accept
an intermediate result (which is not quite precise) [15]. We
assume that all tasks run to completion and are computed
precisely. We are concerned with periodic tasks, which are
executed repetitively. Execution of all tasks of the program once
is referred to an iteration (more precise definition of iteration
is given in Section 11). The programs may contain feedback
loops with arbitrary number of storage elements or registers
(inside the loops). Each loop in the program must have at
least a single register or storage unit (otherwise the program
would be noncomputable). Section I1 describes the details of
the program model considered in this paper. Periodic schedules
can be nonoverlapped or overlapped; fully-static [9], [lo], or
cyclo-static [ 171-[22]. Section I11 describes these characteristics
of periodic schedules. This paper is concerned with construction
of rate-optimal schedules, which are fully-static.
Signal processing data-flow programs are nonterminating in
nature; these programs process infinite time series and produce
infinite time series. Signal processing programs with recursion
(or feedback) have a fundamental lower bound on the iteration
period, referred to as the iteration bound [2], [23]-[25], [SI-[lo].
We can never achieve an iteration period less than the iteration
bound, even when infinite processors are available. The notion
of iteration bound is reviewed in Section IV. A multiprocessor schedule is said to be rate-optimal if the actual iteration
period equals the iteration bound. This paper considers rateoptimal scheduling of iterative data-flow programs, and assumes
availability of a large number of processors (an upper bound
on the required number of processors is given in Section VI11
of this paper), which are completely interconnected. Although

0018-9340/91/0200-0178.$01.00 0 1991 IEEE

PARHI A N D MESSERSCHMm STATIC RATE-OPTIMAL SCHEDULING OF PROGRAMS

the notion of iteration bound has existed for quite some time,
it was so far not known if the iteration bound could always be
achieved for any arbitrary data-flow program (even when infinite
processors are available and a large amount of time is allowed to
construct multiprocessor schedules). This paper proves that it is
always possible to achieve the iteration bound (that is construct
a rate-optimal schedule) for arbitrary static data-flow programs
(using the program unfolding transformation). Using unfolding,
we prove that it is always possible to construct fully-static rateoptimal schedules for any arbitrary signal processing program
(past research had concluded that rate-optimal schedules cannot
always be constructed in a fully-static manner, but can always
be constructed in a cyclo-static manner [17]-[22]).
Traditional static multiprocessor scheduling techniques for
static data-flow programs are nonoverlapped, and use critical path
methods (CPM) [26], [27]. Nonoverlapped schedules minimize
execution time of the program over a single iteration of the
algorithm, and rarely lead to rate-optimal schedules (because
these schedules do not consider overlap of successive iterations).
It is possible to improve the iteration period of a nonoverlapped
schedule using the retiming technique [28], [29] (which was first
used in the context of minimizing clock period of synchronous
digital systems). The retiming technique redistributes the loop
registers, and creates new precedence relations and new schedules. Retiming can improve the iteration period of a program,
but cannot guarantee a schedule to be rate-optimal. Improving
schedules using the retiming transformation is addressed in
Section V.
Rate-optimal, periodic multiprocessor schedules of iterative
data-flow programs can be constructed by exploiting successive
overlap of different iterations. Overlap of successive iterations
exploits precedence constraints among different iterations. We
introduce a formal and systematic approach to exploiting these
constraints using program unfolding transformation [9], [lo].
Section VI of this paper presents the unfolding transformation,
and studies properties of unfolded data-flow programs. If a dataflow program is unfolded by a factor J, then the unfolded
data-flow program (referred to as J-unfolded data-flow program)
describes J successive iterations of the original data-flow program.
In Section VII, we introduce the notion of a class of programs
referred to as perfect-rate data-flow programs [9], [lo]. A dataflow program is said to be perfect-rate, if all the loops in the
program have one and only one (storage element or) register.
We prove that perfect-rate programs can always be scheduled
rate-optimally (thus the name perfect-rate). Next we show that
it is possible to transform any arbitrary data-flow program to an
equivalent perfect-rate program using optimum unfolding (which
can then be scheduled rate-optimally). The optimum unfolding
factor is shown to be equal to the least common multiple of
the number of loop registers in the data-flow program. Rateoptimal scheduling of arbitrary data-flow programs via optimum
unfolding is addressed in Section VIII. An upper bound on the
number of processors is also given in Section VIII.
11.

ITERATIVE DATA-FLOW
PROGRAM
MODEL

This paper is concerned with nonterminating, iterative, dataflow programs, which are useful in signal processing applications. These programs are described by data-flow program
graphs (DFGs), where nodes represent tasks to be executed, and
directed arcs represent communication among nodes. Nonterminating programs process infinite time series and produce infinite

179

Fig. 1. A simple nonterminating data-flow program. The task A represents an addition operation.

time series; these programs execute identical tasks repetitively in


a periodic fashion.
A simple example of a nonterminating program is illustrated
in Program 1.
Program 1:

for { n = 1 to

CO}

4 n ) = 4.1 + Y(n) 1.
Program 1 operates on infinite time series { ~ ( n )and
} {y(n)},
and computes the infinite time series { ~ ( n ) Each
} . execution of
the loop in the nonterminating program is referred to an iteration.
The time required to perform each iteration is referred to as
the iteration period of the program. For example, the iteration
period in Program 1 corresponds to the time required for a single
addition operation. The program graph corresponding to Program
1 is described by a single node with two input arcs (which
represent inputs) and one output arc (which represents the output)
and is shown in Fig. 1. In data-flow terminology, each node is
assumed to consume a single token from each of its incoming
arcs, and produce a single token on each of its output arcs. (Note
that in signal processing literature, token means a sample,
and iteration period is often referred to as sample period, to
indicate that all the time series are sampled periodically with a
period equal to the iteration or sample period). Program 1 has
the characteristic that no token needs to be stored anywhere.
Many signal processing programs require samples or tokens to
be stored in a register or a latch (to be used in future iterations).
As an example of such a program, consider Program 2.
Program 2:

The program graph corresponding to Program 2 is shown in


Fig. 2. The tokens produced by node U and consumed by node
V constitute the time series { ~ ( n ) }Note
.
that a time series
is associated with arcs, and computations are associated with
nodes. The notation fuu(.) represents the function to be evaluated
for computation of the time series {uw(n)}.For simplicity, we
assume the starting index of the iteration to be 1. In Fig. 2, the
tokens produced by node A at iteration n are consumed by node
B at iteration n. However, the token produced by node B at
iteration n is consumed by node C at iteration (n + 2), i.e., the
output token of node B is stored for two cycles before being
consumed by node C. The symbol 2D on arc B + C in the
program graph represents two registers or buffers (for storage of
the token produced by node B). (The registers are frequently

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 2, FEBRUARY 1991

180

Fig. 2. A data-flow program with two registers in arc B 4 C. The token


produced by node B is consumed by node C after two iterations. All the node
execution times are assumed to be unity.

registers implies that the execution of the nth iteration of node V


consumes tokens produced by node U in the (n - i)th iteration,
and possibly other past iterations, i.e., n - i - j t h iterations
(where j is positive). As an example, if the nth execution of
node V uses tokens produced by node U in (n - 2), (n - 4), and
(n - 5)th iteration, this would be modeled by a single arc U -+ V
with two registers associated with the arc. One interpretation of
this model is that the tokens produced by node U before the
(n - i)th iteration can be internally stored inside the node V.
Note that this nature of data-flow computation differs from the
traditional data-flow program model.
The data-flow programs used in this paper are assumed to be
coarse-grained in nature. We assume that splitting of a node into
two or more nodes, and merging of two or more nodes into one
node are not allowed. In general, splitting of a node into two
or more nodes may make the program finer-grained and may
improve the granularity (and hence concurrency) of the program.
On the other hand, merging of two or more nodes into a single
node may make the program more coarse-grained, and may hide
some of the available concurrency.
111. PERIODIC
MULTIPROCESSOR
SCHEDULES

(b)

Fig. 3. (a) A simple program with two loops. The loop Li contains one
register and the loop L2 contains two registers. The execution time for each
node is assumed to be one unit. (b) Acyclic precedence graph obtained by
deleting the arcs with registers.

Multiprocessor periodic schedules of static data-flow programs can be nonoverlapped or overlapped. The nonoverlapped
schedules primarily use critical path methods. The overlapped
schedules can be fully-static or cyclo-static. This section summarizes characteristics of these schedules.

A. Precedence Constraints
referred to as delays in signal processing literature.) In this
paper, the notation iD associated with an arc represents i registers.
The token produced by node C is consumed by node D in the
same iteration and does not need to be stored. The registers are
initialized with initial conditions. In Program 2, bc(-1) and bc(0)
are initially stored in the two registers. In the remainder of the
paper, we will omit the system input and output arcs (since these
do not affect scheduling of the tasks).
Programs 1 and 2 have the common feature that they do not
have any feedback or loop. Many signal processing programs
are recursive in nature, and contain feedback. It is this class of
programs that we are mostly concerned with in this paper, since
the loops in program graphs impose an inherent lower bound
on the achievable iteration period. We assume the recursive
programs to be computable, i.e., all loops in the DFG contain one
or more registers. Consider Program 3, which contains recursion
or feedback (see Fig. 3).

Program 3 has two loops, the loop B -+ C -+ B (denoted L1),


and the loop A -+ B -+ D
A (denoted L2). In Fig. 3(a), the
two loops, respectively, contain 1 and 2 registers.
The arc with registers in the program graph model presented
V with i
in this paper have special characteristics. An arc U
--f

--.)

A n arc U --t V with no register implies that the nth iteration


of node V can only be scheduled after execution of the nth
iteration of node U is complete. In other words, arcs with
no registers impose precedence constraints within an iteration
(referred to as intra-iteration precedence). If we delete all the
arcs with registers in a DFG, then the new graph contains all
the intra-iteration precedence constraints of the program graph.
This precedence graph is used for construction of multiprocessor
schedules. The precedence graph is always acyclic (since this is
obtained by deleting arcs with registers), and does not contain any
inter-iteration precedence (see paragraph below). As an example,
Fig. 3(b) contains the precedence graph of the DFG of Fig. 3(a).
The double arrow in Fig. 3(b) represents the critical path of the
precedence graph (this convention is followed throughout the
paper). Note that if the number of registers in any arc in a DFG
is changed from one positive value to another positive value,
then the precedence graph remains unaltered.
An arc U -+ V with i registers in a DFG implies that the nth
iteration of node V consumes the token produced by node U in the
(n - i)th iteration. Similar argument applies to self-loops also,
i.e., a self-loop at node U with i registers implies the nth iteration
of node U consumes the token produced by the (n - i)th iteration
of node U. Arcs with registers represent precedence constraints
among iterations (referred to as inter-iteration precedence). Construction of multiprocessor schedules using precedence graph
only cannot exploit inter-iteration precedence (or inter-iteration
concurrency) and may not lead to schedules with minimum
iteration period.
If there is a directed path from node U to node V in the
precedence graph, then the node V is a successor of node U,
and apredecessor of node V. If there is a directed arc U -+ V in
the precedence graph, then the node V is an immediate successor
of node U, and node U is an immediate predecessor of node V.

PARHI AND MESSERSCHMITT STATIC RATE-OPTIMAL SCHEDULING OF PROGRAMS


2D

181

P1:
P2:

PI:

D,

A,

D,

y
,

C,

B1 C,

A2

D3 A3 D,

P2:

g
q
($
&
J(

B,

C,

B3

(c)

(b)

Fig. 4. (a) Illustration of transitivity associated with arcs. (b) Illustration of


extended transitiviy associated with the arcs.

A node U in the precedence graph is said to be an initial node


if it has no predecessors, and a node V in the precedence graph
is said to be a terminal node if it has no successors. Note that
an initial node has registers in all its input arcs, and a terminal
node has registers in all its outgoing arcs.

B. Transitivity
Transitivity is associated with arcs. An arc U -+ V in a DFG
is said to be transitive if there exists a path U -+ V, and the
number of registers of the path U -+ V and arc U -+ V are the
same. (A path A -+ B -+ C -+ D is also referred to as the path
A -+ D, and the sum of the number of registers in arcs A + B,
B -+ C and C -+ D is referred to as the number of registers in
the path A -+ D.) For example, in the DFG in Fig. 4(a), there
are three transitive arcs. The arc C -+ E and the path C + E
(which is the same as C -+ D -+ E) both contain no registers,
and the arc C -+ E is a transitive arc. The arc C -+ D implies that
node D can be invoked after node C is executed. The arc D -+ E
implies that the node E can be invoked after the execution of node
D is complete. These two precedence constraints automatically
satisfy the precedence constraint imposed by the arc C + E.
Thus, deletion of the transitive arc C -+ E will not alter the
required precedence constraints.
The arc A -+ E with two registers is transitive, because the
path A + E has also two registers. The path A -+ E with two
registers implies that the execution of the ( n - 2)nd iteration ofA
is complete before the nth iteration of E is invoked, which is also
satisfied by the arc A -+ E. Thus, deletion of the transitive arc
A + E does not change the inter-iteration precedence constraints.
Similarly, the arc E + G is a transitive arc (because this arc and
the path E -+ G contain one register each), and can be deleted. In
this paper, we will assume that the first step in the construction
of a multiprocessor schedule is to delete all the transitive arcs
from the DFG.
We can extend the notion of transitivity further. If an arc
U -+ V contains i registers, and a path U -+ V contains less than
i registers, then the arc U -+ V is an extended transitive arc.
This arc can also be deleted without affecting any precedence

Fig. 5. Two-processor periodic schedule of the data-flow program in Fig. 3.


(a) Nonoverlapped schedule with iteration period 3 units. This schedule is
fully-static. (b) An overlapped schedule with iteration period of 2 units. This
schedule is also fully-static. (c) An overlapped schedule with iteration period
of 2 units. This schedule is cyclo-static.

constraints. This is because the path U -+ V imposes a tighter


precedence constraint than the arc U -+ V. As an example, arcs
A + C and C -+,E in Fig. 4(b) satisfy extended transitivity.
We assume extended transitive arcs are also deleted before
schedules are constructed. Some more discussion on this is given
in Section IV.
C . Nonoverlapped Schedules
Multiprocessor schedules are said to be nonoverlapped if the
execution of the ( n 1)st iteration begins after completion of all
tasks of the nth iteration (i.e., the execution of tasks in any two
consecutive iterations do not overlap). These schedules optimize
the performance of a single iteration of the program, and repeat
the schedule periodically. The minimum iteration period that can
be achieved with a nonoverlapped schedule is the computation
time associated with the critical path of the precedence graph. As
an example, Fig. 5(a) shows the nonoverlapped multiprocessor
schedule of the DFG in Fig. 3(a); this schedule is obtained
using the precedence graph of Fig. 3(b). In this schedule, the
iteration n of task T is denoted T,. For example, D 2denotes the
execution of the second iteration of task D. The iteration period
of the nonoverlapped schedule is 3 units. Note that after a single
iteration is scheduled, the entire periodic schedule is obtained by
simply repeating the schedule with a time displacement equal to
the iteration period. Fig. 5(a) shows the nonoverlapped schedule
for the first three iterations. The notationA, denotes the execution
of the nth iteration of task A;

D. Overlapped Schedules
A schedule is said to be overlapped, if any task of iteration
(n + 1) is scheduled before all tasks of iteration n have been
executed. The schedules in Fig. 5(b) and (c) correspond to
overlapped schedules. In these schedules, D2 is scheduled before
execution of B1 is complete, and any two consecutive iterations always overlap. Overlapped schedules exploit inter-iteration
precedence constraints (in addition to intra-iteration precedence),

182

and exploit the repetitive nature of periodic schedules (one may


note the similarity between overlapped scheduling and hardware
pipelining). These schedules can lead to shorter iteration period.
Note that the iteration periods in Fig. 5(b) and (c) are two units,
as opposed to three in the nonoverlapped schedule in Fig. 5(a).

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 2, FEBRUARY 1991

P I:
P2:

E. Fully-Static Schedules
A periodic multiprocessor schedule is said to be fully-static, if
all the iterations of some task are scheduled in the same processor
[9], [ 101. The nonoverlapped multiprocessor schedule in Fig. 5(a)
D2,D3,
is fully-static. Note that all the iterations of D (i.e., D1,
etc.) are scheduled in processor P I with time displacement 3 units
(the time displacement is the same as the iteration period). The
overlapped periodic schedule in Fig. 5(b) is also fully-static (with
time displacement of 2 units). The schedule of a single iteration is
replicated with a time displacement equal to the iteration period.

P1:
P2:
P3:
P4:

P I:

P2:
F. Cyclo-Static Schedules
P3:
The cyclo-static periodic multiprocessor schedules were introduced by Schwartz and Bamwell [17]-[22]. These schedules
P4:
are characterized by a time displacement as well as a processor
P5:
displacement. In cyclo-static schedules, if the iteration n of some
tasks A is scheduled in processor Pk at time t, then the iteration
P6:
(n + 1) of taskA is scheduled in processor P ( L + I c at) time
~ ~ ~ ~ ~ ~ ~
P?:
(t + T), where T is the time displacement (or iteration period),
N is the total number of processors, and K is the processor
P8:
displacement. The overlapped periodic schedule in Fig. 5(c) is
a cyclo-static schedule with iteration period 2 and a processor
(c)
displacement of 1. The task D1 is scheduled in P1 at time 0, D2
is scheduled in P2 (note that (1 + 1) modulo 2 is considered as Fig. 6. Multiprocessor schedule of the nonrecursive data-flow program in
2) at time 2, D3 is scheduled in P I (since (2 + 1) modulo 2 is Fig. 2. (a) A two-processor schedule with iteration period of 2 units. (b) A
1) at time 4, etc. A cyclo-static schedule reduces to a fully-static four-processor schedule with iteration period of 1 unit. (c) An eight-processor
schedule with iteration period of 1/2 unit.
schedule if the processor displacement is 0. This paper is not
concerned with construction of general cyclo-static schedules;
we only consider construction of fully-static schedules.
loop (these can be scheduled using a postprocessor). We assume
the nodes belonging to no loop can be deleted for now (since
these can be scheduled with arbitrarily less iteration period, some
Iv. ITERATION BOUND
AND RATEOPTIMALSCHEDULES
more discussion on this is in Section VIII).
This section reviews the notion of iteration bound in data-flow
The iteration period bound in any data-flow program with
programs with feedback loops. Any program with no feedback loops is given by
loop can be scheduled with arbitrary concurrency (i.e., with
arbitrarily shorter iteration period) by using a larger number of
(4.1)
T, = M ~ [ T I / D I ]
processors. Consider the program described by the DFG in Fig. 2.
where
the
maximum
is
taken
over
all
loops
1
in
the
DFG,
and
TI
Fig. 6(a), (b), and (c), respectively, show periodic overlapped
is
the
sum
of
the
execution
times
associated
with
all
the
nodes
schedules for this DFG when 2,4, and 8 processors are available.
Note that the iteration period of these schedules are, respectively, in loop 1, and D, is the number of registers in loop 1. The bound
2, 1, and 1/2 units of time (recall all task execution times were imposed on the iteration period due to the lth loop is described
chosen to be 1 unit in this example). This suggests by increasing by the inequality
the number of processors, we can reduce the iteration period
TI I DIT,.
(44
arbitrarily for any program with no loop. However, this is not true
for programs with loop or feedback; loops in programs impose This inequality is referred to as the loop bound inequality, and
an inherent lower bound on the iteration period, referred to as Tl/DI is the loop bound of the lth loop. The loop lo for which
the iteration bound [2], [8]-[lo], [23]-[25]. Periodic schedules Tlo/Drois maximum is referred to as the critical loop, and the
are said to be rate-optimal if the iteration period is the same as loop bound inequality reduces to a strict equality for this loop.
the iteration bound. One can never achieve an iteration period
The difference between the iteration bound and the loop bound
less than this bound even when infinite processors are available. is referred to as the slack time of a loop. The slack time of the
Although the iteration bound has been established for some time, critical loop is zero. The more the slack time, the less critical
it has so far not been shown that rate-optimal schedules can is the loop.
always be constructed. The objective of this paper is to show
All existing scheduling papers take the maximum of the
that one can always construct rate-optimal schedules for static quantity in (4.1) and all the node execution times as the minimum
data-flow programs. In the remainder of this paper, we do not achievable iteration period. In our definition, we do not consider
consider the nodes of the program which do not belong to any the ceiling or the maximum of node execution times. We can

PARHI AND MESSERSCHMITT: STATIC RATE-OPTIMAL SCHEDULING OF PROGRAMS

183

always overlap successive iterations to get an iteration period


less than the execution time of some node in the program! The
iteration bound can also be less than the execution time of a path
in the precedence graph of the data-flow program. These make
the notion of iteration bound important, and we show that it is
always possible to achieve an iteration period equal to the bound
given by (4.1).
Example 4.1: Consider the DFG in Fig. 7(a) with two loops
A -+ B -+ A (call this loop L 1 ) and the self loop B + B (call
this loop L2). This DFG describes Program 4.

2 *lo+

*lo+

Program 4:

02

Initial Conditions: ba(O), bb(O), ab(- l), and ab(0).


for { n = 1 to CO} {
ab(n) = fnb[ba(n- I)]
ba(n) = fb,[ab(n - 2),bb(n - I)]
bb(n) = fbb[ab(n- 2)7 bb(n - I)] }.

+ tb 57.T3

tb 5 TCC

T, is given by
T, = M a xt [+y t, t b ] .

A6

(b)

Fig. 7. (a) A DFG with two loops. The node execution times of nodes A
and B are 10 and 2 units. (b) An overlapped rate-optimal periodic schedule
with iteration period of 4 units.

The loop bounds for this program are


tR

A3

2 I 2 p-

2 *lo+

(4.3a)

where to, t b , and t,, respectively, represent the computation times


associated with the nodes A, B, and C. The iteration bound is
given by

The iteration bound

(4.4b)
(4.3b)

For loop L1, the total computation time is 12 units, and the
number of loop registers is 3. The bound for loop L1 is 4. For
loop L2, the computation time is 2, and there is a single loop
register. The bound for loop L2 is 2. The iteration bound for
the program is 4 (which is maximum of the two loop bounds).
The slack times are, respectively, 0 and 2 units for loops L1 and
Lz, respectively. For this program, any multiprocessor schedule
which achieves an iteration period of 4 units is rate-optimal.
Fig. 7( b) shows a rate-optimal fully-static overlapped schedule.
Note that this periodic schedule is fully-static with respect to
three iterations. This should be clear from the fact Al and A4
are scheduled in the first processor with a time displacement of
12 units. The iteration period is 4 units (which is less than the
execution time of A), since three iterations can be scheduled in
12 units. Also note that a CPM nonoverlapped schedule would
require an iteration period of 10 units (which is the time to
execute task A). How we constructed the schedule in Fig. 7(b)
0
is postponed until Section VI (see Example 6.1).
Example 4.2: Consider the DFG in Fig. 8(a) with 2 loops. The
program described by this DFG is described by Program 5.

For t, = 10, tb = 20, and t, = 40, the iteration bound is 35


units. The loop L1 here is the critical loop. Fig. 8(b) shows the
precedence graph, and Fig. 8(c) shows a nonoverlapped, periodic
0
schedule with iteration period of 60 units.
In Section 111-B, it was stated that the transitive and extended
transitive arcs impose identical or loose precedence constraints.
It is clear that deletion of these arcs does not alter the iteration
bound. Any loop formed by a transitive arc has less loop computation time (than the loop containing the path associated with the
transitive arc) and has the same number of loop registers (since
transitive arcs and associated paths contain identical number of
registers). Thus, the loop bound imposed by a loop containing a
transitive arc is less than the loop containing the associated path.
We illustrate these observations using some examples.
Example 4.3: Consider the loop A ---f B -+ C -+ D -+ E -+
F + G + A of Fig. 4(a). The loop bound inequality for this
loop is given by

t,

+ t b + t, + + t, + t f + t, 5 .T3
td

The loop bound inequality for the loop containing transitive arcs
A + E, and E + G is given by

t, + t , + t , 53T,.
Program 5:

(4.5a)

(4.5b)

The inequality for the loop containing the transitive arc A


E
only is given by ( ~ S C )the
, loop containing the transitive arc
C -+ E only is given by (4.5d), and the loop containing transitive
arcs C ---f E and E -+ G is given by (4.5e).
---f

+ + +
+
+
+

t, t, tf t, 5 3T,.
t, tb 4-t, t, + t f t, 5 3
T
.
t , + t b + t c + t e + t g<3Tw.
Program 5 has two loops, the loop A -+ B --iC -+ A (denoted
L1) and the loop A + B -+ A (denoted L2), the number of loop
registers is, respectively, 2 and 1. The bounds imposed on the
iteration period by the two loops are, respectively, given by

(4.5c)
(4.5d)
(4.5e)

It is easy to verify that if (4.5a) is satisfied, then the inequalities


(4.5b)-(4.5e) are automatically satisfied. In words, the loop
bound inequalities for loops containing transitive arcs are automatically satisfied if the inequality for the loop containing the
0
path associated with the transitive arcs is satisfied.

184

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40,

(b)

-I
B

w
(C)

Fie. 8. (a) A DFG with two looos. The node comoutation times of A, B, ant
C are 10, 20, and 40 units, respectively. The iteration bound is 35 nits, an1
loop L1 is the critical loop. (b) Acyclic precedence graph. (c) A nonoverlapped
schedule for one iteration with iteration period of 60 units. Periodic schedule
is constructed by replicating this schedule with time displacement of 60 units.
I

We assume all transitive arcs are first deleted (thus no loop in


the program graph contains a transitive arc).
Now consider extended transitive arcs. The loop containing
the extended transitive arc has less loop computation time and
a larger number of registers, so this loop bound is much less
than the loop containing the path associated with the extended
transitive arc.
Example 4.4: Consider the inequality for the loop A + B -+
C + D [in Fig. 4(b)]

ta+tb+tc+td ST-.

The retiming technique was first proposed by Leiserson, Rose,


and Saxe to reduce the clock period in synchronous digital
systems [28]. Retiming involves moving around the registers in
the DFG such that the total number of registers in any loop
remains unaltered, and the input-output behavior of the system
is preserved (see [28]). Unlike pipelining, retiming does not
increase the latency of the system. Retiming transformation can
also reduce the iteration period of multiprocessor schedules of
data-flow programs.
Removal of a fixed number of registers from all the incoming
arcs of any node, and addition of the same number of registers to
all the outgoing arcs of the same node is an example of a valid
local retiming applied to the node [28], [29] (this is also referred
to as cutset transformation). Any local retiming operation can
be performed at a node, only if all of the incoming arcs have
registers associated with them, i.e., this node is an initial node in
the corresponding precedence graph. Retiming changes the initial
condition of the data-flow programs, and yields new schedules
by altering the initial nodes of the acyclic precedence graph. If i
registers are removed from all the incoming arcs and i registers
are added to all the outgoing arcs of a node, then the iteration
indexes of the time series produced by that node are increased by
i. Any valid global retiming operation can always be described
as a combination of such local retiming operations. Formally, if
RI (i = 1 to N) are local retiming operations, and R is a valid
global retiming operation, then R = ELlP,R,, where all PEs
are positive. In our notation, p1R, denotes removal of pzregisters
from all incoming arcs of the ith node and addition of PI registers
to all outgoing arc of the ith node. The notation RI + R2 implies
that two local retiming operations R I and R2 have been performed
(in any order).
Example 5.1: Fig. 9(a) shows a DFG, and Fig. 9(b) shows an
example of a retimed DFG with retiming applied locally at node
A. Fig. 9(c) shows another retimed DFG with retiming applied
locally at node B of the DFG in Fig. 9(b). Note that the global
retiming from Fig. 9(a) to Fig. 9(c) is described by a combination
of the two local retiming transformations. The three programs in
Fig. 9(a), (b), and (c) are described by Programs 6, 7, 8. In these
programs, the initial conditions are different.

Program 6:

(4.6a)

The loop inequality containing the extended transitive arcA -+ C


is given by

+ +

ta tc td 5 2T,.

(4.6b)

Program 7:

It is clear that if (4.6a) is satisfied, (4.6b) is automatically then


satisfied. As another example, consider the inequalities for the
loops B + C --$ D + E and B + C + E, respectively, given
by (4.7a) and (4.7b).

tb

+ t,

+ t , 5 T-.

(4.7a)

tb+tc+te<2Tm.

(4.7b)

td

The relation (4.7b) is automatically satisfied if (4.7a) is satisfied.


0
We assume the extended transitive arcs are first deleted, and
no loop in the program graph contains any transitive arc.

2, WBRUARY 1991

V. RETIMING
IN DATA-FLOW
PROGRAMS

Pi:

NO.

Program 8:
Initial condition: bc(1) = f b c [ f a b [ca(O)]1.
for {n = 1 to C O } {

PARHI AND MESSERSCHMITT: STAnC RATE-OPTIMAL SCHEDULING OF PROGRAMS

185

D
A
B
-

(g&@
D

(b)
(C)

Fig. 9. Illustration of retiming in data-flow programs. The data-flow program


in (a) is retimed locally at node A to obtain the retimed data-flow program
in (b). The program in (b) i s further retimed locally at node B to get the
retimed program in (c).

p
o
+
-

20

+
o
l+

P2:

!-

-I

40
(C)

Note that the iteration indexes of the tokens produced by node


A are increased by 1 in Program 7, and the iteration indexes of
the tokens produced by nodes A and B are increased by 1 in
Program 8.
0
Since the retiming operation preserves the number of registers
in a loop [28], it also preserves the iteration bound of the DFG
(since the loop computation times and the loop bounds remain
unaltered). The retiming operation can change the total number
of registers in a DFG. This can be simply illustrated for a local
retiming operation at a node whose number of outgoing arcs
is different from its number of incoming arcs. Retiming can
improve the iteration period of nonoverlapped multiprocessor
schedules. This is illustrated by the following example.
Example 5.2: Fig. lO(a) shows a retimed version of the DFG
of Fig. 8(a), obtained by performing retiming operation locally at
node B (i.e., by removing the register from the input arc of node
B and adding one register to the two outgoing arcs of node B).
The number of registers in each loop remains unaltered (but, the
total number of registers in the DFG has changed). Furthermore,
node B in the retimed DFG computes the time series ba(n 1)
and bc(n 1) (at time index n), and ba(1) and bc(1) are now
initial conditions. The equivalent retimed program is described
by Program 9.

Program 9:

Initial conditions: ba(l), bc(l), and ca(0).

Fig. 10. (a) A retimed equivalent program of the DFG of Fig. 8(a).
Precedence graph of the ietimed DFG;(c) A nonoverlapped schedule for
one iteration with iteration period of 40 units.

(6

The two new initial conditions are precomputed using

ba(1) = fba[ab(O)],

bc(l) = fbc[ab(O)],

to preserve the input-output behavior of the system. The precedence graph and the nonoverlapped periodic schedule corresponding to the retimed DFG in Fig. lO(a) are respectively shown
in Fig. 10(b) and (c). The iteration period of the retimed DFG is
40 units, which is 5 times greater than the iteration bound, but
0
20 units less than the schedule of Fig. 8(c).
Retiming transformation can reduce the iteration period in a
programmable multiprocessor implementation, but cannot guarantee a rate-optimal schedule.

VI. DATA-FLOW
PROGRAM
UNFOLDING
Nonoverlapped multiprocessor schedules constructed from
precedence graphs of DFGs exploit intra-iteration precedence
relations, and fail to exploit inter-iteration precedence. One
can reduce the iteration period of multiprocessor schedules
by exploiting the inter-iteration precedence constraints. The
program unfolding transformation exploits the inter-iteration
precedence constraints, and can lead to rate-optimal schedules
(assuming availability of large number of processors and
complete interconnection among the multiple processors) [9],
[ 101. This section studies systematic unfolding, and properties
of unfolded data-flow programs.
A. Construction of Unfolded Programs
An unfolded DFG with an unfolding factor J describes J
consecutive iterations of the original program. The unfolded

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 2, FEBRUARY 1991

186

DFG with unfolding factor J is referred to as the J-unfolded


DFG, and the corresponding program is referred to as the
J-unfolded program. The number of nodes and arcs in the
J-unfolded DFG are, respectively, J N and JE, where N and
E, respectively, represent the number of nodes and arcs in the
original DFG. Each node U in the original DFG is replaced
by J nodes Ul through UJ in the J-unfolded DFG. The node
U1 executes the iterations 1, (J + l), (2J + 1) in the first,
second, and third cycles, respectively. In general, the node U,
executes the ( n J - J j)th iteration of task U in cycle n.
Consider an arc U -t V in the original graph; the node U
computes the time series uv(n)at cycle n. In the J-unfolded DFG,
the node U, computes uv(nJ - 25 j ) at cycle ( n - l ) , and
u v ( n J - J + j ) at cycle n. Thus, if we input uv(nJ - J + j )
to a register, the output of the register is uv(nJ - 25 + j ) . In
other words, each register in the J-unfolded DFG is J-slow [30],
[31], [28], [29], i.e., these registers output tokens produced before
J iterations when referred to the original program. Unlike in
retiming, the initial conditions are not altered in the program
unfolding transformation. Furthermore, unfolding preserves the
intra-iteration as well as inter-iteration precedence constraints,
and does not introduce any new precedence constraints.
The unfolded DFG can be constructed very easily from the
original DFG. Before we describe the general unfolding algorithm, we illustrate unfolding of the DFG's of Fig. 7(a) and
Fig. S(a) as examples.
Example 6.1: Consider unfolding of the DFG in Fig. 7(a) by
a factor of 3. The 3-unfolded DFG executes three consecutive
iterations, and is obtained by replacing n by (3n - 2 ) , (3n - l ) ,
and 3n in the original program. The 3-unfolded program is
described by Program 10 and is shown in Fig. 11.

(b)

Fig. 11. (a) A 3-unfolded equivalent data-flow program of the program in


Fig. 7. (b) Acyclic precedence graph.

Program 10:

The acyclic precedence graph for the 2-unfolded DFG is


shown in Fig. 12(b). The corresponding two-processor periodic
schedule for this program is shown in Fig. 12(c), and requires
70 units (for execution of two iterations), or equivalently, an
iteration period of 35 units (25 units smaller than the schedule
of Fig. 8(c), and equals the iteration bound of the DFG). This
0
schedule is rate-optimal.
Now we present the general unfolding algorithm below.
Unfolding Algorithm:
Step 1: For each node U in the original DFG, draw J correThe precedence graph of the 3-unfolded DFG is shown in sponding nodes, and label them U,I U,, . . . , and U,.
Step 2: For each arc U
V in the original .DFG containing
Fig. 11(b), and the corresponding overlapped, fully-static, periodic schedule is shown in Fig. 7(b) (note that the schedule no register, draw arcs uk -t v k (IC = 1 to J ) with no register.
Step 3: For each arc U -t V in the original DFG with i
is periodic with respect to the execution of three iterations).
Furthermore, the registers in arcs B3 -+ Al, A2 -t B1,A3 -t B,, registers, perform Step 3A or Step 3B.
Step 3a: If i e J (number of arc registers less than unfolding
and B3 -t B1 are, respectively, initialized by ba(O), ab(-l),
ab(O), and bb(0).
0 factor): Draw arcs Uq-, -t V, (for q = i 1 to J) with no arc
Example 6.2: Consider 2-unfolding of the DFG in Fig. S(a). register. Draw arcs UJ--t+q-+ V, (for q = 1 to i) with a single
~,
is
The 2-unfolded in Program 11 is obtained by replacing n in the register in each arc (since UJ-z+qis used for V . J + which
original program by (2n - 1)and 2n, and is shown in Fig. 12(a). executed by node V, after one cycle).
Step 3b: If i 2 J (the number of arc registers greater than
V, with 1-1
unfolding factor): Draw arcs UrzProgram 11:
registers (for q = 1 to J ) . (Note t at execution of V, consumes
the result of U,-,, which was invoked by U r+l +I J - - I + ,
before
r-1
cycles.)
0
The notation 1.r represents the ceiling function of x (which
is the smallest integer greater than or equal to x). The unfolding
--f

T,~-~+~
-+

%-

PARHI AND MESSERSCHMITT: STATIC RATE-OPTIMAL SCHEDULING OF PROGRAMS

187

is T, , the minimum achievable iteration period for J-iterations


is JT,.
Property 6.3: Let Ti and D:, respectively, correspond to the
loop computation time and the number of registers associated
with the ith loop in the J-unfolded DFG. Then

T,'

[WI,
(b)

20 _ I _
1-

P2:

+
o
l+

40

=2

20

-I-

40

-1

5 JT,D:

(6.1)

must hold.
Proof: Let TL be the iteration bound of the unfolded DFG.
Then, Ti 5 TLD: must hold. But, TL = JT, (due to
U
Property 6.2), and hence (6.1) must hold.
Property 6.4: Any loop bound relation of the type (6.1) in the
unfolded DFG can be obtained either by multiplying a loop bound
relation in the original DFG by a constant, or by taking linear
additive combinations of the loop bound relations in the original
DFG so that the right side of the inequality is a multiple of J.
Proof: The right side of the loop bound for any loop in the
unfolded DFG must 6e a multiple of J (when expressed in terms
of T,, the iteration bound of the original program). Assume
that the ith loop of the original DFG has a bound T, 5 D,T,.
Any linear additive combination of one or more loop bounds
in the original DFG, which corresponds to a loop bound in the
unfolded DFG, must be of the form

(C)

Fig. 12. (a) An equivalent 2-unfolded program of the program in Fig. S(a).
(b) Precedence graph. (c) Rate-optimal periodic schedule with iteration period
of 35 units.

operation can be carried out in linear time with respect to the


number of arcs in the DFG.

B. Properties of Unfolded DFG's


Property 6.1: Unfolding transformation preserves the number
of registers in the DFG.
Proof: If an arc in the original DFG contains no register,
then the corresponding J arcs in the unfolded DFG also contain
no registers (from Step 2 of unfolding algorithm). If an arc
contains i registers and i c J, then the number of registers in
J corresponding arcs in the unfolded DFG equals i (from Step
3a of the unfolding algorithm). If an arc contains i registers, and
i > J, then the number of registers in J corresponding arcs equal

from Step 3b of the unfolding algorithm.


Intuitively, the registers represent initial states (or stored
tokens) at the beginning of the execution. These tokens activate
the invocations, and are updated each cycle. Let DT denote the
total number of registers in the DFG. Then each iteration of the
DFG updates DT tokens to be used during the next iteration. We
know that the completion of each iteration updates DT tokens and
the completion of execution of the Jth iteration of the unfolded
DFG must also update DT tokens to be used for the next J
iterations of the unfolded DFG. Unlike retiming, unfolding does
not alter the number of registers in a DFG.
0
Property 6.2: The iteration bound associated with a J-unfolded
DFG is JT,, where T, is the iteration bound of the original
DFG. The unfolded DFG schedules J consecutive iterations of
the original DFG. Since the minimum achievable iteration period

where L is the number of loops of the original DFG, and


a,D, is divisible by J. Any loop bound relation in the
unfolded DFG, which is not of the form (6.2), will imply an
entirely new loop bound in the original DFG. But this is not
possible, since unfolding creates no new precedence constraints
or loop bounds.
0
Note that all linear additive combinations of the loop bound
relations of the original DFG of the form (6.2) may not correspond to a loop bound in the unfolded DFG. Now we discuss
four special cases of (6.2) in the context of a single loop bound
relation. Let the loop bound of some loop in the original DFG
be T I DT,, where T is the loop computation time, D is the
loop register count, and T, is the iteration bound of the original
DFG. The iteration bound of the unfolded DFG is TL = JT,.
Case Z: J divisible by D: Let us assume J = QD and Q is an
integer. Then a loop bound of the J-unfolded DFG will be of
the form QT 5 (JT,) or QT 5 TL. This implies that one
loop of the unfolded DFG will contain Q instances of the nodes
(of the loop of the original DFG) and a single register. Since
the unfolded DFG contains J = QD instances of each node and
D registers (due to register conservation, Property 6. l), it must
contain D distinct loops with a single register in each loop.
Case ZZ: D divisible by J Assume PJ = D. The loop bound of
the unfolded DFG is of the form T 5 PTL. The unfolded DFG
contains P registers in the loop, and J distinct loops.
Case ZZZ: D and J coprime: For this case, a loop bound in the
unfolded DFG is of the form JT 5 D(JT,) or JT 5 DTL.
The unfolded DFG contains one distinct loop with D registers.
Case N:General Case: Assume PJ = QD, where P and Q are
coprime. The loop bound of the unfolded DFG is of the form
QT 5 PTL. The unfolded DFG contains J / Q = DIP distinct
loops with P registers in each loop.
Example 6.3: Consider the DFG of Fig. S(a) and its unfolded
DFG of Fig. 12(a). The loop bounds in the original DFG are

t,,

+ t b I T,,

t,

+ t b + t,

5 2T,.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 2, FEBRUARY 1991

188

The loop bounds of the 2-unfolded DFG are

2ta

+ 2tb I TL,

ta

+ tb +

tc

L. TL

which are linear additive combinations of the original loop bound


0
inequalities (note that TL = 2T,).
D
Property 6.5: Any linear additive combination of noncritical
loop bound inequalities of the original DFG can never correspond
(a)
to a critical loop bound of the unfolded DFG.
Proof: Consider K noncritical loops L 1 ,L2,. . . , and LK.The
loop bounds of these loops can be written as Tk < T,Dk
(for k = 1 to K), where Tk and Dk are computation time and
register count of the kth loop. The linear additive combination
PkTk < T ,
p k Dk can never represent a critical loop
bound, since the linear combination represents a strict inequality
and the critical loop bound is always represented by a strict
equality.
0
Corollary 6.5.1: A critical loop bound equality in the unfolded
DFG corresponds to a linear additive combination of critical
loop bound equalities of the original DFG. However, any linear
additive combination of the loop bound equalities of the original
D
DFG may not correspond to a critical loop bound equality of the
unfolded DFG.
Corollary 6.5.2: Linear additive combination of critical loop
bound and noncritical loop bounds cannot represent a critical loop
bound. (This is because the critical loop bound represents a strict
equality, noncritical loop bounds represent strict inequalities, and
the combination, therefore, represents a strict inequality.)
(b)
Property 6.6: Any loop in the original DFG with D loop
Fig.
13.
(a)
A
program
with
one
loop
and two registers. (b)A 6-unfolded
registers leads to D distinct loops in the unfolded DFG for an
with two loops, each loop containing a single register and three
unfolding factor of KD (which is a multiple of the number of program
instances of each node.
loop registers). Each of these distinct loops contains a single
register in each loop, and K instances of each node belonging to
the loop in the original DFG. (This follows from special Case I
of Property 6.4.)
Example 6.4: Consider the simple DFG in Fig. 13(a) and its
6-unfolded DFG in Fig. 13(b). The original DFG has two registers in the loop, and the unfolded DFG has 2 distinct loops with
a single register in each loop. Each loop in the 6-unfolded DFG
contains three instances of the nodes of the original loop. The
reader can verify this property for the DFGs in Fig. 7(a) and
Fig. 8(a), and the respective unfolded DFGs in Fig. ll(a), and
Fig. 12(a).
U

E;=,=,

VII. PERFECT-RATE

DATA-FLOW
PROGRAMS

In this section, we introduce the notion of perfect-rate dataflow programs, and show that these can always be scheduled in
a fully-static and real-optimal manner.
Definition 7.1: Any data-flow program with one register in
each loop is referred to as a perfect-rate data-flow program;
a DFG describing a perfect-rate program is referred to as a
perfect-rate DFG.
A. Scheduling of Perfect-Rate Data-Flow Programs
The DFG shown in Fig. 14(a) is an example of a perfect-rate
graph. This DFG has one initial node (node D ) , one terminal node
(node E), and three loops, and all the loops are critical (assuming
unit execution time for each node or task). The iteration bound
for this DFG is 3 units of time (u.t.). The precedence graph for
the DFG is shown in Fig. 14(b), and the length of the critical
path is 5 u.t. (any nonoverlapped CPM schedule would require an
iteration period of 5 u.t.). A rate-optimal overlapped schedule is
shown in Fig. 14(c). Note that the DFG did not need to be retimed

(b)

p:

P2:
(C)

Fig. 14. (a) A perfect-rate program. (b) Acyclic precedence graph.


(c) Overlapped periodic schedule with iteration period of 3 units.

or unfolded to obtain a rate-optimal schedule. In fact, perfect-rate


programs have the property that they directly lead to rate-

PARHI AND MESSERSCHMITP STATIC RATE-OPTIMAL SCHEDULING OF PROGRAMS

189

A-B-C-1
D+
J

&

(b)

(C)

Fig. 15. Several retimed versions of the perfect-rate program of Fig. 14(a),
and corresponding rate-optimal overlapped schedules.

optimal fully-static schedules without requiring any retiming or


unfolding, and it is this property that makes the notion of perfectrate programs useful and important. The schedule in Fig. 14(c)
achieves an iteration period of 3 (which is rate-optimal). Fig. 15
shows several retimed versions of the DFG in Fig. 14(a), and the
corresponding rate-optimal fully-static schedules. Even though
all schedules in Figs. 14 and 15 are rate-optimal, the schedule in
Fig. 15(c) is the only nonoverlapped rate-optimal schedule.
The above discussion leads to the following question: is it
always possible to retime perfect-rate data-flow programs such
that we can always construct rate-optimal nonoverlapped periodic
schedules? The answer is no. Fig. 16(a) shows a perfect-rate
program graph, and Fig. 16(b) shows the precedence graph. The
iteration bound of the program is 60 units. The critical path
computation time is 66 units [see Fig. 16(b)], and the iteration
period of a nonoverlapped schedule would correspond to 66 units.
The reader can verify that no retiming of this perfect-rate program
can lead to a nonoverlapped periodic schedule with iteration
period less than 66 units.
On the other hand, it is always possible to use register
constrained retiming to obtain a nonoverlapped, periodic, rateoptimal schedule for a perfect-rate DFG, if the execution times
of all the nodes of the DFG are equal. In register constrained
retiming, we force a node to have a single register (or no register)
in each of its incoming (outgoing)arcs. In other words, if one
incoming (outgoing) arc of a node has a single register, then all
other incoming (outgoing) arcs must also have a single register.
If all incoming (outgoing) arcs of a node contain a single register,
then that node corresponds to an initial (or terminal) node in the
precedence graph. If the computation time in any path in the
precedence graph is less than or equal to the iteration bound,
then we can achieve a nonoverlapped rate-optimal schedule. A
nonoverlapped schedule is not feasible, if there exists a path (in
the precedence graph) with computation time greater than the

Fig. 16. (a) A perfect-rate program. The execution times of nodes A, B , C,


D , E , F, G , H , I , and J are, respectively, 10, 5, 3, 11, 6, 9, 10, 6, 42, and 35
units. The iteration bound is 60, and all three loops are critical. No retiming
can achieve a nonoverlapped schedule with iteration period less than 66 units.
(b) Acyclic precedence graph.

Fig. 17. Illustration of register-constrained retimed overlapped schedules of


perfect-rate programs with equal node execution times.

iteration bound. Such a path (with no path register in the DFG)


must consist of some nodes of one loop and some nodes of
another interacting loop (because no loop execution time can be
greater than the iteration bound by definition). With no loss of
generality, consider a loop A + B --+ C + D + E --+ F + A
and a path A + B --+ C iD + E -+ G + H (see Fig. 17).
Consider the interacting loop I + C --+ D + E + G --+
H + I. Assume node A has a register at all its inputs, then
node C cannot have registers at its input arcs, and E cannot
have registers at its outgoing arcs, because doing so would imply
either the DFG is not be perfect-rate or the register constraint is
violated. Register-constrained retiming would move the register
of the interacting loop from the arc H + Z to G --+ H, and that
would result in a nonoverlapped rate-optimal schedule. A more
detailed treatment of the register-constrained retiming is beyond
the scope of this paper.
The retimed DFG in Fig. 15(c) satisfies the register constraint
and leads to nonoverlapped rate-optimal schedules, whereas
the retimed DFGs in Fig. 14 and Fig. 15(a) and (b) do not
satisfy the register constraint and do not lead to nonoverlapped

190

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 2, FEBRUARY 1991

(C)

Fig. 18. Rate-optimal nonoverlapped periodic schedules can be obtained


with register-constrainedretiming for perfect-rate programs with equal node
execution times. The program in (a) satisfies register constraint and leads
to nonoverlapped rate-optimal schedule. The retimed program in (b) does
not satisfy the register constraint and leads to nonrate-optimal nonoverlapped
schedule. The retimed program in (c) satisfies register constraint and leads to
rate-optimal nonoverlapped periodic schedules.

rate-optimal schedules. Fig. 18(a) shows another example of


a register-constrained retimed perfect-rate DFG, which can be
scheduled rate-optimally in a nonoverlapped manner. Fig. 18(b)
shows a retimed version, which does not satisfy the register
constraint and cannot be scheduled rate-optimally in nonoverlapped manner. Another retimed version in Fig. 18(c) satisfies
the register-constrained retiming, and can be scheduled rateoptimally in a nonoverlapped manner.
In a coarse-grain DFG, the node execution times can be
arbitrary, and it is not possible to satisfy the register constraint
of having registers in all incoming or outgoing arcs of a node.
This is because one loop may have many interacting loops, and
satisfying the register constraint may require many registers in a
loop, which is not possible for perfect-rate DFGs. For example,
in Fig. 16, the register constraint cannot be satisfied, since loop
L 1 has two interacting loops L2 and L3. It is not possible to
simultaneously satisfy the register constraint for nodesA and G. If
the DFG was fine-grain with equal execution times, then node J in
Fig. 16 would be split into several nodes, and register-constrained
retiming would then be feasible.
Definition 7.2: A schedule of a list of Q nodes Nl -+ N2 +
... + NQ is said to be contiguous if the nodes are scheduled
in that order without any intermediate gap or idle time. Note
that any node in this list can be scheduled in any processor in a
multiprocessor implementation. A schedule of a list of nodes is
said to be noncontiguous, if it is not contiguous.
Now consider the following algorithm for scheduling of the
recursive nodes of the perfect DFG (recall that the scheduling of
the nodes not belonging to any loop is not considered here).

Algorithm 7.1: First, we remove all the transitive and extended


transitive arcs from the perfect-rate graph. The loops are then
ordered, and scheduled according to the decreasing order of their
loop computation times (i.e., decreasing order of criticalness or
increasing order of loop slack time). The nodes in each loop are
also ordered to form a list with the node containing the loop
register at its input arc as the leading node of the list, and the
other nodes are placed so as to satisfy the precedence constraints.
A separate processor is assigned for scheduling of each loop.
First, the nodes of the critical loop are scheduled contiguously
in processor PI (if there is more than one critical loop, they
are ordered among themselves at random). Then, the nodes of
the next critical loop are scheduled in processor P2 such that
the schedules completed so far are preserved. In other words, if
some of the nodes of this loop also belong to the critical loop
(and therefore have already been scheduled in processor l), then
their schedule should remain unaltered. This process is repeated
until scheduling of all the loops is complete. If at any step, all
the nodes of a loops have already been scheduled (because all
the nodes of the loop belong to one or more critical loops), then
simply proceed to the next loop.
Some remarks on Algorithm 7.1 are now stated.
Remark 7.1: Note that we do not assume the scheduling of the
leading nodes of all the loops to begin at the same time unit.
In other words, the scheduling of the leading nodes of the loops
can be skewed in time and can take place in different processors.
This skewed scheduling separates consecutive iterations of the
DFG by a nonvertical boundary, and leads to nonoverlapped
schedules. Also note that it may be possible to merge the tasks
of two or more processors to reduce the number of processors in
a postprocessing step, but we do not consider this here.
Remark 7.2: The complexity of Algorithm 7.1 is at most
exponential in the number of arcs with registers (since the
complexity to order all the loops in a DFG in terms of their
criticalness can be at most exponential). For many practical cases,
however, the number of loops formed by an arc with a single
register can be limited, and the complexity would be much less
than exponential.
Remark 7.3: Nodes in any loop of a perfect-rate DFG are
scheduled noncontinuously (i.e., with intermediate gaps) if and
only if a path consisting of nodes of this loop has an associated
parallel path with a longer path computation time.
Proof: In Algorithm 7.1, nodes of all loops are ordered to
form a list with the node containing the loop register as its
leading node. A path corresponding to the loop refers to a set of
connected nodes, which are members of this list. To prove the
if portion, consider the path Nl + N2 -+ N4 corresponding
to some loop, and its associated parallel path NI -+ N3 -+ N4,
and assume the computation time of N2 to be shorter than that
of N3. This results in a gap in scheduling of the nodes of this
loop, since N4 can be scheduled only after the execution of N3 is
complete. To prove the only if portion, assume that there exists
some gap between completion of N2 and invocation of N3 in the
scheduling of some path . . . . -+ Nl -+ N2 + N3 -+ N4 + . . .
(call this path P1) corresponding to a loop. This would occur
if the invocation of N3 is constrained by completion of another
node (say N5).Denote the path . . . . + N5 -+ N3 -+ N4 . . . ..
as P2. If N3 is the only common node between P1 and P 2 (i.e.,
PIand P2have no node in common except N3), then the schedule
of the nodes in the list .... -+ Nl -+ Nz -+ N3 -+ N4 -+
could be right-shifted so that completion of N2 and N5 coincide.
However, the existence of the gap in the schedule implies that
the paths P1 and P 2 have at least one more node in common
+

e - .

PARHI AND MESSERSCHMITT: STATIC RATE-OPTIMAL SCHEDULING OF PROGRAMS

(which is a predecessor of N3). This implies the existence of an


associated parallel path. This associated parallel path has a longer
0
computation time.
Note that two parallel paths in a perfect-rate DFG must have
the same number of registers (which can be either 1 or 0). If
this were not the case, the two loops containing the two parallel
paths would have different numbers of loop registers, and the
DFG would be imperfect.
Remark 7.4: A contiguous scheduling of the nodes of the
critical loop of the perfect DFGs is admissible.
Proof: From remark 7.3, two parallel paths with different
path computation times lead to a noncontiguous schedule, and
the nodes of the path with shorter path computation time are
scheduled with an intermediate gap. Since the paths of the
critical loop have the largest path computation time, they can
be scheduled with no intermediate gap. Nodes scheduled with
gaps in the schedule must belong to loops with are noncritical.

fi

191

(a)

U
Theorem 7.1: For any perfect-rate graph, we can construct
fully-static rate-optimal schedules requiring no retiming or unfolding transformation.
Proof: The nodes of the critical loop can be scheduled contiguously requiring a period equal to the critical loop computation
time or the iteration bound. This schedule can be replicated over
successive iterations with no gap at all in the same processor
with a time displacement equal to the iteration bound. For each
gap in the scheduling of nodes (of noncritical loops), there exists
a path with longer computation time. This implies that the sum
of the computation time-and the gap time of any loop cannot
exceed the critical loop computation time or the iteration bound,
and therefore Algorithm 7.1 results in a rate-optimal schedule.
The schedule of the single iteration can be replicated with zero
processor displacement and with a time displacement equal to
the iteration bound, and hence the schedule is fully-static.
0
Theorem 7.2: The number of loops in a perfect-rate graph
represents an upper bound on the number of processors to
schedule the recursive nodes in a fully-static and rate-optimal
manner.
Proof: The upper bound on number of processors required
by Algorithm 7.1 equals the number of loops in the program
graph. This equals the upper bound on the number of processors
for a fully-static, rate-optimal schedule.
0
Remark 7.5: Note that this upper bound on the number of
processors is independent of the execution times of the nodes in
the program graph, and can be determined by only considering
the topology of the program graph.

B. Reducing Arbitrary Data-Flow Programs to


Perfect-Rate Programs
Any arbitrary data-flow program can be reduced to equivalent
perfect-rate data-flow programs. This reduction can then be used
to obtain rate-optimal fully-static schedules of arbitrary data-flow
programs (see Section VIII).
Theorem 7.3: If an arbitrary DFG is unfolded by the least
common multiple of the number of loop registers in the original
DFG, the unfolded DFG is a perfect-rate DFG.
Proof: From Property 6.6, if the unfolding factor is a
multiple of a particular loop, then the corresponding loops in the
unfolded DFG contain only a single register. If the unfolding
factor is the least common multiple of the number of loop
registers in all the loops (which is therefore a multiple of the
number of registers in each loop), then all loops in the unfolded

-~

--

B @ c
D

(b)

Fig. 19. (a) A program graph with three loops. The execution times of nodes
A, B , C, D,and E are, respectively, 20, 5, 10, 10, and 2 units. (b) Acyclic
precedence graph.

DFGs would contain only one register. The unfolded DFG then
corresponds to a perfect-rate DFG (by definition). Note that this
unfolding factor is independent of the execution times of the
U
nodes in a DFG.
Example 7.1: Consider the unfolding of the DFG in Fig. 7(a).
The two loops in Fig. 7(a) contain one and three registers, and the
least common multiple is three. Fig. ll(a) shows the 3-unfolded
DFG, which is perfect-rate.
U
Example 7.2: The number of loop registers in the DFG in
Fig. 8(a) are one, and two, and the least common multiple is two.
Fig. 12(a) shows a 2-unfolded DFG, which is perfect-rate. 0

VIII. FULLY-STATIC
RATE-OPTIMAL
SCHEDULING
This section uses the results of Section VI and VI1 and proves
that the tasks of any DFG can be scheduled rate-optimally in a
fully-static manner.
One might be led to believe that we can always achieve a rateoptimal schedule using an unfolding factor equal to the number
of registers in the critical loop and then retiming the unfolded
DFG. This is because the critical loop in the equivalent unfolded
DFG would contain a single register, and the critical loop can
be scheduled contiguously. However, this conjecture is not true!
This conjecture is disproved using a counterexample.
Consider the DFG example in Fig. 19, where the number of
loop register counts in the DFG are 2 and 3, respectively. The
execution times of nodes A , B, C, D , and E in Fig. 19(a) are,
respectively, 20, 5, 10, 10, and 2, and the iteration bound is 16,
and corresponds to the critical loop L1. The precedence relation
of the DFG is shown in Fig. 19(b), and the length of the critical
path (or equivalently the iteration period for this DFG) is 20
units. Since the number of registers in the critical loop is 2, we
construct an equivalent unfolded DFG with J = 2 as shown in
Fig. 20(a). The precedence graph for the unfolded DFG is shown
in Fig. 20(b), and leads to an iteration period of 20 units. We

__

-~
__

192

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 2, FEBRUARY 1991

(b)

Fig. 21. (a) An equivalent retimed program of the unfolded program in


Fig. 20(a). This program is obtained by performing local retiming operations
at nodes A1 and D1. (b) The acyclic precedence graph.

(b)
Fig. 20. (a) A 2-unfolded equivalent program of the program in Fig. 19(a).
(b) The acyclic precedence graph.

can improve the iteration period by retiming the unfolded DFG


(since this unfolded DFG is not perfect-rate) Fig. 21(a) shows
the retimed version of the unfolded DFG, and Fig. 21(b) shows
the corresponding precedence relation. From the critical path in
the unfolded DFG, we observe that the cycle time corresponds
to 35 units, or equivalently the iteration period is 35/2 = 17.5
units (which is greater than the bound by 1.5 units). This is the
minimum iteration period that can be achieved with an unfolding
factor of 2.
In Section VII, we established that unfolding by the least
common multiple of the number of registers in the loops can
reduce the data-flow program to a perfect-rate program, which
can then be scheduled rate-optimally. In the example DFG of
Fig. 19(a), the least common multiple is 6. The 6-unfolded DFG
is shown in Fig. 22(a), and one can verify that it is indeed perfectrate. The precedence graph of this unfolded DFG is shown in
Fig. 22(b). The length of the critical path is 96 (for scheduling
of six iterations), which corresponds to an iteration period of 16
units, equal to the iteration bound. Now we formally prove that
we can always construct rate-optimal schedules for arbitrary dataflow programs using program unfolding with optimum unfolding
factor.
Theorem 8.1: A n y unfolded DFG with an unfolding factor
equal to the least common multiple of the register counts in all
the loops can be scheduled rate-optimally.
Proof: An unfolded DFG with unfolding factor equal to the
least common multiple of the number of registers in all loops corresponds to a perfect-rate program graph (due to Property 6.6).
Since all perfect-rate programs can always be scheduled rateoptimally (due to Theorem 7.1), the unfolded DFG can also be
scheduled rate-optimally.

Remark 8.1:It may be possible to obtain rate-optimal schedules with an unfolding factor less than the least common multiple.
As an example, the DFG in Fig. 3(a) is scheduled rate-optimally
in a fully-static manner [see schedule in Fig. 5(b)] with no
unfolding (the optimum unfolding factor for this example is 2).
However, it is possible to schedule the DFG in Fig. 3(a) in a rateoptimal, fully-static manner with unfolding factor 2 for all possible node execution times. If we change the execution time of the
node D to 3 units (that would make the loop D + A -+ B + D
critical and the iteration bound would be 2.5), then one cannot
obtain a rate-optimal fully-static schedule without unfolding.
The optimum unfolding factor can be used to construct rateoptimal schedules only when no node not belonging to any
loop has an execution time greater than JOT, (where J , is the
optimum unfolding factor). The iteration bound (for scheduling
of J , iterations) of the J,-unfolded DFG is JOT,. If any node
(not belonging to any loop) has an execution time greater than
JOT,, then the DFG needs to be unfolded by a multiple of the
least common multiple of the number of registers in all loops of
the DFG (say KJo), such that the execution time of all nodes in
the DFG is less than KJ,T,. Observe that any unfolding with
unfolding factor KJ, also reduces the DFG to an unfolded DFG,
which is perfect-rate (since unfolding a perfect-rate DFG always
results in a perfect-rate DFG).
Now we obtain an upper bound on the number of processors
for rate-optimal scheduling of arbitrary data-flow programs.
Theorem 8.2: Recursive nodes (i.e., nodes belonging to one
or more loops) of any DFG can be scheduled in a rate-optimal
fully-static manner by using at most P processors, where P is the
sum of the register counts in all the loops in the original DFG.
Proof: Since the unfolding factor is the least common
multiple of the register counts in all the loops, each loop with
K registers transforms to K distinct loops in the unfolded DFG.
Thus, the upper bound on the number of distinct loops in the
unfolded DFG (which is perfect-rate) is equal to the sum of
the register counts in all the loops of the original DFG. This
is the upper bound on the number of processors to schedule all

PARHI AND MESSERSCHMm. STM'IC RATE-OPTIMAL SCHEDULING OF PROGRAMS

193

Fig. 22. (a) A 6-unfolded equivalent program of the program in Fig. 19(a). The unfolded program is perfect-rate. (b) The acyclic precedence graph.
This unfolded program can be scheduled rate-optimally.

the recursive nodes, since the upper bound on the number of


processors to achieve a rate-optimal fully-static schedule in a
perfect-rate program equals the number of loops.
Example 8.1: Fig. 22(b) shows the precedence graph of the
unfolded DFG of Fig. 22(a). From the precedence graph, it is
clear that the rate-optimally fully-static schedule can be achieved
with 4 processors. The upper bound on the number of processors
for this example is 5, since the 2 loops in the original DFG
contain, respectively, 2 and 3 loop registers.
0
Remark 8.2: The upper bound on the number of processors is
determined purely by the topology of the program graph (that is
by the nodes, arcs, and the registers associated with arcs). This
processor bound does not depend upon the execution times of
the nodes. It may be possible to obtain tighter bounds which
account for the node execution times (this remains a problem to
be addressed in the future).

IX. CONCLUSION
We have shown that the inter-iteration precedence constraints
can be exploited by unfolding the data-flow signal processing
programs. Furthermore, unfolding by the optimum unfolding factor transforms a data-flow program to an equivalent perfect-rate
data-flow program, which maximally exploits the inter-iteration
precedence constraints. Unfolding any data-flow program beyond
the optimum unfolding factor does not lead to any further
improvement. The major result obtained in this paper is that the
optimum unfolding factor can always be used to construct rateoptimal fully-static periodic schedules of data-flow programs for
multiprocessor implementations (assuming availability of a large
number of processors and complete interconnection).
While this paper has contributed to our theoretical understanding of multiprocessor implementation of iterative data-flow signal
processing programs, construction of efficient multiprocessor

IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 2, FEBRUARY 1991

194

schedules for practical situations still remains a difficult task


(since these problems are NP-complete 1321). Construction of
minimum-time multiprocessor schedules for a fixed number of
processors (with or without fixed interconnection), and construction of multiprocessor schedules for fixed iteration period
with minimum number of processors (with or without fixed
interconnection) are problems of more practical interest. Future
efforts need to be directed towards construction of multiprocessor
schedules for these problems. Furthermore, the optimum unfolding factor and the processor bounds for rate-optimal scheduling
obtained in this paper are independent of node execution times.
It may be possible to obtain a lower optimum unfolding factor
and a more tighter processor bound by accounting for node
execution times; these issues require further study. This paper
has mostly been concerned with scheduling of nodes which
belong to one or more loops. Scheduling of an entire program
(which considers nodes belonging to loop(s) and nodes not
belonging to any loop) is also of practical interest, and needs
to be studied in future. Furthermore, combination of algorithm
transformations [33] and heuristic scheduling approaches 1341,
[35] can be explored for practical multiprocessor scheduling
of signal processing programs, and for code generation using
programmable multiprocessor DSP processors.
In this paper, we have used the systematic unfolding approach
to unravel hidden concurrency in data-flow signal processing
programs. In [36]-1381, we have also used systematic unfolding
to transform bit-serial signal processing architectures to digitserial ones.

ACKNOWLEDGMENT
The authors are indebted to all the three reviewers for their
careful review of the paper. Their numerous constructive suggestions and criticisms improved the clarity of presentation of
the contents of this paper. Thanks are also due to E.A. Lee,
G. C. Sih, and D.A. Schwartz for many useful discussions.

REFERENCES
R.E. Karp and R.E. Miller, Properties of a model for parallel
computations: Determinacy, termination, and queueing, SIAM J.
Appl. Math., vol. 14, no. 6, pp. 1390-1411, Nov. 1966.
R. Reiter, Scheduling parallel computations, J. ACM, vol. 15,
no. 4, pp. 590-599, Oct. 1968.
R. E. Crochiere and A. V. Oppenheim, Analysis of linear digital
networks, Proc. ZEEE, vol. 63, no. 4, pp. 581-595, Apr. 1975.
J. B. Dennis, Data flow supercomputers, IEEE Comput. Mag.,
pp. 48-56, NOV.1980.
A. L. Davis and R. M. Keller, Data flow program graphs, IEEE
Comput. Mag., vol. 15, no. 2, pp. 26-41, Feb. 1982.
W. B. Ackerman, Data flow languages, IEEE Comput. Mag.,
vol. 15, no. 2, pp. 15-25, Feb. 1982.
E.A. Lee and D.G. Messerschmitt, Static scheduling of synchronous data flow programs for digital signal processing, ZEEE
Trans. Comput., vol. C-36, no. 1, pp. 24-35, Jan. 1987.
S. Y. Kung, P. S. Lewis, and S. C. Lo, Performance analysis and
optimization of VLSI data flow arrays, J. Parallel Distributed
Comput., vol. 4, pp. 592-618, 1987.
K. K. Parhi and D. G. Messerschmitt, Rate-optimal fullystatic multiprocessor scheduling of data-flow signal processing
programs, in Proc. 1989 IEEE Int. Symp. Circuits Syst., Portland,
OR, May 1989
-, Fully-static rate-optimal scheduling of iterative data-flow
programs via optimum unfolding, in Proc. 1989 Int. Con! Parallel
Processing, St. Charles, IL, Aug. 1989
S. C. Cheng, J. A. Stankovic, and K. Ramamritham, Scheduling
algorithms for hard real-time systems-A brief survey, in Hard
Real-Time Systems Tutorial, J. A. Stankovic, Ed. New York IEEE
Computer Society Press, 1988, pp. 150-173.
C. L. Liu and J. W. Layland, Scheduling algorithms for multiprogramming in a hard real-time environment, J. ACM, vol. 20,
pp. 46-61, 1973.

E. L. Lawler and C. U. Martel, Scheduling periodically occurring


tasks on multiple processsors, Inform. Processing Lett., vol. 12,
pp. 9-12, 1981.
W. Zhao, K. Ramamritham, and J. A. Stankovic, Preemptive
scheduling under time and resource constraints, IEEE Trans.
Comput., pp. 949-960, Aug. 1987.
J. W. S. Liu et al., Scheduling real-time, periodic jobs using imprecise results, in Proc. IEEE Real-Time Syst. Symp., San Jose,
CA, Dec. 1987, pp. 252-260.
E.G. Coffman, Jr., Ed., Computer and Job Scheduling Theory.
New York: Wiley, 1976.
D. A. Schwartz and T. P. Bamwell, 111, A graph theoretic technique for the generation of systolic implementations for shift
invariant flow graphs, in Proc. ICASSP-84, San Diego, CA, Mar.
1984.
D.A. Schwartz, Synchronous multiprocessor realizations of shift
invariant flow graphs, Ph.D. dissertation, Georgia Instit. Technol.,
Tech. Rep. DSPL-85-2, July 1985.
D. A. Schwartz and T. P. Barnwell, 111, Cyclostatic multiprocessor
scheduling for the optimal implementation of shift invariant flow
graphs, in Proc. ZCASSP-85, Tampa, FL, Mar. 1985.
D.A. Schwartz, Cyclo-static realizations: Loop unrolling and
CPM, optimal multiprocessor scheduling, in Proc. 1987 Princeton Workshop Algorithms, Architecture, and Technology Issues in
Models of Concurrent Computations, Sept. 30-Oct. 1, 1987.
S. H. Lee and T. P. Bamwell, 111, Optimal multiprocessor implementation from a serial algorithm specification, in Proc. ICASSP88, NY, Apr. 1988, pp. 1694-1697.

H. Forren and D. A. Schwartz, Transformingperiodic synchronous


multiprocessor programs, in Proc. ICASSP-87, Dallas, TX,Apr.
1987, pp. 1406-1409.
M. Renfors and Y. Neuvo, The maximum sampling rate of digital
filters under hardware speed constraints, IEEE Trans Circuits Syst.,
vol. CAS-28, no. 3, pp. 196-202, Mar. 1981.
T. P. Barnwell, 111 and C. J. M. Hodges, Optimal implementation
of signal flow graphs on synchronous multiprocessors, in Proc.
1982 Znt. Con! Parallel Processing, Belaire, MI, Aug. 1982.
C. V. Ramamoorthy and G. S . Ho, Performance evaluation of
asynchronous concurrent systems using Petri nets, IEEE Trans.
Software Eng., vol. SE-6, no. 5, pp. 440-449, Sept. 1980.
J.P. Brafman, J. Szczupak, and S.K. Mitra, An approach to implementation of digital filters using microprocessors,IEEE Trans.
Acoust., Speech, Signal Processing, vol. 26, no. 5, pp. 442-446,
Oct. 1978.
J. Zeman and G.S. Moschytz, Systematic design and programming of signal processors using project management techniques,
ZEEE Trans. Acoust., Speech, Signal Processing, vol. 31, no. 6,
pp. 1536-1549, Dec. 1983.
C.E. Leiserson, F. Rose, and J. Saxe, Optimizing synchronous
circuitry by retiming, in Proc. Third Caltech Con! VLSI, Pasadena,
CA, Mar. 1983, pp. 87-116.
S. Y. Kung, On supercomputing with systolic/wavefront array
processors, Proc. IEEE, vol. 72, no. 7, July 1984.
K. K. Parhi and D. G. Messerschmitt, Concurrent cellular
VLSI adaptive filter architectures, IEEE Trans. Circuits Syst.,
pp. 1141-1152, Oct. 1987.
-,
Pipelining and parallelism in recursive digital filters,
Part 11, IEEE Trans. Acoust., Speech, Signal Processing,
pp. 1118-1135, July 1989.
M. R. Garey and D. S. Johnson, Computers and Intractability: A
Guide to the Theory of the NP-Completeness. San Francisco, C A
Freeman, 1979.
K. K. Parhi, Algorithm transformations for concurrent processors,
Proc. IEEE, vol. 77, no. 12, pp. 1879-1895, Dec. 1989.
T. C. Hu, Parallel sequencing and assembly line problems, Oper.
Res., vol. 9, pp. 841-848, 1961.
M. C. McFarland et al., High level synthesis of digital systems,
Proc. IEEE, Feb. 1990.
K. K. Parhi, Nibble-serial arithmetic processor designs via unfolding, in Proc. IEEE Znt. Symp. Circuits Syst., May 1989, OR,
pp. 635-640.
C.-Y. Wang and K. K. Parhi, Digit-serial DSP architectures, in
Proc. IEEE Con$ Application Specijic Array Processors, Princeton,
NJ, Sept. 1990.
K. K. Parhi, A systematic approach for design of digit-serial
signal processing architectures, IEEE Trans. Circuits Syst., to be
published.

PARHI AND MESSERSCHMR STATIC RATJZ-OPTIMAL. SCHEDULING OF PROGRAMS

195

Kesbab K. Parhi (SSS-MSS) received the


David G. Messerschmitt (S65-M68- SM78
B. Tech. (Honors) degree from the Indian In-F83) received the B.S. degree from the
stitute of Technology, Kharagpur, in 1982, the
University of Colorado in 1967, and the M.S.
M.S.E.E. degree from the University of Pennand Ph.D. degrees from the University of
sylvania, Philadelphia, in 1984, and the Ph.D
Michigan in 1968 and 1971, respectively.
degree from the University of California, BerkeHe is a Professor of Electrical Engineering
ley, in 1988.
and Computer Sciences at the University of CalHe held short term positions at the AT&T Bell
ifornia, Berkeley. Prior to 1977 he was at Bell
Laboratories, Holmdel, NJ, the IBM T. J. Watson
Laboratories, Holmdel, NJ. Current research
Research Center, Yorktown Heights NY, and
interests include applications of digital signal
the Tata Engineering and Locomotive Company,
processing, digital communications (subscriber
Jamshedpur, India. He is currently an Assistant Professor of Electrical loop, fiber optics, and in VLSI and digital systems), architectural
EngineeAng at the University of Minnesota, Minneapolis. His research approaches to dedicated-hardware digital signal processing (especially
interests include concurrent algorithm and architecture designs for com- video compression applications), video applications of broad-band packet
munications, signal and image processing systems, digital integrated networks, and computer-aided design of communications and signal
circuits, VLSI digital filters, design of dedicated architectures, and processing systems. He has published over 110 papers, is co-author of
multiprocessor task scheduling in programmable software systems. He two books, and has 10patents. He also serves as a consultant to a number
has published over 40 papers in these areas.
of companies.
Dr. Parhi received the 1987 Eliahu Jury Award for excellence in sysDr. Messerschmitt is a member of the National Academy of
tems research, the 1987 Demetri Angelakos Award for altruistic activities Engineering. He has served as Editor for Transmission of the
afforded fellow graduate students, the U. C. Regents Fellowship, the TRANSACTIONS
ON COMMUNICATIONS,
and as a member of the Board
IBM Graduate Fellowship at the University of California, Berkeley, the of Governors of the Communications Society. He has also organized
1989 Research Initiation Award of the National Science Foundation, and participated in a number of short courses and seminars devoted to
and the 1991 IEEE Browder Thompson Prize Paper Award for his continuing engineering education.
OF THE
paper on algorithm transformations published in the PROCEEDINGS
IEEE. He is a member of the VLSI-systems and applications technical
committee of the IEEE Circuits and Systems Society, a member of the
VLSI technical committee of the IEEE Signal Processing and Computer
Societies, is currently an Associate Editor of the IEEE TRANSACTIONS
ON CIRCUITS
AND SYSTEMS,
and is a member of the Eta Kappu Nu and
the Association for Computing Machinery.

Vous aimerez peut-être aussi