00238604

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. VOL. 12, NO.
8, AUGUST 1993
1107
SALSA: A New Approach to Scheduling with

Timing Constraints
John A. Nestor, Senior Member, IEEE, and Ganesh Krishnamoorthy
Abstract-This paper describes a new approach to the scheduling problem in high-level synthesis that meets timing constraints while attempting to minimize hardware resource costs.
The approach is based on a modified controUdata flow graph
(CDFG) representation called SALSA. SALSA provides a simple move set that allows alternative schedules to be quickly explored while maintaining timing constraints. It is shown that
this move set is complete in that any legal schedule can be
reached using some sequence of move applications.In addition,
SALSA provides support for scheduling with conditionals,
loops, and subroutines. Scheduling with SALSA is performed
in two steps. First, an initial schedule that meets timing constraints is generated using a constraint solution algorithm
adapted from layout compaction. Second, the schedule is improved using the SALSA move set under control of a simulated
annealing algorithm. Results show the schedulers ability to 6nd
good schedules which meet timing constraintsin reasonable execution times.
I. INTRODUCTION
HE goal of high level synthesis [l] is to translate a

procedural specification of behavior into a registertransfer design that implements that behavior. Most approaches to high-level synthesis use a control/data flow
graph (CDFG) as an intermediate representation of the
behavioral specification and break the synthesis problem
into two subtasks: scheduling, which assigns CDFG nodes
representing operators to control steps, and allocution,
which assigns CDFG nodes representing operators and
edges representing data values to hardware (e.g. ALUs,
registers, and interconnections) to realize a datapath.
Scheduling is a particularly important part of this process
for two reasons. First, it fixes requirements for the various hardware resources used during allocation. Second
and equally important, it fixes the relative timing of operators and thus the satisfaction of timing constraints [2],
[3]. Timing constraints are important because they allow
designers to specify both desired performance and interface information [2], [4].
Fig. 1 illustrates the scheduling problem using a typical
CDFG, which is a directed graph in which nodes repreManuscript received February 5, 1991; revised October 15, 1992. This
work was supported in part by NSF Grant MIP-9010406 and the IIT Education & Research Initiative Fund. This paper was recommended by Associate Editor A. Parker.
J. Nestor is with the Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago IL 60616.
G . Krishnamoorthy is with Mentor Graphics Corp., 15 Independence
Blvd., Warren NJ 07059.
IEEE Log Number 9207276.
sent operators and edges represent ordering dependencies

between operators. Source and sink nodes represent the
beginning and end of activities in the graph. Edges between nodes represent three different types of ordering dependencies. Data edges represent the flow of data from
one operator to another, implying an ordering relationship
because the data must be computed before it is used. Control edges represent ordering relationships associated with
control operations such as conditionals. Timing edges [2],
[3] represent timing constraints between two operators that
must be satisfied in a correct design. A timing constraint
specifies the required relative timing between two operators. Minimum timing constraints specify a lower bound
on the relative timing between operators, while maximum
timing constraints specify an upper bound. All of these
dependencies imply an ordering in which the first operator
precedes the second operator in execution.
Scheduling assigns each operator node to a control step
that represents the controller state in which this operator
will execute. Fig. 1 illustrates a typical schedule by displaying control step boundaries as horizontal lines. Since
scheduling fixes the order in which operators will be implemented in the design, it must perform this task in a way
that meets all dependencies specified by edges in the
graph. When performed before hardware allocation, this
ordering sets a lower bound on the resources required to
implement the CDFG in hardware. Functional unit requirements are determined by the maximum number of
operators of each type (i.e., adders, ALUs, etc.) that are
scheduled in the same control step. Register requirements
are determined by the maximum number of values that are
live at the end of each control step, as represented by data
edges in the CDFG that cross control step boundaries. For
example, the schedule in Fig. 1 requires at least two
adders, one multiplier, and four registers. A weighted sum
of these requirements can be used as a cost function to
estimate schedule quality. Some schedulers also include
an estimate of interconnection requirements based on the
total number of data transfers in each control step [ 5 ] .
Early approaches to scheduling used the simple assoon-as-possible (ASAP) or as late-as-possible
(ALAP) algorithms [ 6 ] ,[7] to minimize schedule length
while ignoring the hardware costs and timing constraints.
More recently, a large number of approaches have been
By convention, edges in all figures are directed top-to-bottom unless an
arrowhead indicates otherwise.
0278-0070/93$03.00 0 1993 IEEE
1108
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8, AUGUST 1993
Timing Conswinc
ILnC(XlJ2) 5 3 stepf
--TSink Node
Fig. 1. A control/data flow graph (CDFG)
developed that attempt to minimize hardware costs in the

resulting allocation. Approaches that minimize hardware
cost can be broken down into constructive approaches using greedy heuristics, iterative transformational approaches such as simulated annealing, and exact approaches such as integer linear programming.
Greedy heuristics attempt to minimize resource costs
but do not guarantee that an optimal schedule will be
found. Examples of greedy approaches include fast, simple heuristics such as list scheduling [8]-[12] and more
complex (and more effective) heuristics such as force-directed scheduling [5]. Greedy heuristics suffer the shortcoming that they can be trapped in local minima in the
cost function and so may not find the globally best schedule.
Transformational approaches alter an existing schedule
to find new schedules and search for low-cost schedules.
Iterative probabilistic approaches such as simulated annealing [ 131 employ a small set of transformations that
are applied to randomly selected parts of the schedule. A
probabilistic acceptance function allows controlled hillclimbing to escape local minima. Simulated annealing has
been employed in combined scheduling and allocation
schemes [14], [15]. Simulated evolution is a related approach in which a group of scheduled operators is probabilistically selected, ripped up (unscheduled) and then
rescheduled using a greedy algorithm [16]. Since transformational approaches require a large number of move
applications to properly explore the schedule space, they
typically exhibit large execution times.
Exact approaches using techniques such as integer linear programming (ILP) (e.g., [17], [18]) guarantee a
globally optimum schedule but can have large execution
times. Such approaches represent scheduling decisions as
a set of decision variables (one for each possible assignment of an operator to a control step) and a set of constraints that must be satisfied to guarantee a legal schedule. ILP can then be used to find the optimal schedule
with respect to a given cost function. While the worstcase execution times of these approaches is exponential,
special characteristics of the scheduling problem can be
exploited to reduce runtime [181, making such techniques
practical for fairly large problems. However, execution
time of these approaches depends on the number of variables and constraints. This value can grow quite large for
large problems, especially as schedule length is increased

for a given CDFG.
Only a few scheduling approaches attempt to meet arbitrary timing constraints while minimizing resource
costs. Constructiveheuristics include modified list-scheduling [9], [2] heuristics that attempt to meet timing constraints during scheduling, and force-directed scheduling,
which uses local timing constraints [5] to limit the time
frames of control steps into which an operator may be
scheduled. In another approach, timing constraints have
been included in an ILP formulation [181.
Other approaches to scheduling consider timing constraints separately from allocation and so do not attempt
to minimize resource costs. For example, Hayati and Parker [19] consider timing constraints as part of the controller generation problem after scheduling and allocation
are complete. Path-based scheduling [20] treats timing
constraints and functional-unit constraints as intervals on
a serialized control-flow graph (CFG) and derives a controller with a minimum number of control states by scheduling each path in the CFG separately.
It has recently been recognized that constraint solution
algorithms drawn from layout compaction [21] can be
used to find an ASAP-like schedule that meets timing constraints. This approach was used by Borriello [22] in the
synthesis of asynchronous interface transducers. A similar approach has also been used to satisfy constraints between loop iterations [12]. More recently, Ku and De
Micheli [23] have applied constraint solution to the problem of scheduling with timing constraints after allocation
has been performed. This technique, called relative
scheduling has the added feature that it can guarantee
constraint satisfaction in the presence of unknown and unbounded delays using a specialized controller scheme.
Since relative scheduling is performed after allocation, resource costs are not considered.
This paper describes a new transformational approach
to scheduling with timing constraints. The key to this approach is a simple set of moves-transformations that alter
an existing schedule by rescheduling individual operators
or in some cases multiple operators. An ASAP-like schedule that meets all timing constraints is first generated using constraint solution techniques similar to [23]. However, unlike [23], this schedule is generated before
allocation and is used as a starting point from which to
search for other schedules that are lower in cost. This is
accomplished by applying a sequence of moves that alter
the schedule under the control of a simulated annealing
algorithm. Moves are applied only when the resulting
schedule will be legal with respect to all ordering and timing constraints. It is shown in the paper that any legal
schedule can be reached from any other legal schedule by
the application of some sequence of these moves.
Since the moves that alter the schedule are applied many
times, a modified CDFG representation called SALSA is
used to represent a scheduled CDFG and speed up the
tasks of checking for move legality, move application,
and evaluation of the cost function after move applica-
NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH
tion. The SALSA representation also provides support for

conditional operations, mutual exclusion in both functional units and registers, and accurate representation of
storage requirements in loops and subroutines. The annealing approach allows the scheduler to avoid local minima that can trap greedy algorithms.
The key contribution of this work is the development
of a conceptually simple transformational approach to
scheduling with timing constraints that uses simulated annealing with a small move set to find high-quality schedules of data-dominated CDFGs. Results show that this
can be accomplished using reasonable amounts of CPU
time, even when schedule length is substantially longer
than minimum schedule length. Constraint solution provides an effective way to find an initial schedule that meets
timing constraints, and these constraints are preserved by
the move set. Analysis showing that all legal schedules
can be reached using the move set lends intuitive support
to the effectiveness of the approach and provides new insight into the structure of the scheduling problem. Finally, support for conditionals, subroutines, and loops allows the application of this approach to large, structured
designs.
The remainder of this paper is organized as follows:
Section II describes the notation and some key concepts
that will be used in the paper. Section I11 introduces the
SALSA representation and the move set that is used to
explore alternative schedules. In addition, it discusses the
completeness of the move set and describes support for
conditionals, loops, and subroutines. Section IV discusses the techniques used for initial schedule generation
and schedule improvement using the SALSA representa-
w..=
eii.,,,in
-eji.
TIMING CONSTRAINTS
1109
Each edge eii in E represents a data, control, or timing

dependency between nodes vi and vj. Two edge weights
may be associated with each edge to represent the spacing
requirement associated with the dependency. Edge weight
eii . min denotes the minimum allowable spacing in control
steps of nodes vi and vi. Simple ordering and data flow
constraints are represented by this weight with a weight
of value 0 or 1 while minimum timing constraints are
weighted with the constraint value in control steps. Edge
weight eii .
denotes the maximum allowable spacing in
control steps between nodes vi and vi as specified by a
maximum timing constraint.
Each edge represents an inequality relationship between the scheduled values of the nodes that must be satisfied in a legal schedule. For example, a minimum timin a schedule x
ing constraint time (vi, vi) 2
represents an inequality:
xi 1 xi
+ eii.min
while a maximum constraint time (vi,vj)

straint represents an inequality:
xi Ixi
con-
+ eii.
mx.
Minimum and maximum constraints on an edge eii are

sometimes separated into two edges eii and eji, where forward edge eii represents the minimum constraint and
backward edge eji represents the maximum constraint
[23]. In this formulation, all constraints can be expressed
in a uniform way as inequalities of the form:
xj
1 xi
+ wij,
where
for a minimum constraint (forward edge)

for a maximum constraint (backward edge).
tion. Section V describes the implementation and presents

scheduling results for a number of examples.
A schedule x of length L is legal if it satisfies all of the

constraint inequalities specified by the edges of the CDFG
and every node vi is scheduled in the range of control
steps 1 Ixi IL. Since the CDFG by definition includes
11. PRELIMINARIES
ordering
edges from the source node and to the sink node,
A CDFG can be represented by a directed graph G ( V ,
this
second
requirement will be satisfied whenever x,, =
E), where V represents the set of nodes of the graph and
E represents the set of edges between nodes. The set of 0 and xSink= L 1.
The slack sii(x) of a constraint eii in schedule x reprenodes includes a source node U,,, a sink U,,&, and opersents the amount by which the scheduled positions of
ator nodes vl-vn which represent the operations in the
nodes vi and vi can be decreased (increased) without vihigh-level specification. The term vi delay denotes the
olating the minimum (maximum) constraint represented
combinational delay of an operator node.
by eij. It is defined as
A schedule of length L is an ordered n-tuple
s..(x) = x . - x . - w..
I
1
V
*
xi, * * * xn)
x = (x,, x2,
where each xi is an integer 1 Ixi IL that represents the

control step in which node vi is scheduled. While not explicitly included in the tuple above, in all schedules of
length L s ~ u r c enode U,, is always scheduled in control
step 0 (Le., x,, = 0 ) and sink node trsink is always scheduled in control step L
1 (i.e., X,ink = L
1).
Note that in a legal schedule, so@) 1 0 for every constraint eii since every inequality must be satisfied.
is sometimes useful to think of a schedule x of n nodes
as a point in an n-dimensional schedule space. Each constraint inequality defines a legal half-space within the
schedule space that satisfies that particular constraint.
1110
x = (1.3)
Y = (1.4)
z = (2.5)
@@E
1
v2
v2
:
5
e xl
v2
Fig. 2. The schedule space.
Since a legal schedule must satisfy all constraints, a region of legal schedules is defined by the intersection of
all such half-spaces. Since each half-space is convex, it
is easily shown that the region resulting from the intersection of half-spaces is also convex [24].
To illustrate the concept of schedule space, Fig. 2(a)
shows the two-dimensional schedule space that results
given two operators under the constraints:
time(v1, u2) 2 1 step
AND
time(u1, u2) I3 steps.
The inequalities implied by these constraints combine with

constraints on schedule length to form a trapezoidal region in the schedule space. Fig. 2(b) shows the schedules
of three of the nine points contained in this region: schedulex = (1, 3 ) , y = (1,4), andz = (2, 5 ) . Foreachpoint
in the schedule space, an adjacent point can be reached
by changing the schedule of a single operator node by one
control step. Schedule y can be reached from schedule x
in this manner by rescheduling node u2. Diagonally adjacent points in the schedule space can be reached by
changing the scheduling of two operators by one control
step. Schedule z can be reached from schedule y in this
manner by rescheduling both nodes u1and u2.
It can also be useful to quantify the amount by which
the scheduled position of an operator node U; varies between two schedules x and y . This is denoted by
space that can be guided by the cost of each new schedule

encountered. Since such a transformational approach requires many move applications and many evaluations of
the cost function, it is important to make these moves fast
and simple. In addition, it is important to quickly test
whether the application of a move will result in a legal
schedule. SALSA supports these needs using an explicit
representation of slack in the constraints of a scheduled
CDFG.
This section describes the SALSA representation and
move set. In addition, it shows that the move set is complete in that any legal schedule can be reached from any
other legal schedule by the application of some sequence
of moves from the move set. Finally, it describes additional considerations, including support for conditional
execution, loops and subroutines.
3.1. Slack Nodes

SALSA explicitly represents slack in a scheduled
CDFG using a new class of nodes known as slack nodes.
Slack nodes are inserted in data, control, and timing dependency edges between operator nodes to represent slack
in an existing schedule, and each slack node explicitly
represents one step of slack. Thus in some schedule x with
node U;scheduled in step xiand node vi scheduled in step
xj and a constraint weight wii, the edge eii will contain
sii(x) = xi - xi - wii slack nodes. Maximum timing condi (x, U) = yi - xi.
straints are represented as backward edges with slack
This specifies the distance between the two schedules for nodes inserted in the same way.
For data edges, slack nodes explicitly represent the need
node ui.Similarly, the total distance between two schedfor storage of a data value during each control step which
ules x and y for all nodes is denoted by
is crossed by the edge. Each such data slack node is
n
considered to be scheduled into one of the control steps
Y ) = 2 Id;(X,Y)l.
i= 1
crossed by the edge, as shown in Fig. 3. Using this repThis value is equivalent to the rectilinear distance be- resentation register costs can be calculated locally in a
tween points in the schedule space. For example, in Fig. control step by examining only nodes scheduled in that
control step: operator nodes that create a new value, and
2, D ( x , y ) = 1, D ( y , z) = 2, andD(x, z) = 3.
slack nodes that represent the storage of a previously created value. For example, Fig. 3 shows the SALSA graph
111. THE SALSA SCHEDULE
REPRESENTATION
for one schedule of a simple CDFG. In the first step, two
The SALSA representation supports a transformational operator nodes produce values that are used in later conapproach to scheduling by describing a scheduled CDFG trol steps. In addition, two slack nodes represent storage
and providing a set of simple moves that transform a of previously created values that are used in later control
schedule x into a new schedule x . Repeated application steps. Thus a total of four registers are required for this
of these moves provides a means to search the schedule control step. This is the maximum number of registers
w,
NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH TIMING CONSTRAINTS
slack
1111
...+
opuaur
..-vv
Timing
Constraint
Fig. 4. Simple moves M1 and M2.
Sink Node
Fig. 3. The SALSA CDFG representation.
required over all control steps, so this schedule will require a minimum of four registers. When a simple transformation changes only part of a scheduled CDFG, the
local nature of these calculations can be exploited to speed
the calculation of register costs.
3.2. The Move Set

An important property of slack nodes is that an operator
can be rescheduled in an adjacent control step while still
satisfying constraints if all of its predecessor or successor
nodes are slack nodes. Furthermore, this rescheduling can
be accomplished by local rearrangement of the operator
node and adjacent slack nodes. These properties can be
exploited by defining a simple set of moves that alter a
schedule by rescheduling one or more operator nodes in
adjacent steps. SALSA provides four such moves M1M4. Each of these moves can be applied to a target operator only when legal that is, when the schedule that results from the move does not violate any data, control, or
timing dependencies. Move legality is easily determined
before performing the move by checking that all dependency edges in the direction of the move contain slack
nodes.
Simple moves M1 and M2 alter a schedule by moving
a single operator node vi to an adjacent control step. M1
and M2 are defined as follows:
the schedule space. For example, in Fig. 2 scheduley can

be reached from schedule x by applying move M2 to node
v2. Repeated application of simple moves allows the exploration of schedule space using very simple transformations.
Chaining [ 113 is supported by a minor extension which
allows moves with non-slack predecessors (successors)
when delay permits. This corresponds to a slight relaxation of ordering constraints (i.e., xi 2 xi 1 becomes xj
2 xi) when the estimated combinational delay of the
chained nodes does not exceed the clock period. Chaining
of a single node is accomplished using modified versions
of M1 and M2 that take this calculation into effect. A
more powerful recursive chaining move is also usefulthis move recursively moves predecessor or successor
nodes if they are already chained with the target node and
would block the completion of a simple chaining move.
Chaining also affects initial schedule generation and is
discussed further in Section 4.1. Multicycling (scheduling
operators into multiple control steps) and multicycling
with pipelined functional units are supported directly and
require no special handling.
Because simple moves require only local changes to a
SALSA graph, the cost of an individual move application
is low. However, several moves may be required to make
significant changes to the schedule. This is especially true
when a move is blocked by a chain of constraints with no
slack. For example, Fig. 5(a) shows a schedule in which
operator v1 cannot be moved to its preceding control step
because there is no slack in its predecessor edge. However, if the preceding + operator were moved to a previous control step, slack would be present, allowing it to
move.
To overcome this problem two more powerful shoving
moves are defined that are similar in concept to the
shove-aside transformationsused in some routers [25].
A shoving move recursively moves any predecessor or
successor operators that are blocking a simple move of
the target operator, thus rescheduling several operators at
once at an added expense in CPU time. The shoving
moves are defined as follows:
MI: Move an operator node vi from its current control

step to the preceding control step. M1 transforms a
schedule x into a new schedule x = (xl, x2, * * ,
x i - 1, - . . ,xn). M1 is legal when all predecessor
edges of vi are connected to slack nodes. Applying
M1 removes one slack node from each predecessor
edge, reschedules the operator, and adds one slack
node to each successor edge, as shown in Fig. 4(a).
M2: Move an operator node vi form its current control
step to the following control step. M2 transforms a
* ,
schedule x into a new schedule x = (x,, x2,
M3: Shove an operator node vi from its current control
xi
1, * - - , xn). M2 is legal only when all sucstep to the preceding control step. This move is accesser edges of vi are connected to slack nodes. Apcomplished in two steps: (1) if any predecessor
plying M2 removes one slack node from each sucnodes are operator nodes that would block moving
cessor edge, reschedules the operator, and adds one
vi,recursively apply M3 to these nodes to shove
slack node to each predecessoredge, as shown in Fig.
them into preceding control steps. (2) Move vi to
4 (b).
the preceding control step as in move M1. As each
operator node is moved, slack nodes are removed
The application of a simple move transforms a schedule
x into a new schedule x that is immediately adjacent in
from predecessor edges and added to successor
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8. AUGUST 1993
1112
Aha
Aha
Jkfim
q!pf
:
l
_
M3 v l
Fig. 5. Move M3 (shove up).
Fig. 6. Move M4 (shove down).
edges as appropriate. When applied to an operator

node vi M3 transforms a schedule x with a chain of
non-slack predecessor constraints (ejk,
, eu) into
,xi - 1,
Xj
a new schedule x = (xl, x2,
- I , . . . , x k - 1 , * * . x 1 - 1 , * ,x n ) . M3 is
always legal unless a chain of dependent operator
nodes extends to the source node, which represents
the beginning of the schedule. In this case, completing the move would result in an illegal schedule.
Fig. 5 shows an example of move M3.
M2: Shove an operator node vi from its current control
step to the following control step. This move is accomplished in two steps: (1) if any successor nodes
are operator nodes that would block moving vi,recursively apply M4 to these nodes to shove them
into following control steps; (2) move vito the following control step as in move M2. As each node is
moved, slack nodes are removed from successor
edges and added to predecessor edges as appropriate. When applied to an operator node v i M4 transforms a schedule x with a chain of non-slack successor constraints (eii, * , ekl)into a new schedule
x = ( X I , x2,
,xi
1,
,xj
1, * . - xk
1, - - ., X I
1, * . . , x n ) . M4 is always legal
unless a chain of dependent operators extends to the
sink node, which represents the end of the schedule.
In this case, completing the move would result in an
illegal schedule. Fig. 6 shows an example of move
M4.
--
- -
---
x2sx1+2
+
2x121
Iff.
xl
Fig. 7. Operators under a fixed-time constraint.
neously (e.g., M4 transforms schedule v into schedule

w). This corresponds to a diagonal move in the schedule
space.
3.3. Completeness of the Move Set

Since moves Ml-M4 are proposed for use in searching
the schedule space, it is important to show that this move
set is complete, i.e., that all legal schedules in the schedule space can be reached using these moves. This can be
demonstrated by showing that given two arbitrary legal
schedules x and y, a sequence of moves Ml-M4 can be
applied to operator nodes in the CDFG that will transform
x into a sequence of legal schedules that are successively
closer to scheduley until schedule y is found. To accomplish this, we first present a useful property of legal
schedules.
Lemma I: Let x and y be two legal schedules of length
L which differ in the scheduled position of at least one
Shoving moves are important when minimum and max- node vi (i.e., yi # xi). If the scheduled position of node
imum timing constraints are combined to create fixed- vi is greater in schedule y than in schedule x (i.e., yi >
time constraints that specify an exact spacing between xi) and there is a successor constraint edge eo between vi
two or more operator nodes. For example, Fig. 7 shows and some node vj with no slack in schedule x, then the
the schedule space and two possible schedules of two scheduled position of node vj must also be greater in
nodes under a fixed-time constraint consisting of con- schedule y than in schedule x (i.e., y j > xi). Similarly, if
straints time ( v l , v 2 ) 2 2 steps AND time(v1, u2) I2 the scheduled position of node viis smaller in schedule y
steps. These constraints form a cycle of non-slack con- than in schedulex (i.e., yi < xi) and there is a predecessor
straints and there is never slack in the constraints on the constraint edge eii between some node vj and vi with no
cycle. For this reason, simple moves cannot be used to slack in schedule x, then the scheduled position of node
alter the schedule. However, shoving moves can be ap- vi must also be smaller in schedule y than in schedule x
plied to reschedule all operators on such a non-slack cycle (i.e., y j < xi).
simultaneously.2For example, in Fig. 7 simple moves
cannot be used because each legal schedule has no imProof Consider the case where yi > x i . Since there
mediately adjacent legal schedules. However, shoving
is no slack in schedule x for constraint eo (i.e., so(x) =
moves can be used to reschedule both operators simulta0) we can replace the constraint inequality for eij in schedule x with an equality relationship:
*To support fixed-time constraints, shoving moves must detect cycles
- + ---
when recursively shoving predecessor or successor nodes. This can be accomplished using a simple making scheme.
1113
In addition, since y is a legal schedule the constraint inequality for the same constraint eii must hold in schedule
Y:
yj 2 yi
Since yi
Xj
-FwF
+ wij
(a) Case 1 - M2 Succeeds
> xi, this inequality can only be satisfied if yj >
The proof of the case where yi < xi is similar and is

omitted.
U
This result can be used to show that given two schedules x and y with distance D (x, y ) , there is always a move
that will create a new schedule x that is closer to y than
the original schedule x.
Lemma 2: Let x and y be two legal schedules of length
L that differ in the scheduled position of at least one node
vi (i.e., yi # xi). If the scheduled position of node vi is
greater in schedule y than in schedule x (i.e., yi > xi),
then there exists a legal schedule X that can be reached
from x through the application of move M2 or M4 to node
vi such that D ( x , y ) < D ( x , y ) . Similarly, if the scheduled position of node vi is smaller in schedule y than in
schedule x (i.e., yi < xi) then there exists a legal schedule
x that can be reached from x through the application of
move M 1 or M3 to operator vi such that D (x ,y ) < D (x,
Y).
Pro08 Consider first the case where the scheduled
position of node vi is greater in scheduley than in schedule x. In this case, applying move M2 or M4 to schedule
x will create a new schedule x in which node vi is one
step closer to its scheduled position in schedule y . We
must show that all constraints that involve vi will be satisfied in the new schedule x and that x is closer toy than
the original schedule x. Constraints associated with predecessor edges of vi need not be considered, since the
inequalities that they represent will be satisfied both before and after the move. For successoredges, we consider
three cases, illustrated in Fig. 8:
Case 1-A42 succeeds: If every successor constraint eij
has slack sij(x) 2 1, then node vi can be moved forward
one control step using simple move M2 to create a new
schedule x as shown in Fig. 8(a). All constraints will still
be satisfied in x but with reduced slack values. Further,
since yi > xi andxj = xi + 1, Id,@, y ) ( = ldi(x, y)I 1 and so D ( x ,y ) < D ( x , y ) .
Case 2 4 4 4 succeeds: If one or more successor constraints have zero slack, then node vi may still be moved
forward one control step using shoving move M4 to create
a new schedule x, as shown in Fig. 8(b). Move M4 will
complete successfully if it can be recursively applied to
the successors of each successor node. These recursive
applications will move forward all nodes { v i ,vi, vk,
,
U,} that lie one or more paths of non-slack successor constraints starting with node vi providing that each path is
terminated by a constraint with slack or else forms a cycle
of zero-slack constraints. Since move M4 simultaneously
moves forward all nodes on the path, constraints on the
path remain satisfied after the move is completed. Further, since each edge in the path is a constraint with no
---
-b
vk
- M2,M4 Fail
(c) C ~ S3C
vk
(b)C ~ S2C- M4 SUCC&
Fig. 8. Considerations when moving an operator.
slack and yi > xi,then by Lemma 1, yj > xi, yk > xk,

, yr > x,. Thus for each node on the path Idi@,y ) (
= ( d i ( x , Y ) )- 1, Idj(x,Y)l = Idj(x,Y)l - 1, (dk(x,Y)(
= )dk(X,Y)l - 1, * * * , J d r ( X , y ) J= J d r ( x , y ) )- 1 and
so D (x , y ) < D (x, y ) for any number of nodes that are
moved using M4.
Case 3-A42 and A44 fail: Move M2 will only succeed
when all successor constraints contain slack. However,
move M4 will always succeed unless there is a path of
non-slack constraints involving nodes { v i , vi, vk,
* ,
U,} that is not terminated by a slack node. This can only
occur when node U, has a non-slack successor constraint
with the sink node Usink, as shown in Fig. 8(c). However,
this situation cannot occur because both x and y are legal
schedules. To demonstrate this assertion, assume that a
path of non-slack constraints extends from node vi to the
sink node, as shown in Fig. 8(c). In this case, since yi >
X i , by LenlIIla 1, yj > X j , Y k > X k , Y r > xr and Ysink
> xsink.Since by definition the two schedules are legal
= L + 1, this cannot occur when both
only if Ysink = xSink
schedules are legal. Since Case 3 cannot occur, Cases 1
and 2 show that there is always a move that will create a
new schedule x such that D ( x ,y ) < D ( x , y ) .
The proof for the case where the scheduled position of
node vi is smaller in scheduley than in schedule x is simU
ilar and is omitted here.
Given the result of Lemma 2, we can show that it is
possible to reach any scheduley from any other schedule
- -
X.
Theorem 1: Let x and y be two legal schedules of length
L that differ in the scheduled position of at least one node

vi (i.e., yi # xi). Then there exists a sequence of no more
than D ( x , y ) applications of moves Ml-M4 to selected
nodes of the CDFG that will transform schedule x into
schedule y .
Proof: By Lemma 2 there is always a move that will
transform schedule x into a new schedule that is closer to
schedule y . Applying such a move will create a new
schedule x such that D (x I , y ) < D (x, y ) . Similarly, there
is always a move that will transform schedule x 1 into a
new schedule x2 such that D ( x 2 , y ) < D ( x , y ) . This
process can be continued, creating a sequence of intermediate schedules x , x2, * * * , x r that are successively
1114
closer to y until finally D ( x r ,y) = 0, and so x r is equivalent to y. If each intermediate schedule is created using
a simple move M 1 or M2 then each intermediate schedule
reduces the distance from scheduley by one. In this case
exactly D (x, y) moves are required to transform schedule
x into schedule y. Since shoving moves reduce the distance of an intermediate schedule from y by more than
one, any shoving moves in reduce the number of moves
required to reach schedule y. Thus no more than D ( x , y)
moves are required to transform schedule x into schedule
Functional unit and register costs are computed by

counting the requirements in each control step and taking
the maximum of these values over all control steps. As in
other approaches, functional unit costs are calculated by
counting the number of similar operator nodes of each
type. Register costs are determined by counting the number of operator nodes that produce data values used in
later steps and the number of slack nodes that represent
previously stored values, as described in Section 3.1.
A full calculation of functional unit and register costs
Y.
rn over all control steps in a CDFG is expensive. However,
This result is important because it shows that using the local nature of simple moves M1 and M2 allows these
moves Ml-M4 the region of legal schedules can be fully changes to be calculated incrementally in the following
explored from any legal starting schedule. Thus the choice fashion: For each resource type (functional unit and regof a particular starting schedule cannot preclude the ex- ister), the control steps which contain the maximum deploration of some set of schedules in the region. Further, mand for the resource are retained in a critical step list.
the most direct path between any two legal schedules lies Each simple move M1 or M2 affects two control steps,
within this region, suggesting that illegal configurations which we will refer to as the source step (from which the
are not needed or desirable when searching the schedule operator is removed) and the destination step (to which
space. It is important to note that the number of legal the operator is added).3 Adding an operator to the destischedules grows exponentially with the number of oper- nation step raises the demand for resources in that step.
ator nodes and thus exhaustive exploration is prohibi- If this value is less than the current maximum demand,
tively expensive. This motivates the use of a probabilistic no action is taken. If it is equal to the current maximum,
algorithm such as simulated annealing to guide the explo- then it is added to the list of critical steps for that reration of the schedule space.
source. If it is greater than the current maximum, the current critical steps are removed and the destination step
becomes the new critical step. Removing an operator from
3.4. Variable-Length Schedules
A straightforward extension to the SALSA representa- the source step lowers the demand for resources in that
step. If the step is currently the only critical step for a
tion allows it to explore schedules with different lengths.
particular
resource, than the overall cost is lowered and
As defined in Section 11, the sink node V,ink of a CDFG
the
critical
steps must be recalculated.
1.
with schedule length L is assigned to control step L
Shoving
moves M3 and M4 are implemented using reVarying the length of the schedule therefore corresponds
to varying the scheduled position of f&k. This can be ac- peated applications of M1 and M2. When these moves are
complished by applying the moves defined in Section 3.2 applied, the incremental cost adjustment must be recalto the sink node as well as the other nodes in the graph. culated for each operator that is moved, adding to the exFor example, an M1 (move up) move can be successfully pense of shoving moves.
applied to Ztsink when the final step of the schedule contains only slack nodes. This has the effect of shortening 3.6. Conditionals, Subroutines, and Loops
SALSA represents conditional activities using an apthe schedule by one control step. On the other hand, applying move M2 to usink will lengthen the schedule if it is proach similar to [26], [27], as shown in Fig. 9. A list of
less than a user-specified upper bound. Shoving move M4 input conditions is attached to each operation that repre(shove down) must also be redefined slightly since it can sents the conditions under which it is activated. Each row
always complete by lengthening the schedule if neces- of this list represents a set of input conditions encoded as
sary. This move now fails only when the upper bound on 0, 1 , or X (either 1 or 0). The universal condition (XXX)
schedule length would be violated. Since there is always is attached to unconditional operations, signifying that
some sequence of moves that will create a maximum- they always execute. Since the input conditions are based
length schedule, the analysis of move-set completeness in on other values in the CDFG, data edges are added to
conditional operators to represent the use of these values
the previous section still holds.
in conditional execution. These edges maintain proper sequencing and account for storage requirements. Condi3.5. Cost Estimation in SALSA
SALSA normally uses a cost function that is a weighted tional operators that produce data values require an added
sum of register and functional unit requirements. When multiplexer operator to select the proper value based on
variable-length schedules are specified, an additional the tested condition. Control operators that change conweighted term is added to the cost function to account for trol flow (e.g., restarting a loop) require no multiplexer
schedule length. Weights are user-specified to allow
tradeoffs between resources of different types and sched'Note that this is true for both single-cycle and multicycle operators because each move reschedules an operator in an adjacent control step.
ule length.
Calling context
1115
Subroutinc~X
Condition
Value A
Fig. 9. Conditional execution.
operator but may require added control edges to maintain

proper sequencing.
Mutual exclusion between two operator nodes is detected by taking the intersection of their condition lists. If
this intersection is empty, then they are mutually exclusive and can share the same functional unit. Condition
lists are also attached to slack nodes, allowing the detection of mutual exclusion in value storage. If the condition
lists of any pair of nodes (either slack or operator) do not
intersect, then the values that they produce are mutually
exclusive and can share the same register.
Subroutines are an important tool for structuring the
control and data flow of behavioral descriptions and synthesized designs. Depending on the designers intent,
subroutines in a behavioral description may be synthesized in a number of different ways, each with different
advantages: First, subroutines may be implemented directly in the controller program. This approach assumes
a single thread of control, and multiple instances of a subroutine are implemented directly as subroutine calls in the
control program. This has the advantage of allowing datapath hardware to be shared between subroutines and
calling routines. Second, subroutines may be treated as
separate structural entities that are synthesized independently. In this case, multiple instances of a subroutine may
be implemented either as a single datapath or as several
datapaths. This has the advantage of allowing hierarchy
and parallelism. Finally, subroutines may be eliminated
altogether by expanding them into calling routines. This
approach has the advantage of simplifying the control
structure but increases the size of the calling routines.
The SALSA representation directly supports only the
first approach to implementing subroutines. This approach allows datapath resources to be shared between
subroutines and calling routines. However, the remaining
approaches can still be implemented using behaviorallevel transformations [8] such as process formation and
inline expansion to alter the structure of the behavioral
description.
This representation of subroutines is implemented using a method similar to the CMU Value Trace [8] with
extensions that support accurate register cost calculation
during subroutine execution. In this approach, subroutines are represented as separate graphs that will be implemented by the same datapath and controller after
scheduling and allocation. Each graph is scheduled into a
separate sequence of control steps which will be bound to
the same datapath during allocation.
CALL operator nodes represent the activation of subroutine graphs. Each CALL node represents a transfer of
Fig. 10. Subroutine graph and CALL nodes.
control from a control step in the calling context to the

sequence of control steps in the subroutine graph. In addition, it represents the transfer of data values to the subroutine from the calling context by data edges into the
CALL node and corresponding edges out of the source
node of the subroutine graph. Similarly, it represents the
transfer of data values from the subroutine to the calling
context at the end of the subroutineby data edges into the
sink node of the subroutine graph and corresponding edges
out of the CALL node. Fig. 9 shows an example of a
subroutine graph and two CALL nodes that transfer control to that graph.
During scheduling, SALSA allows moves Ml-M4 to
be applied to every subroutine graph as well as the graph
representing the main program. This allows scheduling
tradeoffs to be considered simultaneously for the entire
design. However, when subroutines are present register
cost estimation must account not only for local storage
requirements in subroutine graphs but also the storage requirements of the calling routines. For example, in Fig.
10 there are two CALL nodes that activate subroutine
graph X with different storage requirements. During the
first call, values A and B require storage during the execution of the subroutine. During the second call, value C
requires storage. These values must be considered live
when calculating the register cost of the scheduled subroutine. The SALSA representation describes these values explicitly by adding data edges between the source
and sink node of the subroutine graph, as shown in Fig.
11. Slack nodes in these edges explicitly represent storage
requirements in each control step but do not imply any
additional scheduling slack. Because only one call to the
subroutine may be active at a time, values from different
calling contexts are mutually exclusive with respect to
each other. SALSA represents this mutual exclusion by
creating a unique bit vector for each calling context and
adding this vector to the condition list of each slack operator.
As in the Value Trace, loops are treated as a special
case of subroutines. Each loop is represented using a separate graph. Loop execution is initiated using a CALL
operator, and new iterations of the loop are initiated using
a RESTART operator that feeds data values back to the
beginning of the loop. WAIT operations that are used for
external synchronization are implemented in the same way
using simple loops of one control step.
1116
UeC
ValueA
ValueB
Fig. 11. Subroutine graph with added data edges.
constraint-solution0 [
s f with
~ all ops scheduledin step 0 *I
for (every node vi in G(V,E) ) xi = 0;
for ( each SUCCCSSOI Vi of SOUICE V= ) enqueue ( vi );
r check comtraim on
in queue *I
while ( queue is not empty ) [
vj =dequeue()
lower-bound = 0
upper-bound = 0;
I process minimum comtrainu and dependencieson predecessors*I
for ( each predecessor edge eij of vj ) [
if ( chainingenabled && eij is not a timing consmint ) [
comb-delay = longest-comb-delay (vi ) + vj.delay:
if ( comb-delay <= clockqcriod )
lower-bound = max ( lower-bound. xi );
elsc
IV. SCHEDULING
WITH SALSA
The previous section discussed the SALSA representation and how alternative schedules can be explored using the SALSA move set. Given a schedule which meets
all timing and ordering constraints, the application of a
legal move to an operator in the schedule will result in a
new schedule that meets the same constraints. However,
an initial schedule that meets timing constraints must first
be created before this exploration process can proceed.
Following initial schedule creation, some method must be
used to guide the exploration process. This section describes the techniques used to accomplish these tasks.
4.1. Initial Schedule Generation
The initial scheduling phase takes a traditional CDFG
as input, finds a schedule that meets all timing constraints, and adds slack operators to form a SALSA graph.
The schedule can be either a minimum-length schedule or
a schedule of length specified by the user. To find the
schedule, it uses an iterative algorithm adapted from layout compaction [28], [29]. This algorithm is similar to the
relative scheduling algorithm of [23], but is performed
before allocation and does not support unbounded delays.
In one-dimensional layout compaction, objects to be
compacted are treated as nodes in a directed constraint
graph with a single source and sink node. Edges represent
relative positioning (e.g., object A is to the left of object
B). Edge weights represent spacing constraints between
objects (e.g., the distance between the center of objects
A and B must be greater than X). The problem of constraint solution is to find an assignment of objects to locations that meets all spacing constraints and minimizes
the overall layout size. Compaction research [21], [29]
has shown that when a constraint graph contains both
minimum and maximum constraints it can be solved in
O(V * K ) execution time, where V refers to the number
of nodes K refers to the number of maximum constraints.
Additional algorithms allow the determination of whether
a graph contains contradictory constraints [29].
It is straightforward to apply constraint solution techniques to the problem of finding a schedule in a CDFG.
The CDFG becomes a constraint graph in which edges are
weighted to the represent timing constraints expressed in
control steps. Data and control edges are weighted to
guarantee proper operation ordering, and timing edges are
weighted to represent constraint values. Fig. 12 shows a
constraint solution algorithm for scheduling which is pat-
lower-bound = max ( lower-bound, xi + 1 );
else lower-bound = max ( lower-bound. xi + eij.min 1;
I* process maximum constraints on successors*/

for ( each succesmr edge ejk of vj ) [
upper-bound = max ( upper-bound, xk. ejk.max );
P reschedule vj if necessary and enqueueconstrainedM&S *I

newstep = max ( xj, max ( lower-bound. uppcr-bound ) );
if ( newstepf xj ) [
xj = mwstcp;
for ( each predecessor edge eij of vj )
if ( cij.max representsa valid max. constraint) enqueue (vi );
for ( each successoredge ejk of vj )
if ( ejk.min represents a valid min. constraint) enqueue ( vir );
1
Fig. 12. Constraint solution algorithm for initial schedule generation.
terned after the constraint solution algorithm of Bums and

Newton [28], [29] but is extended to deal with chaining.
The algorithm operates by initially scheduling all operators in control step 0 and then iteratively correcting
constraint violations by moving operators to later control
steps. Operators that may violate constraints are placed in
a queue for processing. The outer loop of the algorithm
removes operators from the queue one at a time and tests
for constraint violations. It first tests for the violation of
any minimum and ordering constraints on predecessor operators. If a minimum constraint is violated, it can be corrected by moving the node to a later control step. It then
tests for violation of maximum timing constraints on successor nodes. If a maximum timing constraint is violated,
it can also be corrected by moving the operator to a later
control step. Since moving the operator can cause violations in other constraints, operators connected to potentially violated constraints are placed on the queue for later
processing. The process iterates until the queue is empty;
this represents a schedule where all constraints have been
met.
Chaining [ 1 11 is supported during initial schedule generation using a slight modification to the constraint solution algorithm. When chaining is enabled, an operator that
depends on the data output of another operator may be
placed in the same control step if the estimated combinational delay of the cascaded operators does not exceed
the clock period. If this value is exceeded, the second
operator is placed in the following control step.
The schedule that results from this algorithm is equivalent to an as-soon-as-possible (ASAP) [ 11 schedule
that is adjusted to meet all timing constraints. The constraint graph can also be solved in reverse order, starting
at the sink node with a given number of steps. This sched-
ule is equivalent to an as-late-as-possible (ALAP)

schedule that meets all timing constraints. These schedules can be used in the same way that ASAP and ALAP
schedules are used to determine operator time frames [5]ranges of control steps in which an operator may be
scheduled. When scheduling in a minimum number of
control steps, operators that are scheduled into the same
control step in both schedules are critical path [30], [ 111
operators that cannot be placed in any other control step
in schedules of the given length.
When multiple graphs are present that represent loops
and subroutines, the initialization part of the algorithm
must be modified so that each subroutine is assigned to a
unique set of control steps. This task is straightforward.
Maintaining timing constraints in the presence of calls to
subroutines and loops is more complicated. When the execution time of a subroutineor loop is known exactly, call
operators can be assigned a delay that represents the
execution time of the subroutine or loop in control steps.
This operator is then scheduled into dummy control steps
that represent the time spent during the execution of the
loop or subroutine. In this case, constraint solution can
be used as before to find a schedule that meets all timing
constraints. After constraint solution is completed, the
dummy control steps are removed. To guarantee that normal operators are not scheduled into the dummy control
steps, normal operators must be constrained to either precede or follow call operators by adding edges to the
CDFG.
When the execution time of a loop or subroutine is not
known, then a timing constraint that crosses the call
operation (i.e., the constraint is between one operator that
precedes the call and one operator that follows the call)
cannot be satisfied in all circumstances. However, if a
lower bound on execution time is known then solving the
graph assuming the minimum number of control steps will
result in a schedule that meets any minimum time constraints that cross a call. Similarly, solving the graph assuming the maximum number of control steps will result
in a schedule that meets any maximum timing constraints
that cross a call. However this approach will not work
when both minimum and maximum constraints cross a
call; this remains an area for future research.
4.2. Schedule Improvement

Schedule improvement is implemented using simulated
annealing. The configuration space of the annealing problem is the set of legal schedules for a CDFG. The move
set consists of moves Ml-M4. In terms of the schedule
space, the configuration space corresponds to the region
of legal schedules. The application of a simple move M1
or M2 reschedules an operator in an adjacent control step,
corresponding to a move to an adjacent point in the schedule space. The application of a shoving move M3 or M4
reschedules multiple operators into adjacent control steps,
corresponding to a move to a point in the schedule space
that differs by one control step in multiple dimensions.
1117
More global moves which make larger changes to a

schedule (e.g., move an operator more than one control
step) were also considered, but experiments showed no
improvement over using the basic move set. Repeated application of the move set under the control of simulated
annealing corresponds to a search of several different
points in the schedule space. Individual schedules may be
visited more than once when a move is rejected or reversed by a later move or sequence of moves.
Illegal configurations are often used in annealing implementations, particularly in module placement [313.
However, illegal configurations are not supported in
SALSA because the completeness of the move set guarantees that any legal schedule can be reached from any
starting schedule. Further, because the region of legal
schedules is convex, the shortest path between two schedules also lies within the region of legal schedules. Any
path of schedules that includes illegal configurations is
longer. This is not true in module placement, where module overlap constraints result in a region of legal configurations that is not convex and the shortest path between
two configurations is likely to be through a sequence of
illegal (overlapping) configurations.
Constraint solution is used as discussed earlier to create
an initial legal schedule. If a minimum-length schedule is
specified by the user then critical path operators are identified at this time also. Since critical path operators can
only be schedule in one position, they are excluded from
consideration for move applications.
Simulated annealing is implemented in a straightforward manner [ 131 using a cost value C (the weighted sum
of resource requirements described in Section 3.4) and
temperature control parameter T. The temperature parameter is set to an initial temperature To which is gradually
lowered. At each temperature, several move attempts are
made. During each attempt, a move and operator are selected at random and the move is tested for legality with
the selected operator. If illegal, the attempt is discarded
without applying the move. If legal, then the move is applied and the change in cost AC is calculated. A negative
value of AC reflects an improved configuration. These
downhill moves are always accepted. A positive value
of AC reflects an inferior configuration. These uphill
moves are accepted with a probability:
= e-(AC/T).
This acceptance probability allows acceptance of uphill

moves and an escape from locally optimal points in the
design space. Rejected moves are reversed by applying
the equivalent move in the opposite direction.
Temperature is controlled by an adaptive cooling
schedule that is an adaptation of [32]. It calculates initial
temperature, temperature changes, and equilibrium conditions based on statistics gathered from a number of
moves made before annealing begins. Move attempts are
made at each temperature until either equilibrium is detected or an upper bound is reached that is a weighted sum
of the number of off-critical operators and the schedule
1118
length). The schedule terminates when there is no change

in cost over a number of successive temperatures (typically 3).
Move selection is biased towards moves and operators
that are likely to reduce schedule cost. This is accomplished by selecting operators from the critical steps of
resources that contribute to the current cost. During this
selection a resource is chosen with probability proportional to the relative resource cost if the demand on this
resource exceeds a preset threshold (typically the lower
bound on functional units of this type). An operator is
then chosen at random from a critical step of the selected
resource.
Moving operators out of these critical steps reduces the
demand on the control step and tends to reduce schedule
cost. However, it is still useful to attempt moves on operators in other control steps since this may indirectly provide opportunities to reduce the cost. For this reason,
some of the selected operators (typically 10-15%) are instead chosen at random from the list of all non-critical
operators without regard to current control step.
After an operator is selected, a simple move M1 or M2
is selected at random and attempted. If this attempt fails
and chaining is allowed, then a simple chaining move is
attempted in the same direction. If this second attempt
fails, then either a shoving move or a recursive chaining
move is attempted. If all attempts fail, a final move is
applied occasionally (typically 0.1 % of all cases) that returns the search to the best schedule found so far.
V. IMPLEMENTATION
AND RESULTS
The SALSA representation and scheduler have been
implemented in about 4300 lines of C, including the initial schedule generation and schedule improvement
phases. A separate translator has also been developed that
reads CMU Value Trace files from the System Architects
Workbench [8] and translates these files into the SALSA
representation.
The SALSA scheduler has been tested with a number
of examples. Results from these examples are summarized in Tables 1-111. These examples include some small
examples previously used in the literature, a control-dominated benchmark example, and two larger data-dominated examples. Each table lists schedule length, resource
requirements, estimated problem size, and CPU seconds
of execution time for each annealing run (CPU times were
measured on a Sun SparcStation IPC with 24-Mb mem-
ory).
When evaluating scheduling speed, it is important to
recognize that the complexity of the scheduling problem
grows both with the number of operators in the CDFG and
also the length of the schedule. The estimated problem
size entry in Tables I-111 attempts to estimate this complexity as the total number of scheduled positions that each
operator may be assigned. This value is equal to the number of variables required to represent the scheduling problem in an ILP formulation [171, [ 181.
TABLE I
VARIOUS
EXAMPLES
FU
FU
Example
Steps
+/-
MAHA
MAHA (chained)
TMPCTL
RCVR
8
4
15
37
2
4
2
1
FU
Other
Reg
CPU
(sec)
7
9
10
2
18
42
10
64
Table I summarizes results for three examples. The

MAHA code sequence example [30] shows the operation of chaining in a small data-dominated example. The
first schedule was created without chaining, while the second schedule was created with chaining enabled. In this
case the use of chaining created a schedule with fewer
control steps at the expense of added functional units. The
TMPCTL temperature controller [9] example is a simple example that illustrates the interaction between scheduling and timing constraints. In each of these cases, the
quality of results matches those reported previously.
The RCVR example is used as a control-dominated example that is part of the I825 1 high-level synthesis benchmark [33]. While the scheduler reduces resource demands
as much as possible, control-oriented approaches such as
path-based scheduling [20] give better results for this example, especially in terms of number of control states.
Improving performance in this area will be an important
area of future work.
The Fifth-Order Elliptic Wave Filter benchmark [5],
[33] has been intensively studied. It consists of 34 operators (8 multiply by constant, 26 addition). Table I1 summarizes results for this example for a number of schedule
lengths under three different sets of assumptions. In the
first set of results, chaining is not allowed, adder delay is
assumed to be one clock cycle, and multiplier delay is
assumed to be two clock cycles. Non-pipelined multipliers are used in this case. In the second set of results,
the same set of assumptions is used but now pipelined
multipliers are used with a latency of one clock cycle. In
the third set of results, chaining is allowed. In each case,
execution time of annealing increases as schedule length
increases, but at a slower rate than the increase in estimated problem size.
The first two sets of results can be compared to several
results in the literature (e.g., [ 5 ] , [9], [ll], [14]-[18]). In
each of these cases, SALSA finds schedules that match
the cost of the best schedules found by other researchers,
including several that are known to be 0ptima1.~Fewer
results are available for chaining schedules. Camposano
[20] reports 9 and 13 step schedules that were found using
path-based scheduling with functional unit constraints on
a serially-ordered CDFG. Similar schedules were found
by SALSA and are shown in Table 111. In addition,
SALSA found a 26-step schedule requiring one multiplier
4Register requirements may be reduced by one in some cases if the input
is assumed to be stored in a dedicated input register [18].
1119
TABLE I1
FIFTH-ORDER
ELLIPTIC
WAVEFILTER
Schedule Characteristics
Steps
FU
+/-
FU
Reg
Prob.
Size
CPU
SEC
Non-Pipelined Multipliers
No chaining
17
18
21
28
3
2
1
1
3
3
2
1
10
10
10
10
38
96
198
436
13
55
53
34
Pipelined Multipliers
17
18
19
28
2
1
1
1
3
3
2
1
10
10
10
10
38
96
130
436
12
57
76
45
Chaining
9
13
26
1
1
1
3
2
1
11
11
11
205
440
882
45
46
70
TABLE I11
DISCRETE
COSINE
TRANSFORM
EXAMPLE
Schedule Characteristics
Steps
FU
+/-
FU
Reg
Prob.
Size
CPU
SEC
Non-Pipelined Multipliers
No chaining
10
14
18
19
34
35
4
3
2
2
1
1
4
3
3
2
2
1
15
13
16
17
15
15
240
432
624
672
1392
1440
57
52
63
81
122
137
Pipelined Multipliers
10
11
13
19
20
33
3
2
2
1
1
1
4
4
3
3
2
1
12
13
14
14
14
16
240
288
384
672
720
1344
96
74
78
88
99
132
Chaining
7
8
11
16
17
32
3
2
2
1
1
1
5
4
3
3
2
1
15
15
15
14
15
16
384
432
578
816
864
1584
41
42
44
50
52
91
and one functional unit. When comparing these ap- approaches. These opportunities come at the expense of a
proaches, it is important to note that the quality of the more complex scheduling problem; estimated problem
schedule found by path-based scheduling depends on the sizes for chained versions of the EWF example are much
initial serial ordering of nodes-some orderings result in larger than unchained approaches due to the larger time
longer schedules. In contrast, SALSA requires no such frames that result from chaining.
The discrete cosine transform (DCT) was used to show
ordering and minimizes both functional unit and register
the behavior of SALSA with larger examples. The DCT
requirements.
When examining the quality of schedules with chain- is used extensively in image coding and compression, and
ing, it is interesting to compare the functional unit re- has been implemented in hardware for special-purpose
quirements with the absolute lower bounds for resource image processors (e.g., [37]). Fig. 13 shows the CDFG
requirements derived in [36]. This bound predicts that the of an %point DCT patterned after the implementation denumber of functional units of each type can be no smaller scribed in [37]. It consists of 48 operators (16 multiply
the number of operators of each type divided by the num- by constant, 25 add, and 7 subtract). Unlike the EWF
ber of control steps. In each of the three chained EWF example which has a relatively long minimum schedule
schedules, multiplier and adder costs are equal to the ab- length (17 steps in the unchained case), the DCT has a
solute lower bound. This demonstrates that in contrast to short minimum schedule length (7 steps in the unchained
our initial experience with the small MAHA example, case). This substantially increases the difficulty of finding
chaining often makes it possible to find low-cost sched- schedules that contain a reasonable number of functional
ules using a smaller number of control steps than other units.
1120
TABLE IV
OF EXECUTION
TIMES
FOR EWF EXAMPLES
COMPARISON
~
~~~~~~~
Scheduler
# CSTEPS
CPU Time
Machine Type
SALSA
SA [ 141
FDS [5]
Extended FDS [36]
ILP [U]
OASIC I181
OASIC (FU Only)
17-21
17
17-21
17-21
17-21
17, 18
19
13s-55s
4m
2m-6m
2s-3m
0.26s-34.5s
30s, 4m
36s
SunSparcIPC
DEC VAX 8650
Xerox 1108
Apollo DNlOOOO
DEC VAX 8800
Intel 386
Intel 386
ers scheduling, operator allocation, and estimated interconnect cost. We believe that this advantage is due not
only to the reduced problem scope (i.e. scheduling only),
but is also due to the fact that SALSAS efficient representation and move set allows configurations to be explored very quickly.
While SALSA appears to have a clear advantage over
execution times of Force-Directed Scheduling [5], [34],
the discrepancy in processor speed for the two sets of
Fig. 13. CDFG for DCT example.
measurements is large enough to render comparison almost meaningless. However, when compared to results
from an extended FDS algorithm [38], there is still an
Table I11 summarizes scheduling results for this ex- advantage even though a faster processor was used. More
ample under the same scheduling conditions used for the importantly, analysis of the FDS algorithm [38] has shown
EWF example: non-pipelined multipliers, pipelined mul- that execution time grows as the square of schedule length.
tipliers, and chaining. In addition, it was assumed that In contrast, while it is difficult to characterize the execuadd and subtract operators would be implemented by ALU tion time of a probabilistic algorithm, this time is related
functional units that can perform both operations. As in to the maximum number of move attempts at each temthe EWF example, pipelined multipliers allow a substan- perature. In SALSA, this value grows linearly with retial reduction in functional unit costs. However, as in the spect to schedule length.
EWF example, chaining again provides the best way to
Results for the ILP approach of [ 171 are given for nonfind low cost schedules using a small number of control pipelined multipliers in 17-21 control steps. These exesteps. Schedules that were produced using chaining match cution times are smaller than those of SALSA but grow
the absolute lower bound for functional units in 8, 11, and rapidly with increasing schedule length. An extension of
32 steps. The scheduler was not able to produce a 16 step this work [35] adds constraints to support chaining and
schedule at the absolute lower bound (1 multiplier and 2 pipelined functional units. Execution times are not availadders). However, it found this result in a 17-step sched- able for these features, but for chaining the number of
ule.
added constraints grows exponentially with the depth of
Execution times for the DCT show that execution times chaining allowed. Results for the OASIC IP approach to
grow at a reasonable rate as schedule length increases. scheduling and allocation [18] are given for 17 and 18However, we have found that while SALSA consistently step schedules with pipelined multipliers. This approach
finds the best schedules for small examples such as the uses more CPU time than the SALSA approach, but inEWF in a single annealing run, it does not always do so cludes consideration of interconnect cost and allocation.
for larger examples such as the DCT. When this occurs, Execution times are greatly reduced when only functional
multiple runs can be used to further improve the schedule unit cost is considered, as shown in the final entry of Taat the expense of additional CPU time.
ble IV.
Table IV summarizes execution times for SALSA with
ILP and IP approaches are very attractive since an opthe EWF example compared to those of a number of pre- timal solution is guaranteed. These recent results show
vious approaches. Because these measurements were that when schedule lengths are close to minimum schedmade on processors of widely varying speed, it is difficult ule lengths, execution times are quite good. However, in
to use these results to make accurate comparisons. How- cases where schedule length is substantially longer than
ever, some conclusions can be drawn from these results. the minimum length or when chaining is used, the number
First, SALSA shows a clear advantage over the simulated of variables in the problem formulation grows rapidly, as
annealing approach of [ 141, which simultaneously consid- shown in Tables I1 and 111. Since the execution times of
these approaches can be expected to grow rapidly with the

number of variables, we believe that heuristic approaches
like SALSA will be competitive for a large class of practical synthesis problems.
VI. CONCLUSION
This paper has described a new approach to scheduling
with timing constraints that minimizes resource costs. A
specialized representation and move set provide a way to
quickly explore scheduling alternatives after an initial
schedule is found using constraint solution. Simulated annealing provides an effective way to implement this exploration and yields good results in reasonable execution
times, especially when chaining is used and when schedule lengths are substantially longer than minimum schedule lengths. Proof that all legal schedules may be reached
using the move set provides confidence that the schedule
space can be thoroughly explored during annealing. In addition, it provides new insight into the scheduling problem that may be useful in other approaches. Future work
will concentrate on improving schedule quality for control-dominated examples, improving annealing performance on large examples, and extending the approach to
include support for interconnections, allocation, and more
general timing constraints.
ACKNOWLEDGMENT
The authors would like to thank R. Cloutier and the
anonymous reviewers for their suggestions for improving
this paper, M. McFarland and K. Vissers for helpful discussions concerning scheduling and R. Rutenbar for helpful discussions concerning simulated annealing.
REFERENCES
M. McFarland, A. Parker, and R. Camposano, The high-level synthesis of digital systems, Proc. IEEE, vol. 78, Feb. 1990.
J. Nestor and D. Thomas, Behavioral synthesis with interfaces, in
Proc. ICCAD-86, pp. 112-115, Nov. 1986.
R. Camposano and A. Kunzmann, Considering timing constraints
in synthesis from a behavioral description, in Proc. ICCD, pp.
6-9, Oct. 1986.
G. Bomello and R. Katz, Synthesis and Optimization of Interface
Transducer logic, in Proc. ICCAD-87, pp. 274-277, Nov. 1987.
P. Paulin and J. Knight, Force-directed scheduling for behavioral
synthesis of ASICs, IEEE Trans. Computer-Aided Design, Vol. 8 ,
pp. 661-678, June 1989.
C. Hitchcock and D. Thomas, A method of automatic data path
synthesis, in Proc. 20th DAC, pp. 484-489, June 1983.
C. Tseng and D. Siewiorek, Automated synthesis of data paths in
digital systems, IEEE Trans. Computer-Aided Design, vol.
CAD-5, pp. 379-395, July 1986.
D. Thomas, E. Lagnese, R. Walker, J. Nestor, J. Rajan, and R.
Blackburn, Algorithmic and Register-Transfer Level Synthesis: The
System Architect s Workbench. New York: Kluwer Academic, 1990.
E. Girczyc and J. Knight, An ADA to standard cell hardware compiler based on graph grammars and scheduling, in Proc. ICCD, pp.
726-731, Oct. 1984.
M. McFarland and T. Kowalski, Incorporating bottom-up design
1121
into high-level synthesis, IEEE Trans. Computer-Aided Design, vol.

8, pp. 938-950, Sept. 1990.
[ l l ] B. Pangrle and D. Gajski, Slicer: A state synthesizer for intelligent
silicon compilation, in Proc. ICCD-87, Oct. 1987.
[12] G.Goossens, J. Vandewalle, and H. De Man, Loop optimization
in register-transfer scheduling for DSP-systems, in Proc. 26th DAC,
pp. 826-831, June 1989.
[13] S. Kirkpatrick, C. Gelatt, and M. Vecchi, Optimization by simulated annealing, Science, vol. 220, no. 4598, pp. 671-680, May
1983.
[14] S. Devadas and A. R. Newton, Algorithms for hardware allocation
in data Dath svnthesis, IEEE Trans. Cornouter-Aided D e s k vol.
8, pp. f68-781, July 1989.
M. Quayle and L. Grover, Pipelined and non-pipelined data path
synthesis using simulated annealing, Progress in Computer Aided
VLSI Design, vol. 4 , Feb. 1990.
T. Ly and J. Mowchenko, Applying Simulated Evolution to Scheduling in high-level synthesis, in Proc. IEEE 33rd Midwest Symp.
on Circuits and Systems, 1990.
1. Lee, A new Integer linear programming formulation for the
scheduling problem in data path synthesis, in Proc. ICCAD, pp.
20-23, NOV.1989.
C. Gebotys and M. Elmasry, A global, optimization approach for
architectural synthesis, in Proc. 28th DAC, pp. 2-7, June 1991.
S. Hayati and A. Parker, Automatic production of controller specification from control and timing behavioral descriptions, in Proc.
26th DAC, pp. 75-80, June 1989.
R. Camposano, Path-based scheduling, IEEE Trans. ComputerAided Design, vol. 10, Jan. 1991.
Y. Liao and C. Wong, An algorithm to compact a VLSI symbolic
layout with mixed constraints, IEEE Trans. Computer-Aided Design, vol. CAD-2, pp. 62-69, Apr. 1983.
G.Bomello, A new interface specification methodology and its application to transducer synthesis, Ph.D. dissertation, Univ. of California at Berkeley, May 1988.
D. Ku and G.De Micheli, Relative scheduling under timing constraints, in Proc. 27th DAC, pp. 59-64, June 1990.
C. Papadimitrou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity. Englewood Cliffs, NJ: Prentice-Hall,
1982.
M. Lorenzetti and D. Baeder, Routing, in Physical Design Automation of YLSI Systems, B. Preas and M. Lorenzetti, ed. Menlo
Park, CA: Benjamin-Cummings, 1988.
C. Tseng, R. Wei, S. Rothweiler, M. Tong, and A. Bose, Bridge:
A versatile behavioral synthesis system, in Proc. 25th DAC, pp.
415-420. June 1988.
[27] K. Wakabayashi and T. Yoshimura, A resource sharing and control
synthesis method for conditional branches, in Proc. ICCAD-89, pp.
62-65, NOV.1989.
[28] J. Bums and A. R. Newton, SPARCS: A new constraint-based IC
symbolic layout spacer, in Proc. CICC, pp. 534-539, May 1986.
[29] A. R. Newton, Symbolic Layout and Procedural Design, in Design Systems for VLSI Circuits. Dordrecht, The Netherlands: Martinus Nijhoff, 1987, pp. 65-112.
[30] A. Parker, J. Pizam, and M. Mlinar, MAHA: A program for datapath synthesis, in Proc. 22nd DAC, pp. 461-466, July 1986.
[3 11 R. Rutenbar, Simulated annealing algorithms-An overview, IEEE
Circuits Devices Mag., vol. 6, no. 1, Jan. 1989.
[32] M. Huang, R. Romeo, and A. Sangiovanni-Vincentelli,An efficient
general cooling schedule for simulated annealing, in Proc. ICCAD86, pp. 381-384, NOV.1986.
[33] G. Bomello and E. Detjens, High-level synthesis: Current status
and future directions, Proc. 25th DAC, pp. 477-482, June 1988.
[34] P. Paulin and J. Knight, Scheduling and binding algorithms for highlevel synthesis, in Proc. 26th DAC, pp. 1-6, June 1989.
[35] C. Hwang, J. Lee, and Y. Hsu, A formal approach to the scheduling
problem in high-level synthesis, IEEE Trans. Computer-Aided Design, vol. 8, pp. 464-475, Apr. 1991.
[36] J. Rabaey, C. Chu, P. Hoang, and M. Potkonjak, Fast prototyping
of datapath-intensive architectures, IEEE Design Test, June 1991.
[37] R. Woudsma, et a l . , One-dimensional linear picture transformer,
U.S. Patent 4 881 192.
[38] W. Verhaegh, E. Aarts, J. Korst, and P. Lippens, Improved forcedirected scheduling, in Proc. EDAC 91, pp. 430-435, Feb. 1991.
1122
John A. Nestor (S78-M87-SM91) received the

B.E.E. degree from Georgia Institute of Technology in 1979 and the M.S.E.E. and Ph.D. degrees
from Camegie Mellon University, Pittsburgh, PA,
in 1981 and 1987 respectively.
Currently he is an Associate Professor of Electrical and Computer Engineering at Illinois Institute of Technology. His research interests include
high-level synthesis, visual hardware description
languages, and VLSI systems design.
Dr. Nestor received a Best Paper Award at the
22nd International Workshop on Microprogramming and Microarchitecture
in 1989 and a NSF Research Initiation Award in 1990. He is a member of
Eta Kappa Nu, Tau Beta Pi, and Sigma Xi.
Ganesh Krishnamoorthy received the B.S.E.E.,

M.S.E.E., and Ph.D. degrees from Illinois Institute of Technology in 1985, 1987, and 1992, respectively.
He is currently a Custom Engineer at Mentor
Graphics in Warren, NJ. His research interests include layout compaction and high-level synthesis.
Dr. Krishnamoorthy is a member of Eta Kappa
Nu and Tau Beta Pi.

00238604

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

00238604

Transféré par

Droits d'auteur :

Formats disponibles

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. VOL. 12, NO.

SALSA: A New Approach to Scheduling with

HE goal of high level synthesis [l] is to translate a

sent operators and edges represent ordering dependencies

0278-0070/93$03.00 0 1993 IEEE

Fig. 1. A control/data flow graph (CDFG)

developed that attempt to minimize hardware costs in the

large problems, especially as schedule length is increased

NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH

tion. The SALSA representation also provides support for

Each edge eii in E represents a data, control, or timing

while a maximum constraint time (vi,vj)

Minimum and maximum constraints on an edge eii are

for a minimum constraint (forward edge)

tion. Section V describes the implementation and presents

A schedule x of length L is legal if it satisfies all of the

where each xi is an integer 1 Ixi IL that represents the

Fig. 2. The schedule space.

time(u1, u2) I3 steps.

The inequalities implied by these constraints combine with

space that can be guided by the cost of each new schedule

3.1. Slack Nodes

Fig. 4. Simple moves M1 and M2.

Fig. 3. The SALSA CDFG representation.

3.2. The Move Set

the schedule space. For example, in Fig. 2 scheduley can

MI: Move an operator node vi from its current control

Fig. 5. Move M3 (shove up).

Fig. 6. Move M4 (shove down).

edges as appropriate. When applied to an operator

Fig. 7. Operators under a fixed-time constraint.

neously (e.g., M4 transforms schedule v into schedule

3.3. Completeness of the Move Set

(a) Case 1 - M2 Succeeds

> xi, this inequality can only be satisfied if yj >

The proof of the case where yi < xi is similar and is

(b)C ~ S2C- M4 SUCC&

Fig. 8. Considerations when moving an operator.

slack and yi > xi,then by Lemma 1, yj > xi, yk > xk,

Theorem 1: Let x and y be two legal schedules of length

L that differ in the scheduled position of at least one node

Functional unit and register costs are computed by

Fig. 9. Conditional execution.

operator but may require added control edges to maintain

Fig. 10. Subroutine graph and CALL nodes.

control from a control step in the calling context to the

Fig. 11. Subroutine graph with added data edges.

lower-bound = max ( lower-bound, xi + 1 );

else lower-bound = max ( lower-bound. xi + eij.min 1;

I* process maximum constraints on successors*/

P reschedule vj if necessary and enqueueconstrainedM&S *I

Fig. 12. Constraint solution algorithm for initial schedule generation.

terned after the constraint solution algorithm of Bums and

ule is equivalent to an as-late-as-possible (ALAP)

4.2. Schedule Improvement

More global moves which make larger changes to a

This acceptance probability allows acceptance of uphill

length). The schedule terminates when there is no change

Table I summarizes results for three examples. The

these approaches can be expected to grow rapidly with the

into high-level synthesis, IEEE Trans. Computer-Aided Design, vol.

John A. Nestor (S78-M87-SM91) received the

Ganesh Krishnamoorthy received the B.S.E.E.,

Vous aimerez peut-être aussi