Académique Documents
Professionnel Documents
Culture Documents
8, AUGUST 1993
1107
Abstract-This paper describes a new approach to the scheduling problem in high-level synthesis that meets timing constraints while attempting to minimize hardware resource costs.
The approach is based on a modified controUdata flow graph
(CDFG) representation called SALSA. SALSA provides a simple move set that allows alternative schedules to be quickly explored while maintaining timing constraints. It is shown that
this move set is complete in that any legal schedule can be
reached using some sequence of move applications.In addition,
SALSA provides support for scheduling with conditionals,
loops, and subroutines. Scheduling with SALSA is performed
in two steps. First, an initial schedule that meets timing constraints is generated using a constraint solution algorithm
adapted from layout compaction. Second, the schedule is improved using the SALSA move set under control of a simulated
annealing algorithm. Results show the schedulers ability to 6nd
good schedules which meet timing constraintsin reasonable execution times.
I. INTRODUCTION
1108
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8, AUGUST 1993
Timing Conswinc
ILnC(XlJ2) 5 3 stepf
--TSink Node
w..=
eii.,,,in
-eji.
TIMING CONSTRAINTS
1109
+ eii.min
xi Ixi
con-
+ eii.
mx.
1 xi
+ wij,
where
Note that in a legal schedule, so@) 1 0 for every constraint eii since every inequality must be satisfied.
is sometimes useful to think of a schedule x of n nodes
as a point in an n-dimensional schedule space. Each constraint inequality defines a legal half-space within the
schedule space that satisfies that particular constraint.
1110
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8, AUGUST 1993
x = (1.3)
Y = (1.4)
z = (2.5)
@@E
1
v2
v2
:
5
e xl
v2
Since a legal schedule must satisfy all constraints, a region of legal schedules is defined by the intersection of
all such half-spaces. Since each half-space is convex, it
is easily shown that the region resulting from the intersection of half-spaces is also convex [24].
To illustrate the concept of schedule space, Fig. 2(a)
shows the two-dimensional schedule space that results
given two operators under the constraints:
time(v1, u2) 2 1 step
AND
w,
NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH TIMING CONSTRAINTS
slack
1111
...+
opuaur
..-vv
Timing
Constraint
Sink Node
required over all control steps, so this schedule will require a minimum of four registers. When a simple transformation changes only part of a scheduled CDFG, the
local nature of these calculations can be exploited to speed
the calculation of register costs.
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8. AUGUST 1993
1112
Aha
Aha
Jkfim
q!pf
:
l
_
M3 v l
--
- -
---
x2sx1+2
+
2x121
Iff.
xl
- + ---
when recursively shoving predecessor or successor nodes. This can be accomplished using a simple making scheme.
1113
NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH TIMING CONSTRAINTS
In addition, since y is a legal schedule the constraint inequality for the same constraint eii must hold in schedule
Y:
yj 2 yi
Since yi
Xj
-FwF
+ wij
---
-b
vk
- M2,M4 Fail
(c) C ~ S3C
vk
- -
X.
1114
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8, AUGUST 1993
closer to y until finally D ( x r ,y) = 0, and so x r is equivalent to y. If each intermediate schedule is created using
a simple move M 1 or M2 then each intermediate schedule
reduces the distance from scheduley by one. In this case
exactly D (x, y) moves are required to transform schedule
x into schedule y. Since shoving moves reduce the distance of an intermediate schedule from y by more than
one, any shoving moves in reduce the number of moves
required to reach schedule y. Thus no more than D ( x , y)
moves are required to transform schedule x into schedule
NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH TIMING CONSTRAINTS
Calling context
1115
Subroutinc~X
Condition
Value A
1116
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8, AUGUST 1993
UeC
ValueA
ValueB
constraint-solution0 [
s f with
~ all ops scheduledin step 0 *I
for (every node vi in G(V,E) ) xi = 0;
for ( each SUCCCSSOI Vi of SOUICE V= ) enqueue ( vi );
r check comtraim on
in queue *I
while ( queue is not empty ) [
vj =dequeue()
lower-bound = 0
upper-bound = 0;
I process minimum comtrainu and dependencieson predecessors*I
for ( each predecessor edge eij of vj ) [
if ( chainingenabled && eij is not a timing consmint ) [
comb-delay = longest-comb-delay (vi ) + vj.delay:
if ( comb-delay <= clockqcriod )
lower-bound = max ( lower-bound. xi );
elsc
IV. SCHEDULING
WITH SALSA
The previous section discussed the SALSA representation and how alternative schedules can be explored using the SALSA move set. Given a schedule which meets
all timing and ordering constraints, the application of a
legal move to an operator in the schedule will result in a
new schedule that meets the same constraints. However,
an initial schedule that meets timing constraints must first
be created before this exploration process can proceed.
Following initial schedule creation, some method must be
used to guide the exploration process. This section describes the techniques used to accomplish these tasks.
4.1. Initial Schedule Generation
The initial scheduling phase takes a traditional CDFG
as input, finds a schedule that meets all timing constraints, and adds slack operators to form a SALSA graph.
The schedule can be either a minimum-length schedule or
a schedule of length specified by the user. To find the
schedule, it uses an iterative algorithm adapted from layout compaction [28], [29]. This algorithm is similar to the
relative scheduling algorithm of [23], but is performed
before allocation and does not support unbounded delays.
In one-dimensional layout compaction, objects to be
compacted are treated as nodes in a directed constraint
graph with a single source and sink node. Edges represent
relative positioning (e.g., object A is to the left of object
B). Edge weights represent spacing constraints between
objects (e.g., the distance between the center of objects
A and B must be greater than X). The problem of constraint solution is to find an assignment of objects to locations that meets all spacing constraints and minimizes
the overall layout size. Compaction research [21], [29]
has shown that when a constraint graph contains both
minimum and maximum constraints it can be solved in
O(V * K ) execution time, where V refers to the number
of nodes K refers to the number of maximum constraints.
Additional algorithms allow the determination of whether
a graph contains contradictory constraints [29].
It is straightforward to apply constraint solution techniques to the problem of finding a schedule in a CDFG.
The CDFG becomes a constraint graph in which edges are
weighted to the represent timing constraints expressed in
control steps. Data and control edges are weighted to
guarantee proper operation ordering, and timing edges are
weighted to represent constraint values. Fig. 12 shows a
constraint solution algorithm for scheduling which is pat-
NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH TIMING CONSTRAINTS
1117
= e-(AC/T).
1118
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8, AUGUST 1993
ory).
When evaluating scheduling speed, it is important to
recognize that the complexity of the scheduling problem
grows both with the number of operators in the CDFG and
also the length of the schedule. The estimated problem
size entry in Tables I-111 attempts to estimate this complexity as the total number of scheduled positions that each
operator may be assigned. This value is equal to the number of variables required to represent the scheduling problem in an ILP formulation [171, [ 181.
TABLE I
VARIOUS
EXAMPLES
FU
FU
Example
Steps
+/-
MAHA
MAHA (chained)
TMPCTL
RCVR
8
4
15
37
2
4
2
1
FU
Other
Reg
CPU
(sec)
7
9
10
2
18
42
10
64
NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH TIMING CONSTRAINTS
1119
TABLE I1
FIFTH-ORDER
ELLIPTIC
WAVEFILTER
Schedule Characteristics
Steps
FU
+/-
FU
Reg
Prob.
Size
CPU
SEC
Non-Pipelined Multipliers
No chaining
17
18
21
28
3
2
1
1
3
3
2
1
10
10
10
10
38
96
198
436
13
55
53
34
Pipelined Multipliers
17
18
19
28
2
1
1
1
3
3
2
1
10
10
10
10
38
96
130
436
12
57
76
45
Chaining
9
13
26
1
1
1
3
2
1
11
11
11
205
440
882
45
46
70
TABLE I11
DISCRETE
COSINE
TRANSFORM
EXAMPLE
Schedule Characteristics
Steps
FU
+/-
FU
Reg
Prob.
Size
CPU
SEC
Non-Pipelined Multipliers
No chaining
10
14
18
19
34
35
4
3
2
2
1
1
4
3
3
2
2
1
15
13
16
17
15
15
240
432
624
672
1392
1440
57
52
63
81
122
137
Pipelined Multipliers
10
11
13
19
20
33
3
2
2
1
1
1
4
4
3
3
2
1
12
13
14
14
14
16
240
288
384
672
720
1344
96
74
78
88
99
132
Chaining
7
8
11
16
17
32
3
2
2
1
1
1
5
4
3
3
2
1
15
15
15
14
15
16
384
432
578
816
864
1584
41
42
44
50
52
91
and one functional unit. When comparing these ap- approaches. These opportunities come at the expense of a
proaches, it is important to note that the quality of the more complex scheduling problem; estimated problem
schedule found by path-based scheduling depends on the sizes for chained versions of the EWF example are much
initial serial ordering of nodes-some orderings result in larger than unchained approaches due to the larger time
longer schedules. In contrast, SALSA requires no such frames that result from chaining.
The discrete cosine transform (DCT) was used to show
ordering and minimizes both functional unit and register
the behavior of SALSA with larger examples. The DCT
requirements.
When examining the quality of schedules with chain- is used extensively in image coding and compression, and
ing, it is interesting to compare the functional unit re- has been implemented in hardware for special-purpose
quirements with the absolute lower bounds for resource image processors (e.g., [37]). Fig. 13 shows the CDFG
requirements derived in [36]. This bound predicts that the of an %point DCT patterned after the implementation denumber of functional units of each type can be no smaller scribed in [37]. It consists of 48 operators (16 multiply
the number of operators of each type divided by the num- by constant, 25 add, and 7 subtract). Unlike the EWF
ber of control steps. In each of the three chained EWF example which has a relatively long minimum schedule
schedules, multiplier and adder costs are equal to the ab- length (17 steps in the unchained case), the DCT has a
solute lower bound. This demonstrates that in contrast to short minimum schedule length (7 steps in the unchained
our initial experience with the small MAHA example, case). This substantially increases the difficulty of finding
chaining often makes it possible to find low-cost sched- schedules that contain a reasonable number of functional
ules using a smaller number of control steps than other units.
1120
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8, AUGUST 1993
TABLE IV
OF EXECUTION
TIMES
FOR EWF EXAMPLES
COMPARISON
~
~~~~~~~
Scheduler
# CSTEPS
CPU Time
Machine Type
SALSA
SA [ 141
FDS [5]
Extended FDS [36]
ILP [U]
OASIC I181
OASIC (FU Only)
17-21
17
17-21
17-21
17-21
17, 18
19
13s-55s
4m
2m-6m
2s-3m
0.26s-34.5s
30s, 4m
36s
SunSparcIPC
DEC VAX 8650
Xerox 1108
Apollo DNlOOOO
DEC VAX 8800
Intel 386
Intel 386
ers scheduling, operator allocation, and estimated interconnect cost. We believe that this advantage is due not
only to the reduced problem scope (i.e. scheduling only),
but is also due to the fact that SALSAS efficient representation and move set allows configurations to be explored very quickly.
While SALSA appears to have a clear advantage over
execution times of Force-Directed Scheduling [5], [34],
the discrepancy in processor speed for the two sets of
Fig. 13. CDFG for DCT example.
measurements is large enough to render comparison almost meaningless. However, when compared to results
from an extended FDS algorithm [38], there is still an
Table I11 summarizes scheduling results for this ex- advantage even though a faster processor was used. More
ample under the same scheduling conditions used for the importantly, analysis of the FDS algorithm [38] has shown
EWF example: non-pipelined multipliers, pipelined mul- that execution time grows as the square of schedule length.
tipliers, and chaining. In addition, it was assumed that In contrast, while it is difficult to characterize the execuadd and subtract operators would be implemented by ALU tion time of a probabilistic algorithm, this time is related
functional units that can perform both operations. As in to the maximum number of move attempts at each temthe EWF example, pipelined multipliers allow a substan- perature. In SALSA, this value grows linearly with retial reduction in functional unit costs. However, as in the spect to schedule length.
EWF example, chaining again provides the best way to
Results for the ILP approach of [ 171 are given for nonfind low cost schedules using a small number of control pipelined multipliers in 17-21 control steps. These exesteps. Schedules that were produced using chaining match cution times are smaller than those of SALSA but grow
the absolute lower bound for functional units in 8, 11, and rapidly with increasing schedule length. An extension of
32 steps. The scheduler was not able to produce a 16 step this work [35] adds constraints to support chaining and
schedule at the absolute lower bound (1 multiplier and 2 pipelined functional units. Execution times are not availadders). However, it found this result in a 17-step sched- able for these features, but for chaining the number of
ule.
added constraints grows exponentially with the depth of
Execution times for the DCT show that execution times chaining allowed. Results for the OASIC IP approach to
grow at a reasonable rate as schedule length increases. scheduling and allocation [18] are given for 17 and 18However, we have found that while SALSA consistently step schedules with pipelined multipliers. This approach
finds the best schedules for small examples such as the uses more CPU time than the SALSA approach, but inEWF in a single annealing run, it does not always do so cludes consideration of interconnect cost and allocation.
for larger examples such as the DCT. When this occurs, Execution times are greatly reduced when only functional
multiple runs can be used to further improve the schedule unit cost is considered, as shown in the final entry of Taat the expense of additional CPU time.
ble IV.
Table IV summarizes execution times for SALSA with
ILP and IP approaches are very attractive since an opthe EWF example compared to those of a number of pre- timal solution is guaranteed. These recent results show
vious approaches. Because these measurements were that when schedule lengths are close to minimum schedmade on processors of widely varying speed, it is difficult ule lengths, execution times are quite good. However, in
to use these results to make accurate comparisons. How- cases where schedule length is substantially longer than
ever, some conclusions can be drawn from these results. the minimum length or when chaining is used, the number
First, SALSA shows a clear advantage over the simulated of variables in the problem formulation grows rapidly, as
annealing approach of [ 141, which simultaneously consid- shown in Tables I1 and 111. Since the execution times of
NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH TIMING CONSTRAINTS
VI. CONCLUSION
This paper has described a new approach to scheduling
with timing constraints that minimizes resource costs. A
specialized representation and move set provide a way to
quickly explore scheduling alternatives after an initial
schedule is found using constraint solution. Simulated annealing provides an effective way to implement this exploration and yields good results in reasonable execution
times, especially when chaining is used and when schedule lengths are substantially longer than minimum schedule lengths. Proof that all legal schedules may be reached
using the move set provides confidence that the schedule
space can be thoroughly explored during annealing. In addition, it provides new insight into the scheduling problem that may be useful in other approaches. Future work
will concentrate on improving schedule quality for control-dominated examples, improving annealing performance on large examples, and extending the approach to
include support for interconnections, allocation, and more
general timing constraints.
ACKNOWLEDGMENT
The authors would like to thank R. Cloutier and the
anonymous reviewers for their suggestions for improving
this paper, M. McFarland and K. Vissers for helpful discussions concerning scheduling and R. Rutenbar for helpful discussions concerning simulated annealing.
REFERENCES
M. McFarland, A. Parker, and R. Camposano, The high-level synthesis of digital systems, Proc. IEEE, vol. 78, Feb. 1990.
J. Nestor and D. Thomas, Behavioral synthesis with interfaces, in
Proc. ICCAD-86, pp. 112-115, Nov. 1986.
R. Camposano and A. Kunzmann, Considering timing constraints
in synthesis from a behavioral description, in Proc. ICCD, pp.
6-9, Oct. 1986.
G. Bomello and R. Katz, Synthesis and Optimization of Interface
Transducer logic, in Proc. ICCAD-87, pp. 274-277, Nov. 1987.
P. Paulin and J. Knight, Force-directed scheduling for behavioral
synthesis of ASICs, IEEE Trans. Computer-Aided Design, Vol. 8 ,
pp. 661-678, June 1989.
C. Hitchcock and D. Thomas, A method of automatic data path
synthesis, in Proc. 20th DAC, pp. 484-489, June 1983.
C. Tseng and D. Siewiorek, Automated synthesis of data paths in
digital systems, IEEE Trans. Computer-Aided Design, vol.
CAD-5, pp. 379-395, July 1986.
D. Thomas, E. Lagnese, R. Walker, J. Nestor, J. Rajan, and R.
Blackburn, Algorithmic and Register-Transfer Level Synthesis: The
System Architect s Workbench. New York: Kluwer Academic, 1990.
E. Girczyc and J. Knight, An ADA to standard cell hardware compiler based on graph grammars and scheduling, in Proc. ICCD, pp.
726-731, Oct. 1984.
M. McFarland and T. Kowalski, Incorporating bottom-up design
1121
1122
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8, AUGUST 1993