Académique Documents
Professionnel Documents
Culture Documents
Isabelle Hurbain, Corinne Ancourt†, François Irigoin Michel Barreteau, Nicolas Museux
École des mines de Paris THALES Research & Technology
F-77300 Fontainebleau, FRANCE F-91767 Orsay,FRANCE
Frédéric Pasquier
THOMSON R&D France
F-35576 Cesson-Sévigné, FRANCE
2
Constraint Solver
Communication I/O
comm events
For each loop, the tiling partitions the iteration set and d(c) = α.c + β (4)
distributes it along three dimensions: (1) a cyclic temporal
where c is the index of a computation block, α a line vector
dimension, (2) a “processor” dimension, (3) a local dimen-
of the same dimension as c, “.” the standard scalar product
sion which exploits the local memory.
and β an integer. The solver selects values for all α and β.
The tiling is formally defined in [12] and summarized
here. Let I be the loop nest iteration set (with n loops) 4.3. Data flow dependence
contained in Zn defined by I = {0, . . . , b1 − 1} × · · · ×
{0, . . . , bn − 1} where 1 ≤ k ≤ n, bk ∈ N∗ . Let P and If a block c′ uses a value defined by a block c, c must
L be n × n square diagonal integer matrices with non-null be executed first. This is called a data flow dependence. If
determinant. Then for each point i of I, there exists one there is a data flow dependence between two computation
and only one triplet (c, p, l) of points of I such as: blocks c and c′ , then any legal schedule meets the constraint
3
4.4. Memory capacity • in the tiling model as block index of the tiled loop nest
(Equation 1);
A capacitive memory model is used. As all processors
execute the same code, the required memory size is the • in the dependence model in the precedence relations
same for each processor. Each task is allocated a private (Equation 5);
input buffer, but all tasks share a common output buffer.
• in the scheduling model (Equation 4).
As soon as they are computed, results are sent to all input
buffers of tasks using them. The output buffer is sized to fit During the resolution process, the models communicate
the output of any task and to support a flip-flop mechanism their partial information, value intervals, about their vari-
used to overlap computations and communications. ables to others.
The size of a task input buffer is the sum of the spaces The CCLP system builds a solution space on a model-
required for each argument. Its capacity is the minimum per-model basis. The global search looks for partial solu-
obtained with four possible schemes. tion in the different concurrent models. Only relevant in-
formation is propagated between models. Several global
1. The full array size;
heuristics are used to improve the resolution, e.g. sched-
2. The volume of the hypercube accessed by all cycles; ule choices are driven by computing the shortest path in the
data-flow graph.
3. The volume of references accessed per cycle, multi-
plied by the maximal number of live iterations, which
6. Case study
is relevant when overlaping occurs. Liveness informa-
tion is carried by the dependencies and the schedule;
To show the characteristics of our tool, we present in the
4. The number of referenced accesses per cycle multi- next sections various mappings of the application obtained
plied by the maximal number of live iterations, useful for three different optimization criteria. These results are
in case of non-contiguous accesses. successively taken into account to size the architecture:
The memory constraints are linked to the partitioning, de- • Firstly the cost criterion (Section 7) gives the cheap-
pendence and scheduling parameters. est circuit configuration able to execute the application
meeting or not real-time constraint;
5. A Concurrent Constraint Logic Program-
• Secondly the memory minimization criterion (Sec-
ming Model (CCLP)
tion 8) gives a lower bound for the local memory ca-
pacity per processor;
While some constraints can be translated into linear in-
equations and solved by classical linear programming al- • Finally the execution time criterion (Sections 9,10) to-
gorithms, others like resource constraints require non linear gether with the constraints chosen from the previous
expressions. Solving techniques for both constraints require results provides solutions that fit both architectural and
the combination of integer programming and search. We real-time constraints.
use the Constraint Logic Programming approach.
CCLP handles linear and non linear constraints and We choose a target architecture having up to 16 pro-
yields, through the concurrent propagation of constraints cessors, each with a local memory of 128 or 256 bytes.
over all models, solutions satisfying the global problem. These parameters are entered as architectural constraints in
Figure 3 illustrates our models. The models are linked by APOTRES.
variables that appear concurrently in different models. For A pipelined multiply-add is used and task durations esti-
example, Constraint 6 links tiling and architectural models. mated by PIPS [9], the optimizing compiler used as a front-
The number of processors required by the tiling must be end to APOTRES, are respectively 1, 6, 6, 1 and 1 cycles.
smaller than the number of processors available for any task
k.
7. Cost minimization
Y
n
k
P rocessorN umber ≥ maxk ( Pi,i ) (6) The cost of SoCs is key for industrial exploitation. It de-
i=1
pends on the surface of a single processor, on the surface of
As another example, the computation block index c ap- a memory unit, and on the number of processors and mem-
pears: ory units required by the application.
4
The cost of a processor outweighs that of the memory.
So, cost minimization induces solutions where the proces- Table 3. Tilings with memory minimization
sor number is minimal. Tasks T0 T1 T2 T3 T4
Without the memory constraint, APOTRES selects a # processors 13 13 16 16 16
one-processor target machine, as could be expected. Table 1 # blocks 7 4 4 4 4
presents the solution using 203 memory bytes. It takes ad- # local iterations 2 2 1 1 1
vantage of the target architecture and pipelines up to 8 or 13
local iterations per computation block.
To minimize data storage during the execution of data
Table 1. Tilings for cost minimization, 256b flow dependent tasks, computation blocks have only one lo-
Tasks T0 T1 T2 T3 T4 cal iteration, except Tasks T0 and T1 . Look at T1 code in
# processors 1 1 1 1 1 Figure 2. It uses six array elements produced by Task T0 .
# blocks 13 13 8 8 8 Pipelining two contiguous local iterations of T0 and T1 im-
# local iterations 13 8 8 8 8 plies only one additional element of storage and all 16 pro-
cessors can be used to exploit the parallelism available in
tasks T2 , T3 and T4 .
If a constraint on the memory capacity per processor The execution time is 117 cycles. Computations are
such as #mem ≤ 128 is added, APOTRES chooses the Ta- again scheduled according to an as soon as possible sched-
ble 2 mapping on four processors using 117 bytes each. The ule: (C0 )3 (C0 CALL )2 (C1 C2 C3 C4 )2
reduction of the pipelined local iterations per computation
block decreases the data liveness and thus the memory used. 9. Execution time under memory constraint
In particular, 56 elements of Array H have to be stored be-
tween the two orthogonal convolutions T1 and T2 instead of Multimedia applications often must meet real-time con-
104 for the first solution. straints. Here we wish to set the execution time to a value
strictly less than the 500 cycles found in one previous so-
Table 2. Tilings for cost minimization, 128b lution in Section 7. Furthermore, the results of Sections 7
Tasks T0 T1 T2 T3 T4 and 8 make processors having from 64 to 128 bytes good
# processors 1 4 4 4 4 candidates for our application. Two cases are successively
# blocks 13 13 8 16 16 studied with memory sizes of 128 and 64 bytes.
To reduce the execution time, APOTRES has to maxi-
# local iterations 13 2 2 1 1
mize the number of processors. Table 4 presents a solution
that takes advantage of the available parallelism: processors
and software pipelines.
APOTRES provides an as soon as possible schedule,
represented in Figure 4. From the α and β scheduling
parameters (Section 4.2), the schedules can be expressed Table 4. Tilings for time optimization, 128b
using regular expressions: (C0 C1 )5 (CALL )8 and Tasks T0 T1 T2 T3 T4
(C0 C1 )6 CALL (C0 C1 C3 C4 CALL )3 (C3 C4 C2 C3 C4 )4 C3 C4 # processors 13 13 16 16 16
In order to shorten the scheduling formulations, Ci cor- # blocks 1 1 1 1 1
responds to the computation block with li pipelined local # local iterations 13 8 4 4 4
iterations of Task Ti : Ci = Tili , and CALL is used for
C0 C1 C2 C3 C4 .
The logical durations of the schedules in Figure 4 are The schedule is not interleaved: C0 C1 C2 C3 C4 . The ex-
respectively 1355 and 519 cycles, while the memory capac- ecution time, computed by the solver, is 98 cycles and the
ities per processor are 203 and 117 bytes. memory capacity required per processor is 72 bytes.
When memory size is limited to 64 bytes, the applica-
8. Memory constraint minimization tion cannot be executed on the machine without additional
partitioning. APOTRES finds the solution in Table 5 and its
In order to size the architecture we wish to know the min- related schedule: C0 13 C1 C2 C3 C4
imum memory required to execute the application on the This solution uses 110 cycles to execute the application
target architecture having up to 16 processors. and requires 62 bytes per processor. It actually does not
The best solution found by APOTRES uses only 46 bytes differ from the previous one, except that Task T0 has been
on the 16 processors. The tiling chosen is shown in Table 3. tiled. Now, at each iteration, Task T0 writes only one array
5
T4
T3
T2
T1
T0
T4
T3
T2
T1
T0
Table 5. Tilings for time optimization, 64b Table 7. Tilings with time optimization, 4 proc
Tasks T0 T1 T2 T3 T4 Tasks T0 T1 T2 T3 T4
# processors 13 13 16 16 16 # processors 4 4 4 4 4
# blocks 13 1 1 1 1 # blocks 7 7 4 2 2
# local iterations 1 8 4 4 4 # local iterations 7 4 4 8 8
6
Table 8. Comparative table for the 7 solutions
3 Cost mem ≤ 128 4 519 117 468 0.65 (C0 C1 )6 CALL (C0 C1 C3 C4 CALL )3
(C3 C4 C2 C3 C4 )4 C3 C4
4 Exec. Time #proc = 4 4 367 125 500 0.92 (C0 C1 )3 (C0 C1 C2 CALL )2
5 Memory mem ≤ 128 16 117 46 736 0.72 (C0 )3 (C0 CALL )2 (C1 C2 C3 C4 )2