Vous êtes sur la page 1sur 7

A Case Study of Design Space Exploration

for Embedded Multimedia Applications on SoCs ∗

Isabelle Hurbain, Corinne Ancourt†, François Irigoin Michel Barreteau, Nicolas Museux
École des mines de Paris THALES Research & Technology
F-77300 Fontainebleau, FRANCE F-91767 Orsay,FRANCE
Frédéric Pasquier
THOMSON R&D France
F-35576 Cesson-Sévigné, FRANCE

Abstract cause they naturally take advantage of stream-oriented par-


allelism. Even if architecturally speaking this technology
Embedded real-time multimedia applications usually im- can be seen as an old-fashioned one, it remains easily scal-
ply data parallel processing. SIMD processors embedded able and can be exploited in addition to other techniques.
in SOCs are cost effective to exploit the underlying par- Such SIMD units are now embedded in SoCs and GPPs
allelism. However, programming applications for SIMD (General Purpose Processors) e.g. AltiVec technology from
targets requires data placement and operation scheduling Motorola [4] and Streaming SIMD Extensions from In-
which are NP-complete problems. tel [1].
In this paper we show how our tool (based on concurrent Programming such domain-specific applications for
constraint programming) can be used to explore the design these targets requires data placement and operation schedul-
space of a kernel in H.264 standard (video compression). ing which are NP-complete problems. But the constraint
Different cost functions are considered (e.g. execution time, technology [6] enables the efficient exploration of the com-
memory occupancy, chip cost ...) to derive different source binatorial space of tiling and scheduling. A set of cost func-
codes from the same functional specification. Future work tions lets the user obtain automatically a variety of solu-
includes model refinement as well as full code generation tions, close to expert given ones, that bound the valid design
for rapid prototyping of such hardware and software inten- space.
sive systems. This article briefly presents our APOTRES tool and its
underlying constraint technology [5, 11] and is focused on
a case study to show how the design space is explored. The
1. Introduction next section describes the application and the architecture.
Section 4 introduces constraint-based models such as tiling
Embedded real-time multimedia applications usually im- and scheduling, which define the solution space. Section 5
ply data parallel processing. Indeed image processing in- presents our Concurrent Constraint Logic Programming ap-
volves vectors or matrices of pixels for instance. Moreover proach. Then several optimization criteria are considered
they are often (statically) predictable and well-structured in such as machine size (7), memory capacity size (8), execu-
such a way that the potential parallelism can be extracted tion time with technological or real time constraints (9,10).
at compile time. More than 80% of the execution time is Results for the different cost functions are compared in Sec-
spent in loop nests that define and use arrays. Efficiently tion 11.
performing these regular computations obviously is a key
issue. 2. Application description
SIMD (Single Instruction Multiple Data) architectures
are well suited for this multimedia application domain be-
∗ This
The new standard H.264 [13] provides advanced video
work is funded by the RNTL DREAM-UP project. URL:
http://www.telecom.gouv.fr/rntl/projet/Posters-PDF/RNTL-Poster-
coding techniques and has been adopted by many organi-
DREAM-UP-Thomson.pdf zations and companies. It increases the compression ratio
† Corresponding author: ancourt@cri.ensmp.fr by a factor two in comparison to MPEG-2 and thus reduces
float X[0:12][0:12], Y[0:7][0:7], H[0:7][0:12],
the bit rate needed to transfer images of the same quality. K[0:7][0:7], OUTPUT[0:7][0:7],
However, H.264 kernels require more computation levels COEF1[0:5],COEF2[0:5];
and their implementation should be adapted to various ar-
chitectures. for(i0=0; i0<13; i0++)
for(j0=0; j0<13; j0++)
We use the fractional sample interpolation of H.264 [13] X[i0][j0] = input();
as running example. Interpolation is used to determine in-
tensity values at non integer pixel positions to perform mo- for(i1=0; i1<8; i1++)
for(j1=0; j1<13; j1++)
tion compensation. In Figure 1 the grey squares represent H[i][j] = conv(X[i1:i1+5][j1],COEF1[0:5]);
existing pixels. Pixel i is the desired output. A vertical con-
volution produces pixels such as cc, dd, h, m, ee, ff. An for(i2=0; i2<8; i2++)
for(j2=0; j2<8; j2++)
horizontal one produces j from these pixels. Finally, i is
K[i2][j2] = conv(H[i2][j2:j2+5],COEF2[0:5]);
derived from h and j.
for(i3=0; i3<8; i3++)
for(j3=0; j3<8; j3++)
Y[i3][j3] = K[i3][j3] + H[i3][j3];

for(i4=0; i4<8; i4++)


for(j4=0; j4<8; j4++)
OUTPUT[i4][j4] = Y[i4][j4];

Figure 2. Interpolation code

3. The global approach

The goal is to find a mapping of the application onto the


machine that satisfies the programming model and architec-
Figure 1. Fractional sample interpolation tural constraints. The global optimization problem is de-
composed into many subproblems: memory minimization,
task tiling and scheduling, communication overlapping, re-
For our application we compute the 8×8 matrix of pixels spect of dependencies, latency minimization,... each being a
i with the following equations: fully optimization problem. Only a concurrent model mod-
ular approach can meet all our needs simultaneously.
Figure 3 illustrates our global approach. A Constraint
H[i, j] = conv(X[i − 2 : i + 3, j]) Logic Programming Model approach is used (Section 5).
K[i, j] = conv(H[i, j − 2 : j + 3]) The application, the architecture, the memory, the task
Y [i, j] = K[i, j] + H[i, j] scheduling and tiling, the communication, the data-flow de-
pendencies and the latency are modelized with linear and
non-linear constraints. These models are introduced in Sec-
where X is the 13×13 input matrix, H a 8×13 intermediary
tions 4 and represent the core of APOTRES.
matrix, K a 8 × 8 intermediary matrix, Y the 8 × 8 output
The tool takes as input the application and architectural
matrix, and conv a 6-tap convolution filter.
parameters. The processor number, the local and global
memory sizes, the pipeline depth are given in a specification
2.1. Application code file. The information coming from the application is auto-
matically extracted using the PIPS compiler. The task time
durations are estimated using [8] and array elements refer-
The interpolation code is expressed in a pseudo-C lan- enced by each task are evaluated by the array region [14].
guage in Figure 2. It is derived from the equations defining These pieces of information are taken into account and
H, K and Y, code is composed of parallel loops in single as- propagated together with deduced information from model
signment form. Each loop nest implements a task that reads to model. During the resolution the models exchange range
one or more multidimensional data arrays and updates one information about their variables. The cost function se-
different array. The call function arguments represent all lected among the execution time, memory, architecture cost
the array elements read by the function. The loop nest tasks and latency guides the resolution through a specific heuris-
are called T0 , T1 , T2 , T3 and T4 . tics. A first solution satisfying all the constraints is found.

2
Constraint Solver

programming model Resolution


Memory liveness cost function
Scheduling
memory sizes memory Capacity
processor number size
pipeline depth Architecture data nb of events
volume dependence event duration
nb procs

computation Physical Heuristics


task duration Tiling
array volume Time
dataflow dependencies Application dependence
array elements dependence comm latency Solutions
duration

Communication I/O

comm events

Figure 3. The Concurrent Constraint Programming Approach

The solving process is automatic. with


There is never bad solution, since any solution produced
∀l, 0 ≤ L−1 l < 1 (2)
meets all the constraints. To obtain legal results, the tool
−1
designer has to approximate the architecture and the solu- ∀p, 0 ≤ P p < 1 (3)
tion design by proper models. The introduction of a new ∀i ∈ I, 0 ≤ i < b
architectural scheme implies the development of new set of det(L) 6= 0
constraints.
det(P ) 6= 0 and P diagonal by a permutation
The associated triplet (c, p, l) can be interpreted as follows:
4. Models at a logic time c, each processor p runs l iterations.
P and L define a tiling of the iteration domain. The
solver must find the numerical values of their elements.
As explained above the mapping of applications onto the
architecture uses different interacting models. The main
ones used concurrently by the solver are introduced in this
4.2. Scheduling
section.
Schedules are computed with respect to tilings. For each
loop nest, an affine function represents the schedule. A
4.1. Tiling schedule is legal if it respects the data flow dependence con-
straints. A schedule function is

For each loop, the tiling partitions the iteration set and d(c) = α.c + β (4)
distributes it along three dimensions: (1) a cyclic temporal
where c is the index of a computation block, α a line vector
dimension, (2) a “processor” dimension, (3) a local dimen-
of the same dimension as c, “.” the standard scalar product
sion which exploits the local memory.
and β an integer. The solver selects values for all α and β.
The tiling is formally defined in [12] and summarized
here. Let I be the loop nest iteration set (with n loops) 4.3. Data flow dependence
contained in Zn defined by I = {0, . . . , b1 − 1} × · · · ×
{0, . . . , bn − 1} where 1 ≤ k ≤ n, bk ∈ N∗ . Let P and If a block c′ uses a value defined by a block c, c must
L be n × n square diagonal integer matrices with non-null be executed first. This is called a data flow dependence. If
determinant. Then for each point i of I, there exists one there is a data flow dependence between two computation
and only one triplet (c, p, l) of points of I such as: blocks c and c′ , then any legal schedule meets the constraint

i = LP c + Lp + l (1) d(c) < d(c′ ) (5)

3
4.4. Memory capacity • in the tiling model as block index of the tiled loop nest
(Equation 1);
A capacitive memory model is used. As all processors
execute the same code, the required memory size is the • in the dependence model in the precedence relations
same for each processor. Each task is allocated a private (Equation 5);
input buffer, but all tasks share a common output buffer.
• in the scheduling model (Equation 4).
As soon as they are computed, results are sent to all input
buffers of tasks using them. The output buffer is sized to fit During the resolution process, the models communicate
the output of any task and to support a flip-flop mechanism their partial information, value intervals, about their vari-
used to overlap computations and communications. ables to others.
The size of a task input buffer is the sum of the spaces The CCLP system builds a solution space on a model-
required for each argument. Its capacity is the minimum per-model basis. The global search looks for partial solu-
obtained with four possible schemes. tion in the different concurrent models. Only relevant in-
formation is propagated between models. Several global
1. The full array size;
heuristics are used to improve the resolution, e.g. sched-
2. The volume of the hypercube accessed by all cycles; ule choices are driven by computing the shortest path in the
data-flow graph.
3. The volume of references accessed per cycle, multi-
plied by the maximal number of live iterations, which
6. Case study
is relevant when overlaping occurs. Liveness informa-
tion is carried by the dependencies and the schedule;
To show the characteristics of our tool, we present in the
4. The number of referenced accesses per cycle multi- next sections various mappings of the application obtained
plied by the maximal number of live iterations, useful for three different optimization criteria. These results are
in case of non-contiguous accesses. successively taken into account to size the architecture:

The memory constraints are linked to the partitioning, de- • Firstly the cost criterion (Section 7) gives the cheap-
pendence and scheduling parameters. est circuit configuration able to execute the application
meeting or not real-time constraint;
5. A Concurrent Constraint Logic Program-
• Secondly the memory minimization criterion (Sec-
ming Model (CCLP)
tion 8) gives a lower bound for the local memory ca-
pacity per processor;
While some constraints can be translated into linear in-
equations and solved by classical linear programming al- • Finally the execution time criterion (Sections 9,10) to-
gorithms, others like resource constraints require non linear gether with the constraints chosen from the previous
expressions. Solving techniques for both constraints require results provides solutions that fit both architectural and
the combination of integer programming and search. We real-time constraints.
use the Constraint Logic Programming approach.
CCLP handles linear and non linear constraints and We choose a target architecture having up to 16 pro-
yields, through the concurrent propagation of constraints cessors, each with a local memory of 128 or 256 bytes.
over all models, solutions satisfying the global problem. These parameters are entered as architectural constraints in
Figure 3 illustrates our models. The models are linked by APOTRES.
variables that appear concurrently in different models. For A pipelined multiply-add is used and task durations esti-
example, Constraint 6 links tiling and architectural models. mated by PIPS [9], the optimizing compiler used as a front-
The number of processors required by the tiling must be end to APOTRES, are respectively 1, 6, 6, 1 and 1 cycles.
smaller than the number of processors available for any task
k.
7. Cost minimization
Y
n
k
P rocessorN umber ≥ maxk ( Pi,i ) (6) The cost of SoCs is key for industrial exploitation. It de-
i=1
pends on the surface of a single processor, on the surface of
As another example, the computation block index c ap- a memory unit, and on the number of processors and mem-
pears: ory units required by the application.

4
The cost of a processor outweighs that of the memory.
So, cost minimization induces solutions where the proces- Table 3. Tilings with memory minimization
sor number is minimal. Tasks T0 T1 T2 T3 T4
Without the memory constraint, APOTRES selects a # processors 13 13 16 16 16
one-processor target machine, as could be expected. Table 1 # blocks 7 4 4 4 4
presents the solution using 203 memory bytes. It takes ad- # local iterations 2 2 1 1 1
vantage of the target architecture and pipelines up to 8 or 13
local iterations per computation block.
To minimize data storage during the execution of data
Table 1. Tilings for cost minimization, 256b flow dependent tasks, computation blocks have only one lo-
Tasks T0 T1 T2 T3 T4 cal iteration, except Tasks T0 and T1 . Look at T1 code in
# processors 1 1 1 1 1 Figure 2. It uses six array elements produced by Task T0 .
# blocks 13 13 8 8 8 Pipelining two contiguous local iterations of T0 and T1 im-
# local iterations 13 8 8 8 8 plies only one additional element of storage and all 16 pro-
cessors can be used to exploit the parallelism available in
tasks T2 , T3 and T4 .
If a constraint on the memory capacity per processor The execution time is 117 cycles. Computations are
such as #mem ≤ 128 is added, APOTRES chooses the Ta- again scheduled according to an as soon as possible sched-
ble 2 mapping on four processors using 117 bytes each. The ule: (C0 )3 (C0 CALL )2 (C1 C2 C3 C4 )2
reduction of the pipelined local iterations per computation
block decreases the data liveness and thus the memory used. 9. Execution time under memory constraint
In particular, 56 elements of Array H have to be stored be-
tween the two orthogonal convolutions T1 and T2 instead of Multimedia applications often must meet real-time con-
104 for the first solution. straints. Here we wish to set the execution time to a value
strictly less than the 500 cycles found in one previous so-
Table 2. Tilings for cost minimization, 128b lution in Section 7. Furthermore, the results of Sections 7
Tasks T0 T1 T2 T3 T4 and 8 make processors having from 64 to 128 bytes good
# processors 1 4 4 4 4 candidates for our application. Two cases are successively
# blocks 13 13 8 16 16 studied with memory sizes of 128 and 64 bytes.
To reduce the execution time, APOTRES has to maxi-
# local iterations 13 2 2 1 1
mize the number of processors. Table 4 presents a solution
that takes advantage of the available parallelism: processors
and software pipelines.
APOTRES provides an as soon as possible schedule,
represented in Figure 4. From the α and β scheduling
parameters (Section 4.2), the schedules can be expressed Table 4. Tilings for time optimization, 128b
using regular expressions: (C0 C1 )5 (CALL )8 and Tasks T0 T1 T2 T3 T4
(C0 C1 )6 CALL (C0 C1 C3 C4 CALL )3 (C3 C4 C2 C3 C4 )4 C3 C4 # processors 13 13 16 16 16
In order to shorten the scheduling formulations, Ci cor- # blocks 1 1 1 1 1
responds to the computation block with li pipelined local # local iterations 13 8 4 4 4
iterations of Task Ti : Ci = Tili , and CALL is used for
C0 C1 C2 C3 C4 .
The logical durations of the schedules in Figure 4 are The schedule is not interleaved: C0 C1 C2 C3 C4 . The ex-
respectively 1355 and 519 cycles, while the memory capac- ecution time, computed by the solver, is 98 cycles and the
ities per processor are 203 and 117 bytes. memory capacity required per processor is 72 bytes.
When memory size is limited to 64 bytes, the applica-
8. Memory constraint minimization tion cannot be executed on the machine without additional
partitioning. APOTRES finds the solution in Table 5 and its
In order to size the architecture we wish to know the min- related schedule: C0 13 C1 C2 C3 C4
imum memory required to execute the application on the This solution uses 110 cycles to execute the application
target architecture having up to 16 processors. and requires 62 bytes per processor. It actually does not
The best solution found by APOTRES uses only 46 bytes differ from the previous one, except that Task T0 has been
on the 16 processors. The tiling chosen is shown in Table 3. tiled. Now, at each iteration, Task T0 writes only one array

5
T4
T3
T2
T1
T0

Schedule for local memory < 256

T4
T3
T2
T1
T0

Schedule for local memory < 128

Figure 4. Schedules for cost minimization

Table 5. Tilings for time optimization, 64b Table 7. Tilings with time optimization, 4 proc
Tasks T0 T1 T2 T3 T4 Tasks T0 T1 T2 T3 T4
# processors 13 13 16 16 16 # processors 4 4 4 4 4
# blocks 13 1 1 1 1 # blocks 7 7 4 2 2
# local iterations 1 8 4 4 4 # local iterations 7 4 4 8 8

element instead of 13. The output buffer size is smaller and


that is enough to fit the tighten memory constraint. hardware or real time requirements.
Typically, to minimize the silicon area of a SoC, the two
main components to take into account are the number of
10. Execution time under processor constraint processors and the memory size of each processor. These
two criteria yield the solution with 1 processor and 203
In order to compare the solutions efficiencies, the results
bytes of local memory. Unfortunately, this solution is exe-
have to be computed for the same optimization criterion.
cuted too slowly for a real-time embedded application such
Here are briefly given the mappings of the application on
as video decoding. So the number of cycles should also be
one and four processors using the execution time as function
considered. Furthermore, decreasing the number of cycles
cost.
required to execute an application allows to decrease the
On one processor, the memory capacity required is 241
SoC frequency and thus to reduce the SoC surface by tight-
bytes and the execution time is 1343 cycles.
ening the layers. Tradeoffs should be made between speed
and cost. Solutions 4 and 6 are efficient. They satisfy both
the application and architectural constraints.
Table 6. Tilings with time optimization, 1 proc
Tasks T0 T1 T2 T3 T4
# processors 1 1 1 1 1 12. Related work
# blocks 13 13 4 4 4
# local iterations 13 8 16 16 16 Using constraint programming to solve such mapping
and scheduling problems can be seen as an automatic ap-
proach. Manual approaches may be preferred by designers
On four processors, the memory capacity required per who want to use their own design strategy and control the
processor is 125 bytes and the execution time is 367 cycles. whole process. Mapping in G EDAE[2] is performed manu-
ally by allocating tasks to hardware resources and schedul-
ing is based on dynamic heuristics. But architecture fea-
11. Discussion tures such as memory hierarchy or communication paths
are not shown explicitly and managed by the tool itself.
Table 8 summarizes the different solutions according to This restricts in practice the scope of usable architectures.
the optimization function used and additional constraints, The SPEAR[10] Design Environment removes this limita-

6
Table 8. Comparative table for the 7 solutions

Optimization Additional # procs Duration Local Total Efficiency Schedule


Function Constraints memory memory
1 Cost mem ≤ 256 1 1355 203 203 0.97 (C0 C1 )5 (CALL )8

2 Exec. Time #proc = 1 1 1343 241 241 1 (C0 C1 )5 (C0 C1 CALL )4

3 Cost mem ≤ 128 4 519 117 468 0.65 (C0 C1 )6 CALL (C0 C1 C3 C4 CALL )3

(C3 C4 C2 C3 C4 )4 C3 C4

4 Exec. Time #proc = 4 4 367 125 500 0.92 (C0 C1 )3 (C0 C1 C2 CALL )2

5 Memory mem ≤ 128 16 117 46 736 0.72 (C0 )3 (C0 CALL )2 (C1 C2 C3 C4 )2

6 Exec. Time mem ≤ 128 16 98 72 1152 0.86 C0 C1 C2 C3 C4

7 Exec. Time mem ≤ 64 16 110 62 992 0.76 C0 13 C1 C2 C3 C4

tion by making an open architectural dimension available to References


P TOLEMY II[3]. This enables rapid exploration of the de-
sign space. These tools make the architecture model and de- [1] http://developer.intel.com/design/pentium4/manuals/index new.htm.
sign choices explicit to guide the design space exploration. [2] http://www.gedae.com.
[3] http://www.ptolemy.eecs.berkeley.edu/ptolemyii.
Our work is unique because it takes into account simul- [4] http://www.simdtech.org/altivec.
taneously: the mapping of the complete application, with [5] C. Ancourt, D. Barthou, C. Guettier, F. Irigoin, B. Jeannet,
scheduling and sizing using tiling, the application memory J. Jourdan, and J. Mattioli. Automatic data mapping of sig-
requirement and the operational constraints. nal processing applications. IEEE International Conference
on Application Specific Systems, Architectures and Proces-
sors, pages 350–362, 1997.
[6] K. R. Apt. Principles of Constraint Programming. Cam-
13. Conclusion bridge University press, 2003.
[7] C. Bastoul. Efficient code generation for automatic paral-
lelization and optimization. In ISPDC’2 IEEE International
This article shows how a multimedia application can Symposium on Parallel and Distributed Computing, pages
be rapidly mapped onto a SIMD architecture. Our map- 23–30, Ljubjana, october 2003.
[8] B.Creusillet and F.Irigoin. Interprocedural array region anal-
ping tool is able to explore the tiling and scheduling spaces
yses. Lecture Notes in Computer Science - Languages and
within the combinatorial space of solutions according to dif- Compilers for Parallel Computing, pages 46–60, August
ferent criteria. It enables finding in a few minutes the best 1995.
trade-off depending on the embedded real-time constraints [9] F. Irigoin, P. Jouvelot, and R. Triolet. Semantical interpro-
and target cost. cedural parallelization: an overview of the pips project. In
APOTRES is connected to PIPS, a tool that automati- ACM International Conference on Supercomputing, ICS’91,
Cologne, Allemagne, June 1991.
cally analyzes and transforms codes written in Fortran. An- [10] E. Lenormand and G. Edelin. An industrial perspective:
other potential input is ANSI-C code whose functional re- Pragmatic high-end signal processing environment at thales.
sults can be checked (by THOMSON). Hence our prototyp- 2003.
ing chain is nearly seamless in the sense that a multime- [11] J. Mattioli, N. Museux, J. Jourdan, P. Savéant, and
dia code can be parallelized from any standard specification S. de Givry. A constraint optimization framework for map-
(sometimes not parallel at all) translated “à la Fortran” or ping a digital signal processing application onto a parallel
from a C sequential code. architecture. In Principles and Practice of Constraint Pro-
gramming, 2000.
To make APOTRES more useful, some improvements [12] N. Museux. Aide au placement d’applications de traitement
are planed in two areas: data communication and code gen- du signal sur machines paralleles multi-SPMD. PhD thesis,
eration (control, allocation, communication). Communica- Ecole des Mines de Paris, 2001.
tions have to be modeled with respect to the communication [13] T. Wiegand, G. Sullivan, and A. Luthra. Itu-t rec. h.264 —
resources. APOTRES generates integer values which are in- iso/iec 14496-10 avc - final draft. Technical report, Joint
terpreted as mapping directives. We are currently studying Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG,
May 2003.
control code generation using CLooG [7], which generates [14] L. Zhou. Statical and Dynamical analysis of Program Com-
an efficient control C code from a description of iteration plexity. PhD thesis, Universite P. et M. Curie, 1994.
domains and schedules.

Vous aimerez peut-être aussi