Vous êtes sur la page 1sur 14

110 IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO.

1, JANUARY 2011
Program Phase-Aware Dynamic Voltage Scaling
Under Variable Computational Workload and
Memory Stall Environment
Jungsoo Kim, Student Member, IEEE, Sungjoo Yoo, Member, IEEE, and Chong-Min Kyung, Fellow, IEEE
AbstractMost complex software programs are characterized
by program phase behavior and runtime distribution. Dynamism
of the two characteristics often makes the design-time workload
prediction difcult and inefcient. Especially, memory stall time
whose variation is signicant in memory-bound applications has
been mostly neglected or handled in a too simplistic manner
in previous works. In this paper, we present a novel online
dynamic voltage and frequency scaling (DVFS) method which
takes into account both program phase behavior and runtime
distribution of memory stall time, as well as computational
workload. The online DVFS problem is addressed in two ways:
intraphase workload prediction and program phase detection.
The intraphase workload prediction is to predict the workload
based on the runtime distribution of computational workload
and memory stall time in the current program phase. The
program phase detection is to identify to which program phase
the current instant belongs and then to obtain the predicted
workload corresponding to the detected program phase, which
is used to set voltage and frequency during the program phase.
The proposed method considers leakage power consumption as
well as dynamic power consumption by a temperature-aware
combined V
dd
/V
bb
scaling. Compared to a conventional method,
experimental results show that the proposed method provides
up to 34.6% and 17.3% energy reduction for two multimedia
applications, MPEG4 and H.264 decoder, respectively.
Index TermsDynamic voltage and frequency scaling (DVFS),
energy optimization, memory stall, phase, runtime distribution.
I. Introduction
D
YNAMIC voltage and frequency scaling (DVFS) is one
of the most effective methods for lowering energy con-
sumption. DVFS is used to suppress the leakage energy by a
dynamic control of supply voltage (V
dd
) and body bias voltage
(V
bb
). Accurate prediction of remaining workload (hereafter,
workload prediction) plays a central role in DVFS where the
Manuscript received March 15, 2010; accepted July 27, 2010. Date of
current version December 17, 2010. This work was supported in part by
the National Research Foundation of Korea Grant funded by the Korean
Government, under Grant 2010-0000823, and the Brain Korea 21 Project,
the School of Information Technology, Korea Advanced Institute of Science
and Technology in 2010. This paper was recommended by Associate Editor
H.-H. S. Lee.
J. Kim and C.-M. Kyung are with the Korea Advanced Institute of
Science and Technology, Daejeon 305-701, South Korea (e-mail: jung-
soo.kim83@gmail.com; kyung@ee.kaist.ac.kr).
S. Yoo is with the Pohang University of Science and Technology, Pohang
790-784, South Korea (e-mail: sungjoo.yoo@postech.ac.kr).
Color versions of one or more of the gures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identier 10.1109/TCAD.2010.2068630
frequency level of the processor is set as the ratio of remaining
workload to time-to-deadline.
Workload of software program varies due to data depen-
dency (e.g., loop counts), control dependency (e.g., if/else,
switch/case statement), and architectural dependency [e.g.,
cache hit/miss, translation lookaside buffer (TLB) hit/miss,
and so on]. To tackle the workload variation, extensive works
have been proposed [9][13], [19] assuming that workload
(i.e., elapsed number of clock cycles seen by processor) is
invariant to processor frequency scaling. However, the as-
sumption is not appropriate for applications having signicant
memory accesses. Fig. 1(a) shows the distribution of per-frame
workload of MPEG4 decoder at two different frequency levels,
i.e., 1 and 2 GHz. It is obtained from decoding 3000 frames
of 1920 800 movie clip (an excerpt from Dark Knight)
on LG XNOTE LW25 laptop.
1
As shown in Fig. 1(a), the
workload increases as processor frequency increases. This is
due to the processor stall cycles spent while waiting for data
from external memory (e.g., SDRAM, SSD, and so on). For
example, when the memory access time is 100 ns, each off-
chip memory access takes 100 and 200 processor clock cycles
at 1 GHz and 2 GHz, respectively. Since the memory access
time, called memory stall time, is invariant to processor clock
frequency, the number of processor clock cycles spent for
memory access grows as the clock frequency increases.
To consider memory stall time in clock frequency scaling,
[4][6] present DVFS methods which set the clock frequency
of processor based on the decomposition of whole workload
into two clock frequency-invariant workloads: computational
and memory stall workloads. Computational workload is the
number of clock cycles spent for instruction execution, and
memory stall workload corresponds to memory stall time.
Based on the decomposed workloads, previous methods set
clock frequency, f, as f = w
comp
/(t
R
d
t
stall
), where w
comp
and
t
stall
represent average (or worst-case) computational workload
and memory stall time, respectively. t
R
d
is the time-to-deadline.
Generally, computational workload and memory stall time
have distributions as shown in Fig. 1(b) and (c). Fig. 1(b)
shows the distribution of computational workload caused by
data, control, and architectural dependency. Distribution of
1
LG XNOTE LW25 laptop consists of 2 GHz Intel Core2Duo T7200 proces-
sor with 128 KB L1 instruction and data cache, 4 MB shared L2 cache, and
667 MHz 2 GB DDR2 SDRAM.
0278-0070/$26.00 c 2010 IEEE
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD 111
Fig. 1. Per-frame prole results of MPEG4 decoder when decoding Dark
Knight movie clip. (a) Total workload at 1 and 2 GHz. (b) Computational
workload. (c) Memory stall time. (d) Phase behavior in the per-frame
workload. (e) Runtime distributions of three representative phases.
memory stall workload shown in Fig. 1(c) results mostly
from L2 cache hit/miss, page hit/miss, and interference (e.g.,
memory access scheduling [1]) in accessing DRAM. As the
distribution of memory stall workload becomes signicant,
previous DVFS methods based on average (or worst-case)
memory stall workload become inappropriate in reducing
energy consumption.
Long-running software programs are mostly characterized
by nonstationary phase behavior [14], [15]. For example,
multimedia programs (e.g., MPEG4 and H.264 CODEC) have
distinct time durations whose workload characteristics (e.g.,
mean, standard deviation, and max value of runtime) are
clearly different from other time durations. We call each such
distinct time duration program phase [14], [15]. Formal
denition of program phase will be given later in Section VIII.
Fig. 1(d) exemplies the program phase behavior of MPEG4
decoder when decoding the rst 1000 frames of the movie clip
excerpted from Dark Knight. The x-axis and the left-hand side
y-axis represent frame index and per-frame decoding cycles,
respectively. The right-hand side y-axis represents program
phase index. Note that the program phase index does not cor-
respond to the required performance level of the corresponding
program phase in this example. As shown in Fig. 1(d), the
entire time for decoding 1000 frames is classied into nine
program phases, and, within a program phase, per-frame
decoding cycle has a runtime distribution. Fig. 1(e) shows
runtime distributions of three representative program phases
out of nine program phases to illustrate that there can be a wide
runtime distribution within each program phase characterized
by its runtime distribution.
A. Our Approach
Our observation on the runtime characteristics of software
program suggests that, as shown in Fig. 1, the program work-
load has two characteristics: nonstationary program phase be-
havior and runtime distribution (even within a program phase)
of computational workload and memory stall time. Based on
the observations above, this paper presents an online DVFS
method that tackles the characteristics of program workload in
order to minimize the average energy consumption of software
program. We address the online DVFS problem in two ways:
intraphase workload prediction and program phase detection.
The intraphase workload prediction predicts workloads based
on the runtime distribution of computational workload and
memory stall time in the current program phase. The program
phase detection identies to which program phase the current
instant belongs and then obtains the intraphase workload
prediction of the corresponding program phase, which is used
to set voltage and frequency during the program phase.
Leakage power consumption often dominates total power
consumption especially at high temperature. Our method tack-
les leakage power consumption with a temperature-aware com-
bined V
dd
/V
bb
scaling. During runtime, based on temperature
readings as well as the runtime distribution, the online method
selects a set of appropriate V
dd
and V
bb
corresponding to
frequency level from the solution table (which was prepared
during design time).
This paper is organized as follows. Section II reviews related
works. Section III gives preliminaries on our energy model and
proling method. Section IV presents the problem denition
and solution overview, followed by analytical formulation of
our problem in Section V. Sections VI and VII explain the pro-
posed runtime distribution-aware DVFS. Section VIII presents
the program phase detection method. Section IX reports ex-
perimental results followed by the conclusion in Section X.
II. Related Works
There are a number of methods on the workload prediction
for online DVFS based on (weighted) average, maximum,
or the most frequent workload, or nding a repeated pattern
among N recent workloads [2]. Recently, a control theory-
based workload prediction method was proposed to accurately
capture the transient behavior of workload [3]. To exploit
memory stall time, [4] and [5] present memory stall time-
aware DVFS for soft real-time intertask DVFS which lowers
112 IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
the clock frequency by an amount proportional to the average
ratio of external memory access per instruction to clock
cycle per instruction. However, these memory stall time-aware
DVFS methods are based on average memory stall time and
do not exploit the workload distribution and nonstationary
program phase behavior.
Runtime distribution in computational workload (in most
cases, assuming a constant memory stall time) has been
studied mostly in intratask DVFS methods where performance
level is set dynamically during the execution of a task.
There are several intratask DVFS methods where workload is
predicted based on the program execution paths, e.g., worst-
case execution path [7], average-case execution path [8], and
virtual execution path based on the probability of branch
invocation [9]. [10] presents an analytic workload prediction
method which minimizes statistical average dynamic energy
consumption. [11] presents a numerical solution for combined
V
dd
/V
bb
scaling to tackle leakage energy. [12] and [13] present
a DVFS method, called accelerating frequency schedules,
which considers the per-task runtime distribution for a set
of independent tasks. All the works mentioned above assume
constant memory stall time and single program phase.
Program phase concept has been one of the hottest research
issues because it allows new opportunities of performance
optimization, e.g., program phase-aware dynamic adaptations
of cache architecture [14], [15]. Various methods have been
proposed to characterize program phase behavior. Among
them, a vector of the average execution cycles of basic
blocks, called basic block vector (BBV), is most widely
used. By characterizing a program phase with BBV, one can
apply the program phase concept to DVFS as in [16] and
[17]. A new program phase is detected when two BBVs are
signicantly different, e.g., when Hamming distance between
two BBVs is larger than a pre-dened threshold value.
Because there are a large number of basic blocks in typical
software applications, program phase detection utilizing the
BBV is usually impractical. Thus, the key issue is to reduce
the dimensionality of the BBV by identifying a subset of
basic blocks to represent the program phase behavior. A
random linear projection method is described in [14] and
[15] to reduce the effort of exploring all the combinations of
basic blocks to identify the subset. In this paper, we present a
program phase detection scheme suitable for DVFS purpose,
based on the vectors of predicted workloads for coarse-
grained code sections (instead of using BBV) as explained in
Section VIII. In addition, unlike existing phase-based DVFS
methods, our method exploits runtime distribution within
each program phase to better predict the remaining workload.
Several online DVFS methods have been presented to utilize
the dynamic program behavior for further energy saving. [18]
presents a workload prediction method utilizing the Kalman
lter which captures time-varying workload characteristics
by adaptively reducing the prediction error via feedback.
We presented an online workload prediction method which
minimizes both dynamic and leakage energy consumption by
exploiting the program phase behavior and runtime distribution
of computational cycle within each program phase [19]. Based
on the assumption that memory stall time does not vary a
lot during runtime, the distribution of memory stall time is
not considered. However, the memory stall time is simply
accounted for as an integral (nonseparable) part of the total
runtime of software program. However, in memory-bound ap-
plications where memory stall time becomes a signicant por-
tion of total program runtime, the distribution of memory stall
time needs to be exploited to achieve further energy reduction.
Compared to the method which sets voltage and frequency
based on average computational workload and memory stall
time during program runs [4], our method has three distinctive
features. First, our approach exploits runtime distribution of
both computational cycle and memory stall time, while only
the average values are assumed in [4]. Second, we exploited
program phase detection to achieve maximal reduction of
energy consumption, while [4] utilizes average workload of
whole program without the notion of program phase. Third,
in our method, workload prediction is done in a temperature-
adaptive manner to tackle the dependency of leakage energy
and temperature, while the temperature dependence is ignored
in [4].
III. Preliminary
A. Processor Energy Model
Energy consumption per cycle (e) consists of switching (e
s
)
and leakage (e
l
) components. Additionally, in deep submicron
regime, e
l
is further divided into subthreshold (e
sub
l
), gate
(e
gate
l
), and junction (e
junc
l
) leakage energy. Putting them all
together, we can express the total energy consumption per
cycle as follows [20], [21]:
e C
eff
V
2
dd
+ N
g
f
1
(V
dd
K
1
exp(K
2
V
dd
) exp(K
3
V
bb
)
+V
dd
K
4
exp(K
5
V
dd
) + |V
bb
|I
j
_
(1)
where C
eff
and N
g
are effective capacitance and effective num-
ber of gates of the target processor, respectively. K
1
, K
2
, K
3
,
K
4
, K
5
, and I
j
are process-dependent curve-tting param-
eter sets for e
sub
l
, e
gate
l
, and e
junc
l
, respectively. Especially, the
values of K
1
, K
2
, K
3
are functions of operating temperature
(T), since e
sub
l
increases exponentially as the operating tem-
perature increases. According to BSIM4 model and [21], the
temperature dependence of the parameters (K
1
, K
2
, and K
3
)
is modeled as follows:
K
1
(T)
_
T
T
ref
_
2
exp
_
K
6
T
ref
(1
T
ref
T
)
_
K
1
(T
ref
) (2)
K
2
(T)
_
T
ref
T
_
K
2
(T
ref
) (3)
K
3
(T)
_
T
ref
T
_
K
3
(T
ref
) (4)
where T
ref
is reference temperature and K
6
is a curve-tting
parameter. Thus, K
1
, K
2
, K
3
at temperature T can be ob-
tained from the values at T
ref
using the relationship in (2)(4).
Since the temperature-aware energy model shown in (1)
(4) is too complicated to be used in our optimization, we
adopted a simplied energy model of combined V
dd
/V
bb
scaling to approximate the energy consumption per cycle at
each temperature T as follows:
e(f, T) a
s
(T)f
b
s
(T)
+ a
l
(T)f
b
l
(T)
+ c(T) (5)
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD 113
TABLE I
Energy Fitting Parameters for Approximating the Processor
Energy Consumption to the Accurate Estimation Obtained
from PTscalar with BPTM High-k/Metal Gate 32 nm HP Model
for Different Temperatures, Along with the Corresponding
(Maximal and Average) Errors
Temperature Fitting Parameters Maximum (Avg)
(C) Error (%)
a
s
b
s
a
l
b
l
c
25 1.210
1
1.3 4.610
9
20.5 0.11 2.8 [0.9]
50 1.210
1
1.3 2.010
7
16.6 0.12 1.4 [0.5]
75 1.210
1
1.3 2.210
6
14.2 0.14 1.4 [0.4]
100 1.210
1
1.3 1.410
5
12.4 0.15 1.7 [0.7]
where a
s
(T), b
s
(T) and a
l
(T), b
l
(T) are sets of curve-
tting parameters which model frequency-dependent portion in
e
s
(T) and e
l
(T), respectively. c(T) is a curve-tting parameter
corresponding to the amount of frequency-independent energy
portion in e(f, T). Table I shows examples of tting param-
eters which approximate the processor energy consumption
obtained from PTscalar [21] and Cacti5.3 [22] with the energy
model, i.e., (1)(4), for Berkeley predictive technology model
(BPTM) high-k/metal gate 32 nm HP model [23] at 25 C,
50 C, 75 C, and 100 C. In the modeling, we congured a
target processor in PTscalar as the best-effort estimate of Core
2-class microarchitecture using the parameters presented in
[24]. As Table I shows, the simplied energy model tracks
the original energy model within 2.8% of maximum error for
all the operating temperatures. Note that tting parameters
for modeling switching energy consumption, i.e., a
s
and b
s
,
are unchanged as temperature varies because switching energy
consumption is temperature invariant.
Processor energy consumption depends on the type of
instructions executed in the pipeline path [25]. To simply
consider the energy dependence on instructions, we classify
processor operation into two states: computational state for
executing instructions and memory stall state mostly spent for
waiting for data from memory. When a processor is in the
memory stall state, switching energy consumption can be sup-
pressed using clock gating while leakage energy consumption
is almost the same as the computational state. The reduction
ratio of switching energy, called clock gating fraction denoting
the fraction of the clock-gated circuit, is modeled as (0.1 in
our experiments). Thus, energy consumption per clock cycle
in each processor state can be calculated as follows:
e
comp
= a
s
f
b
s
+ a
l
f
b
l
+ c (6)
e
stall
= a
s
f
b
s
+ a
l
f
b
l
+ c (7)
where e
comp
and e
stall
represent energy consumption per cycle
in the computational and memory stall state, respectively.
Given a desired frequency level (f), one can always nd a
pair of V
dd
and V
bb
that gives minimum energy consumption
per cycle using the combined V
dd
/V
bb
scaling [11].
B. Runtime Workload Proling
The total number of processor execution cycles, x, can be
expressed as a sum of the number of clock cycles for executing
instructions in a processor, x
comp
, and that of stall cycles for
Fig. 2. Memory stall time vs. number of L2 cache misses as approximated
by a straight line.
accessing an external memory, x
stall
, which is expressed as
a function of memory stall time, t
stall
, and frequency, f, as
follows:
x = x
comp
+ x
stall
= x
comp
+ f t
stall
. (8)
For the decomposition of processor cycle into two processor
clock frequency-invariant components, i.e., x
comp
and t
stall
,
during program runs, we adopt an online proling method
which uses performance counters in a processor as presented
in [4]. We model t
stall
using only the number of the last-
level cache misses (N
L2 miss
, in our experiment, L2 is the
last-level cache). The rationale of modeling t
stall
only with
N
L2 miss
is twofold. First, the effect of last-level cache
miss dominates the others (TLB miss, interrupts, and so on)
according to our experiment. Second, the number of events
simultaneously monitored in a processor is usually limited (in
our experimental platform, two events). In our model, t
stall
is
expressed as follows:
t
stall
= a
p
N
L2 miss
+ b
p
(9)
where a
p
and b
p
are tting parameters. Fig. 2 illustrates that
(9) (solid line) tracks quite well the measured memory stall
time (dots) when running H.264 decoder program in FFMPEG
[29].
In a typical software program, x
comp
and t
stall
obtained from
running a code section are correlated with each other. It is
because t
stall
of a code section is proportional to the number
of external memory references which is highly correlated
with the number of executed memory instructions in a code
section, e.g., load and store. x
comp
of a code section depends
on the type and number of executed instructions including
memory instructions. To consider the correlation between
computational cycle (x
comp
) and memory stall time (t
stall
), we
model the distribution of x
comp
and t
stall
of a code section using
a joint probability density function (PDF) as shown in Fig. 3.
During runtime, the joint PDF is obtained as follows. After the
execution of a code section, t
stall
is obtained from (9). Then,
from (8), x
comp
is calculated with x and t
stall
. The probability
of occurrence of a pair of x
comp
and t
stall
is dened as the ratio
of the number of occurrences of the pair to the total number
of executions of the code section.
114 IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
Fig. 3. Joint PDF with respect to computational workload (x
comp
) and
memory stall time (t
stall
).
Fig. 4. Solution inputs. (a) Software program (or source code) partitioned
into program regions. (b) Energy model (in terms of energy-per-cycle) as
a function of frequency. (c) fv table storing the energy-optimal pairs,
(V
dd
, V
bb
), for N frequency levels.
IV. Problem Definition and Solution Overview
Fig. 4 illustrates three types of input required in the
proposed procedure. Fig. 4(a) shows a software program
partitioned into program regions each shown as a box. A
program region is dened as a code section with associated
voltage/frequency setting. The partition can be performed
manually by a designer or via an automatic tool [26] based
on execution cycles of code sections obtained by a priori
simulation of the software program. The ith program region is
denoted as n
i
while the rst and the last program region are
called root (n
root
) and leaf (n
leaf
) program region, respectively.
In this paper, we simply focused on a software program which
periodically runs from n
root
to n
leaf
at every time interval. At
the start of a program region, voltage/frequency is set and
maintained until the end of the program region. At the end of
a program region, computational cycle and memory stall time
are proled. Then, as explained in Section III-B, the joint PDF
of computational cycle and memory stall time are updated
as shown in Fig. 3. Fig. 4(b) shows an energy model (more
specically, energy-per-cycle vs. frequency). Fig. 4(c) shows
a pre-characterized table called f-v table in which the energy-
optimal pair, (V
dd
, V
bb
), is stored for each frequency level (f).
When the frequency is scaled, V
dd
and V
bb
are adjusted to the
corresponding level stored in the table. Note that, due to the
dependency of leakage energy on temperature, energy-optimal
values of (V
dd
, V
bb
) corresponding to f vary depending on the
operating temperature. Therefore, we prepare f-v table for a
set of quantized temperature level.
Algorithm 1 : Overall ow
1: if (end of n
i
) then
2: Online proling and calculation of statistics (Section III-B)
3: if (n
i
== n
leaf
) then
4: iter++
5: if ((iter % PHASE UNIT)==0) then
6: for from n
leaf
to n
root
do
7: Workload prediction for each energy component
(Section VI)
8: end for
9: Program phase detection (Section VIII)
10: end if
11: end if
12: else if (start of n
i
) then
13: Finding workload of n
i
based on coordination (Section VII)
14: Voltage/frequency scaling with feasibility check
15: end if
Given the three inputs in Fig. 4, we nd the energy-optimal
workload prediction, i.e., w
opt
i
, of each program region during
program execution. Algorithm 1 shows the overall ow of the
proposed method. The proposed method is largely divided into
workload prediction (lines 111) and voltage/frequency (v/f)
setting (lines 1215) step, which are invoked at the end and
the start of every program region, respectively.
In the workload prediction step, we prole runtime informa-
tion, i.e., x
stall
i
and t
stall
i
, and update the statistical parameters of
the runtime distributions, e.g., mean, standard deviation, and
skewness of x
stall
i
and t
stall
i
(lines 12). After the completion
of the leaf program region, the number of program runs, i.e.,
iter, is increased (line 4). At every PHASE UNIT program
runs (line 5), where PHASE UNIT is the predened number
of program runs (e.g., 20-frame decoding in MPEG4), we
perform the workload prediction and program phase detection
by utilizing the proled runtime information and its statistical
parameters (lines 510). The periodic workload prediction is
performed in the reverse order of program ow as presented in
[10], [11], and [19], i.e., from the end (n
leaf
) to the beginning
(n
root
) of a program (lines 68). As will be explained in
Sections V and VI, in this step, we nd local-optimal workload
predictions of n
i
, each of which minimizes each energy
component, instead of total energy.
2
By utilizing the local-
optimal workload predictions, the program phase detection is
performed to identify which program phase the current instant
belongs to (line 9).
In the v/f setting step (lines 1215), which is performed
at the start of each program region, a process called coordi-
nation determines energy-optimal global workload prediction,
w
opt
i
, with the combination of the local-optimal workload
predictions of the detected program phase (line 13). Based on
w
opt
i
, we set voltage/frequency while satisfying hard real-time
constraint (line 14).
V. Analytical Formulation of
Memory Stall Time-Aware DVFS
Assume that a program is partitioned into two program
regions, i.e., n
i
and n
i+1
, and that each program region has
2
In this paper, total energy consumption is calculated as the sum of the ve
independent energy components as shown in (11).
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD 115
a distinct computational cycle and memory stall time. The
energy model presented in Section III-A is used. The total
energy consumption for running the two program regions, E
i
,
is calculated as follows:
E
i
= E
comp
i
+ E
stall
i
(10)
where E
comp
i
and E
stall
i
represent the energy consumption for
running computational workload and memory stall workload,
respectively.
E
comp
i
and E
stall
i
, respectively, consist of three indepen-
dent energy components: frequency-dependent switching en-
ergy (Es
comp
i
and Es
stall
i
), frequency-dependent leakage energy
(El
comp
i
and El
stall
i
), and frequency-independent energy called
base energy (Eb
comp
i
and Eb
stall
i
where Eb
i
= Eb
comp
i
+Eb
stall
i
).
Thus, E
i
is expressed as follows:
E
i
= (Es
comp
i
+ El
comp
i
) + (Es
stall
i
+ El
stall
i
) + Eb
i
. (11)
Using (6)(8), the ve energy components in (11) are ex-
pressed as follows:
Es
comp
i
= a
s
f
b
s
i
x
comp
i
+ a
s
f
b
s
i+1
x
comp
i+1
(12)
El
comp
i
= a
l
f
b
l
i
x
comp
i
+ a
l
f
b
l
i+1
x
comp
i+1
(13)
Es
stall
i
= (a
s
f
b
s
i
f
i
t
stall
i
+ a
s
f
b
s
i+1
f
i+1
t
stall
i+1
) (14)
El
stall
i
= a
l
f
b
l
i
f
i
t
stall
i
+ a
l
f
b
l
i+1
f
i+1
t
stall
i+1
(15)
Eb
i
= c(x
comp
i
+ x
comp
i+1
+ f
i
t
stall
i
+ f
i+1
t
stall
i+1
). (16)
Frequency of each program region, f
i
and f
i+1
can be ex-
pressed as the ratio of the remaining computational workload
prediction (w
i
and w
i+1
) to the remaining time-to-deadline
prediction for running the computational workload, i.e., total
remaining time-to-deadline (t
R
i
and t
R
i+1
) minus remaining
memory stall time prediction (s
i
and s
i+1
), as shown in
f
i
=
w
i
t
R
i
s
i
(17)
f
i+1
=
w
i+1
t
R
i+1
s
i+1
. (18)
t
R
i+1
in (18) is expressed as follows:
t
R
i+1
= t
R
i

x
comp
i
f
i
t
stall
i
. (19)
By replacing f
i
and t
R
i+1
with (17) and (19), f
i+1
in (18) is
rearranged as follows:
f
i+1
=
w
i+1
(t
R
i
s
i
)
i
(20)
where

i
= 1
x
comp
i
w
i

t
stall
i
t
R
i
s
i
(21)
t
stall
i
= (t
stall
i
+ s
i+1
) s
i
. (22)
When memory stall time of n
i
and n
i+1
, i.e., t
stall
i
and t
stall
i+1
,
are unit functions, remaining memory stall time prediction is
set to the sum of memory stall time of remaining program
regions, i.e., s
i
= t
stall
i
+ t
stall
i+1
. In the same manner, s
i+1
is set
to t
stall
i+1
because n
i+1
is the leaf, i.e., last, program region in
this case. Therefore, t
stall
i
in (22) becomes zero, thereby,
i
is independent of t
R
i
. Since we perform workload prediction
from leaf to root program region as presented in [10], w
i+1
is
already known as w
opt
i+1
when calculating w
i
. With (17)(20),
(12)(16) can be expressed as functions of w
i
and t
R
i
.
Since E
i
is continuous and convex with respect to w
i
, the
energy-optimal workload prediction of computational work-
load, i.e., w
opt
i
, can be obtained by nding a point which
satises the following relation:
E
i
w
i
=
Es
comp
i
w
i
+
El
comp
i
w
i
+
Es
stall
i
w
i
+
El
stall
i
w
i
+
Eb
i
w
i
= 0. (23)
Since total energy consumption, E
i
, is a function of w
i
as
well as t
R
i
, w
opt
i
satisfying (23) varies with respect to t
R
i
. In
other words, w
opt
i
has to be found for every t
R
i
. Because t
R
i
has a wide range of values, performing a workload prediction
for every value of t
R
i
is unrealistic. Therefore, we proposed
a solution which performs a workload prediction for a set of
quantized levels of t
R
i
[28]. However, it also requires a lot
of workload predictions since more energy savings can be
obtained as t
R
i
is quantized into larger number of quantization
levels. Thus, the method causes a large runtime overhead
if it is applied as the online solution while maintaining
its effectiveness (according to our experiment, the runtime
overhead is 3.4 times larger than the pure runtime for H.264
decoder when t
R
i
is quantized into 30 levels).
To reduce the runtime overhead of nding an energy-
optimal workload prediction, we propose a workload pre-
diction method which nds w
opt
i
in two steps: 1) workload
prediction which minimizes each energy component, called
local-optimal workload prediction (in Section VI), and 2)
coordination of the local-optimal workload predictions to
obtain global workload prediction w
opt
i
(in Section VII).
A local-optimal workload prediction is to nd the workload
prediction which minimizes each of the ve energy compo-
nents in (11) by adjusting voltage/frequency based on the
workload prediction. For example, voltage/frequency scaling
based on the local-optimal workload prediction of Es
comp
i
only
minimizes energy consumption of Es
comp
i
. It can be obtained
by nding the point which equates the single derivative of
(23) to zero, i.e., Es
comp
i
/w
i
= 0. Note that a local-optimal
workload prediction can be calculated independently of t
R
i
,
because
i
in (21) is independent of t
R
i
(t
stall
i
= 0).
A coordination of the local-optimal workload predictions
is to nd the workload prediction which minimizes E
i
by
utilizing the ve local-optimal workload predictions. When
a derivative of one energy component with respect to w
i
dominates others in (23), the workload prediction which
satises the (23) can be obtained by nding a point where
the derivative of the dominant energy component becomes
zero. For instance, when Es
comp
i
/w
i
dominates others, w
opt
i
is simply set to ws
comp
i
. When there are multiple dominant
energy components, we need to coordinate them so as to nd
the workload prediction with lower total energy consumption.
Finding the workload prediction [satisfying (23)] requires a
numerical solution whose complexity is too high to be applied
during runtime, as presented in [28]. In this paper, we present
an efcient approach to coordinate local-optimal workload
116 IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
Fig. 5. Three cases. (a) Case 1: unit functions for both x
comp
i
and t
stall
i
. (b)
Case 2: runtime distribution for x
comp
i
and unit function of t
stall
i
. (c) Case 3:
runtime distributions for both x
comp
i
and t
stall
i
.
predictions in a runtime-adaptive manner. We nd w
opt
i
at the
start of n
i
through the coordination of local-optimal workload
predictions.
VI. Workload Prediction for Minimizing Energy
Consumption of Single Energy Component
In this section, we assume that a program is partitioned into
two consecutive program regions, n
i
and n
i+1
, and present
a method which nds a local-optimal workload prediction
while exploiting the runtime distribution of both computational
workload and memory stall time. As Fig. 5 shows, we will
explain the local-optimal workload prediction method in three
different cases of J
i
, the joint PDF of x
comp
i
and t
stall
i
. Case
1: when J
i
is given as a unit function while J
i+1
is a general
function as shown in Fig. 5(a). Case 2: when x
comp
i
alone has
a runtime distribution while t
stall
i
is a unit function as shown
in Fig. 5(b). Case 3: when both x
comp
i
and t
stall
i
have runtime
distributions as shown in Fig. 5(c).
A. Case 1: Both x
comp
i
and t
stall
i
Have Unit Functions
In this subsection, we explain the case where the joint PDF
of n
i
is given as a unit function as shown in Fig. 5(a). We
dene ws
comp
i
, wl
comp
i
, ws
stall
i
, wl
stall
i
, and wb
i
as the local-
optimal workload prediction for minimizing Es
comp
i
, El
comp
i
,
Es
stall
i
, El
stall
i
, and Eb
i
, respectively. Given the joint PDFs, J
i
and J
i+1
, average switching energy consumption for running
computational workload, i.e., Es
comp
i
, is calculated as the sum
of Es
comp
i
with respect to J
i
and J
i+1
as follows:
Es
comp
i
=

Es
comp
i
J
i
J
i+1
(24)
=
a
s
(t
R
i
s
i
)
b
s
_
w
b
s
i
x
comp
i
+
_
ws
comp
i+1
1 x
comp
i
/w
i
_
b
s
x
comp
i+1
_
where x
comp
i
and x
comp
i+1
represent the average of x
comp
i
and x
comp
i+1
,
respectively. Note that x
comp
i
is xed as x
comp
i
since J
i
is a
unit function in this case. w
opt
i+1
is replaced by ws
comp
i+1
since
we perform the local-optimal workload prediction of Es
comp
i
.
Since Es
comp
i
is continuous and convex on w
i
, ws
comp
i
can be
obtained by nding a point which satises
Es
comp
i
w
i
=
a
s
b
s
w
bs1
i
(t
R
i
s
i
)
bs
(25)
_
x
comp
i
+ (ws
comp
i+1
)
b
s
x
comp
i+1
_
x
comp
i
(w
i
x
comp
i
)
bs+1
_
_
= 0.
By rearranging (25) with respect to w
i
, we can express ws
comp
i
in a closed-form expression as follows:
ws
comp
i
= x
comp
i
+
_
(ws
comp
i+1
)
b
s
x
comp
i+1
_ 1
bs+1
= x
comp
i
+ ws
comp
i+1
. (26)
Equation (26) shows that ws
comp
i
consists of two components:
1) workload of the ith program region, i.e., x
comp
i
, and 2)
ws
comp
i+1
= ((ws
comp
i+1
)
b
s
x
comp
i+1
)
1/(b
s
+1)
, called effective remaining
workload of n
i+1
with respect to Es
comp
i
, corresponding to
the portion of remaining workload after program region n
i
.
Fig. 5(a) illustrates the calculation of ws
comp
i
presented in
(26), where J
i
and J
i+1
are replaced by their representative
workloads, i.e., x
comp
i
and ws
comp
i+1
, respectively. In the same
way, wl
comp
i
, ws
stall
i
, wl
stall
i
, and wb
i
can be expressed as
follows:
wl
comp
i
= x
comp
i
+
_
(wl
comp
i+1
)
b
l
x
comp
i+1
_ 1
b
l
+1
= x
comp
i
+

wl
comp
i+1
(27)
ws
stall
i
= x
comp
i
+
_
x
comp
i
t
stall
i
(ws
stall
i+1
)
b
s
+1
t
stall
i+1
_ 1
bs+2
= x
comp
i
+ ws
stall
i+1
(28)
wl
stall
i
= x
comp
i
+
_
x
comp
i
t
stall
i
(wl
stall
i+1
)
b
l
+1
t
stall
i+1
_ 1
b
l
+2
= x
comp
i
+

wl
stall
i+1
(29)
wb
i
= x
comp
i
+
_
x
comp
i
t
stall
i
wb
i+1
t
stall
i+1
_1
2
= x
comp
i
+

wb
i+1
(30)
where

wl
comp
i+1
, ws
stall
i+1
,

wl
stall
i+1
, and

wb
i+1
are effective remaining
workload of n
i+1
with respect to El
comp
i
, Es
stall
i
, El
stall
i
, and
Eb
i
, respectively. Since local-optimal workload can simply be
calculated by just summing effective remaining workloads of
program regions as shown in (26)(30), it can be obtained
during program runs with negligible runtime overhead.
3
If the software program consists of a cascade of program
regions with conditional branches, we can still calculate the
effective remaining workload of program region in a similar
manner to [10].
3
The runtime overhead of the local-optimal workload prediction is presented
in Table VI.
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD 117
B. Case 2: x
comp
i
Has a Runtime Distribution and t
stall
i
Has a
Unit Function
In this subsection, we explain the case where x
comp
i
has a
runtime distribution with t
stall
i
still assumed as unit function as
shown in Fig. 5(b). In this case, average energy consumption
of computational workload is expressed as follows:
Es
comp
i
=

Es
comp
i
J
i
J
i+1
=
a
s
(t
R
i
s
i
)
b
s
(31)
_
w
b
s
i
x
comp
i
+ (ws
comp
i+1
)
b
s
x
comp
i+1
N
c

j=1
p
comp
i
(j)
(1 x
comp
i
(j)/w
i
)
b
s
_
where N
c
is the number of quantized levels of x
comp
i
in its
PDF. p
comp
i
(j) represents the probability of x
comp
i
falling into
the jth quantized level. Note that, in this case where t
stall
i
is
given as a unit function, the joint PDF (J
i
) of x
comp
i
and t
stall
i
is the same as the PDF of x
comp
i
at the given t
stall
i
, i.e., p
comp
i
.
ws
comp
i
can be obtained by nding w
i
which satises the
following relation:
Es
comp
i
w
i
=
a
s
b
s
w
b
s
1
i
(t
R
i
s
i
)
b
s
_
x
comp
i
+ (32)
w
b
s
i+1
x
comp
i+1
N
c

j=1
_
x
comp
i
(j)p
comp
i
(j)
(w
i
x
comp
i
(j))
b
s
+1
_
_
= 0.
Note that, in this case, ws
comp
i
can be obtained independently
of t
R
i
, without loss of quality degradation. However, contrary
to (26), no explicit form exists for ws
comp
i
. Thus, ws
comp
i
can
be obtained only through a numerical solution approach as
presented in [11], which is too time-consuming for runtime
application. Instead, being inspired by (26), we can model the
solution, ws
comp
i
as follows:
ws
comp
i
= xs
comp
i
+ ws
comp
i+1
(33)
where xs
comp
i
is the effective workload of program region n
i
for Es
comp
i
. ws
comp
i+1
is obtained in the same way as presented
in (26). From our observation that energy-optimal workload
prediction tends to have a value near the average and depends
on runtime distribution, we model xs
comp
i
as follows:
xs
comp
i
= (1 + s
comp
i
) x
comp
i
(34)
where s
comp
i
is a parameter which represents the ratio of
the distance between xs
comp
i
and x
comp
i
to x
comp
i
. We calculate
xs
comp
i
by exploiting the pre-characterization of solutions. First,
we prepare a lookup table LUT
s
comp for s
comp
i
during design
time and perform table lookup to obtain s
comp
i
during runtime.
s
comp
i
depends on the shape of runtime distribution. Thus, we
derived the indexes of LUT
s
comp as follows:
1) Index 1:
comp
i
/x
comp
i
, normalized standard deviation
(
comp
i
) with respect to the mean of n
i
(x
comp
i
);
2) Index 2: g
comp
i
, skewness of x
comp
i
;
3) Index 3: x
comp
i
/ ws
comp
i+1
, ratio of the mean of n
i
(x
comp
i
) to
the effective remaining workload of n
i+1
( ws
comp
i+1
).
The rationale of choosing the three indexes is as follows. By
substituting ws
comp
i
with (33) and (34), (32) is rearranged as
Fig. 6. s
comp
as a function of (a) Index 1:
comp
i
/x
comp
i
and Index 3:
x
comp
i
/ ws
comp
i+1
, and (b) Index 2: skewness (g
comp
i
), at 75 C.
follows:
x
comp
i
+ ( ws
comp
i+1
)
b
s
+1

N
c

j=1
_
(35)
x
comp
i
(j)p
comp
i
(j)
((1 + s
comp
i
) x
comp
i
+ ws
comp
i+1
x
comp
i
(j))
b
s
+1
_
=0.
Note that the optimal s
comp
i
can be obtained by nding a
point which satises (35). As shown in (35), s
comp
i
depends
on x
comp
i
, ws
comp
i+1
(Index 3), and the PDF of x
comp
i
, i.e.,
x
comp
i
(j), p
comp
i
(j), which is modeled as a skewed normal
distribution in this paper, since the PDF usually does not have
a nice normal distribution.
4
The skewed normal distribution is
characterized with three parameters: x
comp
i
,
comp
i
, and g
comp
i
(Index 1 and Index 2). Fig. 6 illustrates s
comp
i
as the indexes
change.
Fig. 6 shows s
comp
i
as a function of three indexes above. As
shown in Fig. 6(a), s
comp
i
increases with the wider distribution
of x
comp
i
, i.e.,
comp
i
/x
comp
i
increases, and increases as the
workload of n
i
(relative to the effective remaining workload
of n
i+1
), i.e., x
comp
i
/ ws
comp
i+1
, increases. It also increases as the
g
comp
i
, skewness of PDF, moves to the right (g
i
> 0) as
Fig. 6(b) shows.
In the same way, wl
comp
i
, ws
stall
i
, wl
stall
i
, and wb
i
can also
be calculated by nding l
comp
i
, s
stall
i
, l
stall
i
, and b
i
from
LUT
l
comp , LUT
s
stall , LUT
l
stall , and LUT
b
, respectively. Note
that s
comp
i
b
i
can be obtained by performing table lookup
with the statistical parameters (e.g., mean, standard deviation,
and skewness) and effective workload of n
i+1
. Thus, it can
be performed with negligible runtime overhead to nd a
local-optimal workload prediction while exploiting the runtime
distribution of computational workload.
C. Case 3: Both x
comp
i
and t
stall
i
Have Runtime Distributions
When both x
comp
i
and t
stall
i
have their runtime distributions
as shown in Fig. 5(c), average switching energy consumption
for running computational workload, i.e., Es
comp
i
, can be
calculated as the sum of Es
comp
i
with respect to the joint PDFs
4
Note that more accurate workload prediction can be performed with an
additional effort, as presented in [19], where PDF of x
comp
i
is modeled as a
multimodal distribution with each mode given as a skewed normal distribution.
Although more energy savings can be obtained from the multimodal modeling,
in this paper, we simply approximated PDF as a single-modal skewed normal
distribution in order to reduce the runtime overhead. However, it can be easily
extended to the multimodal case [19].
118 IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
J
i
and J
i+1
as follows:
Es
comp
i
=

Es
comp
i
J
i
J
i+1
=
a
s
(t
R
i
s
i
)
b
s
_
w
b
s
i
x
comp
i
+ Zs
comp
i
_
(36)
where
Zs
comp
i
= (ws
comp
i+1
)
b
s
x
comp
i+1
N
c

j=1
N
s

k=1
J(j, k)
(
i
(j, k))
b
s
(37)
where
i
(j, k) and J(j, k) denote
i
in (21) and the probability
when (x
comp
i
, t
stall
i
) falls into the (j, k)th quantized level, respec-
tively. Since we set the predicted remaining memory stall time
(s
i
) to the sum of average of t
stall
i
and t
stall
i+1
, t
stall
i
[dened in
(22)] in
i
is not zero any longer. Due to the nonzero t
stall
i
,
the local-optimal workload prediction is a function of t
R
i
. To
reduce the solution complexity, we approximate the calculation
of Zs
comp
i
in (37) as follows:
Zs
comp
i
s
comp
i

_
(ws
comp
i+1
)
b
s
x
comp
i+1
_

N
c

j=1
p
comp
i
(j)
(1 x
comp
i
(j)/w
i
)
b
s
(38)
where
s
comp
i
=
N
c

j=1
N
s

k=1
_
1 x
comp
i
/w
i

i
(j, k)
_
b
s
J
i
(j, k). (39)
As shown in (39), s
comp
i
depends on w
i
and s
i
because
i
(21) is a function of w
i
and s
i
. Note that (w
i
, s
i
) will be
calculated at the end of the current program phase using the
joint PDFs (J
i
and J
i+1
) proled during the time period of
the current program phase. To simplify the interdependence
between s
comp
i
and (w
i
, s
i
), we approximate the calculation
of s
comp
i
by replacing (w
i
, s
i
) with (ws
comp
i
, s
i
) of the current
program phase. By substituting (38) with the approximated
s
comp
i
, we can rearrange (36) as follows:
Es
comp
i

a
s
(t
R
i
s
i
)
b
s

_
w
b
s
i
x
comp
i
+ s
comp
i
(40)
_
(ws
comp
i+1
)
b
s
x
comp
i+1
_

N
c

j=1
p
comp
i
(j)
(1 x
comp
i
/w
i
)
b
s
_
.
Note that (40) is the same as (31), except for s
comp
i
. Therefore,
in a similar way as (32) and (33), we can express ws
comp
i
,
which minimizes Es
comp
i
, as follows:
ws
comp
i
= xs
comp
i
+ ws
comp
i+1
(41)
where
ws
comp
i+1
=
_
s
comp
i
(ws
comp
i+1
)
b
s
x
comp
i+1
_
1/(b
s
+1)
. (42)
Compared to the calculation of ws
comp
i
in the Case 1 and Case
2 in Fig. 5(a) and (b), the only difference is that (s
comp
i
)
1/(b
s
+1)
is multiplied in the calculation of effective remaining workload
of n
i+1
, i.e., ws
comp
i+1
.
5
wl
comp
i
, ws
stall
i
, wl
stall
i
, and wb
i
can also
be calculated in the same way.
5
Note that when memory stall has no distribution, i.e., t
stall
i
= 0, s
comp
i
becomes 1, thereby ws
comp
i+1
becomes the same as Case I and Case II.
Fig. 7. Hierarchical coordination to obtain global workload prediction, w
opt
i
,
where C1C4 represent coordination steps.
TABLE II
Threshold Parameters Used in Coordination
Coordination Threshold Condition
Step Parameter
C1 fs
comp (asf
bs
)
f

(a
l
f
b
l )
f
fl
comp (a
l
f
b
l )
f

c

(asf
bs
)
f
C2 fb
stall (cf)
f

c

(a
l
f
b
l
+1
)
f
fl
stall (a
l
f
b
l
+1
)
f

c

(cf)
f
C3 fs
stall (asf
bs+1
)
f

c

(a
l
f
b
l
+1
+cf)
f
fL
stall (a
l
f
b
l
+1
+cf)
f

c

(asf
bs+1
)
f

c
: user-dened threshold value.
Note that we perform the most time-consuming work of
workload prediction, i.e., nding s
comp
i
b
i
with respect
to the runtime distribution, in a design-time step, and then,
we store the parameters into LUTs. Thus, we can drastically
reduce the runtime overhead of nding workload prediction
while accurately considering the inuence of the runtime
distribution in workload predictions because we only ac-
cess the LUTs to nd workload prediction during runtime.
However, it requires additional memory space to store the
pre-characterized data. The runtime and area overhead are
presented in Section IX-C.
VII. Frequency Selection Based on Coordination
In this section, we present a method called coordination to
nd the global workload prediction of n
i
(w
opt
i
) based on the
local-optimal workload predictions, i.e., ws
comp
i
, wl
comp
i
, ws
stall
i
,
wl
stall
i
, and wb
i
. As (23) shows, the workload prediction which
minimizes average total energy consumption at given t
R
i
varies
according to the sensitivity of each energy components with
respect to w
i
, i.e., Es
comp
i
/w
i
, El
comp
i
/w
i
, Es
stall
i
/w
i
,
El
stall
i
/w
i
, and Eb
i
/w
i
in (23).
Since the coordination of workload predictions is performed
online, it needs to be done with low overhead. To achieve this
goal, we present a simple hierarchical method which nds
w
opt
i
from local-optimal workload predictions (independent of
t
R
i
), as shown in Fig. 7. As Fig. 7 shows, rst, we obtain
the workload prediction for each workload type, i.e., compu-
tational workload (w
comp
i
) through a coordination step called
C1 and memory stall workload (w
stall
i
) through coordination
steps called C2 and C3. Then, we nd w
opt
i
from w
comp
i
and
w
stall
i
through a coordination step called C4.
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD 119
Fig. 8. Linear coordination of (a) C1: ws
comp
i
and wl
comp
i
to nd w
comp
i
. (b)
C4: w
comp
i
and w
stall
i
to nd w
opt
i
.
1) Coordination for w
comp
i
(C1): A workload prediction
for computational workload, w
comp
i
represents the prediction
which minimizes E
comp
i
, i.e., sum of Es
comp
i
and El
comp
i
. There-
fore, w
comp
i
depends on ws
comp
i
and wl
comp
i
. In this coordination,
we utilize the fact that El
comp
i
has exponential dependency
on frequency in combined V
dd
/V
bb
scaling. The rationale is
explained as follows. In the low frequency region, high reverse
body bias voltage can be applied suppressing the leakage
energy consumption due to high V
th
. As frequency increases,
|V
bb
| is decreased to enable higher clock frequency operation
by reducing V
th
, which drastically increases leakage energy
consumption.
In combined V
dd
/V
bb
scaling, increase of switching energy
consumption (with respect to frequency increase), i.e., e
s
/f,
dominates leakage energy consumption in the lower frequency
region while increase of leakage energy consumption, i.e.,
e
l
/f, dominates others in relatively high frequency region
[27]. Therefore, when most operating frequency falls into the
frequency range where the sensitivity of switching energy
consumption is much larger than that of leakage energy
consumption, i.e., e
s
/f e
l
/f, w
comp
i
approaches ws
comp
i
because switching energy consumption is the major contrib-
utor in this frequency region. On the other hand, when the
operating frequency is within the frequency region where
e
s
/f e
l
/f, w
comp
i
approaches wl
comp
i
.
We partition the frequency range into three regions: switch-
ing energy-dominant, leakage energy-dominant, and interme-
diate regions. The partition is done with two threshold fre-
quencies, fs
comp
and fl
comp
. The frequency range below fs
comp
(above fl
comp
) is called switching (leakage) energy-dominant
region while the frequency range between the two threshold
frequencies is called intermediate region. Each energy com-
ponent has two threshold frequencies as shown in Table II.
In order to identify which frequency partition the current
program region belongs to, we introduce a simple evaluation
metric, f
eval
i
, as the upper bound of the operating frequency
in the remaining program regions from n
i
to n
leaf
f
eval
i
=
WCEC
comp(k)
i
t
R
i
WCET
stall(k)
i
. (43)
In (43), WCEC
comp(k)
i
and WCET
stall(k)
i
represent the remaining
worst-case execution cycle of computational workload and
remaining worst-case memory stall time from n
i
to n
leaf
when a current program phase is the kth program phase,
respectively. The solid line in Fig. 8(a) illustrates a linear
coordination method to nd w
comp
i
by utilizing f
eval
i
. When
f
eval
i
is lower than the threshold value, fs
comp
(in the second
row in Table II where
c
is set to 5.0 in our experiment), we set
w
comp
i
to ws
comp
i
because that remaining program regions will
be operated within the switching energy-dominant frequency
region. When f
eval
i
is higher than the threshold value, fl
comp
(in the third row in Table II), we set w
comp
i
to wl
comp
i
. As
the last case, i.e., fs
comp
< f
eval
i
< fl
comp
, we set w
comp
i
in
proportion to the ratio of (f
eval
i
fs
comp
) to (fl
comp
fs
comp
)
using a linear interpolation function L() dened as follows:
L(X
lower
, X
upper
, Y
lower
, Y
upper
, X
eval
)
=
_
X
eval
X
lower
X
upper
X
lower
_
(Y
upper
Y
lower
) + Y
lower
. (44)
By applying X
lower
= fs
comp
, X
upper
= fl
comp
, Y
lower
= ws
comp
i
,
Y
upper
= wl
comp
i
, and X
eval
= f
eval
i
, we can obtain w
comp
i
as the
output of the function L().
2) Coordination for w
stall
i
(C2 and C3): A workload pre-
diction for memory stall, w
stall
i
represents the prediction which
minimizes E
stall
i
. Since E
stall
i
depends on Eb
i
as well as Es
stall
i
and El
stall
i
, w
stall
i
can be derived from wb
i
as well as ws
stall
i
and
wl
stall
i
. To obtain w
stall
i
by coordinating the three local-optimal
workload predictions, we perform the coordination in two
steps as shown in Fig. 7. First, we nd wL
stall
i
by coordinating
wl
stall
i
and wb
i
, i.e., C2, both of which are related to leakage
energy consumption. Then, we nd w
stall
i
by coordinating
ws
stall
i
and wL
stall
i
, i.e., C3. Note that the coordination for
wL
stall
i
can be done in the same way as w
comp
i
, which is
shown in Fig. 8(a), by simply substituting (ws
comp
i
, wl
comp
i
)
by (wl
stall
i
, wb
i
) and (fs
comp
, fl
comp
) by (fl
stall
, fb
stall
), where
fl
stall
and fb
stall
are threshold values dened in the fourth
and fth rows in Table II, respectively. In the same way, the
coordination for w
stall
i
can also be done by the substitution of
corresponding workload predictions and threshold values, i.e.,
fs
stall
and fL
stall
in Table II.
3) Coordination for w
opt
i
(C4): The last step of the coordi-
nation is to obtain w
opt
i
from w
comp
i
and w
stall
i
. In CPU-bound
applications, w
opt
i
approaches w
comp
i
since E
comp
i
dominates
E
stall
i
. On the contrary, in case of memory-bound applications,
w
stall
i
contributes more to w
opt
i
. We calculate the maximum
memory-boundedness of the remaining program region from
n
i
to n
leaf
, denoted by
i
, as the ratio of the worst-case remain-
ing memory stall cycles from n
i
at f
eval
i
(43) to that of the com-
putational cycles, i.e.,
i
= f
eval
i
WCET
stall(k)
i
/WCEC
comp(k)
i
.
Fig. 8(b) illustrates the linear coordination method to nd
w
opt
i
by utilizing
i
. As
i
becomes larger (smaller), the
remaining work is characterized to be more memory-bound
(CPU-bound). When
i
is smaller than a certain threshold
value, called
comp
(0.5, in our experiment), we regard that
the remaining workload is CPU-bound, thereby, we set w
opt
i
to w
comp
i
. On the other hand, if
i
is larger than a certain
threshold value, called
stall
(=1/
comp
, in our experiment),
we set w
opt
i
to w
stall
i
since the remaining work is memory
bound. In the intermediate case, i.e.,
comp
< <
stall
,
we set w
opt
i
in proportion to the ratio of (
i

comp
) to
(
stall

comp
) using (44).
After w
opt
i
is obtained, voltage/frequency is set to f
i
=
w
opt
i
/(t
R
i
s
i
), where t
R
i
is measured at the start of each program
120 IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
region. When setting the voltage/frequency, we check to see
whether the performance level satises the given deadline
constraint even if the worst-case execution time occurs after
the frequency is set, which is called feasibility check. More
details are explained in [10] and [27].
VIII. Program Phase Detection
Program phase, especially, in terms of computational cycles
and memory stall time, during PHASE UNIT (as dened
in Algorithm 1) is characterized by a salient difference in
computational cycle and memory stall time. Conventionally,
the program phase is characterized by utilizing only average
execution cycle of basic blocks without exploiting the runtime
distributions of computational cycle and memory stall time
[14], [15]. To exploit the runtime distributions in characteriz-
ing a program phase, we dene a new program phase vector
consisting of ve local-optimal workload predictions for each
program region. Note that local-optimal workload predictions
reect the correlation as well as the runtime distributions of
both computational cycle and memory stall time. Thus, a set
of local-workload predictions becomes a good indicator which
represents the joint PDF of each program region. The program
phase vector of the kth program phase is dened as follows:
W
(k)
=[Ws
comp(k)
, Wl
comp(k)
, Ws
stall(k)
, Wl
stall(k)
, Wb
(k)
]
T
(45)
where
Ws
comp(k)
=
_
ws
comp(k)
root
, . . . , ws
comp(k)
i
, . . . , ws
comp(k)
leaf

(46)
Wl
comp(k)
=
_
wl
comp(k)
root
, . . . , wl
comp(k)
i
, . . . , wl
comp(k)
leaf

(47)
Ws
stall(k)
=
_
ws
stall(k)
root
, . . . , ws
stall(k)
i
, . . . , ws
stall(k)
leaf

(48)
Wl
stall(k)
=
_
wl
stall(k)
root
, . . . , wl
stall(k)
i
, . . . , wl
stall(k)
leaf

(49)
Wb
(k)
=
_
wb
(k)
root
, . . . , wb
(k)
i
, . . . , wb
(k)
leaf

. (50)
Periodically, i.e., PHASE UNIT (set to the period dened
as the time for decoding 20 frames in our experiments),
we check to see whether a program phase is changed. It
is evaluated by calculating Hamming distance between pro-
gram phase vector of the current period and that of current
program phase. When the Hamming distance is greater than
the threshold called
p
(set to 10% of the magnitude of
the current program phase vector in our experiments), we
evaluate that the program phase is changed, and then, check
to see if there is any previous program phase whose Hamming
distance with the program phase vector of the current period is
within the threshold
p
. If so, we reuse local-optimal workload
predictions of the matched previous phase as that of the new
phase to set voltage/frequency. If there is no previous phase
satisfying the condition, we store the newly detected program
phase and use the local-optimal workload predictions of a
newly detected program phase to set voltage/frequency until
the next program phase detection.
IX. Experimental Results
A. Setup
In our experiments, we used two real-life multimedia pro-
grams, MPEG4 and H.264 decoder in FFMPEG [29]. We
applied two picture sets for the decoding. First, we used,
in total, 4200 frames of 1920 1080 video clip consisting
of eight test pictures, including Rush Hour (500 frames),
Station2 (300 frames), Sunower (500 frames), Tractor (690
frames), SnowMnt (570 frames), InToTree (500 frames), Con-
trolledBurn (570 frames), and TouchdownPass (500 frames)
in [30]. Second, we used 3000 frames of 1920 800 movie
clip (as excerpted from Dark Knight). We inserted nine
voltage/frequency setting points in each program: seven for
macroblock decoding and two for le write operation for de-
coded image. We performed proling with PAPI [31] running
on LG XNOTE with Linux 2.6.3.
We performed experiments at 25, 50, 75, and 100 C.
We calculated the energy consumption using the processor
energy model with combined V
dd
/V
bb
shown in Section III-A.
The parameters in (1)(4) of the processor energy model
were obtained from PTscalar [21] and Cacti5.3 with BPTM
high-k/metal gate 32 nm HP model. We used seven discrete
frequency levels from 333 MHz to 2.333 GHz with 333 MHz
step size. We set 20 s as the time overhead for switching
voltage/frequency levels and calculate the energy overhead
using the model presented in [7].
We compared the following four methods.
1) RT-CM-AVG [4]: runtime DVFS method based on the
average ratio of memory stall time and computational
cycle (baseline).
2) RT-C-DIST [19]: runtime DVFS method which only
exploits the PDF of computational cycle.
3) DT-CM-DIST [28]: design-time DVFS method which
exploits the joint PDF of computational cycle and mem-
ory stall time.
4) RT-CM-DIST: runtime version of DT-CM-DIST (pro-
posed).
We modied the original RT-CM-AVG [4], which runs inter-
task DVFS without real-time constraint, such that it supports
intratask DVFS with a real-time constraint. In running DT-
CM-DIST [28], we performed a workload prediction with
respect to 20 quantized levels of remaining time, i.e., bins,
using the joint PDF of the rst 100 frames in design time.
B. Energy Savings
Table III(a) and (b) shows the comparisons of energy
consumption for MPEG4 and H.264 decoder, respectively,
at 75 C. The rst column shows the name of test pictures.
Columns 2, 3, and 4 represent the energy consumption of each
DVFS method normalized with respect to that of RT-CM-AVG.
Compared with RT-CM-AVG [4], our method, RT-CM-
DIST offers 5.134.6% and 4.517.3% energy savings for
MPEG4 and H.264 decoder, respectively. Fig. 9 shows the
statistics of used frequency levels when running SnowMnt
in MPEG4 decoder. As Fig. 9 shows, RT-CM-AVG uses the
lowest frequency level, i.e., 333 MHz, more frequently than
other two methods. It also leads to the frequent use of high
frequency levels, i.e., frequency levels above 2.00 GHz where
energy consumption drastically increases as frequency rises,
in order to meet the real-time constraint. However, by consid-
ering the runtime distribution in RT-CM-DIST, high frequency
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD 121
TABLE III
Comparison of Energy Consumption for Test Pictures at 75 C:
(a) MPEG4 (20 Frames/s) and (b) H.264 Decoder (12 Frames/s)
(a)
Image RT-C-DIST DT-CM-DIST RT-CM-DIST
[19] [28] (Proposed)
Rush Hour 1.08 0.83 0.79
Station2 1.34 0.97 0.95
Sunower 0.99 0.76 0.74
Tractor 1.01 0.78 0.75
SnowMnt 1.02 0.90 0.81
InToTree 0.97 0.79 0.71
ControlledBurn 0.88 0.67 0.65
TouchdownPass 1.15 0.91 0.86
Average 1.05 0.83 0.78
(b)
Image RT-C-DIST DT-CM-DIST RT-CM-DIST
[19] [28] (Proposed)
Rush Hour 1.11 0.94 0.93
Station2 1.05 0.90 0.83
Sunower 1.09 0.97 0.88
Tractor 1.18 1.00 0.96
SnowMnt 1.03 1.03 0.84
InToTree 1.14 1.00 0.93
ControlledBurn 1.07 0.94 0.88
TouchdownPass 1.10 0.99 0.94
Average 1.10 0.97 0.90
levels incurring high energy overhead are less frequently used
because the workload prediction with distribution awareness
is more conservative than average-based method.
Table IV shows energy savings results for one of the test
pictures, i.e., SnowMnt, at four temperatures, 25 C, 50 C,
75 C, and 100 C. As the table shows, more energy savings
can be achieved as temperature increases. It is because the
energy penalty caused by frequent use of high frequency level
can be more obviously observed as temperature increases,
since leakage energy consumption is exponentially increasing
according to the temperature. By considering the temperature
dependency of leakage energy consumption, RT-CM-DIST
sets voltage/frequency so as to use high frequency levels
less frequently as temperature increases while RT-CM-AVG
does not consider the temperature increases. Note that, in
most cases, MPEG4 decoder gives more energy savings than
H.264 case. It is because, as Fig. 10 shows, the distribution
of memory boundedness (dened as the ratio of memory
stall time to computational cycle) of MPEG4 has a wider
distribution than that of H.264 in terms of Max/Avg and
Max/Min ratios.
Compared with RT-C-DIST [19], which exploits only the
distribution of computational cycle in runtime, RT-CM-DIST
provides up to 20.828.9% and 15.121.0% further energy
savings for MPEG4 and H.264 decoder, respectively. The
amount of further energy savings represents the effectiveness
of considering the distribution of memory stall time as well as
the correlation between computational cycle and memory stall
time, i.e., the joint PDF of computational cycle and memory
stall time. RT-C-DIST regards the whole number of clock
cycles, which is proled at the end of every program region,
Fig. 9. Statistics of used frequency levels in MPEG4 for decoding SnowMnt.
Fig. 10. Distribution of memory boundedness in (a) MPEG4 and (b) H.264
decoder.
TABLE IV
Comparison of Energy Consumption for SnowMnt at Four
Temperature Levels
Temp RT-C-DIST DT-CM-DIST RT-CM-DIST
(C) [19] [28] (Proposed)
25 1.15 0.94 0.86
MPEG4 dec. 50 1.10 0.92 0.84
75 01.02 0.90 0.81
100 0.96 0.89 0.79
25 1.06 1.02 0.91
H.264 dec. 50 1.05 1.02 0.88
75 1.03 1.03 0.84
100 1.01 1.03 0.80
as the computational cycle. Thus, RT-C-DIST cannot consider
the joint PDF distribution of computational cycle and memory
stall time. As the consequence, it sets frequency levels higher
than required levels, as shown in Fig. 9.
In Table III, compared with DT-CM-DIST, which exploits
runtime distributions of both computational and memory stall
workload in design time, RT-CM-DIST provides 2.110.2%
and 1.218.1% further energy savings for MPEG4 and H.264
122 IEEE TRANSACTION ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 1, JANUARY 2011
TABLE V
Comparison of Energy Savings for DarkKnight at 75 C
RT-C-DIST DT-CM-DIST RT-CM-DIST
[19] [28] (Proposed)
MPEG4 dec. 1.26 1.20 0.89
H.264 dec. 1.16 1.16 0.89
TABLE VI
Summary of Runtime Overhead
Source of Runtime Overhead Amount
Local-optimal workload prediction 40 40052 400 cycles
Coordination 27204780 cycles
Feasibility check 4073560 cycles
decoder, respectively. The largest energy savings can be ob-
tained at SnowMnt for both MPEG4 and H.264 decoder, which
has distinctive program phase behavior. Since DT-CM-DIST
nds the optimal workload using the rst 100 frames (design-
time xed training input), which is totally different from that
of the remaining frames (runtime-varying input), it cannot
provide proper voltage and frequency setting.
To further investigate the effectiveness of considering
complex program phase behavior, we performed another
experiment using 3000 frames of Movie clip. Program phase
behavior is more obviously observed at Movie clip whose
scene is fast moving. Table V shows normalized energy
consumption at 75 C when decoding the movie clip from
Dark Knight in MPEG4 and H.264 decoder, respectively.
RT-CM-DIST outperforms DT-CM-DIST by up to 26.3%
and 23.3% for MPEG4 and H.264 decoder, respectively. It is
because, in the movie clip, complex program phase behavior
exists due to frequent scene change as Fig. 1(d) shows.
C. Overhead
1) Runtime Overhead: We measured the runtime overhead
of the proposed online method, i.e., RT-CM-DIST, using
PAPI [31]. The proposed method consists of three parts:
local-optimal workload prediction, coordination, and feasi-
bility check. Table VI shows the runtime overhead of the
proposed method. The local-optimal workload prediction of
a program region consumes 40 40052 400 clock cycles when
PHASE

UNIT is set to 20 frames. Note that the local-optimal


workload prediction is performed at every PHASE UNIT.
The runtime overhead of coordination and feasibility check,
which is performed at every start of program region, takes
27204780 and 4073560 clock cycles, respectively. The total
runtime overhead in Table VI amounts to 0.38% and 0.25%
of the average execution cycles in the case of MPEG4 and
H.264 decoder, respectively.
2) Memory Overhead of LUTs: As explained in Sec-
tion VI-B, the presented method requires three temperature-
independent LUTs, i.e., LUT
s
comp , LUT
s
stall , and LUT
b
, and
two temperature-dependent LUTs, i.e., LUT
l
comp and LUT
l
stall .
The LUTs incur memory overhead. The memory overhead
largely depends on the number of steps (scales) in the indexes
of the LUTs. The more steps are used, the more accurate
workload prediction will be achieved with a higher memory
area overhead. In our implementation, we built each LUT with
the ratio of standard deviation to mean (Index 1) ranging 0.05
0.30 with 0.05 step size, with skewness (Index 2) ranging
1.001.00 with 0.10 step size, and with the ratio of mean to
the effective remaining workload of the remaining program
regions (Index 3) ranging 0.101.00 with 0.10 step size.
Therefore, 1140 (= 19 6 10) entries are required for
each LUT where 8 bits are assigned to each entry. Thus,
about 1 kB memory space is required for each LUT. The area
overhead can be further reduced by trimming and compressing
entries. LUTs are built for four temperatures, i.e., 25, 50, 75,
and 100 C used in our experiment. The total area overhead
amounts to 11 kB [=(31 kB) + 4(21 kB)].
X. Conclusion
In this paper, we presented a novel online DVFS method
which exploits the distribution of both computational workload
and memory stall workload during program runs in combined
V
dd
/V
bb
scaling. To reduce the complexity of our previous
design-time solution [28], we presented a DVFS method
consisting of two steps: local-optimal workload prediction and
coordination. In the local-optimal workload prediction step, we
periodically calculated ve local-optimal workload predictions
each of which minimized single energy component under the
joint PDF of computational cycle and memory stall time,
which is proled during runtime. To further reduce the runtime
overhead, we prepared tables which are pre-characterized
in design time based on the analytical formulation. During
runtime, we utilized them to nd local-optimal workloads. In
the coordination step, we found the global workload prediction
by coordinating the ve local-optimal workload predictions.
Experimental results show that the proposed method offers up
to 34.6% and 17.3% energy savings for MPEG4 and H.264
decoder, respectively, compared with the existing method [4].
References
[1] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens,
Memory access scheduling, in Proc. ISCA, 2000, pp. 128138.
[2] K. Govil, E. Chan, and H. Wasserman, Comparing algorithms for
dynamic speed-setting of a low-power CPU, in Proc. MOBICOM, 1995,
pp. 1325.
[3] Y. Gu and S. Chakraborty, Control theory-based DVS for interactive
3-D games, in Proc. DAC, 2008, pp. 740745.
[4] K. Choi, R. Soma, and M. Pedram, Fine-grained dynamic voltage and
frequency scaling for precise energy and performance tradeoff based on
the ratio of off-chip access to on-chip computation times, IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 1, pp. 1828,
Jan. 2005.
[5] W.-Y. Liang, S.-C. Chen, Y.-L. Chang, and J.-P. Fang, Memory-aware
dynamic voltage and frequency prediction for portable devices, in Proc.
RTCSA, 2008, pp. 229236.
[6] G. Dhiman and T. S. Rosing, System-level power management using
online learning, IEEE Trans. Comput.-Aided Design Integr. Circuits
Syst., vol. 28, no. 5, pp. 676689, May 2009.
[7] A. Azevedo, I. Issenin, R. Cornea, R. Gupta, N. Dutt, A. Veidenbaum,
and A. Nicolau, Prole-based dynamic voltage scheduling using pro-
gram checkpoints, in Proc. DATE, 2002, pp. 168175.
[8] D. Shin and J. Kim, Optimizing intra-task voltage scheduling using
data ow analysis, in Proc. ASPDAC, 2005, pp. 703708.
[9] J. Seo, T. Kim, and J. Lee, Optimal intratask dynamic voltage-scaling
technique and its practical extensions, IEEE Trans. Comput.-Aided
Design Integr. Circuits Syst., vol. 25, no. 1, pp. 4757, Jan. 2006.
KIM et al.: PROGRAM PHASE-AWARE DYNAMIC VOLTAGE SCALING UNDER VARIABLE COMPUTATIONAL WORKLOAD 123
[10] S. Hong, S. Yoo, H. Jin, K.-M. Choi, J.-T. Kong, and S.-K. Eo, Runtime
distribution-aware dynamic voltage scaling, in Proc. ICCAD, 2006, pp.
587594.
[11] S. Hong, S. Yoo, B. Bin, K.-M. Choi, S.-K. Eo, and T. Kim, Dynamic
voltage scaling of supply and body bias exploiting software runtime
distribution, in Proc. DATE, 2008, pp. 242247.
[12] J. R. Lorch and A. J. Smith, Improving dynamic voltage scaling
algorithm with PACE, ACM SIGMETRICS Perform. Eval. Rev., vol.
29, no. 1, pp. 5061, Jun. 2001.
[13] C. Xian and Y.-H. Lu, Dynamic voltage scaling for multitasking real-
time systems with uncertain execution time, in Proc. GLSVLSI, 2006,
pp. 392397.
[14] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder, Dis-
covering and exploiting program phases, IEEE Micro, vol. 23, no. 6,
pp. 8493, Nov. 2003.
[15] T. Sherwood, S. Sair, and B. Calder, Phase tracking and prediction,
in Proc. ISCA, 2003, pp. 336347.
[16] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, Y. Wu, J.
Lee, and D. Brooks, A dynamic compilation framework for controlling
microprocessor energy and performance, in Proc. IEEE MICRO, 2005,
pp. 271282.
[17] C. Isci, G. Contreras, and M. Martonosi, Live, runtime phase monitor-
ing and prediction on real systems with application to dynamic power
management, in Proc. MICRO, 2006, pp. 359370.
[18] S.-Y. Bang, K. Bang, S. Yoon, and E.-Y. Chung, Run-time adaptive
workload estimation for dynamic voltage scaling, IEEE Trans. Comput.-
Aided Design Integr. Circuits Syst., vol. 28, no. 9, pp. 13341347, Sep.
2009.
[19] J. Kim, S. Yoo, and C.-M. Kyung, Program phase and runtime
distribution-aware online DVFS for combined V
dd
/V
bb
scaling, in Proc.
DATE, 2009, pp. 417422.
[20] T. Mudge, K. Flautner, D. Vlaauw, and S. M. Martin, Combined
dynamic voltage scaling and adaptive body biasing for lower power
microprocessors under dynamic workloads, in Proc. ICCAD, 2002, pp.
721725.
[21] W. Liao, L. He, and K. M. Lepak, Temperature and supply voltage
aware performance and power modeling at microarchitecture level,
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7,
pp. 10421053, Jul. 2005.
[22] Cacti5.3 [Online]. Available: http://www.hpl.hp.com/research/cacti
[23] BPTM High-k/Metal Gate 32 nm High Performance Model [Online].
Available: http://www.eas.asu.edu/ptm
[24] K. Puttaswamy and G. H. Loh, Thermal herding: Microarchitecture
techniques for controlling hotspots in high-performance 3-D-integrated
processors, in Proc. HPCA, 2007, pp. 193204.
[25] N. Kavvadias, P. Neofotistos, S. Nikolaidis, C. A. Kosmatopoulos, and
T. Laopoulos, Measurement analysis of the software-related power
consumption in microprocessors, IEEE Trans. Instrum. Meas., vol. 53,
no. 4, pp. 11061112, Aug. 2004.
[26] S. Oh, J. Kim, S. Kim, and C.-M. Kyung, Task partitioning algorithm
for intra-task dynamic voltage scaling, in Proc. ISCAS, 2008, pp. 1228
1231.
[27] J. Kim, S. Oh, S. Yoo, and C.-M. Kyung, An analytical dynamic scaling
of supply voltage and body bias based on parallelism-aware workload
and runtime distribution, IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 28, no. 4, pp. 568581, Apr. 2009.
[28] J. Kim, Y. Lee, S. Yoo, and C.-M. Kyung, An analytical dynamic
scaling of supply voltage and body bias exploiting memory stall time
variation, in Proc. ASPDAC, 2010, pp. 575580.
[29] FFMPEG [Online]. Available: http://www.ffmpeg.org
[30] VQEG [Online]. Available: ftp://vqeg.its.bldrdoc.gov
[31] PAPI [Online]. Available: http://icl.cs.utk.edu/papi
Jungsoo Kim (S06) received the B.S. degree in
electrical engineering from the Korea Advanced
Institute of Science and Technology (KAIST), Dae-
jeon, South Korea, in 2005, and graduated the uni-
ed course of the M.S. and Ph.D. degrees from the
Department of Electrical Engineering and Computer
Science, KAIST, in 2010.
Since 2010, he has been in a post-doctoral po-
sition with KAIST. His current research interests
include dynamic power and thermal management,
multiprocessor system-on-a-chip design, and low-
power wireless surveillance system design.
Sungjoo Yoo (M00) received the B.S., Masters, and
Ph.D. degrees in electronics engineering from Seoul
National University, Seoul, South Korea, in 1992,
1995, and 2000, respectively.
He was a Researcher with the TIMA Laboratory,
Grenoble, France, from 2000 to 2004, and was a
Senior and Principal Engineer with Samsung Elec-
tronics, Seoul, from 2004 to 2008. Since 2008, he
has been with the Pohang University of Science and
Technology, Pohang, South Korea. His current re-
search interests include dynamic power and thermal
management, on-chip network, multithreaded software and architecture, and
fault tolerance of solid-state disk.
Chong-Min Kyung (S06M81SM99F08) re-
ceived the B.S. degree in electronics engineering
from Seoul National University, Seoul, South Korea,
in 1975, and the M.S. and Ph.D. degrees in electrical
engineering from the Korea Advanced Institute of
Science and Technology (KAIST), Daejeon, South
Korea, in 1977 and 1981, respectively.
From April 1981 to January 1983, he was with Bell
Telephone Laboratories, Murray Hill, NJ, in a post-
doctoral position. Since he joined KAIST in 1983,
he has been working on system-on-a-chip design
and verication methodology, processor, and graphics architectures for high-
speed and/or low-power applications, including mobile video codec. He was
a Visiting Professor with the University of Karsruhe, Karsruhe, Germany,
in 1989, as an Alexander von Humboldt Fellow, a Visiting Professor with
the University of Tokyo, Tokyo, Japan, from January 1985 to February
1985, a Visiting Professor with the Technical University of Munich, Munich,
Germany, from July 1994 to August 1994, with Waseda University, Tokyo,
from 2002 to 2005, with the University of Auckland, Auckland, New Zealand,
from February 2004 to February 2005, and with Chuo University, Tokyo, from
July 2005 to August 2005.
Dr. Kyung is the Director of the Integrated Circuit Design Education Center,
Daejeon, established in 1995 to promote the integrated circuit (IC) design
education in Korean universities through computer-aided design environment
setup, and chip fabrication services. He is the Director of the SoC Ini-
tiative for Ubiquity and Mobility Research Center established to promote
academia/industry collaboration in the SoC design-related area. From 1993 to
1994, he served as an Asian Representative in the International Conference
on Computer-Aided Design Executive Committee. He received the Most
Excellent Design Award, and the Special Feature Award from the University
Design Contest in the ASP-DAC 1997 and 1998, respectively. He received the
Best Paper Awards at the 36th DAC, New Orleans, LA, the 10th International
Conference on Signal Processing Application and Technology, Orlando, FL, in
September 1999, and the 1999 International Conference on Computer Design,
Austin, TX. He was the General Chair of the Asian Solid-State Circuits
Conference 2007, and ASP-DAC 2008. In 2000, he received the National
Medal from the Korean Government for his contribution to research and
education in the IC design. He is a member of the National Academy of
Engineering Korea and the Korean Academy of Science and Technology. He
is a Hynix Chair Professor with KAIST.

Vous aimerez peut-être aussi