Académique Documents
Professionnel Documents
Culture Documents
Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/
republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any
1
copyrighted component of this work in other works. DOI: 10.1109/TVLSI.2017.2784807
Vector Processing-Aware
Advanced Clock-Gating Techniques for
Low Power Fused Multiply-Add
Ivan Ratković, Oscar Palomar, Milan Stanić, Osman Unsal, Adrian Cristal, and Mateo Valero
Abstract—The need for power-efficiency is driving a rethink of design decisions in processor architectures. While vector processors
succeeded in the high-performance market in the past, they need a re-tailoring for the mobile market that they are entering now.
Floating point fused multiply-add, being a functional unit with high power consumption, deserves special attention. Although
clock-gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its
application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify,
propose, and evaluate the most suitable clock-gating techniques for vector fused multiply-add units (VFU). These techniques ensure
power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real world”
application-based benchmarking. Using vector masking and vector multi-lane-aware clock-gating, we report power reductions of up to
52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based
clock-gating techniques achieve power savings for all vector floating-point instructions. Finally, when evaluating all techniques together,
using ”real world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we
perform this research in a fully parameterizable and automated fashion.
Index Terms—Digital Arithmetic, Clock-Gating, Low Power, Vector Processors, Fused Multiply-Add, Methodologies.
1 I NTRODUCTION
also applicable to scalar architectures but especially beneficial to 62 60 ... 2 0 62 60 ... 2 0 63 61 ... 3 1 63 61 ... 3 1
V1 FMA V1 FMA
vector processors (e.g. gating internal blocks depending on the 62 60 ... 2 0 V3 63 61 ... 3 1 V3
values of input data), and (3) ideas that are already used in other V2 V2
TABLE 1: IEEE754 single and double precision formats. The TABLE 2: A classification of the proposed techniques using
number of bits for each field is shown (bit ranges are in square two criteria: (1) Vector Processing-Specific or -Beneficial and (2)
brackets, 0 = least-significant bit). operating mode (Active or Idle).
Sign Exponent Fraction VP-specific VP-beneficial
Single Precision (32 bits) 1 [31] 8 [30-23] 23 [22-0] Active MaskCG, ScalarCG, ImplCG InputCG
Double Precision (64 bits) 1 [63] 11 [62-52] 52 [51-0] Idle n/a IdleCG
( )
InputCG MaskCG
rounding
mode,
ScalarCG C B A IdleCG
Enable
op_sign ImplCG
( MaskCG
IdleCG )
ScalarCG, ImplCG ScalarCG, ImplCG ScalarCG, ImplCG
C B A INPUT REGISTERS
InputCG InputCG InputCG
Cmantissa Bmantissa Amantissa
1 clk delay
ImplCG, InputCG
EXPONENT LEADING
AND INCREMENTER ADDER ZERO
ANTICIPATION
CONTROL
NORMALIZER
ROUNDING
TABLE 3: Types of instructions where ScalarCG and ImplCG Cmantissa Bmantissa Amantissa
is applicable. Capital letters indicate vector operands. Third and
fourth column shows the corresponding VS and instruction type ALIGNER MULTIPLIER (Wallace)
signals, respectively.
Operation Vector Instruction VS[0..2] INSTYP FPMULV FPADDV
MUX
A ∗ b +C FPFMAVSV - a multiplicand is a 101 0
scalar
IN1 LEADING IN2
A∗B+c FPFMAVVS - the addend is a 110 0 ADDER AND ZERO
ANTICIPATION
scalar
A∗b+c FPFMAVSS - a multiplicand and 100 0 FPADDV MUX FPMULV
the addend are scalars
A∗B+0 FPMULV - the addend c is a 110 1
Rest of FMA datapath
scalar (0)
A ∗ 1 +C FPADDV - the multiplicand b is 101 1
a scalar (1)
Fig. 3: Gated FMA submodules during FPMULV (dark grey) and
A ∗/+ b/c FPMULVS/FPADDVS - 2 out of 100 1 FPADDV (light grey) instructions in case of ImplCG.
3 operands are scalars5
TABLE 4: InputCG - conditions under which a hardware block app-uBench Vec. PARAMS. uKernelPARAMS. FMAgen PARAMS.
INPUT)
of mantissa arithmetic computations and corresponding input PARAMETERS
(MVL and number of vector lanes (nL )). Using these inputs TABLE 5: High-level VectorSim configuration.
VectorSim generates data and timing traces for the vector floating Execution 32-bit in-order vector core; decoupled vector and
point operations. We use two kinds of uBenchs (both explained in scalar core; vector chaining.
Section 5.2): Vector Register MVL : 16, 64, or 128 elements; Number of registers:
File 8.
• param-uBenchs are generated by feeding the parameteriz-
Vector FU nL : 1, 2, or 4; 1 32-bit arithmetic logic unit (ALU), 1
able microkernel uKernel with its parameters. 64-bit VFU; VFU latency (nS ): 1-4.
• app-uBenchs are manually vectorized kernels extracted from Memory System L2 (direct access to L2, shared with scalar core):
applications. 1MB, 4-way, 128b cache line, hit latency: 7ns, miss
latency: 70ns; 1 load unit, 1 store unit.
The synthesizable Verilog netlists are generated using FMA-
gen (described in Section 5.3). The output of architectural-level
simulations together with FMAgen parameters are transformed been designed to be easily extended with new instructions or
into Verilog test benchmarks (tBenchs) using tBenchGen, a tool implementation alternatives. Therefore, we modified them to
that we developed. The most important inputs are: data and satisfy our research goals and to enable its integration in our
time traces from VectorSim, vector and FMAgen (explained in exploration frameworks. Among other upgrades, we added a set
Section 5.3) parameters, clock cycle, and tBench length expressed of vector floating-point FMA instructions to VectorSim. High-level
in the number of Verilog test vectors. As output, it provides tBench VectorSim configuration is presented in Table 5.
for each lane separately, and its profiling report. We setup VectorSim to model a decoupled 32-bit vector
Afterwards, we use Cadence RTL Compiler (RC) to obtain machine with support for 64-bit floating point. The decoupled
synthesized mapped netlists and to perform static timing analysis execution model assumes separated in-order vector and scalar
(STA) of the VFU. All the designs are synthesized for the execution units [14, 30, 31, 32]. They share instruction fetch and
minimum clock period that provides a safe slack on the critical decode, and they separate issue logic and functional units, allow-
path. We optimize non-critical path logic for leakage using high- ing in that way independent scalar execution. In-order execution is
VT H cells. We provide synthesized Verilog netlists together with common in low power processors due to its simplicity (e.g. some
the physical layout information to the Cadence Encounter Digital of ARM Cortex-A architectures: cortexA7, cortexA8, cortexA32,
Implementation System (EDI) to get placed and routed designs cortexA35, cortexA53 [33]). It is more efficient in vector than
and to perform again STA. Additionally, the most critical paths in scalar processing as in vector architectures the drawbacks of
are verified with Cadence Spectre. in-order execution are diminished, especially if the vectors are
In order to verify the designs and extract resulting switching long. Additionally, we model chaining (vector equivalent of data
activity information (written into Value Change Dump (vcd) files) forwarding) and dead time elimination (allowing to reuse the ALU
we simulate each VFU in Cadence NCSim for each matching immediately after the current instruction).
tBench with back-annotated delays using standard delay format The vector execution engine is organized as nL identical vector
(sdf) files. Afterwards, we perform precise power estimation using lanes. Possible values of the number of lanes (nL ) are 1, 2, and 4.
EDI PowerSim. In our experiments, we do not examine more lanes as it would
All designs are implemented using the mentioned low power not satisfy well a low power core budget7 . Moreover, values that
and low leakage TSMC40LP library for typical operating con- we choose are typical in vector processor design [29]. Each lane
ditions (Vdd = 1.1V,temp = 25C). Since practically all existing has a slice of the vector register file, a slice of the vector mask
vector processors are developed using standard cells [29], we se- file, 1 vector integer ALU, 1 VFU, and a private TLB. There is no
lected this approach in our research. We use latch-based integrated communication across lanes, except for gather/scatter, reduction,
clock gating cells from the cell library. We set the tools to meet and compress instructions. In addition to the vector ALU, each
timing constraints while prioritizing power over the area. All the lane also includes 1 logic unit that handles logic operations,
optimizations in all the tools are applied using high effort. shifting and rotating. We assume the division is done in software,
An initial target core density of 70% is selected as a sensible since it is rare in vectorizable applications and the hardware
balance between timing improvement and shrinking area for the support is costly. Additionally, a control unit is needed to detect
wide set of designs parameters that we use. We experimentally hazards, both from conflicts for the functional units (structural
found that, in place and route (PnR) stage in general, density hazards) and from conflicts for register accesses (data hazards).
below 70% sometimes provides negligible faster timing for a
non-negligible area overhead while densities higher than 70%
5.2 Benchmarking
can spoil timings noticeably as the tool suffers from the lack
of free space for optimizations and routing. Additionally, the This section explains two benchmarking methods that we employ
initial densities below 70% sometimes even cause DRC errors for an in-depth evaluation of the proposed techniques. The first
and density (congestion) violations. method has as goal to stress each of the techniques separately,
while the second tests all the techniques simultaneously and
provides the results for “real world” applications.
5.1 VectorSim
We built VectorSim based on the vector architecture library 5.2.1 Fully Parameterizable Kernel - uKernel
(VALib) and the SimpleVector simulator [11], developed in our
We generate different param-uBenchs using the same uKernel. It is
group. VALib is a library that implements vector instructions
a variant of the DAXPY loop: D = A∗B+C. The inputs are random
and allows rapid manual vectorization and characterization of
values unless specified otherwise. uKernel parameters (Table 6)
applications. SimpleVector is a simple and very fast trace-based
simulator which helps to estimate the performance of a vector 7. The total number of functional units per core is in accordance with many
processor. We took advantage of the fact that both tools have other low power processors [33, 34, 35].
7
However, compared to the rest of FMA submodules, their delay TABLE 8: Evaluation of power savings for ScalarCG and ImplCG
impact is fairly small. Additionally, we debug timing and apply depending on the instruction type against the baseline (NoCG).
retime10 when necessary. Therefore, timing cost of added circuitry INSTYP, ADD/MUL, MULS, and ADDS uKernel parameters are
is almost negligible, especially in case of 4-stage VFU (the most varied in these experiments to test all the instructions separately,
relevant one). while the rest of parameters are zero.
Since we target low power, we do not incorporate any spec- Vector Instruction nS SScalarCG (%) SImplCG (%)
ulative hardware for improving performance, thus, no energy is 1 −34.35 −38.27
wasted on precomputed results that get discarded. FPFMAV
4 1.87 0.39
1 −10.11 −12.73
FPFMAVSV
4 16.47 14.55
6 E VALUATION 1 −31.47 −33.19
FPFMAVVS
This section presents an evaluation of the presented vector pro- 4 4.07 4.74
1 −6.77 −7.77
cessing aware clock-gating proposals in terms of power savings FPFMAVSS
4 18.90 18.83
(S) and area efficiency. Regarding power measurements, first 1 −10.70 65.07
FPADDV
we evaluate each technique separately using the benchmarking 4 16.54 42.42
method from Section 5.2.1, and afterwards we evaluate combined 1 −6.72 70.30
FPADDVS
4 18.92 46.16
scenarios using the method explained in Section 5.2.2.
1 −31.36 −22.81
VFU designs with 1, 2, 3, and 4 stages are synthesized and run FPMULV
4 4.12 9.00
at 0.45, 0.85, 1.1, and 1.3 GHz respectively. We assume a NoCG 1 −6.69 2.52
FPMULVS
VFU as a baseline. Its power in case of 2-lane VFU is 15.6, 30.9, 4 18.95 23.04
44.9, and 59.2 mW for 1, 2, 3, and 4 stages respectively.
We observe that the static (leakage) power is practically
negligible compared to dynamic power. For noCG it is around in terms of power savings for 4- and 1-stage VFU. In these
0.01% ot total power in average. The leakage is highest for IdleCG experiments we set MVL and the number of vector lanes (nL ) to 64
when it is up to 1%. It is practically negligible due to the following and 2 respectively.
reasons: (1) arithmetic topologies produce high switching (high We observe that in most of the cases the savings grow with nS ,
dynamic power), (2) the technology that we use has low leakage, as more pipeline stages enable finer granularity of clock-gating.
and (3) we optimize non-critical path logic for leakage using Due to its higher practical importance, in the rest of the discussion
high-VT H cells. Although it is negligible when considering active we focus on 4-stage results.
operating modes (i.e. the execution inside a vector kernel), when Figure 5 shows results for MaskCG, InputCG, and IdleCG:
the execution is outside a vector kernel (i.e. when the vector core is F MaskCG. Due to its simplicity, this technique comes with
inactive), the leakage might be additionally suppressed via power practically no overhead and the savings are between 8% and 52%
gating.11 However, power gating is out of the scope of this research depending on the pm . The saving attainable when pm =1 (S=52%)
since we target lowering power during active operating modes with is the maximum possible power reduction for active 4-stage VFU.
no performance loss. F InputCG. In order to isolate savings for each of the mentioned
We focus on 4-stage results as they are the most important subtechniques (Table 4), we test all them separately by asserting
from the processor design perspective. Nonetheless, the 1-stage the probabilities pin f , pNaN , and p0 (Table 6) to the operands.
results are presented as a reference and in most cases it has the In InputCG∞ , InputCGNaN , and InputCGmul0 , the corresponding
highest overhead in terms of power and area across all nS . For probability affects operand A, while in InputCGadd0 it affects
the sake of simplicity, in the rest of this section we typically omit operand C, the addend.
results for 2- and 3-stage designs, but we observe these results The maximum saving of 48.3% is available when pin f or pNaN
regularly scale between results for 1 and 4 stages. is 1 (InputCG∞ and InputCGNaN ). The same savings are available
when an operand is NaN or ∞, as in both cases the same hardware
6.1 Area Efficiency is clock-gated. The minimum probability pNaN or pin f (of any
Table 9 shows the area efficiency of the proposed techniques. Area operand) necessary for saving power is the spot where the savings
for a NoCG 2-lane VFU configuration is 36191, 38060, 40693, and graph crosses the probability axis (16%).
43419 µm2 for 1, 2, 3, and 4 stages respectively. Area overhead is When considering InputCGmul0 , the maximum saving is 40%,
in some cases higher than expected because: (1) during synthesis, and the minimum probability p0 (of any multiplicand) necessary
we prioritized timing and power over area to assure power savings for saving power is 18.5%.
without spoiling timing and (2) Chisel generated Verilog code Much lower maximum saving (2.3%) is available when the
is sometimes less area efficient than equivalent manually written addend is a zero (InputCGadd0 ), as the adder consumes much less
Verilog [39]. However, we observe this overhead has a strong power than the multiplier (around 5 times in average). However, by
decreasing trend as the nS increases. combining these scenarios at the same time (which is reasonable to
assume in a real application case), higher savings would be avail-
able. Therefore, even though the savings associated with detecting
6.2 Per Technique Power Analysis
zero addend and clock-gating the adder and the corresponding
Figure 5 and Table 8 reveal the results for each of the pre- aligner and input registers are not enough to justify its existence
sented vector processing aware clock-gating proposals separately, by itself, supporting this case improves overall savings of the
10. Critical path optimization by adjusting the position of the flip-flops.
complete InputCG technique when a real, combined scenario is
11. The gate signal in this case could be generated from vector kill instruc- considered. Since it shares some hardware with other InputCG
tion KILLV (similar to VRIP instruction in Cray X1 instruction set [41]). subtechniques, the overhead of adding it is less then the saving it
9
100 100
IdleCG , IR (a) IdleCG , IR (b)
90 MaskCG , pm 90 MaskCG , pm
InputCG , Apinf InputCG , Apinf
80 InputCG , Apnan 80 InputCG , Apnan
InputCG , Ap0 InputCG , Ap0
70 InputCG , Cp0 70 InputCG , Cp0
60 60
Savings [%] 50 50
40 40
30 30
20 20
10 10
0 0
10 10
20 20
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Probability Probability
Fig. 5: Evaluation of power savings over the baseline (NoCG), as a function of uKernel parameters for IdleCG, MaskCG, and InputCG
for 4-stage (a) and 1-stage (b) 2-lane VFU. For each graph, only one uKernel probability parameter from Table 6 is assumed to be
variable while other parameters are zero. This is indicated in the legend with technique, probability pairs.
TABLE 9: Area efficiency of the proposed clock-gating techniques against the baseline (NoCG) for nS =1 and 4.
CGtype nS IdleCG/MaskCG ScalarCG ImplCG StageCG InputCG ALLCG
1 98.1 153.8 184.6 98.1 129.0 185.4
Ratio(%)
4 96.3 104.6 125.4 96.9 107.3 148.0
can achieve. cases the multiplier is gated. As for ScalarCG, there are savings
The power overhead of the added hardware can be identified for all vector floating-point instructions. Incidentally, we can
in the case when the probability is 0, i.e. when InputCG is observe that for FPADDV and FPADDVS power savings are higher
never active. The cost is a bit higher than expected taking into for 1- than for 4-stage VFU. The reason is that the mantissa
account the amount of additional logic that we include (detecting, multiplier contributes a higher percentage of total power when
bypassing, and clock-gating logic). In line with our discussion only one pipeline stage is present.
of area results, Chisel generated designs sometimes suffer from
unexpected overhead, and our initial experiments confirm it.
6.3 Application-Based Combined Power Evaluation
However, we observe it significantly decreases as nS increases,
thus, we expect this to be negligible for high nS . Table 10 shows the evaluation of the presented clock-gating pro-
F IdleCG. This technique provides savings between 8% and posals as well as workload profiling results for all combinations of
95.3%. Although it uses the same hardware as MaskCG, the app-uBench and vector parameters (MVL and nL ), while figures 6,
savings are higher in average because here the input data is stable, 7, and 8 visualize the key metrics from the table. For the reasons
while in MaskCG it is not, thus incurring some switching albeit explained above, we focus on VFUs with 4 stages. The results
the first pipeline stage is gated. The saving attainable when IR=1 shown in the table and the figures are:
(95.3%) is the maximum possible power reduction when the VFU • S - power savings percentage,
is idle. • Active Execution / Idle Execution - Active/Idle VFU execu-
The results for ScalarCG and ImplCG are shown in Table 8: tion, i.e percentage of app-uBench execution time that VFU
F ScalarCG. Savings are available for all the combinations of is active/idle,
INSTYP, ADD/MUL, MULS, and ADDS uKernel parameters, i.e. • IdleCG, ScalarCG, ImplCG, InputCG and InputCG subtech-
for all vector floating point instructions. Not surprisingly, as the niques efficiency - percentage of execution time that the
internal multiplier dissipates more power than the adder, latching corresponding technique is used,
multiplicand’s mantissas provides more savings than latching • AllCG efficiency - percentage of execution time that any
addend’s mantissa (FPFMAVSV and FPADDV vs. FPFMAVVS and clock-gating technique is used,
FPMULV). • ActiveCG efficiency - percentage of execution time that any
F ImplCG. This technique significantly improves savings for active clock-gating technique is used, and
FPADDV/FPMULV instructions compared to ScalarCG. The high- • ActiveExeCG efficiency - percentage of active VFU execution
est savings, and the largest improvements compared to ScalarCG, that any active clock-gating technique is used. Therefore, it
are available for FPADDV and FPADDVS instructions as in these disregards idle cycles and considers only the cycles in which
10
there is a vector instruction executing in the unit, rather than the mantissa adder and the aligner, while ImplCG gates the input
the total execution time like in ActiveCG. register of the A operand and the mantissa multiplier. The second
We show the profiling results as well to understand the VFU group of overlaps happens when one technique gates just part
behavior and where the savings come from. Data for MaskCG, of the hardware that another technique gates. In these cases, the
InputCGNaN , and InputCGin f are not present in the Table 10 savings are equal to the savings of the technique that has a larger
as they are 0%. The reason is that the selected app-uBenchs scope. For example, if the corresponding bit in VMR is ‘0’ and
do not have vector mask instructions. Also, none of the input the current instruction is FPFMAVVS, the savings are going to be
values are NaNs nor infinities. However, abundant vector mask equal to the savings achieved by MaskCG alone.
instructions could be found in any vector workload that has
conditional execution [42], so in this kind of workloads we can
7 C ONCLUSIONS
expect fair savings as the result of MaskCG technique. Regarding
NaNs and infinities, for some other applications and/or input data In this research, we extensively identify, propose, and evaluate the
sets their occurrence might be more common, thus, the benefit most suitable clock-gating techniques for vector FMA (VFU) con-
of InputCGNaN and InputCGin f subtechniques will be visible. sidering peak performance, and focusing on the active operating
Common cases of NaN and infinity processing are explained in mode. We propose techniques that are either (1) completely novel
[21]. ideas to lower the power of VFU using active clock-gating (e.g.
Figure 6 and Table 10 reveal that AllCG efficiency is very high, vector instructions with scalar operand or vector masking) or (2)
i.e. clock-gating is often used almost 100% of total execution time. ideas that exist in some form in scalar architectures and that we
This is a consequence of the fact that the proposed clock-gating extend to achieve more savings by taking advantage of vector
techniques are used during both idle and active VFU execution processing characteristics. We find that each of the proposed
(i.e. both ActiveCG and IdleCG are used). Due to this very high optimizations achieves power reductions while maintaining the
AllCG efficiency, the power savings are also fairly high. We performance. As a consequence of this fact, sometimes an area
observe that power savings are available for practically all the increase is observed.
combinations of app-uBenchs and vector parameters. The highest An in-depth evaluation is performed, and each of the tech-
savings are obtained for computer vision app-uBenchs (Facerec niques is evaluated separately as well as combined with other
and Disparity) and are between 60% and 80%. The only case techniques. For this evaluation, both synthetic and real application-
in which the techniques do not result in savings is K-Means, based benchmarks are employed. We considered a variety of
MVL =16 and nL =4. There are two reasons for that: (1) the clock- benchmarks with different behavior to assure a fair evaluation and
gating efficiency (i.e. the percentage of execution time that any general conclusions.
clock-gating technique is used) is not high and (2) with nL =4 and In the case of active 4-stage vector fused multiply-add unit
MVL =16 the effective vector length per lane is 4 which makes (VFMA) with 2 lanes actively operating at the peak performance,
ImplCG (the most used technique in this case) less fruitful since it power savings are up to 52% are available when using a single
is used only 3 consecutive cycles per vector on each lane (which as technique. Regarding the vector instruction-dependent techniques
a result has more switching activity in clock-gating and bypassing that we propose, we observe savings for all floating-point vector
logic). Additionally, from Table 10, we observe that presented instructions.
novel ideas/approaches (ActiveCG) provide significant savings in Testing all the techniques together and using real application
addition to the standard one (IdleCG). benchmarks (especially computer vision ones) reveals fairly high
Figure 7 shows that ratio of active and idle VFU execution power reductions that go up to 80%. Clock-gating efficiency
varies across app-uBenchs and vector parameters, and explains the (percentage of time that some of the proposed techniques are
nature for each combination of parameters. There are situations in used) is quite high, often close to 100%. When considering the
which the VFU is most of the time active and vice versa. However, efficiency of only active clock-gating techniques, this number is
we can notice there is a trend that vector processors with MVL of usually between 70% and 100%. We observe that these novel
128 have its VFU most of the time active (busy) as with longer ideas/approaches (applied when VFU is active) provide significant
vectors the effects of cache misses are diminished. IdleCG is used savings in additional to the standard ones (idle VFU). Moreover,
whenever the VFU is idle, thus, IdleCG efficiency inside these we notice the trend that savings for the proposed techniques rise
idle periods is 100%. When considering active VFU execution, with the number of pipeline stages.
the efficiency (ActiveExeCG) varies across the app-uBenchs and We performed this research in a fully parameterizable, scalable
vector parameters and is shown on Figure 8. As we can observe and automated manner using simulators and tools at many levels.
from the figure, a very high percentage of the time at least one Although targeting floating-point FMA, as the major consumer
of the active clock-gating techniques is used and depending on among all functional units, similar low power techniques as well
the benchmark it goes up to 100%. Table 10 shows that in all as the framework could be re-tailored for other vector functional
these cases the used techniques are some variants of ImplCG12 units as well. We would also like to stress that the combination
and InputCG. Also, there are cases when these techniques overlap. of Chisel-based generators and state-of-the-art synthesis and PnR
ActiveCG techniques can arbitrarily overlap and there are tools is a powerful tool for flexible hardware generation with
two potential kinds of overlaps. The first group of overlaps Verilog-like quality of results.
refers to the cases when the techniques jointly produce higher
savings than each technique separately. This happens when the
R EFERENCES
techniques target different hardware. For example, when we have
a zero addend inside a FPADDV instruction, InputCGadd0 gates [1] I. Ratković, O. Palomar, M. Stanić, O. Unsal, A. Cristal,
and M. Valero, “A Fully Parameterizable Low Power Design
12. As explained before, ScalarCG is integrated in ImplCG. of Vector Fused Multiply-Add Using Active Clock-Gating
11
TABLE 10: Power savings and clock-gating statistics. All the results are expressed in percentages.
app- MVL nL S AllCG Active Exe ActiveCG ActiveExeCG Idle Exe, IdleCG ScalarCG/ImplCG InputCG InputCGmul0 Inputadd0
uBench
1 70.32 98.96 16.68 15.63 93.74 83.32 15.63 7.50 0.00 7.50
16 2 73.99 98.83 9.36 8.19 87.49 90.64 8.19 4.21 0.00 4.21
Facerec 4 76.48 98.77 4.91 3.68 74.99 95.09 3.68 2.21 0.00 2.21
1 59.84 99.65 44.33 43.98 99.21 55.67 43.98 19.90 0.00 19.90
128 2 65.50 99.55 28.81 28.35 98.43 71.19 28.35 12.93 0.00 12.93
4 71.04 99.47 16.83 16.30 96.87 83.17 16.30 7.55 0.00 7.55
1 37.54 84.96 49.48 34.43 69.59 50.52 32.06 17.67 2.37 15.29
16 2 38.61 85.05 43.04 28.10 65.27 56.96 26.03 15.37 2.07 13.30
Sphinx3 4 46.13 87.83 28.07 15.90 56.63 71.93 14.55 10.02 1.35 8.68
1 30.55 82.25 69.67 51.92 74.53 30.33 48.31 25.27 3.61 21.66
128 2 22.09 76.98 88.49 65.47 73.98 11.51 60.89 32.10 4.58 27.52
4 23.43 77.70 82.26 59.96 72.89 17.74 55.70 29.84 4.26 25.58
1 75.55 97.85 38.41 36.26 94.41 61.59 36.26 25.90 0.00 25.90
16 2 74.20 97.10 24.38 21.49 88.11 75.62 21.49 16.44 0.00 16.44
Disparity 4 73.17 96.69 13.51 10.20 75.52 86.49 10.20 9.11 0.00 9.11
1 79.69 100.00 81.38 81.38 100.00 18.62 82.60 60.81 0.00 60.81
128 2 78.90 100.00 76.24 76.24 100.00 23.76 76.77 56.97 0.00 56.97
4 77.71 99.45 61.60 61.05 99.10 38.40 61.05 46.03 0.00 46.03
1 19.56 60.22 76.26 36.49 47.84 23.74 35.87 0.61 0.00 0.61
16 2 4.61 47.97 94.05 42.02 44.68 5.95 41.27 0.76 0.00 0.76
K-Means 4 −11.24 45.00 88.93 33.93 38.15 11.07 33.21 0.71 0.00 0.71
1 24.62 60.53 96.44 56.97 59.07 3.56 50.46 6.51 0.00 6.51
128 2 22.91 58.65 100.00 58.65 58.65 0.00 51.90 6.75 0.00 6.75
4 20.77 57.81 100.00 57.81 57.81 0.00 51.05 6.75 0.00 6.75
100%
Power Savings (S)
80%
AllCG Efficiency
60%
40%
20%
0%
1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4
-20%
16 128 16 128 16 128 16 128
Facerec Sphinx3 Disparity K-Means
NL, MVL, app-uBench
Techniques,” in Proceedings of the International Symposium V Vector Processor With Simultaneous-Switching Switched-
on Low Power Electronics and Design (ISLPED), 2016, pp. Capacitor DC–DC Converters in 28 nm FDSOI,” IEEE
362–367. Journal of Solid-State Circuits, vol. 51, no. 4, pp. 930–942,
[2] K. Asanović, “Vector Microprocessor,” 1998, PhD Thesis, 2016.
UC Berkeley. [5] R. Espasa, M. Valero, and J. E. Smith, “Vector Architec-
[3] Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Bat- tures: Past, Present and Future,” in Proceedings of the 12th
ten, and K. Asanović, “Exploring the Tradeoffs Between Pro- International Conference on Supercomputing (SC), 1998, pp.
grammability and Rfficiency in Data-Parallel accelerators,” 425–432.
in Proceedings of the 38th Annual International Symposium [6] H. Li, S. Bhunia, Y. Chen, T. Vijaykumar, and K. Roy, “De-
on Computer Architecture (ISCA), 2011, pp. 129–140. terministic Clock Gating for Microprocessor Power Reduc-
[4] B. Zimmer, Y. Lee, A. Puggelli, J. Kwak, R. Jevtić, B. Keller, tion,” in Proceedings of the Ninth International Symposium
S. Bailey, M. Blagojević, P.-F. Chiu, H.-P. Le et al., “A RISC- on High-Performance Computer Architecture (ISCA), 2003,
12
100%
90%
100%
ActiveExeCG
Normalized Active VFU Execution Time
90%
80% Efficiency
70%
60%
50%
40%
30%
20%
10%
0%
1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4
16 128 16 128 16 128 16 128
Facerec Sphinx3 Disparity K-Means
NL, MVL, app-uBench
pp. 113–122. p. 7.
[7] N. Mohyuddin, K. Patel, and M. Pedram, “Deterministic [12] (2016) http://arm.com/.
Clock Gating to Eliminate Wasteful Activity due to Wrong- [13] S. Momose, “NEC Vector Supercomputer: Its Present and
Path Instructions in Out-Of-Order Superscalar Processors,” Future,” in Sustained Simulation Performance. Springer,
in Proceedings of IEEE International Conference on Com- 2015, pp. 95–105.
puter Design (ICCD), 2009, pp. 166–172. [14] R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gra-
[8] J. Preiss, M. Boersma, and S. M. Mueller, “Advanced Clock- munt, I. Hernandez, T. Juan, G. Lowney, M. Mattina et al.,
gating Schemes for Fused-Multiply-Add-Type Floating- “Tarantula: a Vector Extension to the Alpha Architecture,”
Point Units,” in Proceedings of 19th IEEE Symposium on in Proceedings of 29th Annual International Symposium on
Computer Arithmetic (ARITH), 2009, pp. 48–56. Computer Architecture (ISCA), 2002, pp. 281–292.
[9] I. Ratković, O. Palomar, M. Stanić, O. S. Unsal, A. Cristal, [15] O. Shacham, O. Azizi, M. Wachs, W. Qadeer, Z. Asgar,
and M. Valero, “On the Selection of Adder Unit in Energy K. Kelley, J. P. Stevenson, S. Richardson, M. Horowitz, B.-
Efficient Vector Processing,” in Proceedings of 14th Inter- W. Lee et al., “Rethinking digital design: Why design must
national Symposium on Quality Electronic Design (ISQED), change,” Micro, IEEE, vol. 30, no. 6, pp. 9–24, 2010.
2013, pp. 143–150. [16] B. Nikolic, “Simpler, more efficient design,” in Proceedings
[10] I. Ratković, O. Palomar, M. Stanić, M. Duric, D. Pešić, of 41st European Solid-State Circuits Conference (ESS-
O. Unsal, A. Cristal, and M. Valero, “Joint Circuit-System CIRC), 2015, pp. 20–25.
Design Space Exploration of Multiplier Unit Structure for [17] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman,
Energy-Efficient Vector Processors,” in Proceedings of IEEE R. Avižienis, J. Wawrzynek, and K. Asanović, “Chisel:
Computer Society Annual Symposium on VLSI (ISVLSI), Constructing Hardware in a Scala Embedded Language,” in
2015, pp. 19–26. Proceedings of the 49th Annual Design Automation Confer-
[11] M. Stanic, O. Palomar, I. Ratković, M. Duric, O. Unsal, and ence (DAC), 2012, pp. 1216–1225.
A. Cristal, “VALib and SimpleVector: Tools for Rapid Initial [18] K. R. Gandhi and N. R. Mahapatra, “A Study of Hardware
Research on Vector Architectures,” in Proceedings of the Techniques that Dynamically Exploit Frequent Operands to
11th ACM Conference on Computing Frontiers (CS), 2014, Reduce Power Consumption in Integer Function Units,” in
13
Proceedings of 21st International Conference on Computer [38] S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia,
Design (ICCAD), 2003, pp. 426–428. S. Belongie, and M. B. Taylor, “SD-VBS: The San Diego vi-
[19] (2016) https://software.intel.com/en-us/blogs/2013/avx-512- sion benchmark suite,” in Proceedings of IEEE International
instructions/. Symposium on Workload Characterization (IISWC), 2009,
[20] M. J. Flynn, “Very High-Speed Computing Systems,” Pro- pp. 55–64.
ceedings of the IEEE, vol. 54, no. 12, pp. 1901–1909, 1966. [39] O. Arcas-Abella, G. Ndu, N. Sonmez, M. Ghasempour,
[21] D. Goldberg, “What Every Computer Scientist Should Know A. Armejach, J. Navaridas, W. Song, J. Mawer, A. Cristal,
About Floating-Point Arithmetic,” ACM Computing Surveys and M. Luján, “An Empirical Evaluation of High-Level
(CSUR), vol. 23, no. 1, pp. 5–48, 1991. Synthesis Languages and Tools for Database Acceleration,”
[22] P. Behrooz, “Computer arithmetic: Algorithms and hardware in Proceedings of 24th International Conference on Field
designs,” Oxford University Press, 2000. Programmable Logic and Applications (FPL), 2014, pp. 1–
[23] M. Ercegovac and T. Lang, Digital Arithmetic. Morgan 8.
Kaufmann, 2003. [40] I. S. Committee et al., “754-2008 IEEE standard for floating-
[24] E. Hokenek, R. K. Montoye, and P. W. Cook, “Second- point arithmetic,” IEEE Computer Society Std, vol. 2008,
Generation RISC Floating Point with Multiply-Add Fused,” 2008.
Solid-State Circuits, IEEE Journal of, vol. 25, no. 5, pp. [41] R. M. Russell, “The cray-1 computer system,” Communica-
1207–1213, 1990. tions of the ACM, vol. 21, no. 1, pp. 63–72, 1978.
[25] T. Xanthopoulos and A. P. Chandrakasan, “A Low-Power [42] J. E. Smith, G. Faanes, and R. Sugumar, “Vector Instruction
IDCT Macrocell for MPEG-2 MP@ ML Exploiting Data Set Support for Conditional Operations,” in Proceedings
Distribution Properties for Minimal Activity,” Solid-State of the 27th Annual International Symposium on Computer
Circuits, IEEE Journal of, vol. 34, no. 5, pp. 693–703, 1999. Architecture (ISCA), 2000, pp. 260–269.
[26] (2016) https://github.com/ucb-bar/berkeley-hardfloat/.
[27] S. Balakrishnan and G. S. Sohi, “Exploiting Value Locality
in Physical Register Files,” in Proceedings of 36th Annual
IEEE/ACM International Symposium on Microarchitecture
(2003). IEEE, 2003, pp. 265–276.
[28] M. Ekman and P. Stenstrom, “A Robust Main-Memory Com-
pression Scheme,” in ACM SIGARCH Computer Architecture
News, vol. 33, no. 2, 2005, pp. 74–85.
[29] J. Hennessy, D. Patterson, and K. Asanović, Computer Ar-
chitecture, Appendix G. MK, 2011. Ivan Ratković received the BS and MS degree
[30] D. Abts, A. Bataineh, S. Scott, G. Faanes, J. Schwarzmeier, in Electrical Engineering and Computer Science
E. Lundberg, T. Johnson, M. Bye, and G. Schwoerer, “The from the University of Belgrade (Serbia) and
the PhD degree in Computer Architecture from
Cray BlackWidow: a Highly Scalable Vector Multiproces- Polytecnic University of Catalonia (Spain). He
sor,” in Proceedings of the ACM/IEEE conference on Super- is currently CPU R&D engineer at Esperanto
computing (SC), 2007, p. 17. Technologies and Semidynamics. In the past, Dr.
[31] Y. Lee, A. Waterman, R. Avizienis, H. Cook, C. Sun, V. Sto- Ratković was involved with BSC Microsoft Re-
search Center, Barcelona Supercomputing Cen-
janovic, and K. Asanovic, “A 45nm 1.3 GHz 16.7 double- ter, and Berkeley Wireless Research Center.
precision GFLOPS/W RISC-V processor with vector accel- His research interests include low power design,
erators,” in 40th European Solid State Circuits Conference computer architecture, vector and SIMD processors, digital arithmetic,
VLSI design flows, and embedded systems.
(ESSCIRC), 2014, pp. 199–202.
[32] B. Zimmer, Y. Lee, A. Puggelli, J. Kwak, R. Jevtic, B. Keller,
S. Bailey, M. Blagojevic, P.-F. Chiu, H.-P. Le et al., “A
RISC-V Vector Processor with Tightly-Integrated Switched-
Sapacitor DC-DC Converters in 28nm FDSOI,” in Proceed-
ings of Symposium on VLSI Circuits. IEEE, 2015, pp. C316–
C317.
[33] (2016) https://www.arm.com/products/processors /cortex-a.
[34] (2016) http://www.anandtech.com/show/6936/intels-
silvermont-architecture-revealed-getting-serious-about-
mobile. Oscar Palomar received a B.S. in Computer
Science and PhD degree on Computer Architec-
[35] (2016) https://www.qualcomm.com/products/
ture from the Polytechnic University of Catalonia
snapdragon/cpu-specifications. (UPC), Spain. He joined the Computer Architec-
[36] J. L. Henning, “SPEC CPU2006 enchmark Descriptions,” ture for Parallel Paradigms research group at the
ACM SIGARCH Computer Architecture News, vol. 34, no. 4, Barcelona Supercomputing Center as a post-
doc. There he led research on vector architec-
pp. 1–17, 2006. ture and worked on the hardware architecture
[37] S. K. Rethinagiri, O. Palomar, A. Sobe, G. Yalcin, T. Knauth, workpackage in the ParaDIME project. He was
R. T. Gil, P. Prieto, M. Schneegaß, A. Cristal, O. Unsal granted a Newton International Fellowship from
et al., “ParaDIME: Parallel Distributed Infrastructure for the Royal Society, joining the Advanced Pro-
cessor Technologies research group at the University of Manchester.
Minimization of Energy for Data Centers,” Microprocessors His current research interests are computer vision algorithms, FPGA
and Microsystems, vol. 39, no. 8, pp. 1174–1189, 2015. acceleration, low power computer architectures and vector processors.
14