Emily Report

PROCESSOR
POWER REDUCTION
Wing-Shan (Emily) Chan
wech592@cse.unsw.edu.au
18th May 2004
Supervisor: Annie Guo
huig@cse.unsw.edu.au
Table of Content
Abstract.......................................................................................................1
1. Introduction............................................................................................2
2. Review of Prior Work.............................................................................4
2.1 Prior Related Work........................................................................5
2.1.1 Bai et al. [1].........................................................................5
2.1.2 Maro et al. [4]......................................................................7
2.1.3 Bahar et al. [5].....................................................................9
3. Proposal................................................................................................11
3.1 Introduction..................................................................................11
3.2 Design..........................................................................................12
3.2.1 Floating Point and Integer Clusters...................................12
3.2.2 Ready and Non-Ready FIFOs............................................13
3.3 Implementation............................................................................15
3.3.1 Performance Monitors.......................................................15
3.3.2 Power Estimates and Tools................................................17
3.4 Alternate proposal........................................................................17
3.5 Schedule.......................................................................................19
4.0 Conclusion..........................................................................................21
5.0 References..........................................................................................22
Abstract
Power dissipation has become a very vital issue in designing modern computer
architectures. In this paper, I examine a number of previous researches on processor
power saving techniques and I analyze the constraints of each of them. Based on
these studies, I propose a solution towards the end of this paper.
1. Introduction
For years, researches have been carried out focusing only on ways to maximize
processor performance.
These techniques include pipelining, superscalar
architecture, cache, branch prediction, and different Instruction Set Architectures,

etc. With all these different creative ideas implemented, our processor technology
has shown a dramatically increasing trend in performance throughout the past.
However, not till the recent decade, researchers have realized the importance of
including the power consumption issue in the design phase of a processor. There are
mainly two reasons for this change: (1.) Portable digital equipments with high-end
microprocessors embedded are becoming popular; however, limited battery life
forms the major bottleneck of these equipments; and (2.) It is believed that current
high-end microprocessors are beginning to reach the limits of conventional air
cooling techniques. Moreover, some of these cooling methods may become powerhungry in order to maintain a system at a right temperature. As a consequence, the
issue of power has become an essential part for both portable and non-portable
system designs.
An important observation [1] claims that different applications vary in their
degrees of instruction-level parallelism, branch behaviors and memory access

patterns. This leads to a consequence of available data path resources utilization is
not optimized by all applications or even at all time within a particular application.
Wall shows that the degree of ILP within a single application varies by a factor of
three [2]. It is important for a system to adapt to changes in usage of available
resources and to adjust its configuration according to current needs. Due to this
observation, studies have been previously conducted on different components of a
processor, such as, caches, branch predictors and functional units and these
approaches all share the ultimate goal of reducing power consumption by
processors. Most recent researches have emphasized the effect that issue logic has
on the total power dissipation in out-of-order superscalar processors. A typical
example being the Alpha 21264 processors, in which the issue logic is responsible
for 46% of the total power according to [3].
In my proposed scheme, I divide some of the components of the pipeline
organization into two clusters; mainly the ones that lay geometrically in the middle
part of the organization. One cluster is dedicated for Floating Point Instructions
while the other is for Integer Instructions. Each cluster consists of an issue queue, a
copy of the register file, corresponding functional units and hardware performance
monitors.
These two clusters share the same data cache, the instruction
fetch/rename unit and the commit unit. I then divide the issue queue in each cluster
into two major parts: ready and non-ready instructions queues.
Non-ready
instructions imply the ones that have at least one operand pending. Within each part,
the sub-queue is then further partitioned into several sets (FIFOs). Only the head of
each FIFO is visible to the request and selection/arbiter logic resulting in in-order
issue of instructions in FIFOs.
My contribution is to show the potential in power saving through dynamically
reconfiguring the issue logic as well as the functional units. According to the
feedback from the hardware performance monitors, I dynamically modify the size
and number of the existing FIFOs. Note that the reconfiguration of each cluster can
be carried out independently to each other and this independence holds between the
ready and non-ready queues as well. In addition, I also modify the number of
available functional units on-the-fly based on the feedback provided by the
hardware performance monitors.
The layout of the rest of the paper is as follows. Section 2 in the paper
discusses previously taken approaches. Section 3 explains my proposed scheme in
details and also presents my schedule of implementation.
conclusions.
Section 4 offers
2. Review of Prior Work

Different researchers have attempted to tackle the problem from different
points. I briefly present some of the interesting solutions here. Olson et al. [6]
suggested placing two completely separate processor cores side by side. This
arrangement allows the processor operating in one of the two modes: Highperformance mode and Low-Power mode. Nevertheless, significant amount of time
is required to switch between modes as this operation involves transferring the
contents of one processor core to another. Yuan et al. [7] presented a middleware
framework coordinating the processor and power resource management (PPRM)
which dynamically adjusts the processor speed and power consumption according to
the system workload, the processor status and the power availability. Yuan et al.,
however, only addressed this solution specifically for multimedia applications.
Olson et al. has attempted to solve the problem from a relative low approach
while Yuan et al. has attempted to tackle the issue from a very high level. In this
study, I plan to solve the problem from the architecture level. My proposed solution
is to modify the architecture of the processor dynamically at run-time of programs to
achieve power savings. In next section, I discuss a few number of prior related
studies and I also analyze each of them from the point of how effective they are in
power saving.
2.1 Prior Related Work

2.1.1 Bai et al. [1]
Bai et al. [1] proposed two schemes in implementing a dynamically
reconfigurable mixed in-order/out-of-order issue queue for power-aware processors.
In both schemes the issue queue is partitioned into several smaller sets (FIFOs) and
only heads of these FIFOs are visible to the request and selection/arbitration logic.
As a consequence, this visibility property of the FIFOs forces the FIFOs to issue
instructions in-order which in turn simplify and reduce power consumption of the
request and selection/arbitration logic.
In Scheme#1, a FIFO is completely disabled when the feedback from the
hardware performance monitor claims that the FIFOs are underutilized.
The
hardware performance monitors used here are implemented by Maro et al. in [4].
These monitors are mainly composed of simple counters and comparators; hence,
the power consumed by these components can be neglected [5]. Below is a picture
showing an example of possible operating modes of the issue queue under
Scheme#1.
Figure 1 An example showing possible modes of the processor for Scheme#1.

Graph redrawn based on [1].
One major drawback of this approach is that it offers limited exposure in ILP
by shrinking the total size of the issue queue. This becomes very disadvantaging for
Floating Point benchmarks which exhibit a large degree of ILP, and; therefore,
Floating Point benchmarks are not tested using Scheme#1 in the paper. Out of all
the Integer benchmarks, Scheme#1 produces the average best result of 27.6% power
saving with only 3.7% performance degradation comparing to the starting state of
the experiments (i.e. the base case remains at the starting state configuration
throughout the run time of an application).
In Scheme#2, the size of the issue queue remains the same at all times; this
ensures maximized exposure of ILP. Both the size and the number of FIFOs are
modified simultaneously under this scheme; this makes the issue queue appear to be
smaller in size to the request, selection/arbitration logic.
If the hardware
performance monitors indicate that processor is suffering in achieving high

performance, the size of the FIFOs will be decreased while increasing the number of
existing FIFOs. The following graph presents an example of the configurations of
the issue queues in different operating modes:
Figure 2 An example show possible modes for the processor of Scheme#2.

Graph redrawn based on [1].
Experimental results have shown that Scheme#2 is a rather effective
mechanism in power saving. It achieves a 27.3% power saving with only 2.7%
performance degradation as the best average case.
This time, Floating Point
benchmarks were also tested during the experiment.

Despite the fact that this scheme works relatively effective for both Integer and
Floating Point benchmarks, some potential enhancements can be carried out in order
to gain better control over the type of applications. For instance, the processor can
deal with the two types of instruction separately and therefore results in a greater
flexibility to adapt to changes in resources needs of each instruction type.
In addition, further power saving can be achieved by restricting the broadcast
of a just-computed result to the non-ready instructions only. Applications with a
large degree of ILP will benefit the most from this restriction.
2.1.2 Maro et al. [4]
Maro et al. [4] implemented a hardware performance monitoring mechanism
and these monitors provide feedback on whether or not to disable part of the Integer
and/or Floating Point pipelines during runtime in order to save power. Here shows
the basic multi-pipelined processor:
Figure 3 Pipeline Organization of the processor. Graph taken from [4]

For each cluster, a maximum number of four instructions can be issued each cycle.
In the work, a number of low-power operating modes are defined and they are:
Figure 4 Possible Operating Modes for the processor. Table taken from [4]
Both the entering and exiting of these modes depend of the hardware performance
monitors feedback while the exiting of these modes also depend on the trigger
events such as data/instruction cache misses and floating point activity, etc.
This approach provides greater flexibility in handling instructions according to
their types. Nevertheless, by shrinking the overall size of the issue queue when
entering some of the operating modes will lead to a limitation of the exposure of
ILP. Scheme#1 described in 2.1.1 has this same negative effect as they both attempt
10
to alter the total size of the issue queue during run-time of programs. Moreover, the
select and wake-up logic has no way to distinguish between ready and non-ready
instructions, and; therefore the system becomes very power inefficient in a way that
the associated selection and wake-up signals of all entries in the issue queue will
have to be updated every cycle even when an instruction is not ready to be issued.
2.1.3 Bahar et al. [5]
Bahar et al. [5] proposed a technique called Pipeline Balancing (PLB) in which
it allows disabling of a cluster or part of a cluster of functional units through varying
the issue width. The pipeline organization of 8-wide issue processor is shown
below:
Figure 5 Pipeline Organization of the processor. Graph taken from [5].

Bahar et al. implemented PLB with two possible issue widths when operating
in low-power mode: 4-wide issue and 6-wide issue. In a 6-wide configuration, 2
Floating Point functional units are disabled resulting in an unbalanced machine;
11
while in a 4-wide configuration, a cluster of functional units is disabled (shown as

shaded area in the graph above). Two basic triggers are implemented in both modes;
they are issue IPC (IIPC) and Floating Point IPC (FP IPC); both are measured by
hardware performance monitors [5].
More power is saved in the 4-wide
configuration however the performance penalty of spuriously entering the 4-wide

mode can be great. An extra trigger, mode history, is included to prevent spuriously
entering of the 4-wide mode: No transition into the 4-wide mode is allowed unless
the conditions for the two basic triggers are satisfied for two consecutive sampling
windows.
The state machine for enabling/disabling PLB is shown below; EC 4w, DC4w,
EC6w and DC6w are the enabling and disabling conditions for 4-wide mode and 6wide mode respectively:
12
Figure 6 State machine of the processor. Graph taken from [5].

And the conditions are listed as follows:
Figure 7 Entering and Exiting Conditions for each state of the processor.
Table taken from [5].
It is important to ensure that these threshold values allow the system to respond
to changes in programs needs effectively and efficiently. For example, a program
that has a burst of floating point instructions for some portions of its execution time
will suffer when the processor fails to restore back to normal mode (rebalancing the
structure) effectively.
The performance suffers mainly due to the unbalanced
structure of the data path.

One major drawback of this approach is again the lack of ways in
distinguishing between ready and non-ready instructions (same as the above
approach) resulting in unnecessary waste of power in the selection and wake-up
logic.
Moreover, this scheme only allows disabling of Floating Point functional units;
13
this leads to limited flexibility of a system in adapting to different type of

applications and therefore not maximizing the potential power saving.
3. Proposal
3.1 Introduction
Bai et al. [1] states that a good design strategy should be flexible enough to
dynamically reconfigure available resources according to the programs needs.
This statement becomes the Golden Rule during the designing phase of my strategy.
In my proposed scheme, I try to provide as much flexibility as possible for a system
to adapt to changes in programs needs; and, therefore reacts effectively and
efficiently to fulfill the ultimate goal of power saving in processors.
3.2 Design
The fundamental configuration of my proposed scheme is as follow:
Parameter
Configuration
Issue Queue size
128 entries
Machine Width
8-wide fetch, issue, commit

(4-wide for each cluster)
Functional Units
8 Integer FU, 4 Floating Point FU

and 4 Memory ports
Size of issue queue in each cluster
64 entries
Size of ready queue in each cluster
8 entries
Size of non-ready queue in each cluster
56 entries
14
Figure 8 Configurations of my proposed processor

Note that only relevant parameters are shown at this stage; a much more detailed
description will be provided in the next report. Also, the configuration for each
parameter is only estimation; they are subject to change due to any implementation
issue.
The basic idea of my proposal lies in further dividing the FIFOs architecture in
[7] into smaller components according to the type and the status of the instructions.
3.2.1 Floating Point and Integer Clusters
I divide the middle part of the pipeline organization in [1] into two clusters.
One of the clusters is dedicated for Integer Operations while the other is for Floating
Point Operations. Each cluster consists of its own issue queue (FIFOs), reorder
buffer (ROB), a copy of the register file, hardware performance monitors and
corresponding functional units.
Nevertheless, both clusters share the same
instruction fetch/rename unit, data cache and the commit unit. Note that this is only
a preliminary stage of design; the architecture described above may be changed
according to issues arisen during the implementation stage.
showing the new pipeline structure proposed:
15
Below is a graph
Figure 9 Pipeline Organization of my proposed scheme

The major advantage of this new design over the original one in [1] is that this
provides the system with greater flexibility when handling different types of
benchmarks. For Integer benchmarks as an example, there is hardly any Floating
Point instruction; therefore, the existence of these functional units will consume
unnecessary power while they are not contributing to the overall performance. With
this modified structure, the system will have definitely more control over the power
consumption of various types of applications.
3.2.2 Ready and Non-Ready FIFOs
I further divide the issue queue in each cluster into two components: (1.) Ready
instructions queue and (2.) Non-Ready instructions queue. The purpose of such
16
division is to restrict broadcasting of a just-computed result to only those

instructions that are non-ready.
As a result, extra power saving is achieved
especially for applications exhibiting a high ILP. The graph below shows the
structure of the newly proposed issue queue within a cluster:
Figure 10 Internal architecture of an issue queue in a cluster

At this stage, I plan to allow reconfigurations of the two components according to
the feedback from the hardware performance monitor [1, 4] and other trigger events
[5] to occur independently to each other. The major reason for this is to simplify the
implementation as well as providing more flexibility in dynamically reconfiguration
of the issue queue; however, further investigation may be carried out on how
different combinations of the configurations of each component affect the system
performance.
17
An important property is that the total numbers of entries for both component
queues remain the same at all times for all applications. This diminishes the
negative effect of limiting the exposure of ILP by shrinking the issue queue size.
In addition to the above modifications, I also propose to monitor the usage of
the functional units. More power can be saved by disabling some of the functional
units that are not utilized optimally. This again is implemented independently on
each of the clusters, i.e. disabling a Floating Point functional unit will not affect the
operation of the Integer cluster.
3.3 Implementation
3.3.1 Performance Monitors
Similar to [1, 4, 5], reconfigurations of the system are carried out according to
the feedback from the hardware performance monitors. As stated before, these
monitors are mainly composed of simple counters and comparators; therefore, the
power consumption by these parts can be neglected [5]. The cycle window [1] that I
set is either 512 or 1024 cycles at this stage. I will further investigate the feasibility
and effects on having different cycle window sizes for different monitors. This
ensures that the system will be more flexible in responding to feedback from
different monitors. I implement the following hardware performance monitors in
the system:
18
Monitoring IPC for each cluster separately:

If either the FP issue IPC or the Integer issue IPC is low during the current cycle
window, this may indicates that the ILP in the program is low and therefore the
cluster may be switched to a lower power consuming mode.
Monitoring Ready Instructions:
If the occupancy of the ready queue of a cluster is high, this may imply that the ILP
in the application is high and therefore the ready queue may be switched to a higherperformance mode.
Detecting Variations in IPC:
If the issue and commit rates vary significantly for a cluster may indicates a high
branch misprediction rate. The number of FIFOs in the cluster can be reduced in
order to restrict the issue rate and indirectly limit the amount of branch mispredicted
instructions issued.
Performance Degradation:
If the IPC drop between two consecutive sample windows exceeds a threshold value
within a cluster, the cluster will be restored back to a higher-performance mode.
Issue Queue Usage:
Low issue queue occupancy rates for both components (Ready and Non-Ready
queues) may indicate a potential to reduce the number of FIFOs as the issue queue is
being underutilized.
19
Functional Unit Usage [4]:

It is a relatively inexpensive way to monitor whether the program is underutilizing
the available resources over time by means of a simple shift register. When the
percentage of busy functional units is under a certain threshold, a 1 is shifted into
the register. At any given cycle, if the number of 1s present in this register is
greater than some pre-defined threshold, then the program is said to be
underutilizing the resources; therefore some functional units can be disabled. An
assumption is made that if recent history indicates an underutilization of functional
units then it can be presumed that in the following cycles few resources will be
needed as well.
Issue attempts [4]:
The total number of issue attempts for each ready instruction before a functional unit
is made available for its execution is counted. Whenever an instruction is prevented
from issuing due to the lack of resources, its counter is incremented. When the total
count for all the ready (still not issued) instructions reaches a certain threshold, the
cluster should restore some of its functional units.
A list of monitoring techniques is presented above and the threshold values are not
specified in this report since experiments will have to be carried out to determine the
appropriate values. There is a potential of modifying the above list due to any
arising implementation issues.
20
3.3.2 Power Estimates and Tools

I estimate the total power savings of my processor based on the power
estimates of Alpha 21264 processor in [3]. More details will be provided in the next
report of how the power values are estimated.
The simulator that I use in this study is derived from the SIMPLESCALAR
tool suite. Nevertheless, modifications will have to be carried out in order to
achieve a better modeling of my processor. One modification is to split the Register
Update Unit (RUU) of the SIMPLESCALAR into the reorder buffer (ROB) and the
issue queue (IQ) for each cluster [1].
If possible, I may also run the experiments using the Wattch tool which is an
extension to SIMPLESCALAR including power estimations.
The benchmarks used in this study are the same as those ones in [1]. This can
allow me to compare the performance of my processor with the one implemented in
[1].
3.4 Alternate proposal

I also propose a very similar approach in this study. This alternate approach
basically adopt the same architecture as the one described with an exception that
Integer and Floating Point instructions are no longer handled separately. Therefore,
within a cluster, there exist both Integer and Floating Point Functional units. This
21
scheme will help me to analyze how the separation of the Integer and Floating Point
clusters impact the overall performance. Due to the similarity, the details of this
alternate approach will not be covered here; however, more information will be
provided in depth in the final report.
DESIGN PHASE (3 WEEKS)
Eliminate any uncertainty in design
through research
Modify design when necessary
3.5 Consolidate
Schedule design including determining
parameter values for processor
architecture
Document any changes with reasons
IMPLEMENTATION PHASE STAGE 1

(9 WEEKS)
Modify the out-of-order simulator in
SIMPLESCALAR according to the
design documents
Modify design when necessary
Document any issues arisen and solution
proposed
Verify correctness in functioning of the
processor
IMPLEMENTATION PHASE STAGE 2

(3 WEEKS)
Implement the hardware performance
monitors
Add or modify existing monitors in
design documents when necessary
Test and determine the initial threshold
values
Document any issue arisen and solution
proposed
22
OPTIONAL PHASE
Implement the alternate
approach proposed in the design
document
Implement the proposed
processor using the Wattch Tool
Investigate the effect of
changing the issue width while
reconfiguring the functional
units
TESTING PHASE (6 WEEKS)

Run the simulator using different
combinations of performance monitors
(with different threshold values) for each
benchmark
Modify the processor architecture when
necessary
Investigate if design fails and try to solve
the problem(s) encountered if possible
Document any issue arisen and solutions
proposed
Document reasons for failure and/or
solutions proposed
REPORT PREPARATION PHASE

(1 WEEK)
Finalize experiments
Comment and analyze the results
Document the analysis and comments
made
FINAL DOCUMENTATION PHASE

(2 WEEK)
Summarize all the documents throughout
the entire project
Write up the final report
Prepare for final presentation
23
24
4.0 Conclusion
Due to the rapidly rising awareness of the importance of including power issues
in the design phase of processors, many researches have been carried out. These
prior work have presented many ways to achieve power saving while minimizing the
impact on the overall performance. Based on the previous studies, I propose a
strategy focusing mainly on the issue logic design as well as the usage of functional
units. The aim of my study is to show that by dynamically reconfiguring the internal
structure of my processor according to different sources of feedback, saving in
power consumption will be achieved. And that my proposed processor will have its
maximized flexibility in responding to different programs needs and therefore it
fulfills the Golden Rule stated in [1].
25
5.0 References
[1] Yu Bai and R. Iris Bahar. A Dynamically Reconfigurable Mixed In-Order/Out-ofOrder Issue Queue for Power-Aware Microprocessors.
Division of
Engineering, Brown University.
[2] D. W. Wall. Limits of instruction-level parallelism. In Proceedings of the

International Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS), November 1991.
[3] K. Wilcox and S. Manne. Alpha processors: A history of power issues and a look
to the future. In Cool-Chips Tutorial, November 1999. Held in conjunction
with the 32nd International Symposium on Microarchitecture.
[4]
R. Maro, Y. Bai, and R. I. Bahar. Dynamically reconfiguring processor

resources to reduce power consumption in high-performance processors. In
Workshop on Power-Aware Computer Systems, November 2000.
Held in
conjunction with the International Conference on Architectural Supportfor
26
Programming Languages and Operating Systems (ASPLOS).
[5] R. I. Bahar and S. Manne. Power and energy reduction via pipeline balancing. In
Proceedings of the 28th InternationalSymposium on Computer Architecture,
July 2001.
[6]
Elwin Olson and Andrew Menard.

Tradeoffs.
Issue Logic and Power/Perfomance
Laboratory for Computer Science, Massachusetts Institute of
Technology.
[7] Wanghong Yuan and Klara Nahrstedt. A Middleware Framework coordinating

Processor/Power Resource Management for Multimedia Applications.
Department of Computer Science, University of Illinois at Urban-Champaign.
27

Emily Report

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Emily Report

Transféré par

Droits d'auteur :

Formats disponibles

PROCESSOR

These techniques include pipelining, superscalar

architecture, cache, branch prediction, and different Instruction Set Architectures,

degrees of instruction-level parallelism, branch behaviors and memory access

2. Review of Prior Work

2.1 Prior Related Work

Figure 1 An example showing possible modes of the processor for Scheme#1.

smaller in size to the request, selection/arbitration logic.

performance monitors indicate that processor is suffering in achieving high

Figure 2 An example show possible modes for the processor of Scheme#2.

This time, Floating Point

benchmarks were also tested during the experiment.

Figure 3 Pipeline Organization of the processor. Graph taken from [4]

Figure 5 Pipeline Organization of the processor. Graph taken from [5].

while in a 4-wide configuration, a cluster of functional units is disabled (shown as

More power is saved in the 4-wide

configuration however the performance penalty of spuriously entering the 4-wide

Figure 6 State machine of the processor. Graph taken from [5].

The performance suffers mainly due to the unbalanced

structure of the data path.

this leads to limited flexibility of a system in adapting to different type of

Issue Queue size

8-wide fetch, issue, commit

8 Integer FU, 4 Floating Point FU

Size of issue queue in each cluster

Size of ready queue in each cluster

Size of non-ready queue in each cluster

Figure 8 Configurations of my proposed processor

Nevertheless, both clusters share the same

Figure 9 Pipeline Organization of my proposed scheme

division is to restrict broadcasting of a just-computed result to only those

As a result, extra power saving is achieved

Figure 10 Internal architecture of an issue queue in a cluster

Monitoring IPC for each cluster separately:

Functional Unit Usage [4]:

3.3.2 Power Estimates and Tools

3.4 Alternate proposal

IMPLEMENTATION PHASE STAGE 1

IMPLEMENTATION PHASE STAGE 2

TESTING PHASE (6 WEEKS)

REPORT PREPARATION PHASE

FINAL DOCUMENTATION PHASE

Engineering, Brown University.

[2] D. W. Wall. Limits of instruction-level parallelism. In Proceedings of the

R. Maro, Y. Bai, and R. I. Bahar. Dynamically reconfiguring processor

conjunction with the International Conference on Architectural Supportfor

Programming Languages and Operating Systems (ASPLOS).

Elwin Olson and Andrew Menard.

Issue Logic and Power/Perfomance

Laboratory for Computer Science, Massachusetts Institute of

[7] Wanghong Yuan and Klara Nahrstedt. A Middleware Framework coordinating

Vous aimerez peut-être aussi