Vous êtes sur la page 1sur 8

Clock gating: Smart use ensures smart returns

By Anubhav Srivastava and Neha Srivastava, Freescale Semiconductor - December 4, 2009

In current SOC designing, clock gating is one of the most effective and primitive power-saving techniques utilized to save dynamic functional power throughout the chip. In designs, clock gating is done broadly at two different design-flow levels. At the RT level, you introduce clock gating into the architecture of the design. This clock gating ensures the switching off of the clock to a particular IP depending upon the active and inactive states of that IP. At the synthesis stage, synthesis tools introduce automated clock-gating cells at a fine granular level depending upon the clock gating constraints provided by the user to the tool. These synthesis constraints include defining the minimum and maximum number of registers in a register bank to be driven by a particular type of clock-gating cell. This article targets the common erroneous practices that designers may use while implementing clock gating in SOCs. It details the problem that arises from these errors and also the method to counter these problems early in the design flow. Synthesis-time clock gating Usually designers decide clock-gating strategy during synthesis. At this time they must decide on the type of clock-gating cells to be used. A number of clock-gating cells exist in libraries. For example:

Clock-gating cells with latch/flip-flop implementation Clock-gating cells with postscan or prescan control Clock-gating cells with buffered clock and nonbuffered clock Symmetrical and nonsymmetrical clock-gating cells Clock-gating cells with different threshold voltages, if you are using multiVt cell synthesis Clock-gating cells with synchronous/asynchronous reset pins

Many of the points here are design-dependent, but ignoring factors such as choice of symmetrical cells, cells with a nonbuffered clock, or the different threshold voltages of cells can be dangerous. A brief description of each issue follows.

Clock-gating cells come directly into the clock path, so the best-defined choice is a cell with the same delay for the rising and falling edge of the clock; that is, a symmetrical clock-gate cell. Also, in most designs, you build the clock tree by balancing clocks in the same domain with the minimum possible skew. A nonbuffered clock-gating cell is preferred because it will have less cell delay as compared with a buffered cell, and you can address transition requirements while fixing design-rule violations. This approach will definitely consume fewer buffers compared with the results of using buffered clockgating cells. The third factor concerns the choice of threshold voltage of clock-gating cells. If your design requires a trade-off between leakage power and timing, we recommend an analysis of the number of extra clock buffers required for the clock balancing with high-Vt clock-gating cells. If the design has multiple levels of clock-gating cells from clock source to the leaf flop level and there is enough talking between the gated and ungated levels, then a low-Vt cell would be a definite choice, as this will improve timing and reduce on-chip variationsand all without much degradation in power because low gating-cell latency implies fewer clock buffers will be needed. The designer must also think about the minimum and maximum number of flops used per clock-gating cell. The answer to this question is a bit tricky. A clock-gating cell is inserted to reduce the power consumption. Suppose we put in a clock-gating cell for gating the clock of a module comprising one or two flops. The area overhead and power overheadboth dynamic and leakagewould then be much more than the power saved. Also, there will be a limit, decided by the drive strength of clock-gating cells, beyond which a large buffer is required to maintain the output slew. So the safest minimum number of flops you can gate with one clock gate should be either 3 or 4 in a register bank. And usually 32 or 64 flops is the maximum limit. The designer must also select the clock-gating test signals. Normal scan testing requires bypassing the clock-gating cells in the design. We do not want any gating of clocks during scan. We also require a shift-enable signal to check the logic generating the enable of the clock-gating cell. So all the clock-gating cells usually have a test control signal. In addition, we need to flatten the synthesis-inserted clock-gate cells at the time of logic equivalence checking between RTL and gate-level netlist, as these clock-gate cells do not exist in RTL. For this we constrain the clock-gating test signals. Because of this requirement, the test-logic designer has to ensure separate test signals for both RTL and synthesis-inserted clock-gating cells. Then there is the question of which modules should be gated. The designer decides explicitly on RTL clock gating, whereas at the time of synthesis, the synthesis tool

decides the insertion of clock-gating cells on the basis of two factors: switching activity and observable don't-care conditions. But sometimes the designer can judge exact activity at the architectural level itself. For example, if a particular module is always active and its standby time is almost negligible, then there is no need to put synthesis-level clock gating in cells. There is a catch here, though. Normally it's an assumption that if we don't put automated synthesis-level clock-gating cells in a module, we save area, since the cellinstance count of that module decreases. But savings are entirely dependent upon the type of RTL used. Synthesis tools place clock-gating cells in the design by replacing the muxes in front of the register banks. So when we remove these clock-gating cells, the muxes take their original place. For each clock-gating cell replaced, the number of muxes coming back into the picture is the same as the number of flops the clockgating cell was gating. Hence, we lose both area and power. Further, there exist some critical modulesfor example, test-compression-logic generation modules or clock generation moduleswhere clock-gating cell insertion can affect the functionality. So the decision about in which modules we should allow or suppress synthesis clock gating requires great investigation. Some short case studies Now let's take a look at a couple of short case studies, in which we will describe a common design practice that leads to trouble and suggest an alternative. By common practice, clock-gating cells have an enable pin on which setup and hold checks are done with respect to the clock pin. Usually until we run clock-tree synthesis, the timing violations at these clock-gating cells are not visible because the clock path is treated as ideal. So we in effect assume that the clock on the flop that launches the enable signal is coincident with the clock coming into the clock-gating cell. But as soon as we build clocks, the clock-gating violations at these cells pop up. The main reason is that the enable-launching flop's clock is not balanced with the clock going into the gating cell, but instead to the clocks entering the flops at the fanout of the clock-gating cells. This difference results in a skew, which has a minimum value equal to the delay of a clock-gating cell. With the information added during clock-tree synthesis, new unoptimized paths are now visible. To avoid this problem, we overconstrain these paths at synthesis time, and also follow up at the global physical synthesis level by putting extra uncertainties or latencies at the clock-gating cells. Hold violation is not critical anyway because skew is negative here.

Here is another example. In our designs we normally implement a hierarchical clockgating technique that puts multiple levels of clock-gating cells in the design. The multiple levels also exist because clocks from the various sources (PLL, external oscillator, dividers) get distributed throughout the chip via a clock-distribution network that generates gated and nongated clocks for all individual IPs.

As we mentioned earlier, there are at least two levels of clock-gating insertion in any design. The first is at the RTL stage at the module level, and the second is at the synthesis stage in the form of register-bank-level insertion. So in the end, in designs

we normally end up with three to four levels of clock gating (

). Every clock-gating cell has around 200 ps (90-nm technology) of delay, so for the root-level flops there is approximately an 800-ps contribution of clock latency just by the clock-gating cells. Now if we have two separate clock domains talking with each othersay bus clock with cpu clockand both have four levels of clock-gating cells, then the clock-gating cells will contribute for 1600 ps of uncommon path. If we now take a 10% derating factor for on-chip variations on both launch and capture, that adds 320 ps of on-chip variation slack in register-to-register paths.

We propose to restructure this multilevel clock gating into a single-level linear clockgating structure. We can do this easily by ANDing the enables of the clock-gating cells and providing clocks to all the clock-gating cells in a clock path from a single source (

). Module-level clock gating can appear intuitively obvious. And there is the temptation to just trust the tool when synthesis tools insert clock gating automatically. But innocence can lead to misfortune. We have pointed out some important issues that you must consider when you are thinking through your clock-gating strategies, and we

have illustrated our points with a couple of real-life examples. We hope this saves you some time and trouble on your next design.

Vous aimerez peut-être aussi