Vous êtes sur la page 1sur 10

2

Circuits and Technology for Digital's


StrongARMmand ALPHA Microprocessors
Daniel W. Dobberpuhl
Digital Semiconductor Palo Alto Design Center

Abstract
Since the introduction of the first ALPHA microprocessor in 1992, Digital has
maintained leadership in absolute CPU performance. During the past year,
Digital's StrongARM" processor has also achieved a leadership position as
the fastest CPU capable of operating from a single AA battery cell. Some of
the key techniques used to achieve this pellformance are described in this
invited paper.

Introduction
Digital has introduced two all new microprocessors in the last year, the StrongARMfm110 [ l ]
and the Alpha 21264 [2]. Although intended for radically different markets and applications,
each represents the state-of-the-art in CMOS design and technology for CPU chips in their
respective markets. StrongARMfmis intended for low-cost embedded applications requiring high
performance and low power, while the 21264 is a high-end processor delivering the highest
general purposed CPU performance in the world. Power dissipation for the 21264 can be as high
as 72W at an operating frequency of 600MHz, whereas StrongARM" dissipates about 350mW
at an operating frequency of 160MHz. Despite these very different operating characteristics, the
circuit techniques, design methodology, and physical technology underlying both designs are very
similar. This paper will focus on some of the elements which are critical to each chip's technical
success at meeting it's design objectives.

Description of the 21264


As shown in Table 1, the 21264 is a 0.35pmCMOS design, designed to operate at 600MHz
under worst case conditions. A nominal supply of 2V is chosen to limit power to 72W, but the
design and process can operate reliably up to 2 3 . The physical design is 16.7 by 18.8
millimeters, occupying 3 14 square millimeters. The transistor count, excluding 250nF of gate
decoupling capacitance, is 15.2 million with approximately 6 million dedicated to non-cache
logic. Unlike previous Alpha designs, an Integrated PLL performs frequency synthesis and
synchronizes UO. Also, the design is fully static enabling dynamic power management, IDDQ
testing and single-step debug. The chip is packaged in a 587-Pin IPGA with 389 Signal Pins.
YO signaling is 2.0V HSTL-Like. The 0.35pm CMOS Technology utilized in the design, called
CMOS-6, is essentially the same as used in StrongARM'"'. The most notable exceptions are the
addition of two reference planes, RPl and RP2.

0-8186-7913-1/97 $10.00 0 1997 IEEE

Iml-tat
* N o m i n a l Vdd:

2.0V

ion

* 100m

*600MHz (est.) @ 1 . 9 V

& 100C

- 7 2 Watts ( e s t . ) @ 2 . 0 V & 600MH

-16.7mnx 18.8mn (314mn2)


- 1 5 . 2 M i l l i o n T r a n s i s t o r s (-6M
-25OnF of G a t e D e c o u p l i n g C-it
- 1 n t q r a t d P L L ( D e s i q d by CSI

*Pseudo S t a t i c D e s i g n
m
t
- D y n a n i c POW M
- IDDQ T e s t i n g & S i n g l e s t - 5 8 7 P i n IPGA [389 Si@
Pins]
* I D Signaling i s 2 . 0 V H S T L - L i l

CMOS Technoloay

'h"

0 . 3 5 ~

LEFFEcTIvE
'TOXI,,

0.25P

*VTN / y p
)g&@Liude
i&ukstrate
)*M1
(A1Cu)
*M2
(AlCu)
- R P 1 (AlCu)
gM3
(AlCu)
-M4
(AlCu)
- R P 2 (AlCu)

6.0nm
0 . 3 5 v / -0.35v
Cobalt-Disilicide
P - E P I with N-Well
5 . 7 l d , 1 . 2 2 5 P~ i t c h
5.7kfi, 1 . 2 2 5 P
~itch

14.8d,
14.8k&
14.8k&
14.8kf&,

V s s Plane
2 . 8 0 P
~itch
2.80pm P i t c h
Vdd Plane

rable 1.2 1264 Implementation and Technology Summary

21264 Clocking
Unlike earlier Alpha designs, the 21264 does not implement a single distributed clock. Instead,
GCLK is the root of a hierarchy of buffered and conditioned clocks. Most of the clock load is
isolated from GCLK by double buffering the load onto large buffered or conditioned section
clocks. The buffered section clocks, identified by region in the die photo, are also the basis for
additional, smaller buffered or conditioned clocks. The large unmarked areas utilize buffered or
conditioned clocks derived directly from GCLK.
There are several advantages to this method: First, each conditioned clock presents an
opportunity to save power. Second, circuit designers can take advantage of multiple clocks. For
example, a phase path can be extended by initiating it with GCLK and terminating it with a
delayed section clock. Finally, with the clock drivers closer to their loads, skew and metal usage
are reduced. We estimate at least a 3X reduction in total clock Metal 3 and 4 usage saving

Global Clock (GCL K )

1t

Condifioned I oca1 Clocks


5 uffered I oca1 Clocks
Conditioned Secfion Clocks
5 uffered S ecfion Clocks
Condifioned I oca1 Clocks

5 uffered I oca1 Clocks

A d v m t ag?s
Saved i n Idle Functi
Units
- More Clocking Options f o r
- R d x d Section clock Skau
B u f f e r & Section Clock R e c
- M e t a l 3 & 4 U s a g e R&c&
by > 3X

-Po=

igure 1.21264 Clock Hierarchy


Given the clocking hierarchy and timing methodology, the ability to control and predict, in
relative terms, minimum and maximum path delays is essential in accurately predicting cycle
time and ensuring functionality. Table 2 lists is a sampling of issues which can introduce
unexpected timing errors or variations into a design. Some are easily handled within the design;
others can be partially accounted for, but require some built-in margin; the remainder are
managed solely through the addition of margin.

Desicrn

CAD Verificati

0 Logical
@Data Dependent
@ Crosstalk - Coup1
0 Capacitance

0 Resistance Modeli
@ Inductance

@Logical Worst Cas


@Coarse Linear Ga
and
Diode Model

Accurate Spice
and Diode

@Clock D u t y Cyc
OParasitic
@ Temperature Spati<
and
--f Capacitanc
8 V d d & Vss - Spat
Qvdd

&

Vss

Tempor

Process
@ p o l y CD'S
@Metal CD's
0 Dielectric Thickr
Q Inter-Layer
@Dielectric Const
Q M e t a l Thickness
@Metal Cross Sectj
0 Device
O P M O S / NMOS Ratio
0 Diffusion

0 Inductance

0 Desiqr

0 D e s i q n and

Table 2. Management of Timing Variation

0 Marqin

21264 Power Reference Planes


On-chip inductance is becoming significant in deep sub-micron CMOS. Failure to manage
inductive noise can result in increased transmission delay, data or program dependent delay,
overshoot, undershoot, and crosstalk. Furthermore, the EDA infrastructure for analyzing such
phenomenon in VLSI designs is mostly non-existent. One possible solution adopts reference
planes to manage inductive effects so that they may be neglected to first order. In the 21264, a
Vss plane is embedded between Metals 2 and 3, inductively and capacitively isolating Metals 1
and 2 from Metals 3 and 4. A Vdd plane is inserted above Metal 4. Together, they minimize onchip inductance and reduce the range of inductive coupling by ensuring a local current return
path for every signal. As power planes, the additional metal layers also minimize supply
inductance and DC supply collapse to the chip core. The total reduction in noise enables more
aggressive circuit and device designs improving performance without sacrificing functionality or
reliability.
An example illustrates some benefits of the dual reference plane scheme. Referring to Figure

2, on the left, consider a 4-metal stack with one plane on top, and on the right, the same stack
with an additional plane inserted between Metals 2 and 3. In both cases, Metals 1 and 3 are
routed orthogonally to Metals 2 and 4, with only parallel traces inductively coupled. Resistance
and Inductance, including LR coupling, are extracted in the frequency domain, with full skin
effects modeled. From these results, a time-domain model is constructed, linked into SPICE and
used in conjunction with a standard capacitive model. Note that a simple RC model would yield
identical simulation results for both cases.

ve

Substrate

Figure 2. Reference Plan Example

Substrate

6
Simulation results of 3.5mm, Metal 1 and 3 Busses are shown in Figure 3, with simulation
details chosen to maximize noise. Because the range of inductive coupling exceeds the range of
capacitive coupling on silicon, wide busses are chosen to maximize total aggressors. Nearest
neighbor Metal 3 aggressors, for which capacitive coupling dominates, are switched in direction
opposite to the Metal 1 bus and the remainder of the Metal 3 bus. On the left, significant
crosstalk is observed between Metals 1 and 3. On the right, this crosstalk is largely screened by
the additional, embedded plane. Peak inductive noise is reduced in the two plane scheme to the
point where simulated waveforms reasonably approximate standard RC behavior. Without
inductance management, peak noise can pose an unreasonable functionality or reliability risk in
some applications.

igure 3. Reference Plane Simulation Results

21264 Wirebond Attached Chip Capacitor


The 2 1264 design is a complex, conditionally-clocked, superscalar microprocessor. As a
consequence, it may exhibit data and program dependencies resulting in large variations in
supply current from cycle to cycle. At 600MHz, worst case delta Idd, over a few cycles, is
estimated at 25A. Such variations may excite the tank circuit formed by on-chip capacitance and
package inductance. The resulting resonant noise may be the dominant component of power
supply noise on chip. To address this supply noise, a 1pF, 2cm2 Wirebond Attached Chip
Capacitor (or WACC) implemented as a P-type accumulation mode MOS capacitor is placed on
top, and directly bonded to, the microprocessor die as shown. This silicon capacitor, with
integrated series resistance to dampen power supply resonance helps control supply impedance
over a wide frequency band.

587 IPGA

389 Signal 198 VddNss Pins


389 Signal Bondwires
G
9
5VddNss Bondwires2

P r o b l a n Statarvslt
d I d 3 = 2 5 A (est.) @ V d = 2V, 600MtIz
M " t
of Resonant Su-ly Noise is Necessary

S o l u t i o n : S u m l v a DemuDlincr "Device" (WACC)


1 p , 2 d WCC Increases " N e a r - & i p "
Damupling by 5X
S u p l y R e S o n a c e D;Yrped with Engineered S e r i e s R e s i s t a n c e on WA!
160 Vddhrss B o n a r e P a i r s & V&Vss Planes on the W C Minimi
I nhctance
iigure 4. Wirebond Attached Chip Capacitor

Description of the StrongARM" 110 Chip


The SA1 10 is the first example of this new generation of very high performance embedded
processors [3]. Developed by UK-based ARM Ltd. approximately ten years ago, the ARM
microprocessor architecture is exclusively focused on low-cost and low-power applications. It
is used extensively in consumer-oriented applications such as mobile phones, PDA's, organizers,
and video games .
The StrongARM" chip is a RISC microprocessor designed for low-power, low-cost
applications. It implements the ARM Version 4 instruction set and is bus compatible with the
ARM 610, 710 and 810. The chip interface runs at 3 . 3 ~but the internal power supplies can
vary from 1.5 to 2 . 2 ~ . With an internal supply of 1 . 5 , it runs at 160MHz to deliver 183
Dhrystone MIPS while dissipating less than 500mW. At in internal supply of 2.0v, the chip will
run at 21 5MHz and dissipate under 1 .OW. The chip contains 2.5 million transistors, 90% of
which are in the two 16KB caches. It's fabricated in a 0.35pm 3 metal CMOS process with
0 . 3 5 ~thresholds and 0.25pm effective channel lengths. The chip is about 50 sq m " s and is
packaged in a 144 pin plastic TQFP package.

StrongARM 110
Function
- Implements ARM Version 4 instruction set
-Bus compatible with ARM 610,710 and 810
Performance
- 16OMHz @ 1 . 5 ->
~ 183 Dhrystone MIPS at e 0.5W
- 215MHz @ 2 . 0 ->
~ 245 Dhrystone MIPS at c 1.OW
- 3 . 3 ~pin bus
Process and Package Technology
- 2.5 million transistors fabricated in 0.35pm
3 metal CMOS with 0 . 3 5 ~V, and 0.25pm LE,,
- Die size: 7.8" x 6.4" -> 50mm2
- 144 pin plastic TQFP
able 3 StrongARM"" Characteristics

Power Reduction in StrongARM"


The overall approach to reducing operating power was heavily focused on lowering Vdd.
CMOS-6 technology prescribes a maximum operating voltage of 2.5 volts and features a thin
gate oxide of 60A and low Vt's of pludminus 350mV.
Figure 5 shows the energy'delay product as a function of Vdd, allowing Vt to vary with Vdd
but restricted to no less than 350mV. Other curves include frequency, energy and power
assuming operation at the maximum frequency allowed by a particular Vdd. 350mV was chosen
as a practical limit balancing leakage against operating power.
Three regions of operation are defined based on the percentage change in Frequency per mV
change in Vdd. Within the range of I to 2V, the slope is approximately 0.5%/mV. Above 2V, the
slope falls to <.25%/mV, leading to diminishing returns for higher voltages. Below one volt, the
slope is greater than l%/mV leading to dramatic reductions in performance as Vdd is further
reduced.
Allowing for tolerances in Vdd and Vt, it was decided to restrict design points to the range of
1.5-2.OV nominal Vdd. A power dissipation of 500mW maximum at 1.5V for battery powered
operation was targeted, and the 2.0V power and frequency were allowed to he free variables.

0.1

E
StrongARM" Clocking
Alpha microprocessors have shown that for high performance devices, clocking can easily
dominate the dynamic power budget. For the 21064 [3],clock power represented about 65% of
total power dissipation. It is also true that efficient clocking is one of the keys to high
performance. We considered various approaches to reducing clock power and settled on one
which we considered to be relatively low risk with reasonable speed efficiency. By changing from
a flow-through latch scheme to a unique edge-triggered configuration, we estimated that the clock
load could be reduced by half compared to Alpha. Another benefit, whose value was hard to
estimate, was the lack of spurious transitions on the latch outputs compared to flow-through. The
delay performance of this latch is quite good and comparable to the two flow-through latches it
replaces. However, with this scheme we gave up the opportunity to slosh the delay budget across
latch boundaries
Note that with this latch, only three transistors receive the clock signal, and the two PMOS
precharges can be minimum sized as well. However, it is also true that there is an internal
precharge on every clock cycle regardless of previous or next state. None-the-less, this latch is
very power efficient. Conditional clocking was also included in the methodology. In previous
Alpha designs, the chips were dynamic, so conditional clocking implied a significant penalty. In
StrongARM", the applications require fully static operation, allowing the clock to be stopped
indefinitely. Analysis shows that without conditional clocks, the power dissipation might have
been as much as 45% higher.

10

IN-H
-

vss

-1

IN-L
-

CLK

&dd

TI

igure 6. Edge-Triggered Latch

StrongARM'" vs ALPHA Power Dissipation


Table 4 compares the power dissipation of StrongARM'"' to the Alpha 21064. In fact this is
exactly how we did the initial feasibility evaluation and came up with the target power and
frequency objectives. Starting with the 21064 which dissipates 26 watts at 200MHz, first and
foremost, wc havc Vdd reduction from 3 . 4 5 to
~ 1 . 5 ~This
.
cuts the power by a factor of 5.3.
Next, we reduce the functionality. As compared to a 21064, the most obvious sections missing
are the FPU and the branch history table. Less obvious but very important is reduced control
complexity. This is a fairly simple machine and we worked hard to keep it so. We estimated
that the reduced functionality would cut power by a factor of 3.
Third, we scale the process down from the 0.75pm of the 21064 to the current 0.35pm
process. We estimate that this results in a power reduction of about a factor of 2.
Now we reduce the clock power. The clocking methodology described earlier reduces the
clock power by a factor of two. Since the clock power was about 65% of the total power on the
21064, this results in a reduction of about 1.3.
And finally, we cut the frequency from 200MHz to 160MHz which drops the power by 1.25 and
we're left with a final power of 500mW.

11

Power Reduction Factors


Start with Alpha 21064: 200MHz @ 3.45~
Power= 26W
Vdd reduction:
Power reduction = 5.3~-> 4.9W
Reduce functions: Power reduction = 3x -> 1.6W
Scale process:
Power reduction = 2x -> 0.8W
Clock load:
Power reduction = 1 . 3 ~-> 0.6W
Clock rate:
Power reduction = 1 . 2 5 ~-> 0.5W
Table 4. StrongARM Power Reduction Compared to ALPHA 21064.
Conclusion
Some of key elements of the latest ALPHA and StrongARM microprocessors have been
presented in this invited paper. These circuits and technologies have contributed to making the
fastest and most power efficient microprocessors in existence. The field of VLSI design changes
rapidly and these advances will soon be commonplace to be replaced be newer improvements.
Certainly new techniques and considerations will become critical as processor speeds break the
IGHz barrier in the near future.

References
[ l ] Montanaro, J., et. al., A 160MHz 32b 0.5W CMOS RISC Microprocessor, ISSCC Digest
of Technical Papers, pp. 214-215, Feb., 1996.
[ 2 ] Gieseke, B., et. al., A 600MHz Superscalar RISC Microprocessor with Out-Of-Order
bxecution, ISSCC Digest of Technical Papers, pp. 176-177 , Feb., 1997.
[ 3 ] Dobberpuhl, D., et. al., A 200MHz 64b Dual-Issue CMOS Microprocessor, IEEE Journal
of Solid State Circuits, vol. 27, no. 11, Nov., 1992.

Vous aimerez peut-être aussi