Académique Documents
Professionnel Documents
Culture Documents
Abstract
Since the introduction of the first ALPHA microprocessor in 1992, Digital has
maintained leadership in absolute CPU performance. During the past year,
Digital's StrongARM" processor has also achieved a leadership position as
the fastest CPU capable of operating from a single AA battery cell. Some of
the key techniques used to achieve this pellformance are described in this
invited paper.
Introduction
Digital has introduced two all new microprocessors in the last year, the StrongARMfm110 [ l ]
and the Alpha 21264 [2]. Although intended for radically different markets and applications,
each represents the state-of-the-art in CMOS design and technology for CPU chips in their
respective markets. StrongARMfmis intended for low-cost embedded applications requiring high
performance and low power, while the 21264 is a high-end processor delivering the highest
general purposed CPU performance in the world. Power dissipation for the 21264 can be as high
as 72W at an operating frequency of 600MHz, whereas StrongARM" dissipates about 350mW
at an operating frequency of 160MHz. Despite these very different operating characteristics, the
circuit techniques, design methodology, and physical technology underlying both designs are very
similar. This paper will focus on some of the elements which are critical to each chip's technical
success at meeting it's design objectives.
Iml-tat
* N o m i n a l Vdd:
2.0V
ion
* 100m
*600MHz (est.) @ 1 . 9 V
& 100C
*Pseudo S t a t i c D e s i g n
m
t
- D y n a n i c POW M
- IDDQ T e s t i n g & S i n g l e s t - 5 8 7 P i n IPGA [389 Si@
Pins]
* I D Signaling i s 2 . 0 V H S T L - L i l
CMOS Technoloay
'h"
0 . 3 5 ~
LEFFEcTIvE
'TOXI,,
0.25P
*VTN / y p
)g&@Liude
i&ukstrate
)*M1
(A1Cu)
*M2
(AlCu)
- R P 1 (AlCu)
gM3
(AlCu)
-M4
(AlCu)
- R P 2 (AlCu)
6.0nm
0 . 3 5 v / -0.35v
Cobalt-Disilicide
P - E P I with N-Well
5 . 7 l d , 1 . 2 2 5 P~ i t c h
5.7kfi, 1 . 2 2 5 P
~itch
14.8d,
14.8k&
14.8k&
14.8kf&,
V s s Plane
2 . 8 0 P
~itch
2.80pm P i t c h
Vdd Plane
21264 Clocking
Unlike earlier Alpha designs, the 21264 does not implement a single distributed clock. Instead,
GCLK is the root of a hierarchy of buffered and conditioned clocks. Most of the clock load is
isolated from GCLK by double buffering the load onto large buffered or conditioned section
clocks. The buffered section clocks, identified by region in the die photo, are also the basis for
additional, smaller buffered or conditioned clocks. The large unmarked areas utilize buffered or
conditioned clocks derived directly from GCLK.
There are several advantages to this method: First, each conditioned clock presents an
opportunity to save power. Second, circuit designers can take advantage of multiple clocks. For
example, a phase path can be extended by initiating it with GCLK and terminating it with a
delayed section clock. Finally, with the clock drivers closer to their loads, skew and metal usage
are reduced. We estimate at least a 3X reduction in total clock Metal 3 and 4 usage saving
1t
A d v m t ag?s
Saved i n Idle Functi
Units
- More Clocking Options f o r
- R d x d Section clock Skau
B u f f e r & Section Clock R e c
- M e t a l 3 & 4 U s a g e R&c&
by > 3X
-Po=
Desicrn
CAD Verificati
0 Logical
@Data Dependent
@ Crosstalk - Coup1
0 Capacitance
0 Resistance Modeli
@ Inductance
Accurate Spice
and Diode
@Clock D u t y Cyc
OParasitic
@ Temperature Spati<
and
--f Capacitanc
8 V d d & Vss - Spat
Qvdd
&
Vss
Tempor
Process
@ p o l y CD'S
@Metal CD's
0 Dielectric Thickr
Q Inter-Layer
@Dielectric Const
Q M e t a l Thickness
@Metal Cross Sectj
0 Device
O P M O S / NMOS Ratio
0 Diffusion
0 Inductance
0 Desiqr
0 D e s i q n and
0 Marqin
2, on the left, consider a 4-metal stack with one plane on top, and on the right, the same stack
with an additional plane inserted between Metals 2 and 3. In both cases, Metals 1 and 3 are
routed orthogonally to Metals 2 and 4, with only parallel traces inductively coupled. Resistance
and Inductance, including LR coupling, are extracted in the frequency domain, with full skin
effects modeled. From these results, a time-domain model is constructed, linked into SPICE and
used in conjunction with a standard capacitive model. Note that a simple RC model would yield
identical simulation results for both cases.
ve
Substrate
Substrate
6
Simulation results of 3.5mm, Metal 1 and 3 Busses are shown in Figure 3, with simulation
details chosen to maximize noise. Because the range of inductive coupling exceeds the range of
capacitive coupling on silicon, wide busses are chosen to maximize total aggressors. Nearest
neighbor Metal 3 aggressors, for which capacitive coupling dominates, are switched in direction
opposite to the Metal 1 bus and the remainder of the Metal 3 bus. On the left, significant
crosstalk is observed between Metals 1 and 3. On the right, this crosstalk is largely screened by
the additional, embedded plane. Peak inductive noise is reduced in the two plane scheme to the
point where simulated waveforms reasonably approximate standard RC behavior. Without
inductance management, peak noise can pose an unreasonable functionality or reliability risk in
some applications.
587 IPGA
P r o b l a n Statarvslt
d I d 3 = 2 5 A (est.) @ V d = 2V, 600MtIz
M " t
of Resonant Su-ly Noise is Necessary
StrongARM 110
Function
- Implements ARM Version 4 instruction set
-Bus compatible with ARM 610,710 and 810
Performance
- 16OMHz @ 1 . 5 ->
~ 183 Dhrystone MIPS at e 0.5W
- 215MHz @ 2 . 0 ->
~ 245 Dhrystone MIPS at c 1.OW
- 3 . 3 ~pin bus
Process and Package Technology
- 2.5 million transistors fabricated in 0.35pm
3 metal CMOS with 0 . 3 5 ~V, and 0.25pm LE,,
- Die size: 7.8" x 6.4" -> 50mm2
- 144 pin plastic TQFP
able 3 StrongARM"" Characteristics
0.1
E
StrongARM" Clocking
Alpha microprocessors have shown that for high performance devices, clocking can easily
dominate the dynamic power budget. For the 21064 [3],clock power represented about 65% of
total power dissipation. It is also true that efficient clocking is one of the keys to high
performance. We considered various approaches to reducing clock power and settled on one
which we considered to be relatively low risk with reasonable speed efficiency. By changing from
a flow-through latch scheme to a unique edge-triggered configuration, we estimated that the clock
load could be reduced by half compared to Alpha. Another benefit, whose value was hard to
estimate, was the lack of spurious transitions on the latch outputs compared to flow-through. The
delay performance of this latch is quite good and comparable to the two flow-through latches it
replaces. However, with this scheme we gave up the opportunity to slosh the delay budget across
latch boundaries
Note that with this latch, only three transistors receive the clock signal, and the two PMOS
precharges can be minimum sized as well. However, it is also true that there is an internal
precharge on every clock cycle regardless of previous or next state. None-the-less, this latch is
very power efficient. Conditional clocking was also included in the methodology. In previous
Alpha designs, the chips were dynamic, so conditional clocking implied a significant penalty. In
StrongARM", the applications require fully static operation, allowing the clock to be stopped
indefinitely. Analysis shows that without conditional clocks, the power dissipation might have
been as much as 45% higher.
10
IN-H
-
vss
-1
IN-L
-
CLK
&dd
TI
11
References
[ l ] Montanaro, J., et. al., A 160MHz 32b 0.5W CMOS RISC Microprocessor, ISSCC Digest
of Technical Papers, pp. 214-215, Feb., 1996.
[ 2 ] Gieseke, B., et. al., A 600MHz Superscalar RISC Microprocessor with Out-Of-Order
bxecution, ISSCC Digest of Technical Papers, pp. 176-177 , Feb., 1997.
[ 3 ] Dobberpuhl, D., et. al., A 200MHz 64b Dual-Issue CMOS Microprocessor, IEEE Journal
of Solid State Circuits, vol. 27, no. 11, Nov., 1992.