Académique Documents
Professionnel Documents
Culture Documents
Robert Horst
Sandy Metz
April 1984
ABSTRACT
The TXP processor was introduced in 1983 as the the fastest member of
the compatible Tandem NonStop Processor family. The TXP obtains its
by Robert Horst and Sandra Metz. Tandem Computers Inc., Cupertino, Calif.
.
SPECIAL DATA PATH
sors linked with fiber-optic cables, to
a worldwide network of up to 4,080
processors.
Many of the problems in designing ....
z
?1
the TXP processor had already been ..:..:
solved in the NonStop II processor '"
a: CACHE a: t;;
Wz
.... W
'::0
{==.----..
.... u
~
and system design. The NonStop II
extended the instruction set of the
a:
NonStop 1 + system to handle 32-bit ::t:
W
DECODE EXECUTE
INSTRUCTION INSTRUCTION
STORE INSTRUCTION 6 9
COMPUTE FETCH
INSTRUCTION ADDRESS INSTRUCTION
DECODE
INSTRUCTION
(a)
lbl
1l.. LD_A_D ..L... A_D_D_1 I_.;.... S_TO_R_E I
2. Plpelined. The instruction pipeline of the NonStop TXP system allows parts of several instructions to be processed simultaneously (a)-nine
cycles are required to execute three typical instructions. Without pipelining (b). 24 clock cycles would be required.
3
Hardware-performance monitor helps optimize design
While new architectural concepts were being developed for four-way), block size, and replacement algorithm. Because
the TXP system, a hardware-performance monitor was built the data is taken in real-time and reduced on-line, the hit-
to record measurements of the software-compatible Non- ratio measurements are much more accurate than the
Stop II processor. Xplor consists of two large Wire-Wrap traditional technique. in which short address traces are
boards plus a small board to interface to the processor recorded on tape for later analysis. This is especially impor-
under test. It has approximately 800 Schottky TTL compo- tant in transaction processing. since a large amount of
nents and took more than two years to develop. process switching takes place; some individual transactions
This general-purpose tool is capable of capturing 64 bits can last several seconds, during which millions of memory
of data every 100 nanoseconds and reducing that data to references take place.
usable form. The 256 kilobits of internal memory can be Once the measurement methods were working, Xplor
configured in many different word lengths to record. for was attached to an eight-processor NonStop II system. A
instance, a 64-bit count of 4,096 different events, a 32-bit typical transaction-processing benchmark was brought up
count of 8,192 different events, or a single flag for 256-K on the system, and transactions then were generated by
events. In addition, Xplor has programmable state ma- another system, running software that simulated users at a
chines with which data can be captured based on complex number of terminals. At that point, histogram and cache
sequences of events; it includes hardware for the emula- measurements were taken for several of the central pro-
tion of various cache organizations. cessing units.
Two different Xplor configurations were developed to The results of the histogram measurements helped de-
gather data for the TXPprocessor. The first was an instruc- termine some of the data-path widths and organizations for
tion histogram measurement that records the frequency the TXP processor. Once the most frequently executed
with which each instruction occurs, the percentage of time instructions were known, the design was modified to pro-
spent in each instruction, and the average number of code vide more hardware support for them. Since the measure-
and data reads and writes performed by each instruction. ments distinguished different paths through some instruc-
The data is recorded in 64-bit counters, so in effect an tions, tradeoffs could be made in the microcode to make the
unlimited amount of real-time data can be taken before the frequent cases faster.
counters overflow. The results of the cache measurements brought about
The second Xplor configuration monitors memory ad- some major changes in the original cache organization. In
dresses and emulates the tag store of a cache. Hit ratios for one measurement, the hit ratio went from 97% for the
many different cache organizations can be determined by original cache to 99% for the final one, for an overall CPU
varying the effective cache size, associativity (one-, two, or performance gain of over 15%.
but would require three if the extract operations could store of 8-K 4O-bit words and a horizontal control store
not be done simultaneously. of 4-K 84-bit words. The vertical control store controls
The dual 16-bit data paths tend to require fewer cycles the first stage of the microinstruction pipeline and in-
than a single 32-bit path when manipulating byte and t 6- cludes a field that addresses the horizontal control store,
bit quantities and slightly more cycles when manipulat- whose fields control the pipeline's second stage. Lines of
ing 32-bit quantities. A 32-bit add takes two cycles rather microcode that require the same or similar horizontal
than one, but the other data path is free to use the two controls can share horizontal-control-store entries.
cycles to perform either another 32-bit operation or two Unlike microprocessor-based systems that have micro-
16-bit operations. code fixed in read-only memory, the NonStop TXP sys-
tem microcode is implemented in RAM, so it can be
Time disadvantage
changed along with normal software updates and new
The time disadvantage in performing a single 32-bit performance-enhancing instructions can be added.
operation is partially offset by the cycle-time advantage The NonStop TXP processor uses three-stage pipelining
for 16- versus 32-bit arithmetic (32-bit arithmetic re- for both macro- and microinstructions. Figure 2 illus-
quires more time for carry propagation). Measurements trates the operation of the macroinstruction pipeline for a
of transaction-processing applications have shown that sequence of three instructions. The first is a load instruc-
the frequencies of 32-bit arithmetic are insignificant rela- tion that loads a word into the hardware stack. The
tive to data-movement and byte-manipulation instruc- second is an add immediate instruction that adds a con-
tions. which are handled more efficiently by the dual stant to a register on the hardware stack, and the third is
data paths than by a single 32-bit data path. Most in- a final store, which stores the result in memory.
structions have enough parallelism to let the microcode With no pipelining, this sequence would require 24
make effective use of both data paths. (8 + 7 + 9) clock cycles to execute, but because the pre-
To control the large amount of parallelism in the fetch and part of the execution of each instruction can be
NonStop TXP system processor, a wide control-store overlapped with previous instructions, the actual execu-
word is required. The effective width of the control store tion time is just 9 (3 + 2 + 4) clock cycles. Because in-
is over 100 bits. To reduce the number of RAMs required, structions are pipelined, the TXP processor can execute
the control store is divided between a vertical control its fastest instructions in just two clock cycles (167 ns),
4
the cache on misses and for storing through to memory,
FROM DATA PATHS
is handled by a separate page table cache that holds
mapping information for as many as 2,048 pages of 2-K
bytes each (Fig. 3).
~
I::::::::=:.-----~r~=~
ADDRESS =
CACHE
IF YES,
HIT
A cache memory by itself does not necessarily boost a
processor's performance significantly. It is of little use for
~ CACHE TAG?
o
o
the cache to provide instructions and data at a higher
« rate than the rest of the CPU can process. In the TXP
-'
« processor, the cache's performance was tuned to provide
::>
>-
a:: instructions and data at a rate consistent with the en-
~
CD
..,
N
I----------r--'PAGE·TABLE CACHE ADDRESS =
TAG?
hancements to instruction processing provided by in-
creased pipelining and parallelism.
IF YES,
TAG PAGE· 32 bits and more
1.0-_ _.... TABLE
PHYSICAL
ADDRESS
CACHE The two concerns related to a system's word length
HIT
2,048 ENTRIES are capability and performance. The NonStop TXP sys-
16·BIT BUSES TO/FROM DATA PATHS
tem has 32-bit virtual addressing built into the hardware,
so is capable of addressing a gigabyte of virtual memory.
MEMORY In addition, the TXP processor can manipulate 32 bits of
2-16 MEGABYTES
data at a time through its dual 16-bit data paths. Thus
the 32-bit NonStop TXP system has the additional advan-
tage of being able to run software that was originally
written for the 16-bit NonStop II system; both systems
have been provided with instructions that can operate on
64 BITS 8-, 16-, 32-, and 64-bit data types.
In transaction processing, measurements of instruction
frequencies show that data-movement instructions (loads,
3. Memory access. The simple but extensive organization of the TXP stores, and moves) occur much more frequently than 32-
cache provides an average hit ratio of over 96%. With a cache hit, the bit arithmetic instructions. For this reason, the NonStop
data is read out of the cache in 83 nanoseconds. When the data TXP system is optimized to handle data movement by
requested is not in cache, a cache miss results and the 64-bit-wide providing 64-bit access to main memory and 32-bit buses
access to memory speeds the cache refill. and address registers to make memory addressing as
efficient as possible.
and it can execute load and branch instructions, which The NonStop TXP processor was implemented on four
are frequently used, in only three clock cycles (250 ns). large pc boards using high-speed FAST logic, PALS, and
Each NonStop TXP processor has a 64-K-byte cache high-speed static RAMs. The CPU's logical and physical
that holds both data and code. A l6-processor NonStop partitioning was carefully controlled to ensure that the
TXP system has a full megabyte of cache memory. To machine's basic cycle time would not be slowed by long
determine the organization of the cache, a number of propagation delays. The four CPU boards are:
measurements were performed on a NonStop II system • SQ: containing the control store and sequencing logic.
using a specially designed hardware monitor (see "Hard- • cc: containing the I/O channel and various special
ware-performance monitor helps optimize design," modules.
p. 149). The measurements showed that higher cache hit • IP: holding the main data paths and cache.
ratios resulted with a large, simple cache (directly • MC: providing the memory interface, barrel shifter,
mapped) than with a smaller, more complex cache (orga- and interprocessor bus interface.
nized as two- or four-way associative). Typical hit ratios Each CPU module also has from one to four memory
for transaction processing on the NonStop TXP system boards. On the initial release, each memory board con-
are in the range of 96% to 99%. tains 2 megabytes of error-correcting memory imple-
mented with 64-K dynamic RAMs. A 16-prt)cessor Non-
Cache miss Stop TXP system can therefore contain up to 128
Cache misses are handled in a firmware subroutine megabytes of physical memory.
rather than by the usual method of adding a special state The NonStop TXP system was designed to be easy to
machine and dedicated data paths for handling a miss. manufacture and efficient to test. Data and control reg-
Because of the large savings in cache hardware, the isters were implemented with shift registers configured
cache can reside on the same board as the primary data into several serial-scan strings. The scan strings are of
paths; keeping these functions proximal reduces wiring value in isolating failures in field-replaceable units. This
delays and contributes to the fast 83.3-ns cycle time. serial access to registers also makes board testing much
The cache is addressed by the 32-bit virtual address faster and more efficient because the tester can directly
rather than by the physical address, thus eliminating the observe and control many control points. A single ~us
extra virtual-to-physical translation step that would oth- tom tester was designed for all four CPU boards and for
erwise be required for every memory reference. The vir- the memory-array board as well.
tual-to-physical translation, which is needed for refilling The NonStop TXP system is the first product to be
5
MIPS and transactions per second
Determining relative performance among computer sys- prediction ofactual system performance.
tems has never been an easy task. The often-quoted Each of the various measurementtechniques has pitfalls.
milfions-ofooinstructions-per-second rate is intended as a The MIPS rate is perhaps the least accurate way to com-
way to compare basic central-processing-unit-hardware pare systems. One reason is that there is no easy way to
performance. Comparisons are also made on· the· basis of relate the power of one instruction set to another. In
benchmarks. CPU-intensive benchmarks measurethe per- addition, vendors vary in the way they measure MIPS: some
formance of the CPU hardware and compiler; more exten· use it for· the· speed· of the fastest instructions, others
sive benchmarks measure the entire system perfor: measure the speed of the most frequentlyexecuted.instruc-
mance-including the hardware, compiler, operating lions. and still others measure the speed ot a "typical" mix
system, and data-base-management system. In general, of instructions. According to these· definitions, each Non-
the more extensive benchmarks give a more accurate Stop TXP processor is 6, 4, or 2 MIPS, respectively.
developed using Tandem's proprietary computer-aided- Service ran an extensive benchmark in 1981 to detennine
design system. The CAD system's capabilities for logic which system to select for a large on-line medical-infor-
entry, logic simulation, and automated pc-board routing mation system. 2 In that study, a 15-processor Tandem
were instrumental in reducing the design time. While NonStop system running a 1981 version of Tandem's
most high-perfonnance CPUs require four to five years Encompass DBM system perfonned the benchmark at a
to develop, the NonStop TXP processor took just 2~ rate of 4.5 transactions/so An International Business Ma-
years-six months to complete a written specification, chines Corp. System 370/168-3 running version 3 of the
one year to construct a working prototype, and another Adabas DBM system perfonned the same benchmark at 2
year to reach volume production. transactions/s.
This benchmark gives a data point for comparisons
Performance measurement
between Tandem and IBM systems. A 15-processor Non-
Some simple benchmark programs have recently be- Stop system perfonns the Public Health Service bench-
come popular in measuring perfonnance (see "MIPS and mark 2.25 times as fast as an IBM 370/168-3. Though it
transactions per second," p. above). One is the Puzzle would be desirable to compare the TXP system directly to
benchmark, which is a CPU-intensive program to solve a one of IBM'S newest systems, such as the IBM 4381-2, no
three-dimensional puzzle. Execution times for Puzzle can competitive benchmarks have been published. However,
vary widely for the same machine, depending on whether comparisons of the MIPS rate of different processors with-
the program accesses arrays through subscripts or point- in a single family are fairly accurate and can be used to
ers and whether frequently used variables are assigned to extrapolate to newer systems.
registers. Versions of the Puzzle benchmark with pointers According to market research performed by the
and registers were used to compare relative performance Gartner Group,! the IBM 4381-2 is rated at 2.7 MIPS,
for a TXP processor. compared with the older IBM 370/168-3's 2.4 MIPS rat-
Puzzle was written in TAL (transaction application lan- ing-a ratio of 1.125 : 1. Company tests have shown the
guage, the company's system-programming language); NonStop TXP to have a MIPS rate approximately three
the execution time, using a single TXP processor, was times that of the NonStop processor. The extrapolation
measured at 1.67 s. This compares with 4 s on a VAX- of the Public Health Service benchmark performance to
11/780 for Puzzle written in C. 1 Because Puzzle does not the two newer systems is shown in Table 3.
measure such system features as support for virtual mem- Unlike many shared-memory multiprocessor systems,
ory, I/O bandwith, and the ability to do fast context Tandem systems provide linear growth in transaction-
switching, a standard benchmark for comparing transac- processing power as the system expands. A single system
tion-processing systems is still needed. can include up to 16 processors, and clusters with as
One transaction-processing benchmark has been devel- many as 224 NonStop TXP processors may be configured
oped by a third party, however. The U. S. Public Health wi.th Tandem's fiber-optic link. Clusters with up to 60
processors are currently in operation, and their users
TABLE 3' TANDEM VERSUS IBM PERFORMANCE COMPARISONS have verified the linear-performance growth within a
U.S. Public Health USPHS benchmark: cluster of this si.ze.
Service benchmark: extrapolated results· The largest IBM mainframe today is the IBM 3084,
results (transactions (transactions per
per secondl secondl which is rated at approximately 23 MIPS. Extrapolation
IBM 370/168-3 2 - from the benchmark data suggests that the performance
Tandem NonStop
of a cluster of 224 TXP processors is on the order of 10
'5'processor svstem 4.5 - times as powerful as IBM's top-of-the-line 3084
processor. 0
IBM 4381-2 - 2.25
Reference.
Tandem NonStop 'Malcolm A. Gieser. Judith Bayard. and DaVid D. Lang. "Benchmarking for the Best."
TXP 3'processor system - 2.7 Datamation, May 198~.
6
Distributed by
Af'TANDEMCOMPUTERS
Corporate Information Center
19333 Vallco Parkway MS3-07
Cupertino, CA 95014-2599