Vous êtes sur la page 1sur 10

13 Krypton for AMD - The 5,86

The legendary Krypton which helped Superman to save the world was the force behind the
name of AMD’s newest processor 5,86, which has been entirely developed by AMD. Its tech-
nology is (as is the case for the Cyrix 6x86) between the Pentium and the following generation.
From the outside, it is compatible with the Pentium P54C, and its internal registers and other
Pentium-specific features, such as the extensions of the virtual 8086 mode, are much more
oriented to the Intel original than is the case for the 6x86. Nevertheless, behind the facade which
represents the lowest programming level hide many advanced and pure RISC principles. For
example, the decoder of the 5,86 (the heart which determines most of the performance) trans-
forms all x86 CISC instructions into a more or less complex set of simple RISC operations (so-
called ROPs). They are then executed internally in parallel in one single clock cycle. The key
features of the 5,86 are as follows.
two integer pipelines,
on-chip FPU,
out-of-order execution and completion of instructions,
speculative program execution,
register renaming,
dynamic branch prediction,
data forwarding,
16 kbyte code cache,
8 kbyte data cache,
pages with 4 kbytes and 4 Mbytes,
Pentium socket and register-compatible.

13.1 Pins and Signals

The 5,86 comes in a P54C-compatible SPGA package with 296 pins. Because the 5,86 supports
neither dual processing nor performance monitoring and does not integrate an on-chip APIC,
there are some signals missing. They are listed in the following together with their coordinates
(compare Figure 11.2):

- APIC: PICCLK (H34), PICDO/ DPEN (J33), PICDl/ APICEN (L35);


- dual processing: CPUTYP (Q35), PBGNT (AD4), PBREQ (AE3), PHIT (AA3), PHITM (AC3);
- performance monitoring: PMO/BPO (Q3), PMl/BPl (R4).
Of course, no local interrupts LINT0 and LINT1 are possible; the corresponding terminals are
always configured as 1NTR and NMI. There are no additional terminals to those already present
on the Pentium; all signals have the same meanings.

13.2 Internal Structure

Figure 13.1 shows a block diagram of the 5,86. It has separate code (16 kbytes) and data caches
(8 kbytes) with write-back strategy. In the caches there are linear tags for the TLB; the physical
Krypton for AMD - The 5,86 383

W-bit 32-bit
Data Bus Address Bus

Bus Interface

Memory Management Unit MMU wth Physical Tags

,L

I
Byte Queue 1
I

Fast; p-
Path jcode Path&de
4
c 4

Load Load
ALU I ALU II FPU Branch
Store I store II
I I I I I I 1

Figure 23.1: The internnl structure of the AMD5,86

tags are located in the memory management unit. This means that the TLB operates with linear
addresses (before paging) and physical addresses (after paging). Instructions are transferred
from the code cache into an instruction queue which distributes them to four parallel decoder
units Fast Path/p-code. Finally, the instructions are executed in a total of two ALUs, one FPU
(with an internally extended 82-bit format), one branch unit and two load/store units. The write
buffer comprises four entries and buffers write accesses to memory; the reorder buffer with its
16 entries carries out a correct completion of instructions which update the register file. These
units are connected by means of two pipelines.

13.2.1 Prefetching and Predecoding


The prefetch and predecode unit which is located behind the code cache fetches, in the case of
:a code cache miss, a complete 32-byte cache line into a prefetch cache and simultaneously
Jedecodes 011 the /ly the data to relieve the following decode stages. For example, the pre-
J decode stage determines the length of an instruction (which can be for a x86 instruction between
.:,I and 15 bytes), whether a certain byte is the first, last or an intermediate byte of an x86
i.
&instruction, as well as the number of 5,Winternal RISC operations which this complex CISC
instruction needs for execution in the processor. This information will be stored in the code
cache together with the instruction bytes. Later, the instruction fetch stage of the pipelines reads
fall this information.
K
384 Chapter 13

‘I
13.2.2 Integer Pipelines and ROPs il
a
The u- and v-pipelines in the 5,86 comprise six stages each, that is, the AMD5,86 is, like the a
Pentium, a so-called s~rpersmlar. Figure 13.2 shows the individual stages. I
tl
r C(
0
12
a

:
0
Ii
Result k-10
T
Result k-9 si
ir
In-order Processing Out-of-order Retirement T
ir
P
T
The first prefetch or instruction fetch stage IF fetches up to 16 instruction bytes together with
the predecode information from the code cache or (in the case of a cache miss) from the prefetch
:
cache. Additionally, it carries out a dynamic branch prediction by means of the branch target
buffer BTB. According to AMD, the prediction achieves a hit rate between 70% and 85% of aI1
I
conditional branches. The maximum depth of the code which is executed speculatively after-
wards is three levels.

The predecode unit is formed in the AMD5,86 decoder together with the two decode stages DI
and D2 (more specifically, the 5,86 moves the x86 instruction bytes in two pipelines with the
stages Dl and D2 through the decoder). This decoder is the heart of the AMD processor and
divides the x86 CISC instructions into small and quickly executable RISC instructions, the so
called RISC op~~lio~rs, ROP. Simple CISC instructions (MOV, ADD, SHIFT, etc.) are converted b! 1
the fast-path converters into one to three ROPs, complex instructions, by microcode (p-code)
The four converters transfer up to four ROE’s per cycle to the execution units. Thus, for exnmplc,
two CISC instructions which are converted into two ROPs each may be executed in parallel Jr
both pipes. Only these ROI’s (!) and not the original CISC instructions are guided in par~lllL’l
through the pipelines. This simplifies the pipeline structure significantly. However, the divisloi’
into ROPS must be performed very efficiently. This is done by the two decoding stages DI anti
D2. The RISC operations implement more or less another instruction level which is located &lo”
the x86 architecture. If one compares this strategy with a pure CISC processor (the i386. ‘(”
example) then the division of an x86 CISC instruction into ROPs corresponds to the execut “”
of several microencoded instructions on the i386. But the ROPs are much more powerful tflan
micro-instructions (comprise a set of micro-instructions executed in parallel) and, mcW(>vc’r,
hardwired, that is, they need not be read from a microprogram memory.
Krypton for AMD - The 5,86 385

The first decoding stage DZ assembles the instruction bytes with their predecode information
into the 16.byte queue and generates the internal ROPs. The second decode stage 02 then
assembles register and immediate operands and addresses the reorder buffer ROB to provide
all operands. In the execution stage EX the ROE’s are transferred to the execution units (ALU
I + II, FPU, branch, load/store I + II) and executed. Here, in-order execution terminates because
the ROE’s may have different execution times or produce interlocks. Therefore, an out-of-order
completion of the instruction occurs. The result stage RS performs data forwarding, writes the
corresponding entries into the reorder buffer ROB (to impiement register renaming and specu-
lative program execution) and verifies the predicted branch. The retire stage WB writes the x86
architecture registers after, for example, one level of the speculative program execution has been
verified. Additionally, it switches data write accesses which await verification in the write buffer
through to the data cache (or the memory subsystem). Note that all write accesses to the memory
or I/O address space are carried out exactly in the programmed order. The AMD5,86 performs,
like the Pentium, strorlgly-ordered writes accesses.

The retiring in the retire stage WB can occur a variable number of clock cycles after the result
stage without affecting the pipeline throughput. The 5,86 can retire a maximum of four ROE’s
in the retire stage and thus ((catch up”.

There is another processor, the NexGen 586, which implements such a transformation of CISC
instructions into RISC operations. But because it has not achieved any importance in the market-
place, it will not be discussed in detail here.

The following discusses the RISC strategies which are implemented by the AMD5$36 to avoid
A pipeline stalls and to enhance performance. You may find details concerning these strategies in
Section 12.2.4 regarding the Cyrix 6x86:
.‘T <
- register renaming,
- dynamic branch prediction and speculative instruction execution,
: result forwarding,
-, operand forwarding,
-‘- data bypassing,
- out-of-order completion of instructions.

13.2.3 Instruction Pairing

;yh e t ransformationof the x86 instruction into ROPs naturally also affects instruction pairing on
the x86 level. But, particularly because of the two powerful ALUs, this is less limited than on
’ $e Pentium. ALU instructions occur frequently and may inhibit pairing. Of course, ROE’s as-
signed the ClSC instruction can be executed in parallel only if they do not use the same execu-
ion unit simultaneously (for example, if they do not both contain an explicit or implicit FI’U or
branch instruction). Therefore two ALUs &d two load/store units are formed because x86
instructions often include explicit (MOV rcg, men,; explicit: mem) or implicit (RET; implicit:
%FSP) memory acccsscs because of their extensive addressing schemes. Table 13.1 lists possi-
.:bl e pairings for both ALUs. The load/store units are symmetrical; corresponding instructions
?,Qn always be paired.
Q
386 C h a p t e r 13

instruction ALU I ALU II

Addition yes yes


Subtraction yes yes
Multiplication yes yes
Division yes no
LogIcal function yes yes
Compare yes yes
Packed BCD yes no
Unpacked BCD yes no
ADDC. SUBB yes no
Shift no yes

Tablr 13.1: Instruction pairings for the ALUs

Of course, the microencoded part of the instruction converter performs an implicit instruction
pairing to execute complex x86 instructions (for example, string instructions with REP prefix or
exceptions) as efficient as possible.

13.2.4 Floating-point Unit

The floating-point unit of the AMD5,86 comprises two pipelines for adding and multiplying
values as well as a detection pipeline; in total three pipelines. The FPU ROPs are always dis-
patched in pairs to the FPU. The first ROP contains the low-order half of the two FPU operands
X and Y, and the second ROP the high-order half of X and Y as well as the opcode. The
operands of both ROPs are combined and converted into an internal 82-bit format (for compari-
son: the IEEE norm uses the 80-bit temporary real format). This enhances accuracy; however, the
conversion consumes one clock cycle. Only if the operands come directly from the FPU pipeline
by data forwarding is this conversion is not necessary and the dead cycle obsolete. Most of the
often used and simple instructions (for example, FADD) are hardwired (fast-path encoded); all
complex instructions are microencoded (as is the case for the Pentium) because they usually
consist of a more or less lengthy series of FP addition and multiplications.

13.2.5 On-chip Caches

The 5,86 has a 16 kbyte code cache and an 8 kbyte data cache. Thus the code cache has twice
the capacity of that of the Pentium. Another significant difference is that both caches have linear
tags as we11 as physical tags (in the MMU). The advantage of linear tags is that no paging
conversion of linear into physical addresses is required. The 5,86 therefore saves one clock cycle
when accessing the caches compared to the Pentium which implements only physical tags. Only
in the case of cache misses and snoop cycles are the physical tags accessed. In the course of a
cache line fill the requested &byte double-word is always loaded.

The code cache has a 4-way set-associative organization with 512 lines comprising 32 bytes each
Additionally, it supports 16-byte split line accesses where the first 8 bytes are located in one
cache line and the second 8 bytes in the following cache line. The two S-byte units are W’
Krypton for AMD - The 5,86 387

combined into one single 16-byte unit. Every instruction byte in the cache is assigned five
predecode bits. Because the instruction cache is read-only, only two states of the MESI protocol
(valid, invalid) are implemented.

The data cache is also organized as a 4-way set-associative cache with a cache line size of
32 bytes. Additionally, it is divided into four banks, that is, it has 4-way interleaving which
refers to 4-byte borders (the first bank comprises bytes O-3 and 16-19 of a cache line; the second
bank, bytes 4-7 and 20-23; the third bank, bytes 8-11 and 24-27; and the fourth bank, bytes 12-
15 and 28-311. With two access ports up to two data accesses can occur in parallel if the accesses
refer to different banks. To maintain cache coherency with other caches in the system, the data
cache supports all MESI states. Note that the MESI states are stored in the physical tags. Because
linear tags concern read accesses, MESI states are not required there.

13.2.6 Extensions and Model-specific Registers of the AMD5,86

Although the AMD5,86 is much more Pentium-compatible than the 6x86, some differences
occur in the control register CR4 and the model-specific registers.

Control Register CR4 and Global Pages

The control register CR4 controls the extensions for the virtual 8086 mode, the debug support
and paging, as is the case for the Pentium. Figure 13.3 shows the structure of the AMD5,86
control register CR4

31 16 15 8 7 6 5 4 3 2 1 0

GPE: Global Page Extension


1 =global pages enabled O=global pages disabled
MCE: Machine Check Enable
l=machine check exception enabled O=machine check exception disabled
PSE: Page Size Extension
1=4Mbyte pages 0=4kbyte pages
DE: Debugging Extension
1=1/O breakpomts enabled O=breakpolnts only for memory addresses
TSD: Time Stamp Disable
l=RDTSC only for CPL=O O=RDTSC also for CPL=3..1
PVI: Protected Mode Virtual Interrupts
l=wtuaI interrupt flags in protected mode O=no virtual interrupt flags
VME: Wrtual 8086 Mode Extension
I=virtual interrupt (lags in virtual 8086 mode O=no virtual Interrupt flags

C_tr$ the GPE bit has been newly implemented; the other bits remain unchanged and have been
already discussed in Section 113.1 in connection with the Pcntium. By setting the GPE bit you
“‘“may enable the global page function of the SK86. Every time the control rcgistcr CR3 is reloaded
;: :
388 Chapter 1: K?
-
_

(that is, a new page directory base address is written) the processor invalidates the TLBs for th: Nc
4 kbyte and 4 Mbyte pages. This is the case, for example, for a task switch. But if a page is vaI:J co
for several tasks (as is often true for operating system functions or the video RAM) then invaii-
dation is superfluous and time-consuming because a TLB miss always occurs now for this page. Ar
But this again reloads that entry which was already present in the TLB, therefore it is advan- Of
tageous that certain pages are defined as global (in a similar way, for example, you may define on
certain variables in a C program as global so that they are visible for all subroutines). To do that
you must first set the GPE bit in the control register CR4.
The definition of a page as global is slightly different for 4 kbyte and 4 Mbyte pages. Figure 13.4
shows the structure of the page directory and page table entries with the new G bit.
Hc
13
Page Directory Entry (4 kbyte Pages)
31 16 15 12 11 5 6 7 6 5 4 3 2 1 0 -
Mc

MZ
Page Directory Entry (4 Mbyte Pages) M;
31 22 21 16 15 12 11 9 6 7 6 5 4 3 2 1 0 Tin
IIIIIIIII “ “ “ ‘ I ’ I I An
Page Frame Address 31..22 0 0 0 0 0 0 0 0 0 0 AVAIL G 2 D A
Ha
-

Page Table Entry T!l


31 16 15 12 11 9 6 7 6 5 4 3 2 1 0

Page Frame Address 31..12 .-l.


41
Page Frame Address: address btts 31...12 and 31...22 of the page frame SP
AVAIL available for operating system en
G: Global
13
O=no global page l=global pages
PSI: Page Size
0=4kbyte 1=4Mbyte
D: Dirty
O=page not yet written l=page already written
PCD: Page Cache Disable
O-qage cachable l=page not yet cachable
PWT. Page Write-Through
O-page uses write-back strategy l=page uses write-through strategy
A: Accessed
O=page not yet accessed l=page already accessed
u/s. User/Supervisor Access Level
O=supervisor (CPL-0. 2) l=user (CPL=3)
Rlw~ Readwrite
O=page is read-only l=page may be written
P. Page Present
O=page swapped l=page in memory
L

F~glrrc 13.4: Pnge table entry for o globnl page.

_ 4 kbyte pages: you must set the G bit in the page directory as well as the page table entr!’
_ 4 Mbyte pages: you must set the G bit in the page directory entry.
Krypton for AMD - The 5,86 389

Note that 4 Mbyte pages can occur only in a page directory entry. Afterwards you must (of
course) still load the CR3 register with the base address of the page directory. .
Array Access Registers AAR and Hardware Configuration Registers HWCR

Of a11 the model-specific registers of the Pentium (see Table 11.3), the AMD5,86 implements
only the following:

- Register OOh: machine check address register;


- Register Olh: machine check type register;
- Register 10h: time stamp counter.

However, two new model-specific registers have been implemented which are shown in Table
13.2.

Model-specific register MSR Application


Machine check address register OOh physical address of error-causing bus cycle
Machine check type register Olh type of error-causing bus cycle
Time stamp counter 10h read/write of the internal 64-bit counter
Array access register AAR 82h checking code and data cache, TLBs
Hardware confrguratron register HCR 83h debug function control

Table 13.2: Model-specific registers of the AMDSd6

The AAR serves for checking the code and data caches as well as the TLBs for 4 kbyte and
4 Mbyte pages. It therefore carries out functions which, on the Pentium, are done by the model-
specific registers TlGTR7 (MSR numbers 04h-09h). The hardware configuration register HCR
enables or disables various functional units of the AMD5,86. Its structure is shown in Figure
13.5.

31 16 15 6 7 6 5 4 3 2 1 0
1 I I I I I I I I I I I I I I I I I I I I I I I 10lOllLlli I I IQ
reserved 86:e D C

DDC: Disable Data Cache


l=data cache disabled O=data cache enabled
DIG: Disable lnstructm Cache
l=code cache disabled O=code cache enabled
DBP: Disable Branch Prediction
l=branch prediction disabled O=branch prediction enabled
DC: Debug Control
OOOb=dlsabled 001 b=branch tracing enabled lOOb=enable probe mode for debug traps
DSPC: Disable Stopping Processor Clocks
l=stopping disabled O=stopping enabled

.$you can disable the data cache by setting the DDC bit, and the code cache by setting the DIG bit.
“fis disables the Ll-cache of the AMD5,86. You will find this option in many BIOS. Similarly,
390 Chapter 13

you can disable the branch prediction by setting DBP. The three DC bits serve for controlling
various debug functions. If DC = OOOb then all functions of the HCR register are disabled, that
is, the values of the other bits have no effect, and the AMD5K86 operates normally. A value of
OOlb enables branch trace tiessages; a value of 1OOb enables the probe mode when a debug trap
occurs. This allows the values at the signal outputs of the debug port to be checked in a serial
manner. All other values for DC are reserved. Finally, DSPC controls the stoppage of the pro-
cessor internal clock signal in the case of halt and stop-grant states. A value of <cl,, disables that
stoppage.

13.3 Pentium Compatibility

The AMD5,86 is nearly 100% compatible with the Pentium; deviations are indicated in the
corresponding sections. Note that compatibility refers only to the implemented registers and
their structure as well as the physical bus interface (signals and pins) to outside. The internal
structure and the operating principles (that is, the <<black box k> between instruction and data
input on one side and the result output on the other) are, however, completely different.

13.3.1 Operating Modes

Real mode, protected mode and virtual 8086 mode are completely Pentium-compatible. The
behaviour in system management mode is also the same as that of the Pentium. Only undefined
or reserved sections in the register dump may differ. For paging the only difference is the
implementation of global pages on the AMD5,86. Apart from this, the AMD5,86 is also 100%
Pentium-compatible with this function.

13.3.2 Test Functions

The AMD5K86 also implements various test functions in the same way as the Pentium. Some are
initialized in the course of a reset. Table 13.3 shows the signal levels which must be activated
for a certain test mode.

INIT FLUSH FRMC issued test


1 built-In self-test BET
0 0 functional redundancy checking
0 0 output float test mode

Tnhle 13.3: AME& tc’ssfs

The aspects in which they differ will be discussed in the following.

BIST

The built-in self-test checks the complete internal hardware (code and data cache, PLAs, microcode
ROM, TLBs) and writes an error code into register EAX. The BIST is carried out when the lNIT
13 K r y p t o n f o r A M D - The 5,86 391
-
I
signal is active at the falling edge of reset. An EAX value of OOOOOOOOh means that no error has
“g
\at occurred. Table 13.4 lists all BIST error codes; a set bit indicates the corresponding error.
of
ap Error type
Bit
ial
ro- 31..9 reserved (=OOh)
8 data path
Iat
7 code cache data entry
6 code cache linear tags
5 data cache linear tags
4 PLA
3 microcode ROM
2 data cache data entry
code cache phywal tag
:he 0 data cache physlcal tag
nd
nal Table 13.4: BIST error code
ata
Note that the AMD5,86 starts operation independently of the BIST result, that is, the system
boots. The BIST can also be started through the test access port with the RUNBIST instruction
(see Section 9.85).

:he Output Float Test Mode


ied
the 5 The Pentium calls this test mode the tristate test mode. Apart from this both operating modes
0% are the same.

Functional Redundancy Checking

i This test mode is also the same as the functional redundancy checking of the Pentium. Details
are discussed in Section 11.3.9.
are
ted
4 Branch Trace Cycles

h~ the same manner as the Pentium, the 5,86 drives a hrnnch trnce ~nessn~e srccinl cycle ( BE7-
% = 11011111) when the value of DC in the hardware configuration register HWCR is equal
to OOlb (on the Pentium you must set the ETE bit in test register TR12) and the processor
performs a branch. But the signal encodings differ. The following shows the branch trace special
cycle of the AMD5,86:

Fh-st cycie:
Nl 0 (first cycle)
NO. .A29 undefined
1ua undefined
:‘u7..A20 o
;a19..,, cs selector of the addraas where the branch originates
?A3 0
EIP of the addreae where the branch originates