Prasoon - 64 Bit Computing

Introduction
Why 64-Bit Computing?
The question of why we need 64-bit computing is often asked but rarely answered in
a satisfactory manner. There are good reasons for the confusion surrounding the
question.
That is why first of all; let's look through the list of users who need 64 addressing
and 64-bit calculations today:
• Users of CAD, designing systems, simulators do need RAM over 4 GB.

Although there are ways to avoid this limitation (for example, Intel PAE), it
impacts the performance. Thus, the Xeon processors support the 36bit
addressing mode where they can address up to 64GB RAM. The idea of this
support is that the RAM is divided into segments, and an address consists of
the numbers of segment and locations inside the segment. This approach
causes almost 30% performance loss in operations with memory. Besides,
programming is much simpler and more convenient for a flat memory model
in the 64bit address space - due to the large address space a location has a
simple address processed at one pass. A lot of design offices use quite
expensive workstations on the RISC processors where the 64bit addressing
and large memory sizes are used for a long time already.
• Users of data bases. Any big company has a huge data base, and extension of
the maximum memory size and possibility to address data directly in the data
base is very costly. Although in the special modes the 32bit architecture IA32
can address up to 64GB memory, a transition to the flat memory model in the
64bit space is much more advantageous in terms of speed and ease of
programming.
• Scientific calculations. Memory size, a flat memory model and no limitation for
processed data are the key factors here. Besides, some algorithms in the 64bit
representation have a much simpler form.
• Cryptography and safety ensuring applications get a great benefit from 64bit
integer calculations.
What is 64-bit computing?
The labels "16-bit," "32-bit" or "64-bit," when applied to a microprocessor,
characterize the processor's data stream. Although you may have heard the term
"64-bit code," this designates code that operates on 64-bit data.
In more specific terms, the labels "64-bit," 32-bit," etc. designate the number of bits
that each of the processor's general-purpose registers (GPRs) can hold. So when
someone uses the term "64-bit processor," what they mean is "a processor with GPRs
that store 64-bit numbers." And in the same vein, a "64-bit instruction" is an
instruction that operates on 64-bit numbers.
In the diagram above black boxes are code, white boxes are data, and gray boxes are
results. The instruction and code "sizes" are not to be taken literally, since they're
intended to convey a general feel for what it means to "widen" a processor from 32
bits to 64 bits.
Not all the data either in memory, the cache, or the registers is 64-bit data. Rather,
the data sizes are mixed, with 64 bits being the widest.
Note that in the 64-bit CPU pictured above, the width of the code stream has not
changed; the same-sized opcode could theoretically represent an instruction that
operates on 32-bit numbers or an instruction that operates on 64-bit numbers,
depending on what the opcode's default data size is. On the other hand, the width of
the data stream has doubled. In order to accommodate the wider data stream, the
sizes of the processor's registers and the sizes of the internal data paths that feed
those registers must be doubled.
Now let's take a look at two programming models, one for a 32-bit processor and
another for a 64-bit
The registers in the 64-bit CPU pictured above are twice as wide as those in the 32-
bit CPU, but the size of the instruction register (IR) that holds the currently executing
instruction is the same in both processors. Again, the data stream has doubled in
size, but the instruction stream has not. Finally, the program counter (PC) has also
doubled in size.
For the simple processor pictured above, the two types of data that it can process are
integer data and address data. Ultimately, addresses are really just integers that
designate a memory address, so address data is just a special type of integer data.
Hence, both data types are stored in the GPRs and both integer and address
calculations are done by the ALU.
Many modern processors support two additional data types: floating-point data and
vector data. Each of these two data types has its own set of registers and its own
execution unit(s). The following table compares all four data types in 32-bit and 64-
bit processors:
Data Type Register Type Execution Unit x86 width x86-64 width
Integer GPR ALU 32 64
Address GPR ALU OR AGU 32 64
Floating Point* FPR FPU 64 64
Vector VR VPU 128 128
*x87 uses 80-bit registers to do double-precision floating-point. The floats themselves are 64-bit, but the
processor converts them to an internal, 80-bit format for increased precision when doing computations.
From the table above that the difference the move to 64 bits makes is in the integer
and address hardware. The floating-point and vector hardware stays the same.
Now that we know what 64-bit computing is, let's take a look at the benefits of
increased integer and data sizes.
Dynamic range
The main thing that a wider integer gives you is increased dynamic range.
In the base-10 number system to which we're all accustomed, you can represent a
maximum of ten integers (0 to 9) with a single digit. This is because base-10 has ten
different symbols with which to represent numbers. To represent more than ten
integers you need to add another digit, using a combination of two symbols chosen
from among the set of ten to represent any one of 100 integers (00 to 99). The
general formula that you can use to compute the number of integers (dynamic range,
or DR) that you can represent with an n-digit base-ten number is:
DR = 10n
So a 1-digit number gives you 101 = 10 possible integers, a 2-digit number 102 =
100 integers, a 3-digit number 103 = 1000 integers, and so on.
The base-2, or "binary," number system that computers use has only two symbols
with which to represent integers: 0 and 1. Thus, a single-digit binary number allows
you to represent only two integers, 0 and 1. With a two-digit (or "2-bit") binary, you
can represent four integers by combining the two symbols (0 and 1) in any of the
following four ways:
00 = 0
01 = 1
10 = 2
11 = 3
Similarly, a 3-bit binary number gives you eight possible combinations, which you
can use to represent eight different integers. As you increase the number of bits, you
increase the number of integers you can represent. In general, n bits will allow you to
represent 2n integers in binary. So a 4-bit binary number can represent 24 or 16
integers, an 8-bit number gives you 28=256 integers, and so on.
So in moving from a 32-bit GPR to a 64-bit GPR, the range of integers that a
processor can manipulate goes from 232 = 4.3e9 to 264 = 1.8e19. The dynamic range,
then, increases by a factor of 4.3 billion. Thus a 64-bit integer can represent a much
larger range of numbers than a 32-bit integer.
The benefits of increased dynamic range,
Or, how the existing 64-bit computing market uses 64-bit integers?
Since addresses are just special-purpose integers, an ALU and register

combination that can handle more possible integer values can also handle that many
more possible addresses. With all the recent press coverage that 64-bit architectures
have garnered, it's fairly common knowledge that a 32-bit processor can address at
most 4GB of memory. (Remember our 232 = 4.3 billion number? That 4.3 billion
bytes is about 4GB.) A 64-bit architecture could theoretically, by contrast, address up
to 18 million terabytes.
So, what do you do with over 4GB of memory? Well, caching a very
large database in it is a start. Back-end servers for mammoth databases are one
place where 64 bits have long been a requirement, so it's no surprise to see
upcoming 64-bit offerings billed as capable database platforms.
On the media and content creation side of things, folks who work with very large 2D
image files also appreciate the extra RAM. And a related, much interesting
application domain where large amounts of memory come in handy is in simulation
and modeling. Under this heading you could put various CAD tools and 3D rendering
programs, as well as things like weather and scientific simulations, and even real-
time 3D games. Though the current crop of 3D games wouldn't benefit from greater
than 4GB of RAM, it is quite possible that we'll see a game that benefits from greater
than 4GB RAM within the next five years.
Some applications, mostly in the realm of scientific computing (MATLAB,

Mathematica, MAPLE, etc.) and simulations, require 64-bit integers because they
work with numbers outside the dynamic range of 32-bit integers. When the result of
a calculation exceeds the range of possible integer values, you get a situation called
either overflow (i.e. the result was greater than the highest positive integer) or
underflow (i.e. the result was less than the largest negative integer). When this
happens, the number you get in the register isn't the right answer. There's a bit in
the x86's processor status word that allows you to check to see if an integer has just
exceeded the processor's dynamic range, so you know that the result is bogus. Such
situations are rare in integer applications.
Programmers who run into integer overflow or underflow problems on a 32-bit

platform do have the option of using a 64-bit integer construct provided by a higher
level language like C. In such cases, the compiler uses two registers per integer, one
for each half of the integer, to do 64-bit calculations in 32-bit hardware. This has
obvious performance drawbacks, making it less desirable than a true 64-bit integer
implementation.
Finally, there is another application domain for which 64-bit integers can offer real
benefits: cryptography. Most popular encryption schemes rely on the multiplication
and factoring of very large integers and the larger the integers the more secure the
encryption.
64-bit integer code runs slowly on a 32-bit machine, due to the fact that the 64-bit
computations have to be split apart and processed as two separate 32-bit
computations. So you could say that there's a performance penalty for running 64-bit
integer code on a 32-bit machine; this penalty is absent when running the same code
on a 64-bit machine, since the computation doesn't have to be split in two. The take-
home point here is that only applications that require and use 64-bit integers will see
a performance increase on 64-bit hardware that is due solely to a 64-bit processor's
wider registers and increased dynamic range.
64 bit Architectures
Let’s discuss 64 bit Architectures from the leaders of Processor Manufacturers – AMD
& Intel (AMD’s Opteron & Intel’s Itanium).
Intel 64-bit architecture (IA-64)
By using a technique called VLIW, the letters VLIW mean “Very Large Instruction
Word”. Processors that use this technique access the memory by transferring long
program words, and in each word many instructions are packed. In the case of the
IA-64, three instructions are used for each pack of 128 bits. As each instruction has
41 bits, there are 5 bits left that will be used to indicate the kinds of instruction that
were packed. Figure 1 shows the instruction packaging scheme. This packaging
lessens the number of memory accesses, leaving to the compiler the task of grouping
the instructions in order to get the best of the architecture.
Instruction packaging used in the IA-64 architecture.
As it has already been said, the 5-bit field, named as “pointer”, serves to indicate the
kinds of instructions that are packed. Those 5 bits offer 32 kinds of packaging
possible that, in fact, are reduced to 24 kinds, since 8 are not used. Each instruction
uses one of the CPU features, which are listed below, and that can be identified in
Figure given below.
Unit I - integer data

Unit F - floating-point operations
Unit M - memory access and
Unit B - branch prediction.
The architecture that Intel suggests to execute those instructions, that was called
Itanium, is versatile and promises performance by means of the simultaneous
(parallel) execution of up to 6 instructions. Figure shows the diagram in blocks of this
architecture that uses a ‘pipeline’ of 10 stages.
Block diagram of the Itanium CPU (IA-64 architecture).
The basic structural unit of the Itanium looks like the picture above. The data bus can
cope according to Intel with a data rate of 2.1GB/sec. The Itanium processor contains
4 integer ALUs, 4 multimedia ALUs, 2 AGUs, 3 branching units and 4 FPUs for
arithmetic with floating point numbers. The processor is capable of theoretically
performing 20 operations in one clock cycle by loading 16 operands and evaluating 4
ALU operations. This possibility should not be confused with the number of
instructions possible within one clock cycle - namely six. The instructions are
retrieved from memory and are bundled by a process called bundle rotation; this
prepares the execution of parallel instructions on the hardware level. The instructions
are fetched from the cache speculatively. All this is implemented with the help of 128
floating point registers, 128 integer registers and 8 branching registers, which all
support explicitly 64-bits
The IA-64 architecture receives the sigla EPIC, which means “Explicit Parallel
Instruction Computing”. By using this sigla, Intel wants to say that the compiler will
be the great responsible for determining and clearing the parallelism present in the
instructions to be executed. This is a combination of concepts called speculation,
predication and explicit parallelism.
Next, we will briefly study each one of them.
Explicit parallelism:
The Instruction Level Parallelism - ILP is the ability of executing multiple instructions
at the same time. As we have seen, the IA-64 architecture allows to pack
independent instructions to be executed in parallel and, for each clock period, is
capable of treating multiple packs. Due to the great number of features in parallel, as
well as the great number of registers and multiple executing units, it is possible for
the compiler to manage and program the parallel computing. The compilers used for
the traditional architectures are limited in their speculative capacity because there is
not always a way to be sure if the speculation will be correctly managed by the
processor. The IA-64 architecture allows the compiler to explore the speculative
information without sacrificing the correct execution of an application.
The IA-64 architecture has mechanisms denominated instruction pointer, suggestions

for branches and cache, that allow the compiler to send to the processor information
obtained during the time of compilation. That information minimizes the penalties
that come from the branches and cache misses.
Speculation:
The Itanium can load instructions and data onto the CPU before they're actually
needed or even if they prove not to be needed, effectively using the processor itself
as a cache. Presumably, this early loading is done when the processor is otherwise
idle. The advantage gained by speculation limits the effects of memory latency by
allowing loading of data before it is needed, thus making it ready to go the moment
the processor can use it.
There are two kinds of speculation: data and control. With the speculation, the
compiler advances an operation in a way that its latency (time spent) is removed
from the critical way. The speculation is a form of allowing the compiler to avoid that
slow operations spoil the parallelism of the instructions. Control speculation is the
execution of an operation before the branch that precedes it. On the other hand, data
speculation is the execution of a memory load before a storage operation (store) that
precedes it and with which it can be related.
Speculation Benefits:
Reduces impact of memory latency .Reduces impact of memory latency
Performance improvement at 79% when combined with predication*.
Greatest improvement to code with many cache accesses large databases and
operating systems.ems
Scheduling flexibility enables new levels of performance headroom levels of
performance headroom
Predication:
Branch prediction is currently used in today's processors. However, much processor

time is taken by doing calculations for branches that end up being unneeded.
Predication is a compiler-based technique of looking ahead to make more accurate
predictions of which code branches will actually be used, thus limiting unneeded
calculations.
With the predication you mark with predicates all the branches of the conditional
branches that, next, are sent to the execution in parallel, however only the necessary
ones are executed. Therefore, it is possible to prepare the execution of the
instructions even before having solved the conditional branches. Besides the removal
of branches by means of predicates, IA-64 architecture has a series of mechanisms
that should reduce the error in predicting the branches and the cost when this error
happens.
Predication Benefits:
Reduces branches and mispredict penalties.
Parallel compares further reduce critical paths Parallel compares further reduce itical
paths
Greatly improves code with hard to predict branches ranches
Large server apps- capacity limited .e server apps- capacity limited
Sorting, data mining- large database apps .Sorting, data mining- large
database apps
Data compression Data compression
Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication.
Cmove: 39% more instructions, 30% lower performance.39% m
Instructions must all be speculative.
The IA-64 architecture has a great number of registers. There are 128 integer
registers, 128 floating-point registers, 64 predicate registers of 1 bit, and many other
registers for configuration, management and monitoring of the CPU’s performance.
Rotating Registers
On top of the frames, there's register rotation, a feature that helps loop unrolling
more than parameter passing. With rotation, Itanium can shift up to 96 of its
general-purpose registers (the first 32 are still fixed and global) by one or more
apparent positions. Why? So that iterative loops that hammer on the same register
(s) time after time can all be dispatched and executed at once without stepping on
each other. Each instance of the loop actually targets different physical registers,
allowing them all to be in flight at once.
If this sounds a lot like register renaming, it is. Itanium's register-rotation feature is
less generic than all-purpose register renaming like Athlon's, so it's easier to
implement and faster to execute. Chip-wide register renaming like Athlon's adds gobs
of multiplexers, adders, and routing, one of the big drawbacks of a massively out-of-
order machine. On a smaller scale, ARM used this trick with its ill-fated Piccolo DSP
coprocessor. At the high end, Cydrome also used this technique, a favorite feature
that Cydrome alumnus and Itanium team member Bob Rau apparently brought with
him.
So IA-64 has two levels of indirection for its own registers: the logical-to-virtual
mapping of the frames and the virtual-to-physical mapping of the rotation. All this
means that programs usually aren't accessing the physical registers they think they
are, but that's nothing new to high-end microprocessors. Arcane as it seems, this
method still uses less hardware trickery than the full register renaming of Athlon,
Pentium III, or P4.
Intel promises compatibility with the 32-bit software (IA-32). They should run
without any change since the operating system and the firmware have features for
that. It should be possible to run software in real mode (16 bits), protected mode (32
bits) and virtual mode 86 (16 bits). They mean that the CPU will be able to operate
in IA-64 mode or IA-32 mode. There are special instructions to go from one mode to
the other, as it is shown in Figure 3.
Figure 3: Model of instruction sets transition.
The three instructions that make the transition between the instruction sets are:
JMPE (IA-32): jumps to a 64-bit instruction and changes to IA-64 mode;
br.ia (IA-64): moves to a 32-bit instruction and changes to IA-32 mode;
Interruptions transit to IA-64 mode, allowing the fulfillment of all interruption

conditions and
rfi (IA-64): it is the return of the interruption; the return happens both to an IA-32
situation and to an IA-64, depending on the situation present at the moment when
the interruption is invoked.
Athlon 64 and AMD's 64-bit technology
64-bit architecture
Introduction:
To get a first idea, how the 64-bit architecture works and also how it differs
significantly from a 32-bit implementation it is useful to consider one definition first:
"A 64-bit processor is a microprocessor with a word size of 64 bits, a

requirement for memory and data intensive applications such as computer-aided
design (CAD) applications, database management systems, technical and scientific
applications, and high-performance servers. 64-bit computer architecture provides
higher performance than 32-bit architecture by handling twice as many bits of
information in the same clock cycle.
The most important parts, which define a 64-bit architecture are boldfaced and give a
rough idea that one can now process not only 2^32 = 4294967296 basic units of
information, but 2^64 = 18446744073709551616 units. The numbers are quite
impressive and show that the architecture level has to be updated accordingly.
There are several companies, which actually implemented 64-bit processors, but the
two main companies are AMD and Intel. Other enterprises certainly have their place
in the development of 64-bit processors, too, but the mainstream market is going to
face those products by AMD and Intel. Therefore it is reasonable to explain, how
those two companies designed the 64-bit processors and moreover there are only
details to consider in translating the two special layouts and implementations to the
general concept. There are quite some differences how the two companies chose to
convert 32-bit programs to work with the 64-bit architecture and those differences
will be outlined in the 32-bit part of this document, but in the following part the
structure of a "pure" 64-bit architectural level will be outlined. As there is not much
public information available about the physical structure of current 64-bit processors
due to the fact that neither AMD nor Intel want to provide crucial information to the
corresponding rival on the processor market it is useful to focus on the instruction set
architecture (ISA) and the general differences between a 32-bit processor and the
new 64-bit one.
With the successful introduction of the Opteron processor, AMD completed one half of
its forecast entry into the 64-bit processing world. It is based on an evolution of the
x86 instruction set used by current 32-bit processors made by Intel and AMD, the
Opteron is targeted at the high to mid-range server and workstation market.
The second processor released under the AMD64 architecture will be the Athlon 64,
formerly known as 'Claw hammer,' which aims to bring 64-bit computing power to
the desktop and mobile markets. The Athlon 64 will be a slightly hobbled version of
the Opteron, and with its built in compatibility with current software and operating
systems, will attempt to bridge the gap easily between 32-bit and 64-bit computing
environments.
We will focus on the Athlon 64 and what it will offer to home users and PC
enthusiasts, as well as covering the important details of the AMD64 platform. The
Opteron and the Athlon 64 share an identical base architecture.
AMD has positioned the Opteron as the solution to many system needs, with the
primary goal of providing a 64-bit physical architecture while supplying high-end
performance for both 64- and 32-bit software. This translates into architectural
advantages such as 64-bit data and address pathways, upgraded physical and virtual
memory addressing, and a true 64-bit internal design.
The other main innovation has been to move key Northbridge functions from the
system chipset directly into the Opteron core. These include a memory controller,
multiprocessing control, and data flow, along with a bridge to peripheral data traffic.
Traditional Southbridge and AGP components are still present in the Opteron
architecture, but AMD's eighth-generation processor has absconded with the main
performance and CPU-centric duties.
Opteron Micro architecture
The Opteron core resembles the basic design of the Athlon XP, but the move to a 64-
bit architecture has brought some inherent advantages. Both the Opteron and Athlon
XP contain a few similar features, such as 64K apiece of Level 1 data and instruction
cache and three apiece of integer and floating-point units, but there have been some
noted improvements elsewhere. In terms of basic features, the Opteron includes a
full 1MB of Level 2 cache on the inside, along with an integrated heat spreader and
new Socket 940 packaging on the outside.
Looking a bit deeper, AMD has improved on its seventh-generation design in other
ways. A processor's registers are like miniature cache areas where crucial data is
stored and retrieved; the Opteron features eight more general-purpose registers, and
these have been extended to 64 bits. AMD has also added eight 128-bit Streaming
SIMD Extension (SSE) registers for multimedia instructions, as well as compatibility
with the SSE2 instructions that premiered in Intel's Pentium 4.
The chip's transaction look-aside buffers are larger and offer lower latencies than
those of the Athlon XP. Branch prediction is also enhanced, including an increase to
16K bimodal/history counters, or four times the level found on the Athlon XP.
This last note is important, because in order to provide higher frequencies and better
scalability, AMD has extended the Opteron pipelines. The Opteron features a 12-
stage integer operation pipeline (versus 10 stages for the Athlon XP) and a 17-stage
floating-point operation pipeline (versus 15 for the Athlon XP). While this pays
dividends on higher potential clock speeds, it also incurs a risk of increased prediction
misses, so AMD has adjusted the architecture to provide even higher pipeline
efficiencies than the Athlon XP.
The Opteron also has built-in core logic to support multiprocessor systems without
the need for a Northbridge chip. Internal CPU data traffic is all routed through a
crossbar (XBAR) communications architecture, which shuttles command and data
information between the CPU, memory controller, and three HyperTransport links.
This is a huge technological leap for multiprocessor workstation and server designs,
as it provides a true standard for OEMs to work with, and takes the Northbridge
component out of the equation.
Dual-Channel Memory, More Or Less
The AMD Opteron includes an integrated memory controller, capable of supporting

DDR200 through DDR333 speeds and a maximum of eight DIMM memory modules
per processor. The controller provides up to 5.3GB/sec of memory bandwidth (with
333MHz DDR), yielding higher memory performance, lower memory latencies, and
performance levels that can scale to processor frequencies.
Since each CPU has its own memory controller, memory bandwidth will also scale in
multiprocessor systems. For example, a 2-way Opteron workstation will yield
10.6GB/sec of memory bandwidth, while a 4-way Opteron server will double this
again to an incredible 21.3GB/sec, along with supporting up to 32 DDR DIMMs.
The Opteron's integrated memory controller has been referred to as a dual-channel

design, but this isn't the exact truth. It certainly delivers double the bandwidth of a
single-channel controller, but does so by taking two 64-bit DDR modules and viewing
them as a single 128-bit DIMM with a corresponding 128-bit data path. This is similar
to the design of Intel's dual-channel DDR chipsets such as the E7205 and 875P, but
different than the true dual-channel memory architecture of the NVIDIA nForce2.
This is actually a smart call when it comes to building an integrated memory

controller, as for all intents and purposes, the bandwidth and performance are
equivalent, but the 128-bit memory bus is more streamlined. In the Opteron
architecture, there is no need for an arbiter chip to handle traffic along the dual
physical memory channels, and no requirement for extra controller hardware. Of
course, due to the "single-channel 128-bit" memory architecture, the pairs of DDR
modules but be matched in size, speed, and chip-count, though not necessarily in
manufacturer.
AMD's 64-bit platform
To access an area in the computer's physical memory (RAM) to store or retrieve data,
the processor needs the address of that location, which is an integer number
representing one byte of memory storage.
Suddenly, having 64-bit registers makes sense as, while a 32-bit processor can
access up to 4.3 billion memory addresses (232) for a total of about 4GB of physical
memory, a 64-bit processor could conceivably access over 18 petabytes of physical
memory. This is the one area that clearly shows why 64-bit processors are the future
of computing, as demanding applications such as databases have long been scraping
on the 4GB memory ceiling.
If you are a business with a database of a terabyte or more of information, 64-bit

processors look pretty good right now.
Formerly known as X86-64, the AMD64 architecture is AMD's method of

implementing 64-bit processors.
AMD64 is massively different from Intel's approach to 64-bit processors as seen in
their Itanium line. While Intel used a completely different architecture for the Itanium
chips, forcing software developers to relearn in order to program for them, or use
emulation which slowed down performance, AMD decided to simply extend the
existing x86 architecture (the foundation of all PC's since Intel developed the 8086
processor in 1978) to accommodate 64-bit registers as mentioned above.
There are several advantages to this. First, obviously, reworking code for AMD 64-bit
processors should be considerably easier, since the basis is the same. Secondly, the
AMD64 based Opteron and Athlon 64, are fully compatible with 32-bit applications.
A system based on either of these processors can use a 32-bit operating system and
software without a hitch, providing a stress free upgrade path for businesses and
opening up the desktop market to 64-bit processors, and more specifically, AMD's
Athlon 64.
AMD accomplishes this by enabling the AMD64 processors to run in one of two
modes, Legacy mode and Long mode. Legacy mode removes all 64-bit support and
enables the processor to run strictly in 32-bit mode, necessary for running most
current operating systems, including Windows. Long mode is comprised of two sub
modes, Compatibility mode and 64-bit mode.
Compatibility mode is designed for a 64-bit operating system such as Microsoft's

impending 64-bit versions of XP and Server 2003, due late this year or early in the
next, but running 32-bit software such as current databases. The advantage of this is
that each 32-bit application, though still limited by the 4GB memory limit, can have
all of that 4GB to itself with no overhead for the operating system, since that will use
64-bit addressing and can thus access additional memory space.
This provides some improved performance for demanding 32-bit apps before they are
ported over to 64-bit. 64-bit mode is intended for a pure 64-bit environment,
operating system and software, and offers one huge advantage.....
AMD - Instruction Set Architecture:
The most basic units of organization for the instructions are specified the following
way (see AMD manual again - page 38/39):
1. General Purpose Instructions: The basic integer instructions, which are used
nearly everywhere. Also often referred to as the x86 instruction set and easily
illustrated by examples like addition of integers, moving, load, store, shifts
and so on.
2. 128-Bit Media Instructions: Named due to their primary application, these

instructions operate on vectors of large data packages (e.g. video, scientific
applications, games, etc.). Moreover, they operate in parallel. That means
they are able to access multiple data sets at once. Obviously, these
instructions are designed for speed in one special field of applications and
therefore are not able to perform any task.
3. 64-bit Media Instructions: Also SIMD instructions and not much different in
use compared to the 128-bit instructions.
4. Floating Point Instructions: As GPIs only work for integers, these instructions
are designed to have a suitable tool for floating point operations.
When the LMA is activated the maximum speed for instructions to be performed is
enabled and this is usually done by the operating system. This is the stage we would
like to call "pure" 64-bit mode and this mode can be recognized for both
architectures, the one described here from AMD and the Intel IA64 described later on
this page. For the following part of the analysis we assume that LMA is activated and
the processor is in "pure" 64-bit mode, which is not to be confused with legacy mode
or long mode compatibility mode; these are features to support the transition from
32-bit machines and software to the new architecture. Those should not be
considered yet, but in the 32-bit section. The default size for operands is 32-bits in
contrast to the 16-bits of the 32-bit architecture. The REX registers, which is the
common name for the 8 new GPRs R8-R15 - specify whether one would like to accept
this default value or to extend to virtual 64-bits (basically a concatenation of two
registers). This means that some of the instructions for the opcode had to be
redefined to allow the virtual 64-bit addressing. Nevertheless, these are only minor
changes and most parts of the opcode are carried over from a 32-bit processor. The
memory is a single flat address space starting at the address 0 and is distributed
linearly over 64-bits. The operating system can specify several levels of data
access/protection for the address space. The segment registers to access memory
locations are set to a canonical position - namely 0 - and it is not possible for the
processor to access all segmented registers. This is essentially a real simplification
compared to 32-bit processing and all the compatibility modes offered by AMD. It is
just pure memory addressing from 0 to 2^64 -1 without any specialties. This concept
shows on the micro level what the goal of the complete architecture is. The search for
more simplicity, more raw computing power and preparation for large amounts of
data. Another cornerstone of this path is the possibility to translate all the virtual 64-
address space in physical memory in a one-to-one translation process. Paging can be
performed on the virtual address directly. The bytes themselves are ordered
according to little/low Endean and so are all the data and instructions. The
instructions do not really "change" in the sense that there a structural redesign has
happened. The size of the operands is the crucial factor. Consider for example this
instruction: 48 B8 1234567812345678. The 48 specifies the length of the operands:
64-bits! The opcode B8 is also used in the 32-bit architecture and the remaining part
is just an 8-bit immediate value and we are computing with a 64-bit processor.
There exist five addressing modes:
• Absolute Address: given as displacements from the base - for 64-bits just 0)
• Instruction-Relative Address: referring to the IP (instruction pointer) and the

PC (program counter)
• Stack Address: using the stack pointer
• String Addresses
• Mod R/M Address
And again one realizes that there are no real differences in the structure compared to
non-64-bit ISAs. The PC, the Stack and absolute addressing just carry over with
more bits. The RIP (relative instruction pointer / program counter) keeps its function,
but due to 64-bits provides a more efficient way to directly access segments of code
with relative addressing. This is one reason, why there is a significant increase in
speed for the AMD 64-bit architecture - direct access to program code.
For the Absolute Addressing it gets even easier due to the

common standard base 0. The same holds for pointers in general. As one is no longer
able to access the segmented registers the concept of far pointers, which store a
segment address and the usual address, is no longer needed as the memory is just
one linear chunk. Near pointers are enough and one can return for 64-bit applications
for the AMD architecture to the general term pointer as it is obvious that it can only
point into one data segment. The immediate and displacements remain of 32-bit size
but can be extended to a virtual 64-bit mode if needed.
This finishes the broad outline of the

instruction set architecture for AMD based on the document mentioned above and
their philosophy to keep it simple and easy becomes apparent, but this is only true
for AMD, not for 64-but processors in general. They might demand more
sophisticated instruction sets and might not rather focus and build upon established
concepts. One has to know more certain technical details, which should not be
emphasized here as the new registers must be taken into account and therefore the
possibility of combinations to address and declare correctly rises, but their complexity
level does not rise significantly for AMD. Outlining the new instructions for every new
register would be tedious and cumbersome work and is only valid for the ISA of AMD.
Memory Controllers and Hypertransport
Both the Opteron and the Athlon 64 contain 8 extra registers useable only in 64-bit
mode, which should increase application performance significantly.
One of the largest problems in modern computer design is the presence of
bottlenecks, or areas of low performance which slow an otherwise fast system down.
In most modern computers, data intended for the video and main memory needs to
be passed to and through the Northbridge chip on the motherboard, and data from
other sources like USB connections, PCI slots or hard-drives must pass through the
Southbridge chip, then the Northbridge.
With the amount of information that needs to be squeezed through the various data
buses into the processor to be operated on, bottlenecks inevitably develop, where the
processor is waiting for the necessary bits to be delivered by the I/O subsystem
feeding it.
As processors get consistently faster every few months, while data bus
breakthroughs are irregular, the issue perpetuates itself.
AMD has attempted to get around this constant problem by equipping its 64-bit
processors with two advantages, internal DDR memory controllers and
Hypertransport links. AMD has built the memory controller (normally a part of the
motherboard to which the processor is attached), directly into their Opteron and
Athlon 64 CPUs.
As you can imagine, this gives a considerably reduces the time it takes the processor
to access memory, since while data still needs to travel between the processor and
the physical memory, communication with the controller that arranges the data flow
does not need to be passed outside the processor, reducing the amount of computing
cycles lost while waiting for the memory.
Another benefit is the fact that memory traffic no longer needs to run between the
processor and the Northbridge chip on the motherboard which traditionally provides
the memory controller, reducing bottlenecks. The second part of the package is
support for Hypertransport input/output technology.
HyperTransport™ technology
HyperTransport™ technology is a high-speed, low latency, point-to-point link

designed to increase the communication speed between integrated circuits in
computers, servers, embedded systems, and networking and telecommunications
equipment up to 48 times faster than some existing technologies.
HyperTransport™ technology helps reduce the number of buses in a system, which

can reduce system bottlenecks and enable today's faster microprocessors to use
system memory more efficiently in high-end multiprocessor systems.
HyperTransport™ technology is designed to:
• Provide significantly more bandwidth than current technologies
• Use low-latency responses and low pin counts
• Maintain compatibility with legacy PC buses while being extensible to new

SNA (Systems Network Architecture) buses.
Appear transparent to operating systems and offer little impact on peripheral drivers.
Conclusion
With this article and the previous one, that mention the 64-bit architectures by Intel
and AMD, we finished to talk about the processors for the beginning of the
millennium. In addition, it is important to mention that there already are computers
running 64-bit versions of Windows and Linux. Now, more than performance, our
biggest concern is the compatibility with our present programs. We really have to
verify how much those 64-bit architectures are compatible with our 32- or 16-bit
programs. We hope that in less than a year we already have the answer to this
question. To finish this part of 64-bit CPUs, it is very good to see how the two
companies compete in the market of high performance processors. This grants us
access to even cheaper and better computers.
To conclude, we would like to comment the great space that there still is to the
evolution of electronics and consequently to the evolution of computers. More
important than the creation of supercomputers, this new age will see the
permeability of the computers. It will be the time of invisible computers. They will be
present in nearly all modern devices. At the moment they inhabit our TV sets,
microwave ovens, cars, watches, stereos, DVD, etc... In a near future, they will
invade the refrigerator, the toaster, the air-conditioner and all everyday appliances.
We have gone beyond the cheap electronics age and we are entering the cheap
intelligence age.
References
References for this part are basically placed in the appropriate positions - this list gives an
overview:
- Search390.com:
http://search390.techtarget.com/sDefinition/0,,sid10_gci498697,00.html
- Hammer Review A1- Electronics: http://www.a1-

electronics.co.uk/AMD_Section/CPUs/Hammer_Review_pg2.shtml
- Article X86-64 Hardware site: http://www.hardwaresite.net/x86-64.html
- AMD Developer's Manual X86-64: http://www.amd.com/us-

en/assets/content_type/white_papers_and_tech_docs/24592.pdf
- Article IA-64 Hardware site: http://www.hardwaresite.net/ia64.html
- Presentation IA-64:
http://www.eg.bucknell.edu/~bsprunt/comp_arch/intel/ia64_tutorial.pdf
- Software Developer's Manual Itanium:

http://developer.intel.com/design/itanium/manuals/245317.pdf
- Hardware Developer's Manual Itanium:

http://developer.intel.com/design/itanium/downloads/248701.htm
- AMD Opteron video: http://www.amd.com/us-

en/assets/content_type/DigitalMedia/AMD_Opteron.wmv
- Article 64-bit computing: c't 12/99 page 28
- basic notations, definitons and concepts are taken from "Computer

Organization and Design", Hennessey and Patterson

Prasoon - 64 Bit Computing

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Prasoon - 64 Bit Computing

Transféré par

Droits d'auteur :

Formats disponibles

Introduction

Why 64-Bit Computing?

• Users of CAD, designing systems, simulators do need RAM over 4 GB.

The benefits of increased dynamic range,

Since addresses are just special-purpose integers, an ALU and register

Some applications, mostly in the realm of scientific computing (MATLAB,

Programmers who run into integer overflow or underflow problems on a 32-bit

Intel 64-bit architecture (IA-64)

Instruction packaging used in the IA-64 architecture.

Unit I - integer data

Next, we will briefly study each one of them.

The IA-64 architecture has mechanisms denominated instruction pointer, suggestions

Branch prediction is currently used in today's processors. However, much processor

Figure 3: Model of instruction sets transition.

JMPE (IA-32): jumps to a 64-bit instruction and changes to IA-64 mode;

br.ia (IA-64): moves to a 32-bit instruction and changes to IA-32 mode;

Interruptions transit to IA-64 mode, allowing the fulfillment of all interruption

"A 64-bit processor is a microprocessor with a word size of 64 bits, a

Opteron Micro architecture

Dual-Channel Memory, More Or Less

The AMD Opteron includes an integrated memory controller, capable of supporting

The Opteron's integrated memory controller has been referred to as a dual-channel

This is actually a smart call when it comes to building an integrated memory

If you are a business with a database of a terabyte or more of information, 64-bit

Formerly known as X86-64, the AMD64 architecture is AMD's method of

Compatibility mode is designed for a 64-bit operating system such as Microsoft's

2. 128-Bit Media Instructions: Named due to their primary application, these

There exist five addressing modes:

• Instruction-Relative Address: referring to the IP (instruction pointer) and the

• Stack Address: using the stack pointer

• Mod R/M Address

For the Absolute Addressing it gets even easier due to the

This finishes the broad outline of the

Memory Controllers and Hypertransport

HyperTransport™ technology is a high-speed, low latency, point-to-point link

HyperTransport™ technology helps reduce the number of buses in a system, which

• Provide significantly more bandwidth than current technologies

• Use low-latency responses and low pin counts

• Maintain compatibility with legacy PC buses while being extensible to new

- Hammer Review A1- Electronics: http://www.a1-

- Article X86-64 Hardware site: http://www.hardwaresite.net/x86-64.html

- AMD Developer's Manual X86-64: http://www.amd.com/us-

- Article IA-64 Hardware site: http://www.hardwaresite.net/ia64.html

- Software Developer's Manual Itanium:

- Hardware Developer's Manual Itanium:

- AMD Opteron video: http://www.amd.com/us-

- Article 64-bit computing: c't 12/99 page 28

- basic notations, definitons and concepts are taken from "Computer

Vous aimerez peut-être aussi