Cpu Arch

(6.
1)
Central Processing Unit Architecture
Architecture overview
Machine organization
von Neumann
Speeding up CPU operations
multiple registers
pipelining
superscalar and VLIW
CISC vs. RISC
(6.2)
Computer Architecture
Major components of a computer
Central Processing Unit (CPU)
memory
peripheral devices
Architecture is concerned with
internal structures of each
interconnections
speed and width
relative speeds of components
Want maximum execution speed
Balance is often critical issue
(6.3)
Computer Architecture (continued)
CPU
performs arithmetic and logical operations
synchronous operation
may consider instruction set architecture
how machine looks to a programmer
detailed hardware design
(6.4)
Memory
stores programs and data
organized as
bit
byte = 8 bits (smallest addressable location)
word = 4 bytes (typically; machine dependent)
instructions consist of operation codes and
addresses
oprn
oprn
oprn
addr 1
addr 2
addr 3 addr 2
addr 1
addr 1
(6.5)
Numeric data representations
integer (exact representation)
sign-magnitude
2s complement
negative values change 0 to 1, add 1
floating point (approximate representation)
scientific notation: 0.3481 x 10
6
inherently imprecise
IEEE Standard 754-1985
s magnitude
s exp significand
(6.6)
Simple Machine Organization
Institute for Advanced Studies machine (1947)
von Neumann machine
ALU performs transfers between memory and
I/O devices
note two instructions per memory word
main
memory
Input-
Output
Equipment
Arithmetic -
Logic Unit
Program
Control Unit
op code op code address address
0 8 20 28 39
(6.7)
Simple Machine Organization (continued)
ALU does arithmetic and logical comparisons
AC = accumulator holds results
MQ = memory-quotient holds second portion of
long results
MBR = memory buffer register holds data while
operation executes
(6.8)
Program control determines what computer does
based on instruction read from memory
MAR = memory address register holds address of
memory cell to be read
PC = program counter; address of next instruction
to be read
IR = instruction register holds instruction being
executed
IBR holds right half of instruction read from memory
(6.9)
Machine operates on fetch-execute cycle
Fetch
PC MAR
read M(MAR) into MBR
copy left and right instructions into IR and IBR
Execute
address part of IR MAR
read M(MAR) into MBR
execute opcode
(6.10)
(6.11)
Architecture Families
Before mid-60s, every new machine had a
different instruction set architecture
programs from previous generation didnt run on
new machine
cost of replacing software became too large
IBM System/360 created family concept
single instruction set architecture
wide range of price and performance with same
software
Performance improvements based on different
detailed implementations
memory path width (1 byte to 8 bytes)
faster, more complex CPU design
greater I/O throughput and overlap
Software compatibility now a major issue
partially offset by high level language (HLL) software
(6.12)
Architecture Families
(6.13)
Multiple Register Machines
Initially, machines had only a few registers
2 to 8 or 16 common
registers more expensive than memory
Most instructions operated between memory
locations
results had to start from and end up in
memory, so fewer instructions
although more complex
means smaller programs and (supposedly)
faster execution
fewer instructions and data to move between
memory and ALU
But registers are much faster than memory
30 times faster
(6.14)
Multiple Register Machines (continued)
Also, many operands are reused within a
short time
waste time loading operand again the next
time its needed
Depending on mix of instructions and
operand use, having many registers may
lead to less traffic to memory and faster
execution
Most modern machines use a multiple
register architecture
maximum number about 512, common
number 32 integer, 32 floating point
(6.15)
Pipelining
One way to speed up CPU is to increase
clock rate
limitations on how fast clock can run to
complete instruction
Another way is to execute more than one
instruction at one time
(6.16)
Pipelining
Pipelining breaks instruction execution down
into several stages
put registers between stages to buffer data
and control
execute one instruction
as first starts second stage, execute second
instruction, etc.
speedup same as number of stages as long as
pipe is full
(6.17)
Pipelining (continued)
Consider an example with 6 stages
FI = fetch instruction
DI = decode instruction
CO = calculate location of operand
FO = fetch operand
EI = execute instruction
WO = write operand (store result)
(6.18)
Pipelining Example
Executes 9 instructions in 14 cycles rather than 54 for
sequential execution
(6.19)
Hazards to pipelining
conditional jump
instruction 3 branches to instruction 15
pipeline must be flushed and restarted
later instruction needs operand being
calculated by instruction still in pipeline
pipeline stalls until result ready
(6.20)
Pipelining Problem Example
Is this really a problem?
(6.21)
Real-life Problem
Not all instructions execute in one clock
cycle
floating point takes longer than integer
fp divide takes longer than fp multiply which
takes longer than fp add
typical values
integer add/subtract 1
memory reference 1
fp add 2 (make 2 stages)
fp (or integer) multiply 6 (make 2 stages)
fp (or integer) divide 15
Break floating point unit into a sub-pipeline
execute up to 6 instructions at once
(6.22)
This is not simple to implement
note all 6 instructions could finish at the same
time!!
(6.23)
More Speedup
Pipelined machines issue one instruction
each clock cycle
how to speed up CPU even more?
Issue more than one instruction per clock
cycle
(6.24)
Superscalar Architectures
Superscalar machines issue a variable
number of instructions each clock cycle, up
to some maximum
instructions must satisfy some criteria of
independence
simple choice is maximum of one fp and one
integer instruction per clock
need separate execution paths for each
possible simultaneous instruction issue
compiled code from non-superscalar
implementation of same architecture runs
unchanged, but slower
(6.25)
Superscalar Example

Each instruction path may be pipelined
0 2 3 4 5 6 7 8 1 clock
(6.26)
Superscalar Problem
Instruction-level parallelism
what if two successive instructions cant be
executed in parallel?
data dependencies, or two instructions of slow
type
Design machine to increase multiple
execution opportunities
(6.27)
VLIW Architectures
Very Long Instruction Word (VLIW)
architectures store several simple instructions
in one long instruction fetched from memory
number and type are fixed
e.g., 2 memory reference, 2 floating point, one
integer
need one functional unit for each possible
instruction
2 fp units, 1 integer unit, 2 MBRs
all run synchronized
each instruction is stored in a single word
requires wider memory communication paths
many instructions may be empty, meaning
wasted code space
(6.28)
VLIW Example
Memory
Ref 1
Memory
Ref 2
FP 1 FP 2 Integer
LD F0, 0(R1) LD F6, 8(R1)
LD F10,
16(R1)
LD F14,
24(R1)
SB
R1,R1,#4
8
LD
F18,32(R1)
LD
F22,40(R1)
AD F4,F0,F2 AD F8,F6,F2
LD
F26,48(R1)
AD
F12,F10,F2
AD
F16,F14,F2
(6.29)
Instruction Level Parallelism
Success of superscalar and VLIW machines
depends on number of instructions that occur
together that can be issued in parallel
no dependencies
no branches
Compilers can help create parallelism
Speculation techniques try to overcome
branch problems
assume branch is taken
execute instructions but dont let them store
results until status of branch is known
(6.30)
CISC vs. RISC
CISC = Complex Instruction Set Computer
RISC = Reduced Instruction Set Computer
(6.31)
CISC vs. RISC (continued)
Historically, machines tend to add features
over time
instruction opcodes
IBM 70X, 70X0 series went from 24 opcodes to
185 in 10 years
same time performance increased 30 times
addressing modes
special purpose registers
Motivations are to
improve efficiency, since complex instructions
can be implemented in hardware and
execute faster
make life easier for compiler writers
support more complex higher-level languages
(6.32)
CISC vs. RISC
Examination of actual code indicated many
of these features were not used
RISC advocates proposed
simple, limited instruction set
large number of general purpose registers
and mostly register operations
optimized instruction pipeline
Benefits should include
faster execution of instructions commonly
used
faster design and implementation
(6.33)
CISC vs. RISC
Comparing some architectures
Year Instr. Instr.
Size
Addr
Modes
Registers
IBM
370/168
1973 208 2 - 6 4 16
VAX
11/780
1978 303 2 - 57 22 16
I 80486 1989 235 1 - 11 11 8
M 88000 1988 51 4 3 32
MIPS
R4000
1991 94 4 1 32
IBM 6000 1990 184 4 2 32
(6.34)
CISC vs. RISC
Which approach is right?
Typically, RISC takes about 1/5 the design
time
but CISC have adopted RISC techniques

Cpu Arch

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cpu Arch

Transféré par

Droits d'auteur :

Formats disponibles

(6.

Vous aimerez peut-être aussi