Académique Documents
Professionnel Documents
Culture Documents
Edition
The Hardware/Software Interface
THE PROCESSOR
“In major matters no details are small.”
-French Proverb
§4.1 Introduction
Introduction
•CPU performance factors
•Instruction count
• Determined by ISA and compiler
•CPI and Cycle time
• Determined by CPU hardware
•We will examine two LEGv8
implementations
•A simplified version
•A more realistic pipelined version
• Combinational element
• Operate on data
• Output is a function of input
A
Y
B
◼ Arithmetic/Logic Unit
◼ Multiplexer ◼ Y = F(A, B)
◼ Y = S ? I1 : I0
A
I0 M
u Y ALU Y
I1 x
B
S F
Chapter 4 — The Processor — 9
Sequential Elements
• Register: stores data in a circuit
• Uses a clock signal to determine when to update the stored value
• Positive edge-triggered: update when Clk changes from 0 to 1
Clk
D Q
D
Clk
Q
Clk
D Q Write
Write D
Clk
Q
Increment by
4 for next
32-bit instruction
register
Sign-bit wire
replicated
Chapter 4 — The Processor — 18
Composing the Elements
•First-cut
data path does an instruction in one
clock cycle
• Each datapath element can only do one function at a
time
• Hence, we need separate instruction and data
memories
•Use multiplexers where alternate data sources
are used for different instructions
ALU
opcode ALUOp Operation Opcode field ALU function control
LDUR 00 load register XXXXXXXXXXX add 0010
STUR 00 store register XXXXXXXXXXX add 0010
CBZ 01 compare and XXXXXXXXXXX pass input b 0111
branch on zero
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
AND 100100 AND 0000
ORR 100101 OR 0001
ALUSrc The second ALU operand is from The second ALU input are the sign
the second register file output extended lower bits of the instrunction
PCSrc The PC is replaced by the output The PC is replaced by the output of the
of PC+4 branch target calculator
MemRead N/A Data memory contents designated by
the address input are put on the read
output
MemWrite N/A Data memory contents designated by
the address input are replaced by the
read input
MemtoReg The of the register Write data The of the register Write data input
input comes from the ALU comes
Chapter 4 — The Processor from the memory
— 27
R-Type Instruction
Jump 2 address
31:26 25:0
•Pipelined laundry:
overlapping execution
•Efficiency increases with
load ◼ Four loads:
•Single task time is not ◼ Speedup
reduced = 8/3.5 = 2.3
•Improved Throughput ◼ Non-stop:
•Parallelism improves ◼ Speedup
performance = 2n/0.5n + 1.5 ≈ 4
= number of stages
• Pipelining
improves performance by increasing instruction
throughput
• Executes multiple instructions in parallel
• Each instruction has the same latency
• Subject to hazards
• Structure, data, control
MEM
WB
Right-to-left
flow leads to
hazards
Chapter 4 — The Processor — 54
How to execute the pipeline
•Multiple instructions executed simultaneously
•Need to save values
•How to solve this problem in laundry?
Wrong
register
number
• Traditional form
ForwardB = 10 EX/MEM The second ALU operand is forwarded from the prior
ALU result.
ForwardB = 01 MEM/WB The second ALU operand is forwarded from data
memory or an
earlier ALU result.
• MEM hazard
• if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 31)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 31)
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRn1))
and (MEM/WB.RegisterRd = ID/EX.RegisterRn1)) ForwardA = 01
• if (MEM/WB.RegWrite
and (MEM/WB.RegisterRd ≠ 31)
and not(EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 31)
and (EX/MEM.RegisterRd ≠ ID/EX.RegisterRm2))
and (MEM/WB.RegisterRd = ID/EX.RegisterRm2)) ForwardB = 01
Stall inserted
here
Flush these
instructions
(Set control
values to 0)
PC
Chapter 4 — The Processor — 88
Reducing Branch Delay
•Move hardware to determine outcome to ID
stage
• Target address adder
• Register comparator
•Example: branch taken
36: SUB X10, X4, X8
40: CBZ X1, X3, 8
44: AND X12, X2, X5
48: ORR X13, X2, X6
52: ADD X14, X4, X2
56: SUB X15, X6, X7
...
72: LDUR X4, [X7,#50]
•Exception
• Arises within the CPU
• e.g., undefined opcode, overflow, underflow, syscall, hardware malfunction
…
•Interrupt
• From an external I/O controller
•Otherwise
• Terminate program
• Report error using EPC, cause, …
• Dynamic speculation
• Can buffer exceptions until instruction completion (which may
not occur)
•Load-use hazard
• Still one cycle use latency, but now two instructions
•More aggressive scheduling required
•Example
LDUR X0, [X21,#20]
ADD X1, X0, X2
SUB X23,X23,X3
ANDI X5, X23,#20
• Can start sub while ADD is waiting for LDUI
Hold pending
operands