Spoken English

Computer Organization & Architecture
BITS Pilani
Pilani Campus
Virendra Singh Shekhawat Department of Computer Science and Information Systems
BITS Pilani
Pilani Campus
Module-1 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Computer Organization and Architecture, Functional and Structural View of Computer System, Brief History of Computers, Evolution of Intel x86 Architecture
Computer Architecture and Organization-1

Architecture is those attributes visible to the programmer and having direct impact on the logical execution of the program
Instruction set, Number of bits used for data types, I/O mechanisms, Memory addressing techniques e.g. x86 architecture, IBM/360 architecture
Organization is implementation of computer system in terms of its interconnection of functional units

Control signals Interfaces between computer and peripherals Memory technology
3
BITS Pilani, Pilani Campus
Computer Architecture and Organization-2

Architecture Question?
Is there a multiply/division instruction available?
Organization Question?
Is multiplication implemented by separate hardware or is it done by repeated addition?
4
How to Describe Computer System?

Computer is a complex system!!!
Contains millions of electronics components
Function is the operation of individual components as part of the structure Structure is the way in which components are related to each other
5
Functional View of Computer
6
Computer Operations
7
Structural View of a Computer

Computer
Central Processing Unit
Computer
Peripherals
Main Memory
Systems Interconnection
Input Output Communication lines 8

Structural View of the CPU

CPU
Computer
I/O System Bus CPU
Registers
Arithmetic and Login Unit
Memory
Internal CPU Interconnection
Control Unit
9
Brief History of Computers

ENIAC was the first general purpose electronic digital computer
By John Mauchly and John Eckert in 1946
Von Neumann Machine (1946) called as IAS

Based on stored program concept
PDP-1 Computer (1957)

Developed by Digital Equipment Corporation (DEC) First step towards mini computers
IBM SYSTEM/360 (1964)

First planned family of computers
10
Computer Generations
Vacuum tube (1946-1957) & Transistor (19581964) Integrated Circuits
Small scale integration - 1965 on
Up to 100 devices on a chip
Medium scale integration - to 1971

100 - 3,000 devices on a chip
Large scale integration - 1971-1977

3,000 - 100,000 devices on a chip
Very large scale integration - 1978 -1991

100,000 - 100,000,000 devices on a chip
Ultra large scale integration 1991 onwards

Over 100,000,000 devices on a chip
11
x86 Evolution-1
1971 - 4004
First microprocessor of 4 bit All CPU components on a single chip Followed in 1972 by 8008 (8 bit processor) Both designed for specific applications
8080
First general purpose microprocessor Process/move 8 bit data at a time Used in first personal computer Altair
12
x86 Evolution-2
8086
Much more powerful (16 bit data) Instruction cache, pre-fetch few instructions 8088 (8 bit external bus) used in first IBM PC
80286
16 MByte memory addressable Up from 1MB (in 8086)
80386
32 bit processor with multitasking support
13
x86 Evolution-3
80486
Sophisticated powerful cache and instruction pipelining Built in maths co-processor
Pentium
Superscalar Multiple instructions executed in parallel
Pentium Pro
Increased superscalar organization Aggressive register renaming Branch prediction and Data flow analysis
14
x86 Evolution-4
Pentium II
MMX technology, graphics, video & audio processing
Pentium III
Additional floating point instructions for 3D graphics
Pentium 4
Further floating point and multimedia enhancements
Itanium Series
64 bit with Hardware enhancements to increase speed
Whats next???
Multi core architectures
15
Summary
Architecture vs. Organization Functional and Structural View of a Computer History of Computers Intel Architecture Evolution
16
Review Questions
Differentiate between computer organization and architecture? What are the four main functions of a computer? What are the basic structural components of a computer? Describe the computer generations in brief. What was the first general purpose microprocessor?
17
Thank You!
18

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Concept of Computer Program and Instruction, Internal Structure of CPU, Instruction Execution Cycle With and Without Interrupt
Addition of Two Numbers

Start Load R1,A Load R2,B R3=R1+R2 Store C,R3 End of program
21
What is a Computer Program?

A logical group of Instructions/statements to achieve some specific task is called- Program An Instruction is a sequence of steps For each step, an arithmetic or logical operation or data movement is done For each operation one or more control signals are activated
22
What is required from CPU?

Fetch instructions Interpret instructions Fetch data Process data Write/store data
To do these tasks processor needs: Temporarily storage Interconnection structure between various components (i.e. Registers, ALU, Memory, I/O) ALU
23
CPU and Interconnections
24
Internal Structure of CPU
25
Instruction Cycle
Two steps:
Fetch Execute
26
Fetch Cycle
Program Counter (PC) holds address of next instruction to fetch Processor fetches instruction from memory location pointed by PC Increment PC
Unless told otherwise
Instruction loaded into Instruction Register (IR) Processor interprets instruction and performs required actions
27
Execute Cycle
Data transfer between CPU and Main Memory Data transfer between CPU and I/O module Some arithmetic or logical operation on data Alteration of sequence of operations
e.g. jump
Combination of above
28
How Instruction is represented?
29
Example of Program Execution

<Opcode> <Operation> 0001 = Load AC from Memory 0010 = Store AC to memory 0101 = Add to AC from memory
30
Instruction Cycle State Diagram
31
Interrupts
Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of processing Program
Timer
Generated by internal processor timer Used in pre-emptive multi-tasking
I/O
From I/O controller
Hardware failure
e.g. Memory parity error
32
Interrupt Cycle
33
Instruction Cycle State Diagram with Interrupts
34
Summary
Concept of Computer Program and Instruction Internal Structure of CPU Relationship between CPU register sizes and main memory Instruction Execution Cycle Interrupt and Interrupt Cycle Instruction Execution Cycle with Interrupt
35
Review Questions
Which of the instruction cycle state(s) require main memory access? What is context switching? Assume the main memory address space is 216 and each location contains 1 Byte of data. Find out the minimum possible size required for the registers like PC, MAR, MBR, and IR. Repeat the above question for an address space of 232 and addressability as 16 bits. Also calculate the total size of main memory in GBytes.
36
Thank You!
37

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings, Computer Organization and Design , 4th Ed. By Patterson] Topics: Computer Performance Assessment, CPI, MIPS Rate, Benchmark Programs
Computer Performance
Performance is one of the key parameter to Evaluate processor hardware Measure requirement for new systems
When we say one computer has better performance than another, what does it mean?
Criterion for the performance? Response Time (Single computer), Throughput (Data center)
40
Clock Rate
Operation performed by processor are governed by system clock (fundamental level of processor speed measurement)
Generated by quartz crystal
one clock period
A clock cycle is the basic unit of time to execute one operation. Clock Rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time (clock period)
41
Is Clock Rate Enough?

As we know, instruction execution takes several discrete steps
Fetching, decoding, ALU operation, fetching data etc. It takes multiple clock cycles for its execution
Different instructions takes different number of cycles for their execution

LOAD, ADD, SUB, JUMP etc.
Thus clock speed doesnt tells the whole story!!!

42
Performance: Application Specific

Performance
How a processor performs when executing a given application Application performance depends upon
Speed of the processor Instruction set Choice of language Efficiency of the compiler Programming skill of the user
43
CPU Performance
To maximize performance, need to minimize execution time
performance = 1 / execution_time
If X is n times faster than Y, then
performanceX execution_timeY -------------------- = --------------------- = n performanceY execution_timeX
44
Cycles Per Instruction (CPI)

For any given processor, number of cycles required varies for different types of instructions
e.g. load, store, branch, add, mul etc.
Hence CPI is not a constant value for a processor Needs to calculate average CPI for processor
46
CPU Performance and its Factors

Program execution time (T) = Ic x CPI x t
t=cycle time, Ic = no. of instructions in the program
Instruction execution also requires memory access

T = Ic x [p +(m x k] x t p=processor cycles, m = memory references k = ratio of memory cycles to processor cycles
These performance factors are influenced by

ISA, compiler, processor implementation and memory hierarchy 47
MIPS Rate
Common measure of performance
Millions Instructions Per Second (MIPS) rate Ic/(T x 106) This can be written as: f/(CPI x 106)
48
Example
Processor speed = 400 MHz Four types of instructions with CPI 1,2,4,8 respectively Having instruction mix as 60%, 18%,12%,10% respectively What is MIPS rate?
49
Limitation of MIPS Rate

MIPS rate or instruction execution rate is also inadequate to measure CPU performance. Why?
Because of differences in ISA Ex. To execute a high level language statement A=B+C (A,B and C are in memory) may need different number of low level instructions for different ISA
50
Benchmark Programs
It is a collection of a programs that provides representative test of a computer in a particular application area
e.g. SPEC (System Performance Evaluation Corporation) benchmark suites SPEC CPU 2006 is used for measuring performance for the computational based applications
51
Summary
Computer Performance Assessment Performance Factors Execution Time = No. of Instructions in program x Clock cycles per instruction x Clock cycle time MIPS Rate Benchmark Programs
52
Review Questions
Consider two implementation of the same ISA. Computer A clock cycle time of 250 ns and a CPI of 2 for a program. Computer B has a clock cycle time of 500 ns and a CPI of 1.2 for the same program. Which computer is faster for this program and by how much? A program runs on computer A with a 2 GHz clock in 10 seconds. Another computer B with 4 GHz run this program in 6 seconds. To accomplish this, computer B will require P times as many clock cycles as computer A to run the program. Find the value of P.
53
Thank You!
54

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Computer System Modules and Interconnections; Concept of BUS: Types, Arbitration, Timing; PCI BUS Example
Interconnection
All the units must be connected Different type of connection for different type of unit
Memory Connections
Receives and sends data Receives addresses (of locations) Receives control signals
57
CPU Connections
Reads instruction and data Writes out data (after processing) Sends control signals to other units Receives (& acts on) interrupts
58
I/O Module Connection

Similar to memory from computers viewpoint Input/Output
Receive data from peripheral/computer and send data to computer/peripheral
Receive/send control signals from/to comp./peripheral

e.g. spin disk, interrupt
Receive addresses from computer

e.g. port number to identify peripheral
59
What is a Bus?
A communication pathway connecting two or more devices What do buses look like?
Parallel lines on circuit boards Ribbon cables Strip connectors on mother boards Sets of wire
Usually broadcast the data

60
Data Bus and Address Bus

Data Bus carries data
Remember that there is no difference between data and instruction at this level Width is a key determinant of performance 8, 16, 32, 64 bit
Address Bus identify the source or destination of data

e.g. CPU needs to read an instruction (data) from a given location in memory
Bus width determines maximum memory capacity of system

e.g. 8080/8086 has 16 bit address bus giving 64K address space
61
Control Bus
Control and timing information
Memory read/write signal Interrupt request Clock signals
Bus Interconnection Scheme
62
Single Bus Problems

Lots of devices on one bus leads to:
More propagation delays due to increase in bus length Co-ordination of bus use can adversely affect performance Once aggregate data transfer approaches bus capacity, the bus may become bottleneck
Most systems use multiple buses to overcome these problems

63
Traditional Multiple Bus Architecture
64
High Performance Bus
65
Bus Design Issues-> Bus Types

Dedicated
Separate data & address lines
Multiplexed
Shared lines Address valid or data valid control line Advantage - fewer lines Disadvantages
More complex control Ultimate performance
66
Bus Arbitration
More than one module controlling the bus
e.g. CPU and DMA controller
Only one module may control bus at a time Arbitration may be Centralized
Single hardware device controlling bus access
Distributed
Each module may claim the bus Control logic on all modules
67
Centralized BUS Arbiter
68
Timing: Co-ordination of events on bus

Synchronous
Events determined by clock signals and synchronized on leading edge of clock All devices can read the clock line Usually a single cycle for an event
Asynchronous
The occurrence of one event on a bus follows and depends on the occurrence of a previous event Events on the bus are not synchronized with clock
69
Synchronous Timing Diagram
70
Asynchronous TimingRead Operation
71
Asynchronous Timing Write Operation
72
Advantages Over Synchronous Bus

Synchronization of sender and receiver clocks is not needed Delays are accommodated (allows mixture of slow and fast devices) More flexibility and reliability Is there any draw back of Asynchronous Bus?
In case of a device malfunctioning
73
PCI Read Timing Diagram

a)
b)
c) d) e)
f) g) h) i)
Transaction begins by asserting FRAME Target device will recognize its address on AD Master ceases driving the AD bus. Initiator asserts IRDY to indicate ready for data Target asserts DEVSEL to indicate it has recognize its address Master reads the data at the beginning of the 4th cycle and changes the byte enable lines as needed Target deasserts TRDY (needs time to transfer next block of data) Target places the 3rd data item at 6th cycle but master is not ready (deasserts IRDY) Master deasserts FRAME as last data transfer Master deasserts IRDY (Bus is in idle state) target deasserts TRDY and DEVSEL 74
Exercise-1
Consider a 64-bit microprocessor having 64 bit instructions composed of two fields. The first two bytes contain the opcode and the remainder the immediate operand or an operand address What is the impact on the system speed if the microprocessor bus has
A) A 64 bit local address bus and 32-bit local data bus B) A 32 bit local address bus and 32-bit local data bus C) What is the maximum directly addressable memory ?
75
Exercise-2
Consider a 32-bit micro processor with 16 bit external data bus, driven by 8Mhz input clock. Micro processor has a bus cycle whose minimum duration equals four input clock cycles.
What is the maximum data transfer rate across the bus in Bytes/sec? To increase the performance: Make its external data bus 32 bits Double the external clock frequency supplied to the processor Which is better option? Explain.
76
Summary
Bus as Interconnection Structures Bus Design Issues Synchronous and Asynchronous Bus Operations PCI Bus Operation
77
Thank You!
78
Computer Arithmetic
BITS Pilani
Pilani Campus
S Mohan
Computer Arithmetic
Number Representations
Integer or fixed point Representation Floating Point Representation
Arithmetic
Integer Arithmetic Floating Point Arithmetic
80
Arithmetic & Logic Unit

Does the calculations Everything else in the computer is there to service this unit Handles integers May handle floating point (real) numbers May be separate FPU (maths co-processor) May be on chip separate FPU (486DX +)
81
Number Representation-1
Unsigned Representation Only have 0 & 1 to represent everything Positive numbers stored in binary
e.g. 41=00101001
No minus sign Scope is limited!!!
82
Sign-Magnitude
Left most bit is sign bit 0 means positive 1 means negative +18 = 00010010 -18 = 10010010 Problems
Arithmetic is not, as we want!!! One pattern is wasted!!!
83
Twos Compliment Most significant bit is treated as a sign bit 0 for positive and 1 for negative Positive numbers are represented same as sign magnitude representation Negative numbers are represented in 2s complement form By using n bits the range of numbers is
-2n-1 to 2n-1-1
Arithmetic works, as we want!!!
84
Conversion Between Lengths

Positive number pack with leading zeros
+18 = 00010010 +18 = 00000000 00010010
Negative numbers pack with leading ones

-18 = 10010010 -18 = 11111111 10010010
85
Addition and Subtraction

Normal binary addition Take twos complement of subtrahend and add to minuend
i.e. a - b = a + (-b)
Hardware Implementation is simple

Addition and complement circuits required
Monitor sign bit for overflow

86
Hardware for Addition and Subtraction
87
Multiplication
Complex! Work out partial product for each digit Take care with place value (column) Add partial products Think about hardware implementation??? What are changes are required in the above manual approach for computerization???
Running addition on the partial products Few registers should be used For each 1 in the multiplier an add and shiftR operation is required For 0 only shift operation is required
88
Unsigned Binary Multiplication
89
Example
90
Unsigned Binary Multiplication
91
Multiplying Negative Numbers

Will unsigned scheme works! Try (+5) * (-3) = ??? Solution 1
Convert to positive numbers Multiply as unsigned method If signs were different, take 2s complement of the result
Solution 2
Booths Algorithm
92
Booths Algorithm Principle-1
Multiplicand M unchanged Based upon recoding the multiplier Q to a recoded value R Each digit can assume a negative as well as positive and zero values Known as Signed Digit ( SD) encoding
93

Booths algorithm called skipping over ones String of 1s replaced by 0s
For ex: 30 = 0011110

= 32 2 = 0100000 - 0000010 In the coded form = 0100010
94

Booth recoding Procedure Working from LSB to MSB retain each 0 until a 1 is reached
When a 1 is encountered insert 1 at that position and complement all the succeeding 1s until a 0 is encountered
Replace that 0 with 1 and continue While multiplying with 1 2s compliment is taken
95
Flow Chart: Booths Algorithm
96
Example: Booths Multiplication

A 0000 1001 1100
0011 0001 1010 1101 1110
Q 1101 1101 1110

1110 1111 1111 0111 1011
Q-1 0 0 1
1 0 0 1 1
M = 0111 initial values A A-M shift

A A+M shift A A-M shift shift
97
Computer Arithmetic
BITS Pilani
Pilani Campus
S Mohan
Division of Unsigned Binary Integers
00001101 1011 10010011 1011 001110 Partial 1011 Remainders 001111 1011 100 Divisor
More complex than multiplication Negative numbers are really bad! Based on long division
Quotient Dividend
Remainder
99
Algorithm
Bits of dividend are examined from L to R until the set of bits examined represents a number greater than or equal to the divisor
Until this event occurs, 0s are placed in the quotient When the event occurs, a 1 is placed in the quotient and the divisor is subtracted from the partial dividend This continues in cyclic pattern The process stops when all the bits of dividend are exhausted
100
Flowchart for Unsigned Binary Division
101
Division Algorithm for Signed Integers

1. Load divisor in M register and dividend into the A,Q registers 2. Shift A, Q left 1 bit position 3. If M and A have the same signs, do A=A-M otherwise A=A+M 4. a) If preceding operation is successful or MSB of A=0, then set Q01 b) If operation is unsuccessful and MSB of A<>0, then set Q00 and restore the previous value of A 5. Repeat steps 2 through 4 as there are bit positions in Q 6. The remainder is in A. The quotient is in Q.
102
Floating Point
We need a way to represent
numbers with fractions, e.g., 3.1416
very small numbers (in absolute value), e.g., .00000000023

very large numbers (in absolute value) , e.g., 3.15576 * 1046
Representation:
scientific: sign, exponent, significand form:
binary point
(1)sign * significand * 2exponent . E.g., 101.001101 * 2111001
more bits for significand gives more accuracy more bits for exponent increases range if 1 significand 10two(=2ten) then number is normalized, except for number 0 which is normalized to significand 0
E.g., 101.001101 * 2111001 = 1.01001101 * 2111011 (normalized)
IEEE 754 Floating-point Standard

IEEE 754 floating point standard:
single precision: one word
31 sign bits 30 to 23 8-bit exponent bits 22 to 0 23-bit significand
double precision: two words

31 sign bits 30 to 20 11-bit exponent bits 19 to 0 upper 20 bits of 52-bit significand bits 31 to 0 lower 32 bits of 52-bit significand
IEEE 754 Floating Point Representation

Single precision 4 bytes Double Precision 8 bytes Extended Double 10 bytes Quadruple Precision 16 bytes
105

Sign bit is 0 for positive numbers, 1 for negative numbers Number is assumed normalized and leading 1 bit of significand left of binary point (for non-zero numbers) is assumed and not shown
e.g., significand 1.1001 is represented as 1001, exception is number 0 which is represented as all 0s for other numbers: value = (1)sign * (1 + significand) * 2exponent value
Exponent is biased to make sorting easier

all 0s is smallest exponent, all 1s is largest bias of 127 for single precision and 1023 for double precision equals exponent value therefore, for non-0 numbers: value = (1)sign * (1 + significand) * 2(exponent bias)

Special treatment of 0:
if exponent is all 0 and significand is all 0, then the value is 0 (sign bit may be 0 or 1)
Example : Represent 0.75ten in IEEE 754 single precision
decimal: 0.75 = 3/4 = 3/22 binary: 11/100 = .11 = 1.1 x 2-1 IEEE single precision floating point exponent = bias + exponent value = 127 + (-1) = 126ten = 01111110two IEEE single precision: 10111111010000000000000000000000
sign exponent significand
Floating-Point Example
What number is represented by the singleprecision float 1100000010100000
S=1 Fraction = 01000002 Exponent = 100000012 = 129
x = (1)1 (1 + 012) 2(129 127)

= (1) 1.25 22 = 5.0
IEEE 754 Standard Encoding

Single Precision Exponent 0 0 Fraction 0 Non-zero Double Precision Exponent 0 0 Fraction 0 Non-zero 0 (zero) Denormaliz ed number Object Represented
1-254
255 255
Anything
0 Non-zero
1-2046
2047 2047
Anything
0 Non-zero
Floatingpoint number
Infinity NaN (Not a Number)
NaN : (infinity infinity), or 0/0
Denormalized number = (-1)sign * 0.f * 21-bias

109
Single-Precision Range
Exponents 00000000 and 11111111 reserved Smallest value
Exponent: 00000001 actual exponent = 1 127 = 126 Fraction: 00000 significand = 1.0 1.0 2126 1.2 1038
Largest value
exponent: 11111110 actual exponent = 254 127 = +127 Fraction: 11111 significand 2.0 2.0 2+127 3.4 10+38
Double-Precision Range
Exponents 000000 and 111111 reserved Smallest value
Exponent: 00000000001 actual exponent = 1 1023 = 1022 Fraction: 00000 significand = 1.0 1.0 21022 2.2 10308
Largest value
Exponent: 11111111110 actual exponent = 2046 1023 = +1023 Fraction: 11111 significand 2.0 2.0 2+1023 1.8 10+308
Floating point addition

Make both exponents the same
Find the number with the smaller one Shift its mantissa to the right until the exponents match
Must include the implicit 1 (1.M)
Add the mantissas Choose the largest exponent Put the result in normalized form
Shift mantissa left or right until in form 1.M Adjust exponent accordingly
Handle overflow or underflow if necessary Round Renormalize if necessary if rounding produced an unnormalized result

Algorithm
Floating point addition example

Initial values
1 00000001 S E 000001100 M
0 00000011 S E
010000111 M

Identify smaller E and calculate E difference
1 00000001 S E 000001100 M
difference = 2
0 00000011 S E 010000111 M

Shift smaller M right by E difference
1 00000011 S E 010000011 M
0 00000011 S E
010000111 M

Add mantissas
1 00000011 S E 010000011 M
0 00000011 S E
010000111 M
0.010000011 +1.010000111 = 1.000000100

0 S E 000000100 M

Normalize the result by shifting (already normalized)
0 00000011
S E
000000100
M

Final answer
1 00000001 S E 000001100 M
0 00000011 S E
010000111 M
0 00000011 S E
000000100 M
Hardware design
determine smaller exponent
Hardware design
shift mantissa of smaller number right by exponent difference
Hardware design
add mantissas
Hardware design
normalize result by shifting mantissa of result
Hardware design
round result
Hardware design
renormalize if necessary
FP Adder Hardware
Much more complex than integer adder Doing it in one clock cycle would take too long
Much longer than integer operations Slower clock would penalize all instructions
FP adder usually takes several cycles

Can be pipelined
Floating point multiply

Add the exponents and subtract the bias from the sum
Example: (5+127) + (2+127) 127 = 7+127
Multiply the mantissas Put the result in normalized form

Shift mantissa left or right until in form 1.M Adjust exponent accordingly
Handle overflow or underflow if necessary Round Renormalize if necessary if rounding produced an unnormalized result Set S=0 if signs of both operands the same, S=1 otherwise
Floating point multiply

Algorithm
Floating point multiply example

Initial values
1 00000111 S E 100000000 M
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127

Add exponents
1 00000111 S E 100000000 M
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
00000111 +11100000 = 11100111 (231)

Subtract bias
1 00000111 S E 100000000 M
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
11100111 (231) 01111111 (127) = 01101000 (104)
01101000 S E M

Multiply the mantissas
1 00000111 S E 100000000 M
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
1.1000 x 1.1000 = 10.01000
01101000 S E M

Normalize by shifting 1.M right one position and adding one to E
1 00000111 S E 100000000 M
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
10.01000 => 1.001000
01101001 S E
001000 M

Set S=1 since signs are different
1 00000111 S E 100000000 M
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
1 01101001 S E
001000 M
-1.125 x 2105-127
FP Arithmetic Hardware
FP multiplier is of similar complexity to FP adder
But uses a multiplier for significands instead of an adder
FP arithmetic hardware usually does

Addition, subtraction, multiplication, division, reciprocal, square-root FP integer conversion
Operations usually takes several cycles

Can be pipelined
Floating Point Complexities
In addition to overflow we can have underflow (number too small) Accuracy is the problem with both overflow and underflow because we have only a finite number of bits to represent numbers that may actually require arbitrarily many bits
limited precision rounding rounding error IEEE 754 keeps two extra bits, guard and round four rounding modes positive divided by zero yields infinity zero divide by zero yields not a number
other complexities
Implementing the standard can be tricky

Rounding
Fp arithmetic operations may produce a result with more digits than can be represented in 1.M
The result must be rounded to fit into the available number of M positions Tradeoff of hardware cost (keeping extra bits) and speed versus accumulated rounding error
Rounding
Guard, Round bits for intermediate addition
2.56*100 + 2.34*102 = 0.0256*102 + 2.34*102 = 2.3656*102 5: guard bit 6: round bit 00~49: round down, 51~99: round up, 50: tie-break Result: 2.37*102 Without guard and round bit
0.02*102 + 2.34*102 = 2.36*102
Rounding
In binary, an extra bit of 1 is halfway in between the two possible representations
1.001 (1.125) is halfway between 1.00 (1) and 1.01 (1.25) 1.101 (1.625) is halfway between 1.10 (1.5) and 1.11 (1.75)
IEEE 754 rounding modes

Truncate
1.00100 -> 1.00
Round up to the next value

1.00100 -> 1.01
Round down to the previous value

1.00100 -> 1.00
Round-to-nearest-even
Rounds to the even value (the one with an LSB of 0) 1.00100 -> 1.00 1.01100 -> 1.10 Produces zero average bias
Instruction Set Architecture

BITS Pilani
Pilani Campus
S Mohan
8086 Intel Microprocessor Architecture
142
8086 Memory
Memory is byte-addressable. The original 8086 had a 20-bit address bus that could address just 1MB of main memory called as Real Addressing Mode Newer CPUs can access 64GB of main memory, using 36-bit addresses. A word in the 8086 world is 16 bits.. A 32-bit quantity is called a double word.
143
Microprocessor Registers
Visible registers
Addressable during application programming
Invisible registers
Addressable in system programming
8086,8088,80286 contain 16 bit internal architecture 80386-Pentium 4 contain 32 bit internal architecture
144
Multipurpose Registers
AX- Accumulator
Addressable as AX, AH, or AL Mainly used for multiplication, division
BX- Base Index

Some times holds the offset address of a location
CX Count
Holds count for instructions
DX- Data
Holds a part of the result
145
Multipurpose Registers(2)
BP- Base Pointer
Points to a memory location for memory data transfers.
DI- Destination Index

Addresses string destination data for the string instructions.
SI- Source Index

Addresses string source data for the string instructions.
146
Special Purpose Registers

IP- Instruction Pointer
Addresses the next instruction
SP- Stack Pointer

Addresses area of memory called stack
FLAGS
For controlling the microprocessor operation Not modified for data transfer or program control operation
147
Segment Registers[1]
CS-Code Segment
Contains programs and procedures Defines the starting address of the section the memory. In real mode, it defines the start of a 64Kb section of memory
DS- Data Segment

Contains data used by the program Length is limited to 64Kb in 8086-80286
148
First Semester 2010-2011
Segment Registers[2]
ES- Extra Segment
An additional data segment used by string instructions
SS- Stack
Memory used for the stack
149
First Semester 2010-2011
What is an Instruction Set?

The complete collection of instructions that are understood by a CPU
Machine Code or Binary Usually represented by assembly codes In machine code each instruction has a unique bit pattern For human consumption a symbolic representation is used
e.g. ADD, SUB, LOAD
Operands can be represented in this way

ADD A, B
Elements of an Instruction
Operation code (Op code)
Do this!
Source Operand reference

To this!
Result/Destination Operand reference

Put the answer here!
Next Instruction Reference

When you have done that, do this...
8086 Assembly Instruction Format

General Instruction Format
LABEL OPCODE OPERAND(s) ; COMMENT Example
Next: MOV AL,BL ; Transfers BL contents to AL MOV AL,12h ; Transfers 12 to AL MOV AL, [1234h]; Transfers one byte data from memory location given by [DS+1234] to AL Label and comments are optional!!!
OPERAND(s) may be
Part of instruction Reside in registers of the processor Reside in memory location

BITS Pilani
Pilani Campus
S Mohan
Addressing Modes
An instruction must contains the information about:
How to get the operands? Called as ADDRESSING MODES Essentially it tells where the operands are available and how to get them
Question
Why we need various addressing modes or various ways to represent operands?
Register Addressing
Instruction Opcode Register Address R
Registers
Example MOV AL,BL ADD AX, BX
Operand
Immediate Addressing
Operand is part of instruction No memory reference to fetch operand Ex:
MOV AX,23h ADD CL, 44h MOV AL, A MOV BL, 11001100B
Direct Addressing
Instruction Opcode
Examples: MOV CX, DATA ADD CL, TEMP MOV AL,[1234h]
Address A
Memory
Operand
Single memory reference to access data. So no extra calculations required to get effective address!!!
Register Indirect Addressing

Instruction
Opcode Register Address R
MOV AX,[BX or DI or SI]
Memory
Registers
Pointer to Operand
Operand
Example: Register Indirect Addressing

Assume DS=1120, SI=2145 and AX=1234 What will be contents of memory location(s) modified after the execution of the following instruction?
MOV [SI], AX Solution Memory location pointed by SI = 11200+2145=13345h 13345 will contain 12 and 13346 will contain 34
Register Indirect Addressing More

MOV AL, [DI]
is clear that this is a byte-sized move
MOV [BP], 10h

Ambiguous!!! does it address a byte, word, or doubleword sized memory location?
In an assembler, use
MOV BYTE PTR [DI],10H
BYTE PTR, WORD PTR

Directives used only with instructions that address memory locations through a pointer or index register with immediate data.
Base Plus Index Addressing

Similar to Indirect Addressing Uses one Base register (BX or BP) (holds beginning location of an array) and one Index register (DI or SI) (holds relative position of an element in the array)
Register or Based Relative Addressing

Similar to Base plus Index Addressing Data in a segment of memory are addressed by adding the displacement to the contents of a base or an Index register (BP, BX, DI, SI) Ex. MOV AL, [BX+100h]
Base Relative Plus Index Addressing

Similar to Base plus Index addressing with the addition of displacement to calculate the Effective Address Can be used to address the 2D array of data
Ex. MOV AL, [BX+SI+100H], MOV AX, FILE[BX][DI]
Example:
.Data FILE EQU THIS BYTE recA DB 15 dup(?) ;15 bytes for rec A recB DB 20 dup(?) ;20 bytes for rec B .code .startup MOV BX,OFFSET recA ;address record A MOV DI,0 ;address element 0 MOV AL,FILE[BX+DI] .exit end
Pentium Addressing Modes

Virtual or effective address is offset into segment
Starting address plus offset gives linear address This goes through page translation if paging enabled
12 addressing modes available

Immediate Register operand Displacement Base Base with displacement Scaled index with displacement Base with index and displacement Base scaled index with displacement Relative
Design Decisions
Operation repertoire
How many ops, What can they do and How complex are they?
Instruction formats
Length of op code field Number of addresses
Registers
Number of CPU registers available Which operations can be performed on which registers?
Addressing modes Supported

How many Addresses in an Instruction

More addresses
More complex (powerful?) instructions More registers Inter-register operations are quicker Fewer instructions per program
Fewer addresses
Less complex (powerful?) instructions More instructions per program Faster fetch/execution of instructions
Example: Number of Addresses

Program to execute
Y = (A-B)/(C+D*E)
THREE Address SUB Y, A, B MUL T,D,E ADD T,T,C DIV Y,Y,T
TWO Address MOV Y,A SUB Y,B MOV T,D MUL T,E ADD T,C DIV Y,T
ONE Address LOAD D MUL E ADD C STOR Y LOAD A SUB B DIV Y STOR Y

BITS Pilani
Pilani Campus
S Mohan
Example: Number of Addresses

Program to execute
Y = (A-B)/(C+D*E)
THREE Address SUB Y, A, B MUL T,D,E ADD T,T,C DIV Y,Y,T
TWO Address MOV Y,A SUB Y,B MOV T,D MUL T,E ADD T,C DIV Y,T
ONE Address LOAD D MUL E ADD C STOR Y LOAD A SUB B DIV Y STOR Y
Types of Operand
Addresses
Unsigned integers
Numbers
Binary fixed point/Binary floating point
Characters
ASCII
Logical Data
Bits or flags, Bit level manipulation of data
Assembly to Machine Conversion in 8086

OPCODE D W MOD REG R/M
Byte one contains

OPCODE (6 bits)
Specifies the operation to be performed. Ex. ADD, SUB, MOV
Register direction (D) bit

Tells the register operand in REG field in byte two, is source or destination operand D=1: Data flow to the REG field from R/M (destination) D=0: Data flow from the REG field to the R/M (source)
Data size (W) bit

Specifies whether the operation will be performed on 8 bit or 16 bit W =0: 8 bits and W=1: 16 bits
Assembly to Machine Conversion in 8086

OPCODE D W MOD REG R/M
Register field (REG): 3 bits

To identify the register for the first operand
Mode field (MOD): 2 bits Register/Memory field (R/M): 3 bits

2 bit MOD field and 3 bit R/M field together specify the second operand
Assembly to machine Conversion in 8086

MOD 00 01 Explanation Memory mode no displacement Memory mode 8 bit displacement
Example: MOV BL, AL Machine Code??? OPCODE = 100010 D=1 (BL is destination) W=0 (8 bits) MOD = 11 (Register mode) REG = 011 (code for BL) R/M = 000 (code for AL)
10
11
MOD =11 R/M 000 001 010 011 100 101 110 111 W=0 AL CL DL BL AH CH DH BH
Memory mode 16 bit displacement

Register mode no displacement
Effective Address Calculation R/M 000 001 010 011 100 101 110 111 MOD=00 BX+SI BX+DI BP+SI BP+DI SI DI Direct Address BX MOD=01 BX+SI+D8 BX+DI+D8 BP+SI+D8 BP+DI+D8 SI+D8 DI+D8 BP+D8 BX+D8 MOD=10 BX+SI+D16 BX+DI+D16 BP+SI+D16 BP+DI+D16 SI+D16 DI+D16 BP+D16 BX+D16 W=1 AX CX DX BX SP BP SI DI
Exercise
Show the machine code for the following assembly instructions MOD Explanation
MOV [DI], AL ADD [BX+SI+1234h], AX 00 Memory mode no displacement
01
10 11
Memory mode 8 bit displacement

Memory mode 16 bit displacement Register mode no displacement
MOD=10 BX+SI+D16 BX+DI+D16 BP+SI+D16 BP+DI+D16 SI+D16 DI+D16 BP+D16 BX+D16
MOD =11 R/M 000 001 010 011 100 101 110 111 W=0 AL CL DL BL AH CH DH BH W=1 AX CX DX BX SP BP SI DI R/M 000 001 010 011 100 101 110 111
Effective Address Calculation MOD=00 BX+SI BX+DI BP+SI BP+DI SI DI Direct Address BX MOD=01 BX+SI+D8 BX+DI+D8 BP+SI+D8 BP+DI+D8 SI+D8 DI+D8 BP+D8 BX+D8
Assembly Program Example

Copy the contents of a block of memory (16 bytes) starting at location 10200h to another block of memory starting at 10100h
MOV AX, 1000h MOV DS, AX MOV SI, 200h MOV DI, 100h MOV CX, 10h MOV AH, [SI] MOV [DI], AH INC SI INC DI DEC CX JNZ REPT
REPT:
8086 Instruction Types

Data movement instructions Arithmetic - add, subtract, increment, decrement, convert byte/word and compare. Logic - AND, OR, exclusive OR, shift/rotate and test. String manipulation - load, store, move, compare and scan for byte/word. Control transfer - conditional, unconditional, call subroutine and return from subroutine. Input/Output instructions. Other - setting/clearing flag bits, stack operations, software interrupts, etc.
80x86 Instruction Set (Data Movement Instructions) Moving data/address from either register to memory or memory to register
MOV PUSH/POP LDS,LES,LSS LEA
80x86 Instruction Set (String Instructions)

LODS, STOS, MOVS, INS, OUTS Direction flag and DI, SI registers are closely associated with these instructions D=0 indicates auto increment D=1 indicates auto decrement CLD and STD instructions used to set or reset the D flag
80x86 Instruction Set (Arithmetic/Logical)

Arithmetic Instructions
ADD, SUB, CMP, MUL, DIV
Logical Instructions
AND, OR, TEST, XOR, NOT, NEG
Shift and Rotate instructions

ROL, RCL etc
String Comparisons Instructions

SCAS (String Scan)
Compares AL register with byte block of memory

CMPS (Compare strings)
Compares two sections of memory data
80x86 Instruction Set (Program Control)

Unconditional jump
JMP instruction
Conditional jump
JNC-> jump no carry JNO->jump if no overflow JE-> jump if equal JNE->jump if not equal
x86 Instruction Set Summary

(Data Transfer)
IN LAHF LDS LEA LES LODS MOV MOVS OUT POP POPF PUSH PUSHF SAHF SCAS STOS XCHG XLAT
;Input ;Load AH from Flags ;Load pointer to DS ;Load EA to register ;Load pointer to ES ;Load memory at SI into AX ;Move ;Move memory at SI to DI ;Output ;Pop ;Pop Flags ;Push ;Push Flags ;Store AH into Flags ;Scan memory at DI compared to AX ;Store AX into memory at DI ;Exchange ;Translate byte to AL

(Arithmetic/Logical)
AAA AAD AAM AAS ADC ADD AND CMC CMP CMPS CWD DAA DAS DEC DIV IDIV MUL IMUL INC
;ASCII Adjust for Add in AX ;ASCII Adjust for Divide in AX ;ASCII Adjust for Multiply in AX ;ASCII Adjust for Subtract in AX ;Add with Carry ;Add ;Logical AND ;Complement Carry ;Compare ;Compare memory at SI and DI ;Convert Word to Double in AX DX,AX ;Decimal Adjust for Add in AX ;Decimal Adjust for Subtract in AX ;Decrement ;Divide (unsigned) in AX(,DX) ;Divide (signed) in AX(,DX) ;Multiply (unsigned) in AX(,DX) ;Multiply (signed) in AX(,DX) ;Increment
x86 Instruction Set Summary (Arithmetic/Logical

Cont.)
NEG NOT OR RCL RCR ROL ROR SAR SBB SCAS SHL/SAL SHR SUB TEST XLAT XOR
;Negate ;Logical NOT ;Logical inclusive OR ;Rotate through Carry Left ;Rotate through Carry Right ;Rotate Left ;Rotate Right ;Shift Arithmetic Right ;Subtract with Borrow ;Scan memory at DI compared to AX ;Shift logical/Arithmetic Left ;Shift logical Right ;Subtract ;AND function to flags ;Translate byte to AL ;Logical Exclusive OR

(Control/Branch)
JNLE/JG JNO JNP/JPO JNS JO JP/JPE JS LOOP LOOPNZ/LOOPNE LOOPZ/LOOPE NOP REP/REPNE/REPNZ REPE/REPZ RET SEG STC STD STI TEST
;Jump on Not Less or Equal/Greater ;Jump on Not Overflow ;Jump on Not Parity/Parity Odd ;Jump on Not Sign ;Jump on Overflow ;Jump on Parity/Parity Even ;Jump on Sign ;Loop CX times ;Loop while Not Zero/Not Equal ;Loop while Zero/Equal ;No Operation (= XCHG AX,AX) ;Repeat/Repeat Not Equal/Not Zero ;Repeat Equal/Zero ;Return from call ;Segment register ;Set Carry ;Set Direction ;Set Interrupt ;AND function to flags

(Control/Branch Cont.)
CALL CLC CLD CLI HLT INT INTO IRET JB/JNAE JBE/JNA JCXZ JE/JZ JL/JNGE JLE/JNG JMP JNB/JAE JNBE/JA JNE/JNZ JNL/JGE ;Call ;Clear Carry ;Clear Direction ;Clear Interrupt ;Halt ;Interrupt ;Interrupt on Overflow ;Interrupt Return ;Jump on Below/Not Above or Equal ;Jump on Below or Equal/Not Above ;Jump on CX Zero ;Jump on Equal/Zero ;Jump on Less/Not Greater or Equal ;Jump on Less or Equal/Not Greater ;Unconditional Jump ;Jump on Not Below/Above or Equal ;Jump on Not Below or Equal/Above ;Jump on Not Equal/Not Zero ;Jump on Not Less/Greater or Equal
Quiz
Q. The instruction set architecture for a simple computer must support access to 64 KB of byte-addressable memory space and eight 16bit general-purpose CPU registers.
a) If the computer has three-operand machine language instructions that operate on the contents of two different CPU registers to produce a result that is stored in a third register, how many bits are required in the instruction format for addressing registers?
b) If all instructions are to be 16 bits long, how many op codes are available for the threeoperand, register operation instructions described above (neglecting, for the moment, any other types of instructions that might be required)?

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Computer Memory Hierarchy and its characteristics, Cache Memory, Cache Mapping Function : Direct Mapping
Computer Memory Hierarchy
Source: Null, Linda and Lobur, Julia (2003). Computer Organization and Architecture (p. 236). Sudbury, MA: Jones and Bartlett Publishers
191
Memory Hierarchy
Computer memory exhibits the widest range of
Physical Type Semiconductor (RAM), Magnetic (Disk), Optical (CD) Physical Characteristics Volatile/Non Volatile, Erasable/Non Erasable Organization Physical arrangement of bits Performance Access time (Latency), Transfer time, Memory cycle time
192
Memory System Characteristics-1

Location
CPU, Internal and External
Capacity
Number of Words/Bytes
Unit of transfer
Internal
Usually governed by data bus width
External
Usually a block which is much larger than a word
Addressable unit
Smallest location which can be uniquely addressed Word or byte internally and Cluster on disks
193

Access Methods
Sequential
Start at the beginning and read through in order Access time for a record is location dependent e.g. Magnetic Tape
Direct
Individual blocks have unique address based on physical location Access is by jumping to vicinity plus sequential search Access time depends on location and previous location e.g. Hard Disk
194

Access Methods
Random
Wired-in addressing mechanism Individual addresses identify locations exactly
Associative
Data is located based on a portion of its contents rather than its address Called as Content Addressable Memory (CAM)
Note: Access time is independent of location or previous access for Random and Associative accesses
195
Mismatch of CPU and Main Memory Speed
196
Locality of References
Processor generates the main memory references
To fetch instructions and data
Usually, memory references tend to cluster

e.g. loops, subroutines, operation on tables and on arrays
Clustering of memory references is called as locality of references
Spatial: Accesses tend to be clustered in the

address space
Temporal: Tendency to access recently accessed

memory locations
197
Cache Memory
Small amount of fast memory called Cache memory sits between normal main memory and CPU
198
Cache Design Parameters

Total Cache Size
Small enough to reduce the cost and should be large enough to reduce the average access time
Block Size Number of Caches Replacement Algorithm Write Policy Mapping Function
199
Cache memory: Mapping Function

Why we need mapping?
Cache memory size is smaller than main memory The processor doesnt need to know the existence of the cache!!!
The correspondence between the main memory blocks (group of words) and in the cache lines is specified by a mapping function
200
Direct Mapping
Each block of main memory maps to only one cache line
i.e. if a block is in cache, it must be in one specific place
Mapping Function jth Block of the main memory maps to ith cache line
i = j modulo M (M = number of cache lines)
201
Mapping Function from Main to Cache Memory
Cache line
Main Memory blocks held
0
1 m-1
0, m, 2m, 3m2s-m
1,m+1, 2m+12s-m+1 m-1, 2m-1,3m-12s-1
202
Direct Mapping Function Example
203
Direct Mapping Cache Operation
204
Direct Mapping pros & cons

Simple Inexpensive Fixed location for given block
If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high
205
Exercises
Show the address split up for the direct cache mapping function. Cache and main memory details are as follows:
Cache size is 128K Byte Cache line size is 8 Bytes Main memory size is 16M Bytes Main Memory is Byte addressable
Will main memory addresses x234560 and x374562 map to same cache line?
206
Review Questions
Why tag bits are necessary to store along with the data bits in cache line?
Hint: How to identify the two blocks of main memory mapped to same cache line
How many tag comparisons are required to check the presence of the word requested by the processor in the direct mapped cache? Justify your answer.
Hint: Block to line mapping is fixed
207
Summary
Memory Hierarchy in Computer System Concept of Locality of Reference Cache Memory Cache Mapping Function: Direct Mapping
208
Thank You!
209

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Cache Mapping Functions: Associative, Set Associative , Cache Line Replacement Algorithms
Direct Mapping Example

Cache Size = 8KB (Data) Main memory address is 32-bit Block Size/Line Size = 64 Bytes
212
Problem With the Direct Mapping
213
Solution to Fixed Mapping

Due to fixed mapping from a main memory block to a cache line cache utilization becomes low Can we device a mapping function such that no line replacement until a line is empty in cache memory
Any block of main memory can map to any line of the cache memory!
214
Fully Associative Mapping
215
Fully Associative Mapping

A main memory block can load into any line of cache memory Memory address is interpreted as tag and word Tag uniquely identifies block of main memory
216
Address Split up: Fully Associative Mapping

Cache Size = 2K words of data Main memory Size = 64K words Block Size/Line Size = 4 words
217
Fully Associative Cache Organization
218
Set Associative Mapping

Cache is divided into a number of sets Each set contains a fixed number of lines A given block maps to any line in a given set
e.g. 2 lines per set 2 way associative mapping A given block can be in one of 2 lines in the set
219
Address Split up: 2-way Set Associative Mapping

Cache Size = 2K words Main memory Size = 64K words Block Size/Line Size = 4 words
220
K-Way Set Associative Cache Organization
221
Cache Line Replacement

When all lines have valid data and a new block is to be bring from main memory to the cache
Needs to replace one of the valid or filled cache lines How to decide the victim? Using replacement Algorithms
Ex. LRU, LFU, FIFO, Random etc
222
Replacement Algorithms
No choice in Direct mapping because each block only maps to one line!!! Least Recently used (LRU)
Oldest referenced block is removed
First in first out (FIFO)

Replace the block that has been in cache since longest time
Least frequently used (LFU)

Replace block which has had fewest hits
Random
Replace any block randomly
223
Varying Associativity over Cache Size

1.0
0.9
0.8 0.7
Hit ratio
0.6
0.5 0.4
0.3
0.2 0.1
0.0 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M

Cache size (bytes)
direct 2-way 4-way 8-way 16-way
224
Exercise-1
Consider a fully associative cache memory having 4 lines. The main memory references generated by the processor are 0,1,2,3,4,0,1,2,3,4. Show the final cache entries by assuming LRU replacement policy.
225
Exercise-2
A 4-Way set associative cache has a block size of four 16-bit words. The cache can accommodate a total of 4K such words. The main memory size is 128K words. How the processor's addresses are interpreted.
226
Summary
Cache to Main Memory Mapping
Fully Associative and Set Associative
Cache Line Replacement Algorithms

LRU FIFO LFU Random
227
Thank You!
228
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Cache Performance Measurement , Cache Miss Types, Hit Ratio, Write Policy, Multilevel Caches
Cache Performance
More the processor references found in the cache better the performance
If a reference found in the cache it is HIT otherwise MISS A penalty is associated with each MISS occurred More hits reduces average access time
230
Where to misses come from?

Compulsory Initially cache is empty or no valid data in it. So the first access to a block is always miss. Also called cold start misses or first reference misses. CapacityIf the cache size is not enough such that it can not accommodate sufficient blocks needed during execution of a program then frequent misses will occur. Capacity misses are those misses that occur regardless of associativity or block size.
ConflictIf block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses.
231
Operation of Two Level Memory System-1

Hit Ratio(H)
Ratio of number of references found in the higher level memory (M1) (i.e. cache memory) to the total references
Average Access time

Ts =H*T1 + (1-H)*(T1+T2) T1=Access time of M1 (Cache) T2=Access time of M2 (Main Memory)
232
Operation of Two Level Memory System-2

Cost
Cs = (C1S1+C2S2)/(S1+S2)
Access Efficiency (T1/Ts)
Measure of how close average access time is to M1 access time On chip cache access time is about 25 to 50 times faster than main memory Off chip cache access time is about 5 to 15 times faster than main memory access time
233
Hit ratio Calculation Example

Cache access time is 80 ns Main memory access time is 1000 ns Average access time is 180 ns What is Hit ratio?
234
Main Memory Inconsistency

What happens when data is modified in cache? Must not overwrite a cache block unless main memory is up to date Multiple CPUs may have individual caches
235
Write Through Policy

All writes go to main memory as well as cache memory Multiple CPUs have to monitor main memory traffic to keep local (to CPU) cache up to date Implications
Lots of traffic and Slows down writes
236
Write Back Policy

Updates initially made in cache only Update bit for cache slot is set when update occurs If block is to be replaced, write to main memory only if update bit or dirty bit is set
Implications
Other caches get out of sync I/O must access main memory through cache
237
Hit Ratio vs. Cache Line Size

For each miss not only desired word but a number of adjacent words are retrived Increased block size will increase hit ratio at first
Due to the principle of locality
Hit ratio will decreases as block becomes even bigger

Probability of using newly fetched information becomes less than probability of reusing replaced
Larger blocks
Reduce number of blocks that fit in cache Data overwritten shortly after being fetched Each additional word is less local so less likely to be needed
No definitive optimum value has been found 8 to 64 bytes seems reasonable
238
Multi Level Caches

On chip cache improves the performance. Why??? Will more than one level of cache improves the performance??? Simplest organization is two level cache on chip (L1) and external cache (L2) In many cache designs for an off chip cache (L2) a separate bus is used to transfer the data. Now a days L2 cache is also available as on chip cache due to shrinking the size of processor So now we have one more level of cache i.e. L3 off chip cache 239
Multilevel Cache Performance
240
Unified Vs. Split Caches

Earlier same cache is used for data as well as instructions i.e. Unified Cache Now we have separate caches for data and instructions i.e. Split cache Advantages of Unified cache
It balances load between data and instruction automatically
Advantages of Split cache

Useful in parallel instruction execution Eliminate contention for the instruction fetch/decode unit and the execution unit E.g. Super scalar machines Pentium and Power PC
241
Caches and External Connections in P-3 Processor

Processing units
L1 instruction cache
L1 data cache
L1 data and instruction is 16-KB Data cache: 4 way set associative Instruction cache: 2 way set associative
Bus interface unit Cache bus System bus
L2 cache
L2 is of 512K 2 Way Set Associative
Main memory
Input/Output
242
Cache Memory Evolution

Processor IBM 360/85 PDP11/70 VAX 11/780 IBM 3033 IBM 3090 Intel 80486 Pentium PowerPC 601 PowerPC 620 PowerPC G4 IBM S/390 G4 IBM S/390 G6 Pentium 4 IBM SP CRAY MT Ab Itanium SGI Origin 2001 Itanium 2 IBM POWER5 CRAY XD1 Type Mainframe Minicomputer Minicomputer Mainframe Mainframe PC PC PC PC PC/server Mainframe Mainframe PC/server Highend server/ supercomputer Supercomputer PC/server Highend server PC/server Highend server Supercomputer Year of Introduction 1968 1975 1978 1978 1985 1989 1993 1993 1996 1999 1997 1999 2000 2000 2000 2001 2001 2002 2003 2004 L1 cache 16 to 32 KB 1 KB 16 KB 64 KB 128 to 256 KB 8 KB 8 KB/8 KB 32 KB 32 KB/32 KB 32 KB/32 KB 32 KB 256 KB 8 KB/8 KB 64 KB/32 KB 8 KB 16 KB/16 KB 32 KB/32 KB 32 KB 64 KB 64 KB/64 KB L2 cache 256 to 512 KB 256 KB to 1 MB 256 KB 8 MB 256 KB 8 MB 2 MB 96 KB 4 MB 256 KB 1.9 MB 1MB L3 cache 2 MB 2 MB 4 MB 6 MB 36 MB
243
Review Questions
For a larger block size hit ratio initially increases and drops later. Why?
Why multilevel cache system gives better performance than single level?
How the conflict misses are different from capacity misses?
244
Summary
Cache Performance
Hit Ratio Average Access Time Access Efficiency Write Policy Line Size Multiple Level of cache Unified Cache and Split Cache
245
Thank You!
246

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Internal memory, Semiconductor Memory Types, RAM, ROM, Memory Chip Organization, Memory Module Organization, Flash Memory
Basic Concepts
Data transfer between the processor and the memory takes place through the two registers
MAR and MBR or MDR
Memory Speed measurement

Memory Access Time Memory Cycle Time
Minimum time delay required between the initiation of two successive memory operations
Memory Cycle time is usually slightly longer than the access time! Memory Cycle time for Semiconductor memories ranges 10 to 100 ns
249
Connection: Memory to Processor
250
Semiconductor Memory Types

Memory Type Randomaccess memory (RAM) Readonly memory (ROM) Programmable ROM (PROM) Category Readwrite memory Erasure Electrically, bytelevel
Write Mechanism
Electrically Masks
Volatility Volatile
Readonly memory
Not possible
Erasable PROM (EPROM)

Electrically Erasable PROM (EEPROM) Flash memory Readmostly memory
UV light, chiplevel
Electrically
Nonvolatile
Electrically, bytelevel
Electrically, blocklevel
251
Static RAM
Memories that consists of circuits capable of retaining their state as long as power is applied Bits stored as on/off switches Complex construction so larger per bit and more expensive
252
SRAM Cell Organization

Transistor arrangement gives stable logic state State 1
C1 high, C2 low, T1 T4 off, T2 T3 on
State 0
C2 high, C1 low, T2 T3 off, T1 T4 on
Address line transistors are T5 and T6
253
Dynamic RAM
Bits stored as charge in capacitors charges leak so need refreshing even when powered Simpler construction and smaller per bit so less expensive Address line active when bit read or written
Transistor switch closed (current flows)
Slower operations, used for main memory
254
Memory Chip Organization

b7 W0 b7 b1 b1 b0 b0

FF A0 A1 Address decoder W1 FF

Memory cells
A2
A3
W15

Sense / Write circuit Sense / Write circuit Sense / Write circuit R/W CS
Data input /output lines: b7
b1
b0
Fig. Ref. Computer Organization 5th Ed. by Hamacher
2 5 BITS Pilani, Pilani Campus
Organization of a 1K 1 Memory Chip

5-bit row address W0 W1 5-bit decoder W31
32 32 memory cell array Sense / Write circuitry
10-bit address
32-to-1 output multiplexer and input demultiplexer 5-bit column address
R/ W CS
External connections are required=15
Data input/output
Fig. Ref. Computer Organization 5th Ed. by Hamacher
256
Memory Organization Issues

A 16Mbit chip can be organized as 1M of 16 bit words
36 pins require to address and data + 4 pins (R/W, CS, PS, G)
It can be organized as 4K x (512 x 8)

29 pins are required to address and data + 4 pins
It can be organized as 2048 x 2048 x 4 bit array

Row address and column address can be multiplexed 11 pins to address (211=2048) + 4 pins for data output + 4 pins Adding one more pin doubles range of values so x 4 capacity
257
16 Mbit DRAM Organization
258
DIMM and SIMM Memory Modules
259
256KB Module Organization
1MByte Module Organization
Synchronous DRAM (SDRAM)

Access is synchronized with an external clock Address is presented to RAM RAM finds data (CPU waits in conventional DRAM) Since SDRAM moves data in time with system clock, CPU knows when data will be ready CPU does not have to wait, it can do something else
262
DDR SDRAM Read Timing
263
Flash memory
Similar technology as EEPROM i.e. single transistor controlled by trapped charge A single cell can be read but writing is on block basis and previous contents are erased Greater density and a lower cost per bit, and consumes less power for operation Used in MP3 players, Cell phones, digital cameras Larger flash memory modules are called Flash Drives (i.e. Solid State Storage Devices)
264
Exercises
How memory organization impacts on data transfer rate? Number of pins required for 4Mbit memory chip:
Organized as 4Mx1 Organized as 1Mx4
265
Summary
266
Thank You!
267

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: External Memory, Magnetic Disk, Disk Characteristics and Performance Parameters
External memory
Semiconductor memory can not be used to store large amount of information or data
Due to high per bit cost of it!
Large storage requirements is full filled by

Magnetic disks, Optical disks and Magnetic tapes Called as secondary storage
270
Disk Connection to the System Bus

Processor Main Memory System Bus
Disk Controller
Disk Drive
271
Magnetic Disk Structure

Disk substrate (non magnetizable material) coated with magnetizable material (e.g. iron oxiderust) Advantage of glass substrate over aluminium Improved surface uniformity Reduction in surface defects
Reduced read/write errors
Lower fly heights Better stiffness Better shock/damage resistance

272
Magnetic Disk
273
Data Organization on Disk

Concentric rings called tracks
Gaps between tracks Same number of bits per track Constant angular velocity
Tracks divided into sectors Minimum block size is one sector Disk rotate at const. angular velocity
Gives pie shaped sectors
Individual tracks and sectors addressable

274
Multi Zone Recording Disks
Disk storage capacity in a CAV system is limited by the maximum recording density that can be achieved on the innermost track
275
Read and Write Mechanisms-1

Recording & retrieval of data via conductive coil which is called a head May be single read/write head or separate ones During read/write, head is stationary, platter rotates Write
Electricity flowing through coil produces magnetic field Electric pulses sent to head Magnetic pattern recorded on surface below
276
Read and Write Mechanisms-2

Read (traditional)
Magnetic field moving relative to coil produces current in coil Coil is the same for read and write (floppy disk)
Read (contemporary)
Separate read head, close to write head Partially shielded magneto resistive (MR) sensor Electrical resistance depends on direction of magnetic field Resistance changes are detected as voltage signals Allows High frequency operation
Higher storage density and speed
277
Disk Characteristics
Fixed (rare) or movable head
Fixed head
One read/write head per track mounted on fixed ridged arm
Movable head
One read/write head per side mounted on a movable arm
Removable or fixed disk Single or double (usually) sided Head mechanism

Contact (Floppy), Fixed gap, Flying (Winchester)
Single or multiple platter

278
Multiple Platters Tracks and Cylinders

C y l i n d e r
279
Capacity
Vendors express capacity in units of gigabytes (GB), where 1 GB =10^9 Byte Capacity is determined by these technology factors:
Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track. Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment. Areal density (bits/in2): product of recording and track density.
Modern disks partition tracks into disjoint subsets called recording zones
280
Computing Disk Capacity

Capacity =(# bytes/sector) x (avg. # sectors/track) x (# tracks/surface) x (# surfaces/platter) x (# platters/disk)
Example:
512 bytes/sector, 300 sectors/track (average) 20,000 tracks/surface, 2 surfaces/platter 5 platters/disk Capacity = 512 x 300 x 20000 x 2 x 5 = 30.72GB
281
Disk Performance Parameters

Seek time (Ts)
Time require to positioned the head on the desired track
Time require to positioned desired sector under r/w head
Rotational delay
Transfer time The total average access time is: Ta = Ts+ 1/2r + b/rN
Here Ts is Average seek time r is rotation speed in revolution per second b number of bytes to be transferred N number of bytes on a track
282
Example
Average seek time=4ms Rotation speed= 15,000 rpm 512 bytes per sector No. of sectors per track=500 Want to read a file consisting of 2000 sectors. Calculate the time to read the entire file File is stored sequentially.
4+2+4=10 ms to read 500 sectors or 1 track Time required to read remaining 4 tracks is 3*(4+2)=18 ms Total time is 28 ms
283
Exercise-1
If file is stored randomly.
2000 sectors are randomly scattered
What will be the total time required to read the file now?
284
Exercise-2
What is average time to read or write 512 byte sector for a disk rotating at 15,000 rpm? The average seek time is 4 ms, and the transfer rate is 100 MB/sec. Sol: 4+2+0.005 = 6.005 ms
285
Summary
Secondary Storage in Computers Magnetic Disk
Disk Characteristics Data organization on disk Read/Write Mechanism Access time
286
Thank You!
287

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Redundant Array of Independent Disks (RAID)
Performance Improvement in Secondary Storage

In general multiple components improves the performance Similarly multiple disks should reduce access time?
Arrays of disks operates independently and in parallel
Justification
With multiple disks separate I/O requests can be handled in parallel A single I/O request can be executed in parallel, if the requested data is distributed across multiple disks
Researchers @ University of California-Berkeley proposed the RAID (1988)

290
RAID
Redundant Array of Independent Disks Seven levels in common use Not a hierarchy Characteristics
1. Set of physical disks viewed as single logical drive by operating system 2. Data distributed across physical drives 3. Can use redundant capacity to store parity information
291
Data Mapping in RAID 0
No redundancy Data striped across all disks Round Robin striping

292
RAID 0
Increased Speed
Multiple data requests probably not on same disk Disks seek in parallel A set of data is likely to be striped across multiple disks
Draw Backs:
Not a "True" RAID because it is NOT fault-tolerant The failure of just one drive will result in all data in an array being lost
293
RAID 1
Mirrored Disks Data is striped across disks 2 copies of each stripe on separate disks Read from either and Write to both
294
RAID 1
Recovery is simple
Swap faulty disk & re-mirror No down time
Draw back
Highest disk overhead of all RAID types Expensive Any write should be done on two disks
295
Lots of redundancy Expensive: Good for erroneous disk
296
RAID 2 (Not in use Now!)

Use parallel access technique Very small size strips Error correcting code is calculated across corresponding bits on each data disks Multiple parity disks store Hamming code error correction in corresponding positions Question: How many redundant disk required?
297
298
RAID 3
Similar to RAID 2 Only one redundant disk, no matter how large the array Simple parity bit for each set of corresponding bits Data on failed drive can be reconstructed from surviving data and parity information
Question: Can achieve very high transfer rates. How?
299
RAID 4
Make use of independent access with block level striping Good for high I/O request rate due to large strips Bit by bit parity calculated across stripes on each disk Parity stored on parity disk Drawback???
300
RAID 5
Round robin allocation for parity stripe It avoids RAID 4 bottleneck at parity disk Commonly used in network servers Drawback
Disk failure has a medium impact on throughput Difficult to rebuild in the event of a disk failure (as compared to RAID level 1)
301
RAID 6
Two parity calculations Stored in separate blocks on different disks High data availability
Three disks need to fail for data loss Significant write penalty
Drawback
Controller overhead to compute parity is very high
302
Nesting of RAID Levels: RAID(1+0)

RAID 1 (mirror) arrays are built first, then combined to form a RAID 0 (stripe) array. Provides high levels of: I/O performance data redundancy Disk fault tolerance.
303
Nesting of RAID Levels: RAID(0+1)

RAID 0 (stripe) arrays are built first, then combined to form a RAID 1 (mirror) array Provides high levels of I/O performance and data redundancy Slightly less fault tolerance than a 1+0
304
Review Questions
In the context of the RAID, what is the difference between parallel access and independent access? What is the role of strip size in RAID 0 to achieve high I/O request rate and high data transfer capacity?
305
Summary
Reliability and Performance Improvement for secondary storage Comparative analysis of RAID Levels
306
Thank You!
307

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: I/O Modules: Function and Structure I/O Methods: Programmed and Interrupt Driven I/O
Input /Output
I/O system provides interface to the outside world I/O operations are accomplished through wide variety of external devices Used to transfer the data between computer and external world e.g. Keyboard, Monitor, Disk Drive, Printer An external device attached to the computer by an Interface Circuit called as I/O module
310
Why I/O Module?

Wide variety of peripherals
Impractical to incorporate the necessary logic within processor
Delivering different amounts of data

At different speeds, different formats and word length
Speed mismatch between processor/memory and I/O devices

Some I/O devices are slower than processor and memory while some are faster than processor and memory
311
Generic Model of I/O Module
312
I/O Module Functions-1

Control & Timing
To coordinate the flow of traffic between internal resources and external devices Control Signals: Read, Write, Ready, Busy, Error etc.
Processor and Device Communication

It must communicate with the processor and the external device before transferring the data Processor communication involves
Command decoding, Data, Status reporting, Address recognition
313
I/O Module Functions-2

Data Buffering
To overcome the speed mismatch between CPU or memory and device
Error Detection
Mechanical, electrical malfunctioning and transmission e.g. paper jam, disk bad sector
314
General I/O Module Structure

CPU checks I/O module device status I/O module returns status If ready, CPU requests data transfer I/O module gets data from device I/O module transfers data to CPU
315
I/O Methods
Programmed Interrupt driven Direct Memory Access (DMA)
316
Programmed I/O
CPU has direct control over I/O
Sensing status Read/write commands Transferring data
CPU waits for I/O module to complete operation Commands

Control - telling module what to do Test - check status Read/Write
Wastes CPU time!!! Why?

317
Addressing I/O Devices

Memory mapped I/O
Devices and memory share an address space I/O looks just like memory read/write No special commands for I/O
Large selection of memory access commands available
Isolated I/O
Separate address spaces Need I/O or memory select lines Special commands for I/O
318
Memory Mapped and Isolated I/O
319
Interrupts
Interrupts, what the processor is doing Program
e.g. overflow, division by zero, segmentation fault
Timer
Generated by internal processor timer Used in pre-emptive multi-tasking
I/O
from I/O controller
Hardware failure
e.g. memory parity error
320
Interrupt Driven I/O

CPU issues read command I/O module gets data from peripheral whilst CPU does other work I/O module interrupts CPU once data ready If interrupted: Save context (registers values) Process interrupt
Fetch data & store
321
Program Flow Control
322
Device Identification-1
How does processor determine which device issued the interrupt?
Multiple Interrupt Lines
Impractical to provide one line to each I/O device as number of devices increase
Software Poll
Processor branches to ISR ISR poll each I/O module to identify which module caused the interrupt Time consuming process
323
Device Identification-2
Hardware Poll
All I/O modules share a common interrupt request line Interrupt ACK line is Daisy changed through the modules Requesting module places address of the I/O module on data bus which is called as a vector Technique is called as vectored Interrupt
324
Multiple Interrupts - Sequential

Processor will ignore further interrupts whilst processing one interrupt Interrupts remain pending and are checked after first interrupt has been processed
325
Multiple Interrupts: Priority

Low priority interrupts can be interrupted by higher priority interrupts When higher priority interrupt has been processed, processor returns to previous interrupt
326
Review Questions
When a device interrupt occurs, how does the processor determine which device issued the interrupt? What is context switching? How programmed I/O is different from Interrupt Driven I/O? How software polling is different from Daisy chain method to handle multiple interrupts?
327
Summary
I/O modules: Functions and Structure I/O methods
Programmed I/O
Device addressing methods
Interrupt Driven I/O

Interrupt handling and processing Device Identification
328
Thank You!
329

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Direct Memory Access (DMA), DMA Configurations
Drawbacks of Interrupt Driven and Programmed I/O

Requires active CPU Intervention
Transfer rate is limited by the speed with which the processor can test and service the device CPU is tied up in managing an I/O transfer
Using programmed I/O processor can move data at higher rate at the cost of doing nothing else Interrupt I/O frees up the processor at the expense of the I/O transfer rate
332
Direct Memory Access (DMA)

A special control unit can be provided to allow transfer of a block of data directly between an external device and the main memory. No continuous intervention by the processor is required This control circuit can be part of I/O device interface and called as DMA controller Good for moving large volume of data
333
DMA Module and Operation

CPU tells DMA controller: Read/Write Device address Starting address of memory block for data Amount of data to be transferred
CPU carries on with other work DMA controller deals with transfer DMA controller sends interrupt when finished
334
DMA Transfer Method: Cycle Stealing

Memory accesses by the processor and by the DMA controllers are interwoven DMA controller takes over bus for a cycle Transfer of one word of data Not an interrupt
CPU does not switch context
CPU suspended just before it accesses bus

i.e. before an operand or data fetch or a data write
Slows down CPU but not as much as CPU itself doing transfer
335
DMA and Interrupt Breakpoints During an Instruction Cycle
336
DMA Transfer Method: Burst Mode

DMA controller is given exclusive access to the main memory to transfer a block of data without interruption Most DMA controllers incorporate a data storage buffer It reads a block of data using burst mode from the main memory and stores it into its input buffer Then it is transferred to the desired I/O device
337
DMA Configurations (1)
Single Bus, Detached DMA controller Each transfer uses bus twice
I/O to DMA then DMA to memory
CPU is suspended twice!

338
Single Bus, Integrated DMA controller Controller may support >1 device Each transfer uses bus once CPU is suspended once!
339
Separate I/O Bus Bus supports all DMA enabled devices Each transfer uses system bus once CPU is suspended once
340
Intel 8237A DMA Controller
341
Exercise-1
Processor and an I/O device connected to main memory via a shared bus of width one word. Processor can executes 106 instr/sec. An average instruction requires 5 machine cycles, three of which use the memory bus. Memory R/W requires one machine cycle. Processor is executing background programs that requires 95% of its instructions execution rate but not any I/O instructions. Assume processor cycle equals one bus cycle. Assume I/O device is to be used to transfer very large blocks of data. Answer the following: If programmed I/O is used and each one word transfer requires the processor to execute two instructions. What is the maximum I/O transfer rate? Estimate the rate if DMA is used.
342
Solution
The processor can only devote 5% of its time to I/O. Thus the maximum I/O instruction execution rate is 106 * 0.05 = 50,000 instructions per second. The I/O transfer rate is therefore 25,000 words/second. The number of machine cycles available for DMA control is 106(0.05*5 + 0.95*2) = 2.15*106 If we assume that the DMA module can use all of these cycles, and ignore any setup or status-checking time, then this value is the maximum I/O transfer rate.
343
Exercise-2
A DMA module transferring characters to memory using cycle stealing, from a device transmitting at 9600 bps. The processor is fetching instructions at the rate of 106 instructions per second. By how much will the processor be slowed down due to the DMA activity? Sol:
344
Review Questions
In general, DMA access to main memory is given higher priority than CPU access to main memory. Why? How cycle stealing for DMA is different from context switching used in Interrupt Driven I/O? Why DMA is better for large size data transfer compared to Interrupt Driven I/O?
345
Summary
Limitations of Interrupt Driven and Programmed I/O DMA Structure and Function DMA Configurations Exercises
346
Thank You!
347

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: RISC Vs. CISC Architectures, Instruction Pipeline, Six Stage Instruction Pipeline, Pipeline Performance
Instruction Execution Characteristics

High Level Language (HLL) programs mostly comprises Assignment statements (e.g. AB) Conditional statements (e.g. IF, LOOP) Procedure call/return (e.g. Functions in C)
Key for ISA designer These statements should be supported in optimal fashion
350
RISC Architecture
Earlier computers had small and simple instruction set. Why???
In the early 1980s, designers recommended that computers use fewer instructions with simple constructs so they can be executed much faster without accessing memory as often.
351
RISC Characteristics
Relatively few instructions
128 or less
Relatively few addressing modes

Memory access is limited to LOAD and STORE instructions
All operations done within the registers of the CPU

Use of overlapped register windows for optimization
Fixed Length (4 Bytes), easily decoded instruction format

Instruction execution time consistent
Hardwired control
352
RISC Processors
MIPS R4000
First commercially available RISC processor Supports thirty-two 64-bit registers 128KB of high speed cache
SPARC (Sun)
Based on Berkeley RISC model
PowerPC (IBM) ARM processor family Apple iPods

353
CISC Architecture
Later complexity and number of instructions both are increased. Why???
Simplify the compilation
The trend into computer hardware complexity was influenced by various factors, such as
To provide support for more customer applications Adding instructions that facilitate the translation from high level language into machine language programs Single machine instruction for each high level language statement Ex: VAX computer, IBM/370 computers, Intel x86 based processors, Motorola 68000 Series
354
CISC Characteristics
A large number of instructions Some instructions for special tasks used infrequently A large variety of addressing modes (5 to 20) Variable length instruction formats Instructions that manipulate operands in memory
However, it soon became apparent that a complex instruction set has a number of disadvantages These include a complex instruction decoding scheme, an increased size of the control unit, and increased logic delays.
355
RISC and CISC Comparison

Computer performance equation
CPU Time = (instructions/program) x (avg. cycles/instruction) x (seconds/cycle)
Example CISC Program

mov ax, 20 mov bx, 5 mul ax, bx
again
Example RISC Program mov ax,0 mov bx, 20 mov cx,5 add ax, bx loop again
356
RISC vs CISC Performance Summary

The CISC approach attempts to minimize the number of instructions per program by sacrificing the number of cycles per instruction. RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program.
357
RISC Vs. CISC Controversy

Now the general trend in computer architecture and organization has been toward Increasing processor complexity More instructions More addressing modes More registers No definitive test set of programs exists to compare both architectures
Performance varies with the programs
Most commercially available machines are mixture of RISC and CISC characteristics
358
Instruction Pipelining
Pipelining
What is pipelining? Six stage pipeline example Performance Issues
359
Instruction Pipelining
Similar to the use of an assembly line in manufacturing plant New inputs are accepted at one end before previously accepted inputs appear as outputs at the other end The same concept can be apply to the Instruction execution
360
Revisit Instruction Cycle
361
Two Stage Instruction Pipeline
Execution usually does not access main memory. Next instruction can fetched during execution of current instruction. Called instruction prefetch
362
Two Stage Pipeline: Hardware Organization
363
Two Stage Pipeline Facts

But not doubled!!!
Fetch stage is usually shorter than execution stage
Prefetch more than one instruction?
Any jump or branch means that prefetched instructions are not the required instructions These factors reduce the potential effectiveness of the two stage pipeline
How about to go for More stages?
364
Instruction Execution Stages

Six stages during Instruction execution
Fetch Instruction (FI) Decode Instruction (DI) Calculate Operands (i.e. EAs) (CO) Fetch Operands (FO) Execute Instructions (EI) Write result or Operand (WO)
Overlap these operations!!!

365
Timing Diagram for Instruction Pipeline
366
The Effect of a Conditional Branch on Instruction Pipeline
367
Pipeline Performance
Total time required to execute n instructions for a pipeline with k stages and cycle time t
Tk,n = [k+(n-1)]t
Speedup factor
Sk = nk/[k+(n-1)]
How the value of K influence the speedup???

The larger the number of pipelines stages (k), the greater the potential for speedup.
However, as a practical matter, the potential gains of additional stages are countered by increases in cost, delays between stages.
368
Speedup Factors: Instruction Pipelining
369
Review Questions
A program takes 600 ns for execution on a non pipelined processor. Suppose we need to run 100 programs of same type on a six stage pipeline processor with a clock speed of 10 ns. What is the speed ratio of the pipeline.
What is branch penalty?
370
Summary RISC and CISC Architectures Instruction Pipeline Pipeline Performance Factors
371
Thank You!
372

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Pipeline Hazards: Data Hazards, Resource Hazards, Control Hazards, RISC Pipeline
Four Stage Pipeline

Fetch: Read the instruction from the memory Decode: Decode the instruction and fetch the source operand(s) Execute: Perform the ALU operation Write: Store the result in the destination location Note:
Each stage in pipeline is expected to complete its operation in one clock cycle; hence the clock period should be sufficiently long to complete the task being performed in any stage.
375
Stalls/Hazards in Pipeline
Any condition that causes the pipeline to stall is called Hazard Data Hazards
Any condition in which either the source or the destination operands of an instruction are not available at the expected time in the pipeline
376
Data Hazards
Read After Write Hazard
I1: R1=R1+R2 I2: R3=R1+3
377
Solution for RAW Dependency

Hardware Solution
Operand Forwarding
Software Solution
Inserting NOP (No- Operation) instructions between I1 and I2
378
Resource Hazards
One instruction may need to access memory as part of the Write stage while another instruction is being Fetched
If instruction and data are in same cache unit there is a conflict Called as Structural or Resource conflict or Resource Hazard
379
Instruction/Control Hazards
If the required instruction is not available Example: Cache miss for instruction fetch, Branch instruction Called as control hazards or Instruction hazards
380
Example: Instruction Hazard

Cycle/Instr
I1 I2 (Br) I3 I4 Ik
F1 D1 E1 W1
F2 D2 E2 F3 D3 X F4 X Fk Dk Ek Wk
Conditional branch? Unconditional branch?

381
Handling Unconditional Branch

Branch will always be taken Compute branch target address as soon as possible
Needs extra hardware to do it at Decode stage (e.g. Four stage pipeline)
Many processors employ fetch unit that can fetch instructions before they are needed
382
Handling Conditional Branches-1

The decision to the branch cant be made until the execution of that instruction
Multiple Streams
Have two pipelines Prefetch each branch into a separate pipeline Use appropriate pipeline
Drawbacks
Leads to bus & register contention Multiple branches lead to further pipelines being needed Used in IBM 370/168
383
Handling Conditional Branches-2

Pre-fetch Branch Target
Studies suggest that conditional branches are taken more than 50% of time Branch target prefetching gives better result Used by IBM 360/91
Loop Buffer
Very fast memory maintained by fetch stage of pipeline keeps n most recently fetched instructions, sequentially Check buffer before fetching from memory Very good for small loops or jumps Used in CRAY-1
384
Review Questions
Consider the following sequence of instructions: I1: add R1, R0, #20 (R1 R0 + 20) I2: mul R2, R3, #2 (R2 R3*2) I3: and R4, R1, R2 (R4 R1 and R2) I4: add R5, R4, R2 (R5 R4 + R2) These instructions are executed in a computer that has a four stage pipeline (Fetch, Decode, Execute, Write) discussed in this class. Assume that all stages for all instructions requires one cycle each, except the Execution stage of multiply instruction which requires two cycles. Draw a diagram to describe the operation being performed by each pipeline stage during each clock cycle. Show the stalls in pipeline, if any.
385
Solution
I1: add R1, R0, #20 (R1 R0 + 20) I2: mul R2, R3, #2 (R2 R3*2) I3: and R4, R1, R2 (R4 R1 and R2)
I4: add R5, R4, R2 (R5 R4 + R2)
386
Prediction Based Algorithms for Handling Conditional Branches

Predict never taken
Assume that jump will not happen. Always fetch next instruction
Predict always taken

Assume that jump will happen. Always fetch target instruction
Predict by opcode
Some instructions are more likely to result in a jump than others. Can get up to 75% success
Taken/Not taken switch

Based on previous history.
387
1-Bit Branch Predictor
388
Two Bit Branch Predictor
389
Delayed Branch
Do not take jump until you have to Rearrange instructions Example
LOOP Shift_left Decrement Brp Add R1 R2 LOOP R1,R3
LOOP Decrement Brp Shift_left NEXT Add
R2 LOOP R1 R1,R3
NEXT
Reordered Instructions
The location following a branch instruction is called a branch delay slot The objective is to utilize the slot by putting useful instruction.
390
RISC Pipelining
Most instructions are register to register
I: Instruction fetch E: Execute
ALU operation with register input and output
For load and store

I: Instruction fetch E: Execute
Calculate memory address
D: Memory
Register to memory or memory to register operation
391
RISC Pipeline Example
392
Four Stage RISC Pipelining

E stage usually involves an ALU operation, it may be longer. So we can divide into two stages
E1: Register file read E2: ALU operation and register write
393
Summary Pipeline Hazards

Data Hazards Resource Hazards Instruction Hazards
Branch Prediction Algorithms
RISC Pipeline as an Example
394
Thank You!
395

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Instruction Level and Machine Level Parallelism, Superscalar Pipeline and Superscalar Processor Design Elements, Hazards and Solutions
Introduction
Scalar pipeline improves performance
By overlapping various instruction stages Unable to exploit Instruction Level Parallelism (ILP) completely!
A new term called Superscalar is coined (1987) to exploit ILP of a program
398
What is Superscalar?
Multiple independent instruction pipelines are used to execute instructions independently and concurrently Common instructions (arithmetic, load/store, conditional branch) initiated simultaneously and executed independently Improve the performance of the execution of instructions operated on scalar quantities Equally applicable to RISC & CISC but in practice usually RISC 399
Super Scalar vs. Super Pipeline

Super Pipeline
Many pipeline stages need less than half a clock cycle Double internal clock speed gets two tasks per external clock cycle
400
Limitations
Superscalar approach depends on the ability to execute multiple instructions in parallel How to maximize the ILP
Compiler based optimization Hardware techniques
Limited by following dependencies or hazards

Instruction / Procedural Resource Data Hazards
RAW, WAR, WAW
401
Procedural/Instruction Hazards
Can not execute instructions after a branch in parallel with instructions before a branch Also, if instruction length is not fixed, instructions have to be decoded (partially) to find out how many fetches are needed
This prevents simultaneous fetches
So superscalar techniques are applicable to a RISC or RISC like architecture

402
Resource Conflict
Two or more instructions requiring access to the same resource at the same time
e.g. two arithmetic instructions
Solution
Can duplicate resources. e.g. have two arithmetic units
403
Data Hazards/Conflicts
Example Program
R3= R3 + R5; (I1) R4= R3 + 1; (I2) R3= R5 + 1; (I3) R7= R3 + R4 (I4)
Identify the hazards among above instructions

I1 and I2 ? I3 and I4 ? I2 and I3 ? I1 and I3 ?
404
Effect of Hazards/ Dependencies
405
Superscalar Processor Design Issues

Instruction level parallelism
If instructions in a sequence are independent, execution can be overlapped Degree of instruction level parallelism is governed by data and procedural dependency
Machine Parallelism
Ability to take advantage of instruction level parallelism Governed by number of parallel pipelines
Note
A program may not have enough instruction level parallelism to take full advantage of machine parallelism
406
Instruction Issue Policy

Instruction Issue
Refer to the process of initiating instruction execution in the processors functional units Occurs when instruction moves from the decode stage to the first execute stage
Instruction Issue Policy

Refer to the protocol used to issue instructions Processor look ahead to locate instructions that can be brought into the pipeline and executed
Ordering issues
Order in which instructions are fetched, executed and change contents of registers and memory 407
Issue Policies In-order issue with in-order completion In-order issue with out-of-order completion Out-of-order issue with out-of-order completion
408
Example: Issue Policies-1

Assumptions
A super Scalar Pipeline capable of fetching and decoding two instructions at a time The next two instructions must wait until the pair of decode pipeline stages has cleared Having three separate functional units (two integer arithmetic and one floating point arithmetic) Two instances of the write back pipeline stage
409
Example: Issue Policies-2

Constraints
I1 requires two cycles to execute I3 and I4 conflict for the same functional unit I5 depends on the value produced by I4 I5 and I6 conflict for a functional unit
410
In-Order Issue In-Order Completion
411
In-Order Issue Out-of-Order Completion

It improves the performance of instructions that requires multiple cycles
E.g. I2 is allowed to run to completion prior to I1. Results I3 completed earlier i.e. saving one cycle (Better than previous)
412
Out-of-Order Issue Out-of-Order Completion

Decouple decode stage from execution stage by introducing a buffer called instruction window Since instructions have been decoded, processor can look ahead
413
Register Renaming
When out of order techniques are used
WAW and WAR hazards occur because register contents may not reflect the correct ordering from the program WAW and WAR dependencies are known as storage conflicts
Compilers use optimized registers allocation which leads to more storage conflicts Register Renaming is used to deal with this
Similar to resource duplication
414
Register Renaming
R3b:=R3a + R5a R4b:=R3b + 1 R3c:=R5a + 1 R7a:=R3c + R4b Inference (I1) (I2) (I3) (I4)
R3:= R4:= R3:= R7:= R3 R3 R5 R3 + + + + R5; (I1) 1; (I2) 1; (I3) R4 (I4)
The same original register reference in several different instructions may refer to different actual registers, if different values are needed
415
Superscalar Execution
416
Machine Parallelism
To enhance the performance of the Super Scalar Processors
Duplication of Resources, Out of order instruction issue Register Renaming
417
Review Questions
How instruction level parallelism is different from machine parallelism? Identify the various data hazards among the following instructions?
I1: R1=20 I2: R1=R3+R4 I3: R2=R4-10 I4: R4=R1+R2 I5: R2=R1+40
418
Summary
Instruction level Parallelism vs. Machine level parallelism Superscalar Pipelines Performance Limitation Factors Superscalar Processor Design Issues Register Renaming
Eliminates WAR and WAW hazards
419
Thank You!
420

Spoken English

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Spoken English

Transféré par

Droits d'auteur :

Formats disponibles

Computer Organization & Architecture

Virendra Singh Shekhawat Department of Computer Science and Information Systems

Computer Architecture and Organization-1

Organization is implementation of computer system in terms of its interconnection of functional units

Computer Architecture and Organization-2

How to Describe Computer System?

Functional View of Computer

Structural View of a Computer

Input Output Communication lines 8

Structural View of the CPU

Arithmetic and Login Unit

Internal CPU Interconnection

Brief History of Computers

Von Neumann Machine (1946) called as IAS

PDP-1 Computer (1957)

IBM SYSTEM/360 (1964)

Medium scale integration - to 1971

Large scale integration - 1971-1977

Very large scale integration - 1978 -1991

Ultra large scale integration 1991 onwards

Computer Organization & Architecture

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

Addition of Two Numbers

What is a Computer Program?

What is required from CPU?

BITS Pilani, Pilani Campus

CPU and Interconnections

Internal Structure of CPU

How Instruction is represented?

Example of Program Execution

Instruction Cycle State Diagram

Instruction Cycle State Diagram with Interrupts

Computer Organization & Architecture

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

one clock period

BITS Pilani, Pilani Campus

Is Clock Rate Enough?

Different instructions takes different number of cycles for their execution

Thus clock speed doesnt tells the whole story!!!

Performance: Application Specific

performanceX execution_timeY -------------------- = --------------------- = n performanceY execution_timeX

Cycles Per Instruction (CPI)

CPU Performance and its Factors

Instruction execution also requires memory access

These performance factors are influenced by

Limitation of MIPS Rate

Computer Organization & Architecture

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

I/O Module Connection

Receive/send control signals from/to comp./peripheral

Receive addresses from computer

Usually broadcast the data

Data Bus and Address Bus

Address Bus identify the source or destination of data

Bus width determines maximum memory capacity of system

Bus Interconnection Scheme

Single Bus Problems

Most systems use multiple buses to overcome these problems

Traditional Multiple Bus Architecture

High Performance Bus