Vous êtes sur la page 1sur 419

Computer Organization & Architecture

BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani
Pilani Campus

Module-1 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Computer Organization and Architecture, Functional and Structural View of Computer System, Brief History of Computers, Evolution of Intel x86 Architecture

Computer Architecture and Organization-1


Architecture is those attributes visible to the programmer and having direct impact on the logical execution of the program
Instruction set, Number of bits used for data types, I/O mechanisms, Memory addressing techniques e.g. x86 architecture, IBM/360 architecture

Organization is implementation of computer system in terms of its interconnection of functional units


Control signals Interfaces between computer and peripherals Memory technology
3
BITS Pilani, Pilani Campus

Computer Architecture and Organization-2


Architecture Question?
Is there a multiply/division instruction available?

Organization Question?
Is multiplication implemented by separate hardware or is it done by repeated addition?

4
BITS Pilani, Pilani Campus

How to Describe Computer System?


Computer is a complex system!!!
Contains millions of electronics components

Function is the operation of individual components as part of the structure Structure is the way in which components are related to each other
5
BITS Pilani, Pilani Campus

Functional View of Computer

6
BITS Pilani, Pilani Campus

Computer Operations

7
BITS Pilani, Pilani Campus

Structural View of a Computer


Computer
Central Processing Unit
Computer

Peripherals

Main Memory

Systems Interconnection

Input Output Communication lines 8


BITS Pilani, Pilani Campus

Structural View of the CPU


CPU
Computer
I/O System Bus CPU

Registers

Arithmetic and Login Unit

Memory

Internal CPU Interconnection

Control Unit

9
BITS Pilani, Pilani Campus

Brief History of Computers


ENIAC was the first general purpose electronic digital computer
By John Mauchly and John Eckert in 1946

Von Neumann Machine (1946) called as IAS


Based on stored program concept

PDP-1 Computer (1957)


Developed by Digital Equipment Corporation (DEC) First step towards mini computers

IBM SYSTEM/360 (1964)


First planned family of computers
10
BITS Pilani, Pilani Campus

Computer Generations
Vacuum tube (1946-1957) & Transistor (19581964) Integrated Circuits
Small scale integration - 1965 on
Up to 100 devices on a chip

Medium scale integration - to 1971


100 - 3,000 devices on a chip

Large scale integration - 1971-1977


3,000 - 100,000 devices on a chip

Very large scale integration - 1978 -1991


100,000 - 100,000,000 devices on a chip

Ultra large scale integration 1991 onwards


Over 100,000,000 devices on a chip
11
BITS Pilani, Pilani Campus

x86 Evolution-1
1971 - 4004
First microprocessor of 4 bit All CPU components on a single chip Followed in 1972 by 8008 (8 bit processor) Both designed for specific applications

8080
First general purpose microprocessor Process/move 8 bit data at a time Used in first personal computer Altair
12
BITS Pilani, Pilani Campus

x86 Evolution-2
8086
Much more powerful (16 bit data) Instruction cache, pre-fetch few instructions 8088 (8 bit external bus) used in first IBM PC

80286
16 MByte memory addressable Up from 1MB (in 8086)

80386
32 bit processor with multitasking support
13
BITS Pilani, Pilani Campus

x86 Evolution-3
80486
Sophisticated powerful cache and instruction pipelining Built in maths co-processor

Pentium
Superscalar Multiple instructions executed in parallel

Pentium Pro
Increased superscalar organization Aggressive register renaming Branch prediction and Data flow analysis
14
BITS Pilani, Pilani Campus

x86 Evolution-4
Pentium II
MMX technology, graphics, video & audio processing

Pentium III
Additional floating point instructions for 3D graphics

Pentium 4
Further floating point and multimedia enhancements

Itanium Series
64 bit with Hardware enhancements to increase speed

Whats next???
Multi core architectures
15
BITS Pilani, Pilani Campus

Summary
Architecture vs. Organization Functional and Structural View of a Computer History of Computers Intel Architecture Evolution

16
BITS Pilani, Pilani Campus

Review Questions
Differentiate between computer organization and architecture? What are the four main functions of a computer? What are the basic structural components of a computer? Describe the computer generations in brief. What was the first general purpose microprocessor?
17
BITS Pilani, Pilani Campus

Thank You!

18
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-1 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Concept of Computer Program and Instruction, Internal Structure of CPU, Instruction Execution Cycle With and Without Interrupt
BITS Pilani, Pilani Campus

Addition of Two Numbers


Start Load R1,A Load R2,B R3=R1+R2 Store C,R3 End of program

21
BITS Pilani, Pilani Campus

What is a Computer Program?


A logical group of Instructions/statements to achieve some specific task is called- Program An Instruction is a sequence of steps For each step, an arithmetic or logical operation or data movement is done For each operation one or more control signals are activated
22
BITS Pilani, Pilani Campus

What is required from CPU?


Fetch instructions Interpret instructions Fetch data Process data Write/store data

To do these tasks processor needs: Temporarily storage Interconnection structure between various components (i.e. Registers, ALU, Memory, I/O) ALU

23

BITS Pilani, Pilani Campus

CPU and Interconnections

24
BITS Pilani, Pilani Campus

Internal Structure of CPU

25
BITS Pilani, Pilani Campus

Instruction Cycle
Two steps:
Fetch Execute

26
BITS Pilani, Pilani Campus

Fetch Cycle
Program Counter (PC) holds address of next instruction to fetch Processor fetches instruction from memory location pointed by PC Increment PC
Unless told otherwise

Instruction loaded into Instruction Register (IR) Processor interprets instruction and performs required actions
27
BITS Pilani, Pilani Campus

Execute Cycle
Data transfer between CPU and Main Memory Data transfer between CPU and I/O module Some arithmetic or logical operation on data Alteration of sequence of operations
e.g. jump

Combination of above

28
BITS Pilani, Pilani Campus

How Instruction is represented?

29
BITS Pilani, Pilani Campus

Example of Program Execution


<Opcode> <Operation> 0001 = Load AC from Memory 0010 = Store AC to memory 0101 = Add to AC from memory

30
BITS Pilani, Pilani Campus

Instruction Cycle State Diagram

31
BITS Pilani, Pilani Campus

Interrupts
Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of processing Program

Timer
Generated by internal processor timer Used in pre-emptive multi-tasking

I/O
From I/O controller

Hardware failure
e.g. Memory parity error
32
BITS Pilani, Pilani Campus

Interrupt Cycle

33
BITS Pilani, Pilani Campus

Instruction Cycle State Diagram with Interrupts

34
BITS Pilani, Pilani Campus

Summary
Concept of Computer Program and Instruction Internal Structure of CPU Relationship between CPU register sizes and main memory Instruction Execution Cycle Interrupt and Interrupt Cycle Instruction Execution Cycle with Interrupt
35
BITS Pilani, Pilani Campus

Review Questions
Which of the instruction cycle state(s) require main memory access? What is context switching? Assume the main memory address space is 216 and each location contains 1 Byte of data. Find out the minimum possible size required for the registers like PC, MAR, MBR, and IR. Repeat the above question for an address space of 232 and addressability as 16 bits. Also calculate the total size of main memory in GBytes.
36
BITS Pilani, Pilani Campus

Thank You!

37
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-1 (Lecture-3)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings, Computer Organization and Design , 4th Ed. By Patterson] Topics: Computer Performance Assessment, CPI, MIPS Rate, Benchmark Programs
BITS Pilani, Pilani Campus

Computer Performance
Performance is one of the key parameter to Evaluate processor hardware Measure requirement for new systems

When we say one computer has better performance than another, what does it mean?
Criterion for the performance? Response Time (Single computer), Throughput (Data center)
40
BITS Pilani, Pilani Campus

Clock Rate
Operation performed by processor are governed by system clock (fundamental level of processor speed measurement)
Generated by quartz crystal

one clock period

A clock cycle is the basic unit of time to execute one operation. Clock Rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time (clock period)

41

BITS Pilani, Pilani Campus

Is Clock Rate Enough?


As we know, instruction execution takes several discrete steps
Fetching, decoding, ALU operation, fetching data etc. It takes multiple clock cycles for its execution

Different instructions takes different number of cycles for their execution


LOAD, ADD, SUB, JUMP etc.

Thus clock speed doesnt tells the whole story!!!


42
BITS Pilani, Pilani Campus

Performance: Application Specific


Performance
How a processor performs when executing a given application Application performance depends upon
Speed of the processor Instruction set Choice of language Efficiency of the compiler Programming skill of the user
43
BITS Pilani, Pilani Campus

CPU Performance
To maximize performance, need to minimize execution time
performance = 1 / execution_time
If X is n times faster than Y, then

performanceX execution_timeY -------------------- = --------------------- = n performanceY execution_timeX

44
BITS Pilani, Pilani Campus

Cycles Per Instruction (CPI)


For any given processor, number of cycles required varies for different types of instructions
e.g. load, store, branch, add, mul etc.

Hence CPI is not a constant value for a processor Needs to calculate average CPI for processor

46
BITS Pilani, Pilani Campus

CPU Performance and its Factors


Program execution time (T) = Ic x CPI x t
t=cycle time, Ic = no. of instructions in the program

Instruction execution also requires memory access


T = Ic x [p +(m x k] x t p=processor cycles, m = memory references k = ratio of memory cycles to processor cycles

These performance factors are influenced by


ISA, compiler, processor implementation and memory hierarchy 47
BITS Pilani, Pilani Campus

MIPS Rate
Common measure of performance
Millions Instructions Per Second (MIPS) rate Ic/(T x 106) This can be written as: f/(CPI x 106)

48
BITS Pilani, Pilani Campus

Example
Processor speed = 400 MHz Four types of instructions with CPI 1,2,4,8 respectively Having instruction mix as 60%, 18%,12%,10% respectively What is MIPS rate?

49
BITS Pilani, Pilani Campus

Limitation of MIPS Rate


MIPS rate or instruction execution rate is also inadequate to measure CPU performance. Why?
Because of differences in ISA Ex. To execute a high level language statement A=B+C (A,B and C are in memory) may need different number of low level instructions for different ISA

50
BITS Pilani, Pilani Campus

Benchmark Programs
It is a collection of a programs that provides representative test of a computer in a particular application area
e.g. SPEC (System Performance Evaluation Corporation) benchmark suites SPEC CPU 2006 is used for measuring performance for the computational based applications

51
BITS Pilani, Pilani Campus

Summary
Computer Performance Assessment Performance Factors Execution Time = No. of Instructions in program x Clock cycles per instruction x Clock cycle time MIPS Rate Benchmark Programs
52
BITS Pilani, Pilani Campus

Review Questions
Consider two implementation of the same ISA. Computer A clock cycle time of 250 ns and a CPI of 2 for a program. Computer B has a clock cycle time of 500 ns and a CPI of 1.2 for the same program. Which computer is faster for this program and by how much? A program runs on computer A with a 2 GHz clock in 10 seconds. Another computer B with 4 GHz run this program in 6 seconds. To accomplish this, computer B will require P times as many clock cycles as computer A to run the program. Find the value of P.
53
BITS Pilani, Pilani Campus

Thank You!

54
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-1 (Lecture-4)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Computer System Modules and Interconnections; Concept of BUS: Types, Arbitration, Timing; PCI BUS Example
BITS Pilani, Pilani Campus

Interconnection
All the units must be connected Different type of connection for different type of unit
Memory Connections
Receives and sends data Receives addresses (of locations) Receives control signals

57
BITS Pilani, Pilani Campus

CPU Connections
Reads instruction and data Writes out data (after processing) Sends control signals to other units Receives (& acts on) interrupts

58
BITS Pilani, Pilani Campus

I/O Module Connection


Similar to memory from computers viewpoint Input/Output
Receive data from peripheral/computer and send data to computer/peripheral

Receive/send control signals from/to comp./peripheral


e.g. spin disk, interrupt

Receive addresses from computer


e.g. port number to identify peripheral
59
BITS Pilani, Pilani Campus

What is a Bus?
A communication pathway connecting two or more devices What do buses look like?
Parallel lines on circuit boards Ribbon cables Strip connectors on mother boards Sets of wire

Usually broadcast the data


60
BITS Pilani, Pilani Campus

Data Bus and Address Bus


Data Bus carries data
Remember that there is no difference between data and instruction at this level Width is a key determinant of performance 8, 16, 32, 64 bit

Address Bus identify the source or destination of data


e.g. CPU needs to read an instruction (data) from a given location in memory

Bus width determines maximum memory capacity of system


e.g. 8080/8086 has 16 bit address bus giving 64K address space
61
BITS Pilani, Pilani Campus

Control Bus
Control and timing information
Memory read/write signal Interrupt request Clock signals

Bus Interconnection Scheme

62
BITS Pilani, Pilani Campus

Single Bus Problems


Lots of devices on one bus leads to:
More propagation delays due to increase in bus length Co-ordination of bus use can adversely affect performance Once aggregate data transfer approaches bus capacity, the bus may become bottleneck

Most systems use multiple buses to overcome these problems


63
BITS Pilani, Pilani Campus

Traditional Multiple Bus Architecture

64
BITS Pilani, Pilani Campus

High Performance Bus

65
BITS Pilani, Pilani Campus

Bus Design Issues-> Bus Types


Dedicated
Separate data & address lines

Multiplexed
Shared lines Address valid or data valid control line Advantage - fewer lines Disadvantages
More complex control Ultimate performance
66
BITS Pilani, Pilani Campus

Bus Arbitration
More than one module controlling the bus
e.g. CPU and DMA controller

Only one module may control bus at a time Arbitration may be Centralized
Single hardware device controlling bus access

Distributed
Each module may claim the bus Control logic on all modules
67
BITS Pilani, Pilani Campus

Centralized BUS Arbiter

68
BITS Pilani, Pilani Campus

Timing: Co-ordination of events on bus


Synchronous
Events determined by clock signals and synchronized on leading edge of clock All devices can read the clock line Usually a single cycle for an event

Asynchronous
The occurrence of one event on a bus follows and depends on the occurrence of a previous event Events on the bus are not synchronized with clock
69
BITS Pilani, Pilani Campus

Synchronous Timing Diagram

70
BITS Pilani, Pilani Campus

Asynchronous TimingRead Operation

71
BITS Pilani, Pilani Campus

Asynchronous Timing Write Operation

72
BITS Pilani, Pilani Campus

Advantages Over Synchronous Bus


Synchronization of sender and receiver clocks is not needed Delays are accommodated (allows mixture of slow and fast devices) More flexibility and reliability Is there any draw back of Asynchronous Bus?
In case of a device malfunctioning
73
BITS Pilani, Pilani Campus

PCI Read Timing Diagram


a)

b)
c) d) e)

f) g) h) i)

Transaction begins by asserting FRAME Target device will recognize its address on AD Master ceases driving the AD bus. Initiator asserts IRDY to indicate ready for data Target asserts DEVSEL to indicate it has recognize its address Master reads the data at the beginning of the 4th cycle and changes the byte enable lines as needed Target deasserts TRDY (needs time to transfer next block of data) Target places the 3rd data item at 6th cycle but master is not ready (deasserts IRDY) Master deasserts FRAME as last data transfer Master deasserts IRDY (Bus is in idle state) target deasserts TRDY and DEVSEL 74
BITS Pilani, Pilani Campus

Exercise-1
Consider a 64-bit microprocessor having 64 bit instructions composed of two fields. The first two bytes contain the opcode and the remainder the immediate operand or an operand address What is the impact on the system speed if the microprocessor bus has
A) A 64 bit local address bus and 32-bit local data bus B) A 32 bit local address bus and 32-bit local data bus C) What is the maximum directly addressable memory ?
75
BITS Pilani, Pilani Campus

Exercise-2
Consider a 32-bit micro processor with 16 bit external data bus, driven by 8Mhz input clock. Micro processor has a bus cycle whose minimum duration equals four input clock cycles.
What is the maximum data transfer rate across the bus in Bytes/sec? To increase the performance: Make its external data bus 32 bits Double the external clock frequency supplied to the processor Which is better option? Explain.
76
BITS Pilani, Pilani Campus

Summary
Bus as Interconnection Structures Bus Design Issues Synchronous and Asynchronous Bus Operations PCI Bus Operation

77
BITS Pilani, Pilani Campus

Thank You!

78
BITS Pilani, Pilani Campus

Computer Arithmetic
BITS Pilani
Pilani Campus

S Mohan

BITS Pilani, Pilani Campus

Computer Arithmetic
Number Representations
Integer or fixed point Representation Floating Point Representation

Arithmetic
Integer Arithmetic Floating Point Arithmetic

80
BITS Pilani, Pilani Campus

Arithmetic & Logic Unit


Does the calculations Everything else in the computer is there to service this unit Handles integers May handle floating point (real) numbers May be separate FPU (maths co-processor) May be on chip separate FPU (486DX +)
81
BITS Pilani, Pilani Campus

Number Representation-1
Unsigned Representation Only have 0 & 1 to represent everything Positive numbers stored in binary
e.g. 41=00101001

No minus sign Scope is limited!!!

82
BITS Pilani, Pilani Campus

Number Representation-2
Sign-Magnitude

Left most bit is sign bit 0 means positive 1 means negative +18 = 00010010 -18 = 10010010 Problems
Arithmetic is not, as we want!!! One pattern is wasted!!!
83
BITS Pilani, Pilani Campus

Number Representation-3
Twos Compliment Most significant bit is treated as a sign bit 0 for positive and 1 for negative Positive numbers are represented same as sign magnitude representation Negative numbers are represented in 2s complement form By using n bits the range of numbers is

-2n-1 to 2n-1-1
Arithmetic works, as we want!!!

84
BITS Pilani, Pilani Campus

Conversion Between Lengths


Positive number pack with leading zeros
+18 = 00010010 +18 = 00000000 00010010

Negative numbers pack with leading ones


-18 = 10010010 -18 = 11111111 10010010

85
BITS Pilani, Pilani Campus

Addition and Subtraction


Normal binary addition Take twos complement of subtrahend and add to minuend
i.e. a - b = a + (-b)

Hardware Implementation is simple


Addition and complement circuits required

Monitor sign bit for overflow


86
BITS Pilani, Pilani Campus

Hardware for Addition and Subtraction

87
BITS Pilani, Pilani Campus

Multiplication
Complex! Work out partial product for each digit Take care with place value (column) Add partial products Think about hardware implementation??? What are changes are required in the above manual approach for computerization???
Running addition on the partial products Few registers should be used For each 1 in the multiplier an add and shiftR operation is required For 0 only shift operation is required
88
BITS Pilani, Pilani Campus

Unsigned Binary Multiplication

89
BITS Pilani, Pilani Campus

Example

90
BITS Pilani, Pilani Campus

Unsigned Binary Multiplication

91
BITS Pilani, Pilani Campus

Multiplying Negative Numbers


Will unsigned scheme works! Try (+5) * (-3) = ??? Solution 1
Convert to positive numbers Multiply as unsigned method If signs were different, take 2s complement of the result

Solution 2
Booths Algorithm
92
BITS Pilani, Pilani Campus

Booths Algorithm Principle-1

Multiplicand M unchanged Based upon recoding the multiplier Q to a recoded value R Each digit can assume a negative as well as positive and zero values Known as Signed Digit ( SD) encoding
93
BITS Pilani, Pilani Campus

Booths Algorithm Principle-2


Booths algorithm called skipping over ones String of 1s replaced by 0s

For ex: 30 = 0011110


= 32 2 = 0100000 - 0000010 In the coded form = 0100010
94
BITS Pilani, Pilani Campus

Booths Algorithm Principle-3


Booth recoding Procedure Working from LSB to MSB retain each 0 until a 1 is reached

When a 1 is encountered insert 1 at that position and complement all the succeeding 1s until a 0 is encountered
Replace that 0 with 1 and continue While multiplying with 1 2s compliment is taken
95
BITS Pilani, Pilani Campus

Flow Chart: Booths Algorithm

96
BITS Pilani, Pilani Campus

Example: Booths Multiplication


A 0000 1001 1100
0011 0001 1010 1101 1110

Q 1101 1101 1110


1110 1111 1111 0111 1011

Q-1 0 0 1
1 0 0 1 1

M = 0111 initial values A A-M shift


A A+M shift A A-M shift shift

97
BITS Pilani, Pilani Campus

Computer Arithmetic
BITS Pilani
Pilani Campus

S Mohan

BITS Pilani, Pilani Campus

Division of Unsigned Binary Integers

00001101 1011 10010011 1011 001110 Partial 1011 Remainders 001111 1011 100 Divisor
More complex than multiplication Negative numbers are really bad! Based on long division

Quotient Dividend

Remainder

99
BITS Pilani, Pilani Campus

Algorithm
Bits of dividend are examined from L to R until the set of bits examined represents a number greater than or equal to the divisor
Until this event occurs, 0s are placed in the quotient When the event occurs, a 1 is placed in the quotient and the divisor is subtracted from the partial dividend This continues in cyclic pattern The process stops when all the bits of dividend are exhausted
100
BITS Pilani, Pilani Campus

Flowchart for Unsigned Binary Division

101
BITS Pilani, Pilani Campus

Division Algorithm for Signed Integers


1. Load divisor in M register and dividend into the A,Q registers 2. Shift A, Q left 1 bit position 3. If M and A have the same signs, do A=A-M otherwise A=A+M 4. a) If preceding operation is successful or MSB of A=0, then set Q01 b) If operation is unsuccessful and MSB of A<>0, then set Q00 and restore the previous value of A 5. Repeat steps 2 through 4 as there are bit positions in Q 6. The remainder is in A. The quotient is in Q.
102
BITS Pilani, Pilani Campus

Floating Point
We need a way to represent
numbers with fractions, e.g., 3.1416

very small numbers (in absolute value), e.g., .00000000023


very large numbers (in absolute value) , e.g., 3.15576 * 1046

Representation:
scientific: sign, exponent, significand form:

binary point

(1)sign * significand * 2exponent . E.g., 101.001101 * 2111001

more bits for significand gives more accuracy more bits for exponent increases range if 1 significand 10two(=2ten) then number is normalized, except for number 0 which is normalized to significand 0
E.g., 101.001101 * 2111001 = 1.01001101 * 2111011 (normalized)

BITS Pilani, Pilani Campus

IEEE 754 Floating-point Standard


IEEE 754 floating point standard:
single precision: one word
31 sign bits 30 to 23 8-bit exponent bits 22 to 0 23-bit significand

double precision: two words


31 sign bits 30 to 20 11-bit exponent bits 19 to 0 upper 20 bits of 52-bit significand bits 31 to 0 lower 32 bits of 52-bit significand

BITS Pilani, Pilani Campus

IEEE 754 Floating Point Representation


Single precision 4 bytes Double Precision 8 bytes Extended Double 10 bytes Quadruple Precision 16 bytes

105
BITS Pilani, Pilani Campus

IEEE 754 Floating-point Standard


Sign bit is 0 for positive numbers, 1 for negative numbers Number is assumed normalized and leading 1 bit of significand left of binary point (for non-zero numbers) is assumed and not shown
e.g., significand 1.1001 is represented as 1001, exception is number 0 which is represented as all 0s for other numbers: value = (1)sign * (1 + significand) * 2exponent value

Exponent is biased to make sorting easier


all 0s is smallest exponent, all 1s is largest bias of 127 for single precision and 1023 for double precision equals exponent value therefore, for non-0 numbers: value = (1)sign * (1 + significand) * 2(exponent bias)

BITS Pilani, Pilani Campus

IEEE 754 Floating-point Standard


Special treatment of 0:
if exponent is all 0 and significand is all 0, then the value is 0 (sign bit may be 0 or 1)

Example : Represent 0.75ten in IEEE 754 single precision

decimal: 0.75 = 3/4 = 3/22 binary: 11/100 = .11 = 1.1 x 2-1 IEEE single precision floating point exponent = bias + exponent value = 127 + (-1) = 126ten = 01111110two IEEE single precision: 10111111010000000000000000000000
sign exponent significand

BITS Pilani, Pilani Campus

Floating-Point Example
What number is represented by the singleprecision float 1100000010100000
S=1 Fraction = 01000002 Exponent = 100000012 = 129

x = (1)1 (1 + 012) 2(129 127)


= (1) 1.25 22 = 5.0
BITS Pilani, Pilani Campus

IEEE 754 Standard Encoding


Single Precision Exponent 0 0 Fraction 0 Non-zero Double Precision Exponent 0 0 Fraction 0 Non-zero 0 (zero) Denormaliz ed number Object Represented

1-254
255 255

Anything
0 Non-zero

1-2046
2047 2047

Anything
0 Non-zero

Floatingpoint number
Infinity NaN (Not a Number)

NaN : (infinity infinity), or 0/0

Denormalized number = (-1)sign * 0.f * 21-bias


109

Single-Precision Range
Exponents 00000000 and 11111111 reserved Smallest value
Exponent: 00000001 actual exponent = 1 127 = 126 Fraction: 00000 significand = 1.0 1.0 2126 1.2 1038

Largest value
exponent: 11111110 actual exponent = 254 127 = +127 Fraction: 11111 significand 2.0 2.0 2+127 3.4 10+38
BITS Pilani, Pilani Campus

Double-Precision Range
Exponents 000000 and 111111 reserved Smallest value
Exponent: 00000000001 actual exponent = 1 1023 = 1022 Fraction: 00000 significand = 1.0 1.0 21022 2.2 10308

Largest value
Exponent: 11111111110 actual exponent = 2046 1023 = +1023 Fraction: 11111 significand 2.0 2.0 2+1023 1.8 10+308
BITS Pilani, Pilani Campus

Floating point addition


Make both exponents the same
Find the number with the smaller one Shift its mantissa to the right until the exponents match
Must include the implicit 1 (1.M)

Add the mantissas Choose the largest exponent Put the result in normalized form
Shift mantissa left or right until in form 1.M Adjust exponent accordingly

Handle overflow or underflow if necessary Round Renormalize if necessary if rounding produced an unnormalized result

BITS Pilani, Pilani Campus

Floating point addition


Algorithm

BITS Pilani, Pilani Campus

Floating point addition example


Initial values
1 00000001 S E 000001100 M

0 00000011 S E

010000111 M

BITS Pilani, Pilani Campus

Floating point addition example


Identify smaller E and calculate E difference
1 00000001 S E 000001100 M

difference = 2
0 00000011 S E 010000111 M

BITS Pilani, Pilani Campus

Floating point addition example


Shift smaller M right by E difference
1 00000011 S E 010000011 M

0 00000011 S E

010000111 M

BITS Pilani, Pilani Campus

Floating point addition example


Add mantissas
1 00000011 S E 010000011 M

0 00000011 S E

010000111 M

0.010000011 +1.010000111 = 1.000000100


0 S E 000000100 M
BITS Pilani, Pilani Campus

Floating point addition example


Normalize the result by shifting (already normalized)
0 00000011
S E

000000100
M

BITS Pilani, Pilani Campus

Floating point addition example


Final answer
1 00000001 S E 000001100 M

0 00000011 S E

010000111 M

0 00000011 S E

000000100 M
BITS Pilani, Pilani Campus

Hardware design
determine smaller exponent

Floating point addition

BITS Pilani, Pilani Campus

Hardware design

Floating point addition

shift mantissa of smaller number right by exponent difference

BITS Pilani, Pilani Campus

Hardware design

Floating point addition

add mantissas

BITS Pilani, Pilani Campus

Hardware design

Floating point addition

normalize result by shifting mantissa of result

BITS Pilani, Pilani Campus

Hardware design

Floating point addition

round result

BITS Pilani, Pilani Campus

Hardware design

Floating point addition

renormalize if necessary
BITS Pilani, Pilani Campus

FP Adder Hardware
Much more complex than integer adder Doing it in one clock cycle would take too long
Much longer than integer operations Slower clock would penalize all instructions

FP adder usually takes several cycles


Can be pipelined

BITS Pilani, Pilani Campus

Floating point multiply


Add the exponents and subtract the bias from the sum
Example: (5+127) + (2+127) 127 = 7+127

Multiply the mantissas Put the result in normalized form


Shift mantissa left or right until in form 1.M Adjust exponent accordingly

Handle overflow or underflow if necessary Round Renormalize if necessary if rounding produced an unnormalized result Set S=0 if signs of both operands the same, S=1 otherwise
BITS Pilani, Pilani Campus

Floating point multiply


Algorithm

BITS Pilani, Pilani Campus

Floating point multiply example


Initial values
1 00000111 S E 100000000 M

-1.5 x 27-127

0 11100000 S E

100000000 M

1.5 x 2224-127

BITS Pilani, Pilani Campus

Floating point multiply example


Add exponents
1 00000111 S E 100000000 M

-1.5 x 27-127

0 11100000 S E

100000000 M

1.5 x 2224-127

00000111 +11100000 = 11100111 (231)

BITS Pilani, Pilani Campus

Floating point multiply example


Subtract bias
1 00000111 S E 100000000 M

-1.5 x 27-127

0 11100000 S E

100000000 M

1.5 x 2224-127

11100111 (231) 01111111 (127) = 01101000 (104)

01101000 S E M
BITS Pilani, Pilani Campus

Floating point multiply example


Multiply the mantissas
1 00000111 S E 100000000 M

-1.5 x 27-127

0 11100000 S E

100000000 M

1.5 x 2224-127

1.1000 x 1.1000 = 10.01000

01101000 S E M
BITS Pilani, Pilani Campus

Floating point multiply example


Normalize by shifting 1.M right one position and adding one to E
1 00000111 S E 100000000 M

-1.5 x 27-127

0 11100000 S E

100000000 M

1.5 x 2224-127

10.01000 => 1.001000

01101001 S E

001000 M
BITS Pilani, Pilani Campus

Floating point multiply example


Set S=1 since signs are different
1 00000111 S E 100000000 M

-1.5 x 27-127

0 11100000 S E

100000000 M

1.5 x 2224-127

1 01101001 S E

001000 M

-1.125 x 2105-127

BITS Pilani, Pilani Campus

FP Arithmetic Hardware
FP multiplier is of similar complexity to FP adder
But uses a multiplier for significands instead of an adder

FP arithmetic hardware usually does


Addition, subtraction, multiplication, division, reciprocal, square-root FP integer conversion

Operations usually takes several cycles


Can be pipelined
BITS Pilani, Pilani Campus

Floating Point Complexities

In addition to overflow we can have underflow (number too small) Accuracy is the problem with both overflow and underflow because we have only a finite number of bits to represent numbers that may actually require arbitrarily many bits
limited precision rounding rounding error IEEE 754 keeps two extra bits, guard and round four rounding modes positive divided by zero yields infinity zero divide by zero yields not a number

other complexities

Implementing the standard can be tricky


BITS Pilani, Pilani Campus

Rounding
Fp arithmetic operations may produce a result with more digits than can be represented in 1.M
The result must be rounded to fit into the available number of M positions Tradeoff of hardware cost (keeping extra bits) and speed versus accumulated rounding error
BITS Pilani, Pilani Campus

Rounding
Guard, Round bits for intermediate addition
2.56*100 + 2.34*102 = 0.0256*102 + 2.34*102 = 2.3656*102 5: guard bit 6: round bit 00~49: round down, 51~99: round up, 50: tie-break Result: 2.37*102 Without guard and round bit
0.02*102 + 2.34*102 = 2.36*102

BITS Pilani, Pilani Campus

Rounding
In binary, an extra bit of 1 is halfway in between the two possible representations
1.001 (1.125) is halfway between 1.00 (1) and 1.01 (1.25) 1.101 (1.625) is halfway between 1.10 (1.5) and 1.11 (1.75)

BITS Pilani, Pilani Campus

IEEE 754 rounding modes


Truncate
1.00100 -> 1.00

Round up to the next value


1.00100 -> 1.01

Round down to the previous value


1.00100 -> 1.00

Round-to-nearest-even
Rounds to the even value (the one with an LSB of 0) 1.00100 -> 1.00 1.01100 -> 1.10 Produces zero average bias
BITS Pilani, Pilani Campus

Instruction Set Architecture


BITS Pilani
Pilani Campus

S Mohan

BITS Pilani, Pilani Campus

8086 Intel Microprocessor Architecture

142
BITS Pilani, Pilani Campus

8086 Memory
Memory is byte-addressable. The original 8086 had a 20-bit address bus that could address just 1MB of main memory called as Real Addressing Mode Newer CPUs can access 64GB of main memory, using 36-bit addresses. A word in the 8086 world is 16 bits.. A 32-bit quantity is called a double word.
143
BITS Pilani, Pilani Campus

Microprocessor Registers
Visible registers
Addressable during application programming

Invisible registers
Addressable in system programming

8086,8088,80286 contain 16 bit internal architecture 80386-Pentium 4 contain 32 bit internal architecture

144
BITS Pilani, Pilani Campus

Multipurpose Registers
AX- Accumulator
Addressable as AX, AH, or AL Mainly used for multiplication, division

BX- Base Index


Some times holds the offset address of a location

CX Count
Holds count for instructions

DX- Data
Holds a part of the result
145
BITS Pilani, Pilani Campus

Multipurpose Registers(2)
BP- Base Pointer
Points to a memory location for memory data transfers.

DI- Destination Index


Addresses string destination data for the string instructions.

SI- Source Index


Addresses string source data for the string instructions.
146
BITS Pilani, Pilani Campus

Special Purpose Registers


IP- Instruction Pointer
Addresses the next instruction

SP- Stack Pointer


Addresses area of memory called stack

FLAGS
For controlling the microprocessor operation Not modified for data transfer or program control operation

147
BITS Pilani, Pilani Campus

Segment Registers[1]
CS-Code Segment
Contains programs and procedures Defines the starting address of the section the memory. In real mode, it defines the start of a 64Kb section of memory

DS- Data Segment


Contains data used by the program Length is limited to 64Kb in 8086-80286

148
First Semester 2010-2011

BITS Pilani, Pilani Campus

Segment Registers[2]
ES- Extra Segment
An additional data segment used by string instructions

SS- Stack
Memory used for the stack

149
First Semester 2010-2011

BITS Pilani, Pilani Campus

What is an Instruction Set?


The complete collection of instructions that are understood by a CPU
Machine Code or Binary Usually represented by assembly codes In machine code each instruction has a unique bit pattern For human consumption a symbolic representation is used
e.g. ADD, SUB, LOAD

Operands can be represented in this way


ADD A, B

BITS Pilani, Pilani Campus

Elements of an Instruction
Operation code (Op code)
Do this!

Source Operand reference


To this!

Result/Destination Operand reference


Put the answer here!

Next Instruction Reference


When you have done that, do this...
BITS Pilani, Pilani Campus

8086 Assembly Instruction Format


General Instruction Format
LABEL OPCODE OPERAND(s) ; COMMENT Example
Next: MOV AL,BL ; Transfers BL contents to AL MOV AL,12h ; Transfers 12 to AL MOV AL, [1234h]; Transfers one byte data from memory location given by [DS+1234] to AL Label and comments are optional!!!

OPERAND(s) may be
Part of instruction Reside in registers of the processor Reside in memory location

BITS Pilani, Pilani Campus

Instruction Set Architecture


BITS Pilani
Pilani Campus

S Mohan

BITS Pilani, Pilani Campus

Addressing Modes
An instruction must contains the information about:
How to get the operands? Called as ADDRESSING MODES Essentially it tells where the operands are available and how to get them

Question
Why we need various addressing modes or various ways to represent operands?
BITS Pilani, Pilani Campus

Register Addressing
Instruction Opcode Register Address R

Registers
Example MOV AL,BL ADD AX, BX

Operand

BITS Pilani, Pilani Campus

Immediate Addressing
Operand is part of instruction No memory reference to fetch operand Ex:
MOV AX,23h ADD CL, 44h MOV AL, A MOV BL, 11001100B

BITS Pilani, Pilani Campus

Direct Addressing
Instruction Opcode
Examples: MOV CX, DATA ADD CL, TEMP MOV AL,[1234h]

Address A

Memory

Operand
Single memory reference to access data. So no extra calculations required to get effective address!!!

BITS Pilani, Pilani Campus

Register Indirect Addressing


Instruction
Opcode Register Address R
MOV AX,[BX or DI or SI]

Memory

Registers

Pointer to Operand

Operand

BITS Pilani, Pilani Campus

Example: Register Indirect Addressing


Assume DS=1120, SI=2145 and AX=1234 What will be contents of memory location(s) modified after the execution of the following instruction?
MOV [SI], AX Solution Memory location pointed by SI = 11200+2145=13345h 13345 will contain 12 and 13346 will contain 34

BITS Pilani, Pilani Campus

Register Indirect Addressing More


MOV AL, [DI]
is clear that this is a byte-sized move

MOV [BP], 10h


Ambiguous!!! does it address a byte, word, or doubleword sized memory location?

In an assembler, use
MOV BYTE PTR [DI],10H

BYTE PTR, WORD PTR


Directives used only with instructions that address memory locations through a pointer or index register with immediate data.
BITS Pilani, Pilani Campus

Base Plus Index Addressing


Similar to Indirect Addressing Uses one Base register (BX or BP) (holds beginning location of an array) and one Index register (DI or SI) (holds relative position of an element in the array)

BITS Pilani, Pilani Campus

Register or Based Relative Addressing


Similar to Base plus Index Addressing Data in a segment of memory are addressed by adding the displacement to the contents of a base or an Index register (BP, BX, DI, SI) Ex. MOV AL, [BX+100h]

BITS Pilani, Pilani Campus

Base Relative Plus Index Addressing


Similar to Base plus Index addressing with the addition of displacement to calculate the Effective Address Can be used to address the 2D array of data
Ex. MOV AL, [BX+SI+100H], MOV AX, FILE[BX][DI]

BITS Pilani, Pilani Campus

Example:
.Data FILE EQU THIS BYTE recA DB 15 dup(?) ;15 bytes for rec A recB DB 20 dup(?) ;20 bytes for rec B .code .startup MOV BX,OFFSET recA ;address record A MOV DI,0 ;address element 0 MOV AL,FILE[BX+DI] .exit end

BITS Pilani, Pilani Campus

Pentium Addressing Modes


Virtual or effective address is offset into segment
Starting address plus offset gives linear address This goes through page translation if paging enabled

12 addressing modes available


Immediate Register operand Displacement Base Base with displacement Scaled index with displacement Base with index and displacement Base scaled index with displacement Relative
BITS Pilani, Pilani Campus

Design Decisions
Operation repertoire
How many ops, What can they do and How complex are they?

Instruction formats
Length of op code field Number of addresses

Registers
Number of CPU registers available Which operations can be performed on which registers?

Addressing modes Supported


BITS Pilani, Pilani Campus

How many Addresses in an Instruction


More addresses
More complex (powerful?) instructions More registers Inter-register operations are quicker Fewer instructions per program

Fewer addresses
Less complex (powerful?) instructions More instructions per program Faster fetch/execution of instructions
BITS Pilani, Pilani Campus

Example: Number of Addresses


Program to execute
Y = (A-B)/(C+D*E)

THREE Address SUB Y, A, B MUL T,D,E ADD T,T,C DIV Y,Y,T

TWO Address MOV Y,A SUB Y,B MOV T,D MUL T,E ADD T,C DIV Y,T

ONE Address LOAD D MUL E ADD C STOR Y LOAD A SUB B DIV Y STOR Y
BITS Pilani, Pilani Campus

Instruction Set Architecture


BITS Pilani
Pilani Campus

S Mohan

BITS Pilani, Pilani Campus

Example: Number of Addresses


Program to execute
Y = (A-B)/(C+D*E)

THREE Address SUB Y, A, B MUL T,D,E ADD T,T,C DIV Y,Y,T

TWO Address MOV Y,A SUB Y,B MOV T,D MUL T,E ADD T,C DIV Y,T

ONE Address LOAD D MUL E ADD C STOR Y LOAD A SUB B DIV Y STOR Y
BITS Pilani, Pilani Campus

Types of Operand
Addresses
Unsigned integers

Numbers
Binary fixed point/Binary floating point

Characters
ASCII

Logical Data
Bits or flags, Bit level manipulation of data

BITS Pilani, Pilani Campus

Assembly to Machine Conversion in 8086


OPCODE D W MOD REG R/M

Byte one contains


OPCODE (6 bits)
Specifies the operation to be performed. Ex. ADD, SUB, MOV

Register direction (D) bit


Tells the register operand in REG field in byte two, is source or destination operand D=1: Data flow to the REG field from R/M (destination) D=0: Data flow from the REG field to the R/M (source)

Data size (W) bit


Specifies whether the operation will be performed on 8 bit or 16 bit W =0: 8 bits and W=1: 16 bits
BITS Pilani, Pilani Campus

Assembly to Machine Conversion in 8086


OPCODE D W MOD REG R/M

Register field (REG): 3 bits


To identify the register for the first operand

Mode field (MOD): 2 bits Register/Memory field (R/M): 3 bits


2 bit MOD field and 3 bit R/M field together specify the second operand

BITS Pilani, Pilani Campus

Assembly to machine Conversion in 8086


MOD 00 01 Explanation Memory mode no displacement Memory mode 8 bit displacement
Example: MOV BL, AL Machine Code??? OPCODE = 100010 D=1 (BL is destination) W=0 (8 bits) MOD = 11 (Register mode) REG = 011 (code for BL) R/M = 000 (code for AL)

10
11
MOD =11 R/M 000 001 010 011 100 101 110 111 W=0 AL CL DL BL AH CH DH BH

Memory mode 16 bit displacement


Register mode no displacement
Effective Address Calculation R/M 000 001 010 011 100 101 110 111 MOD=00 BX+SI BX+DI BP+SI BP+DI SI DI Direct Address BX MOD=01 BX+SI+D8 BX+DI+D8 BP+SI+D8 BP+DI+D8 SI+D8 DI+D8 BP+D8 BX+D8 MOD=10 BX+SI+D16 BX+DI+D16 BP+SI+D16 BP+DI+D16 SI+D16 DI+D16 BP+D16 BX+D16 W=1 AX CX DX BX SP BP SI DI

BITS Pilani, Pilani Campus

Exercise
Show the machine code for the following assembly instructions MOD Explanation
MOV [DI], AL ADD [BX+SI+1234h], AX 00 Memory mode no displacement

01
10 11

Memory mode 8 bit displacement


Memory mode 16 bit displacement Register mode no displacement
MOD=10 BX+SI+D16 BX+DI+D16 BP+SI+D16 BP+DI+D16 SI+D16 DI+D16 BP+D16 BX+D16
BITS Pilani, Pilani Campus

MOD =11 R/M 000 001 010 011 100 101 110 111 W=0 AL CL DL BL AH CH DH BH W=1 AX CX DX BX SP BP SI DI R/M 000 001 010 011 100 101 110 111

Effective Address Calculation MOD=00 BX+SI BX+DI BP+SI BP+DI SI DI Direct Address BX MOD=01 BX+SI+D8 BX+DI+D8 BP+SI+D8 BP+DI+D8 SI+D8 DI+D8 BP+D8 BX+D8

Assembly Program Example


Copy the contents of a block of memory (16 bytes) starting at location 10200h to another block of memory starting at 10100h
MOV AX, 1000h MOV DS, AX MOV SI, 200h MOV DI, 100h MOV CX, 10h MOV AH, [SI] MOV [DI], AH INC SI INC DI DEC CX JNZ REPT

REPT:

BITS Pilani, Pilani Campus

8086 Instruction Types


Data movement instructions Arithmetic - add, subtract, increment, decrement, convert byte/word and compare. Logic - AND, OR, exclusive OR, shift/rotate and test. String manipulation - load, store, move, compare and scan for byte/word. Control transfer - conditional, unconditional, call subroutine and return from subroutine. Input/Output instructions. Other - setting/clearing flag bits, stack operations, software interrupts, etc.

BITS Pilani, Pilani Campus

80x86 Instruction Set (Data Movement Instructions) Moving data/address from either register to memory or memory to register
MOV PUSH/POP LDS,LES,LSS LEA

BITS Pilani, Pilani Campus

80x86 Instruction Set (String Instructions)


LODS, STOS, MOVS, INS, OUTS Direction flag and DI, SI registers are closely associated with these instructions D=0 indicates auto increment D=1 indicates auto decrement CLD and STD instructions used to set or reset the D flag

BITS Pilani, Pilani Campus

80x86 Instruction Set (Arithmetic/Logical)


Arithmetic Instructions
ADD, SUB, CMP, MUL, DIV

Logical Instructions
AND, OR, TEST, XOR, NOT, NEG

Shift and Rotate instructions


ROL, RCL etc

String Comparisons Instructions


SCAS (String Scan)

Compares AL register with byte block of memory


CMPS (Compare strings)

Compares two sections of memory data

BITS Pilani, Pilani Campus

80x86 Instruction Set (Program Control)


Unconditional jump
JMP instruction

Conditional jump
JNC-> jump no carry JNO->jump if no overflow JE-> jump if equal JNE->jump if not equal

BITS Pilani, Pilani Campus

x86 Instruction Set Summary


(Data Transfer)

IN LAHF LDS LEA LES LODS MOV MOVS OUT POP POPF PUSH PUSHF SAHF SCAS STOS XCHG XLAT

;Input ;Load AH from Flags ;Load pointer to DS ;Load EA to register ;Load pointer to ES ;Load memory at SI into AX ;Move ;Move memory at SI to DI ;Output ;Pop ;Pop Flags ;Push ;Push Flags ;Store AH into Flags ;Scan memory at DI compared to AX ;Store AX into memory at DI ;Exchange ;Translate byte to AL

BITS Pilani, Pilani Campus

x86 Instruction Set Summary


(Arithmetic/Logical)

AAA AAD AAM AAS ADC ADD AND CMC CMP CMPS CWD DAA DAS DEC DIV IDIV MUL IMUL INC

;ASCII Adjust for Add in AX ;ASCII Adjust for Divide in AX ;ASCII Adjust for Multiply in AX ;ASCII Adjust for Subtract in AX ;Add with Carry ;Add ;Logical AND ;Complement Carry ;Compare ;Compare memory at SI and DI ;Convert Word to Double in AX DX,AX ;Decimal Adjust for Add in AX ;Decimal Adjust for Subtract in AX ;Decrement ;Divide (unsigned) in AX(,DX) ;Divide (signed) in AX(,DX) ;Multiply (unsigned) in AX(,DX) ;Multiply (signed) in AX(,DX) ;Increment
BITS Pilani, Pilani Campus

x86 Instruction Set Summary (Arithmetic/Logical


Cont.)

NEG NOT OR RCL RCR ROL ROR SAR SBB SCAS SHL/SAL SHR SUB TEST XLAT XOR

;Negate ;Logical NOT ;Logical inclusive OR ;Rotate through Carry Left ;Rotate through Carry Right ;Rotate Left ;Rotate Right ;Shift Arithmetic Right ;Subtract with Borrow ;Scan memory at DI compared to AX ;Shift logical/Arithmetic Left ;Shift logical Right ;Subtract ;AND function to flags ;Translate byte to AL ;Logical Exclusive OR

BITS Pilani, Pilani Campus

x86 Instruction Set Summary


(Control/Branch)

JNLE/JG JNO JNP/JPO JNS JO JP/JPE JS LOOP LOOPNZ/LOOPNE LOOPZ/LOOPE NOP REP/REPNE/REPNZ REPE/REPZ RET SEG STC STD STI TEST

;Jump on Not Less or Equal/Greater ;Jump on Not Overflow ;Jump on Not Parity/Parity Odd ;Jump on Not Sign ;Jump on Overflow ;Jump on Parity/Parity Even ;Jump on Sign ;Loop CX times ;Loop while Not Zero/Not Equal ;Loop while Zero/Equal ;No Operation (= XCHG AX,AX) ;Repeat/Repeat Not Equal/Not Zero ;Repeat Equal/Zero ;Return from call ;Segment register ;Set Carry ;Set Direction ;Set Interrupt ;AND function to flags
BITS Pilani, Pilani Campus

x86 Instruction Set Summary


(Control/Branch Cont.)
CALL CLC CLD CLI HLT INT INTO IRET JB/JNAE JBE/JNA JCXZ JE/JZ JL/JNGE JLE/JNG JMP JNB/JAE JNBE/JA JNE/JNZ JNL/JGE ;Call ;Clear Carry ;Clear Direction ;Clear Interrupt ;Halt ;Interrupt ;Interrupt on Overflow ;Interrupt Return ;Jump on Below/Not Above or Equal ;Jump on Below or Equal/Not Above ;Jump on CX Zero ;Jump on Equal/Zero ;Jump on Less/Not Greater or Equal ;Jump on Less or Equal/Not Greater ;Unconditional Jump ;Jump on Not Below/Above or Equal ;Jump on Not Below or Equal/Above ;Jump on Not Equal/Not Zero ;Jump on Not Less/Greater or Equal
BITS Pilani, Pilani Campus

Quiz
Q. The instruction set architecture for a simple computer must support access to 64 KB of byte-addressable memory space and eight 16bit general-purpose CPU registers.
a) If the computer has three-operand machine language instructions that operate on the contents of two different CPU registers to produce a result that is stored in a third register, how many bits are required in the instruction format for addressing registers?

BITS Pilani, Pilani Campus

b) If all instructions are to be 16 bits long, how many op codes are available for the threeoperand, register operation instructions described above (neglecting, for the moment, any other types of instructions that might be required)?

BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-4 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Computer Memory Hierarchy and its characteristics, Cache Memory, Cache Mapping Function : Direct Mapping
BITS Pilani, Pilani Campus

Computer Memory Hierarchy

Source: Null, Linda and Lobur, Julia (2003). Computer Organization and Architecture (p. 236). Sudbury, MA: Jones and Bartlett Publishers

191
BITS Pilani, Pilani Campus

Memory Hierarchy
Computer memory exhibits the widest range of
Physical Type Semiconductor (RAM), Magnetic (Disk), Optical (CD) Physical Characteristics Volatile/Non Volatile, Erasable/Non Erasable Organization Physical arrangement of bits Performance Access time (Latency), Transfer time, Memory cycle time
192
BITS Pilani, Pilani Campus

Memory System Characteristics-1


Location
CPU, Internal and External

Capacity
Number of Words/Bytes

Unit of transfer
Internal
Usually governed by data bus width

External
Usually a block which is much larger than a word

Addressable unit
Smallest location which can be uniquely addressed Word or byte internally and Cluster on disks
193
BITS Pilani, Pilani Campus

Memory System Characteristics-2


Access Methods
Sequential
Start at the beginning and read through in order Access time for a record is location dependent e.g. Magnetic Tape

Direct
Individual blocks have unique address based on physical location Access is by jumping to vicinity plus sequential search Access time depends on location and previous location e.g. Hard Disk
194
BITS Pilani, Pilani Campus

Memory System Characteristics-3


Access Methods
Random
Wired-in addressing mechanism Individual addresses identify locations exactly

Associative
Data is located based on a portion of its contents rather than its address Called as Content Addressable Memory (CAM)

Note: Access time is independent of location or previous access for Random and Associative accesses

195

BITS Pilani, Pilani Campus

Mismatch of CPU and Main Memory Speed

196
BITS Pilani, Pilani Campus

Locality of References
Processor generates the main memory references
To fetch instructions and data

Usually, memory references tend to cluster


e.g. loops, subroutines, operation on tables and on arrays

Clustering of memory references is called as locality of references

Spatial: Accesses tend to be clustered in the


address space

Temporal: Tendency to access recently accessed


memory locations
197
BITS Pilani, Pilani Campus

Cache Memory
Small amount of fast memory called Cache memory sits between normal main memory and CPU

198
BITS Pilani, Pilani Campus

Cache Design Parameters


Total Cache Size
Small enough to reduce the cost and should be large enough to reduce the average access time

Block Size Number of Caches Replacement Algorithm Write Policy Mapping Function
199
BITS Pilani, Pilani Campus

Cache memory: Mapping Function


Why we need mapping?
Cache memory size is smaller than main memory The processor doesnt need to know the existence of the cache!!!

The correspondence between the main memory blocks (group of words) and in the cache lines is specified by a mapping function
200
BITS Pilani, Pilani Campus

Direct Mapping
Each block of main memory maps to only one cache line
i.e. if a block is in cache, it must be in one specific place

Mapping Function jth Block of the main memory maps to ith cache line
i = j modulo M (M = number of cache lines)

201
BITS Pilani, Pilani Campus

Mapping Function from Main to Cache Memory

Cache line

Main Memory blocks held

0
1 m-1

0, m, 2m, 3m2s-m
1,m+1, 2m+12s-m+1 m-1, 2m-1,3m-12s-1
202
BITS Pilani, Pilani Campus

Direct Mapping Function Example

203
BITS Pilani, Pilani Campus

Direct Mapping Cache Operation

204
BITS Pilani, Pilani Campus

Direct Mapping pros & cons


Simple Inexpensive Fixed location for given block
If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high

205
BITS Pilani, Pilani Campus

Exercises
Show the address split up for the direct cache mapping function. Cache and main memory details are as follows:
Cache size is 128K Byte Cache line size is 8 Bytes Main memory size is 16M Bytes Main Memory is Byte addressable

Will main memory addresses x234560 and x374562 map to same cache line?
206
BITS Pilani, Pilani Campus

Review Questions
Why tag bits are necessary to store along with the data bits in cache line?
Hint: How to identify the two blocks of main memory mapped to same cache line

How many tag comparisons are required to check the presence of the word requested by the processor in the direct mapped cache? Justify your answer.
Hint: Block to line mapping is fixed
207
BITS Pilani, Pilani Campus

Summary
Memory Hierarchy in Computer System Concept of Locality of Reference Cache Memory Cache Mapping Function: Direct Mapping

208
BITS Pilani, Pilani Campus

Thank You!

209
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-4 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Cache Mapping Functions: Associative, Set Associative , Cache Line Replacement Algorithms
BITS Pilani, Pilani Campus

Direct Mapping Example


Cache Size = 8KB (Data) Main memory address is 32-bit Block Size/Line Size = 64 Bytes

212
BITS Pilani, Pilani Campus

Problem With the Direct Mapping

213
BITS Pilani, Pilani Campus

Solution to Fixed Mapping


Due to fixed mapping from a main memory block to a cache line cache utilization becomes low Can we device a mapping function such that no line replacement until a line is empty in cache memory
Any block of main memory can map to any line of the cache memory!
214
BITS Pilani, Pilani Campus

Fully Associative Mapping

215
BITS Pilani, Pilani Campus

Fully Associative Mapping


A main memory block can load into any line of cache memory Memory address is interpreted as tag and word Tag uniquely identifies block of main memory

216
BITS Pilani, Pilani Campus

Address Split up: Fully Associative Mapping


Cache Size = 2K words of data Main memory Size = 64K words Block Size/Line Size = 4 words

217
BITS Pilani, Pilani Campus

Fully Associative Cache Organization

218
BITS Pilani, Pilani Campus

Set Associative Mapping


Cache is divided into a number of sets Each set contains a fixed number of lines A given block maps to any line in a given set
e.g. 2 lines per set 2 way associative mapping A given block can be in one of 2 lines in the set

219
BITS Pilani, Pilani Campus

Address Split up: 2-way Set Associative Mapping


Cache Size = 2K words Main memory Size = 64K words Block Size/Line Size = 4 words

220
BITS Pilani, Pilani Campus

K-Way Set Associative Cache Organization

221
BITS Pilani, Pilani Campus

Cache Line Replacement


When all lines have valid data and a new block is to be bring from main memory to the cache
Needs to replace one of the valid or filled cache lines How to decide the victim? Using replacement Algorithms
Ex. LRU, LFU, FIFO, Random etc

222
BITS Pilani, Pilani Campus

Replacement Algorithms
No choice in Direct mapping because each block only maps to one line!!! Least Recently used (LRU)
Oldest referenced block is removed

First in first out (FIFO)


Replace the block that has been in cache since longest time

Least frequently used (LFU)


Replace block which has had fewest hits

Random
Replace any block randomly
223
BITS Pilani, Pilani Campus

Varying Associativity over Cache Size


1.0

0.9
0.8 0.7

Hit ratio

0.6
0.5 0.4

0.3
0.2 0.1

0.0 1k 2k 4k 8k 16k 32k 64k 128k 256k 512k 1M


Cache size (bytes)
direct 2-way 4-way 8-way 16-way

224
BITS Pilani, Pilani Campus

Exercise-1
Consider a fully associative cache memory having 4 lines. The main memory references generated by the processor are 0,1,2,3,4,0,1,2,3,4. Show the final cache entries by assuming LRU replacement policy.

225
BITS Pilani, Pilani Campus

Exercise-2
A 4-Way set associative cache has a block size of four 16-bit words. The cache can accommodate a total of 4K such words. The main memory size is 128K words. How the processor's addresses are interpreted.

226
BITS Pilani, Pilani Campus

Summary
Cache to Main Memory Mapping
Fully Associative and Set Associative

Cache Line Replacement Algorithms


LRU FIFO LFU Random
227
BITS Pilani, Pilani Campus

Thank You!

228
BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-4 (Lecture-3)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Cache Performance Measurement , Cache Miss Types, Hit Ratio, Write Policy, Multilevel Caches
BITS Pilani, Pilani Campus

Cache Performance
More the processor references found in the cache better the performance
If a reference found in the cache it is HIT otherwise MISS A penalty is associated with each MISS occurred More hits reduces average access time

230
BITS Pilani, Pilani Campus

Where to misses come from?


Compulsory Initially cache is empty or no valid data in it. So the first access to a block is always miss. Also called cold start misses or first reference misses. CapacityIf the cache size is not enough such that it can not accommodate sufficient blocks needed during execution of a program then frequent misses will occur. Capacity misses are those misses that occur regardless of associativity or block size.

ConflictIf block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses.
231
BITS Pilani, Pilani Campus

Operation of Two Level Memory System-1


Hit Ratio(H)
Ratio of number of references found in the higher level memory (M1) (i.e. cache memory) to the total references

Average Access time


Ts =H*T1 + (1-H)*(T1+T2) T1=Access time of M1 (Cache) T2=Access time of M2 (Main Memory)
232
BITS Pilani, Pilani Campus

Operation of Two Level Memory System-2


Cost
Cs = (C1S1+C2S2)/(S1+S2)

Access Efficiency (T1/Ts)

Measure of how close average access time is to M1 access time On chip cache access time is about 25 to 50 times faster than main memory Off chip cache access time is about 5 to 15 times faster than main memory access time
233
BITS Pilani, Pilani Campus

Hit ratio Calculation Example


Cache access time is 80 ns Main memory access time is 1000 ns Average access time is 180 ns What is Hit ratio?

234
BITS Pilani, Pilani Campus

Main Memory Inconsistency


What happens when data is modified in cache? Must not overwrite a cache block unless main memory is up to date Multiple CPUs may have individual caches

235
BITS Pilani, Pilani Campus

Write Through Policy


All writes go to main memory as well as cache memory Multiple CPUs have to monitor main memory traffic to keep local (to CPU) cache up to date Implications
Lots of traffic and Slows down writes

236
BITS Pilani, Pilani Campus

Write Back Policy


Updates initially made in cache only Update bit for cache slot is set when update occurs If block is to be replaced, write to main memory only if update bit or dirty bit is set
Implications
Other caches get out of sync I/O must access main memory through cache

237
BITS Pilani, Pilani Campus

Hit Ratio vs. Cache Line Size


For each miss not only desired word but a number of adjacent words are retrived Increased block size will increase hit ratio at first
Due to the principle of locality

Hit ratio will decreases as block becomes even bigger


Probability of using newly fetched information becomes less than probability of reusing replaced

Larger blocks
Reduce number of blocks that fit in cache Data overwritten shortly after being fetched Each additional word is less local so less likely to be needed

No definitive optimum value has been found 8 to 64 bytes seems reasonable

238
BITS Pilani, Pilani Campus

Multi Level Caches


On chip cache improves the performance. Why??? Will more than one level of cache improves the performance??? Simplest organization is two level cache on chip (L1) and external cache (L2) In many cache designs for an off chip cache (L2) a separate bus is used to transfer the data. Now a days L2 cache is also available as on chip cache due to shrinking the size of processor So now we have one more level of cache i.e. L3 off chip cache 239
BITS Pilani, Pilani Campus

Multilevel Cache Performance

240
BITS Pilani, Pilani Campus

Unified Vs. Split Caches


Earlier same cache is used for data as well as instructions i.e. Unified Cache Now we have separate caches for data and instructions i.e. Split cache Advantages of Unified cache
It balances load between data and instruction automatically

Advantages of Split cache


Useful in parallel instruction execution Eliminate contention for the instruction fetch/decode unit and the execution unit E.g. Super scalar machines Pentium and Power PC
241
BITS Pilani, Pilani Campus

Caches and External Connections in P-3 Processor


Processing units

L1 instruction cache

L1 data cache

L1 data and instruction is 16-KB Data cache: 4 way set associative Instruction cache: 2 way set associative

Bus interface unit Cache bus System bus

L2 cache

L2 is of 512K 2 Way Set Associative

Main memory

Input/Output

242
BITS Pilani, Pilani Campus

Cache Memory Evolution


Processor IBM 360/85 PDP11/70 VAX 11/780 IBM 3033 IBM 3090 Intel 80486 Pentium PowerPC 601 PowerPC 620 PowerPC G4 IBM S/390 G4 IBM S/390 G6 Pentium 4 IBM SP CRAY MT Ab Itanium SGI Origin 2001 Itanium 2 IBM POWER5 CRAY XD1 Type Mainframe Minicomputer Minicomputer Mainframe Mainframe PC PC PC PC PC/server Mainframe Mainframe PC/server Highend server/ supercomputer Supercomputer PC/server Highend server PC/server Highend server Supercomputer Year of Introduction 1968 1975 1978 1978 1985 1989 1993 1993 1996 1999 1997 1999 2000 2000 2000 2001 2001 2002 2003 2004 L1 cache 16 to 32 KB 1 KB 16 KB 64 KB 128 to 256 KB 8 KB 8 KB/8 KB 32 KB 32 KB/32 KB 32 KB/32 KB 32 KB 256 KB 8 KB/8 KB 64 KB/32 KB 8 KB 16 KB/16 KB 32 KB/32 KB 32 KB 64 KB 64 KB/64 KB L2 cache 256 to 512 KB 256 KB to 1 MB 256 KB 8 MB 256 KB 8 MB 2 MB 96 KB 4 MB 256 KB 1.9 MB 1MB L3 cache 2 MB 2 MB 4 MB 6 MB 36 MB

243
BITS Pilani, Pilani Campus

Review Questions
For a larger block size hit ratio initially increases and drops later. Why?

Why multilevel cache system gives better performance than single level?
How the conflict misses are different from capacity misses?
244
BITS Pilani, Pilani Campus

Summary
Cache Performance
Hit Ratio Average Access Time Access Efficiency Write Policy Line Size Multiple Level of cache Unified Cache and Split Cache
245
BITS Pilani, Pilani Campus

Thank You!

246
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-5 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Internal memory, Semiconductor Memory Types, RAM, ROM, Memory Chip Organization, Memory Module Organization, Flash Memory
BITS Pilani, Pilani Campus

Basic Concepts
Data transfer between the processor and the memory takes place through the two registers
MAR and MBR or MDR

Memory Speed measurement


Memory Access Time Memory Cycle Time
Minimum time delay required between the initiation of two successive memory operations

Memory Cycle time is usually slightly longer than the access time! Memory Cycle time for Semiconductor memories ranges 10 to 100 ns
249
BITS Pilani, Pilani Campus

Connection: Memory to Processor

250
BITS Pilani, Pilani Campus

Semiconductor Memory Types


Memory Type Randomaccess memory (RAM) Readonly memory (ROM) Programmable ROM (PROM) Category Readwrite memory Erasure Electrically, bytelevel

Write Mechanism
Electrically Masks

Volatility Volatile

Readonly memory

Not possible

Erasable PROM (EPROM)


Electrically Erasable PROM (EEPROM) Flash memory Readmostly memory

UV light, chiplevel
Electrically

Nonvolatile

Electrically, bytelevel
Electrically, blocklevel

251
BITS Pilani, Pilani Campus

Static RAM
Memories that consists of circuits capable of retaining their state as long as power is applied Bits stored as on/off switches Complex construction so larger per bit and more expensive

252
BITS Pilani, Pilani Campus

SRAM Cell Organization


Transistor arrangement gives stable logic state State 1
C1 high, C2 low, T1 T4 off, T2 T3 on

State 0
C2 high, C1 low, T2 T3 off, T1 T4 on

Address line transistors are T5 and T6

253
BITS Pilani, Pilani Campus

Dynamic RAM
Bits stored as charge in capacitors charges leak so need refreshing even when powered Simpler construction and smaller per bit so less expensive Address line active when bit read or written
Transistor switch closed (current flows)

Slower operations, used for main memory

254
BITS Pilani, Pilani Campus

Memory Chip Organization


b7 W0 b7 b1 b1 b0 b0


FF A0 A1 Address decoder W1 FF


Memory cells

A2
A3

W15


Sense / Write circuit Sense / Write circuit Sense / Write circuit R/W CS

Data input /output lines: b7

b1

b0

Fig. Ref. Computer Organization 5th Ed. by Hamacher

2 5 BITS Pilani, Pilani Campus

Organization of a 1K 1 Memory Chip


5-bit row address W0 W1 5-bit decoder W31

32 32 memory cell array Sense / Write circuitry

10-bit address

32-to-1 output multiplexer and input demultiplexer 5-bit column address

R/ W CS

External connections are required=15

Data input/output

Fig. Ref. Computer Organization 5th Ed. by Hamacher

256
BITS Pilani, Pilani Campus

Memory Organization Issues


A 16Mbit chip can be organized as 1M of 16 bit words
36 pins require to address and data + 4 pins (R/W, CS, PS, G)

It can be organized as 4K x (512 x 8)


29 pins are required to address and data + 4 pins

It can be organized as 2048 x 2048 x 4 bit array


Row address and column address can be multiplexed 11 pins to address (211=2048) + 4 pins for data output + 4 pins Adding one more pin doubles range of values so x 4 capacity
257
BITS Pilani, Pilani Campus

16 Mbit DRAM Organization

258
BITS Pilani, Pilani Campus

DIMM and SIMM Memory Modules

259
BITS Pilani, Pilani Campus

256KB Module Organization

26 0 BITS Pilani, Pilani Campus

1MByte Module Organization

26 1 BITS Pilani, Pilani Campus

Synchronous DRAM (SDRAM)


Access is synchronized with an external clock Address is presented to RAM RAM finds data (CPU waits in conventional DRAM) Since SDRAM moves data in time with system clock, CPU knows when data will be ready CPU does not have to wait, it can do something else

262
BITS Pilani, Pilani Campus

DDR SDRAM Read Timing

263
BITS Pilani, Pilani Campus

Flash memory
Similar technology as EEPROM i.e. single transistor controlled by trapped charge A single cell can be read but writing is on block basis and previous contents are erased Greater density and a lower cost per bit, and consumes less power for operation Used in MP3 players, Cell phones, digital cameras Larger flash memory modules are called Flash Drives (i.e. Solid State Storage Devices)
264
BITS Pilani, Pilani Campus

Exercises
How memory organization impacts on data transfer rate? Number of pins required for 4Mbit memory chip:
Organized as 4Mx1 Organized as 1Mx4

265
BITS Pilani, Pilani Campus

Summary

266
BITS Pilani, Pilani Campus

Thank You!

267
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-5 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: External Memory, Magnetic Disk, Disk Characteristics and Performance Parameters
BITS Pilani, Pilani Campus

External memory
Semiconductor memory can not be used to store large amount of information or data
Due to high per bit cost of it!

Large storage requirements is full filled by


Magnetic disks, Optical disks and Magnetic tapes Called as secondary storage

270
BITS Pilani, Pilani Campus

Disk Connection to the System Bus


Processor Main Memory System Bus

Disk Controller

Disk Drive

271
BITS Pilani, Pilani Campus

Magnetic Disk Structure


Disk substrate (non magnetizable material) coated with magnetizable material (e.g. iron oxiderust) Advantage of glass substrate over aluminium Improved surface uniformity Reduction in surface defects
Reduced read/write errors

Lower fly heights Better stiffness Better shock/damage resistance


272
BITS Pilani, Pilani Campus

Magnetic Disk

273
BITS Pilani, Pilani Campus

Data Organization on Disk


Concentric rings called tracks
Gaps between tracks Same number of bits per track Constant angular velocity

Tracks divided into sectors Minimum block size is one sector Disk rotate at const. angular velocity
Gives pie shaped sectors

Individual tracks and sectors addressable


274
BITS Pilani, Pilani Campus

Multi Zone Recording Disks

Disk storage capacity in a CAV system is limited by the maximum recording density that can be achieved on the innermost track
275
BITS Pilani, Pilani Campus

Read and Write Mechanisms-1


Recording & retrieval of data via conductive coil which is called a head May be single read/write head or separate ones During read/write, head is stationary, platter rotates Write
Electricity flowing through coil produces magnetic field Electric pulses sent to head Magnetic pattern recorded on surface below

276
BITS Pilani, Pilani Campus

Read and Write Mechanisms-2


Read (traditional)
Magnetic field moving relative to coil produces current in coil Coil is the same for read and write (floppy disk)

Read (contemporary)
Separate read head, close to write head Partially shielded magneto resistive (MR) sensor Electrical resistance depends on direction of magnetic field Resistance changes are detected as voltage signals Allows High frequency operation
Higher storage density and speed
277
BITS Pilani, Pilani Campus

Disk Characteristics
Fixed (rare) or movable head
Fixed head
One read/write head per track mounted on fixed ridged arm

Movable head
One read/write head per side mounted on a movable arm

Removable or fixed disk Single or double (usually) sided Head mechanism


Contact (Floppy), Fixed gap, Flying (Winchester)

Single or multiple platter


278
BITS Pilani, Pilani Campus

Multiple Platters Tracks and Cylinders


C y l i n d e r

279
BITS Pilani, Pilani Campus

Capacity
Vendors express capacity in units of gigabytes (GB), where 1 GB =10^9 Byte Capacity is determined by these technology factors:
Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track. Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment. Areal density (bits/in2): product of recording and track density.

Modern disks partition tracks into disjoint subsets called recording zones
280
BITS Pilani, Pilani Campus

Computing Disk Capacity


Capacity =(# bytes/sector) x (avg. # sectors/track) x (# tracks/surface) x (# surfaces/platter) x (# platters/disk)

Example:
512 bytes/sector, 300 sectors/track (average) 20,000 tracks/surface, 2 surfaces/platter 5 platters/disk Capacity = 512 x 300 x 20000 x 2 x 5 = 30.72GB
281
BITS Pilani, Pilani Campus

Disk Performance Parameters


Seek time (Ts)
Time require to positioned the head on the desired track
Time require to positioned desired sector under r/w head

Rotational delay

Transfer time The total average access time is: Ta = Ts+ 1/2r + b/rN
Here Ts is Average seek time r is rotation speed in revolution per second b number of bytes to be transferred N number of bytes on a track
282
BITS Pilani, Pilani Campus

Example
Average seek time=4ms Rotation speed= 15,000 rpm 512 bytes per sector No. of sectors per track=500 Want to read a file consisting of 2000 sectors. Calculate the time to read the entire file File is stored sequentially.
4+2+4=10 ms to read 500 sectors or 1 track Time required to read remaining 4 tracks is 3*(4+2)=18 ms Total time is 28 ms
283
BITS Pilani, Pilani Campus

Exercise-1
If file is stored randomly.
2000 sectors are randomly scattered

What will be the total time required to read the file now?

284
BITS Pilani, Pilani Campus

Exercise-2
What is average time to read or write 512 byte sector for a disk rotating at 15,000 rpm? The average seek time is 4 ms, and the transfer rate is 100 MB/sec. Sol: 4+2+0.005 = 6.005 ms

285
BITS Pilani, Pilani Campus

Summary
Secondary Storage in Computers Magnetic Disk
Disk Characteristics Data organization on disk Read/Write Mechanism Access time

286
BITS Pilani, Pilani Campus

Thank You!

287
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-5 (Lecture-3)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Redundant Array of Independent Disks (RAID)
BITS Pilani, Pilani Campus

Performance Improvement in Secondary Storage


In general multiple components improves the performance Similarly multiple disks should reduce access time?
Arrays of disks operates independently and in parallel

Justification
With multiple disks separate I/O requests can be handled in parallel A single I/O request can be executed in parallel, if the requested data is distributed across multiple disks

Researchers @ University of California-Berkeley proposed the RAID (1988)


290
BITS Pilani, Pilani Campus

RAID
Redundant Array of Independent Disks Seven levels in common use Not a hierarchy Characteristics
1. Set of physical disks viewed as single logical drive by operating system 2. Data distributed across physical drives 3. Can use redundant capacity to store parity information
291
BITS Pilani, Pilani Campus

Data Mapping in RAID 0

No redundancy Data striped across all disks Round Robin striping


292
BITS Pilani, Pilani Campus

RAID 0
Increased Speed
Multiple data requests probably not on same disk Disks seek in parallel A set of data is likely to be striped across multiple disks

Draw Backs:
Not a "True" RAID because it is NOT fault-tolerant The failure of just one drive will result in all data in an array being lost
293
BITS Pilani, Pilani Campus

RAID 1

Mirrored Disks Data is striped across disks 2 copies of each stripe on separate disks Read from either and Write to both
294
BITS Pilani, Pilani Campus

RAID 1
Recovery is simple
Swap faulty disk & re-mirror No down time

Draw back
Highest disk overhead of all RAID types Expensive Any write should be done on two disks
295
BITS Pilani, Pilani Campus

Data Mapping in RAID 2

Lots of redundancy Expensive: Good for erroneous disk

296
BITS Pilani, Pilani Campus

RAID 2 (Not in use Now!)


Use parallel access technique Very small size strips Error correcting code is calculated across corresponding bits on each data disks Multiple parity disks store Hamming code error correction in corresponding positions Question: How many redundant disk required?
297
BITS Pilani, Pilani Campus

Data Mapping in RAID 3

298
BITS Pilani, Pilani Campus

RAID 3
Similar to RAID 2 Only one redundant disk, no matter how large the array Simple parity bit for each set of corresponding bits Data on failed drive can be reconstructed from surviving data and parity information

Question: Can achieve very high transfer rates. How?

299
BITS Pilani, Pilani Campus

RAID 4
Make use of independent access with block level striping Good for high I/O request rate due to large strips Bit by bit parity calculated across stripes on each disk Parity stored on parity disk Drawback???

300
BITS Pilani, Pilani Campus

RAID 5
Round robin allocation for parity stripe It avoids RAID 4 bottleneck at parity disk Commonly used in network servers Drawback
Disk failure has a medium impact on throughput Difficult to rebuild in the event of a disk failure (as compared to RAID level 1)

301
BITS Pilani, Pilani Campus

RAID 6
Two parity calculations Stored in separate blocks on different disks High data availability
Three disks need to fail for data loss Significant write penalty

Drawback
Controller overhead to compute parity is very high

302
BITS Pilani, Pilani Campus

Nesting of RAID Levels: RAID(1+0)


RAID 1 (mirror) arrays are built first, then combined to form a RAID 0 (stripe) array. Provides high levels of: I/O performance data redundancy Disk fault tolerance.

303
BITS Pilani, Pilani Campus

Nesting of RAID Levels: RAID(0+1)


RAID 0 (stripe) arrays are built first, then combined to form a RAID 1 (mirror) array Provides high levels of I/O performance and data redundancy Slightly less fault tolerance than a 1+0

304
BITS Pilani, Pilani Campus

Review Questions
In the context of the RAID, what is the difference between parallel access and independent access? What is the role of strip size in RAID 0 to achieve high I/O request rate and high data transfer capacity?

305
BITS Pilani, Pilani Campus

Summary

Reliability and Performance Improvement for secondary storage Comparative analysis of RAID Levels

306
BITS Pilani, Pilani Campus

Thank You!

307
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-6 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: I/O Modules: Function and Structure I/O Methods: Programmed and Interrupt Driven I/O
BITS Pilani, Pilani Campus

Input /Output
I/O system provides interface to the outside world I/O operations are accomplished through wide variety of external devices Used to transfer the data between computer and external world e.g. Keyboard, Monitor, Disk Drive, Printer An external device attached to the computer by an Interface Circuit called as I/O module
310
BITS Pilani, Pilani Campus

Why I/O Module?


Wide variety of peripherals
Impractical to incorporate the necessary logic within processor

Delivering different amounts of data


At different speeds, different formats and word length

Speed mismatch between processor/memory and I/O devices


Some I/O devices are slower than processor and memory while some are faster than processor and memory
311
BITS Pilani, Pilani Campus

Generic Model of I/O Module

312
BITS Pilani, Pilani Campus

I/O Module Functions-1


Control & Timing
To coordinate the flow of traffic between internal resources and external devices Control Signals: Read, Write, Ready, Busy, Error etc.

Processor and Device Communication


It must communicate with the processor and the external device before transferring the data Processor communication involves
Command decoding, Data, Status reporting, Address recognition
313
BITS Pilani, Pilani Campus

I/O Module Functions-2


Data Buffering
To overcome the speed mismatch between CPU or memory and device

Error Detection
Mechanical, electrical malfunctioning and transmission e.g. paper jam, disk bad sector
314
BITS Pilani, Pilani Campus

General I/O Module Structure


CPU checks I/O module device status I/O module returns status If ready, CPU requests data transfer I/O module gets data from device I/O module transfers data to CPU

315
BITS Pilani, Pilani Campus

I/O Methods

Programmed Interrupt driven Direct Memory Access (DMA)

316
BITS Pilani, Pilani Campus

Programmed I/O
CPU has direct control over I/O
Sensing status Read/write commands Transferring data

CPU waits for I/O module to complete operation Commands


Control - telling module what to do Test - check status Read/Write

Wastes CPU time!!! Why?


317
BITS Pilani, Pilani Campus

Addressing I/O Devices


Memory mapped I/O
Devices and memory share an address space I/O looks just like memory read/write No special commands for I/O
Large selection of memory access commands available

Isolated I/O
Separate address spaces Need I/O or memory select lines Special commands for I/O

318
BITS Pilani, Pilani Campus

Memory Mapped and Isolated I/O

319
BITS Pilani, Pilani Campus

Interrupts
Interrupts, what the processor is doing Program
e.g. overflow, division by zero, segmentation fault

Timer
Generated by internal processor timer Used in pre-emptive multi-tasking

I/O
from I/O controller

Hardware failure
e.g. memory parity error
320
BITS Pilani, Pilani Campus

Interrupt Driven I/O


CPU issues read command I/O module gets data from peripheral whilst CPU does other work I/O module interrupts CPU once data ready If interrupted: Save context (registers values) Process interrupt
Fetch data & store

321
BITS Pilani, Pilani Campus

Program Flow Control

322
BITS Pilani, Pilani Campus

Device Identification-1
How does processor determine which device issued the interrupt?
Multiple Interrupt Lines
Impractical to provide one line to each I/O device as number of devices increase

Software Poll
Processor branches to ISR ISR poll each I/O module to identify which module caused the interrupt Time consuming process
323
BITS Pilani, Pilani Campus

Device Identification-2
Hardware Poll
All I/O modules share a common interrupt request line Interrupt ACK line is Daisy changed through the modules Requesting module places address of the I/O module on data bus which is called as a vector Technique is called as vectored Interrupt

324
BITS Pilani, Pilani Campus

Multiple Interrupts - Sequential


Processor will ignore further interrupts whilst processing one interrupt Interrupts remain pending and are checked after first interrupt has been processed

325
BITS Pilani, Pilani Campus

Multiple Interrupts: Priority


Low priority interrupts can be interrupted by higher priority interrupts When higher priority interrupt has been processed, processor returns to previous interrupt

326
BITS Pilani, Pilani Campus

Review Questions
When a device interrupt occurs, how does the processor determine which device issued the interrupt? What is context switching? How programmed I/O is different from Interrupt Driven I/O? How software polling is different from Daisy chain method to handle multiple interrupts?
327
BITS Pilani, Pilani Campus

Summary
I/O modules: Functions and Structure I/O methods
Programmed I/O
Device addressing methods

Interrupt Driven I/O


Interrupt handling and processing Device Identification

328
BITS Pilani, Pilani Campus

Thank You!

329
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-6 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Direct Memory Access (DMA), DMA Configurations
BITS Pilani, Pilani Campus

Drawbacks of Interrupt Driven and Programmed I/O


Requires active CPU Intervention
Transfer rate is limited by the speed with which the processor can test and service the device CPU is tied up in managing an I/O transfer

Using programmed I/O processor can move data at higher rate at the cost of doing nothing else Interrupt I/O frees up the processor at the expense of the I/O transfer rate
332
BITS Pilani, Pilani Campus

Direct Memory Access (DMA)


A special control unit can be provided to allow transfer of a block of data directly between an external device and the main memory. No continuous intervention by the processor is required This control circuit can be part of I/O device interface and called as DMA controller Good for moving large volume of data

333
BITS Pilani, Pilani Campus

DMA Module and Operation


CPU tells DMA controller: Read/Write Device address Starting address of memory block for data Amount of data to be transferred

CPU carries on with other work DMA controller deals with transfer DMA controller sends interrupt when finished
334
BITS Pilani, Pilani Campus

DMA Transfer Method: Cycle Stealing


Memory accesses by the processor and by the DMA controllers are interwoven DMA controller takes over bus for a cycle Transfer of one word of data Not an interrupt
CPU does not switch context

CPU suspended just before it accesses bus


i.e. before an operand or data fetch or a data write

Slows down CPU but not as much as CPU itself doing transfer
335
BITS Pilani, Pilani Campus

DMA and Interrupt Breakpoints During an Instruction Cycle

336
BITS Pilani, Pilani Campus

DMA Transfer Method: Burst Mode


DMA controller is given exclusive access to the main memory to transfer a block of data without interruption Most DMA controllers incorporate a data storage buffer It reads a block of data using burst mode from the main memory and stores it into its input buffer Then it is transferred to the desired I/O device
337
BITS Pilani, Pilani Campus

DMA Configurations (1)

Single Bus, Detached DMA controller Each transfer uses bus twice
I/O to DMA then DMA to memory

CPU is suspended twice!


338
BITS Pilani, Pilani Campus

DMA Configurations (2)

Single Bus, Integrated DMA controller Controller may support >1 device Each transfer uses bus once CPU is suspended once!
339
BITS Pilani, Pilani Campus

DMA Configurations (3)

Separate I/O Bus Bus supports all DMA enabled devices Each transfer uses system bus once CPU is suspended once
340
BITS Pilani, Pilani Campus

Intel 8237A DMA Controller

341
BITS Pilani, Pilani Campus

Exercise-1
Processor and an I/O device connected to main memory via a shared bus of width one word. Processor can executes 106 instr/sec. An average instruction requires 5 machine cycles, three of which use the memory bus. Memory R/W requires one machine cycle. Processor is executing background programs that requires 95% of its instructions execution rate but not any I/O instructions. Assume processor cycle equals one bus cycle. Assume I/O device is to be used to transfer very large blocks of data. Answer the following: If programmed I/O is used and each one word transfer requires the processor to execute two instructions. What is the maximum I/O transfer rate? Estimate the rate if DMA is used.

342
BITS Pilani, Pilani Campus

Solution
The processor can only devote 5% of its time to I/O. Thus the maximum I/O instruction execution rate is 106 * 0.05 = 50,000 instructions per second. The I/O transfer rate is therefore 25,000 words/second. The number of machine cycles available for DMA control is 106(0.05*5 + 0.95*2) = 2.15*106 If we assume that the DMA module can use all of these cycles, and ignore any setup or status-checking time, then this value is the maximum I/O transfer rate.

343
BITS Pilani, Pilani Campus

Exercise-2
A DMA module transferring characters to memory using cycle stealing, from a device transmitting at 9600 bps. The processor is fetching instructions at the rate of 106 instructions per second. By how much will the processor be slowed down due to the DMA activity? Sol:

344
BITS Pilani, Pilani Campus

Review Questions
In general, DMA access to main memory is given higher priority than CPU access to main memory. Why? How cycle stealing for DMA is different from context switching used in Interrupt Driven I/O? Why DMA is better for large size data transfer compared to Interrupt Driven I/O?
345
BITS Pilani, Pilani Campus

Summary
Limitations of Interrupt Driven and Programmed I/O DMA Structure and Function DMA Configurations Exercises

346
BITS Pilani, Pilani Campus

Thank You!

347
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-7 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: RISC Vs. CISC Architectures, Instruction Pipeline, Six Stage Instruction Pipeline, Pipeline Performance
BITS Pilani, Pilani Campus

Instruction Execution Characteristics


High Level Language (HLL) programs mostly comprises Assignment statements (e.g. AB) Conditional statements (e.g. IF, LOOP) Procedure call/return (e.g. Functions in C)

Key for ISA designer These statements should be supported in optimal fashion
350
BITS Pilani, Pilani Campus

RISC Architecture
Earlier computers had small and simple instruction set. Why???
In the early 1980s, designers recommended that computers use fewer instructions with simple constructs so they can be executed much faster without accessing memory as often.

351
BITS Pilani, Pilani Campus

RISC Characteristics
Relatively few instructions
128 or less

Relatively few addressing modes


Memory access is limited to LOAD and STORE instructions

All operations done within the registers of the CPU


Use of overlapped register windows for optimization

Fixed Length (4 Bytes), easily decoded instruction format


Instruction execution time consistent

Hardwired control

352
BITS Pilani, Pilani Campus

RISC Processors
MIPS R4000
First commercially available RISC processor Supports thirty-two 64-bit registers 128KB of high speed cache

SPARC (Sun)
Based on Berkeley RISC model

PowerPC (IBM) ARM processor family Apple iPods


353
BITS Pilani, Pilani Campus

CISC Architecture
Later complexity and number of instructions both are increased. Why???
Simplify the compilation

The trend into computer hardware complexity was influenced by various factors, such as
To provide support for more customer applications Adding instructions that facilitate the translation from high level language into machine language programs Single machine instruction for each high level language statement Ex: VAX computer, IBM/370 computers, Intel x86 based processors, Motorola 68000 Series
354
BITS Pilani, Pilani Campus

CISC Characteristics
A large number of instructions Some instructions for special tasks used infrequently A large variety of addressing modes (5 to 20) Variable length instruction formats Instructions that manipulate operands in memory

However, it soon became apparent that a complex instruction set has a number of disadvantages These include a complex instruction decoding scheme, an increased size of the control unit, and increased logic delays.

355

BITS Pilani, Pilani Campus

RISC and CISC Comparison


Computer performance equation
CPU Time = (instructions/program) x (avg. cycles/instruction) x (seconds/cycle)

Example CISC Program


mov ax, 20 mov bx, 5 mul ax, bx

again

Example RISC Program mov ax,0 mov bx, 20 mov cx,5 add ax, bx loop again

356
BITS Pilani, Pilani Campus

RISC vs CISC Performance Summary


The CISC approach attempts to minimize the number of instructions per program by sacrificing the number of cycles per instruction. RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program.
357
BITS Pilani, Pilani Campus

RISC Vs. CISC Controversy


Now the general trend in computer architecture and organization has been toward Increasing processor complexity More instructions More addressing modes More registers No definitive test set of programs exists to compare both architectures
Performance varies with the programs

Most commercially available machines are mixture of RISC and CISC characteristics
358
BITS Pilani, Pilani Campus

Instruction Pipelining

Pipelining
What is pipelining? Six stage pipeline example Performance Issues

359
BITS Pilani, Pilani Campus

Instruction Pipelining
Similar to the use of an assembly line in manufacturing plant New inputs are accepted at one end before previously accepted inputs appear as outputs at the other end The same concept can be apply to the Instruction execution

360
BITS Pilani, Pilani Campus

Revisit Instruction Cycle

361
BITS Pilani, Pilani Campus

Two Stage Instruction Pipeline

Execution usually does not access main memory. Next instruction can fetched during execution of current instruction. Called instruction prefetch
362
BITS Pilani, Pilani Campus

Two Stage Pipeline: Hardware Organization

363
BITS Pilani, Pilani Campus

Two Stage Pipeline Facts


But not doubled!!!
Fetch stage is usually shorter than execution stage
Prefetch more than one instruction?

Any jump or branch means that prefetched instructions are not the required instructions These factors reduce the potential effectiveness of the two stage pipeline
How about to go for More stages?
364
BITS Pilani, Pilani Campus

Instruction Execution Stages


Six stages during Instruction execution
Fetch Instruction (FI) Decode Instruction (DI) Calculate Operands (i.e. EAs) (CO) Fetch Operands (FO) Execute Instructions (EI) Write result or Operand (WO)

Overlap these operations!!!


365
BITS Pilani, Pilani Campus

Timing Diagram for Instruction Pipeline

366
BITS Pilani, Pilani Campus

The Effect of a Conditional Branch on Instruction Pipeline

367
BITS Pilani, Pilani Campus

Pipeline Performance
Total time required to execute n instructions for a pipeline with k stages and cycle time t
Tk,n = [k+(n-1)]t

Speedup factor
Sk = nk/[k+(n-1)]

How the value of K influence the speedup???


The larger the number of pipelines stages (k), the greater the potential for speedup.

However, as a practical matter, the potential gains of additional stages are countered by increases in cost, delays between stages.
368
BITS Pilani, Pilani Campus

Speedup Factors: Instruction Pipelining

369
BITS Pilani, Pilani Campus

Review Questions
A program takes 600 ns for execution on a non pipelined processor. Suppose we need to run 100 programs of same type on a six stage pipeline processor with a clock speed of 10 ns. What is the speed ratio of the pipeline.

What is branch penalty?

370
BITS Pilani, Pilani Campus

Summary RISC and CISC Architectures Instruction Pipeline Pipeline Performance Factors

371
BITS Pilani, Pilani Campus

Thank You!

372
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-7 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Pipeline Hazards: Data Hazards, Resource Hazards, Control Hazards, RISC Pipeline
BITS Pilani, Pilani Campus

Four Stage Pipeline


Fetch: Read the instruction from the memory Decode: Decode the instruction and fetch the source operand(s) Execute: Perform the ALU operation Write: Store the result in the destination location Note:
Each stage in pipeline is expected to complete its operation in one clock cycle; hence the clock period should be sufficiently long to complete the task being performed in any stage.
375
BITS Pilani, Pilani Campus

Stalls/Hazards in Pipeline
Any condition that causes the pipeline to stall is called Hazard Data Hazards
Any condition in which either the source or the destination operands of an instruction are not available at the expected time in the pipeline

376
BITS Pilani, Pilani Campus

Data Hazards
Read After Write Hazard
I1: R1=R1+R2 I2: R3=R1+3

377
BITS Pilani, Pilani Campus

Solution for RAW Dependency


Hardware Solution
Operand Forwarding

Software Solution
Inserting NOP (No- Operation) instructions between I1 and I2

378
BITS Pilani, Pilani Campus

Resource Hazards
One instruction may need to access memory as part of the Write stage while another instruction is being Fetched
If instruction and data are in same cache unit there is a conflict Called as Structural or Resource conflict or Resource Hazard

379
BITS Pilani, Pilani Campus

Instruction/Control Hazards
If the required instruction is not available Example: Cache miss for instruction fetch, Branch instruction Called as control hazards or Instruction hazards

380
BITS Pilani, Pilani Campus

Example: Instruction Hazard


Cycle/Instr

I1 I2 (Br) I3 I4 Ik

F1 D1 E1 W1
F2 D2 E2 F3 D3 X F4 X Fk Dk Ek Wk

Conditional branch? Unconditional branch?


381
BITS Pilani, Pilani Campus

Handling Unconditional Branch


Branch will always be taken Compute branch target address as soon as possible
Needs extra hardware to do it at Decode stage (e.g. Four stage pipeline)

Many processors employ fetch unit that can fetch instructions before they are needed

382
BITS Pilani, Pilani Campus

Handling Conditional Branches-1


The decision to the branch cant be made until the execution of that instruction

Multiple Streams
Have two pipelines Prefetch each branch into a separate pipeline Use appropriate pipeline

Drawbacks
Leads to bus & register contention Multiple branches lead to further pipelines being needed Used in IBM 370/168
383
BITS Pilani, Pilani Campus

Handling Conditional Branches-2


Pre-fetch Branch Target
Studies suggest that conditional branches are taken more than 50% of time Branch target prefetching gives better result Used by IBM 360/91

Loop Buffer
Very fast memory maintained by fetch stage of pipeline keeps n most recently fetched instructions, sequentially Check buffer before fetching from memory Very good for small loops or jumps Used in CRAY-1
384
BITS Pilani, Pilani Campus

Review Questions
Consider the following sequence of instructions: I1: add R1, R0, #20 (R1 R0 + 20) I2: mul R2, R3, #2 (R2 R3*2) I3: and R4, R1, R2 (R4 R1 and R2) I4: add R5, R4, R2 (R5 R4 + R2) These instructions are executed in a computer that has a four stage pipeline (Fetch, Decode, Execute, Write) discussed in this class. Assume that all stages for all instructions requires one cycle each, except the Execution stage of multiply instruction which requires two cycles. Draw a diagram to describe the operation being performed by each pipeline stage during each clock cycle. Show the stalls in pipeline, if any.
385
BITS Pilani, Pilani Campus

Solution

I1: add R1, R0, #20 (R1 R0 + 20) I2: mul R2, R3, #2 (R2 R3*2) I3: and R4, R1, R2 (R4 R1 and R2)

I4: add R5, R4, R2 (R5 R4 + R2)

386
BITS Pilani, Pilani Campus

Prediction Based Algorithms for Handling Conditional Branches


Predict never taken
Assume that jump will not happen. Always fetch next instruction

Predict always taken


Assume that jump will happen. Always fetch target instruction

Predict by opcode
Some instructions are more likely to result in a jump than others. Can get up to 75% success

Taken/Not taken switch


Based on previous history.
387
BITS Pilani, Pilani Campus

1-Bit Branch Predictor

388
BITS Pilani, Pilani Campus

Two Bit Branch Predictor

389
BITS Pilani, Pilani Campus

Delayed Branch
Do not take jump until you have to Rearrange instructions Example
LOOP Shift_left Decrement Brp Add R1 R2 LOOP R1,R3

LOOP Decrement Brp Shift_left NEXT Add

R2 LOOP R1 R1,R3

NEXT

Reordered Instructions

The location following a branch instruction is called a branch delay slot The objective is to utilize the slot by putting useful instruction.
390
BITS Pilani, Pilani Campus

RISC Pipelining
Most instructions are register to register
I: Instruction fetch E: Execute
ALU operation with register input and output

For load and store


I: Instruction fetch E: Execute
Calculate memory address

D: Memory
Register to memory or memory to register operation
391
BITS Pilani, Pilani Campus

RISC Pipeline Example

392
BITS Pilani, Pilani Campus

Four Stage RISC Pipelining


E stage usually involves an ALU operation, it may be longer. So we can divide into two stages
E1: Register file read E2: ALU operation and register write

393
BITS Pilani, Pilani Campus

Summary Pipeline Hazards


Data Hazards Resource Hazards Instruction Hazards
Branch Prediction Algorithms

RISC Pipeline as an Example

394
BITS Pilani, Pilani Campus

Thank You!

395
BITS Pilani, Pilani Campus

Computer Organization & Architecture


BITS Pilani
Pilani Campus

Virendra Singh Shekhawat Department of Computer Science and Information Systems

BITS Pilani, Pilani Campus

BITS Pilani
Pilani Campus

Module-7 (Lecture-3)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Instruction Level and Machine Level Parallelism, Superscalar Pipeline and Superscalar Processor Design Elements, Hazards and Solutions
BITS Pilani, Pilani Campus

Introduction
Scalar pipeline improves performance
By overlapping various instruction stages Unable to exploit Instruction Level Parallelism (ILP) completely!

A new term called Superscalar is coined (1987) to exploit ILP of a program

398
BITS Pilani, Pilani Campus

What is Superscalar?
Multiple independent instruction pipelines are used to execute instructions independently and concurrently Common instructions (arithmetic, load/store, conditional branch) initiated simultaneously and executed independently Improve the performance of the execution of instructions operated on scalar quantities Equally applicable to RISC & CISC but in practice usually RISC 399
BITS Pilani, Pilani Campus

Super Scalar vs. Super Pipeline


Super Pipeline
Many pipeline stages need less than half a clock cycle Double internal clock speed gets two tasks per external clock cycle

400
BITS Pilani, Pilani Campus

Limitations
Superscalar approach depends on the ability to execute multiple instructions in parallel How to maximize the ILP
Compiler based optimization Hardware techniques

Limited by following dependencies or hazards


Instruction / Procedural Resource Data Hazards
RAW, WAR, WAW
401
BITS Pilani, Pilani Campus

Procedural/Instruction Hazards
Can not execute instructions after a branch in parallel with instructions before a branch Also, if instruction length is not fixed, instructions have to be decoded (partially) to find out how many fetches are needed
This prevents simultaneous fetches

So superscalar techniques are applicable to a RISC or RISC like architecture


402
BITS Pilani, Pilani Campus

Resource Conflict
Two or more instructions requiring access to the same resource at the same time
e.g. two arithmetic instructions

Solution
Can duplicate resources. e.g. have two arithmetic units

403
BITS Pilani, Pilani Campus

Data Hazards/Conflicts
Example Program
R3= R3 + R5; (I1) R4= R3 + 1; (I2) R3= R5 + 1; (I3) R7= R3 + R4 (I4)

Identify the hazards among above instructions


I1 and I2 ? I3 and I4 ? I2 and I3 ? I1 and I3 ?
404
BITS Pilani, Pilani Campus

Effect of Hazards/ Dependencies

405
BITS Pilani, Pilani Campus

Superscalar Processor Design Issues


Instruction level parallelism
If instructions in a sequence are independent, execution can be overlapped Degree of instruction level parallelism is governed by data and procedural dependency

Machine Parallelism
Ability to take advantage of instruction level parallelism Governed by number of parallel pipelines

Note
A program may not have enough instruction level parallelism to take full advantage of machine parallelism
406
BITS Pilani, Pilani Campus

Instruction Issue Policy


Instruction Issue
Refer to the process of initiating instruction execution in the processors functional units Occurs when instruction moves from the decode stage to the first execute stage

Instruction Issue Policy


Refer to the protocol used to issue instructions Processor look ahead to locate instructions that can be brought into the pipeline and executed

Ordering issues
Order in which instructions are fetched, executed and change contents of registers and memory 407
BITS Pilani, Pilani Campus

Issue Policies In-order issue with in-order completion In-order issue with out-of-order completion Out-of-order issue with out-of-order completion
408
BITS Pilani, Pilani Campus

Example: Issue Policies-1


Assumptions
A super Scalar Pipeline capable of fetching and decoding two instructions at a time The next two instructions must wait until the pair of decode pipeline stages has cleared Having three separate functional units (two integer arithmetic and one floating point arithmetic) Two instances of the write back pipeline stage

409
BITS Pilani, Pilani Campus

Example: Issue Policies-2


Constraints
I1 requires two cycles to execute I3 and I4 conflict for the same functional unit I5 depends on the value produced by I4 I5 and I6 conflict for a functional unit

410
BITS Pilani, Pilani Campus

In-Order Issue In-Order Completion

411
BITS Pilani, Pilani Campus

In-Order Issue Out-of-Order Completion


It improves the performance of instructions that requires multiple cycles
E.g. I2 is allowed to run to completion prior to I1. Results I3 completed earlier i.e. saving one cycle (Better than previous)

412
BITS Pilani, Pilani Campus

Out-of-Order Issue Out-of-Order Completion


Decouple decode stage from execution stage by introducing a buffer called instruction window Since instructions have been decoded, processor can look ahead

413
BITS Pilani, Pilani Campus

Register Renaming
When out of order techniques are used
WAW and WAR hazards occur because register contents may not reflect the correct ordering from the program WAW and WAR dependencies are known as storage conflicts

Compilers use optimized registers allocation which leads to more storage conflicts Register Renaming is used to deal with this
Similar to resource duplication
414
BITS Pilani, Pilani Campus

Register Renaming
R3b:=R3a + R5a R4b:=R3b + 1 R3c:=R5a + 1 R7a:=R3c + R4b Inference (I1) (I2) (I3) (I4)
R3:= R4:= R3:= R7:= R3 R3 R5 R3 + + + + R5; (I1) 1; (I2) 1; (I3) R4 (I4)

The same original register reference in several different instructions may refer to different actual registers, if different values are needed
415
BITS Pilani, Pilani Campus

Superscalar Execution

416
BITS Pilani, Pilani Campus

Machine Parallelism
To enhance the performance of the Super Scalar Processors
Duplication of Resources, Out of order instruction issue Register Renaming

417
BITS Pilani, Pilani Campus

Review Questions
How instruction level parallelism is different from machine parallelism? Identify the various data hazards among the following instructions?
I1: R1=20 I2: R1=R3+R4 I3: R2=R4-10 I4: R4=R1+R2 I5: R2=R1+40
418
BITS Pilani, Pilani Campus

Summary
Instruction level Parallelism vs. Machine level parallelism Superscalar Pipelines Performance Limitation Factors Superscalar Processor Design Issues Register Renaming
Eliminates WAR and WAW hazards

419
BITS Pilani, Pilani Campus

Thank You!

420
BITS Pilani, Pilani Campus

Vous aimerez peut-être aussi