Académique Documents
Professionnel Documents
Culture Documents
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
Module-1 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Computer Organization and Architecture, Functional and Structural View of Computer System, Brief History of Computers, Evolution of Intel x86 Architecture
Organization Question?
Is multiplication implemented by separate hardware or is it done by repeated addition?
4
BITS Pilani, Pilani Campus
Function is the operation of individual components as part of the structure Structure is the way in which components are related to each other
5
BITS Pilani, Pilani Campus
6
BITS Pilani, Pilani Campus
Computer Operations
7
BITS Pilani, Pilani Campus
Peripherals
Main Memory
Systems Interconnection
Registers
Memory
Control Unit
9
BITS Pilani, Pilani Campus
Computer Generations
Vacuum tube (1946-1957) & Transistor (19581964) Integrated Circuits
Small scale integration - 1965 on
Up to 100 devices on a chip
x86 Evolution-1
1971 - 4004
First microprocessor of 4 bit All CPU components on a single chip Followed in 1972 by 8008 (8 bit processor) Both designed for specific applications
8080
First general purpose microprocessor Process/move 8 bit data at a time Used in first personal computer Altair
12
BITS Pilani, Pilani Campus
x86 Evolution-2
8086
Much more powerful (16 bit data) Instruction cache, pre-fetch few instructions 8088 (8 bit external bus) used in first IBM PC
80286
16 MByte memory addressable Up from 1MB (in 8086)
80386
32 bit processor with multitasking support
13
BITS Pilani, Pilani Campus
x86 Evolution-3
80486
Sophisticated powerful cache and instruction pipelining Built in maths co-processor
Pentium
Superscalar Multiple instructions executed in parallel
Pentium Pro
Increased superscalar organization Aggressive register renaming Branch prediction and Data flow analysis
14
BITS Pilani, Pilani Campus
x86 Evolution-4
Pentium II
MMX technology, graphics, video & audio processing
Pentium III
Additional floating point instructions for 3D graphics
Pentium 4
Further floating point and multimedia enhancements
Itanium Series
64 bit with Hardware enhancements to increase speed
Whats next???
Multi core architectures
15
BITS Pilani, Pilani Campus
Summary
Architecture vs. Organization Functional and Structural View of a Computer History of Computers Intel Architecture Evolution
16
BITS Pilani, Pilani Campus
Review Questions
Differentiate between computer organization and architecture? What are the four main functions of a computer? What are the basic structural components of a computer? Describe the computer generations in brief. What was the first general purpose microprocessor?
17
BITS Pilani, Pilani Campus
Thank You!
18
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-1 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Concept of Computer Program and Instruction, Internal Structure of CPU, Instruction Execution Cycle With and Without Interrupt
BITS Pilani, Pilani Campus
21
BITS Pilani, Pilani Campus
To do these tasks processor needs: Temporarily storage Interconnection structure between various components (i.e. Registers, ALU, Memory, I/O) ALU
23
24
BITS Pilani, Pilani Campus
25
BITS Pilani, Pilani Campus
Instruction Cycle
Two steps:
Fetch Execute
26
BITS Pilani, Pilani Campus
Fetch Cycle
Program Counter (PC) holds address of next instruction to fetch Processor fetches instruction from memory location pointed by PC Increment PC
Unless told otherwise
Instruction loaded into Instruction Register (IR) Processor interprets instruction and performs required actions
27
BITS Pilani, Pilani Campus
Execute Cycle
Data transfer between CPU and Main Memory Data transfer between CPU and I/O module Some arithmetic or logical operation on data Alteration of sequence of operations
e.g. jump
Combination of above
28
BITS Pilani, Pilani Campus
29
BITS Pilani, Pilani Campus
30
BITS Pilani, Pilani Campus
31
BITS Pilani, Pilani Campus
Interrupts
Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of processing Program
Timer
Generated by internal processor timer Used in pre-emptive multi-tasking
I/O
From I/O controller
Hardware failure
e.g. Memory parity error
32
BITS Pilani, Pilani Campus
Interrupt Cycle
33
BITS Pilani, Pilani Campus
34
BITS Pilani, Pilani Campus
Summary
Concept of Computer Program and Instruction Internal Structure of CPU Relationship between CPU register sizes and main memory Instruction Execution Cycle Interrupt and Interrupt Cycle Instruction Execution Cycle with Interrupt
35
BITS Pilani, Pilani Campus
Review Questions
Which of the instruction cycle state(s) require main memory access? What is context switching? Assume the main memory address space is 216 and each location contains 1 Byte of data. Find out the minimum possible size required for the registers like PC, MAR, MBR, and IR. Repeat the above question for an address space of 232 and addressability as 16 bits. Also calculate the total size of main memory in GBytes.
36
BITS Pilani, Pilani Campus
Thank You!
37
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-1 (Lecture-3)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings, Computer Organization and Design , 4th Ed. By Patterson] Topics: Computer Performance Assessment, CPI, MIPS Rate, Benchmark Programs
BITS Pilani, Pilani Campus
Computer Performance
Performance is one of the key parameter to Evaluate processor hardware Measure requirement for new systems
When we say one computer has better performance than another, what does it mean?
Criterion for the performance? Response Time (Single computer), Throughput (Data center)
40
BITS Pilani, Pilani Campus
Clock Rate
Operation performed by processor are governed by system clock (fundamental level of processor speed measurement)
Generated by quartz crystal
A clock cycle is the basic unit of time to execute one operation. Clock Rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time (clock period)
41
CPU Performance
To maximize performance, need to minimize execution time
performance = 1 / execution_time
If X is n times faster than Y, then
44
BITS Pilani, Pilani Campus
Hence CPI is not a constant value for a processor Needs to calculate average CPI for processor
46
BITS Pilani, Pilani Campus
MIPS Rate
Common measure of performance
Millions Instructions Per Second (MIPS) rate Ic/(T x 106) This can be written as: f/(CPI x 106)
48
BITS Pilani, Pilani Campus
Example
Processor speed = 400 MHz Four types of instructions with CPI 1,2,4,8 respectively Having instruction mix as 60%, 18%,12%,10% respectively What is MIPS rate?
49
BITS Pilani, Pilani Campus
50
BITS Pilani, Pilani Campus
Benchmark Programs
It is a collection of a programs that provides representative test of a computer in a particular application area
e.g. SPEC (System Performance Evaluation Corporation) benchmark suites SPEC CPU 2006 is used for measuring performance for the computational based applications
51
BITS Pilani, Pilani Campus
Summary
Computer Performance Assessment Performance Factors Execution Time = No. of Instructions in program x Clock cycles per instruction x Clock cycle time MIPS Rate Benchmark Programs
52
BITS Pilani, Pilani Campus
Review Questions
Consider two implementation of the same ISA. Computer A clock cycle time of 250 ns and a CPI of 2 for a program. Computer B has a clock cycle time of 500 ns and a CPI of 1.2 for the same program. Which computer is faster for this program and by how much? A program runs on computer A with a 2 GHz clock in 10 seconds. Another computer B with 4 GHz run this program in 6 seconds. To accomplish this, computer B will require P times as many clock cycles as computer A to run the program. Find the value of P.
53
BITS Pilani, Pilani Campus
Thank You!
54
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-1 (Lecture-4)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Computer System Modules and Interconnections; Concept of BUS: Types, Arbitration, Timing; PCI BUS Example
BITS Pilani, Pilani Campus
Interconnection
All the units must be connected Different type of connection for different type of unit
Memory Connections
Receives and sends data Receives addresses (of locations) Receives control signals
57
BITS Pilani, Pilani Campus
CPU Connections
Reads instruction and data Writes out data (after processing) Sends control signals to other units Receives (& acts on) interrupts
58
BITS Pilani, Pilani Campus
What is a Bus?
A communication pathway connecting two or more devices What do buses look like?
Parallel lines on circuit boards Ribbon cables Strip connectors on mother boards Sets of wire
Control Bus
Control and timing information
Memory read/write signal Interrupt request Clock signals
62
BITS Pilani, Pilani Campus
64
BITS Pilani, Pilani Campus
65
BITS Pilani, Pilani Campus
Multiplexed
Shared lines Address valid or data valid control line Advantage - fewer lines Disadvantages
More complex control Ultimate performance
66
BITS Pilani, Pilani Campus
Bus Arbitration
More than one module controlling the bus
e.g. CPU and DMA controller
Only one module may control bus at a time Arbitration may be Centralized
Single hardware device controlling bus access
Distributed
Each module may claim the bus Control logic on all modules
67
BITS Pilani, Pilani Campus
68
BITS Pilani, Pilani Campus
Asynchronous
The occurrence of one event on a bus follows and depends on the occurrence of a previous event Events on the bus are not synchronized with clock
69
BITS Pilani, Pilani Campus
70
BITS Pilani, Pilani Campus
71
BITS Pilani, Pilani Campus
72
BITS Pilani, Pilani Campus
b)
c) d) e)
f) g) h) i)
Transaction begins by asserting FRAME Target device will recognize its address on AD Master ceases driving the AD bus. Initiator asserts IRDY to indicate ready for data Target asserts DEVSEL to indicate it has recognize its address Master reads the data at the beginning of the 4th cycle and changes the byte enable lines as needed Target deasserts TRDY (needs time to transfer next block of data) Target places the 3rd data item at 6th cycle but master is not ready (deasserts IRDY) Master deasserts FRAME as last data transfer Master deasserts IRDY (Bus is in idle state) target deasserts TRDY and DEVSEL 74
BITS Pilani, Pilani Campus
Exercise-1
Consider a 64-bit microprocessor having 64 bit instructions composed of two fields. The first two bytes contain the opcode and the remainder the immediate operand or an operand address What is the impact on the system speed if the microprocessor bus has
A) A 64 bit local address bus and 32-bit local data bus B) A 32 bit local address bus and 32-bit local data bus C) What is the maximum directly addressable memory ?
75
BITS Pilani, Pilani Campus
Exercise-2
Consider a 32-bit micro processor with 16 bit external data bus, driven by 8Mhz input clock. Micro processor has a bus cycle whose minimum duration equals four input clock cycles.
What is the maximum data transfer rate across the bus in Bytes/sec? To increase the performance: Make its external data bus 32 bits Double the external clock frequency supplied to the processor Which is better option? Explain.
76
BITS Pilani, Pilani Campus
Summary
Bus as Interconnection Structures Bus Design Issues Synchronous and Asynchronous Bus Operations PCI Bus Operation
77
BITS Pilani, Pilani Campus
Thank You!
78
BITS Pilani, Pilani Campus
Computer Arithmetic
BITS Pilani
Pilani Campus
S Mohan
Computer Arithmetic
Number Representations
Integer or fixed point Representation Floating Point Representation
Arithmetic
Integer Arithmetic Floating Point Arithmetic
80
BITS Pilani, Pilani Campus
Number Representation-1
Unsigned Representation Only have 0 & 1 to represent everything Positive numbers stored in binary
e.g. 41=00101001
82
BITS Pilani, Pilani Campus
Number Representation-2
Sign-Magnitude
Left most bit is sign bit 0 means positive 1 means negative +18 = 00010010 -18 = 10010010 Problems
Arithmetic is not, as we want!!! One pattern is wasted!!!
83
BITS Pilani, Pilani Campus
Number Representation-3
Twos Compliment Most significant bit is treated as a sign bit 0 for positive and 1 for negative Positive numbers are represented same as sign magnitude representation Negative numbers are represented in 2s complement form By using n bits the range of numbers is
-2n-1 to 2n-1-1
Arithmetic works, as we want!!!
84
BITS Pilani, Pilani Campus
85
BITS Pilani, Pilani Campus
87
BITS Pilani, Pilani Campus
Multiplication
Complex! Work out partial product for each digit Take care with place value (column) Add partial products Think about hardware implementation??? What are changes are required in the above manual approach for computerization???
Running addition on the partial products Few registers should be used For each 1 in the multiplier an add and shiftR operation is required For 0 only shift operation is required
88
BITS Pilani, Pilani Campus
89
BITS Pilani, Pilani Campus
Example
90
BITS Pilani, Pilani Campus
91
BITS Pilani, Pilani Campus
Solution 2
Booths Algorithm
92
BITS Pilani, Pilani Campus
Multiplicand M unchanged Based upon recoding the multiplier Q to a recoded value R Each digit can assume a negative as well as positive and zero values Known as Signed Digit ( SD) encoding
93
BITS Pilani, Pilani Campus
When a 1 is encountered insert 1 at that position and complement all the succeeding 1s until a 0 is encountered
Replace that 0 with 1 and continue While multiplying with 1 2s compliment is taken
95
BITS Pilani, Pilani Campus
96
BITS Pilani, Pilani Campus
Q-1 0 0 1
1 0 0 1 1
97
BITS Pilani, Pilani Campus
Computer Arithmetic
BITS Pilani
Pilani Campus
S Mohan
00001101 1011 10010011 1011 001110 Partial 1011 Remainders 001111 1011 100 Divisor
More complex than multiplication Negative numbers are really bad! Based on long division
Quotient Dividend
Remainder
99
BITS Pilani, Pilani Campus
Algorithm
Bits of dividend are examined from L to R until the set of bits examined represents a number greater than or equal to the divisor
Until this event occurs, 0s are placed in the quotient When the event occurs, a 1 is placed in the quotient and the divisor is subtracted from the partial dividend This continues in cyclic pattern The process stops when all the bits of dividend are exhausted
100
BITS Pilani, Pilani Campus
101
BITS Pilani, Pilani Campus
Floating Point
We need a way to represent
numbers with fractions, e.g., 3.1416
Representation:
scientific: sign, exponent, significand form:
binary point
more bits for significand gives more accuracy more bits for exponent increases range if 1 significand 10two(=2ten) then number is normalized, except for number 0 which is normalized to significand 0
E.g., 101.001101 * 2111001 = 1.01001101 * 2111011 (normalized)
105
BITS Pilani, Pilani Campus
decimal: 0.75 = 3/4 = 3/22 binary: 11/100 = .11 = 1.1 x 2-1 IEEE single precision floating point exponent = bias + exponent value = 127 + (-1) = 126ten = 01111110two IEEE single precision: 10111111010000000000000000000000
sign exponent significand
Floating-Point Example
What number is represented by the singleprecision float 1100000010100000
S=1 Fraction = 01000002 Exponent = 100000012 = 129
1-254
255 255
Anything
0 Non-zero
1-2046
2047 2047
Anything
0 Non-zero
Floatingpoint number
Infinity NaN (Not a Number)
Single-Precision Range
Exponents 00000000 and 11111111 reserved Smallest value
Exponent: 00000001 actual exponent = 1 127 = 126 Fraction: 00000 significand = 1.0 1.0 2126 1.2 1038
Largest value
exponent: 11111110 actual exponent = 254 127 = +127 Fraction: 11111 significand 2.0 2.0 2+127 3.4 10+38
BITS Pilani, Pilani Campus
Double-Precision Range
Exponents 000000 and 111111 reserved Smallest value
Exponent: 00000000001 actual exponent = 1 1023 = 1022 Fraction: 00000 significand = 1.0 1.0 21022 2.2 10308
Largest value
Exponent: 11111111110 actual exponent = 2046 1023 = +1023 Fraction: 11111 significand 2.0 2.0 2+1023 1.8 10+308
BITS Pilani, Pilani Campus
Add the mantissas Choose the largest exponent Put the result in normalized form
Shift mantissa left or right until in form 1.M Adjust exponent accordingly
Handle overflow or underflow if necessary Round Renormalize if necessary if rounding produced an unnormalized result
0 00000011 S E
010000111 M
difference = 2
0 00000011 S E 010000111 M
0 00000011 S E
010000111 M
0 00000011 S E
010000111 M
000000100
M
0 00000011 S E
010000111 M
0 00000011 S E
000000100 M
BITS Pilani, Pilani Campus
Hardware design
determine smaller exponent
Hardware design
Hardware design
add mantissas
Hardware design
Hardware design
round result
Hardware design
renormalize if necessary
BITS Pilani, Pilani Campus
FP Adder Hardware
Much more complex than integer adder Doing it in one clock cycle would take too long
Much longer than integer operations Slower clock would penalize all instructions
Handle overflow or underflow if necessary Round Renormalize if necessary if rounding produced an unnormalized result Set S=0 if signs of both operands the same, S=1 otherwise
BITS Pilani, Pilani Campus
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
01101000 S E M
BITS Pilani, Pilani Campus
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
01101000 S E M
BITS Pilani, Pilani Campus
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
01101001 S E
001000 M
BITS Pilani, Pilani Campus
-1.5 x 27-127
0 11100000 S E
100000000 M
1.5 x 2224-127
1 01101001 S E
001000 M
-1.125 x 2105-127
FP Arithmetic Hardware
FP multiplier is of similar complexity to FP adder
But uses a multiplier for significands instead of an adder
In addition to overflow we can have underflow (number too small) Accuracy is the problem with both overflow and underflow because we have only a finite number of bits to represent numbers that may actually require arbitrarily many bits
limited precision rounding rounding error IEEE 754 keeps two extra bits, guard and round four rounding modes positive divided by zero yields infinity zero divide by zero yields not a number
other complexities
Rounding
Fp arithmetic operations may produce a result with more digits than can be represented in 1.M
The result must be rounded to fit into the available number of M positions Tradeoff of hardware cost (keeping extra bits) and speed versus accumulated rounding error
BITS Pilani, Pilani Campus
Rounding
Guard, Round bits for intermediate addition
2.56*100 + 2.34*102 = 0.0256*102 + 2.34*102 = 2.3656*102 5: guard bit 6: round bit 00~49: round down, 51~99: round up, 50: tie-break Result: 2.37*102 Without guard and round bit
0.02*102 + 2.34*102 = 2.36*102
Rounding
In binary, an extra bit of 1 is halfway in between the two possible representations
1.001 (1.125) is halfway between 1.00 (1) and 1.01 (1.25) 1.101 (1.625) is halfway between 1.10 (1.5) and 1.11 (1.75)
Round-to-nearest-even
Rounds to the even value (the one with an LSB of 0) 1.00100 -> 1.00 1.01100 -> 1.10 Produces zero average bias
BITS Pilani, Pilani Campus
S Mohan
142
BITS Pilani, Pilani Campus
8086 Memory
Memory is byte-addressable. The original 8086 had a 20-bit address bus that could address just 1MB of main memory called as Real Addressing Mode Newer CPUs can access 64GB of main memory, using 36-bit addresses. A word in the 8086 world is 16 bits.. A 32-bit quantity is called a double word.
143
BITS Pilani, Pilani Campus
Microprocessor Registers
Visible registers
Addressable during application programming
Invisible registers
Addressable in system programming
8086,8088,80286 contain 16 bit internal architecture 80386-Pentium 4 contain 32 bit internal architecture
144
BITS Pilani, Pilani Campus
Multipurpose Registers
AX- Accumulator
Addressable as AX, AH, or AL Mainly used for multiplication, division
CX Count
Holds count for instructions
DX- Data
Holds a part of the result
145
BITS Pilani, Pilani Campus
Multipurpose Registers(2)
BP- Base Pointer
Points to a memory location for memory data transfers.
FLAGS
For controlling the microprocessor operation Not modified for data transfer or program control operation
147
BITS Pilani, Pilani Campus
Segment Registers[1]
CS-Code Segment
Contains programs and procedures Defines the starting address of the section the memory. In real mode, it defines the start of a 64Kb section of memory
148
First Semester 2010-2011
Segment Registers[2]
ES- Extra Segment
An additional data segment used by string instructions
SS- Stack
Memory used for the stack
149
First Semester 2010-2011
Elements of an Instruction
Operation code (Op code)
Do this!
OPERAND(s) may be
Part of instruction Reside in registers of the processor Reside in memory location
S Mohan
Addressing Modes
An instruction must contains the information about:
How to get the operands? Called as ADDRESSING MODES Essentially it tells where the operands are available and how to get them
Question
Why we need various addressing modes or various ways to represent operands?
BITS Pilani, Pilani Campus
Register Addressing
Instruction Opcode Register Address R
Registers
Example MOV AL,BL ADD AX, BX
Operand
Immediate Addressing
Operand is part of instruction No memory reference to fetch operand Ex:
MOV AX,23h ADD CL, 44h MOV AL, A MOV BL, 11001100B
Direct Addressing
Instruction Opcode
Examples: MOV CX, DATA ADD CL, TEMP MOV AL,[1234h]
Address A
Memory
Operand
Single memory reference to access data. So no extra calculations required to get effective address!!!
Memory
Registers
Pointer to Operand
Operand
In an assembler, use
MOV BYTE PTR [DI],10H
Example:
.Data FILE EQU THIS BYTE recA DB 15 dup(?) ;15 bytes for rec A recB DB 20 dup(?) ;20 bytes for rec B .code .startup MOV BX,OFFSET recA ;address record A MOV DI,0 ;address element 0 MOV AL,FILE[BX+DI] .exit end
Design Decisions
Operation repertoire
How many ops, What can they do and How complex are they?
Instruction formats
Length of op code field Number of addresses
Registers
Number of CPU registers available Which operations can be performed on which registers?
Fewer addresses
Less complex (powerful?) instructions More instructions per program Faster fetch/execution of instructions
BITS Pilani, Pilani Campus
TWO Address MOV Y,A SUB Y,B MOV T,D MUL T,E ADD T,C DIV Y,T
ONE Address LOAD D MUL E ADD C STOR Y LOAD A SUB B DIV Y STOR Y
BITS Pilani, Pilani Campus
S Mohan
TWO Address MOV Y,A SUB Y,B MOV T,D MUL T,E ADD T,C DIV Y,T
ONE Address LOAD D MUL E ADD C STOR Y LOAD A SUB B DIV Y STOR Y
BITS Pilani, Pilani Campus
Types of Operand
Addresses
Unsigned integers
Numbers
Binary fixed point/Binary floating point
Characters
ASCII
Logical Data
Bits or flags, Bit level manipulation of data
10
11
MOD =11 R/M 000 001 010 011 100 101 110 111 W=0 AL CL DL BL AH CH DH BH
Exercise
Show the machine code for the following assembly instructions MOD Explanation
MOV [DI], AL ADD [BX+SI+1234h], AX 00 Memory mode no displacement
01
10 11
MOD =11 R/M 000 001 010 011 100 101 110 111 W=0 AL CL DL BL AH CH DH BH W=1 AX CX DX BX SP BP SI DI R/M 000 001 010 011 100 101 110 111
Effective Address Calculation MOD=00 BX+SI BX+DI BP+SI BP+DI SI DI Direct Address BX MOD=01 BX+SI+D8 BX+DI+D8 BP+SI+D8 BP+DI+D8 SI+D8 DI+D8 BP+D8 BX+D8
REPT:
80x86 Instruction Set (Data Movement Instructions) Moving data/address from either register to memory or memory to register
MOV PUSH/POP LDS,LES,LSS LEA
Logical Instructions
AND, OR, TEST, XOR, NOT, NEG
Conditional jump
JNC-> jump no carry JNO->jump if no overflow JE-> jump if equal JNE->jump if not equal
IN LAHF LDS LEA LES LODS MOV MOVS OUT POP POPF PUSH PUSHF SAHF SCAS STOS XCHG XLAT
;Input ;Load AH from Flags ;Load pointer to DS ;Load EA to register ;Load pointer to ES ;Load memory at SI into AX ;Move ;Move memory at SI to DI ;Output ;Pop ;Pop Flags ;Push ;Push Flags ;Store AH into Flags ;Scan memory at DI compared to AX ;Store AX into memory at DI ;Exchange ;Translate byte to AL
AAA AAD AAM AAS ADC ADD AND CMC CMP CMPS CWD DAA DAS DEC DIV IDIV MUL IMUL INC
;ASCII Adjust for Add in AX ;ASCII Adjust for Divide in AX ;ASCII Adjust for Multiply in AX ;ASCII Adjust for Subtract in AX ;Add with Carry ;Add ;Logical AND ;Complement Carry ;Compare ;Compare memory at SI and DI ;Convert Word to Double in AX DX,AX ;Decimal Adjust for Add in AX ;Decimal Adjust for Subtract in AX ;Decrement ;Divide (unsigned) in AX(,DX) ;Divide (signed) in AX(,DX) ;Multiply (unsigned) in AX(,DX) ;Multiply (signed) in AX(,DX) ;Increment
BITS Pilani, Pilani Campus
NEG NOT OR RCL RCR ROL ROR SAR SBB SCAS SHL/SAL SHR SUB TEST XLAT XOR
;Negate ;Logical NOT ;Logical inclusive OR ;Rotate through Carry Left ;Rotate through Carry Right ;Rotate Left ;Rotate Right ;Shift Arithmetic Right ;Subtract with Borrow ;Scan memory at DI compared to AX ;Shift logical/Arithmetic Left ;Shift logical Right ;Subtract ;AND function to flags ;Translate byte to AL ;Logical Exclusive OR
JNLE/JG JNO JNP/JPO JNS JO JP/JPE JS LOOP LOOPNZ/LOOPNE LOOPZ/LOOPE NOP REP/REPNE/REPNZ REPE/REPZ RET SEG STC STD STI TEST
;Jump on Not Less or Equal/Greater ;Jump on Not Overflow ;Jump on Not Parity/Parity Odd ;Jump on Not Sign ;Jump on Overflow ;Jump on Parity/Parity Even ;Jump on Sign ;Loop CX times ;Loop while Not Zero/Not Equal ;Loop while Zero/Equal ;No Operation (= XCHG AX,AX) ;Repeat/Repeat Not Equal/Not Zero ;Repeat Equal/Zero ;Return from call ;Segment register ;Set Carry ;Set Direction ;Set Interrupt ;AND function to flags
BITS Pilani, Pilani Campus
Quiz
Q. The instruction set architecture for a simple computer must support access to 64 KB of byte-addressable memory space and eight 16bit general-purpose CPU registers.
a) If the computer has three-operand machine language instructions that operate on the contents of two different CPU registers to produce a result that is stored in a third register, how many bits are required in the instruction format for addressing registers?
b) If all instructions are to be 16 bits long, how many op codes are available for the threeoperand, register operation instructions described above (neglecting, for the moment, any other types of instructions that might be required)?
BITS Pilani
Pilani Campus
Module-4 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Computer Memory Hierarchy and its characteristics, Cache Memory, Cache Mapping Function : Direct Mapping
BITS Pilani, Pilani Campus
Source: Null, Linda and Lobur, Julia (2003). Computer Organization and Architecture (p. 236). Sudbury, MA: Jones and Bartlett Publishers
191
BITS Pilani, Pilani Campus
Memory Hierarchy
Computer memory exhibits the widest range of
Physical Type Semiconductor (RAM), Magnetic (Disk), Optical (CD) Physical Characteristics Volatile/Non Volatile, Erasable/Non Erasable Organization Physical arrangement of bits Performance Access time (Latency), Transfer time, Memory cycle time
192
BITS Pilani, Pilani Campus
Capacity
Number of Words/Bytes
Unit of transfer
Internal
Usually governed by data bus width
External
Usually a block which is much larger than a word
Addressable unit
Smallest location which can be uniquely addressed Word or byte internally and Cluster on disks
193
BITS Pilani, Pilani Campus
Direct
Individual blocks have unique address based on physical location Access is by jumping to vicinity plus sequential search Access time depends on location and previous location e.g. Hard Disk
194
BITS Pilani, Pilani Campus
Associative
Data is located based on a portion of its contents rather than its address Called as Content Addressable Memory (CAM)
Note: Access time is independent of location or previous access for Random and Associative accesses
195
196
BITS Pilani, Pilani Campus
Locality of References
Processor generates the main memory references
To fetch instructions and data
Cache Memory
Small amount of fast memory called Cache memory sits between normal main memory and CPU
198
BITS Pilani, Pilani Campus
Block Size Number of Caches Replacement Algorithm Write Policy Mapping Function
199
BITS Pilani, Pilani Campus
The correspondence between the main memory blocks (group of words) and in the cache lines is specified by a mapping function
200
BITS Pilani, Pilani Campus
Direct Mapping
Each block of main memory maps to only one cache line
i.e. if a block is in cache, it must be in one specific place
Mapping Function jth Block of the main memory maps to ith cache line
i = j modulo M (M = number of cache lines)
201
BITS Pilani, Pilani Campus
Cache line
0
1 m-1
0, m, 2m, 3m2s-m
1,m+1, 2m+12s-m+1 m-1, 2m-1,3m-12s-1
202
BITS Pilani, Pilani Campus
203
BITS Pilani, Pilani Campus
204
BITS Pilani, Pilani Campus
205
BITS Pilani, Pilani Campus
Exercises
Show the address split up for the direct cache mapping function. Cache and main memory details are as follows:
Cache size is 128K Byte Cache line size is 8 Bytes Main memory size is 16M Bytes Main Memory is Byte addressable
Will main memory addresses x234560 and x374562 map to same cache line?
206
BITS Pilani, Pilani Campus
Review Questions
Why tag bits are necessary to store along with the data bits in cache line?
Hint: How to identify the two blocks of main memory mapped to same cache line
How many tag comparisons are required to check the presence of the word requested by the processor in the direct mapped cache? Justify your answer.
Hint: Block to line mapping is fixed
207
BITS Pilani, Pilani Campus
Summary
Memory Hierarchy in Computer System Concept of Locality of Reference Cache Memory Cache Mapping Function: Direct Mapping
208
BITS Pilani, Pilani Campus
Thank You!
209
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-4 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Cache Mapping Functions: Associative, Set Associative , Cache Line Replacement Algorithms
BITS Pilani, Pilani Campus
212
BITS Pilani, Pilani Campus
213
BITS Pilani, Pilani Campus
215
BITS Pilani, Pilani Campus
216
BITS Pilani, Pilani Campus
217
BITS Pilani, Pilani Campus
218
BITS Pilani, Pilani Campus
219
BITS Pilani, Pilani Campus
220
BITS Pilani, Pilani Campus
221
BITS Pilani, Pilani Campus
222
BITS Pilani, Pilani Campus
Replacement Algorithms
No choice in Direct mapping because each block only maps to one line!!! Least Recently used (LRU)
Oldest referenced block is removed
Random
Replace any block randomly
223
BITS Pilani, Pilani Campus
0.9
0.8 0.7
Hit ratio
0.6
0.5 0.4
0.3
0.2 0.1
224
BITS Pilani, Pilani Campus
Exercise-1
Consider a fully associative cache memory having 4 lines. The main memory references generated by the processor are 0,1,2,3,4,0,1,2,3,4. Show the final cache entries by assuming LRU replacement policy.
225
BITS Pilani, Pilani Campus
Exercise-2
A 4-Way set associative cache has a block size of four 16-bit words. The cache can accommodate a total of 4K such words. The main memory size is 128K words. How the processor's addresses are interpreted.
226
BITS Pilani, Pilani Campus
Summary
Cache to Main Memory Mapping
Fully Associative and Set Associative
Thank You!
228
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-4 (Lecture-3)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Cache Performance Measurement , Cache Miss Types, Hit Ratio, Write Policy, Multilevel Caches
BITS Pilani, Pilani Campus
Cache Performance
More the processor references found in the cache better the performance
If a reference found in the cache it is HIT otherwise MISS A penalty is associated with each MISS occurred More hits reduces average access time
230
BITS Pilani, Pilani Campus
ConflictIf block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses.
231
BITS Pilani, Pilani Campus
Measure of how close average access time is to M1 access time On chip cache access time is about 25 to 50 times faster than main memory Off chip cache access time is about 5 to 15 times faster than main memory access time
233
BITS Pilani, Pilani Campus
234
BITS Pilani, Pilani Campus
235
BITS Pilani, Pilani Campus
236
BITS Pilani, Pilani Campus
237
BITS Pilani, Pilani Campus
Larger blocks
Reduce number of blocks that fit in cache Data overwritten shortly after being fetched Each additional word is less local so less likely to be needed
238
BITS Pilani, Pilani Campus
240
BITS Pilani, Pilani Campus
L1 instruction cache
L1 data cache
L1 data and instruction is 16-KB Data cache: 4 way set associative Instruction cache: 2 way set associative
L2 cache
Main memory
Input/Output
242
BITS Pilani, Pilani Campus
243
BITS Pilani, Pilani Campus
Review Questions
For a larger block size hit ratio initially increases and drops later. Why?
Why multilevel cache system gives better performance than single level?
How the conflict misses are different from capacity misses?
244
BITS Pilani, Pilani Campus
Summary
Cache Performance
Hit Ratio Average Access Time Access Efficiency Write Policy Line Size Multiple Level of cache Unified Cache and Split Cache
245
BITS Pilani, Pilani Campus
Thank You!
246
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-5 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Internal memory, Semiconductor Memory Types, RAM, ROM, Memory Chip Organization, Memory Module Organization, Flash Memory
BITS Pilani, Pilani Campus
Basic Concepts
Data transfer between the processor and the memory takes place through the two registers
MAR and MBR or MDR
Memory Cycle time is usually slightly longer than the access time! Memory Cycle time for Semiconductor memories ranges 10 to 100 ns
249
BITS Pilani, Pilani Campus
250
BITS Pilani, Pilani Campus
Write Mechanism
Electrically Masks
Volatility Volatile
Readonly memory
Not possible
UV light, chiplevel
Electrically
Nonvolatile
Electrically, bytelevel
Electrically, blocklevel
251
BITS Pilani, Pilani Campus
Static RAM
Memories that consists of circuits capable of retaining their state as long as power is applied Bits stored as on/off switches Complex construction so larger per bit and more expensive
252
BITS Pilani, Pilani Campus
State 0
C2 high, C1 low, T2 T3 off, T1 T4 on
253
BITS Pilani, Pilani Campus
Dynamic RAM
Bits stored as charge in capacitors charges leak so need refreshing even when powered Simpler construction and smaller per bit so less expensive Address line active when bit read or written
Transistor switch closed (current flows)
254
BITS Pilani, Pilani Campus
FF A0 A1 Address decoder W1 FF
Memory cells
A2
A3
W15
Sense / Write circuit Sense / Write circuit Sense / Write circuit R/W CS
b1
b0
10-bit address
R/ W CS
Data input/output
256
BITS Pilani, Pilani Campus
258
BITS Pilani, Pilani Campus
259
BITS Pilani, Pilani Campus
262
BITS Pilani, Pilani Campus
263
BITS Pilani, Pilani Campus
Flash memory
Similar technology as EEPROM i.e. single transistor controlled by trapped charge A single cell can be read but writing is on block basis and previous contents are erased Greater density and a lower cost per bit, and consumes less power for operation Used in MP3 players, Cell phones, digital cameras Larger flash memory modules are called Flash Drives (i.e. Solid State Storage Devices)
264
BITS Pilani, Pilani Campus
Exercises
How memory organization impacts on data transfer rate? Number of pins required for 4Mbit memory chip:
Organized as 4Mx1 Organized as 1Mx4
265
BITS Pilani, Pilani Campus
Summary
266
BITS Pilani, Pilani Campus
Thank You!
267
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-5 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: External Memory, Magnetic Disk, Disk Characteristics and Performance Parameters
BITS Pilani, Pilani Campus
External memory
Semiconductor memory can not be used to store large amount of information or data
Due to high per bit cost of it!
270
BITS Pilani, Pilani Campus
Disk Controller
Disk Drive
271
BITS Pilani, Pilani Campus
Magnetic Disk
273
BITS Pilani, Pilani Campus
Tracks divided into sectors Minimum block size is one sector Disk rotate at const. angular velocity
Gives pie shaped sectors
Disk storage capacity in a CAV system is limited by the maximum recording density that can be achieved on the innermost track
275
BITS Pilani, Pilani Campus
276
BITS Pilani, Pilani Campus
Read (contemporary)
Separate read head, close to write head Partially shielded magneto resistive (MR) sensor Electrical resistance depends on direction of magnetic field Resistance changes are detected as voltage signals Allows High frequency operation
Higher storage density and speed
277
BITS Pilani, Pilani Campus
Disk Characteristics
Fixed (rare) or movable head
Fixed head
One read/write head per track mounted on fixed ridged arm
Movable head
One read/write head per side mounted on a movable arm
279
BITS Pilani, Pilani Campus
Capacity
Vendors express capacity in units of gigabytes (GB), where 1 GB =10^9 Byte Capacity is determined by these technology factors:
Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track. Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment. Areal density (bits/in2): product of recording and track density.
Modern disks partition tracks into disjoint subsets called recording zones
280
BITS Pilani, Pilani Campus
Example:
512 bytes/sector, 300 sectors/track (average) 20,000 tracks/surface, 2 surfaces/platter 5 platters/disk Capacity = 512 x 300 x 20000 x 2 x 5 = 30.72GB
281
BITS Pilani, Pilani Campus
Rotational delay
Transfer time The total average access time is: Ta = Ts+ 1/2r + b/rN
Here Ts is Average seek time r is rotation speed in revolution per second b number of bytes to be transferred N number of bytes on a track
282
BITS Pilani, Pilani Campus
Example
Average seek time=4ms Rotation speed= 15,000 rpm 512 bytes per sector No. of sectors per track=500 Want to read a file consisting of 2000 sectors. Calculate the time to read the entire file File is stored sequentially.
4+2+4=10 ms to read 500 sectors or 1 track Time required to read remaining 4 tracks is 3*(4+2)=18 ms Total time is 28 ms
283
BITS Pilani, Pilani Campus
Exercise-1
If file is stored randomly.
2000 sectors are randomly scattered
What will be the total time required to read the file now?
284
BITS Pilani, Pilani Campus
Exercise-2
What is average time to read or write 512 byte sector for a disk rotating at 15,000 rpm? The average seek time is 4 ms, and the transfer rate is 100 MB/sec. Sol: 4+2+0.005 = 6.005 ms
285
BITS Pilani, Pilani Campus
Summary
Secondary Storage in Computers Magnetic Disk
Disk Characteristics Data organization on disk Read/Write Mechanism Access time
286
BITS Pilani, Pilani Campus
Thank You!
287
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-5 (Lecture-3)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Redundant Array of Independent Disks (RAID)
BITS Pilani, Pilani Campus
Justification
With multiple disks separate I/O requests can be handled in parallel A single I/O request can be executed in parallel, if the requested data is distributed across multiple disks
RAID
Redundant Array of Independent Disks Seven levels in common use Not a hierarchy Characteristics
1. Set of physical disks viewed as single logical drive by operating system 2. Data distributed across physical drives 3. Can use redundant capacity to store parity information
291
BITS Pilani, Pilani Campus
RAID 0
Increased Speed
Multiple data requests probably not on same disk Disks seek in parallel A set of data is likely to be striped across multiple disks
Draw Backs:
Not a "True" RAID because it is NOT fault-tolerant The failure of just one drive will result in all data in an array being lost
293
BITS Pilani, Pilani Campus
RAID 1
Mirrored Disks Data is striped across disks 2 copies of each stripe on separate disks Read from either and Write to both
294
BITS Pilani, Pilani Campus
RAID 1
Recovery is simple
Swap faulty disk & re-mirror No down time
Draw back
Highest disk overhead of all RAID types Expensive Any write should be done on two disks
295
BITS Pilani, Pilani Campus
296
BITS Pilani, Pilani Campus
298
BITS Pilani, Pilani Campus
RAID 3
Similar to RAID 2 Only one redundant disk, no matter how large the array Simple parity bit for each set of corresponding bits Data on failed drive can be reconstructed from surviving data and parity information
299
BITS Pilani, Pilani Campus
RAID 4
Make use of independent access with block level striping Good for high I/O request rate due to large strips Bit by bit parity calculated across stripes on each disk Parity stored on parity disk Drawback???
300
BITS Pilani, Pilani Campus
RAID 5
Round robin allocation for parity stripe It avoids RAID 4 bottleneck at parity disk Commonly used in network servers Drawback
Disk failure has a medium impact on throughput Difficult to rebuild in the event of a disk failure (as compared to RAID level 1)
301
BITS Pilani, Pilani Campus
RAID 6
Two parity calculations Stored in separate blocks on different disks High data availability
Three disks need to fail for data loss Significant write penalty
Drawback
Controller overhead to compute parity is very high
302
BITS Pilani, Pilani Campus
303
BITS Pilani, Pilani Campus
304
BITS Pilani, Pilani Campus
Review Questions
In the context of the RAID, what is the difference between parallel access and independent access? What is the role of strip size in RAID 0 to achieve high I/O request rate and high data transfer capacity?
305
BITS Pilani, Pilani Campus
Summary
Reliability and Performance Improvement for secondary storage Comparative analysis of RAID Levels
306
BITS Pilani, Pilani Campus
Thank You!
307
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-6 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: I/O Modules: Function and Structure I/O Methods: Programmed and Interrupt Driven I/O
BITS Pilani, Pilani Campus
Input /Output
I/O system provides interface to the outside world I/O operations are accomplished through wide variety of external devices Used to transfer the data between computer and external world e.g. Keyboard, Monitor, Disk Drive, Printer An external device attached to the computer by an Interface Circuit called as I/O module
310
BITS Pilani, Pilani Campus
312
BITS Pilani, Pilani Campus
Error Detection
Mechanical, electrical malfunctioning and transmission e.g. paper jam, disk bad sector
314
BITS Pilani, Pilani Campus
315
BITS Pilani, Pilani Campus
I/O Methods
316
BITS Pilani, Pilani Campus
Programmed I/O
CPU has direct control over I/O
Sensing status Read/write commands Transferring data
Isolated I/O
Separate address spaces Need I/O or memory select lines Special commands for I/O
318
BITS Pilani, Pilani Campus
319
BITS Pilani, Pilani Campus
Interrupts
Interrupts, what the processor is doing Program
e.g. overflow, division by zero, segmentation fault
Timer
Generated by internal processor timer Used in pre-emptive multi-tasking
I/O
from I/O controller
Hardware failure
e.g. memory parity error
320
BITS Pilani, Pilani Campus
321
BITS Pilani, Pilani Campus
322
BITS Pilani, Pilani Campus
Device Identification-1
How does processor determine which device issued the interrupt?
Multiple Interrupt Lines
Impractical to provide one line to each I/O device as number of devices increase
Software Poll
Processor branches to ISR ISR poll each I/O module to identify which module caused the interrupt Time consuming process
323
BITS Pilani, Pilani Campus
Device Identification-2
Hardware Poll
All I/O modules share a common interrupt request line Interrupt ACK line is Daisy changed through the modules Requesting module places address of the I/O module on data bus which is called as a vector Technique is called as vectored Interrupt
324
BITS Pilani, Pilani Campus
325
BITS Pilani, Pilani Campus
326
BITS Pilani, Pilani Campus
Review Questions
When a device interrupt occurs, how does the processor determine which device issued the interrupt? What is context switching? How programmed I/O is different from Interrupt Driven I/O? How software polling is different from Daisy chain method to handle multiple interrupts?
327
BITS Pilani, Pilani Campus
Summary
I/O modules: Functions and Structure I/O methods
Programmed I/O
Device addressing methods
328
BITS Pilani, Pilani Campus
Thank You!
329
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-6 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Direct Memory Access (DMA), DMA Configurations
BITS Pilani, Pilani Campus
Using programmed I/O processor can move data at higher rate at the cost of doing nothing else Interrupt I/O frees up the processor at the expense of the I/O transfer rate
332
BITS Pilani, Pilani Campus
333
BITS Pilani, Pilani Campus
CPU carries on with other work DMA controller deals with transfer DMA controller sends interrupt when finished
334
BITS Pilani, Pilani Campus
Slows down CPU but not as much as CPU itself doing transfer
335
BITS Pilani, Pilani Campus
336
BITS Pilani, Pilani Campus
Single Bus, Detached DMA controller Each transfer uses bus twice
I/O to DMA then DMA to memory
Single Bus, Integrated DMA controller Controller may support >1 device Each transfer uses bus once CPU is suspended once!
339
BITS Pilani, Pilani Campus
Separate I/O Bus Bus supports all DMA enabled devices Each transfer uses system bus once CPU is suspended once
340
BITS Pilani, Pilani Campus
341
BITS Pilani, Pilani Campus
Exercise-1
Processor and an I/O device connected to main memory via a shared bus of width one word. Processor can executes 106 instr/sec. An average instruction requires 5 machine cycles, three of which use the memory bus. Memory R/W requires one machine cycle. Processor is executing background programs that requires 95% of its instructions execution rate but not any I/O instructions. Assume processor cycle equals one bus cycle. Assume I/O device is to be used to transfer very large blocks of data. Answer the following: If programmed I/O is used and each one word transfer requires the processor to execute two instructions. What is the maximum I/O transfer rate? Estimate the rate if DMA is used.
342
BITS Pilani, Pilani Campus
Solution
The processor can only devote 5% of its time to I/O. Thus the maximum I/O instruction execution rate is 106 * 0.05 = 50,000 instructions per second. The I/O transfer rate is therefore 25,000 words/second. The number of machine cycles available for DMA control is 106(0.05*5 + 0.95*2) = 2.15*106 If we assume that the DMA module can use all of these cycles, and ignore any setup or status-checking time, then this value is the maximum I/O transfer rate.
343
BITS Pilani, Pilani Campus
Exercise-2
A DMA module transferring characters to memory using cycle stealing, from a device transmitting at 9600 bps. The processor is fetching instructions at the rate of 106 instructions per second. By how much will the processor be slowed down due to the DMA activity? Sol:
344
BITS Pilani, Pilani Campus
Review Questions
In general, DMA access to main memory is given higher priority than CPU access to main memory. Why? How cycle stealing for DMA is different from context switching used in Interrupt Driven I/O? Why DMA is better for large size data transfer compared to Interrupt Driven I/O?
345
BITS Pilani, Pilani Campus
Summary
Limitations of Interrupt Driven and Programmed I/O DMA Structure and Function DMA Configurations Exercises
346
BITS Pilani, Pilani Campus
Thank You!
347
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-7 (Lecture-1)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: RISC Vs. CISC Architectures, Instruction Pipeline, Six Stage Instruction Pipeline, Pipeline Performance
BITS Pilani, Pilani Campus
Key for ISA designer These statements should be supported in optimal fashion
350
BITS Pilani, Pilani Campus
RISC Architecture
Earlier computers had small and simple instruction set. Why???
In the early 1980s, designers recommended that computers use fewer instructions with simple constructs so they can be executed much faster without accessing memory as often.
351
BITS Pilani, Pilani Campus
RISC Characteristics
Relatively few instructions
128 or less
Hardwired control
352
BITS Pilani, Pilani Campus
RISC Processors
MIPS R4000
First commercially available RISC processor Supports thirty-two 64-bit registers 128KB of high speed cache
SPARC (Sun)
Based on Berkeley RISC model
CISC Architecture
Later complexity and number of instructions both are increased. Why???
Simplify the compilation
The trend into computer hardware complexity was influenced by various factors, such as
To provide support for more customer applications Adding instructions that facilitate the translation from high level language into machine language programs Single machine instruction for each high level language statement Ex: VAX computer, IBM/370 computers, Intel x86 based processors, Motorola 68000 Series
354
BITS Pilani, Pilani Campus
CISC Characteristics
A large number of instructions Some instructions for special tasks used infrequently A large variety of addressing modes (5 to 20) Variable length instruction formats Instructions that manipulate operands in memory
However, it soon became apparent that a complex instruction set has a number of disadvantages These include a complex instruction decoding scheme, an increased size of the control unit, and increased logic delays.
355
again
Example RISC Program mov ax,0 mov bx, 20 mov cx,5 add ax, bx loop again
356
BITS Pilani, Pilani Campus
Most commercially available machines are mixture of RISC and CISC characteristics
358
BITS Pilani, Pilani Campus
Instruction Pipelining
Pipelining
What is pipelining? Six stage pipeline example Performance Issues
359
BITS Pilani, Pilani Campus
Instruction Pipelining
Similar to the use of an assembly line in manufacturing plant New inputs are accepted at one end before previously accepted inputs appear as outputs at the other end The same concept can be apply to the Instruction execution
360
BITS Pilani, Pilani Campus
361
BITS Pilani, Pilani Campus
Execution usually does not access main memory. Next instruction can fetched during execution of current instruction. Called instruction prefetch
362
BITS Pilani, Pilani Campus
363
BITS Pilani, Pilani Campus
Any jump or branch means that prefetched instructions are not the required instructions These factors reduce the potential effectiveness of the two stage pipeline
How about to go for More stages?
364
BITS Pilani, Pilani Campus
366
BITS Pilani, Pilani Campus
367
BITS Pilani, Pilani Campus
Pipeline Performance
Total time required to execute n instructions for a pipeline with k stages and cycle time t
Tk,n = [k+(n-1)]t
Speedup factor
Sk = nk/[k+(n-1)]
However, as a practical matter, the potential gains of additional stages are countered by increases in cost, delays between stages.
368
BITS Pilani, Pilani Campus
369
BITS Pilani, Pilani Campus
Review Questions
A program takes 600 ns for execution on a non pipelined processor. Suppose we need to run 100 programs of same type on a six stage pipeline processor with a clock speed of 10 ns. What is the speed ratio of the pipeline.
370
BITS Pilani, Pilani Campus
Summary RISC and CISC Architectures Instruction Pipeline Pipeline Performance Factors
371
BITS Pilani, Pilani Campus
Thank You!
372
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-7 (Lecture-2)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Pipeline Hazards: Data Hazards, Resource Hazards, Control Hazards, RISC Pipeline
BITS Pilani, Pilani Campus
Stalls/Hazards in Pipeline
Any condition that causes the pipeline to stall is called Hazard Data Hazards
Any condition in which either the source or the destination operands of an instruction are not available at the expected time in the pipeline
376
BITS Pilani, Pilani Campus
Data Hazards
Read After Write Hazard
I1: R1=R1+R2 I2: R3=R1+3
377
BITS Pilani, Pilani Campus
Software Solution
Inserting NOP (No- Operation) instructions between I1 and I2
378
BITS Pilani, Pilani Campus
Resource Hazards
One instruction may need to access memory as part of the Write stage while another instruction is being Fetched
If instruction and data are in same cache unit there is a conflict Called as Structural or Resource conflict or Resource Hazard
379
BITS Pilani, Pilani Campus
Instruction/Control Hazards
If the required instruction is not available Example: Cache miss for instruction fetch, Branch instruction Called as control hazards or Instruction hazards
380
BITS Pilani, Pilani Campus
I1 I2 (Br) I3 I4 Ik
F1 D1 E1 W1
F2 D2 E2 F3 D3 X F4 X Fk Dk Ek Wk
Many processors employ fetch unit that can fetch instructions before they are needed
382
BITS Pilani, Pilani Campus
Multiple Streams
Have two pipelines Prefetch each branch into a separate pipeline Use appropriate pipeline
Drawbacks
Leads to bus & register contention Multiple branches lead to further pipelines being needed Used in IBM 370/168
383
BITS Pilani, Pilani Campus
Loop Buffer
Very fast memory maintained by fetch stage of pipeline keeps n most recently fetched instructions, sequentially Check buffer before fetching from memory Very good for small loops or jumps Used in CRAY-1
384
BITS Pilani, Pilani Campus
Review Questions
Consider the following sequence of instructions: I1: add R1, R0, #20 (R1 R0 + 20) I2: mul R2, R3, #2 (R2 R3*2) I3: and R4, R1, R2 (R4 R1 and R2) I4: add R5, R4, R2 (R5 R4 + R2) These instructions are executed in a computer that has a four stage pipeline (Fetch, Decode, Execute, Write) discussed in this class. Assume that all stages for all instructions requires one cycle each, except the Execution stage of multiply instruction which requires two cycles. Draw a diagram to describe the operation being performed by each pipeline stage during each clock cycle. Show the stalls in pipeline, if any.
385
BITS Pilani, Pilani Campus
Solution
I1: add R1, R0, #20 (R1 R0 + 20) I2: mul R2, R3, #2 (R2 R3*2) I3: and R4, R1, R2 (R4 R1 and R2)
386
BITS Pilani, Pilani Campus
Predict by opcode
Some instructions are more likely to result in a jump than others. Can get up to 75% success
388
BITS Pilani, Pilani Campus
389
BITS Pilani, Pilani Campus
Delayed Branch
Do not take jump until you have to Rearrange instructions Example
LOOP Shift_left Decrement Brp Add R1 R2 LOOP R1,R3
R2 LOOP R1 R1,R3
NEXT
Reordered Instructions
The location following a branch instruction is called a branch delay slot The objective is to utilize the slot by putting useful instruction.
390
BITS Pilani, Pilani Campus
RISC Pipelining
Most instructions are register to register
I: Instruction fetch E: Execute
ALU operation with register input and output
D: Memory
Register to memory or memory to register operation
391
BITS Pilani, Pilani Campus
392
BITS Pilani, Pilani Campus
393
BITS Pilani, Pilani Campus
394
BITS Pilani, Pilani Campus
Thank You!
395
BITS Pilani, Pilani Campus
BITS Pilani
Pilani Campus
Module-7 (Lecture-3)
[Ref. Computer Organization and Architecture , 8th Ed. by William Stallings] Topics: Instruction Level and Machine Level Parallelism, Superscalar Pipeline and Superscalar Processor Design Elements, Hazards and Solutions
BITS Pilani, Pilani Campus
Introduction
Scalar pipeline improves performance
By overlapping various instruction stages Unable to exploit Instruction Level Parallelism (ILP) completely!
398
BITS Pilani, Pilani Campus
What is Superscalar?
Multiple independent instruction pipelines are used to execute instructions independently and concurrently Common instructions (arithmetic, load/store, conditional branch) initiated simultaneously and executed independently Improve the performance of the execution of instructions operated on scalar quantities Equally applicable to RISC & CISC but in practice usually RISC 399
BITS Pilani, Pilani Campus
400
BITS Pilani, Pilani Campus
Limitations
Superscalar approach depends on the ability to execute multiple instructions in parallel How to maximize the ILP
Compiler based optimization Hardware techniques
Procedural/Instruction Hazards
Can not execute instructions after a branch in parallel with instructions before a branch Also, if instruction length is not fixed, instructions have to be decoded (partially) to find out how many fetches are needed
This prevents simultaneous fetches
Resource Conflict
Two or more instructions requiring access to the same resource at the same time
e.g. two arithmetic instructions
Solution
Can duplicate resources. e.g. have two arithmetic units
403
BITS Pilani, Pilani Campus
Data Hazards/Conflicts
Example Program
R3= R3 + R5; (I1) R4= R3 + 1; (I2) R3= R5 + 1; (I3) R7= R3 + R4 (I4)
405
BITS Pilani, Pilani Campus
Machine Parallelism
Ability to take advantage of instruction level parallelism Governed by number of parallel pipelines
Note
A program may not have enough instruction level parallelism to take full advantage of machine parallelism
406
BITS Pilani, Pilani Campus
Ordering issues
Order in which instructions are fetched, executed and change contents of registers and memory 407
BITS Pilani, Pilani Campus
Issue Policies In-order issue with in-order completion In-order issue with out-of-order completion Out-of-order issue with out-of-order completion
408
BITS Pilani, Pilani Campus
409
BITS Pilani, Pilani Campus
410
BITS Pilani, Pilani Campus
411
BITS Pilani, Pilani Campus
412
BITS Pilani, Pilani Campus
413
BITS Pilani, Pilani Campus
Register Renaming
When out of order techniques are used
WAW and WAR hazards occur because register contents may not reflect the correct ordering from the program WAW and WAR dependencies are known as storage conflicts
Compilers use optimized registers allocation which leads to more storage conflicts Register Renaming is used to deal with this
Similar to resource duplication
414
BITS Pilani, Pilani Campus
Register Renaming
R3b:=R3a + R5a R4b:=R3b + 1 R3c:=R5a + 1 R7a:=R3c + R4b Inference (I1) (I2) (I3) (I4)
R3:= R4:= R3:= R7:= R3 R3 R5 R3 + + + + R5; (I1) 1; (I2) 1; (I3) R4 (I4)
The same original register reference in several different instructions may refer to different actual registers, if different values are needed
415
BITS Pilani, Pilani Campus
Superscalar Execution
416
BITS Pilani, Pilani Campus
Machine Parallelism
To enhance the performance of the Super Scalar Processors
Duplication of Resources, Out of order instruction issue Register Renaming
417
BITS Pilani, Pilani Campus
Review Questions
How instruction level parallelism is different from machine parallelism? Identify the various data hazards among the following instructions?
I1: R1=20 I2: R1=R3+R4 I3: R2=R4-10 I4: R4=R1+R2 I5: R2=R1+40
418
BITS Pilani, Pilani Campus
Summary
Instruction level Parallelism vs. Machine level parallelism Superscalar Pipelines Performance Limitation Factors Superscalar Processor Design Issues Register Renaming
Eliminates WAR and WAW hazards
419
BITS Pilani, Pilani Campus
Thank You!
420
BITS Pilani, Pilani Campus