Vous êtes sur la page 1sur 259

Hardware Description

CS220: Introduction to Computer Organization


2011-12 Ist Semester
Why do we need them?
TM To verify a design
Bluespec SystemVerilog -I To programmatically generate the circuit (as collection of gates)
aka synthesis
What is important?
Amey Karkare
The designs are based on digital circuits.
karkare@cse.iitk.ac.in
Digital signals have values 0, 1 (as input).
Output values are 0, 1, Z.
Department of CSE, IIT Kanpur
Z is a state rather than the value
For simulation/verification, it is required to catch errors due to
non-assignments.
X state for the output, X value for the input.

karkare, CSE, IITK CS220, BSV 1 1/9 karkare, CSE, IITK CS220, BSV 1 2/9

A Half Adder Combinational Structures and Basic Types

Inputs: a, b
Outputs: sum, cout Expressions are combinational circuits
Combinatorial Circuit constants, variables, operators and function applications
Basic Types
typedef struct { Bit#(1) sum; Bit#(1) cout; } ResultT Strong typing
deriving (Bits, Eq); Types describe values (independent of wires and storage
elements)
interface Ha_IFC; Variables: declaration, assignment, control structures
method ResultT halfAdd(Bit#(1) a, Bit#(1) b); The non-procedural or combinational view of variable
endinterface assignment
Think HW, dont think simulation!
module mkHa(Ha_IFC);
Functions are simply parameterized combinational circuits
method ResultT halfAdd(Bit#(1) a, Bit#(1) b);
Function application simply connects a parameterized
return (ResultT { sum:a^b, cout:a&b });
combinational circuit to actual inputs
endmethod
endmodule
karkare, CSE, IITK CS220, BSV 1 3/9 karkare, CSE, IITK CS220, BSV 1 4/9
Expressions Basic Types

Built from variables, constants, operators and function


applications Types play a central role in BSV
Describe data-flow in a combinatorial circuit Types are described with Type Expressions
Variables are just names for the wires! Simple type expressions are just identifiers
Complex type expressions are built from simple type
a = 10;
expressions using type constructors.
b = 12;
c = 42; In general, type identifiers begin with an uppercase letter
dsq = b*b - 4*a*c; Exceptions: int and bit, for compatibility with Verilog
d = sqrt(dsq);

karkare, CSE, IITK CS220, BSV 1 5/9 karkare, CSE, IITK CS220, BSV 1 6/9

Strong Typing Strong Typing

Every variable and expression has a type


bit [31:0] x;
Compiler checks that constructs in the language are
applied correctly according to types: x = signExtend (25h9BEEF);
Operators/functions arguments are of the correct type
Assignment is to the correct type x = zeroExtend (25h9BEEF);
Modules parameters are of the correct type
Modules interface is of the correct type x = { 0, 25h9BEEF }; // same as zeroExtend
In case of mismatch, issues an error message
x = zeroExtend (39h9BEEF); // error: input too wide
No automatic sign- or zero-extension; no automatic
truncation
x = truncate (37h9BEEF);
But you dont have to tediously calculate the amount of x = truncate (25h9BEEF); // error: input too narrow
extension or truncation; the compiler will do it for you

karkare, CSE, IITK CS220, BSV 1 7/9 karkare, CSE, IITK CS220, BSV 1 8/9
Variables

Every variable has a type


type var [= init], var [= init], ...;
int x, y = 23, z;
Bool b;
z = y + 2;
x = z * z;
b = (x >= 23);
A variable is not an updatable container
An assigment does not update a container
Registers and state elements are modules
A variable is just a name for an expression

karkare, CSE, IITK CS220, BSV 1 9/9


Variables
CS220: Introduction to Computer Organization
2011-12 Ist Semester
Every variable has a type
Bluespec SystemVerilogTM - II type var [= init], var [= init], ...;
int x, y = 23, z;
Bool b;
Amey Karkare z = y + 2;
karkare@cse.iitk.ac.in x = z * z;
b = (x >= 23);
Department of CSE, IIT Kanpur
A variable is not an updatable container
An assigment does not update a container
Registers and state elements are modules
A variable is just a name for an expression

karkare, CSE, IITK CS220, BSV 2 1/9 karkare, CSE, IITK CS220, BSV 2 2/9

Variable Assignments Variable Assignments

int a = 10;
a = a + 1;
Repeated assignment is just a notation for incrementally a = a * a;
building up expressions b = a + c;
Every assignment updates a new copy of variable.
New copy is available from the point of assignment. This is treated as if written as:
Think hardware, not simulation or software! int a = 10;
a1 = a + 1;
a2 = a1 * a1;
b = a2 + c;

karkare, CSE, IITK CS220, BSV 2 3/9 karkare, CSE, IITK CS220, BSV 2 4/9
Variable Assignments Variable Assignments

int a = 10;
int a = 10; for (int k = 2; k < 5; k = k+1)
if (b) a = a + 1; a = a + k;
else a = a * a;
c = a + 3; This is treated as if

This is treated as if int a = 10;


int k = 2;
int a = 10; a1 = a + k;
a1 = a + 1; k1 = k + 1; // 3
a2 = a * a; a2 = a1 + k1;
a3 = MUX(b, a1, a2); // b?a1:a2 k2 = k1 + 1; // 4
c = a3 + 3; a3 = a2 + k2;
k3 = k2 + 1; // 5

karkare, CSE, IITK CS220, BSV 2 5/9 karkare, CSE, IITK CS220, BSV 2 6/9

Types Types: typedefs and enums

An important strong property of BSV types


Any expression is guaranteed by BSVs type-checking rules to
represent a pure (combinational) value: Type synonyms defined with typedef
It cannot allocate any state
typedef existingType NewType;
It cannot update any state
typedef int PackID;
except if its type contains either of the following two special types
(to be described later) typedef bit[63:0] Data;
Action enum defines set of scalar values with symbolic names
ActionValue typedef enum { Id, ... , Id } NewType deriving (Bits, Eq);
Hence, any such expression can be freely shared or typedef enum { Red, Green, Blue } Color deriving (Bits, Eq);
replicated without changing behavior typedef enum { Start, End, Error } State deriving (Bits, Eq);

The BSV compiler exploits this to perform aggressive


common subexpression elimination optimization (CSE)

karkare, CSE, IITK CS220, BSV 2 7/9 karkare, CSE, IITK CS220, BSV 2 8/9
Types: struct Using let for struct initialization

A struct is a composite type.


A struct value is a collection of values, each of which has a
particular type and is identified by a member or field name It is often convenient to use let for declaring and
initializing a struct value
typedef struct { type id; ... ; type id; } NewType
let req = BusRequest {
deriving (Bits, Eq);
command: Load,
addr: baseAddr + 32h16,
typedef enum { Load, Store, LoadLock, StoreCond }
data: 64h9BEEF };
Command deriving (Bits, Eq);
typedef struct { Compiler deduces req to have type BusRequest
Command command; lets are not allowed in global scope
bit [31:0] addr;
bit [63:0] data;
} BusRequest deriving (Bits, Eq);

karkare, CSE, IITK CS220, BSV 2 9/9 karkare, CSE, IITK CS220, BSV 2 10/9

Each struct type is distinct Parameterized types

Each defined struct type is a new type Many types have some other types associated with them in
Distinct from all others some orthogonal (independent) way.
Even though they may happen to have members with the same With each array type, we associate the type of each item
types and same name contained in the array
typedef struct {int a; Bool b;} Foo deriving (Bits, Eq); With each memory type, we associate the type of addresses and
the type of data
typedef struct {int a; Bool b;} Baz deriving (Bits, Eq);
With each register file type, we associate the type of register
typedef Bar Baz; // type alias or synonym names and register data
Foo x;
System Verilog introduces a notation for this:
Baz y;
Type # ( Type, . . . , Type )
Bar z;
x = y; // type-checking error Mem #( Addr, Data)
RegFile #( RegName, RegData)
x.a = y.a; // ok Client #(Request, Response) // yields Requests, accepts Responses
x.b = y.b; // ok Server #(Request, Response) // accepts Requests, yields Responses
y = z; //ok
karkare, CSE, IITK CS220, BSV 2 11/9 karkare, CSE, IITK CS220, BSV 2 12/9
Numeric (Size) Types Advanced Types

Maybe#(t)
In BSV, some type parameters are numeric and indicate A value of some type t with an accompanying valid bit which
something about the size of each value of that type says whether the value is meaningful or not
Type Meaning Example Maybe#(int) m1 = tagged Valid 23; // valid bit True, value 23
Bit#(n) Bit-vector of width n Bit#(132) vect1 = 132d30; Maybe#(int) m2 = tagged Invalid; // valid bit False, value unspecified
NB: bit [n-1:0] = Bit#(n) // bit[131:0]
UInt#(n) Unsigned integers UInt#(4) vect2 = 4o1; m2 = m1;
of width n
Int#(n) Signed integers Int#(16) vect3 = 16hFF00; // This sets m2 Valid and 23
NB: int = Int#(32) of width n
Vector#(n,t) Vector of n Vector#(3, Bool) vect4; // Some functions
elements, each of Vector#(14,Int#(32)) vect5; Bool b = isValid (m2); // b == valid bit of m2
type t Vector#(16,Tuple2#(Bool, Bit#(8))) vect6; int d = fromMaybe (34, m2); // d = value of m2 if valid, else 34

karkare, CSE, IITK CS220, BSV 2 13/9 karkare, CSE, IITK CS220, BSV 2 14/9

Advanced Types

Tagged Unions
Types that need tags to identify one of a set of possible types
Maybe is a particular case of tagged union: Valid/Invalid
// Define tagged union // Write a tagged union member
typedef union tagged { BusTraffic busElem1, busElem2;
struct {
Bit#(8) header; busElem = tagged Symbol 10d6;
Bit#(8) payload;
Bit#(8) trailer; busElem2 = tagged ShortFrame {
} LongFrame; payload:10d2,
struct { trailer:8d1
Bit#(10) payload; };
Bit#(8) trailer;
} ShortFrame;
Bit#(10) Symbol;
} BusTraffic;

karkare, CSE, IITK CS220, BSV 2 15/9


Advanced Types
CS220: Introduction to Computer Organization
2011-12 Ist Semester
Maybe#(t)
TM
Bluespec SystemVerilog - III A value of some type t with an accompanying valid bit which
says whether the value is meaningful or not
Maybe#(int) m1 = tagged Valid 23; // valid bit True, value 23
Maybe#(int) m2 = tagged Invalid; // valid bit False, value unspecified
Amey Karkare
karkare@cse.iitk.ac.in m2 = m1;

Department of CSE, IIT Kanpur


// This sets m2 Valid and 23

// Some functions
Bool b = isValid (m2); // b == valid bit of m2
int d = fromMaybe (34, m2); // d = value of m2 if valid, else 34

karkare, CSE, IITK CS220, BSV 3 1/13 karkare, CSE, IITK CS220, BSV 3 2/13

Advanced Types

Tagged Unions
Types that need tags to identify one of a set of possible types
Maybe is a particular case of tagged union: Valid/Invalid
// Define tagged union // Write a tagged union member
Getting Started With BSV
typedef union tagged { BusTraffic busElem1, busElem2;
struct {
Bit#(8) header; busElem = tagged Symbol 10d6;
Bit#(8) payload;
Bit#(8) trailer; busElem2 = tagged ShortFrame {
} LongFrame; payload:10d2,
struct { trailer:8d1
Bit#(10) payload; };
Bit#(8) trailer;
} ShortFrame;
Bit#(10) Symbol;
} BusTraffic;

karkare, CSE, IITK CS220, BSV 3 3/13 karkare, CSE, IITK CS220, BSV 3 4/13
A Sample Design Package

In a good design, code is organized into packages


A package is a collection of related definitions
GOAL: Design of a Divider hardware Only one package per file
Two inputs: a is dividend, b is divisor file name for a package Tb must be Tb.bsv
Output: The result of the divison (i.e. file-name <package-name>.bsv)
Valid bit to check for divison by 0 exception To use functionality of other packages, packages need to
be imported explicitly
Using import command
Package structure does not relate to hardware structure
hardware structure is related to module hierarchy.

karkare, CSE, IITK CS220, BSV 3 5/13 karkare, CSE, IITK CS220, BSV 3 6/13

Package Interface

file: Divider.bsv Interface provide a mean to group wires into bundles


These wires have specified uses, described by methods
package Divider;
. . An interface consists of several (related) methods
.
. .
. An interface declaration can have several instances
.
. .
.
. . Different instances of same interface can define the same
endpackage: Divider member method differently.
However, the static structure (names of wires, names of methods
and their parameters) remains the same.

karkare, CSE, IITK CS220, BSV 3 7/13 karkare, CSE, IITK CS220, BSV 3 8/13
Interface Module

file: Divider.bsv
Modules are the building blocks of the design
package Divider; Consists of three things:
interface Div_Ifc; state
method Maybe#(int) divide(int a, int b); rules that operate on state
endinterface: Div_Ifc an interface to the outside world
.
. .
.
. . A module definition specifies a scheme that can be
endpackage: Divider instantiated multiple times.

karkare, CSE, IITK CS220, BSV 3 9/13 karkare, CSE, IITK CS220, BSV 3 10/13

Module Module

file: Divider.bsv package Divider;


package Divider; interface Div_Ifc; ... endinterface
module mkDiv(Div_Ifc);... endmodule
interface Div_Ifc; ... endinterface
module mkDivider(Empty);
Maybe#(int) r1, r2;
module mkDiv(Div_Ifc); Div_Ifc mdiv <- mkDiv;
method Maybe#(int) divide(int a, int b); r1 = mdiv.divide(128, 64); r2 = mdiv.divide(128, 0);
Maybe#(int) res; rule theDivisor;
if (b == 0) res = tagged Invalid; $display("Divison by non zero (128 / 64) is: %b:%0d",
else res = tagged Valid (a / b); isValid(r1), fromMaybe(0, r1));
$display("Divison by zero (128 / 0) is: %b:0x%x",
return res;
isValid(r2), fromMaybe(32hbad1, r2));
endmethod $finish(0);
endmodule: mkDiv endrule
.
. .
. endmodule: mkDivider
. .
endpackage: Divider
endpackage: Divider
karkare, CSE, IITK CS220, BSV 3 11/13 karkare, CSE, IITK CS220, BSV 3 12/13
Running the design

# BSV to Verilog generation


bsc -verilog Divider.bsv

# Verilog to Code generation


bsc -vsim iverilog -e mkDivider -o mkDivider_v \
mkDivider.v

# Code simulation
./mkDivider_v

Result:

Divison by non zero (128 / 64) is: 1:2


Divison by zero (128 / 0) is: 0:0xbad1

karkare, CSE, IITK CS220, BSV 3 13/13


Registers
CS220: Introduction to Computer Organization
2011-12 Ist Semester Registers are modules
Need to be instantiated explicitly
Theyre at the leaves of the module hierarchy
Bluespec SystemVerilogTM - IV Like all modules, registers have interfaces
Register interfaces are parameterized types
Amey Karkare Reg #(Int#(32)) // interface to a register that
karkare@cse.iitk.ac.in // contains Int#(32)
Reg #(Bit#(16)) // interface to a register
Department of CSE, IIT Kanpur
// containing Bit#(16) values
Reg #(State) // interface to a register
// containing a State value
Reg #(Request) // interface to a register
// containing a Request value
Registers are strongly-typed
karkare, CSE, IITK CS220, BSV 4 1/14 karkare, CSE, IITK CS220, BSV 4 2/14

Registers Modules Writing and reading registers

Instantiating registers follows standard module


instantiation syntax
The mkReg() module instantiates a register with a given The Reg#() interface presents the methods to write and
reset value read from a register
The initial value must, of course, have the correct type for interface Reg#(type t);
the type of the register (else type-checking error) method Action _write(t a);
The mkRegU module instantiates a register with an method t _read;
unspecified reset value endinterface
Reg #(Int#(32)) r1 <- mkReg (0); // Synchronously Any module register declares the write method as store
// reset to 0 the value and the read method as return the value
Reg #(Bit#(16)) r2 <- mkRegU; // unspecified
// initial value
Reg #(Request) r4 <-
mkReg(Request { op: Load; addr: 0 });

karkare, CSE, IITK CS220, BSV 4 3/14 karkare, CSE, IITK CS220, BSV 4 4/14
Writing and reading registers Rule

All behavior in BSV is expressed using Rules


For ease of use: A rule is a Declarative specification of a state transition
a register on the left of an assignment is treated as a write
operation
A rule is an Action guarded by a Boolean condition
A register on the right side of an assignment is treated as a read Syntax:
operation rule ruleName [( condition )];
Reg #(Int#(32)) r1 <- mkReg(0); actions
... endrule [: ruleName]
r1 <= 23; Over-simplified analogy to a Verilog always block:
// is equivalent to r1._write(23); always @(posedge CLK)
... if cond begin
let a = r1; actions
// is equivalent to let a = r1._read; end
This analogy does not always hold.
karkare, CSE, IITK CS220, BSV 4 5/14 karkare, CSE, IITK CS220, BSV 4 6/14

Rules Fire When does a rule fire?

Rules dont control clocks


They only generate enable logic and muxing
You define all modules and state Every cycle!
No clock or reset ?? Unless you tell it not to (rule conditions)
those are wired directly to state elements. Unless a child module tells it not to (ready conditions)
Unless a more important rule needs to fire instead (conflicts)
Firing means it is enabled and selected for this cycle.
Rules always try to fire unless you or another rule tells it
not to

karkare, CSE, IITK CS220, BSV 4 7/14 karkare, CSE, IITK CS220, BSV 4 8/14
Rule Conditions Ready Conditions

Rule conditions are arbitrary expressions of type Bool Methods have ready signal
Such expressions are purely combinational Classic examples: FIFOs
fifo.enq() ready when !full
This is guaranteed by BSVs type-checking rules fifo.first() ready when !empty
If this condition is false, rule doesnt fire Ready can be always True
No condition means default True Ready signals are specified for each method in defining
module
rule rule1 (state == TRANSFER); Rule doesnt fire unless all ready conditions are true
... rule actions ...
rule rule1; rule rule2;
endrule : rule1
// fifo.notFull; // fifo.notEmpty;
... if (cond)
rule go (trigger == 1 && state != IDLE);
fifo.enq( 10 ); let x = fifo.first();
... rule actions ...
... ...
endrule : go
endrule endrule
karkare, CSE, IITK CS220, BSV 4 9/14 karkare, CSE, IITK CS220, BSV 4 10/14

Conflicting Rules Rule Actions

The simplest rule action is a register assignment:


A rule may not fire because it conflicts with other rules r <= ... expression ...
Compiler will warn you BSV uses <=, non-blocking assignment, which is a write
But what is a conflict? action to a register
rule rule1; A rule body can contain multiple actions
x <= x + 1; There is no sequencing of actions in a rule body: all the
endrule actions happen simultaneously, in parallel
The following two rules are equivalent
rule rule2;
rule rule1 (st == T); rule rule2 (st == T);
x <= x + 2;
valuea <= valuea + 1; mult <= valuea * 2;
endrule mult <= valuea * 2; valuea <= valuea + 1;
endrule : rule1 endrule : rule2

karkare, CSE, IITK CS220, BSV 4 11/14 karkare, CSE, IITK CS220, BSV 4 12/14
Parallel Composable Rule Actions Resource conflicts in Rule Actions

Because there is no sequencing of actions in a rule body


(all the actions happen simultaneously, in parallel), it is A rule also cannot contain resource conflicts, even if they
are combinational
meaningless to put conflicting actions in the same rule, as
For e.g., trying simultaneously to read two different registers in a
in the examples below register file using a single read-port
rule rule3 (...); rule rule4 (...); The compiler will flag such errors
valuea <= valuea + 1; fifo.enq (23);
valuea <= valuea + 2; fifo.enq (34);
... ... rule rule5 (...);
endrule : rule3 endrule : rule4 let x = regFile.sel1(5);
let y = regFile.sel1(7);
The compiler will flag such errors ...
The term parallel composable is used for actions that can endrule : rule5
be in the same rule body

karkare, CSE, IITK CS220, BSV 4 13/14 karkare, CSE, IITK CS220, BSV 4 14/14
Basics

CS220: Introduction to Computer Organization


2011-12 Ist Semester

Welcome Required Background


Programming (Native Compilation techniques)
Pick up C/C++ if not known already.
Amey Karkare Knowledge of digital gates, flip-flops, latches, counters etc.
karkare@cse.iitk.ac.in Binary number system and operations
What will be covered?
Department of CSE, IIT Kanpur
Programming using Assembly Language
Circuit descriptions in Bluespec Verilog (BSV)
Computer Organization

karkare, CSE, IITK CS220, 1/12 karkare, CSE, IITK CS220, 2/12

Course Structure Course Policies

One midsem exam 25%


Lectures: One endsem exam 40%
Labs 25%
Monday, Tuesday, Thursday; 10:00 - 11:00 AM, CS101
Assignments 10%
Labs:
Tentative. May change as the course progresses.
Monday to Friday; 3:00 - 5:00 PM
Groups of 3 / 4 students
New language for design descriptions: BlueSpec Verilog (BSV)
HW programming using FPGAs Course will be heavy

karkare, CSE, IITK CS220, 3/12 karkare, CSE, IITK CS220, 4/12
Text Review of Digital Circuits

No text books
Follow the lecture notes very closely Basic Gates
Reference Books AND, OR, NAND, NOR, XOR, XNOR, NOT, BUF
BlueSpec Verilog: Memory Elements
Rishiyur S. Nikhil and Kathy Czeck: BSV by Example
Latch, Flip Flops
[www.bluespec.com/forum/download.php?id=140]
Registers
Assembly Language: Level Sensitive, Edge Triggered
Online notes, http://linuxassembly.org/ Input output behaviour (Parallel or Serial): SISO (Delay Line),
Computer Organization: SIPO, PISO, PIPO
Patterson, Hannessy: Computer Organization and Design Basic Circuits
Hamacher et.al.: Computer Organization Multiplexers, Counters, ALU, Adders, Multipliers
Tanenbaum: Structured Computer Organization
Parhami: Computer Architecture
Stallings: Computer Organization and Architecture

karkare, CSE, IITK CS220, 5/12 karkare, CSE, IITK CS220, 6/12

Combinatorial Gates Memory Elements

always @(a or b) Latches


y = a & b; always @( LE or D )
if (LE == 1) Q = D;
OR Flip Flops
always ( posedge CLK )
Q = D;
assign y = a & b;
Latches are level sensitive while flip flops are edge triggered.

karkare, CSE, IITK CS220, 7/12 karkare, CSE, IITK CS220, 8/12
A Computer The machine is electronic

Computer is a machine CPU


As long as power is supplied, processor keeps executing
instructions
Stored program model
Sequential order of execution
Memory: Program and data storage
Disk: File storage (passive data storage)
I/O MEMORY

karkare, CSE, IITK CS220, 9/12 karkare, CSE, IITK CS220, 10/12

Fetch-Execute-Store Model Fetch-Execute-Store Model

Processor reads an instruction from memory Processor takes program only as a sequence of bits
Instruction: Machine Language
A sequence of bits, understood and operated upon by processor
We understand programs in various forms
Processor interprets these bits and operates on data Assembly language: One to one correspondence with the
Stored in internal registers or in memory machine language
Result of the execution is stored in memory/registers. High level languages:
Translators are needed to convert to machine language (compilers)
Processor continues fetching the next instruction.
Programs for Virtual machine:
Program: Languages such as Java, compile code to a virtual machine which
Collection of instructions that are fetched and executed one after is then interpreted by a tool.
another

karkare, CSE, IITK CS220, 11/12 karkare, CSE, IITK CS220, 12/12
Combinatorial Gates

CS220: Introduction to Computer Organization


2011-12 Ist Semester

Introduction always @(a or b)


y = a & b;
Amey Karkare
OR
karkare@cse.iitk.ac.in

Department of CSE, IIT Kanpur


assign y = a & b;

karkare, CSE, IITK CS220, 1/9 karkare, CSE, IITK CS220, 2/9

Memory Elements A Computer

Latches Computer is a machine


always @( LE or D) As long as power is supplied, processor keeps executing
if (LE == 1) Q = D; instructions
Flip Flops Stored program model
Sequential order of execution
always ( posedge CLK )
Q = D; Memory: Program and data storage
Latches are level sensitive while flip flops are edge triggered. Disk: File storage (passive data storage)

karkare, CSE, IITK CS220, 3/9 karkare, CSE, IITK CS220, 4/9
Fetch-Execute-Store Model Fetch-Execute-Store Model

Processor reads an instruction from memory Processor takes program only as a sequence of bits
Instruction: Machine Language
A sequence of bits, understood and operated upon by processor
We understand programs in various forms
Processor interprets these bits and operates on data Assembly language: One to one correspondence with the
Stored in internal registers or in memory machine language
Result of the execution is stored in memory/registers. High level languages:
Translators are needed to convert to machine language (compilers)
Processor continues fetching the next instruction.
Programs for Virtual machine:
Program: Languages such as Java, compile code to a virtual machine which
Collection of instructions that are fetched and executed one after is then interpreted by a tool.
another

karkare, CSE, IITK CS220, 5/9 karkare, CSE, IITK CS220, 6/9

Anatomy of a program Generating Machine Code

Compilers generate machine program that is not ready to


execute yet.
High level language (HLL) Programs are organized in several files.
Variables, expressions, objects, loops etc. Each file is compiled separately.
A program in HLL is translated to machine lang. Compilation results in object files.
Programs use several libraries.
High level constructs are mapped to machine instructions.
Variables mapped to memory/registers Linker links all compiled files to generate a single machine
Expression evaluation to reading and operating the operands program
Loops etc. to program control instructions. Not ready to execute yet. But all reference to other functions are
resolved.
Conversion is several step process.
Loader loads an executable file into memory and causes
the CPU to execute them (OS functionality)
The machine code in memory is ready to execute.

karkare, CSE, IITK CS220, 7/9 karkare, CSE, IITK CS220, 8/9
A flavor of a machine program

Machine programs are hard to


understand #include <asm/unistd.h>
#include <syscall.h>
We use assembly language to #define STDOUT 1
.data
write them hello:
.ascii hello world\n
Assembly language is helloend:
processor specific .text
Pentium Processor assembly .globl _start
_start:
language is different than movl $(SYS_write),%eax // SYS_write = 4
Sparc, ARM etc. movl $(STDOUT),%ebx // fd
movl $hello,%ecx // buf
Specific to the machine movl $(helloend-hello),%edx // count
architectures. int $0x80

Registers, operations etc. movl $(SYS_exit),%eax


xorl %ebx,%ebx
An assembly language int $0x80
program that prints ret

hello world on stdout.


karkare, CSE, IITK CS220, 9/9
Generating Machine Code

CS220: Introduction to Computer Organization


2011-12 Ist Semester
Compilers generate machine program that is not ready to
execute yet.
Introduction II Object files.
Programs use several libraries.
Programs are organized in several files.
Amey Karkare Each file is compiled separately.
karkare@cse.iitk.ac.in Linker links all compiled files to generate a single machine
program
Department of CSE, IIT Kanpur Not ready to execute yet. But all reference to other functions are
resolved.
Loader loads an executable file into memory and causes
the CPU to execute them (OS functionality)
The machine code in memory is ready to execute.

karkare, CSE, IITK CS220, 1/6 karkare, CSE, IITK CS220, 2/6

A flavor of a machine program Story is different for Java

Machine programs are hard to


understand #include <asm/unistd.h>
#include <syscall.h>
We use assembly language to #define STDOUT 1
.data
write them hello:
.ascii hello world\n Java programs do not compile to machine code of the
Assembly language is
processor specific
helloend: native processor.
.text
Pentium Processor assembly .globl _start Java provides a virtual machine
_start:
language is different than movl $(SYS_write),%eax // SYS_write = 4
Java programs are compiled to this machine.
Sparc, ARM etc. movl $(STDOUT),%ebx // fd VM language is Java Bytecode.
movl $hello,%ecx // buf
Specific to the machine movl $(helloend-hello),%edx // count When run on a real machine, Java VM is simulated.
architectures. int $0x80

Registers, operations etc. movl $(SYS_exit),%eax


xorl %ebx,%ebx
An assembly language int $0x80
program that prints ret

hello world on stdout.


karkare, CSE, IITK CS220, 3/6 karkare, CSE, IITK CS220, 4/6
What is coming up next? Hardware Description

Introduction to Hardware Description Language


Representation of data Why do we need them?
To verify a design
Assembly Language programming
To programmatically generate the circuit (as collection of gates)
Building blocks for ALU aka synthesis
CPU structure What is important?
Memory Interface The designs are based on digital circuits.
Digital signals have values 0, 1 (as input).
I/O architecture Output values are 0, 1, Z.
Bus organizations Z is a state rather than the value
I/O techniques For simulation/verification, it is required to catch errors due to
Polling, Interrupts, DMA non-assignments.
X state for the output, X value for the input.
I/O devices
Processor Cache (if time permits)

karkare, CSE, IITK CS220, 5/6 karkare, CSE, IITK CS220, 6/6
Numbers
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Representation of Numbers
There are only 10 types of people in this world
Those who understand Binary
Amey Karkare And those who dont.
karkare@cse.iitk.ac.in If you dont get the joke, refresh your concepts of Binary
Department of CSE, IIT Kanpur
numbers.

karkare, CSE, IITK CS220, 1/8 karkare, CSE, IITK CS220, 2/8

Binary Number System Number System

Binary system is as natural to a digital machine as decimal Consider the number 100011.
system to a man.
100011 can represent a number in any base 2.
Man has 10 fingers, Learns counting using them.
The value will be different in different bases.
Digital systems have two logic states
0/1 or True/False or On/Off When we talk about a number (say 35), we really talk
about its value. The value of a given number is same
A number in a number system is represented using digits irrespective of the base.
to that system Confused?
v = dn 1 dn 2 . . . d2 d1 d0 The confusion arises because we always have decimal base in
mind.
Decimal digits are 0 to 9.
For writing it (on a paper) we need an unambiguous
Binary digits are 0 and 1. representation.
Maximum value: bn 1, where b : base, n : number of b100011 is h23 or o43 or d35
digits. o100011 is h8009 or d32777

karkare, CSE, IITK CS220, 3/8 karkare, CSE, IITK CS220, 4/8
Representation on a Computer Signed Numbers

Sign can be positive or negative.


Computers have a finite and fixed amount of storage for
data types. For a decimal number system, we use separate symbols (
n bit data type to contain integer can store only n bits of + or ) to indicate the sign.
information. The + sign is implicit most of the time.
Lets first consider unsigned (non-negative) numbers only. In binary system, one bit is needed for sign.
In n bits, maximum value that can be stored is 2n 1 For example, say 0 for positive numbers, and 1 for negative
minimum value: 0 numbers,
Typical values of n are powers of 2: 8, 16, 32, 64 etc. Thus, 5 is 0101 in binary, and 5 is 1101.
But the story is more complex here.

karkare, CSE, IITK CS220, 5/8 karkare, CSE, IITK CS220, 6/8

Signed Numbers 2s Complement Number System

If n bits are available, represent the positive numbers as


Sign magnitude representation unsigned numbers.
Sign as a bit (0: positive, 1:negative) Negative numbers are first added 2n , and then the
Magnitude as unsigned number resulting positive number is represented.
in n 1 bits, if total n bits are available.
2n is same as 0 if only n bits are used to represent the number.
n 1 n 1
Maximum value: 2 1, minimum value: (2 1)
Say n = 8. 2n = 256.
Two separate representation for 0: +0 and -0.
Representation of 25: 00011001
2s complement number system Representation of 25 Representation of
Easy for computations. No separate signed addition/subtraction 231(256 25) : 11100111
needed.
Max value: 2n1 1. Min value: 2n1 .
Proof: Practice Exercise.

karkare, CSE, IITK CS220, 7/8 karkare, CSE, IITK CS220, 8/8
2s Complement Number System
CS220: Introduction to Computer Organization
2011-12 Ist Semester
If n bits are available, represent the positive numbers as
Representation of Numbers - II unsigned numbers.
Negative numbers are first added 2n , and then the
resulting positive number is represented.
Amey Karkare 2n is same as 0 if only n bits are used to represent the number.
karkare@cse.iitk.ac.in Say n = 8. 2n = 256.
Representation of 25: 00011001
Department of CSE, IIT Kanpur
Representation of 25 Representation of
231(256 25) : 11100111
Max value: 2n1 1. Min value: 2n1 .
Proof: Practice Exercise.

karkare, CSE, IITK CS220, 1/10 karkare, CSE, IITK CS220, 2/10

2s Complement Numbers 2s Complement Makes Arithmetic Easy

To compute representation of a negative number faster,


find the representation of its absolute value in n bits.
For addition A + B , Add bit patterns of two numbers
invert all bits of this number and add 1 to it. irrespective of whether the numbers are positive or
For example, 10 in 8 bits negative.
10 in 8 bits is 00001010. For subtraction A B , use addition of A + (B ). B can be
Invert all bits to get: 11110101. found by 2s complement.
Add 1 to it to get: 11110110.

karkare, CSE, IITK CS220, 3/10 karkare, CSE, IITK CS220, 4/10
Adder and Subtractor Representation of Characters

B A
Add (0) / z }| { z }| {
Subtract (1)
Characters are coded.
All I/O devices can only print characters.
To print an integer, first its corresponding character array
representation need to be found and printed.
printf function does this for us when we use %d.

C_in
ADDER C_out ASCII coding
American Standard Coding for Information Interchange.
Use man ascii for details.

karkare, CSE, IITK CS220, 5/10 karkare, CSE, IITK CS220, 6/10

Representation of Characters Real Numbers

Internationalization of scripts demanded more space for


characters.
Unicode character set.
All real numbers can not be represented in computers
U+ followed by hexadecimal number.
By definition, between any two distinct real numbers, there are
U+0000 to U+007F for English (ASCII equivalent). infinite number of real numbers.
Indian scripts are U+0900 to U+0D7F for Devanagari, First we need to understand the binary representation of
Bengali, Gurumukhi, Gujrati, Oriya, Tamil, Telugu, real numbers.
Kannada, Malayalam.

karkare, CSE, IITK CS220, 7/10 karkare, CSE, IITK CS220, 8/10
Real Numbers Real Numbers: Fixed Point

Assume that we have n bits to represent a real number.


Binary Point If the position of binary point is fixed (say 4 locations from
the right)
1110 1101 Value = 23 + 22 + 21 + 21 + 22 + 24 = 14.8125
We do not need to store any information about the binary point.
Integral Part Fractional Part
Bit pattern 11101101 is 8-bit fixed point representation with
implied binary point at fourth bit from the right (1110.1101).

Binary point is analogous to the decimal point. This bit pattern is for 14.8125 in decimal.
It differentiates between binary integral part and binary Same bit pattern, when treated as unsigned number, indicates
fractional part. 237.
Decimal point differentiates between decimal integral part and In general a bit pattern for a fixed point number f is same
decimal fractional part. as the bit pattern for number f 2i , where i is the position
of implied binary point.
Positional value of each binary digit
Fixed point numbers can be signed or unsigned.
2i for the integral part
Precision : 2i
1/2i for the fractional part
Range: 2i times the range of the corresponding integer.
karkare, CSE, IITK CS220, 9/10 karkare, CSE, IITK CS220, 10/10
Real Numbers: Fixed Point
CS220: Introduction to Computer Organization
2011-12 Ist Semester Assume that we have n bits to represent a real number.
If the position of binary point is fixed (say 4 locations from
the right)
Floating Point Numbers We do not need to store any information about the binary point.
Bit pattern 11101101 is 8-bit fixed point representation with
implied binary point at fourth bit from the right (1110.1101).
Amey Karkare
This bit pattern is for 14.8125 in decimal.
karkare@cse.iitk.ac.in
Same bit pattern, when treated as unsigned number, indicates
Department of CSE, IIT Kanpur 237.
In general a bit pattern for a fixed point number f is same
as the bit pattern for number f 2i , where i is the position
of implied binary point.
Fixed point numbers can be signed or unsigned.
Precision : 2i
Range: 2i times the range of the corresponding integer.
karkare, CSE, IITK CS220, 1/14 karkare, CSE, IITK CS220, 2/14

Fixed Point Numbers Scientific Representation

Advantage:
Simple format: Same as integer format Numbers in scientific representation are stored with
Simple algorithms: significand and exponent.
Add/Subtract: Same as integer add/subtract Mass of an electron: 9.1 1028 g
Multiply/Divide: Integer multiply/divide followed by bit shift
Mass of a person: 6.0 104 g
Disadvantage:
In standard (normalized) form, significand is between
If the data set is large, one ends up having huge size of data
structure. [1.0, 10).
For example, if a variable is needed to be able to store mass of a Zero can not be written as a number in normalized form.
planet or mass of a person to reasonable accuracy, it will have large
number of bits for the integer part.

karkare, CSE, IITK CS220, 3/14 karkare, CSE, IITK CS220, 4/14
Scientific Numbers in Binary Floating Point Numbers

Exponents can be positive or negative.


Exponents are powers of 2. Exponents are stored as biased integers.
Significands are binary. If e is the exponent, then e + b is stored as unsigned number for
exponent.
In normalized form, significand is between [1.0, 2).
b is the bias.
Integer part is always 1.xxxx
Numbers are normalized.
1.0011E 100 represents 1.0011 2100 or b10011 or 19. Significand is of the form 1.xxxx . . .
1. need not be stored!

karkare, CSE, IITK CS220, 5/14 karkare, CSE, IITK CS220, 6/14

IEEE754 Single Precision Numbers IEEE754 Single Precision Numbers

31 30 2322 0
Exp
S Mantissa
excess-127
Used by most programming languages as float data type.
32 bit wide Consider 24.75 in decimal
Includes In Binary: 11000.11
Sign: Sign of the number Scientific Binary: 1.100011 24
Mantissa: Significand without leading 1.0
Sign: 1 (Negative)
Exponent: biased by 127 or excess-127.
Exponent: 4 + 127 = 131 or 10000011 in binary
Mantissa: 100011000000000000000000

IEEE754 Bit pattern: 1 10000011 100011000000000000000000


In Hex: hC1C60000

karkare, CSE, IITK CS220, 7/14 karkare, CSE, IITK CS220, 8/14
Double Precision Numbers Problems with Representation

64 Bits in size. Can not represent 0.0 in this way.


Sign: 1 bit There is no 1. prefix in its representation.
Exponent: 11 bit excess-1023
Mantissa: 52 bits
Real number computations also need .

karkare, CSE, IITK CS220, 9/14 karkare, CSE, IITK CS220, 10/14

Special Cases Special Cases

+0.0 or 0.0:
Sign: 0 (for +), 1 (for ). Not-a-Number (NaN)
Exponent and Mantissa: all zeroes. Exponent: All 1s.
Mantissa: Other than all zeroes.
Bit pattern for an integer 0 is same as bit pattern for float
+0.0. Denormalized numbers:
: Needed to store small intermediate results.
Exponent: All zeroes.
Sign: 0 (for +), 1 (for ).
Mantissa: Other than all zeroes.
Exponent: All 1s (0xFF for single precision, 0x7FF for double
Assumed to be as: 0.m 21b .
precision).
Mantissa: All zeroes.

karkare, CSE, IITK CS220, 11/14 karkare, CSE, IITK CS220, 12/14
Limits Limits for Denormalized Numbers

Max normalized value:


Exponent: 0xFE (single precision, SP) or 0x7FE (double
precision, DP) Min (positive) value:
Or 2127 (SP), 21023 (DP)
Mantissa: 000 . . . 001
Mantissa: 111 . . . 111 Exponent: 000 . . . 000
Significand: Close to 2 with assumed 1. prefix. 2149 for SP, 21074 for DP
Max value of 2128 (SP), 21024 (DP) Max (positive) value:
Min positive normalized value: Mantissa: 111 . . . 111 ( 0.111 . . . 111 1.0)
Exponent: 0x01 2126 (SP), 21022 (DP) Exponent: 000 . . . 000
Mantissa: 000 . . . 000 or 1.0 2126 (SP), 21022 (DP)
Min value: 2126 (SP), 21022 (DP)
Precision: 2e23 (SP), 2e52 (DP)
Precision is relative to the number (exponent = e)

karkare, CSE, IITK CS220, 13/14 karkare, CSE, IITK CS220, 14/14
Addition and Subtraction
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Assume that the numbers are represented using 2s


Arithmetic Operations - I complement representation scheme.
If two numbers a and b are to be added:
Amey Karkare USe an n-bit adder to add them.
karkare@cse.iitk.ac.in Output: n bit Sum + Carry_out
For computing a b, use a + (b).
Department of CSE, IIT Kanpur b can be computed by inverting all bits and adding 1.
In an adder/subtractor combined, XOR gates used to invert.
Carry_in set to 1 to add 1.

karkare, CSE, IITK CS220, Arithmetic 1/14 karkare, CSE, IITK CS220, Arithmetic 2/14

Adder and Subtractor Overflow/Underflow

B A
Add / z }| { z }| { When result can not be represented in the allocated
Subtract number of bits.
Result > Max that can be represented Overflow.
Result < Min that can be represented Underflow.
Unsigned number representation using n-bits
Overflow when result > 2n 1.
C_in
ADDER C_out Underflow when result < 0.
Signed number representation using n-bits
Overflow when result > 2n1 1.
Underflow when result < 2n1 .

karkare, CSE, IITK CS220, Arithmetic 3/14 karkare, CSE, IITK CS220, Arithmetic 4/14
Addition of Unsigned Numbers: Overflow/Underflow Subtraction of Unsigned Numbers: Overflow/Underflow

n = 4 bits 1 Carry_in
To compute: 5 6 0 1 0 1
n = 4 bits 4d5 = 4b0101 1 0 0 1 Bitwise inverted
0 1 0 0 Carry_out
To compute: 4 + 8 4b6 = 4b0110 0 1 1 1 1
1 0 0 0 Indicates Overflow in
4d4 = 4b0100
0 1 1 0 0 addition.
4b8 = 4b1000 Carry_out

n = 4 bits 1 Carry_in
n = 4 bits To compute: 7 6 0 1 1 1
1 0 0 1 Carry_out 4d7 = 4b0111 1 0 0 1 Bitwise inverted
To compute: 9 + 8
1 0 0 0 Indicates Overflow in 4b6 = 4b0110 1 0 0 0 1
4d9 = 4b1001
1 0 0 0 1 addition.
4b8 = 4b1000
Carry_out
Addition will not underflow in case of unsigned numbers. Carry_out indicates Underflow in subtraction.
Subtraction will not overflow in case of unsigned numbers.
karkare, CSE, IITK CS220, Arithmetic 5/14 karkare, CSE, IITK CS220, Arithmetic 6/14

Addition of Signed Numbers: Overflow/Underflow Signed Addition

Overflow Condition (C = A + B )
Sign of A(SA ) = 0, SB = 0, SC = 1.
While adding a positive number with a negative number
No overflow or underflow can occur. SC can become 1 only when Carry into the sign bit is 1. In
Overflow can occur when two positive numbers are added that case carry out of the sign bit is 0.
and result is out of range. Underflow Condition (C = A + B )
After addition, the result will become negative. SA = 1, SB = 1, SC = 0.
Underflow can occur when two negative numbers are SC can become 0 only when Carry into the sign bit is 0. In
added and result is out of range. that case carry out of the sign bit is 1.
After addition, the result will become positive. Overflow/Underflow = (Carry_in 6= Carry_out)
Carry_in: Carry into the sign bit.
Carry_out: Carry out of the sign bit.

karkare, CSE, IITK CS220, Arithmetic 7/14 karkare, CSE, IITK CS220, Arithmetic 8/14
Subtraction of Signed Numbers Overflow/Underflow

Operation: C = A + (B ) If the carry out in the subtraction is inverted (to reflect the
borrow), overflow/underflow conditions become
Overflow: SA = 0, SB = 1, SC = 1
independent of add/subtract.
Underflow: SA = 1, SB = 0, SC = 0 Unsigned Overflow/Underflow:
Overflow/Underflow Condition = (Carry_in == Carry_out) Carry out of the adder = 1.
Carry_in: Carry into the sign bit.
Signed Overflow/Underflow:
Carry_out: Carry out of the sign bit.
Carry into the sign bit 6= Carry out of sign bit.

karkare, CSE, IITK CS220, Arithmetic 9/14 karkare, CSE, IITK CS220, Arithmetic 10/14

Shift Operation Shift Operation

Shift left operation: <<


Shift the bit string left by k bit positions.
Bits on the left are shifted out (disappear).
Incoming bits on the right are set to 0. Right shift
Left shift by k bits is equivalent to multiplying by the number 2k . Unsigned shift and signed shift are different.
In unsigned shifts, incoming bits on the left are set to 0.
0 1 0 1 1 0 1 0 1
In signed shifts, incoming bits on the left are kept the same
as original sign bit.
0 1 1 0 1 0 1 0 0 Outgoing bits on the right are just dropped.
Right shift by k bit is equivalent to divide by 2k .
Outgoing Bits Incoming Bits

Left shift by 2 bits

karkare, CSE, IITK CS220, Arithmetic 11/14 karkare, CSE, IITK CS220, Arithmetic 12/14
Carry Lookahead (CLA) Adder Carry Lookahead (CLA) Adder

In an n-bit adder,
Let Ai , Bi and Ci be the input bits and carry in.
Let Si and Ci +1 be the sum and carry out.
1-bit Full Adder Equation.
Cout = (A + B ) C + A B C1 = p0 C0 + g0
Also same as (A B ) C + A B C2 = p1 C1 + g1
S = (A B ) C = p1 p0 C0 + p 1 g0 + g1
Let us create two signals C3 = p2 C2 + g2
p = A B (aka Propagate Signal)
g = A B (aka Generate Signal) = p2 p1 p0 C0 + p2 p1 g0 + p2 g1 + g2
Cout = p C + g C4 = p3 C3 + g3
S = pC = p3 p2 p1 p0 C0 + p3 p2 p1 g0 + p 3 p2 g1 + p 3 g2 + g3
C4 can be computed faster using this rather than through
ripple carry.
CLA adder is faster than ripple carry adder
Delay O(log n) vs. O(n)
karkare, CSE, IITK CS220, Arithmetic 13/14 karkare, CSE, IITK CS220, Arithmetic 14/14
Booths Multiplication
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Arithmetic Operations - III Lets consider a few bit patterns of the multiplier:
100110:
25 + 22 + 21
Also same as 26 25 + 23 21
Amey Karkare
25 + 23 21 in case of signed numbers
karkare@cse.iitk.ac.in
111100111:
Department of CSE, IIT Kanpur 28 + 27 + 26 + 25 + 22 + 21 + 20
Also same as 29 25 + 23 20
25 + 23 20 in case of signed numbers

karkare, CSE, IITK CS220, Arithmetic 1/12 karkare, CSE, IITK CS220, Arithmetic 2/12

Booths Multiplication Booths Multiplication Step

Look at two bits at a time. At i th iteration (i in 0 . . . n 1), bits


i and i 1 of the multiplier are looked at.
Bit at position 1 is assumed to be 0.
i i 1 Description & Action
A sequence of 1s can be written as 2x 2y
0 0 Middle of the run of 0s.
y is the index of least significant 1
No partial product is generated.
(x y ) is the length of the run 0 1 End of the run of 1s.
Product = Product +2i multiplicand
1 1 Middle of the run of 1s.
No partial product is generated.
1 0 Start of the run of 1s.
Product = Product 2i multiplicand

karkare, CSE, IITK CS220, Arithmetic 3/12 karkare, CSE, IITK CS220, Arithmetic 4/12
Division Division of Binary Numbers

1 1 1 Quotient
1 0 1 1 0 0 1 0 1
Divisor: n bits 1 0 1
Dividend: 2n bits Divisor 1 0 0 0
1 0 1
Quotient: 2n bits 1 1 1
Remainder: n bits 1 0 1
1 0 Remainder

Restoring Bit Algorithm

karkare, CSE, IITK CS220, Arithmetic 5/12 karkare, CSE, IITK CS220, Arithmetic 6/12

Division Hardware: First Version Divider in BSV rules

// i/p: Bit#(16) a, b; Bool start


Divisor (2n bits) // o/p: Bit#(16) q, r; Bool rdy
// temps: ctr (counts upto 16)
// Bit#(32) a_copy
// Bit#(32) b_copy rule div (!rdy);
Subtractor (2n bits) Shift Quotient (2n bits) // Bit#(32) diff diff<=a_copy-b_copy;
Shift q[15:1] <= q << 1;
rdy = (ctr == 0) if (diff[31] == 0) begin
Overflow
rem = b_copy[15:0] b_copy <= diff;
Control
Unit q[0] <= 1b1;
rule init (rdy & start); end
Remainder (2n bits) Write ctr <= 16; q <= 0; else
b_copy <= {16d0, b}; q[0] <= 1b0;
a_copy <= {1b0,a,15d0}; a_copy <= a_copy >> 1;
endrule ctr <= ctr - 1;
karkare, CSE, IITK CS220, Arithmetic 7/12 karkare, CSE, IITK
endrule
CS220, Arithmetic 8/12
Division Hardware: Second Version Division Algorithm

Divisor (n bits)

rdy = (ctr == 0);


Subtractor (n bits)
rule div (!rdy);
if ( (R[31:16] - b) > 0)
Overflow R <= truncate({(R[31:16] - b), R[15:1], 1b1});
Control else
Unit R <= { R[31:1], 1b0}
Shift
Remainder HI Remainder LO ctr <= ctr - 1;
Write
(n bits) (n bits) endrule

Upper bits (HI) contain the Remainder, and lower bits (LO)
contain the Quotient at the end of the operation.
karkare, CSE, IITK CS220, Arithmetic 9/12 karkare, CSE, IITK CS220, Arithmetic 10/12

Example Combinatorial Division Hardware

Divide 0110 by 0010, n = 4


Initial Divisor: 0010, Remainder 0000 0110
R: Remainder[7:4], D: Divisor
for (i=0; i <= 16; i++)
# Step Divisor Remainder
if ( R[31:16] > b )
1. R < D, Shift 0 in R 0010 0000 1100 R <= truncate({ (R[31:16] - b), R[15:1], 1b1 });
2. R < D, Shift 0 in R 0010 0001 1000 else
3. R < D, Shift 0 in R 0010 0011 0000 R <= { R[31:1], 1b0}
4. R >= D, R = R D 0010 0001 0000 endrule
Shift 1 in R 0010 0001
5. R >= D, R = R D 0010 0000 0001
Shift 1 in R 0000 0011

karkare, CSE, IITK CS220, Arithmetic 11/12 karkare, CSE, IITK CS220, Arithmetic 12/12
CS 220
Introduction to Computer Organization
2011-12 Ist Semester

CPU - I
Amey Karkare (karkare@cse.iitk.ac.in)
Central Processing Unit (CPU)

CPUexecutesinstructions
Sequentialexecutionmodel
Instructionsexecutedoneafteranother.
Nextinstructiontobeexecutedisknownimplicitly(thenext
storedinstruction)orexplicitly(byjumporcallinstructions)
CPUisasequentialcircuitwith
Registers(Registerfile)
ALU(Arithmeticandlogicunit)
Memoryinterface(toreadandwritedata)
Branchunit
Building Register File: Register

D Q Clk
Register
(n-bit)
D D1 D2 D3 D4 D5

clk
Q D1 D2 D3 Undef
en

rule reg; rule reg (en);


if (en) Q <= D; Q <= D;
endrule endrule
Building Register File: Bussing
Q
D Register Q0 // WRITE OPERATION
(n-bit) rule write;
if (wr0) Q0 <= D;
wr0
if (wr1) Q1 <= D;
rd0
endrule

D Register Q1
(n-bit) // READ OPERATION
Q = rd0 ? Q0 : ( rd1 ? Q1 : Z)
wr1
rd1
Register File

module mkRFile(Ifc_RFIle);
R0 reg [8:0] Regs[0:3];

R1 method Bit#(8) readReg(raddr);


D Q
R2 dout = regArray[raddr];
endmethod
R3
method Action writeReg (we, din) ;

wr0 to wr3 rd0 to rd3 if (we) regArray[waddr] <= din;


endmodule
WE
Decoder Decoder

WA RA
Multiple Buses
RD RA RB
WE Clk
D Register Q0
(n-bit)
A
wr0 D
Register File
RdA0 B
RdB0

D Register Q0
(n-bit)

wr0
RdA0
RdB0
Adding ALU to Register File

RD RA RB Function Code
WE Clk

D
Register File ALU
Instruction

Instructionsarestoredinthememory.
Fetchedoneatatime
Instructions(simpleones)provide
FunctionCode(akaopcode)
AddressesofSourceRegisters
Addressofthedestinationregister
WriteEnable.
Clockisthesystemclock
Instruction Fetch

Ateachinstructioncycle,
Instructionisreadfrommemory.
Memoryaddressisprovidedbyaregister
ProgramCounter(PC)
Afterafetch,PCisincrementedbythesizeofthe
instruction.
Instructionreadisstoredintheinstructionregister
(IR)
CurrentinstructionintheIRisexecuted.
CS 220
Introduction to Computer Organization
2011-12 Ist Semester

CPU - II
Amey Karkare (karkare@cse.iitk.ac.in)
Instruction Fetch

At each instruction cycle,



Instruction is read from memory.

Memory address is provided by a register



Program Counter (PC)

After a fetch, PC is incremented by the size of the


instruction.

Instruction read is stored in the instruction


register (IR)

Current instruction in the IR is executed.


Instruction Fetch

Lets say all our instructions are fixed size (32 bit,
or 4 byte wide).

Thus the PC need to be incremented by four after


each instruction fetch.

If the address space is 32-bits, width of PC is 32


bits.

In reality, the last two bits of PC can be always 0


(we need just 30 bit register)

PC Register is added 1 after each instruction fetch.
Memory Interface

Memory interface includes the following



Address: Address of the location to read/write

Size: Number of bytes to read/write

Control: Read or Write

DataIn: Data to memory (for writing)

DataOut: Data from memory (for reading)

Instruction fetch does not write any thing in the memory.



Address: {PC, 2b00}

Size: 4 Bytes

Control: Read

DataIn: Dont care

DataOut: Provided by the memory. Gets latched in the IR
Instruction Fetch

Memory Address (with


two zeros appended)

30-bit PC register
Dataout
1
To
Memory

IR
Adder

Instruction bits

Read Control
Instruction Execution

RD RA RB Function Code
WE Clk

D
Register File ALU
Instructions are incomplete yet

How do we perform i = i + 5?

Assume i in register.

Need to be able to add a constant

Constant has to be part of the instruction.



Immediate constant.

Need to change the data path


Instruction Formats

R Type instructions

OpCode Func
RA RB RD 0
(0) code
31 25 20 15 10 5 0

R-I type instructions

OpCode
RA RD Immediate Constant
(<>0)
31 25 20 15 0
Instruction Execution

IR[15:11] IR[20:15]

RI/R
IR[25:21] Function Code
WE = 1

D
Register File ALU

RI/R
IR[15:0] Sign extend
Instruction Formats

R Type instructions (triadic)

OpCode Func
RS1 RS2 RD 0
(0) code
31 25 20 15 10 5 0

R-I type instructions (triadic)

OpCode
RS1 RD Immediate Constant
(<>0)
31 25 20 15 0
Instructions

Lets assume that R0 is always 0. Any writing into that is ignored.


Read always provides 0.

R type triadic instructions



ADD, SUB, AND, OR, XOR

Shift and rotate instructions: SLL, SRL, SRA, ROL, ROR

Comparison instructions: SLT, SGT, SLE, SGE, UGT, ULT, ULE, UGE.

R-I type triadic instructions



ADDI, SUBI, ORI, ANDI, XORI, SLLI, SRLI, SRAI, SLTI, SGTI,
SLEI, SGEI, ULTI, UGTI, ULEI, UGEI, LHI (To load half of a register).

LDSB, LDSW, LDUB, LDUW, LDL, STB, STW, STL instructions

Load signed byte, word; unsigned byte, word; long; store byte, word, long

Memory address = offset + R[RS1]. For load, destination = R[RD]. For store
data in R[RD] is stored in memory.

More instructions to come later.
Instruction Formats

R Type instructions (diadic)

OpCode RS1 0 Offset from next PC


31 25 20 15 0

J type instructions

OpCode
Offset from next instruction
(<>0)
31 25 0
Instructions

R type dyadic instructions



BNEZ, BEQZ (Branch if RS1 <>0 or = 0)

If (cond) then PC = PC + offset (Remember: PC is 30 bit
register)

JR, JALR (Jump based on register)

(Target: RS1 & 0xFFFFFFFC + 4*offset): (PC = target >> 2)

JALR also saves {PC,2b00} in R31.

J type instructions

J, JAL (jump and link): Jump to new target

Target = (PC + 4 * offset). Offset is signed number. (PC =
Target>>2)

JAL instruction additionally saves {PC,2b00} in R31
ALU functions

ALU provides the following functions.

Addition (ADD) and (ADD4). ADD4: for BNEZ, NEQZ, J, JR, JAL,
JALR insturctions

Subtraction (SUB)

Left shift (LS), Arithmetic right shift (RSA), Logical right shift (RSL)
by a count provided by the second operand (only five bits are used)

Bit wise AND (AND), Bit wise OR (OR), Bit wise XOR (XOR)

Comparison:

Signed greater than comparison (SGT), Signed less than comparison
(SLT), Signed less than or equal to comparison (SLE), Signed greater
than or equal to comparison (SGE)

Unsigned greater than comparison (UGT), Unsigned less than
comparison (ULT), Unsigned less than or equal to comparison (ULE),
Unsigned greater than or equal to comparison (UGE)

Data mixing (output = {IR[15:0], RS1[15:0]}) (DM)


CS 220
Introduction to Computer Organization
2011-12 Ist Semester

CPU - III
Amey Karkare (karkare@cse.iitk.ac.in)
Instruction Formats

R Type instructions (triadic)

OpCode Func
RS1 RS2 RD 0
(0) code
31 25 20 15 10 5 0

R-I type instructions (triadic)

OpCode
RS1 RD Immediate Constant
(<>0)
31 25 20 15 0
Instruction Formats

R Type instructions (diadic)

OpCode RS1 0 Offset from next PC


31 25 20 15 0

J type instructions

OpCode
Offset from next instruction
(<>0)
31 25 0
Instructions
R0 is always 0.
R type triadic instructions
ADD, SUB, AND, OR, XOR
Shift and rotate instructions: SLL, SRL, SRA, ROL, ROR
Comparison instructions: SLT, SGT, SLE, SGE, UGT, ULT, ULE, UGE.
R-I type triadic instructions
ADDI, SUBI, ORI, ANDI, XORI, SLLI, SRLI, SRAI, SLTI, SGTI, SLEI, SGEI, ULTI, UGTI,
ULEI, UGEI, LHI (To load half of a register).
LDSB, LDSW, LDUB, LDUW, LDL, STB, STW, STL instructions
R type dyadic instructions
BNEZ, BEQZ (Branch if RS1 <>0 or = 0)
If (cond) then PC = PC + offset (Remember: PC is 30 bit register)
JR, JALR (Jump based on register)
(Target: RS1 & 0xFFFFFFFC + 4*offset): (PC = target >> 2)
JALR also saves {PC,2b00} in R31.
J type instructions
J, JAL (jump and link): Jump to new target
Target = (PC + 4 * offset). Offset is signed number. (PC = Target>>2)
JAL instruction additionally saves {PC,2b00} in R31
ALU functions
ALU provides the following functions.
Addition (ADD) and (ADD4). ADD4: for BNEZ, NEQZ, J, JR, JAL,
JALR insturctions
Subtraction (SUB)
Left shift (LS), Arithmetic right shift (RSA), Logical right shift (RSL)
by a count provided by the second operand (only five bits are used)
Bit wise AND (AND), Bit wise OR (OR), Bit wise XOR (XOR)
Comparison:
Signed greater than comparison (SGT), Signed less than comparison
(SLT), Signed less than or equal to comparison (SLE), Signed greater
than or equal to comparison (SGE)
Unsigned greater than comparison (UGT), Unsigned less than
comparison (ULT), Unsigned less than or equal to comparison (ULE),
Unsigned greater than or equal to comparison (UGE)
Data mixing (output = {IR[15:0], RS1[15:0]}) (DM)
Instruction Fetch
Read Control

PC Convert
(30 to 32 bits Memory
bit)

Clock

Add IR
1

Instruction
Datapath for R type triadic
instructions
ALU Control
D1out
RD Register
file RS1
Din (32regs)

D2out
Clock
RS2

WE
Datapath for R and R-I type
triadic instructions (non load
store)
IR[20:16] IR[5:0]
IR[15:11] ALU Control Combinatorial
circuit. IR[31:26]
RDSel
D1out
RD Register
file RS1
Din (32regs)

D2out
Clock
RS2
Sign IR[15:0]
Extension to
32 bits
WE Oprnd2Sel
Adding Load-Store data path
IR[20:16] IR[5:0]
IR[15:11] ALU Control Combinatorial
circuit. IR[31:26]
RDSel
D1out
RD Register
file RS1 Address
Memory
Din (32regs)
DCnvt Data
D2out
Clock DinSel
RS2
Sign IR[15:0]
Extension to
32 bits
WE Oprnd2Sel
Data to the Memory
Adding JR, BEQZ, BNEZ
=0? RS1is0
Oprnd1Sel
IR[20:16] IR[5:0]
IR[15:11] Combinatorial
ALU Control IR[31:26]
RDSel circuit.
RD Register D1out
Address
file RS1 Memory
Din
(32regs) DCnvt Data
D2out
Clock DinSel
RS2
Sign Extension to IR[15:0]
32 bits
WE Oprnd2Sel
Data to the Memory
Add 00 at
end Remove 2
bits at end
Read Control
PC (30 Convert to
Memory
bit) 32 bits
NextPC
Clock

Add 1 IR
Instruction
Adding J, JAL and JALR
=0? RS1is0
Oprnd1Sel
IR[20:16] IR[5:0]
31
IR[15:11] Combinatorial
ALU Control IR[31:26]
RDSel circuit.
RD Register file D1out
Address
(32regs) RS1 Memory
Din
DCnvt Data
D2out
Clock DinSel1
RS2
Sign Extension to IR[25:0]
32 bits
DinSel2
WE Oprnd2Sel
Data to the Memory
ExtnCntl
Remove 2
bits at end
Add 00 at Read Control
end
PC (30 Convert to 32
Memory
bit) bits
NextPC
Clock
Add 00 at Add 1 IR
end Instruction
SDLX Data Path

CS220: Introduction to Computer Organization


2011-12 Ist Semester
1&2 &
() *
CPU - IV " # &% +'
" # $% '
" #$%&'
" # % +'

**
, -
.
Amey Karkare
! " # $%&'
karkare@cse.iitk.ac.in
() *
/ , -
Department of CSE, IIT Kanpur
!
.
*
** && *
*
0 & .
, -

3 !0

** && ** "
* "

karkare, CSE, IITK CS220, CPU 1/9 karkare, CSE, IITK CS220, CPU 2/9

Control Signals Control Signals

RDSel: 2 bits. Selector for RD


Possible Values: RD31, RD20-16, RD15-11
DinSel1: Bit Selector for Din Source Oprnd1Sel: RegFile or PC
Values: DinALU, DinMEM
Oprnd2Sel: RegFile or IR
DinSel2: Bit Selector for Din Source
Values: DinPC, DinS1
ExtnCntl: EXT16 or EXT26
WE: Write Enable ALU Control
Values: Yes, No
NextPC: Source for next PC
Values: PCALU, PCPlus1

karkare, CSE, IITK CS220, CPU 3/9 karkare, CSE, IITK CS220, CPU 4/9
Control Signals Control Signals

Inst RDSel DInSel1 DinSel2 WE NextPC Oprnd1Sel OprndSel2 ExtnCntl ALU For Load instructions, data memory is to be converted.
R Triadic IR15-11 DinALU DinS1 Yes PCPlus1 RegFile RegFile X IR[4:0]
ADDI IR20-16 DinALU DinS1 Yes PCPlus1 RegFile IR EXT16 ADD DCnvt control to be used.
LHI IR20-16 DinALU DinS1 Yes PCPlus1 RegFile IR EXT16 DM Values: ByteSExtend, WordSExtend, Byte0Pad, Word0Pad, None
LDSB IR20-16 DinMEM DinS1 Yes PCPlus1 RegFile IR EXT16 ADD
LDUB IR20-16 DinMEM DinS1 Yes PCPlus1 RegFile IR EXT16 ADD LDSB: DCnvt = ByteSExtend
LDUW IR20-16 DinALU DinS1 Yes PCPlus1 RegFile IR EXT16 ADD
LDL IR20-16 DinALU DinS1 Yes PCPlus1 RegFile IR EXT16 ADD LDUB: DCnvt = Byte0Pad
STB X X X No PCPlus1 RegFile IR EXT16 ADD
BNEZ X X X No **** PC IR EXT16 ADD4 LDSW: DCnvt = WordSExtend
JR X X X No PCALU RegFile IR EXT16 ADD4
JALR RD31 X DinPC Yes PCALU RegFile IR EXT16 ADD4 LDUW: DCnvt = Word0Pad
J X X X No PCALU PC IR EXT26 ADD4
JAL RD31 X DinPC Yes PCALU PC IR EXT26 ADD4 LDL: DCnvt = None
Other instructions: DCnvt is Dont Care.

karkare, CSE, IITK CS220, CPU 5/9 karkare, CSE, IITK CS220, CPU 6/9

Harvard vs. Princeton Architectures Memory Interface

Princeton architecture
Must be possible to delay an instruction fetch if there is a
Single memory for program and data
data read/write request
Harvard architecture
In order to do this:
Separate memories for program and data
Load IR with NOP instruction (like ADD r0, r0, r0). Do not
What we covered is Harvard architecture for SDLX change the PC
processor. For esthetics reasons, NOP instruction is all 0s.
If we have a single memory ADD function code is 0.
Only one operation can take place at a time. Load IR with NOP is same as clearing the register to 0s.
Instruction fetch OR data read/write.

karkare, CSE, IITK CS220, CPU 7/9 karkare, CSE, IITK CS220, CPU 8/9
Branch Instructions

When a branch instruction is being executed


PC contains the address of the instruction next to the branch
instruction (BNEZ, BEQZ, J, JR, JAL, JALR)
What do we do with the instruction just fetched?
One solution: To set WE = NO and no update to PC register.
Other solution: Let the instruction execute.
Delayed branch semantics
Compilers can generate a NOP after each branch
Or even reorder instructions, preserving the meaning of the
program

karkare, CSE, IITK CS220, CPU 9/9


Branch Instructions

CS220: Introduction to Computer Organization


2011-12 Ist Semester
When a branch instruction is being executed
CPU - V PC contains the address of the instruction next to the branch
instruction (BNEZ, BEQZ, J, JR, JAL, JALR)
What do we do with the instruction just fetched?
Amey Karkare
One solution: To set WE = NO and no update to PC register.
karkare@cse.iitk.ac.in Other solution: Let the instruction execute.
Delayed branch semantics
Department of CSE, IIT Kanpur
Compilers can generate a NOP after each branch
Or even reorder instructions, preserving the meaning of the
program

karkare, CSE, IITK CS220, CPU 1/7 karkare, CSE, IITK CS220, CPU 2/7

Solution 1: NOP Solution 1

At cycle i ,
100: SUB 0, R5, R5
If instruction executed in cycle i 1 resulted in NextPC = PCALU
104: JAL 224 ; absolute target
(i.e. branch was taken) IR: SUB JAL ADD STL Original
108: ADD ...
Instruction at cycle i must not be executed. .
. .
. PC: 104 108 224 228 Architecture
. .
Clear the IR register so that a NOP is loaded. 224: STL R31, 0(R30)
228: SUB R30, 4, R30 IR: SUB JAL NOP STL Modified

. . PC: 104 108 224 228 Architecture


.
. .
.
PC updated
IR updated
IR cleared

karkare, CSE, IITK CS220, CPU 3/7 karkare, CSE, IITK CS220, CPU 4/7
NOP Feeding Solution 2: Delayed Branch

NOP fed if the current instruction is


LDxx or STxx
J, JR, JAL, JALR Instruction stored after a branch is such that the meaning
BNEZ if (Rs1is0 == 0) of program is not changed
BEQZ if (RS1is0 == 1) Solution can be implemented by Compiler/Assembler
On average every 6 7th instruction in a program is a Naive approach
branch kind of instruction. Put a NOP after each branch instruction.
During the execution of LDxx and STxx instructions, no Smart approach
instruction is read from memory. Put a meaningful instruction after the branch.
But in Branch instructions, an instruction is already read from Requires reordering the instructions.
memory.
Execute it possibly to improve performance.
Delayed Branch semantics.

karkare, CSE, IITK CS220, CPU 5/7 karkare, CSE, IITK CS220, CPU 6/7

Example of Delayed Slot Filling

Program (Desired Semantics) Program (Code in memory) Program (Alternate)


SUB 0, R5, R5 JAL X SUB 0, R5, R5
JAL X SUB 0, R5, R5 JAL X
ADD . . . ADD . . . SUB R30, 4, R30
... ... ADD . . .
X: SUB R30, 4, R30 X: SUB R30, 4, R30 ...
... ... X:
...

Possible only if a branch in- Possible only if a branch de-


dependent instruction can be pendent instruction can be
found in the code before the found in the code after the
branch. branch.

karkare, CSE, IITK CS220, CPU 7/7


Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Assembly - I The material in the following slides is based on the following


references:
Spim documentation (Appendix A) from the third edition of
Amey Karkare Computer Organization and Design: The Hardware/Software
karkare@cse.iitk.ac.in Interface by Patterson & Hennessy
Department of CSE, IIT Kanpur MIPS Assembly Language Programming: CS50 Discussion and
Project Book by Daniel J Ellard

karkare, CSE, IITK CS220, Assembly 1/5 karkare, CSE, IITK CS220, Assembly 2/5

Why Assembly / High Level Languages? Typical Flow: High Level Language Program

Instructions as Binary Strings


Easy for computers to understand
Natural and efficient to manipulate
xxN.c compiler xxN.s assembler xxN.o
Humans have difficulty in understanding Binary Strings
Humans read/write symbols(words) much better than long .. .. .. .. ..
. . . . . exec-
sequence of digits linker
utable
How can Humans communicate with Computers xx2.c compiler xx2.s assembler xx2.o
effectively?
xx1.c compiler xx1.s assembler xx1.o
Humans read/write strings of symbols (Programs in Program
Assembly/High level languages). Libraries
Automated tools process these programs and convert them to
binary strings (object code/executable code).
Computers read and execute binary stings.

karkare, CSE, IITK CS220, Assembly 3/5 karkare, CSE, IITK CS220, Assembly 4/5
Typical Flow: Assembly Program

xxN.s assembler xxN.o

.. .. ..
. . . exec-
linker
utable
xx2.s assembler xx2.o

xx1.s assembler xx1.o


Program
Libraries

karkare, CSE, IITK CS220, Assembly 5/5


Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Assembly - II The material in the following slides is based on the following


references:
Spim documentation (Appendix A) from the third edition of
Amey Karkare Computer Organization and Design: The Hardware/Software
karkare@cse.iitk.ac.in Interface by Patterson & Hennessy
Department of CSE, IIT Kanpur MIPS Assembly Language Programming: CS50 Discussion and
Project Book by Daniel J Ellard

karkare, CSE, IITK CS220, Assembly 1/10 karkare, CSE, IITK CS220, Assembly 2/10

Typical Flow: Assembly Program When to use Assembly Programs

xxN.s assembler xxN.o


A new architecture has come up.
.. .. .. Non-availability of tools for high level languages for the platform.
. . . exec-
linker Boot-strapping
utable To develop high level language tools in assembly.
xx2.s assembler xx2.o
In reality, goes down to assembly - obj code level as well.

xx1.s assembler xx1.o


Program
Libraries

karkare, CSE, IITK CS220, Assembly 3/10 karkare, CSE, IITK CS220, Assembly 4/10
When to use Assembly Programs When to use Assembly Programs

When the requirements are critical.


Time to respond.
Size of code. Commercial application use Hybrid Approach.
Architecture specific operations. Most of the code is in high level language.
Fine grain control over execution is required. Resource/Time critical code is written in assembly.
To avoid surprises
Unexpected compiler optimizations.
Unwanted re-ordering of instructions.

karkare, CSE, IITK CS220, Assembly 5/10 karkare, CSE, IITK CS220, Assembly 6/10

Drawbacks of Assembly Programming Drawbacks of Assembly Programming

Assembly programs are machine specific. Programs are difficult to understand.


Need to be re-written for a different architecture. More bugs.
Can not automatically make use of advances in architecture. Difficult to maintain.
Assembly programs are longer. There are no strictly forced rules to program in assembly.
Expansion factor of more than 3 compared to same program No type checking.
written in high level language. No scopes.
Low productivity of programmers. No fixed rules for parameter passing.
Empirical studies have shown that programmers write nearly the
same number of lines of code per day, irrespective of the language.
While guidelines/conventions exist, it is up to the
Productivity goes down by X if X is the expansion factor. programmer to follow them or ignore them.

karkare, CSE, IITK CS220, Assembly 7/10 karkare, CSE, IITK CS220, Assembly 8/10
Assembler Linker

Assembler translates assembly program into object code. Resolves external references among files (cross-file
references).
Object code contains binary machine instructions.
plus some bookkeeping information. Searches program libraries to find and link library routines
Object code is not same as executable code. used by the program.
Obj code can contain references to external symbol (say, call to Determines memory locations that code from each
printf function). function will occupy and relocates its instructions by
Only local symbols are visible while assembling a file. adjusting absolute references.
Another tool called linker resolves cross-file dependencies and
If linking is successful, output of linker is the executable
produces executable code.
file that is ready to execute.

karkare, CSE, IITK CS220, Assembly 9/10 karkare, CSE, IITK CS220, Assembly 10/10
Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Assembly - III The material in the following slides is based on the following
references:
Spim documentation (Appendix A) from the third edition of
Amey Karkare Computer Organization and Design: The Hardware/Software
karkare@cse.iitk.ac.in Interface by Patterson & Hennessy
Department of CSE, IIT Kanpur MIPS Assembly Language Programming: CS50 Discussion and
Project Book by Daniel J Ellard

karkare, CSE, IITK CS220, Assembly 1/7 karkare, CSE, IITK CS220, Assembly 2/7

MIPS Architecture MIPS Register Set

32 registers, each 32 bit wide.


Register architecture. Register 0 is hardwired to contain value 0 all the time.
Arithmetic and logic operations involve only registers or immediate
Remaining 31 registers are general purpose registers,
constants.
Theoretically, these registers can be used interchangeably.
Load-store architecture. General purpose register 31 is used as the link register for jump
Data is loaded from memory into registers or stored to memory and link instructions.
from registers. However, MIPS programmers have developed set of conventions
No direct manipulation of memory contents. to use these registers.
These calling conventions are maintained by the tool-chain
softwares, but these are not enforced by the hardware.

karkare, CSE, IITK CS220, Assembly 3/7 karkare, CSE, IITK CS220, Assembly 4/7
MIPS Register Set MIPS Instruction Set

Register Common Usage


Number Name
0 zero Hardwired value 0. Any writes to this reg-
ister are ignored. Arithmetic and Logic Instructions
1 at Assembler temporary. add(u), sub(u), mul(u), div(u), abs, . . .
23 v0v1 Function result registers rol, ror, sll, srl, sra, and, or, not, . . .
47 a0a3 Function argument registers that hold the Comparison Instructions
first four arguments. seq, sne, sge(u), sgt(u), sle(u), slt(u)
815,2425 t0t9 Temporary registers.
Branch and Jump Instructions
1623 s0s7 Saved registers to use freely.
2627 k0k1 Reserved for use by the operating system b, beq, bne, bge(u), beqz, bnez, bgezal, . . .
j, jr, jal, jalr
kernel and for exception return.
28 gp Global pointer.
29 sp Stack pointer.
30 fp Frame Pointer.
31 ra Return address register.
karkare, CSE, IITK CS220, Assembly 5/7 karkare, CSE, IITK CS220, Assembly 6/7

MIPS Instruction Set

Load, Store and Data Movement


la, lb(u), lh(u), lw, lwl, lwr, li, . . .
sb, sh, sw
ulh(u), ulw, ush, usw, swl, swr (unaligned load/store)
move, mfhi, mflo, mthi, mtlo
Exception Handling
rfe, syscall, break, nop
Refer to MIPS documentation for details of individual instructions.

karkare, CSE, IITK CS220, Assembly 7/7


Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Assembly - IV The material in the following slides is based on the following


references:
Spim documentation (Appendix A) from the third edition of
Amey Karkare Computer Organization and Design: The Hardware/Software
karkare@cse.iitk.ac.in Interface by Patterson & Hennessy
Department of CSE, IIT Kanpur MIPS Assembly Language Programming: CS50 Discussion and
Project Book by Daniel J Ellard

karkare, CSE, IITK CS220, Assembly 1/10 karkare, CSE, IITK CS220, Assembly 2/10

MIPS Addressing Modes MIPS System Calls

Bare Machine: imm(reg) Some frequently used system calls:


imm: Immediate Constant
reg: Register containing address Service Code Arguments Result
Address computation: Contents of reg + imm constant print_int 1 a0
Virtual Machine: print_string 4 a0
read_int 5 v0
Format Address Computation read_string 8 a0 (address)
(reg) Contents of reg a1 (length)
imm Immediate constant sbrk 9 a0 (amount) v0
imm(reg) Immediate + contents of reg exit 10
label Address of label
Refer to MIPS documentation for remaining system calls and the
labelimm Address of label Immediate
labelimm(reg) Address of label (Immediate + contents of reg) details of system calls.

karkare, CSE, IITK CS220, Assembly 3/10 karkare, CSE, IITK CS220, Assembly 4/10
# fact.asm
Example: Iterative Factorial Program done: # print the result
.text
.globl main

main:

# fact.asm:
# Compute the factorial of given
# number and print the result
#
# $t0: holds the input
# $t1: holds the result: Factorial($t0) # store the base value in $t1
# li $t1, 1

fact:
blez $t0, done
mul $t1, $t1, $t0
sub $t0, $t0, 1
b fact

karkare, CSE, IITK CS220, Assembly 5/10

# fact.asm # fact.asm
done: # print the result done: # print the result
.text .text
.globl main .globl main

main: main:
# read the number...

move $a0, $t1


li $v0, 1
syscall

li $v0, 5
syscall
# and store in $t0
move $t0, $v0

# store the base value in $t1 # exit # store the base value in $t1 # exit
li $t1, 1 li $v0, 10 li $t1, 1 li $v0, 10
syscall syscall
fact: fact:
blez $t0, done blez $t0, done
mul $t1, $t1, $t0 mul $t1, $t1, $t0
sub $t0, $t0, 1 sub $t0, $t0, 1
b fact b fact
# fact.asm # fact.asm
done: # print the result done: # print the result
.text .text # first the message
.globl main .globl main la $a0, result_msg
li $v0, 4
main: main: syscall
# read the number... # read the number...
# first the query # first the query # then the value
la $a0, query_msg move $a0, $t1 la $a0, query_msg move $a0, $t1
li $v0, 4 li $v0, 1 li $v0, 4 li $v0, 1
syscall syscall syscall syscall
# now read the input # now read the input
li $v0, 5 li $v0, 5 # then newline
syscall syscall la $a0, nl_msg
# and store in $t0 # and store in $t0 li $v0, 4
move $t0, $v0 move $t0, $v0 syscall

# store the base value in $t1 # exit # store the base value in $t1 # exit
li $t1, 1 li $v0, 10 li $t1, 1 li $v0, 10
syscall syscall
fact: fact:
blez $t0, done .data blez $t0, done .data
mul $t1, $t1, $t0 query_msg: .asciiz "Input? " mul $t1, $t1, $t0 query_msg: .asciiz "Input? "
sub $t0, $t0, 1 sub $t0, $t0, 1 result_msg: .asciiz "The factorial is: "
b fact b fact nl_msg: .asciiz "\n"
Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Assembly - V The material in the following slides is based on the following


references:
Spim documentation (Appendix A) from the third edition of
Amey Karkare Computer Organization and Design: The Hardware/Software
karkare@cse.iitk.ac.in Interface by Patterson & Hennessy
Department of CSE, IIT Kanpur MIPS Assembly Language Programming: CS50 Discussion and
Project Book by Daniel J Ellard

karkare, CSE, IITK CS220, Assembly 1/12 karkare, CSE, IITK CS220, Assembly 2/12

Procedure Calls Procedure Calls

In assembly language, programmer has to explicitly


maintain most of the procedures environment (local
In a high level language like C, the compiler provides variables, actual to formal mapping, return value, return
several useful abstractions, for e.g. address etc.)
Mapping of actuals to formals. Use of stack to store environment for each procedure.
Allocation and initialization of temporary local storage.
When procedure A calls procedure B, programmer has to
Each invocation of procedure gets its own copy of local variables.
Required to support recursion.
write code to:
save As environment on the stack
jump to B
when B returns, restore As environment from stack

karkare, CSE, IITK CS220, Assembly 3/12 karkare, CSE, IITK CS220, Assembly 4/12
Layout of Memory Layout of Stack Frame

... High Memory Address


Stack Arg # 6
Arg #5
$fp

Saved
Stack Grows
Registers
Dynamic
Data Segment
Static
Local
Text Segment $sp Variables Lower Memory Address

$fp points to the first word in the active procedures stack frame.
Reserved $sp points to the last word in the active procedures stack frame.
Arguments 1. . . 4 are stored in $a0. . . $a3.

karkare, CSE, IITK CS220, Assembly 5/12 karkare, CSE, IITK CS220, Assembly 6/12

Building Stack Frames Placement of Calling Sequences

Call Sequence
No fixed sequence. Immediately before caller invokes callee.
Only conventions. Just as the callee starts executing.
Caller and Callee must agree on the sequence of steps. Return Sequence
Calling sequences Immediately before the callee returns.
Consists of Call sequence and Return sequence. Just as the control reaches caller.
A possible sequence of steps is detailed next.

karkare, CSE, IITK CS220, Assembly 7/12 karkare, CSE, IITK CS220, Assembly 8/12
Call Sequence: Caller Call Sequence: Callee

Save Caller-Saved Registers


As the called procedure is free to use $a0. . . $a3, $t0. . . $t9; caller
Allocate Memory for current callee invocation on the stack
must store them on stack before the call, if they are required after
the call. (stack frame of callee).
Pass Arguments Save Callee-Saved Registers
First four arguments are copied to $a0. . . $a3. If callee code can change any of the registers $s0. . . $s7, $fp, $ra,
Remaining arguments are stored on stack. These appear just it need to store it in the stack frame before changing.
above the callees stack frame. Caller expects to see these registers unmodified.

Jump to Callee Establish new values for $fp, $sp.


Execute JAL instruction.
Stores the return address in $ra.

karkare, CSE, IITK CS220, Assembly 9/12 karkare, CSE, IITK CS220, Assembly 10/12

Return Sequence: Callee Return Sequence: Caller

If value is to be returned put it in $v0. Restore caller saved registers.


Restore $s0. . . $s7, $fp, $ra (callee saved registers). Cleanup (pop) the arguments on the stack, if passed that
Restore $sp (pop out the callees stack frame). way.
Return by jumping to $ra address. Move the return value from $v0 to appropriate register.

karkare, CSE, IITK CS220, Assembly 11/12 karkare, CSE, IITK CS220, Assembly 12/12
Technologies
CS220: Introduction to Computer Organization
2011-12 Ist Semester Writable vs. non-writable
ROM (Read only memory)
RAM (Read write memory but known as Random Access
Memory - I memory)
Volatility
Does memory retain values when powered off?
Amey Karkare ROM, NVRAM
karkare@cse.iitk.ac.in
Technology
Department of CSE, IIT Kanpur ROM: Data masked at fabrication time
PROM: Once programmable (specialized programmers)
EPROM: Erasable and then programmable
EEPROM: Electrically (selectively) erasable
Flash: Sector oriented
RAM, Static vs. Dynamic
Dynamic: Slow but cheap. DRAM, EDORAM, SDRAM, DDR-RAM

karkare, CSE, IITK CS220, Memory 1/6 karkare, CSE, IITK CS220, Memory 2/6

Memory Addressability Memory Interface: Byte-wide Bus

When CS=1
If RD=1, data is read from
the selected memory
Memory is a big array of bytes. location and presented on
n-bit addr 8-bit data the data lines.
Not the entire address space need to have storage attached 2n Locations
If WR=1, data is written into
Processors need to read multi-byte data structures.
the selected location.
Short (2-bytes), int (4-bytes), double (8-bytes) etc.
Should the access be one byte at a time? When CS=0
CS RD WR Data lines are tristated, and
Issue of the bus interface!
no operation is carried out
in memory.
When RD,WR,CS = 001
Same as when CS=0.

karkare, CSE, IITK CS220, Memory 3/6 karkare, CSE, IITK CS220, Memory 4/6
Multi-byte wide Bus Multi-byte wide Bus

n + 1-bit
address n-bit address A
n-bit addr Block A 8-bit data AddressA = AddressB = Address[n:1]
Total Size = 2n+1 Bytes. CS
CSA
(0th bit of address is not used)
2n Bytes RDA
Address needed from RD WRA Size = 0 8-bit transfer.
processor: n + 1 bits. WR Wrapper n-bit address B Size = 1 16-bit transfer
CSB if Size = 1 and Address[0] = 1 Error
CS RD WR Each memory requires just Size
RDB CSA = 1 if CS = 1 and Address[0] = 0.
n bits of address. Error
WRB RDA = RD if Address[0] = 0. 0 otherwise
n-bit addr Block B 8-bit data
Need Wrapper on this WRA = WR if Address[0] = 0. 0 otherwise
2n Bytes CSB = CS if Address[0] = 1 or Size = 1
organization to provide
8-bit transfers Data (16 bits)
Data A (8 bits) RDB = RD if Address[0] = 1 or Size = 1
WRB = WR if Address[0] = 1 or size = 1
16-bit transfers
CS RD WR
Data B (8 bits)

karkare, CSE, IITK CS220, Memory 5/6 karkare, CSE, IITK CS220, Memory 6/6
Multi-byte wide Bus
CS220: Introduction to Computer Organization
2011-12 Ist Semester n + 1-bit
address n-bit address A
CSA AddressA = AddressB = Address[n:1]
CS (0th bit of address is not used)
Memory - II RDA
RD WRA Size = 0 8-bit transfer.
WR Wrapper n-bit address B Size = 1 16-bit transfer
CSB if Size = 1 and Address[0] = 1 Error
Amey Karkare Size
RDB CSA = 1 if CS = 1 and Address[0] = 0.
karkare@cse.iitk.ac.in Error
WRB RDA = RD if Address[0] = 0. 0 otherwise
WRA = WR if Address[0] = 0. 0 otherwise
Department of CSE, IIT Kanpur CSB = CS if Address[0] = 1 or Size = 1
Data (16 bits)
Data A (8 bits) RDB = RD if Address[0] = 1 or Size = 1
WRB = WR if Address[0] = 1 or size = 1

Data B (8 bits)

karkare, CSE, IITK CS220, Memory 1/11 karkare, CSE, IITK CS220, Memory 2/11

A Problem . . . Data Switch is needed

if (Address[0] == 0) // even address


On the 16-bit data bus (data[15:0]) Data[7:0] = DataA;
DataA appears on data[7:0]. DataA
DataB appears on data[15:8]. Data
DataB Switch if (Address[0] == 1) // odd address
For 16-bit transfers Data[7:0] = DataB;
data appears correctly on data[15:0].
For 8-bit transfers if (Size == 1) // 16-bit transfer
If even address, data appears on data[7:0]. A[0] Size Data[15:8] = DataB;
If odd address, data appears on data[15:8].
This is not correct!

Story is complicated for 32-bit buses.

karkare, CSE, IITK CS220, Memory 3/11 karkare, CSE, IITK CS220, Memory 4/11
32-bit Buses 32-bit Memory System

BusAddress[1:0] Size EN0 EN1 EN2 EN3


Byte 1 0 0 0
Most system have 32 bit bus 00 Halfword 1 1 0 0
Support 8-, 16-, and 32-bit data transfers. Word 1 1 1 1
Example: SDLX memory.
Byte 0 1 0 0
For 4-byte wide memory system, 01 Halfword Unaligned Access
There will be 4 memory lanes. Word Unaligned Access
Address for each lane would be
BusAddress[31:2]
Byte 0 0 1 0
Last 2 bits are unused. 10 Halfword 0 0 1 1
BusAddress[1:0] and Size determine which lanes to Word Unaligned Access
enable. Byte 0 0 0 1
11 Halfword Unaligned Access
Word Unaligned Access
CSi for lane i derived as (ENi & CS)
karkare, CSE, IITK CS220, Memory 5/11 karkare, CSE, IITK CS220, Memory 6/11

Bus Switch Memory Hierarchy

Three axes of memory


Size of memory
Speed of memory
Cost of memory
Faster memories are expensive.
Bus switch mechanism is based on BusAddress[1:0] and SRAM speeds are 5-15ns
Size. SRAM within the processor chip (used for internal cache) is faster
(about 5ns)
Most system have a wrapper and switch as described here. SRAM cost is about $150/MB
Cheaper memories are slow.
DRAM cost is about $2/MB
Access time is 40-100ns
Slow storage is even cheaper
Disk speeds: 10-40ms/sector. Cost is $5/GB
Flash disks (USB sticks): 1-5ms. Cost is $50/GB
CD cost is $0.25/GB.
karkare, CSE, IITK CS220, Memory 7/11 karkare, CSE, IITK CS220, Memory 8/11
Pyramid of Storage Cache

Closer to CPU

Literal meaning: A hidden storage typically used for armory.


Reg
Cache is a fast and small memory between Main Memory
Cache and CPU.
Size Cost Hidden: No instruction operates on the cache.
Main Memory
Idea: Frequently used data is copied to cache (Also written
Disk (online storage) as $$$).

Magnetic Tapes (offline storage)

karkare, CSE, IITK CS220, Memory 9/11 karkare, CSE, IITK CS220, Memory 10/11

Principles of Locality

Spatial Locality
If a memory location M is accessed now, it is a high probability
that M will be accessed within small time.
In other words, if a memory location is referenced, it is likely that
nearby items will be referenced in the near future.
Temporal Locality
If a memory location M is accessed now, it is likely that it will be
accesses within time again.
In other words, if a memory location is referenced, it is likely that it
will be referenced again in the near future.
Programs exhibit locality.

karkare, CSE, IITK CS220, Memory 11/11


Principles of Locality
CS220: Introduction to Computer Organization
2011-12 Ist Semester
Spatial Locality
Memory - III If a memory location M is accessed now, it is a high probability
that M will be accessed within small time.
In other words, if a memory location is referenced, it is likely that
nearby items will be referenced in the near future.
Amey Karkare
karkare@cse.iitk.ac.in Temporal Locality
If a memory location M is accessed now, it is likely that it will be
Department of CSE, IIT Kanpur accesses within time again.
In other words, if a memory location is referenced, it is likely that it
will be referenced again in the near future.
Programs exhibit locality.

karkare, CSE, IITK CS220, Memory 1/11 karkare, CSE, IITK CS220, Memory 2/11

Cache Engineering Issues

Inclusion Property: Cache stores only a subset of memory Need to perform search.
image (one that is likely to be used). Store tags. Tags are attached to cache to indicate which address
As the time of memory access, first search in the cache. in main memory.
If found, read from cache.
Tag storage is extra baggage. Need to reduce it.
If not found, go and read from memory.
Use blocks. Also helpful in exploiting spatial locality.
If time to access a cache location is tc and time to access a
memory location is tm , and h is the hit ratio in the cache, Need to speed up search.
Search sequentially: one tag at a time.
then
Not fast enough.
average access time = h tc + (1 h) (tc + tm ) All tags in a block can be searched in parallel.
= tc + (1 h) tm Requires large hardware.
Direct mapping between cache and memory block.
If word is fetched parallely from Cache and Memory
No search needed. Just need to verify if the address is correct.
(requires complex circuitry) Large number of conflicts.
average access time = h tc + (1 h) tm Search multiple tags in parallel.
Usually two or three.

karkare, CSE, IITK CS220, Memory 3/11 karkare, CSE, IITK CS220, Memory 4/11
Blocking Blocking

Memory
Address
0
1
Block (K)
Let the size of the blocks be 2x bytes. 2
Typically 32 bytes is a common choice.
Address within the block is x -bits wide. Line # Tag Block
0
1
2
n x bits for the x -bits address
.
block number within the block .
Block (K)
.
n-bits processor address C-1 2n 1
Block Length
(K Bytes)
Cache Memory
karkare, CSE, IITK CS220, Memory 5/11 karkare, CSE, IITK CS220, Memory 6/11

Caches View of Memory Caching

n-bit address lines 2n bytes of memory.


Cache stores fixed length blocks of K bytes.
Cache views memory as an array of M blocks.
M = 2n /K
A block of memory in cache is referred to as a line.
K is the line size.
Cache size of C blocks where C < M (much much less
than)
Each line includes a tag that identifies the block being
stored
Tag is usually upper portion of memory address

karkare, CSE, IITK CS220, Memory 7/11 karkare, CSE, IITK CS220, Memory 8/11
Multilevel Caching: Unified Cache Multilevel Caching: Non-unified Cache

karkare, CSE, IITK CS220, Memory 9/11 karkare, CSE, IITK CS220, Memory 10/11

Some Observations

There are many more reads than the writes.


All instructions are read.
Variables in the program are read and written.
About 30% instructions are load/store.
Even among the data computations, many computations are
dyadic
c = a+b
Assuming 60% of load/stores are load:
Number of reads for 100 instructions: 100 + 30 0.60 = 118.
Number of writes: 30 0.40 = 12
Only about 10% of whole memory traffic is write.

karkare, CSE, IITK CS220, Memory 11/11


Cache Jargon
CS220: Introduction to Computer Organization
2011-12 Ist Semester
Block: A set of contiguous address locations of some size.
A cache block is also called cache line.
Memory - IV Replacement policy
Rules for creating space for a new block in cache.
LRU, FIFO, LFU, random . . .
Amey Karkare Cache Hit/Cache Miss
karkare@cse.iitk.ac.in Hit: If the requested address exists in cache.
Miss: Otherwise.
Department of CSE, IIT Kanpur
Associativity
Number of possible tags where a physical block may be found
Fully associative, k-way associative
Write Policy
Write through, Write back

karkare, CSE, IITK CS220, Memory 1/12 karkare, CSE, IITK CS220, Memory 2/12

Cache Mapping Algorithms: Example Direct Mapping

Block B of main memory mapped to block B %128 of cache.

Blocks 0, 128, 256, . . . mapped to cache location 0.


Main Memory Blocks 1, 129, 259, . . . mapped to cache location 1, and so on.
16-bit address. Total 32 memory blocks mapped per cache location.
4K blocks of 16 bytes each. Contention may occur even if the cache is not full.
Total 64K bytes. Trivial replacement algorithm: replace the existing block.

Cache 16-bit memory address divided in three parts.


128 blocks of 16 bytes each. Tag Location # Offset
Total 2048 (2K) bytes. 5 7 4
Lower 4 bits of address are the offset within the block (24 = 16).
Middle 7 bits of address point to a particular location in the cache
(27 = 128)
Upper 5 bits of address are compared with tag (25 = 32).

karkare, CSE, IITK CS220, Memory 3/12 karkare, CSE, IITK CS220, Memory 4/12
Direct Mapping: Hardware Implementation Fully Associative Mapping

A memory block can be placed into any cache block


location.
Space in the cache can be utilized efficiently.
A new block replaces an old block only when the cache is full.
All the tags are searched parallely for the desired block.
Very high hardware cost for search.
16-bit memory address divided in two parts.
Tag Offset
12 4
Lower 4 bits of address are the offset within the block (24 = 16).
Upper 12 bits of address are compared with tag
(212 = 4096 = 4K ).

karkare, CSE, IITK CS220, Memory 5/12 karkare, CSE, IITK CS220, Memory 6/12

Set Associative Mapping Set Associative Mapping

A combination of direct- and fully associative- mapping 16-bit memory address divided in three parts.
Blocks in cache are grouped into sets. Example: 2-way set associative.
k-way associative: Each set contains k blocks. Total 128/2 = 64 sets.
k = 1 direct mapping. Total 64 blocks mapped per set.
k = # of cache blocks fully associative. Tag Set # Offset
A block in main memory can be placed in any block of a 6 6 4
specific set. Lower 4 bits of address are the offset within the block (24 = 16).
Contention is reduced w.r.t. direct mapping. Middle 6 bits of address specify the set (26 = 64).
Hardware cost is reduced w.r.t. fully associative mapping. Upper 6 bits of address are compared with tag (26 = 64).

karkare, CSE, IITK CS220, Memory 7/12 karkare, CSE, IITK CS220, Memory 8/12
Set Associative Mapping: Hardware Implementation Replacement Policies

Cache controller need to implement policies about:


When do we replace? (easy to answer!)
How do we replace?
Replacement: at the time of conflict.
All k blocks in the target set are full.
Which of the k blocks to replace?
FIFO, LRU, random.
May require extra information in the tag.
Schemes are easier for small k (1 or 2)

karkare, CSE, IITK CS220, Memory 9/12 karkare, CSE, IITK CS220, Memory 10/12

Extra Information in TAG Other Issues

For FIFO scheme, we need counters.


When a block is replaced, counter for all other blocks in the same
What do we do if the cache block is not in sync with
set are incremented.
For the block brought in, counter is set to 0. memory?
For LRU scheme, we need reference registers. Cache block is more current.
When a block is referred to, a 1 is inserted in the ref bits of the Need to store this information.
referred block and 0 in all other blocks. Dirty (D) bit per block.
In case of FIFO, replacement candidate is selected with Set when a write occurs in the block.
highest counter value. Reset when a new block is brought it.
If dirty bit is set, write the block in memory before reading another
In case of LRU, replacement candidate is chosen using ref
block.
bits with most zeros at the end.
Valid (V) bit
For k = 2, we need just one bit of history/reference Whether the contents of a cache block or valid or not.
information per block.
For k = 1 (direct mapping) we need no extra information.
karkare, CSE, IITK CS220, Memory 11/12 karkare, CSE, IITK CS220, Memory 12/12
Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Input Output - I
The material in the following slides is based on the following
references:
Amey Karkare
Computer Organization and Architecture and its companion
karkare@cse.iitk.ac.in
slides by William Stallings
Department of CSE, IIT Kanpur

karkare, CSE, IITK CS220, I/O 1/11 karkare, CSE, IITK CS220, I/O 2/11

I/O Issues Generic I/O Module Structure

Wide variety of peripherals.


Different methods of operation.
Delivering different amount of data.
Data transfer rate typically much slower than the memory or CPU.
Different data formats or word lengths.
Inefficient to hook peripheral directly to the CPU/system
bus.
I/O modules
Interface to CPU and memory
via system bus or switch.
Interface to one or more peripherals.
via tailored data links.

karkare, CSE, IITK CS220, I/O 3/11 karkare, CSE, IITK CS220, I/O 4/11
External Devices External Devices

Human readable
Day-to-day activities
Monitor, Printer, Projector, Keyboard.
Machine readable
Monitoring and control.
Hard disks, Sensors, Actuators.
Communication
To communicate with remote devices.
Modem, Network interface cards.

karkare, CSE, IITK CS220, I/O 5/11 karkare, CSE, IITK CS220, I/O 6/11

Functions of I/O Modules I/O Steps

Simplified scenario:
Control and Timing
CPU checks I/O module device status.
Processor Communication I/O module returns status.
Device communication If ready, CPU requests data transfer.
I/O module gets data from device.
Data buffering
I/O module transfers data to CPU.
Error detection
Variations for output, DMA, etc.

karkare, CSE, IITK CS220, I/O 7/11 karkare, CSE, IITK CS220, I/O 8/11
I/O Module Block Diagram I/O Module Decisions

Hide or reveal device properties to CPU.


Support multiple or single device.
Control device functions or leave it for CPU.
Also operating system level decisions.
Unix treats everything it can as file.

karkare, CSE, IITK CS220, I/O 9/11 karkare, CSE, IITK CS220, I/O 10/11

Input Output Techniques

Programmed I/O.
Interrupt Driven I/O.
Direct Memory Access (DMA).

karkare, CSE, IITK CS220, I/O 11/11


Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Input Output - II
The material in the following slides is based on the following
references:
Amey Karkare
Computer Organization and Architecture and its companion
karkare@cse.iitk.ac.in
slides by William Stallings
Department of CSE, IIT Kanpur

karkare, CSE, IITK CS220, I/O 1/16 karkare, CSE, IITK CS220, I/O 2/16

Input Output Techniques Programmed I/O

CPU has direct control over I/O.


Sensing status.
Programmed I/O. Read/write commands.
Interrupt Driven I/O. Transferring data.
Direct Memory Access (DMA). CPU waits for I/O module to complete operation.
Advantage: Simple to implement.
Problem: Wastes CPU time.

karkare, CSE, IITK CS220, I/O 3/16 karkare, CSE, IITK CS220, I/O 4/16
Programmed I/O: Steps Programmed I/O: Commands

CPU issues address.


Specifies I/O module and device
CPU requests I/O operation. CPU issues command
I/O module performs operation. Control: activate a peripheral and tell it what to do.
I/O module sets status bits. e.g. spin up disk.
Test: test various status conditions associated with an I/O module
CPU checks status bits periodically. and its peripherals.
I/O module does not inform/interrupt CPU directly. powered on? Error?
CPU may wait or come back later. Read: causes the I/O module to obtain an item of data from the
peripheral and place it into an internal register.
Write: causes the I/O module to take a unit of data from the data
bus and transmit it to the peripheral.

karkare, CSE, IITK CS220, I/O 5/16 karkare, CSE, IITK CS220, I/O 6/16

Programmed I/O: Addressing I/O Devices Programmed I/O: I/O Mapping

Memory Mapped I/O


Single address space for both memory and I/O devices.
I/O module registers are treated as memory addresses.
Data transfer is very much like memory access. Same machine instructions used to access both memory and I/O
From CPU viewpoint. devices.
Shared address, data and control lines.
Each device has a unique identifier.
Advantage: Allows for more efficient programming.
CPU commands contain reference to identifier (address). Disadvantage: Uses up valuable memory address space.
Isolated I/O
Separate address space for memory and I/O devices.
Need I/O or memory select lines.
Small number of special I/O commands.

karkare, CSE, IITK CS220, I/O 7/16 karkare, CSE, IITK CS220, I/O 8/16
Interrupt Driven I/O Interrupt Driven I/O: I/O Modules View

Receive a READ command from the processor.


Overcomes long periods of CPU waiting. Read data from desired peripheral into data register.
Processor does not have to repeatedly check for the I/O Interrupt the processor.
module status. Wait until data is requested by the processor.
Place data on the data bus when requested.

karkare, CSE, IITK CS220, I/O 9/16 karkare, CSE, IITK CS220, I/O 10/16

Interrupt Driven I/O: Processors View Interrupts

Issue a READ command. Interrupts are external signals to the CPU.


When an interrupt is raised,
Perform some useful work.
The CPU executes an interrupt service routine (ISR).
Keep checking for interrupts at the end of every instruction
CPU needs to identify
cycle.
Type/source of interrupt.
When interrupted (by I/O module), save the current Program to execute .
context. Mode of execution.
Will other interrupts be honored or not?
Read the data from I/O module and save it in memory.
After ISR, CPU registers and flags are to be restored such that the
Restore the saved context and resume execution. interrupt is transparent.

karkare, CSE, IITK CS220, I/O 11/16 karkare, CSE, IITK CS220, I/O 12/16
Interrupts: Design Issues ISR structure

Many devices. Each can raise an interrupt.


CPU must have provision for supporting all such interrupts. Save registers
But CPU can only have a fixed, small number (1 or 2) of Identify source of interrupt.
interrupt inputs. Can even be done by the interrupt controller!!
Interrupt Controllers. Execute interrupt code specific to the source.
Multiplex multiple interrupts into a single one.
Restore registers.
Interrupt Service Routine (ISR)
Must not make any assumptions on its inputs (registers).
Return to unknown caller.
Must not modify any system state (must save and restore
registers).

karkare, CSE, IITK CS220, I/O 13/16 karkare, CSE, IITK CS220, I/O 14/16

Interrupts: More issues Interrupt Importance

When should CPU recognize an interrupt?


Interrupts are recognized only between two instructions. Some interrupts are more important than others.
Otherwise there will be an issue of consistency. Consider a devices that gives interrupt every 20ms.
How to deal with multiple interrupts? Can be used as a timekeeper for the system.
Between two interrupt recognition points in time, multiple
interrupts may happen.
Consider another device, such as keyboard, that gives
Interrupt handler may get interrupted. interrupt when a key is pressed.
Which one to handle first? Which interrupt is important?
When to give the next? Which should be handled first if both of them occur
Interrupt acknowledge signal (IACK). simultaneously?
IACK is used to decide when to give next interrupt.

karkare, CSE, IITK CS220, I/O 15/16 karkare, CSE, IITK CS220, I/O 16/16
Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Input Output - III


The material in the following slides is based on the following
references:
Amey Karkare
Computer Organization and Architecture and its companion
karkare@cse.iitk.ac.in
slides by William Stallings
Department of CSE, IIT Kanpur

karkare, CSE, IITK CS220, I/O 1/16 karkare, CSE, IITK CS220, I/O 2/16

Interrupt Importance Interrupt Priority

Some interrupts are more important than others. Several Factors to consider.
Consider a devices that gives interrupt every 20ms. The interactivity of the device.
Can be used as a timekeeper for the system. Time required to handle the interrupt.

Consider another device, such as keyboard, that gives Larger number of interrupt generators
interrupt when a key is pressed. Keyboard, mouse, hard-disk, clock, memory error,
system-parity-error . . .
Which interrupt is important? How to identify the device?
Which should be handled first if both of them occur How to assign priorities?
simultaneously?

karkare, CSE, IITK CS220, I/O 3/16 karkare, CSE, IITK CS220, I/O 4/16
Device identification Device identification

Multiple interrupt lines - each line may have multiple I/O


Daisy Chaining
modules.
Common interrupt request line; any I/O module can raise it.
Software poll - poll each I/O module. Interrupt Acknowledge sent down a chain.
Separate command line - TESTI/O. I/O module responsible places vector on bus.
Processor read status register of I/O modules in priority order. CPU uses vector to identify handler routine.
Time consuming.

karkare, CSE, IITK CS220, I/O 5/16 karkare, CSE, IITK CS220, I/O 6/16

Daisy Chain Daisy Chain Control

Address and Data Bus


intr-prev = intr-dev
intrnext
intrprev
|| intr-next;
iackprev

if (intr-dev) {
CPU iacknext
i-ack-dev = i-ack-prev;
i-ack-next = 0;
Daisy Chain Daisy Chain Daisy Chain
Control Control Control
} else {
iackdev intrdev i-ack-dev = 0;
i-ack-next = i-ack-prev;
}
Device Device Device

karkare, CSE, IITK CS220, I/O 7/16 karkare, CSE, IITK CS220, I/O 8/16
Daisy Chain Device identification

Address and Data Bus

Bus Arbitration
I/O module must claim the bus before raising an interrupt.
CPU I/O module first gains control of the bus.
I/O module sends interrupt request.
Daisy Chain Daisy Chain Daisy Chain
Control Control Control
The processor acknowledges the interrupt request.
I/O module places its vector on the data lines

Device Device Device

karkare, CSE, IITK CS220, I/O 9/16 karkare, CSE, IITK CS220, I/O 10/16

Multiple Interrupts Interrupt Controller

Most real computers use interrupt controllers.


For e.g., 8259 is used by x86 based systems
Incorporated in the chip-set.
The techniques of device identification also used to assign
priorities. BUS
Multiple interrupt lines:
Each interrupt line has a priority
Processor picks line with higher priority.
Software polling: Polling order determines priority.
Daisy chain: order of the chain determines priority. Registers interrupt

Bus arbitration: arbitration scheme determines priority.


Interrupt
iack
Controller

karkare, CSE, IITK CS220, I/O 11/16 karkare, CSE, IITK CS220, I/O 12/16
Interrupt Sources Interrupt Processing

External Devices. Each interrupt has an associated integer called interrupt


Keyboard, Disk, Timer etc. type.
Exceptions (due to execution of instructions). Interrupt types range between 0 to 255.
Divide by zero (floating point exception). Some are reserved by the processor.
MMU errors (page faults, access permission related). ISR addresses are determined using interrupt type.
Software Interrupts (instruction that cause interrupts). An interrupt vector table stores the addresses of the ISR.
Mechanism to make a system call. Upon an interrupt, the ISR address is picked up and
syscall in MIPS.
control is transferred to that code.
int 0x80 in linux.

karkare, CSE, IITK CS220, I/O 13/16 karkare, CSE, IITK CS220, I/O 14/16

Determining Interrupt Type Direct Memory Access (DMA)

For software interrupt, interrupt type is part of the


instruction.
int 0x80: 0x80 is the interrupt type. Programmed I/O
For Exceptions, interrupt types are fixed by the processor. Slow.
Busy wait.
Type 11: Segment not present.
Type 12: Stack fault. Interrupt Driven I/O
Type 6: Undefined instruction. Efficient compared to the Programmed I/O, but for each data
For hardware interrupts transfer, a number of instructions are executed.
Processor is tied up doing the I/O.
The interrupt controller (e.g. 8259) provides interrupt type on the
Especially troublesome with large amount of I/O.
bus during interrupt acknowledgement protocol.
Interrupt can be masked individually.
Not delivered to the processor.

karkare, CSE, IITK CS220, I/O 15/16 karkare, CSE, IITK CS220, I/O 16/16
Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

Input Output - IV
The material in the following slides is based on the following
references:
Amey Karkare
Computer Organization and Architecture and its companion
karkare@cse.iitk.ac.in
slides by William Stallings
Department of CSE, IIT Kanpur

karkare, CSE, IITK CS220, I/O 1/14 karkare, CSE, IITK CS220, I/O 2/14

Direct Memory Access (DMA) DMA

DMA module on system bus used to mimic the processor.


Programmed I/O DMA module only uses system bus when processor does not
Slow. need it.
Busy wait. Requires multi-master bus.
Interrupt Driven I/O A dedicated bus master (DMA controller) to perform the
Efficient compared to the Programmed I/O, but for each data I/O.
transfer, a number of instructions are executed. Read from device and write in memory or vice-versa.
Processor is tied up doing the I/O.
Multiple masters contending for the bus.
Especially troublesome with large amount of I/O.
One of them must win, while others must wait (Arbitration).
Who should win? Priority.

karkare, CSE, IITK CS220, I/O 3/14 karkare, CSE, IITK CS220, I/O 4/14
DMA Transfer Scenario DMA Controller

Data Count
CPU tells DMA controller:
Read/Write. Data Lines Data Register
Device address.
Starting address of memory block for data. Address Lines Address Reg
Amount of data to be transferred.
CPU carries on with other work.
DMA Request
DMA controller deals with transfer. DMA Ack Control
DMA controller arbitrates for the bus and gets mastership for the Interrupt Logic
bus. Read
It then transfers one byte from I/O location to memory (or other Write
way).

karkare, CSE, IITK CS220, I/O 5/14 karkare, CSE, IITK CS220, I/O 6/14

DMA Controller DMA Transfer Modes

DMA Controller (DMAC) works in two modes.


Slave mode.
Master mode.
In the slave mode, DMAC is like an I/O device. The
processor can write data in its registers. Cycle stealing.
Registers included for size of data transfer, starting Burst Mode.
address, direction etc.
A command register and a status register to manage the
transfer.
Start the transfer command makes the DMAC to wait for the
device and then initiate the I/O.
At the end of the transfer, DMAC can interrupt the CPU.

karkare, CSE, IITK CS220, I/O 7/14 karkare, CSE, IITK CS220, I/O 8/14
Cycle Stealing Burst Mode

DMA controller takes over bus for a cycle. DMA controller takes over bus for a large number of cycles.
Transfer of one word of data. A block of data is transferred before returning bus control
Not an interrupt to the CPU.
CPU does not switch context. Choice between burst mode and cycle stealing depends
CPU suspended just before it accesses bus, i.e. before an on:
The speed at which data is arriving relative to the bus bandwidth,
operand or data fetch or a data write
and
Slows down CPU but not as much as CPU doing transfer Whether a particular application will allow the CPU to be locked off
itself. the bus for the duration of one block transfer.

karkare, CSE, IITK CS220, I/O 9/14 karkare, CSE, IITK CS220, I/O 10/14

DMA Configurations Single Bus, Detached DMA

Single bus, detached DMA.

Memory
Single Bus, integrated DMA-I/O.
Processor DMA I/O I/O .... I/O
I/O Bus.

karkare, CSE, IITK CS220, I/O 11/14 karkare, CSE, IITK CS220, I/O 12/14
Single Bus, Integrated DMA I/O Bus

System Bus

Memory
Memory
Processor DMA
Processor DMA DMA ....
I/O Bus
I/O
I/O I/O

I/O I/O I/O

karkare, CSE, IITK CS220, I/O 13/14 karkare, CSE, IITK CS220, I/O 14/14
Acknowledgement
CS220: Introduction to Computer Organization
2011-12 Ist Semester

System Buses
The material in the following slides is based on the following
references:
Amey Karkare
Computer Organization and Architecture and its companion
karkare@cse.iitk.ac.in
slides by William Stallings
Department of CSE, IIT Kanpur

karkare, CSE, IITK CS220, Buses 1/19 karkare, CSE, IITK CS220, Buses 2/19

Computer Modules Memory Connection

Receives addresses (locations).


Receives and sends data.
Receives control signals.
RD/WR/Size.

karkare, CSE, IITK CS220, Buses 3/19 karkare, CSE, IITK CS220, Buses 4/19
I/O Module Connection CPU Connection

From CPUs viewpoint, similar to memory.


Receive address from processor.
Receive control signals from processor. Read instruction and data.
Send control signals to peripheral. Write data (after processing).
Output Send control signals to other units.
Receive data from processor. Receive and process interrupts.
Send data to peripheral.
Input Need to connect all modules together!
Receive data from peripheral.
Send data to processor.
Send interrupt signal to processor (control).

karkare, CSE, IITK CS220, Buses 5/19 karkare, CSE, IITK CS220, Buses 6/19

Bus Buses

Communication pathway connecting two or more devices. Computer systems contain a number of different buses
Shared transmission medium. connecting various components at various levels of
Usually broadcast. hierarchy.
Signal transmitted by any one device is available to all others. Control/Address/Data bus
Consists of multiple lines grouped together. System bus
Each line carries signals representing binary 0 and 1. A bus that connects major components.
Example: 8 bit unit of data transmitted together over 8 bus lines. E.g. processor, memory, I/O.

karkare, CSE, IITK CS220, Buses 7/19 karkare, CSE, IITK CS220, Buses 8/19
Data Bus Address Bus

Identify the source or destination of data.


Carries data e.g. CPU needs to read an instruction (data) from a given location
There is no difference between data and instruction at this level. in memory.
Width is a key determinant of performance. Bus width determines maximum memory capacity of
8, 16, 32, 64 bit. system.
e.g. 8080 has 16 bit address bus giving 64k address space.

karkare, CSE, IITK CS220, Buses 9/19 karkare, CSE, IITK CS220, Buses 10/19

Control Bus Single vs Multiple Buses

Lots of devices on one bus leads to:


Control and Timing Information. Propagation delays
RD/WR Long data paths mean that coordination of bus use can adversely
Interrupt request/IACK affect performance.
Clock signals When aggregate data transfer approaches bus capacity.
Use of multiple buses to overcome these problems.

karkare, CSE, IITK CS220, Buses 11/19 karkare, CSE, IITK CS220, Buses 12/19
Traditional Bus High Performance Bus

karkare, CSE, IITK CS220, Buses 13/19 karkare, CSE, IITK CS220, Buses 14/19

Bus Types Bus Arbitration

Dedicated
Separate data and address lines.
More than one module controlling the bus.
Multiplexed
e.g. CPU and DMA controller.
Shared lines.
Address valid or data valid control lines. Only one module may control the bus at one time.
Advantage - fewer lines. Arbitration may be centralised or distributed.
Disadvantages
More complex control.
Performance penalty.

karkare, CSE, IITK CS220, Buses 15/19 karkare, CSE, IITK CS220, Buses 16/19
Arbitration Techniques Other Bus Characteristics

Topology
Centralised Arbitration
Broadcast bus, Point to Point, Shared
Single hardware device controlling bus access.
Bus controller.
Communication Data Width
Arbiter. Serial vs. Parallel
May be part of CPU or separate. Duplexity
Distributed Arbitration Simplex (Non-duplex), Half Duplex, Full Duplex.
Control logic on all modules. Synchronicity
Asynchronous vs. synchronous.

karkare, CSE, IITK CS220, Buses 17/19 karkare, CSE, IITK CS220, Buses 18/19

Examples of Buses

Parallel Printer Port bus is a 25 pin parallel data bus.


RS232 is a serial bus (one bit of data), full duplex.
IDE, ATA, PS2, USB etc. are all I/O buses.

karkare, CSE, IITK CS220, Buses 19/19


CS1252 COMPUTER ORGANIZATION AND ARCHITECTURE
(Common to CSE and IT)
LTPC
3104
UNIT I BASIC STRUCTURE OF COMPUTERS 9

Functional units Basic operational concepts Bus structures Performance and metrics
Instructions and instruction sequencing Hardware Software interface Instruction
set architecture Addressing modes RISC CISC ALU design Fixed point and
floating point operations.

UNIT II BASIC PROCESSING UNIT 9

Fundamental concepts Execution of a complete instruction Multiple bus organization


Hardwired control Micro programmed control Nano programming.

UNIT III PIPELINING 9

Basic concepts Data hazards Instruction hazards Influence on instruction sets Data
path and control considerations Performance considerations Exception handling.

UNIT IV MEMORY SYSTEM 9

Basic concepts Semiconductor RAM ROM Speed Size and cost Cache
memories Improving cache performance Virtual memory Memory management
requirements Associative memories Secondary storage devices.

UNIT V I/O ORGANIZATION 9

Accessing I/O devices Programmed I/O Interrupts Direct memory access Buses
Interface Circuits Standard I/O interfaces (PCI, SCSI, and USB) I/O Devices and
processors.
L: 45 T: 15 Total: 60
TEXT BOOKS
1. Carl Hamacher, Zvonko Vranesic and Safwat Zaky, Computer Organization, 5th
Edition, Tata Mc-Graw Hill, 2002.
2. Heuring, V.P. and Jordan, H.F., Computer Systems Design and Architecture, 2nd
Edition, Pearson Education, 2004.

REFERENCES
1. Patterson, D. A., and Hennessy, J.L., Computer Organization and Design:The
Hardware/Software Interface, 3rd Edition, Elsevier, 2005.
2. William Stallings, Computer Organization and Architecture Designing for
Performance, 6th Edition, Pearson Education, 2003.
3. Hayes, J.P., Computer Architecture and Organization, 3rd Edition, Tata Mc-Graw
Hill, 1998.
UNIT I

BASIC STRUCTURE OF COMPUTERS

Functional units

Basic operational concepts

Bus structures

Performance and metrics

Instructions and instruction sequencing

Hardware

Software interface

Instruction set architecture

Addressing modes

RISC

CISC

ALU design

Fixed point and floating point operations


BASIC STRUCTURE OF COMPUTERS:
Computer Organization:

It refers to the operational units and their interconnections that realize the
architectural specifications.
It describes the function of and design of the various units of digital computer that
store and process information.

Computer hardware:

Consists of electronic circuits, displays, magnetic and optical storage media,


electromechanical equipment and communication facilities.

Computer Architecture:

It is concerned with the structure and behaviour of the computer.


It includes the information formats, the instruction set and techniques for
addressing memory.

Functional Units

A computer consists of 5 main parts.


Input
Memory
Arithmetic and logic
Output
Control Units

Functional units of a Computer


Input unit accepts coded information from human operators, from
electromechanical devices such as keyboards, or from other computers
over digital communication lines.
The information received is either stored in the computers memory for
later reference or immediately used by the arithmetic and logic circuitry to
perform the desired operations.
The processing steps are determined by a program stored in the memory.
Finally the results are sent back to the outside world through the output
unit.
All of these actions are coordinated by the control unit.
The list of instructions that performs a task is called a program.
Usually the program is stored in the memory.
The processor then fetches the instruction that make up the program from
the memory one after another and performs the desire operations.

1.1 Input Unit:


Computers accept coded information through input units, which read the
data.
Whenever a key is pressed, the corresponding letter or digit is
automatically translated into its corresponding binary code and transmitted
over a cable to either the memory or the processor.
Some input devices are
Joysticks
Trackballs
Mouses
Microphones (Capture audio input and it is sampled & it is
converted into digital codes for storage and processing).
1.2.Memory Unit:
It stores the programs and data.
There are 2 types of storage classes
Primary
Secondary
Primary Storage:
It is a fast memory that operates at electronic speeds.
Programs must be stored in the memory while they are
being executed.
The memory contains large no of semiconductor storage
cells.
Each cell carries 1 bit of information.
The Cells are processed in a group of fixed size called
Words.
To provide easy access to any word in a memory,a distinct
address is associated with each word location.
Addresses are numbers that identify successive locations.
The number of bits in each word is called the word length.
The word length ranges from 16 to 64 bits.
There are 3 types of memory.They are

RAM(Random Access Memory)


Cache memory
Main Memory

RAM:
Memory in which any location can be reached in short and fixed amount of time
after specifying its address is called RAM.
Time required to access 1 word is called Memory Access Time.

Cache Memory:

The small,fast,RAM units are called Cache. They are tightly coupled with
processor to achieve high performance.

Main Memory:
The largest and the slowest unit is called the main memory.

1.3. ALU:
Most computer operations are executed in ALU.
Consider a example,
Suppose 2 numbers located in memory are to be added. They are brought
into the processor and the actual addition is carried out by the ALU. The sum may then
be stored in the memory or retained in the processor for immediate use.
Access time to registers is faster than access time to the fastest cache unit in
memory.
1.4. Output Unit:
Its function is to send the processed results to the outside world. eg.Printer
Printers are capable of printing 10000 lines per minute but its speed is
comparatively slower than the processor.

1.5. Control Unit:


The operations of Input unit, output unit, ALU are co-ordinate by the
control unit.
The control unit is the Nerve centre that sends control signals to other
units and senses their states.
Data transfers between the processor and the memory are also controlled
by the control unit through timing signals.
The operation of computers are,
The computer accepts information in the form of programs and
data through an input unit and stores it in the memory.
Information stored in the memory is fetched, under program
control into an arithmetic and logic unit, where it is processed.
Processed information leaves the computer through an output unit.
All activities inside the machine are directed by the control unit.

BASIC OPERATIONAL CONCEPTS:

The data/operands are stored in memory.


The individual instruction are brought from the memory to the processor, which
executes the specified operation.

Eg:1 Add LOC A ,R1

Instructions are fetched from memory and the operand at LOC A is fetched. It is then
added to the contents of R0, the resulting sum is stored in Register R0.

Eg:2
Load LOC A, R1

Transfer the contents of memory location A to the register R1.

Eg:3
Add R1 ,R0

Add the contents of Register R1 & R0 and places the sum into R0.
Fig:Connection between Processor and Main Memory

Instruction Register(IR)
Program Counter(PC)
Memory Address Register(MAR)
Memory Data Register(MDR)

Instruction Register (IR):

It holds the instruction that is currently being executed.


It generates the timing signals.

Program Counter (PC):

It contains the memory address of the next instruction to be fetched for execution.

Memory Address Register (MAR):

It holds the address of the location to be accessed.

Memory Data Register (MDR):

It contains the data to written into or read out of the address location.
MAR and MDR facilitates the communication with memory.

Operation Steps:

The program resides in memory. The execution starts when PC is point to the first
instruction of the program.
MAR read the control signal.
The Memory loads the address word into MDR.The contents are transferred to
Instruction register. The instruction is ready to be decoded & executed.

Interrupt:

Normal execution of the program may be pre-empted if some device requires


urgent servicing.
Eg...Monitoring Device in a computer controlled industrial process may detect a
dangerous condition.
In order to deal with the situation immediately, the normal execution of the
current program may be interrupted & the device raises an interrupt signal.
The processor provides the requested service called the Interrupt Service
Routine(ISR).

ISR save the internal state of the processor in memory before servicing the
interrupt because interrupt may alter the state of the processor.
When ISR is completed, the state of the processor is restored and the interrupted
program may continue its execution.
BUS STRUCTURES:

A group of lines that serves as the connection path to several devices is called a Bus.
A Bus may be lines or wires or one bit per line.
The lines carry data or address or control signal.

There are 2 types of Bus structures. They are


Single Bus Structure
Multiple Bus Structure

3.1.Single Bus Structure:

It allows only one transfer at a time.


It costs low.
It is flexible for attaching peripheral devices.
Its Performance is low.

3.2.Multiple Bus Structure:

It allows two or more transfer at a time.


It costs high.
It provides concurrency in operation.
Its Performance is high.

Devices Connected with Bus Speed


Electro-mechanical decvices
(Keyboard,printer) Slow

Magnetic / optical disk High

Memory & processing units Very High

The Buffer Register when connected with the bus, carries the information during transfer.
The Buffer Register prevents the high speed processor from being locked to a slow I/O
device during a sequence of data transfer.
SOFTWARE:

System Software is a collection of programs that are executed as needed to perform


function such as,

Receiving & Interpreting user commands.


Entering & editing application program and storing them as files in secondary
Storage devices.
Managing the storage and retrieval of files in Secondary Storage devices.
Running the standard application such as word processor, games, and
spreadsheets with data supplied by the user.
Controlling I/O units to receive input information and produce output results.
Translating programs from source form prepared by the user into object form.
Linking and running user-written application programs with existing standard
library routines.

Software is of 2 types.They are

Application program
System program
Application Program:

It is written in high level programming language(C,C++,Java,Fortran)


The programmer using high level language need not know the details of machine
program instruction.

System Program:(Compiler,Text Editor,File)


Compiler:

It translates the high level language program into the machine language program.

Text Editor:

It is used for entering & editing the application program.

System software Component ->OS(OPERATING SYSTEM)


Operating System :

It is a large program or a collection of routines that is used to control the sharing


of and interaction among various computer units.

Functions of OS:

Assign resources to individual application program.


Assign memory and magnetic disk space to program and data files.
Move the data between the memory and disk units.
Handles I/O operation.

Fig:User Program and OS routine sharing of the process

Steps:

1. The first step is to transfer the file into memory.


2. When the transfer is completed, the execution of the program starts.
3. During time period t0 to t1 , an OS routine initiates loading the application
program from disk to memory, wait until the transfer is complete and then passes the
execution control to the application program & print the results.
4. Similar action takes place during t2 to t3 and t4 to t5.
5. At t5, Operating System may load and execute another application program.
6. Similarly during t0 to t1 , the Operating System can arrange to print the
previous programs results while the current program is being executed.
7. The pattern of managing the concurrent execution of the several application
programs to make the best possible use of computer resources is called the multi-
programming or multi-tasking.
PERFORMANCE:

For best performance, it is necessary to design the compiler, machine instruction


set and hardware in a co-ordinate way.

Elapsed Timethe total time required to execute the program is called the elapsed time.
It depends on all the units in computer system.
Processor TimeThe period in which the processor is active is called the processor time.
It depends on hardware involved in the execution of the instruction.

Fig: The Processor Cache

A Program will be executed faster if the movement of instruction and data


between the main memory and the processor is minimized, which is achieved by using
the Cache.

Processor clock:

ClockThe Processor circuits are controlled by a timing signal called a clock.


Clock CycleThe cycle defines a regular time interval called clock cycle.

Clock Rate,R =1/P

Where, PLength of one clock cycle.

Basic Performance Equation:

T = (N*S)/R

Where, TPerformance Parameter


RClock Rate in cycles/sec
NActual number of instruction execution
SAverage number of basic steps needed to execute one machine instruction.
To achieve high performance,

N,S<R

Pipelining and Superscalar operation:

PipeliningA Substantial improvement in performance can be achieved by overlapping


the execution of successive instruction using a technique called pipelining.

Superscalar Execution It is possible to start the execution of several instruction in


everey clock cycles (ie)several instruction can be executed in parallel by creating
parallel paths.This mode of operation is called the Superscalar execution.

Clock Rate:

There are 2 possibilities to increase the clock rate(R).They are,

Improving the integrated Chip(IC) technology makes logical circuits faster.


Reduce the amount of processing done in one basic step also helps to reduce the
clock period P.

Instruction Set:CISC AND RISC:

The Complex instruction combined with pipelining would achieve the best
performance.
It is much easier to implement the efficient pipelining in processor with simple
instruction set.

Simple
Simple Instruction
Instruction set
Set

RISC
RISC CISC
CISC

(Reduced Instruction Set Computer) (Complex Instruction Set Computer)


It is the design of the instruction set It is the design of the instruction set
of a processor with simple instruction of a processor with complex instruction.

COM Compiler
High level Machine
Translated into
Language instruction
Program
Functions of Compiler:

The compiler re-arranges the program instruction to achieve better performance.


The high quality compiler must be closely linked to the processor architecture to
reduce the total number of clock cycles.

Performance Measurement:

The Performance Measure is the time it takes a computer to execute a given


benchmark.
A non-profit organization called SPEC(System Performance Evaluation
Corporation) selects and publishes representative application program.

Running time on reference computer


SPEC rating=
Running time on computer under test

The Overall SPEC rating for the computer is given by,

n 1/n
SPEC rating=
Where, ( SPECi
n Number )
of programs in the suite
(SPEC)irating i=1
for program I in the suite.

INSTRUCTION AND INSTRUCTION SEQUENCING

A computer must have instruction capable of performing the following operations. They
are,

Data transfer between memory and processor register.


Arithmetic and logical operations on data.
Program sequencing and control.
I/O transfer.

Register Transfer Notation:


The possible locations in which transfer of information occurs are,
Memory Location
Processor register
Registers in I/O sub-system.

Location Hardware Binary Eg Description


Address
Memory LOC,PLACE,A,VAR2 R1[LOC] The contents of memory
location are transferred
to. the processor register.
Processor R0,R1,. [R3][R1]+[R2] Add the contents of
register R1 &R2 and
places .their sum into
register R3.It is
.called Register Transfer
Notation.
I/O Registers DATAIN,DATAOUT Provides Status
information

Assembly Language Notation:

Assembly Language Description


Format
Move LOC,R1 Transfers the contents of memory location to the processor
register R1.

Add R1,R2,R3 Add the contents of register R1 & R2 and places their sum
into register R3.

Basic Instruction Types:


Instruction Syntax Eg Description
Type
Three Address Operation Add A,B,C Add values of variable
Source1,Source2,Destination A ,B & place the result
into c.
Two Address Operation Source,Destination Add A,B Add the values of A,B
& place the result into
B.
One Address Operation Operand Add B Content of
accumulator add with
content of B.

Instruction Execution and Straightline Sequencing:


Instruction Execution:
There are 2 phases for Instruction Execution. They are,
Instruction Fetch
Instruction Execution
Instruction Fetch:
The instruction is fetched from the memory location whose address is in PC.This is
placed in IR.
Instruction Execution:
Instruction in IR is examined to determine whose operation is to be performed.
Program execution Steps:

To begin executing a program, the address of first instruction must be placed in


PC.
The processor control circuits use the information in the PC to fetch & execute
instructions one at a time in the order of increasing order.
This is called Straight line sequencing.During the execution of each
instruction,the PC is incremented by 4 to point the address of next instruction.

Fig: Program Execution

Branching:

The Address of the memory locations containing the n numbers are symbolically
given as NUM1,NUM2..NUMn.
Separate Add instruction is used to add each number to the contents of register
R0.
After all the numbers have been added,the result is placed in memory location
SUM.

Fig:Straight Line Sequencing Program for adding n numbers

Using loop to add n numbers:

Number of enteries in the list n is stored in memory location M.Register R1 is


used as a counter to determine the number of times the loop is executed.
Content location M are loaded into register R1 at the beginning of the program.
It starts at location Loop and ends at the instruction.Branch>0.During each
pass,the address of the next list entry is determined and the entry is fetched and
added to R0.

Decrement R1
It reduces the contents of R1 by 1 each time through the loop.

Branch >0 Loop

A conditional branch instruction causes a branch only if a specified condition is


satisfied.

Fig:Using loop to add n numbers:

Conditional Codes:

Result of various operation for user by subsequent conditional branch instruction


is accomplished by recording the required information in individual bits often called
Condition code Flags.

Commonly used flags:


N(Negative)set to 1 if the result is ve ,otherwise cleared to 0.
Z(Zero) set to 1 if the result is 0 ,otherwise cleared to 0.
V(Overflow) set to 1 if arithmetic overflow occurs ,otherwise cleared to 0.
C(Carry)set to 1 if carry and results from the operation ,otherwise cleared to 0.

ADDRESSING MODES

The different ways in which the location of an operand is specified in an instruction is


called as Addressing mode.

Generic Addressing Modes:

Immediate mode
Register mode
Absolute mode
Indirect mode
Index mode
Base with index
Base with index and offset
Relative mode
Auto-increment mode
Auto-decrement mode

Implementation of Variables and Constants:

Variables:

The value can be changed as needed using the appropriate instructions.


There are 2 accessing modes to access the variables. They are

Register Mode
Absolute Mode

Register Mode:

The operand is the contents of the processor register.


The name(address) of the register is given in the instruction.

Absolute Mode(Direct Mode):

The operand is in new location.


The address of this location is given explicitly in the instruction.
Eg: MOVE LOC,R2

The above instruction uses the register and absolute mode.


The processor register is the temporary storage where the data in the register are accessed
using register mode.
The absolute mode can represent global variables in the program.

Mode Assembler Syntax Addressing Function

Register mode Ri EA=Ri


Absolute mode LOC EA=LOC

Where EA-Effective Address

Constants:

Address and data constants can be represented in assembly language using Immediate
Mode.

Immediate Mode.

The operand is given explicitly in the instruction.

Eg: Move 200 immediate ,R0

It places the value 200 in the register R0.The immediate mode used to specify the value
of source operand.

In assembly language, the immediate subscript is not appropriate so # symbol is used.


It can be re-written as

Move #200,R0

Assembly Syntax: Addressing Function

Immediate #value Operand =value

Indirection and Pointers:

Instruction does not give the operand or its address explicitly.Instead it provides
information from which the new address of the operand can be determined.This address
is called effective Address(EA) of the operand.
Indirect Mode:

The effective address of the operand is the contents of a register .


We denote the indirection by the name of the register or new address given in the
instruction.

Fig:Indirect Mode Add (A),R0

Add (R1),R0

Operand

Operand

Address of an operand(B) is stored into R1 register.If we want this operand,we can get it
through register R1(indirection).

The register or new location that contains the address of an operand is called the pointer.

Mode Assembler Syntax Addressing Function

Indirect Ri , LOC EA=[Ri] or EA=[LOC]

Indexing and Arrays:

Index Mode:

The effective address of an operand is generated by adding a constant value to the


contents of a register.
The constant value uses either special purpose or general purpose register.
We indicate the index mode symbolically as,

X(Ri)

Where X denotes the constant value contained in the instruction


Ri It is the name of the register involved.

The Effective Address of the operand is,

EA=X + [Ri]
The index register R1 contains the address of a new location and the value of X defines
an offset(also called a displacement).

To find operand,

First go to Reg R1 (using address)-read the content from R1-1000

Add the content 1000 with offset 20 get the result.


1000+20=1020

Here the constant X refers to the new address and the contents of index register
define the offset to the operand.
The sum of two values is given explicitly in the instruction and the other is stored
in register.

Eg: Add 20(R1) , R2 (or) EA=>1000+20=1020

Index Mode Assembler Syntax Addressing Function

Index X(Ri) EA=[Ri]+X


Base with Index (Ri,Rj) EA=[Ri]+[Rj]
Base with Index and offset X(Ri,Rj) EA=[Ri]+[Rj] +X

Relative Addressing:

It is same as index mode. The difference is, instead of general purpose register, here we
can use program counter(PC).

Relative Mode:

The Effective Address is determined by the Index mode using the PC in place of
the general purpose register (gpr).
This mode can be used to access the data operand. But its most common use is to
specify the target address in branch instruction.Eg. Branch>0 Loop
It causes the program execution to goto the branch target location. It is identified
by the name loop if the branch condition is satisfied.

Mode Assembler Syntax Addressing Function

Relative X(PC) EA=[PC]+X

Additional Modes:

There are two additional modes. They are


Auto-increment mode
Auto-decrement mode

Auto-increment mode:

The Effective Address of the operand is the contents of a register in the


instruction.
After accessing the operand, the contents of this register is automatically
incremented to point to the next item in the list.

Mode Assembler syntax Addressing Function

Auto-increment (Ri)+ EA=[Ri];


Increment Ri
Auto-decrement mode:

The Effective Address of the operand is the contents of a register in the


instruction.
After accessing the operand, the contents of this register is automatically
decremented to point to the next item in the list.

Mode Assembler Syntax Addressing Function

Auto-decrement -(Ri) EA=[Ri];


Decrement Ri

CISC
Pronounced sisk, and stands for Complex Instruction Set Computer. Most PC's use CPU
based on this architecture. For instance Intel and AMD CPU's are based on CISC
architectures.
Typically CISC chips have a large amount of different and complex instructions. The
philosophy behind it is that hardware is always faster than software, therefore one should
make a powerful instruction set, which provides programmers with assembly instructions
to do a lot with short programs.
In common CISC chips are relatively slow (compared to RISC chips) per instruction, but
use little (less than RISC) instructions.

RISC
Pronounced risk, and stands for Reduced Instruction Set Computer. RISC chips evolved
around the mid-1980 as a reaction at CISC chips. The philosophy behind it is that almost
no one uses complex assembly language instructions as used by CISC, and people mostly
use compilers which never use complex instructions. Apple for instance uses RISC chips.
Therefore fewer, simpler and faster instructions would be better, than the large, complex
and slower CISC instructions. However, more instructions are needed to accomplish a
task.
An other advantage of RISC is that - in theory - because of the more simple instructions,
RISC chips require fewer transistors, which makes them easier to design and cheaper to
produce.
Finally, it's easier to write powerful optimised compilers, since fewer instructions exist.

RISC vs CISC
There is still considerable controversy among experts about which architecture is better.
Some say that RISC is cheaper and faster and therefor the architecture of the future.
Others note that by making the hardware simpler, RISC puts a greater burden on the
software. Software needs to become more complex. Software developers need to write
more lines for the same tasks.
Therefore they argue that RISC is not the architecture of the future, since conventional
CISC chips are becoming faster and cheaper anyway.
RISC has now existed more than 10 years and hasn't been able to kick CISC out of the
market. If we forget about the embedded market and mainly look at the market for PC's,
workstations and servers I guess a least 75% of the processors are based on the CISC
architecture. Most of them the x86 standard (Intel, AMD, etc.), but even in the mainframe
territory CISC is dominant via the IBM/390 chip. Looks like CISC is here to stay
Is RISC than really not better? The answer isn't quite that simple. RISC and CISC
architectures are becoming more and more alike. Many of today's RISC chips support
just as many instructions as yesterday's CISC chips. The PowerPC 601, for example,
supports more instructions than the Pentium. Yet the 601 is considered a RISC chip,
while the Pentium is definitely CISC. Further more today's CISC chips use many
techniques formerly associated with RISC chips.

ALU Design

In computing an arithmetic logic unit (ALU) is a digital circuit that performs arithmetic
and logical operations. The ALU is a fundamental building block of the central
processing unit (CPU) of a computer, and even the simplest microprocessors contain one
for purposes such as maintaining timers. The processors found inside modern CPUs and
graphics processing units (GPUs) accommodate very powerful and very complex ALUs;
a single component may contain a number of ALUs.

Mathematician John von Neumann proposed the ALU concept in 1945, when he wrote a
report on the foundations for a new computer called the EDVAC. Research into ALUs
remains an important part of computer science, falling under Arithmetic and logic
structures in the ACM Computing Classification System

Numerical systems

An ALU must process numbers using the same format as the rest of the digital circuit.
The format of modern processors is almost always the two's complement binary number
representation. Early computers used a wide variety of number systems, including ones'
complement, Two's complement sign-magnitude format, and even true decimal systems,
with ten tubes per digit.
ALUs for each one of these numeric systems had different designs, and that influenced
the current preference for two's complement, as this is the representation that makes it
easier for the ALUs to calculate additions and subtractions.

The ones' complement and Two's complement number systems allow for subtraction to
be accomplished by adding the negative of a number in a very simple way which negates
the need for specialized circuits to do subtraction; however, calculating the negative in
Two's complement requires adding a one to the low order bit and propagating the carry.
An alternative way to do Two's complement subtraction of A-B is present a 1 to the carry
input of the adder and use ~B rather than B as the second input.

Practical overview

Most of a processor's operations are performed by one or more ALUs. An ALU loads
data from input registers, an external Control Unit then tells the ALU what operation to
perform on that data, and then the ALU stores its result into an output register. Other
mechanisms move data between these registers and memory.

Simple operations
A simple example arithmetic logic unit (2-bit ALU) that does AND, OR, XOR, and
addition

Most ALUs can perform the following operations:

Integer arithmetic operations (addition, subtraction, and sometimes multiplication


and division, though this is more expensive)
Bitwise logic operations (AND, NOT, OR, XOR)
Bit-shifting operations (shifting or rotating a word by a specified number of bits
to the left or right, with or without sign extension). Shifts can be interpreted as
multiplications by 2 and divisions by 2.

Complex operations

Engineers can design an Arithmetic Logic Unit to calculate any operation. The more
complex the operation, the more expensive the ALU is, the more space it uses in the
processor, the more power it dissipates. Therefore, engineers compromise. They make the
ALU powerful enough to make the processor fast, but yet not so complex as to become
prohibitive. For example, computing the square root of a number might use:

1. Calculation in a single clock Design an extraordinarily complex ALU that


calculates the square root of any number in a single step.
2. Calculation pipeline Design a very complex ALU that calculates the square root
of any number in several steps. The intermediate results go through a series of
circuits arranged like a factory production line. The ALU can accept new numbers
to calculate even before having finished the previous ones. The ALU can now
produce numbers as fast as a single-clock ALU, although the results start to flow
out of the ALU only after an initial delay.
3. interactive calculation Design a complex ALU that calculates the square root
through several steps. This usually relies on control from a complex control unit
with built-in microcode
4. Co-processor Design a simple ALU in the processor, and sell a separate
specialized and costly processor that the customer can install just beside this one,
and implements one of the options above.
5. Software libraries Tell the programmers that there is no co-processor and there is
no emulation, so they will have to write their own algorithms to calculate square
roots by software.
6. Software emulation Emulate the existence of the co-processor, that is, whenever
a program attempts to perform the square root calculation, make the processor
check if there is a co-processor present and use it if there is one; if there isn't one,
interrupt the processing of the program and invoke the operating system to
perform the square root calculation through some software algorithm.

The options above go from the fastest and most expensive one to the slowest and least
expensive one. Therefore, while even the simplest computer can calculate the most
complicated formula, the simplest computers will usually take a long time doing that
because of the several steps for calculating the formula.

Powerful processors like the Intel Core and AMD64 implement option #1 for several
simple operations, #2 for the most common complex operations and #3 for the extremely
complex operations.

Inputs and outputs

The inputs to the ALU are the data to be operated on (called operands) and a code from
the control unit indicating which operation to perform. Its output is the result of the
computation.

In many designs the ALU also takes or generates as inputs or outputs a set of condition
codes from or to a status register. These codes are used to indicate cases such as carry-in
or carry-out, overflow, divide-by-zero, etc.

ALUs vs. FPUs

A Floating Point Unit also performs arithmetic operations between two values, but they
do so for numbers in floating point representation, which is much more complicated than
the two's complement representation used in a typical ALU. In order to do these
calculations, a FPU has several complex circuits built-in, including some internal ALUs.

In modern practice, engineers typically refer to the ALU as the circuit that performs
integer arithmetic operations (like two's complement and BCD). Circuits that calculate
more complex formats like floating point, complex numbers, etc. usually receive a more
specific name such as FPU.

FIXED POINT NUMBER AND OPERATION

In computing, a fixed-point number representation is a real data type for a number that
has a fixed number of digits after (and sometimes also before) the radix point (e.g., after
the decimal point '.' in English decimal notation). Fixed-point number representation can
be compared to the more complicated (and more computationally demanding) floating
point number representation.

Fixed-point numbers are useful for representing fractional values, usually in base 2 or
base 10, when the executing processor has no floating point unit (FPU) or if fixed-point
provides improved performance or accuracy for the application at hand. Most low-cost
embedded microprocessors and microcontrollers do not have an FPU.

Representation

A value of a fixed-point data type is essentially an integer that is scaled by a specific


factor determined by the type. For example, the value 1.23 can be represented as 1230 in
a fixed-point data type with scaling factor of 1/1000, and the value 1230000 can be
represented as 1230 with a scaling factor of 1000. Unlike floating-point data types, the
scaling factor is the same for all values of the same type, and does not change during the
entire computation.

The scaling factor is usually a power of 10 (for human convenience) or a power of 2 (for
computational efficiency). However, other scaling factors may be used occasionally, e.g.
a time value in hours may be represented as a fixed-point type with a scale factor of
1/3600 to obtain values with one-second accuracy.

The maximum value of a fixed-point type is simply the largest value that can be
represented in the underlying integer type, multiplied by the scaling factor; and similarly
for the minimum value. For example, consider a fixed-point type represented as a binary
integer with b bits in two's complement format, with a scaling factor of 1/2f (that is, the
last f bits are fraction bits): the minimum representable value is 2b-1/2f and the maximum
value is (2b-1-1)/2f.

Operations

To convert a number from a fixed point type with scaling factor R to another type with
scaling factor S, the underlying integer must be multiplied by R and divided by S; that is,
multiplied by the ratio R/S. Thus, for example, to convert the value 1.23 = 123/100 from a
type with scaling factor R=1/100 to one with scaling factor S=1/1000, the underlying
integer 123 must be multiplied by (1/100)/(1/1000) = 10, yielding the representation
1230/1000. If S does not divide R (in particular, if the new scaling factor R is less than the
original S), the new integer will have to be rounded. The rounding rules and methods are
usually part of the language's specification.

To add or subtract two values the same fixed-point type, it is sufficient to add or subtract
the underlying integers, and keep their common scaling factor. The result can be exactly
represented in the same type, as long as no overflow occurs (i.e. provided that the sum of
the two integers fits in the underlying integer type.) If the numbers have different fixed-
point types, with different scaling factors, then one of them must be converted to the
other before the sum.

To multiply two fixed-point numbers, it suffices to multiply the two underlying integers,
and assume that the scaling factor of the result is the product of their scaling factors. This
operation involves no rounding. For example, multiplying the numbers 123 scaled by
1/1000 (0.123) and 25 scaled by 1/10 (2.5) yields the integer 12325 = 3075 scaled by
(1/1000)(1/10) = 1/10000, that is 3075/10000 = 0.3075. If the two operands belong to
the same fixed-point type, and the result is also to be represented in that type, then the
product of the two integers must be explicitly multiplied by the common scaling factor; in
this case the result may have to be rounded, and overflow may occur. For example, if the
common scaling factor is 1/100, multiplying 1.23 by 0.25 entails multiplying 123 by 25
to yield 3075 with an intermediate scaling factor if 1/10000. This then must be multiplied
by 1/100 to yield either 31 (0.31) or 30 (0.30), depending on the rounding method used,
to result in a final scale factor of 1/100.

To divide two fixed-point numbers, one takes the integer quotient of their underlying
integers, and assumes that the scaling factor is the quotient of their scaling factors. The
first division involves rounding in general. For example, division of 3456 scaled by 1/100
(34.56) by 1234 scaled by 1/1000 (1.234) yields the integer 34561234 = 3 (rounded)
with scale factor (1/100)/(1/1000) = 10, that is, 30. One can obtain a more accurate result
by first converting the dividend to a more precise type: in the same example, converting
3456 scaled by 1/100 (34.56) to 3456000 scaled by 1/100000, before dividing by 1234
scaled by 1/1000 (1.234), would yield 34560001234 = 2801 (rounded) with scaling
factor (1/100000)/(1/1000) = 1/100, that is 28.01 (instead of 290). If both operands and
the desired result are represented in the same fixed-point type, then the quotient of the
two integers must be explicitly divided by the common scaling factor.

Precision loss and overflow

Because fixed point operations can produce results that have more bits than the operands
there is possibility for information loss. For instance, the result of fixed point
multiplication could potentially have as many bits as the sum of the number of bits in the
two operands. In order to fit the result into the same number of bits as the operands, the
answer must be rounded or truncated. If this is the case, the choice of which bits to keep
is very important. When multiplying two fixed point numbers with the same format, for
instance with I integer bits, and Q fractional bits, the answer could have up to 2I integer
bits, and 2Q fractional bits.
For simplicity, fixed-point multiply procedures use the same result format as the
operands. This has the effect of keeping the middle bits; the I-number of least significant
integer bits, and the Q-number of most significant fractional bits. Fractional bits lost
below this value represent a precision loss which is common in fractional multiplication.
If any integer bits are lost, however, the value will be radically inaccurate.

Some operations, like divide, often have built-in result limiting so that any positive
overflow results in the largest possible number that can be represented by the current
format. Likewise, negative overflow results in the largest negative number represented by
the current format. This built in limiting is often referred to as saturation.

Some processors support a hardware overflow flag that can generate an exception on the
occurrence of an overflow, but it is usually too late to salvage the proper result at this
point.

FLOATING POINT NUMBERS & OPERATIONS

Floating point Representation:

To represent the fractional binary numbers, it is necessary to consider binary point.If


binary point is assumed to the right of the sign bit ,we can represent the fractional binary
numbers as given below,

B= (b0 * 20 +b-1 * 2-1 + b-2 * 2-2 +.+ b-(n-1) * 2-(n-1) )

With this fractional number system,we can represent the fractional numbers in the
following range,

-1 < F <1 2 -(n-1)

The binary point is said to be float and the numbers are called floating point
numbers.

The position of binary point in floating point numbers is variable and hence
numbers must be represented in the specific manner is referred to as floating point
representation.

The floating point representation has three fields.They are,


Sign
Significant digits and
Exponent.
Eg: 111101.10001101.111101100110 *25

Where ,

25 Exponent and scaling factor


UNIT I
PART A

1.Difference between Computer architecture and computer organization.


2.Discusss the various functions of computer.
3.List out the various instruction type.
4.What is addressing mode.
5.Difference between Stack and Queue.
6.What are the different types of Assembler directive?
7.A computer executes 3 instructions with the length of 6cycles/sec which performs its
instruction in 4 steps.Calculate the performance.
8.What is pipelining?
9.What is program Counter?
10.What is bus and explain its types.
11. Give an example each of zero-address, one-address, two-address, and three-address
instructions.
12. Which data structures can be best supported using (a) indirect addressing mode (b)
indexed addressing mode?

PART B

1.Explain the functional Units of a Computer.


2.What is Addressing Mode?Explain the different types of addressing mode.
3.What is Instruction Type?Write short notes on various instruction formats.
4.What is byte addressability?Explain the various types.
5.Write short notes on:
a. IR b..MAR
c.MDR d.PC
e.Interrupt f.Pipelining and Superscalar operation

6. Explain in detail the different types of instructions that are supported in a typical
processor.
UNIT II

BASIC PROCESSING UNIT

Fundamental concepts

Execution of a complete instruction

Multiple bus organization

Hardwired control

Micro programmed control

Nano programming.
Basic fundamental concepts
Some fundamental concepts
The primary function of a processor unit is to execute
sequence of instructions stored in a memory, which is
external to the processor unit.
The sequence of operations involved in processing an
instruction constitutes an instruction cycle,which can be
subdivided into 3 major phases:

1. Fetch cycle
2. Decode cycle
3. Execute cycle
Basic instruction cycle

To perform fetch, decode and execute cycles the processor unit has to perform set
of operations called micro-operations.
Single bus organization of processor unit shows how the building blocks of
processor unit are organised and how they are interconnected.
They can be organised in a variety of ways, in which the arithmetic and logic unit
and all processor registers are connected through a single common bus.
It also shows the external memory bus connected to memory address(MAR) and
data register(MDR).
Single Bus Organisation of processor

The registers Y,Z and Temp are used only by the processor unit for temporary
storage during the execution of some instructions.
These registers are never used for storing data generated by one instruction for
later use by another instruction.
The programmer cannot access these registers.
The IR and the instruction decoder are integral parts of the control circuitry in the
processing unit.
All other registers and the ALU are used for storing and manipulating data.
The data registers, ALU and the interconnecting bus is referred to as data path.
Register R0 through R(n-1) are the processor registers.
The number and use of these register vary considerably from processor to
processor.
These registers include general purpose registers and special purpose registers
such as stack pointer, index registers and pointers.
These are 2 options provided for A input of the ALU.
The multiplexer(MUX) is used to select one of the two inputs.
It selects either output of Y register or a constant number as an A input for the
ALU according to the status of the select input.
It selects output of Y when select input is 1 (select Y) and it selects a constant
number when select input is 0(select C) as an input A for the multiplier.
The constant number is used to increment the contents of program counter.
For the execution of various instructions processor has to perform one or more of
the following basic operations:

a) Transfer a word of data from one processor register to the another or to


the ALU.
b) perform the arithmetic or logic operations on the data from the
processor registers and store the result in a processor register.
c) Fetch a word of data from specified memory location and load them
into a processor register.
d) Store a word of data from a processor register into a specified memory
location.

1. Register Transfers
Each register has input and output gating and these gates are controlled by
corresponding control signals.
Fig: Input and Output Gating for the Registers
The input and output gates are nothing but the electronic switches which can be
controlled by the control signals.
When signal is 1, the switch is ON and when the signal is 0, the switch is OFF.

Implementation of input and output gates of a 4 bit register

Consider that we have transfer data from register R1 to R2


It can be done by,
a. Activate the output enable signal of R1,R1 out=1. It places the
contents of R1 on the common bus.
b. Activate the input enable signal of R2, R2 in=1. It loads data from
the common bus into the register R2.
One-bit register
The edge triggered D flip-flop which stores the one-bit data is connected to the
common bus through tri-state switches.
Input D is connected through input tri-state switch and output Q is connected
through output tri-state switch.
The control signal Rin enables the input tri-state switch and the data from common
bus is loaded into the D flip-flop in synchronisation with clock input when Rin is
active.
It is implemented using AND gate .
The control signal Rout is activated to load data from Q output of the D flip-flop
on to the common bus by enabling the output tri-state switch.

2. Performing an arithmetic or logic operation

ALU performs arithmetic and logic operations.


It is a combinational circuit that has no internal memory.
It has 2 inputs A and B and one output.
Its A input gets the operand from the output of the multiplexer and its B input
gets the operand directly from the bus.
The result produced by the ALU is stored temporarily in register Z.
Let us find the sequence of operations required to subtract the contents of register R2
from register R1 and store the result in register R3.

This sequence can be followed as:


a) R1,Yin
b) R2out,Select Y, sub,Zin
c) Zout,R3in

Step 1: contents from register R1 are loaded into register Y.


Step 2: contents from Y and from register R2 are applied to the A and B inputs of ALU,
subtraction is performed and result is stored in the Z register.
Step 3: The contents of Z register is stored in the R3 register.

3. Fetching a word from memory


To fetch a word of data from memory the processor gives the address of
the memory location where the data is stored on the address bus and
activates the Read operation.
The processor loads the required address in MAR, whose output is
connected to the address lines of the memory bus.
At the same time processor sends the Read signal of memory control bus
to indicate the Read operation.
When the requested data is received from the memory its stored into the
MDR, from where it can be transferred to other processor registers.
4. Storing a word in memory
To write a word in memory location processor has to load the address of
the desired memory location in the MAR, load the data to be written in
memory, in MDR and activate write operation.
Assume that we have to execute instruction Move(R2), R1.
This instruction copies the contents of register R1 into the memory whose
location is specified by the contents of register R2.
The actions needed to execute this instruction are as follows:
a) MAR [R2]
b) MDR [R1]
c) Activate the control signal to perform the write operation.

The various control signals which are necessary to activate to


perform given actions in each step.
a) R2out, MARin
b) R1out, MDRinP
c) MARout, MDRoutM,Write

Timing diagram for MOVE(R2), R1 instruction (Memory write operation)


The MDR register has 4 control signals:
MDRinP & MDRoutP control the connection to the internal processor data
bus and signals MDRinM & MDRoutM control the connection to the memory
Data bus.
MAR register has 2 control signals.
Signal MARin controls the connection to the internal processor address bus
and signal MARout controls the connection to the memory address bus.
Control signals read and write from the processor controls the operation
Read and Write respectively.
The address of the memory word to be read word from that location to the
register R3,.
It can be indicated by instruction MOVE R3,(R2).

The actions needed to execute this instruction are as follows:

a) MAR [R2]
b) Activate the control signal to perform the Read operation
c) Load MDR from the memory bus
d) R3 [MDR]

Various control signals which are necessary to activate to perform given actions in each
step:
a) R2out, MARin
b) MARout, MDRinM, Read
c) MDRoutP,R3in
EXECUTION OF A COMPLETE INSTRUCTION
Let us find the complete control sequence for execution of the instruction Add
R1,(R2) for the single bus processor.
o This instruction adds the contents of register R1 and the contents of memory
location specified by register R2 and stores results in the register R1.
o To execute bus instruction it is necessary to perform following actions:
1. Fetch the instruction
2. Fetch the operand from memory location pointed by R2.
3. Perform the addition
4. Store the results in R1.
The sequence of control steps required to perform these operations for the single
bus architecture are as follows;
1.PCout, MARin Yin, select C, Add, Zin
2. Zout, PCin, MARout , MARinM, Read
3. MDRout P,MARin
4. R2out , MARin
5. R2out , Yin,MARout , MARinM, Read
6. MDRout P, select Y, Add, Zin
7. Zout, R1in
(i) Step1, the instruction fetch operation is initiated by loading the controls of the PC into
the MAR.
PC contents are also loaded into register Y and added constant number by
activating select C input of multiplexer and add input of the ALU.
By activating Zin signal result is stored in the register Z
(ii) Step2 , the contents of register Z are transferred to pc register by activating Zout and
pcin signal.
This completes the PC increment operation and PC will now point to next
instruction,
In the same step (step2), MARout , MDR inM and Read signals are activated.
Due to MARout signal , memory gets the address and after receiving read signal
and activation of MDR in M Signal ,it loads the contents of specified location into
MDR register.
(iii) Step 3 contents of MDR register are transferred to the instruction register(IR) of the
processor.
The step 1 through 3 constitute the instruction fetch phase.
At the beginning of step 4, the instruction decoder interprets the contents of the
IR.
This enables the control circuitry to activate the control signals for steps 4 through
7, which constitute the execution phase.
(iv) Step 4, the contents of register R2 are transferred to register MAR by activating R2out
and MAR in signals.
(v) Step 5, the contents of register R1 are transferred to register Y by activating R1out and
Yin signals. In the same step, MARout, MDRinM and Read signals are activated.
Due to MARout signal, memory gets the address and after receiving read signal
and activation of MDRinM signal it loads the contents of specified location into
MDR register.
(vi) Step 6 MDRoutP, select Y, Add and Zin signals are activated to perform addition of
contents of register Y and the contents of MDR. The result is stored in the register Z.
(vii) Step 7, the contents of register Z are transferred to register R1 by activating Zout and
R1in signals.

Branch Instruction

The branch instruction loads the branch target address in PC so that PC will fetch
the next instruction from the branch target address.
The branch target address is usually obtained by adding the offset in the contents
of PC. The offset is specified within the instruction.
The control sequence for unconditional branch instruction is as follows:
1. PCout, MARin, Yin, SelectC, Add, Zin
2. Zout, PCin, MARout, MDRinM, Read
3. MDRoutP,IRin
4. PCout,Yin
5. Offset_field_Of_IRout,SelectY,Add,Zin
6. Zout,PCin

First 3 steps are same as in the previous example.


Step 4: The contents of PC are transferred to register Y by activating PCout and Yin
signals.
Step 5: The contents of PC and the offset field of IR register are added and result is saved
in register Z by activating corresponding signals.
Step 6: The contents of register Z are transferred to PC by activating Zout and PC in
signals.
Multiple Bus Organisation:
Single bus only one data word can be transferred over the bus in a clock cycle.
This increases the steps required to complete the execution of the instruction.
To reduce the number of steps needed to execute instructions, most commercial
process provide multiple internal paths that enable several transfer to take place in
parallel.
3 buses are used to connect registers and the ALU of the processor.
All general purpose registers are shown by a single block called register file.
There are 3 ports, one input port and two output ports.
So it is possible to access data of three register in one clock cycle, the value can
be loaded in one register from bus C and data from two register can be accessed to
bus A and bus B.
Buses A and B are used to transfer the source operands to the A and B inputs of
the ALU.
After performing arithmetic or logic operation result is transferred to the
destination operand over bus C.
To increment the contents of PC after execution of each instruction to fetch the
next instruction, separate unit is provided. This unit is known as incrementer.
Incrementer increments the contents of PC accordingly to the length of the
instruction so that it can point to next instruction in the sequence.
The incrementer eliminates the need of multiplexer connected at the A input of
ALU.
Let us consider the execution of instruction, Add,R1,R2,R3.
This instruction adds the contents of registers R2 and the contents of register R3
and stores the result in R1.
With 3 bus organization control steps required for execution of instruction Add
R1,R2,R3 are as follows:
1. PCout, MARin
2. MARout, MDRinM, Read
3. MDRoutP,IRin
4. R2out,R3out,Add,R1in
Step 1: The contents of PC are transferred to MAR through bus B.
Step 2: The instruction code from the addressed memory location is read into
MDR.
Step 3: The instruction code is transferred from MDR to IR register. At the
beginning of step 4, the instruction decoder interprets the contents of the IR.
This enables the control circuitry to activate the control signals for step 4,
which constitute the execution phase.
Step 4: two operands from register R2 and register R3 are made available at A and
B inputs of ALU through bus A and bus B.
These two inputs are added by activation of Add signal and result is stored
in R1 through bus C.
Hardwird Control
The control units use fixed logic circuits to interpret instructions and
generate control signals from them.
The fixed logic circuit block includes combinational circuit that generates the
required control outputs for decoding and encoding functions.
Instruction decoder
It decodes the instruction loaded in the IR.
If IR is an 8 bit register then instruction decoder generates 28(256 lines); one for
each instruction.
According to code in the IR, only one line amongst all output lines of decoder goes
high (set to 1 and all other lines are set to 0).
Step decoder
It provides a separate signal line for each step, or time slot, in a control sequence.
Encoder
It gets in the input from instruction decoder, step decoder, external inputs and
condition codes.
It uses all these inputs to generate the individual control signals.
After execution of each instruction end signal is generated this resets control step
counter and make it ready for generation of control step for next instruction.
The encoder circuit implements the following logic function to generate Yin
Yin = T1 + T5 . Add + T . BRANCH+
The Yin signal is asserted during time interval T1 for all instructions, during T5 for
an ADD instruction, during T4 for an unconditional branch instruction, and so on.
As another example, the logic function to generate Zout signal can given by
Zout = T2 + T7 . ADD + T6 . BRANCH +.
The Zout signal is asserted during time interval T2 of all instructions, during T7 for an
ADD instruction, during T6 for an unconditional branch instruction, and so on.

A Complete processor

It consists of
Instruction unit
Integer unit
Floating-point unit
Instruction cache
Data cache
Bus interface unit
Main memory module
Input/Output module.
Instruction unit- It fetches instructions from an instruction cache or from the main
memory when the desired instructions are not available in the cache.
Integer unit To process integer data
Floating unit To process floating point data
Data cache The integer and floating unit gets data from data cache
The 80486 processor has 8-kbytes single cache for both instruction and data whereas
the Pentium processor has two separate 8 kbytes caches for instruction and data.
The processor provides bus interface unit to control the interface of processor to
system bus, main memory module and input/output module.

Microprogrammed Control
Every instruction in a processor is implemented by a sequence of one or more
sets of concurrent micro operations.
Each micro operation is associated with a specific set of control lines which,
when activated, causes that micro operation to take place.
Since the number of instructions and control lines is often in the hundreds, the
complexity of hardwired control unit is very high.
Thus, it is costly and difficult to design. The hardwired control unit is relatively
inflexible because it is difficult to change the design, if one wishes to correct design error
or modify the instruction set.
Microprogramming is a method of control unit design in which the control signal
memory CM.
The control signals to be activated at any time are specified by a microinstruction,
which is fetched from CM.
A sequence of one or more micro operations designed to control specific
operation, such as addition, multiplication is called a micro program.
The micro programs for all instructions are stored in the control memory.

The address where these microinstructions are stored in CM is generated by


microprogram sequencer/microprogram controller.
The microprogram sequencer generates the address for microinstruction according
to the instruction stored in the IR.
The microprogrammed control unit,
- control memory
- control address register
- micro instruction register
- microprogram sequencer

The components of control unit work together as follows:


The control address register holds the address of the next
microinstruction to be read.
When address is available in control address register, the
sequencer issues READ command to the control memory.
After issue of READ command, the word from the addressed
location is read into the microinstruction register.
Now the content of the micro instruction register generates
control signals and next address information for the sequencer.
The sequencer loads a new address into the control address
register based on the next address information.

Advantages of Microprogrammed control


It simplifies the design of control unit. Thus it is both, cheaper and less error
phrone implement.
Control functions are implemented in software rather than hardware.
The design process is orderly and systematic
More flexible, can be changed to accommodate new system specifications or to
correct the design errors quickly and cheaply.
Complex function such as floating point arithmetic can be realized efficiently.
Disadvantages
A microprogrammed control unit is somewhat slower than the hardwired control
unit, because time is required to access the microinstructions from CM.
The flexibility is achieved at some extra hardware cost due to the control memory
and its access circuitry.

Microinstruction

A simple way to structure microinstructions is to assign one bit position to each


control signal required in the CPU.

Grouping of control signals


Grouping technique is used to reduce the number of bits in the microinstruction.
Gating signals: IN and OUT signals
Control signals: Read,Write, clear A, Set carry in, continue operation, end, etc.
ALU signals: Add, Sub,etc;

There are 46 signals and hence each microinstruction will have 46 bits.
It is not at all necessary to use all 46 bits for every microinstruction because by using
grouping of control signals we minimize number of bits for microinstruction.
Way to reduce number of bits microinstruction:
Most signals are not needed simultaneously.
Many signals are mutually exclusive
e.g. only one function of ALU can be activated at a time.
A source for data transfers must be unique which means that it should not be
possible to get the contents of two different registers on to the bus at the same
time.
Read and Write signals to the memory cannot be activated simultaneously.
46 control signals can be grouped in 7 different groups.

The total number of grouping bits are 17. Therefore, we minimized 46 bits
microinstruction to 17 bit microinstruction.
Techniques of grouping of control signals

The grouping of control signal can be done either by using technique called
vertical organisation or by using technique called vertical organisation or by using
technique called horizontal organisation.
Vertical organisation
Highly encoded scheme that can be compact codes to specify only a small number
of control functions in each microinstruction are referred to as a vertical organisation.

Horizontal organisation
The minimally encoded scheme, in which resources can be controlled with a
single instruction is called a horizontal organisation.

Comparison between horizontal and vertical organisation


S.No Horizontal Vertical
1 Long formats Short formats
2 Ability to express a high degree of parallelism Limited ability to express
parallel microoperations
3 Little encoding of the control information Considerable encoding of
the control information
4 Useful when higher operating speed is desired Slower operating speeds

Advantages of vertical and horizontal organization

1. Vertical approach is the significant factor,it is used to reduce the requirement for the
parallel hardware required to handle the execution of microinstructions.
2. Less bits are required in the microinstruction.
3. The horizontal organisation approach is suitable when operating speed of computer is a
critical factor and where the machine structure allows parallel usage of a number of
resources.

Disadvantages
Vertical approach results in slower operations speed.

Microprogram sequencing
The task of microprogram sequencing is done by microprogram sequencer.
2 important factors must be considered while designing the microprogram
sequencer:
a) The size of the microinstruction
b) The address generation time.
The size of the microinstruction should be minimum so that the size of control memory
required to store microinstructions is also less.
This reduces the cost of control memory.
With less address generation time, microinstruction can be executed in less time resulting
better throughout.
During execution of a microprogram the address of the next microinstruction to be
executed has 3 sources:
i. Determined by instruction register
ii. Next sequential address
iii. Branch
Microinstructions can be shared using microinstruction branching.
Consider instruction ADD src, Rdst.
The instruction adds the source operand to the contents of register Rdst and places
the sum in Rdst, the destination register.
Let us assume that the source operand can be specified in the following addressing
modes:
a) Indexed
b) Autoincrement
c) Autodecrement
d) Register indirect
e) Register direct

Each box in the flowchart corresponds to a microinstruction that controls the transfers
and operations indicated within the box.
The microinstruction is located at the address indicated by the number above the upper
right-hand corner of the box.
During the execution of the microinstruction, the branching takes place at point A.
The branching address is determined by the addressing mode used in the instruction.
Techniques for modification or generation of branch addresses
i. Bit-ORing
The branch address is determined by ORing particular bit or bits
with the current address of microinstruction.
Eg: If the current address is 170 and branch address is 172 then the
branch address can be generated by ORing 02(bit 1), with the
current address.
i. Using condition variables
It is used to modify the contents CM address register directly, thus
eliminating whole or in part the need for branch addresses in
microinstructions.
Eg: Let the condition variable CY indicate occurance of CY = 1,
and no carry when CY = 0.
Suppose that we want to execute a SKIP_ON_CARRY
microinstruction.
It can be done by logically connecting CY to the count enable
input of pc at on appropriate point in the microinstruction cycle.
It allows the overflow condition increment pc an extra time, thus
performing the desired skip operation.
iii. Wide-Brance Addressing
Generating branch addresses becomes more difficult as the number of
branches increases.In such situations programmable logic array can be used to generate
the required branch addresses.The simple and inexpensive way of generating branch
addresses is known as wide-branch addressing.The opcode of a machine instruction is
translated into the starting address of the corresponding micro-routine.This is achieved by
connecting the opcode bits of the instruction register as inputs to the PLA , which acts as
a decoder.The output of the PLA is the address of the desired microroutine.
Comparison between Hardwired and Microprogrammed Control

Attribute Hardwired Control Microprogrammed


Control
Speed Fast Slow
Control functions Implemented in hardware Implemented in software
Flexibility Not flexible to More flexible, to
accommodate new system accommodate new system
specifications or new specification or new
instructions instructions redesign is
required
Ability to handle Difficult Easier
large/complex instruction
sets
Ability to support Very difficult Easy
operating systems and
diagnostic features
Design process Complicated Orderly and systematic
Applications Mostly RISC Mainframes, some
microprocessors microprocessors
Instructionset size Usually under 100 Usually over 100
instructions instructions
ROM size - 2K to 10K by 20-400 bit
microinstructions
Chip area efficiency Uses least area Uses more area
UNIT III

PIPELINING

Basic concepts

Data hazards

Instruction hazards

Influence on instruction sets

Data path and control considerations

Performance considerations

Exception handling
Basic Concepts
It is a particularly effective way of organizing concurrent activity in a computer
system.
Sequencial execution

Hardware organization
An inter-stage storage buffer, B1, is needed to hold the information being passed from
one stage to the next.
New information is loaded into this buffer at the end of each clock cycle.
F Fetch: read the instruction from the memory
D Decode: decode the instruction and fetch the source operand(s)
E Execute: perform the operation specified by the instruction
W Write: store the result in the destination location

Pipelined execution
In the first clock cycle, the fetch unit fetches an instruction I1 (step F1) and stores
it in buffer B1 at the end of the clock cycle.
In the second clock cycle the instruction fetch unit proceeds with the fetch
operation for instruction I2 (step F2).
Meanwhile, the execution unit performs the operation specified by instruction
I1,which is available to it in buffer B1 (step E1).
By the end of the second clock cycle, the execution of instruction I1 is completed
and instruction I2 is available.
Instruction I2 is stored in B1, replacing I1, which is no longer needed.
Step E2 is performed by the execution unit during the third clock cycle, while
instruction I3 is being fetched by the fetch unit.
Four instructions are in progress at any given time.
So it needs four distinct hardware units.

These units must be capable of performing their tasks simultaneously and without
interfering with one another.
Information is passed from one unit to the next through a storage buffer.
During clock cycle 4, the information in the buffers is as follows:
Buffer B1 holds instruction I3, which was fetched in cycle 3 and is being decoded
by the instruction-decoding unit.
Buffer B2 holds both the source operands for instruction I2 and the specifications
of the operation to be performed. This is the information produced by the
decoding hardware in cycle 3.
- The buffer also holds the information needed for the write step of
instruction I2(step W2).
- Even though it is not needed by stage E, this information must be
passed on to stage W in the following clock cycle to enable that stage
to perform the required write operation.
Buffer B3 holds the results produced by the execution unit and the destination
information for instruction I1.

Pipeline performance
Pipelining is proportional to the number of pipeline stages.
For variety of reasons, one of the pipeline stages may not be able to complete
its processing task for a given instruction in the time allotted.
For eg, stage E in the four-stage pipeline is responsible for arithmetic and
logic operations and one clock cycle is assigned for this task.
Although this may be sufficient for most operations some operations such as
divide may require more time to complete.
Instruction I2 requires 3 cycles to complete from cycle 4 through cycle 6.
Thus in cycles 5 and 6 the write stage must be told to do nothing, because it
has no data to work with.
Meanwhile, the information in buffer B2 must remain intact until the execute
stage has completed its operation.
This means that stage 2 and in turn stage1 are blocked from accepting new
instructions because the information in B1 cannot be overwritten.
Thus steps D4 and F5 must be postponded.

The pipeline may also be stalled because of a delay in the availability of an


instruction.
For example, this may be a result of a miss in the cache, requiring the
instruction to be fetched from the main memory.
Such hazards are often called control hazards or instruction hazards.
Instruction execution steps in successive clock cycles:
Clock Cycle 1 2 3 4 5 6 7 8
Instruction
I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

Function performed by each process stage in successive clock cycles

Clock Cycle 1 2 3 4 5 6 7 8 9

Stage
F: Fetch F1 F2 F2 F2 F2
D: Decode D1 idle idle idle D2 D3
E: Execute E1 idle idle idle E2 E3
W : Write W1 idle idle idle W2 W3

Instruction I1 is fetched from the cache in cycle1,and its execution proceeds


normally.
Thus in cycles 5 and 6, the write stage must be told to do nothing, because it has
no data to work with.
Meanwhile ,the information in buffer B2 must remain intact until the execute
stage has completed its operation.
This means that stage 2 and in turn stage1 are blocked from accepting now
instructions because the information in B1 cannot be overwritten.
Thus steps D4 and F4 must be postponed.
Pipelined operation is said to have been stacked for two clock cycles.
Normal pipelined operation resumes in cycle 7.

Hazard:
Any location that causes the pipeline to stall is called hazard.
Data Hazard

A data hazard is any conditions in which either the source or the destination
operands of an instruction are not available at the time expected in the pipeline. As a
result some operation has to be delayed ,and the pipeline stalls.

b)Pipeline performance
For a variety of reasons, one of the pipeline stages may not be able to complete its
processing task for a given instruction I the time allotted.
For Ex. Stage E in the 4 stage pipeline is responsible for arithmetic and logic
operations and one clock cycle is assigned for this task.
Although this may be sufficient for most operations, same operations such as
divide may require more time to complete.
Effect of an execution operation talking more than one clock cycle:
The operation specified in instruction I2 requires three cycles to complete
,from cycle 4 through cycle 6.

Pipe Lining:

1. Basic Concepts:
Pipelining is a particularly effective way of organization concurrent activity in a
computer system.

Instruction pipeline:
The fetch, decode and execute cycles for several instructions are performed
simultaneously to reduce overall processing time. This process is referred to as
instruction pipelining.
Consider 4 stage process
F Fetch : read the instruction from the memory
D Decode: Decode the instruction and fetch the some operands
E Execute: perform the operation specified by the instruction.
W Write: Store the results in the destination location.

a) Instruction execution divided into 4 steps:

During
Influence on Instruction sets
1. Addressing modes

Many processors provide various combinations of addressing modes


such as index,indirect,auto increment, auto decrement and so on.
We can classify these addressing modes as simple addressing modes and complex
addressing modes.
A complex address mode may require several accesses to the memory to reach the
named operand while simple addressing mode requires only one access to the memory to
reach the named operand.
Consider the instruction Load R1,(X(R0)).
This instruction has a complex addressing mode because this instruction requires
two memory accesses one to read location X+[R0] and then to read location [X+[R0]].
If R1 is a source operand in the next instruction that instruction has to be stalled
for two cycles.

If we want to perform the same operation using simple addressing mode


instructions we require to execute three instructions:
ADD R1,R0,#X
LOAD R1,(R2)
LOAD R1,(R1)
The ADD instruction performs the operation R1 [R0]+X.
The two load instructions fetch the address and then the operand from the
memory.
The two load instructions fetch the address and then the operand from the
memory.
This sequence of instructions takes exactly the same number of clock cycles as
the single load instruction having a complex addressing mode.
The above example indicates that in a pipeline processor, complex addressing
mode do not necessarily lead to faster execution.
But it reduce the number of instructions needed to perform particular task and
hence reduce the program space needed in the main memory.
Disadvantages
a) Long execution times of complex addressing mode instructions may cause
pipeline to stall.
b) They require more complex hardware to decode and execute them.
c) They are not convenient for compilers to work with.
Due to above reasons the complex addressing mode instructions are not
suitable for pipelined execution.
These addressing modes are avoided in modern processors.
Features of modern processor addressing modes
a) They access operand from the memory in only one cycle.
b) Only load and store instructions are provided to access memory.
c) The addressing modes used do not have side effects.(when a location other than
one explicitly named in an instruction as a destination operand is affected, the
instruction is said to have a side effect).
2. Condition codes
Most of the processors store condition code flags in their status register.
The conditional branch instruction checks the condition code flag and the flow of
program execution is decided.
When the data dependency or branch exists, the pipeline may stall.
A special optimizing compiler is designed which reorders the instructions in the
program which is to be executed.
This avoids the stalling the pipeline, the compiler takes the care to obtain the correct
output while reordering the instructions.
To handle condition codes,
a) The condition code flags should be affected by as few instructions as possible.
It provides flexibility in reordering instructions.
Otherwise, the dependency produced by the condition code flags reduces the
flexibility available for the compiler reorder instructions.
b) It should be known to compiler that in which instructions of a program the
condition codes are affected and in which they remain unaffected.

Datapath and control considerations

When single bus is used in a processor only one data word can be transferred
over the bus in a clock cycle.
This increases the steps required to complete the execution of the instruction.
To reduce the number of steps needed to execute instructions most commercial
processors provide multiple internal paths that enable several transfer to take place in
parallel.

Modified 3 bus structure of the processor for pipelined execution


3 buses are used to connect registers and the ALU of the processor.
All general purpose registers are connected by a single block called register file.

It has 3 ports:
- One input port
- Two output ports
It is possible to access data of 3 register in one clock cycle, the value can be loaded in
one register from bus C and data from two register can be accessed to bus A and bus B
respectively.

Buses A and B are used to transfer the source operands to the A and B inputs of
the ALU.
After performing arithmetic or logic operation result is transferred to the
destination operand over bus C.
To increment the contents of PC after execution of each instruction to fetch the
next instruction separate unit is provided, it is known as incrementer.

Incrementer increments the contents of PC accordingly to the length of the


instruction so that it can point to next instruction in the sequence.
The processor uses separate instruction and data caches.
They use separate address and data connections to the processor.
Two versions of MAR register are needed IMAR for accessing the instruction
cache and DMAR for accessing the data cache.
The PC is connected directly to the IMAR.
This allows transferring the contents of PC to IMAR and performing ALU
operation simultaneously.
The data address in DMAR can be obtained directly from the register file or from
the ALU.
It supports the register indirect and indexed addressing modes.
The read and write operations use separate MDR registers.
When load and store operations are to be performed the data can be transferred
directly between these registers file.
It is not required to pass this data through the ALU.
Two buffer registers at the input and one at the output of the ALU are used.
The instruction queue gets loaded from instruction cache.
The output of the queue is connected to the instruction decoder input and output
of the decoder is connected to the control signal pipeline.
The processor can perform the following operations independently,
1. Reading an instruction from the instruction cache.
2. Incrementing the PC.
3. Decoding an instruction by instruction decoder.
4. Reading from and writing into the data cache.
5. Reading the contents of up to two registers from the register file.
6. Writing into one register in the register file.
7. Performing an ALU operation.

Superscalar Operation

Several instructions can be executed concurrently because of pipelining.


However, these instructions in the pipeline are in different stages of execution
such as fetch, decode, ALU operation.
When one instruction is fetch phase, one instruction is completing its
execution in the absence of hazards.
Thus at the most, one instruction is executed in each clock cycle.
This means the maximum throughput of the processor is one instruction per
clock cycle when pipelining is used.
Processors reach performance levels greater than one instruction per cycle by
fetching, decoding and executing several instructions concurrently.
This mode of operation is called superscalar.
A processor capable of parallel instruction per cycle is known as superscalar
processor.
A superscalar processor has multiple execution units (E-units), each of which
is usually pipelined, so that they constitute a set of independent instruction
pipelines.
Its program control unit (PCU) is capable of fetching and decoding several
instructions concurrently.
It can issue multiple instructions simultaneously.
If the processor can issue up to k instructions simultaneously to the various E-
units, then k is called instruction issue degree.
It can be six or more using current technology.

A processor has two execution units,


i) Integer unit (used for integer operations)
ii) Floating point unit (used for floating point operations)

The instruction fetch unit can read two instructions simultaneously and stores
them temporarily in the instruction queue.
The queue operates on the principle first in first out (FIFO).
In each clock cycle, the dispatch unit can take two instructions from the
instruction queue and decodes them.
If these instructions are such that one is integer and another is floating-point with
no hazards, then both instructions are dispatched in the same clock cycle.
The compiler is designed to avoid hazards by proper selection and ordering of the
instructions.
It is assumed that the floating point unit takes three clock cycles and integer unit
takes one clock cycle to complete execution.
1. The instruction fetch unit fetches instructions I1 and I2 in clock cycle.
2. I1 and I2 enter the dispatch unit in cycle 2. The fetch unit fetches next two
instructions, I3 and I4 during the same cycle.
3. The floating point unit takes three clock cycles to complete the floating point
operation(Execution: EA,EB,EC) specified in I1. The integer unit completes
execution of I2 in one clock cycle(clock cycle 3). Also, instructions I3 and I4
enter the dispatch unit in cycle 3.
4. We have assumed the floating point unit as a three-stage pipeline. So it can
accept a new instruction for execution in each clock cycle. So during clock cycle
4, though the execution of I1 is in progress, it accepts I3 for the execution. The
integer unit can accept a new instruction for execution because instruction I2 has
entered to the write stage. Thus, instructions I3 and I4 are dispatched in cycle 4.
Instruction I1 is still in the execution phase, but in the second stage of the internal
pipeline in floating-point unit.
5. The instructions complete execution with no hazards as shown in further clock
cycles.
Out-of-order Execution

The dispatch unit dispatches the instructions in the order in which they appear in
the program. But their execution may be completed in the different order.
For example, execution of I2 is completed before the complete execution of I1.
Thus the execution of the instructions is completed out of order.
If there is a data dependency among the instructions, the execution of the
instructions is delayed.
For example, if the execution of I2 needs the results of execution of I1, I2 will be
delayed.
If such dependencies are handled correctly, the execution of the instructions will
not be delayed.
There are two causes of exceptions,
- a bus error (during an operand fetch)
- illegal operation(eg. Divide by zero)
-
2 types of exceptions,
- Imprecise exceptions
- Precise exceptions

Imprecise exception

Consider the pipeline timing, the result of operation of I2 is written into the
register file in cycle 4.
If instruction I1 causes an exception and succeeding instructions are permitted to
complete execution, then the processor is said to have imprecise exceptions. Because of
the exception by I1, program execution is in an inconsistent state.

Precise exception

In the imprecise exception, consistent state is not guaranteed when an exception


occurs.
If the exception occurred then to guarantee a consistant state, the result of the execution
of instructions must be written into the destination locations strictly in the program order.
The result of execution of I2 is written to the destination in cycle 4(W2).
But the result of I1 is written to the destination in cycle 6(W1).
So step W2 is to be delayed until cycle 6.
The integer execution unit has to retain the result of execution of I2.
So it cannot accept instruction I4 until cycle 6.
Thus in precise exception, if the exception occurs during an instruction, the succeeding
instructions, may have been partially executed are discarded.
If an external interrupt is received, the dispatch unit stops reading new instructions from
the instruction queue.
The instructions which are placed in the instruction queue are discarded.
The processor first completes the pending execution of the instructions to completion.
The consistent state of the processor and all its registers is achieved.
Then the processor starts the interrupt handling process.
Execution completion
The decoding instructions are stored temporarily into the temporary registers and
later they are transferred to the permanent registers in correct program order.
Thus 2 write operations TW and W respectively are carried out.
The step W is called the commitment step because the final result gets stored into the
permanent register during this step.
Eventhough, exception occurs, the result of succeeding instructions would be in
temporary registers and hence can be safely discarded.
Register renaming
When temporary register holds the contents of the permanent register, the name of
permanent register is given to that temporary register is called as register
renaming.
For example, if I2 uses R4 as a destination register, then the temporary register
used in step TW2 is also reffered as R4 during cycles 6 and 7, that temporary
register used only for instructions that follow I2 in program order.
For example, if I1 needs to read R4 in cycle 6 or 7, it has to access R4 though it
contains unmodified data be I2.
Commitment Unit
It is a special control unit needed to guarantee in-order commitment when out-of-
order execution is allowed.
It uses a special queue called the reorder buffer which stores the instruction
committed next.
Retired instructions
The instructions entered in the reorder buffer(queue)of the commitment unit
strictly in program order.
When the instruction from this queue is executed completely, the results obtained
from it are copied from temporary registers to the permanent registers and
instruction is removed from the queue.
The resources which were assigned to the instruction are released.
At this stage, the instruction is said to have been retired.
The instructions are retired in program order though they may complete execution
out of order.
That is, all instructions that were dispatched before it must also have been retired.

Dispatch operation

Each instruction requires the resources for its execution.


The dispatch unit first checks the availability of the required resources and then only
dispatches the instructions for execution.
These resources include temporary register, a location in the order buffer, appropriate
execution unit etc.

Deadlock
Consider 2 units U1 and U2 are using shared resources.
U2 needs completion of the task assign to unit U1 to complete its task.
If unit U2 is using a resource which is also required to unit U1, both units cannot
complete the tasks assigned to them.
Both the units remain waiting for the need resource.
Also, unit U2 is waiting for the completion of task by unit U1 before it can release that
resource. Such a situation is called a deadlock.
UNIT IV

MEMORY SYSTEM

Basic concepts

Semiconductor RAM

ROM

Speed ,Size and cost

Cache memories

Improving cache performance

Virtual memory

Memory management requirements

Associative memories

Secondary storage devices

Performance consideration is here


BASIC CONCEPTS
The maximum size of the memory that can be used in any computer is determined
by the addressing scheme.

Address Memory Locations


16 Bit 216 = 64 K
32 Bit 232 = 4G (Giga)
40 Bit 240 = IT (Tera)

Fig: Connection of Memory to Processor:

If MAR is k bits long and MDR is n bits long, then the memory may contain upto
2K addressable locations and the n-bits of data are transferred between the
memory and processor.
This transfer takes place over the processor bus.
The processor bus has,

Address Line
Data Line
Control Line (R/W, MFC Memory Function Completed)

The control line is used for co-ordinating data transfer.


The processor reads the data from the memory by loading the address of the
required memory location into MAR and setting the R/W line to 1.
The memory responds by placing the data from the addressed location onto the
data lines and confirms this action by asserting MFC signal.
Upon receipt of MFC signal, the processor loads the data onto the data lines into
MDR register.
The processor writes the data into the memory location by loading the address of
this location into MAR and loading the data into MDR sets the R/W line to 0.

Memory Access Time It is the time that elapses between the intiation of an
Operation and the completion of that operation.
Memory Cycle Time It is the minimum time delay that required between the
initiation of the two successive memory operations.

RAM (Random Access Memory):


In RAM, if any location that can be accessed for a Read/Write operation in fixed
amount of time, it is independent of the locations address.

Cache Memory:

It is a small, fast memory that is inserted between the larger slower main memory
and the processor.
It holds the currently active segments of a pgm and their data.

Virtual memory:

The address generated by the processor does not directly specify the physical
locations in the memory.
The address generated by the processor is referred to as a virtual / logical address.
The virtual address space is mapped onto the physical memory where data are
actually stored.
The mapping function is implemented by a special memory control circuit is often
called the memory management unit.
Only the active portion of the address space is mapped into locations in the
physical memory.
The remaining virtual addresses are mapped onto the bulk storage devices used,
which are usually magnetic disk.
As the active portion of the virtual address space changes during program
execution, the memory management unit changes the mapping function and
transfers the data between disk and memory.
Thus, during every memory cycle, an address processing mechanism determines
whether the addressed in function is in the physical memory unit.
If it is, then the proper word is accessed and execution proceeds.
If it is not, a page of words containing the desired word is transferred from disk to
memory.
This page displaces some page in the memory that is currently inactive.

SEMI CONDUCTOR RAM MEMORIES:


Semi-Conductor memories are available is a wide range of speeds.
Their cycle time ranges from 100ns to 10ns
INTERNAL ORGANIZATION OF MEMORY CHIPS:

Memory cells are usually organized in the form of array, in which each cell is
capable of storing one bit of in formation.
Each row of cells constitute a memory word and all cells of a row are connected
to a common line called as word line.
The cells in each column are connected to Sense / Write circuit by two bit lines.
The Sense / Write circuits are connected to data input or output lines of the chip.
During a write operation, the sense / write circuit receive input information and
store it in the cells of the selected word.

Fig: Organization of bit cells in a memory chip

The data input and data output of each senses / write ckt are connected to a single
bidirectional data line that can be connected to a data bus of the cptr.

R / W Specifies the required operation.

CS Chip Select input selects a given chip in the multi-chip memory system

Requirement of external
Bit Organization connection for address, data and
control lines
128 (16x8) 14
(1024) 128x8(1k) 19
Static Memories:

Memories that consists of circuits capable of retaining their state as long as power is
applied are known as static memory.

Fig:Static RAM cell

Two inverters are cross connected to form a batch


The batch is connected to two bit lines by transistors T1 and T2.
These transistors act as switches that can be opened / closed under the control of
the word line.
When the wordline is at ground level, the transistors are turned off and the latch
retain its state.

Read Operation:

In order to read the state of the SRAM cell, the word line is activated to close
switches T1 and T2.
If the cell is in state 1, the signal on bit line b is high and the signal on the bit line
b is low.Thus b and b are complement of each other.
Sense / write circuit at the end of the bit line monitors the state of b and b and set
the output accordingly.
Write Operation:

The state of the cell is set by placing the appropriate value on bit line b and its
complement on b and then activating the word line. This forces the cell into the
corresponding state.
The required signal on the bit lines are generated by Sense / Write circuit.
Fig:CMOS cell (Complementary Metal oxide Semi Conductor):
Transistor pairs (T3, T5) and (T4, T6) form the inverters in the latch.
In state 1, the voltage at point X is high by having T5, T6 on and T4, T5 are OFF.
Thus T1, and T2 returned ON (Closed), bit line b and b will have high and low
signals respectively.
The CMOS requires 5V (in older version) or 3.3.V (in new version) of power
supply voltage.
The continuous power is needed for the cell to retain its state
Merit :

It has low power consumption because the current flows in the cell only when the
cell is being activated accessed.
Static RAMs can be accessed quickly. It access time is few nanoseconds.

Demerit:

SRAMs are said to be volatile memories because their contents are lost when the
power is interrupted.

Asynchronous DRAMS:-

Less expensive RAMs can be implemented if simplex calls are used such cells
cannot retain their state indefinitely. Hence they are called Dynamic RAMs
(DRAM).
The information stored in a dynamic memory cell in the form of a charge on a
capacitor and this charge can be maintained only for tens of Milliseconds.
The contents must be periodically refreshed by restoring by restoring this
capacitor charge to its full value.

Fig:A single transistor dynamic Memory cell

In order to store information in the cell, the transistor T is turned on & the
appropriate voltage is applied to the bit line, which charges the capacitor.
After the transistor is turned off, the capacitor begins to discharge which is caused
by the capacitors own leakage resistance.
Hence the information stored in the cell can be retrieved correctly before the
threshold value of the capacitor drops down.
During a read operation, the transistor is turned on & a sense amplifier
connected to the bit line detects whether the charge on the capacitor is above the
threshold value.

If charge on capacitor > threshold value -> Bit line will have logic value 1.
If charge on capacitor < threshold value -> Bit line will set to logic value 0.

Fig:Internal organization of a 2M X 8 dynamic Memory chip.

DESCRIPTION:

The 4 bit cells in each row are divided into 512 groups of 8.
21 bit address is needed to access a byte in the memory(12 bitTo select a row,9
bitSpecify the group of 8 bits in the selected row).

A8-0 Row address of a byte.


A20-9 Column address of a byte.

During Read/ Write operation ,the row address is applied first. It is loaded into the
row address latch in response to a signal pulse on Row Address Strobe(RAS)
input of the chip.
When a Read operation is initiated, all cells on the selected row are read and
refreshed.
Shortly after the row address is loaded,the column address is applied to the
address pins & loaded into Column Address Strobe(CAS).
The information in this latch is decoded and the appropriate group of 8
Sense/Write circuits are selected.
R/W =1(read operation)The output values of the selected circuits are
transferred to the data lines D0 - D7.
R/W =0(write operation)The information on D0 - D7 are transferred to the
selected circuits.
RAS and CAS are active low so that they cause the latching of address when they
change from high to low. This is because they are indicated by RAS & CAS.
To ensure that the contents of a DRAM s are maintained, each row of cells must
be accessed periodically.
Refresh operation usually perform this function automatically.
A specialized memory controller circuit provides the necessary control signals
RAS & CAS, that govern the timing.
The processor must take into account the delay in the response of the memory.
Such memories are referred to as Asynchronous DRAMs.

Fast Page Mode:

Transferring the bytes in sequential order is achieved by applying the consecutive


sequence of column address under the control of successive CAS signals.
This scheme allows transferring a block of data at a faster rate. The block of
transfer capability is called as Fast Page Mode.
Synchronous DRAM:

Here the operations e directly synchronized with clock signal.


The address and data connections are buffered by means of registers.
The output of each sense amplifier is connected to a latch.
A Read operation causes the contents of all cells in the selected row to be loaded
in these latches.
Fig:Synchronous DRAM
Data held in the latches that correspond to the selected columns are transferred
into the data output register, thus becoming available on the data output pins.

Fig:Timing Diagram Burst Read of Length 4 in an SDRAM

First ,the row address is latched under control of RAS signal.


The memory typically takes 2 or 3 clock cycles to activate the selected row.
Then the column address is latched under the control of CAS signal.
After a delay of one clock cycle,the first set of data bits is placed on the data lines.
The SDRAM automatically increments the column address to access the next 3
sets of bits in the selected row, which are placed on the data lines in the next 3
clock cycles.

Latency & Bandwidth:

A good indication of performance is given by two parameters.They are,


Latency
Bandwidth
Latency:

It refers to the amount of time it takes to transfer a word of data to or from the
memory.
For a transfer of single word,the latency provides the complete indication of
memory performance.
For a block transfer,the latency denote the time it takes to transfer the first word
of data.
Bandwidth:

It is defined as the number of bits or bytes that can be transferred in one second.
Bandwidth mainly depends upon the speed of access to the stored data & on the
number of bits that can be accessed in parallel.

Double Data Rate SDRAM(DDR-SDRAM):

The standard SDRAM performs all actions on the rising edge of the clock signal.
The double data rate SDRAM transfer data on both the edges(loading edge,
trailing edge).
The Bandwidth of DDR-SDRAM is doubled for long burst transfer.
To make it possible to access the data at high rate , the cell array is organized into
two banks.
Each bank can be accessed separately.
Consecutive words of a given block are stored in different banks.
Such interleaving of words allows simultaneous access to two words that are
transferred on successive edge of the clock.

Larger Memories:

Dynamic Memory System:

The physical implementation is done in the form of Memory Modules.


If a large memory is built by placing DRAM chips directly on the main system
printed circuit board that contains the processor ,often referred to as
Motherboard;it will occupy large amount of space on the board.
These packaging consideration have led to the development of larger memory
units known as SIMMs & DIMMs .
SIMM-Single Inline memory Module
DIMM-Dual Inline memory Module

SIMM & DIMM consists of several memory chips on a separate small board that
plugs vertically into single socket on the motherboard.

MEMORY SYSTEM CONSIDERATION:

To reduce the number of pins, the dynamic memory chips use multiplexed
address inputs.
The address is divided into two parts.They are,

High Order Address Bit(Select a row in cell array & it is provided first
and latched into memory chips under the control of RAS signal).
Low Order Address Bit(Selects a column and they are provided on same
address pins and latched using CAS signals).

The Multiplexing of address bit is usually done by Memory Controller Circuit.


Fig:Use of Memory Controller

The Controller accepts a complete address & R/W signal from the processor,
under the control of a Request signal which indicates that a memory access
operation is needed.
The Controller then forwards the row & column portions of the address to the
memory and generates RAS &CAS signals.
It also sends R/W &CS signals to the memory.
The CS signal is usually active low, hence it is shown as CS.

Refresh Overhead:

All dynamic memories have to be refreshed.


In DRAM ,the period for refreshing all rows is 16ms whereas 64ms in SDRAM.

Eg:Given a cell array of 8K(8192).

Clock cycle=4
Clock Rate=133MHZ
No of cycles to refresh all rows =8192*4
=32,768
Time needed to refresh all rows=32768/133*106
=246*10-6 sec
=0.246sec
Refresh Overhead =0.246/64
Refresh Overhead =0.0038

Rambus Memory:

The usage of wide bus is expensive.


Rambus developed the implementation of narrow bus.
Rambus technology is a fast signaling method used to transfer information
between chips.
Instead of using signals that have voltage levels of either 0 or Vsupply to represent
the logical values, the signals consists of much smaller voltage swings around a
reference voltage Vref.
.The reference Voltage is about 2V and the two logical values are represented by
0.3V swings above and below Vref..
This type of signaling is generally is known as Differential Signalling.
Rambus provides a complete specification for the design of communication
links(Special Interface circuits) called as Rambus Channel.
Rambus memory has a clock frequency of 400MHZ.
The data are transmitted on both the edges of the clock so that the effective data
transfer rate is 800MHZ.
The circuitry needed to interface to the Rambus channel is included on the
chip.Such chips are known as Rambus DRAMs(RDRAM).
Rambus channel has,

9 Data lines(1-8Transfer the data,9th lineParity checking).


Control line
Power line

A two channel rambus has 18 data lines which has no separate address lines.It is
also called as Direct RDRAMs.
Communication between processor or some other device that can serves as a
master and RDRAM modules are serves as slaves ,is carried out by means of
packets transmitted on the data lines.
There are 3 types of packets.They are,

Request
Acknowledge
Data

READ ONLY MEMORY:


Both SRAM and DRAM chips are volatile,which means that they lose the stored
information if power is turned off.
Many application requires Non-volatile memory (which retain the stored
information if power is turned off).
Eg:Operating System software has to be loaded from disk to memory which
requires the program that boots the Operating System ie. It requires non-volatile
memory.
Non-volatile memory is used in embedded system.
Since the normal operation involves only reading of stored data ,a memory of this
type is called ROM.
Fig:ROM cell

At Logic value 0 Transistor(T) is connected to the ground point(P).


Transistor switch is closed & voltage on bitline nearly drops to zero.
At Logic value 1 Transistor switch is open.
The bitline remains at high voltage.

To read the state of the cell,the word line is activated.


A Sense circuit at the end of the bitline generates the proper output value.

Types of ROM:

Different types of non-volatile memory are,

PROM
EPROM
EEPROM
Flash Memory

PROM:-Programmable ROM:

PROM allows the data to be loaded by the user.


Programmability is achieved by inserting a fuse at point P in a ROM cell.
Before it is programmed, the memory contains all 0s
The user can insert 1s at the required location by burning out the fuse at these
locations using high-current pulse.
This process is irreversible.

Merit:
It provides flexibility.
It is faster.
It is less expensive because they can be programmed directly by the user.

EPROM:-Erasable reprogrammable ROM:

EPROM allows the stored data to be erased and new data to be loaded.
In an EPROM cell, a connection to ground is always made at P and a special
transistor is used, which has the ability to function either as a normal transistor or
as a disabled transistor that is always turned off.
This transistor can be programmed to behave as a permanently open switch, by
injecting charge into it that becomes trapped inside.
Erasure requires dissipating the charges trapped in the transistor of memory cells.
This can be done by exposing the chip to ultra-violet light, so that EPROM chips
are mounted in packages that have transparent windows.
Merits:
It provides flexibility during the development phase of digital system.
It is capable of retaining the stored information for a long time.

Demerits:
The chip must be physically removed from the circuit for reprogramming and its
entire contents are erased by UV light.

EEPROM:-Electrically Erasable ROM:

Merits:
It can be both programmed and erased electrically.
It allows the erasing of all cell contents selectively.
Demerits:
It requires different voltage for erasing ,writing and reading the stored data.

Flash Memory:

In EEPROM, it is possible to read & write the contents of a single cell.


In Flash device, it is possible to read the contents of a single cell but it is only
possible to write the entire contents of a block.
Prior to writing,the previous contents of the block are erased.
Eg.In MP3 player,the flash memory stores the data that represents sound.
Single flash chips cannot provide sufficient storage capacity for embedded system
application.
There are 2 methods for implementing larger memory modules consisting of
number of chips.They are,
Flash Cards
Flash Drives.
Merits:
Flash drives have greater density which leads to higher capacity & low cost per
bit.
It requires single power supply voltage & consumes less power in their operation.

Flash Cards:
One way of constructing larger module is to mount flash chips on a small card.
Such flash card have standard interface.
The card is simply plugged into a conveniently accessible slot.
Its memory size are of 8,32,64MB.
Eg:A minute of music can be stored in 1MB of memory. Hence 64MB flash cards
can store an hour of music.

Flash Drives:

Larger flash memory module can be developed by replacing the hard disk drive.
The flash drives are designed to fully emulate the hard disk.
The flash drives are solid state electronic devices that have no movable parts.
Merits:
They have shorter seek and access time which results in faster response.
They have low power consumption which makes them attractive for battery
driven application.
They are insensitive to vibration.
Demerit:
The capacity of flash drive (<1GB) is less than hard disk(>1GB).
It leads to higher cost perbit.
Flash memory will deteriorate after it has been written a number of
times(typically atleast 1 million times.)

SPEED,SIZE COST:
Characteristics SRAM DRAM Magnetis Disk
Speed Very Fast Slower Much slower than
DRAM
Size Large Small Small
Cost Expensive Less Expensive Low price

Magnetic Disk:
A huge amount of cost effective storage can be provided by magnetic disk;The
main memory can be built with DRAM which leaves SRAMs to be used in
smaller units where speed is of essence.

Memory Speed Size Cost


Registers Very high Lower Very Lower
Primary cache High Lower Low
Secondary cache Low Low Low
Main memory Lower than High High
Seconadry cache
Secondary Very low Very High Very High
Memory
Fig:Memory Hierarchy

Types of Cache Memory:

The Cache memory is of 2 types.They are,


Primary /Processor Cache(Level1 or L1 cache)
Secondary Cache(Level2 or L2 cache)

Primary Cache It is always located on the processor chip.


Secondary CacheIt is placed between the primary cache and the rest of the memory.

The main memory is implemented using the dynamic


components(SIMM,RIMM,DIMM).
The access time for main memory is about 10 times longer than the access time
for L1 cache.

CACHE MEMORIES
The effectiveness of cache mechanism is based on the property of Locality of
reference.
Locality of Reference:
Many instructions in the localized areas of the program are executed repeatedly
during some time period and remainder of the program is accessed relatively
infrequently.
It manifests itself in 2 ways.They are,
Temporal(The recently executed instruction are likely to be executed again
very soon.)
Spatial(The instructions in close proximity to recently executed instruction
are also likely to be executed soon.)
If the active segment of the program is placed in cache memory, then the total
execution time can be reduced significantly.
The term Block refers to the set of contiguous address locations of some size.
The cache line is used to refer to the cache block.
Fig:Use of Cache Memory

The Cache memory stores a reasonable number of blocks at a given time but this
number is small compared to the total number of blocks available in Main
Memory.
The correspondence between main memory block and the block in cache memory
is specified by a mapping function.
The Cache control hardware decide that which block should be removed to create
space for the new block that contains the referenced word.
The collection of rule for making this decision is called the replacement
algorithm.
The cache control circuit determines whether the requested word currently exists
in the cache.
If it exists, then Read/Write operation will take place on appropriate cache
location. In this case Read/Write hit will occur.
In a Read operation, the memory will not involve.
The write operation is proceed in 2 ways.They are,

Write-through protocol
Write-back protocol
Write-through protocol:

Here the cache location and the main memory locations are updated
simultaneously.

Write-back protocol:

This technique is to update only the cache location and to mark it as with
associated flag bit called dirty/modified bit.
The word in the main memory will be updated later,when the block containing
this marked word is to be removed from the cache to make room for a new block.
If the requested word currently not exists in the cache during read operation,then
read miss will occur.
To overcome the read miss Load through / Early restart protocol is used.
Read Miss:
The block of words that contains the requested word is copied from the main memory
into cache.
Load through:
After the entire block is loaded into cache,the particular word requested is
forwarded to the processor.
If the requested word not exists in the cache during write operation,then Write
Miss will occur.
If Write through protocol is used,the information is written directly into main
memory.
If Write back protocol is used then block containing the addressed word is first
brought intothe cache and then the desired word in the cache is over-written with
the new information.

Mapping Function:
Direct Mapping:
It is the simplest technique in which block j of the main memory maps onto block
j modulo 128 of the cache.
Thus whenever one of the main memory blocks 0,128,256 is loaded in the cache,it
is stored in block 0.
Block 1,129,257 are stored in cache block 1 and so on.
The contention may arise when,
When the cache is full
When more than one memory block is mapped onto a given cache block
position.
The contention is resolved by allowing the new blocks to overwrite the currently
resident block.
Placement of block in the cache is determined from memory address.
Fig: Direct Mapped Cache
The memory address is divided into 3 fields.They are,

Low Order 4 bit field(word)Selects one of 16 words in a block.


7 bit cache block fieldWhen new block enters cache,7 bit determines the cache
position in which this block must be stored.
5 bit Tag fieldThe high order 5 bits of the memory address of the block is
stored in 5 tag bits associated with its location in the cache.
As execution proceeds, the high order 5 bits of the address is compared with tag
bits associated with that cache location.
If they match,then the desired word is in that block of the cache.
If there is no match,then the block containing the required word must be first read
from the main memory and loaded into the cache.
Merit:
It is easy to implement.
Demerit:
It is not very flexible.

Associative Mapping:
In this method, the main memory block can be placed into any cache block position.

Fig:Associative Mapped Cache.

12 tag bits will identify a memory block when it is resolved in the cache.
The tag bits of an address received from the processor are compared to the tag bits
of each block of the cache to see if the desired block is persent.This is called
associative mapping.
It gives complete freedom in choosing the cache location.
A new block that has to be brought into the cache has to replace(eject)an existing
block if the cache is full.
In this method,the memory has to determine whether a given block is in the cache.
A search of this kind is called an associative Search.
Merit:
It is more flexible than direct mapping technique.
Demerit:
Its cost is high.

Set-Associative Mapping:
It is the combination of direct and associative mapping.
The blocks of the cache are grouped into sets and the mapping allows a block of
the main memory to reside in any block of the specified set.
In this case,the cache has two blocks per set,so the memory blocks
0,64,128..4032 maps into cache set 0 and they can occupy either of the two
block position within the set.

6 bit set fieldDetermines which set of cache contains the desired block .
6 bit tag fieldThe tag field of the address is compared to the tags of the two blocks of
the set to clock if the desired block is present.

Fig: Set-Associative Mapping:

No of blocks per set no of set field

2 6
3 5
8 4
128 no set field
The cache which contains 1 block per set is called direct Mapping.
A cache that has k blocks per set is called as k-way set associative cache.
Each block contains a control bit called a valid bit.
The Valid bit indicates that whether the block contains valid data.
The dirty bit indicates that whether the block has been modified during its cache
residency.
Valid bit=0When power is initially applied to system
Valid bit =1When the block is loaded from main memory at first time.
If the main memory block is updated by a source & if the block in the source is
already exists in the cache,then the valid bit will be cleared to 0.
If Processor & DMA uses the same copies of data then it is called as the Cache
Coherence Problem.
Merit:
The Contention problem of direct mapping is solved by having few choices for
block placement.
The hardware cost is decreased by reducing the size of associative search.

Replacement Algorithm:
In direct mapping, the position of each block is pre-determined and there is no
need of replacement strategy.
In associative & set associative method,the block position is not pre-
determined;ie..when the cache is full and if new blocks are brought into the cache,
then the cache controller must decide which of the old blocks has to be replaced.
Therefore,when a block is to be over-written,it is sensible to over-write the one
that has gone the longest time without being referenced.This block is called Least
recently Used(LRU) block & the technique is called LRU algorithm.
The cache controller track the references to all blocks with the help of block
counter.
Eg:

Consider 4 blocks/set in set associative cache,


2 bit counter can be used for each block.
When a hit occurs,then block counter=0;The counter with values originally
lower than the referenced one are incremented by 1 & all others remain
unchanged.
When a miss occurs & if the set is full,the blocks with the counter value 3 is
removed,the new block is put in its place & its counter is set to 0 and other
block counters are incremented by 1.
Merit:
The performance of LRU algorithm is improved by randomness in deciding
which block is to be over-written.
PERFORMANCE CONSIDERATION:
Two Key factors in the commercial success are the performance & cost ie the best
possible performance at low cost.
A common measure of success is called the Pricel Performance ratio.
Performance depends on how fast the machine instruction are brought to the
processor and how fast they are executed.
To achieve parallelism(ie. Both the slow and fast units are accessed in the same
manner),interleaving is used.

Interleaving:
Fig:Consecutive words in a Module

VIRTUAL MEMORY:
Techniques that automatically move program and data blocks into the physical
main memory when they are required for execution is called the Virtual
Memory.
The binary address that the processor issues either for instruction or data are
called the virtual / Logical address.
The virtual address is translated into physical address by a combination of
hardware and software components.This kind of address translation is done by
MMU(Memory Management Unit).
When the desired data are in the main memory ,these data are fetched /accessed
immediately.
If the data are not in the main memory,the MMU causes the Operating system to
bring the data into memory from the disk.
Transfer of data between disk and main memory is performed using DMA
scheme.

Fig:Virtual Memory Organisation

Address Translation:

In address translation,all programs and data are composed of fixed length units
called Pages.
The Page consists of a block of words that occupy contiguous locations in the
main memory.
The pages are commonly range from 2K to 16K bytes in length.
The cache bridge speed up the gap between main memory and secondary storage
and it is implemented in software techniques.
Each virtual address generated by the processor contains virtual Page
number(Low order bit) and offset(High order bit)
Virtual Page number+ OffsetSpecifies the location of a particular byte (or word) within
a page.
Page Table:

It contains the information about the main memory address where the page is
stored & the current status of the page.
Page Frame:

An area in the main memory that holds one page is called the page frame.
Page Table Base Register:
It contains the starting address of the page table.

Virtual Page Number+Page Table Base registerGives the address of the


corresponding entry in the page table.ie)it gives the starting address of the page if that
page currently resides in memory.

Control Bits in Page Table:

The Control bits specifies the status of the page while it is in main memory.
Function:

The control bit indicates the validity of the page ie)it checks whether the page is
actually loaded in the main memory.
It also indicates that whether the page has been modified during its residency in
the memory;this information is needed to determine whether the page should be
written back to the disk before it is removed from the main memory to make room
for another page.

Fig:Virtual Memory Address Translation

The Page table information is used by MMU for every read & write access.
The Page table is placed in the main memory but a copy of the small portion of
the page table is located within MMU.
This small portion or small cache is called Translation LookAside Buffer(TLB).
This portion consists of the page table enteries that corresponds to the most
recently accessed pages and also contains the virtual address of the entry.
Fig:Use of Associative Mapped TLB

When the operating system changes the contents of page table ,the control bit in
TLB will invalidate the corresponding entry in the TLB.
Given a virtual address,the MMU looks in TLB for the referenced page.
If the page table entry for this page is found in TLB,the physical address is
obtained immediately.
If there is a miss in TLB,then the required entry is obtained from the page table in
the main memory & TLB is updated.
When a program generates an access request to a page that is not in the main
memory ,then Page Fault will occur.
The whole page must be broght from disk into memry before an access can
proceed.
When it detects a page fault,the MMU asks the operating system to generate an
interrupt.
The operating System suspend the execution of the task that caused the page fault
and begin execution of another task whose pages are in main memory because the
long delay occurs while page transfer takes place.
When the task resumes,either the interrupted instruction must continue from the
point of interruption or the instruction must be restarted.
If a new page is brought from the disk when the main memory is full,it must
replace one of the resident pages.In that case,it uses LRU algorithm which
removes the least referenced Page.
A modified page has to be written back to the disk before it is removed from the
main memory. In that case,write through protocol is used.

MEMORY MANAGEMENT REQUIREMENTS:


Management routines are part of the Operating system.
Assembling the OS routine into virtual address space is called System Space.
The virtual space in which the user application program reside is called the User
Space.
Each user space has a separate page table.
The MMU uses the page table to determine the address of the table to be used in
the translation process.
Hence by changing the contents of this register, the OS can switch from one space
to another.
The process has two stages. They are,
User State
Supervisor state.
User State:
In this state,the processor executes the user program.
Supervisor State:
When the processor executes the operating system routines,the processor will be
in supervisor state.
Privileged Instruction:
In user state,the machine instructions cannot be executed.Hence a user program is
prevented from accessing the page table of other user spaces or system spaces.
The control bits in each entry can be set to control the access privileges granted to
each program.
Ie)One program may be allowed to read/write a given page,while the other
programs may be given only red access.

SECONDARY STORAGE:
The Semi-conductor memories donot provide all the storage capability.
The Secondary storage devices provide larger storage requirements.
Some of the Secondary Storage devices are,
Magnetic Disk
Optical Disk
Magnetic Tapes.

Magnetic Disk:
Magnetic Disk system consists o one or more disk mounted on a common spindle.
A thin magnetic film is deposited on each disk, usually on both sides.
The disk are placed in a rotary drive so that the magnetized surfaces move in
close proximity to read /write heads.
Each head consists of magnetic yoke & magnetizing coil.
Digital information can be stored on the magnetic film by applying the current
pulse of suitable polarity to the magnetizing coil.
Only changes in the magnetic field under the head can be sensed during the Read
operation.
Therefore if the binary states 0 & 1 are represented by two opposite states of
magnetization, a voltage is induced in the head only at 0-1 and at 1-0 transition in
the bit stream.
A consecutive (long string) of 0s & 1s are determined by using the clock which
is mainly used for synchronization.
Phase Encoding or Manchester Encoding is the technique to combine the clocking
information with data.
The Manchester Encoding describes that how the self-clocking scheme is
implemented.
Fig:Mechanical Structure

The Read/Write heads must be maintained at a very small distance from the
moving disk surfaces in order to achieve high bit densities.
When the disk are moving at their steady state, the air pressure develops between
the disk surfaces & the head & it forces the head away from the surface.
The flexible spring connection between head and its arm mounting permits the
head to fly at the desired distance away from the surface.

Wanchester Technology:
Read/Write heads are placed in a sealed, air filtered enclosure called the
Wanchester Technology.
In such units, the read/write heads can operate closure to magnetic track surfaces
because the dust particles which are a problem in unsealed assemblies are absent.

Merits:

It have a larger capacity for a given physical size.


The data intensity is high because the storage medium is not exposed to
contaminating elements.
The read/write heads of a disk system are movable.
The disk system has 3 parts.They are,
Disk Platter(Usually called Disk)
Disk Drive(spins the disk & moves Read/write heads)
Disk Controller(controls the operation of the system.)

Fig:Organizing & Accessing the data on disk

Each surface is divided into concentric tracks.


Each track is divided into sectors.
The set of corresponding tracks on all surfaces of a stack of disk form a logical
cylinder.
The data are accessed by specifying the surface number,track number and the
sector number.
The Read/Write operation start at sector boundaries.
Data bits are stored serially on each track.
Each sector usually contains 512 bytes.

Sector header -> contains identification information.


It helps to find the desired sector on the selected track.
ECC (Error checking code)- used to detect and correct errors.
An unformatted disk has no information on its tracks.
The formatting process divides the disk physically into tracks and sectors and this
process may discover some defective sectors on all tracks.
The disk controller keeps a record of such defects.
The disk is divided into logical partitions. They are,
Primary partition
Secondary partition
In the diag, Each track has same number of sectors.
So all tracks have same storage capacity.
Thus the stored information is packed more densely on inner track than on outer
track.
Access time
There are 2 components involved in the time delay between receiving an address
and the beginning of the actual data transfer. They are,
Seek time
Rotational delay / Latency
Seek time Time required to move the read/write head to the proper track.
Latency The amount of time that elapses after the head is positioned over the correct
track until the starting position of the addressed sector passes under the read/write head.
Seek time + Latency = Disk access time
Typical disk

One inch disk- weight=1 ounce, size -> comparable to match book
Capacity -> 1GB
Inch disk has the following parameter
Recording surface=20
Tracks=15000 tracks/surface
Sectors=400.
Each sector stores 512 bytes of data
Capacity of formatted disk=20x15000x400x512=60x109 =60GB
Seek time=3ms
Platter rotation=10000 rev/min
Latency=3ms
Internet transfer rate=34MB/s
Data Buffer / cache
A disk drive that incorporates the required SCSI circuit is referred as SCSI drive.
The SCSI can transfer data at higher rate than the disk tracks.
An efficient method to deal with the possible difference in transfer rate between
disk and SCSI bus is accomplished by including a data buffer.
This buffer is a semiconductor memory.
The data buffer can also provide cache mechanism for the disk (ie) when a read
request arrives at the disk, then controller first check if the data is available in the
cache(buffer).
If the data is available in the cache, it can be accessed and placed on SCSI bus . If
it is not available then the data will be retrieved from the disk.
Disk Controller
The disk controller acts as interface between disk drive and system bus.
The disk controller uses DMA scheme to transfer data between disk and main
memory.
When the OS initiates the transfer by issuing Read/Write request, the controllers
register will load the following information. They are,
Main memory address(address of first main memory location of the block of
words involved in the transfer)
Disk address(The location of the sector containing the beginning of the desired
block of words)
(number of words in the block to be transferred).

Sector header -> contains identification information.


It helps to find the desired sector on the selected track.
ECC (Error checking code)- used to detect and correct errors.
An unformatted disk has no information on its tracks.
The formatting process divides the disk physically into tracks and sectors and this
process may discover some defective sectors on all tracks.
The disk controller keeps a record of such defects.
The disk is divided into logical partitions. They are,
Primary partition
Secondary partition
In the diag, Each track has same number of sectors.
So all tracks have same storage capacity.
Thus the stored information is packed more densely on inner track than on outer track.
Access time
There are 2 components involved in the time delay between receiving an address
and the beginning of the actual data transfer. They are,
Seek time
Rotational delay / Latency
Seek time Time required to move the read/write head to the proper track.
Latency The amount of time that elapses after the head is positioned over the correct
track until the starting position of the addressed sector passes under the read/write head.
Seek time + Latency = Disk access time
Typical disk
One inch disk- weight=1 ounce, size -> comparable to match book
Capacity -> 1GB
3.5 inch disk has the following parameter
Recording surface=20
Tracks=15000 tracks/surface
Sectors=400.
Each sector stores 512 bytes of data
Capacity of formatted disk=20x15000x400x512=60x109 =60GB
Seek time=3ms
Platter rotation=10000 rev/min
Latency=3ms
Internet transfer rate=34MB/s
Data Buffer / cache
A disk drive that incorporates the required SCSI circuit is referred as SCSI drive.
The SCSI can transfer data at higher rate than the disk tracks.
An efficient method to deal with the possible difference in transfer rate between disk and
SCSI bus is accomplished by including a data buffer.
This buffer is a semiconductor memory.
The data buffer can also provide cache mechanism for the disk (ie) when a read request
arrives at the disk, then controller first check if the data is available in the cache(buffer).
If the data is available in the cache, it can be accessed and placed on SCSI bus . If it is not
available then the data will be retrieved from the disk.
Disk Controller
The disk controller acts as interface between disk drive and system bus.
The disk controller uses DMA scheme to transfer data between disk and main memory.
When the OS initiates the transfer by issuing Read/Write request, the controllers register
will load the following information. They are,
Main memory address(address of first main memory location of the block of words
involved in the transfer)
Disk address(The location of the sector containing the beginning of the desired block of
words)
(number of words in the block to be transferred).

UNIT V
MEMORY SYSTEM
PART-A

1) Write about the functions of memory organization?


2) Draw the block diagram for memory unit?
3) What re the key characteristics of memories?
4) What are the types of access method?
5) What is meant by memory cycle time?
6) Define transfer rate
7) What are the memory operations?
8) Define throughput
9) What is memory bus?
10) What are the types of memory?
11) Write about RAM
12) What is static RAM?
13) Write short notes on ROM
14) What is error detection and correction code?
15) Explain parity bit?
16) What is hamming error-correcting codes?
17) What is cache memory?
18) What is fully associative mapping?
19) What is block replacement?
20) What is meant by LRU?
21) Explain write strategy?
22) Define bit ratio
23) Explain cache update policies?
24) What is direct mapped cache?
25) What is associative mapping?
26) Explain cache organization?
27) What is cache miss rate?
28) Write about cache directories?
29) What is a write-protect bit?
30) What is multilevel cache?
31) What is virtual memory?
32) What is segmented memory?
33) Write about block access time?
34) What is page mode?
35) Write about I/O-address strobe (RA )?
36) What is column-address strobe (CA )?
37) What is access time myth?
38) How can hits or misses myth?
39) How are memory blocks mapped into cache lines?
40) What is secondary storages?

PART B

1) Explain the memory organization?


2) Write about the following
a. Memory bus
b. Byte storage methods
3) Write about memory hierarchy
4) Explain semi conductor RAM?
5) Write about semi conductor ROM?
6) Explain error detection and correction codes?
UNIT V

I/O ORGANIZATION

Accessing I/O devices

Programmed I/O

Interrupts

Direct memory access

Buses

Interface Circuits

Standard I/O interfaces (PCI, SCSI, and USB)

I/O Devices and processors


ACCESSING I/O DEVICES
A simple arrangement to connect I/O devices to a computer is to use a single bus
structure. It consists of three sets of lines to carry
Address
Data
Control Signals.
When the processor places a particular address on address lines, the devices that
recognize this address responds to the command issued on the control lines.
The processor request either a read or write operation and the requested data are
transferred over the data lines.
When I/O devices & memory share the same address space, the arrangement is called
memory mapped I/O.

Single Bus Structure

Processor Memory

Bus

I/O device 1 .. I/O device n

Eg:-

Move DATAIN, Ro Reads the data from DATAIN then into processor register Ro.
Move Ro, DATAOUT Send the contents of register Ro to location DATAOUT.
DATAIN Input buffer associated with keyboard.
DATAOUT Output data buffer of a display unit / printer.

Fig: I/O Interface for an Input Device

Address line
Data line
Control line

Address Control Data & status I/O interface


decoder circuits register

Input device.
Address Decoder:

It enables the device to recognize its address when the address appears on address
lines.

Data register It holds the data being transferred to or from the processor.
Status register It contains infn/. Relevant to the operation of the I/O devices.

The address decoder, data & status registers and the control circuitry required to
co-ordinate I/O transfers constitute the devices I/F circuit.
For an input device, SIN status flag in used SIN = 1, when a character is entered
at the keyboard.
For an output device, SOUT status flag is used SIN = 0, once the char is read by
processor.

Eg

DIR Q Interrupt Request for display.


KIR Q Interrupt Request for keyboard.
KEN keyboard enable.
DEN Display Enable.
SIN, SOUT status flags.

The data from the keyboard are made available in the DATAIN register & the data sent to
the display are stored in DATAOUT register.

Program:
WAIT K Move # Line, Ro
Test Bit #0, STATUS
Branch = 0 WAIT K
Move DATAIN, R1
WAIT D Test Bit #1, STATUS
Branch = 0 WAIT D
Move R1, DATAOUT
Move R1, (Ro)+
Compare #OD, R1
Branch = 0 WAIT K
Move #DOA, DATAOUT
Call PROCESS
EXPLANATION:
This program, reads a line of characters from the keyboard & stores it in a
memory buffer starting at locations LINE.
Then it calls the subroutine PROCESS to process the input line.
As each character is read, it is echoed back to the display.
Register Ro is used as a updated using Auto increment mode so that successive
characters are stored in successive memory location.
Each character is checked to see if there is carriage return (CR), char, which has
the ASCII code 0D(hex).
If it is, a line feed character (on) is sent to more the cursor one line down on the
display & subroutine PROCESS is called. Otherwise, the program loops back to
wait for another character from the keyboard.

PROGRAM CONTROLLED I/O


Here the processor repeatedly checks a status flag to achieve the required
synchronization between Processor & I/O device.(ie) the processor polls the device.

There are 2 mechanisms to handle I/o operations. They are,


Interrupt, -
DMA (Synchronization is achieved by having I/O device send special over
the bus where is ready for data transfer operation)

DMA:

Synchronization is achieved by having I/O device send special over the bus where
is ready for data transfer operation)
It is a technique used for high speed I/O device.
Here, the input device transfer data directly to or from the memory without
continuous involvement by the processor.
INTERRUPTS
When a program enters a wait loop, it will repeatedly check the device status.
During this period, the processor will not perform any function.
The Interrupt request line will send a hardware signal called the interrupt signal to
the processor.
On receiving this signal, the processor will perform the useful function during the
waiting period.
The routine executed in response to an interrupt request is called Interrupt
Service Routine.
The interrupt resembles the subroutine calls.

Fig:Transfer of control through the use of interrupts

The processor first completes the execution of instruction i Then it loads the
PC(Program Counter) with the address of the first instruction of the ISR.
After the execution of ISR, the processor has to come back to instruction i + 1.
Therefore, when an interrupt occurs, the current contents of PC which point to i
+1 is put in temporary storage in a known location.
A return from interrupt instruction at the end of ISR reloads the PC from that
temporary storage location, causing the execution to resume at instruction i+1.
When the processor is handling the interrupts, it must inform the device that its
request has been recognized so that it remove its interrupt requests signal.
This may be accomplished by a special control signal called the interrupt
acknowledge signal.
The task of saving and restoring the information can be done automatically by the
processor.
The processor saves only the contents of program counter & status register (ie)
it saves only the minimal amount of information to maintain the integrity of the
program execution.
Saving registers also increases the delay between the time an interrupt request is
received and the start of the execution of the ISR. This delay is called the
Interrupt Latency.
Generally, the long interrupt latency in unacceptable.
The concept of interrupts is used in Operating System and in Control
Applications, where processing of certain routines must be accurately timed
relative to external events. This application is also called as real-time processing.

Interrupt Hardware:

Fig:An equivalent circuit for an open drain bus used to implement a common
interrupt request line

A single interrupt request line may be used to serve n devices. All devices are
connected to the line via switches to ground.
To request an interrupt, a device closes its associated switch, the voltage on INTR
line drops to 0(zero).
If all the interrupt request signals (INTR1 to INTRn) are inactive, all switches are
open and the voltage on INTR line is equal to Vdd.
When a device requests an interrupts, the value of INTR is the logical OR of the
requests from individual devices.

(ie) INTR = INTR1++INTRn

INTR It is used to name the INTR signal on common line it is active in the low
voltage state.

Open collector (bipolar ckt) or Open drain (MOS circuits) is used to drive INTR
line.
The Output of the Open collector (or) Open drain control is equal to a switch to
the ground that is open when gates input is in 0 state and closed when the gates
input is in 1 state.
Resistor R is called a pull-up resistor because it pulls the line voltage upto the
high voltage state when the switches are open.
Enabling and Disabling Interrupts:

The arrival of an interrupt request from an external device causes the processor to
suspend the execution of one program & start the execution of another because
the interrupt may alter the sequence of events to be executed.
INTR is active during the execution of Interrupt Service Routine.
There are 3 mechanisms to solve the problem of infinite loop which occurs due to
successive interruptions of active INTR signals.
The following are the typical scenario.

The device raises an interrupt request.


The processor interrupts the program currently being executed.
Interrupts are disabled by changing the control bits is PS (Processor Status
register)
The device is informed that its request has been recognized & in response, it
deactivates the INTR signal.
The actions are enabled & execution of the interrupted program is resumed.
Edge-triggered:

The processor has a special interrupt request line for which the interrupt handling
circuit responds only to the leading edge of the signal. Such a line said to be edge-
triggered.

Handling Multiple Devices:

When several devices requests interrupt at the same time, it raises some questions.
They are.

How can the processor recognize the device requesting an interrupt?


Given that the different devices are likely to require different ISR, how
can the processor obtain the starting address of the appropriate routines in
each case?
Should a device be allowed to interrupt the processor while another
interrupt is being serviced?
How should two or more simultaneous interrupt requests be handled?

Polling Scheme:

If two devices have activated the interrupt request line, the ISR for the selected
device (first device) will be completed & then the second request can be serviced.
The simplest way to identify the interrupting device is to have the ISR polls all
the encountered with the IRQ bit set is the device to be serviced
IRQ (Interrupt Request) -> when a device raises an interrupt requests, the status
register IRQ is set to 1.
Merit:
It is easy to implement.
Demerit:
The time spent for interrogating the IRQ bits of all the devices that may not be
requesting any service.

Vectored Interrupt:

Here the device requesting an interrupt may identify itself to the processor by
sending a special code over the bus & then the processor start executing the ISR.
The code supplied by the processor indicates the starting address of the ISR for
the device.
The code length ranges from 4 to 8 bits.
The location pointed to by the interrupting device is used to store the staring
address to ISR.
The processor reads this address, called the interrupt vector & loads into PC.
The interrupt vector also includes a new value for the Processor Status Register.
When the processor is ready to receive the interrupt vector code, it activate the
interrupt acknowledge (INTA) line.

Interrupt Nesting:
Multiple Priority Scheme:

In multiple level priority scheme, we assign a priority level to the processor that
can be changed under program control.
The priority level of the processor is the priority of the program that is currently
being executed.
The processor accepts interrupts only from devices that have priorities higher than
its own.
At the time the execution of an ISR for some device is started, the priority of the
processor is raised to that of the device.
The action disables interrupts from devices at the same level of priority or lower.

Privileged Instruction:

The processor priority is usually encoded in a few bits of the Processor Status
word. It can also be changed by program instruction & then it is write into PS.
These instructions are called privileged instruction. This can be executed only
when the processor is in supervisor mode.
The processor is in supervisor mode only when executing OS routines.
It switches to the user mode before beginning to execute application program.

Privileged Exception:

User program cannot accidently or intentionally change the priority of the


processor & disrupts the system operation.
An attempt to execute a privileged instruction while in user mode, leads to a
special type of interrupt called the privileged exception.

Fig: Implementation of Interrupt Priority using individual Interrupt request


acknowledge lines

Each of the interrupt request line is assigned a different priority level.


Interrupt request received over these lines are sent to a priority arbitration circuit
in the processor.
A request is accepted only if it has a higher priority level than that currently
assigned to the processor,

Simultaneous Requests:
Daisy Chain:

The interrupt request line INTR is common to all devices. The interrupt
acknowledge line INTA is connected in a daisy chain fashion such that INTA
signal propagates serially through the devices.
When several devices raise an interrupt request, the INTR is activated & the
processor responds by setting INTA line to 1. this signal is received by device.
Device1 passes the signal on to device2 only if it does not require any service.
If devices1 has a pending request for interrupt blocks that INTA signal &
proceeds to put its identification code on the data lines.
Therefore, the device that is electrically closest to the processor has the highest
priority.

Merits:
It requires fewer wires than the individual connections.

Arrangement of Priority Groups:

Here the devices are organized in groups & each group is connected at a different
priority level.
Within a group, devices are connected in a daisy chain.

Controlling Device Requests:

KEN Keyboard Interrupt Enable


DEN Display Interrupt Enable
KIRQ / DIRQ Keyboard / Display unit requesting an interrupt.

There are two mechanism for controlling interrupt requests.


At the devices end, an interrupt enable bit in a control register determines whether
the device is allowed to generate an interrupt requests.
At the processor end, either an interrupt enable bit in the PS (Processor Status) or
a priority structure determines whether a given interrupt requests will be accepted.

Initiating the Interrupt Process:

Load the starting address of ISR in location INTVEC (vectored interrupt).


Load the address LINE in a memory location PNTR. The ISR will use this
location as a pointer to store the i/p characters in the memory.
Enable the keyboard interrupts by setting bit 2 in register CONTROL to 1.
Enable interrupts in the processor by setting to 1, the IE bit in the processor status
register PS.

Exception of ISR:

Read the input characters from the keyboard input data register. This will cause
the interface circuits to remove its interrupt requests.
Store the characters in a memory location pointed to by PNTR & increment
PNTR.
When the end of line is reached, disable keyboard interrupt & inform program
main.
Return from interrupt.
Exceptions:

An interrupt is an event that causes the execution of one program to be suspended


and the execution of another program to begin.
The Exception is used to refer to any event that causes an interruption.

Kinds of exception:

Recovery from errors


Debugging
Privileged Exception

Recovery From Errors:


Computers have error-checking code in Main Memory , which allows detection of
errors in the stored data.
If an error occurs, the control hardware detects it informs the processor by raising
an interrupt.
The processor also interrupts the program, if it detects an error or an unusual
condition while executing the instance (ie) it suspends the program being
executed and starts an execution service routine.
This routine takes appropriate action to recover from the error.

Debugging:

System software has a program called debugger, which helps to find errors in a
program.
The debugger uses exceptions to provide two important facilities
They are
Trace
Breakpoint

Trace Mode:

When processor is in trace mode , an exception occurs after execution of every


instance using the debugging program as the exception service routine.
The debugging program examine the contents of registers, memory location etc.
On return from the debugging program the next instance in the program being
debugged is executed
The trace exception is disabled during the execution of the debugging program.

Break point:

Here the program being debugged is interrupted only at specific points selected by
the user.
An instance called the Trap (or) software interrupt is usually provided for this
purpose.
While debugging the user may interrupt the program execution after instance I
When the program is executed and reaches that point it examine the memory and
register contents.

Privileged Exception:

To protect the OS of a computer from being corrupted by user program certain


instance can be executed only when the processor is in supervisor mode. These
are called privileged exceptions.
When the processor is in user mode, it will not execute instance (ie) when the
processor is in supervisor mode , it will execute instance.
DIRECT MEMORY ACCESS
A special control unit may be provided to allow the transfer of large block of data
at high speed directly between the external device and main memory , without
continous intervention by the processor. This approach is called DMA.
DMA transfers are performed by a control circuit called the DMA Controller.
To initiate the transfer of a block of words , the processor sends,

Starting address
Number of words in the block
Direction of transfer.
When a block of data is transferred , the DMA controller increment the memory
address for successive words and keep track of number of words and it also informs
the processor by raising an interrupt signal.
While DMA control is taking place, the program requested the transfer cannot
continue and the processor can be used to execute another program.
After DMA transfer is completed, the processor returns to the program that requested
the transfer.
Fig:Registes in a DMA Interface

31 30 1 0
Status &
Control Flag

IRQ Done

IE

Starting Address

Word Count
R/W Determines the direction of transfer .
When
R/W =1, DMA controller read data from memory to I/O device.
R/W =0, DMA controller perform write operation.
Done Flag=1, the controller has completed transferring a block of data and is
ready to receive another command.
IE=1, it causes the controller to raise an interrupt (interrupt Enabled) after it has
completed transferring the block of data.
IRQ=1, it indicates that the controller has requested an interrupt.

Fig: Use of DMA controllers in a computer system

A DMA controller connects a high speed network to the computer bus . The disk
controller two disks, also has DMA capability and it provides two DMA channels.
To start a DMA transfer of a block of data from main memory to one of the disks,
the program write s the address and the word count inf. Into the registers of the
corresponding channel of the disk controller.
When DMA transfer is completed, it will be recorded in status and control
registers of the DMA channel (ie) Done bit=IRQ=IE=1.

Cycle Stealing:

Requests by DMA devices for using the bus are having higher priority than
processor requests .
Top priority is given to high speed peripherals such as ,
Disk
High speed Network Interface and Graphics display device.

Since the processor originates most memory access cycles, the DMA controller
can be said to steal the memory cycles from the processor.
This interviewing technique is called Cycle stealing.
Burst Mode:
The DMA controller may be given exclusive access to the main memory to
transfer a block of data without interruption. This is known as Burst/Block Mode

Bus Master:
The device that is allowed to initiate data transfers on the bus at any given time is
called the bus master.

Bus Arbitration:

It is the process by which the next device to become the bus master is selected and
the bus mastership is transferred to it.
Types:
There are 2 approaches to bus arbitration. They are,

Centralized arbitration ( A single bus arbiter performs arbitration)


Distributed arbitration (all devices participate in the selection of next bus
master).

Centralized Arbitration:

Here the processor is the bus master and it may grants bus mastership to one of its
DMA controller.
A DMA controller indicates that it needs to become the bus master by activating
the Bus Request line (BR) which is an open drain line.
The signal on BR is the logical OR of the bus request from all devices connected
to it.
When BR is activated the processor activates the Bus Grant Signal (BGI) and
indicated the DMA controller that they may use the bus when it becomes free.
This signal is connected to all devices using a daisy chain arrangement.
If DMA requests the bus, it blocks the propagation of Grant Signal to other
devices and it indicates to all devices that it is using the bus by activating open
collector line, Bus Busy (BBSY).

Fig:A simple arrangement for bus arbitration using a daisy chain


Fig: Sequence of signals during transfer of bus mastership for the devices

The timing diagram shows the sequence of events for the devices connected to the
processor is shown.
DMA controller 2 requests and acquires bus mastership and later releases the bus.
During its tenture as bus master, it may perform one or more data transfer.
After it releases the bus, the processor resources bus mastership

Distributed Arbitration:
It means that all devices waiting to use the bus have equal responsibility in carrying out
the arbitration process.

Fig:A distributed arbitration scheme

Each device on the bus is assigned a 4 bit id.


When one or more devices request the bus, they assert the Start-Arbitration signal
& place their 4 bit ID number on four open collector lines, ARB0 to ARB3.
A winner is selected as a result of the interaction among the signals transmitted
over these lines.
The net outcome is that the code on the four lines represents the request that has
the highest ID number.
The drivers are of open collector type. Hence, if the i/p to one driver is equal to 1,
the i/p to another driver connected to the same bus line is equal to 0(ie. bus the
is in low-voltage state).
Eg:
Assume two devices A & B have their ID 5 (0101), 6(0110) and their code is
0111.
Each devices compares the pattern on the arbitration line to its own ID starting
from MSB.
If it detects a difference at any bit position, it disables the drivers at that bit
position. It does this by placing 0 at the i/p of these drivers.
In our eg. A detects a difference in line ARB1, hence it disables the drivers on
lines ARB1 & ARB0.
This causes the pattern on the arbitration line to change to 0110 which means that
B has won the contention.
Buses
A bus protocol is the set of rules that govern the behavior of various devices
connected to the bus ie, when to place information in the bus, assert control
signals etc.
The bus lines used for transferring data is grouped into 3 types. They are,
Address line
Data line
Control line.

Control signalsSpecifies that whether read / write operation has to performed.


It also carries timing infn/. (ie) they specify the time at which the
processor & I/O devices place the data on the bus & receive the data
from the bus.

During data transfer operation, one device plays the role of a Master.
Master device initiates the data transfer by issuing read / write command on the
bus. Hence it is also called as Initiator.
The device addressed by the master is called as Slave / Target.

Types of Buses:
There are 2 types of buses. They are,
Synchronous Bus
Asynchronous Bus.
Synchronous Bus:-

In synchronous bus, all devices derive timing information from a common clock
line.
Equally spaced pulses on this line define equal time.
During a bus cycle, one data transfer on take place.
The crossing points indicate the tone at which the patterns change.
A signal line in an indeterminate / high impedance state is represented by an
intermediate half way between the low to high signal levels.

Fig:Timing of an input transfer of a synchronous bus.

At time to, the master places the device address on the address lines & sends an
appropriate command on the control lines.
In this case, the command will indicate an input operation & specify the length of
the operand to be read.
The clock pulse width t1 t0 must be longer than the maximum delay between
devices connected to the bus.
The clock pulse width should be long to allow the devices to decode the address
& control signals so that the addressed device can respond at time t1.
The slaves take no action or place any data on the bus before t1.

Fig:A detailed timing diagram for the input transfer

The picture shows two views of the signal except the clock.
One view shows the signal seen by the master & the other is seen by the salve.
The master sends the address & command signals on the rising edge at the
beginning of clock period (t0). These signals do not actually appear on the bus
until tam.
Some times later, at tAS the signals reach the slave.
The slave decodes the address & at t1, it sends the requested data.
At t2, the master loads the data into its i/p buffer.
Hence the period t2, tDM is the setup time for the masters i/p buffer.
The data must be continued to be valid after t2, for a period equal to the hold time
of that buffers.
Demerits:
The device does not respond.
The error will not be detected.

Multiple Cycle Transfer:-

During, clock cycle1, the master sends address & cmd infn/. On the bus
requesting a read operation.
The slave receives this information & decodes it.
At the active edge of the clock (ie) the beginning of clock cycel2, it makes
accession to respond immediately.
The data become ready & are placed in the bus at clock cycle3.
At the same times, the slave asserts a control signal called slave-ready.
The master which has been waiting for this signal, strobes, the data to its i/p
buffer at the end of clock cycle3.
The bus transfer operation is now complete & the master sends a new address to
start a new transfer in clock cycle4.
The slave-ready signal is an acknowledgement form the slave to the master
confirming that valid data has been sent.

Fig:An input transfer using multiple clock cycles

Asynchronous Bus:-

An alternate scheme for controlling data transfer on. The bus is based on the use
of handshake between Master & the Slave. The common clock is replaced by
two timing control lines.
They are
Masterready
Slave ready.

Fig:Handshake control of data transfer during an input operation

The handshake protocol proceed as follows :


At t0 The master places the address and command information on the bus and
all devices on the bus begin to decode the information
At t1 The master sets the Master ready line to 1 to inform the I/O devices that
the address and command information is ready.

The delay t1 t0 is intended to allow for any skew that may occurs on the bus.
The skew occurs when two signals simultaneously transmitted from one source
arrive at the destination at different time.
Thus to guarantee that the Master ready signal does not arrive at any device a
head of the address and command information the delay t1 t0 should be larger
than the maximum possible bus skew.

At t2 The selected slave having decoded the address and command information
performs the required i/p operation by placing the data from its data
register on the data lines. At the same time, it sets the slave Ready
signal to 1.
At t3 The slave ready signal arrives at the master indicating that the i/p data are
available on the bus.
At t4 The master removes the address and command information on the bus.
The delay between t3 and t4 is again intended to allow for bus skew.
Errorneous addressing may take place if the address, as seen by some
device on the bus, starts to change while the master ready signal is still
equal to 1.
At t5 When the device interface receives the 1 to 0 tranitions of the Master
ready signal. It removes the data and the slave ready signal from the bus.
This completes the i/p transfer.
In this diagram, the master place the output data on the data lines and at the same
time it transmits the address and command information.
The selected slave strobes the data to its o/p buffer when it receives the Master-
ready signal and it indicates this by setting the slave ready signal to 1.
At time t0 to t1 and from t3 to t4, the Master compensates for bus.
A change of state is one signal is followed by a change is the other signal. Hence
this scheme is called as Full Handshake.
It provides the higher degree of flexibility and reliability.

INTERFACE CIRCUITS:
The interface circuits are of two types.They are

Parallel Port
Serial Port

Parallel Port:

The output of the encoder consists of the bits that represent the encoded character
and one signal called valid,which indicates the key is pressed.
The information is sent to the interface circuits,which contains a data
register,DATAIN and a status flag SIN.
When a key is pressed, the Valid signal changes from 0 to1,causing the ASCII
code to be loaded into DATAIN and SIN set to 1.
The status flag SIN set to 0 when the processor reads the contents of the DATAIN
register.
The interface circuit is connected to the asynchronous bus on which transfers are
controlled using the Handshake signals Master ready and Slave-ready.
Serial Port:

A serial port used to connect the processor to I/O device that requires transmission one
bit at a time.
It is capable of communicating in a bit serial fashion on the device side and in a bit
parallel fashion on the bus side.

STANDARD I/O INTERFACE


A standard I/O Interface is required to fit the I/O device with an Interface circuit.
The processor bus is the bus defined by the signals on the processor chip itself.
The devices that require a very high speed connection to the processor such as the
main memory, may be connected directly to this bus.
The bridge connects two buses, which translates the signals and protocols of one
bus into another.
The bridge circuit introduces a small delay in data transfer between processor and
the devices.

Fig:Example of a Computer System using different Interface Standards

Processor Main Memory

Bridge
Processor Bus

Additional SCS / Ethernet i/f USB ISA i/f


Memory Controller Controller

SCSI Bus

Disk CD ROM Video IDE Disk


Controller Controller

DISK 1 DISK 2 CD ROM Key Board GAME

We have 3 Bus standards.They are,

PCI (Peripheral Component Inter Connect)


SCSI (Small Computer System Interface)
USB (Universal Serial Bus)

PCI defines an expansion bus on the motherboard.


SCSI and USB are used for connecting additional devices both inside and outside
the computer box.
SCSI bus is a high speed parallel bus intended for devices such as disk and video
display.
USB uses a serial transmission to suit the needs of equipment ranging from
keyboard keyboard to game control to internal connection.
IDE (Integrated Device Electronics) disk is compatible with ISA which shows
the connection to an Ethernet.
PCI:
PCI is developed as a low cost bus that is truly processor independent.
It supports high speed disk, graphics and video devices.
PCI has plug and play capability for connecting I/O devices.
To connect new devices, the user simply connects the device interface board to
the bus.
Data Transfer:

The data are transferred between cache and main memory is the bursts of several
words and they are stored in successive memory locations.
When the processor specifies an address and request a read operation from
memory, the memory responds by sending a sequence of data words starting at
that address.
During write operation, the processor sends the address followed by sequence of
data words to be written in successive memory locations.
PCI supports read and write operation.
A read / write operation involving a single word is treated as a burst of length one.
PCI has three address spaces. They are

Memory address space


I/O address space
Configuration address space

I/O address space It is intended for use with processor


Configuration space It is intended to give PCI, its plug and play
capability.
PCI Bridge provides a separate physical connection to main memory.
The master maintains the address information on the bus until data transfer is
completed.
At any time, only one device acts as bus master.
A master is called initiator in PCI which is either processor or DMA.
The addressed device that responds to read and write commands is called a
target.
A complete transfer operation on the bus, involving an address and bust of data is
called a transaction.

Fig:Use of a PCI bus in a Computer system

HOST

PCI Bridge Main Memory

PCI
BUS

DISK PRINTER Ethernet i/f


Data Transfer Signals on PCI Bus:

Name Function

CLK 33 MHZ / 66 MHZ clock


FRAME # Sent by the indicator to indicate the duration of transaction
AD 32 address / data line
C/BE # 4 command / byte Enable Lines
IRDY, TRDYA Initiator Ready, Target Ready Signals
DEVSEL # A response from the device indicating that it has
recognized its address and is ready for data transfer
transaction.
IDSEL # Initialization Device Select

Individual word transfers are called phases.

Fig :Read operation an PCI Bus

In Clock cycle1, the processor asserts FRAME # to indicate the beginning of a


transaction ; it sends the address on AD lines and command on C/BE # Lines.
Clock cycle2 is used to turn the AD Bus lines around ; the processor ; The
processor removes the address and disconnects its drives from AD lines.
The selected target enable its drivers on AD lines and fetches the requested data to
be placed on the bus.
It asserts DEVSEL # and maintains it in asserted state until the end of the
transaction.
C/BE # is used to send a bus command in clock cycle and it is used for different
purpose during the rest of the transaction.
During clock cycle 3, the initiator asserts IRDY #, to indicate that it is ready to
receive data.
If the target has data ready to send then it asserts TRDY #. In our eg, the target
sends 3 more words of data in clock cycle 4 to 6.
The indicator uses FRAME # to indicate the duration of the burst, since it read 4
words, the initiator negates FRAME # during clock cycle 5.
After sending the 4th word, the target disconnects its drivers and negates DEVSEL
# during clockcycle 7.

Fig: A read operation showing the role of IRDY# / TRY#

It indicates the pause in the middle of the transaction.


The first and words are transferred and the target sends the 3rd word in cycle 5.
But the indicator is not able to receive it. Hence it negates IRDY#.
In response the target maintains 3rd data on AD line until IRDY is asserted again.
In cycle 6, the indicator asserts IRDY. But the target is not ready to transfer the
fourth word immediately, hence it negates TRDY in cycle 7. Hence it sends the
4th word and asserts TRDY# at cycle 8.
Device Configuration:

The PCI has a configuration ROM memory that stores information about that
device.
The configuration ROMs of all devices are accessible in the configuration
address space.
The initialization s/w read these ROMs whenever the S/M is powered up or reset
In each case, it determines whether the device is a printer, keyboard, Ethernet
interface or disk controller.
Devices are assigned address during initialization process and each device has an
w/p signal called IDSEL # (Initialization device select) which has 21 address lines
(AD) (AD to AD31).
During configuration operation, the address is applied to AD i/p of the device and
the corresponding AD line is set to and all other lines are set to 0.
AD11 - AD31 Upper address line
A00 - A10 Lower address line Specify the type of the operation and to
access the content of device configuration
ROM.
The configuration software scans all 21 locations.
PCI bus has interrupt request lines.
Each device may requests an address in the I/O space or memory space
Electrical Characteristics:

The connectors can be plugged only in compatible motherboards PCI bus can
operate with either 5 33V power supply.
The motherboard can operate with signaling system.

SCSI Bus:- (Small Computer System Interface)


SCSI refers to the standard bus which is defined by ANSI (American National
Standard Institute).
SCSI bus the several options. It may be,

Narrow bus It has 8 data lines & transfers 1 byte at a time.


Wide bus It has 16 data lines & transfer 2 byte at a time.
Single-Ended Transmission Each signal uses separate wire.
HVD (High Voltage Differential) It was 5v (TTL cells)
LVD (Low Voltage Differential) It uses 3.3v

Because of these various options, SCSI connector may have 50, 68 or 80 pins.
The data transfer rate ranges from 5MB/s to 160MB/s 320Mb/s, 640MB/s.
The transfer rate depends on,
Length of the cable
Number of devices connected.
To achieve high transfer rat, the bus length should be 1.6m for SE signaling and
12m for LVD signaling.
The SCSI bus us connected to the processor bus through the SCSI controller.
The data are stored on a disk in blocks called sectors.
Each sector contains several hundreds of bytes. These data will not be stored in
contiguous memory location.
SCSI protocol is designed to retrieve the data in the first sector or any other
selected sectors.
Using SCSI protocol, the burst of data are transferred at high speed.
The controller connected to SCSI bus is of 2 types. They are,
Initiator
Target
Initiator:
It has the ability to select a particular target & to send commands specifying the
operation to be performed.
They are the controllers on the processor side.
Target:
The disk controller operates as a target.
It carries out the commands it receive from the initiator. The initiator establishes a
logical connection with the intended target.
Steps:

Consider the disk read operation, it has the following sequence of events.

The SCSI controller acting as initiator, contends process, it selects the target
controller & hands over control of the bus to it.
The target starts an output operation, in response to this the initiator sends a
command specifying the required read operation.
The target that it needs to perform a disk seek operation, sends a message to the
initiator indicating that it will temporarily suspends the connection between them.
Then it releases the bus.
The target controller sends a command to disk drive to move the read head to the
first sector involved in the requested read in a data buffer. When it is ready to
begin transferring data to initiator, the target requests control of the bus. After it
wins arbitration, it reselects the initiator controller, thus restoring the suspended
connection.
The target transfers the controls of the data buffer to the initiator & then suspends
the connection again. Data are transferred either 8 (or) 16 bits in parallel
depending on the width of the bus.
The target controller sends a command to the disk drive to perform another seek
operation. Then it transfers the contents of second disk sector to the initiator. At
the end of this transfer, the logical connection b/w the two controller is
terminated.
As the initiator controller receives the data, if stores them into main memory
using DMA approach.
The SCSI controller sends an interrupt to the processor to inform it that the
requested operation has been completed.

Bus Signals:-
The bus has no address lines.
Instead, it has data lines to identify the bus controllers involved in the selection /
reselection / arbitration process.
For narrow bus, there are 8 possible controllers numbered from 0 to 7.
For a wide bus, there are 16 controllers.
Once a connection is established b/w two controllers, these is no further need for
addressing & the datalines are used to carry the data.
SCSI bus signals:

Category Name Function


Data - DB (0) to DB (7) Datalines
- DB(P) Parity bit for data bus.
Phases - BSY Busy
- SEL Selection
Information type - C/D Control / Data
- MSG Message
Handshake - REQ Request
- ACK Acknowledge
Direction of transfer I/O Input / Output
Other - ATN Attention
- RST Reset.

All signal names are proceeded by minus sign.


This indicates that the signals are active or that the dataline is equal to 1, when
they are in the low voltage state.

Phases in SCSI Bus:-


The phases in SCSI bus operation are,
Arbitration
Selection
Information transfer
Reselection
Arbitration:-
When the BSY signal is in inactive state, the bus will he free & any controller
can request the use of the bus.
Since each controller may generate requests at the same time, SCSI uses
distributed arbitration scheme.
Each controller on the bus is assigned a fixed priority with controller 7 having the
highest priority.
When BSY becomes active, all controllers that are requesting the bus examines
the data lines & determine whether the highest priority device is requesting the
bus at the same time.
The controller using the highest numbered line realizes that it has won the
arbitration process.
At that time, all other controllers disconnect from the bus & wait for BSY to
become inactive again.
Fig:Arbitration and selection on the SCSI bus.Device 6 wins arbitration and select
device 2

Selection:

Here Device wons arbitration and it asserts BSY and DB6 signals.
The Select Target Controller responds by asserting BSY.
This informs that the connection that it requested is established.

Reselection:

The connection between the two controllers has been reestablished,with the target
in control the bus as required for data transfer to proceed.

USB Universal Serial Bus

USB supports 3 speed of operation. They are,


Low speed (1.5Mb/s)
Full speed (12mb/s)
High speed ( 480mb/s)
The USB has been designed to meet the key objectives. They are,

It provide a simple, low cost & easy to use interconnection s/m that overcomes
the difficulties due to the limited number of I/O ports available on a computer.
It accommodate a wide range of data transfer characteristics for I/O devices
including telephone & Internet connections.
Enhance user convenience through Plug & Play mode of operation.
Port Limitation:-
Normally the system has a few limited ports.
To add new ports, the user must open the computer box to gain access to the
internal expansion bus & install a new interface card.
The user may also need to know to configure the device & the s/w.

Merits of USB:-

USB helps to add many devices to a computer system at any time without opening the
computer box.

Device Characteristics:-
The kinds of devices that may be connected to a cptr cover a wide range of
functionality.
The speed, volume & timing constrains associated with data transfer to & from
devices varies significantly.

Eg:1 Keyboard Since the event of pressing a key is not synchronized to any other
event in a computer system, the data generated by keyboard are called asynchronous.
The data generated from keyboard depends upon the speed of the human operator which
is about 100bytes/sec.

Eg:2 Microphone attached in a cptr s/m internally / externally

The sound picked up by the microphone produces an analog electric signal, which
must be converted into digital form before it can be handled by the cptr.
This is accomplished by sampling the analog signal periodically.
The sampling process yields a continuous stream of digitized samples that arrive
at regular intervals, synchronized with the sampling clock. Such a stream is called
isochronous (ie) successive events are separated by equal period of time.
If the sampling rate in S samples/sec then the maximum frequency captured by
sampling process is s/2.
A standard rate for digital sound is 44.1 KHz.
Requirements for sampled Voice:-
It is important to maintain precise time (delay) in the sampling & replay process.
A high degree of jitter (Variability in sampling time) is unacceptable.

Eg-3:Data transfer for Image & Video:-

The transfer of images & video require higher bandwidth.


The bandwidth is the total data transfer capacity of a communication channel.
To maintain high picture quality, The image should be represented by about
160kb, & it is transmitted 30 times per second for a total bandwidth if 44MB/s.
Plug & Play:-

The main objective of USB is that it provides a plug & play capability.
The plug & play feature enhances the connection of new device at any time, while
the system is operation.
The system should,
Detect the existence of the new device automatically.
Identify the appropriate device driver s/w.
Establish the appropriate addresses.
Establish the logical connection for communication.

USB Architecture:-

USB has a serial bus format which satisfies the low-cost & flexibility
requirements.
Clock & data information are encoded together & transmitted as a single signal.
There are no limitations on clock frequency or distance arising form data skew, &
hence it is possible to provide a high data transfer bandwidth by using a high
clock frequency.
To accommodate a large no/. of devices that can be added / removed at any time,
the USB has the tree structure.

Fig:USB Tree Structure

Each node of the tree has a device called hub, which acts as an intermediate
control point b/w host & I/O devices.
At the root of the tree, the root hub connects the entire tree to the host computer.
The leaves of the tree are the I/O devices being served.
UNIT V

PART A

1) What is I/O bus structure?


2) What is shared data and control lines memory mapped I/O?
3) What is asynchrome data transfer?
4) What is meant by strobe?
5) Write about hand shaking?
6) What is timing diagram?
7) Define Handshaking.
8) What are the functions of I/O module?
9) What are the I/O techniques?
10) What is programmed I/O?
11) What is interrupt driven I/O?
12) What is DMA?
13) What is meant by DMA break points and interrupt break points?
14) Define Vector Interrupt.
15) Based on what factors,the vector execution time will be calculated?
16) What is Interrupt Latency?

PART B

1) Write the short notes on programmed I/O and memory mapped I/O
2) What is DMA? Describe how DMA is used to transfer data?
3) Explain briefly about PCI and USB Bus.
4) Discuss about the structure of a distributed shared memory system with various
interconnection schemes and message passing mechanism.
5) Explain the bus interconnection structure and characteristics of SCSI bus
Standard.
D 276
B.E./B.Tech. DEGREE EXAMINATION, APRIL/MAY 2003.

Fourth Semester

Computer Science and Engineering

CS 238 COMPUTER ARCHITECTURE I

Time : Three hours Maximum : 100 marks


Answer ALL questions.

PART A (10 2 = 20 marks)

1. Define Absolute Addressing.

2. What are the different addressing modes?

3. Define underflow.

4. What is meant by bit sliced processor?

5. Define microinstruction.

6. What is Pipelining?

7. How Cache memory is used in reducing the execution time?

8. Define memory interleaving.

9. What is DMA?

10. Define dumb terminal.

PART B (5 16 = 80 marks)

11. Describe Von Neuman Architecture in detail. (16)

12. (a) Explain the bit slice processor and its internal structure of the ALU.

Or
(b) How Floating point addition is implemented? Explain briefly with neat
diagram.

13. (a) Describe the Microprogrammed control unit based on Wilkes original
design.

Or

(b) Explain the different Hard wired controllers.

14. (a) How Cache blocks are mapped to main memory module using direct
mapping method and Associate Mapping Method.

Or

(b) What is Virtual Memory? Explain the Virtualmemory address


translation.

15. (a) Write short notes on programmed I/O and memory mapped I/O.

Or

(b) What is DMA? Describe how DMA is used to transfer Data from
peripherals.

Vous aimerez peut-être aussi