Vous êtes sur la page 1sur 45

Computer Architecture

Ucb
Q1. Define Parallelism. Discuss various types of parallel processing mechanism. List out different
parallel processing computers.

A mode of parallelism is said to be achieved when two or more unrelated and independent code or
modules run simultaneously in a computer system whether uniprocessor or multiprocessor. In other
words we can say that parallelism is the state of execution of codes where different codes run parallel to
each other or a single set of code run parallel on different set of data.

Parallel computing is a form of computation in which many calculations are carried out
simultaneously, operating on the principle that large problems can often be divided into smaller ones,
which are then solved concurrently ("in parallel"). There are several different forms of parallel computing:
bit-level-, instruction-level-, data-level, and task-level parallelism. Parallelism has been employed for
many years, mainly in high-performance computing, but interest in it has grown lately due to the
physical constraints preventing frequency scaling. As power consumption by computers has become a
concern in recent years, parallel computing has become the dominant paradigm in computer
architecture, mainly in the form of multicore processors.

Parallel processing is another method to improve performance of a system. A uniprocessor system


can achieve parallelism, both within the CPU and within the computer system as a whole. But mostly
multiprocessor system is employed to achieve great deal of parallelism and thus the performance.
However design of multiprocessor system is not a straightforward extension of uniprocessor system
design. From data conflict and coordinating memory accesses by separate processors, to data
communication a bunch of other issues have to be dealt to address the problem of a multiprocessor
design. Ideally parallel processing makes a program faster because there are more CPUs running it. In
practice, it is difficult to divide a program in such a ways that separate CPU can execute different
portions without interfering with each other. With single-C.P.U computer, it is possible to perform
parallel processing by connecting the computers in a network. This type of parallel processing requires
Distributing Processing Software. Parallel processing is also called parallel computing. Parallel processing
is different from multitasking in which a single C.P.U executes several programs at once.

Parallel computers can be roughly classified


according to the level at which the hardware Adder separators 
supports parallelism—with multi-core and multi-
processor computers having multiple processing
elements within a single machine, while clusters, Integer
MPPs, and grids use multiple computers to work on Memory or I/O 
the same task. Specialized parallel computer
port  Logic unit
architectures are sometimes used alongside
traditional processors, for accelerating specific
tasks.Parallel processing can be view from various Shift unit
levels of complexities. At the lowest level, parallel
and serial operations can be distinguished by the Processor
type of registers. Shift register operates in serial Registers Incremented
fashion one bit at a time, while register with parallel
load operates with all the bits of the words
simultaneously. Parallel processing is established Floating-point
by distributing the data among the multiple Add-subtract
functional units.
Separating the executions unit into eight
functional units operating in parallel can be done in Floating-point
one of the possible ways. The operands in the Multiply
register are applied to one of the units depending
on the operation specified by the instruction Floating-point
associated with the operation. The operation Divided
performed in each functional unit is indicated in
each block of the diagram. The adder and integer
multiplier perform the arithmetic operations with integer number. The floating-point operations are
separated into three circuits operating in parallel. The logic shift and increment operations can be
performed concurrently on different data. All units are independent of each other; so one number can be
shifted while other number is being incremented. A multifunctional organization is usually associated
with a complex control unit to coordinate all the activities among the various components.
 
1
Computer Architecture
Ucb
PARALLEL PROCESSING MODELS

Parallel processing models exist as an obstruction above hardware and memory architecture. There are
several programming models in common use. Some of them are :-

1. Shared memory model 3. Message passing model


2. Thread model 4. Data parallel model

Shared memory model: -

• In the shared memory-programming model; task share a common address space, which they read
and write asynchronously.
• Various mechanisms Such as locks /semaphore may be used to control access to shared memory.
• An advantage of these models from the programmer’s point of views is that the notion of data
“Owner ship” is lacking, so there is no need to specify explicitly the communication of data
between tasks. Program development can often be simplified.
• An important disadvantage in term of performance is that it becomes more difficult to understand
and manage data locality.

Implementation: -

a) On shared memory platforms, the native compilers translate user program variables into actual
memory addresses, which are global.
b) No common distributed platform Implementation currently exists.

Thread model: -

In thread model of parallel processing, a single process can have multiple concurrent execution paths.
Threads are commonly associated with shared memory architectures and operating system.
Perhaps the most simply analogy that can be used to describe threads is the concept of a single
program that includes a number of sub routines.

Example –
a. The main program a.out is scheduled to run by the native
operating system, a.out loads and acquires all of the
necessary system and user resources to run.
b. a.out performs some serial works, and then creates a number
of task (threads) that can be scheduled and run by the
operating system concurrently.
c. Each thread has local data but also shares the entire
resources of a.out. This saves the entire resources of a.out.
This saves the overhead associated with replicating programs
resources for each thread. Each thread also benefits from a
global memory view because it shares memory space of a.out.
d. A thread’s work may best be described as a subroutine
within the main program. Any thread can execute any
subroutine at the same time as other threads.
e. Threads communicate with each other through global memory (updating address locations). This
require synchronization construct to insure that more than one thread is not updating the same
global address at any time.
f. Threads can come and go, but a.out remains present to provide the necessary shared resources
until the application has completed.

Implementation: -

From a programming prospective, threads implementation commonly comprise of:-

a) A library of sub routines that are called from within parallel source code.
b) A set of compiler directives imbedded in either serial or parallel source code.
 
2
Computer Architecture
Ucb
Message passing model: -

• A set of tasks that use their own local memory during computation; multiple tasks can reside on the
same physical machine as well across an arbitrary machine.
• Tasks exchange data through communications by sending and receiving messages.
• Data transfer usually requires cooperative operations to be performed by each process.

Implementation: -

a. From a programming perspective, message-passing


implementations commonly comprise a library of
subroutines that are embedded in source code. The
programmer is responsible for determining all
parallelism.
b. For shared memory architectures, Message passing
interface (MPI) implementation usually doesn’t use a
network for task communications. Instead, they use
shared memory (Memory copies) for performance reasons.

Data parallel model: -

• The Data parallel model demonstrates the following characteristic: -


• Most of the parallel work focuses on performing operations on a data set. The data set is typically
organized into common structures such as an array or cube.
• A set of tasks work collectively on the same data structure, however, each tasks work on a different
partition of the same data structure.
• Tasks perform the same operation on their partition of work, for example “add 4 to every array
element.

™ On shared memory architectures, all tasks


may have access to the data structure to the
global memory. On distributed memory
architectures the data structure is split up
and resides as “chunks” in the local
memory of each task.

Implementations: -

a) Programming with the data parallel


model is usually accomplished by
writing a program with data parallel
constructs. The constructs can be calls
to a data parallel subroutine library or,
compiler directives recognized by a data
parallel compiler.
b) Distributed memory implementations of
this model usually have the compiler to
convert the program into standard code
with calls to a message-passing library
(MPI usually) to distribute the data to all
the processes. All messages passing are
done invisibly to the programmer.

 
3
Computer Architecture
Ucb
TYPES OF PARALELLISM

There are the following types of parallelisms:-

1. Bit-level parallelism
2. Instruction-level parallelism
3. Data parallelism
4. Task parallelism

Bit-level parallelism

From the advent of very-large-scale integration (VLSI) computer-chip fabrication technology in the 1970s
until about 1986, speed-up in computer architecture was driven by doubling computer word size—the
amount of information the processor can execute per cycle. Increasing the word size reduces the number
of instructions the processor must execute to perform an operation on variables whose sizes are greater
than the length of the word. For example, where an 8-bit processor must add two 16-bit integers, the
processor must first add the 8 lower-order bits from each integer using the standard addition
instruction, then add the 8 higher-order bits using an add-with-carry instruction and the carry bit from
the lower order addition; thus, an 8-bit processor requires two instructions to complete a single
operation, where a 16-bit processor would be able to complete the operation with a single instruction.
Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors.
This trend generally came to an end with the introduction of 32-bit processors, which has been a
standard in general-purpose computing for two decades. Not until recently (c. 2003–2004), with the
advent of x86-64 architectures, have 64-bit processors become commonplace.

Instruction-level parallelism

A canonical five-stage pipeline in a RISC machine (IF =


Instruction Fetch, ID = Instruction Decode, EX = Execute,
MEM = Memory access, WB = Register write back)

A computer program is, in essence, a stream of


instructions executed by a processor. These instructions
can be re-ordered and combined into groups which are then executed in parallel without changing the
result of the program. This is known as instruction-level parallelism. Advances in instruction-level
parallelism dominated computer architecture from the mid-1980s until the mid-1990s.

Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a
different action the processor performs on that instruction in that stage; a processor with an N-stage
pipeline can have up to N different instructions at different stages of completion. The canonical example
of a pipelined processor is a RISC processor, with five stages: instruction fetch, decode, execute, memory
access, and write back. The Pentium 4 processor had a 35-stage pipeline.

A five-stage pipelined superscalar processor, capable of


issuing two instructions per cycle. It can have two
instructions in each stage of the pipeline, for a total of up
to 10 instructions (shown in green) being simultaneously
executed.

In addition to instruction-level parallelism from


pipelining, some processors can issue more than one
instruction at a time. These are known as superscalar
processors. Instructions can be grouped together only if
there is no data dependency between them. Score
boarding and the Tomasulo algorithm (which is similar to score boarding but makes use of register
renaming) are two of the most common techniques for implementing out-of-order execution and
instruction-level parallelism.

 
4
Computer Architecture
Ucb
Data parallelism

Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across
different computing nodes to be processed in parallel. "Parallelizing loops often leads to similar (not
necessarily identical) operation sequences or functions being performed on elements of a large data
structure." Many scientific and engineering applications exhibit data parallelism.

A loop-carried dependency is the dependence of loop iteration on the output of one or more previous
iterations. Loop-carried dependencies prevent the parallelization of loops. For example, consider the
following pseudo code that computes the first few Fibonacci numbers:

1: PREV2 := 0
2: PREV1 := 1
3: CUR := 1
4: do:
5: CUR := PREV1 + PREV2
6: PREV2 := PREV1
7: PREV1 := CUR
8: while (CUR < 10)
 
This loop cannot be parallelized because CUR depends on itself (PREV1) and PREV2, which are
computed in each loop iteration. Since each iteration depends on the result of the previous one, they
cannot be performed in parallel. As the size of a problem gets bigger, the amount of data-parallelism
available usually does as well.

Task parallelism

Task parallelism is the characteristic of a parallel program that "entirely different calculations can be
performed on either the same or different sets of data". This contrasts with data parallelism, where the
same calculation is performed on the same or different sets of data. Task parallelism does not usually
scale with the size of a problem.

List of parallel computers:- Cray 1, 2, Blue gene L, Iliac V (SIMD architecture), RIKEN MDGRAPE-3

Q2. What do you mean by Flynn’s Classification for parallel processing?

MICHAEL J. FLYNN’S CLASSIFICATION

There are different ways to classify parallel computers. One of the more widely used classifications, in use
since 1966, is called Flynn's Taxonomy.

• Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be
classified along the two independent dimensions of Instruction and Data. Each of these dimensions
can have only one of two possible states: Single or Multiple.
• Based on the multiplicity of the instruction stream, and the data stream in a computer system. The
sequence of instruction read from the memory constitutes the instruction stream, and the data they
operate on in the processors constitute the data stream.
• The table below defines the 4 possible classifications according to Flynn.

SISD SIMD
Single Instruction, Single Data Single Instruction, Multiple Data

MISD MIMD
Multiple Instruction, Single Data Multiple Instruction, Multiple Data

 
5
Computer Architecture
Ucb
Single Instruction, Single Data ( SISD )

SISD stands for Single Instruction stream over a Single Data stream.
It represents the organization of a single computer controlling a control unit a processor unit and a
memory unit. Instructions are executed sequentially and the system may of may not have internal
parallel processing capability. Parallel processing in this may be achieved by means of multiple
functional units or by pipeline processing.

IS = Instruction stream
I/O  DS = Data stream
C.U.  IS P.U.  DS  M.U.
CU = Control unit
PU = processing unit
MU = Memory unit

• A serial (non-parallel) computer


• Single instruction: only one instruction stream is being acted on by the
CPU during any one clock cycle
• Single data: only one data stream is being used as input during any one
clock cycle
• Deterministic execution
• This is the oldest and until recently, the most prevalent form of
computer
Examples: most PCs, single CPU workstations and mainframes.

Single Instruction, Multiple Data ( SIMD )

SIMD stands for Single Instruction stream over Multiple Data stream.
SIMD represents an organization that includes many units under the supervision of a common control
unit. All processors receive the same instruction from the control unit but operate on different items of
data. The most common example is the execution of for loop in which same set of instruction is
executed for different set of data.

I.S.
Program PE1  DS  LM1 Data
Loaded from IS = Instruction stream
set
h t CU = Control unit
C.U.  PE = Processing Element
Loaded
DS  LM = Local Memory
PEn  LM4  

Fig. :- SIMD Architecture (With distributed memory)

 
6
Computer Architecture
Ucb
• A type of parallel computer
• Single instruction: All processing units
execute the same instruction at any given
clock cycle
• Multiple data: Each processing unit can
operate on a different data element
• This type of machine typically has an
instruction dispatcher, a very high-
bandwidth internal network, and a very
large array of very small-capacity
instruction units.
• Best suited for specialized problems
characterized by a high degree of
regularity, such as image processing.
• Synchronous (lockstep) and deterministic
execution.
• Two varieties: Processor Arrays and Vector Pipelines

Examples:
Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

Multiple Instruction, Single Data ( MISD )

MISD stands for Multiple Instruction stream over a Single Data.


MISD organization consists of N processor units, each working on a different set of instruction but
working on the same set of a data. The output of one unit becomes the input to the other units SIMD
and MISB are more suitable unit for special purpose computation.

IS IS 
CU1  CU2  CUn 
Memory IS  IS = Instruction stream
(Program IS CU = Control unit
and data) IS  IS

PU1  PU2  PUn   


DS 
I/O 

Fig:- MISD Architecture

• A single data stream is fed into multiple processing units.


• Each processing unit operates on the data independently via independent instruction streams.
• Few actual examples of this class of parallel computer have ever existed. One is the experimental
Carnegie-Mellon C.mmp computer (1971).
• Some conceivable uses might be:
multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single coded message

 
7
Computer Architecture
Ucb
Multiple Instruction, Multiple Data ( MIMD )

MIMD organization implies interaction between N processors because all memory streams are derived
from the same data stream shared by all processors. If the interaction between the processor is high it
is called a tightly coupled (or a share memory processors) system or else It is called a loosely coupled
(or networked system) system most multi processor fit in to this category.

IS  IS 
PS 
CU1  PU1  IS = Instruction stream
Shared
DS  memory CU = Control unit
PU = Processing unit

CUn  PUn 
I/O  DS 

Fig:- MIMD Architecture

• Currently, the most common type of


parallel computer. Most modern
computers fall into this category.
• Multiple Instruction: every processor may
be executing a different instruction
stream
• Multiple Data: every processor may be
working with a different data stream
• Execution can be synchronous or
asynchronous, deterministic or non-
deterministic

Examples:

Most current supercomputers, networked


parallel computer "grids" and multi-
processor SMP computers - including
some types of PCs.

Q3. Discuss various conditions of parallelism. Explain various levels of parallelism needed in designing
parallel programs.

Parallelism appears in various forms in a computing environment. Some of the key areas are
computation models for parallel computing, inter-processor communication in parallel architecture and
system integration for incorporating parallel system into general computing environment. All forms of
parallelisms can be attributed to levels of parallelism, computation granularity, time and space
complexity, communication latencies, scheduling policies, and load balancing.
Some of the important conditions of parallelism are:-

1. Data and Resource Dependency


2. Hardware and Software Parallelism
3. The Role of Compilers

 
8
Computer Architecture
Ucb
DATA and RESOURCE DEPENDENCY

The ability to execute several programs segment in parallel requires each segment to be independent
of the other segment. There are various types of dependencies:-

A. Data Dependence
B. Control Dependence
C. Resource Dependence
D. Bernstein’s Condition

Data Dependence

The ordering relationship between statements is indicated by data dependence. These are of five
types as mentioned below:-
a. Flow dependence
b. Anti-dependence
c. Output dependence
d. I/O dependence
e. Unknown dependence

a) Flow dependence
A Statement s2 is flow dependent on statement s1 if an execution path exists from s1 to s2
and if at least one output of s1 feeds in as input to s2. It is denoted by s1Æs2.

S1: LOAD R1,A /R1-memory(A)/


S2: ADD R2, R1 /R2 Å(R1)+(R2)/
S3: Move R1, R3 /R1 Å(R3)/
S4: STORE B, R1 /Memory (B) Å (R1)/

b) Anti-dependence
Statement s2 is anti-dependent on statement s1 if s2 follows s1 in program order and if the
output of s2 overlaps the input to s1. The direct arrow crossed with a bar ( ) is used to
represent anti-dependence. s1 s2 shows that s1 is anti-dependent to s2.

c) Output dependence
Two statements are output dependent if they produce (Write) the same output variable. It is
denoted by ( ) and can be represented by indicating the output dependent from s1 to s2.

d) I-O dependence
Read and write are input output statements. I/O dependence occurs not because the same
variable is involved but because the same file is referenced by both I/O statements.

e) Unknown dependence
The dependence relation between two statements can’t be determined in the following
situations:-
• The subscript of a variable is itself subscripted (indirect addressing )
• The subscript doesn’t contain the look index variable.
• The variable appears more than once with subscripts having different coefficients of the
loop variable.
• The subscript is nonlinear in the loop index variable.
NOTE: -  When one or more condition exists a conservative assumption is to claim unknown
dependence among the statements involves.

 
9
Computer Architecture
Ucb
Control Dependence

This refers to the situation where the order of execution of statements can’t be determined before
run time. Different paths taken after a conditional branch may introduce or eliminate data
dependence among instructions. Dependence may also exist between operations performed in
successive alterations of a looping procedure. The successive iterations of the following loop are
control-independent.

Do 20, I=1, N
A (I) =C (I)
IF (A (I).LT.0)
A (I) =1
20 continue

The following loop has control dependent iteration:-

Do 40, I=1, N
IF (A (I-1).EQ.0)
A (I) =0
40 continue

• Control dependence often prohibits parallelism from being exploited.

Resource Dependence

Resource dependence is concerned with the conflicts in using the shared resources such as
integer units, floating point unit, and register and memory areas among parallel events. When
the conflicting resource is an ALU, it is called ALU Dependence. The work-place storage is called
storage dependence. In the case of storage dependence each task must work on independent
storage locations or use protected access to shared writable data. The transformation of a
sequentially coded program into a parallel executable form can be done manually by the
programmer using explicit parallelism, or by a compiler detecting implicit parallelism
automatically.
Program partitioning determines whether a given program can be partitioned or splice into pieces
that can execute in parallel or follow a certain pre-specified order of execution.

Bernstein’s Condition

Bernstein’s revealed a set of conditions based on which two processes can execute in parallel. A
process is a software entity corresponding to the abstraction of a program fragment defined at
various processing levels. The input set Ii of a process Pi as the set of all input variables needed
to execute the process. Similarly the output set Oi consists of all output variables generated after
execution of the process Pi. Consider two processes P1 and P2 with their input set I1 and I2 and
output sets O1 and O2 respectively. These two processes can execute in parallel and are denoted
P1 || P2 if they are independent and don’t create confusing result.

Formatting these conditions are stated as follows:

I1 ∩ O2 = 0 These three conditions 
I2 ∩ O1 = 0 ‐‐‐ (i)  are known as Bernstein’s 
O1 ∩ O2 = 0 conditions. 

The input set Ii is also called the read set or the domain of Pi. The output set OI has been called
the write set or the range of a process Pi . In terms of data dependencies, Bernstein’s condition
simply implies that two processes can execute in parallel if they are flow-independent, anti-
independent, and output-independent.

 
10
Computer Architecture
Ucb
The parallel execution of two processes produces the same result regardless of whether they
are executed sequentially or in any order or in parallel. This is possible only if the output of one
process will not be used as input to the other process. In general, a set of processes P1, P2 ……….
Pk can execute in parallel if Bernstein’s conditions are satisfied on a pair-wise basic; that is P1||
P2 ||P3||……. ||Pk if and only in Pi || Pj for all i ≠ j.

P1 : C=D*E
P2 : M=G+C
P3 : A=B+C ‐‐‐ (ii) 
P4 : C=L+M
P5 : F=G/E

*In this program each statement required one step to execute. No pipelining is considered here.

A dependence graph showing both data dependence (solid arrows) and resource (dash arrows)

Violations of any one or more of the three conditions in Eq.-i prohibit parallelism between two
processes. In general, violation of any one or more of the 3n (n-1)/2 Bernstein’s conditions
among n processes prohibits parallelism correctively or partially. Any statements or processes
which depend on run-time conditions are not transformed to parallel form. These include IF
statements or conditional branches. Recursion also prohibits parallelism. Data dependence,
control dependence, and resource dependence all prevent parallelism from being exploitable. The
statement level dependence can be generalized to higher levels, such as code segment,
subroutines process, task and program levels.

 
11
Computer Architecture
Ucb
HARDWARE and SOFTWARE DEPENDENCY

Hardware Parallelism

Hardware Parallelism is defined by the machine architecture and


hardware Multiplicity. It is often a function of cost and
performance trade off. It displays resource utilization patterns of
simultaneously executable operation. It can also include the peak
performance of the processor resources. Parallelism in a processor
is characterized by the number of instruction issues per machine
cycle. If a processor issues K instruction per machine cycle, it is
call a K-issue processor. A multiprocessor system built with n K-
issue processor should be able to handle a maximum number of
n=K threads of instruction simultaneously.

Software Parallelism

It is defined by the control the control and data dependence of


programs. The degree of the parallelism is revealed in the program
profile or in the program flow graph. Software parallelism is a L1  L2  L3 L4
function of algorithm, programming style and compiler
optimization. The program flow graph displays the patterns of
simultaneously executable operations. Parallelism in a program ×2
varies during the execution period. It often limits the sustained ×1 
performance of the processors.

Li: Load Operations   +  ‐


Xi: Multiply Operations B

Software Parallelism in two-issue super-scalar processor:-

There are two most important software parallelisms in parallel


programming. The first is control parallelism. This allows two or
more operations to be performed simultaneously. The second type
has been called data parallelism, in which almost the same
operation is performed over many data elements by many
processors simultaneously.
Control parallelism, appearing in the form of pipelining or
multiple functional units, is limited the pipeline length and by the
multiplicity of functional unit.
Data parallelism offers the higher potential for concurrency
used in both SIMD and MIMD modes on MPP system.
Synchronization in SIMD data parallelism is handling by the
hardware. Mismatch problem between software and hardware
problem can be solved by compilation supports.

THE ROLE OF COMPILERS

Compiler techniques are used to exploit hardware features to improve performance. Interaction
between compiler and architecture design is a necessity in modern computer development. Most
existing processors issue one instruction per cycle and provide a few registers. This may cause
excessive spilling of temporary result from the available registers.
There exists a vicious cycle of limited hardware support and the use of a naïve compiler. To break
the cycle, one must design the compiler and the hardware jointly at the same time. Interaction
between the two can lead to a better solution to the mismatch problem between software and
hardware parallelism.
The general guideline is to increase the flexibility in hardware parallelism and to exploit software
parallelism in control-intensive programs. Hardware and software design tradeoffs also exist in terms
of cost, complexity, expandability, compatibility and performance. Compiling for multiprocessors is
much more involved than for uni-processors. Granularity and communication latency play,
important roles in the code optimization and scheduling process.
 
12
Computer Architecture
Ucb
Q4. What do you mean by granularity? To design a most efficient and optimal parallel program, which
type of granularity is most suitable and why? Take suitable assumption to justify your answer.

Grain size or Granularity is a measure of the amount of computation involved in a software


process. The simplest measure is to count the number of instructions in a grain or program segment.
Grain size determines the basic program segment chosen for parallel processing. Grain sizes are
commonly described as Fine, Medium and Coarse depending upon the processing involved.

In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.

Coarse: relatively large amounts of computational work are done between communication events.
Fine: relatively small amounts of computational work are done between communication events.

Computation granularity and communication latency are closely related. Granularity is a


qualitative measure of the ratio of computation to communication. Periods of computation are typically
separated from periods of communication by synchronization events.

Parallelism is achieved at different programs levels as demonstrated below and thus at different
levels different grain size is required and are considered efficient. Let us examine the three grain sizes:-

Fine-grain Parallelism

a) Typical fine grain contains less than 20


Level 1
instructions.(may range from 2 to 1000
instruction)
Jobs or programs 
b) Abundance of parallelism (if assisted
by a good parallelizing compiler).
Increasing communication demand and scheduling overhead

c) Usually implemented at the instruction

Coarse grain
Level 2
level and loop level.
d) Relatively small amounts of Subprograms, job steps 
computational work are done between or related parts of 
communication events.

Medium grain
program 
e) Low computation to communication

Higher degree of parallelism


ratio. Level 3
f) Facilitates load balancing. Procedures, Subroutine, 
g) Implies high communication overhead
and less opportunity for performance tasks or co‐routines 
enhancement.
h) If granularity is too fine it is possible Level4
that the overhead required for
communications and synchronization No recursive loops or 
between tasks takes longer than the unfolded iterations 
computation.
Fine grain

Medium-grain Parallelism Level 5

Instructions or 
a) Typical medium grain contains less
than 2000 instructions. statements 
b) Relatively high amount of parallelism
(if assisted by a good parallelizing
compiler as well as a programmer). Levels of Parallelisms and computational grain size 
c) Usually implemented at the procedural
level and subprogram level.
d) Relatively large amounts of
computational work are done between
communication events.
e) Often less communication required.

 
13
Computer Architecture
Ucb
Coarse-grain Parallelism

a) Typical coarse grain may contain tens of thousands of instructions.


b) Relatively high amount of parallelism (if assisted by a good
parallelizing compiler as well as a programmer).
c) Usually implemented at the procedural level and subprogram level.
d) Relatively large amounts of computational work are done between
communication/synchronization events.
e) High computation to communication ratio.
f) Implies more opportunity for performance increase.
g) Harder to load balance efficiently.
h) Relies heavily on effective OS and algorithm.

NOTE:-

Fine grain plays a significant role by increasing the chance of parallelism


but it also adds the communication overhead. The inter processor
communication. Even-though data parallelism is done at fine grain levels on
SIMD or MIMD.

• The most efficient granularity is dependent on the algorithm and the


hardware environment in which it runs.
• In most cases the overhead associated with communications and
synchronization is high relative to execution speed so it is
advantageous to have coarse granularity.
• Fine-grain parallelism can help reduce overheads due to load imbalance.

Q5. Explain various types of system inter-connect architecture with neat sketch.

In the way of designing of the


Interconnection Network Taxonomy
multiprocessors and multicomputer with
distributed nodes and memory we have to
overcome a problem of communication
needed by the processor node in order to Interconnection Network 
execute the instructions depending upon
the results of other processors. The
shared variables architecture is simple
and it requires very less use of
interconnect network but in message Static Dynamic
passing distributed memory and
processors based architecture we need a
good inter-connection network.
1‐D 2‐D HC Bus‐based  Switch‐based
These network connections are required
for connecting processors, memory
modules, I/O disk arrays in a centralized Crossbar
Multiple  SS  MS
system, or for distributed network of Single
multicomputer nodes.

The topology of an interconnection network can be either static or dynamic. Static network are formed of
point-to-point direct connection which will not change during the program execution. Static network are
used for fixed connections among subsystems of a centralized system or multiple computing nodes of a
distributed system.

Dynamic network are implemented with switch channels or bus networks, which are dynamically
configured to match the communication demand. Dynamic networks include buses, cross-bars, switches
& multi-stage networks which are often used in shared memory architectures.

 
14
Computer Architecture
Ucb
PROPERTIES OF INTERCONNECTION NETWORK

1. Network properties and routing


2. Perfect shuffle and exchange
3. Hypercube routing function

NETWORK PROPERTIES & ROUTING

A network is represented by the graph of a finite no of nodes linked by directed or undirected edges.
The no of nodes in the graph is called the network size. The no of (edges links or channels) incident on
a node is called the node degree. The degree of a node is the sum of in-degree and out-degree
channels. The node degree reflects the no. of I/O ports required per node. The diameter D of a network
is the maximum shortest path between any two nodes managed by link traversed. The network
diameter indicates the maximum no. of distinct hops between any two nodes, providing
communication merit.
When the given network is cut into two equal halves, the minimum no. of edges along the cut is called
the channel bisection width (b). In communication each edge corresponds to a channel with w bit
wires. The wire bisection width B=b*w. The parameter B reflects wire density of a network. When B is
fixed, the channel width w=B/b bits. The wire length between nodes affects the signal latency, clock
skewing, or power requirements.
Data routing in multi-computer network is achieved by message passing. Hardware routers are used
to route messages among multiple computer nodes. Data routing functions among PEs include
shifting, rotation, permutation (one-to-one), and broadcast (one-to all), multicast (many-to-many),
personalized communication (one-to-many), shuffle, exchange etc.

• Permutation:- The set of all permutations form a permutation group with respect to a composition
operation. The permutation π= (a, b, c) (d, e) stands for the bisection mapping:
aÆb, bÆc, cÆa, dÆe, and eÆd in a circular fashion. Crossbar switch is used to
implement permutation.

PERFECT SHUFFLE & EXCHANGE

Perfect shuffle is a special permutation function for parallel processing application. To shuffle n=2k
objects evenly, one can express each object in the domain by a k-bit binary number x=(xk-
1,………..,x1,x0). The perfect shuffle maps x to y, where y = (xk-2,………….,x1,x0,xk-1) is obtained from x
by shifting 1 bit to the left and wrapping around the most significant to the least significant position.

000  000  000  000 

001  001  001  001 

010  010  010  010 

011  011  011  011 

100  100  100  100 

101  101  101  101 

110  110  110  110 

111  111  111  111 


a) Perfect shuffle
b) Inverse perfect shuffle

 
15
Computer Architecture
Ucb
HYPERCUBE ROUTING FUNCTION

Hypercube routing function can be represented by a three-dimensional binary cube network. Three
routing functions are defined by 3 bits in the node address. Data can be exchanged between adjacent
nodes which differ in the least significant bit C0.

110 111

010 011

100 101

000 001
Three routing functions defined by a binary 3-

(a) A 3-cube with nodes denoted as C2C1C0 in Binary

A broadcast is a one-to-all mapping. This can be easily achieved in an SIMD computer using a
broadcast bus extending from the array controller to all PEs. A message-passing multi-computer also
has mechanisms to broadcast messages. Multicast corresponds to a mapping from one subset to
another (many-to-many).
Personalized broadcast sends personalized messages to only selected receivers. It is often treated as a
global operation in a multicomputer.

PERFORMANCE AFFECTING FACTORS:-

The performance of an interconnection network is affected by the following factors:

1. Functionality

This refers to how the networks support data routing, interrupt handling, synchronization,
request/ message combining, and coherence.

2. Network Latency

This refers to the worst-case time delay for a unit message to be transferred through the
network.

3. Bandwidth

This refers to the maximum data transfer rate, in terms of Mbytes/sec transmitted through the
network.

4. Hardware complexity

This refers to implementation cost such as wires, switches, connectors, arbitration and
interface logic.

5. Scalability

This refers to the ability of a network to be modularly expandable with a scalable performance
with increasing machine resources.
 
16
Computer Architecture
Ucb
INTERCONNECTION NETWORK

As mentioned above there are mainly two types of interconnection i.e. static and dynamic is being
discussed below in details. Let’s take an over-view of the system-interconnect network:-

1. Static Connection
Interconnection Network Taxonomy
a) 1-D
b) 2-D
c) HC Interconnection Network 
2. Dynamic Connection

a) Bus-based Static Dynamic


• Single
• Multiple
1‐D 2‐D HC Bus‐based  Switch‐based
b) Switch-based
• Single-stage
• Multi-stage Multiple SS  MS Crossbar
Single
• Cross-bar

STATIC CONNECTION

Static networks use direct link which are fixed ones built. This type of network is used more suitably
for predictable communication pattern or implementable with static connections. There are several
topologies used in terms of network parameters.

i. Liner Array
Linear Array
Liner arrays are the simplest connection topology. This is a one
dimensional network in which N nodes are connected by N-1 links in a
line. Internal nodes have degree 2 and external nodes have degree 1. A
liner array allows concurrent use of different sections of the structure by
different source and destination pairs.

ii. Ring and Chorale Ring

A ring is obtained by connecting the two terminal nodes of a linear array with
one extra link. A ring can be unidirectional or bidirectional. It is symmetric with
a constant node degree of 2. The diameter is for a bidirectional ring and N Ring 
for unidirectional ring. By increasing the node degree from 2 to 3 or 4, Chordal
rings are obtained. Adding more rings, the higher the node degree and the
shorter the network diameters. The completely connected network has a node
degree of 15 with the shortest possible diameter of 1.

 
17
Computer Architecture
Ucb
iii. Barrel Shifter

It is obtained from the ring by adding extra links from each node to
those nodes which having a distance equal to an integer power of 2.
This implies that node i is connected to node j, if | j – i | = 2r for
some r=0, 1, 2,……, n-1 and the network size is N = 2n. Such a barrel
shifter has a node degree of d= 2n-1, and diameter D=n/2.

iv. Tree and Star

A binary tree of 31 nodes in five levels is shown in figure. In general


a k level completely balanced binary tree should have N=2k-1 nodes.
The maximum node degree is 3 and the diameter 2(k-1). With a
constant node degree, the binary tree is a scalable architecture.

The star is a two-level tree with a high node degree of d=N-1 and a small constant diameter of 2.
The star architecture has been used in systems with centralized supervisor node.

v. Fat tree

Fat tree is the modification


with conventional binary tree.
The channel width of a fat tree
increases as ascend from
leaves to the root. Branches
get thicker towards the root
which causes bottle neck
problem.

vi. Mesh and Torus

Mesh network architecture has been implemented in the Iliac-IV, MPP, DAP, CM-
2 and Intel Paragon with variations. In general, a k-dimensional mesh with N=nk
nodes has an interior node degree of 2k and the network diameter is k(n-1). The
variation of the mesh forms iliac network architecture. The iliac network is
topologically equivalent to a Chorale ring. The n*n mesh should have a diameter Mesh
d=n-1, which is one half of the diameter for the pure mesh.

The Torus has ring connection along each row and along each column of the array. An n*n
binary torus has a node degree of 4 and the diameter of 2 . The torus is a symmetric
topology.

 
18
Computer Architecture
Ucb

vii. Systolic array

This is a class of multidimensional pipelined array architecture


designed for implementing fixed algorithms. In general static
systolic arrays are pipelined with multidirectional flow of data
streams. With fixed interconnection connection and
synchronous operation, a systolic array matches the
communication structure of the algorithm. It is used in VLSI
array processors.

viii. Hyper cube

This is a binary n cube architecture which has been implemented


in the iPSC, nCUBE and CM-2 systems. In general, an n-cube
consists of N=2n nodes spanning along dimensions, with two
nodes per dimension. The node degree increases linearly with
respect to the dimension, making it difficult to consider the hyper
cube, a scalable architecture. The node degree of an n cube
equals n and, so shows the network diameter.

Hypercube 100 110 


01 00 010
00 

10
10  11 111

001 011
d = 
d = 0  d = 1  d = 2
ix. Cube Connected Cycles

The architecture is modified form of hypercube. A 3-cube is


modified to form 3-cube-connected cycles (CCC). The idea is to cut
off the corner nodes of the 3-cube and replace each by a ring of 3
nodes. In general one can construct k-cube-connected cycles from
a k-cube with n=2k cycles nodes. The idea is to replace each vertex
of the k-dimensional hypercube by a ring of k nodes. A k-cube can
be thus transformed to a k-CCC with k*2k nodes. A 3-CCC has a
diameter of 6, twice that of the original 3-cube. In general the
network diameter of a k-CCC equals 2k. That
constant node degree of 3 is independent of the
dimension of the underlying hypercube.
1110                    1111  0110                0111 
Consider a hypercube with N=2n nodes. A CCC
with an equal number of N nodes must be built    
from a lower-dimension k-cube such that 1101  0100                    0101 

2n=k*2k for sum k<n. The CCC is a better  


 
 
 

architecture of building scalable systems if


latency can be tolerated.    

d=4

 
19
Computer Architecture
Ucb
k-ary n-Cube networks

Ring, meshes, torus binary n cubes (hyper cubes) and


Omega networks are topologically isomorphic to a family of
k-ary n-cube networks.
The parameter n is the dimension of the cube and
k is the radix, for the number of nodes (multiplicity) along
each dimension. These two numbers are related to a
number of nodes, N, in the network by N=kn, (

). A node is the k-array n-cubes can be


identified by an n-digit radix-k address A=a0, a1, a2, a3 ----
- an where ai represents the nodes positions in the ith
dimension. All links are assumed bi-directional. Each line is the network represent two
communications each direction. Traditionally low dimensional k-array n-cube is called Tori, and the
high dimensional binary n-cubes are called hyper-cubes. The long end-around connections in a torus
can be avoided by folding the network. In this case all rings along the ring in each dimension have
equal wire length when the multi dimensional network is embedded in a plane.

DYNAMIC CONNECTION

For multipurpose or general purpose applications, dynamic connections are used to implement all
communication patterns based on program demands. Switches or arbiters must be used along the
connecting path to provide the dynamic connectivity .Dynamic connection networks include Bus
systems, Multi-stage Interconnection Networks (MIN), and Crossbar Switch networks. The
performance is indicated by the network bandwidth, data transfer rate, network latency and
communication patterns supported.

i. Digital Buses

A bus system is a collection of wires and


connector data for transactions among
processors, memory modules, and peripheral
devices attached to the bus. The bus is used for
only one transaction at a time between source
and destination. In case of multiple requests, the
bus arbitration logic must be able to allocated or
de-allocate the bus servicing the request one at a
time. The digital bus has been called contention
bus or a time-sharing bus among multiple
functional modules. The system bus provides a
common communication path between the
processor or I/O subsystem and the memory
modules or secondary storage devices. The active
or master devices (processors or I/O subsystem)
generate requests to address the memory. The passive or slave devices (i.e., memory or peripherals)
respond to the requests. The common bus is use on a time-sharing basic, and the bus issues
include the bus arbitration, interrupts handling, coherence protocol and transaction processing.
 
20
Computer Architecture
Ucb
ii. Switch modules

A a x b switch modules has a inputs and b outputs. A binary switch corresponds to a 2 x 2 switch
module in which a=b=2 . Theoretically, a and b are unequal, often chosen as integer power of 2;
that is a=b=2k for same k>=1.

In several commonly used switch module size of 2x2, 4x4, 8x8. Each input can be connected to
one or more of the outputs. In other words, one-to-one and one-to-many mappings are allowed;
but many-to-one mappings are not allowed due to conflict in output terminals. The numbers of
legitimate connection parallels for switch module of various sizes are listed below:-

Module Legitimate Permutation


size state connections
2x2 4 2
4x4 256 24
8x8 16,777,216 40,320
nxn nn n!

iii. Multi-stage Networks

MINs have been used in both MIMD and SIMD


computers. A general structure of a Multistage
Interconnection Network built with a number Straight Exchange
of axb switches in each stage. Fixed inter-stage
connections are used between the switches in
adjacent stages. The switches can be
dynamically set to establish the desired Upper- Lower-
connections between the inputs and outputs.
The different setting of the 2X2 SE

Dynamic Interconnection Network Multi-stage INs Dynamic Interconnection Network Multi-stage INs
(MINs) (MINs) (cont.)
000 000

000 1 5 9 000
001 001
001
001

010 010 010 2 6 10 010

011 011 011 011

100 100 100 3 7 11 100


101
101 101
101

110 110 110 4 8 12 110

111 111
111 111

An example 8X8 Shuffle-Exchange network An 8X8 Banyan


(SEN)

 
21
Computer Architecture
Ucb
THE OMEGA NETWORK

The Omega Network is one of several connection networks that are used in parallel machines. In the
applet below, a small but typical network illustrates the common attributes of such a network, which
includes:

2 k = N inputs and a like number of outputs.


Between these are log2N stages each having
N/2 exchange elements at each stage?
The inputs are connected to the first stage
using a perfect shuffle connection system,
and this is repeated prior to each group of
exchange elements. But from the last group to
the destination elements the connections are
direct.
An Omega network is typically a (semi-) blocking network. When an exchange element is in use, the
next message or data packet that needs this element must wait.
Since each input has to connect to every possible output, the data has to be directed through
exchange elements where cross-connections may take place. In order to perform these crossovers
efficiently, a perfect shuffle is used between groups of these elements. To visualize this, picture a
deck of only 8 cards.
Cut exactly in two, one stack will have cards 1-4, the other 5-8. If then shuffled to perfection and
counting from 0, the new order will be 0-4-1-5-2-6-3-7. Mathematically, this places the nth card as
follows:

PS (n) = 2 * n for n < N/2


PS (n) = 2 * n - N + 1 for n >= N/2

Since N=8 can be represented with a 3-bit array, it becomes obvious that this formula can be
represented by a one bit shift left, with replacement or wrap. For example, let's start with card number
5 or . Normally, after a shift, the msb or leftmost bit disappears, but we'll wrap it back to the
right side so that we end up with , as per the formula above, a 3. Only if the number is more
than N/2 will it have a 1 in that bit position, which explains the plus 1in the formula above.
Now let's go from processor #5 to #3 (try the applet). From the process above, we go to switch 3, but
then what? If we do an Xor - exclusive or - with our source and destination, , we end
up with or 6. And this tells us that we need to perform a crossover at the first and second
switches. Notice that if we do another Xor with 5 and our result 6, we end up with 3 again. This will
always hold.

We know which switch we go to from the PS operation at each step, and when we crossover. But do we
cross up, or down? First, you can use the result of the PS operation and see whether it is odd or even.
Also, just check the source bit pattern. A 1 says we go up and a 0 down, so we have up-down-no
change. If we do not cross, then reverse the implication; and a 0 means we traverse the switch along
the top, a 1 along the bottom. Also see Gita's constructive proof, which shows that using the
operation on a single bit after a shift left at each step, will automatically transform the source into the
destination.

Realize that the shift merely is a way of inter-mingling the connections. It thus amounts to mere
bookkeeping. Performing the 2nd time as mentioned, will reverse the first two bits of 5 and result in
our desired 3. And notice when a switch is locked in any closed position; no other data transmission
can take place, showing that this is a blocking network.

 
22
Computer Architecture
Ucb
THE CROSS-BAR NETWORK

The highest bandwidth and interconnection capability are provided by crossbar network.

Network

For possible connection of two cross two switches are use a crossbar network can be visualised as a
single-stage switch network. Each cross-point switches can provide a connection between pairs. The
switch can be set ON/OFF dynamically upon program demanded. To build a shared-memory
multiprocessor, one can use a crossbar network between the processor and memory modules. This is
essentially a memory-access network. The inter-processor crossbar provides permutation connection of
one-to-one. Therefore the n x n crossbar connects at most n pairs at a time.

P1

P2

P3

P4

P5

P6

P7

P8

An 8X8 Crossbar Network

Networks Delay Cost Blocking Degree of FT

Bus O(N) O(1) Yes 0

Multiple- O(mN) O(m) Yes (m-1)


bus
MINs O(logN) O(NlogN) Yes 0

Crossbar O(1) O(N2) No 0

 
23
Computer Architecture
Ucb
Q6. Differentiate between scalar, superscalar and vector computer. Differentiate according to their
attributes.

RISC & CISC:-

An instruction set of compilers specifies the primitive commands or machine instructions that a
programmer can use in the programming of a machine.
The first microprocessors were simple with very simple instruction set. Gradually we moved towards
complex instructions set as the hardware cost dropped and software cost went up steadily. The semantic
gap between the hardware & the high lever language has widened so the more & more functions were
being hardwired into processor making instruction set large & complex. Gradually two architectures
evolved as the RISC & CISC.

The complexity of an instruction is judged upon several attributes such as:-

1. Instruction/data formats
2. Addressing modes
3. General purpose registers
4. Opcode specification
5. Flow control mechanism
6. Clock Rate & CPI

CISC

CISC is the abbreviation of complex instruction set computing. CISC computer as obvious from the
name suggests that the instruction set of the CISC computers are complex involving more sub-
instructions per instructions. The general philosophy of designing a CISC processor is to implement
instructions in hardware/firmware which may result in shorter program length with lower software
overhead.
CISC processors have micro-programmed control with unified cache for both instructions as well as
data. Many HLL features are directly implemented in the micro-program control memory.
CISC have variable length instruction format. CISC provide single machine instruction for each
statement of HLL statement. It also provides memory based manipulation. The CISC architecture
poses some problems along with its advantages.

The general feature of a CISC architecture processors are:-

1. Large no. of instructions typically 120-350.


2. Variable instruction/data formats (1-64 bits/ instruction)
3. Large variety of addressing modes (12-24)
4. Large no of memory reference operation & manipulation on the operands in memory.
5. ROM is used to store micro-programmed control.
6. Uses unified cache for instruction s& data.
7. Some instruction to perform specialized task are used infrequently.
8. Small set of General purpose register(8-24)

Advantages of CISC architecture:-

1. Allows CPU to do more work per instruction


2. Compiler design is simplified.
3. Allows execution efficiency.
4. Shorter program lengths.
5. As CISC processor borrow portions of design from its predecessor family of processor:-
(a) Development cost is less.
(b) Reliability is increased
6. Also provide backward compatibility i.e. newer processors are compatible with rest of
computer design of its predecessor processor if they are pin compatible.
7. Also the software can be easily migrated to new architecture.

 
24
Computer Architecture
Ucb
Disadvantages of CISC architecture:-

1. CISC processors are more complex to design


2. Control units are complex.
3. Run at low clock rate
4. CPI is high
5. Since cache is unified for data and instruction they have to share same path (more
conflicts)
6. Space of a CPU chip wasted for a micro programmed complex instruction which is less
frequently used
7. Low no. of GPRs so more fetching time for instructions and data from memory
8. Pipelining of instruction not easily possible.
E.g.:- Intel 1486, Motorola MC 68040

RISC

RISC is the abbreviation of Reduced Instruction Set Computing. The RISC computer has less numbers
of instructions which are simple and are of uniform length and formed. RISC processor issue one
instruction per cycle. This makes the RISC processor program very lengthy. The compiler design
becomes complex but paralleling the codes is easy.
RISC processors have hardwired control unit. This is more easily implemented at the more it can make
the processor runs at higher clock rate and CPI is also less (1.5). Small control units’ saves space on
chip which is used to increase number of GPRs which reduces access time for operands pipelining
becomes possible in RISC. Separate I-cache and D-cache allows them to separate their usage paths
and thus less conflicts and also less intermediate results be stored.

The general features of a RISC processes are:-

1. Small number of instruction (<100)


2. Less addressing mode (3.5)
3. Fixed length data formats / instruction formats (32 bit)
4. Hardwired control unit
5. Separate I-cache and D-cache
6. Memory access limited to load and store

Advantages of RISC computer architecture:-

1. Fixed length instruction / data formats can be used to pipeline efficiently


2. Low CPI
3. High clock rate/high MIPS
4. Separate I and D caches results in less conflict
5. Fast data access from cache
6. Not always necessary to store intermediate results
7. Large no. of register window allow fast context switching and data accessing as well as
other operations
8. Entire processor can be implements on a single processor VLSI chip
9. Design of an optimizing compiler is easy
10. Control units are easily designed and implemented

Disadvantages

1. Less flexible hardwired control


2. Compiler design is complex
3. Program length is high
4. Allow only single instruction per cycle
5. Software and hardware not backward compatible.
6. Software overhead for complex HLL statement
7. Reliance on a good compiler is more demanding in this case.
8. Increase in program length increase instruction traffic
9. Register decoding system is more compare
Ex: - Intel i860, Motorola M88100
 
25
Computer Architecture
Ucb
SUPER SCALAR PROCESSOR

A scalar processor is designed to issue one instruction per cycle and only one instruction completion is
expected. A CISC or RISC scalar processors can be improved with a superscalar architecture or vector
architecture.
In a super scalar processor multiple instruction pipelines are issued and multiple instructions are
issued and multiple results are generated per cycle.
Thus the effective CPI of a superscalar processor is lowered than that of a general scalar processor.
A general base scalar processor issues instruction with four phases i.e. Fetch, Decode, Execute and
Write-back. In general base scalar processor the instruction issue rate or the degree is 1 whereas in
superscalar processor it is more than 1. In fact superscalar processors are designed to exploit more
instruction level parallelism in the user programs. Only independent instructions can be executed in
parallel without causing a wait state. The amount of instruction level of parallelism varies widely
depending upon the type of code being executed. The instructions issue degree in a superscalar
processor varies from 2 to 5.
In order to fully utilize the superscalar processor of degree n, the instruction must execute in parallel
and simple operation latency should be only one per cycle. For high degree of instruction level
parallelism the processor relies on an optimizing compiler.
A superscalar machine that can issue a fixed point floating –point, load and branch all in one cycle
achieves the same parallelism as a vector machine. To achieve parallelism multiple instruction
pipelines are used. The instruction cache supplies instructions for Fetch, the actual number of
instructions issued to various functional units may vary in cache cycle. This depends on the data
dependence or resource conflict among instructions. Multiple functional units are built into the integer
and floating point unit.
Multiple data issue exists between the functional units. The Integer Unit ( IU ) and floating point unit (
FPU) both are generally implemented on a single chip. Register in each unit are 32 bit. The high clock
rate & less CPI make superscalar processor outperform scalar processor.

VECTOR SUPER COMPUTERS

A vector is a set of scalar data items all of same type, stored in memory. Usually the vector elements
are ordered to have a fixed addressing increment between successive elements called stride.
A vector processor is an ensemble of hardware resource, including vector registers, functional
pipelines, processing elements and register counters for performing vector operations.
A vector operation is performed when an arithmetic or logical operation are performed on a vector. A
vector instruction involves a large array of operands. In other words, the same operation will be
performed over a string of data.
A vector processing is different from scalar processing which operates on one or one pair of data.
A vector process is a co-processor specially designed to performed vector construction. A vector
processor can be of register-to-register architecture or memory-to-memory architecture based on the
fact that whether a vector register is used in interfacing memory and vector function pipeline or not.
The register-to-register architecture uses shorter instructions and vector register file.
The memory-to-memory architecture uses memory based instructions on which are longer in length.
Vector processor takes advantage of unrolled loop level parallelism. The vector pipelines can be
attached to any scalar, superscalar or super-pipelined processor. Dedicated vector pipeline eliminate
some software overhead in looping control. Of course the effectiveness of a vector process relies on the
capability of an optimizing compiler which convert scalar sequential code to vector pipelining code i.e.
perform vectorization.
Vector processing is faster and more efficient than scalar processing. It reduces memory conflicts, and
adheres to pipelining concept of one result per each clock cycle continuously. A well vectorised code
can easily achieve a speedup of 10 to 20 times that of a scalar processing code.

 
26
Computer Architecture
Ucb
Q7. What do you mean by cache memory organization? Explain various types of cache mapping with
neat diagram.

Caches memory is constructed using static RAM (SRAM). It is faster than DRAM. Its access line is of the
order of 10 ns. It’s expensive than main memory6. It is transparent is a programmer as assignment to a
cache can`t be done by a programmer. It is located closest to a microprocessor. The processor chip or on
the board L1 cache is the cache memory incorporated in the processor chip itself. The second L2 cache,
which is outside of the microprocessor usually on the board.

The cache is controlled by a MMU or cache


controller. It copies data from physical memory to
the cache.

Cache is the fast so data or4 instruction which has


the highest chances of being fetched or called, is
kept already in the cache for CPU to access. So no
data is accessed directly from main memory by CPU
rather through the cache.

Caches are implementing at different level of the memory hierarchy also

I cache – instruction cache


D cache - data
Unified cache - both data and instruction

The cache has tags to identify which block of the


main memory is generally called as associative
memory since the comparison is done of the basis of
data not by address and these is also build by a
associative memory. CPU specified the argument of
data value and mask or key or bits of data to be
checked.

Cache is divided into block of memory say if cache


is of 64 Kbyte, the cache block is of 4 bytes then 16
K (2``) lines or the hit refers to a situation when the
required data is found on the cache itself. Miss ratio
refers to a situation when the required data is not
found on the cache. The ratio productivity or
percentage which is no. of line which is processor
has found data in cache. Cache treats memory as a
set of block.

Cache is added using partial memory

DIRECT MAPPING

Cache memory transfers are done in blocks term. The block frame of a cache corresponds to a block of
a main memory page.
This cache organization is based on direct mapping of n/m memory blocks repeated by equal
distances. Blocks Bj in the main memory is mapped directly to a memory block frame in cache Bi
where i=j.
This direct mapping technique is very easy to implement. It doesn`t require any replacement policy.
But it is very rigid.
There is always a unique block from Bi that each Bj can load into.
 
27
Computer Architecture
Ucb
Salient features of direct mapping cache organization:-

Each block of main memory maps to only one cache line


♦ “cache line #” = “main memory block #” % “number of lines in cache”

Main memory addresses are viewed as three fields


♦ Least significant w bits identify a unique word or byte within a block
♦ Most significant s bits specify one of the 2^s blocks of main memory
• Tag field of s-r bits (most significant)
• Line field of r bits – identifies one of the m = 2^r lines of the cache

Direct Mapping Cache Table:-

Cache Line Main memory blocks


assigned
0 0, m, 2m, …, 2^s-m
1 1, m+1, 2m+1, …, 2^s-
m+1
m- m-1, 2m-1, 3m-1, …, 2^s-
1 1

Direct Mapping cache organization:-

A main memory block can be loaded into any line


of the cache.
• A memory address is interpreted as a tag and a
word field
• The tag field uniquely identifies a block of main
memory
• Each cache line’s tag is examined
simultaneously to determine if a block is in
Cache

ASSOCIATIVE MAPPING

As it is very clear form the name itself the associative mapping technique,. The data can be associated
to any block entry. In this s-bit tag is needed tin each cache block to be compared to reach the block.
The m-way associative search requires the tag to be compared with all block. This associative mapping
is flexible and thus the locality of reference is also implemented in it. Thus the hit ratio as well as the
average access time is reduced. The data is found easily in the contiguous location. This also offers the
greatest flexibility in implementing block replacement policies for a higher hit ratio.
The fully associative search has one disadvantage and that is in terms of its search and derived
hardware cost. Since the whole memory is to compared with tags and then a parallel comparison has
to be done to achieve fast search. This requires an associative memory which is expensive and thus
these types of cache are not popularly used.

Salient features of Associative cache mapping organization-

A main memory block can be loaded into any line of the cache
A memory address is interpreted as a tag and a word field
The tag field uniquely identifies a block of main memory
Each cache line’s tag is examined simultaneously to determine if a block is in cache

 
28
Computer Architecture
Ucb
Associative Mapping Cache Organization:

SET-ASSOCIATIVE MAPPING

In the set associative cache block are further categorized into sets which can contain block.

This design is a middle way between the direct mapping and fully associative mapping and fully
associative mapping. This can give high performance ratio if designed & implemented properly.
The m cache block frames are divided into v =m/k sets with k block per set. The set is identified by d –
bit set number and tag is identified by s-d bit tag.
This set associative has many advantages as the search becomes easier relative to associative since
only k block of a single set has to be searched to set the data. There can be more than one data block
in a set. This search is more economical and its replacement policy can be more flexible and
economical.
The set is identified and k block are searched in the identified set, where k is taken generally as 2, 4,
8, 16, 24 as depending upon cost and performance factors as well as cache size.

Salient features of set Associative Mapping cache organization:-

Compromise between direct and associative mapping


Cache divided into v sets
Each set contains k lines
A given block maps into any line in a given set

Set Associative Cache Organization:-

 
29
Computer Architecture
Ucb
Q8. What do you mean by virtual memory? Explain memory allocation techniques used in physical
memory to allocate pages of virtual memory with neat sketch.

VIRTUAL MEMORY:-

A Virtual memory as appear form the name shows that the memory does not exist in the physical
memory but is use as that one. In simple words we can say the size of the physical memory is increased
too much by use of the virtual memory.

The main memory of a CPU is very small as compared to what we require in the multiprogramming &
multitasking environment. Every program has to reside in the physical memory to be able to be running.
Also most modern advance CPUs can address more memory than we generally have in the main memory.
For Example:- A 32-bit CPU can generate a 32—bit address which can directly access 4GB of memory
but we generally don’ t use 4GB main memory. But we can take advantage of it and solve the aforesaid
problem by use of virtual memory.

Virtual memory makes it to appear to CPU that it has more memory than is actually present and this can
be used by memory programs with a disk memory. (Disk files or swap disk).
Only active program (portions) are required to run in the CPU and other irrelevant data & instructions are
kept in the disk swap file and retrieved and accessed as gradually required.
Some mechanism is always needed to translate the virtual address generated by the CPU to be mapped to
a physical memory address. Each process gets its own virtual address space and at run time they are
translated into physical addresses.

Virtual memory is managed by a MMU and data is swapped


in or swapped out of memory on dynamic basis. It allows
multi programming. It also facilitates the software
portability and allows users to execute heavy memory
requiring programs with ease.
Virtual memory usage requires some technique to translate
the address and some implementation method.

The address mapping is a dynamic operation which means


that every address is translated into physical address at
run time. The table implementation of the address mapping
is simplified if the information in the address space and the
memory space are each divided into groups of fixed size.
The physical memory broken down into groups of equal size
called blocks, which may range from 64 to 4096 words
each. The term page refers to groups of address space of
same size.

Let V be the set of virtual address generated by a process or


program running on a processor. Let M be the set of
physical addresses allocated to run this program. A virtual
memory demands an automatic mechanism to implemented
the following mapping:- . This mapping is
a time function which varies from time to time because the
physical memory is dynamically allocated and de-allocated.
Consider any virtual address v € V. The mapping ft is
formally defined as follows:-

In other words, the mapping ft (v) uniquely


translates the virtual address v in to a physical address m if
these are a memory hit in M. When there is a memory miss
the value return ft(v)=0, signals that the referenced item has
not been brought into the main memory yet.

 
30
Computer Architecture
Ucb
Private virtual memory:-

In a private virtual memory space associated with each processor, each private virtual space is divided
into pages. Virtual pages from different virtual spaces are mapped into the same physical memory shared
by the all processors. The advantage of using private virtual memories include the use of a small
processor address space(32 bits), protection on each page or on a pre-process basic, and the use of private
memory maps, which requires no locking.

Shared virtual memory:-

This model combines all the virtual address spaces into a single globally shared virtual space. Each
processor is given a portion of the shared virtual memory to declare their addresses. The advantages of
using shared virtual memory include the fact that all addresses are unique. However, each processor
must be allowed to generate to address larger than 32 bits, such as 46 bits for a 64-Tbyte (246 bytes)
address space. The page table must allow shared accesses. Therefore mutual exclusion (locking) is needed
to enforce protected access. Segmentation is built on top of the paging system to confine each process to
its own address space. Global virtual memory makes the address translation process even longer.

MEMORY ALLOCATION TECHNIQUES USED IN PHYSICAL MEMORY TO ALLOCATE PAGES OF


VIRTUAL MEMORY

PAGING

In paging technique the whole address space Page 1 Page table Page 1 Frame 1
(logical Address generated by CPU or in other
words the physical as well as the virtual memory is Page 2 Page 2 Frame 2
partitioned into contiguous fixed size block called
the pages. Page 3 Page 3 Frame 3
Compilers generally create code such that a page
Page 4 Unused Frame 4
will contain either program instruction or data but
not both. Physical memory is divided in section Page 5 Unused Frame 5
called frames that page size is equal to frame size.
When a CPU requires a data (if not found in Page 6 Physical memory
cache) from the main memory then it generates the
Page 7
address the MMU translate the address into
physical address which if relates a address in the Virutal Memory
main memory then data or say page is sent to the
cache in form of block and if not found in the main memory then a page fault is generated and then
the data is swapped in form the swap disk and replacing or a page frame if necessary in the process.
MMU handles all the complexities of finding and replacing the required page with the help of a
page table. These page table essentially contains a page frame, virtual frame, valid, count, & dirty bit
columns. In case of a page fault the process is suspended, a context switching is made to another
process while the missing page is replaced in the main memory, page table is updated. This page table
provide the s mechanism of address translation & mapping from logical to physical.
Also page table can be implemented a multiple level to extend the page mapping but its take
time. Also paging introduces a problem of internal fragmentation.

TLB ( Translation Look-aside Buffer )

When the MMU maps a logical address to a physical address it maintains to keep track of a present
page and not present pages with the help of a page table. Suppose an address is generated by CPU
whose data is required by the CPU then the MMU refers to the page table and checks the number and
then generates the address by concatenating the last 12 bit address with frame no. to get the physical
address. But suppose an address is required thousand times per second and then this page table
mechanism will take time IN this case MMU also maintains another table called TLB generally
implanted as an associative memory to maintain a table of recently used pages. It mainly contains the
page no and the frame no and valid bit column. This eases the address translation by just
concatenating the 12 bit address and passing it to physical memory. If the entry is not found in the
TLB then page table is searched. All entries in the TLB is present in the page table but reverse is not
true. Thus the TLB uses the locality of reference property by storing ht recently used addresses and
thus speeding away the translation of addresses and accessing of memory.
 
31
Computer Architecture
Ucb
SEGMENTATION

The segmentation is another technique to


implement the memory mapping and Logical Segment Offset
translation between the physical & virtual Address
memory. The programme are divide into
segments which are generally logical
divisions say the subroutines, stack or any
other data structure and which are self
Segment table

C o m p a re
contained unit. These segments can invoke S. No. Start Error generation
each other. Size if offset >= size

(if n o t in m e m o ry)
A segment unlike pages can vary in size.
The MMU manages a segmented memory
differently. It uses the segment table as

F a u lt
well as TLB to do the address translation
from logical to physical. A segment can
start at any address and length can be of
variable size. A segmentation memory Address of
addresses is divided into parts one of the start of segment
segment no and the rest is used as the Physical
offset Thus a segmented memory is
+
arranged as two dimensional one is the Address
segment and the inner dimension is the Fig: Segmentation
offset. The offset address within each segment forms one dimensional of the contiguous address. The
segment address are not necessary contiguous to each other from the second dimensional of the
address space.
Segmentation makes a disadvantage with itself as it has to add the start address found out from the
segment table or the TLB with the offset which is a more heavy process than concatenation as in the
case of paging. Also a problem of external fragmentation is introduced in the segmentation Therefore a
third way of implementation is done that is paged segmentation.

PAGED SEGMENTATION

In this segment is made of several pages. Segment Page Offset


Within each segment many fixed size pages
are there. Each virtual address is there
divided into three field the segment no., page
no., and then the offset.
These techniques offer the advantage of both Segment
the paged and segmented memory technique. table
The allocation of segments to physical Fault
memory is simpler since it is no longer
necessary to find one contiguous block large Pointers Page
enough o hold the entire segment. It is no to table
longer necessary to store the size of the Page table
segment in the segment table. The valid bit
entry in the page table serves this
information. Also the heavy addition process Fault Frame Offset
is replaced by the concatenation process. Physical memory To Physical
However a new look up table is introduced Address Memory
but this table look up time is ignorable. Fig: Paged segmentation
INVERTED PAGING

In inverted paging an inverted page table is created for each page frame that has been allocated to
users. Any virtual page number can be paired with a given physical page no. Inverted page tables are
accessed either by an associative search or by the use of a hashing function. The generation of a long
virtual address from a short physical address is done with the help of segment register. The leading 4-
bit (S.Reg) of a 32[-bit address names a segment register. The register provides a segment Id that
replaces the 4 bit S.Reg to form a long virtual address.
 
32
Computer Architecture
Ucb
Q9. What is back plane bus system? Discuss various issues related to back plane bus system.

BUS
Bus is the parallel set of conductor use to
carry, address and signals in computer
system. Bus system of a computer system
operate on a conventional basis several
activity devices such as processors may
request use of the bus at the same time only
of then can be granted access at a time. The
effective bandwidth available to each processor
is inversely proportional to the no. of
processors contending for the bus.

BACKPLANE BUS SYSTEM

Backplane bus inter-connections


processors, data storage and peripheral
devices are a tightly coupled hardware
configuration. The bus system must have
communication protocol, timing protocol and
operational rule to insure data transfer on the
bus without disturbing the internal activities
of all the devices attached to the bus. Timing
protocols must be established to arbitrate among multiple requests. Signal lines on the backplane are
functionally grouped into several buses. Various functional boards are plucked in to slots on the
backplane. Each slot is provided with one or more connectors for inserting the boards.

Data Transfer Bus (DTB)


Data, address, and control lines form the data transfer Bus (DTB) in a VME Bus. The addressing lines
are used to broadcast the data and device address. The number of address line is proportional to the
logarithm of the size of the address space. Address modifier line can be used to define special
addressing modes. The data lines are proportional to the memory word length. The revised VMB bus
specification has 32 address line and 32 or 64 data line. The 32 address line can be multiplexed to
serve as the lower half of the 64-bit data during data transfer cycles. The DTB control lines are used to
indicate read write, timing control and bus error conditions.

Bus arbitration and control


The process of assigning control of the DTB to a requester is called arbitration. Dedicated lines are
reserved to coordinate the arbitration process among several requesters. The requester is called a
master and the receiving and is called slave.

Interrupt & synchronization lines


Interrupt lines are used to handle interrupts, which are often prioritized. Dedicated line may be used
to synchronize parallel activities among the processor module. Utility line includes signals that provide
periodic timing (Clocking) and coordinate the power-up and power-down sequences of the system.

The backplane is made of single line and connectors. A special bus controller board is used to house the
back plane control logic such as system clock driver, arbiter, bus timer and power driver. The back plane
bus is driven by a digital clock with a fixed cycle time called the bus cycle. The bus cycle is determine by
the electrical, mechanical and packaging characteristic of the backplane. Signals travelling on bus lines
may experience unequal delays from the source to the destination. Factors affecting the bus delay include
the sources’ line drivers, the destination‘s receivers, slot capacitance, line length, and the bus loading
effects (the no. of boards attached).

To optimize performance, the bus should be designed to minimize the time required for request handling,
arbitration, addressing, and interrupts so that most bus cycles are used for useful data transfer
operations.

 
33
Computer Architecture
Ucb
FUNCTIONAL MODULES

A functional module is a collection of electronic circuitry that resides on one functional board that
work to achieve special bus control functions. The various functional modules are:-

i. Arbiter An arbiter is a functional module that accept bus request from the requester module
and grants control to the DTB to one request at a time.

ii. Bus timer It majors the time each data transfer takes on the DTB and terminates the DTB cycle
if a transfer text to long.

iii. Interrupter This module generates and interrupt request and provide status/ Id information
when an interrupt handler module request it.

iv. Location monitor It monitors data transfer over the DTB. A power monitor watches the monitor
status of the power source and signals when becomes unstable.

v. System clock driver It provides a clock timer signal on the utility bus. In addition, board
interface logic is needed to match the single line impedance and propagation time, and termination
values between the backplane and the plug in boards.

Q10. Explain the concept of pipelining. Discuss various types of pipeline designs used in computer.

A pipeline is the collection of different stages where the different independent instructions can execute in
the over-headed mode. The process of the division of a task into sub-task and executing in the pipelined
mode is called pipelining.
Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. It is
a technique of decomposing a sequential process into sub-operations with each sub-process being
executed in special segments that operate concurrently with all other segments.
A pipeline is in analogy of the assembly line of an automobile industry. In assembly line the assembling of
a car is divided through different stages which contribute in the assembling of a car. Each stage acts in
parallel t each other but obviously on a different part and of different cars.
Similarly pipeline in computer consisting of different stages operate on different sub tasks involving
different functional units and these stages act in parallel on the subtask of the different instructions or
task.
A pipeline’s efficiency is determined on the various factors such as the Speedup, Throughput & efficiency.

Total time required to process n tasks by k stage pipeline with clock cycle of τ time is; Tk= [k+ (n-1)] τ

The same when performed without pipeline is given as T1=n.k τ

T1 nkτ nk
Therefore a speedup is given by Sk = = =
TK [k + (n − 1)]τ k + (n − 1)
n
Throughput is defined as the no. of instructions performed per unit time denoted by Hk =
[k + (n - 1)]τ
Sk n
Efficiency Ek = =
k k + (n − 1)

Pipeline is to be designed such that the each pipeline stage takes equal time and are perfectly balance.
The instruction should be partitioned to take equal no. of clock cycle so that no any stage becomes
bottleneck in the efficient working of the pipeline.
Thus pipeline is an implementation technique that exploits parallelism among the instructions in a
sequential instruction stream.

 
34
Computer Architecture
Ucb
Pipeline
Pipeline

Scalar Vector Linear Non-linear(dynamic)

Instruction Arithmetic Synchronous Asynchronous

Fig:- Based on the nature of instruction/data Fig:- Based on the execution nature

LINEAR PIPELINE

A linear pipeline processor is a cascade of processing stages which are linearly connected to perform
different task over a stream of data. The linear pipeline requires proper subdivision of task into proper and
equivalent division.
A linear pipeline process is constructed with k processor stages. Operands are fed into the pipeline at the
first stage Si to next stage Si+1 for all i= 1, 2,. , k-1. The final result emerges from the last pipeline stage Sk.

Depending on the control of data flow along the pipeline, we model linear pipelines in two categories:-

1. Asynchronous Pipeline
2. Synchronous Pipeline

Asynchronous Pipeline

In this type of pipelining different stages communication & dataflow is carried out with help of
handshaking protocols. When one stage is ready to transmit it sends ready signal to next stage which
after receiving data sends an acknowledgement signal.
Asynchronous pipeline are useful in designing communication channels in message passing multi-
computers where pipeline wormhole routing is practiced. Asynchronous pipeline may have a variable
throughput rate. Different amounts of delay may be experienced in different stages.

Input Input Input


Ready Ready Ready
Ack
S1 Ack
S2 Sk Ack

Synchronous Pipeline

In synchronous pipeline all stages transfer data to next stage synchronously at the arrival of clock
pulse through latches(temporary memory) which are used as interface between different stages and
are use to store the results temporarily.
The delays in each pipeline stages are desired to equal so that result are produced at end of pipeline at
end of each clock cycle. This required the data to be independent of each other.

 
35
Computer Architecture
Ucb
NON-LINEAR PIPELINE

In a non linear pipeline processor sequence of the stages of a pipeline can be reconfigured dynamically to
perform different functions at different point of time.

A dynamic pipeline allows feed-forward and feedback connections in addition to the streamline
connections.
Function partitioning in a dynamic pipeline become quite involved because the pipeline stages are
interconnected.

The feed-forward & feedback connections make the scheduling of successive events into the pipeline a
non-trivial task. Output of one pipeline is not necessarily from the last stage. In fact following different
dataflow patterns one can use the same pipeline to evaluate different functions.

The utilization pattern of successive stage in a synchronous pipeline is specified by a reservation table.
The reservation table is essentially a space- time diagram depicting the precedence relationship in using
the pipeline stage.

SCALAR PIPELINE

The scalar pipeline refers to execution of a scalar instruction in a pipelined fashion. The different subtask
of an instruction i.e. Fetch, decode, Execute & write back are done in a pipelined fashion.

Depending on the control of data flow along the pipeline, we model scalar pipelines in two categories:-

1. Instruction Pipeline
2. Arithmetic Pipeline

Instruction Pipeline

The instruction pipeline is very similar to an assembly line where each stage performs its task and
passes the product to its next stage for further modification and a final product appears at last. As one
product advances to next stage the new product enters the line and ultimately different n stages of an
assembly line is working on n product simultaneously.
Similarly the instruction pipeline work and the instruction are processed in an over-headed node so as
to exploit parallelism at instruction level. The execution cycle of a typical instruction includes four
phases fetch, decode, execute & write back.

The instruction pipeline processes an instruction the way the assembly line. The first stage fetches the
instruction. The second decode it, third executes and the fourth one stores or write it back to memory.
The streamlined execution of the instructions in overlapped fashion characterizes the instruction
pipeline.

 
36
Computer Architecture
Ucb
A pipeline cycle is defined as the time required for each phase to complex the operation assuming
equal delay at all stage. The basic properties & definitions associated with instruction pipeline.

i) Instruction pipeline cycle: - The clock period of the instruction pipeline.


ii) Instruction issue latency: -Time (in cycles) required between issuing of two adjacent instructions.
iii) Instruction issue rate: - The number of instructions issued per cycle, also called the degree of a
superscalar processor.
iv) Resource conflicts: - The situation where two or more instr4ucitons demand use of the same
functional unit at the same time.
v) Simple /complex operation latency: - The simple instruction latency involved latency in
execution simple operations. Similarly latency involved in complex operations is complex is
operation latency. Instruction pipelines can be executed in scalar superscalar, super pipelined and
also in vector supercomputers.
nT 1
The speed up of a k stage pipe is given: = Sn =
(n + k − 1)Tk

T1 = non pipeline instructions processing time


Tk = instruction processing time in k stage pipeline
n = no. of instructions

Pipeline has to encounter problem in situation where an instruction fetch required more than one
cycle it show down the pipeline. If also a cache is there it has to keep its data & instruction separately
to avoid conflicts form different stages of the pipeline.
Another problem arises due to branching statements which cases to jump to instruction which is not
next (or available in the pipeline.

Arithmetic Pipeline

As we know in a pipeline different stages of a pipeline perform different sub-task to get a final result in
overlapped basis. IN arithmetic pipeline different operation like adding, multiplications, division,
subtraction can be done in a pipelined fashion. The different stages of arithmetic pipeline can be adder
multiplier etc. The pipeline doesn’t increases the throughput on the whole.

Add, subtract, multiply & divide are basis operation along with other complex operation like power
calculation, trigonometric function and floating point operations are performed in a pipelined basis in
arithmetic pipeline. The pipelining in scalar arithmetic scalar pipeline is controlled by software lops
while in vector arithmetic pipeline is designed with hardware and controlled by firmware or hardwired
control. Vector hardware pipelines are built as ad-ons to a scalar processor driven by a control
processor.

Arithmetic pipeline also implement the shift registers and look-ahead carry to execute instructions
fast. Arithmetic pipeline may be static or dynamic. The dynamic arithmetic pipeline is reconfigurable
dynamically according to arithmetic operations required to perform.
Consider the following code:

for (i=1; i<=100; i++) This instruction when executed in a non-pipelined fashion will take
A[i] = B[i].C[i] + D[i]; more time than when done through arithmetic pipeline. 

 
37
Computer Architecture
Ucb

When the first stage multiplier two operands B[i] & c[i] then the stage is adding D[i-1] with product
of B[i-1]&C[i-1] concurrently.

nT 1
An arithmetic pipeline speedup is given as Sn =
(n + k − 1)Tk

T1 = non pipeline instructions processing time


Tk = instruction processing time in k stage pipeline
n = no. of instructions

A dynamic arithmetic pipeline is consisting of feed-forward & feedback connection between


different stages which can perform specialized operations.

A dynamic arithmetic pipeline consisting of 3 stages which can perform addition, subtraction &
multiplication.

VECTOR PIPELINE

Vector pipeline is the over-headed mode execution of the vector instruction. The vector pipeline can be
attached to any scalar processor whether it is superscalar, super pipelined or both. Dedicated vector
pipelines will eliminate some overhead in looping control of course effectiveness of a vector processor relies
on the capability of an optimizing compiler that vectorises sequential code for vector pipelining.

 
38
Computer Architecture
Ucb
Q11. What do you mean by latency hiding? Discuss various types of latency hiding approaches used in
scalable multiprocessors.

LATENCY HIDING TECHNIQUE

The access of remote memory may significantly increase memory latency. Furthermore, the
processor speed is increasing at a much faster rate than memory and interconnection network. Thus
any scalable multiprocessor or large scale multicomputer must rely on the use of latency-reducing -
tolerating, or -hiding mechanisms.

There are four types of latency hiding mechanisms:-

1. Pre-fetching Technique
2. Cache Coherent
3. Release Memory Consistency
4. Multiple Contexts

Pre-fetching Technique

Pre-fetching uses knowledge the expected missies in a program to move the corresponding data close
to the processor before it is actually needed. Pre-fetching can be classified based on whether it is
binding or nonbinding, and whether it is controlled by hardware or software.
With binding pre-fetching, the value of a later reference (e.g. a register load) is bound at the time
when the pre-fetch completes. Binding pre-fetching may result in a significant loss in performance
due to such limitation.

In contrast, nonbinding pre-fetching also brings the data close to the processor, but the data
remains visible to the cache coherence protocol and is thus kept consistent until the processor
actually reads the value.
Hardware controlled pre-fetching includes schemes such as long cache lines and instruction look-
ahead. While instructions look-ahead is limited by branches and the finite look-ahead buffer size.
With software controlled pre-fetching, explicit pre-fetch instructions are issued.

Advantage:-

The benefits of pre-fetching come from several sources. The most obvious benefit occurs when
a pre-fetch is issued early enough in the code so that the line is already in the cache by the
time it is referenced.
Pre-fetching offers another benefit in multiprocessors that use an ownership based cache
coherence protocol.

Disadvantage:-

The disadvantages of software control include the extra instruction overhead required to
generate the Pre-fetches, as well as the need for sophisticated software intervention. In our
study, we concentrate on non-binding software – controlled pre-fetching.

Cache Coherent

While the coherence problem is easily solved for small bus-based multiprocessors through the use of
snoopy coherence protocols, the problem is such more complicated for large scale multiprocessors
that use general interconnection network.

Example - Dash Experience: We evaluate the benefits when both private and shared read write data
are cache-able, as allowed by the Dash hardware coherent caches, versus the case where only
private data are cache-able. As per as concern present a breakdown of the normalized execution
times with and without caching of shared data for each of the applications. Private data are cached
in both caches.

 
39
Computer Architecture
Ucb
Release Memory Consistency

Release Consistency (RC) model introduced by Gharachorloo et al. (1990). Release consistency
requires that synchronization accesses in the program be identified and classified as either acquires
(e.g. locks) or releases (e.g. unlock). As acquire is a load operation (which can be part of a read-
modify-write) that gains permission to access a set of data, while a release is a write operation that
gives away such permission. This information is used to provide flexibility in buffering and pipelines
of accesses between synchronization points.

Advantages:

The main advantage of the Release models is the potential for increased performance by hiding
as much write latency as possible.

Disadvantage

The main disadvantage is increased hardware complexity & more complex programming model.

Multiple Contexts

A conventional single-thread processor will wait during a remote reference, so may say it is idle for a
period of time L. A multithreaded processor, as modeled will suspend the current context and switch
to another, so after some fixed number of cycles. It will again be busy doing useful work even
through the remote reference is outstanding. Only if all the contexts are suspended will the
processor be idle.
Clearly, the objective is to maximize the fraction of time that the processor is busy so well will use
the efficiency of the processor as our performance index, given by
Efficiency = busy / (busy + switching + idle)

Q12. What do you mean by multithreading in multiprocessing system? Discuss the principles of
multithreading.

Multithreading is an execution of multiple threads


simultaneously. It demands that the processor should be Interconnect 
designed to handle multiple contexts simultaneously on Latency 
context switching basic. A multithreaded MPP (Multiple
parallel processors) system is modelled by a network of a
processor and memory nodes. The distributed memory
forms a global address. Four machines parameter are
defining to another the performance of the network.

Latency (L)

This is the communication latency on a remote memory


access. The value of latency depends upon the following:-

● Network delay
● Cache miss penalty
● other cache delays like by contention in split transaction 

The number of threads (N)

This is the number of thread that can be interleave in each processor a thread is represented by a
context consisting of a program counter, a register sets and the required context status words.

The context switching over head (C)

This refers to the cycle lost in performing context switching in a processor. This time depends on the
switch mechanism in a processor. This time depends on the switch mechanism and the amount of
processor steps devoted to maintaining active states.
 
40
Computer Architecture
Ucb
Context switching Policy: -
Different multithreading architectures are distinguished by the context switching polices adopted.
There are following type of switching policy used.

1. Switching on cache missing


This policy corresponds to the condition where a context is preempted when it cause a cache
machine this policy R is the average interval between cache misses (in cycles), and L is the time
required to satisfied the miss.
2. Switch in every load
This policy allows swathing on every load, independent of whether it will cause a miss or not. In
this case, R represents the average is blocked for L cycles after every switch; but in the case of
switch-on-load processor, this happens only if the load causes a cache miss.
3. Switch in every instruction
This policy allows switching on every instruction, independent of whether it is a load or not. In
other words, it interleaves the instructions from different threads on a cyclic basis. Successive
instructions become independent, which will benefit pipeline execution. However, the trace-
driven experiments at Stanford that cycle-by-cycle interleaving of contexts provides a
performance advantage over switching a cache miss in that the context interleaving could hide
pipeline dependences and reduce the context switch cost.
4. Switch on block of instruction
This policy allows switching on a block of instruction generally a block of instruction from
different threads are interleaved. This will include the cache hit ratio due to locality.

The interval between switches (R)

This refers to the cycle between switches triggered by remote reference, the inverse for remote access.
This reflects a combination of program behaviour and memory system design.
Multithreading competition start with a sequential thread is followed by supervisory scheduling, where
the processor being thread of competition, by inter computer has a distributed memory, and by an
synchronization prior to beginning the next unit of parallel work the communication overhead period
inherent in distributed memory structures is usually distributed throughout the competition and is
possibly completely overlapped. Message passing overhead in multicomputer can be reduce by
specialized hardware operating in parallel with computation.
Massive parallel processors operate in synchronization within a network environment. The
synchronization arises two fundament latency problems: - Remote loads and synchronizing loads.
The solution to asynchronous problem is to multiplex amount many threads: When one thread issue a
remote-load request the processor begin working on another thread and so on. As the inter node latency
increases, more thread are needed to hide it effectively. In case of issuing a remote load from thread t1
to thread t2 which also issue a remote load, the response may not be in the same order. This problem is
resolved by associating each remote load and response with an identifier for appropriate thread. These
thread identifiers are referred to as continuations on messages. A large continuation name space should
be provided to name an adequate number of threads waiting for remote responses.
A multithreaded processor will suspend the current context and switch to another, so after some fixed
number of cycles it will again be busy doing useful work, even though the remote reference is
outstanding. The basic idea behind a multithreaded machine is to interleave the execution of several
contexts in order to dramatically reduce the ideal time of processors without increasing the context
switching time. Multithread systems are constructed with multiple context processors. A conventional
single thread to a maximum the fraction of time that the processor is busy, so the efficiency of processor
is given by efficiency=Busy/(busy+ switching + idle), where busy, switching and ideal represent the
amount of time, measured over some large interval. The state of processor is determined by the disk
position of the various contexts on the processor. A context cycle during its life time through the
following states: Ready, Running, Living, and blocked. There can be at most one context running or
living.
A processor is busy if there is a context in the running state. It is switching while making the
transition from one context to another. Otherwise if all context are blocked the processor is idle.
A running context keeps the processor busy until it issues an operation that requires a context switch.
The context then spends C cycle in the leaving state, then goes into the blocked state for L Cycles, and
finally re-enters the ready state. Eventually the processor will choose it and the cycle will start again.
 
41
Computer Architecture
Ucb
Q13. Write notes on:
(a) Memory hierarchy (d) Data flow computer (i) Parallelism in uniprocessor
(j) Tera multiprocessor system.

Data flow computer

Data flow computer have the potential for exploiting all the parallelism available in the program. Since
execution is driven only by the availability of operands at the inputs to the functional unit, there is no
need for a program counter in the architecture and its parallelism is limited only by the actual data
dependences in a application program. While the data flow concept offer the potential of high
performance, the performance of an actual data flow implementation can be restricted by the limited
number of functional units, limited memory bandwidth, and the need to associatively match pending
operations with available functional unit.
There are two type of data flow computer:-

1. Static

Static data flow computer which allows more than one token to reside on any one arc. A node is
enabled as soon as token are present on all input arc & there is no token on any of its output arc.

2. Dynamic

In dynamic dataflow architecture, each data token is tagged with a context descriptor which is
called tagged token. A node is enabled as soon as token with identical tags are presented at each of
its inputs are with tagged token. Tag matching become necessary, therefore special hardware
mechanisms are needed to achieve tagged matching.

CONCLUSION:-
Since data dependency exist in data flow graph therefore they do not forced unnecessary
specializations and the computer scheduled instruction according to ability of operand. Value or token
may be memory allocation. Each instruction made for token found all inputs consumes input tokens
computes output value based on input values and produces token on output. No further restriction on
instruction ordering is imposed. No side effects are produced with the execution of instructions in
dataflow computers.
Example: - MIT Developed, TTDA (Tagged Token Dataflow architecture).

Tera multiprocessor system

This system consists of 256 processors 512 memory, 256 I/O cache units, 256 I/O processors
4096 interconnection network node and a clock period of less than 3 ns. It achieves high speed, from
operational-level parallelism within program basis blocks to multi user’s time and space sharing. There
are no registers or memory addressing constraints and only three addressing nodes. Condition code
setting is consistent and orthogonal. The architecture permits the free exchange of spatial and temporal
locality for parallelism, a high optimizing compiler may work hard improving locality and trade the
parallelism thereby saved for more speed
The interconnection network of one 256-processor Tera system contains 4096 nodes arranged in a
16*16*16 Toroidal mesh. Of the 409 nodes, 1280 are attached to the resources comprising 256 cache
units’ and256 I/O processors. The 2816 remaining nodes do not have resources attached but still
provide message bandwidth.
Each processor in Tera computer can execute multiple instruction streams (threads)
simultaneously on every tick of the clock the processor logic selects a thread that is ready to execute
and allows it to issue its next instruction. Since instruction interpretation is completely pipelined by the
processor and by the network interfering either its predecessors. When an instruction finishes, the
thread to which it belongs becomes ready to execute the next instruction. Context switching is so rapid
that the processor has no time to swap status word (SSW); -Thirty-two 64-bit general purpose registers
(R0-R32); - Eight 64-bit target registers (T0-T7). There are 128 copies of each per system that is 128
SSWS, 4096 general purpose registers, and 1024 target registers. Program addresses are 32bits in
length.
The Tera architecture uses Explicit-Dependence-look ahead. Each instruction contains a 3-bit
look- ahead field that explicitly specifies how instruction from this thread will be issued before at most 8
instructions and 24 operations can be concurrently executing from each thread.
 
42
Computer Architecture
Ucb
Parallelism in Uni-processor

A computer system achieves parallelism when it performs two or more task simultaneously In computer
design this is generally understood to mean that the task are not related to each. But this would not be
considered parallel processor. Like for ex. FETCH2: DR←M, PC←PC+1

This is not a parallel processing as the two micro-operations in the same instruction processing parallels
processing is generally attributed to super computer i.e. system with many CPU’s .
However a uni-processor system can also exhibit some sort of parallel processing.

The simplest parallel processing in uni-processor is the parallel


transfer of data from one module to another through parallel bus
rather than doing in serial fashion.

The second simplest example are the parallel


address, which uses techniques like look-ahead
carry and carry-save also exhibit parallelism.

Similarly the execution of the floating point operations in


the co-processors or the floating point unit along with
the parallel operations in the integer unit also exhibit
parallelism in uni-processor.

Also the I/O processors which perform I/O


operation while the CPU is doing some other
instruction also display parallel behavior in uni-
processor system.

Similar to I/o the DMA controller can also exhibit


parallelism if it work in the transparent mode rather than
cycle stealing or burst mode DMA handles the memory
based operations while the CPU is busy execution other
instruction . This also shows a way of parallelism.

Also in uni-processor system multiple, simultaneous


memory access can be done. This requires multiple
address, data & control busses one for each
simultaneous memory access.

A multiport memory which has two sets of address data and


control pins to allow two simultaneous data transfer to occur.
Thus DMA & CPU can use the memory concurrently as long as
the memory location sis not same or if they are performing write
operation at it.

A multithreaded architecture can also show a parallel behavior


in uni-processors. The threads are switched among to keep
CPU busy.

Nowadays multi-core CPU also exhibit parallelism where


different cores can work simultaneously.

The pipelining can also help to achieve parallelism in uni-


processor. The instruction can be executed on a pipelining way.
The arithmetic operation can also be performed through
pipelining and thus giving a way to parallelism in uni-
processor.
 
43
Computer Architecture
Ucb
Q14. Explain the properties of memory hierarchy. What are the issues related to plan memory
capacity.

MEMORY HIERARCHY

The arrangement of the memory devices such as register caches, main memory, disk devices, tapes in a
hierarchical manner according the five parameters and it level to the proximity in the CPU is called the
Memory Hierarchy.

The five parameters on which the memory hierarchy units are measured are:-

1. Access time (ti) ns,ms,min The time between presenting the address &
getting the valid data.
2. Memory (si) KB,HB,GB,TB Size of the memory in bytes, Kilo, Hex, Giga,
Tera Bytes
3. Cost per byte (ci) cents/KB Cost of building the memory
4. Transfer bandwidth (bi) MB/s The rate of data movement
5. Unit of transfer (xi) bytes/blocks

The access time ti refers to the round-trip time from the CPU to the ith-level memory size Si is the
number of bytes or words in level i. The cost of the ith-level memory is estimated by the product Ci*Si.
The bandwidth B/bi refers to the rate at which information is transferred between adjacent levels. The
unit of transfer xi refers to the grain size for data transfer between levels i and i+1.
Memory devices at a lower level are faster to access, smaller in size, and more expensive per byte, have a
higher bandwidth and using a smaller unit of transfer as compared with those at a higher level. In other
words, Ti-1 < ti, Si-1 < Si, Ci-1 > Ci, bi-1 > bi, and Xi-1 < Xi ; for i=1,2,3, and 4 in the hierarchy where i=0
corresponds to the CPU register level.

Registers and Caches

The register is the internal CPU memory directly


controlled by CPU after decoding of instructions. Register
transfer is conducted at processor speed in one clock
cycle.
Caches are intermediate memory residing in between CPU
register and main memory to speed up the processing
operations. Caches are controlled by MMU and visible to
programmer.

Main Memory

Main Memory are semiconductor memory larger but


slower than cache often implemented by the RAM chips.
The main Memory is Managed by MMU in cooperation
with O/S. Memory size can be expanded.

Disk Driver & Tape Units

Disk Driver & Tape Units are handled by O/S with limited
user intervention. These are used for storing future
usable information such as program/data. Magnetic
tapes are off-line memory used for backup storage.

Peripheral technology

Besides disk drivers and tap units, peripheral devices include printers, plotters, terminals, monitors,
graphic displays, optical scanners, image digitizers, output microfilms etc. Some I/O devices are tied
to special-purpose or multimedia applications.

 
44
Computer Architecture
Ucb
PROPERTIES OF MEMORY HIERARCHY
Information stored in a memory hierarchy (M1, M2 …Mn)
satisfies three important properties:
1. Inclusion Property
2. Coherence Property
3. Locality of Reference Property
Inclusion Property
The inclusion property is stated as M1⊂M2⊂M3⊂….⊂Mn.
The set inclusion relationship implies that all information
items are originally stored in the outermost level Mn. During
the processing, subsets of Mn are copied into Mn-1. Similarly
subsets of Mn-1 are copied into Mn-2 and so on. If information
word is found in Mi, then copies of the same word can be
found in all upper level Mi+1, Mi+2…..Mn. The highest level is
the backup storage, where everything is found.
Information transfer between the CPU and cache is in
terms of words. The cache (M1) is divided into cache blocks,
also called cache lines. Blocks are the units of data transfer between the cache and main memory.
The main memory (M2) is divided into pages, each page contains N blocks. Pages are the units
of information transferred between disks and main memory.
Scattered pages are organized as a segment in the disk memory. Data transfer between the
disk and tape unit is handled at the file level.

Coherence Property
The coherence property requires that copies of the same information item at successive levels
be consistent. If a word is modified in the cache, copies of that word must be updated immediately of
eventually at all higher levels. Frequently used information is often found in the lower level in order
to minimize the effective access time of the memory hierarchy. There are two strategies for
maintaining the coherence in memory hierarchy are:-
The first method is called write-through(WT), which demands immediate update in Mi+1, if a
word is modified in Mi for i=1,2,3…… n-1.
The second method is write-back (WB) which delays the update in Mi+1 until the word being
modified in Mi is replace or removed from Mi.

Locality of Reference Property


The memory hierarchy is based on program behaviour known as Locality of References.
Memory references are generated by the CPU for either instruction or data address. There are three
dimension of the locality property: temporal, spatial and sequential. During the life time of a
software process, a number of pages are used dynamically. The references to these pages vary from
time to time. This memory reference patterns are caused by the following locality properties:

Temporal Locality

Referencing of recently referenced item is caused by construct of program such as iterative loops,
process stacks, temporary variables or subroutines. Once a loop entered or a subroutine is
called, a small code segment will be referenced repeatedly many times. Thus temporal locality
tends to cluster the access in the recently used area.

Spatial Locality

This refers to the tendency for a process to access items whose addresses are near one another.
Program segments, such as routines and macros, tend to be stored in the same neighbourhood
of the memory space.

Sequential Locality

In typical programs, the execution of instructions follows sequential order. The ratio of in-order
execution to out-of-order execution in roughly 5 to 1 in ordinary programs.
The sequentiality in program behaviour contributes to the spatial locality because sequentially
coded instructions and array elements are often stored in adjacent locations. Each type of
locality affects the design of memory hierarchy.
 
45

Vous aimerez peut-être aussi