Académique Documents
Professionnel Documents
Culture Documents
Ucb
Q1. Define Parallelism. Discuss various types of parallel processing mechanism. List out different
parallel processing computers.
A mode of parallelism is said to be achieved when two or more unrelated and independent code or
modules run simultaneously in a computer system whether uniprocessor or multiprocessor. In other
words we can say that parallelism is the state of execution of codes where different codes run parallel to
each other or a single set of code run parallel on different set of data.
Parallel computing is a form of computation in which many calculations are carried out
simultaneously, operating on the principle that large problems can often be divided into smaller ones,
which are then solved concurrently ("in parallel"). There are several different forms of parallel computing:
bit-level-, instruction-level-, data-level, and task-level parallelism. Parallelism has been employed for
many years, mainly in high-performance computing, but interest in it has grown lately due to the
physical constraints preventing frequency scaling. As power consumption by computers has become a
concern in recent years, parallel computing has become the dominant paradigm in computer
architecture, mainly in the form of multicore processors.
Parallel processing models exist as an obstruction above hardware and memory architecture. There are
several programming models in common use. Some of them are :-
• In the shared memory-programming model; task share a common address space, which they read
and write asynchronously.
• Various mechanisms Such as locks /semaphore may be used to control access to shared memory.
• An advantage of these models from the programmer’s point of views is that the notion of data
“Owner ship” is lacking, so there is no need to specify explicitly the communication of data
between tasks. Program development can often be simplified.
• An important disadvantage in term of performance is that it becomes more difficult to understand
and manage data locality.
Implementation: -
a) On shared memory platforms, the native compilers translate user program variables into actual
memory addresses, which are global.
b) No common distributed platform Implementation currently exists.
Thread model: -
In thread model of parallel processing, a single process can have multiple concurrent execution paths.
Threads are commonly associated with shared memory architectures and operating system.
Perhaps the most simply analogy that can be used to describe threads is the concept of a single
program that includes a number of sub routines.
Example –
a. The main program a.out is scheduled to run by the native
operating system, a.out loads and acquires all of the
necessary system and user resources to run.
b. a.out performs some serial works, and then creates a number
of task (threads) that can be scheduled and run by the
operating system concurrently.
c. Each thread has local data but also shares the entire
resources of a.out. This saves the entire resources of a.out.
This saves the overhead associated with replicating programs
resources for each thread. Each thread also benefits from a
global memory view because it shares memory space of a.out.
d. A thread’s work may best be described as a subroutine
within the main program. Any thread can execute any
subroutine at the same time as other threads.
e. Threads communicate with each other through global memory (updating address locations). This
require synchronization construct to insure that more than one thread is not updating the same
global address at any time.
f. Threads can come and go, but a.out remains present to provide the necessary shared resources
until the application has completed.
Implementation: -
a) A library of sub routines that are called from within parallel source code.
b) A set of compiler directives imbedded in either serial or parallel source code.
2
Computer Architecture
Ucb
Message passing model: -
• A set of tasks that use their own local memory during computation; multiple tasks can reside on the
same physical machine as well across an arbitrary machine.
• Tasks exchange data through communications by sending and receiving messages.
• Data transfer usually requires cooperative operations to be performed by each process.
Implementation: -
Implementations: -
3
Computer Architecture
Ucb
TYPES OF PARALELLISM
1. Bit-level parallelism
2. Instruction-level parallelism
3. Data parallelism
4. Task parallelism
Bit-level parallelism
From the advent of very-large-scale integration (VLSI) computer-chip fabrication technology in the 1970s
until about 1986, speed-up in computer architecture was driven by doubling computer word size—the
amount of information the processor can execute per cycle. Increasing the word size reduces the number
of instructions the processor must execute to perform an operation on variables whose sizes are greater
than the length of the word. For example, where an 8-bit processor must add two 16-bit integers, the
processor must first add the 8 lower-order bits from each integer using the standard addition
instruction, then add the 8 higher-order bits using an add-with-carry instruction and the carry bit from
the lower order addition; thus, an 8-bit processor requires two instructions to complete a single
operation, where a 16-bit processor would be able to complete the operation with a single instruction.
Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors.
This trend generally came to an end with the introduction of 32-bit processors, which has been a
standard in general-purpose computing for two decades. Not until recently (c. 2003–2004), with the
advent of x86-64 architectures, have 64-bit processors become commonplace.
Instruction-level parallelism
Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a
different action the processor performs on that instruction in that stage; a processor with an N-stage
pipeline can have up to N different instructions at different stages of completion. The canonical example
of a pipelined processor is a RISC processor, with five stages: instruction fetch, decode, execute, memory
access, and write back. The Pentium 4 processor had a 35-stage pipeline.
4
Computer Architecture
Ucb
Data parallelism
Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across
different computing nodes to be processed in parallel. "Parallelizing loops often leads to similar (not
necessarily identical) operation sequences or functions being performed on elements of a large data
structure." Many scientific and engineering applications exhibit data parallelism.
A loop-carried dependency is the dependence of loop iteration on the output of one or more previous
iterations. Loop-carried dependencies prevent the parallelization of loops. For example, consider the
following pseudo code that computes the first few Fibonacci numbers:
1: PREV2 := 0
2: PREV1 := 1
3: CUR := 1
4: do:
5: CUR := PREV1 + PREV2
6: PREV2 := PREV1
7: PREV1 := CUR
8: while (CUR < 10)
This loop cannot be parallelized because CUR depends on itself (PREV1) and PREV2, which are
computed in each loop iteration. Since each iteration depends on the result of the previous one, they
cannot be performed in parallel. As the size of a problem gets bigger, the amount of data-parallelism
available usually does as well.
Task parallelism
Task parallelism is the characteristic of a parallel program that "entirely different calculations can be
performed on either the same or different sets of data". This contrasts with data parallelism, where the
same calculation is performed on the same or different sets of data. Task parallelism does not usually
scale with the size of a problem.
List of parallel computers:- Cray 1, 2, Blue gene L, Iliac V (SIMD architecture), RIKEN MDGRAPE-3
There are different ways to classify parallel computers. One of the more widely used classifications, in use
since 1966, is called Flynn's Taxonomy.
• Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be
classified along the two independent dimensions of Instruction and Data. Each of these dimensions
can have only one of two possible states: Single or Multiple.
• Based on the multiplicity of the instruction stream, and the data stream in a computer system. The
sequence of instruction read from the memory constitutes the instruction stream, and the data they
operate on in the processors constitute the data stream.
• The table below defines the 4 possible classifications according to Flynn.
SISD SIMD
Single Instruction, Single Data Single Instruction, Multiple Data
MISD MIMD
Multiple Instruction, Single Data Multiple Instruction, Multiple Data
5
Computer Architecture
Ucb
Single Instruction, Single Data ( SISD )
SISD stands for Single Instruction stream over a Single Data stream.
It represents the organization of a single computer controlling a control unit a processor unit and a
memory unit. Instructions are executed sequentially and the system may of may not have internal
parallel processing capability. Parallel processing in this may be achieved by means of multiple
functional units or by pipeline processing.
IS = Instruction stream
I/O DS = Data stream
C.U. IS P.U. DS M.U.
CU = Control unit
PU = processing unit
MU = Memory unit
SIMD stands for Single Instruction stream over Multiple Data stream.
SIMD represents an organization that includes many units under the supervision of a common control
unit. All processors receive the same instruction from the control unit but operate on different items of
data. The most common example is the execution of for loop in which same set of instruction is
executed for different set of data.
I.S.
Program PE1 DS LM1 Data
Loaded from IS = Instruction stream
set
h t CU = Control unit
C.U. PE = Processing Element
Loaded
DS LM = Local Memory
PEn LM4
6
Computer Architecture
Ucb
• A type of parallel computer
• Single instruction: All processing units
execute the same instruction at any given
clock cycle
• Multiple data: Each processing unit can
operate on a different data element
• This type of machine typically has an
instruction dispatcher, a very high-
bandwidth internal network, and a very
large array of very small-capacity
instruction units.
• Best suited for specialized problems
characterized by a high degree of
regularity, such as image processing.
• Synchronous (lockstep) and deterministic
execution.
• Two varieties: Processor Arrays and Vector Pipelines
Examples:
Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820
IS IS
CU1 CU2 CUn
Memory IS IS = Instruction stream
(Program IS CU = Control unit
and data) IS IS
7
Computer Architecture
Ucb
Multiple Instruction, Multiple Data ( MIMD )
MIMD organization implies interaction between N processors because all memory streams are derived
from the same data stream shared by all processors. If the interaction between the processor is high it
is called a tightly coupled (or a share memory processors) system or else It is called a loosely coupled
(or networked system) system most multi processor fit in to this category.
IS IS
PS
CU1 PU1 IS = Instruction stream
Shared
DS memory CU = Control unit
PU = Processing unit
CUn PUn
I/O DS
Examples:
Q3. Discuss various conditions of parallelism. Explain various levels of parallelism needed in designing
parallel programs.
Parallelism appears in various forms in a computing environment. Some of the key areas are
computation models for parallel computing, inter-processor communication in parallel architecture and
system integration for incorporating parallel system into general computing environment. All forms of
parallelisms can be attributed to levels of parallelism, computation granularity, time and space
complexity, communication latencies, scheduling policies, and load balancing.
Some of the important conditions of parallelism are:-
8
Computer Architecture
Ucb
DATA and RESOURCE DEPENDENCY
The ability to execute several programs segment in parallel requires each segment to be independent
of the other segment. There are various types of dependencies:-
A. Data Dependence
B. Control Dependence
C. Resource Dependence
D. Bernstein’s Condition
Data Dependence
The ordering relationship between statements is indicated by data dependence. These are of five
types as mentioned below:-
a. Flow dependence
b. Anti-dependence
c. Output dependence
d. I/O dependence
e. Unknown dependence
a) Flow dependence
A Statement s2 is flow dependent on statement s1 if an execution path exists from s1 to s2
and if at least one output of s1 feeds in as input to s2. It is denoted by s1Æs2.
b) Anti-dependence
Statement s2 is anti-dependent on statement s1 if s2 follows s1 in program order and if the
output of s2 overlaps the input to s1. The direct arrow crossed with a bar ( ) is used to
represent anti-dependence. s1 s2 shows that s1 is anti-dependent to s2.
c) Output dependence
Two statements are output dependent if they produce (Write) the same output variable. It is
denoted by ( ) and can be represented by indicating the output dependent from s1 to s2.
d) I-O dependence
Read and write are input output statements. I/O dependence occurs not because the same
variable is involved but because the same file is referenced by both I/O statements.
e) Unknown dependence
The dependence relation between two statements can’t be determined in the following
situations:-
• The subscript of a variable is itself subscripted (indirect addressing )
• The subscript doesn’t contain the look index variable.
• The variable appears more than once with subscripts having different coefficients of the
loop variable.
• The subscript is nonlinear in the loop index variable.
NOTE: - When one or more condition exists a conservative assumption is to claim unknown
dependence among the statements involves.
9
Computer Architecture
Ucb
Control Dependence
This refers to the situation where the order of execution of statements can’t be determined before
run time. Different paths taken after a conditional branch may introduce or eliminate data
dependence among instructions. Dependence may also exist between operations performed in
successive alterations of a looping procedure. The successive iterations of the following loop are
control-independent.
Do 20, I=1, N
A (I) =C (I)
IF (A (I).LT.0)
A (I) =1
20 continue
Do 40, I=1, N
IF (A (I-1).EQ.0)
A (I) =0
40 continue
Resource Dependence
Resource dependence is concerned with the conflicts in using the shared resources such as
integer units, floating point unit, and register and memory areas among parallel events. When
the conflicting resource is an ALU, it is called ALU Dependence. The work-place storage is called
storage dependence. In the case of storage dependence each task must work on independent
storage locations or use protected access to shared writable data. The transformation of a
sequentially coded program into a parallel executable form can be done manually by the
programmer using explicit parallelism, or by a compiler detecting implicit parallelism
automatically.
Program partitioning determines whether a given program can be partitioned or splice into pieces
that can execute in parallel or follow a certain pre-specified order of execution.
Bernstein’s Condition
Bernstein’s revealed a set of conditions based on which two processes can execute in parallel. A
process is a software entity corresponding to the abstraction of a program fragment defined at
various processing levels. The input set Ii of a process Pi as the set of all input variables needed
to execute the process. Similarly the output set Oi consists of all output variables generated after
execution of the process Pi. Consider two processes P1 and P2 with their input set I1 and I2 and
output sets O1 and O2 respectively. These two processes can execute in parallel and are denoted
P1 || P2 if they are independent and don’t create confusing result.
I1 ∩ O2 = 0 These three conditions
I2 ∩ O1 = 0 ‐‐‐ (i) are known as Bernstein’s
O1 ∩ O2 = 0 conditions.
The input set Ii is also called the read set or the domain of Pi. The output set OI has been called
the write set or the range of a process Pi . In terms of data dependencies, Bernstein’s condition
simply implies that two processes can execute in parallel if they are flow-independent, anti-
independent, and output-independent.
10
Computer Architecture
Ucb
The parallel execution of two processes produces the same result regardless of whether they
are executed sequentially or in any order or in parallel. This is possible only if the output of one
process will not be used as input to the other process. In general, a set of processes P1, P2 ……….
Pk can execute in parallel if Bernstein’s conditions are satisfied on a pair-wise basic; that is P1||
P2 ||P3||……. ||Pk if and only in Pi || Pj for all i ≠ j.
P1 : C=D*E
P2 : M=G+C
P3 : A=B+C ‐‐‐ (ii)
P4 : C=L+M
P5 : F=G/E
*In this program each statement required one step to execute. No pipelining is considered here.
A dependence graph showing both data dependence (solid arrows) and resource (dash arrows)
Violations of any one or more of the three conditions in Eq.-i prohibit parallelism between two
processes. In general, violation of any one or more of the 3n (n-1)/2 Bernstein’s conditions
among n processes prohibits parallelism correctively or partially. Any statements or processes
which depend on run-time conditions are not transformed to parallel form. These include IF
statements or conditional branches. Recursion also prohibits parallelism. Data dependence,
control dependence, and resource dependence all prevent parallelism from being exploitable. The
statement level dependence can be generalized to higher levels, such as code segment,
subroutines process, task and program levels.
11
Computer Architecture
Ucb
HARDWARE and SOFTWARE DEPENDENCY
Hardware Parallelism
Software Parallelism
Compiler techniques are used to exploit hardware features to improve performance. Interaction
between compiler and architecture design is a necessity in modern computer development. Most
existing processors issue one instruction per cycle and provide a few registers. This may cause
excessive spilling of temporary result from the available registers.
There exists a vicious cycle of limited hardware support and the use of a naïve compiler. To break
the cycle, one must design the compiler and the hardware jointly at the same time. Interaction
between the two can lead to a better solution to the mismatch problem between software and
hardware parallelism.
The general guideline is to increase the flexibility in hardware parallelism and to exploit software
parallelism in control-intensive programs. Hardware and software design tradeoffs also exist in terms
of cost, complexity, expandability, compatibility and performance. Compiling for multiprocessors is
much more involved than for uni-processors. Granularity and communication latency play,
important roles in the code optimization and scheduling process.
12
Computer Architecture
Ucb
Q4. What do you mean by granularity? To design a most efficient and optimal parallel program, which
type of granularity is most suitable and why? Take suitable assumption to justify your answer.
Coarse: relatively large amounts of computational work are done between communication events.
Fine: relatively small amounts of computational work are done between communication events.
Parallelism is achieved at different programs levels as demonstrated below and thus at different
levels different grain size is required and are considered efficient. Let us examine the three grain sizes:-
Fine-grain Parallelism
Coarse grain
Level 2
level and loop level.
d) Relatively small amounts of Subprograms, job steps
computational work are done between or related parts of
communication events.
Medium grain
program
e) Low computation to communication
Instructions or
a) Typical medium grain contains less
than 2000 instructions. statements
b) Relatively high amount of parallelism
(if assisted by a good parallelizing
compiler as well as a programmer). Levels of Parallelisms and computational grain size
c) Usually implemented at the procedural
level and subprogram level.
d) Relatively large amounts of
computational work are done between
communication events.
e) Often less communication required.
13
Computer Architecture
Ucb
Coarse-grain Parallelism
NOTE:-
Q5. Explain various types of system inter-connect architecture with neat sketch.
The topology of an interconnection network can be either static or dynamic. Static network are formed of
point-to-point direct connection which will not change during the program execution. Static network are
used for fixed connections among subsystems of a centralized system or multiple computing nodes of a
distributed system.
Dynamic network are implemented with switch channels or bus networks, which are dynamically
configured to match the communication demand. Dynamic networks include buses, cross-bars, switches
& multi-stage networks which are often used in shared memory architectures.
14
Computer Architecture
Ucb
PROPERTIES OF INTERCONNECTION NETWORK
A network is represented by the graph of a finite no of nodes linked by directed or undirected edges.
The no of nodes in the graph is called the network size. The no of (edges links or channels) incident on
a node is called the node degree. The degree of a node is the sum of in-degree and out-degree
channels. The node degree reflects the no. of I/O ports required per node. The diameter D of a network
is the maximum shortest path between any two nodes managed by link traversed. The network
diameter indicates the maximum no. of distinct hops between any two nodes, providing
communication merit.
When the given network is cut into two equal halves, the minimum no. of edges along the cut is called
the channel bisection width (b). In communication each edge corresponds to a channel with w bit
wires. The wire bisection width B=b*w. The parameter B reflects wire density of a network. When B is
fixed, the channel width w=B/b bits. The wire length between nodes affects the signal latency, clock
skewing, or power requirements.
Data routing in multi-computer network is achieved by message passing. Hardware routers are used
to route messages among multiple computer nodes. Data routing functions among PEs include
shifting, rotation, permutation (one-to-one), and broadcast (one-to all), multicast (many-to-many),
personalized communication (one-to-many), shuffle, exchange etc.
• Permutation:- The set of all permutations form a permutation group with respect to a composition
operation. The permutation π= (a, b, c) (d, e) stands for the bisection mapping:
aÆb, bÆc, cÆa, dÆe, and eÆd in a circular fashion. Crossbar switch is used to
implement permutation.
Perfect shuffle is a special permutation function for parallel processing application. To shuffle n=2k
objects evenly, one can express each object in the domain by a k-bit binary number x=(xk-
1,………..,x1,x0). The perfect shuffle maps x to y, where y = (xk-2,………….,x1,x0,xk-1) is obtained from x
by shifting 1 bit to the left and wrapping around the most significant to the least significant position.
15
Computer Architecture
Ucb
HYPERCUBE ROUTING FUNCTION
Hypercube routing function can be represented by a three-dimensional binary cube network. Three
routing functions are defined by 3 bits in the node address. Data can be exchanged between adjacent
nodes which differ in the least significant bit C0.
110 111
010 011
100 101
000 001
Three routing functions defined by a binary 3-
A broadcast is a one-to-all mapping. This can be easily achieved in an SIMD computer using a
broadcast bus extending from the array controller to all PEs. A message-passing multi-computer also
has mechanisms to broadcast messages. Multicast corresponds to a mapping from one subset to
another (many-to-many).
Personalized broadcast sends personalized messages to only selected receivers. It is often treated as a
global operation in a multicomputer.
1. Functionality
This refers to how the networks support data routing, interrupt handling, synchronization,
request/ message combining, and coherence.
2. Network Latency
This refers to the worst-case time delay for a unit message to be transferred through the
network.
3. Bandwidth
This refers to the maximum data transfer rate, in terms of Mbytes/sec transmitted through the
network.
4. Hardware complexity
This refers to implementation cost such as wires, switches, connectors, arbitration and
interface logic.
5. Scalability
This refers to the ability of a network to be modularly expandable with a scalable performance
with increasing machine resources.
16
Computer Architecture
Ucb
INTERCONNECTION NETWORK
As mentioned above there are mainly two types of interconnection i.e. static and dynamic is being
discussed below in details. Let’s take an over-view of the system-interconnect network:-
1. Static Connection
Interconnection Network Taxonomy
a) 1-D
b) 2-D
c) HC Interconnection Network
2. Dynamic Connection
STATIC CONNECTION
Static networks use direct link which are fixed ones built. This type of network is used more suitably
for predictable communication pattern or implementable with static connections. There are several
topologies used in terms of network parameters.
i. Liner Array
Linear Array
Liner arrays are the simplest connection topology. This is a one
dimensional network in which N nodes are connected by N-1 links in a
line. Internal nodes have degree 2 and external nodes have degree 1. A
liner array allows concurrent use of different sections of the structure by
different source and destination pairs.
A ring is obtained by connecting the two terminal nodes of a linear array with
one extra link. A ring can be unidirectional or bidirectional. It is symmetric with
a constant node degree of 2. The diameter is for a bidirectional ring and N Ring
for unidirectional ring. By increasing the node degree from 2 to 3 or 4, Chordal
rings are obtained. Adding more rings, the higher the node degree and the
shorter the network diameters. The completely connected network has a node
degree of 15 with the shortest possible diameter of 1.
17
Computer Architecture
Ucb
iii. Barrel Shifter
It is obtained from the ring by adding extra links from each node to
those nodes which having a distance equal to an integer power of 2.
This implies that node i is connected to node j, if | j – i | = 2r for
some r=0, 1, 2,……, n-1 and the network size is N = 2n. Such a barrel
shifter has a node degree of d= 2n-1, and diameter D=n/2.
The star is a two-level tree with a high node degree of d=N-1 and a small constant diameter of 2.
The star architecture has been used in systems with centralized supervisor node.
v. Fat tree
Mesh network architecture has been implemented in the Iliac-IV, MPP, DAP, CM-
2 and Intel Paragon with variations. In general, a k-dimensional mesh with N=nk
nodes has an interior node degree of 2k and the network diameter is k(n-1). The
variation of the mesh forms iliac network architecture. The iliac network is
topologically equivalent to a Chorale ring. The n*n mesh should have a diameter Mesh
d=n-1, which is one half of the diameter for the pure mesh.
The Torus has ring connection along each row and along each column of the array. An n*n
binary torus has a node degree of 4 and the diameter of 2 . The torus is a symmetric
topology.
18
Computer Architecture
Ucb
d=4
19
Computer Architecture
Ucb
k-ary n-Cube networks
DYNAMIC CONNECTION
For multipurpose or general purpose applications, dynamic connections are used to implement all
communication patterns based on program demands. Switches or arbiters must be used along the
connecting path to provide the dynamic connectivity .Dynamic connection networks include Bus
systems, Multi-stage Interconnection Networks (MIN), and Crossbar Switch networks. The
performance is indicated by the network bandwidth, data transfer rate, network latency and
communication patterns supported.
i. Digital Buses
A a x b switch modules has a inputs and b outputs. A binary switch corresponds to a 2 x 2 switch
module in which a=b=2 . Theoretically, a and b are unequal, often chosen as integer power of 2;
that is a=b=2k for same k>=1.
In several commonly used switch module size of 2x2, 4x4, 8x8. Each input can be connected to
one or more of the outputs. In other words, one-to-one and one-to-many mappings are allowed;
but many-to-one mappings are not allowed due to conflict in output terminals. The numbers of
legitimate connection parallels for switch module of various sizes are listed below:-
Dynamic Interconnection Network Multi-stage INs Dynamic Interconnection Network Multi-stage INs
(MINs) (MINs) (cont.)
000 000
000 1 5 9 000
001 001
001
001
111 111
111 111
21
Computer Architecture
Ucb
THE OMEGA NETWORK
The Omega Network is one of several connection networks that are used in parallel machines. In the
applet below, a small but typical network illustrates the common attributes of such a network, which
includes:
Since N=8 can be represented with a 3-bit array, it becomes obvious that this formula can be
represented by a one bit shift left, with replacement or wrap. For example, let's start with card number
5 or . Normally, after a shift, the msb or leftmost bit disappears, but we'll wrap it back to the
right side so that we end up with , as per the formula above, a 3. Only if the number is more
than N/2 will it have a 1 in that bit position, which explains the plus 1in the formula above.
Now let's go from processor #5 to #3 (try the applet). From the process above, we go to switch 3, but
then what? If we do an Xor - exclusive or - with our source and destination, , we end
up with or 6. And this tells us that we need to perform a crossover at the first and second
switches. Notice that if we do another Xor with 5 and our result 6, we end up with 3 again. This will
always hold.
We know which switch we go to from the PS operation at each step, and when we crossover. But do we
cross up, or down? First, you can use the result of the PS operation and see whether it is odd or even.
Also, just check the source bit pattern. A 1 says we go up and a 0 down, so we have up-down-no
change. If we do not cross, then reverse the implication; and a 0 means we traverse the switch along
the top, a 1 along the bottom. Also see Gita's constructive proof, which shows that using the
operation on a single bit after a shift left at each step, will automatically transform the source into the
destination.
Realize that the shift merely is a way of inter-mingling the connections. It thus amounts to mere
bookkeeping. Performing the 2nd time as mentioned, will reverse the first two bits of 5 and result in
our desired 3. And notice when a switch is locked in any closed position; no other data transmission
can take place, showing that this is a blocking network.
22
Computer Architecture
Ucb
THE CROSS-BAR NETWORK
The highest bandwidth and interconnection capability are provided by crossbar network.
Network
For possible connection of two cross two switches are use a crossbar network can be visualised as a
single-stage switch network. Each cross-point switches can provide a connection between pairs. The
switch can be set ON/OFF dynamically upon program demanded. To build a shared-memory
multiprocessor, one can use a crossbar network between the processor and memory modules. This is
essentially a memory-access network. The inter-processor crossbar provides permutation connection of
one-to-one. Therefore the n x n crossbar connects at most n pairs at a time.
P1
P2
P3
P4
P5
P6
P7
P8
23
Computer Architecture
Ucb
Q6. Differentiate between scalar, superscalar and vector computer. Differentiate according to their
attributes.
An instruction set of compilers specifies the primitive commands or machine instructions that a
programmer can use in the programming of a machine.
The first microprocessors were simple with very simple instruction set. Gradually we moved towards
complex instructions set as the hardware cost dropped and software cost went up steadily. The semantic
gap between the hardware & the high lever language has widened so the more & more functions were
being hardwired into processor making instruction set large & complex. Gradually two architectures
evolved as the RISC & CISC.
1. Instruction/data formats
2. Addressing modes
3. General purpose registers
4. Opcode specification
5. Flow control mechanism
6. Clock Rate & CPI
CISC
CISC is the abbreviation of complex instruction set computing. CISC computer as obvious from the
name suggests that the instruction set of the CISC computers are complex involving more sub-
instructions per instructions. The general philosophy of designing a CISC processor is to implement
instructions in hardware/firmware which may result in shorter program length with lower software
overhead.
CISC processors have micro-programmed control with unified cache for both instructions as well as
data. Many HLL features are directly implemented in the micro-program control memory.
CISC have variable length instruction format. CISC provide single machine instruction for each
statement of HLL statement. It also provides memory based manipulation. The CISC architecture
poses some problems along with its advantages.
24
Computer Architecture
Ucb
Disadvantages of CISC architecture:-
RISC
RISC is the abbreviation of Reduced Instruction Set Computing. The RISC computer has less numbers
of instructions which are simple and are of uniform length and formed. RISC processor issue one
instruction per cycle. This makes the RISC processor program very lengthy. The compiler design
becomes complex but paralleling the codes is easy.
RISC processors have hardwired control unit. This is more easily implemented at the more it can make
the processor runs at higher clock rate and CPI is also less (1.5). Small control units’ saves space on
chip which is used to increase number of GPRs which reduces access time for operands pipelining
becomes possible in RISC. Separate I-cache and D-cache allows them to separate their usage paths
and thus less conflicts and also less intermediate results be stored.
Disadvantages
A scalar processor is designed to issue one instruction per cycle and only one instruction completion is
expected. A CISC or RISC scalar processors can be improved with a superscalar architecture or vector
architecture.
In a super scalar processor multiple instruction pipelines are issued and multiple instructions are
issued and multiple results are generated per cycle.
Thus the effective CPI of a superscalar processor is lowered than that of a general scalar processor.
A general base scalar processor issues instruction with four phases i.e. Fetch, Decode, Execute and
Write-back. In general base scalar processor the instruction issue rate or the degree is 1 whereas in
superscalar processor it is more than 1. In fact superscalar processors are designed to exploit more
instruction level parallelism in the user programs. Only independent instructions can be executed in
parallel without causing a wait state. The amount of instruction level of parallelism varies widely
depending upon the type of code being executed. The instructions issue degree in a superscalar
processor varies from 2 to 5.
In order to fully utilize the superscalar processor of degree n, the instruction must execute in parallel
and simple operation latency should be only one per cycle. For high degree of instruction level
parallelism the processor relies on an optimizing compiler.
A superscalar machine that can issue a fixed point floating –point, load and branch all in one cycle
achieves the same parallelism as a vector machine. To achieve parallelism multiple instruction
pipelines are used. The instruction cache supplies instructions for Fetch, the actual number of
instructions issued to various functional units may vary in cache cycle. This depends on the data
dependence or resource conflict among instructions. Multiple functional units are built into the integer
and floating point unit.
Multiple data issue exists between the functional units. The Integer Unit ( IU ) and floating point unit (
FPU) both are generally implemented on a single chip. Register in each unit are 32 bit. The high clock
rate & less CPI make superscalar processor outperform scalar processor.
A vector is a set of scalar data items all of same type, stored in memory. Usually the vector elements
are ordered to have a fixed addressing increment between successive elements called stride.
A vector processor is an ensemble of hardware resource, including vector registers, functional
pipelines, processing elements and register counters for performing vector operations.
A vector operation is performed when an arithmetic or logical operation are performed on a vector. A
vector instruction involves a large array of operands. In other words, the same operation will be
performed over a string of data.
A vector processing is different from scalar processing which operates on one or one pair of data.
A vector process is a co-processor specially designed to performed vector construction. A vector
processor can be of register-to-register architecture or memory-to-memory architecture based on the
fact that whether a vector register is used in interfacing memory and vector function pipeline or not.
The register-to-register architecture uses shorter instructions and vector register file.
The memory-to-memory architecture uses memory based instructions on which are longer in length.
Vector processor takes advantage of unrolled loop level parallelism. The vector pipelines can be
attached to any scalar, superscalar or super-pipelined processor. Dedicated vector pipeline eliminate
some software overhead in looping control. Of course the effectiveness of a vector process relies on the
capability of an optimizing compiler which convert scalar sequential code to vector pipelining code i.e.
perform vectorization.
Vector processing is faster and more efficient than scalar processing. It reduces memory conflicts, and
adheres to pipelining concept of one result per each clock cycle continuously. A well vectorised code
can easily achieve a speedup of 10 to 20 times that of a scalar processing code.
26
Computer Architecture
Ucb
Q7. What do you mean by cache memory organization? Explain various types of cache mapping with
neat diagram.
Caches memory is constructed using static RAM (SRAM). It is faster than DRAM. Its access line is of the
order of 10 ns. It’s expensive than main memory6. It is transparent is a programmer as assignment to a
cache can`t be done by a programmer. It is located closest to a microprocessor. The processor chip or on
the board L1 cache is the cache memory incorporated in the processor chip itself. The second L2 cache,
which is outside of the microprocessor usually on the board.
DIRECT MAPPING
Cache memory transfers are done in blocks term. The block frame of a cache corresponds to a block of
a main memory page.
This cache organization is based on direct mapping of n/m memory blocks repeated by equal
distances. Blocks Bj in the main memory is mapped directly to a memory block frame in cache Bi
where i=j.
This direct mapping technique is very easy to implement. It doesn`t require any replacement policy.
But it is very rigid.
There is always a unique block from Bi that each Bj can load into.
27
Computer Architecture
Ucb
Salient features of direct mapping cache organization:-
ASSOCIATIVE MAPPING
As it is very clear form the name itself the associative mapping technique,. The data can be associated
to any block entry. In this s-bit tag is needed tin each cache block to be compared to reach the block.
The m-way associative search requires the tag to be compared with all block. This associative mapping
is flexible and thus the locality of reference is also implemented in it. Thus the hit ratio as well as the
average access time is reduced. The data is found easily in the contiguous location. This also offers the
greatest flexibility in implementing block replacement policies for a higher hit ratio.
The fully associative search has one disadvantage and that is in terms of its search and derived
hardware cost. Since the whole memory is to compared with tags and then a parallel comparison has
to be done to achieve fast search. This requires an associative memory which is expensive and thus
these types of cache are not popularly used.
A main memory block can be loaded into any line of the cache
A memory address is interpreted as a tag and a word field
The tag field uniquely identifies a block of main memory
Each cache line’s tag is examined simultaneously to determine if a block is in cache
28
Computer Architecture
Ucb
Associative Mapping Cache Organization:
SET-ASSOCIATIVE MAPPING
In the set associative cache block are further categorized into sets which can contain block.
This design is a middle way between the direct mapping and fully associative mapping and fully
associative mapping. This can give high performance ratio if designed & implemented properly.
The m cache block frames are divided into v =m/k sets with k block per set. The set is identified by d –
bit set number and tag is identified by s-d bit tag.
This set associative has many advantages as the search becomes easier relative to associative since
only k block of a single set has to be searched to set the data. There can be more than one data block
in a set. This search is more economical and its replacement policy can be more flexible and
economical.
The set is identified and k block are searched in the identified set, where k is taken generally as 2, 4,
8, 16, 24 as depending upon cost and performance factors as well as cache size.
29
Computer Architecture
Ucb
Q8. What do you mean by virtual memory? Explain memory allocation techniques used in physical
memory to allocate pages of virtual memory with neat sketch.
VIRTUAL MEMORY:-
A Virtual memory as appear form the name shows that the memory does not exist in the physical
memory but is use as that one. In simple words we can say the size of the physical memory is increased
too much by use of the virtual memory.
The main memory of a CPU is very small as compared to what we require in the multiprogramming &
multitasking environment. Every program has to reside in the physical memory to be able to be running.
Also most modern advance CPUs can address more memory than we generally have in the main memory.
For Example:- A 32-bit CPU can generate a 32—bit address which can directly access 4GB of memory
but we generally don’ t use 4GB main memory. But we can take advantage of it and solve the aforesaid
problem by use of virtual memory.
Virtual memory makes it to appear to CPU that it has more memory than is actually present and this can
be used by memory programs with a disk memory. (Disk files or swap disk).
Only active program (portions) are required to run in the CPU and other irrelevant data & instructions are
kept in the disk swap file and retrieved and accessed as gradually required.
Some mechanism is always needed to translate the virtual address generated by the CPU to be mapped to
a physical memory address. Each process gets its own virtual address space and at run time they are
translated into physical addresses.
30
Computer Architecture
Ucb
Private virtual memory:-
In a private virtual memory space associated with each processor, each private virtual space is divided
into pages. Virtual pages from different virtual spaces are mapped into the same physical memory shared
by the all processors. The advantage of using private virtual memories include the use of a small
processor address space(32 bits), protection on each page or on a pre-process basic, and the use of private
memory maps, which requires no locking.
This model combines all the virtual address spaces into a single globally shared virtual space. Each
processor is given a portion of the shared virtual memory to declare their addresses. The advantages of
using shared virtual memory include the fact that all addresses are unique. However, each processor
must be allowed to generate to address larger than 32 bits, such as 46 bits for a 64-Tbyte (246 bytes)
address space. The page table must allow shared accesses. Therefore mutual exclusion (locking) is needed
to enforce protected access. Segmentation is built on top of the paging system to confine each process to
its own address space. Global virtual memory makes the address translation process even longer.
PAGING
In paging technique the whole address space Page 1 Page table Page 1 Frame 1
(logical Address generated by CPU or in other
words the physical as well as the virtual memory is Page 2 Page 2 Frame 2
partitioned into contiguous fixed size block called
the pages. Page 3 Page 3 Frame 3
Compilers generally create code such that a page
Page 4 Unused Frame 4
will contain either program instruction or data but
not both. Physical memory is divided in section Page 5 Unused Frame 5
called frames that page size is equal to frame size.
When a CPU requires a data (if not found in Page 6 Physical memory
cache) from the main memory then it generates the
Page 7
address the MMU translate the address into
physical address which if relates a address in the Virutal Memory
main memory then data or say page is sent to the
cache in form of block and if not found in the main memory then a page fault is generated and then
the data is swapped in form the swap disk and replacing or a page frame if necessary in the process.
MMU handles all the complexities of finding and replacing the required page with the help of a
page table. These page table essentially contains a page frame, virtual frame, valid, count, & dirty bit
columns. In case of a page fault the process is suspended, a context switching is made to another
process while the missing page is replaced in the main memory, page table is updated. This page table
provide the s mechanism of address translation & mapping from logical to physical.
Also page table can be implemented a multiple level to extend the page mapping but its take
time. Also paging introduces a problem of internal fragmentation.
When the MMU maps a logical address to a physical address it maintains to keep track of a present
page and not present pages with the help of a page table. Suppose an address is generated by CPU
whose data is required by the CPU then the MMU refers to the page table and checks the number and
then generates the address by concatenating the last 12 bit address with frame no. to get the physical
address. But suppose an address is required thousand times per second and then this page table
mechanism will take time IN this case MMU also maintains another table called TLB generally
implanted as an associative memory to maintain a table of recently used pages. It mainly contains the
page no and the frame no and valid bit column. This eases the address translation by just
concatenating the 12 bit address and passing it to physical memory. If the entry is not found in the
TLB then page table is searched. All entries in the TLB is present in the page table but reverse is not
true. Thus the TLB uses the locality of reference property by storing ht recently used addresses and
thus speeding away the translation of addresses and accessing of memory.
31
Computer Architecture
Ucb
SEGMENTATION
C o m p a re
contained unit. These segments can invoke S. No. Start Error generation
each other. Size if offset >= size
(if n o t in m e m o ry)
A segment unlike pages can vary in size.
The MMU manages a segmented memory
differently. It uses the segment table as
F a u lt
well as TLB to do the address translation
from logical to physical. A segment can
start at any address and length can be of
variable size. A segmentation memory Address of
addresses is divided into parts one of the start of segment
segment no and the rest is used as the Physical
offset Thus a segmented memory is
+
arranged as two dimensional one is the Address
segment and the inner dimension is the Fig: Segmentation
offset. The offset address within each segment forms one dimensional of the contiguous address. The
segment address are not necessary contiguous to each other from the second dimensional of the
address space.
Segmentation makes a disadvantage with itself as it has to add the start address found out from the
segment table or the TLB with the offset which is a more heavy process than concatenation as in the
case of paging. Also a problem of external fragmentation is introduced in the segmentation Therefore a
third way of implementation is done that is paged segmentation.
PAGED SEGMENTATION
In inverted paging an inverted page table is created for each page frame that has been allocated to
users. Any virtual page number can be paired with a given physical page no. Inverted page tables are
accessed either by an associative search or by the use of a hashing function. The generation of a long
virtual address from a short physical address is done with the help of segment register. The leading 4-
bit (S.Reg) of a 32[-bit address names a segment register. The register provides a segment Id that
replaces the 4 bit S.Reg to form a long virtual address.
32
Computer Architecture
Ucb
Q9. What is back plane bus system? Discuss various issues related to back plane bus system.
BUS
Bus is the parallel set of conductor use to
carry, address and signals in computer
system. Bus system of a computer system
operate on a conventional basis several
activity devices such as processors may
request use of the bus at the same time only
of then can be granted access at a time. The
effective bandwidth available to each processor
is inversely proportional to the no. of
processors contending for the bus.
The backplane is made of single line and connectors. A special bus controller board is used to house the
back plane control logic such as system clock driver, arbiter, bus timer and power driver. The back plane
bus is driven by a digital clock with a fixed cycle time called the bus cycle. The bus cycle is determine by
the electrical, mechanical and packaging characteristic of the backplane. Signals travelling on bus lines
may experience unequal delays from the source to the destination. Factors affecting the bus delay include
the sources’ line drivers, the destination‘s receivers, slot capacitance, line length, and the bus loading
effects (the no. of boards attached).
To optimize performance, the bus should be designed to minimize the time required for request handling,
arbitration, addressing, and interrupts so that most bus cycles are used for useful data transfer
operations.
33
Computer Architecture
Ucb
FUNCTIONAL MODULES
A functional module is a collection of electronic circuitry that resides on one functional board that
work to achieve special bus control functions. The various functional modules are:-
i. Arbiter An arbiter is a functional module that accept bus request from the requester module
and grants control to the DTB to one request at a time.
ii. Bus timer It majors the time each data transfer takes on the DTB and terminates the DTB cycle
if a transfer text to long.
iii. Interrupter This module generates and interrupt request and provide status/ Id information
when an interrupt handler module request it.
iv. Location monitor It monitors data transfer over the DTB. A power monitor watches the monitor
status of the power source and signals when becomes unstable.
v. System clock driver It provides a clock timer signal on the utility bus. In addition, board
interface logic is needed to match the single line impedance and propagation time, and termination
values between the backplane and the plug in boards.
Q10. Explain the concept of pipelining. Discuss various types of pipeline designs used in computer.
A pipeline is the collection of different stages where the different independent instructions can execute in
the over-headed mode. The process of the division of a task into sub-task and executing in the pipelined
mode is called pipelining.
Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. It is
a technique of decomposing a sequential process into sub-operations with each sub-process being
executed in special segments that operate concurrently with all other segments.
A pipeline is in analogy of the assembly line of an automobile industry. In assembly line the assembling of
a car is divided through different stages which contribute in the assembling of a car. Each stage acts in
parallel t each other but obviously on a different part and of different cars.
Similarly pipeline in computer consisting of different stages operate on different sub tasks involving
different functional units and these stages act in parallel on the subtask of the different instructions or
task.
A pipeline’s efficiency is determined on the various factors such as the Speedup, Throughput & efficiency.
Total time required to process n tasks by k stage pipeline with clock cycle of τ time is; Tk= [k+ (n-1)] τ
T1 nkτ nk
Therefore a speedup is given by Sk = = =
TK [k + (n − 1)]τ k + (n − 1)
n
Throughput is defined as the no. of instructions performed per unit time denoted by Hk =
[k + (n - 1)]τ
Sk n
Efficiency Ek = =
k k + (n − 1)
Pipeline is to be designed such that the each pipeline stage takes equal time and are perfectly balance.
The instruction should be partitioned to take equal no. of clock cycle so that no any stage becomes
bottleneck in the efficient working of the pipeline.
Thus pipeline is an implementation technique that exploits parallelism among the instructions in a
sequential instruction stream.
34
Computer Architecture
Ucb
Pipeline
Pipeline
Fig:- Based on the nature of instruction/data Fig:- Based on the execution nature
LINEAR PIPELINE
A linear pipeline processor is a cascade of processing stages which are linearly connected to perform
different task over a stream of data. The linear pipeline requires proper subdivision of task into proper and
equivalent division.
A linear pipeline process is constructed with k processor stages. Operands are fed into the pipeline at the
first stage Si to next stage Si+1 for all i= 1, 2,. , k-1. The final result emerges from the last pipeline stage Sk.
Depending on the control of data flow along the pipeline, we model linear pipelines in two categories:-
1. Asynchronous Pipeline
2. Synchronous Pipeline
Asynchronous Pipeline
In this type of pipelining different stages communication & dataflow is carried out with help of
handshaking protocols. When one stage is ready to transmit it sends ready signal to next stage which
after receiving data sends an acknowledgement signal.
Asynchronous pipeline are useful in designing communication channels in message passing multi-
computers where pipeline wormhole routing is practiced. Asynchronous pipeline may have a variable
throughput rate. Different amounts of delay may be experienced in different stages.
Synchronous Pipeline
In synchronous pipeline all stages transfer data to next stage synchronously at the arrival of clock
pulse through latches(temporary memory) which are used as interface between different stages and
are use to store the results temporarily.
The delays in each pipeline stages are desired to equal so that result are produced at end of pipeline at
end of each clock cycle. This required the data to be independent of each other.
35
Computer Architecture
Ucb
NON-LINEAR PIPELINE
In a non linear pipeline processor sequence of the stages of a pipeline can be reconfigured dynamically to
perform different functions at different point of time.
A dynamic pipeline allows feed-forward and feedback connections in addition to the streamline
connections.
Function partitioning in a dynamic pipeline become quite involved because the pipeline stages are
interconnected.
The feed-forward & feedback connections make the scheduling of successive events into the pipeline a
non-trivial task. Output of one pipeline is not necessarily from the last stage. In fact following different
dataflow patterns one can use the same pipeline to evaluate different functions.
The utilization pattern of successive stage in a synchronous pipeline is specified by a reservation table.
The reservation table is essentially a space- time diagram depicting the precedence relationship in using
the pipeline stage.
SCALAR PIPELINE
The scalar pipeline refers to execution of a scalar instruction in a pipelined fashion. The different subtask
of an instruction i.e. Fetch, decode, Execute & write back are done in a pipelined fashion.
Depending on the control of data flow along the pipeline, we model scalar pipelines in two categories:-
1. Instruction Pipeline
2. Arithmetic Pipeline
Instruction Pipeline
The instruction pipeline is very similar to an assembly line where each stage performs its task and
passes the product to its next stage for further modification and a final product appears at last. As one
product advances to next stage the new product enters the line and ultimately different n stages of an
assembly line is working on n product simultaneously.
Similarly the instruction pipeline work and the instruction are processed in an over-headed node so as
to exploit parallelism at instruction level. The execution cycle of a typical instruction includes four
phases fetch, decode, execute & write back.
The instruction pipeline processes an instruction the way the assembly line. The first stage fetches the
instruction. The second decode it, third executes and the fourth one stores or write it back to memory.
The streamlined execution of the instructions in overlapped fashion characterizes the instruction
pipeline.
36
Computer Architecture
Ucb
A pipeline cycle is defined as the time required for each phase to complex the operation assuming
equal delay at all stage. The basic properties & definitions associated with instruction pipeline.
Pipeline has to encounter problem in situation where an instruction fetch required more than one
cycle it show down the pipeline. If also a cache is there it has to keep its data & instruction separately
to avoid conflicts form different stages of the pipeline.
Another problem arises due to branching statements which cases to jump to instruction which is not
next (or available in the pipeline.
Arithmetic Pipeline
As we know in a pipeline different stages of a pipeline perform different sub-task to get a final result in
overlapped basis. IN arithmetic pipeline different operation like adding, multiplications, division,
subtraction can be done in a pipelined fashion. The different stages of arithmetic pipeline can be adder
multiplier etc. The pipeline doesn’t increases the throughput on the whole.
Add, subtract, multiply & divide are basis operation along with other complex operation like power
calculation, trigonometric function and floating point operations are performed in a pipelined basis in
arithmetic pipeline. The pipelining in scalar arithmetic scalar pipeline is controlled by software lops
while in vector arithmetic pipeline is designed with hardware and controlled by firmware or hardwired
control. Vector hardware pipelines are built as ad-ons to a scalar processor driven by a control
processor.
Arithmetic pipeline also implement the shift registers and look-ahead carry to execute instructions
fast. Arithmetic pipeline may be static or dynamic. The dynamic arithmetic pipeline is reconfigurable
dynamically according to arithmetic operations required to perform.
Consider the following code:
for (i=1; i<=100; i++) This instruction when executed in a non-pipelined fashion will take
A[i] = B[i].C[i] + D[i]; more time than when done through arithmetic pipeline.
37
Computer Architecture
Ucb
When the first stage multiplier two operands B[i] & c[i] then the stage is adding D[i-1] with product
of B[i-1]&C[i-1] concurrently.
nT 1
An arithmetic pipeline speedup is given as Sn =
(n + k − 1)Tk
A dynamic arithmetic pipeline consisting of 3 stages which can perform addition, subtraction &
multiplication.
VECTOR PIPELINE
Vector pipeline is the over-headed mode execution of the vector instruction. The vector pipeline can be
attached to any scalar processor whether it is superscalar, super pipelined or both. Dedicated vector
pipelines will eliminate some overhead in looping control of course effectiveness of a vector processor relies
on the capability of an optimizing compiler that vectorises sequential code for vector pipelining.
38
Computer Architecture
Ucb
Q11. What do you mean by latency hiding? Discuss various types of latency hiding approaches used in
scalable multiprocessors.
The access of remote memory may significantly increase memory latency. Furthermore, the
processor speed is increasing at a much faster rate than memory and interconnection network. Thus
any scalable multiprocessor or large scale multicomputer must rely on the use of latency-reducing -
tolerating, or -hiding mechanisms.
1. Pre-fetching Technique
2. Cache Coherent
3. Release Memory Consistency
4. Multiple Contexts
Pre-fetching Technique
Pre-fetching uses knowledge the expected missies in a program to move the corresponding data close
to the processor before it is actually needed. Pre-fetching can be classified based on whether it is
binding or nonbinding, and whether it is controlled by hardware or software.
With binding pre-fetching, the value of a later reference (e.g. a register load) is bound at the time
when the pre-fetch completes. Binding pre-fetching may result in a significant loss in performance
due to such limitation.
In contrast, nonbinding pre-fetching also brings the data close to the processor, but the data
remains visible to the cache coherence protocol and is thus kept consistent until the processor
actually reads the value.
Hardware controlled pre-fetching includes schemes such as long cache lines and instruction look-
ahead. While instructions look-ahead is limited by branches and the finite look-ahead buffer size.
With software controlled pre-fetching, explicit pre-fetch instructions are issued.
Advantage:-
The benefits of pre-fetching come from several sources. The most obvious benefit occurs when
a pre-fetch is issued early enough in the code so that the line is already in the cache by the
time it is referenced.
Pre-fetching offers another benefit in multiprocessors that use an ownership based cache
coherence protocol.
Disadvantage:-
The disadvantages of software control include the extra instruction overhead required to
generate the Pre-fetches, as well as the need for sophisticated software intervention. In our
study, we concentrate on non-binding software – controlled pre-fetching.
Cache Coherent
While the coherence problem is easily solved for small bus-based multiprocessors through the use of
snoopy coherence protocols, the problem is such more complicated for large scale multiprocessors
that use general interconnection network.
Example - Dash Experience: We evaluate the benefits when both private and shared read write data
are cache-able, as allowed by the Dash hardware coherent caches, versus the case where only
private data are cache-able. As per as concern present a breakdown of the normalized execution
times with and without caching of shared data for each of the applications. Private data are cached
in both caches.
39
Computer Architecture
Ucb
Release Memory Consistency
Release Consistency (RC) model introduced by Gharachorloo et al. (1990). Release consistency
requires that synchronization accesses in the program be identified and classified as either acquires
(e.g. locks) or releases (e.g. unlock). As acquire is a load operation (which can be part of a read-
modify-write) that gains permission to access a set of data, while a release is a write operation that
gives away such permission. This information is used to provide flexibility in buffering and pipelines
of accesses between synchronization points.
Advantages:
The main advantage of the Release models is the potential for increased performance by hiding
as much write latency as possible.
Disadvantage
The main disadvantage is increased hardware complexity & more complex programming model.
Multiple Contexts
A conventional single-thread processor will wait during a remote reference, so may say it is idle for a
period of time L. A multithreaded processor, as modeled will suspend the current context and switch
to another, so after some fixed number of cycles. It will again be busy doing useful work even
through the remote reference is outstanding. Only if all the contexts are suspended will the
processor be idle.
Clearly, the objective is to maximize the fraction of time that the processor is busy so well will use
the efficiency of the processor as our performance index, given by
Efficiency = busy / (busy + switching + idle)
Q12. What do you mean by multithreading in multiprocessing system? Discuss the principles of
multithreading.
Latency (L)
● Network delay
● Cache miss penalty
● other cache delays like by contention in split transaction
This is the number of thread that can be interleave in each processor a thread is represented by a
context consisting of a program counter, a register sets and the required context status words.
This refers to the cycle lost in performing context switching in a processor. This time depends on the
switch mechanism in a processor. This time depends on the switch mechanism and the amount of
processor steps devoted to maintaining active states.
40
Computer Architecture
Ucb
Context switching Policy: -
Different multithreading architectures are distinguished by the context switching polices adopted.
There are following type of switching policy used.
This refers to the cycle between switches triggered by remote reference, the inverse for remote access.
This reflects a combination of program behaviour and memory system design.
Multithreading competition start with a sequential thread is followed by supervisory scheduling, where
the processor being thread of competition, by inter computer has a distributed memory, and by an
synchronization prior to beginning the next unit of parallel work the communication overhead period
inherent in distributed memory structures is usually distributed throughout the competition and is
possibly completely overlapped. Message passing overhead in multicomputer can be reduce by
specialized hardware operating in parallel with computation.
Massive parallel processors operate in synchronization within a network environment. The
synchronization arises two fundament latency problems: - Remote loads and synchronizing loads.
The solution to asynchronous problem is to multiplex amount many threads: When one thread issue a
remote-load request the processor begin working on another thread and so on. As the inter node latency
increases, more thread are needed to hide it effectively. In case of issuing a remote load from thread t1
to thread t2 which also issue a remote load, the response may not be in the same order. This problem is
resolved by associating each remote load and response with an identifier for appropriate thread. These
thread identifiers are referred to as continuations on messages. A large continuation name space should
be provided to name an adequate number of threads waiting for remote responses.
A multithreaded processor will suspend the current context and switch to another, so after some fixed
number of cycles it will again be busy doing useful work, even though the remote reference is
outstanding. The basic idea behind a multithreaded machine is to interleave the execution of several
contexts in order to dramatically reduce the ideal time of processors without increasing the context
switching time. Multithread systems are constructed with multiple context processors. A conventional
single thread to a maximum the fraction of time that the processor is busy, so the efficiency of processor
is given by efficiency=Busy/(busy+ switching + idle), where busy, switching and ideal represent the
amount of time, measured over some large interval. The state of processor is determined by the disk
position of the various contexts on the processor. A context cycle during its life time through the
following states: Ready, Running, Living, and blocked. There can be at most one context running or
living.
A processor is busy if there is a context in the running state. It is switching while making the
transition from one context to another. Otherwise if all context are blocked the processor is idle.
A running context keeps the processor busy until it issues an operation that requires a context switch.
The context then spends C cycle in the leaving state, then goes into the blocked state for L Cycles, and
finally re-enters the ready state. Eventually the processor will choose it and the cycle will start again.
41
Computer Architecture
Ucb
Q13. Write notes on:
(a) Memory hierarchy (d) Data flow computer (i) Parallelism in uniprocessor
(j) Tera multiprocessor system.
Data flow computer have the potential for exploiting all the parallelism available in the program. Since
execution is driven only by the availability of operands at the inputs to the functional unit, there is no
need for a program counter in the architecture and its parallelism is limited only by the actual data
dependences in a application program. While the data flow concept offer the potential of high
performance, the performance of an actual data flow implementation can be restricted by the limited
number of functional units, limited memory bandwidth, and the need to associatively match pending
operations with available functional unit.
There are two type of data flow computer:-
1. Static
Static data flow computer which allows more than one token to reside on any one arc. A node is
enabled as soon as token are present on all input arc & there is no token on any of its output arc.
2. Dynamic
In dynamic dataflow architecture, each data token is tagged with a context descriptor which is
called tagged token. A node is enabled as soon as token with identical tags are presented at each of
its inputs are with tagged token. Tag matching become necessary, therefore special hardware
mechanisms are needed to achieve tagged matching.
CONCLUSION:-
Since data dependency exist in data flow graph therefore they do not forced unnecessary
specializations and the computer scheduled instruction according to ability of operand. Value or token
may be memory allocation. Each instruction made for token found all inputs consumes input tokens
computes output value based on input values and produces token on output. No further restriction on
instruction ordering is imposed. No side effects are produced with the execution of instructions in
dataflow computers.
Example: - MIT Developed, TTDA (Tagged Token Dataflow architecture).
This system consists of 256 processors 512 memory, 256 I/O cache units, 256 I/O processors
4096 interconnection network node and a clock period of less than 3 ns. It achieves high speed, from
operational-level parallelism within program basis blocks to multi user’s time and space sharing. There
are no registers or memory addressing constraints and only three addressing nodes. Condition code
setting is consistent and orthogonal. The architecture permits the free exchange of spatial and temporal
locality for parallelism, a high optimizing compiler may work hard improving locality and trade the
parallelism thereby saved for more speed
The interconnection network of one 256-processor Tera system contains 4096 nodes arranged in a
16*16*16 Toroidal mesh. Of the 409 nodes, 1280 are attached to the resources comprising 256 cache
units’ and256 I/O processors. The 2816 remaining nodes do not have resources attached but still
provide message bandwidth.
Each processor in Tera computer can execute multiple instruction streams (threads)
simultaneously on every tick of the clock the processor logic selects a thread that is ready to execute
and allows it to issue its next instruction. Since instruction interpretation is completely pipelined by the
processor and by the network interfering either its predecessors. When an instruction finishes, the
thread to which it belongs becomes ready to execute the next instruction. Context switching is so rapid
that the processor has no time to swap status word (SSW); -Thirty-two 64-bit general purpose registers
(R0-R32); - Eight 64-bit target registers (T0-T7). There are 128 copies of each per system that is 128
SSWS, 4096 general purpose registers, and 1024 target registers. Program addresses are 32bits in
length.
The Tera architecture uses Explicit-Dependence-look ahead. Each instruction contains a 3-bit
look- ahead field that explicitly specifies how instruction from this thread will be issued before at most 8
instructions and 24 operations can be concurrently executing from each thread.
42
Computer Architecture
Ucb
Parallelism in Uni-processor
A computer system achieves parallelism when it performs two or more task simultaneously In computer
design this is generally understood to mean that the task are not related to each. But this would not be
considered parallel processor. Like for ex. FETCH2: DR←M, PC←PC+1
This is not a parallel processing as the two micro-operations in the same instruction processing parallels
processing is generally attributed to super computer i.e. system with many CPU’s .
However a uni-processor system can also exhibit some sort of parallel processing.
MEMORY HIERARCHY
The arrangement of the memory devices such as register caches, main memory, disk devices, tapes in a
hierarchical manner according the five parameters and it level to the proximity in the CPU is called the
Memory Hierarchy.
The five parameters on which the memory hierarchy units are measured are:-
1. Access time (ti) ns,ms,min The time between presenting the address &
getting the valid data.
2. Memory (si) KB,HB,GB,TB Size of the memory in bytes, Kilo, Hex, Giga,
Tera Bytes
3. Cost per byte (ci) cents/KB Cost of building the memory
4. Transfer bandwidth (bi) MB/s The rate of data movement
5. Unit of transfer (xi) bytes/blocks
The access time ti refers to the round-trip time from the CPU to the ith-level memory size Si is the
number of bytes or words in level i. The cost of the ith-level memory is estimated by the product Ci*Si.
The bandwidth B/bi refers to the rate at which information is transferred between adjacent levels. The
unit of transfer xi refers to the grain size for data transfer between levels i and i+1.
Memory devices at a lower level are faster to access, smaller in size, and more expensive per byte, have a
higher bandwidth and using a smaller unit of transfer as compared with those at a higher level. In other
words, Ti-1 < ti, Si-1 < Si, Ci-1 > Ci, bi-1 > bi, and Xi-1 < Xi ; for i=1,2,3, and 4 in the hierarchy where i=0
corresponds to the CPU register level.
Main Memory
Disk Driver & Tape Units are handled by O/S with limited
user intervention. These are used for storing future
usable information such as program/data. Magnetic
tapes are off-line memory used for backup storage.
Peripheral technology
Besides disk drivers and tap units, peripheral devices include printers, plotters, terminals, monitors,
graphic displays, optical scanners, image digitizers, output microfilms etc. Some I/O devices are tied
to special-purpose or multimedia applications.
44
Computer Architecture
Ucb
PROPERTIES OF MEMORY HIERARCHY
Information stored in a memory hierarchy (M1, M2 …Mn)
satisfies three important properties:
1. Inclusion Property
2. Coherence Property
3. Locality of Reference Property
Inclusion Property
The inclusion property is stated as M1⊂M2⊂M3⊂….⊂Mn.
The set inclusion relationship implies that all information
items are originally stored in the outermost level Mn. During
the processing, subsets of Mn are copied into Mn-1. Similarly
subsets of Mn-1 are copied into Mn-2 and so on. If information
word is found in Mi, then copies of the same word can be
found in all upper level Mi+1, Mi+2…..Mn. The highest level is
the backup storage, where everything is found.
Information transfer between the CPU and cache is in
terms of words. The cache (M1) is divided into cache blocks,
also called cache lines. Blocks are the units of data transfer between the cache and main memory.
The main memory (M2) is divided into pages, each page contains N blocks. Pages are the units
of information transferred between disks and main memory.
Scattered pages are organized as a segment in the disk memory. Data transfer between the
disk and tape unit is handled at the file level.
Coherence Property
The coherence property requires that copies of the same information item at successive levels
be consistent. If a word is modified in the cache, copies of that word must be updated immediately of
eventually at all higher levels. Frequently used information is often found in the lower level in order
to minimize the effective access time of the memory hierarchy. There are two strategies for
maintaining the coherence in memory hierarchy are:-
The first method is called write-through(WT), which demands immediate update in Mi+1, if a
word is modified in Mi for i=1,2,3…… n-1.
The second method is write-back (WB) which delays the update in Mi+1 until the word being
modified in Mi is replace or removed from Mi.
Temporal Locality
Referencing of recently referenced item is caused by construct of program such as iterative loops,
process stacks, temporary variables or subroutines. Once a loop entered or a subroutine is
called, a small code segment will be referenced repeatedly many times. Thus temporal locality
tends to cluster the access in the recently used area.
Spatial Locality
This refers to the tendency for a process to access items whose addresses are near one another.
Program segments, such as routines and macros, tend to be stored in the same neighbourhood
of the memory space.
Sequential Locality
In typical programs, the execution of instructions follows sequential order. The ratio of in-order
execution to out-of-order execution in roughly 5 to 1 in ordinary programs.
The sequentiality in program behaviour contributes to the spatial locality because sequentially
coded instructions and array elements are often stored in adjacent locations. Each type of
locality affects the design of memory hierarchy.
45