Vous êtes sur la page 1sur 24

Advanced Computer Architecture

Question 1 - Write notes on any 2 Program Flow mechanisms


Question 2 - Write notes on the following:
Amdahl’s law and efficiency of a system
Utilization of system and quality of parallelism
Redundancy
Question 3 - Define Clock rate, CPI, MIPS rate and Throughput rate
Question 4 - Explain super scalar processors
Question 5 - Explain non linear pipeline processor
Question 6 - Write notes on the following:
Crossbar Switch Network
Mulitport Memory
Multistage Network
Question - 7 Explain message passing in detail

Question 1 - Write notes on any 2 Program Flow


mechanisms
Conventional machines used control flow mechanism in which the order of program
execution is explicitly stated in user programs.
● Control flow machines - In this type of machines, token of control indicates when a
statement is executed.
● Data flow machines - In this type of machines, instructions can be executed by
determining the operand availability
● Reduction machines - In this type of machines, instruction executions are trigger based
on the demand for its results.

Comparison of program flow mechanisms:

Machine Control Flow Data flow Reduction machine


Model

Basic Conventional Eager evaluation; Lazy evaluation;


Definition computation; token of statements are executed statements are executed
control indicates when when all their operands only when their result is
a statement should be are available required for another
executed. computation

Advantag 1. Full control 1. Very high potential 1. Only required


es 2. Complex data for parallelism instructions are
and control 2. High throughput executed.
structures are 3. Free from side 2. High degree of
easily effects parallelism
implemented. 3. Easy manipulation
of data structures.

Disadvant 1. Less efficient 1. High control 1. Time needed to


ages 2. Difficult in overhead propagate demand
programming 2. Difficult in tokens
3. Difficult in manipulating data
preventing run structures
time error

Question 2 - Write notes on the following:


Amdahl’s law and efficiency of a system
Utilization of system and quality of parallelism
Redundancy
a. Amdahl’s Law:
Example 1:
If an improvement can speed up 30% of the computation, F will be
0.3; if the improvement makes the portion affected twice as fast, S
will be 2.) Amdahl's law states that the overall speedup of applying
the improvement will be

Example 2:
We are given a task which is split up into four parts: F1 = 11%, F2 =
18%, F3 = 23%, F4 = 48%, which add up to 100%. Then we say F1
is not sped up, so S1 = 1 or 100%, F2 is sped up 5×, so S2 = 500%,
F3 is sped up 20×, so S3 = 2000%, and F4 is sped up 1.6×, so S4 =
160%. By using the formula F1/S1 + F2/S2 + F3/S3 + F4/S4, we find
the running time is

or a little less than ½ the original running time which we know is 1.


Therefore the overall speed boost is 1 / 0.4575 = 2.186 or a little
more than double the original speed using the formula(F1/S1 +
F2/S2 + F3/S3 + F4/S4)−1. Notice how the 20× and 5× speedup
don't have much effect on the overall speed boost and running time
when 11% is not sped up, and 48% is sped up by 1.6×.

System Efficiency

■ Let O(n) be the total number of unit operations performed by n-processor


system and T(n) be the execution time in unit time steps. In general, T(n)
< O(n) if more than one operation is performed by n processors per unit
time, where n>=2. Assume T(1) = O(1) in a uni-processor system.

The speed up factor is defined as:


s(n) = T(1)/T(n)

The efficiency of a n-processor system is defined as:


E(n) = s(n)/n = T(1)/(n*T(n))

■ Efficiency is an indication of the actual degree of speed up performance


achieved compared with the maximum value.
Since 1<= S(n) <= n, we have 1/n <= E(n) <=1. [always a fraction]
■ Lowest efficiency corresponds to the case where the entire program is
being executed sequentially on a single processor.
■ The maximum efficiency is achieved when all n processors are fully
utilized throughout the execution period.

b. System Utilization
■ System utilization in a parallel computation is defined as below:
V(n) = R(n) * E(n) = O(n)/(n * T(n))
■ The system utilization indicates the percentage of resources that was
kept busy during the execution of a parallel program. It is interesting to
note the following relationships:
1/n <= E(n) <= U(n) <= 1
1 <= R(n) <= 1/E(n) <= 1

Quality of parallelism
■ The quality of a parallel computation is directly proportional to the
speedup and efficiency and inversely related to the redundancy. Thus we
have:

Q(n) = (S(n) * E(n))/R(n) = T^3(1)/(n*T^2(n)*O(n))

Since E(n) is always a fraction and R(n) is a number between 1 and n, the
quality Q(n) is always bounded by the speed up factor S(n).

c. Redundancy
■ The redundancy in a parallel computation is defined as the ratio of O(n) to
O(1):
R(n) = O(n)/O(1)

Question 3 - Define Clock rate, CPI, MIPS rate and


Throughput rate
i. Clock rate

■ CPU is driven by a clock with a constant cycle time


● Cycle time is represented using T in nanoseconds
■ Inverse of cycle time is the clock rate (f=1/T)
● f = 1 in megahertz
■ The clock rate is the rate in cycles per second (measured in
hertz) or the frequency of the clock in any synchronous circuit,
such as a central processing unit (CPU).

ii. CPI - Cycles per instruction


iii. MIPS - Millions of instructions per second
iv. Throughput Rate
Question 4 - Explain super scalar processors

■ A superscalar CPU architecture implements a form of parallelism called


instruction level parallelism within a single processor. It therefore allows faster
CPU throughput than would otherwise be possible at a given clock rate.
■ A superscalar processor executes more than one instruction during a clock
cycle by simultaneously dispatching multiple instructions to redundant
functional units on the processor. Each functional unit is not a separate CPU
core but an execution resource within a single CPU such as an arithmetic logic
unit, a bit shifter, or a multiplier.

■ In the Flynn Taxonomy, a superscalar processor is classified as a MIMD


processor (Multiple Instructions, Multiple Data).

■ While a superscalar CPU is typically also pipelined, pipelining and superscalar


architecture are considered different performance enhancement techniques.

■ The superscalar technique is traditionally associated with several identifying


characteristics (within a given CPU core):
1. Instructions are issued from a sequential instruction stream
2. CPU hardware dynamically checks for data dependencies between
instructions at run time (versus software checking at compile time)
3. The CPU accepts multiple instructions per clock cycle
Fig: Simple superscalar pipeline. By fetching and dispatching two instructions at
a time, a maximum of two instructions per cycle can be completed.

■ The simplest processors are scalar processors. Each instruction executed


by a scalar processor typically manipulates one or two data items at a
time.

■ In contrast, each instruction executed by a vector processor operates


simultaneously on many data items. An analogy is the difference
between scalar and vector arithmetic.

■ A superscalar processor is sort of a mixture of the above 2 processor


types. Each instruction processes one data item, but there are
multiple redundant functional units within each CPU thus multiple
instructions can be processing separate data items concurrently.
■ Superscalar CPU design emphasizes improving the instruction
dispatcher accuracy, and allowing it to keep the multiple functional
units in use at all times. This has become increasingly important when
the number of units increased. While early superscalar CPUs would have
two ALUs and a single FPU, a modern design such as the PowerPC 970
includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is
ineffective at keeping all of these units fed with instructions, the
performance of the system will suffer.

■ A superscalar processor usually sustains an execution rate in


excess of one instruction per machine cycle. But merely processing
multiple instructions concurrently does not make an architecture
superscalar, since pipelined, multiprocessor or multi-core architectures
also achieve that, but with different methods.

■ In a superscalar CPU the dispatcher reads instructions from memory


and decides which ones can be run in parallel, dispatching them to
redundant functional units contained inside a single CPU. Therefore
a superscalar processor can be envisioned having multiple parallel
pipelines, each of which is processing instructions simultaneously
from a single instruction thread.

Limitations:

Available performance improvement from superscalar techniques is


limited by three key areas:
1. The degree of intrinsic parallelism in the instruction stream, i.e.
limited amount of instruction-level parallelism.
2. The complexity and time cost of the dispatcher and associated
dependency checking logic.
3. The branch instruction processing.

e.g. Some of the super scalar processors:


● The P5 Pentium was the first superscalar x86 processor;
● Nx586, P6 Pentium Pro and AMD K5
● Cyrix 6x86.

Question 5 - Explain non linear pipeline processor


■ A pipeline need not be a simple linear chain of stages. There are instances
where it is useful to have a collection of functional units that can be wired into a
particular pattern of flow, even with loops and skips in the chain. This may allow
more than one function to be computed with the same pipeline.

■ A typical case would be built-in floating-point square root, which chains together
the floating-point adder and multiplier, rather than having separate functional
units for this rarely used operation. Depending upon how the square root
operation operates, it might leave holes in the schedule that would admit
independent floating adds or multiplies.

■ The problem with trying to utilize a nonlinear pipeline is that it is difficult to keep it
full unless the functions do not collide with each other or themselves.

■ These reservation tables show the sequence in which each function utilizes each
stage. (For example, think of X as being a floating square root, and Y as being a
floating cosine. A simple floating multiply might occupy just S1 and S2 in
sequence.) We could also denote multiple stages being used in parallel, or a
stage being drawn out for more than one cycle with these diagrams.
■ We determine the next start time for one or the other of the functions by lining up
the diagrams and sliding one with respect to another to see where one can fit into
the open slots.
■ Once an X function has been scheduled, another X function can start after 1, 3 or
6 cycles. A Y function can start after 2 or 4 cycles.
■ Once a Y function has been scheduled, another Y function can start after 1, 3 or
5 cycles. An X function can start after 2 or 4 cycles.
■ After two functions have been scheduled, no more can start until both are
complete.

Question 6 - Write notes on the following:

Crossbar Switch Network


i.
■ A separate path is available for each memory unit.
■ Every processor is connected to each memory module through a cross point
switch.
■ Obviously hardware complexity increases.
■ All processors can send memory requests independently and asynchronously.
■ Each cross point switch in a cross point network can be set open or closed,
providing a point to point connection between the source and destination.
■ On each row of the crossbar mesh multiple switches can be connected
simultaneously.
■ In each column of the crossbar only one switch can be connected at a time.
■ Problem arises when multiple requests are destined for the same memory
module at the same time. In such cases only one request can be services at a
time, since at any given time only one switch can be connected.
■ Each cross point must have an additional hardware which is capable to handling
all switching and resolving all conflicts.
■ An arbitration module is used to make the selection based on the priority,
whenever a conflict arises. The acknowledgement signals are sent to indicate the
result of the conflict.
■ A multiplexer module multiplexes the data, address and signal from the
processor.
■ Each cross point requires a large number of connection lines for accommodating
the address, data and control signals.
■ The cross bar switch has the potential for the highest bandwidth and system
efficiency.
■ The maximum number of simultaneous transfers is limited by t he number of
memory modules, bandwidth and speed of the buses rather than the number of
paths available.
■ Because of its complexity and cost, it may not be preferred for large
multiprocessor system.

Mulitport Memory

■ In multiport memory has multiple ports connected to multiple paths between


memory and processors.
■ Multiport memory is based on the idea of moving all crosspoint arbitration and
switching functions associated with each memory module into the memory
controller.
■ Thus the memory module becomes more expensive due to added access ports
and associated logic.
Multistage Network
Question 7 - Explain message passing in
detail
Message passing in multicomputers

Message Formats
Store and forward routing
Flits and Wormhole Routing

Store and Forward Vs Wormhole


Asynchronous pipelining
Wormhole Node Handshake

Vous aimerez peut-être aussi