Vous êtes sur la page 1sur 106

CS402

Advanced Computer
Architecture
2010

Chapter 1


Adapted from CS 152 Spring 2002 UC Berkeley
Copyright (C) 2001 UCB

Course Instructor
Advanced Computer Architecture. CS402
J V Vadavi Assistant Professor SrGrade,
Department of CSE
SDMCET ,Dharwad.
E-mail: (jvvadavi@sdmcet.ac.in),

Available in CCF every day from 3.00pm
to 5.00pm except on Monday.
Intercom number 8077.
Direct number 0836-2328077

Course Information [contd]


Evaluation Procedure & Planned marks distribution:
Two Continuous Assessment Tests (CAT).Each test is for 20
marks. There will be another improvement CAT of 20 marks.
10 marks (Teacher assessment marks). This is based on the
assessments quizzes conducted during the course. Total
internal marks 50. .
End Semester Examination 100 Marks. Reduced to 50
marks.
Final course grading is calculated based on internal marks
and end semester marks.
Course Information [contd]
Book (Required)
Computer Architecture, A Quantitative Approach ,
4
th
Edition ,
David A. Patterson and John L. Hennessy, Elsevier
Publishers.

REFERENCES:
Advanced Computer Architecture Parallelism,
Scalability - Kai Hwang:, Prograniability, Tata Mc
Grawhill, 2003.
Parallel Computer Architecture, A Hardware I
Software Approach - David E. Culler, Jaswinder Pal
Singh, Anoop Gupta:, Morgan Kaufman, 1999.
Course Information [contd]
Course Webpage
Yet to be designed any one of you may take
initiative..
VII Semester CSE email group..
Any one of you may create and operate this
group and check any email regarding course

Course Overview
34-bit ALU
LO register
(16x2 bits)
L
o
a
d
H
I
C
l e
a
r H
I
L
o
a
d
L
O
Multiplicand
Register
ShiftAll
LoadMp
E
x t r a
2
b
i t s
32 32
LO[1:0]
Result[HI] Result[LO]
32 32
P
r e
v
L O [ 1 ]
B
o
o
t h
E
n
c
o
d
e
r ENC[0]
ENC[2]
"LO
[0]"
Control
Logic
Input
Multiplier
32
Sub/Add
2
34
34
32
Input
Multiplicand
32=>34
signEx
34
34x2 MUX
32=>34
signEx
<<1
34
ENC[1]
Multi x2/x1
2
2 HI register
(16x2 bits)
2
0 1
34 Arithmetic
Single/multicycle
Datapaths
Computer Arithmetic
Datapaths
Course Overview [contd]
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
Pipelining
Memory Systems
Performance
Memory
Whats In It For Me ?
In-depth understanding of the inner-
workings of modern computers, their
evolution, and trade-offs present at the
hardware/software boundary.
Insight into fast/slow operations that are easy/hard to
implementation hardware

Computer Architecture - Definition
Computer Architecture = ISA + MO

Instruction Set Architecture
What the executable can see as underlying hardware
Logical View

Machine Organization
How the hardware implements ISA ?
Physical View
Roman
Japanese
Chinese (compute in hex?)
Computer Architecture Changing Definition
1950s to 1960s: Computer Architecture Course:
Computer Arithmetic
1970s to mid 1980s: Computer Architecture Course:
Instruction Set Design, especially ISA appropriate for compilers
1990s: Computer Architecture Course:
Design of CPU, memory system, I/O system, Multiprocessors,
Networks
2000s: Computer Architecture Course:
Non Von-Neumann architectures, Reconfiguration
DNA Computing, Quantum Computing ????

Some Examples
Digital Alpha (v1, v3) 1992-97 RIP soon
HP PA-RISC (v1.1, v2.0) 1986-96 RIP soon
Sun SPARC (v8, v9) 1987-95
SGI MIPS (MIPS I, II, III, IV, V) 1986-96
IA-16/32 (8086,286,386, 486, 1978-1999
Pentium, MMX, SSE, )
IA-64 (Itanium) 1996-now
AMD64/EMT64 2002-now
IBM POWER (PowerPC,) 1990-now
Many dead processor architectures live on in
microcontrollers
The MIPS R3000 ISA (Summary)
Instruction Categories
Load/Store
Computational
Jump and Branch
Floating Point
coprocessor
Memory Management
Special
R0 - R31
PC
HI
LO
OP
OP
OP
rs
rt
rd sa funct
rs
rt
immediate
jump target
3 Instruction Formats: all 32 bits wide
CPSC 321
What is Computer Architecture ?
I/O system Instr. Set Proc.
Compiler
Operating
System
Application
Digital Design
Circuit Design
Instruction Set
Architecture
Firmware
Coordination of many levels of abstraction
Under a rapidly changing set of forces
Design, Measurement, and Evaluation
Datapath & Control
Layout
Impact of changing ISA
Early 1990s Apple switched instruction set
architecture of the Macintosh
From Motorola 68000-based machines
To PowerPC architecture
Intel 80x86 Family: many implementations
of same architecture
program written in 1978 for 8086 can be run
on latest Pentium chip OR even i3.??

Factors affecting ISA ???
Computer
Architecture
Technology Programming
Languages
Operating
Systems
History
Applications
Cleverness
ISA: Critical Interface
instruction set
software
hardware
Examples: 80x86 50,000,000 vs. MIPS 5500,000 ???
The Big Picture
Control
Datapath
Memory
Processor
Input
Output
Since 1946 all computers have had 5 components!!!
Example Organization
TI SuperSPARC
tm
TMS390Z50 in Sun SPARCstation20
Floating-point Unit
Integer Unit
Inst
Cache
Ref
MMU
Data
Cache
Store
Buffer
Bus Interface
SuperSPARC
L2
$
CC
MBus Module
MBus
L64852
MBus control
M-S Adapter
SBus
DRAM
Controller
SBus
DMA
SCSI
Ethernet
STDIO
serial
kbd
mouse
audio
RTC
Floppy
SBus
Cards
Technology Trends
Processor
logic capacity: about 30% per year
clock rate: about 20% per year
Memory
DRAM capacity: about 60% per year (4x every 3 years)
Memory speed: about 10% per year
Cost per bit: improves about 25% per year
Disk
capacity: about 60% per year
Total use of data: 100% per 9 months!
Network Bandwidth
Bandwidth increasing more than 100% per year!
i4004
i8086
i80386
Pentium
i80486
i80286
SU MIPS
R3010
R4400
R10000
1000
10000
100000
1000000
10000000
100000000
1965 1970 1975 1980 1985 1990 1995 2000 2005
T
r
a
n
s
i
s
t
o
r
s
i80x86
M68K
MIPS
Alpha
In ~1985 the single-chip processor (32-bit) and the single-board computer emerged

In the 2002+ timeframe, these may well look like mainframes compared single-chip
computer (maybe 2 chips)
DRAM
Year Size
1980 64 Kb
1983 256 Kb
1986 1 Mb
1989 4 Mb
1992 16 Mb
1996 64 Mb
1999 256 Mb
2002 1 Gb
uP-Name
Microprocessor Logic Density
DRAM chip capacity
Technology Trends
Technology Trends
Smaller feature sizes higher speed, density
ECE/CS 752; copyright J. E. Smith, 2002 (Univ. of Wisconsin)
Technology Trends
Number of transistors doubles every 18 months
(amended to 24 months)
ECE/CS 752; copyright J. E. Smith, 2002 (Univ. of Wisconsin)
Log-log plot of bandwidth and latency
milestones

Levels of Representation
High Level Language
Program
Assembly Language
Program
Machine Language
Program
Control Signal
Specification
Compiler
Assembler
Machine Interpretation
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
0000 1001 1100 0110 1010 1111 0101 1000
1010 1111 0101 1000 0000 1001 1100 0110
1100 0110 1010 1111 0101 1000 0000 1001
0101 1000 0000 1001 1100 0110 1010 1111
ALUOP[0:3] <= InstReg[9:11] & MASK
Execution Cycle
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
Defining Computer Architecture

The term instruction set architecture (ISA)
to refer to the actual programmer visible
instruction set
The ISA serves as the boundary between
the software and hardware
There are seven dimensions of an ISA
Examples processors MIPS and 80x86

Seven dimensions of an ISA

1.Class of ISA - Nearly all ISAs today are
classified as general-purpose register
architectures, where the operands are either
registers or memory locations.
MIPS has 32 general-purpose and 32 floating-
point registers. What is there in 80x86
processor?
Register-memory ISAs such as the 80x86 and
load-store ISAs such as MIPS

2. Memory addressing : objects must be
aligned
Both 80x86 and MIPS use BYTE
addressing to access memory addressing.
The 80x86 does not require alignment, but
accesses are generally faster if operands
are aligned


3.Addressing modes Addressing modes
specify the address of a memory object.
MIPS addressing modes are Register, Immediate (for constants), and
Displacement, where a constant offset is added to a register to form the
memory address.
The 80x86 supports those three plus three variations of displacement
4 Types and sizes of operands
Like most ISAs, MIPS and 80x86 support operand sizes of 8-bit
(ASCII character), 16-bit (Unicode character or half word), 32-bit
(integer or word), 64-bit (double word or long integer), and IEEE 754
floating point in 32-bit (single precision) and 64-bit (double
precision).The 80x86 also supports 80-bit floating point (extended
doubleprecision).

MIPS instruction format
5.OperationsThe general categories of
operations are data transfer, arithmetic,
logical, control and floating point. MIPS is a
simple and easy-to-pipeline instruction set architecture,
and it is representative of the RISC architectures being
used in 2006.

6.Control flow instructions Virtually all
ISAs, support conditional branches,
unconditional jumps, procedure calls, and
returns.
7. Encoding an ISA There are two basic
choices on encoding: fixed length and
variable length. All MIPS instructions are
32 bits long, which simplifies instruction
decoding. Figure 1.6 shows the MIPS
instruction formats. The 80x86 encoding is
variable length, ranging from 1 to 18 bytes.

Summary of some of the most important
functional requirements an architect faces.
Performance milestones over 20 to 25 years
for microprocessors, memory, networks,
and disks.

Trends in Power in Integrated Circuits

Power dynamic = 1/2 x Capacitive load x Voltage x
Frequency switched
Energy Dynamic = Capacitive load x voltage.

voltages have dropped from 5V to just over 1V in
20 years
The capacitive load is a function of the number
of transistors connected to an output and the
technology, which determines the capacitance of
the wires and the transistors
slowing clock rate reduces power, but not
energy.



Example
Some microprocessors today are
designed to have adjustable voltage, so
that a 15% reduction in voltage may result
in a 15% reduction in frequency. What
would be the impact on dynamic power?

Reading Assignment
IC Manufacturing trends and Cost
estimation of ICs is left for self
study.
The Role of Performance
Dependability
As the VLSI technology is shrinking from 65nm
to 32nm and even 22 nm,
what is the reliability that these devices are still
Dependable? Reliable?
Think of few critical applications..
Servers in share market, missile operations,
spacecraft operations, lifesaving biomedical
equipment etc.,
Then how to estimate the reliability of these
devices?
The reliability is measured using the availability
and non-availability of the device for the service
We use a term Mean Time To Failure (MTTF) to
measure the availability. And reciprocal of MTTF
is failure rate. OR failure in time (FIT).
The devices may be put into use by some repair.
The time to repair is measured as Mean Time To
Repair (MTTR)
We may use simply the term Mean time
Between Failures (MTBF) as sum of
MTTF and MTTR

Now the module availability may be
estimated as
= MTTF / MTTF + MTTR
Example 1
Assume a disk subsystem with the following
components and MTTF:
10 disks, each rated at 1,000,000-hour MTTF
1 SCSI controller, 500,000-hour MTTF
1 power supply, 200,000-hour MTTF
1 fan, 200,000-hour MTTF
1 SCSI cable, 1,000,000-hour MTTF
Using the simplifying assumptions that the
lifetimes are exponentially distributed and that
failures are independent, compute the MTTF of
the system as a whole.

Solution
Method : obtain the sum of the failure
rates of each sub component. That gives
the failure rate.
Convert this to FIT: failure per million
hours.
Reciprocal of this FIT is MTTF.
Example 2
Disk subsystems often have redundant
power supplies to improve dependability.
Using the components and MTTFs from
above, calculate the reliability of a
redundant power supply. Assume one power
supply is sufficient to run the disk subsystem and that we
are adding one redundant power supply.

Solution
Since we have added two power supplies
the MTTF is half of the original one.
MTTF= MTTF/2
Since we need only one PSS at any
time, if one fails it can be repaired
where as the other takes care.
Therefore the probability of second
also fails is estimated as
MTTR/MTTF
Hence the MTTF of the whole subsystem as
(MTTF/2) / ( MTTR/MTTF )
(MTTF
)^2/
2
/
MTTR
MTTF^2/ (2* MTTR)
If we make an approximation that the faulty unit
is repaired in 24 hours, the MTTF will be
MTTF= 200,000^2 / 2 x 24
= 830,000,000 !!!
the PSS pair about 4150 times more reliable than
a single power supply.


Measuring, Reporting, and
Summarizing Performance
Performance Metrics
Response Time
Delay between start end end time of a task

Throughput
Numbers of tasks per given time

New: Power/Energy
Energy per task, power
Examples
(Throughput/Performance)
Replace the processor with a faster
version?
3.8 GHz instead of 3.2 GHz

Add an additional processor to a system?
Core Duo instead of P4
Core i3,i5,i7 are the latest in the desktop
series.
Measuring Performance
Wall-clock time or- Total Execution Time

CPU Time
User Time
System Time

Try using time command on UNIX system
Relating the Metrics
Performance = 1/Execution Time

CPU Execution Time = CPU clock cycles
for program x Clock cycle time

CPU clock cycles = Instructions for a
program x Average clock cycles per
Instruction


n = Execution y / Execution x
means machine x is n times faster
than machine y; it is relative parameter!
n = Execution y / Execution x
=1/ performance y / 1/performance x
= Performance x /performance y

"the throughput of X is 1.3 times higher than Y"

Benchmarks
Kernels: which are small, key pieces of real
applications;
Toy programs, which are 100-line programs
from beginning programming assignments, such
as quicksort; and
Synthetic benchmarks, which are fake
programs invented to try to match the profile and
behavior of real applications, such as Dhrystone.

SPEC Benchmarks
These are the standard programs written
for some reference machines.
When these program is run on any
machine how much fast this machine is ?
is evaluated.
Hence the spec ratio is
SPEC
A
=ExecTime on Reference Machine
ExecTime on Machine
A
Hence the SPEC always give us a weight
in terms of ratios.
Then how to compare two machines? A&B


Because a SPECRatio is a ratio rather
than an absolute execution time, the mean
must be computed using the geometric
mean. (Since SPECRatios have no units, comparing
SPECRatios arithmetically is meaningless.)


1. The geometric mean of the ratios is the
same as the ratio of the geometric means.
2. The ratio of the geometric means is equal
to the geometric mean of the performance
ratios, which implies that the choice of the
reference computer is irrelevant.!!

Example
Show that the ratio of the geometric
means is equal to the geometric mean of
the performance ratios, and that the
reference computer of SPECRatio matters
not ay all.
To characterize variability about the
arithmetic mean, we use the arithmetic
standard deviation (stdev), often called o.
It is defined as:


Using the data in the above Figure,
calculate the geometric standard deviation
and the percentage of the results that fall
within a single standard deviation of the
geometric mean. Are the results
compatible with a lognormal distribution?
Solution to this is in page #37 pl refer
9/23/2004 Lec 1-2 76
CPU Performance
The Fundamental Law




Three components of CPU performance:
Instruction count
CPI
Clock cycle time
cycle
seconds
n instructio
cycles
program
ns instructio
program
seconds
time CPU = =
Inst. Count CPI Clock
Program X
Compiler X X
Inst. Set
Architecture
X X X
Arch X X
Physical Design X


9/23/2004 Lec 1-2 77
CPI - Cycles per Instruction
Let Fi be the frequency of type I instructions in a program.
Then, Average CPI:

=
= =
=
n
1 i
i
i i i
Count n Instructio
IC
F where F CPI
Count n Instructio Total
Cycle Total
CPI
) IC (CPI time Cycle time CPU
n
1 i
i i
=
=
InstructiontypeALU LoadStoreBranch
Frequency 43%21%12%24%
Clockcycles 1 2 2 2


Example:
average CPI = 0.43 + 0.42 + 0.24 + 0.48 = 1.57 cycles/instruction
Example 1
A LOAD/STORE machine has the characteristics shown below. We also
observe that 25% of the ALU operations directly use a loaded value that is
not used again. Thus we hope to improve things by adding new ALU
instructions that have one source operand in memory. The CPI of the new
instructions is 2. The only unpleasant consequence of this change is that
the CPI of branch instructions will increase from 2 to 3. Overall, will CPU
performance increase?
Instruction type Frequency CPI
ALU ops 0.43 1
Loads 0.21 2
Stores 0.12 2
Branches 0.24 2
Example 1 (Solution)
Instruction type Frequency CPI
ALU ops 0.43 1
Loads 0.21 2
Stores 0.12 2
Branches 0.24 2
Before change
Instruction type Frequency CPI
ALU ops (0.43-x)/(1-x) 1
Loads (0.21-x)/(1-x) 2
Stores 0.12/(1-x) 2
Branches 0.24/(1-x) 3
Reg-mem ops
x/(1-x)
2
After change
Since CPU time increases, change will not improve performance.
T IC 57 . 1
T 1.57 IC
time cycle Clock CPI IC time CPU
1.57 2 0.24) 0.12 (0.21 1 0.43 CPI
=
=
=
= + + + =
T IC 1.703
T 908 . 1 IC ) - (1
time cycle Clock CPI IC time CPU
908 . 1
0.8925
1.7025
- 1
3 0.24 2 ) 0.12 - (0.21 1 ) - (0.43
CPI
1075 . 0 4 0.43
=
=
=
= =
+ + + +
=
= =
x
x
x x x
x
Example 2
A load-store machine has the characteristics shown below. An optimizing
compiler for the machine discards 50% of the ALU operations, although it
cannot reduce loads, stores, or branches. Assuming a 500 MHz (2 ns)
clock, what is the MIPS rating for optimized code versus unoptimized code?
Does the ranking of MIPS agree with the ranking of execution time?
Instruction type Frequency CPI
ALU ops 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
Example 2 (Solution)
Instruction type Frequency CPI
ALU ops 43% 1
Loads 21% 2
Stores 12% 2
Branches 24% 2
Without optimization
Instruction type Frequency CPI
ALU ops (0.43-x)/(1-x) 1
Loads 0.21/(1-x) 2
Stores 0.12/(1-x) 2
Branches
0.24/(1-x)
2
With optimization
Performance increases,
but MIPS decreases!
5 . 318
10 1.57
MHz 500
MIPS
IC 10 14 . 3
10 2 1.57 IC
time cycle Clock CPI IC time CPU
1.57 2 0.24) 0.12 (0.21 1 0.43 CPI
6
9 -
9 -
=

=
=
=
=
= + + + =
0 . 289
10 73 . 1
MHz 500
MIPS
IC 10 72 . 2
10 2 73 . 1 IC ) - (1
time cycle Clock CPI IC time CPU
73 . 1
0.785
1.355
- 1
2 0.24) 0.12 (0.21 1 x) - (0.43
CPI
2 0.43
6
9 -
9 -
=

=
=
=
=
= =
+ + +
=
=
x
x
x
Example 3
Suppose we have made the following
measurements:
Frequency of FP operations = 25%
Average CPI of FP operations = 4.0
Average CPI of other instructions = 1.33
Frequency of FPSQR= 2%
CPIofFPSQR = 20
Assume that there are two design alternatives
one to decrease the CPI of FPSQR to 2 and
other to decrease the average CPI of all FP
operations to 2.5.
Compare these two design alternatives using
the processor performance equation.

Solution
Find out the CPI of the original processor



Design 1
CPI with FSQR =
CPI Original 2% X (CPI Old FPSQR - CPI New FPSQR)
CPI with FSQR = 2.0 0.02x(20-2) =1.64
Design 2
CPI new Fp = (75% x 1.33) +(25% x 2.5)= 1.62
Over all CPI Speedup =


Performance of (Blocking)
Caches
time cycle Clock cycles CPU time CPU =
time cycle Clock cycles) stall Memory cycles (CPU time CPU + =
penalty Miss
reference Memory
Misses
n Instructio
references Memory
IC
penalty Miss
n Instructio
Misses
IC
penalty Miss misses of Number cycles stall Memory
=
=
=
CPI IC cycles CPU =
no cache misses!
with cache misses!
IC instruction count
Quantitative Principles of
Computer Design
Key principle : Taking advantage of
parallelism is one of the most important
methods for improving performance.
90% of the time code used is only 10% of
the code !!
Think of bringing parallelism for this 10%
code.
Validity of the single processor approach to achieving large scale computing capabilities, G. M. Amdahl,
AFIPS Conference Proceedings, pp. 483-485, April 1967
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
Amdahls Law (History, 1967)
Historical context
Amdahl was demonstrating the continued validity of the single
processor approach and of the weaknesses of the multiple
processor approach
Paper contains no mathematical formulation, just arguments and
simulation
The nature of this overhead appears to be sequential so that it is
unlikely to be amenable to parallel processing techniques.
A fairly obvious conclusion which can be drawn at this point is that
the effort expended on achieving high parallel performance rates is
wasted unless it is accompanied by achievements in sequential
processing rates of very nearly the same magnitude.
Nevertheless, it is of widespread applicability in all kinds
of situations
Speedup
Book shows two forms of speedup eqn





We will use the second because you get
speedup factors like 2X
old
overall
new
ExTime
Speedup
ExTime
=
new
overall
old
ExTime
Speedup
ExTime
=
Amdahls Law
Pitfall: Expecting the improvement of one aspect of a
machine to increase performance by an amount
proportional to the size of improvement


The performance gain that can be
obtained by improving some portion of a
computer can be calculated using
Amdahl's Law.
Amdahl's Law states that the performance
improvement to be gained from using
some faster mode of execution is limited
by the fraction of the time the faster mode
can be used.

Amdahls law defines speedup.
What is speedup?

Amdahls Law
( )
enhanced
enhanced
enhanced
new
old
overall
Speedup
Fraction
Fraction
1
ExTime
ExTime
Speedup
+
= =
1
Best you could ever hope to do:
( )
enhanced
maximum
Fraction - 1
1
Speedup =
( )
(

+ =
enhanced
enhanced
enhanced old new
Speedup
Fraction
Fraction ExTime ExTime 1
Two things
1. The fraction of the computation time in the
original computer that can be converted to
take advantage of the enhancement
For example, if 20 seconds of the
execution time of a program that takes 60
seconds in total can use an enhancement,
the fraction is 20/60. This value, which we
will call Fraction enhanced is always less
than or equal to 1.

2.The improvement gained by the enhanced
execution mode; that is, how much faster the
task would run if the enhanced mode were used
for the entire program
This value is the time of the original mode over
the time of the enhanced mode. If the enhanced
mode takes, say, 2 seconds for a portion of the
program, while it is 5 seconds in the original
mode, the improvement is 5/2. We will call this
value, which is always greater than 1,
Speedup
enhancec
i.

The execution time using the original computer with the enhanced mode will
be the time spent using the un-enhanced portion of the computer plus the
time spent using the enhancement:
Example
Suppose that we want to enhance the
processor used for Web serving. The new
processor is 10 times faster on computation
in the Web serving application than the
original processor. Assuming that the original
processor is busy with computation 40% of the time and is
waiting for I/O 60% of the time, what is the overall speedup
gained by incorporating the enhancement?

Solution
( )
( )
56 . 1
64 . 0
1

10
0.4
0.4 1
1

Speedup
Fraction
Fraction 1
1
Speedup
enhanced
enhanced
enhanced
overall
= =
+
=
+
=
Its human nature to be attracted by 10X faster, vs.
keeping in perspective its just 1.6X faster

New CPU 10X faster and 40% computational intensive
I/O bound server, so 60% time waiting
Amdahls Law for Multiple Tasks
| |
| |
(

=
(

second
results
1
1
second
results
Fraction of results
generated at this rate
Average execution rate
(performance)
Note: Not fraction
of time spent working
at this rate
Bottleneckology: Evaluating Supercomputers, Jack Worlton, COMPCOM 85, pp. 405-406
1
1
=
=


i
i
i
i
i
avg
F
R
F
R
Example
30% of results are generated at the rate of 1 MFLOPS,
20% at 10 MFLOPS,
50% at 100 MFLOPS.
What is the average performance in MFLOPS?
What is the bottleneck?
Bottleneck: the rate that consumes most of the time
0 0.2 0.4 0.6 0.8 1
1
1
=
=


i
i
i
i
i
avg
F
R
F
R
MFLOPS 08 . 3
5 . 32
100
5 . 0 2 30
100
100
5 . 0
10
2 . 0
1
3 . 0
1
= =
+ +
=
+ +
=
R avg
% 5 . 1
5 . 32
5 . 0
%, 2 . 6
5 . 32
2
%, 3 . 92
5 . 32
30
= = =
Example
A common transformation required in graphics
processors is square root. Implementations of floating-
point (FP) square root vary significantly in performance,
especially among processors designed for graphics.
Suppose FP square root (FPSQR) is responsible for
20% of the execution time of a critical graphics
benchmark. One proposal is to enhance the FPSQR
hardware and speed up this operation by a factor of 10.
The other alternative is just to try to make all FP
instructions in the graphics processor run faster by a
factor of 1.6; FP instructions are responsible for half of
the execution time for the application. The design team
believes that they can make all FP instructions run 1.6
times faster with the same effort as required for the fast
square root.
Compare these two design alternatives.

Problem analysis..
Design 1
FPSQRT is responsible for 20% of the
application execution time
FPSQRT H/W speedup is enhanced by 10%
Design 2
FPSQRT is responsible for 50% of the
application time
All instructions run at 1.6 times faster
Which design has better speedup?
Solution
Implications of Amdahls Law
Opportunity for improvement is affected by
how much time the event consumes
Very high speedup requires making nearly
every case fast
As stated, Amdahls Law is valid only if the
system always works with exactly one of the
rates
Overlap between CPU and I/O operations?
Amdahls Law as given here is not applicable

Implications of Amdahls Law
Improvements provided by a feature limited by how often
feature is used

Bottleneck is the most promising target for improvements
Make the common case fast
Infrequent events, even if they consume a lot of time,
will make little difference to performance
Typical use: Change only one parameter of system, and
compute effect of this change
The same program, with the same input data, should
run on the machine in both cases
Focus on overall performance, not one
aspect


Summary
Computer Architecture = Instruction Set Architure + Machine
Organization
All computers consist of five components
Processor: (1) datapath and (2) control
(3) Memory
(4) Input devices and (5) Output devices
Not all memory are created equally
Cache: fast (expensive) memory are placed closer to the
processor
Main memory: less expensive memory--we can have more
Interfaces are where the problems are - between functional units
and between the computer and the outside world
Need to design against constraints of performance, power, area and
cost


Summary
Performance eye of the beholder
Seconds/program =

(Instructions/Pgm)x(Clk Cycles/Instructions)x(Seconds/Clk cycles)

Amdahls Law Make the Common Case
Faster

Vous aimerez peut-être aussi