Advanced Computer Architecture

1/24/14
Welcome To CSCE 4610/5610: Computer Architecture

Outcomes for CSCE 4610

1. Apply metrics to evaluate performance modern computer systems.

2. Design processor pipelines to meet specifications.

3. Design simple branch prediction for a pipelined processor.

4. Design an out-of-order instruction execution using reservation stations and

reorder buffers.

5. Apply simple compiler techniques to improve performance.

6. Gain knowledge about various cache design alternatives.

You will be asked at the end of the semester to see if we met these objectives

Review

What is computer architecture

Instruction Set Architecture

Computer Organization

Micro-architecture

System Architecture

CSCE 4610/5610 Jan. 16, 2014

Welcome To CSCE 4610/5610: Computer Architecture

Role of a computer architect

Support functionality

Depends on the type of applications or target market

Desktop, server, scientific, mobile/personal devices

Embedded systems (controllers, etc.)

Understand technology trends

Denser chips

Denser memories

Memory wall

Clock frequencies

Heat dissipation

Support functionality with best performance

Speed performance

Reliability, availability

Power/energy performance

hard or soft real-time requirements

CSCE 4610/5610 Jan. 16, 2014

1/24/14

CSCE 4610/5610: Computer Architecture

Issues related to Cost

Cost of Integrated Chip

Cost of the die (or chip)

Cost of testing

Cost of packaging

Cost of the die depends on the number of chips per wafer and

how many good dies per wafer (or yield)

Dies per wafer:

CSCE 4610/5610 Jan. 16, 2014

4610/5610: Computer Architecture

CSCE

Die yield=
(Wafer _ yield)*
1
[1+ (defects _ per _ unit _ area)*(die _ area)]N
N is known as process complexity and in 2010 its value ranged between 11.5 and 15.5

Example from page 31

Note: Some of the problems from Chapter 1 do not work with this formula

30cm water and we have two different dies, 1.5cm or 1.0 cm square

Dies per wafer:

With 1.5 cm dies or 2.25 cm2 we get 270 dies

With 1cm dies or 1.0 cm2, we get 640 dies

CSCE 4610/5610 Jan. 16, 2014

1/24/14


Defects per square determine the yield

Given 0.031 defects for cm2 and N=13.5

If we use 1.5cm dies,

Die yield = 0.40. So we get 270*.4 = 108 good chips from 30cm wafer

If we use 1cm dies: die yield = 0.66 and we will get 640*0.66 = 422 good chips

CSCE 4610/5610 Jan. 16, 2014


Another example: Problem 1.1 on page 62

CSCE 4610/5610 Jan. 16, 2014

1/24/14


Here we simply apply the yield equation

Die yield=
(Wafer _ yield)*
1
[1+ (defects _ per _ unit _ area)*(die _ area)]N
We will assume wafer yield to be 100% and N= 13.5

If you the equation in the text book, we get Yield = 2.9*10-5

VERY BAD!

The previous edition used the following equation for yield

Die _ yield = wafer _ yield *[1+
(defects _ per _ unit _ area)*(die _ area)

]
Assume Wafer-Yield =100% and alpha = 4

Now, the yield for Power 4 turns out to be more reasonable = 0.36

CSCE 4610/5610 Jan. 16, 2014


Why does power5 have a lower defect rate?

IBM technology is older (see larger scale in terms of manufacturing size in nm)

So it is mature and have fewer defects in manufacturing

Power consumed by a processor

Two types of power consumed

Static: even if a hardware component is not active

sometimes called leakage

dynamic: due to switching of transistors

Powerdynamic = (1/2)*(Capacitive load)*(threshold voltage)2*(operating frequency)

CSCE 4610/5610 Jan. 16, 2014

1/24/14


Example: What happens if voltage is dropped by 15% and (proportional) change

in operating frequency?

No change in capacitive load

So we reduced the power consumption by 39%

Consider another example: problem 1.4

CSCE 4610/5610 Jan. 16, 2014


CSCE 4610/5610 Jan. 16, 2014

10

1/24/14


Note we are using Intel processor with 2 DRAM chips and 7200 rpm disk

total power = 66 (for processor) +2*(2.3) (for DRAM) + 7.9 (for disk)

= 78.5 W

However if a power supply only works at 80% efficiency and need to supply 78.5W we
need a power supply that is rated for 78.5/0.8 = 99W

b). Disk is 60% idle (or 40% busy)

7.8*40% + 4.0*60% = 5.56

c). 7200 disk can be idle longer, or seek time shorter

seek7600= 75% seek5400

Total-power7200=100%=seek7200+idle7600=75%*seek5400+idle7200

Total-power5400=seek5400+idle5400

We need to equate these two power equations and use the power consumed by the two disks

for seek and idle given in the table

Note seek_time = 1-(idle_time)

Solving, you will see that idle7600 is approximately 29%

CSCE 4610/5610 Jan. 16, 2014

11


Note: More hardware means more power consumption

both static and dynamic

The capacitive load is proportional to the number of transistors

Dynamic voltage and frequency scaling

Changing voltage and clock speed) can degrade performance

A better measure may be time*energy product

If we change voltage and frequency in the middle of execution

you will lose some time since hardware components need to be resynchronized

Dropping (threshold) voltage reduces power consumption but may become more error prone

Lot of work done on changing frequencies as well as shutting off components to save power

Globally Asynchronous Locally Synchronous (GALS)

Different units (different stages of a pipeline) run at different clock rates

CSCE 4610/5610 Jan. 16, 2014

12

1/24/14


Another criteria is the amount of silicon area needed

at least for embedded systems

Let us define the size needed for 1bit register as 1rbe.

To build one bit SRAM we need 0.6 rbe

To build one bit DRAM we need 0.1 rbe

To build one bit direct mapped cache we need ~ 0.8 rbe

So, need to decide if you want DRAM or DRAM or Cache or registers

Tradeoff Registers are faster than caches, caches faster than DRAM

Logic circuits (like control logic, arithmetic logic) consume more area than memory units

If we can reduce the amount of cache memory needed, we can potentially reduce the area
needed for cache and power consumption

We have explored some ideas -- keep the same performance but reduce area and

Power consumed using different cache organizations

CSCE 4610/5610 Jan. 16, 2014

13


Cost of computers

The cost of CPU chip is only a small fraction of the overall cost a computer

CPU is 22% of total system cost

This fraction keeps changing on how the cost of other components change

The cost of system must be understood in relation to the selling price of the system

the actual cost of the system is only 25% of list price

rest for marketing, profits etc

So, if the cost of the CPU increases by $, the system cost increases by $4.5

The list price will increase by $18!

So, if we are considering adding new functionality, we need to worry about the impact of the
functionality on cost and price

And the increase should be justified by performance either speed or reliability/availability
or less power

CSCE 4610/5610 Jan. 16, 2014

14

1/24/14


How do we define the performance of a processor?

Execution time for a program?

Wall clock or CPU time?

User CPU and System CPU Time

For now we will only use user CPU time =

(instr uction count)* (CPI)

(Clock Rate)
Note cycle time = 1/clock_rate. 1 Ghz clock means 1 ns per cycle

CPI: Average number of Cycles Per Instruction.

How do we find this?

Consider for example that we collected average frequencies for various instruction types.

ALU operations occur 43% of time and take 1 clock cycle to execute

Load instructions occur 21% of the time and need 2 cycles

Store instructions occur 12% of the time and need 2 cycles

Branch instructions occur 24% of the time and need 2 cycles

CSCE 4610/5610 Jan. 16, 2014

15


How do we get CPI = average cycles per instruction?

instruction count
(Ci)
(instruction count)
The average number of cycles per instruction =

0.43*1+0.21*2+0.12*2+0.24*2 = 1.57 Cycles per instruction

Once we have the CPI, and clock speed, we can find the MIPS ratings of a processor

If we are using 1Ghz processor, the MIPS rating is given by

109 /(1.57) = 637 Million Instruction Per Second

Execution time = (instruction_count)*(1/637 mips)

= (instruction_count)*6.37*10-9 seconds

Remember, the clock speed (or frequency) is inversely related to clock period.

CSCE 4610/5610 Jan. 16, 2014

16

1/24/14


Consider why MIPS rating can be misleading.

Suppose we have a compiler that can optimize the program

The optimized compiler can eliminate 50% of Arithmetic instructions.

Now let us consider how the equations change. What is the CPI?

Consider for example that we collect average frequencies for various instruction types.

ALU operations occur 21.5% of time and take 1 clock cycle to execute

Load instructions occur 21% of the time and need 2 cycles

Store instructions occur 12% of the time and need 2 cycles


But we need to scale these fractions since the total is only 78.5%

So CPI = [(21.5%)*1 + (21%)*2+(12%)*2+(24%)*2]/(78.5%)

= 1.73 cycles per instruction larger CPI?

MIPS = (1 Ghz)/[1.73*106] = 578 MIPS

So the computer with an optimized compiler has lower MIPS rating!

CSCE 4610/5610 Jan. 16, 2014

17


Another Example

Consider two processors with different ways of implementing conditional instructions

CPU-A: needs two instructions; A compare and a branch (eg., SLT R3, R1, R2; BNZ R3, loop)

CPU-B: A single instruction to compare and branch (eg., BLT R1, R2, Loop)

Branches take 2 cycles and all other instructions take 1 cycle

Frequency of branches = 20%

CPU-As clock is 25% faster simpler instructions

Time on CPU-A = (Instr_Count)*{0.80*1+0.20*(2+1)}(Cycle_Time)

= (Instr_Count)*1.4*(Cycle_Time)

Time on CPU-B = (inst_Count)*(0.8*1+0.2*2)*(1.25*Cycle_Time)

= (Instr_Count)*1.5*(Cycle_Time)

CPU-A is faster even if it needs more instructions!

CSCE 4610/5610 Jan. 16, 2014

18

1/24/14


Many Complex Interactions During Execution.

Pipeline Bubbles or Stalls or lost cycles due to branch instructions

Consider for example, on the average 50% of all branches

are taken and cause 3 cycle stalls or lost cycles

What is the CPI for branch instructions?

If not taken, CPI = 1

If taken, CPI = 4

Effective CPI for branches = 0.5*1+0.5*4 =2.5

If branches are 20%, total CPI = 80%*1 + 20%*2.5 = 1.3

Cache Misses

Effect only load and store instructions

If no cache miss, say CPI =2

If cache miss, we may have a CPI of 50

5% miss rate leads to 0.95*2+0.05*50 = 4.4

Remember the instruction frequencies from a previous example

CSCE 4610/5610 Jan. 16, 2014

19


ALU operations occur 43% of time and take 1 clock cycle to execute

Load instructions occur 21% of the time and need 2 cycles without cache misses

Store instructions occur 12% of the time and need 2 cycles without cache misses


But if have 21% loads and 12% stores with 4.4 cycles with cache misses,

the new CPI is = 33%*4.4 + 43%*1+24%*2 = 2.32 CPI

compared to 1.57 CPI with no cache misses

How to report performance data?

Execution time for one program

Execution times for all programs

Average execution time across all programs

Weighted average etc.

Arithmetic Mean =

Assuming n programs

1 n
(Time)i
n i =1
CSCE 4610/5610 Jan. 16, 2014

20

10

1/24/14


n
(Weight ) * (Time)
Weighted Arithmetic Mean =

n
Harmonic Mean =

i =1
(Time)
i =1
Let us look an example. Here we are comparing 3 different computers using 2 programs.

Computer A Computer B Computer C
Pgm P1
Pgm P2
Total
1
1000
1001
10
100
110
20
20
40
Let us find weighted arithmetic average execution times and we will use 3 different weights

W1: P1=50% P2=50%

W2: P1= 90.9%, P2= 9.1%

W3: P1= 99.9%; P2=0.1%

CSCE 4610/5610 Jan. 16, 2014

21


Pgm P1
Pgm P2
Total
Avg with W1
Avg with W2
Avg with W3
1
1000
1001
10
100
110
20
20
40
500.5
91.91
2
55
18.19
10.09
20
20
20
So which computer is best?

If we use W1, C is best, with W2, B is best and with W3 A is best

Can we think of a different way of computing averages?

Relative performance. For each program use a relative execution time, compared a
standard computer.

The relative execution times can be used to compute an arithmetic (or weighted) means.

CSCE 4610/5610 Jan. 16, 2014

22

11

1/24/14


We can also compute Geometric Mean.

( Relative _ execution _ time)

n
i =1
Let us look our example using Geometric means. The relative performance of the 3 machines
remain the same

Pgm P1
Pgm P2
Arithmetic Mean
Geometric Mean
Pgm P1
Pgm P2
Arithmetic Mean
Geometric Mean
Pgm P1
Pgm P2
Arithmetic Mean
Geometric Mean
Normalized to A
1
10
20
1
0.1
0.02
1
1
5.05
1
10.01
0.63
Normalized to B
0.1
1
2
10
1
0.2
5.05
1
1
1
Now, C is always the best

1.1
0.63
Normalized to C
0.05
0.5
1
50
5
1
25.03
1.58
2.75
1.58
1
1
23


Another example. See the table on page 43

here we are looking at Geometric means for Opteron and Itanium

Sun Ultra Spark 5 is used as the reference computer

Opteron runs 30% slower

CSCE 4610/5610 Jan. 16, 2014

24

12

1/24/14


What programs to use in evaluating performance?

The programs that will be run in the field

Benchmark programs

Real programs that are common in an application domain

e.g. SPEC benchmarks(Spec CPU, Integer, float)

SPECWeb, SPECvirt

bio-informatics

High-performance (SPEComp)

Program kernels:

eg. Embedded kernels (EEMBC)

NAS benchmarks, Livermore loops

Synthetic program mixes

CSCE 4610/5610 Jan. 16, 2014

25


How to collect performance data using benchmarks?

Actual Measurements and Simulations

If the architecture already exist, run programs and collect data

Need to be careful in collecting data

Instrumentation may skew data

Performance Registers

Software profiling techniques

Or develop simulations.

Detailed simulations

Trace driven simulations

Monte Carlo simulations

CSCE 4610/5610 Jan. 16, 2014

26

13

Advanced Computer Architecture

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Advanced Computer Architecture

Transféré par

Droits d'auteur :

Formats disponibles

1/24/14

Welcome To CSCE 4610/5610: Computer Architecture

Welcome To CSCE 4610/5610: Computer Architecture

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

Dies per wafer:

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

We will assume wafer yield to be 100% and N= 13.5

Die _ yield = wafer _ yield *[1+

(defects _ per _ unit _ area)*(die _ area)

Assume Wafer-Yield =100% and alpha = 4

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

So we reduced the power consumption by 39%

Consider another example: problem 1.4

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610: Computer Architecture

(instr uction count)* (CPI)

Note cycle time = 1/clock_rate. 1 Ghz clock means 1 ns per cycle

CSCE 4610/5610: Computer Architecture

The average number of cycles per instruction =

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610: Computer Architecture

Weighted Arithmetic Mean =

CSCE 4610/5610: Computer Architecture

So which computer is best?

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

( Relative _ execution _ time)

Now, C is always the best

CSCE 4610/5610: Computer Architecture

Opteron runs 30% slower

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

Real programs that are common in an application domain

CSCE 4610/5610 Jan. 16, 2014

CSCE 4610/5610: Computer Architecture

CSCE 4610/5610 Jan. 16, 2014

Vous aimerez peut-être aussi