Vous êtes sur la page 1sur 13

1/24/14

Welcome To CSCE 4610/5610: Computer Architecture



Outcomes for CSCE 4610



1. Apply metrics to evaluate performance modern computer systems.

2. Design processor pipelines to meet specifications.

3. Design simple branch prediction for a pipelined processor.

4. Design an out-of-order instruction execution using reservation stations and

reorder buffers.

5. Apply simple compiler techniques to improve performance.

6. Gain knowledge about various cache design alternatives.




You will be asked at the end of the semester to see if we met these objectives

Review


What is computer architecture




Instruction Set Architecture




Computer Organization




Micro-architecture




System Architecture

CSCE 4610/5610 Jan. 16, 2014

Welcome To CSCE 4610/5610: Computer Architecture



Role of a computer architect


Support functionality




Depends on the type of applications or target market




Desktop, server, scientific, mobile/personal devices




Embedded systems (controllers, etc.)




Understand technology trends




Denser chips




Denser memories




Memory wall




Clock frequencies




Heat dissipation


Support functionality with best performance




Speed performance




Reliability, availability




Power/energy performance




hard or soft real-time requirements

CSCE 4610/5610 Jan. 16, 2014


1/24/14

CSCE 4610/5610: Computer Architecture



Issues related to Cost



Cost of Integrated Chip




Cost of the die (or chip)




Cost of testing




Cost of packaging

Cost of the die depends on the number of chips per wafer and






how many good dies per wafer (or yield)

Dies per wafer:


CSCE 4610/5610 Jan. 16, 2014



4610/5610: Computer Architecture

CSCE


Die yield=
(Wafer _ yield)*

1
[1+ (defects _ per _ unit _ area)*(die _ area)]N

N is known as process complexity and in 2010 its value ranged between 11.5 and 15.5

Example from page 31



Note: Some of the problems from Chapter 1 do not work with this formula

30cm water and we have two different dies, 1.5cm or 1.0 cm square









Dies per wafer:




With 1.5 cm dies or 2.25 cm2 we get 270 dies

With 1cm dies or 1.0 cm2, we get 640 dies

CSCE 4610/5610 Jan. 16, 2014


1/24/14

CSCE 4610/5610: Computer Architecture



Defects per square determine the yield


Given 0.031 defects for cm2 and N=13.5



If we use 1.5cm dies,

Die yield = 0.40. So we get 270*.4 = 108 good chips from 30cm wafer



If we use 1cm dies: die yield = 0.66 and we will get 640*0.66 = 422 good chips


CSCE 4610/5610 Jan. 16, 2014


CSCE 4610/5610: Computer Architecture



Another example: Problem 1.1 on page 62

CSCE 4610/5610 Jan. 16, 2014


1/24/14

CSCE 4610/5610: Computer Architecture



Here we simply apply the yield equation

Die yield=
(Wafer _ yield)*

1
[1+ (defects _ per _ unit _ area)*(die _ area)]N

We will assume wafer yield to be 100% and N= 13.5





If you the equation in the text book, we get Yield = 2.9*10-5

VERY BAD!

The previous edition used the following equation for yield

Die _ yield = wafer _ yield *[1+

(defects _ per _ unit _ area)*(die _ area)


]

Assume Wafer-Yield =100% and alpha = 4





Now, the yield for Power 4 turns out to be more reasonable = 0.36

CSCE 4610/5610 Jan. 16, 2014


CSCE 4610/5610: Computer Architecture



Why does power5 have a lower defect rate?

IBM technology is older (see larger scale in terms of manufacturing size in nm)

So it is mature and have fewer defects in manufacturing

Power consumed by a processor

Two types of power consumed


Static: even if a hardware component is not active




sometimes called leakage




dynamic: due to switching of transistors

Powerdynamic = (1/2)*(Capacitive load)*(threshold voltage)2*(operating frequency)

CSCE 4610/5610 Jan. 16, 2014


1/24/14

CSCE 4610/5610: Computer Architecture



Example: What happens if voltage is dropped by 15% and (proportional) change


in operating frequency?



No change in capacitive load

So we reduced the power consumption by 39%


Consider another example: problem 1.4


CSCE 4610/5610 Jan. 16, 2014


CSCE 4610/5610: Computer Architecture


CSCE 4610/5610 Jan. 16, 2014


10

1/24/14

CSCE 4610/5610: Computer Architecture



Note we are using Intel processor with 2 DRAM chips and 7200 rpm disk


total power = 66 (for processor) +2*(2.3) (for DRAM) + 7.9 (for disk)




= 78.5 W

However if a power supply only works at 80% efficiency and need to supply 78.5W we
need a power supply that is rated for 78.5/0.8 = 99W

b). Disk is 60% idle (or 40% busy)


7.8*40% + 4.0*60% = 5.56

c). 7200 disk can be idle longer, or seek time shorter


seek7600= 75% seek5400


Total-power7200=100%=seek7200+idle7600=75%*seek5400+idle7200


Total-power5400=seek5400+idle5400



We need to equate these two power equations and use the power consumed by the two disks


for seek and idle given in the table

Note seek_time = 1-(idle_time)



Solving, you will see that idle7600 is approximately 29%

CSCE 4610/5610 Jan. 16, 2014

11

CSCE 4610/5610: Computer Architecture



Note: More hardware means more power consumption


both static and dynamic


The capacitive load is proportional to the number of transistors

Dynamic voltage and frequency scaling


Changing voltage and clock speed) can degrade performance




A better measure may be time*energy product


If we change voltage and frequency in the middle of execution


you will lose some time since hardware components need to be resynchronized



Dropping (threshold) voltage reduces power consumption but may become more error prone



Lot of work done on changing frequencies as well as shutting off components to save power



Globally Asynchronous Locally Synchronous (GALS)


Different units (different stages of a pipeline) run at different clock rates


CSCE 4610/5610 Jan. 16, 2014


12

1/24/14

CSCE 4610/5610: Computer Architecture



Another criteria is the amount of silicon area needed




at least for embedded systems

Let us define the size needed for 1bit register as 1rbe.



To build one bit SRAM we need 0.6 rbe

To build one bit DRAM we need 0.1 rbe

To build one bit direct mapped cache we need ~ 0.8 rbe


So, need to decide if you want DRAM or DRAM or Cache or registers



Tradeoff Registers are faster than caches, caches faster than DRAM

Logic circuits (like control logic, arithmetic logic) consume more area than memory units

If we can reduce the amount of cache memory needed, we can potentially reduce the area
needed for cache and power consumption




We have explored some ideas -- keep the same performance but reduce area and

Power consumed using different cache organizations

CSCE 4610/5610 Jan. 16, 2014


13

CSCE 4610/5610: Computer Architecture



Cost of computers


The cost of CPU chip is only a small fraction of the overall cost a computer




CPU is 22% of total system cost


This fraction keeps changing on how the cost of other components change

The cost of system must be understood in relation to the selling price of the system

the actual cost of the system is only 25% of list price


rest for marketing, profits etc

So, if the cost of the CPU increases by $, the system cost increases by $4.5



The list price will increase by $18!

So, if we are considering adding new functionality, we need to worry about the impact of the
functionality on cost and price



And the increase should be justified by performance either speed or reliability/availability
or less power

CSCE 4610/5610 Jan. 16, 2014

14

1/24/14

CSCE 4610/5610: Computer Architecture



How do we define the performance of a processor?


Execution time for a program?




Wall clock or CPU time?


User CPU and System CPU Time

For now we will only use user CPU time =

(instr uction count)* (CPI)


(Clock Rate)

Note cycle time = 1/clock_rate. 1 Ghz clock means 1 ns per cycle



CPI: Average number of Cycles Per Instruction.


How do we find this?

Consider for example that we collected average frequencies for various instruction types.




ALU operations occur 43% of time and take 1 clock cycle to execute

Load instructions occur 21% of the time and need 2 cycles

Store instructions occur 12% of the time and need 2 cycles

Branch instructions occur 24% of the time and need 2 cycles

CSCE 4610/5610 Jan. 16, 2014

15

CSCE 4610/5610: Computer Architecture





How do we get CPI = average cycles per instruction?


instruction count

(Ci)

(instruction count)

The average number of cycles per instruction =




0.43*1+0.21*2+0.12*2+0.24*2 = 1.57 Cycles per instruction

Once we have the CPI, and clock speed, we can find the MIPS ratings of a processor



If we are using 1Ghz processor, the MIPS rating is given by




109 /(1.57) = 637 Million Instruction Per Second



Execution time = (instruction_count)*(1/637 mips)




= (instruction_count)*6.37*10-9 seconds



Remember, the clock speed (or frequency) is inversely related to clock period.

CSCE 4610/5610 Jan. 16, 2014


16

1/24/14

CSCE 4610/5610: Computer Architecture



Consider why MIPS rating can be misleading.


Suppose we have a compiler that can optimize the program


The optimized compiler can eliminate 50% of Arithmetic instructions.



Now let us consider how the equations change. What is the CPI?

Consider for example that we collect average frequencies for various instruction types.

ALU operations occur 21.5% of time and take 1 clock cycle to execute

Load instructions occur 21% of the time and need 2 cycles

Store instructions occur 12% of the time and need 2 cycles

Branch instructions occur 24% of the time and need 2 cycles



But we need to scale these fractions since the total is only 78.5%

So CPI = [(21.5%)*1 + (21%)*2+(12%)*2+(24%)*2]/(78.5%)


= 1.73 cycles per instruction larger CPI?



MIPS = (1 Ghz)/[1.73*106] = 578 MIPS



So the computer with an optimized compiler has lower MIPS rating!

CSCE 4610/5610 Jan. 16, 2014

17

CSCE 4610/5610: Computer Architecture



Another Example

Consider two processors with different ways of implementing conditional instructions



CPU-A: needs two instructions; A compare and a branch (eg., SLT R3, R1, R2; BNZ R3, loop)



CPU-B: A single instruction to compare and branch (eg., BLT R1, R2, Loop)



Branches take 2 cycles and all other instructions take 1 cycle

Frequency of branches = 20%



CPU-As clock is 25% faster simpler instructions

Time on CPU-A = (Instr_Count)*{0.80*1+0.20*(2+1)}(Cycle_Time)


= (Instr_Count)*1.4*(Cycle_Time)

Time on CPU-B = (inst_Count)*(0.8*1+0.2*2)*(1.25*Cycle_Time)


= (Instr_Count)*1.5*(Cycle_Time)

CPU-A is faster even if it needs more instructions!

CSCE 4610/5610 Jan. 16, 2014

18

1/24/14

CSCE 4610/5610: Computer Architecture



Many Complex Interactions During Execution.

Pipeline Bubbles or Stalls or lost cycles due to branch instructions


Consider for example, on the average 50% of all branches




are taken and cause 3 cycle stalls or lost cycles


What is the CPI for branch instructions?


If not taken, CPI = 1


If taken, CPI = 4


Effective CPI for branches = 0.5*1+0.5*4 =2.5

If branches are 20%, total CPI = 80%*1 + 20%*2.5 = 1.3

Cache Misses


Effect only load and store instructions


If no cache miss, say CPI =2


If cache miss, we may have a CPI of 50


5% miss rate leads to 0.95*2+0.05*50 = 4.4



Remember the instruction frequencies from a previous example

CSCE 4610/5610 Jan. 16, 2014


19

CSCE 4610/5610: Computer Architecture



ALU operations occur 43% of time and take 1 clock cycle to execute

Load instructions occur 21% of the time and need 2 cycles without cache misses

Store instructions occur 12% of the time and need 2 cycles without cache misses

Branch instructions occur 24% of the time and need 2 cycles

But if have 21% loads and 12% stores with 4.4 cycles with cache misses,

the new CPI is = 33%*4.4 + 43%*1+24%*2 = 2.32 CPI



compared to 1.57 CPI with no cache misses

How to report performance data?


Execution time for one program


Execution times for all programs


Average execution time across all programs


Weighted average etc.

Arithmetic Mean =

Assuming n programs

1 n
(Time)i
n i =1
CSCE 4610/5610 Jan. 16, 2014

20

10

1/24/14

CSCE 4610/5610: Computer Architecture



n

(Weight ) * (Time)

Weighted Arithmetic Mean =


n
Harmonic Mean =

i =1

(Time)
i =1

Let us look an example. Here we are comparing 3 different computers using 2 programs.

Computer A Computer B Computer C
Pgm P1
Pgm P2
Total

1
1000
1001

10
100
110

20
20
40

Let us find weighted arithmetic average execution times and we will use 3 different weights


W1: P1=50% P2=50%


W2: P1= 90.9%, P2= 9.1%


W3: P1= 99.9%; P2=0.1%

CSCE 4610/5610 Jan. 16, 2014

21

CSCE 4610/5610: Computer Architecture



Computer A Computer B Computer C
Pgm P1
Pgm P2
Total
Avg with W1
Avg with W2
Avg with W3

1
1000
1001

10
100
110

20
20
40

500.5
91.91
2

55
18.19
10.09

20
20
20

So which computer is best?




If we use W1, C is best, with W2, B is best and with W3 A is best



Can we think of a different way of computing averages?

Relative performance. For each program use a relative execution time, compared a
standard computer.



The relative execution times can be used to compute an arithmetic (or weighted) means.

CSCE 4610/5610 Jan. 16, 2014


22

11

1/24/14

CSCE 4610/5610: Computer Architecture



We can also compute Geometric Mean.

( Relative _ execution _ time)


n

i =1

Let us look our example using Geometric means. The relative performance of the 3 machines
remain the same

Pgm P1
Pgm P2
Arithmetic Mean
Geometric Mean

Pgm P1
Pgm P2
Arithmetic Mean
Geometric Mean

Pgm P1
Pgm P2
Arithmetic Mean
Geometric Mean

Normalized to A
Computer A Computer B Computer C
1
10
20
1
0.1
0.02
1
1

5.05
1

10.01
0.63

Normalized to B
Computer A Computer B Computer C
0.1
1
2
10
1
0.2
5.05
1

1
1

Now, C is always the best


1.1
0.63

Normalized to C
Computer A Computer B Computer C
0.05
0.5
1
50
5
1
25.03
1.58

2.75
1.58

1
1

23

CSCE 4610/5610: Computer Architecture



Another example. See the table on page 43


here we are looking at Geometric means for Opteron and Itanium

Sun Ultra Spark 5 is used as the reference computer

Opteron runs 30% slower


CSCE 4610/5610 Jan. 16, 2014


24

12

1/24/14

CSCE 4610/5610: Computer Architecture



What programs to use in evaluating performance?

The programs that will be run in the field

Benchmark programs

Real programs that are common in an application domain




e.g. SPEC benchmarks(Spec CPU, Integer, float)




SPECWeb, SPECvirt




bio-informatics




High-performance (SPEComp)


Program kernels:




eg. Embedded kernels (EEMBC)




NAS benchmarks, Livermore loops


Synthetic program mixes

CSCE 4610/5610 Jan. 16, 2014


25

CSCE 4610/5610: Computer Architecture



How to collect performance data using benchmarks?


Actual Measurements and Simulations

If the architecture already exist, run programs and collect data


Need to be careful in collecting data


Instrumentation may skew data


Performance Registers


Software profiling techniques

Or develop simulations.


Detailed simulations


Trace driven simulations


Monte Carlo simulations

CSCE 4610/5610 Jan. 16, 2014


26

13

Vous aimerez peut-être aussi