Vous êtes sur la page 1sur 61

Centre for Computer Technology

ICT123 Computer Architecture


Week 11

Parallel Architectures

Contents at a Glance
Review of week 10 Cache Coherence Protocols Snoopy Protocol, MESI Protocol Execution Time (MIPS rate) Latency vs. throughput Limitations

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Flynns Taxonomy of Parallel Processor Architectures

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Taxonomy of Parallel Computers

Flynns taxonomy of parallel computers.


(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education)
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Organizations - SISD

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Organizations - SIMD

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Processing - MISD

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Organizations - MIMD Shared Memory

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Organizations - MIMD Distributed Memory

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Tightly Coupled MP Systems

P1

Memory

P2

I-O 1

I-O 2

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Loosely Coupled MP Systems


M1 M2

Communications P1 link P2

I-O 1
March 20, 2012

I-O 2
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Multiprogramming and Multiprocessing

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Message-Passing Architecture
memory memory

...

memory

cache

cache

cache

processor

processor

...

processor

interconnection network

(CS 284a Lecture, Tuesday, 7 October 1997, John Thornley)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Shared-Memory Architecture
processor 1 processor 2

...

processor N

cache

cache

cache

interconnection network

memory 1

memory 2

...

memory M

(CS 284a Lecture, Tuesday, 7 October 1997, John Thornley)


March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Computer Architectures

(a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor. (d) A multicomputer.Mitra A grid. March 20, 2012 Richard Salomon, Sudipto (e)
(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education) Copyright Box Hill Institute

Shared-Memory Architecture: Cache Coherence(1)


Problem - multiple copies of same data may reside in some caches and the main memory If the processors are allowed to update their own copies (in the cache) freely, may result in an inconsistent view of memory This results in a cache coherence problem Hence multiple copies of the data (in cache) have to be kept identical
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Shared-Memory Architecture: Cache Coherence(2)

Write back

Write operations are made to the cache. Main memory is updated when the corresponding cache line is flushed. This policy can lead to inconsistency

Write through

March 20, 2012

All write operations are made to the main memory and the cache This can also give problems unless caches monitor memory traffic
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Cache Coherence Protocols

The objective is to update recently used local variables in the cache and let them reside through numerous reads and write The protocol maintains consistency of shared variables in multiple caches at the same time Cache Coherence approaches

Software solutions Hardware solutions


Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

Software Solutions (1)


March 20, 2012

Compiler and operating system deal with problem Overhead of detecting potential problems is transferred from run time to compile time. Design complexity transferred from hardware to software Compiler based mechanisms determine which data item may become unsafe for caching and mark them Operating system or the hardware prevents Sudipto them from beingRichard Salomon,Hill Institute cached Mitra Copyright Box

Software Solutions (2)

However, software tends to make conservative decisions which results in an Inefficient cache utilization (prevent any shared data variables from being cached) An efficient approach is to analyze code to determine safe periods for caching shared variables Compiler then inserts instructions into the generated code to implement cache coherence during the critical periods
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

Hardware Solution

Generally referred to as Cache coherence protocols Dynamic recognition of potential problems at run time More efficient use of cache as the problem is dealt with when it arises Transparent to programmer and the compiler Two categories

Directory protocols Snoopy protocols

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Directory Protocols (1)

March 20, 2012

Collect and maintain information about copies of data in cache There is a centralized controller, that is part of a main memory controller, and a directory stored in main memory When a request is made, the centralized controller checks and issues necessary commands for data transfer between memory and cache or between caches The central controller keeps the state information updated Salomon, Sudipto Mitra Richard
Copyright Box Hill Institute

Directory Protocols (2)


Creates a central bottleneck Overhead of communication lines between the various cache controllers and the central controller Effective in large scale systems that involve multiple buses or some other complex interconnection schemes

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Snoopy Protocols (1)


Distribute cache coherence responsibility among cache controllers Cache recognizes that a line is shared Updates announced to other caches by broadcast mechanism Each cache controller snoops on the network to observe the broadcast notifications and react accordingly

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

Snoopy Protocols (2)


Suited

to bus based multiprocessor Increases bus traffic Two approaches


Write

invalidate, multiple readers but only one writer at a time Write update, multiple writers and multiple readers
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Write Invalidate
Multiple readers, one writer When a write is required, all other caches of the line are invalidated Writing processor then has exclusive (cheap) access until line required by another processor Used in Pentium II and PowerPC systems State of every line is marked as modified, exclusive, shared or invalid (MESI)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Write Update

Multiple readers and writers Updated word is distributed to all other processors Some systems use an adaptive mixture of both solutions

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

MESI Protocol

Modified : line in the cache has been modified (different from mail memory) and is available only in this line Exclusive : line in the cache is the same as that in main memory and is not present in any other cache Shared : line in the cache is the same as that in main memory and may be present in another cache Invalid : line in the cache does not contain valid data
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

MESI Cache Lines States

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

MESI State Transition Diagram (1)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

MESI State Transition Diagram (2)

Read Miss
Processor initiates a memory read to read the line containing the missing address Processor generates a signal to alert all other units to snoop the transaction Possible Outcomes

Invalid

to shared Modified to shared Invalid to exclusive


March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

MESI State Transition Diagram (3)

Read Hit

Write Miss

The processor reads the required item State remains modified, shared or exclusive Processor initiates a memory read to read the line containing the missing address Processor issues a read with intent to modify (RWITM) The line is marked modified
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

MESI State Transition Diagram (4)

Write Hit
The effect depends on the current state of the line in the local cache Possible Outcomes

Shared

to modified Exclusive to modified modified

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Centre for Computer Technology

Measuring Processor Performance

Performance Marches On ...

But what is performance?


March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide35

Time versus throughput (1)


Vehicle Ferrari Time to Bay Area 3.1 hours Speed 160 mph Passengers 2 Throughput (pm/h) 320

Greyhound

7.7 hours

65 mph

60

3900

Time to do the task from start to finish execution time, response time, latency Tasks per unit time mostly used for throughput, bandwidth
March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

data movement

Slide36

Time versus throughput (2)


Time is measured in time units/job. Throughput is measured in jobs/time unit. But time = 1/throughput may be false.

It takes 4 months to grow a tomato. Can you only grow 3 tomatoes a year ?? If you run only one job at a time, time = 1/throughput

March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide37

How do you measure Execution Time?


> time foo ... foos results ... 90.7u 12.9s 2:39 65% >
user + kernel wallclock

user CPU time? (time CPU spends running your code) total CPU time (user + kernel)? (includes op. sys. code) Wallclock time? (total elapsed time) Answer depends ...

Includes time spent waiting for I/O, other users, ...

For measuring processor speed, we can use total CPU. If no I/O or interrupts, wallclock may be better
more precise (microseconds rather than 1/100 sec) can measure individual sections of code
CSE 141 - Performance I and II Copyright Box Hill Institute

March 20, 2012

Slide38

Performance

For performance, larger should be better.

CPU performance = 1 / total CPU time System performance = 1 / wallclock time These terms only make sense if you know what program is measured ...

Time is backwards - larger execution time is worse.

and if CPU or system only works on 1 program at a time.

e.g. The performance on Linpack was 200 MFLOP/S

Can answer What was performance? by It took 15 March 20, 2012 seconds. CSE 141 - Performance I and II

Copyright Box Hill Institute

Performances units, inverse seconds, can be awkward

This may all change in the next few years!

Slide39

A brief study of time


CPU Time = CPU cycles executed * Cycle times

Every conventional processor has a clock with a fixed cycle time or clock rate
Rate often measured in MHz = millions of cycles/second Time often measured in ns (nanoseconds) X MHz corresponds to 1000/X ns (e.g. 500 MHz 2 ns clock)

CPU cycles = Instructions executed * CPI


Average Clock Cycles per Instruction
March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide40

Putting it all together


seconds
CPU Execution = Time Instruction Clock Cycle X CPI X Count Time

One of P&Hs big pictures

instructions/program

cycles/instruction

seconds/cycle

Note: CPI is somewhat artificial


(its computed from the other numbers using this formula)

but its an intuitive and useful concept. Note: Use dynamic instruction count (#instructions executed), not static (#instructions in compiled code)
March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide41

Explaining performance variation


CPU Execution = Time Instruction Clock Cycle X CPI X Count Time

Same machine, different programs Same program, different machines, but same ISA Same program, different ISAs
March 20, 2012 Slide42

CSE 141 - Performance I and II Copyright Box Hill Institute

Comparing performance (1)


The fundamental question: Will computer A will run program P faster than computer B? Compare clock rates?

Will a 1.7 GHz PC be faster than a 867 MHz Mac?? Not necessarily CPI or Instruction Count may differ.
see http://www.apple.com/g4/myth (Photoshop benchmark)
(MIPS = Millions of Instructions / sec)

Peak MIPS rate?


PowerPC G4 can execute 4 instruction/cycle (CPI=1/4) 867 MHz clock 3468 MIPS peak But it doesnt necessarily execute that quickly.
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide43

March 20, 2012

Comparing performance (2)


The fundamental question: Will computer A will run program P faster than computer B? Compare actual MIPS rate on program P?
MIPS = 1 / (CPI x Cycle time) (in microseconds) If Instruction Counts are the same, this is OK

E.g.,

comparing two implementations of same ISA

Otherwise, actual MIPS doesnt answer question.


CSE 141 - Performance I and II Copyright Box Hill Institute
Slide44

March 20, 2012

Comparing performance (3)


The fundamental question: Will computer A will run program P faster than computer B? Relative MIPS ?
Defined as, How much faster is this computer than a Vax 11 model 780 (on some benchmark programs) If the benchmark is similar to P, this may give the right answer.

March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide45

What about MFLOP/S?


Millions of Floating Point Ops per Second

Often written MFLOPS. maximum float ops per cycle / cycle time
(in microseconds)

Peak MFLOP/S (like peak MIPS) is useless.

Normalized MFLOP/S uses conventions (e.g. divide counts as three float ops) so flop count of a program is machine-independent.

OK for floating-point intensive programs Depends on program - a better MFLOP/S rate on program P doesnt guarantee better performance on Q.
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide46

March 20, 2012

Relative Performance

Computer X is r times faster than Y means


Perf(X) / Perf(Y) = r (i.e. Time(Y) / Time(X) = r)

Note the swapping of which goes on top when you use times
March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide47

times faster than (or times as fast as) means theres a multiplicative factor relating quantities

Comparing speeds ...

percent faster than implies an additive relationship

X was 3 time faster than Y speed(X) = 3 speed(Y) X was 25% faster than Y speed(X) = (1+25/100) speed(Y)

percent slower than implies subtraction


times slower than or times as slow as is awkward.


X was 5% slower than Y speed(X) = (1-5/100) speed(Y) 100% slower means it doesnt move at all !

X was 3 times slower than Y means speed(X) = 1/3 speed(Y) It hints at having a measure of slowness Ill mostly avoid using this.
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide48

March 20, 2012

Percentages arent intuitive!

If X is p% faster than Y, is Y p% slower than X? X is p% faster speed(X) = (1+p/100) speed(Y)

so speed(Y) = 1/(1+p/100) speed(X)


(unless p=0)

Y is p% slower speed(Y) = (1-p/100) speed(X)

No! 1/(1+p/100) is not (1 p/100)

Suppose X is p% faster than Y and Y q% faster than Z. Is X (p+q)% faster than Z ??


March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide49

Times faster is easier!


X is r times faster than Y speed(X) = r speed(Y) speed(Y) = 1/r speed(X) Y is r times slower than X X is r times faster than Y, & Y is s times faster than Z speed(X) = r speed(Y) = rs speed(Z) X is rs faster than Z Advice: Convert % faster to times faster then do calculation and convert back if needed. Example: change 25% faster to 5/4 times faster.
March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide50

How do you judge computer performance?


Clock speed?

No
Unless ISA is same

Peak MIPS rate?

No Sometimes (if program tested is like yours) The best method!


CSE 141 - Performance I and II Copyright Box Hill Institute
Slide51

Relative MIPS, normalized MFLOPS?

How fast does it execute MY program

March 20, 2012

Benchmarks

Its hard to convince manufacturers to run your program (unless youre a BIG customer) A benchmark is a set of programs that are representative of a class of problems.

Microbenchmarks measure one feature of system

Kernel most compute-intensive part of applications

e.g. memory accesses or communication speed

Full application:

e.g. Linpack and NAS kernel bmarks (for supercomputers)

SPEC = System Performance Evaluation Cooperative (int and float) (for Unix workstations) Other suites for databases, web servers, graphics,...
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide52

March 20, 2012

Improving Latency

Latency is (ultimately) limited by physics.

Some improvements are incremental


e.g. speed of light

Improvements often require new technology


Smaller transistors shorten distances. To reduce disk access time, make disks rotate faster.

Replace stagecoach by pony express or telegraph. Replace DRAM by SRAM. Once upon a time, bipolar or GaAs were much faster than CMOS.
Copyright Box Hill Institute

March 20, 2012

Slide53 But incremental CSE 141 - Performance I and II CMOS have triumphed. improvements to

Improving Bandwidth

You can improve bandwidth or throughput by throwing money at the problem.

Use wider buses, more disks, multiple processors, more functional units ...

Two basic strategies:

Parallelism: duplicate resources.

Run multiple tasks on separate hardware Reduces the time needed for a single stage Build separate resources for each stage. Start a new task down the pipe every (shorter) timestep
CSE 141 - Performance I and II Copyright Box Hill Institute

Pipelining: break process up into multiple stages


March 20, 2012

Slide54

Pipelining

Modern washing machine:


Old fashioned washing machine:


Washing/rinsing and spinning done in same tub. Takes 15 (wash/rinse) + 5 (spin) minutes Time for 1 load: 20 minutes Time for 10 loads: 200 minutes Tub for washing & rinsing (15 minutes) Separate spinner (10 minutes) Time for 1 load: 25 minutes Time for 10 loads: 160 minutes
CSE 141 - Performance I and II Copyright Box Hill Institute

(25 minutes for first load, 15 minutes for each thereafter)


Slide55

March 20, 2012

Parallelism vs pipelining
Both improve throughput or bandwidth Automobiles: More plants vs. assembly line I/O bandwidth: Wider buses (e.g. parallel port) vs. pushing bits onto bus faster (serial port). Memory-to-processor: wider buses vs. faster rate CPU speed:

superscalar processor having multiple functional units so you can execute more than one instructions per cycle. superpipelining using more steps than classical 5-stage pipeline recent microprocessors use both techniques.
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide56

March 20, 2012

Latency vs Bandwidth of DRAM


DRAM is much slower than SRAM

Perhaps 30 ns vs 1 ns access time

But we also hear, SDRAM is much faster than ordinary DRAM

e.g. RDRAM (from Rambus) is 5 times faster...

Are S(R)DRAMs almost as good as SRAM?


CSE 141 - Performance I and II Copyright Box Hill Institute
Slide57

March 20, 2012

What are limits?

Physics: speed of light, size of atoms, heat generated (speed requires energy loss), capacity of electromagnetic spectrum (for wireless), ... Limits with current technology: size of magnetic domains, chip size (due to defects), lithography, pin count. New technologies on the horizon: quantum computers, molecular computers, superconductors, optical computers, holographic storage, ... Fallacy improvements will stop Pitfall trying to predict > 5 years in future
March 20, 2012

CSE 141 - Performance I and II Copyright Box Hill Institute

Slide58

Summary

March 20, 2012

The objective of a cache coherence protocol is to update recently used local variables in the cache and let them reside through numerous reads and write Processor performance can be measured by the rate at which it executes instructions, Execution time = instructions x CPI x cycle time A benchmark is a set of programs that are representative of a class of problems
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Reference

Stallings William, 2003, Computer Organization & Architecture designing for performance, Sixth Edition, Pearson Education, Inc, ISBN 0 - 13 049307 4. M Morris Mano, Computer System Architecture, Third Edition, Prentice Hall. Measuring Performance, UCSD, CSE 141, Larry Carter, Winter 2002 Tanenbaum, Structured Computer Organization, Fifth Edition, 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0. CS 284a Lecture, Tuesday, 7 October 1997, John Thornley.
March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Further Reading
Manufacturers websites Relevant Special Interest Groups [SIG] Articles in magazines IEEE Computer Society Task Force on Cluster Computing web-site

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Vous aimerez peut-être aussi