Ict123 W11

Centre for Computer Technology
ICT123 Computer Architecture

Week 11
Parallel Architectures
Contents at a Glance
Review of week 10 Cache Coherence Protocols Snoopy Protocol, MESI Protocol Execution Time (MIPS rate) Latency vs. throughput Limitations
March 20, 2012
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Flynns Taxonomy of Parallel Processor Architectures
March 20, 2012
Taxonomy of Parallel Computers
Flynns taxonomy of parallel computers.

(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education)
March 20, 2012
Parallel Organizations - SISD
March 20, 2012
Parallel Organizations - SIMD
March 20, 2012
Parallel Processing - MISD
March 20, 2012
Parallel Organizations - MIMD Shared Memory
March 20, 2012
Parallel Organizations - MIMD Distributed Memory
March 20, 2012
Tightly Coupled MP Systems
P1
Memory
P2
I-O 1
I-O 2
March 20, 2012
Loosely Coupled MP Systems

M1 M2
Communications P1 link P2
I-O 1
March 20, 2012
I-O 2
Multiprogramming and Multiprocessing
March 20, 2012
Message-Passing Architecture
memory memory
...
memory
cache
cache
cache
processor
processor
...
processor
interconnection network
(CS 284a Lecture, Tuesday, 7 October 1997, John Thornley)
March 20, 2012
Shared-Memory Architecture
processor 1 processor 2
...
processor N
cache
cache
cache
interconnection network
memory 1
memory 2
...
memory M
(CS 284a Lecture, Tuesday, 7 October 1997, John Thornley)

March 20, 2012
Parallel Computer Architectures
(a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor. (d) A multicomputer.Mitra A grid. March 20, 2012 Richard Salomon, Sudipto (e)
(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education) Copyright Box Hill Institute
Shared-Memory Architecture: Cache Coherence(1)

Problem - multiple copies of same data may reside in some caches and the main memory If the processors are allowed to update their own copies (in the cache) freely, may result in an inconsistent view of memory This results in a cache coherence problem Hence multiple copies of the data (in cache) have to be kept identical
March 20, 2012
Shared-Memory Architecture: Cache Coherence(2)
Write back

Write operations are made to the cache. Main memory is updated when the corresponding cache line is flushed. This policy can lead to inconsistency
Write through
March 20, 2012
All write operations are made to the main memory and the cache This can also give problems unless caches monitor memory traffic
Cache Coherence Protocols
The objective is to update recently used local variables in the cache and let them reside through numerous reads and write The protocol maintains consistency of shared variables in multiple caches at the same time Cache Coherence approaches

Software solutions Hardware solutions

March 20, 2012
Software Solutions (1)

March 20, 2012
Compiler and operating system deal with problem Overhead of detecting potential problems is transferred from run time to compile time. Design complexity transferred from hardware to software Compiler based mechanisms determine which data item may become unsafe for caching and mark them Operating system or the hardware prevents Sudipto them from beingRichard Salomon,Hill Institute cached Mitra Copyright Box
Software Solutions (2)
However, software tends to make conservative decisions which results in an Inefficient cache utilization (prevent any shared data variables from being cached) An efficient approach is to analyze code to determine safe periods for caching shared variables Compiler then inserts instructions into the generated code to implement cache coherence during the critical periods
March 20, 2012
Hardware Solution

Generally referred to as Cache coherence protocols Dynamic recognition of potential problems at run time More efficient use of cache as the problem is dealt with when it arises Transparent to programmer and the compiler Two categories

Directory protocols Snoopy protocols
March 20, 2012
Directory Protocols (1)
March 20, 2012
Collect and maintain information about copies of data in cache There is a centralized controller, that is part of a main memory controller, and a directory stored in main memory When a request is made, the centralized controller checks and issues necessary commands for data transfer between memory and cache or between caches The central controller keeps the state information updated Salomon, Sudipto Mitra Richard
Copyright Box Hill Institute
Directory Protocols (2)

Creates a central bottleneck Overhead of communication lines between the various cache controllers and the central controller Effective in large scale systems that involve multiple buses or some other complex interconnection schemes
March 20, 2012
Snoopy Protocols (1)

Distribute cache coherence responsibility among cache controllers Cache recognizes that a line is shared Updates announced to other caches by broadcast mechanism Each cache controller snoops on the network to observe the broadcast notifications and react accordingly
March 20, 2012
Snoopy Protocols (2)

Suited
to bus based multiprocessor Increases bus traffic Two approaches

Write
invalidate, multiple readers but only one writer at a time Write update, multiple writers and multiple readers
March 20, 2012
Write Invalidate
Multiple readers, one writer When a write is required, all other caches of the line are invalidated Writing processor then has exclusive (cheap) access until line required by another processor Used in Pentium II and PowerPC systems State of every line is marked as modified, exclusive, shared or invalid (MESI)
March 20, 2012
Write Update

Multiple readers and writers Updated word is distributed to all other processors Some systems use an adaptive mixture of both solutions
March 20, 2012
MESI Protocol
Modified : line in the cache has been modified (different from mail memory) and is available only in this line Exclusive : line in the cache is the same as that in main memory and is not present in any other cache Shared : line in the cache is the same as that in main memory and may be present in another cache Invalid : line in the cache does not contain valid data
March 20, 2012
MESI Cache Lines States
March 20, 2012
MESI State Transition Diagram (1)
March 20, 2012
Read Miss
Processor initiates a memory read to read the line containing the missing address Processor generates a signal to alert all other units to snoop the transaction Possible Outcomes
Invalid
to shared Modified to shared Invalid to exclusive

March 20, 2012
Read Hit
Write Miss
The processor reads the required item State remains modified, shared or exclusive Processor initiates a memory read to read the line containing the missing address Processor issues a read with intent to modify (RWITM) The line is marked modified
March 20, 2012
Write Hit
The effect depends on the current state of the line in the local cache Possible Outcomes
Shared
to modified Exclusive to modified modified
March 20, 2012
Centre for Computer Technology
Measuring Processor Performance
Performance Marches On ...
But what is performance?

March 20, 2012
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide35
Time versus throughput (1)

Vehicle Ferrari Time to Bay Area 3.1 hours Speed 160 mph Passengers 2 Throughput (pm/h) 320
Greyhound
7.7 hours
65 mph
60
3900
Time to do the task from start to finish execution time, response time, latency Tasks per unit time mostly used for throughput, bandwidth
March 20, 2012
data movement
Slide36
Time versus throughput (2)

Time is measured in time units/job. Throughput is measured in jobs/time unit. But time = 1/throughput may be false.
It takes 4 months to grow a tomato. Can you only grow 3 tomatoes a year ?? If you run only one job at a time, time = 1/throughput
March 20, 2012
Slide37
How do you measure Execution Time?

> time foo ... foos results ... 90.7u 12.9s 2:39 65% >
user + kernel wallclock
user CPU time? (time CPU spends running your code) total CPU time (user + kernel)? (includes op. sys. code) Wallclock time? (total elapsed time) Answer depends ...
Includes time spent waiting for I/O, other users, ...
For measuring processor speed, we can use total CPU. If no I/O or interrupts, wallclock may be better
more precise (microseconds rather than 1/100 sec) can measure individual sections of code
March 20, 2012
Slide38
Performance
For performance, larger should be better.
CPU performance = 1 / total CPU time System performance = 1 / wallclock time These terms only make sense if you know what program is measured ...
Time is backwards - larger execution time is worse.
and if CPU or system only works on 1 program at a time.
e.g. The performance on Linpack was 200 MFLOP/S
Can answer What was performance? by It took 15 March 20, 2012 seconds. CSE 141 - Performance I and II
Performances units, inverse seconds, can be awkward
This may all change in the next few years!
Slide39
A brief study of time

CPU Time = CPU cycles executed * Cycle times
Every conventional processor has a clock with a fixed cycle time or clock rate
Rate often measured in MHz = millions of cycles/second Time often measured in ns (nanoseconds) X MHz corresponds to 1000/X ns (e.g. 500 MHz 2 ns clock)
CPU cycles = Instructions executed * CPI

Average Clock Cycles per Instruction
March 20, 2012
Slide40
Putting it all together

seconds
CPU Execution = Time Instruction Clock Cycle X CPI X Count Time
One of P&Hs big pictures
instructions/program
cycles/instruction
seconds/cycle
Note: CPI is somewhat artificial

(its computed from the other numbers using this formula)
but its an intuitive and useful concept. Note: Use dynamic instruction count (#instructions executed), not static (#instructions in compiled code)
March 20, 2012
Slide41
Explaining performance variation

CPU Execution = Time Instruction Clock Cycle X CPI X Count Time
Same machine, different programs Same program, different machines, but same ISA Same program, different ISAs
March 20, 2012 Slide42
Comparing performance (1)

The fundamental question: Will computer A will run program P faster than computer B? Compare clock rates?

Will a 1.7 GHz PC be faster than a 867 MHz Mac?? Not necessarily CPI or Instruction Count may differ.
see http://www.apple.com/g4/myth (Photoshop benchmark)
(MIPS = Millions of Instructions / sec)
Peak MIPS rate?

PowerPC G4 can execute 4 instruction/cycle (CPI=1/4) 867 MHz clock 3468 MIPS peak But it doesnt necessarily execute that quickly.
Slide43
March 20, 2012

The fundamental question: Will computer A will run program P faster than computer B? Compare actual MIPS rate on program P?
MIPS = 1 / (CPI x Cycle time) (in microseconds) If Instruction Counts are the same, this is OK
E.g.,
comparing two implementations of same ISA
Otherwise, actual MIPS doesnt answer question.

Slide44
March 20, 2012

The fundamental question: Will computer A will run program P faster than computer B? Relative MIPS ?
Defined as, How much faster is this computer than a Vax 11 model 780 (on some benchmark programs) If the benchmark is similar to P, this may give the right answer.
March 20, 2012
Slide45
What about MFLOP/S?

Millions of Floating Point Ops per Second
Often written MFLOPS. maximum float ops per cycle / cycle time
(in microseconds)
Peak MFLOP/S (like peak MIPS) is useless.
Normalized MFLOP/S uses conventions (e.g. divide counts as three float ops) so flop count of a program is machine-independent.
OK for floating-point intensive programs Depends on program - a better MFLOP/S rate on program P doesnt guarantee better performance on Q.
Slide46
March 20, 2012
Relative Performance
Computer X is r times faster than Y means

Perf(X) / Perf(Y) = r (i.e. Time(Y) / Time(X) = r)
Note the swapping of which goes on top when you use times
March 20, 2012
Slide47
times faster than (or times as fast as) means theres a multiplicative factor relating quantities
Comparing speeds ...
percent faster than implies an additive relationship
X was 3 time faster than Y speed(X) = 3 speed(Y) X was 25% faster than Y speed(X) = (1+25/100) speed(Y)
percent slower than implies subtraction

times slower than or times as slow as is awkward.

X was 5% slower than Y speed(X) = (1-5/100) speed(Y) 100% slower means it doesnt move at all !
X was 3 times slower than Y means speed(X) = 1/3 speed(Y) It hints at having a measure of slowness Ill mostly avoid using this.
Slide48
March 20, 2012
Percentages arent intuitive!
If X is p% faster than Y, is Y p% slower than X? X is p% faster speed(X) = (1+p/100) speed(Y)
so speed(Y) = 1/(1+p/100) speed(X)

(unless p=0)
Y is p% slower speed(Y) = (1-p/100) speed(X)
No! 1/(1+p/100) is not (1 p/100)
Suppose X is p% faster than Y and Y q% faster than Z. Is X (p+q)% faster than Z ??

March 20, 2012
Slide49
Times faster is easier!

X is r times faster than Y speed(X) = r speed(Y) speed(Y) = 1/r speed(X) Y is r times slower than X X is r times faster than Y, & Y is s times faster than Z speed(X) = r speed(Y) = rs speed(Z) X is rs faster than Z Advice: Convert % faster to times faster then do calculation and convert back if needed. Example: change 25% faster to 5/4 times faster.
March 20, 2012
Slide50
How do you judge computer performance?

Clock speed?
No
Unless ISA is same
Peak MIPS rate?
No Sometimes (if program tested is like yours) The best method!

Slide51
Relative MIPS, normalized MFLOPS?
How fast does it execute MY program
March 20, 2012
Benchmarks

Its hard to convince manufacturers to run your program (unless youre a BIG customer) A benchmark is a set of programs that are representative of a class of problems.

Microbenchmarks measure one feature of system
Kernel most compute-intensive part of applications
e.g. memory accesses or communication speed
Full application:
e.g. Linpack and NAS kernel bmarks (for supercomputers)
SPEC = System Performance Evaluation Cooperative (int and float) (for Unix workstations) Other suites for databases, web servers, graphics,...
Slide52
March 20, 2012
Improving Latency

Latency is (ultimately) limited by physics.
Some improvements are incremental

e.g. speed of light
Improvements often require new technology

Smaller transistors shorten distances. To reduce disk access time, make disks rotate faster.
Replace stagecoach by pony express or telegraph. Replace DRAM by SRAM. Once upon a time, bipolar or GaAs were much faster than CMOS.
March 20, 2012
Slide53 But incremental CSE 141 - Performance I and II CMOS have triumphed. improvements to
Improving Bandwidth
You can improve bandwidth or throughput by throwing money at the problem.
Use wider buses, more disks, multiple processors, more functional units ...
Two basic strategies:
Parallelism: duplicate resources.
Run multiple tasks on separate hardware Reduces the time needed for a single stage Build separate resources for each stage. Start a new task down the pipe every (shorter) timestep
Pipelining: break process up into multiple stages

March 20, 2012
Slide54
Pipelining
Modern washing machine:

Old fashioned washing machine:

Washing/rinsing and spinning done in same tub. Takes 15 (wash/rinse) + 5 (spin) minutes Time for 1 load: 20 minutes Time for 10 loads: 200 minutes Tub for washing & rinsing (15 minutes) Separate spinner (10 minutes) Time for 1 load: 25 minutes Time for 10 loads: 160 minutes
(25 minutes for first load, 15 minutes for each thereafter)

Slide55
March 20, 2012
Parallelism vs pipelining
Both improve throughput or bandwidth Automobiles: More plants vs. assembly line I/O bandwidth: Wider buses (e.g. parallel port) vs. pushing bits onto bus faster (serial port). Memory-to-processor: wider buses vs. faster rate CPU speed:

superscalar processor having multiple functional units so you can execute more than one instructions per cycle. superpipelining using more steps than classical 5-stage pipeline recent microprocessors use both techniques.
Slide56
March 20, 2012
Latency vs Bandwidth of DRAM

DRAM is much slower than SRAM
Perhaps 30 ns vs 1 ns access time
But we also hear, SDRAM is much faster than ordinary DRAM
e.g. RDRAM (from Rambus) is 5 times faster...
Are S(R)DRAMs almost as good as SRAM?

Slide57
March 20, 2012
What are limits?
Physics: speed of light, size of atoms, heat generated (speed requires energy loss), capacity of electromagnetic spectrum (for wireless), ... Limits with current technology: size of magnetic domains, chip size (due to defects), lithography, pin count. New technologies on the horizon: quantum computers, molecular computers, superconductors, optical computers, holographic storage, ... Fallacy improvements will stop Pitfall trying to predict > 5 years in future
March 20, 2012
Slide58
Summary
March 20, 2012
The objective of a cache coherence protocol is to update recently used local variables in the cache and let them reside through numerous reads and write Processor performance can be measured by the rate at which it executes instructions, Execution time = instructions x CPI x cycle time A benchmark is a set of programs that are representative of a class of problems
Reference
Stallings William, 2003, Computer Organization & Architecture designing for performance, Sixth Edition, Pearson Education, Inc, ISBN 0 - 13 049307 4. M Morris Mano, Computer System Architecture, Third Edition, Prentice Hall. Measuring Performance, UCSD, CSE 141, Larry Carter, Winter 2002 Tanenbaum, Structured Computer Organization, Fifth Edition, 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0. CS 284a Lecture, Tuesday, 7 October 1997, John Thornley.
March 20, 2012
Further Reading
Manufacturers websites Relevant Special Interest Groups [SIG] Articles in magazines IEEE Computer Society Task Force on Cluster Computing web-site
March 20, 2012

Ict123 W11

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Ict123 W11

Transféré par

Droits d'auteur :

Formats disponibles

Centre for Computer Technology

ICT123 Computer Architecture

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Flynns Taxonomy of Parallel Processor Architectures

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Taxonomy of Parallel Computers

Flynns taxonomy of parallel computers.

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Organizations - SISD

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Organizations - SIMD

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Processing - MISD

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Organizations - MIMD Shared Memory

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Organizations - MIMD Distributed Memory

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Tightly Coupled MP Systems

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Loosely Coupled MP Systems

Multiprogramming and Multiprocessing

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

(CS 284a Lecture, Tuesday, 7 October 1997, John Thornley)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

(CS 284a Lecture, Tuesday, 7 October 1997, John Thornley)

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Parallel Computer Architectures

Shared-Memory Architecture: Cache Coherence(1)

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Shared-Memory Architecture: Cache Coherence(2)

March 20, 2012

Cache Coherence Protocols

Software solutions Hardware solutions

March 20, 2012

Software Solutions (1)

March 20, 2012

Software Solutions (2)

March 20, 2012

Directory protocols Snoopy protocols

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Directory Protocols (1)

March 20, 2012

Directory Protocols (2)

March 20, 2012

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

Snoopy Protocols (1)

Richard Salomon, Sudipto Mitra Copyright Box Hill Institute

March 20, 2012

Snoopy Protocols (2)

to bus based multiprocessor Increases bus traffic Two approaches