Vous êtes sur la page 1sur 55

ECE 6500 Advanced Computer Architecture

Course Materials Syllabus: Available on the course web site http://homepages.wmich.edu/~bazuinb/ECE6500/ECE6500_Sp11.htm

Course Web Site:

Prof. Kai Hwang Home Page ID and Acknowledgement Form: Handout, Fill-in ID, and Sign

Why does Dr. Bazuin teach this course? Computer architecture is a necessary part of real-time signal processing, particularly when it comes to wired and wireless communications systems

To create or extract desired signals for wireless communication, advanced signal processing algorithms and mathematics must be employed. To perform the processing required, digital signal processing techniques must be developed and hosted on real-time signal processing machines. Real-time processing requires parallel computations and computing architectures. Therefore, the ability to define, develop, and program scalable parallel computing machines or computers is critical knowledge when working with advanced wireless systems!

Dr. Bazuins Biography Dr. Bazuin graduated magna cum laude with a B.S. in Engineering and Applied Sciences, Intensive Electrical Engineering, from Yale University. He continued his education at Stanford University, receiving his M.S. and Ph.D. in 1982 and 1989 respectively. Dr. Bazuins graduate work was with the Center for Integrated Electronics in Medicine (CIEM) associated with the Stanford Integrates Circuits Laboratory (ICL) and Center for Integrated Systems (CIS). He defined and developed a custom implantable dimension measurement telemetry systems under the direction of his advisor Dr. James Meindl, currently the director of the Georgia Tech. Microelectronic Research Center (MiRC at http://www.mirc.gatech.edu/). Dr. Bradley J. Bazuin is a tenure-track assistant professor in WMUs Electrical and Computer Engineering Department. Dr. Bazuin entered the academic community in 2000 following over 19 years of full- and part-time industrial experience developing commercial and military communication systems. Dr. Bazuin has taught a number of undergraduate and graduate courses, has been involved in a range of research project, and has collaborated on projects with a number of southwestern Michigan companies. Dr. Bazuin has been the co-author on multiple referred publications, an invited panel discussion member on wireless communications, an invited luncheon speaker on radio frequency identification (RFID), author and co-author of a number of conference papers and presentations, a co-presenter of regional training seminars for Michigan DoD procurement technical assistance centers (PTACs) on RFID, an invited speaker at a regional and local users group meeting, and has presented numerous seminars and guest lectures at Western Michigan University. Dr. Bazuin was employed part-time, at ARGOSystems, Inc. in Sunnyvale, CA (now a wholly owned subsidiary of The Boeing Co.) while pursuing his graduate degrees and full-time after completion. Initially performing digital circuit design, he became involved in digital ASIC design, establishing an ASIC design center, DSP algorithm implementation, and system engineering for a range of direction finding, SIGINT and COMINT systems for the US government. He left in 1991 for Radix Technologies of Mt. View, CA, a spin-off, where he was responsible for the system engineering and development of a range of advanced spatial, spectral and temporal signal processing detection and exploitation systems, blind-adaptive anti-jam GPS receivers, LPI communications systems, and, later, commercial wireless local loop communication systems (the initial phase of AT&Ts project angel). Research: Semiconductor Physics and Device Characterization, Communications, Microprocessor Applications, Advanced Digital Signal Processing, Adaptive Filters, and Smart Antennas. Roll-to-roll Printed Electronics, Wireless SAW Smart Sensors, Chaotic Communications, Software Radios, RFID, and DSP for 1-D and 2-D signal processing.

Projects:

2 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Chapter 1: Scalable Computer Platforms and Models


Computers have always been used to perform mathematical computations at higher rates, whether driven by monetary tax or interest computations, military artillery table generation, or graphical renderings for more visually appealing computer games. To appreciate where we are, it is useful to understand where we have been. And amazingly, some ideas from the past do help us find better approaches for the future 1.1: Evolution of Computer Architecture Generation 0 1642-1945 Technology and Architecture Mechanical Software and Operating System Hard Code Punch Cards Ada Lovelace 1 1946-1956 2 1956-1967 Tubes, Relays, Single Bit CPU, Accumulator Based instruction set Discrete Transistors, Core Memory, Floating Point Accelerators, I/O Channels ICs, pipelined CPU, micro-programmed control units VLSI, solid-state memory, multiprocessor, vector supercomputer ULSI, scalable parallel computers, workstation clusters, Intranet/Internet, superscalar processor Machine/Assembly Language, programs without subroutines Algol and FORTRAN with Compilers, batch processing OS C language, multiprogramming, timeshared OS Symmetric multiprocessor, parallelizing compiler, message-passing libraries WWW, Microkernels, JAVA, Multithreading, Distributed OS Examples Pascal Calculating Machine Babbage Analytical Engine COLOSSUS, ENIAC, Von Neumann IAS IBM 7030 CDC 1604 DEC PDP-1 DEC PDP-8 IBM 360/370 DEC PDP-11 Intel 8080 Cray-1 DEC VAX IBM PC SUN SPARC Cray X/MP IBM SP2 SGI Origin 2000

3 1967-1978

4 1978-1989

5 1990-

3 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

The Following Computer Architecture History is based on: A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1. Historical Computing Machines

A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.

4 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

The Zeroth Generation Mechanical Computers (1642-1945) Pascal (1642) Von Leibniz (1670-ish) Babbage (1830ish) the first working mechanical calculator addition and subtraction for tax collection multiply and divide, the four function calculator (1) Difference Engine, fixed add/sub algo. for naval navigation (2) Analytic Engine

Babbages Analytical Engine

Store (Memory)

Input

Mill (Processor)

Output

The store: Mill:

100 words of 50 decimal digits Add, subtract, multiply, divide

Instructions read from punched cards. Instructions included computation and branching! The first assembly language that required programming. Therefore, there is the first programmer: Ada Augusta Lovelace The Early 1900s Zuse (Germany 1930s) Automatic calculating machines using relays Machines destroyed in WW II

Atanasoff (Iowa State 1940) Calculating machine using relays. Included capacitors for storage that were refreshed (beginning DRAM). Non-operational. Stibbitz (Bell Labs 1940) Aiken (Harvard 1944) Calculating machine using relays. Demonstrated in 1940. Babbage inspired machine using relays.

5 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

The First Generation Vacuum Tubes (1945-1955) World War II drove the requirement for more advanced computational engines. Code Breaking Artillery Range COLOSSUS by the British to break the ENIGMA cyper The first electronic digital computer. Alan Turing. ENIAC by the US, not completed before the end of WW II 20 registers, 10-digit decimal numbers. Programming with switches and jumpers. 30 tons, 140 kilowatts! John Mauchley.

After WW II, numerous projects were undertaken Von Neumann The IAS machine, his first after working on ENIAC The EDSAC, the first stored program computer The von Neumann machine/architecture

Von Neumann Machine

A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.

A binary, stored program, independent data and control path machine 4096 words of 40-bits, two 20-bit instructions or one 40-bit signed integer word accumulator machine (40-bit). All these aspects were firsts in computer architecture!

6 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

International Business Machines - Computing Tabulating Recording Company 1911 or earlier. Q. What was the IBM 701? A. The 701 Electronic Data Processing Machines System, introduced in 1952, was IBMs first commercially available scientific computer and the first IBM machine in which programs were stored in an internal, addressable electronic memory. Using cathode ray tube (Williams tube) memory for speed and flexibility, the 701 could process more than 2,000 multiplications and divisions a second. The arithmetic section contained the memory register, accumulator register and the multiplierquotient register. Each register had a capacity of 35 bits and sign. The accumulator register also had two extra positions called register overflow positions. The functional machine cycle of the 701 was 12 microseconds; the time required to execute an instruction or a sequence of instructions was an integral multiple of this cycle or 456 microseconds were required for the execution of a multiply or divide instruction. The 701 could execute 33 different operations. http://www-03.ibm.com/ibm/history/documents/pdf/faq.pdf Q. What was the IBM 704? A. The IBM 704 Electronic Data Processing Machine, introduced in 1954, was the first largescale commercially available computer to employ fully automatic floating point arithmetic commands. It was a large-scale, electronic digital computer used for solving complex scientific, engineering and business problems. Input and output could be binary, decimal, alphabetic or special character code, such as binary coded decimal which includes decimal, alphabetic and special characters. A key feature of the 704 was FORTRAN (Automatic Formula Translation), which was an advanced program for automatically translating mathematical notation to optimum machine programs. A contemporary IBM publication listed the following features for the 704: 32,768, 8,192 or 4,096 words of high-speed magnetic core storage. (A word consists of 36 binary digits slightly larger than a 10 decimal digit number). Any word is individually addressable. Any word in magnetic core storage can be located and transferred in 12 millionths of a second. Single address type stored program controls all operations. Internal number system is binary. Executes most instructions at a rate of 40,000 per second. Built-in instructions provide maximum flexibility with minimum programming. A parallel machine, it operates on a full word simultaneously. Magnetic tape input-output units permit masses of data to enter and leave the internal memory of the machine at high speed. http://www-03.ibm.com/ibm/history/documents/pdf/faq.pdf 7 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

The Second Generation Transistors (1955-1965) The transistor was invented by Bardeen, Brattain, and Shockley at Bell Labs in 1948. (Bell Labs are now a part of Alcatel-Lucent and is almost gone ) MIT DEC Lincoln Labs, http://en.wikipedia.org/wiki/Lincoln_Labs The TX-0 was the 1st transistorized computer. 16-bit machine founded in 1957, as an outgrowth of MIT transistorized computer. PDP-1 was the first mincomputer (1960) - an 4K x 18-bit machine, 5 usec cycle, $120,000 - One given to MIT, students created a video game PDP-8 was the break-out machine for minicomputer (1965) - a 12 bit machine, $16,000

DEC PDP-8

A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.

The omnibus or single computer bus was a major departure from a von Neumann machine. DEC was known for minicomputers while IBM built large mainframes for scientific computing. (Dec bought by Compaq which then merged with HP) http://en.wikipedia.org/wiki/Digital_Equipment_Corporation

8 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Other Computer Companies CDC Control Data Corporation Parallel computing was introduced, up to 10 simultaneous instructions! A key contributor was Cray, later of Cray Computers and supercomputer fame. http://en.wikipedia.org/wiki/Control_Data_Corporation B5000 focused on incorporating features to more directly implement a language, ALGOL, and thereby easy the compilers tasks Software was identified as a keep component of a computer hardware design! Burroughs Corporation merged with Sperry Corporation to form Unisys http://en.wikipedia.org/wiki/Burroughs_Corporation

Burroughs

9 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

The Third Generation Integrated Circuits (1965-1980) Shockley Semiconductor spin-off, Fairchild, focused on putting multiple transistors on a single substrate. Robert Noyce was involved and later became a founder of Intel. IBM (1964) Combined two older, incompatible series of computers into the IBM 360 family Multiple models from low end (commercial) to high end (scientific) A key contributor was Cray, later of Cray Computers and supercomputer fame. Multiprogramming: multiple programs reside in memory simultaneously allows time sharing Emulation: could simulate the operations of other computers. Used microprogramming and allowed instructions to be interpreted by the control unit! 16 32-bit registers,8-bit memory Bytes, 2^24 Byte address space (16 MB) Initial IBM 360 Family

A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.

DEC (1970)

PDP-11, the first personal workstation allowed research and labs to have their own minicomputers

Microprocessors began to appear TI, Intel, Motorola Intel 4004 (1971), Intel 8086 (1974), Intel 8086 (1978)

10 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

The Fourth Generation Very Large Scale Integrated Circuits (1980-1989) IBM (1981) Personal Computers (PCs) Based on Intel 8088, separate group development, commodity parts Plans published to allow expansion clones emerged Comodore, Amiga, Atari, Apple (Home Brew Computer Club) Jobs and Wozniak Disk operating systems developed for PCs Microsoft bought an operating system and began to dominate

Other PCs DOS MS

MIPS (1985) First commercial RISC processor

Parallel supercomputers continue for high-speed computations. Systolic Array Processing Parallel Processing Arrays

Flynns Taxonomy of Computer Architectures

A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.

MacIntosh

Apple uses Xerox PARC concepts for new line of computers and a superior operating system

11 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

The Fifth Generation ULSI and ASICs (1990-1996) WINTEL Software Hardware Windows and Intel (WinTel) dominate the PC market Computer languages flourish Operating systems expand Superscalar Processors (multiple execution units in the CPU chip) Pipeling Out-of-Order Execution

Computer Systems Networking Distributed Computing Cluster Computing Servers

The Sixth Generation Superscalar/Superpipelined Machines (1997-) Superscalar - Superpipelined INTEL IBM/Motorola P6 architectures (Pentium II, Pentium III and Celeron) PowerPC architecture

Multithreaded Machine INTEL Pentium IV and Celeron

The Seventh Generation ??? Multi-core CPUs multiple CPUs on a single IC INTEL Dual-core and quad-core with multithreading

12 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Current trends and Looking into the future Moores Law Co-founder of INTEL The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuit had doubled every year since the integrated circuit was invented. Moore predicted that this trend would continue for the foreseeable future. In subsequent years, the pace slowed down a bit, but data density has doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed.
from http://www.webopedia.com/TERM/M/Moores_Law.html

Moores Law for Intel CPUs

http://en.wikipedia.org/wiki/Moore%27s_law

13 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Intel Family of Processors

A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.

http://en.wikipedia.org/wiki/File:Transistor_Count_and_Moore%27s_Law_-_2008.svg

14 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

The Computing Problem


System or Application Requirements

Algorithm and Data Structures

Mapping

Operating System Hardware Architcture

Programming Binding (Compile, Link, Load)

High-Level Languages

Application Software

Performance Evaluation

K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw Hill, 1993. ISBM: 0-07-031622-8

Six layers for a computer system development

Applications Programming Environment Languages Supported Communication Model Addressing Space Hardware Architecture Machine Independent

Machine Dependent

K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw Hill, 1993. ISBM: 0-07-031622-8

15 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Scalable Parallel Computer Architectures Parallel: Scalable: Exploit multiple simultaneous operations by the computer. Can the architecture be scaled up or down as appropriate for an application?

Scalability Implies: Functionality and Performance: Improve functionality and compute time in proportion to the increase (decrease) in resources. Scaling in Cost: Costs changes must be reasonable, for an N times performance change can we expect a cost increase of 1, N, N*logN, N^2, etc. Compatibility: Existing components should still be useable with minor changes.

Classes of Computers (The Computer Pyramid)

Supercomputers Mainframes
Cost/Performance

SMP Servers and Clusters Stand-alone Workstations Personal Computers


Quantity

Computer rankings based on Cost/Performance vs. Quantity

16 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Flynns Taxonomy of Computer Architectures Revisited

A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.

Single Instruction Single Data


IS Processing Unit CU: PU: MU: IS: DS: Control Unit Processing Unit Memory Unit Instruction Stream Data Stream

I/O

Control Unit

IS

DS

Memory Unit

Single Instruction Multiple Data


IS Control Unit IS CU: PU: MU: IS: DS: Control Unit Processing Unit Memory Unit Instruction Stream Data Stream Processing Unit DS Memory Unit DS Processing Unit DS Memory Unit DS

Processing Unit

DS

Memory Unit

DS

17 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Multiple Instruction Single Data


IS IS

IS

CU

CU

CU

I/O

Memory Unit

IS

IS

IS

DS

PU

DS

PU

DS

PU

DS

Multiple Instruction Multiple Data


IS

I/O

CU

IS

PU

DS

I/O

CU

IS

PU IS

DS

Memory Unit

I/O

CU

IS

PU

DS

IS

As a computer architecture is it parallel and scalable? Parallel: Scalable: Exploit multiple simultaneous operations by the computer. Can the architecture be scaled up or down as appropriate for an application? Resource Application Technology (machine component) (problem and machine size) (time, space, heterogeneity)

Functionality and Performance Scaling in Cost Compatibility

18 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Expanded Taxonomy of Computer Architectures

A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.

19 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

1.2 Dimensions of Scalability


Resource Application Technology (machine component) (problem and machine size) (time, space, heterogeneity)

Resource Scalability Increasing the machine size: Number of processors, amount of memory, improved software Size Scalability Number of processors, add or subtract Not always a simple process due to communications subsystems and processes i.e. interconnections, interfaces, communication software Not always possible due to software i.e. working with parallelism, how to program or compile code Additional Resources Memory, cache memory, disk drives, etc. Not always a simple process due to addressing, sharing, coherency, etc. Software Scalability Compiler Improvements for parallelism and efficiency More efficient libraries (math, engineering, sorts, etc.) Applications software structured for scalable processing New Operating Systems (OS) to use scaled resources User friendly programming environment

20 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Application Scalability Application programs that are scalable in machine size and problem size Machine Size At what rate does the application performance change as the machine scales Getting started, communicating, coordinating all get less efficient with bigger machines how much gets wasted? Problem Size Growth in data set sizes Change number of users/tasks as the machine scales Practical limits may exist for applications Systems involve a combination of the machine and the application Not solely dependent upon machine and problems sizes. Dont forget memory, I/O capability, communications, etc.

21 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Technology Scalability Adapt to changes in technology Generation (or Time) Scalability Next generation component: proc., memory, etc. Next generation OS Change is inevitable, but will the system need to be replaced Intel Processors have maintained backward compatibility Motorola PowerPC Processors have not

Space Scalability How large and distributed can the system get? Box, rack, room, building, region, international, the WWW Heterogeneity Scalability Scale with components from different vendors, both hardware and software Was a design standard used or is everything customized or vendor specific? Industry standards Open Architecture Software Portability (e.g. JAVA)

22 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

1.3 Parallel Computer Models Define idealized abstract models of a parallel computer.
A parallel machine model is an abstract parallel computer from the programmers viewpoint. We will define: [1.3.3] [1.3.4] Abstract machine models (typically used to estimate performance) and Physical machine models

Abstract models are used to characterize the capability of a parallel computer. We hope to capture implicitly the relative cost of parallel computations (cost in dollars and performance).

The simplest model: The Parallel Random Access Machine (PRAM) Model

PU

PU

PU

PU

MIMD Fine Grain Tightly Synchronized Zero Overhead Shared Variable

Shared Memory

Attributes Used to Describe the Models


Five Semantic Attributes and several Performance Attributes can be used to characterize a parallel machine and its model. Semantic Attributes: Homogeneity Synchrony Interaction Mechanism Address Space Memory Model

23 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

(1) Homogeneity Characterize how alike the processors are. PRAM with 1 processor is Flynns SISD (Single Instruction Single Data) PRAM with n processor is typically Flynns MIMD (Multiple Instruction Multiple Data) but could be Flynns SIMD (Single Instruction Multiple Data) If all the processors perform the same instruction on a cycle-by-cycle basis, SIMD. A special case is SPMD, Single Program Multiple Data.

(2) Synchrony Characterize how tightly synchronized the processors are. PRAM is synchronized at the instruction (clock) level, one step at a time For a SIMD machine, synchrony is expected For a MIMD machine, you would expect that synchrony is optional or even undesirable. Synchrony Options or levels: Clocks Instructions Asynchronous with a synchronization operation Supersteps (a block of instructions) Phases of execution, loosely synchronized phases Asynchronous

24 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

(3) Interaction Mechanism Characterize how many processors may interact and how they perform the interaction. Shared Variables (data in a memory that is accessible) Shared Memory Shared Messages (messages used to communicate) Shared Nothing

Multiprocessor: Multicomputer:

An MIMD machine with shared variables An MIMD machine with message passing

(4) Address Space Characterize how the memory addressing space is organized. Single Address Space Multiple Address Spaces UMA: NUMA: Provided in Shared /Memory Models Multicomputers

Shared memory, single address space Shared memory, single address space, non-uniform access time Local and remote memory accesses A Distributed Shared Memory (DSM) system A Global shared memory with independent local memories. (TMS320C40)

Hybrid:

25 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

(5) Memory Model Characterize how the machine handles shared memory access conflicts. The desire for consistency Valid operations within the machine Consistency Rules must be defined. EREW CREW CRCW Exclusive Read, Exclusive Write one at a time Concurrent Read, Exclusive Write multiple read access, but only one write Concurrent Read, Concurrent Write a conflict exists in writing, Access Policy is defined to handle conflicts Common the same values should be simultaneously written Arbitrary Pick one and keep going Minimum allow the processor with the lowest ID/index Priority combine values in a defined way (OR, AND, sum, max) Memory Consistency Models have been defined and are presented in Chap. 5

ATOMIC Memory Operations: (1) Invisible: Once started, other processes cannot see the intermediate states (2) Finite: It will finish in a finite amount of time

Transaction: ACID Atomicity Consistency Isolation Durability

An ATOMIC operations that meets the following properties All or nothing move cleanly from one state to the next results are not revealed until committed once committed, transaction persists

26 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

High Level Architecture of MIMD Machines Based on Interaction Methods Shared Nothing Shared Disk Shared Memory Multicomputer Network Multicomputer Network with Global Store Multiprocessor System

Shared Nothing Multicomputer Message Passing Communications


Computer Node
P

Computer Node

P: C: M: HD: NIC:

Processing Unit Cache Memory Disk Network Interface Circuitry

Computer Node

HD

NIC

NIC

NIC

Interconnection Network

Shared Disk Multicomputer Message Passing Communications


Computer Node
M P

Computer Node

P: C: M: HD: NIC:

Processing Unit Cache Memory Disk Network Interface Circuitry

Computer Node

NIC

NIC

NIC

Interconnection Network

HD

HD

27 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Shared Memory Multiprocessor Shared Memory Communications

Processor Node
C

Processor Node
C

P: C: M: HD: NIC:

Processing Unit Cache Memory Disk Network Interface Circuitry

Processor Node
C

Interconnection Network

HD

Micro-Architecture vs. Macro-Architecture: Building an MIMD Machine


1. Start with a commodity processor 2. Build a processing unit shell with appropriate resources (the Processing Unit) The computer micro-architecture 3. Build a Computer Node or Processing Node 4. Construct a parallel processing system The computer macro-architecture

Semantic Attributes: which ones may apply to the high level architecture? (semantic attributes and how they relate to the machine model)
Homogeneity Synchrony Interaction Mechanism Address Space Memory Model 28 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Physical Machine Models


PVP SMP MPP DSM COW Parallel Vector Processor Symmetric Multiprocessor Massively Parallel Processor Distributed Shared Memory Processor Cluster of Workstations

Figure 1.6, p. 27, Five Physical Parallel Computer Models


VP VP VP
P/C P/C P/C

Crossbar Switch

Bus or Crossbar

SM

SM

SM

SM

SM

SM

PVP: Parallel Vector Processor


P/C MB LM NIC MB LM NIC P/C

SMP: Symmetric Multiprocessor


P/C MB LM DIR NIC MB LM DIR NIC P/C

Custom-Designed Network

MPP: Massively Parallel Processor

Custom-Designed Network

DSM: Distributed Shared Memory Processor


P/C MB M Bridge MB M Bridge P/C

NIC

NIC

Commodity Network (Ethernet, ATM. Etc.)

COW: Cluster of Workstations


29 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Semantic Attributes: which ones may apply to the physical machine models? (semantic attributes and how they relate to the machine model)
Homogeneity Synchrony Interaction Mechanism Address Space Memory Model

PVP Parallel Vector Processor


Custom designed with special data access. May incorporate vector registers or access. Applications special purpose signal processing

VP

VP

VP

Crossbar Switch

SM
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model

SM

SM

Yes, Custom Processors Asynchronous or Loosely Synchronous Shared Variables Single Uniform (UMA) Sequentially Consistent

30 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

SMP Symmetric Multiprocessor


Commercial processor, unique custom high speed interconnections, equal access for I/O & memory Applications databases, on-line transactions, data warehouses

P/C

P/C

P/C

Bus or Crossbar

SM
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model

SM

SM

Yes, Commodity Processors Asynchronous or Loosely Synchronous Shared Variables Single Uniform (UMA) Sequentially Consistent

31 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

MPP Massively Parallel Processor


Unique Commercial processor based Computing Nodes Network interconnections designed for low latency and high bandwidth Designed to be highly scalable (1,000s of nodes) Applications scientific computing and data warehouses

P/C MB LM NIC MB

P/C

LM NIC

Custom-Designed Network
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model Yes Asynchronous or Loosely Synchronous Message Passing Multiple NORMA Data Flow

32 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

DSM Distributed Shared Memory Processor


Memory Distributed to Computing or Processing Nodes H/W and S/W perceive that memory is in a single address space message passing network emulates shared-variable Special purpose hardware/software involved (a cache directory) Designed to be highly scalable (1,000s of nodes) Applications scientific computing and data warehouses

P/C MB LM DIR NIC MB

P/C

LM DIR NIC

Custom-Designed Network
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model Yes Asynchronous or Loosely Synchronous Shared Variable Single NUMA Weak Ordering (Ch. 5) support by directory

33 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

COW Cluster of Workstations


Each node is a complete workstation Includes O.S., but has software (middleware) that supports a single system image (SSI) Nodes may be headless without any I/O devices (e.g. keyboard, monitor) May include local hard disks and/or special resources Applications Availability of PCs makes this concept attractive for computing operations

P/C MB M Bridge MB

P/C

M Bridge

NIC

NIC

Commodity Network (Ethernet, ATM. Etc.)


Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model Maybe, (easier if it is) Asynchronous or Loosely Synchronous Message Passing Multiple NORMA Data Flow

34 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Physical Machine Model Attributes


Parallel Vector Processor (PVP): UMA, crossbar, shared memory Symmetric Multiprocessor (SMP): UMA, crossbar or bus, shared memory, hard to scale Massively Parallel Processor (MPP): NORMA, message passing, custom interconnection, classic supercomputers Distributed Shared Memory (DSM): NUMA or NORMA, shared memory (hardware or software based), custom interconnections, possible cache directories Cluster of Workstations (COW): NORMA, message passing, SSI challenged, commodity processors and interconnection

Describing the physical machine


Scalable Computer Architectures Functionality and Performance Scaling in Cost Compatibility Dimensions of Scalability Resource Scalability Application Scalability Technology Scalability Parallel Computer Models: Semantic Attributes Homogeneity Synchrony Interaction Mechanism Address Space Memory Model

35 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Critical concept for MIMD parallel processing


A parallel computer should be and indeed is one machine. From the users and/or programmers view of one machine, the system as a goal should provide a single system image (SSI).

Figure 1.7, p. 30, Typical Programmers Architecture of a Cluster of Multiple Computers

Programming Environment and Application Single System Image Infrastructure OS Node OS Node OS Node OS Node

Commodity or Proprietary Interconnect

36 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Summary of Physical Machine Models


Table 1.6, p. 26, Semantic Attributes of Parallel Machine Models
Attributes Homogeneity Synchrony PRAM
MIMD Instruction-level synchronous Shared Variable Single UMA EREW, CREW or CRCW Theoretical Model

PVP/SMP
MIMD Asynchronous or loosely synchronous Shared Variable Single UMA Sequential Consistency IBM R50, Cray T-90

DSM
MIMD Asynchronous or loosely synchronous Shared Variable Single NUMA Weak ordering is widely used Stanford DASH, SGI Origin 2000

MPP/COW
MIMD Asynchronous or loosely synchronous Message Passing Multiple NORMA N/A Cray T3E, Berkeley NOW

Interaction Mechanism Address Space Access Cost Memory Model Example Machines

37 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Performance Attributes: (see supplemental web)


Terminology
Machine Size Clock Rate Workload Sequential Execution Time Parallel Execution Time Speed Peak Speed Speedup

Notation
n f W

Unit
Dimensionless MHz Mflop sec sec Mflop/sec Mflop/sec Dimensionless

T1
Tn Pn = W Tn

PPeak = max(P, Pi ) S n = T1 En = Un = Pn

Tn Dimensionless n Dimensionless

Efficiency Utilization

Sn

(n PPeak )
t0

Startup Time Asymptotic Bandwidth

usec Mbytes/sec

Derived Performance:
Nominal time for communications
Tmessage = t 0 + m r

Communications time is nominally: the sum of the communication latency ( t 0 ) and the message in bytes ( m ) transfer time.

38 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Managing concurrency acquires a central role in developing parallel applications. The basic steps in designing parallel applications are:
Partitioning The partitioning stage of a design is intended to expose opportunities for parallel execution. Hence, the focus is on defining a large number of small tasks in order to yield what is termed a fine-grained decomposition of a problem. Communication The tasks generated by a partition are intended to execute concurrently but cannot, in general, execute independently. The computation to be performed in one task will typically require data associated with another task. Data must then be transferred between tasks so as to allow computation to proceed. This information flow is specified in the communication phase of a design. Agglomeration (to form or collect into a rounded mass) In the third stage, we move from the abstract toward the concrete. We revisit decisions made in the partitioning and communication phases with a view to obtaining an algorithm that will execute efficiently on some class of parallel computer. In particular, we consider whether it is useful to combine, or agglomerate, tasks identified by the partitioning phase, so as to provide a smaller number of tasks, each of greater size. We also determine whether it is worthwhile to replicate data and/or computation. Mapping In the fourth and final stage of the parallel algorithm design process, we specify where each task is to execute. This mapping problem does not arise on uniprocessors or on shared-memory computers that provide automatic task scheduling.

From http://en.wikipedia.org/wiki/Multi-core_(computing)

39 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Operations in Parallel Programs


(1) Computation (2) Parallelism (3) Interaction arithmetic, logic, data transfer, and control flow operations of a traditional sequential machine operations needed to manage processes, such as; creation and termination, context switching, and grouping operations needed to communicate and to synchronize processes

Both explicit and implicit operations must be considered for Parallelism and Interaction Explicit: Added instruction calls, additions to sequential operation Implicit: Not in the instructions but performed anyway

Overhead:

Operations needed in addition to traditional sequential code execution

Types of Overhead:

Parallelism Overhead Communication Overhead Synchronization Overhead Load Imbalance Overhead

caused by process management caused by processors exchanging information caused when executing synchronization operations caused when processors are idle while others continue

Computation Time

Tcomputation Computation time and load imbalance time

T interaction
T parallel

Interaction time including communication and synchronization caused when executing synchronization operations

Total Time Processing Time

T1
Tn = Tcomp + Tinter + T par

SISD machine. Traditional sequential code execution MIMD machine: Parallel Processing with overhead 40 of 55

Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

1.3 Abstract Machine Models


Models used to design and analyze parallel algorithms. Note: they do not necessarily take into account or care what the physical machine model is used! The abstract models, in order of how close they are to actual execution time, are:

PRAM BSP PPM

Parallel Random Access Model Bulk Synchronous Parallel Model Phase Parallel Model

Note: algorithms executed on every one of the physical machine models can be estimated based on each of the abstract models.

PRAM

Parallel Random Access Model

PRAM is a first order parallel processing algorithm model. It is simple, clean and widely used.

PU

PU

PU

PU

MIMD Fine Grain Tightly Synchronized Zero Overhead Shared Variable

Shared Memory

There are numerous unrealistic assumptions! zero communications overhead (shared variable) zero synchronization overhead assumed instruction level synchrony zero parallelism required (ignored)

Computation Time Includes:


Tcomputation

Computation and load imbalance Zero interaction time Zero parallelism time
Tn = Tcomp

Tinteraction = 0
T parallel = 0 Tn = Tcomp + (0 = Tinter ) + (0 = Tpar )

41 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

BSP Bulk Synchronous Parallel Model


Improve upon the PRAM by including: communications and simple synchronization


P node P node P node

MIMD Variable Grain Loosely Synchronous Superstep Computation Communication Barrier Synchronization Non-Zero Overhead Message Passing or Shared Variable

P node

Communication Network

Computation Time Includes:


Tcomputation

Computation and load imbalance Communication and Simple Synchronization Zero parallelism time Therefore: Tn = Tcomp + Tinter
Tn = Tcomp (i ) + Tinter (i )
i =1 M

Tinteraction
T parallel = 0
Tn = Tcomp + Tinter + (0 = Tpar )

Also, a summation of supersteps

In BSP, algorithms execute in a sequence of supersteps. A superstep consisting of computation operations: communication: barrier synchronization: The time of a superstep is Tn (step i ) = wi + g hi + l of at most w cycles, of at most g h cycles (where h is in words), and of l cycles.

42 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Computation Time with Supersteps:


A superstep consisting of computation operations: communication: barrier synchronization: The time of a superstep is Tn (step i ) = wi + g hi + l The total time is a summation of supersteps Tn = Tn (i ) = (wi + g hi + l )
i i

of at most w cycles, of at most g h cycles (where h is in words), and of l cycles.

Total BSP model time Tn = Tn (i ) = (wi + g hi + l )


i i

Computation time per superstep, wi The maximum computation time of any processor in the step Communication time per superstep, g hi Defined based on an h-relation, where each nodes sends and receives at most h words. The coefficient g is a value that is determined for the machine platform. Note: the communications time does not explicitly include the start-up time, t 0

l Synchronization time per superstep, The time it takes to send a synchronization message to all processors. The lower bound may be set equal to the time that it takes to broadcast a null message over the network.

43 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

PPM Phase Parallel Model


Improve upon the BSP by including all reasonable computation time and overhead A parallel program executes as a sequence of phases. The next phase can not begin until the current phase is completed. The phase models explicit include: the parallelism phase (overhead to manage parallel processes) the computation phase (processor computations with any statistical variances),and the interaction phase (communication, synchronization, and aggregation [reduction or scans])

Computation Times Included


Tcomputation

Computation and load imbalance Communication and Synchronization Parallelism time


Tn = Tcomp + Tinter + Tpar

Tinteraction
T parallel Tn = Tcomp + Tinter + T par

Also, a summation of phases

Tn = Tcomp (i ) + Tinter (i ) + Tpar (i )


i =1

In PPM, algorithms execute in a sequence of phases. Each phase includes all reasonable computation time and overhead

44 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Computation Time with Phases:


For computation times, processors and cycles need not be uniform. The n processors my have an average time to execute a workload at an execution time, but there may be a variance term included. (e.g. processor clocks vary from machine to machine, so there is an average and variance about the average clock rate). Where the partitioned phase workload could be defined as Winst (i ) = n w inst (i ) Then
Tcomp (i ) = (w inst (i ) + fn (n )) t f

which allows for the processing rate/time variances between the n processors

For the interaction time, they may be different for each phase, but the following can be used. m Tinteraction (m, n ) = t 0 (n ) + = t 0 (n ) + m t c (n ) r (n ) where m is the message length in bytes and the start-up time and as asymptotic bandwidth are a function of the machine size, n. The per-Byte message time is also defined based on the asymptotic bandwidth.

Then, the total time in its most complex form may be stated as Tn = Tn (i ) = f (wi , t f , f , mi , t 0 (n ), r (n ), T par (i ))
i i

For a Phase Parallel Model of n processors


Computation: There is typically load imbalance and there may be differences in clock rates that cause additional delay. As a result, we can describe computation as Tcomp (i ) = (wi + ) t f where Wi = nwi and load imbalance and other factors are include as the . If is small the system is nearly balanced If is large the system is poorly balanced If is zero the system is balanced But then may be a statistical or computed parameter that describes variations between the processors performing parallel operations. Example derived value Tcomp (i ) = wi + 2 log 2 (n ) t f 45 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Example Application: Vector Multiplication or Inner Product


s = AH B , Problem Complexity: for A, B Nx1

fn(N )

For a uniprocessor
Multiplies: Additions: Computation Time: N N-1

T1 = (N + N 1) t f = (2 N 1) t f
T1 2 N t f

For a PRAM of n processors


Multiplies: Additions: Computation Time: n ceil N n ceil N

( n)
2

[ ( n ) 1]+ log (n )
( n ) 1 + log (n)) t
2 f

Tn = 2 ceil N

Tn 2 N + log 2 (n ) t f n Speedup: S n = T1 Tn

Sn =

(2 N n + log (n)) t
2

2 N t f

=
f

2 N 2 N + log (n ) 2 n

Sn =

n 1+ n log 2 (n ) 2 N

for N >> n

Sn n
46 of 55

Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Big-O Notation (the order of estimation)

Using rapid order of estimates to define the size of a problem. Quick but very coarse. Examples: A(n ) = 4 n 4 : O( A) n 4 B(n ) = 8 n 3 : O(B ) n 3 Then for n = 2

O ( A) > O ( B ) A(2) = B(2)

The oops this doesnt always work Example:

TA = 7 n
TB = 1 n log 2 (n ) 4

therefore therefore therefore

O(TA ) = n O(TB ) = n log 2 (n )


O(TC ) = n log 2 (log 2 (n ))

TC = n log 2 (log 2 (n )) Then

O(TB ) > O(TC ) > O(T A )

When n = 1024 = 210 Therefore

TA = 7168 and TB = 2560 and TC = 3401


T A > TC > TB

Simple analysis can cause significant problems! You may want to only use order of for back-of-the-envelope estimates or brain-storming guesses

47 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Vector Multiplication or Inner Product for a BSP of n processors

for n = 8

s = AH B ,

for A, B Nx1

Superstep
Tcomp

1 w = 2 ceil N

( n ) 1

w =1 h =1 l =1

w =1 h =1 l =1

w =1 h=0 l=0

Tcomm
Tsynch

h =1 l =1

Computation Time:

Tn = Tn (i ) = (wi + g hi + l )
i i

Tcomp = 2 ceil N

( n ) 1]+ [1] + [1] + [1]

Tcomm = g ([1] + [1] + [1] + [0])


Tsynch = l ([1] + [1] + [1] + [0])

Computation Time: Speedup:

T8 = 2 ceil N

( 8 ) 1 + log (8) (1 + g + l )
2

S n = T1

Tn

Sn =

2 N 2 N + log 2 (n ) (1 + g + l ) n

n n (1 + g + l ) 1 + log 2 (n ) 2N

The speedup for PRAM was

Sn =

n 1+ n log 2 (n ) 2 N

BSP should always be smaller than PRAM S PRAM > S BSP > S PP

48 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Amdahls Law (a basic PRAM model speedup)

Assume that an applications program has two types of code (X) code that can be parallelized and (Y) code that can not be parallelized. For the total code, W = X + Y When executed on one processor
T1 = X t f + Y t f

When executed on multiple processors (assuming no load imbalance)


Tn = X tf + Y tf n

The speedup is then


Sn = X t f +Y t f T1 X +Y = = X X Tn t f +Y t f +Y n n

Looking at the proportions of code (dividing through by the total code, W) Sn = T1 = X Tn

( W ) + (Y ) W n

For n
T1 1 1 Sn = = lim = = Y W X Y Tn n W W Y + W n

( ) ( ) ( ) ( )

Implications: Efficiently optimize the code that can be parallelized (X) The maximum speedup is bounded based on the percent of the code that cannot be parallelized (Y/W) Therefore, minimize (Y) the amount of code that cannot be parallelized!

49 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Scalable Design Principles

Principle of Independence Attempt to make the components of a system independent of one another. Allow independent scaling in hardware or in software or in algorithm, etc. (others: machine language, high-level language, application, platform, architecture, algorithm, interfaces, network, network topology) Principle of Balanced Design Minimize performance bottlenecks. Eliminate the slowest component first, allocate acceptable time performance Amdahls Law example (1st reference in text) Amdahls Rule: The processing speed should be balanced against the memory capacity AND the I/O speed. (e.g. 1 MIP:1 MB:1 Mbps) Principle of Design for Scalability Scalability must be considered as a main object and performed from the start of the design activity. Overdesign designed for the future? Backward Compatibility for legacy or scaled down activity Principle of Latency Hiding Techniques to be described in Chapter 5 How can we hide anything that would slow the processing down?

50 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

The 50% Rule The design is balanced if each of the overhead factors may degrade the performance by a factor of no more than 50%. To evaluate whether a design meets this criteria: 1. Select an appropriate performance factor (Speed, Efficiency, or Utilization). 2. Derive a value for the system performance with no overhead and define the acceptable performance as 50% of that value when the overhead is included(e.g. 2x the time, x the speed, x the utilization) 3. Form an inequality that defines a system performance (with only the overhead of interest non-zero) that is greater than the 50% acceptable performance. Manipulate this equation to define an inequality based on the overhead of interest. This now bounds the acceptable range for the overhead factor of interest. Utilization is defined as U= Pn P T P T 1 W 1 = = 1 1 = 1 1 n Ppeak n Ppeak Tn n Ppeak Tn Ppeak n Tn

Therefore, for a specific value of n selected, Taking


Tn = Tcomp + T par + Tint eract

Tn = w + 2 log(n ) t f + t 0 + w t c + t p Letting
T1 = W t f = n w t f

t0 + t p n Tn n Tn t = = 1+ 2 log(n ) + + c T1 n wt f w wt f tf

51 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

1.)

We can get the equation for utilization as follows:


U= P1 Ppeak t0 + t p t + c 1+ 2 log(n ) + w wt f tf

2.)

Performance based on individual overheads (the others forced to zero):


U= P1 Ppeak 1+ 2 log(n ) w
P1 Ppeak t 1+ 0 w tf P1 Ppeak t 1+ c tf

Computation variance

Com. Startup Time

U=

Communication Bandwidth

U=

Parallelism

U=

P1 Ppeak tp 1+ w tf

3.) Form an inequality that defines a system performance (with only the overhead of interest non-zero) that is greater than the 50% acceptable performance. Manipulate this equation to define an inequality based on the overhead of interest. This now bounds the acceptable range for the overhead factor of interest.
U ( ) =
1+

P1 Ppeak

2 log(n )

50%

P1 Ppeak = 50% U (0 ) 0 1+ 2 log(n ) w

1+

0 1 2 log(n ) 1 + 2 log(n ) w 2 w

2 log(n )

52 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Basic Concept of Clustering


There has been a major technical movement toward cluster computing Cluster Nodes Single-System-Image (SSI) Internode Connection Enhanced Availability Better Performance Each node is a complete computer A single computer resource Commodity networks Percentage of time system available is high Higher speed processing

Cluster benefits and difficulties Usability Scalability Availability Utilization Performance vs Cost Ratio

Overlapped Design Space of Distributed Systems, Clusters, MPPs and SMPs

Node Complexity Distributed Computer System

Clusters

SMPs MPPs

Single System Image

53 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Comparison of Scalability and Availability for: Fault Tolerant Systems, Clusters, MPPs and SMPs

Scalable Performance

MPPs

Clusters SMPs Fault Tolerant Systems

System Availability Why are these curves like this ?

54 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Table 1.7, p. 33, Comparison of Clusters, MPP, SMP and Distributed Systems
System Characteristic Number of Nodes (N) Node Complexity Internode Communication MPP SMP Cluster
Distributed System

O(100-1000) Fine or medium grain Message Passing or Shared Variable for DSM Single run queue at host Partially N (microkernel) and 1 host OS (monolithic) Multiple (single if DSM) Unnecessary One Organization Nonstandard Low to Medium Throughput and turnaround time

O(10) or less Medium or coarse grain

O(100) or less Medium grain

O(10-1000) Wide range Shared Files, RPC, Message Passing Independent multiple queues None N (heterogeneous) Multiple Required Many Organizations Standard Medium Response Time

Shared Memory

Message Passing

Job Scheduling SSI Support Node OS copies and type Address Space Internode Security Ownership Network Protocol System Availability Performance Metric

Single run queue Always One (monolithic)

Multiple queues but coordinated Desired N (homogeneous desired) Multiple Required if exposed One or more Organizations Standard or Nonstandard Highly Available or Fault Tolerant Throughput and turnaround time

Single Unnecessary One Organization Nonstandard Often Low Turnaround time

Many of these characteristics are covered later in the textbook 55 of 55


Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.

Vous aimerez peut-être aussi