Académique Documents
Professionnel Documents
Culture Documents
Prof. Kai Hwang Home Page ID and Acknowledgement Form: Handout, Fill-in ID, and Sign
Why does Dr. Bazuin teach this course? Computer architecture is a necessary part of real-time signal processing, particularly when it comes to wired and wireless communications systems
To create or extract desired signals for wireless communication, advanced signal processing algorithms and mathematics must be employed. To perform the processing required, digital signal processing techniques must be developed and hosted on real-time signal processing machines. Real-time processing requires parallel computations and computing architectures. Therefore, the ability to define, develop, and program scalable parallel computing machines or computers is critical knowledge when working with advanced wireless systems!
Dr. Bazuins Biography Dr. Bazuin graduated magna cum laude with a B.S. in Engineering and Applied Sciences, Intensive Electrical Engineering, from Yale University. He continued his education at Stanford University, receiving his M.S. and Ph.D. in 1982 and 1989 respectively. Dr. Bazuins graduate work was with the Center for Integrated Electronics in Medicine (CIEM) associated with the Stanford Integrates Circuits Laboratory (ICL) and Center for Integrated Systems (CIS). He defined and developed a custom implantable dimension measurement telemetry systems under the direction of his advisor Dr. James Meindl, currently the director of the Georgia Tech. Microelectronic Research Center (MiRC at http://www.mirc.gatech.edu/). Dr. Bradley J. Bazuin is a tenure-track assistant professor in WMUs Electrical and Computer Engineering Department. Dr. Bazuin entered the academic community in 2000 following over 19 years of full- and part-time industrial experience developing commercial and military communication systems. Dr. Bazuin has taught a number of undergraduate and graduate courses, has been involved in a range of research project, and has collaborated on projects with a number of southwestern Michigan companies. Dr. Bazuin has been the co-author on multiple referred publications, an invited panel discussion member on wireless communications, an invited luncheon speaker on radio frequency identification (RFID), author and co-author of a number of conference papers and presentations, a co-presenter of regional training seminars for Michigan DoD procurement technical assistance centers (PTACs) on RFID, an invited speaker at a regional and local users group meeting, and has presented numerous seminars and guest lectures at Western Michigan University. Dr. Bazuin was employed part-time, at ARGOSystems, Inc. in Sunnyvale, CA (now a wholly owned subsidiary of The Boeing Co.) while pursuing his graduate degrees and full-time after completion. Initially performing digital circuit design, he became involved in digital ASIC design, establishing an ASIC design center, DSP algorithm implementation, and system engineering for a range of direction finding, SIGINT and COMINT systems for the US government. He left in 1991 for Radix Technologies of Mt. View, CA, a spin-off, where he was responsible for the system engineering and development of a range of advanced spatial, spectral and temporal signal processing detection and exploitation systems, blind-adaptive anti-jam GPS receivers, LPI communications systems, and, later, commercial wireless local loop communication systems (the initial phase of AT&Ts project angel). Research: Semiconductor Physics and Device Characterization, Communications, Microprocessor Applications, Advanced Digital Signal Processing, Adaptive Filters, and Smart Antennas. Roll-to-roll Printed Electronics, Wireless SAW Smart Sensors, Chaotic Communications, Software Radios, RFID, and DSP for 1-D and 2-D signal processing.
Projects:
2 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
3 1967-1978
4 1978-1989
5 1990-
3 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
The Following Computer Architecture History is based on: A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1. Historical Computing Machines
A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.
4 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
The Zeroth Generation Mechanical Computers (1642-1945) Pascal (1642) Von Leibniz (1670-ish) Babbage (1830ish) the first working mechanical calculator addition and subtraction for tax collection multiply and divide, the four function calculator (1) Difference Engine, fixed add/sub algo. for naval navigation (2) Analytic Engine
Store (Memory)
Input
Mill (Processor)
Output
Instructions read from punched cards. Instructions included computation and branching! The first assembly language that required programming. Therefore, there is the first programmer: Ada Augusta Lovelace The Early 1900s Zuse (Germany 1930s) Automatic calculating machines using relays Machines destroyed in WW II
Atanasoff (Iowa State 1940) Calculating machine using relays. Included capacitors for storage that were refreshed (beginning DRAM). Non-operational. Stibbitz (Bell Labs 1940) Aiken (Harvard 1944) Calculating machine using relays. Demonstrated in 1940. Babbage inspired machine using relays.
5 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
The First Generation Vacuum Tubes (1945-1955) World War II drove the requirement for more advanced computational engines. Code Breaking Artillery Range COLOSSUS by the British to break the ENIGMA cyper The first electronic digital computer. Alan Turing. ENIAC by the US, not completed before the end of WW II 20 registers, 10-digit decimal numbers. Programming with switches and jumpers. 30 tons, 140 kilowatts! John Mauchley.
After WW II, numerous projects were undertaken Von Neumann The IAS machine, his first after working on ENIAC The EDSAC, the first stored program computer The von Neumann machine/architecture
A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.
A binary, stored program, independent data and control path machine 4096 words of 40-bits, two 20-bit instructions or one 40-bit signed integer word accumulator machine (40-bit). All these aspects were firsts in computer architecture!
6 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
International Business Machines - Computing Tabulating Recording Company 1911 or earlier. Q. What was the IBM 701? A. The 701 Electronic Data Processing Machines System, introduced in 1952, was IBMs first commercially available scientific computer and the first IBM machine in which programs were stored in an internal, addressable electronic memory. Using cathode ray tube (Williams tube) memory for speed and flexibility, the 701 could process more than 2,000 multiplications and divisions a second. The arithmetic section contained the memory register, accumulator register and the multiplierquotient register. Each register had a capacity of 35 bits and sign. The accumulator register also had two extra positions called register overflow positions. The functional machine cycle of the 701 was 12 microseconds; the time required to execute an instruction or a sequence of instructions was an integral multiple of this cycle or 456 microseconds were required for the execution of a multiply or divide instruction. The 701 could execute 33 different operations. http://www-03.ibm.com/ibm/history/documents/pdf/faq.pdf Q. What was the IBM 704? A. The IBM 704 Electronic Data Processing Machine, introduced in 1954, was the first largescale commercially available computer to employ fully automatic floating point arithmetic commands. It was a large-scale, electronic digital computer used for solving complex scientific, engineering and business problems. Input and output could be binary, decimal, alphabetic or special character code, such as binary coded decimal which includes decimal, alphabetic and special characters. A key feature of the 704 was FORTRAN (Automatic Formula Translation), which was an advanced program for automatically translating mathematical notation to optimum machine programs. A contemporary IBM publication listed the following features for the 704: 32,768, 8,192 or 4,096 words of high-speed magnetic core storage. (A word consists of 36 binary digits slightly larger than a 10 decimal digit number). Any word is individually addressable. Any word in magnetic core storage can be located and transferred in 12 millionths of a second. Single address type stored program controls all operations. Internal number system is binary. Executes most instructions at a rate of 40,000 per second. Built-in instructions provide maximum flexibility with minimum programming. A parallel machine, it operates on a full word simultaneously. Magnetic tape input-output units permit masses of data to enter and leave the internal memory of the machine at high speed. http://www-03.ibm.com/ibm/history/documents/pdf/faq.pdf 7 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
The Second Generation Transistors (1955-1965) The transistor was invented by Bardeen, Brattain, and Shockley at Bell Labs in 1948. (Bell Labs are now a part of Alcatel-Lucent and is almost gone ) MIT DEC Lincoln Labs, http://en.wikipedia.org/wiki/Lincoln_Labs The TX-0 was the 1st transistorized computer. 16-bit machine founded in 1957, as an outgrowth of MIT transistorized computer. PDP-1 was the first mincomputer (1960) - an 4K x 18-bit machine, 5 usec cycle, $120,000 - One given to MIT, students created a video game PDP-8 was the break-out machine for minicomputer (1965) - a 12 bit machine, $16,000
DEC PDP-8
A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.
The omnibus or single computer bus was a major departure from a von Neumann machine. DEC was known for minicomputers while IBM built large mainframes for scientific computing. (Dec bought by Compaq which then merged with HP) http://en.wikipedia.org/wiki/Digital_Equipment_Corporation
8 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Other Computer Companies CDC Control Data Corporation Parallel computing was introduced, up to 10 simultaneous instructions! A key contributor was Cray, later of Cray Computers and supercomputer fame. http://en.wikipedia.org/wiki/Control_Data_Corporation B5000 focused on incorporating features to more directly implement a language, ALGOL, and thereby easy the compilers tasks Software was identified as a keep component of a computer hardware design! Burroughs Corporation merged with Sperry Corporation to form Unisys http://en.wikipedia.org/wiki/Burroughs_Corporation
Burroughs
9 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
The Third Generation Integrated Circuits (1965-1980) Shockley Semiconductor spin-off, Fairchild, focused on putting multiple transistors on a single substrate. Robert Noyce was involved and later became a founder of Intel. IBM (1964) Combined two older, incompatible series of computers into the IBM 360 family Multiple models from low end (commercial) to high end (scientific) A key contributor was Cray, later of Cray Computers and supercomputer fame. Multiprogramming: multiple programs reside in memory simultaneously allows time sharing Emulation: could simulate the operations of other computers. Used microprogramming and allowed instructions to be interpreted by the control unit! 16 32-bit registers,8-bit memory Bytes, 2^24 Byte address space (16 MB) Initial IBM 360 Family
A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.
DEC (1970)
PDP-11, the first personal workstation allowed research and labs to have their own minicomputers
Microprocessors began to appear TI, Intel, Motorola Intel 4004 (1971), Intel 8086 (1974), Intel 8086 (1978)
10 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
The Fourth Generation Very Large Scale Integrated Circuits (1980-1989) IBM (1981) Personal Computers (PCs) Based on Intel 8088, separate group development, commodity parts Plans published to allow expansion clones emerged Comodore, Amiga, Atari, Apple (Home Brew Computer Club) Jobs and Wozniak Disk operating systems developed for PCs Microsoft bought an operating system and began to dominate
Parallel supercomputers continue for high-speed computations. Systolic Array Processing Parallel Processing Arrays
A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.
MacIntosh
Apple uses Xerox PARC concepts for new line of computers and a superior operating system
11 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
The Fifth Generation ULSI and ASICs (1990-1996) WINTEL Software Hardware Windows and Intel (WinTel) dominate the PC market Computer languages flourish Operating systems expand Superscalar Processors (multiple execution units in the CPU chip) Pipeling Out-of-Order Execution
The Sixth Generation Superscalar/Superpipelined Machines (1997-) Superscalar - Superpipelined INTEL IBM/Motorola P6 architectures (Pentium II, Pentium III and Celeron) PowerPC architecture
The Seventh Generation ??? Multi-core CPUs multiple CPUs on a single IC INTEL Dual-core and quad-core with multithreading
12 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Current trends and Looking into the future Moores Law Co-founder of INTEL The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuit had doubled every year since the integrated circuit was invented. Moore predicted that this trend would continue for the foreseeable future. In subsequent years, the pace slowed down a bit, but data density has doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed.
from http://www.webopedia.com/TERM/M/Moores_Law.html
http://en.wikipedia.org/wiki/Moore%27s_law
13 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.
http://en.wikipedia.org/wiki/File:Transistor_Count_and_Moore%27s_Law_-_2008.svg
14 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Mapping
High-Level Languages
Application Software
Performance Evaluation
K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw Hill, 1993. ISBM: 0-07-031622-8
Applications Programming Environment Languages Supported Communication Model Addressing Space Hardware Architecture Machine Independent
Machine Dependent
K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw Hill, 1993. ISBM: 0-07-031622-8
15 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Scalable Parallel Computer Architectures Parallel: Scalable: Exploit multiple simultaneous operations by the computer. Can the architecture be scaled up or down as appropriate for an application?
Scalability Implies: Functionality and Performance: Improve functionality and compute time in proportion to the increase (decrease) in resources. Scaling in Cost: Costs changes must be reasonable, for an N times performance change can we expect a cost increase of 1, N, N*logN, N^2, etc. Compatibility: Existing components should still be useable with minor changes.
Supercomputers Mainframes
Cost/Performance
16 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.
I/O
Control Unit
IS
DS
Memory Unit
Processing Unit
DS
Memory Unit
DS
17 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
IS
CU
CU
CU
I/O
Memory Unit
IS
IS
IS
DS
PU
DS
PU
DS
PU
DS
I/O
CU
IS
PU
DS
I/O
CU
IS
PU IS
DS
Memory Unit
I/O
CU
IS
PU
DS
IS
As a computer architecture is it parallel and scalable? Parallel: Scalable: Exploit multiple simultaneous operations by the computer. Can the architecture be scaled up or down as appropriate for an application? Resource Application Technology (machine component) (problem and machine size) (time, space, heterogeneity)
18 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.
19 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Resource Scalability Increasing the machine size: Number of processors, amount of memory, improved software Size Scalability Number of processors, add or subtract Not always a simple process due to communications subsystems and processes i.e. interconnections, interfaces, communication software Not always possible due to software i.e. working with parallelism, how to program or compile code Additional Resources Memory, cache memory, disk drives, etc. Not always a simple process due to addressing, sharing, coherency, etc. Software Scalability Compiler Improvements for parallelism and efficiency More efficient libraries (math, engineering, sorts, etc.) Applications software structured for scalable processing New Operating Systems (OS) to use scaled resources User friendly programming environment
20 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Application Scalability Application programs that are scalable in machine size and problem size Machine Size At what rate does the application performance change as the machine scales Getting started, communicating, coordinating all get less efficient with bigger machines how much gets wasted? Problem Size Growth in data set sizes Change number of users/tasks as the machine scales Practical limits may exist for applications Systems involve a combination of the machine and the application Not solely dependent upon machine and problems sizes. Dont forget memory, I/O capability, communications, etc.
21 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Technology Scalability Adapt to changes in technology Generation (or Time) Scalability Next generation component: proc., memory, etc. Next generation OS Change is inevitable, but will the system need to be replaced Intel Processors have maintained backward compatibility Motorola PowerPC Processors have not
Space Scalability How large and distributed can the system get? Box, rack, room, building, region, international, the WWW Heterogeneity Scalability Scale with components from different vendors, both hardware and software Was a design standard used or is everything customized or vendor specific? Industry standards Open Architecture Software Portability (e.g. JAVA)
22 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
1.3 Parallel Computer Models Define idealized abstract models of a parallel computer.
A parallel machine model is an abstract parallel computer from the programmers viewpoint. We will define: [1.3.3] [1.3.4] Abstract machine models (typically used to estimate performance) and Physical machine models
Abstract models are used to characterize the capability of a parallel computer. We hope to capture implicitly the relative cost of parallel computations (cost in dollars and performance).
The simplest model: The Parallel Random Access Machine (PRAM) Model
PU
PU
PU
PU
Shared Memory
23 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
(1) Homogeneity Characterize how alike the processors are. PRAM with 1 processor is Flynns SISD (Single Instruction Single Data) PRAM with n processor is typically Flynns MIMD (Multiple Instruction Multiple Data) but could be Flynns SIMD (Single Instruction Multiple Data) If all the processors perform the same instruction on a cycle-by-cycle basis, SIMD. A special case is SPMD, Single Program Multiple Data.
(2) Synchrony Characterize how tightly synchronized the processors are. PRAM is synchronized at the instruction (clock) level, one step at a time For a SIMD machine, synchrony is expected For a MIMD machine, you would expect that synchrony is optional or even undesirable. Synchrony Options or levels: Clocks Instructions Asynchronous with a synchronization operation Supersteps (a block of instructions) Phases of execution, loosely synchronized phases Asynchronous
24 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
(3) Interaction Mechanism Characterize how many processors may interact and how they perform the interaction. Shared Variables (data in a memory that is accessible) Shared Memory Shared Messages (messages used to communicate) Shared Nothing
Multiprocessor: Multicomputer:
An MIMD machine with shared variables An MIMD machine with message passing
(4) Address Space Characterize how the memory addressing space is organized. Single Address Space Multiple Address Spaces UMA: NUMA: Provided in Shared /Memory Models Multicomputers
Shared memory, single address space Shared memory, single address space, non-uniform access time Local and remote memory accesses A Distributed Shared Memory (DSM) system A Global shared memory with independent local memories. (TMS320C40)
Hybrid:
25 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
(5) Memory Model Characterize how the machine handles shared memory access conflicts. The desire for consistency Valid operations within the machine Consistency Rules must be defined. EREW CREW CRCW Exclusive Read, Exclusive Write one at a time Concurrent Read, Exclusive Write multiple read access, but only one write Concurrent Read, Concurrent Write a conflict exists in writing, Access Policy is defined to handle conflicts Common the same values should be simultaneously written Arbitrary Pick one and keep going Minimum allow the processor with the lowest ID/index Priority combine values in a defined way (OR, AND, sum, max) Memory Consistency Models have been defined and are presented in Chap. 5
ATOMIC Memory Operations: (1) Invisible: Once started, other processes cannot see the intermediate states (2) Finite: It will finish in a finite amount of time
An ATOMIC operations that meets the following properties All or nothing move cleanly from one state to the next results are not revealed until committed once committed, transaction persists
26 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
High Level Architecture of MIMD Machines Based on Interaction Methods Shared Nothing Shared Disk Shared Memory Multicomputer Network Multicomputer Network with Global Store Multiprocessor System
Computer Node
P: C: M: HD: NIC:
Computer Node
HD
NIC
NIC
NIC
Interconnection Network
Computer Node
P: C: M: HD: NIC:
Computer Node
NIC
NIC
NIC
Interconnection Network
HD
HD
27 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Processor Node
C
Processor Node
C
P: C: M: HD: NIC:
Processor Node
C
Interconnection Network
HD
Semantic Attributes: which ones may apply to the high level architecture? (semantic attributes and how they relate to the machine model)
Homogeneity Synchrony Interaction Mechanism Address Space Memory Model 28 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Crossbar Switch
Bus or Crossbar
SM
SM
SM
SM
SM
SM
Custom-Designed Network
Custom-Designed Network
NIC
NIC
Semantic Attributes: which ones may apply to the physical machine models? (semantic attributes and how they relate to the machine model)
Homogeneity Synchrony Interaction Mechanism Address Space Memory Model
VP
VP
VP
Crossbar Switch
SM
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model
SM
SM
Yes, Custom Processors Asynchronous or Loosely Synchronous Shared Variables Single Uniform (UMA) Sequentially Consistent
30 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
P/C
P/C
P/C
Bus or Crossbar
SM
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model
SM
SM
Yes, Commodity Processors Asynchronous or Loosely Synchronous Shared Variables Single Uniform (UMA) Sequentially Consistent
31 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
P/C MB LM NIC MB
P/C
LM NIC
Custom-Designed Network
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model Yes Asynchronous or Loosely Synchronous Message Passing Multiple NORMA Data Flow
32 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
P/C
LM DIR NIC
Custom-Designed Network
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model Yes Asynchronous or Loosely Synchronous Shared Variable Single NUMA Weak Ordering (Ch. 5) support by directory
33 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
P/C MB M Bridge MB
P/C
M Bridge
NIC
NIC
34 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
35 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Programming Environment and Application Single System Image Infrastructure OS Node OS Node OS Node OS Node
36 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
PVP/SMP
MIMD Asynchronous or loosely synchronous Shared Variable Single UMA Sequential Consistency IBM R50, Cray T-90
DSM
MIMD Asynchronous or loosely synchronous Shared Variable Single NUMA Weak ordering is widely used Stanford DASH, SGI Origin 2000
MPP/COW
MIMD Asynchronous or loosely synchronous Message Passing Multiple NORMA N/A Cray T3E, Berkeley NOW
Interaction Mechanism Address Space Access Cost Memory Model Example Machines
37 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Notation
n f W
Unit
Dimensionless MHz Mflop sec sec Mflop/sec Mflop/sec Dimensionless
T1
Tn Pn = W Tn
PPeak = max(P, Pi ) S n = T1 En = Un = Pn
Tn Dimensionless n Dimensionless
Efficiency Utilization
Sn
(n PPeak )
t0
usec Mbytes/sec
Derived Performance:
Nominal time for communications
Tmessage = t 0 + m r
Communications time is nominally: the sum of the communication latency ( t 0 ) and the message in bytes ( m ) transfer time.
38 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Managing concurrency acquires a central role in developing parallel applications. The basic steps in designing parallel applications are:
Partitioning The partitioning stage of a design is intended to expose opportunities for parallel execution. Hence, the focus is on defining a large number of small tasks in order to yield what is termed a fine-grained decomposition of a problem. Communication The tasks generated by a partition are intended to execute concurrently but cannot, in general, execute independently. The computation to be performed in one task will typically require data associated with another task. Data must then be transferred between tasks so as to allow computation to proceed. This information flow is specified in the communication phase of a design. Agglomeration (to form or collect into a rounded mass) In the third stage, we move from the abstract toward the concrete. We revisit decisions made in the partitioning and communication phases with a view to obtaining an algorithm that will execute efficiently on some class of parallel computer. In particular, we consider whether it is useful to combine, or agglomerate, tasks identified by the partitioning phase, so as to provide a smaller number of tasks, each of greater size. We also determine whether it is worthwhile to replicate data and/or computation. Mapping In the fourth and final stage of the parallel algorithm design process, we specify where each task is to execute. This mapping problem does not arise on uniprocessors or on shared-memory computers that provide automatic task scheduling.
From http://en.wikipedia.org/wiki/Multi-core_(computing)
39 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Both explicit and implicit operations must be considered for Parallelism and Interaction Explicit: Added instruction calls, additions to sequential operation Implicit: Not in the instructions but performed anyway
Overhead:
Types of Overhead:
caused by process management caused by processors exchanging information caused when executing synchronization operations caused when processors are idle while others continue
Computation Time
T interaction
T parallel
Interaction time including communication and synchronization caused when executing synchronization operations
T1
Tn = Tcomp + Tinter + T par
SISD machine. Traditional sequential code execution MIMD machine: Parallel Processing with overhead 40 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Parallel Random Access Model Bulk Synchronous Parallel Model Phase Parallel Model
Note: algorithms executed on every one of the physical machine models can be estimated based on each of the abstract models.
PRAM
PRAM is a first order parallel processing algorithm model. It is simple, clean and widely used.
PU
PU
PU
PU
Shared Memory
There are numerous unrealistic assumptions! zero communications overhead (shared variable) zero synchronization overhead assumed instruction level synchrony zero parallelism required (ignored)
Computation and load imbalance Zero interaction time Zero parallelism time
Tn = Tcomp
Tinteraction = 0
T parallel = 0 Tn = Tcomp + (0 = Tinter ) + (0 = Tpar )
41 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
P node P node P node
MIMD Variable Grain Loosely Synchronous Superstep Computation Communication Barrier Synchronization Non-Zero Overhead Message Passing or Shared Variable
P node
Communication Network
Computation and load imbalance Communication and Simple Synchronization Zero parallelism time Therefore: Tn = Tcomp + Tinter
Tn = Tcomp (i ) + Tinter (i )
i =1 M
Tinteraction
T parallel = 0
Tn = Tcomp + Tinter + (0 = Tpar )
In BSP, algorithms execute in a sequence of supersteps. A superstep consisting of computation operations: communication: barrier synchronization: The time of a superstep is Tn (step i ) = wi + g hi + l of at most w cycles, of at most g h cycles (where h is in words), and of l cycles.
42 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Computation time per superstep, wi The maximum computation time of any processor in the step Communication time per superstep, g hi Defined based on an h-relation, where each nodes sends and receives at most h words. The coefficient g is a value that is determined for the machine platform. Note: the communications time does not explicitly include the start-up time, t 0
l Synchronization time per superstep, The time it takes to send a synchronization message to all processors. The lower bound may be set equal to the time that it takes to broadcast a null message over the network.
43 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Tinteraction
T parallel Tn = Tcomp + Tinter + T par
In PPM, algorithms execute in a sequence of phases. Each phase includes all reasonable computation time and overhead
44 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
which allows for the processing rate/time variances between the n processors
For the interaction time, they may be different for each phase, but the following can be used. m Tinteraction (m, n ) = t 0 (n ) + = t 0 (n ) + m t c (n ) r (n ) where m is the message length in bytes and the start-up time and as asymptotic bandwidth are a function of the machine size, n. The per-Byte message time is also defined based on the asymptotic bandwidth.
Then, the total time in its most complex form may be stated as Tn = Tn (i ) = f (wi , t f , f , mi , t 0 (n ), r (n ), T par (i ))
i i
fn(N )
For a uniprocessor
Multiplies: Additions: Computation Time: N N-1
T1 = (N + N 1) t f = (2 N 1) t f
T1 2 N t f
( n)
2
[ ( n ) 1]+ log (n )
( n ) 1 + log (n)) t
2 f
Tn = 2 ceil N
Tn 2 N + log 2 (n ) t f n Speedup: S n = T1 Tn
Sn =
(2 N n + log (n)) t
2
2 N t f
=
f
2 N 2 N + log (n ) 2 n
Sn =
n 1+ n log 2 (n ) 2 N
for N >> n
Sn n
46 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Using rapid order of estimates to define the size of a problem. Quick but very coarse. Examples: A(n ) = 4 n 4 : O( A) n 4 B(n ) = 8 n 3 : O(B ) n 3 Then for n = 2
TA = 7 n
TB = 1 n log 2 (n ) 4
Simple analysis can cause significant problems! You may want to only use order of for back-of-the-envelope estimates or brain-storming guesses
47 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
for n = 8
s = AH B ,
for A, B Nx1
Superstep
Tcomp
1 w = 2 ceil N
( n ) 1
w =1 h =1 l =1
w =1 h =1 l =1
w =1 h=0 l=0
Tcomm
Tsynch
h =1 l =1
Computation Time:
Tn = Tn (i ) = (wi + g hi + l )
i i
Tcomp = 2 ceil N
T8 = 2 ceil N
( 8 ) 1 + log (8) (1 + g + l )
2
S n = T1
Tn
Sn =
2 N 2 N + log 2 (n ) (1 + g + l ) n
n n (1 + g + l ) 1 + log 2 (n ) 2N
Sn =
n 1+ n log 2 (n ) 2 N
BSP should always be smaller than PRAM S PRAM > S BSP > S PP
48 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Assume that an applications program has two types of code (X) code that can be parallelized and (Y) code that can not be parallelized. For the total code, W = X + Y When executed on one processor
T1 = X t f + Y t f
( W ) + (Y ) W n
For n
T1 1 1 Sn = = lim = = Y W X Y Tn n W W Y + W n
( ) ( ) ( ) ( )
Implications: Efficiently optimize the code that can be parallelized (X) The maximum speedup is bounded based on the percent of the code that cannot be parallelized (Y/W) Therefore, minimize (Y) the amount of code that cannot be parallelized!
49 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Principle of Independence Attempt to make the components of a system independent of one another. Allow independent scaling in hardware or in software or in algorithm, etc. (others: machine language, high-level language, application, platform, architecture, algorithm, interfaces, network, network topology) Principle of Balanced Design Minimize performance bottlenecks. Eliminate the slowest component first, allocate acceptable time performance Amdahls Law example (1st reference in text) Amdahls Rule: The processing speed should be balanced against the memory capacity AND the I/O speed. (e.g. 1 MIP:1 MB:1 Mbps) Principle of Design for Scalability Scalability must be considered as a main object and performed from the start of the design activity. Overdesign designed for the future? Backward Compatibility for legacy or scaled down activity Principle of Latency Hiding Techniques to be described in Chapter 5 How can we hide anything that would slow the processing down?
50 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
The 50% Rule The design is balanced if each of the overhead factors may degrade the performance by a factor of no more than 50%. To evaluate whether a design meets this criteria: 1. Select an appropriate performance factor (Speed, Efficiency, or Utilization). 2. Derive a value for the system performance with no overhead and define the acceptable performance as 50% of that value when the overhead is included(e.g. 2x the time, x the speed, x the utilization) 3. Form an inequality that defines a system performance (with only the overhead of interest non-zero) that is greater than the 50% acceptable performance. Manipulate this equation to define an inequality based on the overhead of interest. This now bounds the acceptable range for the overhead factor of interest. Utilization is defined as U= Pn P T P T 1 W 1 = = 1 1 = 1 1 n Ppeak n Ppeak Tn n Ppeak Tn Ppeak n Tn
Tn = w + 2 log(n ) t f + t 0 + w t c + t p Letting
T1 = W t f = n w t f
t0 + t p n Tn n Tn t = = 1+ 2 log(n ) + + c T1 n wt f w wt f tf
51 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
1.)
2.)
Computation variance
U=
Communication Bandwidth
U=
Parallelism
U=
P1 Ppeak tp 1+ w tf
3.) Form an inequality that defines a system performance (with only the overhead of interest non-zero) that is greater than the 50% acceptable performance. Manipulate this equation to define an inequality based on the overhead of interest. This now bounds the acceptable range for the overhead factor of interest.
U ( ) =
1+
P1 Ppeak
2 log(n )
50%
1+
0 1 2 log(n ) 1 + 2 log(n ) w 2 w
2 log(n )
52 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Cluster benefits and difficulties Usability Scalability Availability Utilization Performance vs Cost Ratio
Clusters
SMPs MPPs
53 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Comparison of Scalability and Availability for: Fault Tolerant Systems, Clusters, MPPs and SMPs
Scalable Performance
MPPs
54 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Table 1.7, p. 33, Comparison of Clusters, MPP, SMP and Distributed Systems
System Characteristic Number of Nodes (N) Node Complexity Internode Communication MPP SMP Cluster
Distributed System
O(100-1000) Fine or medium grain Message Passing or Shared Variable for DSM Single run queue at host Partially N (microkernel) and 1 host OS (monolithic) Multiple (single if DSM) Unnecessary One Organization Nonstandard Low to Medium Throughput and turnaround time
O(10-1000) Wide range Shared Files, RPC, Message Passing Independent multiple queues None N (heterogeneous) Multiple Required Many Organizations Standard Medium Response Time
Shared Memory
Message Passing
Job Scheduling SSI Support Node OS copies and type Address Space Internode Security Ownership Network Protocol System Availability Performance Metric
Multiple queues but coordinated Desired N (homogeneous desired) Multiple Required if exposed One or more Organizations Standard or Nonstandard Highly Available or Fault Tolerant Throughput and turnaround time