Chap 01

ECE 6500 Advanced Computer Architecture
Course Materials Syllabus: Available on the course web site http://homepages.wmich.edu/~bazuinb/ECE6500/ECE6500_Sp11.htm
Course Web Site:
Prof. Kai Hwang Home Page ID and Acknowledgement Form: Handout, Fill-in ID, and Sign
Why does Dr. Bazuin teach this course? Computer architecture is a necessary part of real-time signal processing, particularly when it comes to wired and wireless communications systems
To create or extract desired signals for wireless communication, advanced signal processing algorithms and mathematics must be employed. To perform the processing required, digital signal processing techniques must be developed and hosted on real-time signal processing machines. Real-time processing requires parallel computations and computing architectures. Therefore, the ability to define, develop, and program scalable parallel computing machines or computers is critical knowledge when working with advanced wireless systems!
Dr. Bazuins Biography Dr. Bazuin graduated magna cum laude with a B.S. in Engineering and Applied Sciences, Intensive Electrical Engineering, from Yale University. He continued his education at Stanford University, receiving his M.S. and Ph.D. in 1982 and 1989 respectively. Dr. Bazuins graduate work was with the Center for Integrated Electronics in Medicine (CIEM) associated with the Stanford Integrates Circuits Laboratory (ICL) and Center for Integrated Systems (CIS). He defined and developed a custom implantable dimension measurement telemetry systems under the direction of his advisor Dr. James Meindl, currently the director of the Georgia Tech. Microelectronic Research Center (MiRC at http://www.mirc.gatech.edu/). Dr. Bradley J. Bazuin is a tenure-track assistant professor in WMUs Electrical and Computer Engineering Department. Dr. Bazuin entered the academic community in 2000 following over 19 years of full- and part-time industrial experience developing commercial and military communication systems. Dr. Bazuin has taught a number of undergraduate and graduate courses, has been involved in a range of research project, and has collaborated on projects with a number of southwestern Michigan companies. Dr. Bazuin has been the co-author on multiple referred publications, an invited panel discussion member on wireless communications, an invited luncheon speaker on radio frequency identification (RFID), author and co-author of a number of conference papers and presentations, a co-presenter of regional training seminars for Michigan DoD procurement technical assistance centers (PTACs) on RFID, an invited speaker at a regional and local users group meeting, and has presented numerous seminars and guest lectures at Western Michigan University. Dr. Bazuin was employed part-time, at ARGOSystems, Inc. in Sunnyvale, CA (now a wholly owned subsidiary of The Boeing Co.) while pursuing his graduate degrees and full-time after completion. Initially performing digital circuit design, he became involved in digital ASIC design, establishing an ASIC design center, DSP algorithm implementation, and system engineering for a range of direction finding, SIGINT and COMINT systems for the US government. He left in 1991 for Radix Technologies of Mt. View, CA, a spin-off, where he was responsible for the system engineering and development of a range of advanced spatial, spectral and temporal signal processing detection and exploitation systems, blind-adaptive anti-jam GPS receivers, LPI communications systems, and, later, commercial wireless local loop communication systems (the initial phase of AT&Ts project angel). Research: Semiconductor Physics and Device Characterization, Communications, Microprocessor Applications, Advanced Digital Signal Processing, Adaptive Filters, and Smart Antennas. Roll-to-roll Printed Electronics, Wireless SAW Smart Sensors, Chaotic Communications, Software Radios, RFID, and DSP for 1-D and 2-D signal processing.
Projects:
2 of 55
Notes and material based on K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw-Hill, New York, NY, 1998. ISBN 0-07-031798-4.
Chapter 1: Scalable Computer Platforms and Models

Computers have always been used to perform mathematical computations at higher rates, whether driven by monetary tax or interest computations, military artillery table generation, or graphical renderings for more visually appealing computer games. To appreciate where we are, it is useful to understand where we have been. And amazingly, some ideas from the past do help us find better approaches for the future 1.1: Evolution of Computer Architecture Generation 0 1642-1945 Technology and Architecture Mechanical Software and Operating System Hard Code Punch Cards Ada Lovelace 1 1946-1956 2 1956-1967 Tubes, Relays, Single Bit CPU, Accumulator Based instruction set Discrete Transistors, Core Memory, Floating Point Accelerators, I/O Channels ICs, pipelined CPU, micro-programmed control units VLSI, solid-state memory, multiprocessor, vector supercomputer ULSI, scalable parallel computers, workstation clusters, Intranet/Internet, superscalar processor Machine/Assembly Language, programs without subroutines Algol and FORTRAN with Compilers, batch processing OS C language, multiprogramming, timeshared OS Symmetric multiprocessor, parallelizing compiler, message-passing libraries WWW, Microkernels, JAVA, Multithreading, Distributed OS Examples Pascal Calculating Machine Babbage Analytical Engine COLOSSUS, ENIAC, Von Neumann IAS IBM 7030 CDC 1604 DEC PDP-1 DEC PDP-8 IBM 360/370 DEC PDP-11 Intel 8080 Cray-1 DEC VAX IBM PC SUN SPARC Cray X/MP IBM SP2 SGI Origin 2000
3 1967-1978
4 1978-1989
5 1990-
3 of 55
The Following Computer Architecture History is based on: A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1. Historical Computing Machines
A.S. Tanenbaum, Structured Computer Organization 4th ed., Prentice Hall, Upper Sable River, NJ, 1999. ISBN 0-13-095990-1.
4 of 55
The Zeroth Generation Mechanical Computers (1642-1945) Pascal (1642) Von Leibniz (1670-ish) Babbage (1830ish) the first working mechanical calculator addition and subtraction for tax collection multiply and divide, the four function calculator (1) Difference Engine, fixed add/sub algo. for naval navigation (2) Analytic Engine
Babbages Analytical Engine
Store (Memory)
Input
Mill (Processor)
Output
The store: Mill:
100 words of 50 decimal digits Add, subtract, multiply, divide
Instructions read from punched cards. Instructions included computation and branching! The first assembly language that required programming. Therefore, there is the first programmer: Ada Augusta Lovelace The Early 1900s Zuse (Germany 1930s) Automatic calculating machines using relays Machines destroyed in WW II
Atanasoff (Iowa State 1940) Calculating machine using relays. Included capacitors for storage that were refreshed (beginning DRAM). Non-operational. Stibbitz (Bell Labs 1940) Aiken (Harvard 1944) Calculating machine using relays. Demonstrated in 1940. Babbage inspired machine using relays.
5 of 55
The First Generation Vacuum Tubes (1945-1955) World War II drove the requirement for more advanced computational engines. Code Breaking Artillery Range COLOSSUS by the British to break the ENIGMA cyper The first electronic digital computer. Alan Turing. ENIAC by the US, not completed before the end of WW II 20 registers, 10-digit decimal numbers. Programming with switches and jumpers. 30 tons, 140 kilowatts! John Mauchley.
After WW II, numerous projects were undertaken Von Neumann The IAS machine, his first after working on ENIAC The EDSAC, the first stored program computer The von Neumann machine/architecture
Von Neumann Machine
A binary, stored program, independent data and control path machine 4096 words of 40-bits, two 20-bit instructions or one 40-bit signed integer word accumulator machine (40-bit). All these aspects were firsts in computer architecture!
6 of 55
International Business Machines - Computing Tabulating Recording Company 1911 or earlier. Q. What was the IBM 701? A. The 701 Electronic Data Processing Machines System, introduced in 1952, was IBMs first commercially available scientific computer and the first IBM machine in which programs were stored in an internal, addressable electronic memory. Using cathode ray tube (Williams tube) memory for speed and flexibility, the 701 could process more than 2,000 multiplications and divisions a second. The arithmetic section contained the memory register, accumulator register and the multiplierquotient register. Each register had a capacity of 35 bits and sign. The accumulator register also had two extra positions called register overflow positions. The functional machine cycle of the 701 was 12 microseconds; the time required to execute an instruction or a sequence of instructions was an integral multiple of this cycle or 456 microseconds were required for the execution of a multiply or divide instruction. The 701 could execute 33 different operations. http://www-03.ibm.com/ibm/history/documents/pdf/faq.pdf Q. What was the IBM 704? A. The IBM 704 Electronic Data Processing Machine, introduced in 1954, was the first largescale commercially available computer to employ fully automatic floating point arithmetic commands. It was a large-scale, electronic digital computer used for solving complex scientific, engineering and business problems. Input and output could be binary, decimal, alphabetic or special character code, such as binary coded decimal which includes decimal, alphabetic and special characters. A key feature of the 704 was FORTRAN (Automatic Formula Translation), which was an advanced program for automatically translating mathematical notation to optimum machine programs. A contemporary IBM publication listed the following features for the 704: 32,768, 8,192 or 4,096 words of high-speed magnetic core storage. (A word consists of 36 binary digits slightly larger than a 10 decimal digit number). Any word is individually addressable. Any word in magnetic core storage can be located and transferred in 12 millionths of a second. Single address type stored program controls all operations. Internal number system is binary. Executes most instructions at a rate of 40,000 per second. Built-in instructions provide maximum flexibility with minimum programming. A parallel machine, it operates on a full word simultaneously. Magnetic tape input-output units permit masses of data to enter and leave the internal memory of the machine at high speed. http://www-03.ibm.com/ibm/history/documents/pdf/faq.pdf 7 of 55
The Second Generation Transistors (1955-1965) The transistor was invented by Bardeen, Brattain, and Shockley at Bell Labs in 1948. (Bell Labs are now a part of Alcatel-Lucent and is almost gone ) MIT DEC Lincoln Labs, http://en.wikipedia.org/wiki/Lincoln_Labs The TX-0 was the 1st transistorized computer. 16-bit machine founded in 1957, as an outgrowth of MIT transistorized computer. PDP-1 was the first mincomputer (1960) - an 4K x 18-bit machine, 5 usec cycle, $120,000 - One given to MIT, students created a video game PDP-8 was the break-out machine for minicomputer (1965) - a 12 bit machine, $16,000
DEC PDP-8
The omnibus or single computer bus was a major departure from a von Neumann machine. DEC was known for minicomputers while IBM built large mainframes for scientific computing. (Dec bought by Compaq which then merged with HP) http://en.wikipedia.org/wiki/Digital_Equipment_Corporation
8 of 55
Other Computer Companies CDC Control Data Corporation Parallel computing was introduced, up to 10 simultaneous instructions! A key contributor was Cray, later of Cray Computers and supercomputer fame. http://en.wikipedia.org/wiki/Control_Data_Corporation B5000 focused on incorporating features to more directly implement a language, ALGOL, and thereby easy the compilers tasks Software was identified as a keep component of a computer hardware design! Burroughs Corporation merged with Sperry Corporation to form Unisys http://en.wikipedia.org/wiki/Burroughs_Corporation
Burroughs
9 of 55
The Third Generation Integrated Circuits (1965-1980) Shockley Semiconductor spin-off, Fairchild, focused on putting multiple transistors on a single substrate. Robert Noyce was involved and later became a founder of Intel. IBM (1964) Combined two older, incompatible series of computers into the IBM 360 family Multiple models from low end (commercial) to high end (scientific) A key contributor was Cray, later of Cray Computers and supercomputer fame. Multiprogramming: multiple programs reside in memory simultaneously allows time sharing Emulation: could simulate the operations of other computers. Used microprogramming and allowed instructions to be interpreted by the control unit! 16 32-bit registers,8-bit memory Bytes, 2^24 Byte address space (16 MB) Initial IBM 360 Family
DEC (1970)
PDP-11, the first personal workstation allowed research and labs to have their own minicomputers
Microprocessors began to appear TI, Intel, Motorola Intel 4004 (1971), Intel 8086 (1974), Intel 8086 (1978)
10 of 55
The Fourth Generation Very Large Scale Integrated Circuits (1980-1989) IBM (1981) Personal Computers (PCs) Based on Intel 8088, separate group development, commodity parts Plans published to allow expansion clones emerged Comodore, Amiga, Atari, Apple (Home Brew Computer Club) Jobs and Wozniak Disk operating systems developed for PCs Microsoft bought an operating system and began to dominate
Other PCs DOS MS
MIPS (1985) First commercial RISC processor
Parallel supercomputers continue for high-speed computations. Systolic Array Processing Parallel Processing Arrays
Flynns Taxonomy of Computer Architectures
MacIntosh
Apple uses Xerox PARC concepts for new line of computers and a superior operating system
11 of 55
The Fifth Generation ULSI and ASICs (1990-1996) WINTEL Software Hardware Windows and Intel (WinTel) dominate the PC market Computer languages flourish Operating systems expand Superscalar Processors (multiple execution units in the CPU chip) Pipeling Out-of-Order Execution
Computer Systems Networking Distributed Computing Cluster Computing Servers
The Sixth Generation Superscalar/Superpipelined Machines (1997-) Superscalar - Superpipelined INTEL IBM/Motorola P6 architectures (Pentium II, Pentium III and Celeron) PowerPC architecture
Multithreaded Machine INTEL Pentium IV and Celeron
The Seventh Generation ??? Multi-core CPUs multiple CPUs on a single IC INTEL Dual-core and quad-core with multithreading
12 of 55
Current trends and Looking into the future Moores Law Co-founder of INTEL The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuit had doubled every year since the integrated circuit was invented. Moore predicted that this trend would continue for the foreseeable future. In subsequent years, the pace slowed down a bit, but data density has doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed.
from http://www.webopedia.com/TERM/M/Moores_Law.html
Moores Law for Intel CPUs
http://en.wikipedia.org/wiki/Moore%27s_law
13 of 55
Intel Family of Processors
http://en.wikipedia.org/wiki/File:Transistor_Count_and_Moore%27s_Law_-_2008.svg
14 of 55
The Computing Problem

System or Application Requirements
Algorithm and Data Structures
Mapping
Operating System Hardware Architcture
Programming Binding (Compile, Link, Load)
High-Level Languages
Application Software
Performance Evaluation
K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw Hill, 1993. ISBM: 0-07-031622-8
Six layers for a computer system development
Applications Programming Environment Languages Supported Communication Model Addressing Space Hardware Architecture Machine Independent
Machine Dependent
K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw Hill, 1993. ISBM: 0-07-031622-8
15 of 55
Scalable Parallel Computer Architectures Parallel: Scalable: Exploit multiple simultaneous operations by the computer. Can the architecture be scaled up or down as appropriate for an application?
Scalability Implies: Functionality and Performance: Improve functionality and compute time in proportion to the increase (decrease) in resources. Scaling in Cost: Costs changes must be reasonable, for an N times performance change can we expect a cost increase of 1, N, N*logN, N^2, etc. Compatibility: Existing components should still be useable with minor changes.
Classes of Computers (The Computer Pyramid)
Supercomputers Mainframes
Cost/Performance
SMP Servers and Clusters Stand-alone Workstations Personal Computers

Quantity
Computer rankings based on Cost/Performance vs. Quantity
16 of 55
Flynns Taxonomy of Computer Architectures Revisited
Single Instruction Single Data

IS Processing Unit CU: PU: MU: IS: DS: Control Unit Processing Unit Memory Unit Instruction Stream Data Stream
I/O
Control Unit
IS
DS
Memory Unit
Single Instruction Multiple Data

IS Control Unit IS CU: PU: MU: IS: DS: Control Unit Processing Unit Memory Unit Instruction Stream Data Stream Processing Unit DS Memory Unit DS Processing Unit DS Memory Unit DS
Processing Unit
DS
Memory Unit
DS
17 of 55
Multiple Instruction Single Data

IS IS
IS
CU
CU
CU
I/O
Memory Unit
IS
IS
IS
DS
PU
DS
PU
DS
PU
DS
Multiple Instruction Multiple Data

IS
I/O
CU
IS
PU
DS
I/O
CU
IS
PU IS
DS
Memory Unit
I/O
CU
IS
PU
DS
IS
As a computer architecture is it parallel and scalable? Parallel: Scalable: Exploit multiple simultaneous operations by the computer. Can the architecture be scaled up or down as appropriate for an application? Resource Application Technology (machine component) (problem and machine size) (time, space, heterogeneity)
Functionality and Performance Scaling in Cost Compatibility
18 of 55
Expanded Taxonomy of Computer Architectures
19 of 55
1.2 Dimensions of Scalability

Resource Application Technology (machine component) (problem and machine size) (time, space, heterogeneity)
Resource Scalability Increasing the machine size: Number of processors, amount of memory, improved software Size Scalability Number of processors, add or subtract Not always a simple process due to communications subsystems and processes i.e. interconnections, interfaces, communication software Not always possible due to software i.e. working with parallelism, how to program or compile code Additional Resources Memory, cache memory, disk drives, etc. Not always a simple process due to addressing, sharing, coherency, etc. Software Scalability Compiler Improvements for parallelism and efficiency More efficient libraries (math, engineering, sorts, etc.) Applications software structured for scalable processing New Operating Systems (OS) to use scaled resources User friendly programming environment
20 of 55
Application Scalability Application programs that are scalable in machine size and problem size Machine Size At what rate does the application performance change as the machine scales Getting started, communicating, coordinating all get less efficient with bigger machines how much gets wasted? Problem Size Growth in data set sizes Change number of users/tasks as the machine scales Practical limits may exist for applications Systems involve a combination of the machine and the application Not solely dependent upon machine and problems sizes. Dont forget memory, I/O capability, communications, etc.
21 of 55
Technology Scalability Adapt to changes in technology Generation (or Time) Scalability Next generation component: proc., memory, etc. Next generation OS Change is inevitable, but will the system need to be replaced Intel Processors have maintained backward compatibility Motorola PowerPC Processors have not
Space Scalability How large and distributed can the system get? Box, rack, room, building, region, international, the WWW Heterogeneity Scalability Scale with components from different vendors, both hardware and software Was a design standard used or is everything customized or vendor specific? Industry standards Open Architecture Software Portability (e.g. JAVA)
22 of 55
1.3 Parallel Computer Models Define idealized abstract models of a parallel computer.
A parallel machine model is an abstract parallel computer from the programmers viewpoint. We will define: [1.3.3] [1.3.4] Abstract machine models (typically used to estimate performance) and Physical machine models
Abstract models are used to characterize the capability of a parallel computer. We hope to capture implicitly the relative cost of parallel computations (cost in dollars and performance).
The simplest model: The Parallel Random Access Machine (PRAM) Model
PU
PU
PU
PU
MIMD Fine Grain Tightly Synchronized Zero Overhead Shared Variable
Shared Memory
Attributes Used to Describe the Models

Five Semantic Attributes and several Performance Attributes can be used to characterize a parallel machine and its model. Semantic Attributes: Homogeneity Synchrony Interaction Mechanism Address Space Memory Model
23 of 55
(1) Homogeneity Characterize how alike the processors are. PRAM with 1 processor is Flynns SISD (Single Instruction Single Data) PRAM with n processor is typically Flynns MIMD (Multiple Instruction Multiple Data) but could be Flynns SIMD (Single Instruction Multiple Data) If all the processors perform the same instruction on a cycle-by-cycle basis, SIMD. A special case is SPMD, Single Program Multiple Data.
(2) Synchrony Characterize how tightly synchronized the processors are. PRAM is synchronized at the instruction (clock) level, one step at a time For a SIMD machine, synchrony is expected For a MIMD machine, you would expect that synchrony is optional or even undesirable. Synchrony Options or levels: Clocks Instructions Asynchronous with a synchronization operation Supersteps (a block of instructions) Phases of execution, loosely synchronized phases Asynchronous
24 of 55
(3) Interaction Mechanism Characterize how many processors may interact and how they perform the interaction. Shared Variables (data in a memory that is accessible) Shared Memory Shared Messages (messages used to communicate) Shared Nothing
Multiprocessor: Multicomputer:
An MIMD machine with shared variables An MIMD machine with message passing
(4) Address Space Characterize how the memory addressing space is organized. Single Address Space Multiple Address Spaces UMA: NUMA: Provided in Shared /Memory Models Multicomputers
Shared memory, single address space Shared memory, single address space, non-uniform access time Local and remote memory accesses A Distributed Shared Memory (DSM) system A Global shared memory with independent local memories. (TMS320C40)
Hybrid:
25 of 55
(5) Memory Model Characterize how the machine handles shared memory access conflicts. The desire for consistency Valid operations within the machine Consistency Rules must be defined. EREW CREW CRCW Exclusive Read, Exclusive Write one at a time Concurrent Read, Exclusive Write multiple read access, but only one write Concurrent Read, Concurrent Write a conflict exists in writing, Access Policy is defined to handle conflicts Common the same values should be simultaneously written Arbitrary Pick one and keep going Minimum allow the processor with the lowest ID/index Priority combine values in a defined way (OR, AND, sum, max) Memory Consistency Models have been defined and are presented in Chap. 5
ATOMIC Memory Operations: (1) Invisible: Once started, other processes cannot see the intermediate states (2) Finite: It will finish in a finite amount of time
Transaction: ACID Atomicity Consistency Isolation Durability
An ATOMIC operations that meets the following properties All or nothing move cleanly from one state to the next results are not revealed until committed once committed, transaction persists
26 of 55
High Level Architecture of MIMD Machines Based on Interaction Methods Shared Nothing Shared Disk Shared Memory Multicomputer Network Multicomputer Network with Global Store Multiprocessor System
Shared Nothing Multicomputer Message Passing Communications

Computer Node
P
Computer Node
P: C: M: HD: NIC:
Processing Unit Cache Memory Disk Network Interface Circuitry
Computer Node
HD
NIC
NIC
NIC
Interconnection Network
Shared Disk Multicomputer Message Passing Communications

Computer Node
M P
Computer Node
P: C: M: HD: NIC:
Computer Node
NIC
NIC
NIC
HD
HD
27 of 55
Shared Memory Multiprocessor Shared Memory Communications
Processor Node
C
Processor Node
C
P: C: M: HD: NIC:
Processor Node
C
HD
Micro-Architecture vs. Macro-Architecture: Building an MIMD Machine

1. Start with a commodity processor 2. Build a processing unit shell with appropriate resources (the Processing Unit) The computer micro-architecture 3. Build a Computer Node or Processing Node 4. Construct a parallel processing system The computer macro-architecture
Semantic Attributes: which ones may apply to the high level architecture? (semantic attributes and how they relate to the machine model)
Homogeneity Synchrony Interaction Mechanism Address Space Memory Model 28 of 55
Physical Machine Models

PVP SMP MPP DSM COW Parallel Vector Processor Symmetric Multiprocessor Massively Parallel Processor Distributed Shared Memory Processor Cluster of Workstations
Figure 1.6, p. 27, Five Physical Parallel Computer Models

VP VP VP
P/C P/C P/C
Crossbar Switch
Bus or Crossbar
SM
SM
SM
SM
SM
SM
PVP: Parallel Vector Processor

P/C MB LM NIC MB LM NIC P/C
SMP: Symmetric Multiprocessor

P/C MB LM DIR NIC MB LM DIR NIC P/C
Custom-Designed Network
MPP: Massively Parallel Processor
DSM: Distributed Shared Memory Processor

P/C MB M Bridge MB M Bridge P/C
NIC
NIC
Commodity Network (Ethernet, ATM. Etc.)
COW: Cluster of Workstations

29 of 55
Semantic Attributes: which ones may apply to the physical machine models? (semantic attributes and how they relate to the machine model)
Homogeneity Synchrony Interaction Mechanism Address Space Memory Model
PVP Parallel Vector Processor

Custom designed with special data access. May incorporate vector registers or access. Applications special purpose signal processing
VP
VP
VP
Crossbar Switch
SM
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model
SM
SM
Yes, Custom Processors Asynchronous or Loosely Synchronous Shared Variables Single Uniform (UMA) Sequentially Consistent
30 of 55
SMP Symmetric Multiprocessor

Commercial processor, unique custom high speed interconnections, equal access for I/O & memory Applications databases, on-line transactions, data warehouses
P/C
P/C
P/C
Bus or Crossbar
SM
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model
SM
SM
Yes, Commodity Processors Asynchronous or Loosely Synchronous Shared Variables Single Uniform (UMA) Sequentially Consistent
31 of 55
MPP Massively Parallel Processor

Unique Commercial processor based Computing Nodes Network interconnections designed for low latency and high bandwidth Designed to be highly scalable (1,000s of nodes) Applications scientific computing and data warehouses
P/C MB LM NIC MB
P/C
LM NIC
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model Yes Asynchronous or Loosely Synchronous Message Passing Multiple NORMA Data Flow
32 of 55
DSM Distributed Shared Memory Processor

Memory Distributed to Computing or Processing Nodes H/W and S/W perceive that memory is in a single address space message passing network emulates shared-variable Special purpose hardware/software involved (a cache directory) Designed to be highly scalable (1,000s of nodes) Applications scientific computing and data warehouses
P/C MB LM DIR NIC MB
P/C
LM DIR NIC
Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model Yes Asynchronous or Loosely Synchronous Shared Variable Single NUMA Weak Ordering (Ch. 5) support by directory
33 of 55
COW Cluster of Workstations

Each node is a complete workstation Includes O.S., but has software (middleware) that supports a single system image (SSI) Nodes may be headless without any I/O devices (e.g. keyboard, monitor) May include local hard disks and/or special resources Applications Availability of PCs makes this concept attractive for computing operations
P/C MB M Bridge MB
P/C
M Bridge
NIC
NIC
Commodity Network (Ethernet, ATM. Etc.)

Homogeneity Synchrony Interaction Mechanism Address Space Memory Access Memory Model Maybe, (easier if it is) Asynchronous or Loosely Synchronous Message Passing Multiple NORMA Data Flow
34 of 55
Physical Machine Model Attributes

Parallel Vector Processor (PVP): UMA, crossbar, shared memory Symmetric Multiprocessor (SMP): UMA, crossbar or bus, shared memory, hard to scale Massively Parallel Processor (MPP): NORMA, message passing, custom interconnection, classic supercomputers Distributed Shared Memory (DSM): NUMA or NORMA, shared memory (hardware or software based), custom interconnections, possible cache directories Cluster of Workstations (COW): NORMA, message passing, SSI challenged, commodity processors and interconnection
Describing the physical machine

Scalable Computer Architectures Functionality and Performance Scaling in Cost Compatibility Dimensions of Scalability Resource Scalability Application Scalability Technology Scalability Parallel Computer Models: Semantic Attributes Homogeneity Synchrony Interaction Mechanism Address Space Memory Model
35 of 55
Critical concept for MIMD parallel processing

A parallel computer should be and indeed is one machine. From the users and/or programmers view of one machine, the system as a goal should provide a single system image (SSI).
Figure 1.7, p. 30, Typical Programmers Architecture of a Cluster of Multiple Computers
Programming Environment and Application Single System Image Infrastructure OS Node OS Node OS Node OS Node
Commodity or Proprietary Interconnect
36 of 55
Summary of Physical Machine Models

Table 1.6, p. 26, Semantic Attributes of Parallel Machine Models
Attributes Homogeneity Synchrony PRAM
MIMD Instruction-level synchronous Shared Variable Single UMA EREW, CREW or CRCW Theoretical Model
PVP/SMP
MIMD Asynchronous or loosely synchronous Shared Variable Single UMA Sequential Consistency IBM R50, Cray T-90
DSM
MIMD Asynchronous or loosely synchronous Shared Variable Single NUMA Weak ordering is widely used Stanford DASH, SGI Origin 2000
MPP/COW
MIMD Asynchronous or loosely synchronous Message Passing Multiple NORMA N/A Cray T3E, Berkeley NOW
Interaction Mechanism Address Space Access Cost Memory Model Example Machines
37 of 55
Performance Attributes: (see supplemental web)

Terminology
Machine Size Clock Rate Workload Sequential Execution Time Parallel Execution Time Speed Peak Speed Speedup
Notation
n f W
Unit
Dimensionless MHz Mflop sec sec Mflop/sec Mflop/sec Dimensionless
T1
Tn Pn = W Tn
PPeak = max(P, Pi ) S n = T1 En = Un = Pn
Tn Dimensionless n Dimensionless
Efficiency Utilization
Sn
(n PPeak )
t0
Startup Time Asymptotic Bandwidth
usec Mbytes/sec
Derived Performance:
Nominal time for communications
Tmessage = t 0 + m r
Communications time is nominally: the sum of the communication latency ( t 0 ) and the message in bytes ( m ) transfer time.
38 of 55
Managing concurrency acquires a central role in developing parallel applications. The basic steps in designing parallel applications are:
Partitioning The partitioning stage of a design is intended to expose opportunities for parallel execution. Hence, the focus is on defining a large number of small tasks in order to yield what is termed a fine-grained decomposition of a problem. Communication The tasks generated by a partition are intended to execute concurrently but cannot, in general, execute independently. The computation to be performed in one task will typically require data associated with another task. Data must then be transferred between tasks so as to allow computation to proceed. This information flow is specified in the communication phase of a design. Agglomeration (to form or collect into a rounded mass) In the third stage, we move from the abstract toward the concrete. We revisit decisions made in the partitioning and communication phases with a view to obtaining an algorithm that will execute efficiently on some class of parallel computer. In particular, we consider whether it is useful to combine, or agglomerate, tasks identified by the partitioning phase, so as to provide a smaller number of tasks, each of greater size. We also determine whether it is worthwhile to replicate data and/or computation. Mapping In the fourth and final stage of the parallel algorithm design process, we specify where each task is to execute. This mapping problem does not arise on uniprocessors or on shared-memory computers that provide automatic task scheduling.
From http://en.wikipedia.org/wiki/Multi-core_(computing)
39 of 55
Operations in Parallel Programs

(1) Computation (2) Parallelism (3) Interaction arithmetic, logic, data transfer, and control flow operations of a traditional sequential machine operations needed to manage processes, such as; creation and termination, context switching, and grouping operations needed to communicate and to synchronize processes
Both explicit and implicit operations must be considered for Parallelism and Interaction Explicit: Added instruction calls, additions to sequential operation Implicit: Not in the instructions but performed anyway
Overhead:
Operations needed in addition to traditional sequential code execution
Types of Overhead:
Parallelism Overhead Communication Overhead Synchronization Overhead Load Imbalance Overhead
caused by process management caused by processors exchanging information caused when executing synchronization operations caused when processors are idle while others continue
Computation Time
Tcomputation Computation time and load imbalance time
T interaction
T parallel
Interaction time including communication and synchronization caused when executing synchronization operations
Total Time Processing Time
T1
Tn = Tcomp + Tinter + T par
SISD machine. Traditional sequential code execution MIMD machine: Parallel Processing with overhead 40 of 55
1.3 Abstract Machine Models

Models used to design and analyze parallel algorithms. Note: they do not necessarily take into account or care what the physical machine model is used! The abstract models, in order of how close they are to actual execution time, are:
PRAM BSP PPM
Parallel Random Access Model Bulk Synchronous Parallel Model Phase Parallel Model
Note: algorithms executed on every one of the physical machine models can be estimated based on each of the abstract models.
PRAM
Parallel Random Access Model
PRAM is a first order parallel processing algorithm model. It is simple, clean and widely used.
PU
PU
PU
PU
MIMD Fine Grain Tightly Synchronized Zero Overhead Shared Variable
Shared Memory
There are numerous unrealistic assumptions! zero communications overhead (shared variable) zero synchronization overhead assumed instruction level synchrony zero parallelism required (ignored)
Computation Time Includes:

Tcomputation
Computation and load imbalance Zero interaction time Zero parallelism time
Tn = Tcomp
Tinteraction = 0
T parallel = 0 Tn = Tcomp + (0 = Tinter ) + (0 = Tpar )
41 of 55
BSP Bulk Synchronous Parallel Model

Improve upon the PRAM by including: communications and simple synchronization

P node P node P node
MIMD Variable Grain Loosely Synchronous Superstep Computation Communication Barrier Synchronization Non-Zero Overhead Message Passing or Shared Variable
P node
Communication Network
Computation Time Includes:

Tcomputation
Computation and load imbalance Communication and Simple Synchronization Zero parallelism time Therefore: Tn = Tcomp + Tinter
Tn = Tcomp (i ) + Tinter (i )
i =1 M
Tinteraction
T parallel = 0
Tn = Tcomp + Tinter + (0 = Tpar )
Also, a summation of supersteps
In BSP, algorithms execute in a sequence of supersteps. A superstep consisting of computation operations: communication: barrier synchronization: The time of a superstep is Tn (step i ) = wi + g hi + l of at most w cycles, of at most g h cycles (where h is in words), and of l cycles.
42 of 55
Computation Time with Supersteps:

A superstep consisting of computation operations: communication: barrier synchronization: The time of a superstep is Tn (step i ) = wi + g hi + l The total time is a summation of supersteps Tn = Tn (i ) = (wi + g hi + l )
i i
of at most w cycles, of at most g h cycles (where h is in words), and of l cycles.
Total BSP model time Tn = Tn (i ) = (wi + g hi + l )

i i
Computation time per superstep, wi The maximum computation time of any processor in the step Communication time per superstep, g hi Defined based on an h-relation, where each nodes sends and receives at most h words. The coefficient g is a value that is determined for the machine platform. Note: the communications time does not explicitly include the start-up time, t 0
l Synchronization time per superstep, The time it takes to send a synchronization message to all processors. The lower bound may be set equal to the time that it takes to broadcast a null message over the network.
43 of 55
PPM Phase Parallel Model

Improve upon the BSP by including all reasonable computation time and overhead A parallel program executes as a sequence of phases. The next phase can not begin until the current phase is completed. The phase models explicit include: the parallelism phase (overhead to manage parallel processes) the computation phase (processor computations with any statistical variances),and the interaction phase (communication, synchronization, and aggregation [reduction or scans])
Computation Times Included

Tcomputation
Computation and load imbalance Communication and Synchronization Parallelism time

Tn = Tcomp + Tinter + Tpar
Tinteraction
T parallel Tn = Tcomp + Tinter + T par
Also, a summation of phases
Tn = Tcomp (i ) + Tinter (i ) + Tpar (i )

i =1
In PPM, algorithms execute in a sequence of phases. Each phase includes all reasonable computation time and overhead
44 of 55
Computation Time with Phases:

For computation times, processors and cycles need not be uniform. The n processors my have an average time to execute a workload at an execution time, but there may be a variance term included. (e.g. processor clocks vary from machine to machine, so there is an average and variance about the average clock rate). Where the partitioned phase workload could be defined as Winst (i ) = n w inst (i ) Then
Tcomp (i ) = (w inst (i ) + fn (n )) t f
which allows for the processing rate/time variances between the n processors
For the interaction time, they may be different for each phase, but the following can be used. m Tinteraction (m, n ) = t 0 (n ) + = t 0 (n ) + m t c (n ) r (n ) where m is the message length in bytes and the start-up time and as asymptotic bandwidth are a function of the machine size, n. The per-Byte message time is also defined based on the asymptotic bandwidth.
Then, the total time in its most complex form may be stated as Tn = Tn (i ) = f (wi , t f , f , mi , t 0 (n ), r (n ), T par (i ))
i i
For a Phase Parallel Model of n processors

Computation: There is typically load imbalance and there may be differences in clock rates that cause additional delay. As a result, we can describe computation as Tcomp (i ) = (wi + ) t f where Wi = nwi and load imbalance and other factors are include as the . If is small the system is nearly balanced If is large the system is poorly balanced If is zero the system is balanced But then may be a statistical or computed parameter that describes variations between the processors performing parallel operations. Example derived value Tcomp (i ) = wi + 2 log 2 (n ) t f 45 of 55
Example Application: Vector Multiplication or Inner Product

s = AH B , Problem Complexity: for A, B Nx1
fn(N )
For a uniprocessor
Multiplies: Additions: Computation Time: N N-1
T1 = (N + N 1) t f = (2 N 1) t f
T1 2 N t f
For a PRAM of n processors

Multiplies: Additions: Computation Time: n ceil N n ceil N
( n)
2
[ ( n ) 1]+ log (n )
( n ) 1 + log (n)) t
2 f
Tn = 2 ceil N
Tn 2 N + log 2 (n ) t f n Speedup: S n = T1 Tn
Sn =
(2 N n + log (n)) t
2
2 N t f
=
f
2 N 2 N + log (n ) 2 n
Sn =
n 1+ n log 2 (n ) 2 N
for N >> n
Sn n
46 of 55
Big-O Notation (the order of estimation)
Using rapid order of estimates to define the size of a problem. Quick but very coarse. Examples: A(n ) = 4 n 4 : O( A) n 4 B(n ) = 8 n 3 : O(B ) n 3 Then for n = 2
O ( A) > O ( B ) A(2) = B(2)
The oops this doesnt always work Example:
TA = 7 n
TB = 1 n log 2 (n ) 4
therefore therefore therefore
O(TA ) = n O(TB ) = n log 2 (n )

O(TC ) = n log 2 (log 2 (n ))
TC = n log 2 (log 2 (n )) Then
O(TB ) > O(TC ) > O(T A )
When n = 1024 = 210 Therefore
TA = 7168 and TB = 2560 and TC = 3401

T A > TC > TB
Simple analysis can cause significant problems! You may want to only use order of for back-of-the-envelope estimates or brain-storming guesses
47 of 55
Vector Multiplication or Inner Product for a BSP of n processors
for n = 8
s = AH B ,
for A, B Nx1
Superstep
Tcomp
1 w = 2 ceil N
( n ) 1
w =1 h =1 l =1
w =1 h =1 l =1
w =1 h=0 l=0
Tcomm
Tsynch
h =1 l =1
Computation Time:
Tn = Tn (i ) = (wi + g hi + l )
i i
Tcomp = 2 ceil N
( n ) 1]+ [1] + [1] + [1]
Tcomm = g ([1] + [1] + [1] + [0])

Tsynch = l ([1] + [1] + [1] + [0])
Computation Time: Speedup:
T8 = 2 ceil N
( 8 ) 1 + log (8) (1 + g + l )
2
S n = T1
Tn
Sn =
2 N 2 N + log 2 (n ) (1 + g + l ) n
n n (1 + g + l ) 1 + log 2 (n ) 2N
The speedup for PRAM was
Sn =
n 1+ n log 2 (n ) 2 N
BSP should always be smaller than PRAM S PRAM > S BSP > S PP
48 of 55
Amdahls Law (a basic PRAM model speedup)
Assume that an applications program has two types of code (X) code that can be parallelized and (Y) code that can not be parallelized. For the total code, W = X + Y When executed on one processor
T1 = X t f + Y t f
When executed on multiple processors (assuming no load imbalance)

Tn = X tf + Y tf n
The speedup is then

Sn = X t f +Y t f T1 X +Y = = X X Tn t f +Y t f +Y n n
Looking at the proportions of code (dividing through by the total code, W) Sn = T1 = X Tn
( W ) + (Y ) W n
For n
T1 1 1 Sn = = lim = = Y W X Y Tn n W W Y + W n
( ) ( ) ( ) ( )
Implications: Efficiently optimize the code that can be parallelized (X) The maximum speedup is bounded based on the percent of the code that cannot be parallelized (Y/W) Therefore, minimize (Y) the amount of code that cannot be parallelized!
49 of 55
Scalable Design Principles
Principle of Independence Attempt to make the components of a system independent of one another. Allow independent scaling in hardware or in software or in algorithm, etc. (others: machine language, high-level language, application, platform, architecture, algorithm, interfaces, network, network topology) Principle of Balanced Design Minimize performance bottlenecks. Eliminate the slowest component first, allocate acceptable time performance Amdahls Law example (1st reference in text) Amdahls Rule: The processing speed should be balanced against the memory capacity AND the I/O speed. (e.g. 1 MIP:1 MB:1 Mbps) Principle of Design for Scalability Scalability must be considered as a main object and performed from the start of the design activity. Overdesign designed for the future? Backward Compatibility for legacy or scaled down activity Principle of Latency Hiding Techniques to be described in Chapter 5 How can we hide anything that would slow the processing down?
50 of 55
The 50% Rule The design is balanced if each of the overhead factors may degrade the performance by a factor of no more than 50%. To evaluate whether a design meets this criteria: 1. Select an appropriate performance factor (Speed, Efficiency, or Utilization). 2. Derive a value for the system performance with no overhead and define the acceptable performance as 50% of that value when the overhead is included(e.g. 2x the time, x the speed, x the utilization) 3. Form an inequality that defines a system performance (with only the overhead of interest non-zero) that is greater than the 50% acceptable performance. Manipulate this equation to define an inequality based on the overhead of interest. This now bounds the acceptable range for the overhead factor of interest. Utilization is defined as U= Pn P T P T 1 W 1 = = 1 1 = 1 1 n Ppeak n Ppeak Tn n Ppeak Tn Ppeak n Tn
Therefore, for a specific value of n selected, Taking

Tn = Tcomp + T par + Tint eract
Tn = w + 2 log(n ) t f + t 0 + w t c + t p Letting
T1 = W t f = n w t f
t0 + t p n Tn n Tn t = = 1+ 2 log(n ) + + c T1 n wt f w wt f tf
51 of 55
1.)
We can get the equation for utilization as follows:

U= P1 Ppeak t0 + t p t + c 1+ 2 log(n ) + w wt f tf
2.)
Performance based on individual overheads (the others forced to zero):

U= P1 Ppeak 1+ 2 log(n ) w
P1 Ppeak t 1+ 0 w tf P1 Ppeak t 1+ c tf
Computation variance
Com. Startup Time
U=
Communication Bandwidth
U=
Parallelism
U=
P1 Ppeak tp 1+ w tf
3.) Form an inequality that defines a system performance (with only the overhead of interest non-zero) that is greater than the 50% acceptable performance. Manipulate this equation to define an inequality based on the overhead of interest. This now bounds the acceptable range for the overhead factor of interest.
U ( ) =
1+
P1 Ppeak
2 log(n )
50%
P1 Ppeak = 50% U (0 ) 0 1+ 2 log(n ) w
1+
0 1 2 log(n ) 1 + 2 log(n ) w 2 w
2 log(n )
52 of 55
Basic Concept of Clustering

There has been a major technical movement toward cluster computing Cluster Nodes Single-System-Image (SSI) Internode Connection Enhanced Availability Better Performance Each node is a complete computer A single computer resource Commodity networks Percentage of time system available is high Higher speed processing
Cluster benefits and difficulties Usability Scalability Availability Utilization Performance vs Cost Ratio
Overlapped Design Space of Distributed Systems, Clusters, MPPs and SMPs
Node Complexity Distributed Computer System
Clusters
SMPs MPPs
Single System Image
53 of 55
Comparison of Scalability and Availability for: Fault Tolerant Systems, Clusters, MPPs and SMPs
Scalable Performance
MPPs
Clusters SMPs Fault Tolerant Systems
System Availability Why are these curves like this ?
54 of 55
Table 1.7, p. 33, Comparison of Clusters, MPP, SMP and Distributed Systems
System Characteristic Number of Nodes (N) Node Complexity Internode Communication MPP SMP Cluster
Distributed System
O(100-1000) Fine or medium grain Message Passing or Shared Variable for DSM Single run queue at host Partially N (microkernel) and 1 host OS (monolithic) Multiple (single if DSM) Unnecessary One Organization Nonstandard Low to Medium Throughput and turnaround time
O(10) or less Medium or coarse grain
O(100) or less Medium grain
O(10-1000) Wide range Shared Files, RPC, Message Passing Independent multiple queues None N (heterogeneous) Multiple Required Many Organizations Standard Medium Response Time
Shared Memory
Message Passing
Job Scheduling SSI Support Node OS copies and type Address Space Internode Security Ownership Network Protocol System Availability Performance Metric
Single run queue Always One (monolithic)
Multiple queues but coordinated Desired N (homogeneous desired) Multiple Required if exposed One or more Organizations Standard or Nonstandard Highly Available or Fault Tolerant Throughput and turnaround time
Single Unnecessary One Organization Nonstandard Often Low Turnaround time
Many of these characteristics are covered later in the textbook 55 of 55


Chap 01

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Chap 01

Transféré par

Droits d'auteur :

Formats disponibles

ECE 6500 Advanced Computer Architecture

Course Materials Syllabus: Available on the course web site http://homepages.wmich.edu/~bazuinb/ECE6500/ECE6500_Sp11.htm

Course Web Site:

Chapter 1: Scalable Computer Platforms and Models

Babbages Analytical Engine

The store: Mill:

100 words of 50 decimal digits Add, subtract, multiply, divide

Von Neumann Machine

Other PCs DOS MS

MIPS (1985) First commercial RISC processor

Flynns Taxonomy of Computer Architectures

Computer Systems Networking Distributed Computing Cluster Computing Servers

Multithreaded Machine INTEL Pentium IV and Celeron

Moores Law for Intel CPUs

Intel Family of Processors

The Computing Problem

Algorithm and Data Structures

Operating System Hardware Architcture

Programming Binding (Compile, Link, Load)

Six layers for a computer system development

Classes of Computers (The Computer Pyramid)

SMP Servers and Clusters Stand-alone Workstations Personal Computers

Computer rankings based on Cost/Performance vs. Quantity

Flynns Taxonomy of Computer Architectures Revisited

Single Instruction Single Data

Single Instruction Multiple Data

Multiple Instruction Single Data

Multiple Instruction Multiple Data

Functionality and Performance Scaling in Cost Compatibility

Expanded Taxonomy of Computer Architectures

1.2 Dimensions of Scalability

MIMD Fine Grain Tightly Synchronized Zero Overhead Shared Variable

Attributes Used to Describe the Models

Transaction: ACID Atomicity Consistency Isolation Durability

Shared Nothing Multicomputer Message Passing Communications

Processing Unit Cache Memory Disk Network Interface Circuitry

Shared Disk Multicomputer Message Passing Communications

Processing Unit Cache Memory Disk Network Interface Circuitry

Shared Memory Multiprocessor Shared Memory Communications

Processing Unit Cache Memory Disk Network Interface Circuitry

Micro-Architecture vs. Macro-Architecture: Building an MIMD Machine

Physical Machine Models

Figure 1.6, p. 27, Five Physical Parallel Computer Models

PVP: Parallel Vector Processor

SMP: Symmetric Multiprocessor

MPP: Massively Parallel Processor

DSM: Distributed Shared Memory Processor

Commodity Network (Ethernet, ATM. Etc.)

COW: Cluster of Workstations

PVP Parallel Vector Processor

SMP Symmetric Multiprocessor

MPP Massively Parallel Processor

DSM Distributed Shared Memory Processor

P/C MB LM DIR NIC MB

COW Cluster of Workstations

Commodity Network (Ethernet, ATM. Etc.)

Physical Machine Model Attributes

Describing the physical machine

Critical concept for MIMD parallel processing

Figure 1.7, p. 30, Typical Programmers Architecture of a Cluster of Multiple Computers

Commodity or Proprietary Interconnect

Summary of Physical Machine Models

Performance Attributes: (see supplemental web)