Vous êtes sur la page 1sur 53

Parallel Computing

Lecturer: Satinder Pal Singh


E-mail: mail@sviet.ac.in

CS-517 Parallel Efficient Algorithms


Slide 1

Course Contents

Slide 2

Recommended Course Textbooks


M. J. Quinn. Parallel Computing: Theory and Practice , McGraw Hill, New York, 1994. T. G. Lewis and H. El-Rewini. Introduction to Parallel Computing , Prentice Hall, New Jersey, 1992. T. G. Lewis. Parallel Programming: A MachineIndependent Approach , IEEE Computer Society Press, Los Alamitos, 1994.

Slide 3

What is Parallel Computing?


Traditionally, software has been written for serial computation: To be run on a single computer having a single Central Processing Unit (CPU); A problem is broken into a discrete series of instructions. Instructions are executed one after another. Only one instruction may execute at any moment in time.

Slide 4

What is Parallel Computing?


In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: To be run using multiple CPUs A problem is broken into discrete parts that can be solved concurrently Each part is further broken down to a series of instructions Instructions from each part execute simultaneously on different CPUs

Slide 5

What is Parallel Computing?


The compute resources can include: A single computer with multiple processors; An arbitrary number of computers connected by a network; A combination of both. The computational problem usually demonstrates characteristics such as the ability to be: Broken apart into discrete pieces of work that can be solved simultaneously; Execute multiple program instructions at any moment in time; Solved in less time with multiple compute resources than with a single compute resource.
Slide 6

What is Parallel Computing?


The Universe is parallel: Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence. For example: Galaxy formation Planetary movement Weather and ocean patterns Automobile assembly line

Slide 7

What is Parallel Computing?


Uses for Parallel Computing Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult scientific and engineering problems found in the real world. Some examples: vAtmosphere, Earth, Environment vPhysics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics vBioscience, Biotechnology, Genetics vChemistry, Molecular Sciences vGeology, Seismology vMechanical Engineering - from prosthetics to spacecraft vElectrical Engineering, Circuit Design, Microelectronics vComputer Science, Mathematics
Slide 8

What is Parallel Computing?


Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. For example: Databases, data mining Oil exploration Web search engines, web based business services Medical imaging and diagnosis Pharmaceutical design Management of national and multi-national corporations Financial and economic modeling Advanced graphics and virtual reality, particularly in the entertainment industry Networked video and multi-media technologies Collaborative work environments

Slide 9

Why Use Parallel Computing? Save time and/or money: In theory, throwing more resources at a

task will shorten its time to completion, with potential cost savings. Parallel clusters can be built from cheap, commodity components. Solve larger problems: Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory. For example: "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and PetaBytes of computing resources. Web search engines/databases processing millions of transactions per second Provide concurrency: A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously. For example, the Access Grid ( www.accessgrid.org) provides a global collaboration network where Slide 10 people from around the world can meet and conduct work "virtually".

Slide 11

There are different ways to classify parallel computers. One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy. Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple. The matrix below defines the 4 possible classifications according to Flynn:
SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data

Flynn's Classical Taxonomy

MISD Multiple Instruction, Single Data

MIMD Multiple Instruction, Multiple Data

Slide 12

Single Instruction, Single Data (SISD): (uniprocessor) A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.

Flynn's Classical Taxonomy

Slide 13

Single Instruction, Multiple Data (SIMD): (array processors) A type of parallel computer Single instruction: All processing units execute the same instruction at any given clock cycle Multiple data: Each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. Synchronous (lockstep) and deterministic execution Two varieties: Processor Arrays and Vector Pipelines Examples: Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 Most modern computers, particularly those with graphics processor Slide 14 units (GPUs) employ SIMD instructions and execution units.

Flynn's Classical Taxonomy

Multiple Instruction, Single Data (MISD): (not practical) A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent instruction streams. Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). Some conceivable uses might be: multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message.

Flynn's Classical Taxonomy

Slide 15

Multiple Instruction, Multiple Data (MIMD): (multiprocessor system) Currently, the most common type of parallel computer. Most modern computers fall into this category. Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or nondeterministic Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. Note: many MIMD architectures also include SIMD execution subcomponents
Slide 16

Flynn's Classical Taxonomy

Shared Memory General Characteristics: Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA. Uniform Memory Access (UMA): Most commonly represented today by Symmetric Multiprocessor (SMP) machines Identical processors Equal access and access times to memory Slide 17 Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent

Parallel Computer Memory Architectures

Parallel Computer Memory Architectures


CC-NUMA

Shared Memory (UMA)

Slide 18

Parallel Computer Memory Architectures


Shared Memory (NUMA)
Advantages: Global address space provides a user-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantages: Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared memoryCPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management. Programmer responsibility for synchronization constructs that ensure "correct" access of global memory. Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors.

Slide 19

Distributed Memory General Characteristics:

Parallel Computer Memory Architectures

Like shared memory systems, distributed memory systems vary widely but share a common characteristic. Distributed memory systems require a communication network to connect interprocessor memory. Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet.

Slide 20

Advantages: Memory is scalable with number of processors. Increase the number of processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. Cost effectiveness: can use commodity, off-the-shelf processors and networking.

Parallel Computer Memory Architectures

Disadvantages: The programmer is responsible for many of the details associated with data communication between processors. It may be difficult to map existing data structures, based on global memory, to this memory organization. Non-uniform memory access (NUMA) times ed for data transfer varies widely, though it can be as simple as Ethernet.

Slide 21

Hybrid Distributed-Shared Memory

Parallel Computer Memory Architectures

The largest and fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global. The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another. Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future. Advantages and Disadvantages: whatever is common to both shared and distributed memory architectures.

Slide 22

Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. Two examples: 1.Shared memory model on a distributed memory machine: Kendall Square Research (KSR) ALLCACHE approach. Machine memory was physically distributed, but appeared to the user as a single shared memory (global address space). Generically, this approach is referred to as "virtual shared memory". Note: although KSR is no longer in business, there is no reason to suggest that a similar implementation will not be made available by another vendor in the future. 2. Message passing model on a shared memory machine: MPI on SGI Origin. The SGI Origin employed the CC-NUMA type of shared memory architecture, where every task has direct access to global memory. However, the ability to send and receive messages with Slide 23 MPI, as is commonly done over a network of distributed memory

Parallel Programming Models

What is Parallel Computing? (basic idea)


Consider the problem of stacking (reshelving) a set of library books.
A single worker trying to stack all the books in their proper places cannot accomplish the task faster than a certain rate. We can speed up this process, however, by employing more than one worker.
Slide 24

Solution 1
Assume that books are organized into shelves and that the shelves are grouped into bays One simple way to assign the task to the workers is:
To divide the books equally among them. Each worker stacks the books one a time

This division of work may not be most efficient way to accomplish the task Slide 25 since

Solution 2

Instance of task partitioning

An alternative way to divide the work is to assign a fixed and disjoint set of bays to each worker. As before, each worker is assigned an equal number of books arbitrarily.
If the worker finds a book that belongs to a of Instance bay assigned to him or her, Communication task
he or she places that book in its assignment spot

Otherwise,
He or she passes it on to the worker responsible for the bay it belongs to.

The second approach requires less Slide 26 effort from individual workers

Problems are parallelizable to different degrees


For some problems, assigning partitions to other processors might be more time-consuming than performing the processing locally. Other problems may be completely serial.
For example, consider the task of digging a post hole.
Although one person can dig a hole in a Slide 27 certain amount of time,

Power of parallel solutions


Pile collection
Ants/robots with very limited abilities

(see its neighbourhood ) (sticks and robots)

Grid environment

Move() Move randomly ( ) Until robot sees a stick in its nighbouhood Collect() Move(); Pick up a sick; Move(); Put it down; Collect();
Slide 28

Sorting in nature

1 3

Slide 29

Parallel Processing
(Several processing elements working to solve a single problem)

Primary consideration: elapsed time


NOT: throughput, sharing resources, etc.

Downside: complexity
system, algorithm design

Elapsed Time = computation time + communication time + synchronization time


Slide 30

Design of efficient algorithms

A parallel computer is of little use unless efficient parallel algorithms are available.
The issue in designing parallel algorithms are very different from those in designing their sequential counterparts. A significant amount of work is being done to develop efficient parallel Slide algorithms for a variety of parallel 31

Processor Trends
Moores Law
performance doubles every 18 months

Parallelization within processors


pipelining multiple pipelines
Slide 32

Why Parallel Computing


Practical:
Moores Law cannot hold forever Problems must be solved immediately Cost-effectiveness Scalability

Theoretical:
challenging problems

Slide 33

Some Complex Problems


N-body simulation Atmospheric simulation Image generation Oil exploration Financial processing Computational biology
Slide 34

Some Complex Problems


N-body simulation
O(n log n) time galaxy 1011 stars approx. one year / iteration

Atmospheric simulation
3D grid, each element interacts with neighbors 1x1x1 mile element 5 108 elements 10 day simulation requires approx. 35 Slide 100 days

Some Complex Problems


Image generation
animation, special effects several minutes of video 50 days of rendering

Oil exploration
large amounts of seismic data to be processed months of sequential exploration
Slide 36

Some Complex Problems


Financial processing
market prediction, investing Cornell Theory Center, Renaissance Tech.

Computational biology
drug design gene sequencing (Celera) structure prediction (Proteomics)
Slide 37

Fundamental Issues
Is the problem amenable to parallelization? How to decompose the problem to exploit parallelism? What machine architecture should be used? What parallel resources are available? What kind of speedup is desired? Slide 38

Two Kinds of Parallelism


Pragmatic
goal is to speed up a given computation as much as possible problem-specific techniques include:
overlapping instructions (multiple pipelines) overlapping I/O operations (RAID systems) traditional (asymptotic) parallelism techniques

Slide 39

Two Kinds of Parallelism


Asymptotic
studies:
architectures for general parallel computation parallel algorithms for fundamental problems limits of parallelization

can be subdivided into three main areas


Slide 40

Asymptotic Parallelism
Models
comparing/evaluating different architectures

Algorithm Design
utilizing a given architecture to solve a given problem

Computational Complexity
classifying problems according to their difficulty
Slide 41

Architecture
Single processor:
single instruction stream single data stream von Neumann model

Multiple processors:
Flynns taxonomy

Slide 42

Flynns Taxonomy

Instruction Streams

Many

MISD

MIMD

SISD

SIMD

Many Data Streams


Slide 43

Slide 44

Parallel Architectures
Multiple processing elements Memory:
shared distributed hybrid

Control:
centralized distributed
Slide 45

Parallel vs Distributed Computing


Parallel:
several processing elements concurrently solving a single same problem

Distributed:
processing elements do not share memory or system clock

Which is the subset of which?


distributed is a subset of parallel
Slide 46

Efficient and optimal parallel algorithms


A parallel algorithm is efficient iff
it is fast (e.g. polynomial time) and the product of the parallel time and number of processors is close to the time of at the best know sequential algorithm

se u n l q e tia

p ra l a lle

p ce rs ro sso

A parallel algorithms is optimal iff this product is of the same order as the best known sequential time Slide 47

Metrics
A measure of relative performance between a multiprocessor system and a single processor system is the speed-up S( p), defined as follows: S( p) =
Execution time using a single processor system Execution time using a multiprocessor with p processors

S( p) =

T1 Tp

Efficiency =

Sp p

Cost = p Tp
Slide 48

Metrics
Parallel algorithm is cost-optimal:

parallel cost = sequential time Cp = T1 Ep = 100%

Critical when down-scaling:

parallel implementation may become slower than sequential T1 = n3 Tp = n2.5 when p = n2 Cp = n4.5
Slide 49

Amdahls Law
f = fraction of the problem thats inherently sequential

(1 f) = fraction thats parallel

Parallel time Tp:

Tp = f + (1 f ) p
Sp = 1 f f+ p
Slide 50

Speedup with p processors: 1

Part f is computed by a single processor Part (1-f) is computed by p processors, p>1

What kind of speed-up may be achieved?

Basic observation: Increasing p we cannot speed-up part f.

Slide 51

Amdahls Law
Upper bound on speedup (p = ) 1 1 S = Converges to 0 S = 1 f f f+

p

Example:

f = 2% S = 1 / 0.02 = 50
Slide 52

The main open question


The basic parallel complexity class is NC. NCis a class of problems computable in polylogarithmic time (log c n, for a constant c) using a polynomial number of processors. P is a class of problems computable sequentially in a polynomial time

The main open question in parallel computations is NC = P ?


Slide 53

Vous aimerez peut-être aussi