Parallel Computing

Parallel Computing
Lecturer: Satinder Pal Singh

E-mail: mail@sviet.ac.in
CS-517 Parallel Efficient Algorithms

Slide 1
Course Contents
Slide 2
Recommended Course Textbooks

M. J. Quinn. Parallel Computing: Theory and Practice , McGraw Hill, New York, 1994. T. G. Lewis and H. El-Rewini. Introduction to Parallel Computing , Prentice Hall, New Jersey, 1992. T. G. Lewis. Parallel Programming: A MachineIndependent Approach , IEEE Computer Society Press, Los Alamitos, 1994.
Slide 3
What is Parallel Computing?

Traditionally, software has been written for serial computation: To be run on a single computer having a single Central Processing Unit (CPU); A problem is broken into a discrete series of instructions. Instructions are executed one after another. Only one instruction may execute at any moment in time.
Slide 4

In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: To be run using multiple CPUs A problem is broken into discrete parts that can be solved concurrently Each part is further broken down to a series of instructions Instructions from each part execute simultaneously on different CPUs
Slide 5

The compute resources can include: A single computer with multiple processors; An arbitrary number of computers connected by a network; A combination of both. The computational problem usually demonstrates characteristics such as the ability to be: Broken apart into discrete pieces of work that can be solved simultaneously; Execute multiple program instructions at any moment in time; Solved in less time with multiple compute resources than with a single compute resource.
Slide 6

The Universe is parallel: Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence. For example: Galaxy formation Planetary movement Weather and ocean patterns Automobile assembly line
Slide 7

Uses for Parallel Computing Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult scientific and engineering problems found in the real world. Some examples: vAtmosphere, Earth, Environment vPhysics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics vBioscience, Biotechnology, Genetics vChemistry, Molecular Sciences vGeology, Seismology vMechanical Engineering - from prosthetics to spacecraft vElectrical Engineering, Circuit Design, Microelectronics vComputer Science, Mathematics
Slide 8

Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. For example: Databases, data mining Oil exploration Web search engines, web based business services Medical imaging and diagnosis Pharmaceutical design Management of national and multi-national corporations Financial and economic modeling Advanced graphics and virtual reality, particularly in the entertainment industry Networked video and multi-media technologies Collaborative work environments
Slide 9
Why Use Parallel Computing? Save time and/or money: In theory, throwing more resources at a
task will shorten its time to completion, with potential cost savings. Parallel clusters can be built from cheap, commodity components. Solve larger problems: Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory. For example: "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring PetaFLOPS and PetaBytes of computing resources. Web search engines/databases processing millions of transactions per second Provide concurrency: A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously. For example, the Access Grid ( www.accessgrid.org) provides a global collaboration network where Slide 10 people from around the world can meet and conduct work "virtually".
Slide 11
There are different ways to classify parallel computers. One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy. Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple. The matrix below defines the 4 possible classifications according to Flynn:
SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data
Flynn's Classical Taxonomy
MISD Multiple Instruction, Single Data
MIMD Multiple Instruction, Multiple Data
Slide 12
Single Instruction, Single Data (SISD): (uniprocessor) A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer Examples: older generation mainframes, minicomputers and workstations; most modern day PCs.
Slide 13
Single Instruction, Multiple Data (SIMD): (array processors) A type of parallel computer Single instruction: All processing units execute the same instruction at any given clock cycle Multiple data: Each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing. Synchronous (lockstep) and deterministic execution Two varieties: Processor Arrays and Vector Pipelines Examples: Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820, ETA10 Most modern computers, particularly those with graphics processor Slide 14 units (GPUs) employ SIMD instructions and execution units.
Multiple Instruction, Single Data (MISD): (not practical) A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent instruction streams. Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). Some conceivable uses might be: multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message.
Slide 15
Multiple Instruction, Multiple Data (MIMD): (multiprocessor system) Currently, the most common type of parallel computer. Most modern computers fall into this category. Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or nondeterministic Examples: most current supercomputers, networked parallel computer clusters and "grids", multi-processor SMP computers, multi-core PCs. Note: many MIMD architectures also include SIMD execution subcomponents
Slide 16
Shared Memory General Characteristics: Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA. Uniform Memory Access (UMA): Most commonly represented today by Symmetric Multiprocessor (SMP) machines Identical processors Equal access and access times to memory Slide 17 Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent
Parallel Computer Memory Architectures

CC-NUMA
Shared Memory (UMA)
Slide 18

Shared Memory (NUMA)
Advantages: Global address space provides a user-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantages: Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared memoryCPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management. Programmer responsibility for synchronization constructs that ensure "correct" access of global memory. Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors.
Slide 19
Distributed Memory General Characteristics:
Like shared memory systems, distributed memory systems vary widely but share a common characteristic. Distributed memory systems require a communication network to connect interprocessor memory. Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet.
Slide 20
Advantages: Memory is scalable with number of processors. Increase the number of processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages: The programmer is responsible for many of the details associated with data communication between processors. It may be difficult to map existing data structures, based on global memory, to this memory organization. Non-uniform memory access (NUMA) times ed for data transfer varies widely, though it can be as simple as Ethernet.
Slide 21
Hybrid Distributed-Shared Memory
The largest and fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global. The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another. Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future. Advantages and Disadvantages: whatever is common to both shared and distributed memory architectures.
Slide 22
Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. Two examples: 1.Shared memory model on a distributed memory machine: Kendall Square Research (KSR) ALLCACHE approach. Machine memory was physically distributed, but appeared to the user as a single shared memory (global address space). Generically, this approach is referred to as "virtual shared memory". Note: although KSR is no longer in business, there is no reason to suggest that a similar implementation will not be made available by another vendor in the future. 2. Message passing model on a shared memory machine: MPI on SGI Origin. The SGI Origin employed the CC-NUMA type of shared memory architecture, where every task has direct access to global memory. However, the ability to send and receive messages with Slide 23 MPI, as is commonly done over a network of distributed memory
Parallel Programming Models
What is Parallel Computing? (basic idea)

Consider the problem of stacking (reshelving) a set of library books.
A single worker trying to stack all the books in their proper places cannot accomplish the task faster than a certain rate. We can speed up this process, however, by employing more than one worker.
Slide 24
Solution 1
Assume that books are organized into shelves and that the shelves are grouped into bays One simple way to assign the task to the workers is:
To divide the books equally among them. Each worker stacks the books one a time
This division of work may not be most efficient way to accomplish the task Slide 25 since
Solution 2
Instance of task partitioning
An alternative way to divide the work is to assign a fixed and disjoint set of bays to each worker. As before, each worker is assigned an equal number of books arbitrarily.
If the worker finds a book that belongs to a of Instance bay assigned to him or her, Communication task
he or she places that book in its assignment spot
Otherwise,
He or she passes it on to the worker responsible for the bay it belongs to.
The second approach requires less Slide 26 effort from individual workers
Problems are parallelizable to different degrees

For some problems, assigning partitions to other processors might be more time-consuming than performing the processing locally. Other problems may be completely serial.
For example, consider the task of digging a post hole.
Although one person can dig a hole in a Slide 27 certain amount of time,
Power of parallel solutions

Pile collection
Ants/robots with very limited abilities
(see its neighbourhood ) (sticks and robots)
Grid environment
Move() Move randomly ( ) Until robot sees a stick in its nighbouhood Collect() Move(); Pick up a sick; Move(); Put it down; Collect();
Slide 28
Sorting in nature
1 3
Slide 29
Parallel Processing
(Several processing elements working to solve a single problem)
Primary consideration: elapsed time

NOT: throughput, sharing resources, etc.
Downside: complexity
system, algorithm design
Elapsed Time = computation time + communication time + synchronization time

Slide 30
Design of efficient algorithms
A parallel computer is of little use unless efficient parallel algorithms are available.
The issue in designing parallel algorithms are very different from those in designing their sequential counterparts. A significant amount of work is being done to develop efficient parallel Slide algorithms for a variety of parallel 31
Processor Trends
Moores Law
performance doubles every 18 months
Parallelization within processors

pipelining multiple pipelines
Slide 32
Why Parallel Computing

Practical:
Moores Law cannot hold forever Problems must be solved immediately Cost-effectiveness Scalability
Theoretical:
challenging problems
Slide 33
Some Complex Problems

N-body simulation Atmospheric simulation Image generation Oil exploration Financial processing Computational biology
Slide 34

N-body simulation
O(n log n) time galaxy 1011 stars approx. one year / iteration
Atmospheric simulation
3D grid, each element interacts with neighbors 1x1x1 mile element 5 108 elements 10 day simulation requires approx. 35 Slide 100 days

Image generation
animation, special effects several minutes of video 50 days of rendering
Oil exploration
large amounts of seismic data to be processed months of sequential exploration
Slide 36

Financial processing
market prediction, investing Cornell Theory Center, Renaissance Tech.
Computational biology
drug design gene sequencing (Celera) structure prediction (Proteomics)
Slide 37
Fundamental Issues
Is the problem amenable to parallelization? How to decompose the problem to exploit parallelism? What machine architecture should be used? What parallel resources are available? What kind of speedup is desired? Slide 38
Two Kinds of Parallelism

Pragmatic
goal is to speed up a given computation as much as possible problem-specific techniques include:
overlapping instructions (multiple pipelines) overlapping I/O operations (RAID systems) traditional (asymptotic) parallelism techniques
Slide 39
Two Kinds of Parallelism

Asymptotic
studies:
architectures for general parallel computation parallel algorithms for fundamental problems limits of parallelization
can be subdivided into three main areas

Slide 40
Asymptotic Parallelism
Models
comparing/evaluating different architectures
Algorithm Design
utilizing a given architecture to solve a given problem
Computational Complexity
classifying problems according to their difficulty
Slide 41
Architecture
Single processor:
single instruction stream single data stream von Neumann model
Multiple processors:
Flynns taxonomy
Slide 42
Flynns Taxonomy
Instruction Streams
Many
MISD
MIMD
SISD
SIMD
Many Data Streams

Slide 43
Slide 44
Parallel Architectures
Multiple processing elements Memory:
shared distributed hybrid
Control:
centralized distributed
Slide 45
Parallel vs Distributed Computing

Parallel:
several processing elements concurrently solving a single same problem
Distributed:
processing elements do not share memory or system clock
Which is the subset of which?

distributed is a subset of parallel
Slide 46
Efficient and optimal parallel algorithms

A parallel algorithm is efficient iff
it is fast (e.g. polynomial time) and the product of the parallel time and number of processors is close to the time of at the best know sequential algorithm
se u n l q e tia
p ra l a lle
p ce rs ro sso
A parallel algorithms is optimal iff this product is of the same order as the best known sequential time Slide 47
Metrics
A measure of relative performance between a multiprocessor system and a single processor system is the speed-up S( p), defined as follows: S( p) =
Execution time using a single processor system Execution time using a multiprocessor with p processors
S( p) =
T1 Tp
Efficiency =
Sp p
Cost = p Tp
Slide 48
Metrics
Parallel algorithm is cost-optimal:
parallel cost = sequential time Cp = T1 Ep = 100%
Critical when down-scaling:
parallel implementation may become slower than sequential T1 = n3 Tp = n2.5 when p = n2 Cp = n4.5
Slide 49
Amdahls Law
f = fraction of the problem thats inherently sequential
(1 f) = fraction thats parallel
Parallel time Tp:
Tp = f + (1 f ) p
Sp = 1 f f+ p
Slide 50
Speedup with p processors: 1
Part f is computed by a single processor Part (1-f) is computed by p processors, p>1
What kind of speed-up may be achieved?
Basic observation: Increasing p we cannot speed-up part f.
Slide 51
Amdahls Law
Upper bound on speedup (p = ) 1 1 S = Converges to 0 S = 1 f f f+

p
Example:

f = 2% S = 1 / 0.02 = 50
Slide 52
The main open question

The basic parallel complexity class is NC. NCis a class of problems computable in polylogarithmic time (log c n, for a constant c) using a polynomial number of processors. P is a class of problems computable sequentially in a polynomial time
The main open question in parallel computations is NC = P ?

Slide 53

Parallel Computing

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Parallel Computing

Transféré par

Droits d'auteur :

Formats disponibles

Parallel Computing

Lecturer: Satinder Pal Singh

CS-517 Parallel Efficient Algorithms

Recommended Course Textbooks

What is Parallel Computing?

What is Parallel Computing?

What is Parallel Computing?

What is Parallel Computing?

What is Parallel Computing?

What is Parallel Computing?

Flynn's Classical Taxonomy

MISD Multiple Instruction, Single Data

MIMD Multiple Instruction, Multiple Data

Flynn's Classical Taxonomy

Flynn's Classical Taxonomy

Flynn's Classical Taxonomy

Flynn's Classical Taxonomy

Parallel Computer Memory Architectures

Parallel Computer Memory Architectures

Shared Memory (UMA)

Parallel Computer Memory Architectures

Distributed Memory General Characteristics:

Parallel Computer Memory Architectures

Parallel Computer Memory Architectures

Hybrid Distributed-Shared Memory

Parallel Computer Memory Architectures

Parallel Programming Models

What is Parallel Computing? (basic idea)

Instance of task partitioning

Problems are parallelizable to different degrees

Power of parallel solutions

(see its neighbourhood ) (sticks and robots)

Primary consideration: elapsed time

Elapsed Time = computation time + communication time + synchronization time

Design of efficient algorithms

Parallelization within processors

Why Parallel Computing

Some Complex Problems

Some Complex Problems

Some Complex Problems

Some Complex Problems

Two Kinds of Parallelism

Two Kinds of Parallelism

can be subdivided into three main areas

Many Data Streams

Parallel vs Distributed Computing

Which is the subset of which?

Efficient and optimal parallel algorithms

parallel cost = sequential time Cp = T1 Ep = 100%

Critical when down-scaling:

(1 f) = fraction thats parallel

Parallel time Tp:

Speedup with p processors: 1

Part f is computed by a single processor Part (1-f) is computed by p processors, p>1

What kind of speed-up may be achieved?

Basic observation: Increasing p we cannot speed-up part f.

The main open question

The main open question in parallel computations is NC = P ?

Vous aimerez peut-être aussi