Vous êtes sur la page 1sur 32

Unit II

1. Workload Driven Evaluation


1. Scaling workload and machine
2. Key issues in scaling
3. Scaling models and speedup measures
4. Scaling workload parameters
5. Evaluating a real machine
6. Performance isolation
7. Choosing Workload
8. Choosing performance metrics
Evaluation
Microbenchmark small program that
stress a particular machine feature
Spec(standard performance Evaluation Corporation)

Simulation
Simulator program (flexible)- a program that simulates the
design with & without the proposed feature of interest
No. of Programs or multiprogram workload
Performance impact
Cost=hardware + design time

Workload needs to be renewed


Benchmark suits are revised
Costly to develop accurate simulator
Good evaluation, good design
Workload-Driven
Evaluation
Evaluating real machines
Evaluating an architectural idea or
trade-offs
Multiprocessor, workload of interest
Parallel program
Multiprogram

=> need good metrics of performance


=> need to pick good workloads
=> need to pay attention to scaling
many factors involved
Reasons why workload evaluation
is difficult for multiprocessor
Immaturity of parallel applications
Not easy to find representative workload

Immaturity of parallel programming


Software model is not stabilized

Sensitivity of behavioral difference


Different decision in parallelizing , sequential program

New degrees of freedom


No. of processor, extended memory, communication
architecture

Limitations of simulation
Resource intensive, lot of memory, time
Scaling Workloads and machines:
Basic measure of multiprocessor
Performance
Two performance characteristics
1. Absolute performance
Imp. to end user, buyer
Measured as work done per unit time
Input configuration ( problem size)
up front, set of continuous inputs
Performance = 1/ execution time
Explicit work per unit time, meaning full for application
Ex no of transaction serviced per unit time, no of
bonds computed per second,
2. Performance improvement due to parallelism
Speedup
Absolute performance on p processor
Abs performance on single processor
Speedup (Execution time as performance metric)
Time( 1 proc)
Time(p procs)

Speedup ( operation per second as performance metric)


Operation per second( p procs)
Operation per second(1 proc)
Why worry about scaling?
Speedup with fixed problem size is insufficient
Problem Size Small
increase overhead due to parallelism
Problem Size Large
Data too large to fit in memory
Artifactual communication
Problem Size Same
Not reflect realistic usage
User want powerful machine to solve larger problem
Problem size increases with machine size
So Scaling of problem size should be done
Scaling overcome size mismatch
Measure of performance always work per unit time, inspite
different scaling model
(a) shows the speedup for a small problem size in the Ocean application. The problem size is clearly very
appropriate for a machine with 8 or so processors. At a little beyond 16 processors, the speedup has
saturated and it is no longer clear that one would run this problem size on this large a machine. And
this is clearly the wrong problem size to run on or evaluate a machine with 32 or more processors,
since wed be better off running it on a machine with 8 or 16 processors!
(b) shows the speedup for the equation solver kernel illustrating superlinear speedups when a processors
working set fits in cache by the time 16 processors are used, but does not fit and performs poorly
when fewer processors are used.
Scaling a machine making it more or less powerful
Any component
Bigger, sophisticated, faster
Processor, cache, memory, communication
architecture, I/O system
Machine size vector, characterizing per node
capabilities
Adding more identical nodes
Ex p processor, p x m megabyte of total memory
k x p processor, k x p x m megabyte of total
memory
Problem Size specific problem instance, input
configuration
Vector of input parameters
Ex. n by n grid in ocean
V= (n, , t, T)
n grid size in each dimension
error tolerance
t temporal resolution ( physical time between time
steps)
T number of time steps
Data set size - amount of storage needed to run the
program on a single processor
Grid size n
Memory usage amount of memory used by parallel
program including replication
Key issues in scaling
1. Under what constraints should the problem be
scaled?
Some property must be kept fixed
2. How should the problem be scaled?
Parameter in vector v, problem size, must be
changed to meet the chosen constraint
Under What Constraints to
Scale?
Two types of constraints:
1. User-oriented, e.g. particles, rows, transactions, I/Os per processor
2. Resource-oriented, e.g. memory, time

Which is more appropriate depends on application domain


. User-oriented easier for user to think about and change
. Resource-oriented more general, and often more real

Resource-oriented scaling models:


. Problem constrained (PC)
. Memory constrained (MC)
. Time constrained (TC)

(TPC:
users terminals and size of database is scale with
computing power)

Growth under MC and TC may be hard to predict


Problem Constrained
Scaling
User wants to solve same problem,
only faster by using larger machine
Video compression
Computer graphics
VLSI routing

Butlimited when evaluating larger


machines
Problem size-fix, evaluation time-vary
Time(1 processor)
SpeedupPC(p) = Time(p processor)
Time Constrained Scaling
Execution time is kept fixed as system scales
User has fixed time to use machine or wait for result
Performance = Work/Time as usual, and time is
fixed, so Work(p)
SpeedupTC(p) = Work(1)

How to measure work?


Execution time on a single processor? (thrashing problems)
Should be easy to measure, ideally analytical and intuitive
Should scale linearly with sequential complexity
Or ideal speedup will not be linear in p (e.g. no. of rows in matrix
program)
If cannot find intuitive application measure, as often true, measure
execution time with ideal memory system on a uniprocessor (e.g.
pixie)
Memory Constrained
Scaling
Scale so memory usage per processor stays fixed

Scaled Speedup: Time(1) / Time(p) for scaled up


problem
Hard to measure Time(1), and inappropriate
Work(p) Time(1) Increase in Work
SpeedupMC(p) = x =
Time(p) Work(1) Increase in Time

Can lead to large increases in execution time


If work grows faster than linearly in memory usage
e.g. matrix factorization
10,000-by 10,000 matrix takes 800MB and 1 hour on uniprocessor.
With 1,000 processors, can run 320K-by-320K matrix, but ideal parallel time
grows to 32 hours!
With 10,000 processors, 100 hours ...
Scaling Summary
Under any scaling rule, relative structure of
the problem changes with P
PC scaling: per-processor portion gets smaller
MC & TC scaling: total problem get larger

Need to understand hardware/software


interactions with scale

Forgiven problem, there is often a natural


scaling rule
example: equal error scaling
Scaling workload
parameters
Other parameter of problem size
should be scaled relatively
Not issue under PC , but consider
under MC and TC
Ex.
Force calculation accuracy -
Physical interval between time steps - t
The number of bodies - n
Evaluating a real machine
Types of workload driven
evaluation
Evaluation a real machine
Simple
Organization, granularity, performance
parameter fixed
Choosing appropriate workload
Not constrained by limitation of software
simulation
Evaluating an architectural idea or tradeoff
We begin with the use of micro-benchmarks
to isolate performance characteristics
Performance isolation using
microbenchmark
As a first step in evaluating a real machine, we might like
to understand its basic performance capabilities; that is,
the performance characteristics of the primitive operations
provided by the programming model,
communication abstraction,
or hardware-software interface.
This is usually done with small, specially written programs
called microbenchmarks
Five types of microbenchmarks are used in parallel
systems, the first three of which are used for uniprocessor
evaluation as well
1. Processing do not access memory
2. Local memory organization, latencies, bandwidths of
levels of memory
3. I/O - disk read, write
4. Communication msg send, receive, remote read write
The communication and synchronization
microbenchmarks depend on the communication
abstraction or programming model used
For measurement purposes, microbenchmarks are
usually implemented as repeated sets of the primitive
operations, e.g. ten thousand remote reads in a row.
The role of microbenchmarks is to isolate and
understand the performance of basic system
capabilities.
The next step is to evaluate the machine on more
realistic workloads. There are three major axes we must
navigate: the workloads, their problem sizes, and the
number of processors (or the machine size
The microbenchmark consists of a large number of loads from a local array. The Y-
axis shows the time per load in nanoseconds. The X-axis is the stride between
successive loads in the loop, i.e. the difference in the addresses of the memory
locations being accessed. And the different curves correspond to the size of the
array (ArraySize) being strided through. When ArraySize is less than 8KB, the array
fits in the processor cache so all loads are hits and take 6.67 ns to complete. For
larger arrays, we see the effects of cache misses. The average access time is the
weighted sum of hit and miss time, until there is an inflection when the stride is
longer than a cache block (32 words or 128 bytes) and every reference misses.
The next rise occurs due to some references causing page faults, with an inflection
when the stride is large enough (16KB) for every consecutive reference does so.
The final rise is due to conflicts at the memory banks in the 4-bank main memory,
with an inflection at 64K stride when consecutive references hit the same bank.
Types of Workloads (Choosing
workload)
Kernels: These are well-defined parts of real applications, but are
not complete applications themselves. Ex matrix factorization, FFT,
depth-first tree search
Complete Applications: consist of multiple kernels, and exhibit
higherlevel interactions among kernels that an individual kernel
cannot reveal . Ex ocean simulation, crew scheduling, database
Multiprogrammed Workloads: consist of multiple sequential and
parallel applications running together on the machine. The different
applications may either time-share the machine or space-share it or
both, depending on the operating systems multiprogramming
policies
Realistic
Multiprog.
Complex Appls Kernels Microbench.
Easier to understand
Higher level interactions Controlled
Are what really matters Repeatable
Basic machine characteristics

Each has its place:


Use kernels and microbenchmarks to gain understanding, but applications to
evaluate effectiveness and performance
Representativeness of Application
Domain
If we are performing an evaluation as users looking to procure a machine, and we
know that the machine will be used to run only certain types of applications (say
databases, transaction processing, or ocean simulation), then this part of our job
is easy
if our machine may be used to run a wide range of workloads, or if we are
designers trying to evaluate a machine to learn lessons for the next generation,
we should choose a mix of workloads representative of a wide range of domains
Scientific
Engineering
Graphics
media processing
information management
Optimization
artificial intelligence
Multiprogrammed
operating system
Coverage of Behavioral Properties

it is important that the workloads we choose taken together


stress a range of important performance characteristics. For
example, we should choose workloads
with low and high communication to computation ratios,
small and large working sets,
regular and irregular access patterns,
and localized and collective communication.
Another important issue is the level of program optimization.
real parallel programs will not always be highly optimized for
good performance. In particular, there are three important
levels to consider:
1. Algorithmic. The decomposition and assignment of tasks may be less than
optimal, and certain algorithmic enhancements for data locality such as
blocking may not be implemented; for example, strip-oriented versus block-
oriented assignment for a grid computation.
2. Data structuring. The data structures used may not interact optimally with the
architecture, causing artifactual communication; for example, two-dimensional
versus four-dimensional arrays to represent a two-dimensional grid in a shared
address space.
3. Data layout, distribution and alignment. Even if appropriate data structures are
used, they may not be distributed or aligned appropriately to pages or cache
blocks, causing artifactual communication in shared address space systems.
4. Orchastrating of communication and synchronization. The resulting
Concurrency
Should have enough to utilize the processors
If load imbalance dominates, may not be much machine can do
(Still, useful to know what kinds of workloads/configurations dont have enough concurrency)

Algorithmic speedup: useful measure of concurrency/imbalance


Speedup (under scaling model) assuming all memory/communication operations take zero time
Ignores memory system, measures imbalance and extra work
Uses PRAM machine model (Parallel Random Access Machine)
Unrealistic, but widely used for theoretical algorithm development

In general we should isolate performance limitations due to program characteristics


that a machine cannot do much about (concurrency) from those that it can.

Many efforts have been made to define standard benchmark suites of parallel
applications to facilitate workload-driven architectural evaluation.

For now, let us assume that a parallel program has been chosen as a workload, and
see how we might use it to evaluate a real machine. We first keep the number of
processors fixed, which both simplifies the discussion and also exposes the
important interactions more cleanly. Then, we discuss varying the number of
processors as well.
Workload/Benchmark
Suites
Numerical Aerodynamic Simulation (NAS)
Originally pencil and paper benchmarks

SPLASH/SPLASH-2
Shared address space parallel programs

ParkBench
Message-passing parallel programs

ScaLapack
Message-passing kernels

TPC
Transaction processing

SPEC-HPC
. ..
Choosing Performance
metrics
1. Absolute performance
User time or Wall clock time
Average or maximum time
2. Performance improvement or speedup
3. Processing rate
Number of computer operation executed per unit time
MFLOPS, MIPS
4. Utilization
Processor busy in execution
5. Problem size
Smallest problem size of a given application that obtains a
specified parallel efficiency
Speedup/number of processors
Efficiency constrained scaling
6. Percentage improvement in performance
Improved performance due to architectural feature
In evaluating how well a machine scales as resources are
added, it is not only how performance increases that
matters but also how cost increases.
1. Absolute Performance
Suppose that execution time is our absolute performance
metric. Time can be measured in different ways
First, there is a choice between
User time is the time the machine spent executing
code from the particular workload or program in
question, thus excluding system activity and other
programs that might be timesharing the machine
wall-clock time is the total elapsed time for the
workloadincluding all intervening activityas
measured by a clock hanging on the wall
Second, there is the issue of whether to use the average
or the maximum execution time over all processes of the
program.
2. Performance Improvement or Speedup
what the denominator in the speedup ratio
performance on one processorshould actually
measure. There are four choices
1. Performance of the parallel program on one
processor of the parallel machine
2. Performance of a sequential implementation of
the same algorithm on one processor of the
parallel machine
3. Performance of the best sequential algorithm
and program for the same problem on one
processor of the parallel machine
4. Performance of the best sequential program on
an agreed-upon standard machine.
3. Processing Rate
A metric that is often quoted to characterize the
performance of machines is the number of computer
operations that they execute per unit time (as opposed to
operations that have meaning at application level, such
as transactions or chemical bonds).
Classic examples are MFLOPS (millions of floating point
operations per second) for numerically intensive
programs and MIPS (millions of instructions per second)
for general programs.
4. Utilization
Architects sometimes measure success by how well (what
fraction of the time) they are able to keep their
processing engines busy executing instructions rather
than stalled due to various overheads.
5. Problem Size
The smallest problem size of a given application that obtains a
specified parallel efficiency, which is defined as speedup divided
by number of processor.
By keeping parallel efficiency fixed as the number of processors
increases, in a sense this introduces a new scaling model that we
might call efficiency-constrained scaling, and with it a
performance metric which is the smallest problem size needed
6. Percentage improvement in performance

Vous aimerez peut-être aussi