Vous êtes sur la page 1sur 67

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.

org

ECE750 Lecture 9: Performance engineering


Todd Veldhuizen tveldhui@acm.org
Electrical & Computer Engineering University of Waterloo Canada

Nov 16, 2007

1 / 67

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Part I Memory hierarchy

2 / 67

Memory hierarchy I
So far we have primarily been concerned with asymptotic eciency of algorithms. Having chosen an algorithm that is ecient in theory, there are still serious engineering challenges to getting high performance in practice. To obtain decent performance on problems involving nontrivial amounts of data, you need to understand
1. How the memory hierarchy works; 2. How to take advantage of this.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Historical performance of typical desktop machine: Year 1980 1990 2000 Clock cycle (ns) 500 50 1 DRAM latency (ns) 375 100 60 Disk latency (ms) 87 28 8
3 / 67

Memory hierarchy II

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

In two decades:
CPU: 500x faster clock cycle. Main memory: 6x faster. Disk: 10x faster.

It can now take hundreds of clock cycles to access data in main memory: Memory is the new disk. To mitigate the widening gap between CPU speed and memory access time, an elaborate memory hierarchy has evolved.
4 / 67

Memory hierarchy III


Example memory hierarchy of a desktop machine:
L1 Cache L2 Cache Main memory Disk Bandwidth 20 Gb/s 10 Gb/s 2 Gb/s 0.08 Gb/s Latency 1 ns 8 ns 200 ns 107 ns Size 16kb 1Mb 2Gb 400Gb Block size (B) 64 bytes 1024 bytes

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Bandwidth is the rate at which data can be transferred in a sustained, bulk manner, scanning through contiguous memory locations. Main memory can supply data at 1/10th the rate of L1 cache. Latency is the amount of time that elapses between a request for data and the start of its arrival. Disk is a million times slower than L1 cache. Block size is the chunk size in which data is transferred up to the next-fastest level of the hierarchy.

5 / 67

Memory hierarchy IV
The speed at which a program can run is always determined by some bottleneck: the cpu, main memory, the disk. Programs where the bottleneck is main memory are called memory bound most of the execution time is spent waiting for data to arrive from main memory or disk. A desktop machine with, say, a 2 GHz cpu, eectively runs much slower if it is memory bound: e.g.,
If L2 cache is the bottleneck: eective speed 1 GHz (about a Cray Y-MP supercomputer, circa 1988). If main memory is the bottleneck: eective speed 200 MHz (about a Cray 1 supercomputer, circa 1976). If disk is the bottleneck: eective speed between < 1 MHz (about a 1981 IBM PC) and 80 MHz, depending on access patterns.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

6 / 67

Latency and throughput I


Many operations (transfers of data from disk to memory, oating-point vector operations, network communication, etc.) follow a characteristic performance curve: slow for a small number of items, faster for a large number of items.
Let R be the asymptotic rate achievable (e.g., bandwidth) Let t0 be the latency. Then, an operation on n items takes time t0 + The eective rate is R (n ) = = n t0 + Rn R 1 + R t 0 n 1
2 R t0 n

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

n R .

+ O ( n 2 )
7 / 67

Latency and throughput II


E.g. with R = 1 and t0 = 10:
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

20

40

60

80

100

120

140

160

180

200

A useful parameter: n1/2 is the value at which half the 1 asymptotic rate is attained: R (n1/2 ) = 2 R . n1/2 = t0 R

8 / 67

Latency and throughput III


n1/2 gives an approximate chunk size required to achieve close to the asymptotic performance. For example, current typical disk has R = 0.08Gb /s and t0 = 10ms . For these parameters, n1/2 = 800000 bytes almost 1Mb. If you are dealing in chunks substantially smaller than n1/2 , actual performance may be a tiny fraction of R .

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Example: performance of a FAXPY on a desktop machine. A FAXPY operation looks like this: float X[N], Y[N]; for (int i=0; i < N; ++i) Y[i] = Y[i] + a*X[i];

9 / 67

Latency and throughput IV

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

(the term FAXPY comes from the BLAS library, a living relic from Fortran 77.) The following graph shows millions of oating point operations per second (Mops/s) versus the length of the vectors, N . Each plateau corresponds to a level of the memory hierarchy.

10 / 67

Latency and throughput V


DAXPY Benchmark 900 Vector<T> 800

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

700

600 Mflops/s

500

400

300

200

100 0 10

10

10

10

10 Vector length

10

10

10

10

The rst part of the curve follows a typical latency-throughput curve (i.e., R , n1/2 ). It reaches a plateau corresponding to the L1 cache. Performance then drops to a second plateau, when the vectors t in L2 cache. The third plateau is for main memory.
11 / 67

Latency and throughput VI

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

The abrupt drop at the far right of the graph indicates that the vectors no longer t in memory, and the operating system is paging data to disk.

12 / 67

Units of memory transfer I

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Memory is transferred between levels of the hierarchy in minimal block sizes:


Between CPU and cache: a word, typically 64 or 32 bits. Between cache and main memory: a cache line, typically 32, 64 or 128 bytes. (Pentium 4 has 64-byte cache lines.) Between main memory and disk: a disk block or page, often 512 bytes to 4096 bytes for desktop machines, but sometimes 32kb-256kb for serving large media les.

13 / 67

Units of memory transfer II

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org


Disk 1 Tb 60 Mb/s
latency 4 000 000 clocks Disk Block

The Memory Hierarchy


Reg. L1 Cache 32 kb 20 Gb/s
latency 3 clocks Word latency 9 clocks Cache Line

L2 Cache 2 Mb 2!3 Gb/s


latency 300 clocks Cache Line

Memory 2 Gb

Smaller Faster Decreasing latency

Bigger Slower Increasing latency

Example: Intel Core Duo


Has prefetching logic that can detect streams of data coming in from main memory.

14 / 67

Units of memory transfer III


Worst-case: random access to a large working set, using only a fraction of the data in each cacheline: get 26Mb /s Best case: accessing a sequence of main memory locations sequentially, and using all the data: get 2Gb /s . Note: best case is 80 times faster than worst case! Intel Data Prefetch Logic (DPL)
Detects constant stride memory accesses After 2 cachelines retrieved, starts prefetching DPL can run 8 cachelines ahead of requests Strides must be < 128 bytes 16 independent prefetch streams (12 ascending, 4 descending) Prefetching will not cross page boundaries (4096 bytes) Prefetching adjusts to bus trac Prefetching delivers 1 cacheline every 60 clock cycles ( 1 byte/cycle)

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

15 / 67

Units of memory transfer IV


If your problem can be solved by processing streams of memory, then the limited L1-L2 cache size is irrelevant; can get 2Gb/s streaming in from main memory.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

The ideal memory access pattern for good performance:


Appropriate-sized chunks (cache lines, blocks); in contiguous memory locations, e.g., reading a long sequence of disk pages that are stored consecutively. Doing as much work as possible on a given piece of data before moving on to other memory locations.

Memory layout of arrays and data structures is a crucial performance issue.


Memory layout is largely beyond your control in languages like Java that provide a programming model far removed from the actual machine. Well look at some examples in C/C++, where memory layout can be controlled at quite a ne level of detail.
16 / 67

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Part II Memory layout

17 / 67

Matrix Multiplication I
Example: how to write a matrix-multiplication algorithm that will perform well?
NB: In general, you would be well advised to use a matrix multiplication routine from a high-performance library, and not waste time tuning your own. (But, the principles are worth knowing.)

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Here is a naive matrix multiplication: for i=1 to n for j=1 to n for k=1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) If the matrices are small enough to t in cache, this may perform acceptably. For large matrices, this is disastrously slow.
18 / 67

Matrix Multiplication II
Assuming C-style (row major) arrays, the layout of the B matrix in memory looks like this:
B(1,1) B(2,1) B(3,1) B(4,1) B(5,1) etc. B(1,2) B(2,2) B(1,3) B(2,8)

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Each element of the array is a double that takes 8 bytes. The innermost loop (k ) is iterating down a column of the B matrix. Light-blue shows a cache line of 64 bytes. It lies along a row of the matrix.

19 / 67

Matrix Multiplication III


If 64*n is greater than the cache size, each element of B accessed in the innermost loop will bring in an entire cache line from memory (64 bytes), of which only 8 bytes will be used. The remaining 56 bytes will be discarded before they are used. The innermost loop travels over a row of A, so it will travel along (rather than across) cache lines. Each step of the innermost loop will bring in 64 + 8 bytes from memory on average, but only use 16 bytes of it = an eciency of only 22%. (By transposing the B matrix, we could make it run 4.5 times faster.)

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

20 / 67

Matrix Multiplication IV
A generally useful strategy is blocking (also called tiling): perform the matrix multiplication over submatrices of size m m, where m is chosen so that 8m 2 cache size. B11 B12 . . . A11 A12 . . . A21 A22 B21 B22 . . . . . . If each submatrix is of size m m, then a multiplication such as A11 B11 requires 16m2 bytes from memory, but performs 2m3 oating point operations. The number of oating point operations per memory access is 8 16m2 =m . As m is made bigger, the overhead of 2m3 accessing main memory can be made very small: the expensive memory access is amortized over a large amount of computation.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

21 / 67

Matrix Multiplication V

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Linear algebra libraries (e.g. ATLAS) can automatically tune themselves to a machines memory hierarchy by choosing appropriate block sizes, unrollings of inner loops, etc. [2, 9].

22 / 67

Example: Linked lists I


Designing data structures that will perform well on memory hierarchies is nontrivial:
Data structures are often built from small pieces (e.g., nodes, vertices) that may be smaller than a cache line. Manipulation of data structures often entails following pointers unpredictably through memory, so that prefetching hardware is unable to anticipate memory locations that will be needed. Many techniques developed in the past for data structures on disk (called external memory data structures [1, 8]) are increasingly relevant as techniques for managing data structures in main memory!

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Example: Linked lists are useful for maintaining lists of unpredictable size, e.g., queues, hash table chaining, etc. Recall that a simple linked list has a structure such as
23 / 67

Example: Linked lists II


struct ListNode { float data; ListNode* next; }; (Here we are storing a list of oating-point values.) The memory layout of this structure in a 64-byte cache line:
Cacheline (64 bytes)

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Each ListNode contains 8 bytes of data. But, if it is brought in from main memory we will read an entire cache line say, 64 bytes of data. Unless were lucky, those extra 54 bytes contain nothing of value; we only get 4 bytes of actual useful data (the data eld) for 64 bytes read from memory, an eciency of only 4/64 = 6.25%.
24 / 67

Example: Linked lists III


This style of data structure is among the worst possible for reading from main memory:
Iterating through a list means pointer-chasing: following pointers to unpredictable memory locations. Some processors can exploit constant-stride memory access patterns and prefetch data, but are powerless against irregular access patterns. Each list node accessed brings in 64 bytes of memory, only 4 bytes of which may be useful!

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

A better layout is to compromise between a linked list and array: have linked list nodes that contain little arrays, sized so that each list node lls a cache line: struct ListNode2 { float data[14]; int count; ListNode2* next; };
25 / 67

Example: Linked lists IV


This version stores up to 14 pieces of data per node, and uses the count eld to keep track of how full it is. The ListNode structure is exactly 64 bytes long (assuming 4-byte int, 4-byte pointers). Get 56 bytes of useful data for every 64 byte cache-line: about 87.5% eciency if most of the nodes are full. The following graph shows benchmark results comparing the performance of an array, a linked list, and linked lists of arrays (sized so that each element of the list lls 1, 4, or 16 cache lines.) The operation being timed is to scan through the list and add all the data elements: ListNode* p = first; do { s += p->data; p = p->next; } while (p != 0);

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

26 / 67

Example: Linked lists V


450

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

400

350

Access rate (millions of items per second)

300

250

200

150 Array Linked List List of arrays (1 cacheline) List of arrays (4 cachelines) List of arrays (16 cachelines)

100

50

0 1 10

10

10

10 10 Number of items in list

10

10

10

27 / 67

Example: Linked lists VI


List Benchmark 450 Array Linked List, contiguous Linked List, permuted List of arrays (4 cachelines), contiguous List of arrays (4 cachelines), permuted

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

400

350

300

Millions of items/s

250

200

150

100

50

0 1 10

10

10

10

10 Number of items

10

10

10

28 / 67

Example: Linked lists VII

ECE750 Lecture 9: Performance engineering Todd Veldhuizen

In the L1 cache, the linked list is very fast. But, tveldhui@acm.org compare performance to other regions of the memory hierarchy (numbers are millions of items per second): Array Linked List Linked List of arrays (1 cache line) L1 cache 411 411 317 L2 cache 406 90 285 Main memory 377 7 84 In L2 cache, the list-of-arrays is 3 times faster; in main memory it is 12 times faster. In main memory, the linked list is 50 times slower than the array. The regular memory access pattern for the array allows the memory prefetching hardware to predict what memory will be needed in advance, resulting in the high performance for the array. Lessons:
29 / 67

Example: Linked lists VIII

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Data structure nodes of size less than one cache line can cause serious performance problems when working out-of-cache. Performance can sometimes be improved by making fatter nodes that are one or more cache lines long. Nothing beats accessing a contiguous sequence of memory locations (an array). For this reason, many cache-ecient data structures bottom out to an array of some size at the last level.

30 / 67

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Part III Locality

31 / 67

Locality I
Ecient operation of the memory hierarchy rests on two basic strategies:
Caching memory contents in the hope that memory, once used, is likely to be soon revisited. Anticipating where the attention of the program is likely to wander, and prefetching those memory contents into cache.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

The caches and prefetching strategies of the memory hierarchy are only eective if memory access patterns exhibit some degree of locality of reference.
Temporal locality: if a memory location is accessed at time t , it is more likely to be referenced at times close to t . Spatial locality: if memory location x is referenced, one is more likely to access memory locations close to x .

32 / 67

Locality II
Dennings notion of a working set [3, 4]:
Let W (t , ) be the items accessed during the time interval (t , t ). This is called the working set (or locality set.) Dennings thesis was that programs progress through a series of working sets or locales, and while in that locale do not make (many) references outside of it. Optimal cache management then consists of guaranteeing that the locale is present in high-speed memory when a program needs it. Working sets are a model that programs often adhere to. One of the early triumphs of the concept was explaining why multiprocess systems when overloaded ground to a halt (thrashing) instead of degrading gracefully: thrashing occurred when there was not room in memory for the working sets of the processes at the same time.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

33 / 67

Locality III

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

The memory hierarchy has evolved around the concept of a working set, with the result that programs have small working sets has gone from being a descriptive statement to a prescription for good performance. Let (t , ) = |W (t , )| be the number of distinct items accessed in the interval (t , t ) the size of the working set. If (t , ) grows rapidly as a function of , then caching of recently used data will have little eect. If however it levels o, then a high hit rate in the cache can be achieved.

34 / 67

Locality IV
(t , ) caching ineective

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Cache size

good locality of reference

Algorithms can sometimes be manipulated to improve their locality of reference. The matrix multiplication example seen earlier was an example: by multiplying m m blocks instead of columns, one can do O (m3 ) work while reading only O (m2 ) memory locations.
35 / 67

Locality V

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Good compilers will try to do this automatically for certain simple arrangements of loops, but the general problem is too dicult for automated solutions.

36 / 67

Locality VI

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Some standard strategies for increasing locality:


Blocking and tiling: decompose multidimensional data sets into a collection of blocks or tiles. Example:

Bibliography

If a large image is stored on disk column-by-column, then rendering a small region (dotted line) requires retrieving many disk pages (left). If instead the image is broken into tiles, many fewer pages are needed (right).

37 / 67

Locality VII

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Iteration-space tiling or pipelining: in certain multistep algorithms, instead of performing one step over the entire structure, one can perform several steps over each local region.

38 / 67

Locality VIII
Space-lling curves: if an algorithm requires traversing some multidimensional space, one can sometimes improve locality by following a space-lling curve, for example:

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

For general data sets, a similar eect can be achieved by using multilevel minimum linear arrangement (MultiMinLA).
39 / 67

Locality IX

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Graph partitioning: if a dataset lacks a clear geometric structure, one can sometimes still achieve results similar to tiling by partitioning the dataset into regions that have small perimeters relative to the amount of data they contain. There are numerous good algorithms for this, in particular from the VLSI and parallel computing communities [7, 5, 6]. Indeed, the idea of tiling can be regarded as a special case of graph partitioning for very regular graphs.

40 / 67

Locality X
Vertical partitioning: the term comes from databases, but the concept applies to the design of memory layouts also. A simple example is layout of complex-valued arrays, where each element has a real and imaginary component. Could store pairs (a, b ) for each array element to represent a + bi , or could have two separate arrays, one for the real and one for the imaginary component. These two layouts have very dierent performance characteristics! Another common situation where vertical partitioning can apply: suppose a computation proceeds in phases, and dierent information is needed during each phase. E.g., in a graph algorithm one might have:

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

41 / 67

Locality XI

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

struct Edge { Node* from; Node* to; float weight; /* Needed in phase one */ Edge* parent; /* Phase two: union-find */ }; Instead of having all the data needed for each phase in one structure, it might be better to break up the Edge object into several structures, one for each phase of the algorithm.

42 / 67

Faster binary search I

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Binary search performs rather poorly in main memory, due to


Non constant-stride access patterns; Using only a fraction of each cacheline.

43 / 67

Faster binary search II


Binary Search on N=65536 uint32

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Requires log(N) 3 cachelines.

L1 Cache

L2

Main Memory

Each access retrieves a cacheline (64 bytes), only 4 of which are used.

44 / 67

Faster binary search III

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

45 / 67

Faster binary search IV

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Binary search on N = 512, with 8 elements per cacheline. Red shows a cacheline retrieved, pink shows a data element used.

46 / 67

Cacheline Tree I

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Asymptotically, 4 times faster than binary search in main memory (assuming 16 keys per cacheline) In practice, 7 times faster than STL map access for random

Essentially a B+-tree where each page is a single cacheline. Radix of tree = cacheline size/sizeof(Key). E.g., with 4-byte keys, have radix 16 tree.
1 Indices result in a space overhead of R 1 , where R is the radix e.g. 1/15 (6.7%) when R = 16.

47 / 67

Cacheline Tree 256 II 16

4096

65536

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Cacheline Tree
Requires log(N)/4 Cachelines (1/4 binary search)

4 cachelines to find element (N=65536) Space=273kb (+6.6%)

48 / 67

Cacheline tree: benchmarks I

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography


VBTree (mmap) STL map CLTree Bsearch

VBTree find benchmark

2.5

s per find

1.5

0.5

0 0 10

10

10

10 10 Number of elements in tree

10

10

10

49 / 67

Cacheline tree: benchmarks II

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography


VBTree (mmap) STL map CLTree Bsearch

VBTree find benchmark 8

Performance relative to CLTree

0 0 10

10

10

10 10 Number of elements in tree

10

10

10

50 / 67

Disk types
Traditional platter/head
High latency for random access: head has to move into position, and platter has to rotate to correct position Huge capacities: $1 Tb is quite aordable ($600).

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Solid-state drive (SSD)


A hard-drive form factor, but full of memory chips. Low-end: 864 Gb of ash memory. Standard hard drive connection (e.g. SATA). Random access times can be fairly fast, so far read/write rates are comparable to hard drives (e.g. 70Mb/s). Increasingly used in laptops. Wear levelling is used to work around lifetime problems of ash memory. Prices at consumer levels ($400), but cost is about $10/Gb (vs. $0.60/Gb for hard drives). High-end: up to 1Tb. Multiple Fibre Channel connections 4Gbit/s, random access time < 10s , throughput 12 Gbytes/s. Cost k $1, 000/Gb.

51 / 67

Disk benchmarks

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Disk I/O benchmark (linux,plato.logicblox.com) 70 60 50 MB/s 40 30 20 10 0 0 10


1 2 3 4 5 6 7 8

rnd read synch. seq read synch. 2!stream read synch.

10

10

10

10 Chunk size

10

10

10

10

10

Time to read chunk (s)

10

10

rnd read synch. seq read synch. 2!stream read synch.

10

10 0 10

10

10

10

10 Chunk size

10

10

10

10

52 / 67

Disk benchmarks

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

256k

60

50

64k

Page i+2 Page i+3 Random Access

40 MBytes/second

Sequential Access

30

8k

20

10

0 0 10 10
1 2

10 Skip distance (# pages)

10

10

10

10

10

Page size

53 / 67

Disk benchmarks

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org


Time to complete read (ms), kanushu.uwaterloo.ca

13 14 15 16

6.5

10

11

22

7.5

7.5

13

22

10

15 16

12

32

45

Random Access 4.5 Skip distance (pages)

10

8 7.5

11

22

10

Sequential Access

10

16

32

64

128 Page size (kb)

256

12

512

1024

2048

45

64

Bibliography

7.5

6 6.5 6 5.5 5

10
9

11 12

64

14

45

32

6.5

16
15

13

64

4 3.5

14

32

3
4096 8192

54 / 67

Optimizing disk reads I


Random access is very slow: e.g. with 8kb pages, only obtain 3% of possible throughput doing random page reads Sequential access to small pages is also slow: get only 25% of possible throughput Access is fastest for (a) sequential- or near-sequential access; (b) big chunks (many pages at once). If need some subset of pages in a locale, it is far more ecient to just read the entire disk region sequentially than to issue a series of individual synchronous read requests.
E.g. Suppose need to read pages #0, 8, 16, 32, 40, 48, 56, 64. With individual, synchronous reads: takes 28 ms

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

55 / 67

Optimizing disk reads II

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

But, reading pages 064 with a single read (520kb) would only take 16 ms; just junk (or prioritize for eviction) the superuous pages

Bibliography

Given a disk model (e.g., contour plot of previous slide), can optimally partition a set of pages to be read into eciently readable chunks using dynamic programming. Need to approach paging as an oine rather than online problem to get decent disk performance: plan and re-order page requests to get maximal throughput.

56 / 67

RAID I
RAID = Redundant Array of Inexpensive Disks Can serve several purposes:
Overcome throughput limitations by having multiple drives:
1 disk = 70 Mb/s 10 disks = 700 Mb/s

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Easy recovery from hard drive crashes: store redundant data on drives, so that errors can be corrected. E.g., with three drives:
Disk A Disk B Disk C = A B (XOR)

Then, get twice the throughput of 1 disk, plus if any one of the drives crashes, the data is recoverable. (E.g. if A crashes, then can use A = B C to recover it.)

57 / 67

RAID II
RAID 0: striping Disk A 0 3 6 . . . Raid 1: mirroring Disk A 0 1 2 . . . Disk B 1 4 7 . . . Disk B 0 1 2 . . . Disk C 2 5 8 . . . Disk C 0 1 2 . . .

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Duplicates data over multiple disks. Same throughput as single disk, but fault-tolerant. Better RAID systems let you hot-swap drives.
58 / 67

RAID III

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Higher levels of RAID: combinations of striping and error recovery.

59 / 67

External Memory Data Structures I


External memory data structures (also known as out-of-core data structures) are used when the data set is too large to t in memory, and must be stored on disk. Databases are the primary application. Some good surveys: [1, 8]. Basic concerns:
Disk has very high latency: often 107 clock cycles. Bandwidth is lower than main memory, perhaps 1/10 to 1/100th the rate. Block sizes are very large (multiples of 1kbyte) compared to main memory; to get good throughput with random access, need huge blocks (e.g. > 1 Mb) Only a small fraction of the data can be in memory at once; only a minute fraction can be in cache. Disk space is cheap (in dollar cost).

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

60 / 67

External Memory Data Structures II


Basic coping strategies:
High latency = very large n1/2 fat data structure nodes that contain a lot of data, stored contiguously in arrays. The high latency ( 107 clock cycles) means that a lot of computational resources can be expended to decide on the best strategy for storing and retrieving data:
External memory data structures often manage their own I/O, rather than relying on the operating systems page management, since they can better predict what pages to prefetch and evict: treat storage management as an oine rather than online problem. Databases expend a lot of eort to nd a good strategy for answering a query, since a bad strategy can be disastrously costly.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

To improve the rate of data transfer, data compression and parallel I/O (parallel disk model) are often used.

61 / 67

External Memory Data Structures III


Since disk space is cheap, it can pay o to duplicate data in several dierent forms, each appropriate for a certain class of queries.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

Classic example: the B-tree, a binary search tree suitable for external memory.
Each node has (B ) children, and the tree is kept balanced, so that the height is roughly logB N . Supported operations: nd, insert, delete, and range search (retrieve all records in an interval [k1 , k2 ].) When insertions cause a node to overow, it is split into two half-full nodes, which can cause the parent node to overow and split, etc. Typically the branching factor is very large, so that a massive data set can be stored in a B-tree of height 3 or 4, for example.

62 / 67

Summary: Principles for eective use of the memory hierarchy I


Algorithms should exhibit locality of reference; when possible, rearrange the order in which computations are performed so as to reuse data that will be in cache. The ability of the memory hierarchy to cache and prefetch data depends on predictable access patterns. When working out-of-cache:
Performance is best when a long, contiguous sequence of memory locations are accessed (e.g., scanning an array). Performance is worst when memory accesses are unpredictable, accessing many small pieces of data scattered through memory in an unpredictable pattern. (e.g., pointer-chasing through linked lists, binary search trees, etc.)

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

63 / 67

Summary: Principles for eective use of the memory hierarchy II


Design data layouts so that:
Items used together are stored together. Items not used together are stored apart. If nodes of a data structure reside at a level of the memory hierarchy where transfer blocks are of size B , then those nodes should be of size k B when possible, where k > 1. i.e., for main-memory data structures, use nodes that are several cache lines long; for data structures on disk, use nodes that are several pages in size. Maximize the amount of useful information in each block.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

64 / 67

Bibliography I
[1] Lars Arge. External memory data structures. In Handbook of massive data sets, pages 313357. Kluwer Academic Publishers, Norwell, MA, USA, 2002. [2] Je Bilmes, Krste Asanovi c, Cheewhye Chin, and Jim Demmel. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, Vienna, Austria, July 1997. [3] Peter J. Denning. The working set model for program behavior. Commun. ACM, 11(5):323333, 1968.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

65 / 67

Bibliography II
[4] Peter J. Denning. The locality principle. In J. Barria, editor, Communication Networks and Computer Systems, pages 4367. Imperial College Press, 2006. [5] Josep D az, Jordi Petit, and Maria Serna. A survey of graph layout problems. ACM Comput. Surv., 34(3):313356, 2002. [6] Ulrich Elsner. Graph partitioning: A survey. Technical Report S393, Technische Universit at Chemnitz, 1997.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

66 / 67

Bibliography III
[7] Bruce Hendrickson and Robert Leland. A multilevel algorithm for partitioning graphs. In Supercomputing 95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), page 28, New York, NY, USA, 1995. ACM Press. [8] Jerey Scott Vitter. External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv., 33(2):209271, 2001. [9] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(12):335, January 2001.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography

67 / 67

Vous aimerez peut-être aussi