ECE750 F2008 Algorithms9

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.
org
ECE750 Lecture 9: Performance engineering

Todd Veldhuizen tveldhui@acm.org
Electrical & Computer Engineering University of Waterloo Canada
Nov 16, 2007
1 / 67
ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org
Part I Memory hierarchy
2 / 67
Memory hierarchy I
So far we have primarily been concerned with asymptotic eciency of algorithms. Having chosen an algorithm that is ecient in theory, there are still serious engineering challenges to getting high performance in practice. To obtain decent performance on problems involving nontrivial amounts of data, you need to understand
1. How the memory hierarchy works; 2. How to take advantage of this.
Historical performance of typical desktop machine: Year 1980 1990 2000 Clock cycle (ns) 500 50 1 DRAM latency (ns) 375 100 60 Disk latency (ms) 87 28 8
3 / 67
Memory hierarchy II
In two decades:
CPU: 500x faster clock cycle. Main memory: 6x faster. Disk: 10x faster.
It can now take hundreds of clock cycles to access data in main memory: Memory is the new disk. To mitigate the widening gap between CPU speed and memory access time, an elaborate memory hierarchy has evolved.
4 / 67
Memory hierarchy III

Example memory hierarchy of a desktop machine:
L1 Cache L2 Cache Main memory Disk Bandwidth 20 Gb/s 10 Gb/s 2 Gb/s 0.08 Gb/s Latency 1 ns 8 ns 200 ns 107 ns Size 16kb 1Mb 2Gb 400Gb Block size (B) 64 bytes 1024 bytes
Bandwidth is the rate at which data can be transferred in a sustained, bulk manner, scanning through contiguous memory locations. Main memory can supply data at 1/10th the rate of L1 cache. Latency is the amount of time that elapses between a request for data and the start of its arrival. Disk is a million times slower than L1 cache. Block size is the chunk size in which data is transferred up to the next-fastest level of the hierarchy.
5 / 67
Memory hierarchy IV
The speed at which a program can run is always determined by some bottleneck: the cpu, main memory, the disk. Programs where the bottleneck is main memory are called memory bound most of the execution time is spent waiting for data to arrive from main memory or disk. A desktop machine with, say, a 2 GHz cpu, eectively runs much slower if it is memory bound: e.g.,
If L2 cache is the bottleneck: eective speed 1 GHz (about a Cray Y-MP supercomputer, circa 1988). If main memory is the bottleneck: eective speed 200 MHz (about a Cray 1 supercomputer, circa 1976). If disk is the bottleneck: eective speed between < 1 MHz (about a 1981 IBM PC) and 80 MHz, depending on access patterns.
6 / 67
Latency and throughput I

Many operations (transfers of data from disk to memory, oating-point vector operations, network communication, etc.) follow a characteristic performance curve: slow for a small number of items, faster for a large number of items.
Let R be the asymptotic rate achievable (e.g., bandwidth) Let t0 be the latency. Then, an operation on n items takes time t0 + The eective rate is R (n ) = = n t0 + Rn R 1 + R t 0 n 1
2 R t0 n
n R .
+ O ( n 2 )
7 / 67
Latency and throughput II

E.g. with R = 1 and t0 = 10:
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
20
40
60
80
100
120
140
160
180
200
A useful parameter: n1/2 is the value at which half the 1 asymptotic rate is attained: R (n1/2 ) = 2 R . n1/2 = t0 R
8 / 67
Latency and throughput III

n1/2 gives an approximate chunk size required to achieve close to the asymptotic performance. For example, current typical disk has R = 0.08Gb /s and t0 = 10ms . For these parameters, n1/2 = 800000 bytes almost 1Mb. If you are dealing in chunks substantially smaller than n1/2 , actual performance may be a tiny fraction of R .
Example: performance of a FAXPY on a desktop machine. A FAXPY operation looks like this: float X[N], Y[N]; for (int i=0; i < N; ++i) Y[i] = Y[i] + a*X[i];
9 / 67
Latency and throughput IV
(the term FAXPY comes from the BLAS library, a living relic from Fortran 77.) The following graph shows millions of oating point operations per second (Mops/s) versus the length of the vectors, N . Each plateau corresponds to a level of the memory hierarchy.
10 / 67
Latency and throughput V

DAXPY Benchmark 900 Vector<T> 800
700
600 Mflops/s
500
400
300
200
100 0 10
10
10
10
10 Vector length
10
10
10
10
The rst part of the curve follows a typical latency-throughput curve (i.e., R , n1/2 ). It reaches a plateau corresponding to the L1 cache. Performance then drops to a second plateau, when the vectors t in L2 cache. The third plateau is for main memory.
11 / 67
Latency and throughput VI
The abrupt drop at the far right of the graph indicates that the vectors no longer t in memory, and the operating system is paging data to disk.
12 / 67
Units of memory transfer I
Memory is transferred between levels of the hierarchy in minimal block sizes:

Between CPU and cache: a word, typically 64 or 32 bits. Between cache and main memory: a cache line, typically 32, 64 or 128 bytes. (Pentium 4 has 64-byte cache lines.) Between main memory and disk: a disk block or page, often 512 bytes to 4096 bytes for desktop machines, but sometimes 32kb-256kb for serving large media les.
13 / 67
Units of memory transfer II

Disk 1 Tb 60 Mb/s
latency 4 000 000 clocks Disk Block
The Memory Hierarchy

Reg. L1 Cache 32 kb 20 Gb/s
latency 3 clocks Word latency 9 clocks Cache Line
L2 Cache 2 Mb 2!3 Gb/s

latency 300 clocks Cache Line
Memory 2 Gb
Smaller Faster Decreasing latency
Bigger Slower Increasing latency
Example: Intel Core Duo

Has prefetching logic that can detect streams of data coming in from main memory.
14 / 67
Units of memory transfer III

Worst-case: random access to a large working set, using only a fraction of the data in each cacheline: get 26Mb /s Best case: accessing a sequence of main memory locations sequentially, and using all the data: get 2Gb /s . Note: best case is 80 times faster than worst case! Intel Data Prefetch Logic (DPL)
Detects constant stride memory accesses After 2 cachelines retrieved, starts prefetching DPL can run 8 cachelines ahead of requests Strides must be < 128 bytes 16 independent prefetch streams (12 ascending, 4 descending) Prefetching will not cross page boundaries (4096 bytes) Prefetching adjusts to bus trac Prefetching delivers 1 cacheline every 60 clock cycles ( 1 byte/cycle)
15 / 67
Units of memory transfer IV

If your problem can be solved by processing streams of memory, then the limited L1-L2 cache size is irrelevant; can get 2Gb/s streaming in from main memory.
The ideal memory access pattern for good performance:

Appropriate-sized chunks (cache lines, blocks); in contiguous memory locations, e.g., reading a long sequence of disk pages that are stored consecutively. Doing as much work as possible on a given piece of data before moving on to other memory locations.
Memory layout of arrays and data structures is a crucial performance issue.

Memory layout is largely beyond your control in languages like Java that provide a programming model far removed from the actual machine. Well look at some examples in C/C++, where memory layout can be controlled at quite a ne level of detail.
16 / 67
Part II Memory layout
17 / 67
Matrix Multiplication I
Example: how to write a matrix-multiplication algorithm that will perform well?
NB: In general, you would be well advised to use a matrix multiplication routine from a high-performance library, and not waste time tuning your own. (But, the principles are worth knowing.)
Here is a naive matrix multiplication: for i=1 to n for j=1 to n for k=1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) If the matrices are small enough to t in cache, this may perform acceptably. For large matrices, this is disastrously slow.
18 / 67
Matrix Multiplication II
Assuming C-style (row major) arrays, the layout of the B matrix in memory looks like this:
B(1,1) B(2,1) B(3,1) B(4,1) B(5,1) etc. B(1,2) B(2,2) B(1,3) B(2,8)
Each element of the array is a double that takes 8 bytes. The innermost loop (k ) is iterating down a column of the B matrix. Light-blue shows a cache line of 64 bytes. It lies along a row of the matrix.
19 / 67
Matrix Multiplication III

If 64*n is greater than the cache size, each element of B accessed in the innermost loop will bring in an entire cache line from memory (64 bytes), of which only 8 bytes will be used. The remaining 56 bytes will be discarded before they are used. The innermost loop travels over a row of A, so it will travel along (rather than across) cache lines. Each step of the innermost loop will bring in 64 + 8 bytes from memory on average, but only use 16 bytes of it = an eciency of only 22%. (By transposing the B matrix, we could make it run 4.5 times faster.)
20 / 67
Matrix Multiplication IV
A generally useful strategy is blocking (also called tiling): perform the matrix multiplication over submatrices of size m m, where m is chosen so that 8m 2 cache size. B11 B12 . . . A11 A12 . . . A21 A22 B21 B22 . . . . . . If each submatrix is of size m m, then a multiplication such as A11 B11 requires 16m2 bytes from memory, but performs 2m3 oating point operations. The number of oating point operations per memory access is 8 16m2 =m . As m is made bigger, the overhead of 2m3 accessing main memory can be made very small: the expensive memory access is amortized over a large amount of computation.
21 / 67
Matrix Multiplication V
Linear algebra libraries (e.g. ATLAS) can automatically tune themselves to a machines memory hierarchy by choosing appropriate block sizes, unrollings of inner loops, etc. [2, 9].
22 / 67
Example: Linked lists I

Designing data structures that will perform well on memory hierarchies is nontrivial:
Data structures are often built from small pieces (e.g., nodes, vertices) that may be smaller than a cache line. Manipulation of data structures often entails following pointers unpredictably through memory, so that prefetching hardware is unable to anticipate memory locations that will be needed. Many techniques developed in the past for data structures on disk (called external memory data structures [1, 8]) are increasingly relevant as techniques for managing data structures in main memory!
Example: Linked lists are useful for maintaining lists of unpredictable size, e.g., queues, hash table chaining, etc. Recall that a simple linked list has a structure such as
23 / 67
Example: Linked lists II

struct ListNode { float data; ListNode* next; }; (Here we are storing a list of oating-point values.) The memory layout of this structure in a 64-byte cache line:
Cacheline (64 bytes)
Each ListNode contains 8 bytes of data. But, if it is brought in from main memory we will read an entire cache line say, 64 bytes of data. Unless were lucky, those extra 54 bytes contain nothing of value; we only get 4 bytes of actual useful data (the data eld) for 64 bytes read from memory, an eciency of only 4/64 = 6.25%.
24 / 67
Example: Linked lists III

This style of data structure is among the worst possible for reading from main memory:
Iterating through a list means pointer-chasing: following pointers to unpredictable memory locations. Some processors can exploit constant-stride memory access patterns and prefetch data, but are powerless against irregular access patterns. Each list node accessed brings in 64 bytes of memory, only 4 bytes of which may be useful!
A better layout is to compromise between a linked list and array: have linked list nodes that contain little arrays, sized so that each list node lls a cache line: struct ListNode2 { float data[14]; int count; ListNode2* next; };
25 / 67
Example: Linked lists IV

This version stores up to 14 pieces of data per node, and uses the count eld to keep track of how full it is. The ListNode structure is exactly 64 bytes long (assuming 4-byte int, 4-byte pointers). Get 56 bytes of useful data for every 64 byte cache-line: about 87.5% eciency if most of the nodes are full. The following graph shows benchmark results comparing the performance of an array, a linked list, and linked lists of arrays (sized so that each element of the list lls 1, 4, or 16 cache lines.) The operation being timed is to scan through the list and add all the data elements: ListNode* p = first; do { s += p->data; p = p->next; } while (p != 0);
26 / 67
Example: Linked lists V

450
400
350
Access rate (millions of items per second)
300
250
200
150 Array Linked List List of arrays (1 cacheline) List of arrays (4 cachelines) List of arrays (16 cachelines)
100
50
0 1 10
10
10
10 10 Number of items in list
10
10
10
27 / 67
Example: Linked lists VI

List Benchmark 450 Array Linked List, contiguous Linked List, permuted List of arrays (4 cachelines), contiguous List of arrays (4 cachelines), permuted
400
350
300
Millions of items/s
250
200
150
100
50
0 1 10
10
10
10
10 Number of items
10
10
10
28 / 67
Example: Linked lists VII
ECE750 Lecture 9: Performance engineering Todd Veldhuizen
In the L1 cache, the linked list is very fast. But, tveldhui@acm.org compare performance to other regions of the memory hierarchy (numbers are millions of items per second): Array Linked List Linked List of arrays (1 cache line) L1 cache 411 411 317 L2 cache 406 90 285 Main memory 377 7 84 In L2 cache, the list-of-arrays is 3 times faster; in main memory it is 12 times faster. In main memory, the linked list is 50 times slower than the array. The regular memory access pattern for the array allows the memory prefetching hardware to predict what memory will be needed in advance, resulting in the high performance for the array. Lessons:
29 / 67
Example: Linked lists VIII
Data structure nodes of size less than one cache line can cause serious performance problems when working out-of-cache. Performance can sometimes be improved by making fatter nodes that are one or more cache lines long. Nothing beats accessing a contiguous sequence of memory locations (an array). For this reason, many cache-ecient data structures bottom out to an array of some size at the last level.
30 / 67
ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org Bibliography
Part III Locality
31 / 67
Locality I
Ecient operation of the memory hierarchy rests on two basic strategies:
Caching memory contents in the hope that memory, once used, is likely to be soon revisited. Anticipating where the attention of the program is likely to wander, and prefetching those memory contents into cache.
The caches and prefetching strategies of the memory hierarchy are only eective if memory access patterns exhibit some degree of locality of reference.
Temporal locality: if a memory location is accessed at time t , it is more likely to be referenced at times close to t . Spatial locality: if memory location x is referenced, one is more likely to access memory locations close to x .
32 / 67
Locality II
Dennings notion of a working set [3, 4]:
Let W (t , ) be the items accessed during the time interval (t , t ). This is called the working set (or locality set.) Dennings thesis was that programs progress through a series of working sets or locales, and while in that locale do not make (many) references outside of it. Optimal cache management then consists of guaranteeing that the locale is present in high-speed memory when a program needs it. Working sets are a model that programs often adhere to. One of the early triumphs of the concept was explaining why multiprocess systems when overloaded ground to a halt (thrashing) instead of degrading gracefully: thrashing occurred when there was not room in memory for the working sets of the processes at the same time.
33 / 67
Locality III
The memory hierarchy has evolved around the concept of a working set, with the result that programs have small working sets has gone from being a descriptive statement to a prescription for good performance. Let (t , ) = |W (t , )| be the number of distinct items accessed in the interval (t , t ) the size of the working set. If (t , ) grows rapidly as a function of , then caching of recently used data will have little eect. If however it levels o, then a high hit rate in the cache can be achieved.
34 / 67
Locality IV
(t , ) caching ineective
Cache size
good locality of reference
Algorithms can sometimes be manipulated to improve their locality of reference. The matrix multiplication example seen earlier was an example: by multiplying m m blocks instead of columns, one can do O (m3 ) work while reading only O (m2 ) memory locations.
35 / 67
Locality V
Good compilers will try to do this automatically for certain simple arrangements of loops, but the general problem is too dicult for automated solutions.
36 / 67
Locality VI
Some standard strategies for increasing locality:

Blocking and tiling: decompose multidimensional data sets into a collection of blocks or tiles. Example:
Bibliography
If a large image is stored on disk column-by-column, then rendering a small region (dotted line) requires retrieving many disk pages (left). If instead the image is broken into tiles, many fewer pages are needed (right).
37 / 67
Locality VII
Iteration-space tiling or pipelining: in certain multistep algorithms, instead of performing one step over the entire structure, one can perform several steps over each local region.
38 / 67
Locality VIII
Space-lling curves: if an algorithm requires traversing some multidimensional space, one can sometimes improve locality by following a space-lling curve, for example:
For general data sets, a similar eect can be achieved by using multilevel minimum linear arrangement (MultiMinLA).
39 / 67
Locality IX
Graph partitioning: if a dataset lacks a clear geometric structure, one can sometimes still achieve results similar to tiling by partitioning the dataset into regions that have small perimeters relative to the amount of data they contain. There are numerous good algorithms for this, in particular from the VLSI and parallel computing communities [7, 5, 6]. Indeed, the idea of tiling can be regarded as a special case of graph partitioning for very regular graphs.
40 / 67
Locality X
Vertical partitioning: the term comes from databases, but the concept applies to the design of memory layouts also. A simple example is layout of complex-valued arrays, where each element has a real and imaginary component. Could store pairs (a, b ) for each array element to represent a + bi , or could have two separate arrays, one for the real and one for the imaginary component. These two layouts have very dierent performance characteristics! Another common situation where vertical partitioning can apply: suppose a computation proceeds in phases, and dierent information is needed during each phase. E.g., in a graph algorithm one might have:
41 / 67
Locality XI
struct Edge { Node* from; Node* to; float weight; /* Needed in phase one */ Edge* parent; /* Phase two: union-find */ }; Instead of having all the data needed for each phase in one structure, it might be better to break up the Edge object into several structures, one for each phase of the algorithm.
42 / 67
Faster binary search I
Binary search performs rather poorly in main memory, due to

Non constant-stride access patterns; Using only a fraction of each cacheline.
43 / 67
Faster binary search II

Binary Search on N=65536 uint32
Requires log(N) 3 cachelines.
L1 Cache
L2
Main Memory
Each access retrieves a cacheline (64 bytes), only 4 of which are used.
44 / 67
Faster binary search III
45 / 67
Faster binary search IV
Binary search on N = 512, with 8 elements per cacheline. Red shows a cacheline retrieved, pink shows a data element used.
46 / 67
Cacheline Tree I
Asymptotically, 4 times faster than binary search in main memory (assuming 16 keys per cacheline) In practice, 7 times faster than STL map access for random
Essentially a B+-tree where each page is a single cacheline. Radix of tree = cacheline size/sizeof(Key). E.g., with 4-byte keys, have radix 16 tree.
1 Indices result in a space overhead of R 1 , where R is the radix e.g. 1/15 (6.7%) when R = 16.
47 / 67
Cacheline Tree 256 II 16
4096
65536
Cacheline Tree
Requires log(N)/4 Cachelines (1/4 binary search)
4 cachelines to find element (N=65536) Space=273kb (+6.6%)
48 / 67
Cacheline tree: benchmarks I

VBTree (mmap) STL map CLTree Bsearch
VBTree find benchmark
2.5
s per find
1.5
0.5
0 0 10
10
10
10 10 Number of elements in tree
10
10
10
49 / 67
Cacheline tree: benchmarks II

VBTree (mmap) STL map CLTree Bsearch
VBTree find benchmark 8
Performance relative to CLTree
0 0 10
10
10
10 10 Number of elements in tree
10
10
10
50 / 67
Disk types
Traditional platter/head
High latency for random access: head has to move into position, and platter has to rotate to correct position Huge capacities: $1 Tb is quite aordable ($600).
Solid-state drive (SSD)

A hard-drive form factor, but full of memory chips. Low-end: 864 Gb of ash memory. Standard hard drive connection (e.g. SATA). Random access times can be fairly fast, so far read/write rates are comparable to hard drives (e.g. 70Mb/s). Increasingly used in laptops. Wear levelling is used to work around lifetime problems of ash memory. Prices at consumer levels ($400), but cost is about $10/Gb (vs. $0.60/Gb for hard drives). High-end: up to 1Tb. Multiple Fibre Channel connections 4Gbit/s, random access time < 10s , throughput 12 Gbytes/s. Cost k $1, 000/Gb.
51 / 67
Disk benchmarks
Disk I/O benchmark (linux,plato.logicblox.com) 70 60 50 MB/s 40 30 20 10 0 0 10

1 2 3 4 5 6 7 8
rnd read synch. seq read synch. 2!stream read synch.
10
10
10
10 Chunk size
10
10
10
10
10
Time to read chunk (s)
10
10
rnd read synch. seq read synch. 2!stream read synch.
10
10 0 10
10
10
10
10 Chunk size
10
10
10
10
52 / 67
Disk benchmarks
256k
60
50
64k
Page i+2 Page i+3 Random Access
40 MBytes/second
Sequential Access
30
8k
20
10
0 0 10 10
1 2
10 Skip distance (# pages)
10
10
10
10
10
Page size
53 / 67
Disk benchmarks

Time to complete read (ms), kanushu.uwaterloo.ca
13 14 15 16
6.5
10
11
22
7.5
7.5
13
22
10
15 16
12
32
45
Random Access 4.5 Skip distance (pages)
10
8 7.5
11
22
10
Sequential Access
10
16
32
64
128 Page size (kb)
256
12
512
1024
2048
45
64
Bibliography
7.5
6 6.5 6 5.5 5
10
9
11 12
64
14
45
32
6.5
16
15
13
64
4 3.5
14
32
3
4096 8192
54 / 67
Optimizing disk reads I

Random access is very slow: e.g. with 8kb pages, only obtain 3% of possible throughput doing random page reads Sequential access to small pages is also slow: get only 25% of possible throughput Access is fastest for (a) sequential- or near-sequential access; (b) big chunks (many pages at once). If need some subset of pages in a locale, it is far more ecient to just read the entire disk region sequentially than to issue a series of individual synchronous read requests.
E.g. Suppose need to read pages #0, 8, 16, 32, 40, 48, 56, 64. With individual, synchronous reads: takes 28 ms
55 / 67
Optimizing disk reads II
But, reading pages 064 with a single read (520kb) would only take 16 ms; just junk (or prioritize for eviction) the superuous pages
Bibliography
Given a disk model (e.g., contour plot of previous slide), can optimally partition a set of pages to be read into eciently readable chunks using dynamic programming. Need to approach paging as an oine rather than online problem to get decent disk performance: plan and re-order page requests to get maximal throughput.
56 / 67
RAID I
RAID = Redundant Array of Inexpensive Disks Can serve several purposes:
Overcome throughput limitations by having multiple drives:
1 disk = 70 Mb/s 10 disks = 700 Mb/s
Easy recovery from hard drive crashes: store redundant data on drives, so that errors can be corrected. E.g., with three drives:
Disk A Disk B Disk C = A B (XOR)
Then, get twice the throughput of 1 disk, plus if any one of the drives crashes, the data is recoverable. (E.g. if A crashes, then can use A = B C to recover it.)
57 / 67
RAID II
RAID 0: striping Disk A 0 3 6 . . . Raid 1: mirroring Disk A 0 1 2 . . . Disk B 1 4 7 . . . Disk B 0 1 2 . . . Disk C 2 5 8 . . . Disk C 0 1 2 . . .
Duplicates data over multiple disks. Same throughput as single disk, but fault-tolerant. Better RAID systems let you hot-swap drives.
58 / 67
RAID III
Higher levels of RAID: combinations of striping and error recovery.
59 / 67
External Memory Data Structures I

External memory data structures (also known as out-of-core data structures) are used when the data set is too large to t in memory, and must be stored on disk. Databases are the primary application. Some good surveys: [1, 8]. Basic concerns:
Disk has very high latency: often 107 clock cycles. Bandwidth is lower than main memory, perhaps 1/10 to 1/100th the rate. Block sizes are very large (multiples of 1kbyte) compared to main memory; to get good throughput with random access, need huge blocks (e.g. > 1 Mb) Only a small fraction of the data can be in memory at once; only a minute fraction can be in cache. Disk space is cheap (in dollar cost).
60 / 67
External Memory Data Structures II

Basic coping strategies:
High latency = very large n1/2 fat data structure nodes that contain a lot of data, stored contiguously in arrays. The high latency ( 107 clock cycles) means that a lot of computational resources can be expended to decide on the best strategy for storing and retrieving data:
External memory data structures often manage their own I/O, rather than relying on the operating systems page management, since they can better predict what pages to prefetch and evict: treat storage management as an oine rather than online problem. Databases expend a lot of eort to nd a good strategy for answering a query, since a bad strategy can be disastrously costly.
To improve the rate of data transfer, data compression and parallel I/O (parallel disk model) are often used.
61 / 67
External Memory Data Structures III

Since disk space is cheap, it can pay o to duplicate data in several dierent forms, each appropriate for a certain class of queries.
Classic example: the B-tree, a binary search tree suitable for external memory.
Each node has (B ) children, and the tree is kept balanced, so that the height is roughly logB N . Supported operations: nd, insert, delete, and range search (retrieve all records in an interval [k1 , k2 ].) When insertions cause a node to overow, it is split into two half-full nodes, which can cause the parent node to overow and split, etc. Typically the branching factor is very large, so that a massive data set can be stored in a B-tree of height 3 or 4, for example.
62 / 67
Summary: Principles for eective use of the memory hierarchy I

Algorithms should exhibit locality of reference; when possible, rearrange the order in which computations are performed so as to reuse data that will be in cache. The ability of the memory hierarchy to cache and prefetch data depends on predictable access patterns. When working out-of-cache:
Performance is best when a long, contiguous sequence of memory locations are accessed (e.g., scanning an array). Performance is worst when memory accesses are unpredictable, accessing many small pieces of data scattered through memory in an unpredictable pattern. (e.g., pointer-chasing through linked lists, binary search trees, etc.)
63 / 67
Summary: Principles for eective use of the memory hierarchy II

Design data layouts so that:
Items used together are stored together. Items not used together are stored apart. If nodes of a data structure reside at a level of the memory hierarchy where transfer blocks are of size B , then those nodes should be of size k B when possible, where k > 1. i.e., for main-memory data structures, use nodes that are several cache lines long; for data structures on disk, use nodes that are several pages in size. Maximize the amount of useful information in each block.
64 / 67
Bibliography I
[1] Lars Arge. External memory data structures. In Handbook of massive data sets, pages 313357. Kluwer Academic Publishers, Norwell, MA, USA, 2002. [2] Je Bilmes, Krste Asanovi c, Cheewhye Chin, and Jim Demmel. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, Vienna, Austria, July 1997. [3] Peter J. Denning. The working set model for program behavior. Commun. ACM, 11(5):323333, 1968.
65 / 67
Bibliography II
[4] Peter J. Denning. The locality principle. In J. Barria, editor, Communication Networks and Computer Systems, pages 4367. Imperial College Press, 2006. [5] Josep D az, Jordi Petit, and Maria Serna. A survey of graph layout problems. ACM Comput. Surv., 34(3):313356, 2002. [6] Ulrich Elsner. Graph partitioning: A survey. Technical Report S393, Technische Universit at Chemnitz, 1997.
66 / 67
Bibliography III
[7] Bruce Hendrickson and Robert Leland. A multilevel algorithm for partitioning graphs. In Supercomputing 95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), page 28, New York, NY, USA, 1995. ACM Press. [8] Jerey Scott Vitter. External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv., 33(2):209271, 2001. [9] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(12):335, January 2001.
67 / 67

ECE750 F2008 Algorithms9

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ECE750 F2008 Algorithms9

Transféré par

Droits d'auteur :

Formats disponibles

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.

ECE750 Lecture 9: Performance engineering

Nov 16, 2007

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Part I Memory hierarchy

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Memory hierarchy III

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Latency and throughput I

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Latency and throughput II

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Latency and throughput III

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Latency and throughput IV

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Latency and throughput V

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Latency and throughput VI

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Units of memory transfer I

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Memory is transferred between levels of the hierarchy in minimal block sizes:

Units of memory transfer II

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

The Memory Hierarchy

L2 Cache 2 Mb 2!3 Gb/s

Smaller Faster Decreasing latency

Bigger Slower Increasing latency

Example: Intel Core Duo

Units of memory transfer III

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Units of memory transfer IV

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

The ideal memory access pattern for good performance:

Memory layout of arrays and data structures is a crucial performance issue.

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Part II Memory layout

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Matrix Multiplication III

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Example: Linked lists I

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Example: Linked lists II

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Example: Linked lists III

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Example: Linked lists IV

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Example: Linked lists V

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Access rate (millions of items per second)

10 10 Number of items in list

Example: Linked lists VI

ECE750 Lecture 9: Performance engineering Todd Veldhuizen tveldhui@acm.org

Example: Linked lists VII

ECE750 Lecture 9: Performance engineering Todd Veldhuizen