Académique Documents
Professionnel Documents
Culture Documents
org
1 / 67
2 / 67
Memory hierarchy I
So far we have primarily been concerned with asymptotic eciency of algorithms. Having chosen an algorithm that is ecient in theory, there are still serious engineering challenges to getting high performance in practice. To obtain decent performance on problems involving nontrivial amounts of data, you need to understand
1. How the memory hierarchy works; 2. How to take advantage of this.
Historical performance of typical desktop machine: Year 1980 1990 2000 Clock cycle (ns) 500 50 1 DRAM latency (ns) 375 100 60 Disk latency (ms) 87 28 8
3 / 67
Memory hierarchy II
In two decades:
CPU: 500x faster clock cycle. Main memory: 6x faster. Disk: 10x faster.
It can now take hundreds of clock cycles to access data in main memory: Memory is the new disk. To mitigate the widening gap between CPU speed and memory access time, an elaborate memory hierarchy has evolved.
4 / 67
Bandwidth is the rate at which data can be transferred in a sustained, bulk manner, scanning through contiguous memory locations. Main memory can supply data at 1/10th the rate of L1 cache. Latency is the amount of time that elapses between a request for data and the start of its arrival. Disk is a million times slower than L1 cache. Block size is the chunk size in which data is transferred up to the next-fastest level of the hierarchy.
5 / 67
Memory hierarchy IV
The speed at which a program can run is always determined by some bottleneck: the cpu, main memory, the disk. Programs where the bottleneck is main memory are called memory bound most of the execution time is spent waiting for data to arrive from main memory or disk. A desktop machine with, say, a 2 GHz cpu, eectively runs much slower if it is memory bound: e.g.,
If L2 cache is the bottleneck: eective speed 1 GHz (about a Cray Y-MP supercomputer, circa 1988). If main memory is the bottleneck: eective speed 200 MHz (about a Cray 1 supercomputer, circa 1976). If disk is the bottleneck: eective speed between < 1 MHz (about a 1981 IBM PC) and 80 MHz, depending on access patterns.
6 / 67
n R .
+ O ( n 2 )
7 / 67
20
40
60
80
100
120
140
160
180
200
A useful parameter: n1/2 is the value at which half the 1 asymptotic rate is attained: R (n1/2 ) = 2 R . n1/2 = t0 R
8 / 67
Example: performance of a FAXPY on a desktop machine. A FAXPY operation looks like this: float X[N], Y[N]; for (int i=0; i < N; ++i) Y[i] = Y[i] + a*X[i];
9 / 67
(the term FAXPY comes from the BLAS library, a living relic from Fortran 77.) The following graph shows millions of oating point operations per second (Mops/s) versus the length of the vectors, N . Each plateau corresponds to a level of the memory hierarchy.
10 / 67
700
600 Mflops/s
500
400
300
200
100 0 10
10
10
10
10 Vector length
10
10
10
10
The rst part of the curve follows a typical latency-throughput curve (i.e., R , n1/2 ). It reaches a plateau corresponding to the L1 cache. Performance then drops to a second plateau, when the vectors t in L2 cache. The third plateau is for main memory.
11 / 67
The abrupt drop at the far right of the graph indicates that the vectors no longer t in memory, and the operating system is paging data to disk.
12 / 67
13 / 67
Memory 2 Gb
14 / 67
15 / 67
17 / 67
Matrix Multiplication I
Example: how to write a matrix-multiplication algorithm that will perform well?
NB: In general, you would be well advised to use a matrix multiplication routine from a high-performance library, and not waste time tuning your own. (But, the principles are worth knowing.)
Here is a naive matrix multiplication: for i=1 to n for j=1 to n for k=1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) If the matrices are small enough to t in cache, this may perform acceptably. For large matrices, this is disastrously slow.
18 / 67
Matrix Multiplication II
Assuming C-style (row major) arrays, the layout of the B matrix in memory looks like this:
B(1,1) B(2,1) B(3,1) B(4,1) B(5,1) etc. B(1,2) B(2,2) B(1,3) B(2,8)
Each element of the array is a double that takes 8 bytes. The innermost loop (k ) is iterating down a column of the B matrix. Light-blue shows a cache line of 64 bytes. It lies along a row of the matrix.
19 / 67
20 / 67
Matrix Multiplication IV
A generally useful strategy is blocking (also called tiling): perform the matrix multiplication over submatrices of size m m, where m is chosen so that 8m 2 cache size. B11 B12 . . . A11 A12 . . . A21 A22 B21 B22 . . . . . . If each submatrix is of size m m, then a multiplication such as A11 B11 requires 16m2 bytes from memory, but performs 2m3 oating point operations. The number of oating point operations per memory access is 8 16m2 =m . As m is made bigger, the overhead of 2m3 accessing main memory can be made very small: the expensive memory access is amortized over a large amount of computation.
21 / 67
Matrix Multiplication V
Linear algebra libraries (e.g. ATLAS) can automatically tune themselves to a machines memory hierarchy by choosing appropriate block sizes, unrollings of inner loops, etc. [2, 9].
22 / 67
Example: Linked lists are useful for maintaining lists of unpredictable size, e.g., queues, hash table chaining, etc. Recall that a simple linked list has a structure such as
23 / 67
Each ListNode contains 8 bytes of data. But, if it is brought in from main memory we will read an entire cache line say, 64 bytes of data. Unless were lucky, those extra 54 bytes contain nothing of value; we only get 4 bytes of actual useful data (the data eld) for 64 bytes read from memory, an eciency of only 4/64 = 6.25%.
24 / 67
A better layout is to compromise between a linked list and array: have linked list nodes that contain little arrays, sized so that each list node lls a cache line: struct ListNode2 { float data[14]; int count; ListNode2* next; };
25 / 67
26 / 67
400
350
300
250
200
150 Array Linked List List of arrays (1 cacheline) List of arrays (4 cachelines) List of arrays (16 cachelines)
100
50
0 1 10
10
10
10
10
10
27 / 67
400
350
300
Millions of items/s
250
200
150
100
50
0 1 10
10
10
10
10 Number of items
10
10
10
28 / 67
In the L1 cache, the linked list is very fast. But, tveldhui@acm.org compare performance to other regions of the memory hierarchy (numbers are millions of items per second): Array Linked List Linked List of arrays (1 cache line) L1 cache 411 411 317 L2 cache 406 90 285 Main memory 377 7 84 In L2 cache, the list-of-arrays is 3 times faster; in main memory it is 12 times faster. In main memory, the linked list is 50 times slower than the array. The regular memory access pattern for the array allows the memory prefetching hardware to predict what memory will be needed in advance, resulting in the high performance for the array. Lessons:
29 / 67
Data structure nodes of size less than one cache line can cause serious performance problems when working out-of-cache. Performance can sometimes be improved by making fatter nodes that are one or more cache lines long. Nothing beats accessing a contiguous sequence of memory locations (an array). For this reason, many cache-ecient data structures bottom out to an array of some size at the last level.
30 / 67
31 / 67
Locality I
Ecient operation of the memory hierarchy rests on two basic strategies:
Caching memory contents in the hope that memory, once used, is likely to be soon revisited. Anticipating where the attention of the program is likely to wander, and prefetching those memory contents into cache.
The caches and prefetching strategies of the memory hierarchy are only eective if memory access patterns exhibit some degree of locality of reference.
Temporal locality: if a memory location is accessed at time t , it is more likely to be referenced at times close to t . Spatial locality: if memory location x is referenced, one is more likely to access memory locations close to x .
32 / 67
Locality II
Dennings notion of a working set [3, 4]:
Let W (t , ) be the items accessed during the time interval (t , t ). This is called the working set (or locality set.) Dennings thesis was that programs progress through a series of working sets or locales, and while in that locale do not make (many) references outside of it. Optimal cache management then consists of guaranteeing that the locale is present in high-speed memory when a program needs it. Working sets are a model that programs often adhere to. One of the early triumphs of the concept was explaining why multiprocess systems when overloaded ground to a halt (thrashing) instead of degrading gracefully: thrashing occurred when there was not room in memory for the working sets of the processes at the same time.
33 / 67
Locality III
The memory hierarchy has evolved around the concept of a working set, with the result that programs have small working sets has gone from being a descriptive statement to a prescription for good performance. Let (t , ) = |W (t , )| be the number of distinct items accessed in the interval (t , t ) the size of the working set. If (t , ) grows rapidly as a function of , then caching of recently used data will have little eect. If however it levels o, then a high hit rate in the cache can be achieved.
34 / 67
Locality IV
(t , ) caching ineective
Cache size
Algorithms can sometimes be manipulated to improve their locality of reference. The matrix multiplication example seen earlier was an example: by multiplying m m blocks instead of columns, one can do O (m3 ) work while reading only O (m2 ) memory locations.
35 / 67
Locality V
Good compilers will try to do this automatically for certain simple arrangements of loops, but the general problem is too dicult for automated solutions.
36 / 67
Locality VI
Bibliography
If a large image is stored on disk column-by-column, then rendering a small region (dotted line) requires retrieving many disk pages (left). If instead the image is broken into tiles, many fewer pages are needed (right).
37 / 67
Locality VII
Iteration-space tiling or pipelining: in certain multistep algorithms, instead of performing one step over the entire structure, one can perform several steps over each local region.
38 / 67
Locality VIII
Space-lling curves: if an algorithm requires traversing some multidimensional space, one can sometimes improve locality by following a space-lling curve, for example:
For general data sets, a similar eect can be achieved by using multilevel minimum linear arrangement (MultiMinLA).
39 / 67
Locality IX
Graph partitioning: if a dataset lacks a clear geometric structure, one can sometimes still achieve results similar to tiling by partitioning the dataset into regions that have small perimeters relative to the amount of data they contain. There are numerous good algorithms for this, in particular from the VLSI and parallel computing communities [7, 5, 6]. Indeed, the idea of tiling can be regarded as a special case of graph partitioning for very regular graphs.
40 / 67
Locality X
Vertical partitioning: the term comes from databases, but the concept applies to the design of memory layouts also. A simple example is layout of complex-valued arrays, where each element has a real and imaginary component. Could store pairs (a, b ) for each array element to represent a + bi , or could have two separate arrays, one for the real and one for the imaginary component. These two layouts have very dierent performance characteristics! Another common situation where vertical partitioning can apply: suppose a computation proceeds in phases, and dierent information is needed during each phase. E.g., in a graph algorithm one might have:
41 / 67
Locality XI
struct Edge { Node* from; Node* to; float weight; /* Needed in phase one */ Edge* parent; /* Phase two: union-find */ }; Instead of having all the data needed for each phase in one structure, it might be better to break up the Edge object into several structures, one for each phase of the algorithm.
42 / 67
43 / 67
L1 Cache
L2
Main Memory
Each access retrieves a cacheline (64 bytes), only 4 of which are used.
44 / 67
45 / 67
Binary search on N = 512, with 8 elements per cacheline. Red shows a cacheline retrieved, pink shows a data element used.
46 / 67
Cacheline Tree I
Asymptotically, 4 times faster than binary search in main memory (assuming 16 keys per cacheline) In practice, 7 times faster than STL map access for random
Essentially a B+-tree where each page is a single cacheline. Radix of tree = cacheline size/sizeof(Key). E.g., with 4-byte keys, have radix 16 tree.
1 Indices result in a space overhead of R 1 , where R is the radix e.g. 1/15 (6.7%) when R = 16.
47 / 67
4096
65536
Cacheline Tree
Requires log(N)/4 Cachelines (1/4 binary search)
48 / 67
2.5
s per find
1.5
0.5
0 0 10
10
10
10
10
10
49 / 67
0 0 10
10
10
10
10
10
50 / 67
Disk types
Traditional platter/head
High latency for random access: head has to move into position, and platter has to rotate to correct position Huge capacities: $1 Tb is quite aordable ($600).
51 / 67
Disk benchmarks
10
10
10
10 Chunk size
10
10
10
10
10
10
10
10
10 0 10
10
10
10
10 Chunk size
10
10
10
10
52 / 67
Disk benchmarks
256k
60
50
64k
40 MBytes/second
Sequential Access
30
8k
20
10
0 0 10 10
1 2
10
10
10
10
10
Page size
53 / 67
Disk benchmarks
13 14 15 16
6.5
10
11
22
7.5
7.5
13
22
10
15 16
12
32
45
10
8 7.5
11
22
10
Sequential Access
10
16
32
64
256
12
512
1024
2048
45
64
Bibliography
7.5
6 6.5 6 5.5 5
10
9
11 12
64
14
45
32
6.5
16
15
13
64
4 3.5
14
32
3
4096 8192
54 / 67
55 / 67
But, reading pages 064 with a single read (520kb) would only take 16 ms; just junk (or prioritize for eviction) the superuous pages
Bibliography
Given a disk model (e.g., contour plot of previous slide), can optimally partition a set of pages to be read into eciently readable chunks using dynamic programming. Need to approach paging as an oine rather than online problem to get decent disk performance: plan and re-order page requests to get maximal throughput.
56 / 67
RAID I
RAID = Redundant Array of Inexpensive Disks Can serve several purposes:
Overcome throughput limitations by having multiple drives:
1 disk = 70 Mb/s 10 disks = 700 Mb/s
Easy recovery from hard drive crashes: store redundant data on drives, so that errors can be corrected. E.g., with three drives:
Disk A Disk B Disk C = A B (XOR)
Then, get twice the throughput of 1 disk, plus if any one of the drives crashes, the data is recoverable. (E.g. if A crashes, then can use A = B C to recover it.)
57 / 67
RAID II
RAID 0: striping Disk A 0 3 6 . . . Raid 1: mirroring Disk A 0 1 2 . . . Disk B 1 4 7 . . . Disk B 0 1 2 . . . Disk C 2 5 8 . . . Disk C 0 1 2 . . .
Duplicates data over multiple disks. Same throughput as single disk, but fault-tolerant. Better RAID systems let you hot-swap drives.
58 / 67
RAID III
59 / 67
60 / 67
To improve the rate of data transfer, data compression and parallel I/O (parallel disk model) are often used.
61 / 67
Classic example: the B-tree, a binary search tree suitable for external memory.
Each node has (B ) children, and the tree is kept balanced, so that the height is roughly logB N . Supported operations: nd, insert, delete, and range search (retrieve all records in an interval [k1 , k2 ].) When insertions cause a node to overow, it is split into two half-full nodes, which can cause the parent node to overow and split, etc. Typically the branching factor is very large, so that a massive data set can be stored in a B-tree of height 3 or 4, for example.
62 / 67
63 / 67
64 / 67
Bibliography I
[1] Lars Arge. External memory data structures. In Handbook of massive data sets, pages 313357. Kluwer Academic Publishers, Norwell, MA, USA, 2002. [2] Je Bilmes, Krste Asanovi c, Cheewhye Chin, and Jim Demmel. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology. In Proceedings of International Conference on Supercomputing, Vienna, Austria, July 1997. [3] Peter J. Denning. The working set model for program behavior. Commun. ACM, 11(5):323333, 1968.
65 / 67
Bibliography II
[4] Peter J. Denning. The locality principle. In J. Barria, editor, Communication Networks and Computer Systems, pages 4367. Imperial College Press, 2006. [5] Josep D az, Jordi Petit, and Maria Serna. A survey of graph layout problems. ACM Comput. Surv., 34(3):313356, 2002. [6] Ulrich Elsner. Graph partitioning: A survey. Technical Report S393, Technische Universit at Chemnitz, 1997.
66 / 67
Bibliography III
[7] Bruce Hendrickson and Robert Leland. A multilevel algorithm for partitioning graphs. In Supercomputing 95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), page 28, New York, NY, USA, 1995. ACM Press. [8] Jerey Scott Vitter. External memory algorithms and data structures: dealing with massive data. ACM Comput. Surv., 33(2):209271, 2001. [9] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(12):335, January 2001.
67 / 67