Improving Cache Performance

Improving Cache Performance/Cache Optimization
The average memory access time is given by Average memory access time=Hit Time + Miss Rate Miss Penalty To improve the cache performance we need to decrease the average memory access time and the above formula shows that AMAT depends on hit time, miss rate and miss penalty. Therefore the cache optimization techniques to improve cache performance or reduce power consumption can be organized into following four categories.
1. Reducing miss penalty
2. Reducing miss rate 3. Reducing miss rate and miss penalty via parallelism 4. Reducing hit time
Reducing miss penalty

There are mainly 5 optimization techniques to reduce miss penalty. 1) Multilevel cache This technique concentrates on the gap between cache and the main memory. To improve the cache performance we should either make cache faster to keep in pace with the CPU speed or make cache larger to overcome the widening gap between memory and CPU. We address both this concerns in this method. In this method another level of cache is added between the original cache and the main memory. First level of cache should be small enough to match the clock cycle time of fast CPU and second level of cache can be large enough to capture a large number of accesses that would go to the memory. The average memory access time of a two level cache is given by Average memory access time=Hit timeL1 + Miss rate L1 Miss Penalty L1
Miss Penalty L1 = Hit timeL2 + Miss rate L2 Miss Penalty L2 Miss penalty of the first level cache is the average memory access time taken by the second level cache. To avoid ambiguity two separate miss rates are defined for two level cache systems: local miss rate & global miss rate. Local Miss Rate is the number of misses in a particular cache divided by total number of memory access to that cache. Local miss rate of first level cache is
Miss rate L1 =
number of misses L1 number memory access L1
Local miss rate of second level cache is
Miss rate L2 =
number of misses L2 number memory access L2
Global Miss Rate is the number of misses in a particular cache divided by total number of memory access generated by CPU. Global miss rate for first level cache is Miss rate L1, but for the second level cache it is Miss rate L1Miss rate L2. Local miss rate are larger for second level cache since first level cache skims the cream of memory access. Global miss rate is more useful measure to indicate what fraction of memory access that leave the CPU go all way to the memory. Natural policy for memory hierarchy is multilevel inclusion ie all the data found in the higher level are found in the lower levels. A designer can afford only an L2 cache that is slightly bigger than the L1 cache. So we cannot use a significant portion of the L2 cache to store a redundant copy of L1 cache. L1 data is never found in L2 cache. Thus a multilevel cache follows multilevel exclusion. 2) Critical word first and Early restart
CPU normally needs one word of a block at a time. Blocks may be multiple word sized. We dont want to wait for full block to be loaded before sending the requested word and restarting the CPU. There are mainly two strategies to address this. a) Critical word first/ Wrapped fetch/Requested word first Request the missed word first from memory and sent it to the CPU as soon as it arrives. Let the CPU continue execution while filling rest of words in block. b) Early restart Fetch words in normal order. As soon as the requested word of the block arrives, send it to CPU and let CPU continue execution. These techniques benefits only cache with large block size. 3) Give priority to read miss over writes This optimization as the name suggests gives priority to read miss over writes. Let us see how this is done for a write through and write back cache. a) Write through cache In write through cache the information is written into both cache and the main memory. They use a write buffer which holds the data waiting to be written to the main memory. Write buffers complicate memory access because they might hold updated value of location needed on a read miss. The simple way out of this is for the read miss to wait till write buffer is empty. An alternative is to check the contents of write buffer on read miss and if there are no conflicts and the memory system is available, let the read miss continue. b) Write back cache Suppose when a read miss occurs, it may replace a dirty memory block. Instead of writing block back to memory and then reading, we could copy dirty block to a buffer, then read memory and then write memory.
4) Merging write buffer Both write through and write back cache use write buffer. If write buffer is empty, data and full address are written to the buffer. The write is finished from CPU perspective and the CPU continues working. If write buffer is not empty, CPU has to wait. If the buffer contains other modified blocks, the address can be checked to see the address of this new data matches the address of the valid buffer entry. If so, data are combined with that entry. This is known as write merging. Now, if there is no match and buffer is full, then the CPU must wait. Multiword writes are usually faster than writes performed one at a time. Therefore, this optimization uses memory more efficiently. This optimization reduces stalls due to write buffer being full.
Example:
The buffer has four entries, and each entry holds four 64-bit words. The address for each entry is on the left; with valid bits (V) indicating whether or not the next sequential eight bytes are occupied in this entry. In the second case the four writes are merged into a single buffer entry with write merging; without it, the buffer is full even though threefourths of each entry is wasted. 5) Victim cache Another approach to lower the miss penalty is to remember what was discarded in case it is needed again. Since the discarded data has already been fetched, it can be used again at small cost. The discarded data (replaced data) is stored in a small fully associative cache between cache and refill path. This is victim cache. This contains blocks that are discarded from cache because of a miss. This cache is checked on a miss to see if they have the desired data before going to the next lower level. If it is found there, victim block and cache
block are swapped. Figure shows a victim cache placed in the critical path between data cache and main memory.
Reducing Miss Rate

This is a classical approach to improving cache behavior and there are 5 methods to reduce the miss rate. Before going into the details of these methods, let us see the categorization of misses occurring in a processor. Compulsory Misses: The first access to a cache cannot be in the cache, so the block must be brought into the cache. This is known as compulsory miss, cold start miss or first reference miss. Capacity Misses: If the cache cannot hold all the data necessary for the execution of a program, capacity misses occur. Conflict Misses: If the block placement strategy is set associative or direct mapped, conflict misses occur, because a block may be discarded and later retrieved if too many blocks map to its set. This misses are conflict misses, collision misses or interference misses. 1) Larger Block Size Larger block sizes will reduce compulsory misses. This reduction occurs because of the principle of locality which has two components: temporal locality and spatial locality. Larger blocks take advantage of spatial locality.
Conflict misses increase for larger block sizes since cache has fewer blocks. Larger blocks increase the miss penalty, since more time is required to transfer a block from main memory to the cache. The miss penalty usually outweighs the decrease in the miss rate making large block sizes less favored. 2) Larger Caches The obvious way to reduce capacity misses is to increases capacity of the cache. The drawback of this method is longer hit time and higher cost. 3) Higher Associativity Higher associativity reduces the conflict misses. But greater associativity comes at the expense of larger hit access time. Another disadvantage of this method is that the hardware complexity increases for high associativity. 4) Way Prediction and Pseudo Associative Caches a) Way Prediction Higher associativity reduces the conflict misses, but increases the hit time because a number of tag comparisons have to be done and a multiplexor is placed in the critical path to select a particular block in the set. An approach to reduce conflict misses and yet maintains the hit speed of direct mapped cache is way prediction. In way-prediction, extra bits are kept in the cache to predict the way or block within the set of the next cache access. This prediction means the multiplexor is set early to select the desired set, and only a single tag comparison is performed that clock cycle. A miss results in checking the other blocks for matches in subsequent clock cycles. b) Pseudo Associative Caches Basic idea is to start with a direct mapped cache, and then check another entry on a miss before going to the next lower level. A typical next location to check is to invert the high order index bit to get the next try. The advantages of this method are it has a miss rate of 2-way set associative cache and access time close to that of a of directmapped cache. 5) Compiler Optimization This technique reduces miss rate without any hardware changes. The reduction comes from optimized software. A code can be rearranged without affecting the correctness to
reduce the miss rate. There are number ways to do this. Loop interchange and blocking are two methods for compiler optimization.
Loop Interchange In this method the inner and the outer loops are exchanged to improve the cache performance. Consider the following example If the cache is small (much less than 5000 numbers), well have many misses in the inner loop due to replacement. Reordering the loop maximizes the use of data before it is discarded. The original code would skip through memory in strides of 100 words, while the revised version accesses all the words in one cache block before going to the next block. (The original code will access x[0][0] first and then x[1][0]. But in the memory x[0][1] is stored just after x[0][0]. So when the loop is interchanged misses are reduced by spatial locality) Blocking Basic idea of this method is to do as much operations as possible on a sub-block before accessing next sub-block from memory. This optimization tries to reduce misses via improved temporal locality. In this we are dealing with multiple arrays, with some arrays accessed by rows and some by columns. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks. The goal is to maximize accesses to the data loaded into the cache before the data are replaced. The code example below, which performs matrix multiplication, helps motivate the optimization: The two inner loops read all N by N elements of z, read the same N elements in a row of y repeatedly, and write one row of N elements of x. Figure gives a snapshot of the accesses to the three arrays. A dark shade indicates a recent access, a light shade indicates an older access, and white means not yet accessed.
The number of capacity misses clearly depends on N and the size of the cache. If it can hold all three N by N matrices, then there is no problem. If the cache can hold one N by N matrix and one row of N, then at least the i-th row of y and the array z may stay in the cache. Less than that and misses may occur for both x and z. To ensure that the elements being accessed can fit in the cache, the original code is changed to compute on a submatrix of size B by B. Two inner loops now compute in steps of size B rather than the full length of x and z. B is called the blocking factor. Figure illustrates the accesses to the three arrays using blocking.
In contrast to previous code, smaller number of element is accessed.
Reducing cache miss penalty or miss rate via parallelism

These techniques overlap the execution of instructions with activity in the memory hierarchy. 1) Nonblocking Cache For pipelined computers that allow out-of-order completion, the CPU need not stall on a cache miss. For example, the CPU could continue fetching instructions from the instruction cache while waiting for the data cache to return the missing data. A nonblocking cache or lockup-free cache increase the benefits of such a scheme by allowing the data cache to continue to supply cache hits during a miss. This hit under miss optimization reduces the effective miss penalty by being helpful during a miss
instead of ignoring the requests of the CPU. Another complex option is that the cache may further lower the effective miss penalty if it can overlap multiple misses: a hit under multiple miss or miss under miss optimization. This option is beneficial only if the memory system can service multiple misses. 2) Hardware Prefetching In this approach instructions or data can be prefetched, either directly into the caches or into an external buffer that can be more quickly accessed than main memory. Instruction prefetch is frequently done in hardware outside of the cache. Typically, the processor fetches two blocks on a miss: the requested block and the next consecutive block. The requested block is placed in the instruction cache when it returns, and the prefetched block is placed into the instruction stream buffer. If the requested block is present in the instruction stream buffer, the original cache request is canceled, the block is read from the stream buffer, and the next prefetch request is issued. 3) Compiler controlled prefetching/ Software Prefetching In this method the compiler inserts special prefetch instructions to request the data before they are needed. This special pre-fetching instruction should not cause any faults. There are mainly two types of compiler controlled prefetch: Register prefetch and cache prefetch. Register prefetch will load the value into the register and cache prefetch loads data into the cache. Prefetching makes sense only if the processor can proceed while the prefetched data are being fetched; that is, the caches do not stall but continue to supply instructions and data while waiting for the prefetched data to return. Consider the following example: The revised code given below prefetches a[i][7] to a[i][99] In this data a is fetched for 7 iterations later and we pay penalty for first 7 iterations. This can be improved further by prefetching b also. Refer the text for the code which prefetches a and b.
Reducing Hit Time

1) Small and simple cache
Usually a smaller hardware is faster and therefore the cache should be made as small as possible to reduce the hit time. It is also critical to keep the cache small enough to fit on the same chip as the processor to avoid the time penalty of going off-chip. Second method is to keep the cache simple, such as by using direct mapping. A main benefit of direct-mapped caches is that the tag check can be done in parallel with the transmission of the data. This effectively reduces hit time. 2) Avoiding address translation during indexing of cache Some programs are very large and require large amount of memory. The main memory may not be able to store the whole data required for the program to execute. But the whole data is not required at any instant. So we use a virtual memory, where the main memory acts as a cache to secondary data storage device. To access virtual memory we use a virtual address in program. This virtual address is translated to physical address to access memory. But the translation to physical address increases the hit time of the cache. To reduce hit time, we use the part of address that is identical in both virtual and physical addresses to index the cache. At the same time as the cache is being read using that index, the virtual part of the address is translated, and the tag match is done using physical addresses. This allows the cache read to begin immediately and yet the tag comparison is still with physical addresses. 3) Pipelined cache access The memory access is pipelined so that the different steps during memory access can occur in parallel. 4) Trace Cache Conventional cache stores data in static program order. Trace cache is a special instruction cache which captures dynamic instruction sequence. Trace cache is accessed in parallel with instruction cache. If it is a hit in the trace cache, the instruction sequence is read to the issue buffer. If miss the instruction sequence is read from the instruction cache.
First time a trace (instruction sequence) is encountered, it is allocated a line or block in the trace cache. The line is filled as the instructions are fetched from instruction cache. If the same trace is encountered again in course of executing the program, it will be available in trace cache. Trace hit occurs if fetch address and branch prediction matches. Figure shows an example of storing instructions in a trace cache. *************************

Improving Cache Performance

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Improving Cache Performance

Transféré par

Droits d'auteur :

Formats disponibles

Improving Cache Performance/Cache Optimization

Reducing miss penalty

number of misses L1 number memory access L1

Local miss rate of second level cache is

number of misses L2 number memory access L2

Reducing Miss Rate

In contrast to previous code, smaller number of element is accessed.

Reducing cache miss penalty or miss rate via parallelism

Reducing Hit Time

Vous aimerez peut-être aussi