Académique Documents
Professionnel Documents
Culture Documents
INTRODUCTION
In recent years, energy and power efficiency have become key design objectives in microprocessors, in both embedded and general-purpose microprocessor domains.Although extensive research has been focused on improving the performance of prefetching mechanisms, the impact of prefetching techniques on processor energy efficiency has not yet been fully investigated.Both hardware and software techniques have been proposed for data prefetching. Software prefetching techniques normally need the help of compiler analyses inserting explicit prefetch instructions into the executables. Prefetch instructions are supported by most contemporary microprocessors.Hardware prefetching techniques use additional circuitry for prefetching data based on access patterns. In general, hardware prefetching tends to yield better performance than software prefetching for most applications. In order to achieve both energy efficiency and good performance, we investigate the energy impact of hardware-based data prefetching techniques, exploring their energy/performance tradeoffs, and introduce new compiler and hardware techniques to mitigate their energy overhead. Although aggressive hardware prefetching techniques improve performance significantly, in most applications they increase energy consumption by up to 30% compared to the case with no prefetching. In many systems, this constitutes more than 15% increase in chip-wide energy consumption and would be likely unacceptable.Most of the energy overhead due to hardware prefetching comes from prefetchhardware-related energy cost and unnecessary L1 data cache lookups related to prefetches that hit in the L1 cache. In this paper, We propose and evaluate several techniques to reduce energy overhead of hardware data prefetching. A compiler-based selective filtering approach which reduces the number of accesses to prefetch hardware. A compiler-assisted adaptive prefetching mechanism, which utilizes compiler information to selectively apply different hardware prefetching schemes based on predicted memory access patterns.
A compiler-driven filtering technique using a runtime stride counter designed to reduce prefetching energy consumption on memory access patterns with very small strides. A hardware-based filtering technique applied to further reduce the L1 cache related energy overhead due to prefetching. A Power-Aware pRefetch Engine (PARE) with a new prefetching table and compiler based location set analysis that consumes 711 less power per access com-pared to previous approaches.
Proposed techniques together could significantly reduce the hardware prefetching related energy overhead leading to total energy consumption that is comparable to, or even less than, the corresponding number for no prefetching. This achieves the twin objectives of high performance and low energy.
CHAPTER 2
DATA PREFETCHING
By any metric, microprocessor performance has increased at a dramatic rate over the past decade. This trend has been sustained by continued architectural innovations and advances in microprocessor fabrication technology. In contrast, main memory dynamic RAM(DRAM) performance has increased at a much more leisurely rate, as shown in Fig 2.1
Chief among latency reducing techniques is the use of cache memory hierarchies.The static RAM (SRAM) memories used in caches have managed to keep pace with processor memory request rates but continue to be too expensive for a main store technology. Although the use of large cache hierarchies has proven to be effective in reducing the average memory access penalty for programs that show a high degree of locality in their addressing patterns, it is still not uncommon for scientific and other data-intensive programs to spend more than half their run times stalled on memory requests. The large, dense matrix operations that form the basis of many such applications typically exhibit little data reuse and thus may defeat caching strategies.The poor cache utilization of these applications is partially a result of the on demand memory fetch policy of most caches. This policy fetches data into the cache from main memory only after the processor has requested a word and found it absent from the cache. The situation is illustrated in
3
Fig 2.2 where computation, including memory references satisfied within the cache hierarchy, are represented by the upper time line while main memory access time is represented by the lower time line.
Fig 2. 2. Execution diagram assuming (a) no prefetching, (b) perfect prefetching, and (c) degraded prefetching.
In this figure, the data blocks associated with memory references r1, r2, and r3 are not found in the cache hierarchy and must therefore be fetched from main memory. Assuming a simple, in-order execution unit, the processor will be stalled while it waits for the corresponding cache block to be fetched. Once the data returns from main memory it is cached and forwarded to the processor where computation may again proceed. Note that this fetch policy will always result in a cache miss for the first access to a cache block, since only previously accessed data are stored in the cache. Such cache misses are known as cold start or compulsory misses. Also, if the referenced data is part of a large array operation, it is likely that the data will be replaced after its use to make room for new array elements being streamed into the cache. When the same data block is needed later, the processor must again bring it in from main memory, incurring the full main memory access latency. This is called a capacity miss. Many of these cache misses can be avoided if we augment the demand fetch policy of the cache with a data prefetch operation. Rather than waiting for a cache miss to perform a memory fetch, data prefetching anticipates
4
such misses and issues a fetch to the memory system in advance of the actual memory reference. This prefetch proceeds in parallel with processor computation, allowing the memory system time to transfer the desired data from main memory to the cache. Ideally, the prefetch will complete just in time for the processor to access the needed data in the cache without stalling the processor.
CHAPTER 3
In the case of false sharing, this traffic is unnecessary because only the processor executing the store references the word being written. Increasing the cache block size increases the likelihood of two processors sharing data from the same block, and hence false sharing is more likely to arise.Sequential prefetching can take advantage of spatial locality without introducing some of the problems associated with large cache blocks. The simplest sequential prefetching schemes are variations upon the one block lookahead(OBL) approach, which initiates a prefetch for block b +1 when block b is accessed. This differs from simply doubling the block size, in that the prefetched blocks are treated separately with regard to cache replacement and coherence policies. For example, a large block may contain one word that is frequently referenced and several other words that are not in use. Assuming an LRU replacement policy, the entire block will be retained, even though only a portion of the blocks data is actually in use. If this large block is replaced with two smaller blocks, one of them could be evicted to make room for more active data. Similarly, the use of smaller cache blocks reduces the probability that false sharing will occur.OBL implementations differ depending on what type of access to block b initiates the prefetch of b+ 1. There are several of these approaches, of which the prefetch-on-miss and tagged prefetch algorithms will be discussed here. The prefetch-on-miss algorithm simply initiates a prefetch for block b + 1 whenever an access for block b results in a cache miss. If b +1 is already cached, no memory access is initiated. The tagged prefetch algorithm associates a tag bit with every memory block. This bit is used to detect when a block is demand-fetched or a prefetched block is referenced for the first time. In either of these cases, the next sequential block is fetched. It was found that tagged prefetching reduces cache miss ratios in a unified (both
instruction and data) cache by between 50% and 90% for a set of tracedriven simulations. Prefetch-on-miss was less than half as effective as tagged prefetching in reducing miss ratios. The reason prefetch-on-miss is less effective is illustrated in Fig 3.1 where the behavior of each algorithm when accessing three contiguous blocks is shown. Here, it can be seen that a strictly sequential access pattern will result in a cache miss for every other cache block when the prefetch-on-miss algorithm is used but this same access pattern results in only one cache miss when employing a tagged prefetch algorithm.
Fig 3.1. Three forms of sequential prefetching: (a) Prefetch on miss, (b) tagged prefetch, and (c) sequential prefetching with K = 2
One shortcoming of the OBL schemes is that the prefetch may not be initiated far enough in advance of the actual use to avoid a processor memory stall. A sequential access stream resulting from a tight loop, for example, may not allow sufficient lead time between the use of block b and the request for block b +1. To solve this problem, it is possible to increase the number of blocks prefetched after a demand fetch from one to K, where K is known as the degree of prefetching. Prefetching K > 1 subsequent blocks aids the memory system in staying ahead of rapid processor requests for sequential data blocks. As each prefetched block b, is accessed for the first time, the cache is interrogated to check if blocks b + 1, . . . , b + K, are present in the cache and, if not, the missing blocks are fetched from memory.
where A3 is the predicted value of the observed address a3. Prefetching continues in this way until An an.
Fig 3.2 State transition graph for reference prediction table entries
Note that this approach requires the previous address used by a memory instruction to be stored along with the last detected stride, if any. Recording the reference histories of every memory instruction in the program is clearly impossible. Instead, a separate cache called the reference prediction table (RPT) holds this information for only the most recently used memory instructions. The organization of the RPT is given in Fig 3.2. Table entries contain the address of the memory instruction, the previous address accessed by this instruction, a stride value for those entries that have established a stride, and a state field that records the entrys current state. The state diagram for RPT entries is given in Fig 3.3. The RPT is indexed by the CPUs program counter (PC). When memory instruction mi is executed for the first time, an entry for it is made in the RPT with the state set to initial, signifying that no prefetching is yet initiated for this instruction. If mi is executed again before its RPT entry has been evicted, a stride value is calculated by subtracting the previous address stored in the RPT from the current effective address.
To illustrate the functionality of the RPT, consider the matrix multiply code and associated RPT entries given in Fig 3.4. In this example, only the load instructions for arrays a, b and c are considered, and it is assumed that the arrays begin at addresses 10000, 20000, and 30000, respectively. For simplicity, one word cache blocks are also assumed. After the first iteration of the innermost loop, the state of the RPT is as given in Fig 3.4(b) where instruction addresses are represented by their pseudocode mnemonics. Since the RPT does not yet contain entries for these instructions, the stride fields are initialized to zero and each entry is placed in an
10
initial state. All three references result in a cache miss. After the second iteration, strides are computed as shown in Figure 3.4(c). The entries for the array references to b and c are placed in a transient state because the newly computed strides do not match the previous stride. This state indicates that an instructions referencing pattern may be in transition, and a tentative prefetch is issued for the block at address effective address 1 stride if it is not already cached. The RPT entry for the reference to array a is placed in a steady state because the previous and current strides match. Since this entrys stride is zero, no prefetching will be issued for this instruction. During the third iteration, the entries for array references b and c move to the steady state when the tentative strides computed in the previous iteration are confirmed. The prefetches issued during the second iteration result in cache hits for the b and c references, provided that a prefetch distance of one is sufficient.
11
12
CHAPTER 4
13
4.1. Compiler-Based Selective Filtering (CBSF) of Hardware Prefetches Not all load instructions are useful for prefetching. Some instructions, such as scalar memory accesses, cannot trigger useful prefetches when fed into the prefetcher. The compiler identifies the following memory accesses as not being beneficial to prefetching. Noncritical: Memory accesses within a loop or a recursive function are regarded as critical accesses. We can safely filter out the other noncritical accesses. Scalar: Scalar accesses do not contribute to the prefetcher.Only memory accesses to array structures and linked data structures will therefore be fed to the prefetcher.This optimization eliminates 8% of all prefetch table accesses on average.
4.2. Compiler-Assisted Adaptive Hardware Prefetching (CAAP) CAAP is a filtering approach that helps the prefetch predictor choose which prefetching schemes (dependence or stride) are appropriate depending on access pattern.One important aspect of the combined approach is that it uses two techniques independently and prefetches based on the memory access patterns for all memory accesses. Since distinguishing between pointer and non-pointer accesses is difficult during execution, it is accomplished during compilation. Array accesses and pointer accesses are annotated using hints written into the instructions. During runtime, the prefetch engine can identify the hints and apply different prefetching mechanisms.We have found that simply splitting the array and pointer structures is not very effective and affects the performance speedup (which is a primary goal of prefetching techniques).Instead, we use the following heuristic to decide whether we should use stride prefetching or pointer prefetching: Memory accesses to an array which does not belong to any larger structure (e.g., fields in a C struct) are only fed into the stride prefetcher; Memory accesses to an array which belongs to a larger structure are fed into both stride and pointer prefetchers; Memory accesses to a linked data structure with no arrays are only fed into the pointer prefetcher; Memory accesses to a linked data structure that contains arrays are fed into both prefetchers.
14
The above heuristic is able to preserve the performance speedup benefits of the aggressive prefetching scheme. This technique can filter out up to 20% of all the prefetch-table accesses and up to 10% of the extra L1 tag lookups.
16
are driven to and compared in parallel with all locations in the CAM array. Depending on the matching tag, one of the wordlines in the SRAM array is selected and read out.
The prefetching engine will update the table for each load instruction and check whether steady prefetching relationships have been established. If there exists a steady relation,the prefetching address will be calculated according to the relation and data stored in the history table. A prefetching request will be issued in the following cycle.
First, because of the program locality property, we do not need the whole 32 bits PC to distinguish between different memory access instructions. If we use only the lower 16 bits of the PC, we could reduce roughly half of the power consumed by each CAM access. Next, we break up the whole history table into 16 smaller tables, each containing only 4 entries, as shown in Fig 5.2.Each memory access will be directed to one of the smaller tables according to their group numbers provided by the compiler when they enter the prefetching engine .The prefetching engine will update the information within the group and will make prefetching decisions solely based on the information within this group. The compile time location set analysis is utilized to ensure that no information will be lost due to the partitioning of memory accesses. The group number can be accommodated in future ISAs that target energy efficiency and can be added easily in VLIW/EPIC type of designs. We also expect that many optimizations that would use compiler hints could be combined to reduce the impact on the ISA. The approach can reduce power significantly even with fewer tables (requiring fewer bits in the ISA) and could also be implemented in current ISAs by using some bits from the offset. Embedded ISAs like ARM that have 4 bits for predication in each instruction could trade off less predication bits (or none) with perhaps more bits used for compiler inserted hints.
19
In the PARE history table shown in Fig 5.2, during a search operation, only one of the 16 tables will be activated based on the group number provided by the compiler. We only perform the CAM search within the activated table, which is a fully-associative 4-entry CAM array.The schematic of each small table is shown in Figure 4.3.Each small table consists of a 4x16 bits CAM array containing the program counter, a sense amplifier and a valid bit for each CAM row, and the SRAM array on the right which contains the data. The cell uses ten transistors that contain an SRAM cell and a dynamic XOR gate used for comparison.It separates search bitlines from the write bitlines in order to reduce the capacitance switched during a search operation. For the row sense amplifier, we are using a single ended alpha latch to sense the match line during the search in the CAM array. The activation timing of the sense amplifier was determined with the case where only one bit in the word has a mismatch state.Each word has the valid bit which indicates whether the data stored in the word will be used in search operations. A match line and a single ended sense amplifier are associated with each word. A hit/miss signal is also generated: its high state indicating a hit or multiple hits and the low state indicating no hits or miss.
20
Finally, the SRAM array is the memory block that holds the data. Low-power memory designs typically use a six transistor (6T) SRAM cell. Writes are performed differentially with full rail voltage swings.The power dissipation for each successful search is the power consumed in the decoder, CAM search and SRAM read. The power consumed in a CAM search includes the power in the match lines and search lines, the sense amplifiers and the valid bits.The new hardware prefetch table has the following benefits compared to the baseline design: The dynamic power consumption is dramatically reduced because of the partitioning into 16 smaller tables The CAM cell power is also reduced because we use only the lower 16 bits of the PC instead of the whole 32 bits; Another benefit of the new table is that since the table is very small (4-entry), we do not need a column sense amplifier. This also helps to reduce the total power consumed. However, some overhead is also introduced by the new design. First, we need an address decoder to select one of the 16 tables. The total leakage power is increased (in a relative sense only) because while one of the smaller tables is active, the remaining 15 tables will be leaking. Fortunately,the new PARE design overcomes all these disadvantages.
21
Figure 5.4 shows the overall organization of one sub-bank of a CAM-tag cache. Each cache line in the sub-bank has a local tag that compares its contents with the broadcast search tag bits. Each CAM cell is a standard ten-transistor design laid out to be exactly twice the RAM cell area at 4.32m * 9.12m = 39.4m2 as shown in Figure 5.5. The cell contains a SRAM cell and a dynamic XOR gate used for comparison. The match line is precharged high and conditionally discharged on a mismatch. All match lines are OR-ed to generate the hit signal.The search bitlines, match lines, and buffers that drive control signals across the tag array are the main consumers of energy in the CAM-tag cache.
To reduce the capacitance switched during a search operation, we separate the search bitlines fromthe write bitlines. To reduce the energy dissipated on the match lines, they are only precharged to Vdd - Vtn through n-type precharge transistors and single ended sense-amps are used to regenerate a full-rail match signal. As shown in Figure 5.6, we also break the entire row of tags into two equal partitions such that the worstcase delay of the match line capacitance discharge can be halved . As with RAM-tag designs, we break up the cache into sub-banks using low order address bits and only activate a search within the enabled sub-bank. We can further reduce the energy of
22
a CAM-tag design by only enabling a smaller number of rows within a sub-bank, effectively reducing the associativity. For example, the StrongARM design has 64 CAM rows (128 RAM rows) in each cache sub bank but only enables one 32-way subset on each access.Figure 5 shows the layout of a complete 1KB 32-way set-associative cache sub-bank. This consumes around 10% greater area than a 1KB RAM-tag sub-bank. CAM tags have the property of providing very high associativity within a single sub-bank without having to partition the RAM array. There would be significant area overhead for sense-amps and muxes if we were to try to implement a small and highly-associative RAM-tag cache sub-bank.
Another advantage of CAM-tag designs is that they simplify the handling of cache stores. In a CAM-tag design,the data RAM word lines are only enabled on a cache hit and so stores can happen in a single cycle. A conventional RAM-tag design has to split the store access across two cycles: the first cycle checks the tag, and the second cycle writes the data storage on a hit. To allow full speed writes, RAM-tag designs often include a write buffer ahead of the primary cache to avoid stalling on stores, adding additional complexity and energy overhead.
23
CONCLUSION
This paper explores the energy-efficiency aspects of hardware data-prefetching techniques and proposes several new techniques and a PARE to make prefetching energy-aware. PARE reduces prefetching related energy consumption by 711. In conjunction with a net leakage energy reduction due to performance improvement, this may yield up to 12% less total energy consumption compared to a no-prefetching baseline. While the new techniques may have a very small reduction in performance benefits compared to a scheme with prefetching but no energy-aware techniques, they still maintain a significant speedup compared to the noprefetching baseline, thereby achieving the twin goals of energy efficiency and performance improvement.
24
REFERENCES
[1] Yao Guo,Pritish Narayanan,Mahmoud Abdullah Bennaser,Saurabh Chheda,Csaba Andras MoritzEnergy-Efficient Hardware Data prefetching [2] J. L. Baer and T. F. Chen, An effictive on-chip preloading scheme to reduce data access penalty, in Proc. Supercomput., 1991, pp. 179186. [3] A. Roth, A. Moshovos, and G. S. Sohi, Dependence based prefetching for linked data structures, in Proc. ASPLOS-VIII, Oct. 1998, pp.115126. [4] Y. Guo, S. Chheda, I.Koren, C. M. Krishna, and C. A. Moritz, Energyaware data prefetching for general-purpose programs, in Proc. Workshop Power-Aware Comput. Syst. (PACS04) Micro-37, Dec. 2004, pp.7894 [5] Y. Guo, M. Bennaser, and C. A. Moritz, PARE: A power-aware hardware data prefetching engine, in Proc. ISLPED, New York, 2005, pp.339344
25