Orginal Seminar Report

CHAPTER 1
INTRODUCTION
In recent years, energy and power efficiency have become key design objectives in microprocessors, in both embedded and general-purpose microprocessor domains.Although extensive research has been focused on improving the performance of prefetching mechanisms, the impact of prefetching techniques on processor energy efficiency has not yet been fully investigated.Both hardware and software techniques have been proposed for data prefetching. Software prefetching techniques normally need the help of compiler analyses inserting explicit prefetch instructions into the executables. Prefetch instructions are supported by most contemporary microprocessors.Hardware prefetching techniques use additional circuitry for prefetching data based on access patterns. In general, hardware prefetching tends to yield better performance than software prefetching for most applications. In order to achieve both energy efficiency and good performance, we investigate the energy impact of hardware-based data prefetching techniques, exploring their energy/performance tradeoffs, and introduce new compiler and hardware techniques to mitigate their energy overhead. Although aggressive hardware prefetching techniques improve performance significantly, in most applications they increase energy consumption by up to 30% compared to the case with no prefetching. In many systems, this constitutes more than 15% increase in chip-wide energy consumption and would be likely unacceptable.Most of the energy overhead due to hardware prefetching comes from prefetchhardware-related energy cost and unnecessary L1 data cache lookups related to prefetches that hit in the L1 cache. In this paper, We propose and evaluate several techniques to reduce energy overhead of hardware data prefetching. A compiler-based selective filtering approach which reduces the number of accesses to prefetch hardware. A compiler-assisted adaptive prefetching mechanism, which utilizes compiler information to selectively apply different hardware prefetching schemes based on predicted memory access patterns.
A compiler-driven filtering technique using a runtime stride counter designed to reduce prefetching energy consumption on memory access patterns with very small strides. A hardware-based filtering technique applied to further reduce the L1 cache related energy overhead due to prefetching. A Power-Aware pRefetch Engine (PARE) with a new prefetching table and compiler based location set analysis that consumes 711 less power per access com-pared to previous approaches.
Proposed techniques together could significantly reduce the hardware prefetching related energy overhead leading to total energy consumption that is comparable to, or even less than, the corresponding number for no prefetching. This achieves the twin objectives of high performance and low energy.
CHAPTER 2
DATA PREFETCHING
By any metric, microprocessor performance has increased at a dramatic rate over the past decade. This trend has been sustained by continued architectural innovations and advances in microprocessor fabrication technology. In contrast, main memory dynamic RAM(DRAM) performance has increased at a much more leisurely rate, as shown in Fig 2.1
Fig 2.1 System and DRAM performance since 1988
Chief among latency reducing techniques is the use of cache memory hierarchies.The static RAM (SRAM) memories used in caches have managed to keep pace with processor memory request rates but continue to be too expensive for a main store technology. Although the use of large cache hierarchies has proven to be effective in reducing the average memory access penalty for programs that show a high degree of locality in their addressing patterns, it is still not uncommon for scientific and other data-intensive programs to spend more than half their run times stalled on memory requests. The large, dense matrix operations that form the basis of many such applications typically exhibit little data reuse and thus may defeat caching strategies.The poor cache utilization of these applications is partially a result of the on demand memory fetch policy of most caches. This policy fetches data into the cache from main memory only after the processor has requested a word and found it absent from the cache. The situation is illustrated in
3
Fig 2.2 where computation, including memory references satisfied within the cache hierarchy, are represented by the upper time line while main memory access time is represented by the lower time line.
Fig 2. 2. Execution diagram assuming (a) no prefetching, (b) perfect prefetching, and (c) degraded prefetching.
In this figure, the data blocks associated with memory references r1, r2, and r3 are not found in the cache hierarchy and must therefore be fetched from main memory. Assuming a simple, in-order execution unit, the processor will be stalled while it waits for the corresponding cache block to be fetched. Once the data returns from main memory it is cached and forwarded to the processor where computation may again proceed. Note that this fetch policy will always result in a cache miss for the first access to a cache block, since only previously accessed data are stored in the cache. Such cache misses are known as cold start or compulsory misses. Also, if the referenced data is part of a large array operation, it is likely that the data will be replaced after its use to make room for new array elements being streamed into the cache. When the same data block is needed later, the processor must again bring it in from main memory, incurring the full main memory access latency. This is called a capacity miss. Many of these cache misses can be avoided if we augment the demand fetch policy of the cache with a data prefetch operation. Rather than waiting for a cache miss to perform a memory fetch, data prefetching anticipates
4
such misses and issues a fetch to the memory system in advance of the actual memory reference. This prefetch proceeds in parallel with processor computation, allowing the memory system time to transfer the desired data from main memory to the cache. Ideally, the prefetch will complete just in time for the processor to access the needed data in the cache without stalling the processor.
CHAPTER 3
OVERVIEW OF EXISTING DATA PREFETCHING TECHNIQUES

Hardware-based prefetching mechanisms need additional components for prefetching data based on access patterns. Prefetch tables are used to remember recent load instructions and relations between load instructions are set up. These relations are used to predict future (potential) load addresses from where data can be prefetched. Hardware-based prefetching techniques studied in this paper include 1. Sequential Prefetching 2. Stride Prefetching 3. Dependence-based Prefetching 4. Combined stride and dependence approach
3.1. Sequential Prefetching

Most (but not all) prefetching schemes are designed to fetch data from main memory into the processor cache in units of cache blocks. It should be noted, however, that multiple word cache blocks are themselves a form of data prefetching. By grouping consecutive memory words into single units, caches exploit the principle of spatial locality to implicitly prefetch data that is likely to be referenced in the near future.The degree to which large cache blocks can be effective in prefetching data is limited by the ensuing cache pollution effects. That is, as the cache block size increases, so does the amount of potentially useful data displaced from the cache to make room for the new block. In shared-memory multiprocessors with private caches, large cache blocks may also cause false sharing, which occurs when two or more processors wish to access different words within the same cache block and at least one of the accesses is a store.Although the accesses are logically applied to separate words, the cache hardware is unable to make this distinction because it operates on whole cache blocks only. Hence the accesses are treated as operations applied to a single object, and cache coherence traffic is generated to ensure that the changes made to a block by a store operation are seen by all processors caching the block.
In the case of false sharing, this traffic is unnecessary because only the processor executing the store references the word being written. Increasing the cache block size increases the likelihood of two processors sharing data from the same block, and hence false sharing is more likely to arise.Sequential prefetching can take advantage of spatial locality without introducing some of the problems associated with large cache blocks. The simplest sequential prefetching schemes are variations upon the one block lookahead(OBL) approach, which initiates a prefetch for block b +1 when block b is accessed. This differs from simply doubling the block size, in that the prefetched blocks are treated separately with regard to cache replacement and coherence policies. For example, a large block may contain one word that is frequently referenced and several other words that are not in use. Assuming an LRU replacement policy, the entire block will be retained, even though only a portion of the blocks data is actually in use. If this large block is replaced with two smaller blocks, one of them could be evicted to make room for more active data. Similarly, the use of smaller cache blocks reduces the probability that false sharing will occur.OBL implementations differ depending on what type of access to block b initiates the prefetch of b+ 1. There are several of these approaches, of which the prefetch-on-miss and tagged prefetch algorithms will be discussed here. The prefetch-on-miss algorithm simply initiates a prefetch for block b + 1 whenever an access for block b results in a cache miss. If b +1 is already cached, no memory access is initiated. The tagged prefetch algorithm associates a tag bit with every memory block. This bit is used to detect when a block is demand-fetched or a prefetched block is referenced for the first time. In either of these cases, the next sequential block is fetched. It was found that tagged prefetching reduces cache miss ratios in a unified (both
instruction and data) cache by between 50% and 90% for a set of tracedriven simulations. Prefetch-on-miss was less than half as effective as tagged prefetching in reducing miss ratios. The reason prefetch-on-miss is less effective is illustrated in Fig 3.1 where the behavior of each algorithm when accessing three contiguous blocks is shown. Here, it can be seen that a strictly sequential access pattern will result in a cache miss for every other cache block when the prefetch-on-miss algorithm is used but this same access pattern results in only one cache miss when employing a tagged prefetch algorithm.
Fig 3.1. Three forms of sequential prefetching: (a) Prefetch on miss, (b) tagged prefetch, and (c) sequential prefetching with K = 2
One shortcoming of the OBL schemes is that the prefetch may not be initiated far enough in advance of the actual use to avoid a processor memory stall. A sequential access stream resulting from a tight loop, for example, may not allow sufficient lead time between the use of block b and the request for block b +1. To solve this problem, it is possible to increase the number of blocks prefetched after a demand fetch from one to K, where K is known as the degree of prefetching. Prefetching K > 1 subsequent blocks aids the memory system in staying ahead of rapid processor requests for sequential data blocks. As each prefetched block b, is accessed for the first time, the cache is interrogated to check if blocks b + 1, . . . , b + K, are present in the cache and, if not, the missing blocks are fetched from memory.
3.2 Stride Prefetching

Several techniques have been proposed that employ special logic to monitor the processors address referencing pattern to detect constant stride array references originating from looping structures.This is accomplished by comparing successive addresses used by load or store instructions.Assume a memory instruction, mi, references addresses a1, a2, and a3 during three successive loop iterations.Prefetching for mi will be initiated if (a2 - a1) = 0 is now assumed to be the stride of a series of array accesses.
The first prefetch address will then be A3 = a2+
where A3 is the predicted value of the observed address a3. Prefetching continues in this way until An an.
Fig 3.2 State transition graph for reference prediction table entries
Note that this approach requires the previous address used by a memory instruction to be stored along with the last detected stride, if any. Recording the reference histories of every memory instruction in the program is clearly impossible. Instead, a separate cache called the reference prediction table (RPT) holds this information for only the most recently used memory instructions. The organization of the RPT is given in Fig 3.2. Table entries contain the address of the memory instruction, the previous address accessed by this instruction, a stride value for those entries that have established a stride, and a state field that records the entrys current state. The state diagram for RPT entries is given in Fig 3.3. The RPT is indexed by the CPUs program counter (PC). When memory instruction mi is executed for the first time, an entry for it is made in the RPT with the state set to initial, signifying that no prefetching is yet initiated for this instruction. If mi is executed again before its RPT entry has been evicted, a stride value is calculated by subtracting the previous address stored in the RPT from the current effective address.
Fig 3.3 The RPT during execution of matrix multiply
To illustrate the functionality of the RPT, consider the matrix multiply code and associated RPT entries given in Fig 3.4. In this example, only the load instructions for arrays a, b and c are considered, and it is assumed that the arrays begin at addresses 10000, 20000, and 30000, respectively. For simplicity, one word cache blocks are also assumed. After the first iteration of the innermost loop, the state of the RPT is as given in Fig 3.4(b) where instruction addresses are represented by their pseudocode mnemonics. Since the RPT does not yet contain entries for these instructions, the stride fields are initialized to zero and each entry is placed in an
10
initial state. All three references result in a cache miss. After the second iteration, strides are computed as shown in Figure 3.4(c). The entries for the array references to b and c are placed in a transient state because the newly computed strides do not match the previous stride. This state indicates that an instructions referencing pattern may be in transition, and a tentative prefetch is issued for the block at address effective address 1 stride if it is not already cached. The RPT entry for the reference to array a is placed in a steady state because the previous and current strides match. Since this entrys stride is zero, no prefetching will be issued for this instruction. During the third iteration, the entries for array references b and c move to the steady state when the tentative strides computed in the previous iteration are confirmed. The prefetches issued during the second iteration result in cache hits for the b and c references, provided that a prefetch distance of one is sufficient.
Fig 3.4. Rpt of matrix multiply
11
3.3 Pointer Prefetching

Stride prefetching has been shown to be effective for array-intensive scientific programs. However, for general-purpose programs which are pointer-intensive, or contain a large number of dynamic data structures, no constant strides can be easily found that can be used for effective stride prefetching.One scheme for hardware-based prefetching on pointer structures is dependence-based prefetching that detects dependencies between load instructions rather than establishing reference patterns for single instructions.Dependence-based prefetching uses two hardware tables. The correlation table (CT) is responsible for storing dependence information. Each correlation represents a dependence between a load instruction that produces an address (producer) and a subsequent load that uses that address (consumer). The potential producer window (PPW) records the most recent loaded values and the corresponding instructions. When a load commits, its base address value is checked against the entries in the PPW,with a correlation created on a match. This correlation is added to the CT. PPW and CT typically consist of 64128 entries containing addresses and program counters; each entry may contain 64 or more bits. The hardware cost is around twice that for stride prefetching. This scheme improves performance on many of the pointer-intensive Olden benchmarks.
3.4. Combined Stride and Pointer Prefetching

In order to evaluate a technique that is beneficial for applications containing both array and pointer bashed accesses, a combined technique that integrates stride prefetching and pointer prefetching was implemented and evaluated. The combined technique performs consistently better than the individual techniques on two benchmark suites with different characteristics.This technique can be used for both array and pointer fetches.Hence combined technique is a perfect solution in terms of performance. However its not energy efficient.Large amount of energy is wasted to access prefetch hardware tables.
12
CHAPTER 4
ENERGY AWARE DATA PREFETCHING TECHNIQUES

In this section, we will introduce techniques to reduce the energy overhead for the most aggressive hardware prefetching scheme, the combined stride and pointer prefetching, that gives the best performance speedup for general-purpose programs, but is the worst in terms of energy efficiency.We introduces a new power efficient prefetch engine also.The first three techniques reduce prefetch-hardware related energy costs and some extra L1 tag lookups due to prefetching. The last one is a hardwarebased approach designed to reduce the extra L1 tag lookups. The techniques proposed are as follows: 1) Compiler based selective filtering (CBSF) of hardware prefetches approach which reduces the number of accesses to the prefetch hardware by only searching the prefetch hardware tables for selected memory accesses that are identified by the compiler; 2) Compiler assisted adaptive hardware prefetching (CAAP) mechanism, which utilizes compiler information to selectively apply different prefetching schemes depending on predicted memory access patterns; 3) Compiler driven filtering technique using a runtime stride counter (SC) designed to reduce prefetching energy consumption on memory access patterns with very small strides; 4) Hardware based filtering technique using a prefetch filter buffer (PFB) applied to further reduce the L1 cache related energy overhead due to prefetching. 5) A Power-Aware pRefetch Engine (PARE) with a new prefetching table and compiler based location set analysis that consumes 711 less power per access compared to previous approaches.PARE reduces energy consumption by as much as 40% in the data memory system (containing caches and prefetching hardware) with an average speedup degradation of only 5%. The compiler-based approaches help make the prefetch predictor more selective based on program information. With the help of the compiler hints, we perform fewer searches in the prefetch hardware tables and issue fewer useless prefetches, which results in less energy overhead being consumed in L1 cache tag-lookups.
13
4.1. Compiler-Based Selective Filtering (CBSF) of Hardware Prefetches Not all load instructions are useful for prefetching. Some instructions, such as scalar memory accesses, cannot trigger useful prefetches when fed into the prefetcher. The compiler identifies the following memory accesses as not being beneficial to prefetching. Noncritical: Memory accesses within a loop or a recursive function are regarded as critical accesses. We can safely filter out the other noncritical accesses. Scalar: Scalar accesses do not contribute to the prefetcher.Only memory accesses to array structures and linked data structures will therefore be fed to the prefetcher.This optimization eliminates 8% of all prefetch table accesses on average.
4.2. Compiler-Assisted Adaptive Hardware Prefetching (CAAP) CAAP is a filtering approach that helps the prefetch predictor choose which prefetching schemes (dependence or stride) are appropriate depending on access pattern.One important aspect of the combined approach is that it uses two techniques independently and prefetches based on the memory access patterns for all memory accesses. Since distinguishing between pointer and non-pointer accesses is difficult during execution, it is accomplished during compilation. Array accesses and pointer accesses are annotated using hints written into the instructions. During runtime, the prefetch engine can identify the hints and apply different prefetching mechanisms.We have found that simply splitting the array and pointer structures is not very effective and affects the performance speedup (which is a primary goal of prefetching techniques).Instead, we use the following heuristic to decide whether we should use stride prefetching or pointer prefetching: Memory accesses to an array which does not belong to any larger structure (e.g., fields in a C struct) are only fed into the stride prefetcher; Memory accesses to an array which belongs to a larger structure are fed into both stride and pointer prefetchers; Memory accesses to a linked data structure with no arrays are only fed into the pointer prefetcher; Memory accesses to a linked data structure that contains arrays are fed into both prefetchers.
14
The above heuristic is able to preserve the performance speedup benefits of the aggressive prefetching scheme. This technique can filter out up to 20% of all the prefetch-table accesses and up to 10% of the extra L1 tag lookups.
4.3 Compiler-Hinted Filtering Using a Runtime Stride Counter (SC)

Another part of prefetching energy overhead comes from memory accesses with small strides. Accesses with very small strides (compared to the cache line size of 32 bytes we use) could result in frequent accesses to the prefetch table and issuing more prefetch requests than needed. For example, if we have an iteration on an array with a stride of 4 bytes, the hardware table may be accessed 8 times before a useful prefetch is issued to get a new cache line. The overhead not only comes from the extra prefetch table accesses; eight different prefetch requests are also issued to prefetch the same cache line during the eight iterations, leading to additional tag lookups.Software prefetching would be able to avoid the penalty by doing loop unrolling. In our approach, we use hardware to accomplish loop unrolling with assistance from the compiler. The compiler predicts as many strides as possible based on static information. Stride analysis is applied not only for array-based memory accesses, but also for pointer accesses with the help of pointer analysis.Strides predicted as larger than half the cache line size (16 bytes in our example) will be considered as large enough since they will access a different cache line after each iteration.Strides smaller than the half the cache line size will be recorded and passed to the hardware. This is a very small eight-entry buffer used to record the most recently used instructions with small strides. Each entry contains the program counter (PC) of the particular instruction and a stride counter. The counter is used to count how many times the instruction occurs after it was last fed into the prefetcher. The counter is initially set to a maximum value (decided by cache_line_size/stride) and is then decremented each time the instruction is executed. The instruction is only fed into the prefetcher when its counter is decreased to zero; then, the counter will be reset to the maximum value. For example, if we have an array access (in a loop) with a stride of 4 bytes, the counter will be set to 8 initially. Thus,during eight occurrences of this load instruction, it is sent only once to the prefetcher.This technique reduces 5% of all prefetch table accesses as well as 10% of the extra L1 cache tag lookups, while resulting in less than 0.3% performance degradation.
15
4.4 Hardware-Based Prefetch Filtering Using PFB

To further reduce the L1 tag-lookup related energy consumption,we add a hardwarebased prefetch filtering technique. Our approach uses a very small hardware buffer called the prefetch filtering buffer (PFB).When a prefetch engine predicts a prefetching address, it does not prefetch the data from that address immediately from the lower-level memory system (e.g., L2 Cache). Typically,tag lookups on L1 tag-arrays are performed. If the data to be prefetched already exists in the L1 Cache, the request from the prefetch engine is dropped. A cache taglookup costs much less energy compared to a full read/write access to the low-level memory system (e.g., the L2 cache). However, associative tag-lookups are still energy expensive.To reduce the number of L1 tag-checks due to prefetching,a PFB is added to store the most recently prefetched cache tags. We check the prefetching address against the PFB when a prefetching request is issued by the prefetch engine. If the address is found in the PFB, the prefetching request is dropped and it is assumed that the data is already in the L1 cache. If the data is not found in the PFB, a normal tag lookup is performed. The LRU replacement algorithm is used when the PFB is full.A smaller PFB costs less energy per access, but can only filter out a smaller number of useless L1 tag-checks.A larger PFB can filter out more, but each access to the PFB costs more energy.PFBs are not always correct in predicting whether the data is still in L1 since the data might have been replaced although its address is still present in the PFB. Fortunately,results show that the PFB misprediction rate is very low (close to 0).
16
CHAPTER 5 POWER AWARE PREFETCHING DESIGN

As we mentioned earlier, the combined stride and pointer prefetching technique integrates the mechanisms from both stride prefetching and dependence-based prefetching.Stride prefetching captures the static strides between memory accesses (mainly array accesses), and requires a history table to record the address of the instruction (PC), previously accessed address, and the predicted stride. In comparison, the dependence-based prefetching requires two history tables to record the potential candidates of instructions and the correlations which include PC, previously generated addresses, and predicted offset values.As we will show later, the tables for both techniques could be combined together into a single table, each entry attached with one bit to indicate the prefetching type. We will also use two bits to indicate the prefetching status, which will help us to track whether the relationship is steady(status>1) or not. Prefetching requests will be issued only after the relationship is established, i.e., it is steady.Next, we will show the design of our baseline prefetching history table, which is a 64-entry fully-associative table with many circuit-level low-power features. Following that we present the design of the proposed indexed history table for PARE, and compare the power dissipation, including both dynamic and leakage, of the two designs.
5.1 Baseline History Table Design

The baseline prefetching table design is a 64-entry fully associative table shown in Fig 4.1. In each table entry, we store a 32-bit program counter (the address of the instruction), the lower 16 bits of the previously used memory address (we do not need to store the whole 32 bits because of the locality property in prefetching). We also use one bit to indicate the prefetching type and two bits for status, as mentioned previously. Finally, each entry also contains the lower 12 bits of the predicted stride/offset value. In our design, we use Content Addressable Memory (CAM) for the PCs in the table, because CAM provides a fast and power efficient data search function, accessing data by its content rather than its memory location. The memory array of CAM cells logically consists of 64 by 32 bits. The rest of the history table is implemented using SRAM arrays. During a search operation, the reference data
17
are driven to and compared in parallel with all locations in the CAM array. Depending on the matching tag, one of the wordlines in the SRAM array is selected and read out.
Fig 5.1: The baseline design of hardware prefetch table
The prefetching engine will update the table for each load instruction and check whether steady prefetching relationships have been established. If there exists a steady relation,the prefetching address will be calculated according to the relation and data stored in the history table. A prefetching request will be issued in the following cycle.
5.2 PARE History Table Design

Each access to the table in Fig 5.1 still consumes significant power because all 64 CAM entries are activated during a search operation. We could reduce the power dissipation in two ways: reducing the size of each entry and partitioning the large table into multiple smaller tables.
18
First, because of the program locality property, we do not need the whole 32 bits PC to distinguish between different memory access instructions. If we use only the lower 16 bits of the PC, we could reduce roughly half of the power consumed by each CAM access. Next, we break up the whole history table into 16 smaller tables, each containing only 4 entries, as shown in Fig 5.2.Each memory access will be directed to one of the smaller tables according to their group numbers provided by the compiler when they enter the prefetching engine .The prefetching engine will update the information within the group and will make prefetching decisions solely based on the information within this group. The compile time location set analysis is utilized to ensure that no information will be lost due to the partitioning of memory accesses. The group number can be accommodated in future ISAs that target energy efficiency and can be added easily in VLIW/EPIC type of designs. We also expect that many optimizations that would use compiler hints could be combined to reduce the impact on the ISA. The approach can reduce power significantly even with fewer tables (requiring fewer bits in the ISA) and could also be implemented in current ISAs by using some bits from the offset. Embedded ISAs like ARM that have 4 bits for predication in each instruction could trade off less predication bits (or none) with perhaps more bits used for compiler inserted hints.
Fig. 5.2. Overall organization of the PARE hardware prefetch table
19
In the PARE history table shown in Fig 5.2, during a search operation, only one of the 16 tables will be activated based on the group number provided by the compiler. We only perform the CAM search within the activated table, which is a fully-associative 4-entry CAM array.The schematic of each small table is shown in Figure 4.3.Each small table consists of a 4x16 bits CAM array containing the program counter, a sense amplifier and a valid bit for each CAM row, and the SRAM array on the right which contains the data. The cell uses ten transistors that contain an SRAM cell and a dynamic XOR gate used for comparison.It separates search bitlines from the write bitlines in order to reduce the capacitance switched during a search operation. For the row sense amplifier, we are using a single ended alpha latch to sense the match line during the search in the CAM array. The activation timing of the sense amplifier was determined with the case where only one bit in the word has a mismatch state.Each word has the valid bit which indicates whether the data stored in the word will be used in search operations. A match line and a single ended sense amplifier are associated with each word. A hit/miss signal is also generated: its high state indicating a hit or multiple hits and the low state indicating no hits or miss.
Fig 5.3 Schematic for each small history table in PARE.
20
Finally, the SRAM array is the memory block that holds the data. Low-power memory designs typically use a six transistor (6T) SRAM cell. Writes are performed differentially with full rail voltage swings.The power dissipation for each successful search is the power consumed in the decoder, CAM search and SRAM read. The power consumed in a CAM search includes the power in the match lines and search lines, the sense amplifiers and the valid bits.The new hardware prefetch table has the following benefits compared to the baseline design: The dynamic power consumption is dramatically reduced because of the partitioning into 16 smaller tables The CAM cell power is also reduced because we use only the lower 16 bits of the PC instead of the whole 32 bits; Another benefit of the new table is that since the table is very small (4-entry), we do not need a column sense amplifier. This also helps to reduce the total power consumed. However, some overhead is also introduced by the new design. First, we need an address decoder to select one of the 16 tables. The total leakage power is increased (in a relative sense only) because while one of the smaller tables is active, the remaining 15 tables will be leaking. Fortunately,the new PARE design overcomes all these disadvantages.
5.3 CAM Tag Designs
Fig 5.4: Organization of a highly-associative CAM-tag cache
21
Figure 5.4 shows the overall organization of one sub-bank of a CAM-tag cache. Each cache line in the sub-bank has a local tag that compares its contents with the broadcast search tag bits. Each CAM cell is a standard ten-transistor design laid out to be exactly twice the RAM cell area at 4.32m * 9.12m = 39.4m2 as shown in Figure 5.5. The cell contains a SRAM cell and a dynamic XOR gate used for comparison. The match line is precharged high and conditionally discharged on a mismatch. All match lines are OR-ed to generate the hit signal.The search bitlines, match lines, and buffers that drive control signals across the tag array are the main consumers of energy in the CAM-tag cache.
Fig 5.5: CAM cell circuitry
To reduce the capacitance switched during a search operation, we separate the search bitlines fromthe write bitlines. To reduce the energy dissipated on the match lines, they are only precharged to Vdd - Vtn through n-type precharge transistors and single ended sense-amps are used to regenerate a full-rail match signal. As shown in Figure 5.6, we also break the entire row of tags into two equal partitions such that the worstcase delay of the match line capacitance discharge can be halved . As with RAM-tag designs, we break up the cache into sub-banks using low order address bits and only activate a search within the enabled sub-bank. We can further reduce the energy of
22
a CAM-tag design by only enabling a smaller number of rows within a sub-bank, effectively reducing the associativity. For example, the StrongARM design has 64 CAM rows (128 RAM rows) in each cache sub bank but only enables one 32-way subset on each access.Figure 5 shows the layout of a complete 1KB 32-way set-associative cache sub-bank. This consumes around 10% greater area than a 1KB RAM-tag sub-bank. CAM tags have the property of providing very high associativity within a single sub-bank without having to partition the RAM array. There would be significant area overhead for sense-amps and muxes if we were to try to implement a small and highly-associative RAM-tag cache sub-bank.
Fig 5.6: Split CAM row operation
Another advantage of CAM-tag designs is that they simplify the handling of cache stores. In a CAM-tag design,the data RAM word lines are only enabled on a cache hit and so stores can happen in a single cycle. A conventional RAM-tag design has to split the store access across two cycles: the first cycle checks the tag, and the second cycle writes the data storage on a hit. To allow full speed writes, RAM-tag designs often include a write buffer ahead of the primary cache to avoid stalling on stores, adding additional complexity and energy overhead.
23
CONCLUSION
This paper explores the energy-efficiency aspects of hardware data-prefetching techniques and proposes several new techniques and a PARE to make prefetching energy-aware. PARE reduces prefetching related energy consumption by 711. In conjunction with a net leakage energy reduction due to performance improvement, this may yield up to 12% less total energy consumption compared to a no-prefetching baseline. While the new techniques may have a very small reduction in performance benefits compared to a scheme with prefetching but no energy-aware techniques, they still maintain a significant speedup compared to the noprefetching baseline, thereby achieving the twin goals of energy efficiency and performance improvement.
24
REFERENCES
[1] Yao Guo,Pritish Narayanan,Mahmoud Abdullah Bennaser,Saurabh Chheda,Csaba Andras MoritzEnergy-Efficient Hardware Data prefetching [2] J. L. Baer and T. F. Chen, An effictive on-chip preloading scheme to reduce data access penalty, in Proc. Supercomput., 1991, pp. 179186. [3] A. Roth, A. Moshovos, and G. S. Sohi, Dependence based prefetching for linked data structures, in Proc. ASPLOS-VIII, Oct. 1998, pp.115126. [4] Y. Guo, S. Chheda, I.Koren, C. M. Krishna, and C. A. Moritz, Energyaware data prefetching for general-purpose programs, in Proc. Workshop Power-Aware Comput. Syst. (PACS04) Micro-37, Dec. 2004, pp.7894 [5] Y. Guo, M. Bennaser, and C. A. Moritz, PARE: A power-aware hardware data prefetching engine, in Proc. ISLPED, New York, 2005, pp.339344
25

Orginal Seminar Report

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Orginal Seminar Report

Transféré par

Droits d'auteur :

Formats disponibles

CHAPTER 1

Fig 2.1 System and DRAM performance since 1988

OVERVIEW OF EXISTING DATA PREFETCHING TECHNIQUES

3.1. Sequential Prefetching

3.2 Stride Prefetching

The first prefetch address will then be A3 = a2+

Fig 3.3 The RPT during execution of matrix multiply

Fig 3.4. Rpt of matrix multiply

3.3 Pointer Prefetching

3.4. Combined Stride and Pointer Prefetching

ENERGY AWARE DATA PREFETCHING TECHNIQUES

4.3 Compiler-Hinted Filtering Using a Runtime Stride Counter (SC)

4.4 Hardware-Based Prefetch Filtering Using PFB

CHAPTER 5 POWER AWARE PREFETCHING DESIGN

5.1 Baseline History Table Design

Fig 5.1: The baseline design of hardware prefetch table

5.2 PARE History Table Design

Fig. 5.2. Overall organization of the PARE hardware prefetch table

Fig 5.3 Schematic for each small history table in PARE.

5.3 CAM Tag Designs

Fig 5.4: Organization of a highly-associative CAM-tag cache

Fig 5.5: CAM cell circuitry

Fig 5.6: Split CAM row operation

Vous aimerez peut-être aussi