Vous êtes sur la page 1sur 31

A SEMINAR REPORT ON DYNAMIC CACHE MANAGEMENT TECHNIQUE BY ELEKWA JOHN OKORIE ESUT/ 2007/88499 PRESENTED TO THE DEPARTMENT OF COMPUTER

ENGINEERING FACULTY OF ENGINEERING ENUGU STATE UNIVERSITY OF SCIENCE AND TECHNOLOGY (ESUT), ENUGU SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE AWARD OF A BACHELOR OF ENGINEERING (B.ENG) DEGREE IN COMPUTER ENGINEERING

SEPTEMBER, 2012

CERIFICATION I, ELEKWA JOHN OKORIE with the registration number, ESUT/2007/88499 in the Department of Computer Engineering in Enugu State University of Science and Technology, Enugu certify that this seminar was done by me.

--------------------------------Signature

------------------------Date

ii

APPROVAL PAGE This is to certify that this seminar topic on dynamic cache management technique was approved and carried out by ELEKWA JOHN OKORIE with Reg. no: ESUT/2007/88499 strict supervision.

--------------------------------ENGR. ARIYO IBIYEMI (Seminar supervisor) department)

-----------------------------MR.IKECHUKWU ONAH (Head of

-------------------DATE

---------------------DATE

iii

DEDICATION The report is dedicated to God Almighty for his love, favors and protection all these time. To my parents Mr. Ugonna Alex Okorie and Mrs. Monica Elekwa and also to those who contributed in marking sure that Im moving ahead from my wonderful lectures to my great friends. To my family for their love, care, prayers and support.

iv

ACKNOWLEDGEMENT Apart from the efforts of me, the success of any seminar depends largely on the encouragement and guidelines of many others. I take this opportunity to express my gratitude to the people who have been instrumental in the successful completion of this seminar. I would like to show my greatest appreciation to Dr. Mac vain Ezedo and his wife. I cant say thank you enough for his tremendous support and help.God bless you all. The guidance and support received from all my friends: Uche Ugwoda, Peter Obaro, Matin Ozioko ,Oge Raphael and Stone etc. I am grateful for their constant support and help. Finally, my lecturers, whose tutelage I was taught under. A grateful God Bless you. More especially my supervisor ENGR. ARIYO IBIYEMI and all others and also my amiable HOD. ENGR. IKECHUKWU ONAH.

TABLE OF CONTENTS Title page Certification Approval Page Dedication Acknowledgement Abstract Table of Contents CHAPTER ONE 1.0 INTRODUCTION 1.1 Power Trends for Current Microprocessors CHAPTER TWO 2.0 Working with the L0 CACHE 2.1 pipeline Micro Architecture 2.2 Branch Prediction & Confidence Estimation A Brief Overview CHAPTER THREE 3.0 What is Dynamic Cache Management Technique 3.1 Basic Idea of the Dynamic Management Scheme 3.2 Dynamic Techniques for l0-cache Management 3.3 Simple Method 3.4 Static Method 3.5 Dynamic Confidence Estimation Method 3.6 Restrictive Dynamic Confidence Estimation 7 12 12 13 14 14 15 16 16
vi

i ii iii iv v vi vii

1 5

6 7

Method 3.7 Dynamic Distance Estimation Method 3.8 Comparison of Dynamic Techniques CHAPTER FOUR 4.0 Conclusion REFERENCES

17 18

19 20

vii

LIST OF FIGURES Fig 1: Fig 2: Fig 3: 6 Fig 4: Fig 5: Pipeline Micro Architecture (A) Pipeline Micro Architecture (B) 7 7 Memory Hierarchy An Instruction Cache Levels of Cache 3 4

viii

CHAPTER ONE 1.0 INTRODUCTION First of all what is Cache Memory:

Cache memory is a fast memory that is use to hold the most recently accessed data. Cache is pronounced like the word cash. Cache Memory is the level of computer memory hierarchy situated between the processor and main memory. It is a very fast memory the processor can access much more quickly than main memory or RAM. Cache is relatively small and expensive. Its function is to keep a copy of the data and code (instructions) currently used by the CPU. By using cache memory, waiting states are significantly reduced and the work of the processor becomes more effective.

As processor performance continues to grow, and high performance, wide-issue processors exploit the available Instruction-Level Parallelism, the memory hierarchy should continuously supply instructions and data to the data path to keep the execution rate as high as possible. Very often, the memory hierarchy access latencies dominate the execution time of the program. The very high utilization of the
1

instruction memory hierarchy entails high energy demands for the on-chip I-Cache subsystem. In order to reduce the effective energy dissipation per instruction access, we propose the addition of a small, extra cache (the L0-Cache) which serves as the primary cache of the processor, and is used to store the most frequently executed portions of the co de, and subsequently provide them to the pipeline. Our approach seeks to manage the L0Cache in a manner that is sensitive to the frequency of accesses of the instructions executed. It can exploit the temporalities of the code and can make decisions on they, i.e., while the code executes. In dissipation has become one recent of the years, major power design

concerns for the microprocessor industry. The shrinking device size and the large number of devices packed in a chip die coupled with large operating frequencies; have led to unacceptably high levels of power dissipation. The problem of wasted power caused by unnecessary activities in various parts of the CPU during code execution has traditionally been ignored in code optimization and architectural design. Higher frequencies and large transistor counts more

than offset the lower voltages and the smaller the devices and they result in large power consumption in a newest version in a processor family.

Figure 1 Cache is much faster than main memory because it is implemented using SRAM (Static Random Access Memory). The problem with DRAM, which comprises main memory, is that it is composed entirely of capacitors, which have to be constantly refreshed in order to preserve the stored information (leakage current). Whenever data is read from the cell, the cell is refreshed. The DRAM cells need to be refreshed very frequently, i.e. typically every 4 to 16ms. this slows down the entire process. SRAM on the other hand consists of flip-flops, which stay in its state as long as the power supply is on. (A flip-flop is an electrical circuit composed of transistors and resistors. See picture) Because of this SRAM need not be refreshed and is over 10 times faster than DRAM. Flip-flops, however, are implemented
3

using complex circuitry which makes SRAM much larger and more expensive, limiting its use.

Level one cache memory (called L1 Cache, for Level 1 Cache) is directly integrated into the processor. It is subdivided into two parts:

The first part is the instruction cache, which contains instructions from the RAM that have been decoded as they came across the pipelines. The second part is the data cache, which contains data from the RAM and data recently used during processor operations.

Figure 2 - An instruction cache

Level 1 cache can be accessed very rapidly. Access waiting time approaches that of internal processor registers Level two cache memory (called L2 Cache, for Level 2 Cache) is located in the case along with the processor (in the chip). The level two cache is an
5

intermediary between the processor, with its internal cache, and the RAM. It can be accessed more rapidly than the RAM, but less rapidly than the level one cache. Level three cache memory (called L3 Cache, for Level 3 Cache) is located on the motherboard. 1.1 POWER TRENDS FOR CURRENT MICROPROCESSORS DEC Freq (Mhz) Power (W) 21164 433 32.5 DEC 21164 High Freq 600 45 Pentiu m Pro 200 28.1 Pentium II 300 41.4

Very often the memory hierarchy access latencies dominate the execution time of the program; the very high utilization of the instruction memory hierarchy entails high energy demands on the on chip I-cache subsystem. In order to reduce the effective energy dissipation per instruction access, the addition of an extra cache is proposed, which serves as the primary cache of the processor, and is used to store the most frequently executed portion of the code.

CHAPTER TWO 2.0 WORKING WITH THE L0 CACHE Some dynamic techniques are used to manage the L0-cache. The problem that the dynamic techniques seek to solve is how to select the basic blocks to be stored in the L0-cache while the program is being executed. If a block is selected, the

Figure 3 CPU will access the L0-cache first; otherwise, it will go directly to the I-cache and bypass the L0-cache. In the case of an L0-cache miss, the CPU is directed to fetch

instructions from the I-cache and to transfer the instructions from the I-cache to the L0-cache

2.1 PIPELINE MICRO ARCHITECTURE

Figure 4

Figure 5

Figure 4 shows the processor pipeline we model in this research. The pipeline is typical of embedded processors such as StrongARM. There are five stages in the pipeline fetch, decode, and execute, mem and writeback. There is no external branch predictor. All branches are predicted untaken. There is two-cycle delay for taken branches. Instructions can be delivered to the pipeline from one of three sources: line buffer, I-cache and DFC. There are three ways to determine where to fetch Instructions: Serialsources are accessed one by one in fixed order;
9

Parallelall the sources are accessed in parallel; Predictivethe access order can be serial with flexible order or parallel based on prediction. Serial access results in minimal power because the most power efficient source is always accessed first. But it also results in the highest performance degradation because every miss in the first accessed source will generate a bubble in the pipeline. On the other hand, parallel access has no performance degradation. But I-cache is always accessed and there is no power savings in instruction fetch. Predictive access, if accurate, can have both the power efficiency of the serial access and the low performance degradation of the which parallel source access. to access Therefore, it is adopted in our first based on current fetch approach. As shown in Figure 1, a predictor decides address. Another functionality of the predictor is pipeline

gating. Suppose a DFC hit is predicted for the next fetch at cycle N. The fetch stage is disabled at cycle N and the decoded instruction is sent from the DFC to latch 5. Then at cycle N, the decode stage is disabled and the decoded instruction is sent from latch 5 to latch 2. If an instruction is fetched from the I-cache, the hit cache line is also sent to the line buffer. The line buffer can provide instructions for subsequent fetches to the same line.

10

2.2 BRANCH PREDICTION & CONFIDENCE ESTIMATION A BRIEF OVERVIEW 2.2.1 Branch Prediction Branch prediction is an important technique to increase parallelism in the CPU, by predicting the outcome of a conditional branch instruction as soon as it is decoded. Successful branch prediction mechanisms take advantage of the non-random nature of branch behavior. Most branches are either mostly taken in the course of program execution. The commonly used branch predictors are:
1. Bimodal branch predictor.

Bimodal branch predictor uses a counter for determining the prediction. Each time a branch is taken, the counter is incremented by one, and each time it falls through, it is decremented by one. Looking onto the value of the counter does the prediction. If it is less than a threshold value, the branch is predicted as not taken; otherwise, it is predicted as taken.

11

Figure 6 2. Global branch predictor Global branch predictor considers the past behavior of the current branch as well as the other branches to predict the behavior of the current branch. 3. Confidence Estimation The relatively new concept confidence estimation has been introduced to keep track of the quality of branch prediction. Confidence estimators are hardware mechanisms that are accessed in parallel with the branch predictors when a branch is decoded, and they are modified when the branch is resolved. They characterize a branch as high confidence or low confidence depending upon the
12

branch predictor for the particular branch. If the branch predictor predicted a branch correctly most of the time, the confidence estimator would designate the prediction as high confidence otherwise as low confidence.

13

CHAPTER THREE 3.0 WHAT IS DYNAMIC CACHE MANAGEMENT TECHNIQUE:

The memory hierarchy of high performance. Extrapolating and current trend and this portion are likely to the near future.

This mechanism provides to you the instruction stream. It is an accounting for and fraction of a chips transistor. It is use to the eliminate the need for high utilization. It is a resizing strategy of cache memory. IDEA OF THE DYNAMIC MANAGEMENT

3.1 BASIC

SCHEME The dynamic scheme for the L0-cache should be able to select the most frequently executed basic blocks for placement in the L0-cache. It should also rely on existing mechanisms without much hardware investment if it to be attractive for energy reduction. The branch prediction in conjunction with confidence estimation is a reliable solution to this problem.

14

Unusual Behavior of the Branches A branch that was predicted taken with high confidence will be expected to be taken during program execution. If it not taken, it will be assumed to be behaving unusually. The basic idea is that, if a branch behaves unusually, the dynamic scheme disables the L0-cache access for the subsequent basic blocks. Under this scheme, only basic blocks that are to be executed frequently tend to make it to the L0-cache, hence avoiding cache pollution problems in the L0-cache 3.2 DYNAMIC TECHNIQUES FOR L0-CACHE MANAGEMENT The dynamic portions cache. 1. Simple Method. 2. Static Method. select techniques the discussed in the subsequent

basic blocks to be placed in the L0-

cache. There are five techniques for the management of L0-

15

3. Dynamic Confidence Estimation Method. 4. Restrictive Dynamic Confidence Estimation Method. 5. Dynamic Distance Estimation Method. Different dynamic techniques trade off energy reduction with performance degradation 3.3 SIMPLE METHOD The confidence estimation mechanism is not used in simple method. The branch predictor can be used as a stand-alone mechanism to provide insight on which portions of the code are frequently executed and which is not. A mispredicted branch is assumed to drive the thread of execution to an infrequently executed part of the program. The strategy used for selecting the basic blocks is as follows: If a branch predictor is mispredicted, the machine will access the I-cache to fetch the instructions. If a branch is predicted correctly, the machine will access the L0-cache. In a misprediction, the pipeline will flush and the machine will start fetching the instructions from the correct address by accessing the I-cache. The energy dissipation and the execution time of the original configuration that uses no L0-cache is taken as unity, and normalize everything with respect to that.
16

3.4 STATIC METHOD The selection criteria adopted for the basic blocks is: If a high confidence branch was predicted incorrectly, the I-cache is accessed for the subsequent basic blocks. If more than n low confidence branches have been decoded in a row, the I-cache is accessed. Therefore the L0-cache will be bypassed when either of the two conditions is satisfied. In any other case the machine will access the L0-cache. The first rule for accessing the I-cache is due to the fact that a mispredicted high confidence branch behaves unusually and drives the program to an infrequently executed portion of the code. The second rule is due to the fact that a series of low confidence branches will also suffer from the same problem since the probability that they all are predicted correctly is low. 3.5 DYNAMIC CONFIDENCE ESTIMATION METHOD Dynamic confidence estimation method is a dynamic version of the static method. The confidence of The I-cache is accessed if a high confidence branch is mispredicted. More than n successive low confidence branches are encountered.
17

The

dynamic

confidence

estimation

mechanism

is

slightly better in terms of energy reduction than in the simple or static method. Since the confidence estimator can adapt dynamically to the temporalities of the code, it is more accurate in characterizing a branch and, then, regulating the access of the L0-cache. 3.6 RESTRICTIVE DYNAMIC CONFIDENCE ESTIMATION METHOD The methods described in previous sections tend to place a large number of basic blocks in the L0-cache, thus degrading performance. Restrictive dynamic scheme is a more selective scheme in which only the really important basic blocks would be selected for the L0-cache. The selection mechanism is slightly modified as: The L0-cache is accessed only if a high confidence branch is predicted correctly. The I-cache is accessed in any other case. This method selects some of the most frequently

executed basic blocks, yet it misses some others. It has much lower performance degradation, at the expense of lower energy reduction. system energy.
18

It

is probably is

preferable

in a

where

performance

more important than

3.7 DYNAMIC DISTANCE ESTIMATION METHOD The dynamic distance estimation method is based on the fact that, a mispredicted branch triggers a series of successive mispredicted branches. follows: All branches after a mispredicted branch are tagged as low confidence otherwise as high confidence. The basic blocks after a low confidence branch are fetched from the L0-cache. The net effect is that a branch misprediction causes a series of fetches from the I-cache. A counter is used to measure the distance of a branch from the previous mispredicted branch. This scheme is also very selective in storing instructions from the L0-cache, even more than the restrictive dynamic estimation method. 3.8 COMPARISON OF DYNAMIC TECHNIQUES The energy reduction and delay increase is a function of the algorithm used for the regulation of the L0-cache access, the size of the L0-cache, its block size and its associability. For example, a larger block size causes a larger hit ratio in the L0-cache. This results into smaller performance overhead, and bigger efficiency since the I-cache does not need to be accessed so often.
19

The method works as

On the other hand if the block size increase does not have a large impact on the hit ratio, the energy dissipation may go up, since a cache with a larger block size is less energy efficient than a cache with the same size but smaller block size. The static method and dynamic confidence method make the assumption: The less frequently executed basic blocks usually follow less predictable branches that are mispredicted. The simple method and restrictive dynamic confidence estimation method address the problem from another angle. They make the assumption: Most frequently executed basic blocks usually follow high predictable branches. The dynamic distance estimation method is the most successful in reducing the performance overhead, but the least successful in energy reduction. This method possesses stricter requirement for a basic block to be selected for the L0-cache than the original dynamic confidence estimation method. Larger block size and associability will have a beneficial effect on both energy and performance. The hit rate of a
20

small cache is more sensitive to the variation of the block size and the associability.

21

CHAPTER FOUR 4.0 CONCLUSION This paper presents the method for dynamic selection of basic blocks for placement in the L0-cache. It explains the functionality of the branch prediction and the confidence estimation mechanisms in high performance processors. Finally five different dynamic techniques were discussed for the selection of the basic blocks. These techniques try to capture the execution profile of the basic blocks by using the branch statistics that are gathered by the branch predictor. The experiment of the of the evaluation L0-cache. demonstrates for the the

applicability management

dynamic techniques

Different techniques can

trade off energy with delay by regulating the way the L0cache is accessed.

22

REFERENCES [1] J. Diguet, S. Wuytack, F. Cattho or, and H. De Man, Formalized methodology for data reuse exploration in hierarchical memory mappings," in Proceedings of the International Symposium of Low Power Electronics and Design, pp. 30{35, Aug. 1997. J. Kin, M. Gupta, and W. Mangione-Smith, The Filter cache: An energy efficient memory structure," in Proceedings of the International Symposium on Micro architecture, pp. 184{193, Dec. 1997. N. Bellas, I. Ha jj, C. Polychronopoulos, and G. Stamoulis, Architectural and compiler supp ort for energy reduction in the memory hierarchy of high performance microprocessors," in Proceedings of the International Symposium of Low Power Electronics and De-sign, pp. 70{75, Aug. 1998. S. Manne, D. Grunwald, and A. Klauser, Pipeline gating: Speculation control for energy reduction," in Proceedings of the International Symposium of Computer Architecture, pp. 132{141, 1998. Speed Shop User's Guide. Silicon Graphics, Inc., 1996. S. Wilson and N. Jouppi, An enhanced access and cycle time model for on-chip caches," tech. rep., DEC WRL 93/5, July 1994. Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos, Using dynamic cache management techniques to reduce energy in a high-performance processor, Department of Electrical & Computer Engineering and the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign 1308 West Main Street, Urbana, IL 61801, USA.
23

[2]

[3]

[4]

[5] [6]

[7]