SRRIP Policy Improves Cache Performance Over LRU and LFU

CSCE 614: HW4 REPORT (Implementation and
evaluation of SRRIP policy over LRU and LFU)

Rohan Seth - 927001590
CEEN
Texas A&M University
Abstract—The commonly used LRU replacement policy always without any external information on the re-reference interval
predicts a nearimmediate re-reference interval on cache hits and for every missing cache block, LRU or LFU cannot identify
misses. Applications that exhibit either a distant re-reference and preserve non-scan blocks in a mixed access pattern.
interval or near re-reference interval perform badly under LRU.
Such applications usually have a working-set larger than the Whilst, scanresistance using RRIP requires that the width of
cache or have frequent bursts of references to nontemporal data the RRPV register to be appropriately sized to avoid sources
(called scans). To improve the performance of such workloads, we of performance degradation.
here emulates and evaluates cache replacement using Rereference
Interval Prediction (RRIP). We give quantitative measure of II. SRRIP T ECHNIQUE
improvement of SRRIP over LRU and LFU with detailed analysis A. Short description of SRRIP technique
over benchmarks.
The primary goal of RRIP is to prevent blocks with a distant
I. I NTRODUCTION rereference interval from polluting the cache. In the absence of
L EAST R ECENTLY U SED (LRU) replacement policy, the any external re-reference information, RRIP statically predicts
LRU chain represents the recency of cache blocks referenced the block’s re-reference interval. Since always predicting a
with the MRU position representing a cache block that was near-immediate or a distant re-reference interval at cache
most recently used while the LRU position representing a insertion time is not robust across all access patterns, RRIP
cache block that was least recently used. always inserts new blocks with a long re-reference interval.
L EAST F REQUENTLY U SED (LFU) replacement policy, the A long re-reference interval is defined as an intermediate re-
LFU counter represents the recency of cache blocks refer- reference interval that is skewed towards a distant re-reference
enced. In this replacement policy, when the cache is full and interval. We use an RRPV of 2M–2 to represent a long re-
requires more room the system will purge the item with the reference interval. The intuition behind always predicting a
lowest reference frequency. long re-reference interval on cache insertion is to prevent cache
R E - REFERENCE I NTERVAL P REDICTION (RRIP), uses M- blocks with re-references in the distant future from polluting
bits per cache block to store one of 2M possible Reref- the cache. Additionally, always predicting a long re-reference
erence Prediction Values (RRPV). RRIP dynamically learns interval instead of a distant re-reference interval allows RRIP
rereference information for each block in the cache access more time to learn and improve the re-reference prediction.
pattern. Like NRU, an RRPV of zero implies that a cache If the newly inserted cache block has a near-immediate re-
block is predicted to be re-referenced in the near-immediate reference interval, RRIP can then update the re-reference
future while RRPV of saturation (i.e., 2M–1) implies that a prediction to be shorter than the previous prediction. In effect,
cache block is predicted to be re-referenced in the distant RRIP learns the block’s re-reference interval.
future. Since the re-reference predictions made by RRIP are B. Implementation details
statically determined on cache hits and misses, we refer to
this replacement policy as S TATIC R E - REFERENCE I NTERVAL On a cache miss, the RRIP victim selection policy selects
P REDICTION (SRRIP). the victim block by finding the first block that is predicted
With only one bit of information, LFU/LRU can predict to be rereferenced in the distant future (i.e., the block whose
either a nearimmediate re-reference interval or a distant re- RRPV is 2M–1). Like NRU, the victim selection policy breaks
reference interval for all blocks filled into the cache. Always ties by always starting the victim search from a fixed location
predicting a near-immediate re-reference interval on all cache (the left in our studies). In the event that RRIP is unable to
insertions limits cache performance for mixed access patterns find a block with a distant re-reference interval, RRIP updates
because scan blocks unnecessarily occupy the cache space the re-reference predictions by incrementing the RRPVs of
without receiving any cache hits. On the other hand, always all blocks in the cache set and repeats the search until a
predicting a distant re-reference interval significantly degrades block with a distant re-reference interval is found. Updating
cache performance for access patterns that predominantly RRPVs at victim selection time allows RRIP to adapt to
have a near-immediate re-reference interval. Consequently, changes in the application working set by removing stale
blocks from the cache. A natural opportunity to change the
Computer Science and Engineering, Texas A&M University re-reference prediction of a block occurs on a hit to the block.
Fig. 2. An example of the algorithm of SRRIP (2 bit)
cycles respectively. Cache access herein happens as described

in Fig. 2. [h!]
// Cache Access implementation
Look Up Cache (array->lookup) {
If find the line {
Update repl_policy and return line ID;
} else {
return -1;
}
}
And this is how a miss is handled inside the zsim implemen-

tation.
Cache Miss Handling (including Block
Replacement / Writeback)
{
array->preinsert(); // consult repl_policy to
find a victim
cc->processEviction(); // write back if needed
array->postinsert(); // finish and update the
replacement
}
Fig. 1. An example of the algorithm of SRRIP (2 bit)
The algorithm for this update of the RRPV register is called

the RRIP hit promotion policy. The primary purpose of the
hit promotion policy is to dynamically improve the accuracy
of the predicted re-reference interval of cache blocks. We
propose two policies to update the re-reference prediction:
Hit Priority (HP) and Frequency Priority (FP). The RRIP-
HP policy predicts that the block receiving a hit will be re-
referenced in the near-immediate future and updates the RRPV
of the associated block to zero.
III. M ETHODOLOGY
We use Zsim, a full featured memory based system sim-
ulator for Caches, to conduct our performance studies. Our
Fig. 3. Internal breakup of array access functions
baseline processor is single core westmere system( or a 8
core processor for multi threaded instructions)with 64 bit Thus we implement the update(), replaced() and rank() func-
wordlength and three level cache. The L1 instruction cache is tions to our SRRIP implementation.
a 4way associative 32K, L1 data cache is 8 way Set associative
32K. L2 cache is 256K, 8 way associative and L3 is 2MB 16 void update(uint32_t id, const MemReq* req) {
way associative. Only demand references to the cache update if(!miss) // Variable to check
if the entry is through a miss
the LRU state while non-demand references (e.g., write back {
references) leave the LRU state unchanged. The load-to-use array[id] = 0; // update for
latencies for the L1, L2, and L3 caches are 1, 10, and 24 SRRIP-HP to zero on HIT
} }
miss=0; // Set miss to zero
for future iterations. Internal if(flag == 1)
variable for differentation {
between hits and misses. Default return bestCand;
value is 0 }
} else
{
return 0; // This part is nver
Underneath is the code snippet of the replaced() function. accessed, but included to avoid
void replaced(uint32_t id) { syntactical clearity.
array[id] = rpvMax - 1; //Reduce }
the score by 1 upon replacement }
miss = 1; // Set miss
variable for update(). IV. E VALUATION
}
We conduct evaluation based on the following parameters
Underneath is the code snippet of the rank() function, that amongst LRU, LFU, SRRIP(2) and SRRIP(3).
identifies the victim for replacement. • Number of Cycles
total cycles = cycles + cCycles (1)

template <typename C> inline uint32_t
rank(const MemReq* req, C cands) { – Here we observe that on overall benchmarks, SRRIP
uint32_t bestCand = 0; policy shows better performance over LRU and LFU
uint32_t flag = 0;
uint32_t flag2 = 0; in terms of total number of cycles used. Fig. 4 and
a: for (auto ci = cands.begin(); ci != Fig. 5
cands.end(); ci.inc()) { – Overall, with SPEC and PARSEC into consideration
uint32_t s = array[*ci]; together, SRRIP demonstrates 2.1% lesser cycles
if(s == rpvMax && flag !=1 ) // than LRU vs. 0.84% for LFU.
Comparison to replace, by identifying
candidates with score traversed from – Moreover this increases very insignificantly with
left increase in M, it reaches 2.2% for SRRIP with M=3.
{ – The percentage improvement is more significant in
bestCand = *ci; PARSEC( 5.57%) than SPEC(-0.17%).
flag = 1; – Individually SPEC shows an inverse trend with
}
} decrement of 0.17% with SRRIP(2) vs. -0.4% with
if(flag == 0 && flag2 == 0) LFU. Thus in SPEC, both LFU and SRRIP take
{ higher cycles than LRU, with SRRIP being little
b: for (auto ci = cands.begin(); ci != better amonst the two. This value though starts to
cands.end(); ci.inc()) show improvement to 0.25% with M=3.
{
array[*ci] = array[*ci] + 1; – Individually PARSEC shows 5.57% improvement
uint32_t s = array[*ci]; over LRU with SRRIP(2) vs. 2.75% with LFU. This
if(s == rpvMax) // Check the value reduces to 5.21% with M=3.
score to see if it reaches – With an exception of hmmer in INT and, x264
the replacement limit. and BODYTRACK in PARSEC, SRRIP seems to
{
flag2 = 1; // flag to perform always better than the two.
check if retraversal of – The percentage improvement is more significant in
loop is required. PARSEC than SPEC.
} – Floating Point Benchmarks, do not exhibit any im-
} provement over LFU. On average, with SRRIP it
if(flag2 == 0) //Internal flag to see if shows 0.12% improvment over LRU vs. 0.14% with
retraversal is required. LFU. This value improves to 0.19% with M=3, but
{ in conclusion the increment is fairmly insiginificant.
goto b; – Integer Benchmarks, exhibit an opposite trend. On
} average(GM), with SRRIP it shows 0.47% decrement
else if (flag2 == 1) // Internal flag to over LRU vs. 0.94% decrement with LFU.
finally select the best candidate, – Amongst SPEC, INT and FLOAT both show in-
once increments are done. significant changes with SRRIP.
{ • IPC or Speedup/Performance
goto a;
} IP C = totali nstruction/totalc ycles (2)
Fig. 4. Number of cycles comparison across Benchmarks
Fig. 5. Percentage fewer cycles comparison over LRU
– Here we observe that overall across all benchmarks increase in M, it reaches 2.27% for SRRIP with
SRRIP policy shows better performance over LRU M=3.
and LFU in terms of IPC. Fig. 6 and Fig. 7 – The percentage improvement is more significant in
– Overall, with SPEC and PARSEC into consideration PARSEC( 5.8%) than SPEC.
together, SRRIP demonstrates 2.16% lesser cycles – Individually SPEC shows an inverse trend with
than LRU. decrement of 0.16% with SRRIP(2) vs. -0.4% with
– Moreover this increases very insignificantly with LFU. Thus in SPEC, both LFU and SRRIP have
Fig. 6. IPC comparison over Benchmarks
Fig. 7. Percentage higher IPC than LRU
lower IPC than LRU, with SRRIP being little better – With an exception of hmmer in INT and x264
amonst the two. This value though starts to show and BODYTRACK in PARSEC, SRRIP seems to
improvement to 0.21% with M=3. perform always better than the two.
– Individually PARSEC shows 5.89% improvement – The percentage improvement is more significant in
over LRU with SRRIP(2) vs. 2.83% with LFU. This PARSEC( 5.8%) than SPEC.
value reduces to 5.5% with M=3. – Amongst SPEC, INT and FLOAT both show in-
Fig. 8. MPKI comparison across Benchmarks
Fig. 9. Percentage lesser MPKI comparison over LRU
significant changes with SRRIP. M P KI = (total misses/total instruction) ∗ 1000

(4)
• MPKI
– Here we observe that on overal benchmarks,l SRRIP
policy shows better performance over LRU and LFU
total misses = mGET S+mGET XIM +mGET XSM in terms of MPKI. Fig. 8 and Fig. 9
(3) – Overall, with SPEC and PARSEC into consideration
together, SRRIP demonstrates 8% lesser MPKI than reducing the running time by helping in conducting jobs in
LRU vs. 4% for LFU parallel.
– Moreover this increases with increase in M, it
C ONCLUSIONS
reaches 8.8% for SRRIP with M=3.
– The percentage improvement is more significant in SRRIP-HP technique demonstrates better performance over
PARSEC( 17.17%) than SPEC(1.6%). LRU and LFU in Speedup as well as MPKI. But this re-
– Individually SPEC shows an improvment of 1.6% placement policy predominately outperforms in case of multi
with SRRIP(2) vs. 0.8% with LFU. This value im- thread instruction sets, and gives an almost insignificant gain in
proves to 2.1% with M=3. integer and float instructions. The replacement policies success
– Individually PARSEC shows 17.17% with SRRIP(2) seems dependent on the access pattern sought by it. With
vs. 8.9% with LFU. This value improves to 18.33% the mixed access pattern where neither LRU nor LFU can
with M=3. perform perfectly, SRRIP provides perfect middle ground as
– Floating Point Benchmarks, exhibit an opposite trend replacement policy. SRRIP-HP and SSRIP-FP give different
with generally MPKI getting slightly increased with weightages to the LRU composition within and thus will vary
an exception of CactusADM and SOPLEX. On av- in performance dependent on pattern underneath.
erage, with SRRIP it shows 0.6% improvment vs. R EFERENCES
0.3% with LFU.
[1] D. Sanchez and C. Kozyrakis, “ZSim: Fast and Accurate Microarchitec-
– Integer Benchmarks, exhibit an increasing trend but tural Simulation of Thousand-Core Systems,” ISCA, pp. 529–551, April
not as high as in PARSEC. On average(GM), with 2013.
SRRIP it shows 2.5% improvment vs. 1.22% with [2] A. Jaleel, K. Theobald, S. Steely and J. Emer, “High Performance Cache
Replacement Using Re-Reference Interval Prediction (RRIP), ISCA, Jun
LFU. 2010, pp.68–73.
Analysis Here we observe that the percentage improvement
of SRRIP policy is very minimal for SPEC benchmarks,
whereas it improves a considerable amount for PARSEC.
SRRIP policy is in general a broad hybrid mixture of LRU
and MRU, and proves to be of significance when the data
structure on operation’s replacement size is greater than cache
size. With PARSEC benchmarks, this seems to be the case thus
exhibiting a greater improvement, whilst not so in Integer and
Float. In assumption, if there is a data structure of k entries
and is then updated a different data structure of m entries. For
access patterns, when m + k is less than the available cache,
the total working set fits into the cache and LRU works well.
However, when m + k is greater than the available cache, LRU
discards the frequently referenced working set from the cache.
Consequently, accesses to the frequently referenced working
set always misses after the scan. In the absence of scans,
mixed access patterns prefer LRU. However, in the presence
of scans, the optimal policy preserves the active working set in
the cache after the scan completes. This is where SRRIP will
show its dominance over LRU/LFU. Highlighting the outliers,
X264 benchmark in PARSEC seems to be Recency Friendly
Access Pattern, thus performing best for LRU, worst for LFU
and moderate for SRRIP. Similarly from data, Hmmer the
integer benchmark seems more of a mixed access pattern with
dominant ’near immediate re-reference interval’, thus reducing
the speedup for both LFU and SRRIP. Interestingly the MPKI
for this benchmark is almost the same as LRU, thus indicating
a strongly accurate prediction using SRRIP.
ACKNOWLEDGMENT
The data and pattern observed was discussed amongst my
group of friends (Saloni, Paras and Vikas) with which we
came to the above mentioned inferences. I am also thankful
to Akhilesh for his help on setting up the local Ubuntu
environment on my system, which has significantly helped in

SRRIP Policy Improves Cache Performance Over LRU and LFU

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

SRRIP Policy Improves Cache Performance Over LRU and LFU

Transféré par

Droits d'auteur :

Formats disponibles

CSCE 614: HW4 REPORT (Implementation and

evaluation of SRRIP policy over LRU and LFU)

cycles respectively. Cache access herein happens as described

And this is how a miss is handled inside the zsim implemen-

Fig. 1. An example of the algorithm of SRRIP (2 bit)

The algorithm for this update of the RRPV register is called

total cycles = cycles + cCycles (1)

Fig. 5. Percentage fewer cycles comparison over LRU

Fig. 7. Percentage higher IPC than LRU

Fig. 9. Percentage lesser MPKI comparison over LRU

significant changes with SRRIP. M P KI = (total misses/total instruction) ∗ 1000

Vous aimerez peut-être aussi