E 518

Managing Shared Last-Level Cache in a
Heterogeneous Multicore Processor

Vineeth Mekkat, Anup Holey, Pen-Chung Yew, Antonia Zhai
Department of Computer Science & Engineering
University of Minnesota
Minneapolis, MN 55455, USA
{mekkat, aholey, yew, zhai}@cs.umn.edu
Abstract—Heterogeneous multicore processors that integrate One of the key challenges in designing heterogeneous
CPU cores and data-parallel accelerators such as GPU cores multicore systems is the sharing of on-chip resources such
onto the same die raise several new issues for sharing various
as the last-level cache (LLC), since integrating CPU and GPU
on-chip resources. The shared last-level cache (LLC) is one
of the most important shared resources due to its impact on cores onto the same die leads to competition in the LLC that
performance. Accesses to the shared LLC in heterogeneous does not exist in homogeneous systems. First, the difference
multicore processors can be dominated by the GPU due to in cache sensitivity among diverse cores imply difference in
the significantly higher number of threads supported. Under performance benefits obtained from owning the same amount
current cache management policies, the CPU applications’ share
of cache space. Second, GPU cores with a large number of
of the LLC can be significantly reduced in the presence of
competing GPU applications. For cache sensitive CPU appli- threads can potentially dominate accesses to the LLC, and
cations, a reduced share of the LLC could lead to significant consequently, skew existing cache sharing policies in favor of
performance degradation. On the contrary, GPU applications can the GPU cores. As a result, GPU cores occupy an unfair share
often tolerate increased memory access latency in the presence of the LLC with existing policies. Prior work have shown
of LLC misses when there is sufficient thread-level parallelism.
that judicious sharing of the LLC can improve the overall
In this work, we propose Heterogeneous LLC Management
(HeLM), a novel shared LLC management policy that takes performance of diverse workloads in homogeneous multicore
advantage of the GPU’s tolerance for memory access latency. system [6, 11, 16, 19, 21, 24, 26, 27]. However, it is unclear
HeLM is able to throttle GPU LLC accesses and yield LLC whether these techniques can be adopted by heterogeneous
space to cache sensitive CPU applications. GPU LLC access multicore processors.
throttling is achieved by allowing GPU threads that can tolerate
longer memory access latencies to bypass the LLC. The latency Dynamic Re-Reference Interval Prediction (DRRIP) [6] is
tolerance of a GPU application is determined by the availability a cache management policy developed primarily for homo-
of thread-level parallelism, which can be measured at runtime geneous multicore processors. DRRIP predicts re-reference
as the average number of threads that are available for issuing. (reuse) interval of cache lines to be either intermediate or
Our heterogeneous LLC management scheme outperforms LRU distant, and inserts lines at non-MRU (Most Recently Used)
policy by 12.5% and TAP-RRIP by 5.6% for a processor with 4
CPU and 4 GPU cores. position based on the re-reference prediction. If a line is re-
used after insertion into the LLC, it is promoted by increasing
Index Terms—heterogeneous multicores, shared last-level its age to improve its lifetime in the cache. Non-MRU insertion
cache, cache management policy.
of cache lines performs better than MRU insertion because
most of the lines do not observe immediate re-reference.
I. I NTRODUCTION Figure 1 shows the performance of Least Recently Used
(LRU) and DRRIP policies in a heterogeneous execution en-
Advances in semiconductor technology and the urgent need vironment where 401.bzip2 (from SPEC CPU2006 benchmark
for energy efficient computation has facilitated the integra- suite) executing on a single CPU core shares a 2MB LLC with
tion of computational cores that are heterogeneous in na- a GPU benchmark (listed in Table IV) executing on four GPU
ture onto the same die. Data-parallel accelerators such as cores. Details of the experiment are provided in Section IV.
the Graphic Processor Units (GPU) are the most popular Figure 1(a) shows the average LLC occupancy and Figure 1(b)
among the accelerator cores used in such designs. With easy shows the average LLC misses per kilo-instruction (MPKI) as
to adopt programming models, such as Nvidia CUDA [17] well as normalized IPC of the CPU application across all GPU
and OpenCL [10], these data-parallel cores are now being benchmarks. Since 401.bzip2 is cache sensitive, while most
employed to accelerate diverse workloads. The availability of of the GPU applications are not, it is desirable to allocate a
heterogeneous multicore systems such as AMD Fusion [3], larger share of the LLC to the CPU application. However, we
Intel Sandy Bridge [5], and Nvidia Denver [18] suggests observe that a major chunk of the LLC is occupied by the GPU
that multicore designs with heterogeneous processing elements application. This leads to significant performance degradation
are becoming the mainstream. Diversity in the performance for the CPU application for the LRU policy. DRRIP provides
characteristics of these computational cores presents a unique little performance improvement as it is overwhelmed by an
set of challenges in designing these heterogeneous multicore order of magnitude difference between the memory access
processors. rates of the CPU and the GPU cores.
of memory access latency tolerance. Taking into consideration
the latency tolerance of the GPU, we propose Heterogeneous
LLC Management (HeLM), a mechanism for managing shared
LLC in heterogeneous multicore processors.
HeLM is a cache replacement policy that improves the
effectiveness of the shared LLC in a heterogeneous envi-
ronment by utilizing the available TLP in GPU applications.
Under the HeLM policy, GPU LLC accesses are throttled by
allowing memory accesses to selectively bypass the LLC; and
(a) Cache occupancy of the CPU and the GPU cores. consequently, the cache sensitive CPU application is able to
utilize a larger portion of the cache.
Overall, the contributions in this work are as follows:
• We analyze GPU application characteristics in Section II,
and identify available TLP and LLC bypassing as two
significant factors in managing shared LLC in a hetero-
geneous multicore processor.
• We propose a runtime mechanism, in Section III, that
dynamically determines the cache sensitivity of both CPU
and GPU applications, and adapts the cache management
(b) Misses per kilo-instruction (MPKI) and normalized IPC for 401.bzip2.
IPC is normalized to the IPC of 401.bzip2 executing on the heterogeneous
policy based on this information.
processor without interference from the GPU cores. • We evaluate the proposed mechanism in Section V, and
demonstrate performance improvement over previously
Fig. 1: The performance impact on a CPU core for sharing proposed cache management policies.
the LLC with four GPU cores. 401.bzip2 executes on the
CPU core. The performance impact is measured across the II. C HALLENGES & O PPORTUNITIES
set of GPU benchmarks shown in Table IV.
The heterogeneous multicore architecture we address in
We are aware of only one existing work, TLP-Aware Cache this work is depicted in Figure 2. This design is modelled
Management Policy (TAP) [13], that addresses the diversity of after Intel Sandy Bridge [5]. The processor consists of several
on-chip cores while designing the LLC sharing policy. TAP CPU and GPU cores each with its own private cache. These
identifies the cache sensitivity of the GPU application, and cores share the LLC and DRAM controllers, and the modules
the difference in LLC access rate between the CPU and GPU communicate through an on-chip interconnection network.
cores. This information is then used to influence the decisions Efficient sharing of on-chip resources is critical to the per-
made by the underlying cache management policy. When these formance of a multicore processor. The last-level cache is one
metrics indicate a cache sensitive GPU application, both the of the most important among these resources.
cores are given equal priority. Whereas, when GPU application
is found to be cache insensitive, GPU core is given lower G G G G
CPU $ CPU $ CPU $ CPU $ P $ P $ P $ P $
priority in the underlying policy. U U U U
However, Figure 1(a) indicates that a large portion of the
cache is still allocated to the cache-insensitive GPU application On−Chip Interconnection Network
under TAP, and consequently, the performance degradation due
to LLC sharing is still significant for the CPU application. Last− Last− Last−
DRAM DRAM DRAM
Several reasons prohibit TAP from achieving the desired Level Level Level
Ctrl. Ctrl. Ctrl.
$ $ $
performance. First, the core sampling technique used in TAP
to measure the cache sensitivity of the GPU application leaves Fig. 2: Architecture of a heterogeneous multicore processor
a significant amount of GPU dead-blocks in the LLC. Second, with CPU and GPU cores sharing the LLC.
TAP takes the same decision for all GPU memory accesses in a
sampling period, and is slow to adapt to the runtime variations Existing cache management policies, developed for homo-
in the application’s behavior. A more fine-grained control over geneous multicore processors, face difficulty in adapting to
the GPU LLC share could potentially improve the utilization heterogeneous multicores. These mechanisms [6, 11, 16, 19,
of the shared LLC. We discuss TAP in detail in Section V-B. 21, 24, 26, 27] do not consider the diversity of core charac-
The GPU core can support thousands of active threads teristics in their design. While many mechanisms [6, 19, 21,
simultaneously. Thus, the thread-level parallelism (TLP) avail- 26, 27] consider the cache sensitivity of the application, they
able with the GPU core is orders of magnitude higher than that do not consider the difference in LLC access rate between
with the CPU core. This higher level of TLP aids the GPU the diverse cores in a heterogeneous multicore. An order of
core in tolerating longer memory access latency by scheduling magnitude higher access rate from the GPU core, compared
threads that are ready to execute. Our experiments show that to the CPU core, tends to skew their judgement in favor of
majority of the GPU applications we study have a high level the GPU.
A. Challenges in LLC sharing GPU applications are able to tolerate higher memory access
Sharing of the LLC among cores in a heterogeneous mul- latency using the available TLP. Figure 3 shows the impact
ticore processor introduces several challenges. This section of bypassing1 the shared LLC for 100%, 75%, 50%, and
presents the challenges in determining the cache sensitivity 25% of GPU memory access requests. We observe that, on
of various applications executing on individual cores; and an average, GPU applications can sustain up to 75% of LLC
devising cache management mechanisms that can cope with access bypassing without significant performance degradation.
cores having widely divergent memory access patterns. The GPU is able to do so by utilizing its high degree of TLP.
(i) Cache Sensitivity: Cache sensitivity indicates how much the
performance of an application can benefit from an increase in
cache capacity. Cache management policies can utilize cache
sensitivity as a metric to determine how to best share cache
capacity between cores. In CPU-based homogeneous multicore
systems, this issue has been extensively studied. Techniques,
such as set dueling [19], have been demonstrated effective in
improving cache utilization. It is worth pointing out that, in
such CPU-based systems [6, 19, 21, 26, 27], cache sensitivity Fig. 3: Impact of random LLC bypassing on GPU
is often measured in terms of variations in cache miss rates benchmarks. Graph shows the IPC of each configuration
across different cores as cache capacity allocation varies [21, relative to the native run (without LLC bypassing).
27] or as cache replacement policy changes [6, 19, 26]. While
for CPU cores, change in cache miss rate is a direct indicator These characteristics point to the fact that the TLP available
of cache sensitivity, in GPU cores, increase in cache miss rate in a GPU application is a good indicator to its cache sensitivity,
does not necessarily lead to performance degradation. This and hence could aid in promoting an effective sharing of LLC
is because GPU cores can tolerate memory access latency among cores. Moreover, TLP is a true runtime metric that
by context switching between a large number of concurrently adapts to dynamic behaviors of the GPU application. To the
active threads. Thus, cache miss rate is not a good indicator best of our knowledge, no other work has directly utilized TLP
of the cache sensitivity of the GPU. GPU-specific techniques as a metric to manage shared LLC in heterogeneous multicore
must be developed to determine the cache sensitivity of GPU processors. We observe that while mechanisms such as set
workloads. dueling are not able to identify the true cache sensitivity of
(ii) Cache Management Mechanisms: There are two classes GPU applications, TLP forms an accurate metric for the same.
of techniques for managing the shared caches: i) partitioning LLC Bypassing: To improve the flexibility of cache man-
the cache ways among applications; and ii) prioritizing the agement mechanisms in a heterogeneous multicore processor,
insertion/eviction of blocks from different programs. When we explore LLC bypassing techniques. LLC bypassing allows
workloads with differing cache sensitivities share a cache, potentially different decisions for each incoming GPU access.
one of these techniques could be employed to enhance cache When both CPU and GPU applications are identified as cache
utilization and maximize the overall performance. However, sensitive, the mechanism can consider various application
when GPU workload is sharing the cache with CPU, previ- characteristics while making the bypass decisions. Such char-
ously proposed mechanisms are often unable to make judicious acteristics include the difference in cache sensitivities of CPU
decisions because GPU workloads often have memory access and GPU applications, and the amount of TLP available in the
rates that are an order of magnitude higher than those of CPU GPU application. Such fine-grained throttling of each LLC
workloads. In particular, when both CPU and GPU workloads access can bring significant performance improvement as a
are identified as cache sensitive, the memory accesses from the result of better LLC utilization.
GPU will pollute the shared cache, and wipe out cache blocks Additionally, bypassing of unnecessary blocks improves the
needed by the CPU. In such cases, it is desirable to give cache dynamic energy performance of the LLC. For cache insensitive
sensitive CPU workloads higher priority over cache sensitive applications, the blocks in LLC are mostly dead (zero reuse
GPU workloads to improve the overall performance. before eviction). Hence, they end up accessing off-chip DRAM
for these blocks whether they are stored in LLC or not.
B. Improving LLC sharing Thus, bypassing technique can improve the LLC dynamic
energy performance without significant increase in the off-chip
In this work, we propose to address the challenges faced
DRAM access.
by existing cache management techniques in a heterogeneous
multicore environment. First, we propose to use TLP as a run-
time metric to correctly identify the cache sensitivity of GPU C. Workload Characteristics
applications. Second, we propose to use LLC bypassing to A rethinking of the optimal sharing of on-chip resources in a
improve cache management in the heterogeneous environment. heterogeneous multicore processor necessitates a reevaluation
TLP as a Runtime Metric: The general cache insensitivity of of the characteristics of the target workload. We analyze the
GPU applications stems from two main reasons: i) streaming 1 For 75%, 50%, and 25% bypassing, we randomly choose the GPU accesses
memory access behavior; and ii) high levels of available TLP. for bypassing. The values shown are average for GPU benchmarks in Table IV,
Even when the memory access behavior is not streaming, executing on 4 GPU cores with specifications as mentioned in Table II.
2
IPC
0.02 2
IPC
400 the TLP information of the GPU core is attached to the LLC
350
1.5
MPKI
0.015 1.5
MPKI
300
access request. The available TLP at runtime is measured using
250 hardware performance monitors that measure the number of
MPKI
MPKI
IPC
IPC
1 0.01 1 200
150 wavefronts2 ready to be scheduled at any given time. Higher
0.5 0.005 0.5 100
50
number of ready wavefronts indicate higher TLP, which in turn
0 0 0 0 suggests that GPU can tolerate higher memory access latency.
Cache sensitivity of the CPU and GPU applications plays
4096
2048
1024
512
256
128
64
32
16
8
4096
2048
1024
512
256
128
64
32
16
8
Compute Intensive Cache Sensitive a critical role in making bypassing decisions. A cache insen-
2
IPC
150
140
2
IPC 170 sitive CPU application does not benefit from increased LLC
MPKI MPKI
1.5 130 1.5
160
150
space available due to GPU LLC bypassing. Bypassing LLC
120
for a cache sensitive GPU application executing along with
MPKI
MPKI
140
IPC
IPC
1 110 1
0.5
100
90 0.5
130
120
such cache insensitive CPU applications could degrade GPU
80 110 performance without improving the overall performance.
0 70 0 100
In the following subsections, we discuss in detail the
4096
2048
1024
512
256
128
64
32
16
8
4096
2048
1024
512
256
128
64
32
16
8
techniques employed to identify: i) the cache sensitivities of
Performance Insensitive Streaming
the CPU and GPU applications; and ii) an effective TLP
Fig. 4: LLC sensitivity of GPU benchmarks. Four different threshold to measure the memory access latency tolerance
classes of GPU benchmarks are shown with their IPC and of the GPU application. We combine these metrics into a
MPKI for different LLC sizes (in KB). Threshold Selection Algorithm (TSA) that makes GPU LLC
bypassing decisions.
performance characteristics of general purpose GPU applica-
tions for varying LLC sizes. Based on instructions per cycle A. Measuring the LLC Sensitivity
(IPC) and misses per kilo-instruction (MPKI) characteristics,
We employ a mechanism based on the set dueling [19]
we broadly classify these applications as either cache sensitive
technique to measure the cache sensitivity of the CPU and
or cache insensitive. Cache insensitive applications can further
GPU applications. Set dueling applies two opposing tech-
be classified into three types: compute intensive, performance
niques to two distinct sets, and identifies the characteristic
insensitive, or streaming. The first type puts very little pressure
of the application from the performance difference among the
on the shared LLC. The last two types have high LLC access
sets. We apply set dueling in two different ways to the CPU
rate, however, cache size has hardly any influence on their
core and the GPU core.
performance. This characteristic is either due to their streaming
1) GPU LLC Sensitivity: To measure the GPU LLC sen-
access behavior or their TLP availability.
sitivity, we utilize two GPU sampling cores and two TLP
Figure 4 shows the IPC and MPKI characteristics for these
thresholds: LowThr and HighThr. In every sampling period,
categories. This characterization can be used to evaluate the ef-
one of the GPU cores (LowGPU) performs LLC bypassing
fectiveness of LLC space on the application, and it shows that
at LowThr, while the other core (HighGPU) uses HighThr.
not all GPGPU applications have a uniform streaming access
LowThr is always smaller than HighThr and indicates a higher
behavior. There are several cache sensitive applications that
rate of bypassing. Hence, LowGPU bypasses more memory
need special attention while managing the LLC. These charac-
accesses than HighGPU. A significant performance (IPC)
teristics assume increased importance when GPU applications
difference (∆IPCGP U ), greater than pThreshold3, between
share the LLC with CPU applications in a heterogeneous
these two cores indicates that LLC bypassing is having an
multicore processor. Benchmarks belonging to each class is
adverse impact on the GPU performance and hence the GPU
shown in Table IV. We utilize this benchmark characterization
application is cache sensitive. If the performance difference
to form the workload mix in Section IV-B. Section V-D
is within the limit, the GPU application is considered cache
evaluates HeLM based on these application classes.
insensitive.
III. H ETEROGENEOUS LLC M ANAGEMENT 2) CPU LLC Sensitivity: We evaluate the cache sensitivity
of the CPU application by monitoring the impact of GPU
In this section, we describe our heterogeneous LLC man-
LLC bypassing on the performance of the CPU application.
agement mechanism that mitigates the performance impact of
Since CPU applications are more cache sensitive than GPU
LLC sharing by throttling LLC accesses initiated by the GPU
applications, change in cache misses directly affects the perfor-
cores. HeLM exploits the memory access latency tolerance
mance of CPU applications. We measure two CPU LLC misses
capability of the GPU cores and allows the GPU cores to
MissLow and MissHigh corresponding to GPU bypassing at
yield LLC space to the cache sensitive CPU cores without
LowThr and HighThr respectively. Two set dueling monitors
significantly degrading their own performance. In HeLM, we
(SDM) are used at the LLC to obtain the MissLow and
manage the LLC occupancy of the GPU cores by allowing the
MissHigh numbers, each bypassing GPU accesses at LowThr
GPU memory traffic to selectively bypass the LLC when: i) the
GPU cores exhibit sufficient TLP to tolerate memory access 2 Work is allocated to the GPU cores as kernels that contain a large number
latency; or ii) when the GPU application is not sensitive to of threads. A kernel is further partitioned and mapped to different GPU cores
as thread-blocks or workgroups. Scalar threads within each GPU core are
LLC performance. scheduled simultaneously as warps [17] or wavefronts [10] onto the SIMD
For each GPU memory access, the decisions for bypassing computing engine.
the LLC is made at the shared LLC. On an L1 cache miss, 3 Based on empirical analysis, we set pThreshold to 5%.
and HighThr respectively. For these SDMs, the bypassing Data: ∆IPCGP U ,∆MISSCP U
decision is made using its unique TLP threshold irrespective Result: Bypass TLP Threshold
of which GPU core initiated the access. if ∆MISSCP U ≥ mThreshold then
Since GPU takes more LLC space with HighThr than with Set LowThr as TLP threshold;
LowThr, MissHigh is always greater than MissLow. If the end
difference between MissHigh and MissLow (∆MISSCP U ) else if ∆IPCGP U > pThreshold then
is greater than mThreshold4 , GPU bypassing is affecting the Set HighThr as TLP threshold;
CPU LLC behavior, and hence its performance. This criterion end
can identify compute intensive as well as streaming CPU else
workloads5. Dynamic Set Sampling (DSS) [20] has shown if delta(∆IPCGP U ,pThreshold) ≥
that sampling a small number of sets in the LLC can indicate delta(∆MISSCP U ,mThreshold) then
the cache access behavior with high accuracy. We use this Set LowThr as TLP threshold;
technique by sampling 32 sets (out of 4096) to measure the end
cache sensitivity. else
Set HighThr as TLP threshold;
B. Determining Effective TLP Threshold end
end
Determining the effective TLP threshold to initiate GPU
Algorithm 1: Pseudocode for the Threshold Selection Algo-
LLC bypass is critical. To adapt to the diversity among GPU
rithm (TSA).
applications, and the runtime variations within an application
itself, we propose an algorithm to dynamically determine The pseudocode for TSA is shown in Algorithm 1. If the
LowThr and HighThr. Our heuristic is inspired by the binary CPU application is cache sensitive, LowThr is selected for
chop algorithm that is commonly used for searching an ele- bypassing LLC for GPU memory accesses. Otherwise, the
ment in a sorted list bound by limits MaxLimit and MinLimit. choice of threshold depends on the characteristics of the GPU
The algorithm starts with two parameters U and L such that application. If GPU application is cache sensitive, HighThr
U ≥ L, and calculates a decision element E as the average of U is selected for bypassing LLC for GPU memory accesses.
and L (AVG (U, L)). At the beginning of the algorithm, U and Although the threshold selected when both CPU and GPU
L are initialized to MaxLimit and MinLimit, respectively, and applications are identified as cache insensitive does not impact
a prediction is made. If the prediction is lower than expected, performance significantly, it could have a significant impact
the search window is moved up (GO UP) by updating U and L on the off-chip DRAM access rate. In such a case, we take a
as shown in Table I. If the prediction is higher than expected, conservative approach by evaluating which metric is nearer to
the search window is moved down (GO DOWN). At each the limit and select the TLP threshold accordingly.
step, E is recalculated, and the process is continued until E Based on the threshold selected by TSA, HighThr and
matches with the searched element. LowThr are re-calculated using the binary chop heuristic
Action U L discussed in Section III-B. For cache sensitive GPU appli-
INIT MaxLimit MinLimit cations, LLC bypassing aggressiveness is reduced by action
GO UP AVG(MaxLimit, U) E GO UP; otherwise, the aggressiveness is increased by action
GO DOWN E AVG(L, MinLimit) GO DOWN. If the actions toggle between GO UP and GO
DOWN for consecutive sampling periods, we maintain the
TABLE I: Binary chop algorithm for adapting sampling existing HighThr and LowThr values for next five sampling
thresholds at runtime. periods.
CPU cache management policies have employed thread
Our adaptation of the binary chop algorithm recomputes awareness to avoid the domination of one application on the
the higher and lower bypass thresholds at runtime depend- sharing policy. Mechanisms such as thread-aware DRRIP [6]
ing upon the application behavior. We start by initializing (referred to as DRRIP in this paper) utilize separate set of
HighThr = U = 34 ×MAXwavef ronts , and LowThr = L = SDMs to isolate the influence of applications on each other.
1
4 ×MAXwavef ronts . Here, MaxLimit = MAXwavef ronts , Min- Similarly, HeLM is made thread aware by assigning individual
Limit = MINwavef ronts . After every sampling period, HighThr MissLow and MissHigh counters to calculate ∆MISSCP U for
and LowThr values are updated with new values of U and L, each thread. For thread awareness, TSA selects LowThr as the
respectively. TLP threshold if any of the ∆MISSCP U is ≥ mThreshold.
C. Putting It All Together D. Other Design Considerations

TSA monitors the workload characteristics continuously and Impact on On-Chip Energy and Off-Chip Access: Allowing
re-evaluates the TLP threshold at the end of every sampling memory accesses that are unlikely to be reused in the cache
period (1M execution cycles in our study). Once the threshold to bypass the LLC can potentially improve cache utilization
is chosen, it is enforced on all the follower GPU cores. and reduce dynamic energy of LLC accesses. However, LLC
4 Based bypassing could also increase the off-chip DRAM accesses.
on empirical analysis, we set mThreshold to 10%.
5 For compute intensive workloads, MissHigh ≈ MissLow ≈ 0, while for Due to the streaming nature of GPU applications, the blocks in
streaming workloads, MissHigh ≈ MissLow ≈ K (positive number). the LLC are mostly dead, and the accesses go off-chip to fetch
data blocks from DRAM. Thus we observe that LLC bypassing We evaluate 500 million instructions for each CPU bench-
does not increase off-chip DRAM accesses significantly. We mark and 150 million instructions6 for each GPU bench-
evaluate the impact of bypassing on LLC energy consumption mark. The 500 million representative interval for each CPU
and off-chip accesses in Section V-E. benchmark, with ref input, is obtained through SimPoint [4]
analysis. As followed in previous works [6, 13, 21], early
Handling Coherence: The contemporary GPU does not sup-
finishing benchmarks continue to execute until all the bench-
port coherent memory hierarchy. However, if coherence is sup-
marks execute the specified number of instructions. We utilize
ported in future GPUs, bypassing can also be easily supported.
McPAT [14] for studying on-chip LLC energy consumption.
The additional support for maintaining coherence with GPU
bypassing may or may not be required depending upon the B. Benchmarks
inclusion property of GPU cache hierarchy. Inclusion ensures
that cache blocks present in high level caches are also present The CPU benchmarks evaluated belong to the SPEC
in the LLC, while non-inclusion or exclusion relaxes such CPU2006 benchmark suite [23]. The GPU benchmarks eval-
a constraint. Bypassing essentially turns the inclusive LLC uated are OpenCL programs from AMD APP (Application
into a non-inclusive cache. Therefore, the support necessary Parallel Programming) v2.5 software development kit [1]. We
for maintaining coherence in non-inclusive LLC can also be classify these benchmarks into different categories, depending
used to support bypassing in inclusive LLC. Coherence in upon their cache performance, as shown in Tables III and IV.
non-inclusive caches is maintained by employing mechanisms Category Benchmarks
such as snoop filter [22], which is essentially a replica of bzip2, gcc, mcf, perlbench, dealII,
higher level cache tags at the LLC. Therefore, bypassing Cache sensitive
omnetpp, astar, soplex, povray, h264
for non-inclusive LLC will not require any modifications for Streaming: libquantum, bwaves, milc,
zeusmp, cactusadm, gemsfdtd, lbm, leslie3d
handling coherence, while support similar to snoop filter will Cache insensitive
Compute intensive: gobmk, hmmer, gromacs,
be necessary for inclusive LLC. sjeng, gamess, calculix, tonto, namd
While this work evaluates workloads where the CPU and
TABLE III: Classification of CPU benchmarks.
GPU applications have disjoint address spaces, we expect
the proposed technique to be equally effective when these
applications share the same address space. When data is Category Benchmarks
shared between the CPU and the GPU, the underlying cache Cache sensitive
matrix-multiplication, matrix-transpose,
coherence mechanism will ensure correctness of data accesses, gaussian, fast-walsh transform, floyd-warshal
Streaming: histogram, radix sort,
and the proposed bypass mechanism can be deployed with a blackscholes, reduction
snoop filter as discussed above. In this work, we model a Cache insensitive
Performance insensitive: sobel, dwthaar1D,
GPU cache hierarchy that is write-through; however, it will scanarray, dct, box filter
Compute intensive: binomial option,
work equally well with a write-back cache. eigen, bitonic sort
TABLE IV: Classification of GPU benchmarks.

IV. E XPERIMENTAL M ETHODOLOGY
A. Simulator
Core Configuration Workloads
We evaluate HeLM on a cycle accurate simulator,
1CPU + 4GPU (1C4G) 100
Multi2Sim [25], that simulates both CPU and GPU cores as 2CPU + 4GPU (2C4G) 40
depicted in Figure 2. The CPU cores are modelled on a 4-wide 4CPU + 4GPU (4C4G) 30
out-of-order x86 processor, while the GPU cores are based on
TABLE V: Heterogeneous workloads evaluated.
AMD Evergreen [2] architecture. We extended the memory
subsystem in Multi2Sim to support shared LLC between CPU We evaluate multiprogrammed workloads on heterogeneous
and GPU cores. Table II shows the parameters of the cores processors with core configuration as shown in Table V. We
we simulate. consider three types of processors: processors with one, two,
and four CPU cores, respectively. All of them contain four
CPU
Core 1-4 cores, 2.6GHz, 4-wide out-of-order, 64-entry RoB GPU cores as well. Each of the CPU core executes a CPU
L1 Cache 4-way, 32KB, 64B line, private I/D (2 cycles) benchmark while all the four GPU cores execute the same
L2 Cache 8-way, 256KB, 64B line, unified (8 cycles) GPU benchmark. We refer to the workloads executing on these
GPU processors as 1C4G, 2C4G, and 4C4G, respectively, where
4 cores, 1.3GHz, 8-wide SIMD, 16K register file,
Core
64 wavefronts, round-robin scheduling
C stands for CPU core and G stands for GPU core. Equal
L1 Cache 4-way, 8KB, 64B line, private I/D (2 cycles) number of CPU and GPU benchmarks are selected from cache
Shared Memory 32KB, 256B block (2 cycles) sensitive and insensitive categories. Workloads in Table V are
Shared Components formed by randomly selecting benchmarks from each category.
LLC 32-way, 2-8MB, 64B line, 4-tiles (20 cycles) Processor with one CPU core shares a 2MB LLC with the four
DRAM 4GB, 4 controllers (200 cycles)
NoC Mesh topology, 32B flit-size GPU cores, while processor with two and four CPU cores
share 4MB LLC and 8MB LLC respectively.
TABLE II: Configuration parameters for the heterogeneous 6 An instruction executed by all threads in a wavefront is counted as one
evaluation infrastructure. instruction.
C. Cache Management Policies
HeLM is decoupled from the underlying cache management
policy, which brings flexibility to the mechanism as it can be
adapted to work with any cache management policy. In this
paper, we implement HeLM over DRRIP policy, however, it
could well be implemented over other cache management poli-
cies such as Utility-based Cache Partitioning (UCP) [21]. TAP
was proposed with two variants: TAP-UCP and TAP-RRIP,
built on top of UCP and DRRIP respectively. Since TAP-
(a) Cache performance of CPU and GPU benchmarks.
RRIP outperforms TAP-UCP in all evaluations, we choose
to compare HeLM against TAP-RRIP. In DRRIP policy, in-
coming cache blocks are inserted at non-MRU position and
are promoted later on cache hits. Similar to TAP, we do not
promote GPU blocks on LLC hits. Also, when both CPU and
GPU blocks are available for replacement, we replace the GPU
block first.
Reuse-based Cache Management. Reuse-based mecha-
nisms [7–9, 12, 15] have been studied extensively in CPU
domain to improve cache utilization. Also known as dead- (b) Speedup for CPU and GPU benchmarks.
block predictors, they predict whether a cache block is dead
or live and improve cache performance by replacing dead Fig. 5: Impact of different cache management policies on
blocks first, by bypassing dead blocks, or by prefetching data cache performance and speedup of the CPU and GPU
into dead blocks. In this paper, we compare performance of benchmarks. Graphs show results for workloads: 1C4G,
HeLM with two reuse-based bypass mechanisms, MAT [7] 2C4G, and 4C4G. Results are relative to the LRU policy.
and Sampling Dead-Block Predictor (SDBP) [8]. Other reuse-
based mechanisms are discussed in Section VI-B. Additionally, HeLM outperforms TAP. Overall, HeLM reduces
MAT is an address-based reuse mechanism in which cache CPU LLC misses by 29.5%, 39.1%, and 33% over LRU for
block reuse is determined using a Memory Address Table 1C4G, 2C4G, and 4C4G workloads, while the corresponding
(MAT) at a macroblock granularity. MAT bypasses a cache reduction for TAP are 13.9%, 18.1%, and 18.8%. On the other
block if the victim block has higher reuse than the one hand, TAP and HeLM increase the GPU LLC misses as they
being inserted. SDBP, on the other hand, is a PC-based reuse give lower priority to GPU over CPU. While DRRIP does
mechanism in which the reuse pattern is learned from accesses better than TAP and HeLM, MAT and SDBP are the most
to the cache blocks in a few sampled sets in the LLC. SDBP favorable towards GPU.
updates its prediction table using the PC of the last instruction For the CPU, cache performance directly translates to
that accesses a cache block in the sampled sets. On an access speedup as shown in Figure 5(b). Here, HeLM outperforms
to the LLC, SDBP predicts the cache block as dead or live all other policies convincingly. The lower priority given to the
by referring to the prediction table, indexed by the PC of the GPU is evident in the speedup of TAP and HeLM. However,
instruction initiating the access. If the block being accessed is the impact of increased cache misses do not translate linearly
dead, it can be bypassed from the LLC. into performance degradation for the GPU. This is due to
the difference in cache sensitivity among the CPU and GPU
cores. This shows that since the CPU benchmarks are more
V. E VALUATION
cache sensitive than the GPU benchmarks, it is preferable to
In this section, we evaluate the impact of HeLM on the prioritize CPU benchmarks while managing the shared LLC
performance of shared LLC in a heterogeneous multicore pro- space. In our experiments, the GPU benchmarks in 2C4G
cessor. We compare the performance of HeLM with DRRIP, workloads show slightly higher performance degradation over-
MAT, SDBP, and TAP-RRIP, all normalized to LRU. MAT and all when compared to 1C4G and 4C4G workloads. This is
SDBP were originally proposed for bypassing CPU memory potentially due to the random selection of the workload mix
accesses. However, to compare with HeLM, we employ these which contains more cache sensitive GPU benchmarks than
techniques to bypass only the GPU memory accesses. 1C4G and 4C4G.
Combined speedup for CPU and GPU for all workloads is
A. Performance shown in Figure 6. Speedup is calculated as the geometric
We start our evaluation with the impact of these policies on mean of individual benchmark speedups. The figure shows
the cache performance for CPU and GPU cores. Figure 5(a) that HeLM outperforms all other replacement policies consis-
shows the reduction in CPU and GPU LLC misses for these tently in overall system performance. As the number of CPU
policies (normalized to LRU) for 1C4G, 2C4G, and 4C4G benchmarks increases, speedup from HeLM also increases,
workloads. In case of CPU, although all of them perform better indicating that HeLM is able to scale with the number of
than LRU, the improvement for DRRIP and the reuse-based CPU cores in a heterogeneous multicore. Overall, HeLM
mechanisms (MAT and SDBP) are lower than TAP and HeLM. performs 7.7%, 10.4%, and 12.5% better than LRU for 1C4G,
Since HeLM does not suffer from this side effect, it performs
better than TAP.
(ii) TAP takes a binary decision on whether the GPU appli-
cation is cache sensitive or not. This decision is then used
to override the underlying policy for all the accesses in the
sampling period. However, such a binary decision is at a coarse
granularity, while a more fine-grained ability to control the
LLC share between CPU and GPU could potentially improve
the performance of both the cores. HeLM is able to control
Fig. 6: Combined speedup for various workloads under the cache occupancy of the GPU core at a finer granularity,
different policies. Combined speedup is calculated as the by taking bypass decision for each GPU access, which also
geometric mean of individual benchmark speedups. Results helps in outperforming TAP.
are relative to the LRU policy.
C. Sensitivity to Cache Size
2C4G, and 4C4G workloads, respectively. The corresponding Figure 7 presents the sensitivity of HeLM to varying LLC
improvements in TAP over LRU are 2.6%, 4.4%, and 6.5%, sizes for 4C4G workloads. To configure different LLC sizes
respectively. we vary the LLC associativity. As shown in the figure,
HeLM outperforms other policies for all cache configurations.
B. Comparison with Other Policies Although the performance benefits of HeLM is more evident
with smaller LLC size, it is able to preserve its benefits with
Here we discuss the reasons for the performance improve-
increasing LLC sizes. This shows that HeLM can adapt well
ment of HeLM over various cache management policies we
to variations in cache configurations.
evaluate.
DRRIP: The effectiveness of DRRIP in multicore environ-
ment is visible in Figures 5(a) and 5(b) as DRRIP outperforms
LRU. However, DRRIP faces difficulty in adapting to the
heterogeneous characteristics of the cores. DRRIP policy,
similar to LRU, does not consider the diversity among the
on-chip cores and gives equal priority to both. Therefore, the
higher LLC access rate from the GPU cores tends to skew the
cache management policy in their favor. Thus, the performance
improvement for the CPU core is limited, while GPU cores
do not benefit much from the additional LLC space. Hence,
Fig. 7: Performance of 4C4G workloads with varying cache
the overall speedup for DRRIP in Figure 6 is low.
sizes. Result are relative to the LRU policy.
MAT/SDBP: The performance of reuse analysis based poli-
cies, MAT and SDBP, is very similar to DRRIP, however for
D. Workload Types
different reasons. These mechanisms, although capable of LLC
bypassing of memory accesses, are overly conservative in their Mixing of CPU and GPU applications in a heterogeneous
approach towards the GPU applications. They detect reuse multicore processor creates workloads with unique characteris-
pattern in GPU memory access behavior and preserve the GPU tics. To evaluate the potential opportunities in these workloads,
blocks in LLC. This improves the cache performance of GPU we broadly classify them based on their cache sensitivity,
as in Figure 5(a). However, GPU does not benefit significantly resulting in four different categories:
from the increased LLC space due to their TLP and the ability • CPU cache Sensitive, GPU cache Insensitive (CSGI):
to tolerate higher memory access latency. This additional LLC CSGI is perhaps the most common category in which
space would have been better utilized had it been provided to GPU occupies majority of the LLC space leading to poor
the CPU application. Hence, these mechanisms also observe performance of the CPU application in existing cache
low overall speedup as in Figure 6. policies.
TAP: TAP considers the diversity of on-chip cores in optimiz- • CPU cache Sensitive, GPU cache Sensitive (CSGS):
ing the cache management policy, and improves performance Although GPU is cache sensitive in this combination, it
over existing policies by prioritizing CPU over GPU. However, has the advantage of high TLP. Hence, any additional
HeLM still outperforms TAP for two reasons: LLC space given to CPU could bring a larger overall
(i) the core sampling technique used by TAP leaves a sig- performance improvement.
nificant portion of the shared LLC to be occupied by the • CPU cache Insensitive, GPU cache Insensitive (CIGI):
GPU cores. Majority of these blocks originate from the GPU • CPU cache Insensitive, GPU cache Sensitive (CIGS):
core that inserts at the MRU position in the core sampling These categories do not leave significant room for perfor-
technique, and end up being dead blocks. In our experiments mance improvement as the CPU applications are cache
with 1C4G workloads, we observe that nearly 40% of the insensitive.
shared LLC space is occupied by the GPU dead blocks that Figure 8 presents the speedup for the evaluated policies,
were inserted at the MRU position. This leads to eviction of over LRU, for the four categories. As expected, all the policies
useful CPU blocks, leaving significant room for improvement. show performance improvement over LRU in the first two
show slight increase in total energy consumption due to the
increased off-chip DRAM access from GPU.
In systems which utilize significant amount of DRAM
memory, such as data center systems, the increase in off-chip
accesses due to bypassing could have a significant impact due
to the increase in the DRAM energy consumption and off-chip
memory bandwidth. In such situations, HeLM should consider
DRAM energy consumption and bandwidth utilization while
making bypassing decisions.
Fig. 8: Speedup for 1C4G workloads category-wise. Result F. Hardware Overhead
are relative to the LRU policy. Table VI presents the hardware overhead for various cache
management policies, including HeLM. While MAT and
categories (CSGI, CSGS) where CPU is cache sensitive. Also, SDBP require significant amount of storage to track the reuse
we can observe that HeLM outperforms all the other policies in of GPU LLC blocks, TAP and HeLM can be implemented
these categories, particularly by a significant margin for CSGI using simple hardware counters. HeLM utilizes MissHigh,
which is the most common category. In the last two categories and MissLow counters to track GPU and CPU LLC access
(CIGI, CIGS), there is hardly any performance improvement behavior. Also, instruction counters and core IDs are required
over LRU for any of the policies. to identify the cache sensitivity of the GPU application every
sampling period. The TLP threshold register holds the TLP
E. Energy Consumption & Off-Chip Bandwidth
threshold selected by TSA. In summary, hardware overhead
In this section, we discuss the impact of HeLM on the for HeLM is comparable to that for TAP, and both these
energy consumption of the memory modules, both on-chip and mechanisms use significantly less additional hardware than
off-chip. On one hand, HeLM reduces the on-chip dynamic MAT or SDBP.
energy consumption of the shared LLC by allowing GPU
Policy Hardware Overhead
memory accesses to bypass the shared LLC. On the other
4K-entry memory address table
hand, this could lead to an increase in off-chip main memory MAT [7] (each entry: 20-bit tag, 8-bit counter, 14.5 KB
(DRAM) accesses, resulting in a potential increase in DRAM 1 valid bit)
dynamic energy consumption and off-chip memory bandwidth. 4K-entry x 3 prediction tables (each entry:
SDBP [8] 13.7 KB
2-bit counter), sampler sets
There are two scenarios contributing to LLC dynamic Instruction counters (20-bit x 4 GPU cores),
energy: (i) on an LLC hit, tag array and cache block data TAP [13] 120 bits
core IDs (10-bit x 4 LLC tiles)
are accessed; and (ii) on an LLC miss, tag array is accessed, MissHigh and MissLow counters (20-bit each),
a cache block is written back to memory if it is dirty, and HeLM TLP threshold register (6-bit), 166 bits
instruction counters, core IDs
data for the missed access is written to the cache block.
When an LLC access is bypassed on a miss, only the tag TABLE VI: Hardware Overhead.
array is accessed, eliminating the energy consumption of data VI. R ELATED W ORK
block accesses. We evaluate the LLC energy consumption
using McPAT simulator (32nm technology). Overall, HeLM The importance of the cache subsystem to application per-
saves 65% of the LLC dynamic energy compared to LRU on formance has left a significantly large trove of research work
average. Speedup combined with dynamic energy savings in in cache management techniques. Due to space limitations, we
HeLM result in total energy (static+dynamic) savings of 10.1% discuss only the works that are closely related to ours.
over LRU for the LLC. A. Cache Management
Allowing certain memory accesses to bypass the LLC Existing work on cache management for homogeneous mul-
can potentially increase off-chip DRAM access. In general, ticore processors can be divided into two general categories:
GPU applications show streaming data access behavior and (i) cache partitioning techniques; and (ii) cache replacement
experience high LLC miss rate. Hence, LLC bypassing does policies.
not significantly increase the off-chip accesses as the accesses 1) Partitioning Techniques: Dynamic cache partitioning
that miss LLC would anyway go off-chip to find the data mechanisms aim to achieve their performance goal by dividing
in the DRAM. Overall, we observe a 5% increase in the the cache ways among the applications at runtime. Suh et
off-chip DRAM accesses due to LLC bypassing in HeLM. al. [24] introduce dynamic cache partitioning among threads
However, this does not increase DRAM energy consumption executing on the same chip by utilizing hardware performance
significantly as increase in off-chip accesses affects only the counters to maximize cache hit among threads. Moreto et al.
dynamic energy consumption of the DRAM module. [16] propose a dynamic cache partitioning mechanism that
We observe that the change in energy consumption with considers the memory-level parallelism of an application and
LLC bypassing is a characteristic of the workload category. the impact of cache misses on its performance. Utility-based
For example, categories with cache sensitive CPU applications cache partitioning (UCP) [21] tries to find the optimal cache
(CSGS) whose performance is improved by HeLM show partitioning by prioritizing applications on the basis of benefit
significant reduction in their total energy consumption. On the from cache over the demand for cache. Thrasher caging [27]
other hand, categories with cache insensitive CPU applications re-evaluates cache partitioning mechanisms in the presence of
(CIGS) that do not gain significant performance improvement one or more thrashing applications.
2) Replacement Policies: Cache replacement policies aim ACKNOWLEDGMENT
to identify the appropriate position to insert a new cache block, This work is supported in part by National Science Foun-
and to identify the right victim for replacement, to achieve dation grants CCF-0916583 and CPS-0931931. We would
their performance goal. Qureshi et al. propose the dynamic like to thank all anonymous reviewers for their constructive
insertion policy (DIP) [19] that overcomes the impact of comments that helped to improve the quality of this paper.
thrashing behavior of certain applications on other applications We would also like to thank Ragavendra Natarajan, Jieming
in the workload. Jaleel et al. [6], on the other hand, utilize re- Yin, and Carl Sturtivant for their suggestions to improve the
reference interval prediction (RRIP) to develop replacement paper.
policies for multicore processors that is both thrashing and
scan resistant. PIPP [26] is a cache management technique R EFERENCES
that combines insertion and promotion policies to utilize the
[1] AMD Accelerated Parallel Processing (APP) Software Development Kit
benefits of cache partitioning and adaptive insertion. (SDK) . http://developer.amd.com/sdks/amdappsdk/.
However, these mechanisms face significant challenge in [2] A DVANCED M ICRO D EVICES I NCORPORATED. Evergreen Family
the presence of diverse cores sharing the LLC, and hence, Instruction Set Architecture . http://goo.gl/WQ5lE.
[3] B ROOKWOOD , N. AMD Fusion Family of APUs: Enabling a Superior,
they cannot be directly adopted to heterogeneous multicore Immersive PC Experience. AMD White Paper (2010).
processors. A recent work, TAP [13], adapts UCP and RRIP [4] H AMERLY, G., P ERELMAN , E., L AU , J., AND C ALDER , B. Simpoint
for heterogeneous multicore processors. However, HeLM out- 3.0: Faster and more flexible program analysis. In Journal of Instruction
Level Parallelism (2005).
performs these policies as discussed in Section V-B. [5] I NTEL C ORPORATION. Intel Sandy Bridge Microarchitecture.
http://www.intel.com.
B. Dead Block Predictors [6] JALEEL , A., T HEOBALD , K. B., S TEELY, J R ., S. C., AND E MER ,
J. High performance cache replacement using re-reference interval
Reuse-based cache management, often referred to as dead- prediction (RRIP). In ISCA (2010).
block predictors, have been proposed in prior work [7–9, 12, [7] J OHNSON , T., C ONNORS , D., M ERTEN , M., AND H WU , W.-M. Run-
time cache bypassing. IEEE Transactions on Computers (1999).
15]. Lai et al. propose a dead-block predictor to prefetch [8] K HAN , S. M., T IAN , Y., AND J IMENEZ , D. A. Sampling dead block
data into L1 data cache [12]. While Kharbutli et al. propose prediction for last-level caches. In MICRO (2010).
counting-based dead-block predictors [9] that consider the [9] K HARBUTLI , M., AND S OLIHIN , Y. Counter-based cache replacement
and bypassing algorithms. IEEE Transactions on Computers (2008).
number of accesses to a cache block, Cache Bursts [15] [10] KHRONOS G ROUP . OpenCL - The open standard for parallel program-
observes references to a cache block at MRU position to make ming of heterogeneous systems. http://www.khronos.org/opencl/.
dead block prediction. All prior work have only addressed [11] K IM , S., C HANDRA , D., AND S OLIHIN , Y. Fair cache sharing and
partitioning in a chip multiprocessor architecture. In PACT (2004).
the dead-block prediction issue for CPU workloads. These [12] L AI , A.-C., F IDE , C., AND FALSAFI , B. Dead-block prediction & dead-
mechanisms cannot be directly adopted for heterogeneous block correlating prefetchers. In ISCA (2001).
workloads as they prove to be overly conservative for the GPU. [13] L EE , J., AND K IM , H. TAP: A TLP-aware cache management policy
for a CPU-GPU heterogeneous architecture. In HPCA (2012).
[14] L I , S., A HN , J. H., S TRONG , R. D., B ROCKMAN , J. B., T ULLSEN ,
VII. C ONCLUSIONS D. M., AND J OUPPI , N. P. Mcpat: an integrated power, area, and
timing modeling framework for multicore and manycore architectures.
The growing importance of data-parallel accelerator cores, In MICRO (2009).
such as GPU, has lead to their integration with CPU cores on [15] L IU , H., F ERDMAN , M., H UH , J., AND B URGER , D. Cache bursts:
A new approach for eliminating dead blocks and increasing cache
the same die. Such architectures with heterogeneous process- efficiency. In MICRO (2008).
ing cores present a significant challenge to optimal sharing of [16] M ORETO , M., C AZORLA , F., R AMIREZ , A., AND VALERO , M. Mlp-
on-chip resources such as the LLC. Our heterogeneous LLC aware dynamic cache partitioning. In HiPEAC. 2008.
[17] NVIDIA C ORPORATION. NVIDIA CUDA C Programming Guide.
management mechanism, HeLM, monitors the TLP available http://www.nvidia.com.
in the GPU application, and uses this information to throttle [18] N VIDIA C ORPORATION. Nvidia Project Denver. http://goo.gl/Sbjbb.
the GPU LLC access when the application has enough TLP to [19] Q URESHI , M. K., JALEEL , A., PATT, Y. N., S TEELY, S. C., AND E MER ,
J. Adaptive insertion policies for high performance caching. In ISCA
sustain longer memory access latency. This in turn provides (2007).
an increased share of the LLC to the CPU application, thus [20] Q URESHI , M. K., LYNCH , D. N., M UTLU , O., AND PATT, Y. N. A case
improving its performance. HeLM monitors the cache sensi- for mlp-aware cache replacement. In ISCA (2006).
[21] Q URESHI , M. K., AND PATT, Y. N. Utility-based cache partitioning: A
tivity of both CPU and GPU applications in heterogeneous low-overhead, high-performance, runtime mechanism to partition shared
workloads, and achieves LLC sharing that improves overall caches. In MICRO (2006).
system performance. [22] S ALAPURA , V., B LUMRICH , M., AND G ARA , A. Design and imple-
mentation of the Blue Gene/P snoop filter. In HPCA (2008).
We evaluate HeLM against: (i) existing shared LLC man- [23] S PRADLING , C. D. SPEC CPU2006 Benchmark Tools. SIGARCH
agement techniques (LRU, DRRIP); (ii) reuse-based bypassing Computer Architecture News (2007).
mechanisms (MAT, SDBP); and (iii) the only technique pro- [24] S UH , G., RUDOLPH , L., AND D EVADAS , S. Dynamic partitioning of
shared cache memory. The Journal of Supercomputing (2004).
posed for heterogeneous multicore (TAP). HeLM outperforms [25] U BAL , R., JANG , B., M ISTRY, P., S CHAA , D., AND K AELI , D.
all these mechanisms in overall system performance. HeLM Multi2Sim: A Simulation Framework for CPU-GPU Computing . In
improves over LRU policy by 7.7% and outperforms TAP by PACT (2012).
[26] X IE , Y., AND L OH , G. H. Pipp: promotion/insertion pseudo-partitioning
4.9% for 1C4G workloads. HeLM scales well with increasing of multi-core shared caches. In ISCA (2009).
processor count and outperforms TAP by 5.7% and 5.6% for [27] X IE , Y., AND L OH , G. H. Scalable shared-cache management by
2C4G and 4C4G workloads, respectively. containing thrashing workloads. In HiPEAC (2010).

E 518

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

E 518

Transféré par

Droits d'auteur :

Formats disponibles

Managing Shared Last-Level Cache in a

Heterogeneous Multicore Processor

C. Putting It All Together D. Other Design Considerations

TABLE IV: Classiﬁcation of GPU benchmarks.

Vous aimerez peut-être aussi