Académique Documents
Professionnel Documents
Culture Documents
Abstract—Heterogeneous multicore processors that integrate One of the key challenges in designing heterogeneous
CPU cores and data-parallel accelerators such as GPU cores multicore systems is the sharing of on-chip resources such
onto the same die raise several new issues for sharing various
as the last-level cache (LLC), since integrating CPU and GPU
on-chip resources. The shared last-level cache (LLC) is one
of the most important shared resources due to its impact on cores onto the same die leads to competition in the LLC that
performance. Accesses to the shared LLC in heterogeneous does not exist in homogeneous systems. First, the difference
multicore processors can be dominated by the GPU due to in cache sensitivity among diverse cores imply difference in
the significantly higher number of threads supported. Under performance benefits obtained from owning the same amount
current cache management policies, the CPU applications’ share
of cache space. Second, GPU cores with a large number of
of the LLC can be significantly reduced in the presence of
competing GPU applications. For cache sensitive CPU appli- threads can potentially dominate accesses to the LLC, and
cations, a reduced share of the LLC could lead to significant consequently, skew existing cache sharing policies in favor of
performance degradation. On the contrary, GPU applications can the GPU cores. As a result, GPU cores occupy an unfair share
often tolerate increased memory access latency in the presence of the LLC with existing policies. Prior work have shown
of LLC misses when there is sufficient thread-level parallelism.
that judicious sharing of the LLC can improve the overall
In this work, we propose Heterogeneous LLC Management
(HeLM), a novel shared LLC management policy that takes performance of diverse workloads in homogeneous multicore
advantage of the GPU’s tolerance for memory access latency. system [6, 11, 16, 19, 21, 24, 26, 27]. However, it is unclear
HeLM is able to throttle GPU LLC accesses and yield LLC whether these techniques can be adopted by heterogeneous
space to cache sensitive CPU applications. GPU LLC access multicore processors.
throttling is achieved by allowing GPU threads that can tolerate
longer memory access latencies to bypass the LLC. The latency Dynamic Re-Reference Interval Prediction (DRRIP) [6] is
tolerance of a GPU application is determined by the availability a cache management policy developed primarily for homo-
of thread-level parallelism, which can be measured at runtime geneous multicore processors. DRRIP predicts re-reference
as the average number of threads that are available for issuing. (reuse) interval of cache lines to be either intermediate or
Our heterogeneous LLC management scheme outperforms LRU distant, and inserts lines at non-MRU (Most Recently Used)
policy by 12.5% and TAP-RRIP by 5.6% for a processor with 4
CPU and 4 GPU cores. position based on the re-reference prediction. If a line is re-
used after insertion into the LLC, it is promoted by increasing
Index Terms—heterogeneous multicores, shared last-level its age to improve its lifetime in the cache. Non-MRU insertion
cache, cache management policy.
of cache lines performs better than MRU insertion because
most of the lines do not observe immediate re-reference.
I. I NTRODUCTION Figure 1 shows the performance of Least Recently Used
(LRU) and DRRIP policies in a heterogeneous execution en-
Advances in semiconductor technology and the urgent need vironment where 401.bzip2 (from SPEC CPU2006 benchmark
for energy efficient computation has facilitated the integra- suite) executing on a single CPU core shares a 2MB LLC with
tion of computational cores that are heterogeneous in na- a GPU benchmark (listed in Table IV) executing on four GPU
ture onto the same die. Data-parallel accelerators such as cores. Details of the experiment are provided in Section IV.
the Graphic Processor Units (GPU) are the most popular Figure 1(a) shows the average LLC occupancy and Figure 1(b)
among the accelerator cores used in such designs. With easy shows the average LLC misses per kilo-instruction (MPKI) as
to adopt programming models, such as Nvidia CUDA [17] well as normalized IPC of the CPU application across all GPU
and OpenCL [10], these data-parallel cores are now being benchmarks. Since 401.bzip2 is cache sensitive, while most
employed to accelerate diverse workloads. The availability of of the GPU applications are not, it is desirable to allocate a
heterogeneous multicore systems such as AMD Fusion [3], larger share of the LLC to the CPU application. However, we
Intel Sandy Bridge [5], and Nvidia Denver [18] suggests observe that a major chunk of the LLC is occupied by the GPU
that multicore designs with heterogeneous processing elements application. This leads to significant performance degradation
are becoming the mainstream. Diversity in the performance for the CPU application for the LRU policy. DRRIP provides
characteristics of these computational cores presents a unique little performance improvement as it is overwhelmed by an
set of challenges in designing these heterogeneous multicore order of magnitude difference between the memory access
processors. rates of the CPU and the GPU cores.
of memory access latency tolerance. Taking into consideration
the latency tolerance of the GPU, we propose Heterogeneous
LLC Management (HeLM), a mechanism for managing shared
LLC in heterogeneous multicore processors.
HeLM is a cache replacement policy that improves the
effectiveness of the shared LLC in a heterogeneous envi-
ronment by utilizing the available TLP in GPU applications.
Under the HeLM policy, GPU LLC accesses are throttled by
allowing memory accesses to selectively bypass the LLC; and
(a) Cache occupancy of the CPU and the GPU cores. consequently, the cache sensitive CPU application is able to
utilize a larger portion of the cache.
Overall, the contributions in this work are as follows:
• We analyze GPU application characteristics in Section II,
and identify available TLP and LLC bypassing as two
significant factors in managing shared LLC in a hetero-
geneous multicore processor.
• We propose a runtime mechanism, in Section III, that
dynamically determines the cache sensitivity of both CPU
and GPU applications, and adapts the cache management
(b) Misses per kilo-instruction (MPKI) and normalized IPC for 401.bzip2.
IPC is normalized to the IPC of 401.bzip2 executing on the heterogeneous
policy based on this information.
processor without interference from the GPU cores. • We evaluate the proposed mechanism in Section V, and
demonstrate performance improvement over previously
Fig. 1: The performance impact on a CPU core for sharing proposed cache management policies.
the LLC with four GPU cores. 401.bzip2 executes on the
CPU core. The performance impact is measured across the II. C HALLENGES & O PPORTUNITIES
set of GPU benchmarks shown in Table IV.
The heterogeneous multicore architecture we address in
We are aware of only one existing work, TLP-Aware Cache this work is depicted in Figure 2. This design is modelled
Management Policy (TAP) [13], that addresses the diversity of after Intel Sandy Bridge [5]. The processor consists of several
on-chip cores while designing the LLC sharing policy. TAP CPU and GPU cores each with its own private cache. These
identifies the cache sensitivity of the GPU application, and cores share the LLC and DRAM controllers, and the modules
the difference in LLC access rate between the CPU and GPU communicate through an on-chip interconnection network.
cores. This information is then used to influence the decisions Efficient sharing of on-chip resources is critical to the per-
made by the underlying cache management policy. When these formance of a multicore processor. The last-level cache is one
metrics indicate a cache sensitive GPU application, both the of the most important among these resources.
cores are given equal priority. Whereas, when GPU application
is found to be cache insensitive, GPU core is given lower G G G G
CPU $ CPU $ CPU $ CPU $ P $ P $ P $ P $
priority in the underlying policy. U U U U
However, Figure 1(a) indicates that a large portion of the
cache is still allocated to the cache-insensitive GPU application On−Chip Interconnection Network
under TAP, and consequently, the performance degradation due
to LLC sharing is still significant for the CPU application. Last− Last− Last−
DRAM DRAM DRAM
Several reasons prohibit TAP from achieving the desired Level Level Level
Ctrl. Ctrl. Ctrl.
$ $ $
performance. First, the core sampling technique used in TAP
to measure the cache sensitivity of the GPU application leaves Fig. 2: Architecture of a heterogeneous multicore processor
a significant amount of GPU dead-blocks in the LLC. Second, with CPU and GPU cores sharing the LLC.
TAP takes the same decision for all GPU memory accesses in a
sampling period, and is slow to adapt to the runtime variations Existing cache management policies, developed for homo-
in the application’s behavior. A more fine-grained control over geneous multicore processors, face difficulty in adapting to
the GPU LLC share could potentially improve the utilization heterogeneous multicores. These mechanisms [6, 11, 16, 19,
of the shared LLC. We discuss TAP in detail in Section V-B. 21, 24, 26, 27] do not consider the diversity of core charac-
The GPU core can support thousands of active threads teristics in their design. While many mechanisms [6, 19, 21,
simultaneously. Thus, the thread-level parallelism (TLP) avail- 26, 27] consider the cache sensitivity of the application, they
able with the GPU core is orders of magnitude higher than that do not consider the difference in LLC access rate between
with the CPU core. This higher level of TLP aids the GPU the diverse cores in a heterogeneous multicore. An order of
core in tolerating longer memory access latency by scheduling magnitude higher access rate from the GPU core, compared
threads that are ready to execute. Our experiments show that to the CPU core, tends to skew their judgement in favor of
majority of the GPU applications we study have a high level the GPU.
A. Challenges in LLC sharing GPU applications are able to tolerate higher memory access
Sharing of the LLC among cores in a heterogeneous mul- latency using the available TLP. Figure 3 shows the impact
ticore processor introduces several challenges. This section of bypassing1 the shared LLC for 100%, 75%, 50%, and
presents the challenges in determining the cache sensitivity 25% of GPU memory access requests. We observe that, on
of various applications executing on individual cores; and an average, GPU applications can sustain up to 75% of LLC
devising cache management mechanisms that can cope with access bypassing without significant performance degradation.
cores having widely divergent memory access patterns. The GPU is able to do so by utilizing its high degree of TLP.
(i) Cache Sensitivity: Cache sensitivity indicates how much the
performance of an application can benefit from an increase in
cache capacity. Cache management policies can utilize cache
sensitivity as a metric to determine how to best share cache
capacity between cores. In CPU-based homogeneous multicore
systems, this issue has been extensively studied. Techniques,
such as set dueling [19], have been demonstrated effective in
improving cache utilization. It is worth pointing out that, in
such CPU-based systems [6, 19, 21, 26, 27], cache sensitivity Fig. 3: Impact of random LLC bypassing on GPU
is often measured in terms of variations in cache miss rates benchmarks. Graph shows the IPC of each configuration
across different cores as cache capacity allocation varies [21, relative to the native run (without LLC bypassing).
27] or as cache replacement policy changes [6, 19, 26]. While
for CPU cores, change in cache miss rate is a direct indicator These characteristics point to the fact that the TLP available
of cache sensitivity, in GPU cores, increase in cache miss rate in a GPU application is a good indicator to its cache sensitivity,
does not necessarily lead to performance degradation. This and hence could aid in promoting an effective sharing of LLC
is because GPU cores can tolerate memory access latency among cores. Moreover, TLP is a true runtime metric that
by context switching between a large number of concurrently adapts to dynamic behaviors of the GPU application. To the
active threads. Thus, cache miss rate is not a good indicator best of our knowledge, no other work has directly utilized TLP
of the cache sensitivity of the GPU. GPU-specific techniques as a metric to manage shared LLC in heterogeneous multicore
must be developed to determine the cache sensitivity of GPU processors. We observe that while mechanisms such as set
workloads. dueling are not able to identify the true cache sensitivity of
(ii) Cache Management Mechanisms: There are two classes GPU applications, TLP forms an accurate metric for the same.
of techniques for managing the shared caches: i) partitioning LLC Bypassing: To improve the flexibility of cache man-
the cache ways among applications; and ii) prioritizing the agement mechanisms in a heterogeneous multicore processor,
insertion/eviction of blocks from different programs. When we explore LLC bypassing techniques. LLC bypassing allows
workloads with differing cache sensitivities share a cache, potentially different decisions for each incoming GPU access.
one of these techniques could be employed to enhance cache When both CPU and GPU applications are identified as cache
utilization and maximize the overall performance. However, sensitive, the mechanism can consider various application
when GPU workload is sharing the cache with CPU, previ- characteristics while making the bypass decisions. Such char-
ously proposed mechanisms are often unable to make judicious acteristics include the difference in cache sensitivities of CPU
decisions because GPU workloads often have memory access and GPU applications, and the amount of TLP available in the
rates that are an order of magnitude higher than those of CPU GPU application. Such fine-grained throttling of each LLC
workloads. In particular, when both CPU and GPU workloads access can bring significant performance improvement as a
are identified as cache sensitive, the memory accesses from the result of better LLC utilization.
GPU will pollute the shared cache, and wipe out cache blocks Additionally, bypassing of unnecessary blocks improves the
needed by the CPU. In such cases, it is desirable to give cache dynamic energy performance of the LLC. For cache insensitive
sensitive CPU workloads higher priority over cache sensitive applications, the blocks in LLC are mostly dead (zero reuse
GPU workloads to improve the overall performance. before eviction). Hence, they end up accessing off-chip DRAM
for these blocks whether they are stored in LLC or not.
B. Improving LLC sharing Thus, bypassing technique can improve the LLC dynamic
energy performance without significant increase in the off-chip
In this work, we propose to address the challenges faced
DRAM access.
by existing cache management techniques in a heterogeneous
multicore environment. First, we propose to use TLP as a run-
time metric to correctly identify the cache sensitivity of GPU C. Workload Characteristics
applications. Second, we propose to use LLC bypassing to A rethinking of the optimal sharing of on-chip resources in a
improve cache management in the heterogeneous environment. heterogeneous multicore processor necessitates a reevaluation
TLP as a Runtime Metric: The general cache insensitivity of of the characteristics of the target workload. We analyze the
GPU applications stems from two main reasons: i) streaming 1 For 75%, 50%, and 25% bypassing, we randomly choose the GPU accesses
memory access behavior; and ii) high levels of available TLP. for bypassing. The values shown are average for GPU benchmarks in Table IV,
Even when the memory access behavior is not streaming, executing on 4 GPU cores with specifications as mentioned in Table II.
2
IPC
0.02 2
IPC
400 the TLP information of the GPU core is attached to the LLC
350
1.5
MPKI
0.015 1.5
MPKI
300
access request. The available TLP at runtime is measured using
250 hardware performance monitors that measure the number of
MPKI
MPKI
IPC
IPC
1 0.01 1 200
150 wavefronts2 ready to be scheduled at any given time. Higher
0.5 0.005 0.5 100
50
number of ready wavefronts indicate higher TLP, which in turn
0 0 0 0 suggests that GPU can tolerate higher memory access latency.
Cache sensitivity of the CPU and GPU applications plays
4096
2048
1024
512
256
128
64
32
16
8
4096
2048
1024
512
256
128
64
32
16
8
Compute Intensive Cache Sensitive a critical role in making bypassing decisions. A cache insen-
2
IPC
150
140
2
IPC 170 sitive CPU application does not benefit from increased LLC
MPKI MPKI
1.5 130 1.5
160
150
space available due to GPU LLC bypassing. Bypassing LLC
120
for a cache sensitive GPU application executing along with
MPKI
MPKI
140
IPC
IPC
1 110 1
0.5
100
90 0.5
130
120
such cache insensitive CPU applications could degrade GPU
80 110 performance without improving the overall performance.
0 70 0 100
In the following subsections, we discuss in detail the
4096
2048
1024
512
256
128
64
32
16
8
4096
2048
1024
512
256
128
64
32
16
8
techniques employed to identify: i) the cache sensitivities of
Performance Insensitive Streaming
the CPU and GPU applications; and ii) an effective TLP
Fig. 4: LLC sensitivity of GPU benchmarks. Four different threshold to measure the memory access latency tolerance
classes of GPU benchmarks are shown with their IPC and of the GPU application. We combine these metrics into a
MPKI for different LLC sizes (in KB). Threshold Selection Algorithm (TSA) that makes GPU LLC
bypassing decisions.
performance characteristics of general purpose GPU applica-
tions for varying LLC sizes. Based on instructions per cycle A. Measuring the LLC Sensitivity
(IPC) and misses per kilo-instruction (MPKI) characteristics,
We employ a mechanism based on the set dueling [19]
we broadly classify these applications as either cache sensitive
technique to measure the cache sensitivity of the CPU and
or cache insensitive. Cache insensitive applications can further
GPU applications. Set dueling applies two opposing tech-
be classified into three types: compute intensive, performance
niques to two distinct sets, and identifies the characteristic
insensitive, or streaming. The first type puts very little pressure
of the application from the performance difference among the
on the shared LLC. The last two types have high LLC access
sets. We apply set dueling in two different ways to the CPU
rate, however, cache size has hardly any influence on their
core and the GPU core.
performance. This characteristic is either due to their streaming
1) GPU LLC Sensitivity: To measure the GPU LLC sen-
access behavior or their TLP availability.
sitivity, we utilize two GPU sampling cores and two TLP
Figure 4 shows the IPC and MPKI characteristics for these
thresholds: LowThr and HighThr. In every sampling period,
categories. This characterization can be used to evaluate the ef-
one of the GPU cores (LowGPU) performs LLC bypassing
fectiveness of LLC space on the application, and it shows that
at LowThr, while the other core (HighGPU) uses HighThr.
not all GPGPU applications have a uniform streaming access
LowThr is always smaller than HighThr and indicates a higher
behavior. There are several cache sensitive applications that
rate of bypassing. Hence, LowGPU bypasses more memory
need special attention while managing the LLC. These charac-
accesses than HighGPU. A significant performance (IPC)
teristics assume increased importance when GPU applications
difference (∆IPCGP U ), greater than pThreshold3, between
share the LLC with CPU applications in a heterogeneous
these two cores indicates that LLC bypassing is having an
multicore processor. Benchmarks belonging to each class is
adverse impact on the GPU performance and hence the GPU
shown in Table IV. We utilize this benchmark characterization
application is cache sensitive. If the performance difference
to form the workload mix in Section IV-B. Section V-D
is within the limit, the GPU application is considered cache
evaluates HeLM based on these application classes.
insensitive.
III. H ETEROGENEOUS LLC M ANAGEMENT 2) CPU LLC Sensitivity: We evaluate the cache sensitivity
of the CPU application by monitoring the impact of GPU
In this section, we describe our heterogeneous LLC man-
LLC bypassing on the performance of the CPU application.
agement mechanism that mitigates the performance impact of
Since CPU applications are more cache sensitive than GPU
LLC sharing by throttling LLC accesses initiated by the GPU
applications, change in cache misses directly affects the perfor-
cores. HeLM exploits the memory access latency tolerance
mance of CPU applications. We measure two CPU LLC misses
capability of the GPU cores and allows the GPU cores to
MissLow and MissHigh corresponding to GPU bypassing at
yield LLC space to the cache sensitive CPU cores without
LowThr and HighThr respectively. Two set dueling monitors
significantly degrading their own performance. In HeLM, we
(SDM) are used at the LLC to obtain the MissLow and
manage the LLC occupancy of the GPU cores by allowing the
MissHigh numbers, each bypassing GPU accesses at LowThr
GPU memory traffic to selectively bypass the LLC when: i) the
GPU cores exhibit sufficient TLP to tolerate memory access 2 Work is allocated to the GPU cores as kernels that contain a large number
latency; or ii) when the GPU application is not sensitive to of threads. A kernel is further partitioned and mapped to different GPU cores
as thread-blocks or workgroups. Scalar threads within each GPU core are
LLC performance. scheduled simultaneously as warps [17] or wavefronts [10] onto the SIMD
For each GPU memory access, the decisions for bypassing computing engine.
the LLC is made at the shared LLC. On an L1 cache miss, 3 Based on empirical analysis, we set pThreshold to 5%.
and HighThr respectively. For these SDMs, the bypassing Data: ∆IPCGP U ,∆MISSCP U
decision is made using its unique TLP threshold irrespective Result: Bypass TLP Threshold
of which GPU core initiated the access. if ∆MISSCP U ≥ mThreshold then
Since GPU takes more LLC space with HighThr than with Set LowThr as TLP threshold;
LowThr, MissHigh is always greater than MissLow. If the end
difference between MissHigh and MissLow (∆MISSCP U ) else if ∆IPCGP U > pThreshold then
is greater than mThreshold4 , GPU bypassing is affecting the Set HighThr as TLP threshold;
CPU LLC behavior, and hence its performance. This criterion end
can identify compute intensive as well as streaming CPU else
workloads5. Dynamic Set Sampling (DSS) [20] has shown if delta(∆IPCGP U ,pThreshold) ≥
that sampling a small number of sets in the LLC can indicate delta(∆MISSCP U ,mThreshold) then
the cache access behavior with high accuracy. We use this Set LowThr as TLP threshold;
technique by sampling 32 sets (out of 4096) to measure the end
cache sensitivity. else
Set HighThr as TLP threshold;
B. Determining Effective TLP Threshold end
end
Determining the effective TLP threshold to initiate GPU
Algorithm 1: Pseudocode for the Threshold Selection Algo-
LLC bypass is critical. To adapt to the diversity among GPU
rithm (TSA).
applications, and the runtime variations within an application
itself, we propose an algorithm to dynamically determine The pseudocode for TSA is shown in Algorithm 1. If the
LowThr and HighThr. Our heuristic is inspired by the binary CPU application is cache sensitive, LowThr is selected for
chop algorithm that is commonly used for searching an ele- bypassing LLC for GPU memory accesses. Otherwise, the
ment in a sorted list bound by limits MaxLimit and MinLimit. choice of threshold depends on the characteristics of the GPU
The algorithm starts with two parameters U and L such that application. If GPU application is cache sensitive, HighThr
U ≥ L, and calculates a decision element E as the average of U is selected for bypassing LLC for GPU memory accesses.
and L (AVG (U, L)). At the beginning of the algorithm, U and Although the threshold selected when both CPU and GPU
L are initialized to MaxLimit and MinLimit, respectively, and applications are identified as cache insensitive does not impact
a prediction is made. If the prediction is lower than expected, performance significantly, it could have a significant impact
the search window is moved up (GO UP) by updating U and L on the off-chip DRAM access rate. In such a case, we take a
as shown in Table I. If the prediction is higher than expected, conservative approach by evaluating which metric is nearer to
the search window is moved down (GO DOWN). At each the limit and select the TLP threshold accordingly.
step, E is recalculated, and the process is continued until E Based on the threshold selected by TSA, HighThr and
matches with the searched element. LowThr are re-calculated using the binary chop heuristic
Action U L discussed in Section III-B. For cache sensitive GPU appli-
INIT MaxLimit MinLimit cations, LLC bypassing aggressiveness is reduced by action
GO UP AVG(MaxLimit, U) E GO UP; otherwise, the aggressiveness is increased by action
GO DOWN E AVG(L, MinLimit) GO DOWN. If the actions toggle between GO UP and GO
DOWN for consecutive sampling periods, we maintain the
TABLE I: Binary chop algorithm for adapting sampling existing HighThr and LowThr values for next five sampling
thresholds at runtime. periods.
CPU cache management policies have employed thread
Our adaptation of the binary chop algorithm recomputes awareness to avoid the domination of one application on the
the higher and lower bypass thresholds at runtime depend- sharing policy. Mechanisms such as thread-aware DRRIP [6]
ing upon the application behavior. We start by initializing (referred to as DRRIP in this paper) utilize separate set of
HighThr = U = 34 ×MAXwavef ronts , and LowThr = L = SDMs to isolate the influence of applications on each other.
1
4 ×MAXwavef ronts . Here, MaxLimit = MAXwavef ronts , Min- Similarly, HeLM is made thread aware by assigning individual
Limit = MINwavef ronts . After every sampling period, HighThr MissLow and MissHigh counters to calculate ∆MISSCP U for
and LowThr values are updated with new values of U and L, each thread. For thread awareness, TSA selects LowThr as the
respectively. TLP threshold if any of the ∆MISSCP U is ≥ mThreshold.