Académique Documents
Professionnel Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
1
Abstract—Small basestations (SBs) equipped with caching approach to mitigate this limitation is to shift the excess
units have potential to handle the unprecedented demand growth load from peak periods to off-peak periods. Caching realizes
in heterogeneous networks. Through low-rate, backhaul connec- this shift by fetching the “anticipated” popular contents, e.g.,
tions with the backbone, SBs can prefetch popular files during
off-peak traffic hours, and service them to the edge at peak reusable video streams, during off-peak periods, storing this
periods. To intelligently prefetch, each SB must learn what and data in SBs equipped with memory units, and reusing them
when to cache, while taking into account SB memory limitations, during peak traffic hours [2]–[4]. In order to utilize the caching
the massive number of available contents, the unknown popu- capacity intelligently, a content-agnostic SB must rely on
larity profiles, as well as the space-time popularity dynamics available observations to learn what and when to cache. To this
of user file requests. In this work, local and global Markov
processes model user requests, and a reinforcement learning end, machine learning tools can provide 5G cellular networks
(RL) framework is put forth for finding the optimal caching with efficient caching, in which a “smart” caching control unit
policy when the transition probabilities involved are unknown. (CCU) can learn, track, and possibly adapt to the space-time
Joint consideration of global and local popularity demands along popularities of reusable contents [2], [5].
with cache-refreshing costs allow for a simple, yet practical
asynchronous caching approach. The novel RL-based caching Prior work. Existing efforts in 5G caching have focused on
relies on a Q-learning algorithm to implement the optimal policy enabling SBs to learn unknown time-invariant content popu-
in an online fashion, thus enabling the cache control unit at the larity profiles, and cache the most popular ones accordingly.
SB to learn, track, and possibly adapt to the underlying dynamics. A multi-armed bandit approach is reported in [6], where a
To endow the algorithm with scalability, a linear function
approximation of the proposed Q-learning scheme is introduced, reward is received when user requests are served via cache;
offering faster convergence as well as reduced complexity and see also [7] for a distributed, coded, and convexified reformu-
memory requirements. Numerical tests corroborate the merits of lation. A belief propagation-based approach for distributed and
the proposed approach in various realistic settings. collaborative caching is also investigated in [8]. Beyond [6],
Index Terms—Caching, dynamic popularity profile, reinforce- [7], and [8] that deal with deterministic caching, [9] and [10]
ment learning, Markov decision process (MDP), Q-learning. introduce probabilistic alternatives. Caching, routing and video
encoding are jointly pursued in [11] with users having different
QoE requirements. However, a limiting assumption in [6]–
I. I NTRODUCTION
[11] pertains to space-time invariant modeling of popularities,
The advent of smart phones, tablets, mobile routers, and which can only serve as a crude approximation for real-world
a massive number of devices connected through the Internet requests. Indeed, temporal dynamics of local requests are
of Things (IoT) have led to an unprecedented growth in prevalent due to user mobility, as well as emergence of new
data traffic. Increased number of users trending towards video contents, or, aging of older ones. To accommodate dynamics,
streams, web browsing, social networking and online gaming, Ornstein-Uhlenbeck processes and Poisson shot noise models
have urged providers to pursue new service technologies that are utilized in [12] and [13], respectively, while context-
offer acceptable quality of experience (QoE). One such tech- and trend-aware caching approaches are investigated in [14]
nology entails network densification by deploying small pico- and [15].
and femto-cells, each serviced by a low-power, low-coverage,
small basestation (SB). In this infrastructure, referred to as Another practical consideration for 5G caching is driven by
heterogeneous network (HetNet), SBs are connected to the the fact that a relatively small number of users request contents
backbone by a cheap ‘backhaul’ link. While boosting the during a caching period. This along with the small size of cells
network density by substantial reuse of scarce resources, e.g., can challenge SBs from estimating accurately the underlying
frequency, the HetNet architecture is restrained by its low-rate, content popularities. To address this issue, a transfer-learning
unreliable, and relatively slow backhaul links [1]. approach is advocated in [13], [16], and [17], to improve the
time-invariant popularity profile estimates by leveraging prior
During peak traffic periods specially when electricity prices
information obtained from a surrogate (source) domain, such
are also high, weak backhaul links can easily become
as social networks.
congested–an effect lowering the QoE for end users. One
Finally, recent studies have investigated the role of coding
This work was supported by NSF 1423316, 1508993, 1514056, 1711471. for enhancing performance in cache-enabled networks [18]–
Authors are with the Digital Technology Center and the Department of
Electric and Computer Engineering, University of Minnesota, Minneapolis, [20]; see also [21], [22], and [23], where device-to-device
MN 55455 USA (E-mail: {sadeghi, sheik081, georgios}@umn.edu). “structureless” caching approaches are envisioned.
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
2
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
3
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
4
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
5
The objective of this paper is to find the optimal policy ⇡ ⇤ The policy evaluation step is of complexity O(|S|3 ), since
such that the average cost of any state s is minimized (cf. (5)) it requires matrix inversion for solving the linear system of
equations in (7). Furthermore, given V⇡i (s) 8s, the complexity
⇡ ⇤ = arg min V⇡ (s) , 8s 2 S (6) of the policy update step is O(|A||S|2 ), since the Q-values
⇡2⇧
must be updated per state-action pair, each subject to |S|
where ⇧ denotes the set of all feasible policies.
operations; see also (8). Thus, the per iteration complexity of
The optimization in (6) is a sequential decision making the policy iteration algorithm is O(|S|3 + |A||S|2 ). Iterations
problem. In the ensuing section, we present optimality con- proceed until convergence, i.e., ⇡i+1 (s) = ⇡i (s), 8s 2 S.
ditions (known as Bellman equations) for our problem, and Clearly, the policy iteration algorithm relies on knowing
introduce a Q-learning approach for solving (6). [P a ]ss0 , which is typically not available in practice. This
motivates the use of adaptive dynamic programming (ADP)
III. O PTIMALITY CONDITIONS that learn [P a ]ss0 for all s, s0 2 S, and a 2 A, as itera-
Bellman equations, also known as dynamic programming tions proceed [25, pg. 834]. Unfortunately, ADP algorithms
equations, provide necessary conditions for optimality of a are often very slow and impractical, as they must estimate
policy in a sequential decision making problem. Being at the |S|2 ⇥ |A| probabilities. In contrast, the Q-learning algorithm
(t 1)st slot, let [P a ]ss0 denote the transition probability of elaborated next finds the optimal ⇡ ⇤ as well as V⇡ (s), while
going from the current state s to the next state s0 under action circumventing the need to estimate [P a ]ss0 , 8s, s0 ; see e.g.,
a; that is, [24, pg. 140].
n o
[P a ]ss0 := Pr s(t) = s0 s(t 1) = s, ⇡(s(t 1)) = a . A. Optimal caching via Q-learning
Bellman equations express the state value function by (5) in Q-learning is an online RL scheme to jointly infer the
a recursive fashion as [24, pg. 47] optimal policy ⇡ ⇤ , and estimate the optimal state-action
X value function Q⇤ (s, a0 ) := Q⇡⇤ (s, a0 ) 8s, a0 . Utilizing
V⇡ (s) = C (s, ⇡(s)) + [P ⇡(s) ]ss0 V⇡ (s0 ) , 8s, s0 (7) (7) for the optimal policy ⇡ ⇤ , it can be shown that [24, pg. 67]
s0 2S
⇡ ⇤ (s) = arg min Q⇤ (s, ↵), 8s 2 S. (9)
which amounts to the superposition of C plus a discounted ↵
version of future state value functions under a given policy ⇡. The Q-function and V (·) under ⇡ ⇤ are related by
Specifically, after dropping the current slot index t 1 and
indicating with prime quantities of the next slot t, C in (4) V ⇤ (s) := V⇡⇤ (s) = min Q⇤ (s, ↵) (10)
↵
can be written as which in turn yields
X ⇣ ⌘ X
C (s, ⇡(s)) = [P ⇡(s) ]ss0 C s, ⇡(s) p0G , p0L Q⇤ (s, a0 ) = C (s, a0 ) + [P a ]ss0 min Q⇤ (s0 , ↵) . (11)
s0 :=[p0G ,p0L ,a0 ]2S ↵2A
s0 2S
⇣ ⌘
Capitalizing on the optimality conditions (9)-(11), an online
where C s, ⇡(s) p0G , p0L is found as in (3). It turns out that,
Q-learning scheme for caching is listed under Alg. 1. In this
with [P a ]ss0 given 8s, s0 , one can readily obtain {V⇡ (s), 8s} algorithm, the agent updates its ⌘estimated Q̂(s(t 1), a(t))
⇣
by solving (7), and eventually the optimal policy ⇡ ⇤ in (9)
as C s(t 1), a(t) pG (t), pL (t) is observed. That is, given
using the so-termed policy iteration algorithm [24, pg. 79]. To
outline how this algorithm works in our context, define the s(t 1), Q-learning ⇣takes action a(t), and upon ⌘ observing
state-action value function that we will rely on under policy ⇡ s(t), it incurs cost C s(t 1), a(t) pG (t), pL (t) . Based on
[24, pg. 62] the instantaneous error
X 0 1⇣ b (s(t), ↵)
Q⇡ (s, a0 ) := C (s, a0 ) + [P a ]ss0 V⇡ (s0 ) . (8) " (s(t 1), a(t)) := C (s(t 1), a(t)) + min Q
2 ↵
s0 2S ⌘2
b (s(t 1), a(t))
Q (12)
Commonly referred to as the “Q-function,” Q⇡ (s, ↵) basically
captures the expected current cost of taking action ↵ when the Q-function is updated using stochastic gradient descent as
the system is in state s, followed by the discounted value of
the future states, provided that the future actions are taken Q̂t (s(t 1), a(t)) = (1 t )Q̂t 1 (s(t 1), a(t)) +
h ⇣ ⌘ i
according to policy ⇡. t C s(t 1), a(t) pG (t), pL (t) + min Q̂t 1 (s(t), ↵)
↵
In our setting, the policy iteration algorithm initialized with
⇡0 , proceeds with the following updates at the ith iteration. while keeping the rest of the entries in Q̂t (·, ·) unchanged.
• Policy evaluation: Determine V⇡i (s) for all states s 2 S Regarding convergence of the Q-learning algorithm, a nec-
under the current (fixed) policy ⇡i , by solving the system essary condition ensuring Q̂t (·, ·) ! Q⇤ (·, ·), is that all
of linear equations in (7) 8s. state-action pairs must be continuously updated [26]. Under
• Policy update: Update the policy using this and the usual stochastic approximation conditions that
will be specified later, Q̂t (·, ·) converges to Q⇤ (·, ·) with
⇡i+1 (s) := arg min Q⇡i (s, ↵),
↵
8s 2 S. probability 1; see [27] for a detailed description.
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
6
Algorithm 1 Caching via Q-learning at CCU Although selection of a constant stepsize prevents the al-
Initialize s(0) randomly and Q̂0 (s, a) = 0 8s, a gorithm from exact convergence to Q⇤ in stationary settings,
it enables CCU adaptation to the underlying non-stationary
for t = 1, 2, ... do Markov processes in dynamic scenaria. Furthermore, the op-
Take action a(t) chosen probabilistically by timal policy in practice can be obtained from the Q-function
( values before convergence is achieved [24, pg. 79].
arg min Q̂t 1 (s(t 1), a) w.p. 1 ✏t However, the main practical limitation of the Q-learning
a(t) = a
random a 2 A w.p. ✏t algorithm is its slow convergence, which is a consequence
of independent updates of the Q-function values. Indeed, Q-
pL (t) and pG (t) are revealed based on user requests function values are related, and leveraging these relationships
can lead to multiple updates per observation as well as faster
⇥ > ⇤
> >
Set s(t) > convergence. In the ensuing section, the structure of the
⇣ = pG (t), pL (t), a(t) ⌘
Incur cost C s(t 1), a(t) pG (t), pL (t) problem at hand will be exploited to develop a linear function
approximation of the Q-function, which in turn will endow
Update
our algorithm not only with fast convergence, but also with
Q̂t (s(t 1), a(t)) = (1 t )Q̂t 1 (s(t 1), a(t)) scalability.
h ⇣ ⌘
+ t C s(t 1), a(t) pG (t), pL (t)
i IV. S CALABLE CACHING
+ min Q̂t 1 (s(t), ↵)
↵ Despite simplicity of the updates as well as optimality
end for guarantees of the Q-learning algorithm, its applicability over
real networks faces practical challenges. Specifically, the Q-
F
table is of size |PG ||PL ||A|2 , where |A| = M encompasses
From stochastic approximation perspective, by defining all possible selections of M from F files. Thus, the Q-table
ti (s, a) as the index of the ith time that state-action size grows prohibitively with F , rendering convergence of the
(s, a) is visited, convergence Q̂t (·, ·) ! Q⇤ (·, ·) can be table entries, as well as the policy iterates unacceptably slow.
guaranteed if the stepsize sequence Furthermore, action selection in min↵2A Q(s, a) entails an
P1 P1 i=1 satisfies
{ ti (s,a) }1
{ i (s,a) }
1
1 and { i (s,a) }
2 expensive exhaustive search over the feasible action set A.
t=1 t i=1 = t=1 t i=1 < 1, 8s, a
[26]. In addition, for continuously updating state-action pairs, Linear function approximation is a popular scheme for
various exploration-exploitation algorithms have been pro- rendering Q-learning applicable to real-world settings [25],
posed within the scope of multi-armed bandit problems [25], [29], [30]. A linear approximation for Q(s, a) in our setup
and reasonable schemes have been discussed that will even- is inspired by the additive form of the instantaneous costs in
tually lead to optimal actions by the agent. Technically, for (3). Specifically, we propose to approximate Q(s, a0 ) as
a constant selection of step size t = , any such scheme
Q(s, a0 ) ' QG (s, a0 ) + QL (s, a0 ) + QR (s, a0 ) (14)
needs to be greedy in the limit of infinite exploration, or
GLIE [25, p. 840]. Several GLIE schemes have been proposed, where QG , QL , and QR correspond to global and local
including the ✏t -greedy algorithm [28] with ✏t = 1/t, which popularity mismatch, and cache-refreshing costs, respectively.
will converge to an optimal policy, although at a very slow Recall that the state vector s consists of three subvectors,
rate. Instead, selecting a constant value for ✏t approaches the namely s := [p> G , pL , a ] . Corresponding to the global
> > >
optimal Q⇤ (·, ·) faster, however, since it is not GLIE, its exact popularity subvector, our first term of the approximation in
convergence can not be guaranteed. Additionally, with constant (14) is
✏ as well as stepsize t = , the mean-square error (MSE) of
Q̂t+1 (·, ·) is bounded as (cf. [27]) |PG | F
XX
QG (s, a0 ) := G
✓i,f {pG =piG } {[a0 ]f =0} (15)
2 i=1 f =1
E Q̂t+1 Q⇤ Q̂0 '1 ( ) + '2 (Q̂0 ) exp ( 2 t)
F
(13) where the sums are over all possible global popularity profiles
where '1 ( ) is a positive and increasing function of ; while as well as files, and the indicator function {·} takes value 1
the second term denotes the initialization error, which decays if its argument holds, and 0 otherwise; while ✓i,f
G
captures the
exponentially as the iterations proceed. average “overall” cost if the system is in global state piG , and
the CCU decides not to cache the f th content.⇤ By defining the
Current work utilizes an ✏t -greedy exploration-exploitation ⇥
|PG | ⇥ |F| matrix with (i, f )-th entry ⇥G i,f := ✓i,f G
, one
approach to selecting actions. To this end, during initial
iterations or when the CCU observes considerable shifts in can rewrite (15) as
content popularities, setting ✏t high promotes exploration in QG (s, a0 ) = > G
a0 ) (16)
G (pG )⇥ (1
order to learn the underlying dynamics. On the other hand, in
stationary settings and once “enough” observations are made, where
small values of ✏t is desirable as it results agent actions to h i>
|P |
approach the optimal policy. G (pG ) := (pG p1G ), . . . , (pG pG G ) .
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
7
Similarly, we advocated the second summand in the approx- Algorithm 2 Scalable Q-learning
imation (14) to be b G = 0, ⇥
Initialize s(0) randomly, ⇥ b L = 0, ✓ˆR = 0, and
0 0 0
|PL | F
XX b
thus (s) = 0
QL (s, a0 ) := L
✓i,f {pL =piL } {[a0 ]f =0}
i=1 f =1 for t = 1, 2, ... do
= > L
L (pL )⇥ (1 a0 ) (17) Take action a(t) chosen probabilistically by
⇥ ⇤ (
where ⇥L i,f := ✓i,f
L
, and M best files via b (s(t 1)) w.p. 1 ✏t
a(t) =
h
|P |
i> random a 2 A w.p. ✏t
L (pL ) := (pL p1L ), . . . , (pL pL L )
where ˆ(s) := > bG
G (pG )⇥ + > bL
L (pL )⇥ + ✓bR a>
with ✓i,f
L
modeling the average overall cost for not caching
file f when the local popularity is in state piL . pG (t) and pL (t) are revealed based on user requests
Finally, our third summand in (14) corresponds to the cache- ⇥ > ⇤
> >
refreshing cost Set >
⇣ = pG (t), pL (t), a(t) ⌘
s(t)
F
X Incur cost C s(t 1), a(t) pG (t), pL (t)
0 R
QR (s, a ) : = ✓ {[a0 ]f =1} {[a]f =0} (18) Find "b (s(t 1), a(t))
f =1 Update b G, ⇥
⇥ b L and ✓ˆR based on (21)-(23)
t t t
R 0>
= ✓ a (1 a) end for
⇥ ⇤
= ✓R a0> (1 a) + a> 1 a0> 1
= ✓R a> (1 a0 ) and
where ✓R models average cache-refreshing cost per content.
✓ˆtR = ✓ˆtR 1 ↵R r✓R "b (s(t 1), a(t)) (23)
The constraint a> 1 = a0> 1 = M , is utilized to factor out the
= ✓ˆR + ↵R eb (s(t 1), a(t)) r✓R Q b⇤ (s(t 1), a(t))
term 1 a0 , which will become useful later. t 1 t 1
Upon defining the set of parameters ⇤ := {⇥G , ⇥L , ✓a }, = ✓ˆtR 1 + ↵R eb (s(t 1), a(t)) a (t >
1)(1 a(t)).
the Q-function is readily approximated (cf. (14))
⇣ ⌘ The pseudocode for this scalable approximation of the Q-
b ⇤ (s, a0 ) := G
Q >
(pG )⇥G + L>
(pL )⇥L + ✓R a> (1 a0 ). learning scheme is tabulated in Alg. 2. The upshot of this
| {z } scalable scheme is three-fold.
(s):=
(19) • The large state-action space in the Q-learning algorithm
Thus, the original task of learning |PG ||PL ||A|2 parame- is handled by reducing the number of parameters from
ters in Alg. 1 is now reduced to learning ⇤ containing |PG ||PL ||A|2 to (|PG | + |PL |) |F| + 1.
(|PG | + |PL |) |F| + 1 parameters. • In contrast to single-entry updates in the exact Q-learning
Alg. 1, F M entries in ⇥ b G and ⇥b L as well as ✓R , are
A. Learning ⇤ updated per observation using (21)-(23), which leads to
bG ,⇥ b L , ✓ˆR } a much faster convergence.
Given the current parameter estimates {⇥
• The exhaustive search in min Q (s, a) required in ex-
t 1 t 1 t 1
at the end of information exchange phase of slot t, the a2A
instantaneous error is given by ploitation; and also in the error evaluation (20), is cir-
cumvented. Specifically, it holds that (cf. (19))
eb (s(t 1), a(t)) := C (s(t 1), a(t))
b ⇤ (s(t), a0 ) Qb⇤ min Q(s, a0 ) ⇡ min >
(s) (1 a0 ) = max >
(s) a0
+ min Q t 1 t 1
(s(t 1), a(t)) . a0 2A 0 a 2A 0 a 2A
a0 (24)
Let us define where
1⇣ ⌘2
"b (s(t 1), a(t)) := eb (s(t 1), a(t)) , (20) (s) := >
G (pG )⇥
G
+ >
L (pL )⇥
L
+ ✓R a> .
2
then, the parameter update rules are obtained using stochastic The solution of (24) is readily given by [a]⌫i = 1
gradient descent iterations as (cf. [25, p. 847]) for i = 1, . . . , M , and [a]⌫i = 0 for i > M , where
ˆG = ⇥
⇥ ˆG ↵G r⇥G "b (s(t 1), a(t)) (21) [ (s)]⌫F · · · [ (s)]⌫1 are sorted entries of (s).
t t 1
ˆ G b Remark 2. In the model of Section II-B, the state-space
= ⇥t 1 + ↵G eb (s(t 1), a(t)) r⇥G Q⇤t 1 (s(t 1), a(t))
cardinality of the popularity vectors is finite. These vectors can
=⇥ ˆ G + ↵G eb (s(t 1), a(t)) G (pG (t 1))(1 a(t))> be viewed as centroids of quantization regions partitioning a
t 1
state space of infinite cardinality. Clearly, such a partitioning
ˆL = ⇥
⇥ ˆL ↵L r⇥L "b (s(t 1), a(t)) (22) inherently bears a complexity-accuracy trade off, motivating
t t 1
ˆ L b
= ⇥t 1 + ↵L eb (s(t 1), a(t)) r⇥L Q⇤t 1 (s(t 1), a(t)) optimal designs to achieve a desirable accuracy for a given
affordable complexity. This is one of our future research
=⇥ ˆ L + ↵L eb (s(t 1), a(t)) L (pL (t 1))(1 a(t))>
t 1 directions for the problem at hand.
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
8
G
G
p12 Caching performance is assessed under three cost-parameter
p11 p 1
pG2 pG22 settings: (s1) 1 = 10, 2 = 600, 3 = 1000; (s2) 1 =
G
Zipf G
1
Zipf G
2
600, 2 = 10, 3 = 1000, and (s3) 1 = 10, 2 = 10, 3 =
1000. In all numerical tests the optimal caching policy is
State 1 pG21 State 2
found by utilizing the policy iteration algorithm with known
transition probabilities. In addition, Q-learning in Alg. 1 and
(a) Global popularity Markov chain.
its scalable approximation in Alg. 2 are run with t = 0.8,
p12L ↵G = ↵L = ↵R = 0.005, and ✏t = 0.05.
p11L p 1
pL2
L
p22
L Fig. 6 depicts the observed cost versus iteration (time) index
L
Zipf L
Zipf
1 1 averaged over 1000 realizations. It is seen that the cashing
L
State 1 p21 State 2 cost via Q-learning, and through its scalable approximation
converge to that of the optimal policy. As anticipated, even
(b) Local popularity Markov chain. for the small size of this network, namely |PG | = |PL | = 2
Fig. 5: Popularity profiles Markov chains. and |A| = 45, the Q-learning algorithm converges slowly to
the optimal policy, especially under (s1), while its scalable
approximation exhibits faster convergence. The reason for
Simulation based evaluation of the proposed algorithms for slower convergence under (s1) is that the corresponding cost
RL-based caching is now in order. parameters of local and global popularity mismatch are set
high, thus, the convergence of the Q-learning algorithm as well
as caching policy essentially relies on learning both global
V. N UMERICAL TESTS
and local popularity Markov chains. In contrast, under (s2),
In this section, performance of the proposed Q-learning 2 corresponding to local popularity mismatch is low, thus
algorithm and its scalable approximation is tested. To compare the impact of local popularity Markov chain on the optimal
the proposed algorithms with the optimal caching policy, policy is reduced, giving rise to a simpler policy, thus a faster
which is the best policy under known transition probabilities convergence. To further elaborate this issue, simulations are
for global and local popularity Markov chains, we first sim- carried out under a simpler scenario (s3). In this setting,
ulated a small network with F = 10 contents, and caching having 1 = 10 further reduces the effect of cache refreshing
capacity M = 2 at the local SB. Global popularity profile cost, and thus more importance falls on learning the Markov
is modeled by a two-state Markov chain with states p1G and chain of global popularities. Indeed, the simulations present
p2G , that are drawn from Zipf distributions having parameters a slightly faster convergence for (s3) compared to (s2), while
⌘1G = 1 and ⌘2G = 1.5, respectively [31]; see also Fig. 5. That both demonstrate much faster convergence than (s1).
is, for state i 2 {1, 2}, the F contents are assigned a random In order to highlight the trade-off between global and
ordering of popularities, and then sorted accordingly in a local popularity mismatches, the percentage of accommodated
descending order. Given this ordering and the Zipf distribution requests via cache is depicted in Fig. 7 for settings (s4)
parameter ⌘iG , the popularity of the f -th content is set to 1 = 3 = 0, 2 = 1, 000, and (s5) 1 = 2 = 0, 3 = 1, 000.
Observe that penalizing local popularity-mismatch in (s4)
1
piG = for i = 1, 2 forces the caching policy to adapt to local request dynamics,
G P
F
f
f ⌘ i 1/l ⌘ G
i thus accommodating a higher percentage of requests via cache,
l=1 while (s5) prioritizes tracking global popularities, leading to a
where the summation normalizes the components to follow lower cache-hit in this setting. Due to slow convergence of the
a valid probability mass function, while ⌘iG 0 controls exact Q-learning under (s4) and (s5), only the performance of
the skewness of popularities. Specifically, ⌘iG = 0 yields a the scalable solver is presented here.
uniform spread of popularity among contents, while a large Furthermore, the convergence rate of Algs. 1 and 2 is
value of ⌘i generates more skewed popularities. Furthermore, illustrated under (s6) 1 = 60, 2 = 10, 3 = 10 in Fig. 8,
state transition probabilities of the Markov chain modeling where average normalized error is evaluated in terms of the
global popularity profiles are “exploration index.” Specifically, a pure exploration is taken
" # " # for the first Texplore iterations of the algorithms, i.e., ✏t = 1
G pG
11 pG
12 0.8 0.2 for t = 1, 2, . . . , Texplore ; and a pure exploitation with ✏t = 0
P := = .
pG
21 pG
22 0.75 0.25 is adopted afterwards. We have set ↵ = 0.005, and selected
t = 0.7. As the plot demonstrates, the exact Q-learning Alg. 1
Similarly, local popularities are modeled by a two-state exhibits slower convergence, whereas just a few iterations
Markov chain, with states p1L and p2L , whose entries are suffice for the scalable Alg. 2 to converge to the optimal
drawn from Zipf distributions with parameters ⌘1L = 0.7 and solution, thanks to the reduced dimension of the problem as
⌘2L = 2.5, respectively. The transition probabilites of the local well as the multiple updates that can be afforded per iteration.
popularity Markov chain are Having established the accuracy and efficiency of the Alg. 2,
" # " #
pL pL 0.6 0.4 we next simulated a larger network with F = 1, 000 available
L 11 12
P := = . files, and a cache capacity of M = 10, giving rise to a total
pL pL 0.2 0.8
21 22 of 100010 ' 2 ⇥ 1023 feasible caching actions. In addition,
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
9
1
1800
Scenario 1
1600 0.9 Approximated Q-learning
1400 Q-learning
0.8
1200
0.7
Normallized error
1000
0.6
Cost
800
0.5
Scenario 2 0.4
600
0.3
0.2
400 Q-learning
Approximated Q-learning Scenario 3 0.1
Optimal policy
0
10 1 10 2 10 3 10 4 10 5 0 0.5 1 1.5 2 2.5 3 3.5 4
Iteration index 10 4
Fig. 6: Performance of the proposed algorithms. Fig. 8: Convergence rate of the exact and scalable Q-learning.
scenario 4 1600
25 1400 Scenario 9
1200
20
1000
800
Cost
15
scenario 5 600
Scenario 8
10 400
Scenario 7
200
5
Optimal policy 0
Approximated Q-learning Optimal policy
-200 Approximated Q-learning
0
10 0 10 1 10 2 10 3 10 4 1 2 3 4 5 6 7 8 9
Iteration index Iteration index ×10 5
Fig. 7: Percentage of accommodated requests via cache. Fig. 9: Performance in large state-action space scenaria.
we set the local and global popularity Markov chains to have approach with scalability and light-weight updates.
|PL | = 40 and |PG | = 50 states, for which the underlying Remark 3. In this section, it is assumed that the file pop-
state transition probabilities are drawn randomly, and Zipf ularities demonstrate correlation across time, and thus do
parameters are drawn uniformly over the interval (2, 4). not change dramatically from one slot to the other. Thus, it
Fig. 9 plots the performance of Alg. 2 under (s7) 1 = 100, is assumed that the popularity profile does not dramatically
2 = 20, 3 = 20, (s8) 1 = 0, 2 = 0, 3 = 1, 000, and change within the interval of interest, and the realization of a
(s9) 1 = 0, 2 = 1, 000, 3 = 600. Exploration-exploitation large portion of possible states is considered extremely rare.
parameter is set to ✏t = 1 for t = 1, 2, . . . , 7 ⇥ 105 , in Therefore, relatively small number of states, say 50 in our
order to greedily explore the entire state-action space in initial setting, is assumed to practically cover the most likely states.
iterations, and ✏t = 1 / (iteration index) for t > 7 ⇥ 105 . In a broader scenario, where a larger number of states are to
Finding the optimal policy in (s8) and (s9) requires pro- be considered, continuous function approximation techniques
hibitively sizable memory as well as extremely high com- such as kernel-based or deep learning approaches [32], [33]
putational complexity, and it is thus unaffordable for this can be utilized to enable the algorithm with further scalability.
network. However, having large cache-refreshing cost with Finally, numerical tests are carried to elaborate the impact of
1 2 , 3 in (s7) forces the optimal caching policy to dynamic costs dictated to CCUs according to a cost parameter
freeze its cache contents, making the optimal caching policy profile. The preselected profiles are reported in Fig.10(a),
predictable in this setting. Despite the very limited storage and Fig.10(b) shows corresponding cost and percentage of
capacity, of 10 / 1, 000 = 0.01 of available files, utilization of accommodated requests via cache. The two-state Markov
RL-enabled caching offers a considerable reduction in incurred chain for global and local popularity profiles are considered
costs, while the proposed approximated Q-learning endows the the same as described earlier in this section. As the percent-
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
10
0
[7] A. Sengupta, S. Amuru, R. Tandon, R. M. Buehrer, and T. C. Clancy,
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 “Learning distributed caching strategies in small cell networks,” in Proc.
Iteration index Intl. Symp. on Wireless Communications Systems, Barcelona, Spain,
Aug. 2014, pp. 917–921.
(a) Cost profiles. [8] J. Liu, B. Bai, J. Zhang, and K. B. Letaief, “Content caching at the
wireless network edge: A distributed algorithm via belief propagation,”
1500 in Intl. Conf. on Communications, Kuala Lumpur, Malaysia, May 2016,
pp. 1–6.
1000 [9] B. Chen, C. Yang, and Z. Xiong, “Optimal caching and scheduling for
Cost
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSTSP.2017.2787979, IEEE Journal
of Selected Topics in Signal Processing
11
[27] V. S. Borkar and S. P. Meyn, “The ODE method for convergence of Georgios B. Giannakis (F’97) received his Diploma
stochastic approximation and reinforcement learning,” SIAM J. Control in Electrical Engr. from the Ntl. Tech. Univ. of
Optim., vol. 38, no. 2, pp. 447–469, Jan. 2000. Athens, Greece, 1981. From 1982 to 1986 he was
[28] J. Wyatt, “Exploration and inference in learning from reinforcement,” with the Univ. of Southern California (USC), where
Ph.D. dissertation, School of Informatics, University of Edinburgh, he received his MSc. in Electrical Engineering,
Edinburgh, Scotland, 1998. 1983, MSc. in Mathematics, 1986, and Ph.D. in
[29] A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, and Electrical Engr., 1986. He was with the University of
J. P. How, “A tutorial on linear function approximators for dynamic Virginia from 1987 to 1998, and since 1999 he has
programming and reinforcement learning,” Foundations and Trends in been a professor with the Univ. of Minnesota, where
Machine Learning, vol. 6, no. 4, pp. 375–451, Dec. 2013. he holds an Endowed Chair in Wireless Telecom-
[30] S. Mahadevan, “Learning representation and control in markov decision munications, a University of Minnesota McKnight
processes: New frontiers,” Foundations and Trends in Machine Learning, Presidential Chair in ECE, and serves as director of the Digital Technology
vol. 1, no. 4, pp. 403–565, June 2009. Center.
[31] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “Web caching His general interests span the areas of communications, networking and
and Zipf-like distributions: Evidence and implications,” in Intl. Conf. on statistical signal processing - subjects on which he has published more than
Computer Communications, New York, USA, March 1999, pp. 126–134. 400 journal papers, 700 conference papers, 25 book chapters, two edited
[32] D. Ormoneit and Ś. Sen, “Kernel-based reinforcement learning,” Ma- books and two research monographs (h-index 128). Current research focuses
chine learning, vol. 49, no. 2, pp. 161–178, Nov. 2002. on learning from Big Data, wireless cognitive radios, and network science
[33] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, with applications to social, brain, and power networks with renewables. He is
D. Silver, and D. Wierstra, “Continuous control with deep reinforcement the (co-) inventor of 30 patents issued, and the (co-) recipient of 9 best paper
learning,” arXiv preprint arXiv:1509.02971, 2015. awards from the IEEE Signal Processing (SP) and Communications Societies,
including the G. Marconi Prize Paper Award in Wireless Communications.
He also received Technical Achievement Awards from the SP Society (2000),
from EURASIP (2005), a Young Faculty Teaching Award, the G. W. Taylor
Award for Distinguished Research from the University of Minnesota, and the
IEEE Fourier Technical Field Award (2015). He is a Fellow of EURASIP, and
has served the IEEE in a number of posts, including that of a Distinguished
Lecturer for the IEEE-SP Society.
1932-4553 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.