Académique Documents
Professionnel Documents
Culture Documents
Abstract. Learning Automata (LA) can be reckoned to be the founding algorithms on which the field of Reinforcement Learning has been built. Among the
families of LA, Estimator Algorithms (EAs) are certainly the fastest, and of these,
the family of Pursuit Algorithms (PAs) are the pioneering work. It has recently
been reported that the previous proofs for -optimality for all the reported algorithms in the family of PAs have been flawed1 . We applaud the researchers
who discovered this flaw, and who further proceeded to rectify the proof for the
Continuous Pursuit Algorithm (CPA). The latter proof, though requires the learning parameter to be continuously changing, is, to the best of our knowledge, the
current best and only way to prove CPAs -optimality. However, for all the algorithms with absorbing states, for example, the Absorbing Continuous Pursuit
Algorithm (ACPA) and the Discretized Pursuit Algorithm (DPA), the constrain
of a continuously changing learning parameter can be removed. In this paper,
we provide a new method to prove the -optimality of the Discretized Pursuit
Algorithm which does not require this constraint. We believe that our proof is
both unique and pioneering. It can also form the basis for formally showing the
-optimality of the other EAs with absorbing states.
Keywords: Pursuit Algorithms, Discretized Pursuit Algorithm, -optimality.
1 Introduction
Learning automata (LA) have been studied as a typical model of reinforcement learning
for decades. An LA is an adaptive decision-making unit that learns the optimal action
from among a set of actions offered by the Environment it operates in. At each iteration, the LA selects one action, which triggers either a stochastic reward or a penalty
as a response from the Environment. Based on the response and the knowledge acquired in the past iterations, the LA adjusts its action selection strategy in order to
make a wiser decision in the next iteration. In such a way, the LA, even though it
1
Chancellors Professor; Fellow: IEEE and Fellow: IAPR. The Author also holds an Adjunct
Professorship with the Dept. of ICT, University of Agder, Norway.
This flaw also renders the finite time analysis of these algorithms [12] to be incorrect. This
is because the latter analysis relied on the same condition used in the flawed proofs, i.e., they
considered the monotonicity property of the probability of selecting the optimal action.
A. Moonis et al. (Eds.): IEA/AIE 2014, Part I, LNAI 8481, pp. 379388, 2014.
c Springer International Publishing Switzerland 2014
380
X. Zhang et al.
lacks a complete knowledge about the Environment, is able to learn through repeated
interactions with the Environment, and adapts itself to the optimal decision. Hence,
LA have been applied to a variety of fields where complete knowledge of the Environment can not be obtained. These applications include game playing [1], parameter optimization [2], solving knapsack-like problems and utilizing the solution in web polling
and sampling [3], vehicle path control [4], resource allocation [5], service selection in
stochastic environments [6], and numerical optimization [7].
One of the most important part in the design and analysis of LA consists of the formal
proofs of their convergence accuracies. Among all the different types of LA (FSSA,
VSSA, Discretized etc.), the most difficult proofs involve the family of EAs. This is
because the convergence involves two intertwined phenomena, i.e., the convergence
of the reward estimates and the convergence of the action probabilities themselves.
Ironically, the combination of these in the updating rule is what renders the EA fast.
Prior Proofs: As the pioneering work of the study of EAs, the -optimality of the
families of PAs have been studied and presented in [9], [10], [11], [12] and [13]. The
basic result stated in these papers is that by utilizing a sufficiently small value for the
learning parameter (or resolution), PAs will converge to the optimal action with an
arbitrarily large probability.
Flaws in the Existing Proofs: The premise for this paper is that the proofs reported
for almost three decades for PAs have a common flaw, which involves a very fine argument. In fact, the proofs reported in these papers deduced the -optimality based
on the conclusion that after a sufficiently large but finite time instant, t0 , the probability
of selecting the optimal action is monotonically increasing, which, in turn, is based on
the condition that the reward probability estimates are ordered properly forever after t0 .
This ordering is, indeed, true by the law of large numbers if all the actions are chosen
infinitely often and if the time instant, t0 , is defined to be infinite. But if such an infinite selection does not occur, the proper ordering happens only in probability. In other
words, the authors of these papers misinterpreted the concept ordering forever with
the ordering most of the time after t0 . As a consequence of this misinterpretation,
the condition supporting the monotonicity property is false, which leads to an incorrect
proof for the PAs being -optimal.
Discovery of the Flaw: Even though this has been the accepted argument for almost
three decades, we credit the authors of [14] for discovering this flaw, and further rectifying the proof for the CPA. The latter proof is also based on the monotonicity property
of the probability of selecting the optimal action, and requires that the learning parameter be decreasing over time. This methodology, to the best of our knowledge, is the
current best way in proving the -optimality of the CPA.
Problem Statement: This paper aims at correcting the above-mentioned flaw by presenting a new proof for the DPA, which is a type of EAs with absorbing states. As
opposed to all the previous EAs proofs, we will show that while the monotonicity
property is sufficient for convergence, it is not really necessary for proving that the
DPA is -optimal. Rather, we will present a completely new proof methodology which
is based on the convergence theory of submartingales and the theory of Regular functions [15]. The new proof is distinct in principle and argument from the proof reported
381
in [14], and does not require the learning parameter to be continuously decreasing.
Besides, the new proof can be extended to formally demonstrate the -optimality of
other EAs with absorbing states.
p j = 1. An action is se-
j=1...r
Else
P(t + 1) = P(t).
EndIf
We now visit the proofs of the DPAs convergence.
382
X. Zhang et al.
The earlier reported proofs for the -optimality of the DPA follow the strategy that
consists of four steps. Firstly, given a sufficiently large value for the learning parameter
N, all actions will be selected enough number of times before a finite time instant, t0 .
Secondly, for all t > t0 , dm will remain as the maximum element of the reward proba Thirdly, suppose dm has been ranked as the largest element
bility estimates vector, D.
in D since t0 , the action probability sequence of {pm (t)}, with t > t0 , will be monotonically increasing, whence one concludes that pm (t) converges to 1 with probability
1. Finally, given that the probability of dm being the largest element in D is arbitrarily
close to unity, and that pm (t) 1 w.p. 1, -optimality follows from the axiom of total
probability.
The formal assertions of these steps are listed below.
1. The first step of the proof can be described mathematically by Theorem 2.
Theorem 2. For any given constants > 0 and M < , there exist an N0 > 0 and
t0 < such that under the DPA algorithm, for all positive N > N0 ,
Pr{All actions are selected at least M times each before time t0 } > 1 .
(1)
(2)
In the reported proofs for the EAs, Eq. (2) is reckoned true if the following assumption
is true.
Let w be the difference between the two highest reward probabilities, if all actions
are selected a large number of times, each of the di will be in a w2 neighborhood of
di with an arbitrarily large probability. In other words, the probability of dm (t) being
greater than dj (t) j=m will be arbitrarily close to unity. This assumption can be easily
proven by the weak law of large numbers as per [9].
2
The error in the proofs lies precisely at this juncture, as we shall show presently.
383
Flaw in the Argument: There is a flaw in the above argument. In fact, the above as 0 )} > 1 . To be specific, let us define K(t) =
sumption does not guarantee Pr{K(t
{dm (t) is the largest element in D(t)}. Then the result that can be deduced from the assumption when t > t0 is that Pr{K(t)} > 1 . But, indeed, the condition that Eq. (1)
0 ) = K(t), which means that for every single time instant in the
is based on is: K(t
t>t0
q j (t),
(3)
j=m
(0, 1),
G(t) = {q(t) > 1 },
(0, 1).
0) = {
{q(t) > 1 }},
G(t
(4)
t>t0
384
X. Zhang et al.
(p j d j qct )
j=1...r
j=1...r
=pm +
p jd j +
j=1...r
p j d j q +
j=1...r
p j pm
j=1...r
p j d j (q(ct + ) ).
j=1...r
In the above, pm (t) and q(t) are respectively written as pm and q in the interest of
conciseness. The difference between E[pm (t + 1)] and pm (t) can be expressed as:
p j (t)d j (q(t)(ct + ) ).
j=1...r
ct +
1
ct +1 ,
we see that if t > t0 , q(t) > Zt , then, Di f f pm (t) > 0, and so {pm (t)t>t0 } is a submartingale.
Table 1. The various possibilities for updating pm for the next iteration under the DPA
The greatest element in D
dm , (w.p. q(t))
Reward, (w.p. d j )
pm (t + 1)
dj , j = m, (w.p. 1 q(t))
Penalty, (w.p. 1 d j ) dj , j = 1...r, (1)
Responses
Updating pm
pm (t) + ct
pm (t)
pm (t)
As per the action probability updating rules of the DPA, ct = 1, 2, ..., r 1, implying
that Zt [ 1r , 12 ]. Let the quantity 1 = max{Zt } = 12 , Then, according to Theorem
3, there exists a time instant t0 such that t > t0 , q(t) > 12 = max{Zt }. Consequently,
{pm (t)t>t0 } is a submartingale, and the theorem is proven.
385
that for all resolution parameters N > N0 and for all t > t0 , the quantities q(t) > 1 ,
and Pr{pm () = 1} 1.
Proof: According to the submartingale convergence theory [15],
pm () = 0 or 1.
If we denote e j as the unit vector with the jth element being 1, then our task is to prove:
m (P)
Pr{pm () = 1|P(0) = P}
=Pr{P() = em |P(0) = P}
1.
(5)
To prove Eq. (5), we shall use the theory of Regular functions [15]. Let (P) as a
function of P. If we define an operator U as
U(P) = E[(P(n + 1))|P(n) = P],
then the result of the n-step invocation of U is:
U n (P) = E[(P(n))|P(0) = P].
We refer to the function (P) as being:
Superregular: If U(P) (P). Then applying U repeatedly yields:
(P) U(P) U 2 (P) ... U (P).
(6)
(7)
(8)
(9)
386
X. Zhang et al.
then, as per the definition of Regular functions and the submartingale convergence theory, we have
U (P) = E[(P())|P(0) = P]
=
j=1
= Pr{P() = em |P(0) = P}
= m (P).
(10)
Comparing Eq. (10) with Eq. (8), we see that m (P) is exactly the function (P)
upon which if U is applied an infinite number of times, the sequence of operations
will lead to a function that equals the Regular function (P). Consequently, m (P)
can be indirectly obtained by investigating a Regular function of P. However, such a
Regular function is not easily found, although its existence is guaranteed. Fortunately,
Eq. (7) tell us that m (P), i.e., the Regular function of P, can be bounded from below
by the subregular function of P. As we are most interested in the lower bound of m (P),
our goal is to find such a Subregular function of P, which also satisfies the boundary
conditions given by Eq. (9), which then will guarantee to bound m (P) from below.
To find a proper subregular function of P, we need to firstly find the corresponding
superregular function. Consider m (P) = exm pm as a specific instantiation of , where
xm is a positive constant. Then, under the DPA,
U(m (P)) m (P)
=E[m (P(n + 1))|P(n) = P] m (P)
(11)
j=1...r
exm (pm ) p j d j (1 q)
j=1...r
exm pm p j (1 d j ) exm pm
j=1...r
p j d j exm pm q(exm ct exm ) + (exm 1) .
j=1...r
We must determine a proper value for xm such that m (P) is superregular, i.e.,
U(m (P)) m (P) 0. This is equivalent to solving the following inequality:
q(exm ct exm ) + (exm 1) 0.
(12)
(ln b)2 2
ln b2 2
2 2
(ct 1) (ln b) +
0.
q (ln b)(ct + 1) +
2
2
387
2(q(ct +1)1)
Substitute b with exm , we have xm (xm (q(c
) 0. As xm is defined as a positive
2
t 1)+1)
constant, we have
0 < xm
If we denote xm0 =
and q(t)(t>t0 ) >
1
2.
2(q(ct +1)1)
,
(q(ct2 1)+1)
2(q(ct + 1) 1)
.
(q(ct2 1) + 1)
(13)
1 exm pm
.
1 exm
(14)
As Eq. (14) holds for every xm bounded by Eq. (13), we can choose the largest
value xm0 , and when xm0 , m (P) 1. We have thus proved that for the DPA,
Pr{pm () = 1} 1, implying its -optimality.
5 Conclusions
Estimator algorithms are acclaimed to be the fastest Learning Automata (LA), and
within this family, the set of Pursuit algorithms have been considered to be the pioneering schemes. The -optimality of Pursuit Algorithms (PAs) are of great importance
and has been studied for decades. The convergence proofs for the PAs in all reported
papers have a common flaw which was discovered by the authors of [14]. This paper
corrects the flaw and provides a new proof for the CPA.
Rather than examining the monotonicity property of the {pm (t)(t>t0 ) } sequence as
done in the previous papers and in [14], the current new proof in this paper studies
the submartingale property of {pm (t)(t>t0 ) } in the DPA. Thereafter, by virtue of the
submartingale property and the weaker condition, the new proof invokes the theory
of Regular functions, and does not require the learning parameter to decrease/increase
gradually in proving the -optimality of the DPA.
We believe that our submitted proof is both novel and pioneering. Further, as opposed
to the proof in [14], the new method does not require the parameter to change continuously. Also, the new proof can be extended to formally demonstrate the -optimality of
other EAs with absorbing states.
388
X. Zhang et al.
References
1. Oommen, B.J., Granmo, O.C., Pedersen, A.: Using stochastic ai techniques to achieve unbounded resolution in finite player goore games and its applications. In: IEEE Symposium
on Computational Intelligence and Games, Honolulu, HI (2007)
2. Beigy, H., Meybodi, M.R.: Adaptation of parameters of bp algorithm using learning automata. In: Sixth Brazilian Symposium on Neural Networks, JR, Brazil (2000)
3. Granmo, O.C., Oommen, B.J., Myrer, S.A., Olsen, M.G.: Learning automata-based solutions
to the nonlinear fractional knapsack problem with applications to optimal resource allocation.
IEEE Trans. Syst., Man, Cybern. B 37(1), 166175 (2007)
4. Unsal, C., Kachroo, P., Bay, J.S.: Multiple stochastic learning automata for vehicle path control in an automated highway system. IEEE Trans. Syst., Man, Cybern. A 29, 120128 (1999)
5. Granmo, O.C.: Solving stochastic nonlinear resource allocation problems using a hierarchy
of twofold resource allocation automata. IEEE Trans. Computers 59(4), 545560 (2010)
6. Yazidi, A., Granmo, O.C., Oommen, B.J.: Service selection in stochastic environments: A
learning-automaton based solution. Applied Intelligence 36, 617637 (2012)
7. Vafashoar, R., Meybodi, M.R., Momeni, A.A.H.: Cla-de: a hybrid model based on cellular
learning automata for numerical optimization. Applied Intelligence 36, 735748 (2012)
8. Thathachar, M.A.L., Sastry, P.S.: Estimator algorithms for learning automata. In: The Platinum Jubilee Conference on Systems and Signal Processing, Bangalore, India, pp. 2932
(1986)
9. Oommen, B.J., Lanctot, J.K.: Discretized pursuit learning automata. IEEE Trans. Syst., Man,
Cybern. 20, 931938 (1990)
10. Lanctot, J.K., Oommen, B.J.: On discretizing estimator-based learning algorithms. IEEE
Trans. Syst., Man, Cybern. B 2, 14171422 (1991)
11. Lanctot, J.K., Oommen, B.J.: Discretized estimator learning automata. IEEE Trans. Syst.,
Man, Cybern. B 22(6), 14731483 (1992)
12. Rajaraman, K., Sastry, P.S.: Finite time analysis of the pursuit algorithm for learning automata. IEEE Trans. Syst., Man, Cybern. B 26, 590598 (1996)
13. Oommen, B.J., Agache, M.: Continuous and discretized pursuit learning schemes: various
algorithms and their comparison. IEEE Trans. Syst., Man, Cybern. B 31(3), 277287 (2001)
14. Ryan, M., Omkar, T.: On -optimality of the pursuit learning algorithm. Journal of Applied
Probability 49(3), 795805 (2012)
15. Narendra, K.S., Thathachar, M.A.L.: Learning Automata: An Introduction. Prentice Hall
(1989)