Vous êtes sur la page 1sur 10

Using the Theory of Regular Functions to Formally

Prove the -Optimality of Discretized Pursuit Learning


Algorithms
Xuan Zhang1, B. John Oommen2,1, Ole-Christoffer Granmo1, and Lei Jiao1
2

1 Dept. of ICT, University of Agder, Grimstad, Norway


School of Computer Science, Carleton University, Ottawa, Canada

Abstract. Learning Automata (LA) can be reckoned to be the founding algorithms on which the field of Reinforcement Learning has been built. Among the
families of LA, Estimator Algorithms (EAs) are certainly the fastest, and of these,
the family of Pursuit Algorithms (PAs) are the pioneering work. It has recently
been reported that the previous proofs for -optimality for all the reported algorithms in the family of PAs have been flawed1 . We applaud the researchers
who discovered this flaw, and who further proceeded to rectify the proof for the
Continuous Pursuit Algorithm (CPA). The latter proof, though requires the learning parameter to be continuously changing, is, to the best of our knowledge, the
current best and only way to prove CPAs -optimality. However, for all the algorithms with absorbing states, for example, the Absorbing Continuous Pursuit
Algorithm (ACPA) and the Discretized Pursuit Algorithm (DPA), the constrain
of a continuously changing learning parameter can be removed. In this paper,
we provide a new method to prove the -optimality of the Discretized Pursuit
Algorithm which does not require this constraint. We believe that our proof is
both unique and pioneering. It can also form the basis for formally showing the
-optimality of the other EAs with absorbing states.
Keywords: Pursuit Algorithms, Discretized Pursuit Algorithm, -optimality.

1 Introduction
Learning automata (LA) have been studied as a typical model of reinforcement learning
for decades. An LA is an adaptive decision-making unit that learns the optimal action
from among a set of actions offered by the Environment it operates in. At each iteration, the LA selects one action, which triggers either a stochastic reward or a penalty
as a response from the Environment. Based on the response and the knowledge acquired in the past iterations, the LA adjusts its action selection strategy in order to
make a wiser decision in the next iteration. In such a way, the LA, even though it

1

Chancellors Professor; Fellow: IEEE and Fellow: IAPR. The Author also holds an Adjunct
Professorship with the Dept. of ICT, University of Agder, Norway.
This flaw also renders the finite time analysis of these algorithms [12] to be incorrect. This
is because the latter analysis relied on the same condition used in the flawed proofs, i.e., they
considered the monotonicity property of the probability of selecting the optimal action.

A. Moonis et al. (Eds.): IEA/AIE 2014, Part I, LNAI 8481, pp. 379388, 2014.
c Springer International Publishing Switzerland 2014


380

X. Zhang et al.

lacks a complete knowledge about the Environment, is able to learn through repeated
interactions with the Environment, and adapts itself to the optimal decision. Hence,
LA have been applied to a variety of fields where complete knowledge of the Environment can not be obtained. These applications include game playing [1], parameter optimization [2], solving knapsack-like problems and utilizing the solution in web polling
and sampling [3], vehicle path control [4], resource allocation [5], service selection in
stochastic environments [6], and numerical optimization [7].
One of the most important part in the design and analysis of LA consists of the formal
proofs of their convergence accuracies. Among all the different types of LA (FSSA,
VSSA, Discretized etc.), the most difficult proofs involve the family of EAs. This is
because the convergence involves two intertwined phenomena, i.e., the convergence
of the reward estimates and the convergence of the action probabilities themselves.
Ironically, the combination of these in the updating rule is what renders the EA fast.
Prior Proofs: As the pioneering work of the study of EAs, the -optimality of the
families of PAs have been studied and presented in [9], [10], [11], [12] and [13]. The
basic result stated in these papers is that by utilizing a sufficiently small value for the
learning parameter (or resolution), PAs will converge to the optimal action with an
arbitrarily large probability.
Flaws in the Existing Proofs: The premise for this paper is that the proofs reported
for almost three decades for PAs have a common flaw, which involves a very fine argument. In fact, the proofs reported in these papers deduced the -optimality based
on the conclusion that after a sufficiently large but finite time instant, t0 , the probability
of selecting the optimal action is monotonically increasing, which, in turn, is based on
the condition that the reward probability estimates are ordered properly forever after t0 .
This ordering is, indeed, true by the law of large numbers if all the actions are chosen
infinitely often and if the time instant, t0 , is defined to be infinite. But if such an infinite selection does not occur, the proper ordering happens only in probability. In other
words, the authors of these papers misinterpreted the concept ordering forever with
the ordering most of the time after t0 . As a consequence of this misinterpretation,
the condition supporting the monotonicity property is false, which leads to an incorrect
proof for the PAs being -optimal.
Discovery of the Flaw: Even though this has been the accepted argument for almost
three decades, we credit the authors of [14] for discovering this flaw, and further rectifying the proof for the CPA. The latter proof is also based on the monotonicity property
of the probability of selecting the optimal action, and requires that the learning parameter be decreasing over time. This methodology, to the best of our knowledge, is the
current best way in proving the -optimality of the CPA.
Problem Statement: This paper aims at correcting the above-mentioned flaw by presenting a new proof for the DPA, which is a type of EAs with absorbing states. As
opposed to all the previous EAs proofs, we will show that while the monotonicity
property is sufficient for convergence, it is not really necessary for proving that the
DPA is -optimal. Rather, we will present a completely new proof methodology which
is based on the convergence theory of submartingales and the theory of Regular functions [15]. The new proof is distinct in principle and argument from the proof reported

Using the Theory of Regular Functions to Formally Prove the -Optimality

381

in [14], and does not require the learning parameter to be continuously decreasing.
Besides, the new proof can be extended to formally demonstrate the -optimality of
other EAs with absorbing states.

2 Overview of the DPA


We first present the notations used for the DPA:
r: The number of actions.
j : The jth action that can be selected by the LA.
P: The action probability vector. P = [p1 , p2 , . . . , pr ] and

p j = 1. An action is se-

j=1...r

lected by randomly sampling from P.


D: The reward probability vector. D = [d1 , d2 , . . . , dr ], with each d j being the probability
that the corresponding action j will be rewarded.
v j : The number of times j has been selected.
u j : The number of times j has been rewarded.
dj = u j .
dj : The jth element of the reward probability estimates vector D,
vj
m: The index of the optimal action.

h: The index of the largest element of D.


R: The response from the Environment, where R = 0 corresponds to a Reward, and
R = 1 to a Penalty.
1
, with N being a positive integer.
: The discretized step size, where = rN
The algorithm of DPA can be described as a sequence of iterations. In iteration t:
1. The LA selects an action by sampling from P(t), suppose the selected action is i .
2. The selected action triggers either a reward or a penalty from the Environment. The
LA updates D = [d1 (t), d2 (t), ..., dr (t)] based on the Environments response as:
ui (t) = ui (t 1) + (1 R(t)); vi (t) = vi (t 1) + 1; di (t) = ui (t) .
vi (t)

then h is considered as the current


3. Suppose dh (t) is the largest element in D(t),
best action. The LA increases its action probability as:
If R(t) = 0 Then
p j (t + 1) = max{p j (t) , 0}, j = h
ph (t + 1) = 1 p j (t + 1).
j=h

Else
P(t + 1) = P(t).
EndIf
We now visit the proofs of the DPAs convergence.

3 Prior Proof for DPAs -Optimality


The formal assertion of the -optimality of the DPA [9] is stated in Theorem 1.
Theorem 1. Given any small , > 0, there exist a N0 > 0 and a t0 < such that for all
time t t0 and for any positive learning parameter N > N0 , Pr{pm (t) > 1 } > 1 .

382

X. Zhang et al.

The earlier reported proofs for the -optimality of the DPA follow the strategy that
consists of four steps. Firstly, given a sufficiently large value for the learning parameter
N, all actions will be selected enough number of times before a finite time instant, t0 .
Secondly, for all t > t0 , dm will remain as the maximum element of the reward proba Thirdly, suppose dm has been ranked as the largest element
bility estimates vector, D.

in D since t0 , the action probability sequence of {pm (t)}, with t > t0 , will be monotonically increasing, whence one concludes that pm (t) converges to 1 with probability
1. Finally, given that the probability of dm being the largest element in D is arbitrarily
close to unity, and that pm (t) 1 w.p. 1, -optimality follows from the axiom of total
probability.
The formal assertions of these steps are listed below.
1. The first step of the proof can be described mathematically by Theorem 2.
Theorem 2. For any given constants > 0 and M < , there exist an N0 > 0 and
t0 < such that under the DPA algorithm, for all positive N > N0 ,

Pr{All actions are selected at least M times each before time t0 } > 1 .

The detailed proof for this result can be found in [9].


2. The sequence of probabilities, {pm (t)(t>t0 ) }, is stated to be monotonically increasing. The previous proofs attempted to do this by showing that:
|pm (t)| 1, and
0 )] = pm (t) + dm ct 0, t > t0 ,
pm (t) = E[pm (t + 1) pm (t)|K(t

(1)

0 ) is the condition that dm remains the largest


where ct = 1, 2, ..., r 1, and K(t
element in D after time t0 . If this step of the proof was flawless2 , pm (t) can be
shown to converge to 1 w.p. 1.
0 )} > 1 , by the
3. Since pm (t) 1 w.p. 1, if it can, indeed, be proven that Pr{K(t
axiom of total probability, one can then see that:
0 )} > 1 (1 ) = 1 ,
Pr{pm (t) > 1 } Pr{pm (t) 1} Pr{K(t

and -optimality is proved.


0 )} > 1 ,
According to the sketch of the proof above, the key is to prove Pr{K(t
i.e.,
Pr{dm (t) > dj (t) j=m , t > t0 } > 1 .

(2)

In the reported proofs for the EAs, Eq. (2) is reckoned true if the following assumption
is true.
Let w be the difference between the two highest reward probabilities, if all actions
are selected a large number of times, each of the di will be in a w2 neighborhood of
di with an arbitrarily large probability. In other words, the probability of dm (t) being
greater than dj (t) j=m will be arbitrarily close to unity. This assumption can be easily
proven by the weak law of large numbers as per [9].
2

The error in the proofs lies precisely at this juncture, as we shall show presently.

Using the Theory of Regular Functions to Formally Prove the -Optimality

383

Flaw in the Argument: There is a flaw in the above argument. In fact, the above as 0 )} > 1 . To be specific, let us define K(t) =
sumption does not guarantee Pr{K(t

{dm (t) is the largest element in D(t)}. Then the result that can be deduced from the assumption when t > t0 is that Pr{K(t)} > 1 . But, indeed, the condition that Eq. (1)
0 ) =  K(t), which means that for every single time instant in the
is based on is: K(t
t>t0

future, i.e., t > t0 , dm (t) needs to be the largest element in D(t).


The previous flawed

proofs have mistakenly reckoned that K(t) is equivalent to K(t0 ).


The flaw is documented in [14], which focused on the CPA, and further provided
0 )} > 1 instead of proving
a way of correcting the flaw, i.e., by proving Pr{K(t
Pr{K(t)} > 1 . However, the proof requires a sequence of decreasing values of
the learning rate (for CPA), which is not necessary for proving the -optimality of
the DPA. We thus present a new proof for the DPA that is quite distinct (and uses
completely different techniques) than that reported in [14]. The new proof also follows
a four-step sketch but instead of examining the monotonicity property, is rather based
on the convergence theory of submartingales, and on the theory of Regular functions.

4 DPAs -Optimality: A New Proof


4.1 The Moderation Property of DPA
The property of moderation can be described precisely by Theorem 2, which has been
proven in [9]. This implies that under the DPA, by utilizing a sufficiently large value
for the learning parameter, N, each action will be selected an arbitrarily large number
of times.
0 ) for {pm (t)t>t } being a Submartingale
4.2 The Key Condition G(t
0
In our proof strategy, instead of examining the condition for {pm (t)t>t0 } being monotonically increasing, we will investigate the condition for {pm (t)t>t0 } being a submartin 0 ), defined as: :
gale. This is based on G(t
q j (t) = Pr{dm (t) > dj (t), j = m},
q(t) = Pr{dm (t) > dj (t), j = m} =

q j (t),

(3)

j=m

(0, 1),
G(t) = {q(t) > 1 },

(0, 1).
0) = {
{q(t) > 1 }},
G(t

(4)

t>t0

Our goal is to prove the result given in Theorem 3.


0 )holds.
Theorem 3. Given a (0, 1), there exists a time instant t0 < , such that G(t
there exists a t0 < , such that t > t0 : q(t) > 1 .

In other words, for this given ,


Proof: To prove Theorem 3, we are to prove Eq. 2. As mentioned in Section 3, this can
be proven by showing that each of the dj is within a w2 neighborhood of d j with a large
enough probability. The latter, in turn, can be directly proven by invoking the weak law
of large numbers.

384

X. Zhang et al.

4.3 {pm (t)t>t0 } is a Submartingale under the DPA


We now prove the submartingale properties of {pm (t)t>t0 } for the DPA.
Theorem 4. Under the DPA, the quantity {pm (t)t>t0 } is a submartingale.
Proof: Firstly, as pm (t) is a probability, we have
E[pm (t)] 1 < .
Secondly, we explicitly calculate E[pm (t)]. Using the DPAs updating rule, we describe
the update of pm (t) as per Table 1. Thus, we have:
E[pm (t + 1)|P(t)]
=

p j (d j (q(pm + ct ) + (1 q)(pm )) + (1 d j )pm )

(p j d j qct )

j=1...r

j=1...r

=pm +

p jd j +

j=1...r

p j d j q +

j=1...r

p j pm

j=1...r

p j d j (q(ct + ) ).

j=1...r

In the above, pm (t) and q(t) are respectively written as pm and q in the interest of
conciseness. The difference between E[pm (t + 1)] and pm (t) can be expressed as:

Di f f pm (t) = E[pm (t + 1)|P(t)] pm(t) =

p j (t)d j (q(t)(ct + ) ).

j=1...r

Given that p j (t) > 0 and d j > 0, if we denote


Zt =

ct +

1
ct +1 ,

we see that if t > t0 , q(t) > Zt , then, Di f f pm (t) > 0, and so {pm (t)t>t0 } is a submartingale.
Table 1. The various possibilities for updating pm for the next iteration under the DPA
The greatest element in D
dm , (w.p. q(t))
Reward, (w.p. d j )
pm (t + 1)
dj , j = m, (w.p. 1 q(t))
Penalty, (w.p. 1 d j ) dj , j = 1...r, (1)
Responses

Updating pm
pm (t) + ct
pm (t)
pm (t)

As per the action probability updating rules of the DPA, ct = 1, 2, ..., r 1, implying
that Zt [ 1r , 12 ]. Let the quantity 1 = max{Zt } = 12 , Then, according to Theorem
3, there exists a time instant t0 such that t > t0 , q(t) > 12 = max{Zt }. Consequently,
{pm (t)t>t0 } is a submartingale, and the theorem is proven.

Using the Theory of Regular Functions to Formally Prove the -Optimality

385

4.4 Pr{pm () = 1} 1 under the DPA


We now prove the -optimality of the DPA.
Theorem 5. The DPA is -optimal in all random Environments. More formally, given
any 1 12 , there exists a positive integer N0 < and a time instant t0 < , such

that for all resolution parameters N > N0 and for all t > t0 , the quantities q(t) > 1 ,
and Pr{pm () = 1} 1.
Proof: According to the submartingale convergence theory [15],
pm () = 0 or 1.
If we denote e j as the unit vector with the jth element being 1, then our task is to prove:
m (P)
Pr{pm () = 1|P(0) = P}
=Pr{P() = em |P(0) = P}
1.

(5)

To prove Eq. (5), we shall use the theory of Regular functions [15]. Let (P) as a
function of P. If we define an operator U as
U(P) = E[(P(n + 1))|P(n) = P],
then the result of the n-step invocation of U is:
U n (P) = E[(P(n))|P(0) = P].
We refer to the function (P) as being:
Superregular: If U(P) (P). Then applying U repeatedly yields:
(P) U(P) U 2 (P) ... U (P).

(6)

Subregular: If U(P) (P). In this case, if we apply U repeatedly, we have


(P) U(P) U 2 (P) ... U (P).

(7)

Regular: If U(P) = (P). In such a case, it follows that:


(P) = U(P) = U 2 (P) = ... = U (P).

(8)

Moreover, if (P) satisfies the boundary conditions


(em ) = 1 and (e j ) = 0, (for j = m),

(9)

386

X. Zhang et al.

then, as per the definition of Regular functions and the submartingale convergence theory, we have
U (P) = E[(P())|P(0) = P]
=

(em )Pr{P() = e j |P(0) = P}

j=1

= Pr{P() = em |P(0) = P}
= m (P).

(10)

Comparing Eq. (10) with Eq. (8), we see that m (P) is exactly the function (P)
upon which if U is applied an infinite number of times, the sequence of operations
will lead to a function that equals the Regular function (P). Consequently, m (P)
can be indirectly obtained by investigating a Regular function of P. However, such a
Regular function is not easily found, although its existence is guaranteed. Fortunately,
Eq. (7) tell us that m (P), i.e., the Regular function of P, can be bounded from below
by the subregular function of P. As we are most interested in the lower bound of m (P),
our goal is to find such a Subregular function of P, which also satisfies the boundary
conditions given by Eq. (9), which then will guarantee to bound m (P) from below.
To find a proper subregular function of P, we need to firstly find the corresponding
superregular function. Consider m (P) = exm pm as a specific instantiation of , where
xm is a positive constant. Then, under the DPA,
U(m (P)) m (P)
=E[m (P(n + 1))|P(n) = P] m (P)

(11)

=E[exm pm (n+1) |P(n) = P] exm pm


=

exm (pm +ct ) p j d j q +

j=1...r

exm (pm ) p j d j (1 q)

j=1...r

exm pm p j (1 d j ) exm pm

j=1...r



p j d j exm pm q(exm ct exm ) + (exm 1) .

j=1...r

We must determine a proper value for xm such that m (P) is superregular, i.e.,
U(m (P)) m (P) 0. This is equivalent to solving the following inequality:
q(exm ct exm ) + (exm 1) 0.

(12)

+ (ln b)x + (ln2b) x2 . If we set b = exm ,


We know that when b > 0 and x 0, bx =1
when 0, Eq. (12) can be re-written as
2



(ln b)2 2
ln b2 2
2 2
(ct 1) (ln b) +
0.
q (ln b)(ct + 1) +
2
2

Using the Theory of Regular Functions to Formally Prove the -Optimality

387

2(q(ct +1)1)
Substitute b with exm , we have xm (xm (q(c
) 0. As xm is defined as a positive
2
t 1)+1)
constant, we have

0 < xm
If we denote xm0 =
and q(t)(t>t0 ) >

1
2.

2(q(ct +1)1)
,
(q(ct2 1)+1)

2(q(ct + 1) 1)
.
(q(ct2 1) + 1)

(13)

we see that when 0, xm0 as ct = 1, 2, ..., r 1


xm pm

We now introduce another function m (P) = 1e


, where xm is the same as
1exm
defined in m (P). Moreover, we observe the property that if m (P) = exm pm is a superxm pm
is a subregular (superregular) [15]. Thereregular (subregular), then m (P) = 1e
1exm
fore, the xm , as defined in Eq. (13), which renders m (P) to be superregular, makes the
m (P) be subregular.
Obviously, m (P) meets the boundary conditions, i.e.,

1, when P = em ,
1 exm pm
m (P) =
=
x
m
1e
0, when P = e j .
Therefore, according to Eq. (7),
m (P) m (P) =

1 exm pm
.
1 exm

(14)

As Eq. (14) holds for every xm bounded by Eq. (13), we can choose the largest
value xm0 , and when xm0 , m (P) 1. We have thus proved that for the DPA,
Pr{pm () = 1} 1, implying its -optimality.

5 Conclusions
Estimator algorithms are acclaimed to be the fastest Learning Automata (LA), and
within this family, the set of Pursuit algorithms have been considered to be the pioneering schemes. The -optimality of Pursuit Algorithms (PAs) are of great importance
and has been studied for decades. The convergence proofs for the PAs in all reported
papers have a common flaw which was discovered by the authors of [14]. This paper
corrects the flaw and provides a new proof for the CPA.
Rather than examining the monotonicity property of the {pm (t)(t>t0 ) } sequence as
done in the previous papers and in [14], the current new proof in this paper studies
the submartingale property of {pm (t)(t>t0 ) } in the DPA. Thereafter, by virtue of the
submartingale property and the weaker condition, the new proof invokes the theory
of Regular functions, and does not require the learning parameter to decrease/increase
gradually in proving the -optimality of the DPA.
We believe that our submitted proof is both novel and pioneering. Further, as opposed
to the proof in [14], the new method does not require the parameter to change continuously. Also, the new proof can be extended to formally demonstrate the -optimality of
other EAs with absorbing states.

388

X. Zhang et al.

References
1. Oommen, B.J., Granmo, O.C., Pedersen, A.: Using stochastic ai techniques to achieve unbounded resolution in finite player goore games and its applications. In: IEEE Symposium
on Computational Intelligence and Games, Honolulu, HI (2007)
2. Beigy, H., Meybodi, M.R.: Adaptation of parameters of bp algorithm using learning automata. In: Sixth Brazilian Symposium on Neural Networks, JR, Brazil (2000)
3. Granmo, O.C., Oommen, B.J., Myrer, S.A., Olsen, M.G.: Learning automata-based solutions
to the nonlinear fractional knapsack problem with applications to optimal resource allocation.
IEEE Trans. Syst., Man, Cybern. B 37(1), 166175 (2007)
4. Unsal, C., Kachroo, P., Bay, J.S.: Multiple stochastic learning automata for vehicle path control in an automated highway system. IEEE Trans. Syst., Man, Cybern. A 29, 120128 (1999)
5. Granmo, O.C.: Solving stochastic nonlinear resource allocation problems using a hierarchy
of twofold resource allocation automata. IEEE Trans. Computers 59(4), 545560 (2010)
6. Yazidi, A., Granmo, O.C., Oommen, B.J.: Service selection in stochastic environments: A
learning-automaton based solution. Applied Intelligence 36, 617637 (2012)
7. Vafashoar, R., Meybodi, M.R., Momeni, A.A.H.: Cla-de: a hybrid model based on cellular
learning automata for numerical optimization. Applied Intelligence 36, 735748 (2012)
8. Thathachar, M.A.L., Sastry, P.S.: Estimator algorithms for learning automata. In: The Platinum Jubilee Conference on Systems and Signal Processing, Bangalore, India, pp. 2932
(1986)
9. Oommen, B.J., Lanctot, J.K.: Discretized pursuit learning automata. IEEE Trans. Syst., Man,
Cybern. 20, 931938 (1990)
10. Lanctot, J.K., Oommen, B.J.: On discretizing estimator-based learning algorithms. IEEE
Trans. Syst., Man, Cybern. B 2, 14171422 (1991)
11. Lanctot, J.K., Oommen, B.J.: Discretized estimator learning automata. IEEE Trans. Syst.,
Man, Cybern. B 22(6), 14731483 (1992)
12. Rajaraman, K., Sastry, P.S.: Finite time analysis of the pursuit algorithm for learning automata. IEEE Trans. Syst., Man, Cybern. B 26, 590598 (1996)
13. Oommen, B.J., Agache, M.: Continuous and discretized pursuit learning schemes: various
algorithms and their comparison. IEEE Trans. Syst., Man, Cybern. B 31(3), 277287 (2001)
14. Ryan, M., Omkar, T.: On -optimality of the pursuit learning algorithm. Journal of Applied
Probability 49(3), 795805 (2012)
15. Narendra, K.S., Thathachar, M.A.L.: Learning Automata: An Introduction. Prentice Hall
(1989)

Vous aimerez peut-être aussi