Académique Documents
Professionnel Documents
Culture Documents
Control and
Information Sciences
Edited by A.V. Balakrishnan and M.Thoma
Yousri M. EI-Fattah
Claude Foulard
Learning Systems:
Decision, Simulation, and
Control
Springer-Verlag
Berlin Heidelberg New York 1978
Series Editors
& V. Balakrishnan • M. Thoma
Advisory Board
A. G. J. MacFarlane • H. Kwakernaak • Ya. Z. Tsypkin
Authors
Dr. Y. M. EI-Fattah
Electronics Laboratory
Faculty of Sciences
Rabat, Marocco
Professor C. Foulard
Automatic Control Laboratory
Polytechnic Institute of Grenoble
Grenoble, France
This work is subject to copyright. All rights are reserved, whether the whole
or part of the material is concerned, specifically those of translation, re-
printing, re-use of illustrations, broadcasting, reproduction by photocopying
machine or similar means, and storage in data banks.
Under § 54 of the German Copyright Law where copies are made for other
than private use, a fee is payable to the publisher, the amount of the fee to
be determined by agreement with the publisher.
© by Springer-Verlag Berlin Heidelberg 1978
Printed in Germany
Printing and binding: Beltz Offsetdruck, Hemsbach/Bergstr.
2061/3020-543210
FOREWORD
This monograph studies topics in using learning systems for decision, simulation, and
control. Chapter I discusses what is meant by learning systems, and comments on their
cybernetic modeling. Chapter I I concerning decision is devoted to the problem of pat-
tern recognition. Chapter I l l concerning simulation is devoted to the study of a cer-
tain class of problems of collective behavior. Chapter IV concerning control is de-
voted to a simple model of f i n i t e Markov chains. For each of the last three chapters,
numerical examples are worked out entirely using computer simulations. This monograph
has developed during a number of years through which the f i r s t author has profited
from a number of research fellowships in France, Norway, and Belgium. He is grateful
to a number of friends and co-workers who influenced his views and collaborated with
him. Particular thanks are due to W. Brodey, R. Henriksen, S. Aidarous, M. Ribbens-
Pavella, and M. Duflo.
Y.M. EI-Fattah
C. Foulard
CONTENTS
ABSTRACT
CHAPTER I . CYBERNETICS OF LEARNING
l.l. System Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3. Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4. Learning Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5. Learning and Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6. Types o f Learning Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7. Mathematical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER I I . DECISION - PATTERN RECOGNITION
2.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
2.2. Feature E x t r a c t i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3. Karuhnen - Loeve Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4. I n t r a s e t Feature E x t r a c t i o n . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5. I n t e r s e t Feature E x t r a c t i o n . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6. Optimal C l a s s i f i c a t i o n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7. S t a t i s t i c a l Decision Algorithms . . . . . . . . . . . . . . . . . . . . . . 23
2.8. Sequential Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9. Supervised Bayes Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.10.Non-Supervised Bayes Learning . . . . . . . . . . . . . . . . . . . . . . . . 35
2.ll.ldentifiability o f F i n i t e Mixtures . . . . . . . . . . . . . . . . . . . 36
2 . 1 2 . P r o b a b i l i s t i c I t e r a t i v e Methods - Supervised Learning 37
2 . 1 3 , P r o b a b i l i s t i c I t e r a t i v e Methods - Unsupervised Learning 42
2 . 1 4 . S e l f Learning w i t h Unknown Number o f Pattern Classes 46
2 . ] 5 . A p p l i c a t i o n - Measurement S t r a t e g y f o r Systems
Identification ....................................... 49
2.16.Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
CHAPTER I I I . SIMULATION - MODELS OF COLLECTIVE BEHAVIOR
3.7. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2. Automata Model I - S u f f i c i e n t a p r i o r i Information... 65
3.3. Automata Model I I - Lack o f a p r i o r i Information . . . . . 70
Vl
This monograph presents some fundamental and new approaches to the use of learning
systems in certain classes of decision, simulation, and control problems.
CYBERNETICS OF LEARNING
l . l SYSTEMCONCEPT.
(a) Deterministic Systems. All the relations are presented by mappings (either one-
to-one or many-to-one). In other words, the output variables are functions of the
input variables. No probabilities have to be assigned to elements of the relations.
Deterministic systems can be subdivided into :
i i - Sequential systems. In this case there exists at least one input which is asso-
ciated with more than one output. The d i f f e r e n t outputs of the system to the same
input belong to d i f f e r e n t , but accurately defined, sequences of inputs which prece-
ded the given input.
(b) Probabilistic (Stochastic) Systems. At least one of the input output relations
is not presented by a mapping ( i t is presented by a one-to-many relation). Each'ele-
ment (a,b) of the relation is then associated with a conditional probability P(b/a)
of occurrence of b when a occurs. Probabilistic systems can be subdivided into :
In a sequential system the output (response) depends not only on the instanta-
neous input (stimulus) but also on the preceding inputs ( s t i m u l i ) . This means, howe-
ver, that the required stimuli must be remembered by the system in the form of va-
lues of some internal quantities. Let us term them memory quantities, and the aggre-
gate of t h e i r instantaneous values the internal state of the system.
1.2 ENVIRONMENT.
Every system has i t s environment. With physical systems the environment is theo-
r e t i c a l l y everything that is not included in the given system. However, since we
confine ourselves mostly to a f i n i t e number of defined relations between the system
and i t s environment, i t is usually of advantage to r e s t r i c t oneself to the substan-
t i a l environment, i . e . to a limited set of elements which interest us in the environ-
ment. The same applies to abstract systems. The physical system and i t s environment
act on each other - they interact. The manner of which a system influences i t s envi-
ronment depends, in general, on the properties of the system i t s e l f as well as on
the manner of which the environment acts on the system. Conversely, the same applies
to the environment.
There is no "hard" boundary between the system and the environment. The environ-
ment is indeed another system which surrounds. The interaction process between the
system and the environment can only continue when the environment defined by i t s
behavior 2 and the system likewise form two abstract sets which neither include nor
exclude each other, The intersection of the two sets represents the boundary between
the system and the environment, see F i g . l . This represents the part of the system
which is relevant to the environment and reversely the part of the environment which
is relevant to the system : Relevance with regards to the environment's or the sys-
tem's purposes or goals or means of their realization in work. So the boundary is
representing the interaction context between the system and the environment. This
interaction is maintained by both interdependence of purpose - complementarity and
interaction which maintains separation and thus contradiction. Such tendancy both
towards c o n f l i c t and union are present in any real interacting process. The context
is being metabolized as the system and i t s environment work at changing each other
according to the dynamics of the interaction process and change into each other.
1.3 CONTROL.
An obvious prerequisite for learning is that the system should have several
courses of actions open to i t . Since only a single course can be selected at any
given time, i t must be decided which of the possible courses is to be taken. This
is equivalent to ordering or organizing the d i f f e r e n t courses of actions according
to t h e i r preference in view of a certain goal linked to the environment response.
The more the disorder of those courses of actions the more the need of learning.
Entropy is defined as a measure of that disorder.
n
0 ~ Pi ~ I , s Pi = 1 (I)
i=l
n
H = -k z Pi gn Pi (2)
i=1
Ho > 0 (3)
Besides the necessity f o r i n i t i a l system's action disorder, i t is also important
f o r learning process to take place t h a t the system be s e n s i t i v e to the environment
response. A receptive code is necessary. Obviously, i f the system is i n s e n s i t i v e or
i n d i f f e r e n t to the environment's response there w i l l be no sense t a l k i n g about l e a r -
ning f o r the environment would have no influence whatsoever on the system.
lim H ÷ 0 (4)
t-~
Ho
t
to
a. UnsupervisedLearnin~.
(Self-Leraning or Learning without Teacher).
This is the case when the system does not receive any outside information except
the ordinary signals from the environment. The system would then be learning by
experimenting behavior. Such learning systems are usually called self-organizing
systems, see Fig.3. The study of such systems finds its application for example, in
problems of simulation of behavior, and automatic clustering of input data.
Environment I
T System
This is the case when the system receives additional information from the out-
side during the learning process. Here the teacher is the source of additional exter-
nal information input to the system. Depending on what information the teacher in-
puts to the system undergoing training i t is possible to distinguish two situations :
i . training by showing, the teacher inputs realizations of the output signal y cor-
responding to the given realizations of the input signal x to the system being t r a i -
ned, see Fig.4.
i i . training by assessment, the teacher observes the operation of the system being
trained and inputs to i t i t s appraisal z of the quality of i t s operation (in the
simplest case the teacher gives a reward z = +l or punishment z = - l ) , see Fig.4.
One may further classify learning by a teacher into two categories : learning
by an ideal (or perfect) teacher, and learning by a real teacher (or teacher who
makes mistakes).
Teacher
yor z
Y
Environment
I System
The problem of learning thus considered may be viewed as the problem of estima-
tion or successive approximation of the unknown quantities of a functional which
is chosen by the designer or the learning system to represent the process under
study. The basic ideas of measuring the accuracy of approximations w i l l be related
to the problem of learning in an unknown environment, i . e . , where the function to
be learned (approximated, estimated) is khown only by i t s form over the observation
space. Any f u r t h e r specification of such a functional form can be performed only on
the basis of experiments which o f f e r the values of the approximated function in the
domain of i t s d e f i n i t i o n . This implies that any desired solution which needs the
knowledge of the approximated function is reached gradually by methods relying on
experimentation and observation.
1.8 CONCLUSIONS.
COMMENTS.
1.6 The models of unsupervised learning are important in behavioral science. Some
models were introduced in the literature on behavioral psychology5 and lately
in engineering science6. A discussion on supervised learning, or training by
assessment and by showing is given in Pugachev7. Some discussions on learning
from a teacher who makes mistakes is given in Tebbe8.
10
REFERENCES.
D E C I S I 0 N - Pattern RECOGNITION
2.1. PATTERNRECOGNITIONPROBLEM.
The problem of pattern recognition is concerned with the analysis and the deci-
sion rules governing the i d e n t i f i c a t i o n or c l a s s i f i c a t i o n of observed situations,
objects, or inputs in general.
ii. - ~ig~_~_@~!i~i~D_~!~ifi~ig~l.
The abstraction problem is concerned with the decision rules for labeling or
classifying feature measurements into pattern classes. The c l a s s i f i c a t i o n rules is
such that the features in each class share more or less common properties.
Due to the distorted and noisy nature of feature measurements each pattern class
could be characterized by certain s t a t i s t i c a l properties. Such properties may be
fully-known, partially-known, or completely missing a p r i o r i . In the case of lacking
a p r i o r i information i t is required that the c l a s s i f i e r undergoes training or to be
b u i l t according to learning theorems.
~2
2.2. FEATUREEXTRACTION.
The basic feature extraction problem can be classified into two general catego-
ries :
Intraset feature extraction is concerned with those attributes which are common
to each pattern class.
On the other hand for interset feature extraction the interest is to emphasize
the differences between the patterns. This can be attained i f some clustering of the
same pattern samples is attained in the feature space. This amounts to contracting
the distance between the same pattern samples in the feature space, thus enhancing
the organization or minimizin~ the entropy.
Assume there are K pattern classes (K > 2) wl , w2. . . . wk. The pattern vector
is assumed to be N dimensional with probability density functions.
K
f(x_) : S ~k fk (-x) (1)
k:l
where ~k is the probability that a pattern belongs to class wk, fk(~) is the condi-
tional density of x for given wk. We assume without loss of generality that E(~) = O,
since a random vector with nonzero mean can be transformed into one with zero mean
by translation, which is a linear operation. Then the covariance matrix R is the
N x N matrix
13
K
R-- E {x Z
x } = k~=l ~k Ek { x x T} (2)
where Ek denotes the expectation over the pattern vectors of class wk. The Karhunen-
Loeve expansion is an expansion of the random vector x in terms of the eigenvectors
of R. Let ~j ans uj be the j-th eigenvalue and eigenvector of R i.e.
Ruj = ~j ~j (3)
~j ~ o (4)
N T
: j ~ l cj~j ~ cj =~ ~j (6)
is called the Karhunen - Loeve expansion. Note that cj is a random variable due to
the randomness of ~. Since we assume E(x) = O, E(cj) = O, and by (2), (3), and the
orthonormality of uj ,
9
In other words, the random variables cj and cc are uncorrelated i f j # ~, and E(c~)
equals the eigenvalue ~j. This property of zero correlation is an important and
unique property of the Kruhnen-Loeve expansion.
The intraset feature extraction reflects the pattern properties common to the
same class. Intraset feature extraction may be studied from various points of view.
This extraction problem may be analyzed as an estimation problem, or considered as
a problem of maximizing the population entropy (as noted before).
z : Tx (8)
where T is the matrix
where {v_j} is a set of orthonormal basis in £x" Notice that the feature space ~z is
a subspace of £x whose basis is {V-I . . . . . _vM}. I f we expand ~ in terms of (V-j) we
have
N
x = jZl cj v j (lO)
with the c o e f f i c i e n t s
cj : T V - j (ll)
Note that c i is a random variable due to the randomness of ~. The feature vector £
v
becomes
zT xT TT N T
- :- : j:l cj V-j (Z1 . . . . ~M) = (c I . . . . c M)
(12)
The estimation problem consists in determining the matrix T, see eqn. (g) such
that the error between the pattern vector x in ~ and i t s projection z in the fea-
-- X
N
EIJ~-~JJ 2 = E {(~_~)T(~_z)} : E{(j=~+I cjv-j)T(k=~+l Ck~k)} (13)
N N
= E z c~ : z E(c~)
j=M+I 3 j=M+I
be minimum.
I f one uses the Karhunen - Loeve expansion for the representation (I0) then i t f o l -
lows from eqn. (7) that the required vectors V-I . . . . . -YM are given by the eigenvec-
tors ~I . . . . . . ~M' see eqn,(3), corresponding to the M largest eigenvalues of the
covariance matrix R, see eqn. (2).
15
we obtain
where Rz = T R TT (19)
H(z_) = ½ M
• ~n @j + ~ ~n 2~ + ~M (20)
j=l
with @j being the eigenvalues of the covariance matrix Rz. Hence we obtain the f o l -
lowing result,
Theorem
f2(~)
~n A(~) = ~ n ~ (23)
17
Jl(~) : EL{ Zn f l ( ~ ) } : f f l ( x ) Zn f l ( ~ ) dx
I
f2 (~) ~x f2(~ ) - (24)
J2(~) = E2{ Zn f2(~) } : $ f2(~) Zn ...
f2(~) dx
f l (~) Rx f l (~) -
where El{. }and E2{.)indicate the expectation over the densities f l ( ~ ) and f2(~)
respectively. Jl(X) may be interpreted as the average information for discrimination
in favor of w I against w2, and J2 (~) may be interpreted in a similar manner. The
divergence is defined as
The measure (25) stated for the two-class case can be converted to the K - class
case by optimizing the sum of a l l pairwise measures of quality or by maximizing the
minimum of pairwise measure of quality.
(a) _z : Tl(x_)
Z
r
(b) z_ = T2(x )
Z
r
Example (a).
El(k) : ~ , E2(~) :m
(26)
E l (x x_T) : R1 , E2(x x_T) : R2
19
E1 (~) : £ E2 (~) : Ez
(27)
E1 (~_z T) = Rzl E2 (A_z T) = Rz2
where
~z = T _
m , Rzl = T Rl TT , Rz2 = T R2 TT (28)
(29)
J (x) is s i m i l a r to J (~) in (29) with mz, Rzl and Rz2 substituted by m, R1 and R2 .
Let us consider two special cases.
Rl = R2 = R , Rzl = Rz2 = Rz
and obviously
J (x) = _m
T R- l _m
T : mJ R-I (30)
J(z) : m/ R- I m (m
. T R-. l R .R- I m)
. -I mT R-I m = J (x) (31)
The r e s u l t suggests that the other directions do not contribute to the discrimination
o f the two classes. In other words the optimum c l a s s i f i c a t i o n is based on the s t a t i s -
t i c mT R- l x.
b) Equal means. In this case the mean of the second-class is also zero, m = O. I f
both Rl and R2 are positive d e f i n i t e then there exists a real and non-singular N x N
20
m a t r i x U,
U Rl UT : A , U R2 UT = I (32)
where A is a diagonal matrix with real and p o s i t i v e elements X I ' X2 . . . . ' XN and I
is the i d e n t i t y m a t r i x . In f a c t , the row vectors o f U are the solutions of the equa-
tion,
R1 ~j = ~j R2uj (33)
T T
uj R2~ j = 1 , uj R 2 u ~ = 0 j # ~ (34)
N (~j + ~--j
J(z) = I/2 t r (A - I) (I - A-Itx = I/2 j~l l _ 2) (35)
l l 1 (37)
~1 + ~ll > ~2 +~2 > " " ~ ~N +
M 1 . 2) (38)
J (z) : I/2 j~l (~J + X-'I"
Note that the row vectors of T are orthonormal in the sense of (34) instead of u ]
u& = O, which is a property of the optimal T considered in the previous sections.
21
Let the pattern classes be wl , w2. . . . . wK. For each pattern class wj, j=l . . . . K,
assume that the conditional multivariate (M - dimensional) probability density of the
feature vector z (M - dimensional), f j (z) as well as the probabilities ~j of occu-
rences of wj (i = l . . . . . K) are known.
The problem of classification consists in partitioning the feature space flz into
~
K subspaces Fl , F2. . . . . . . FK, see Fig.2, such that i f z ~ Fi we classify the pattern
to the class wi -
rl
r ( w i ' Y) : I F(w i , y) f i ( ~ )
flz dz (40)
22
For a given set of a priori probabilities ~ = (~l . . . . . . ~K)T, the average loss
is
K
R(~_, y) = iS__l ~i r(wi" y) (41)
K
R(~, y) = ~z i=l F(wi' Y) fi (~) ~i dz
c ~ (42)
a = f f] (~) d~ (45)
r2
B = r~ f2(~) d~ (46)
for any observation or feature vector z. Since the decision y takes only two values
Yl and Y2 say + l and - l , respectively, then minimization of (49) can be obtained
by simply comparing the corresponding values.
The decision rule (51), called Bayes rule, can be rewritten in the form,
y = + 1 if 2 (~) > h
(52)
y = - 1 if ~ (~) < h
fl (~) (53)
x (~) : ~
f2 (~)
24
v21 - ~22 ~2
h - (54)
v12 - Vll ~I
I +l
.... z _~"'
-l
R = ~l + B~2 (s6)
which amounts to the total probability of error. The decision rule can then be expres-
sed by (52) with the threshold value
h = ~2 / ~l (57)
and
h : ~ ~2 / Xl (60)
This actually corresponds to the Neyman-Pearson rule for which the threshold h is
given by (60) where H is determined by solving the i m p l i c i t relation,
~2
f2 (~) dk (~) = A , h =H s~ (62)
The approach of min - max consists of choosing ~l which minimizes the maximum devia-
tion. Hence ~l is determined by the condition,
Equations (65) and (66) completely specify the min-max rule computational d i f -
f i c u l t i e s however can arise when solving equation (65).
and
where
h = ~2 / ~I (70)
2?
~I : ~2 : I/2 (71)
fl (zi)
~i = ~n ~ , i : I, 2.... (72)
n fl (zi) fl (z)
: : f2-TI!T (73)
28
~n ~ A (75)
and stop taking observations and decide to accept the hypothesis H2 as soon as
Xn ~ B (76)
The constants A and B are called the upper and lower stopping boundaries respecti-
v e l y . They can be chosen to obtain approximately the p r o b a b i l i t i e s o f error ~ and B
prescribed.
Suppose that at the n-th stage of the measuring process i t is found that
kn = A (77)
which is equivalent to
(I - ~) : AB (80)
S i m i l a r l y , when
~n = B (81)
then
= B (I - B) (82)
2g
A = (I - a) / B (83)
B : ~/ (l - B) (84)
I t is noted that the choice o f stopping boundaries A and B results in error probabi-
l i t i e s ~ and B i f continuous observations are made and the exact equality (77) and
(81) can be respected.
From (77) and (81), again by neglecting the excess over the boundaries, we have
Defi ne
ni = l i f no decision is made up to the (i - l ) the stage
= 0 i f a decision is made at an e a r l i e r stage
N
Ln = i~l ~i = ~l nl + ~2 n2 + . . . . + ~n nn (87)
Taking expectations
co
=E(~) ~ E(ni)
i=l
Therefore, from (85) the average numberof observations when Hl is true can be
expressed as
E}W)(n) = (l - ~) kn A + ~ kn B
El (~),, • (89)
S i m i l a r l y , from (86),
E~W)(n) _ ~3 ~n A + (I
. . . . .
Xn B (go)
2 . 8 . 2 . F i n i t e Automata.
fl (~)
: ~n ~ (gl)
We s h a l l consider the symmetric case, where the thresholds a and b are chosen
so t h a t P (~ > a I HI) = P {~ < b I H2) = p, P(~ > al H2) = P(~ < bl HI ) = q and
r = l - p - q = P(b < ~ < a).
Let us define
i Probability of the automaton attaining
Lj A the state s i f i t begins its motion (92)
from the state j
31
state
S r1
P
s-1 q )~r
r
q P P q
~pr i ~ r
q P
r 2
q
0 r2 / q
ii
" H^ is true " H1 is true "
L
1 - ~-J
(94)
Lj l - ~-s
where
X = p/q > l (95)
xs-i _ l
= - - (96)
Xs - I
I f the hypothesis H2 is true, then we obtain for Lj with the same boundary con-
d i t i o n s , the equation
xJ -l (98)
Lj = ~s -l
Xi -I
- (99)
Xs -1
Hence, i f the error probabilities ~ and B are f i x e d , the parameters s and i of the
automaton w i l l be given by
i = ~n ( ] - ~ ) / ~n s-i = ~n ( ~ ) / ~n X (100)
Let Tj denote the mean number of t r i a l s from the start of the experiment and i t s
end, i f the automaton begins i t s motion from the state j .
33
with boundary conditions TO = Ts = O. Ths solution of this equation has the form
Since, by hypothesis, the automaton begins i t s motion from the state i , then,
taking into account eqs. (96), (lO0), we obtain
I f the hypothesis H2 is true, then for Tj the equation takes the form
x i f i ( E i / ~ i ) P( ~JEl . . . . . z_~)
PC ~ I~1' '-~n ) i=l (i14)
- "'" = l i ~Kl ~ifi(zi/e_i)p ( -~/ El ,.--,Z_n_1)d
where
fi (z /_Oi} = g(z ; m , r i ) ,
_0i = (mi , r i )
T
0 = (_e . . . . . 9_~, Pl . . . . . PK)
K K'
f(z/~) = i~l fi (z/-Oi) ~i : iS__l fi (z/O_~) ITi' (116)
N
= ~ ci ¢i (z) (117)
i=l
T
c = (c I , --, c N)
(ll9)
~T(~) : (@l(z) ' - - , @N(~))
y= c 3 ~ ( z ) (120)
and
In two-classes (wI and w2) pattern recognition problem, the output y takes on either
the value + l or - l , such that y = + l , - . z ~ F l , y= -I . - ~ E r 2. This means that
= + l corresponds to classifying z in wI and y = -l to ~ in w2.
Let us now consider the learning of different decision rules.
I t follows from eqn. (51) that the optimal discriminant function is given by :
such that :
The goal of learning can thus be stated as to minimize some convex function of
the error between the optimal discriminant function (122) and i t s approximation
(120). Let us consider the quadratic error function,
Taking the orthonormality condition (121) into consideration eqn. (125) can be re-
written thus
Ez {~(~)} =0 (127)
where
The block diagram of learning system that realizes this algorithm is shown in
Fig. 5.
The block, diagram of learning system that realizes this algorithm is shown in Fig.6.
40
~(n-l)
v12- ~II
+
v22- ~21
~ )
c(n-1
~(.)
F i g . 6 - Supervised Learning o f S i e g e r t - K o t e l n i k o v r u l e
y(n)
~ y(n)
f (I - A) f 2 ( ~ ) dz - f A f 2 ( ~ ) d~ = 0 (136)
rl - ?2
where
(I - A) / ~T2 , i f z ~ w 2 , C__
T~_(z) > 0
e(z) = (138)
- A / IT2 , i f Z~,W 2 , C/~__(Z) < 0
The block diagram of learning system that realizes the algorithms (139) is shown in
Fig.8.
Now consider the case when the teacher does not give the correct c l a s s i f i c a t i o n
y of the observed situations. This corresponds to learning without supervision or
to self-learning.
Let the goal of the self-learning system is to learn the Siegert-Kotelnikov maxi-
43
Let us assume now that the products of a priori probabilities and conditional den-
sity functions ~l fl (~) and ~2 f2 (~) can be approximated by a f i n i t e series
are known vectors functions. For simplicity, their component functions are assumed
to form an orthonormal system.
The decision rule (140) can then be written in the form
~(~, a, b) : a T ~ ( ~ ) - b T ~ ( z ) (142)
and the decision rule is determined b~ finding the vectors a and b. But these vec-
tors can be found in the following manner. Noticing that due to (141) the probability
density function
is approximately equal to
v a J(~, b) : E{~(~)} - ~ - Gb : 0
(146)
Vb J(a, b) = E{v(z)) - GTa - b = 0
: E { U ( ~ ( ~ ) - G ~(~)) } , (148)
: E { U ( ~ ( ~ ) - GT~(~)) } , (149)
where U = (I - GGT) - I .
The simplest optimal stochastic approximation algorithms for solving the regres-
sion equations (148) and (149) are,
The bIock diagram of the s e l f - l e a r n i n g system that uses these algorithms is shown
in Fig.9.
r-
z(n)
Qb,
(/I
In the algorithms of self-learning given above, i t was assumed that the number
of regions K into which the observed situations have to be clustered is given in
advance (.fo.r simplicity and c l a r i t y , i t was assumed to equal 2), Although this does
not look l i k e a s i g n i f i c a n t l i m i t a t i o n , since for K > 2 we can repeatedly use the
binary case (frequently called "dichotomy"), i t is s t i l l needed to remove the neces-
s i t y of specifying a fixed number of regions. In other words, i t is desired not
only to relate observed situations to proper regions but also to determine the cor-
rect nu~er of these regions.
K
f(z) = k~l~k fk(~) (152)
We can assume that the peaks of the estimated mixture probability density function
correspond to the "centers" of the regions, and the lines passing along the valleys
of i t s r e l i e f are the boundaries of the regions ; the number of existing peaks in
f(~) defines the number of regions, see Fig.lO,
z3
z2
zI
for which the necessary condition for optimality leads to the regression equation
a : E {~(z)} (155)
According to (153),
F~?~z(n))
r-
z_(n) + + a(n-l)
T
F i g . l l Learning the mixed density function.
48
IA(~) : l if ~ ~ A (159)
=0 otherwise
Rosenblatt 20 demonstrated the convergence (in mean square) of fn(z-) towards f(z-) on
the condition that h is a function of n such that h n ÷ 0 as n ~ ~ with hn converging
to zero slower than 1
n
We note that choosing the I function in (158) yields a contribution of the "needle-
type" following each o b s e r v a t i o n .
This algorithm of learning, like the algorithm of learning (156) and (157), can be
used in the estimation of the mixture probability density function, and thus also
in finding the number of regions or classes and their corresponding situations.
I z3
zl/j
~,z2
where h(n) is a certain decreasing sequence of positive numbers. Eqn. (162) has the
meaning that the distributions get "sharpened" around the centers as n increases.
So their effets become secondary ; they merely contribute "needle" changes (corres-
ponding to 6 - function) as n + ~.
I t should be noticed that the algorithms of learning (156) and (157) are the
special cases of the lagorithm of learning (161). Actually, by setting
in (161), and by introducing fk(z) from (157), we obtain the algorithm of learning
(156) after a division by ~(z).
We have described above the way toward the restoration (estimation) of the mix-
ture probability density functions. For multidimensional vectors of the situation ~,
this restoration is very d i f f i c u l t when smoothness has to be maintained. I t is even
more d i f f i c u l t to extract the desired regions.
i - fixed measurement i n t e r v a l ,
i i - constrained set of admissible measurement structures, where the measurement
system has a variable structure, namely the number and spatial configuration of the
sensors can be altered 14. I t is assumed that the set of admissible measurement struc-
tures is f i n i t e , and the system is i d e n t i f i a b l e within that set.
where f n ( . ~ s a known non linear vector function ; ~(n) and w(n) denote, respecti-
vely, the state and disturbance vectors, at the time step n = O, l , . . . . N-l. The
vectors f_~, ~ and w are a l l p-dimensional.
The vector c(n) specifies the measurement structure, which characterizes the
relationship between the system state-parameter vector and the measurement vector
at time step n. Such measurement structure has to be a member of the set of admis-
sible measurement structures,
C = {c I , c2 . . . . . c_M} (169)
N-l
q(0, N) = @{VK (N)} + Z ~n [c(n) ] (171)
n=O
afn(R(n)). @gn+l(~((n+l)/n),~(n+l))
Fin ) : , G(n+l) :
~(n) 3~((n+l)/n)
(172)
the a priori variance matrix
where
53
j=k=i for i = I , 2. . . . . p
j:k-I =i -p " i : p + 1 . . . . . 2p-I (180)
j = k - 2 = i - (2p - l ) " i = 2p, . . . . . 3p-3
Thus the up-right diagonals of the V~ matrix are arranged componentwise for the
a vector. Henceforth, the vector a w i l l be referred to as the variance vector.
Let us note that the variance vector a(n) at time n is a function of ~(0) and
the measurement structures ~(m) for a l l m = O, 1. . . . . n-I through the sequential
representation
Here h~_l denotes the algorithm given by (172), (173), (174) and (177).
In terms of the variance vector ~(n) the overall cost (171) can be written as
N-l
Q(O,N) : @(_a(n)) + ~ Z ~n[C(n)), ~ >~ 0 (182)
n:O
Assume that c~(O) . . . . . c9(N-2 ) are determined, then the optimization of (182)
w i l l be reduced to minimizing
over the set of a l l admissible measurement structures C. Fixing numerical values for
~(N-l) w i l l then make (]84) a function o f ~ ( N - l ) , which can be minimized over the
set C to y i e l d a value ~ ( N - l ) corresponding to the numerical value assigned to
~(N-l). That procedure is repeated for large d i f f e r e n t possible realizations of
~(N-l) which can be produced by suitable random generation. The numerical results
are then tabulated according to patterns or classes of elements. In that respect one
may define M patterns : Al . . . . . Ai . . . . . AM the i th of which can be represented as
54
Here c~(~
'')-~ denotes the optimal structure corresponding to a value ~J. ci is an
element of the set C. Notice that there are M classes of ~ (at any time step n)
which equals the cardinal number of the set C.
Let us denote bY~n the decision vector function needed to recognize the appro-
priate pattern of a sample ~(n) at time n. By means of that decision function i t is
possible to assign the optimal structure c~ corresponding to the variance vector
at time n. Let us represent that decision procedure formally by the equation
Notice that the time step is considered to be n rather than N-l in (186) in
order to save w r i t i n g when the procedure is again mentioned i n the fol%owing argu-
ments.
Q(N-I,N) = ¢ [~N-I(~(N'I)'~N-I(~N-I'~(N-I)))] +
Let us now consider the two stage decision process (N-2,N) for which the cost
can be written as
: Y N _ I [ h N _ ] ( ~ _ ( N - 2 ) , c ( N - 2 ) ) ] +X C N . 2 ( c ( N - 2 ) ) (188)
I f we begin, for example, with four classes ; we could solve the following two
problems in the order given :
(1) Find a decision boundary separating classes I and 2 from classes 3 and 4.
(2) Using samples of classes l and 2 alone, find a decision boundary separating
class l from 2. Using samples of classes 3 and 4, find a decision boundary sepa-
rating class 3 and 4.
Notice that the decision procedure need to be carried out a maximum of M-l times
for an M-class problem. This can easily be shown Fig.13 where the decision procedure
is de~cted as tree structure, each of the nodes corresponds to a decision function.
The parallel structure (Fig.13.a) has the advantage of quick decisions since the
maximum number of decision functions in a decision procedure is less than that of
the series structure (Fig.13.b).
T
= pi(~,S) : ~ ~(~) (189)
-I i f ~ is classified into X~
Y (191)
l +l i f ~ is classified into ×2
The superscript ° denotes the correct class. Obviously the decision w i l l be correct
if
y.pi(~,S) > 0 (192)
D ( y - pi(~,S)) (194)
R : : D ( y - pi(~,S)) p(~) d~
(195)
= E { D ( y - pi(~,S) ) }
9
Now on the basis o f the r e s u l t s o f we can e a s i l y o b t a i n the l e a r n i n g a l g o r i -
thms f o r the p a t t e r n r e c o g n i t i o n system
§ D'(x) = dD(x) I dx
5?
9
then the algorithm for optimal learning will be
Illustrative example.
Consider the slab-type nuclear reactor which is represented by the following
17
four-point model , based on space discretization where X_~ is a four-dimensional
vector representing the state of the system at the four mech points. The state tran-
sition matrix F is given by
clas<17°
cl ass 2 P~/~ ~<0
class 3 class 4
a. Parallel structure. b. Series structure.
~(n) z(n)
UI
Process
x_(n) Measuring
System Estimator
R(n)
,,Im,,,-
(variable ~(n)
I structure)
(n)
l
Pattern
E_
Recognition
I-
Fig,14 Variable-structure identification scheme.
59
Specify ci ; i=l . . . . M.
Read Vw, Vv.
Choose 9= ST T(~).
Classification,
Given ~(n) try d i f f e r e n t i
i = l , . . , M , to calculate V~(n+l).
Compute Q(n,N).
I
' Generate a
sample of ~'I -
Find the optimal structure c~ ,
which minimizes Q(n,N).
Repeat this for different ~(n).
Cluster the sample of ~(n) into M
classes according to their perta-
ining to the same optimal struct-
ure.
Decision rule.
I n= n-I I
Yes ~ No
cT
z n = __nX~ + vn
where c_~ is assumed to have one of the following two measurement structures (M = 2):
~I : ( l 0 0 O) T , _c2 = (0 0 l O)T
@{V~(N)} = t r V~(N)
The ~ vector w i l l be taken as the diagonal elements of V£, thus comprising four
components. Each component mi' i=l . . . . . 4 is generated independently using a uni-
form random distribution over the range of values between 0 and I . For each specific
value of ~, the optimum structure is determined for the last stage decision process.
Then the decision rule is calculated using the learning algorithm (200). The form
considered for the decision rule is
~? = sT m = Sl ~l + s2 m2 + s3 m3 + s4 m4"
In computing the vector S f i f t y samples of the vector ~ are used. The vector
is also calculated for the two-stage decision process (N-2,N) using again f i f t y
generations of ~ and the decision rule calculated from the single-stage decision
process (N-I,N) as explained before.
The learning scheme converged to the following values :
1.5630
-0.3637
I 0.4294
= I-0.8950
S =
S
- N-l -I.1492 N-2 ~ 0.9488
0.0065 ~-0.8022
The above decision rules are tested on the same samples of ~ and the misclassi-
fication ratio of the optimum structure is 4% for the single-stage and 6% for the
two-stages which are considered to be quite satisfactory.
2.16. CONCLUSION.
and c l a s s i f i c a t i o n .
On the other hand, in order to extract the attributes that emphasize the d i f f e -
rences between or among pattern classes i t is necessary to perform the utmost orga-
nization in the feature space ( i . e . minimize the entropy) in order to cluster the
different populations and hence f a c i l i t a t e separability. This is the c r i t e r i o n for
interset feature extraction.
In that respect Wald's test has been presented. That test becomes practical
when considered in i t s discrete version. To that end f i n i t e automata can prove to
be a useful tool. Certain form of such automata (automata with linear tactic) is
given in some d e t a i l .
The Bayes and probabilistic i t e r a t i v e techniques have been presented with view
to solving the pattern-recognition problem. Using those techniques d i f f e r e n t lear-
ning algorithms with or without supervision can be obtained. Learning without super-
vision demands more complex algorithms and takes a longer time than learning with
supervision under the same conditions. This agrees with the fact that ignorance must
be paid for.
COMMENTS
2.1. The d i v i s i o n of the pattern recognition problem into extraction and c l a s s i f i -
cation problem is rather a r t i f i c i a l . Essentially there is no "hard" boundary
between both problems. For "perception" aids "decision" inasmuch as "decision"
structures what to perceive. An adaptation scheme should be envisaged to adapt
the extraction and c l a s s i f i c a t i o n algorithms simultaneously with view to
better recognition.
2.2. - 2.5. The terminology of " i n t r a s e t " and " i n t e r s e t " features is due to Tou and
Heydron I . They seem to be erroneous in indicating that i n t r a s e t extraction
c r i t e r i o n corresponds to m~nimizing the entropy, which leads them to an ambi-
guous r e s u l t . Also see Tout , Young and Calvert ~.
2.8. For Wald'~ SPRT, see Fu5, The automata model is taken from Radyuk and
Terpugov u.
2.10. An i n t e r e s t i n g discussion about the complexity of the non-supervised Bayes
algorithm is also given in Young and Calvert 3 p.83. Approximation of unsuper-
vised Bayes learning are the subject of many researches, see e.g.18,19.
REFERENCES
I. J.T. Tou, and R.P. Heydorn, ~'Some Approaches to Optimum Feature Extraction", in
Computer and Information Sciences - I I (J.T. Tou, ed.) Academic New York, 1967.
2. J.T. Tou, "Feature Selec~on f o r Pattern Recognition Systems", in Methodologies
of Pattern Recognition,(S. Watanabe, ed.) New York : Academic, 1972.
3. T.Y. Young, and T.W. Calvert, C l a s s i f i c a t i o n ? Estimation, and Pattern Recognition
New Yo~k : Elsevier, 1974.
4. W.S. Meisel, Computer-Oriente d Approaches to Pattern Recognition. New York :
Academic, 1972.
5. K.S. Fu, Sequential Methods in Pattern Recognitio n and Machine Learning : New
York : Aca
e~, 1968.
6. L.E. Radyuk, and A.F. Terpugov, "Effectiveness of Applying Automata with l i n e a r
Tactic in signal Detection Systems, "Automation and Remote Control", N°4, 1971,
pp. 609 - 617.
7. H. Teicher, " I d e n t i f i a b i l i t y of Mixtures", Ann. Math. Stat. 32,1961, pp. 244-248.
8. H. Teicher, " I d e n t i f i a b i l i t y of F i n i t e Mixtures", Ann. Math. Stat. 34, 1963,
pp. 1265 - 1269.
9. Tsypkin Ya. Z. Foundation of the theory of learnin 9 systems. Academic Press,
1973, New York.
I0. G. Albert, Stochastic Approximation and Non-linear Regrassion. MIT Press, 1967.
I I . N.V. Loginov, "Methods of Stochastic Approximation", Automation and Remote
Control, 27, N:4, 1966, pp. 706 - 728.
12. D.J. Sakrison, "Stochastic Approximation : A Recursive Method for Solving
Regression Problems" in Advan. Cgmmunication Systems, 2, 1966.
13. Y.M. EI-Fattah, S.E. Aidarous, "A Pattern Recognition Approach f o r Optimal
Measurement Strategies in Dynamic Systems I d e n t i f i c a t i o n " , IFAC sump. on
83
I d e n t i f i c a t i o n T e b i l i s i (USSR), 1976.
14. Aidarous S.E., Gevers M.R., I n s t a l l e M.J. I n t . J. Control, 1975, Vol. 22, 197-213.
15. Athans M. Automatica, 1972, Vol. 8, 397-412.
16. Sage A.P. Estimation and i d e n t i f i c a t i o n . Proc. 5th IFAC Congress, 1972, Paris.
(France).
17. Hassan M.A., Ghonaimy M.A.R., Abd EI-Shaheed M.A., "A computer algorithm for opti-
mal discrete time state estimation of linear distributed systems", Proc. IFAC
Symp. on Control of Distributed Systems, 1971,Banff (Canada).
18. Patrick E.A., J.P. Costello and F.C. Monds, "Decision Directed Estimation of a
Two Class Decision Boundary", IEEE Trans. on Computer, vol. C - 19. N°3,
pp. 197 - 205, 1970.
19. Makov, U.E., and A.F.M. Smith, "Quasi-Bayes Procedures for Unsupervised Learning"
Proc. of the 1976 IEEE Conf. on Decision and Control. Paper WP 4.
20. Rosenblatt, M., "Remarks on some non-parametric estimates of a density functions",
Ann-Math. S t a t i s t . , 2__77,832 - 837 (1956).
21. Kullback, S. Information Theory and S t a t i s t i c s . New York : Jo Wiley and Sons
1958.
CHAPTER 111
S I M U L A T I 0 N - MODELSOF COLLECTIVEBEHAVIOR
3 , I . INTRODUCTION.
(.) a unique deterministic mapping between the automaton states and outputs is assu-
med.
66
i p~, .. i.)T
pi = (pl . . . . PmI (i : l . . . . . . N)
i mi
O<pj<l Z i
j=l pj = 1
where pji is the probability that the automaton uses i t s pure strategy y~. The proba-
b i l i t y vector p_i specifies the mixed strategy of the i - th automaton.
y!
A1 - ~_ ~-
Z
• :---
z
O
i ] yi >~I
S ~ ~.
Ai ~- =" ~I ~m-
&aJ
$- (~.
• o
L~
r~
sN yN "
AN m -- ,,,
r~
Variable structure stochastic automaton is the name given when the probability
vector i is always modified according to some reinforcement scheme. This may be
effected through changing the elements of the automaton transition probability matri-
ces corresponding to the automaton input ui .
The objective of each automaton in the game is to seek the mixed strategy~ i
that minimizes i t s average penalty, taken as the expected absolute value of a func-
tion Fi that depends on the automaton's strategy yi and the environment response si ,
87
ADAPTIVE
DEVICE
PERFORMANCE STOCHASTIC yi
EVALUATION AUTOMATON
I t is assumed that for each i - th automaton for arbitrary fixed values of the fo-
reign strategies y l , . . . , y i - 1 , y i + l , . . . . yN there e x i s t s one value of the own
strategy yi which is the "best". Let us denote its value by
yi~(y) = y i ~ ( y l . . . . . y i - l , yi+l . . . . . yN). Such yi is given by minimizing 16i(~)I
i
with rerespect to y .
We shall call 6i(y) an indicator function for the i - th automaton i f i t gives i n d i -
cation of the "distance" from the best s i t u a t i o n , i . e .
i f y i ( t ) = yji i
then yi(t+l) = yj+ui
where
u i : sign "'(~i(y(t))
" " " (8)
In the limit, when the set yi tends to the continuous interval [y~, Ymi], and the
6g
time increment between successive steps tends to zero, the above rule of behavior
will approximate the following continuous-time model of collective behavior,
dy i
sign-d~ - = sign ( ~ i ( y ) ) (i : 1 . . . . . N) (9)
Pr [yi(t+l
)_. i . > Pr [yi(t+l) = y~]
= Yj+ul]
(ll)
for all k # j+u I
if y i ( t ) = Y~i then
• i .
p~i+ui(t+l ) = Pki+u1(t) + yi(t+l) 16i(y_(t))I
pji (t+l) : p~(t) - yi(t+l)m
i - 1 I~i(y(t))l (12)
j = l ............. mi j # ki + ui
i min p~(t)
l-p ki+ui(t ) . .
i(t+l) < min ( . ; (mi-l) j }~ kl+ul ) , ~i # 0 (14)
I~I I~il
to guarantee that the p r o b a b i l i t y vector pi s a t i s f i e s the condition 0 < p~ < 1 for
all j .
?0
y i ( t ) = yi
ki
then
j = l . . . . . . . mi , j # ki + ui
where
ui = sign (oi(y~i, si(t+l)) (16)
The sequence yi(t+l) should also satisfy the condition (12) besides the upper bound
condition
• i
1Plkl+ul(t) min. Pi(t)
, ( m i - l ) J#k1+ul ), 0 i # 0 (17)
yi(t+l) < min ( IF1 (el) I IF~ (el) I
• • o
~ 1 ( y l s i) = or yi : y l i , ei > O, (18)
m
oi ( y i s i ) , otherwise
71
The idea underlying the functioning of a learning automaton in the present model
can be stated as follows. At any time step i f the automaton action has elicited an
environment response for which the penalty index ei is greater than zero then at
the next time step the probability of the next supremal action is increased. On the
other hand i f the penalty index is less than zero then the probability of the next
• d e x yji -
infimal action is increased. I f in the case of positive penalty ~n i o~ ; i or
_ ym
de i i .
in the case of negative penalty in x yj = Yl then ~ncrease the probability of y j .
Finally i f the penalty index is zero, the automaton remains in the status quo.
~mm~_!.
For contramonotonic indicator functions ~i(y)
~r~f.
i) Let~ ~O,j =l ..... N, j~i, andA/>O
at least for one j . Let yi~ ({yj}) denote the best strategy of the i - th automaton
for arbitrary fixed values of the foreign strategies {yJ}= {y~}. Let
yi~ ( { y ~ } ) = y ~ , 1 < ~ <mi and ~i(y~, {y~}) ~ O.
Accordingly the optimum yi for the new {yJ} = {y~ + ayj } cannot be found in the
subset of strategie (y . . . . . y~_l) and can only be found in the subset
Y~+I . . . . . . . Y~i). Hence ay1~>o. On the other hand i f ~i(y ,{y })< 0 then u=l,
or else ai(yi ~,{y~})>O. I f v=l then (19) is automatically satisfied. I f
i-
for all Z<v-l , whence by >0, and (19) holds.
Hence the optimum yi~ for {y + ~yJ} can only be found in the subset {YI~"" , Y i~ . l ) .
Accordingly ayi~ < 0 and again (19) is verified. On the other hand i f ~1(y~,{y~})>O
then either ~i(yi.,,{y~})<O or else v= mi . I f ~= mi then (19) is automatically
.~l i K . . . . . .
satisfied. I f 0 then i j
(yZ,{yk})<a i (yv+l,{Yk})<
i j 0 for
all Z>v+l. Hence ~y1*<O, and (19) holds.
Theorem I (Existence).
Proof.
i i
Let ~o = I ( i = 1 . . . . . . N), and suppose t h a t Y i is not a Nash play. Consider
o
the set o f automata s t r a t e g i e s given by
= yi~({yj }) (i : 1...... N)
For a l l j # i , (i = 1 . . . . . . N)
i . e . the mappings y i ~ ( . ) map the subset of plays
y(1)
: {Z : Yi i S y i ~ Ymi
~ (i: 1..... N)}
into itself. Thus a Nash play must e x i s t in y ( 1 ) . Let us then consider the s t r a t e -
gies,
Y;~
• ~l
: yi~( { y j j } ) (i : 1 ....... N)
?3
I f {yl i} again is not a Nash pl~y, then a Nash play must be in the subset,
~2
i " i
y(2) = {y : y i ~ y ~ Ymi (i = l . . . . . . N)}
~2
and the mappings yi~ (.) map y(2) into i t s e l f . By successive application of the
above procedure i t follows that i f {yl i} is not a Nash play then the candidate for
~s
that play must be in the subset of plays
i yi i
y(S) = {~ : y i < < Ymi (i : l . . . . . . N)}
~s
By Lemma I i t is clear that
For all i. Since yim(.) map the subset y(S) into i t s e l f , i t is obvious that in the
limit - unless yl i is a Nash play for some ~ - the subset of candidate Nash plays
CONDITION I :
Let ay # O. We so partition the set of subscripts I = { l , 2 , . . , N } into three subsets
•
I> , I= , I< ; such that i~ I>*-~AyI>O, i~ I=*-~ Ayi=O, i~ I<*-~ ayl< " O.
The following inequality is then assumed to hold :
One can also see the appearance of the boundedness o f the inter-automata i n f l u -
ences i f one provides the f o l l o w i n g i n t e r p r e t a t i o n : the i n t e r a c t i o n among several
g o a l - o r i e n t e d automata c o n s t i t u t e competition among the automata users f o r some
resource which is necessary to them. Let y i be the magnitude of the e f f o r t of the
i - th automaton to acquire the resource, while 6 i ( y ) is the magnitude of the d e f i -
c i t of the resource ( i f ~i < 0 then J 6 i j i s the magnitude of the excess) in terms of
the i - th automaton in the play Z. Then the monotonicity of 6 i ( y ) with respect to
y i , assumed by the condition I , means t h a t there is the p o s s i b i l i t y of s e l f - r e g u -
l a t i o n by each automaton i n d i v i d u a l l y , since there is guaranteed a decrease o f the
resource a v a i l a b l e as i t s own e f f o r t s increase (and conversely).
CONDITION 2 :
For any i , and f o r any a r b i t r a r y f i x e d foreign s t r a t e g i e s , 6 i ( y ) as a function
o f the own strategy y i does not assume two consecutive values which are equal in
magnitude, i . e .
yim(z ) , (i = l ..... N)
6~.~.
Let condition I hold. Let ~i(y) Ayi ~ 0 (i = 1 . . . . . N) and 6~C # 0. Then
~ ( Z ) > 0.
~ .
N S.
A¢(y) = ~
- i=l
Al6i(y)I :
-
i:Ay1#0 AIai(y)l + i:A~i=o a l a i ( y ) >
- (i:AySi #0 Aai(y)
--
sign ay i + i : A Z y i : 0 IAai(y) l) (27)
Theorem 2 (Uniqueness).
For y to be a Nash play i t is necessary and sufficient that y be a minimum
point of the function @(Z)- ~ is unique.
Proof.
I t follows then that in every case Ay i ~i(z~ ) ~ O for all i , and consequently
A@(y) > O. Hence y ~ is the minimum point of the function @on Y and is unique as
w e l l , by virtue of condition 2.
Necessity.
Let y~ = {Y~i} be the minimum point of ~. Assume that y_~ is not a Nash play.
That means with due regard to the monotonicity property that
i i i i
sign 6 (Yvi-l) = sign 6 (Y~i) i f ai(y__..) is negative or
i i i i
sign 6 (Y,ji+l) = sign 6 (Yvi) if 6i(y._~) is positive.
Let us then consider the point Z obtained by replacing the components of Yvi of
i i
y_~ by either Y~-I or yv+ 1 depending on whether 6i is negative or positive, respec-
tively.
Then we get the inequality ~i(y) (yi~ _ yi) ~ 0 for all i. Hence @(y_~) > @(Z) which
is contrary to the assumption that y m is the minimal point of @. Hence~m must be
the Nash play. The proof is complete.
where ui is defined as in (16). ~SE denotes the projection operator into the sim-
• i " mi i = l}
plex SC = { l : pj > I , j~l Pj
Theorem 3.
The automata play y(t) converges in probability to the Nash play, i.e.
~(t) P > ~, i f the conditions,
(i =l ....... N)
Theorem 4.
Then on the set {~ Bt < ~' ~ ~t < ~} Ut converges a.s. to a random variable and
~ < ~ a.s.
t
P~f_~f_.Th_~g[~m_3. mi
N
Let V(p(t)) = i~l j~l (Pl (t) - pji~,2
; (30)
i(t) i~ 2
Z ..~i+ i [~SE(t)(pji(t'l)- m l l Y I F i ( e i ( y ~ i , s i ) ) l ) - Pj ]
i=l J~K U
Using the contraction property of the projection operator, we get,
78
v(E(t)) N
mi •
•
): (p1((t_l)- p]*)2 + 2N yi2(t) IFi(~i(y
k1 i,si))
.
[2 +
i=l j=l " i=l
+ NS Z y_~)
i=l j#ki+u i (ml-l) 2
ik
IFi(~)i(y ,si))
i
12+
N ,
Z 2 yi(t)( Plki+ui(t i.~F 1.)iFi(ei(yli,sl)) [
l) pkl+u
i=1 k
S .Z.i+ 2
i=l J)eK U ml-I
(31)
where
+ =Ix, if x>0
Ix] (3z)
LO, if x-< 0
N y i ( t ) 2 mi Aai(y_)+ N y i ( t ) 2 mi aai(y)_
m "
yi(t)m12m_11aal(y)}
,
which is negative for any ay # 0 and all Z~ Y, by virtue of condition I (cf. Sec.
3.4).
pi*
Note that E{AZ} = 0 i f and only i f = p i
for all i. Therefore, we can ( t _ l )
assign positive constants Bl, B2,..,B N such that the third term in (31) is
expressed as,
N
y i ( t ) Biii pi(t l) _ pi, )l (34)
i=l
kI ,k2, • ,kN
• °
on the condition
Noting that
According to (37) and (39) the sequence {El(t)} must guarantee the convergence of
the series ¥ i ( t ) c i ( t ) , i.e.
om
r. y i ( t ) c i ( t ) <~ (40)
t=l
With due regard to (37), (39), (40) as well as the fact that { y i ( t ) } is divergent,
yi (t) :
t=l
3.6. ENVIRONMENTMODEL.
In the following we present two different models of the environment, namely the
"pairwise comparison" and the "proportional u t i l i t y " .
The considered form for the probability of a response from an environment element
to an automaton agrees with the natural assumption that the probability increases
as the u t i l i t y of the element increases and vice versa. The dependence of the ele-
ments' u t i l i t y on the automata strategies y i , yk(i ' k = l . . . . . . N) sets the compe-
t i t i o n and consequently stimulates certain objectives for the automata. For this
model of pairwise comparison the competition comes from the fact that for any auto-
maton, say the i - th, another automaton e.g. the k - th, while seeking i t s own
goal, may minimize the u t i l i t y of an environment element response to the i - th
automaton. I t is therefore conspicuous that the u t i l i t y pJ has to be a function of
the difference between yi and yk. The sign of that difference determines on which
side the strategy yi is to be manipulated by the i - th automaton.
Since the probability of choosing two automata out of N ones equals 2/N(N-I), i t
is clear that the total probability of agreement between a j - th element and an
i - th automaton is equal to
N
""
pjl(y_) :IT~ 2 k=Zl ~/(pj(yi, yk)) (42)
k#i
Notice that
82
• ° N ,,
Si : i ~ i y i ) • yi ~
(44)
, otherwise
where
I x~O
i
,
sg (x) = (46)
0 , x<O
Also, the probability of two responses (i.e. 2~(yi) ~< si < 3m(yi)) by two elements
out of v is equal to :
i
p i ( ~ ( y i ) ,< si < 3~(yi)/{yk}) = po({yk}}+~l [p~11({yk}) +
~i p i ( s i / z ) = l (49)
S
3.6.2. Proportional u t i l i t y .
in this model each element of the environment responds to the automata with pro-
b a b i l i t i e s proportionable to the u t i l i t i e s of t h e i r strategies. The probability of
a response from an element increases as the u t i l i t y of an automaton strategy increa-
ses and becomes maximum for maximum u t i l i t y . Hence the probability that the j - th
element responds to the i - th automaton can be expressed thus,
•. N
p31(y) = ~ ( p j ( y i ) ) / k~l ~(pj(yk)), (5o)
j = I, ........ v ; i = I, ...... , N
Eqs. (44) - (48) again complete the mathematical description of this environment
model.
ei : si _ q i y i , (i : 1 . . . . . . N) (51)
(i = 1 ....... N) (52)
The nonlinearity of the weighting function Fi for cautious or hazardous seller indi-
cates the lack of objectivity of such psychological types. Thus a hazardous type
overestimates the importance of the excess of buyers demand (oi>0) and underestimates
the importance of the shortage of buyers demand (ei<0). A cautious type overestimates
the importance of the shortage and underestimates the importance of the excess.
The objective of each seller automaton is to find a price strategy which ensures on
the average the ]east harmful situation (according to i t s psychology) created by the
mis-match between commodity supply and demand in monetary units. Hence, each seller
attempts to minimize the function (4) where the indicator function ~i is given by
eqn. (5).
The automata scheme (15) is considered to simulate the behavior of the sellers.
The buyers representing the s e l l e r ' s environment may be simulated by the "pair-
wise comparison" model of section 6.1. In this case the u t i l i t y of the j - th buyer
making his purshase from the i - th s e l l e r is given by
i , x>A
~(x) = (x+A)12~, -A < x < a (55)
, X < -A
Here (-A, A) represents the "active zone of the function". The function m(.) in eqn.
85
m(yi) = yi
(56)
i YO
y (t) = ~- , YO = cont., t = I, 2 . . . . . (i : I, 2, 3) (57)
The automata sheme (15) always converged to certain equilibrium price probabi-
l i t i e s independently of any i n i t i a l assumptions.
i
For objective sellers and the following set of prices Yk
i i~ 1 2 3
1 2 3
E{p~ (lO0)}
1 .0528 .OOl0 .6708
3 .5018 0
The effect on convergence rate that YO has can be deduced from the form of the
scheme (15). Note that the coefficients of Pki+ui(t), p (t) are all unity. Any con-
vergence will come from the "forcing terms" ; all of which are multiplied by
yi (t÷l).
E i l 2 3
{Pk (lO0) }
l .0457 0 .3265
3 .8361 0
For the case of cautious sellers the equilibrium price probabilities show the ten-
dency of increasing the probability of lower prices
:)3
1
I.
~o~'1
~o= .01
5 ¸
I I I I I I I
) I0 20 30 40 50 60 70
E{p~ (I00)} 1 2 3
- .2072 0
- number o f s e l l e r s = 2
- number o f buyers : 3
- buyer's u t i l i t y pj(yi) = h - yi, see eqn. (50) where h is a "reference p r i c e " f o r
the buyer
- a v a i l a b l e amount o f money to each buyer ~ = 3
- r a t e o f commodity supplies ql = I , q2 = 2
- a c t i v e zone A = I , see eqn. (55)
- set o f p r i c e s , yl = { I ,
2} , y2 = { I , 2, 3}
- buyer's response f u n c t i o n m(y i ) = y i
- all s e l l e r s and buyers are o b j e c t i v e .
yl y2 i = 1 i = 2 ~l(y) ~2(y) i = 1 i = 2
1 1 0,5 0,5 3 3 1 2
1 2 2/3 I/3 4 2 1 2
1 3 1 0 6 0 1 0
2 1 I/3 2/3 2 4 2 2
2 2 0,5 0,5 3 3 2 3
2 3 l 0 6 0 2 0
89
I t follows from the table above that the optimal prices to be adopted by the sellers
are ylX = 2, y2~ = 2. At these prices the profit of each seller is maximum.
We simulated the stochastic automata model (12)~m starting from the i n i t i a l state of
absolute randomeness, i.e.
l.
.8 2
P2
.6
( yo= 0.I )
.4
.2
0 i i
0 20 40 60
~m Notice that the indicator function 6i(y) is given by the difference between ave-
rage demand ~i and the supply q i y i , i.e. 6i(z) = ~i(z ) - qiyi
90
1.0
( Yo= 0,I )
.8
.6
,4
.2
t
I ! -I 1 I I
2O 40 60
Fig. 5, Price Probabilities - No a p r i o r i information.
3.8. RESOURCEALLOCATION.
N ¢i " N . •
max S (s I) subject to z s1<R , silO (i=l . . . . N) (58)
s l s 2, . . . . s N i=] i=l
The known computational procedures for solving that problem are based on e i t h e r
dynamic programming 16 or gradient methods 17. Those methods assume p r i o r knowledge
of the functions @I(si). They lead to computational algorithms in the form of i t e r a -
tive procedures, where the constraint on available resource may be violated at in-
termediate computation steps. This makes direct on-line application of the computa-
tion results i n f e a s i b l e ,
The above methods are not suitable for real application due to several reasons
among them,
l - The functions ~(s i ) are often not known a p r i o r i neither to the user nor
to the allocation cneter. Moreover, the effects attained by d i f f e r e n t users can
vary unpredictably during the relevant time period due to random factors l i k e machi-
ne f a i l u r e , varying market prices, etc.
2 - The users are active systems which use the information about t h e i r produc-
tiveness to promote t h e i r own goals 18.
ei = @i'(si ) _ yi (60)
or
I~o i , ei < 0
Fi(ei) = ne l" , oi > 0 ~' q > 0 (62)
i - th user,
ddYit - { O,F(¢(Yi
i =(s@i
i ')(o))i
_yi
' (Fi<O),
otherwise(Yi=@i'(R)) (Fi>O) (63)
Center gi (yi)
si = R N • (i : l ....... N) (64)
gJ (yJ)
j=]
that the system (63), (64) is stable and all trajectories y i ( t ) converge to the
point,
#i : d@---~i I ~ i ( ~ ) (i = 1 N) (65)
ds I " , .... ,
provided that # are s t r i c t l y concave functions for all i = l . . . . , N and for all
0 ~ si 4 R. The above organization yields a suboptimal solution of the resource
allocation problem.
Optimality is approached when
yl ~ y2 ~ ~ N
....... = y = const = R
where,
N N
@(s, X) = iZl¢i(s
i ) = + X(R- Z sJ)'
j=l
Here we consider the stochastic automata analog of Malishevskii's model. The varia-
bles @i, ¢i' for any si are considered to be stochastic variables with unknown
distributions. The estimation yi of the production effectiveness @i' is considered
to be the automaton's action or strategy. The learning of each i - th producer-auto-
maton is directed towards decreasing its own penalty,
where si is given by (64). Notice that the above formulation corresponds to a Nash
game : any realization of 6i satisfies the contramonotonicity property.
To show this, let y i ( l ) > yi(2), then
•
@i'( i . (~i (yi(1) 1 ~,~ _ yi( l) @i' gi(yi(2) . . ) _ yi(2)
g (yl )+j~i~J(yU) < (gi(yl(2))÷j~i~J(y~)
and consequently
• "
FI(¢I'(.) _ yi(1)) < Fi (¢i' (.) _ yi(2))
Computer simulation were carried out for ten consumers with the following production
functions,
¢i'
The set of trategies for each automaton taken as the values of at ten points
between 0 and l ; R is taken to equal one. The functions Fi(.) were taken as in (62)
with ~ = n = I. The functions gi(yi), see (64), were taken as
The sequences y ( t ) , c(t), see (28) were taken as Yo/t and co/t, respectively.
Convergence was always observed after short time interval. The algorithm (28) demons.
trated low sensitivity to the choice of the parameters YO' and c0. Fig. 6 depicts
the total production versus time. The results are improved when r gets larger, see
Fig. 6. This m~ans that i f the rule of distribution of the resource is close to the
rule "provide to him who gives the maximum estimate of effectiveness" the solution
approches the optimal one8. The organization however in the latter case seems to
be more sensitive to elements failure, see Fig. 6.
At time t ~, see Fig. 6 the f i r s t producer was considered to break down and start
to emit the estimate yl = O. That instantly caused the total production to drop off
drastically. Immediately then the system self-organized i t s e l f in such a way that
the resource was redistributed among the remaining producers so that the increase in
effects of the remaining producers partially compensated the drop in total produc-
tion. This result is interesting as i t demonstrates the r e l i a b i l i t y of the system.
- '-I ~ r ' ~ . . ,-i n ~ ,~ F'
'- - ~, 'l '---'- ;/ ,-~_, - - I o', i" / I - Cr-~-r-'"
I '~ -- I-I --"
3.5 l lj i I '~..I I...' I I r=4
I I I I
II I u I eqn.(67)
Ii i I o_
r= 2
2.~
!
I
I
I
I
I
I
I
J
2.0 , i t~
I I I
20 40 60 80
Fig. 6. Production versus time.
96
3.9. CONCLUSIONS.
REFERENCES.
13. Robbins H., and Siegmund D., "A convergence theorem for nonnegative almost
supermartingales and some applications", in Optimizin 9 Met.hods in Statistics",
pp. 233-257. Academic Press, New York, 1971.
14. Viswanathan R., and Narendra K.S., "Comparison of Expedient and Optimal
Reinforcement Schemes for Learning Systems", Journal of Cybernetics, vol. 2,
n°l, 1972, pp. 21-37.
15. Krylatykh L.P., "On a Model of Collective Behavior", Engineering Cybernetics,
1972, pp. 803-808.
16. Bellman R., and Dreyfus S., Applied Dynamic Programmin9, Princeton : University
Press, 1962.
17. Arrow K., Hurwitz L. and Uzawa H., Studies in linear and nonlinear Programmin9,
Standford : California Stanford University Press, 1958.
18. Ivanovskii A.G., "Problems of stimulation and obtaining objective Estimates in
active systems", Automation and Remote Control, N°8, ]298-1303, 197D.
19. El Fattah Y.M., "Learning Automata as models of behavior", Simulation 75
Proceedings, Zurich (Switzerland) (]975).
20. El Fattah Y.M., "A model of many goal-oriented stochastic automata with applica-
tion to a marketing problem", 7th IFIP Conf. on Optimizatio.n Techniques procee-
dings, Nice (France) (1975).
2l. El Fattah Y.M., "Analysis of collective behavior in large systems using a model
of many goal-oriented stochastic automata with applications", IFAC Symp. on
large-scale systems proceedings, Udine (Italy) (1976).
22. El Fattah Y.M., and R. Henriksen, '~Simulation of market price formation as a
game between stochastic automata", J. Of Dynamic Systems, Measurement and
control (special issue) March (1976).
23. El Fattah Y.M., "Use of a learning automata model in resource allocation pro-
blems", IFAC.Symp., Cairo (Egypt) 1977.
APPENDIX
Projection operator
We stipulate that
l (A.2)
T~ = (0,, 0 . . . . . . L, 0 . . . . . . O) (A.3)
-.~
J
Tm
k = {x : xEDm, xi ) OVi} (A.4)
In order to see that ~(Dm) lies on Dm, perform the summation of both sides of
(A.7) after premultiplying into aj to get
N = Z
!
N a~ xj + 3~l a~ (L- i 1m.ai x i ) = N aj xj + m(L-i~laix
= i)
j~l aj(~(Dm))j j=l "= 3 j~l m
--L
Lemma. The face Tm_1 closest to the point ~(Dm) has an orthogonal vector a_m_l with
the components
(am)k k ~j
(am_l) k = (A.8)
0 k =j
9g
where the index j corresponds to the minimal component of the point X(Dm~, that is
Then
V2(~, T~) - V2(~, T~I) = 2(zm - Zk)L (A.lO)
i.e. the most distant vertex corresponds to the least component zk = man zm. Conse-
quently the face Tm_l which lies opposite to the vertex and closest to the point
has the orthogonal vector a_m_l obtained by nullifying the j - t h component of am,
see (A.8). The lemma has been proved.
By definition, the projection ~(~°) of a point ~° ~ RN into S is called the
point
The property (A.ll) of the projection operator as well as the Lemma suggest the
following sequential procedure for determining the projection,
a) check the condition x ° ~ S.
b~ i f x°Z S then find the projection x°(DN)
k "-
c) i f ~°(DN)~S then find the face TN_l closest to the point ~°(DN) from the Lemma,
(I) Notice that <x°-y, y> : 0 i f -y is the projection of x °. Also <x o(Dm)-Xo, xO(Dm)>
= O. Hence i f y is the projection of _x°(Dm) then <x°(Dm)-y, y> = 0 and conse-
quently <x_°(Dm) - x °, y - -x°(D
- - m
)> = 0
~OO
d) project _x°(DN) into DN_I-~ TkN_l. I f x°(DN_I)~S then we find the face T~_2 closest
to the point x°(DN_l) from the Lemmaand again project x°(DN_I) into DN_2:T~_2
and so forth t i l l xO(Dm)~S for certain m.
e} i f x°(D2)~S then the projection will be one of the vertices T~-
C 0 N T R 0 L - FINITE MARKOVCHAINS.
N
0 ~(i,k,j) and j~l ~ ( i , k , j ) = 1 , i=l . . . . . N, KcKi (I)
sMi M xj ~ Mi
= {x__~--i : O, jZI= xj = l } , i = l ..... N (2)
The subclass of policies for which d~i) is zero except for exactly one kcKi , for
every i , is the class of deterministic policies.
We note that for any fixed D the observed states {Xn}n>0 constitute a homogeneous
Markov chain whose transition matrix P(D) = (Pij(D)) is giv-en by
102
For any fixed D the chain is assumed to be ergodic. That means that the chain is
characterized by one irreducible closed set of persistent aperiodic states. Tran-
sient states are allowed and can vary with the policy.
Definition. Let P be a stochastic matrix. The ergodic coefficient of P, denoted
by ~(P), is defined by
N
~(P) = 1 - sup
i,k j~l (Pij - Pkj ) + (4)
- p +
where (Pij kj ) = max (0, Pij " Pkj )"
A homogeneous Markov chain is ergodic i f and only i f ~(pk) > 0 for some k, cf.
Isaacson and MadsenI .
Under the ergodicity assumption there exists exactly one long-run state d i s t r i -
bution Pi(D), i = 1 . . . . . N, s a t i s f y i n g the conditions,
i . Pi(D) ~ 0 N
ii. pj(D) = i~l Pi(D)Pij (D)
N
iii. j~l pj(D) : 1
Let~T(D) be the 1 x N vector (Pl(D) . . . . . PN(D)) and ~ be the N x 1 vector of l ' s .
Then pT(D) is the solution of
(I - P(D)) I = 0 (6)
So any of the columns of (I - P(D)) can be eliminated. Let B(D) be the NxN matrix
obtained by replacing the f i r s t column of (I - P(D)) by ~. Then eqs. (5) and (6)
combine to assure that ~(D) is the unique solution o f
T
where ~ i is the 1 x N vector ( I , 0 . . . . . 0). B is i n v e r t i b l e due to the fact that
i t is of rank N. Let Q(D) denote the inverse of B(D), I t follows from (7) that
The assumption that the chain is ergodic for a l l D implies that the expected
reward does not depend on the i n i t i a l state i after a s u f f i c i e n t l y long interval of
time. The long-run expected average reward can be expressed as
N ~i)
@(D) = i~l k~K
~i d nik Pi(D) (9)
where, N
nik : j~l ~ ( i , k , j ) rik j , i : l . . . . . N ; k~Ki (lO)
is the expected average reward per stage when the state i is observed and the k-th
decision in Ki is taken. Without loss of generality we assume that a l l the q's are
nonnegative. They must be subject to a f i n i t e upper bound,
The control problem is to find the optimal policy D which maximizes the expected
average reward ¢(D) subject to the system equation (7).
a~B(~) : ~ = 1. . . . . N ; B : 1. . . . . M
, otherwise (.12)
i
where,
~)T = 1 .....
( - B-TT l 1~ ' 1 , - ~1 ..... - ~T ) (13)
p
B - 1 (~=I,...N
~=I,...M
C~
and ~ . denotes the Mi dimensional null vector. Here_~B(A) denotes the perturbation
of thelpolicy vector d ( 1 )
The variation "step-lenght'° 4, eqn. (12), must satisfy the condition that
d(~) + ~B(A) l i e in the simplex sM~. This amounts to
0 , i~m , all j
(15)
(6mBB)ij = 8~j &,i=m, j = l , . . °~
N
where,
O, j=l
Oj
~B = I -~(~,6,J) + k~B 1 ~((~,k,j), j=2, M
(16)
Using a matrix inversion lemma, cf. Ourand2, we can write the inverse of the matrix
B + a BB as
(B + a BB) - I = Q + ~ BQ (17)
3
Qi~ ~B (18)
(8~BQ)iJ = I+ ~ ~ , i,j=l ..... N
and
N i
E]~BJ: iZ=l O~BQiJ ' 3=I. . . . . N (19)
Let us adopt the following definitions of the norm of a vector a_ and a matrix A,
Since Q is the inverse matrix of B, i t follows from the Schwarz inequality that
The constants co and c(~) can always be chosen big enough to have inequality (22)
valid for all policies D.
Eqs. (8), (18) and (19) combine to assure that the variations 8 ~Pi(D) and 6 B~B(D),
105
i
Ci
, i:l ..... N (24)
Hence the variation in the expected reward, eqn. (9), resulting from a variation
a B(4) of a policy D, eqn. (12), can be written as
N M~
~a#(D) : z (SaBd~i) + d~i)
i=l k~I "nik'Pi "nik'~aBPi)
(25)
1 N ~i d~i) ~_i_BB )p
: 4(~ - ~ #B ~k - i ~ k:l ~i~ ~+4.~ ~
a d~~)(D) = A~om
il a B¢(D)IA
N Mi i
l Z d~ i)
= (nab " ~( i k~B n~k
- Z
i=l k=l nik ~c~B)pc~ (26)
6 B@(D) - 8~@-~-T(D)
d~~j . 4
+ ~B(D) " 42 + °(43) (27)
where,
N Mi
~aB(D) = S Z d~i)
i=l k=l
i
nik ~B ~B P~
(28)
Let us now examine the effect of a variation 6y5 of a policy D on the f i r s t order
derivatives ~¢/8d~~) given by eqn.(26)o That variation can be written as,
N Mi dCi) i
@ = (nab _ 1 _ ~
N Mi
- S S d( i ) ~ i . 1 i
i=l k:l k "ik 6 y ~ B P~ - &(ny6 yi~ k~6qyk)~B P~
106
Considering the variations 5y6 of both sides of eqs.(8), (19), and making use of
eqn.(18), we get
C~
~'y6
(3o)
dy6 Po~ Y
Zl.py. 1 + A ~y(S
i
i = _ y ~ya
(31)
SuBstituting from (30) and (31) into (29) and then taking the limit of the division
by A as A+O, we get what we call the second-order derivatives,
~2 ¢ @@ _(na B 1 N Mi (i) i
Bd(Y)~d(~)
g : ~+0
lim ~ 6y6 @d-~B = - ]T~T k~B n~k - i :Zl k:l
Z dR qik~B)py~y6
_ _ B=I
where,
=~ - ) (34)
C~
Equation (33) means that one can construct a policy D' from another policy D by
means of successive admissible variations ~ B(A~)),'~ ~ =l . . . . . N ; 6cK . The varia-
tion in the expected average reward can be written as,
• ~a(1))) aft) +
+..+ a~ (D+611(a~l)) +... + alMl ' M1 . M1
(35)
+...+
Since,
~2 ~ a~y)
@d~(D + y<~,Z (36)
~d~~# Y ,~ @d(Y)@d(~)(D) •
y<~
Y=~,~<B
(37)
Proof. Suppose that condition i of the theorem is not satisfied for some
: _~)~ ~ .int SM, say for ~ = ~. Consider a variation ~B(A )) of the policy D
such that A~~) is admissible. Since d(5)m is an interior point of the simplex
sM~ then A~~) can assume both positive and negative signs. Choose,
which contradicts the assumption that D~ is optimal. Hence condition i must hold.
Now i f condition i i does not hold for some ~ : d(~) ~@SM~,say for ~, then again an
admissible variation 6~B(A~) ) will yield (40) which is in contradiction with the
assumption that Dm is optimal.
Proof. I f the conditions of the theorem hold then i t follows from the expansion
formula (37) as well as the definitions (26), (28), and (32) that in the neighbor-
hood of D~,
4.3 AUTOMATONCONTROLMODEL.
~I A d a p t i v eDevice I__
-- i (Control policy Adaptation) I ~
R~m~. I f i t happens that for some state i , at epoch n, the component d~! ) of the
u
policy d[i}"" equals zero, and at the next recurrence of the state i the gradient
corresponding to a decision klk o is positive then the reinforcement
M~-I M.
scheme (43), or (44), w i l l be applied on the simplex S ' ~ S l obtained by drop-
ping o f f the k -th decision a l t e r n a t i v e . This is necessary to avoid premature con-
o
vergence to a non-optimal p o l i c y .
4.4 CONVERGENCE.
Let the present epoch be n. Let ~(n) denote the realization of the random varia-
bles : the decision probability vectors d ( i ) ( n ' ) , i=l . . . . ,N, and the observed state
xn, for n' = O,l . . . . . n. Consider the error criterion ; I(n) = (@(n) - @~)2 where,
@(n) is the expected average reward for policy D(n) = Dn and @~ is the optimal expec-
ted average reward. The expected value of I(n+l) conditioned on E(n) can be written
as,
110
where xn, a denote the observed state and the control decision, respectively, at
epoch n. It follows from (25) and (26) that
where,
(47)
" ~X~a(n) (n)
fXna(Dn, A~xn)) : ~ [ d~i)(n)'nik. ~IXna(n)-l+A!xn)~Xn (n) "Pxn
^na
and a"(Xn)
a is choosed according to the reinforcement scheme (44) as
rb
+ yXn(n)~d~(n)fXna(Dn,yXn(n)~--~@(n)
a ada^nJ
f~v
fXna(Dn'yXn(n) ~@~a
(n))]
ad~n~ •
a
I ] (5o)
Here [. ] means the same term between brackets as in the second terms~of the same
equation. We impose the condition that the estimated derivatives @ ~ ( n ) , f o l l o -
wing an appropriate e s t i m a t i o n o f the t r a n s i t i o n p r o b a b i l i t i e s % ( i , k , j ) , satisfy
the c o n d i t i o n t h a t
where,
pxn(n) (54)
< l (57)
- c2+clc~
Hence l(n) is a non-negative almost supermartingale and we can apply the convergence
theorem of Robbins and Siegmund3. This yields the following result. Lim l(n) exists
n
and is f i n i t e and
<~ (60)
n:l a Pxn
on
, Z r (62)
n=1
then (60) combined with the fact that (~___~@)2 is a uniformly bounded positive
sequence ; cf.(55), and (56), imply that@da^nj Pxna
THEOREM3. The reinforcement scheme (44) subject to the conditions (51), (53) as well
as
O~ y~n ~ ~ - 1 ~n(n) : ~ , ~
c2+clC-~o ' n~l Y n~l (Y n(n))2< ~ ' n~l ~ n(n)rn<
113
4.5 ACCELERATION,
The following idea, originally due to Kesten4, may be employed to accelerate the
convergence of the stochastic algorithm (44). When the policy D is far from the op-
timal there will be few changes of sign of successive values of the gradient
a¢/ad~); ~ = l . . . . . N, ~ = i . . . . . M~ Near the optimal, we would expect to cause oscil-
lation from one side of d~~)m to the other. This suggests using the number of sign
changes of successive values of @@/@d~ ~) to indicate whether the policy estimate
d~~) is near or far from d~~)m. To accelerate convergence, the quantity y~(n), see
(44), is no~ decreasing i f a@/~d~~) has the same sign as the respective preceding
value (i.e. for the same ~, B). To formalize this, we introduce the set of N vectors
Z(1) Z(N) I f at epoch n the event : state ~, decision B, is taken place then
, , , , o , __ •
We also introduce the set of N count vectors L(1) . . . . . . ~(N) which are initialized
as
L~i)(o) = O, j = i . . . . . Mi ; i : l . . . . . . N (65)
If at epoch n the event : state ~, decision B, is taken place then the B-th compo-
nent of the vector L(~) will be up-dated thus,
i
yj(O) = Yo = const. > O, j = l . . . . . Mi, i =l . . . . . N
(67)
If at epoch n the event : state ~, decision B is taken place then the step-length
element y~(n) is defined as,
y~(n) =I Y~(A-I)
i f Y O' (L~)(R)< 2)U((L~a)(n) >- 2)~(Z~(~)(n)'Z~C~)(n-l)> O)
114
The sequence y~(n) must satisfy the condition that the respective policy incre-
ments ~ i ) ( n ) satisfy the constraint (14) in order to verify that d(i)(n+l) belongs
to the simplex. SM~. I f A~i)(n) is such that d(i)(n+l) does not belongs to the sim-
plex then y~(n) is divided by two. The process of division is repeated, i f neces-
sary, u n t ~•l ~
d (i) (n + I ) i•s found t o be i•n t h e s l•m p l e x SMi .
We consider the example of the "Taxicab operation" given by Howard5. The problem
consists of a taxicab driver whose t e r r i t o r y encompasses three towns A, B and C. I f
he is in town A, he has three alternatives :
l. He can cruise in the hope of picking up a passenger by being hailed.
2.He can drive to the nearest cab stand and wait in line.
3.He can pull over and wait for a radio call.
I f he is in town C, he has the same three alternatives, but i f he is in town B,
the last alternative is not present because there is no radio cab service in that
town. For a given town and given alternative, there is a probability that the next
t r i p will go to each of the towns A, B and C and a corresponding reward in monetary
units associated with each such t r i p . This reward represents the income from the
t r i p after all necessary expenses have been deducted. For example, in the case of
alternatives l and 2, the cost of cruising and of driving to the nearest stand must
be included in calculating the rewards. The probabilities of transition and the
rewards depend upon the alternative because different customer population will be
encountered under each alternative.
2 l I/2 0 I/2 14 0 18 16
2 1/16 7/8 1/16 B 16 8 15
Table 2.
0 9.179964
1 9.680186
2 12. 465918
3 12.650611
4 12. 660767
5 12.817069
lO 13.177875
20 13.241918
30 13.318928
40 13. 342069
50 13.344534
d~ i)
is
.6
!1
.L.J
i
.4 !P I ,I
.2
I I r
0 I I
0 I0 20 30 40
4.7 CONCLUSIONS.
COMMENTS.
4.2 The idea of using a variational approach is inspired by the work of Lyubchik and
Poznyak6. They provided a sketchy formulation of the conditions of optimality for
the optimal control problem with inequality constaints. They did not, however, i n d i -
cate any concrete meanings of the "derivatives". Neither did they evaluate the natu-
re of t h e i r conditions ( s u f f i c i e n t , necessary, or both). Further work remains to be
done for the problem with constraints, which may be formulated as a stochastic pro-
gramming problem7. Theorems I , and 2 presented here are believed to be new. I t is
interesting to examine t h e i r relationship with the Howard's conditions 5 based on
the dynamic programming approach.
4.3 The presented convergence proof is new. Condition (51) for the case of lack of
a p r i o r i information is believed to be less stringent than the condition of minimum
contrast estimate, cf. Mandl8.
118
REFERENCES
I . D.L. Isaacson, and R.W. Madsen, Markov Chains. New York : John Wiley and Sons,
1976.
2. E. Durand, Solutions Num~riques des Equations Alg6briques, Tome I f . Paris : Masson
1961.
3. H. Robbins, and D. Siegmund, "A convergence Theorem for Non-negative Almost Super-
Martingales and Some Applications", in Optimization Methods in S t a t i s t i c s , ed.
by J.S. Rustagi. New York : Academic, 1971.
4. H. Kesten, "Accelerated Stochastic Approximation", Ann. Math. S t a t i s t i c s , 29,
l , pp. 41 - 59, 1958.
5. R.A. Howard, Dynamic Programming and Markoy Processes, New York : John Wiley and
Sons, 1962.
6. L.M. Lyubchik, and A.G. Poznyak, "Learning Automata in Stochastic Plant Control
Problems", Automation and Remote Control, N°6, pp. 777 - 789, 1974.
7. A.S. Poznyak, "Learning Automata in Stochastic Programming Problems", Automation
and Remote Control, N°IO, pp. 1608 - 1619, 1973.
8. P. Mandl, "Estimation and Contro| in Markov Chains", Adv. Appl. Prob. 6,
pp. 40 - 60, 1971.
119
EPILOGUE