Vous êtes sur la page 1sur 41

Annals of Operations Research (7997)477-512 29

477

ON THE COMPUTATION OF THE OPTIMAL COST FUNCTION FOR DISCRETE TIME MARKOV MODELS WITH PARTIAL OBSERVATIONS * EnriqueL. SERNIK and Steven MARCUS I.
Department of Electrical and Computer Engineering, The Uniuersity of Texas at Austin, Austin, Texas 78712-1084,USA

Wc consider several applications of two state, finite action, infinite horizon, discrete-time Markov decision processeswith partial observations, for two special cases of observation quality, and show that in each of these cases the optimal cost function is piecewise linear. This in turn allows us to obtain either explicit formulas or simplified algorithms to compute the optimal cost function and the associated optimal control policy. Several examples are presented. Keywords: Markov chains (finite state and action spaces, partial observations), dynamic programming (infinite horizon, value iteration algorithm).

,i
I

T
I

l. Introduction Finding structural characteristicsof the optimal cost and optimal policies where only partial observationsof the with stochasticcontrol systems, associated states are available, has been a problem that has interested researchersin the different disciplines where these models occur. This is clear since such knowledge greatly facilitates the design of controls to improve the performance of the system. The determination of structural properties is important, both becauseit drastically reducescomputation, and becauseoften the discretizationsassociated with the numerical procedure make it difficult to obtain certain information from the system, such as sensitivity of the optimal policy with respect to small changes in the parametersof the modelFor the kind of problems in which we are interested, namely control of finite state, finite action, infinite horizon, partially observed Markov chains, several important structural results have been obtained, and for the sake of brevity we
* Research supported in part by the Air Force Office of Scientific Research under Grant AFOSR-86-0029, in part by the National Science Foundation under Grant ECS-8617860,in part by the Advanced Technology Program of the State of Texas, and in part by the DoD Joint Services Electronics Program through the Air Force Office of Scientific Research (AFSC) Contract F49620-86-C-0045. o J,C. Baltzer A.G. Scientific Publishing Company

472

E.1,. Sernik, S.I. Marcus / Discrete time Markou models

cannot list them all here. As examples though, we mention some of these structural results, and refer the reader to the other papers cited in this work, and to the references therein. White [32] in his work on production-replacement models, and for the discounted cost case, showed that among the stationary optimal policies there is a smaller class of optimal policies, called structured policies, such that one need only look among thesewhen searching for an optimal replacement policy. The extension of these results to thc case in which the performance index is the averagecost, has recently been obtained by Fernandez, Arapostathis and Marcus in [8,9]. Albright [1] showed for a two dimensional problem that both the optimal cost and the optimal poiicy are monotone functions of the current information state. Kumar and Seidman [16], in their work on the one-armed bandit problem, showed the existenceof a function whose graph, called the " boundary between decision regions", divides the state space into regions in such a way that in each region only one decision is optimal. Sondik [26,27), showed for the strictly partially observed (PO) case that the infinite horizon, optimal cost function is piecewiselinear whenever the associated optimal policy is finitely transient, and provided an algorithm to compute the optimal cost and policy. Recently, Sernik and Marcus 124,251 showed for a particular replacement model, that the infinite horizon, optimal discounted cost is piecewisclinear even when the associated optimal policy is not finitely transient, and provided explicit formulas for the computation of the optimal cost and policy. As is mentioned in [16], for the type of problems in which we are interested (and more generallyfor Bayesianadaptive control problems, see[16]), writing the Dynamic Programming (DP) equation for the optimal expected return is relatively simple, but obtaining its explicit solution is extremely difficult. In this paper, our objective is to show that the same procedure applied by the authors in [24,25] to a two dimensional replacement modcl, can be used to obtain similar results to those of [24,251 for other applications. There are several interesting decision problems that are naturally modeled using two states. For example, see [1] (advertising model), [12] (internal audit timing), U6l (one-armed bandit problem), [18] (optimal stopping times), [19] (equipment checking and target search),[28] (inspection of standby units) and [13,20,25,29,371(quality control, replacement models). The results to be presented below provide exact solutions for most of these problems. In addition, these results can be used to obtain additional insight into the structure of each particular application (cf. [25, examples 1-4]), and to develop theoretical insights in more complex models. This work is organized as follows. In section 2 we analyze the replacement model studied, among others, by Ross [20], White [31], Hughes [13] and Wang [29]. Section 3 is devoted to the analysis of the inspection model of Thomas, Jacobs and Gaver [28]. In section 4 we present other applications for which the approach employed in sections2 and 3 can be used. In section 5 we provide some examples.Section 6 consistsof conclusions.

E.L. Sernik, S.I. Marcus / Discrete time Markou models 2. Markovian replacement models

473

Consider a machine that produces items and which can be in one of two states, "good" or "bad". Accordingly, if { x,, t:0, 1,...} represents the state of the machine,we have x, X: {good, bad} = {0, 1} with X the state space.Suppose that the machine deteriorates with use and that there are three actions available to the decision maker: keep the machine producing, inspect it, or replace it. We with z, e U = {produce,inspect, denoteby {u,, /:0, 1,... } the control process, replace) : {0, 1,2}. Here "produce" stands for "produce without inspection", and "inspect" refers to "produce with inspection". The observationprocess,{1, t:7,2,...), takesvaluesin Y= {0, 1}. Since the state of the machine is only partially observed(PO), this problem is converted into an equivalent completely observed (CO) Markov Decision (MD) problem (see, e.g., [5, chapter 3], and [8,9,31,32]),in which the conditional probability distribution (also referred to as the information vector) n(t): (1 p(t), p(l)) provides the necessaryinformation to select the control at time /. Here p(l) is the probability that the machine is in the bad state given previous (up to time l) and actions(up to time I - 1); see,e.g.,[5, p. 100].We observations will often denote n(l) and p(t) by n and p respectively,omitting explicit dependence /. on We will consider two different casesof observation quality of this PO problem: the completely unobserved (CU) case,where only two actions are available to the decision maker, namely U: {0,2}; and the closed loop (CL) case, with U: {0, 1,2}, in which there are no observationsduring production, but costly perfect observations are available during inspection. These casesare also of interest since they provide upper and lower bounds for the optimal cost function in the PO case (see[3,33]). Of particular importance for the contents of this section are the results concerning the structure of the optimal policy associatedwith the models considered here. Ross ([20, theorem 3.3]) and White ([31, theorem 6.1], andl32, theorem p.2a0l gave sufficient conditions for the optimal policy to be characterized,as a function of p, by three numbers pi, i : 7, 2, 3, 0 < pr ( pz ( pr ( 1, such that " it is optimal to produce for 0<p(pr and pr<p< pz,to inspectfor p, <p4pz, and to replace for p. <p < 1". That is, a structured optimal policy may consist of one, two, three or four intervals such that for all values of p in each of these intervals, only one action is optimal. We refer to these intervals as regions (i.e., inspection region, etc.). This characterization implies that one need only look within the class of structured policies when searching for an optimal policy, simplifying the computation considerably. We first consider the CU case. 2.1.CU CASE Ross [20], White [31], Hughes [13] and Wang [29] all considered slightly different versions of the problem described above. Namely, in Ross' model, if the

474

E.L. Sernik, S.I. Marcus / Discrete time Markou models

machine is replaced at the beginning of period /, then it is assumed that the machine will be in the good state at the end of that period (see[20, p. 587]), and therefore the state of the machine at the beginning of period I * 1 is determined by the transition probabilitics that govern the evolution of the process.In White's [31]model, replacementof the machine at period I placesthe production process in the good state at the next decision epoch (see [31, p. 236]). Hughes' model requires that thc replacement decision be made following an inspection, and made strictly dependent upon the outcome of the inspection (so that only two actions {produce, inspect-replace} are being considerecl);also, the replacement action at period I implies that the processis in the good state at time / * l. Wang [29] studied severalvariations of the two-action, CU problem, including that of Ross and that of Hughes but with the replace action placing the machine in the good state at thc end of the sameperiod in which the action is taken. The results to be presentedapply with minor modifications to all these models and hence, without loss of generality,we will work with Ross' model in [20]. We now complete the description of the model. Each control action has a cost associatedwith it, as follows: the cost associatedwith replacing the machine is given by R independentlyof the underlying state; if the machine is in the good statc, the cost of production is 0, and it is C if the machine is in the bad state.We assume0 < C < R (see,e.g.,[20,31]). Also, the processevolvesaccording to transition probabilities p,r(2,) defined p(u,), by p,i(u): Pr{xtt r: jlx,: i, u,: D).Thetransitionprobabilitymatrices u, U, with entriespir(u,) are:

p(o):[r-e
t o

o]
ll'

p( 2 \:ltr tzt:j- "e o

ol "o]'

I:o' l' 2""'

( l)

where d is the probability of machine failure in one time step. To avoid trivialities we assume0 < p < 1. Assume that the initial probabilities z'(0): (Pr{xo:0}, Pr{xu:1)) are given.We are interested the infinite horizon case.Let D*=(Xx in U)* be the space of infinite sequences with elements in x x U, and let D be the Borel o-algebraobtained by endowing D* with the discretetopology. Then, for x,e X, u,cU, /Nu{0}, a {xs. us,x1, 111,...}e D- represents realizationof the state and action sequences. The problem is to find an optimal admissible control policy that minimizes the expecteddiscountedcost .{(z), given by:

rui): t:liol',{,,, ,))

( 2)

where Ef ['] is the expectationwith respectto the unique probability measureon D induced by the initial probability distribution n and the strategy g (see [6, pp. 140-1a4l); B is the discountfactor,0 < B <1; c(x,, u,), a(measurable) function c: X x [./-- R, is the cost accrued when the machine is in state x, and action a, is selected; and g: {g,}8, is an admissible policy, that is, a deterministic se-

E.L. Sernik, S.I. Marcus / Discrete time Markou models

475

quence, since its computation does not involve any observations, and thus it can be computed off line. In the two-action, CU case, g, { }2, can also be written as a sequence Borel measurable of maps g,: [0, 1l X [0, 11-- U, such that u, : g,(n(t)), u, e U, for / : 0, 7,2,.. . . For the remainder of section 2 we take advantage of the fact that the conditional probability vector zr is characterized by the scalar conditional probability p, so that the optimization problem(2) is reduced to a scalar one (see,e.g., 120, p. 5901' or [31, p- 846])- Let vp(p) : infuJr(il. Then (see, ".g., vp(p) is the expectedcost accruedrnihen optimal policy is 15,8,13,20,25,29,371) an selected,given that the machine starts in state l with probability p, and future costs are discounredat rate B. Also, it is well known (see[20]) that vB(p) is the unique solution of the functional equation:

vu ( r):

mi n {cp + B vp (r (p) ) , R + pvr ) @),}

( 3)

where 7(p) is the updated probability that the machine is in the bad state.and is given by T( p): p(l - 0) + 0. Remark 2.1 In White's [31] model, Va(D is the unique solution of:

u u (t):-i " {

cp + B vB (l:( p) ) R+ pv.@) } . ,
satisfies:

( 4)

In Hughes' [13] model, VB( il

v o(n): mi n{c r(p)+ B U B ( r ( p ) ) , p ) ( c + n ) + r + F v B @ ) } , r(

(5)

with 1 the cost of inspection.For one of wang's models (see [29]), vil p) satisfies:

v o ( r ) : m i n t c p * p v . e ( il) , p ( c+ n ) + r + Bvp ( o ) )

(6)

Hence, the results to be presented next for Ross' model are obtained for the other replacement models described above in a straightforward manner. Remark 2.2 ( p) has the same form for all the replacement models mentioned above, since in each of them the state of the machine is not observed during production. As mentioned in the introduction, the two-action, cU problem was presented in [25]. To make this work self contained, we list below the properties of the model that result in the piecewiselinearity of the optimal cost function: (P1) Z(p) satisfies T(p)>p for,all pe[0, 1).Similarly, T-'(p):(p-0)/(1 - d), 0 < 0 <I,satisfies T-,(p) < p for all p e.[0, 1). In addition, p : 1 is the unique fixed point of both maps (p) and f t(p).

476

E.L. Sernik. S.I. Marcus / Discrete time Markou models (b - a)/

(P2) For 0 < a < b < 1 and 0 <0 < 1, we have that Z-t(U) - T-t(o):

(r*0 )>(b -a ).
(P3) As mentioned above, it follows from [20, theorem 3.3], and [31, theorem 6.1], that when only two actions are considered, the structured optimal policies associated with the model of this section are o'produce for all peI},1]", and "produce for pe[0, p*] and replacefor pe(p",1]" (the latter is referred to also as a control-limit policy). Also, Ross ([20, theorem 3.4a]) gave necessary and sufficient conditions for the policy "produce for all p e [0, 1]" to be optimal (this is the casewheneverR >- C/(1 - P(1 - 0))). Thus, as far as the optimal policy is concerned, the problem reduces to finding the control-limit p*. (P4) Following Sawaki'snotation ([23, p. 116]), define the subintervalsof [0, 1]: E: {pe [0, p*]: T'(p)e(p*,1]], lN, (1)

where r'(p)=T(Ti'(p)), i>-7, To(p)=p, and 71( il:f(il:pQ0) + 0. Observe that the subintervals E, satisfy the recursion E,: { p e i>-2. In addition, from (P1) and the continuity of [0, l): T(p)eE,_r), Z(.), the Ei, i> 1, are convex disjoint intervals, and in Sawaki's terminology, they constitutea partition, call it E, of [0,7)([23, p. 113]). Now, from (3) it is clear that VB(p) l1o*.,1, restriction of the optimal cost the function to pe (p*, 1] is constant. Based on properties (P1) through (P4), the piecewiselinearity of the infinite horizon, optimal cost function Vn(p) is proved in [25] by showing that the restriction of VB(p) to each interval E, is an affine function of p, and that there is a lower bound, greater than zero, for the length of the E,'s. These observations imply an upper bound on the number of line segments describing Vilp) lro,o.l,givingthe desired result- We refer the reader to [25] for details. In the sequelwe will make use of properties (P1) through (P4). Once piecewise linearity is established, formulas to compute the optimal cost and the control-limit p* can relatively easily be derived by following (3), and using inductive arguments. These formulas are given in [25], and for the sake of brevity will not be repeatedhere. Remark 2.3 Wang's work [29] was aimed at showing that control-limit policies were optimal for the two-action, CU model. In addition, Wang also provided analytical expressionsfor computing the optimal cost and the optimal policy for this problem. Although his results could be used to show the piecewiselinearity of the optimal cost function, Wang did not do so. Wang studied a more general model for the two-action, CU problem than the one treated here, but he did not consider the three-action, CL model (cf. next section). In later work [30] Wang extended his results to higher dimensional (greater than two) models. These results could

E.L. Sernik, S.I. Marcus / Discrete time Markou models

477

be taken as the starting point for extending the results stated below for the three-action, CL model to higher dimensions. Remark 2.4 Lel J * 1 be the number of line segments describingVp(p) lru,o.r. Wheneverp* does not belong to the set of points {0, T(0), T2(0),...,Tt{0;i. the optimal policies "produce for pe [0, p"] and replace for pe (p*,1]" arc finitcly transient, as defined by Sondik in [26] and [27] (see also [22]), and consequently the associated optimal cost function is piecewise linear- In those cases, the expressions for the optimal cost and policy derived in [25] provide the same results as those that can be obtained by using Sondik's algorithm. When p* {0, T(0), T'(0),...,T1(0)}, the optimal policy is not finitely transient.The reason, in Sondik's terminology, is that the intersection between the set of points where the optimal policy is discontinuous,and the set of "next values" for the information vector, is never empty (cf-125, example 3], or example 7 in section 5 below). However, the optimal cost function is still piecewise linear; we refer the reader to [25] for details. The optimal cost function can still be computed using the equations derived in [25]. If Sondik's algorithm is used in this case, the optimal cost and policy will be found if the initial guess for the degree of the approximatton ([27, p. 2921)is smaller than the actual number of line segmentsin the optimal cost function. The expressionsderived in [25] to compute the optimal cost and the control-limit p* are particularly attractive for sensitivity analyses with respect to the parameters of the model, since they do not involve any iterative procedure. Remark 2.5 The piecewise linearity of Vp(p) can also be obtained by analyzing the value iteration algorithm ([20, eq. (4)]) used to solve eq. (3), namely:

v|( r):

v i (p):
,ht \

-i " { t .

C p ,R } ,
(8)

-i "t

p c p + B v ,{- ' ( r ( ) ) , R + B v ; - 1 ( 0 ) } ,
^ nrrn-ltm/ \\

sincc from the theory of contraction mappings it is guaranteed that algorithm (8) convergesuniformly to the unique solution VB(p) of (3) as t --) oo (see, e.9., f77, theorem 3.4.1]). Let p'denote the control-limit at iteration n, arrd for each n for E'= {pe[0, p']: T'Lp)(p",11], tN. which pn<l define the subintervals Then, for each such n the E! constitute a partition, say E', of [0, 1], and the piecewise linearity of the cost function follows from, e.g., [23, lemma 1 anc corollaryl, since the partition E, as defined in (?, is the limiting partition of (p" p* as n m) of the sequence partitions {.E"}; seel24ltot details. -

478
Remark 2.6

E.L. Sernik, S.L Marcus / Discrete time Markou models

Note that in Hughes' model, and in Wang's model (cf. eqs. (5) and (6) respectively), VB(p) | <p.,rtis not a constant; however, since the optimal policy in each of these models is also a control-limit policy, and since Z(p) is the same as that in Ross'model, properties (Pl)-(P ) hold with minor modifications for each of these models. The final effect of these differences is that, while for Ross' [20] and White's [31] model VB(p)l<o*,t is constant, it is an affine function of p in Hughes' and Wang's models. 2.2.CL CASE The model in this case is that of section 2.1, with inspection belonging to the set of admissibleactions, i.e., U = {0, 1,2} = {produce, inspect, replace}. Also, we have P(1) : P(0), and the cost associatedwith inspecting the machine is given by 1, independently of the underlying state (another possibility is to specify the inspection cost as the production cost plus a fixed fee, see e.g., 132,p. 2361; the results presented below can be obtained in a straightforward manner for this model as well). Now we assume0 < C < .I < R (see, e.g., 120,p.5901, or [31, p. 846]). Observe that now D- : ( X X U x Y)* is the space of infinite sequences with elements {0,7,2}, and that for x, e X, u,e U, l,e Y and / N U {0}, in uo, !t'-.-,ut-7, !t, x,, 111, a !t+t, xt+.rr..'} e D- represents realizationof {xg, the state, action and observation sequences.Also note that g: {g,}Lr, an admissiblepolicy (with g,, t :1,2,..., Borel measurablemaps as described in section 2.1), is not a deterministic sequenceanymore since now the computation of n(t) involves the observations available up to time l. The model is the same as that considered by Ross in [20]. Following the notation of section 2.7, the functional equation associatedwith the cost function in this case is given (see [20]) by:

v u( : n)

c p + B v p ( r ( p ) ) I + B p v p ( t+ B ( t - p ) v u @ ) , , ) -'n{ R + pv p(o))
R} ,

(e)

value iteration algorithm can be written (see[20]) as follows: and the associated vi(p):mir{Cp,I,

v f* '(p) : * i n{ c p + B\ { ( r ( p ) ) . I + P p v ; 0 ) + B ( 1- p ) v { ( 0 ) , R + pv {(0) }
Remark 2.7

(1 0 )

White [31] also studied the three-action, CL problem described above, and as in the two-action, CU case,his model is slightly different from that of Ross, the

E.L. Sernik, S.I. Marcus / Discrete time Mqrkou models

479

differencebeing the sequence which the eventsoccur, as explainedabove. For in White's three-action, model the optimal cost function satisfies functional CL the equation:

v u(n):

C p+ B v B ( r ( p ) )r, + B r ( p ) v p ( l) B ( l - r ( p ) ) v u ( o ) , + -i n{ n + B z B (o)). (1 1 )

Hence, the results to be stated below can be obtained in a straiehtforward manner for White's [31] model. The analysis of (9) gives, in this case also, the piecewiselinearity of the infinite horizon, optimal cost functionVB(l)- The problem is more complex now than in the two-action, CU case since there are more admissible actions available, thus more structured policies to consider, and p now depends on the observations. However, from equation (9), and the results of Ross [20] and white [31,32] on structured policies mentioned at the beginning of section 2, it is clear than an analogous set of properties (P1)-(P4) can be associatedwith the model of this section as well. Unfortunately, necessaryand sufficient conditions (expressedonly in terms of the basic data of the problem) for a policy structure with two, three or four regions to be optimal do not exist (necessaryand sufficient conditions to have the optimal policy " produce for all p e lO, 1]" in this case are exactly as those for the two-action, cU problem; cf. (P3)). Hence, each of these policy structures has to be analyzed individually. 2.2.1. Three-regionstationary optimal policy structure In this section we assume that the optimal policy structure is of the form "produce for p e [0, p,], inspect for p e (pv pzl and replace f.or p (pr, 1]". Also, following the notation of the previous section, define the intervals of [0, 1]:

4: { p e l o , p rl : T '(p ) e ( pr , pr 7} , i N.

( 12)

Note that the {'s satisfy properties analogous to those satisfied by the E,'s in (P4). Observefrom eq. (9) that VB()l(o,,p,l iS affine and VB(.)l1o,,r1 ZB(1))is (: constant. Once the optimal policy is assumed to have three regions, properties (P1) through (P4) are enough to characteize Vp(p) lro,o,t.We state this in the following proposition.
PROPOSITION 2.1

Assume that there is a stationary optimal policy structure with three regions for the CL model described above (three actions, i.e., U: {produce, inspect, replace), with the state of the system CU during production but CO during inspection).Then, Vp(p) lro,o,t piecewiselinear. is

480 Proof

E.L. Sernik, S.I. Marcus / Discrete time Markou models

From [20], lemma 3.2, it is known that every optimal policy "produces for p eI0,01". Therefore,we need to consider only two cases: (i) The case where I < h can be treated as in the two-action, CU model in [25], because once it is assumed that the stationary optimal policy structure has an inspectionregion, we have that for pe F1, vr(p):Cp+ BvBQ(p)) since it is optimal to produce. But for p a F1, T( p) e (pr, p.r), and VB(f( .)) is affine. Since the properties of the {'s are the same as those of the {'s in (P ), the remainder of the proof proceeds exactly as that for the two-action, CU problem (see [25] for details), to conclude that there is a uniform lower bound, greater than zero, for the length of the line segments.This in turn implies that there exists an upper bound in the number of line segments describing Vp(p) | p,n,1. (ii) When pt:0, the proof differs slightly from that in [25] for the two action, CU problcm, since now there are two different cases to consider, depending on the relationship between p' and pz, &Sfollows: (a) If pr:0 and Z(pr) 1pz,we have that the optimal cost associated with the stationary optimal policy "produce f.or p = [0, pr], inspect for p e (p,, pr], and rcplace for p e (pr, 7f", is given by:

n (ro * B tzu () | ( o,, p c [0, pr ], o,r pe( pr , pr ,f ttu (i l :\ru f p )|1 0 ,.r,1 ,

(1 3 )

I nu(r)

p e ( p r ,tl.

To seethis, note that the situation is similar to that in (i) since for p e [0, p,]it is optimal to produce and z( p) e (pt, pzl,hence vBQ-()) is affine. Thus, vp(-) | ro.u,t is affine, and Vp(') is piecewiselinear. (b) If pr: 0 but 7n( >- pz, then the optimal cost associated pr) with the optimal policy described in (a) above, is given by:

-., vo(p):(-.,\, 'F'r'l

t) (r, * B vu ( | ( o,, p c [0, y] , o,r pe( r ,p1) , l cp +p vp (l ) I tzu (t)

lvp(p)l(o,,o,r

_, r pe(h,pz7,

(14)

p e ( pz,tl,

where 'y = T- t(pr) < pr. In this case,for p e. (y, pr], it is optimal to produce and T(p)e(pr, 11, so that VBQ(.)) is constant.Thus, V^p) l(y,p,t.:Cp+ BVp(\). The result follows now by observing that for p[0, y] we have the situation described in (a). Finally, note that since 7(0):0 is continuous, and (.) Vp(p) | 10.o,1 cannot be of the form Cp + BVBQ) f.orp e [0, pr] becausethat would imply that there is no inspection region in the policy structure, contradicting our hypothesis. tr

E.L. Sernik, S.I. Marcus / Discrete time Markou models Remark 2.8

481

As in the two-action, CU case, the piecewise linearity of the optimal cost function for the model of this section could have also been obtained using the value iteration algorithm (10). However, the analysis becomes cumbersome because now there are severalpolicy structures to consideroand since necessaryand sufficient conditions for their existence are unknown, policy structures that are not optimal for some finite horizon cannot be eliminated as suboptimal for the infinite horizon problem. We will return to this point in section 5, example 3. Before considering the case in which there are four regions in the stationary optimal policy structure, let us show how the piecewiselinearity of VB(p) can be used to find analytic expressions(depending explicitly and only on the basic data of the problem) to compute the optimal cost and policy. From proposition 2.1 there is a finite number, J + 7, of line segmentsdescribing vB(P) lro.o,l. From (9), Vp(p) l(o,,o,tcan be written as VB(p)|t10,,0,1: + I + B(1 - p)vBQ).Also from eq. (9), vp(I): n+ BvuQ). Thus, we obBpvBQ) tain:

vu ( i l l ( o , , : pr[ ( p- 1 ) vB0 ) R] + lr + vu \) - Rl. + o,


From eq. (9), and for p e 4, we obtain from (15) that: vu( n) : Cp + Bvp(T( p)) | <o,,o,t, p e Ft

( 1 s)

: p tc + B (1- d)[(B - t ) z B ( 1+ R] ] ) * * {B I r + v B \)- n ] * B ( 1 - ( 1- d ) ) [ t B - 1 ) v p ( 1 ) R] ] ,


p e Ft. Next, for p e F2, since T( p) e Ft, one obtains from (9) and (16) that: vu( r) : Cp + Pvp(r( p)) | to,, o,t, p e Fz (16)

: , { r I B , ( r- q i + p ' ( r e ) ' [( B L ) vp ( r ) . ^ ])

{.i

-(1 p,F - d) ' ) F' [ + v BG )R] * r


], p e Fz.
(1 7 )

+ p '(r- (1 - d )') [( B- 1)zp( l)+ R]

Using induction it is straightforward to obtain a recursive equation for all the J * 1 line segments in [0, pr]. Denoting by F'( p) the ith line segment, one

482 obtains:

E.L. Sernik, S.I. Marcus / Disuete time Markou models

P '(p): p

- o {''i,u'u) i + B ' ( 1- q ' [ ( p - l) v B ( L+l n ] ) - {'i u'('( 1 o ) ,)+ B,[r+ vo \)- a ]

+ B ' ( 1 ( 1- a ) ' ) [tB t ) v . \ ) .


peF,,i:1,...,J+7,

^l) ,
(1 8 )

w h e re f, : ( a y p r), F t* t: [0, a" r], F,: (a,, a,_r], i : 2,. .., J, and a,, i : 1 ,..., J , g i v e s th e i n te rs e c tion betw een adj acent l i ne segments P ' (.) and P ' * t(.). It can be easily shown, using (18), that:

c (1- (1- o ) ' )* ( B - r ) [ / + v B \ ) - R]


Qi:

(1 - 0 )'{ -c+

+ [( F - 1) vp( 1,)R][1 - B( 1- d) ] ]

- (r - 0)'{-c + ' [(B- 1)vB\)+n][r -B(1 -d)]]


i:7,...,J.

+ [(B- r)ru(r) n][(B-r) + (r -B)'(r-B(r -a))]

( 1 e)

Note that we also have a,:T-'(p.), i:1,...,,I. Now, from eq. (9) and the definition of the {'s, ZB(1) is computed by VBQ):R+ BPr@), since the ("/ + 1)st line segmentis specifiedin [0, ar], and qt < 0.Using (18), we obtain: t-r I

v u 1 ) : l n + B c eL p ' 0 - 0 ) ' + p ' *' ( r - ( r - 0 ) ' *' ) ^ * B' *' ( 1 - R)


I i: o
J- l

+c I
j :r

B th- n

- 0)')l/17 - p'*'(r (r- d)'*')(B - r)

\20) Observe that the only unknownin (20)is J. SinceUu{n)l(o,,o,t givenby (15),it is is only left to find "/, p, and pr. The control-limitsp', pz are found by comparing vu{d l(o,.o,l with Pt(p) and ZB(1), respectively, obtain: to

-P tttf,

pe (p r, 1 ] .

(2r)
and

R-1
Pz :

(B -1 )r/p (l )+R'

(22)

E.L. sernik, s-1. Marcus / Discrete time Markou moders

4g3

In order to obtain the number of line segmentsdescribing vB(il|10,p,1, note that from the definition of the {'s, and the propertiesof the map (p)'a;f. (p1)), J+ 1 is the smallestinteger k such that T-k(p,) <0. Define:

A(k ) = R + pc l i
1- O k _1

pj 0 - 0 ) i + p o * ' (- ( r - d ) o * ' ) R r

+ c p L p,(1- (1 - o ) i) + B k * , ( r_ n )
J|

and

B (k) = 7 - B o*'(t - (t - 0)r* ' )(B - t ) - p o * ' . We stateour resultsformally in the following proposition.
PROPOSITION 2.2 Assume that there is a stationary optimal policy structure with three regions for the cL model consideredin this section (i.e, three actions, u: {proluce, inspect, replace), with perfect observations during inspection and no observations during production). Then, the number of line segments necessaryto completely describe VB(il1p.n,1, J + 1, is the smallestinteger k that satisfiesthe inequality:

1----l _ * (1 - 0 l *

1 (t - o \o (B - 1 )(r+ B 0 )A (k)+ ( p0R+ ( p - 1) ( 1_ R) ) B( k)

< o.

(23)

Furthermore, the optimal cost function and the values of pr and pz that characterizethe optimal policy structure can be computed as follows: (1) Find ,r from (23); (2) comput3 Vp(t).usinge+ (20); (3) compute VB(p)l(o,,n,twith eq. (15); (a) obtain pr and p, using eqs. (21) and(22) respectiv;lt; (s)"iffih the line segmentsthat describe vp(p) l1o,o,1 using eq. (18), and their associated intervals with eq. (19). observe that eq. (23) depends only (and explicitly) on the basic data of the problem (and it is not a function of the optimal cost evaluated at some particular value of p), hence the computation of "/ is straightforward. once -/ has been obtained, one only substitutes J, c, R,0 and B into formulas (15), and (1g) through (22); that is, no iterative procedure is required. We now analyze the casewhere there are four regions in the stationary optimal policy structure. 2.2.2- Four-region stationary optimal poticy structure As described above, a four-region policy structure consists of two intervals, p[0, pr] and pe(pz, p:1,d{ pr<pz<p:<1, forwhichisoptimal toproduce.

484

E.L. sernik, s-1. Marcus / Discrete time Markou moders

we will refer to the interval [0, pr] as the "first produce region,,, and to the interval (pr, pl as the "second produce region',. Since the analysis of this case is similar to that of the previous section, it will not be included here for the sake of brevity. We note thai the formulas obtained for the three-region, stationary policy structure case are not affected by the existence of a second produce region (note, e.g., that to compute vBe) and allthatisneededistoevaluatevB(r)at p:g,and,from[25, lemma l.,o,.o,r Yo^(-o) 3.2l,it is shown that 0 always belongs to the'?irst produce region). Furthermore, by following arguments similar to those in the pr"uiom section, we can prove that the restriction of the optimal cost function to p e (pr, prl, ,/o(pfh.is ^-,, piecewise linear, and that the line segments that descriuetrrJ opii'#ii"".or, function in this interval are computed in the same way as those describing the optimal cost function in the two-action, CU model. We summarize theseresults in thc following proposition. PROPOSITION 2.3 Assume that there is a stationary optimal policy with four regions for the three-action,cL problem under study (that is, u: replace), {produce, insp-ect, no observations during production, perfect observations during inspectionj. Then the infinite horizon, optimal cost function VB(p) is piecewiselinear. In addition, vp(p) and the control-limitsp,, i:7,2,3, which characterize optimal policy the structure can be computed following the procedure: (1) Find the number of line segments the first produce region, ,r + 1, with eq. (23); (2) compute zr(1) in usine eq. (20); (3) compute p, with eq. (27); (4) find VB(p)l(p,,o,tusing eql'ifSl; tSl next, compute the line segments describinE p) lro,o,r vB( (1g) and (19); '-'-'' "sitg "qs. (6) compute pr: rhis is given by (see [25, eq. (tZl}

p:: (1- B )v oQ)/c ;

(7) find K, the number of line segments in the second produce region, as the smallest integer k for which the folrowing inequality is nit satisfied: 1+ P'-1, t (7 - 0).

e4)

r + (t - B k ) v o \ ) n - c z j: i7 i( t - ( t - o ) i) c L j: ] B i( t - 0 ) ' - ( B - 1 ) v B 0-) R

\2s)

(note that e<aQ5) depends only on the basic parameters of the problem, since VB(I) was compured in step (2)); (8) compute pz; given by:

r + (r - pK )v p(l)R - c x f : llo ( r - ( r - 0 ) o ) ^ _ , c z f:i pk (t-o ) o _ ( p _ 1 ) v B 0 ) _ R

\26)

(9) compute the line segmentsdescribing VB(p) from [25, eq. (13)], these l(o,,o,1, line segmentsare given by:
i_1

F ,(p ) : cp L B o (r- 0) o+ Bkvp( l) +c I


k:o

i _7

Bo( t_ ( 1 _ 0) r ) ,
(21)

k:l

p e (d,, d,_,],

E.L. Sernik, S.I. Marcus / Discrete time Markou models

485

i:1,..., K, with the intersections between adjacent line segmentsgiven by: pt, dx= pz, and ([25, eq. (15)]) do=

a' , : 1 - - - - 1 - *

p) vp (1 ) ( 1-

(1 -0)'

c \-i l t

' r : 1,..., K 1.

(2 8 )

Observe that we have formulas for thc infinite horizon, optimal cost function and the stationary optimal policy structure for all the structured optimal policies of the type considered in this work, and associatedwith the replacement model under study (as mentioned above; see [20,31,32]). However, we have not found necessary and sufficient conditions to state,given the data of the model, which is the optimal policy structure.This does not representa problem though, since we can compute the optimal costs and control-limits p;, i:7,2,3, for all the structured policies, and selectthe optimal one as the one that gives the smallest cost. This comparison makes sensesince for each (fixed) set of paramcter values only one of the costs computed is the optimal one (recall that the functional equation has a unique solution), while the others are costs associated with particular (possibly nonstationary)policies,computed assuminga policy structure different from the optimal one. In addition, computationally the comparison of the different costs calculated here is reasonable,since obtaining VB(p) with the equations presented above represents a minimal computational effort compared to that required for solving the problem via, say, the value iteration algorithm (in simulations performed, the computer time has been approximately four orders of magnitude smaller, for the same computer and computer load). Finally, note that the formulas presentedhere give the exact solution to the problem, and not just an approximation, and so they representan easy way to obtain insight about the behavior of the systemwith respectto (say) uncertainties of the parameters of the model.

3. An inspection model for standby units In this section we analyze the model for inspection of standby units described in detail in [28]. The idea here is that a standby unit (maybe more than one) is installed to improve the reliability of the system.But the standby unit has to be inspected, and repaired if necessary, since it can go down even when not in operation, and this will causeit to fail to operate the next time it is needed.Thus, if inspection reveals the unit to be in an unsatisfactory state, repairs are made. The time when there is a need for the standby unit are called initiating events (see [28]), and if the unit is in a failed state when an initiating event occurs, then a catastrophic event is said to occur. Suppose that the standby unit can either be "up" or "down" when it is not in operation (i.e., the state spaceX= {.rp, down}). Let sn be the probability that the

486

E.L. Sernik. S.I. Marcus / Discrete time Markou models

unit will be up at the next time period given that it is up in this period, that n time periods have elapsedsince the unit was installed, and that the unit is not in operation. For the remainder of this section we assume that s,: s for all n, ,s> 0, meaning that the unit has a constant failure rate (see [28, p.265)). In each time period the decision maker can inspect the unit, repair it, or do nothing (i.c., the available set of actions is U: {do nothing, inspect, repair}). If the inspection finds the unit up, then no repairs are made. However, there is a probability (1 - i ) that the inspection is damaging given that the unit is up, and so with probability (1 - r) the unit is down immediately after inspection. If the unit is found down, a repair is attempted, which has probability r of returning the unit to the up state, and a probability (1 - r) of leaving it in the down state. An inspectionwhich finds the unit up takes M periods of time, while inspection plus repair takes N (N > M) periods. During any of these periods the unit cannot respond to an initiating event. If the decision maker decides to repair without inspecting first, the unit has probability r of being up immediately after repair, irrespectively of its state before repair, and again it is out of action for N periods of time. Also, an initiating event (i.e., one that requires the standby unit to come into action) occurs each period with probability b (that is, times between initiating eventsare independentrandom variableshaving common geometricdistribution; see [28]). Finally, there is a probability (1 - c) that the use of the machine will causeit to go down (cs is the probability that the unit will be up in the next time period given that it was used in the present period). As mentioned above, the model is the one consideredin [28]. Note that it allows for nonzero inspection-repair and repair times, and that it takes into account possible mistakes during inspection or repair. For a complete description of the model we refer the reader to [28]. 1'he objective considered is that of maximizing the expected number of periods until a catastrophic event occurs. The problem can be modeled (see [28]) as a pe [0,71J, p being the probability POMD process,with state space X={p: that the unit is up in the next period given all the past observationsand actions (observe that in this model there are no observations available to the decision maker when the action taken is to do nothing, but perfect observations are obtained during inspection and repair; hence, the model considered is a closed loop model). We follow the notation of [28]. Let V( p) be the maximum expected number of periods until a catastrophic event occurs, given that p is the present probability, based on the history of the system,that the unit is up. Then, Z(p) satisfiesthe functional equation (see[28]); v( p) : max{W,( p), W,( p), Wr( p)} , where Wr( p), Wr( p) (29)

and Wr(p) respectively correspond to the actions do

E.L. Sernik, S.I. Marcus / Discrete time Markou models nothing, inspect and repair, and they are given by:

487

,vr( ) : 1 + (1 - b )v('p ) + bpv( c) , p

( 30)

w , ( p ) : p [ ( 1 ( 1- u ) ' ) /r + ( 1- u 1 ' v1 i1 ]+1- r ) [( r - e - n ) N) 7 tt ( + ( 1- t ) ' v 1 , 7 1.


w,( p ): (r - (r - b )')/t * ( 1 - b) *v( ,) ,

(31)
(32)

where (1 - (1 - b)')/b is the expected number of periods to pass during an inspection of M periods before there is an initiating event. In order to simplify the comparison between the models of the previous sections and the one described hcre,let q:7 - p,i.e., q is the presentprobability that the unit is down given all the past observationsand actions. Then T(q) = t - sp:1-s(l - q): sq+(1 -s). Clearly,T-t(q):(q(1 -s))/s, s> 0. Also, define V(q) as the maximum expected number of periods until a catastrophic event occurs given that q is the present probability, based on all the previous observations and actions, that the unit is down. V(q) satisfies a functional equation similar to (29), with p replacedby q, tp replacedby T(q), and i, r and c replacedby 1 - i,7 - r and 1 - c respectively. addition, we make the change In of variable suggested [28], namely i(q): in V(q)-7/b, that is, t@1 is the expectedextra time until a catastrophicevent occurs becausethere is a standby unit ((1/b) is the expected number of periods until the first initiating event). Then, Z(q) satisfies:

/(q): with

mux{titr(q),wr(q) , f rr(q )} ,

(3 3 )

w,(q): (r - q)(1 + bv(l - c )) + (r - b )i(r(q )),

(3 4 ) {1s) (3 6 )

wJ i l :

(r - q)(t - D Mf0 - i) + q ( t - n ) ' t ( ' t - r t ,

w ,(q): (1- u)* t(t -,).

We point out the similarity betweeneq. (33) and the functional equation (9) for the three-action, CL replacement problem: - rtr(q) dependson T(q) and q just as VB(p) lro,o,l d-oes 7( p) and p in the on replacement model; Wlil is affine in 4 since V(l - l) and V(l - r) are constantsfor given i, r; W3(q) = Wz is constant. T(q) and ?"-r(4) satisfy T(q).q for 0<4<1. Clearly, and f t(q)<q the unique fixed point of (q) and T-t(q). Also, for 0<a<b<1, 4:1is .s> 0, T-t(b) - T-1(a) : (b - a)/s > b - a.

488

E.L. Sernik, S.I. Marcus / Discrete time Markou models

* It is possible to write a value iteration algorithm in order to compute I7(q; since the expression in (33) satisfies all the standard results of DP (see [28]):

/'(q ):

ma x{7 q,0,0} , - qXi + bt( t + ( 1 - b) v( r ( q) ) , ')) '" u ^((r (r - q) ( r - t) * v( t - i) + q( I - u) * v( t - r ) , (1 - b) 'v1t - 11),

tn +1 (n ;:

= max wf til, fr; (q), w{ (q)}, {

\ 3 7)

in whcre we have used Zo(q;:0 as suggested [28, p.263]. policy for this problem is a structured policy, and is given (see [28, - The optimal theorem 2.11)bV a control-limit q*, such that: one does nothing for q< q*; andrepair for q>q* if inspect for q>q* it t1l,-i)>(I-67n-u/0-r); -b)"t1t-11. v(t-i)<(1 Therefore, for the model described in this section one has a set of properties equivalentto properties(P1) through (P4) for the replacementmodels of sections 2.7 and 2.2. ln order to make even more clear the analogy between the replacement models of section 2 and the one being analyzedhere, we can Llsethe functional equation (33) to prove for the inspection model described above two results in the same spirit of [20, lemma 3.2 and theorem 3.4b], as follows.
LEM M A 3. I

(a) If the stationary optimal policy is "do nothing for 4 5 q* and inspect for q > q* ", 0 ( 4* < 1, then it is alwaysoptimal to "do nothing for all q ( min{l r ,7 - i)". (b) Similarly, if the stationary optimal policy is "do nothing for 4 ( q* and repair for q> q*", O < q* < 1, then it is always optimal to "do nothing for all q<1-r". Proof (a) Let z:min{7-r,1-r}. Then, from (33): Assume that it is optimal to inspect fot q<z'

i (q ) : (1 - q )(1- b )' v( t - t) + 4( 1- t) *v( t - r ) . N1 t M , -, * [(r - q )(1* r) ' + q( l - t) .lv( z)

. [ ( r - q ) ( t- u ) ' + s0 - r \*lv1 q 1 , q < zI

(3s)

Both inequalities follow because V1il i" nonincreasing in q ([28,.lemma ]_11). From (3U, t(q)(\ _ (1 - qX1 - b), - q(7 - b)") < 0, which implies that v(q)

E.L. Sernik, S.I. Marcus / Discrete time Markou models

489

< 0 since the quantity in parenthesesis less than 1 (it is equal to 1 only for b: 0, but in that case the standby unit is never used). On the other hand, from (33):

v ( 1 ) Q + n v Q - . ) X 1- r ) + ( 1- b ) v( 1:) ( t - u ) v( t) . >
That is, Ut\) ) 0, which implies (again for b * 0) that t1t1r,- 0. Putting these results together,we have V(7) >- V(q), wlnch is a contradiction since V(q) is nonincreasing q. The proof of (b) is similar, and is omitted- n in
LEMMA 3.2 It is never optimal to "do nothing" for q :1. Proof Let rl.,Q)be the cost associated with the policy "do nothing for all qe.10,7f". (37) one readily obtains that: Then, using
l"- 1 n-l

*(q):(t-q)

tim lf n- al* : o

b k c k + I ro (t
k:t

-b )o I

.r-l l ,.r

j:k\Kl

l' . lb rt -* t c t L -* t 1 . (3 e )
)

We recall that the limit in (39) exists since the expression in eq. (33) satisfies the axioms of the contraction operator approach to DP (see [28, p. 262]), and the value iteration algorithm (37) convergesuniformly to the unique solution of the functional eq. (33); see,e.9.,[17, theorem 3.4.11. Now, assumethat it is optimal to do nothing for q:1. Then, from (33) we have:

o>(1-u)',t(r-'),
a contradiction since ,1.,@):0 only for q:1(this can be easily verified: taking the limit in (39) for the first term in brackets gives 1/(1 - bc) > 0, while the limit of the second term in brackets is greater than or equal to zero since it is the sum of nonnegative terms). Hence, one either inspects or replaces for q: 1 (we exclude the case b :7, since in that case there is an initiating event every time period, i.e., the standby unit is used every period, and the problem becomes a replacement problem since there is no standby unit as such). n It is now clear that following the approach of the previous section, we have the following result. PROPOSITION 3.1 For the inspection model for standby units described in this section (i.e., U: {do nothing, inspect,replace},and the unit has a constant failure rate s > 0), the performance index V(q), the expected extra time until a catastrophic event occurs, is a piecewiselinear function of q, the present probability that the unit is down, given all the past observations and actions. Next, in order to develop equations for computingViil the followins: and qx, we point out

490 Remark 3.1

E.L. sernik, s.L Marcus / Discrete time Markou models

As mentioned above, from [28, theorem ].11. we know that the optimal policy inspectsfor q > q* it t1t - i) > (7 - 61<N-utte -r). since v(q) ononintreasing in q, and since N )- M, we have that if (from the data of the problemT 1-r>1-i,then:

w ,(q ): (1 - q )(t - b ) Mt( 1- i) + q( t - t) , t( t - r ) ,- (7 - q )(1 - b )Mv( t - r ) + q( r - u) ,t( t _ r y >-(1 - i l Q - b )' t( r - r ) + q( t - u) ,t( t _ r ) : (1 - o )N t(t - r )

:n ,

which implies that whenever 1 - r > 7 - i, one knows from the data of the problem that the optimal policy is "do nothing for q { q* and inspect for q> q* ", 0( q* <7. Therefore, in this case the equations for computing t1q7 and q* are obtained by following exactly the same procedure illustrated in section 2.2.1, where instead of eq. (9), one uses:

v (q) : max {(r bv (t +

on the other hand, if 1 - r < 1 - i, nothing can be said about which is the optimal policy since in that case: wr(q): (1 - q)(1 - b)M v(li) + q(l - o)N v(t - r)

- q ) + ( r - a )w, ( r ( q ) ) ,r f u , ( q ) } . ( + o ) ")X1

so no conclusion is possible. This means that in the case I - r < 1 - j one has to develop two sets of equations (again, following the procedure illustrated in section 2.2.1), one using eq. (40), and the other one using:

< (1- t)'v (t - r ) lV" "

t(q ) : ma *{(r+ b t(t -.) ) ( 1 - q) + ( 1 - b) A, wr \.

(41)

The optimal cost (and the associatedpolicy) is obtained by comparison (the same comments made at the end of section 2.2.2 apply here). Remark 3.2 One major difference between the model of this section and those of section 2 is that, while in the replacement models all the probabilities appearing in the functional equation (e.g., 0,1 in (9)) are known to belong to a specific interval (region) for each possible structured policy (e.g., it is always optimal to produce for p: d), in the inspection model of this section that is not the case. Further-

E.L. Sernik, S.I. Marcus / Discrete time Markou models

497

more, there is no explicit relationship between c, i, r and s; this means that several cases(see below remark 3.3) will have to be considered in order to find formulas tor V1q\ and q*. In addition, since tlqS is piecewise linear and is evaluated at q: c, Q:i and Q: r in the functional equation (33), blt T(q) dependsonly on s (and not on 7 - c, 1 - I or 7 - r), there is no way to know a priori which interval (and hence which line segment) has to be used to evaluate V(q) at Q: C, q: i and q: r (compare this to the replacementproblem, where it is known that 0 belongs to the interval where the "/th line segment is specified). Therefore, some extra work (described below) will be required to find the number of line segments describing the optimal cost function (as opposed to what happens in the replacement problem of section 2.2.7, where only inequality (23) has to be considered). For the remainder of this scction we assume(as in [28, example2.71)that c: t, meaning that the occurrence of an initiating event that finds the unit up is equivalent to an inspection that takes zero time. Remark 3.3 When i:c and 7 - r> 1-i, the optimal policy is to "do nothing for q(q* and inspect for q > 4* ". In this case,there are sevenpossible situations that have to be studied (for each assumptionon s, e.9., r<J< i, or r <i <s, etc.), as follows: (i) {1-i, 1-r,i,r}[0,q*]; (ii) {1 -i, 1-r}e [0,r1*l; {i,r;e (q*,7l; (iii) {1 - i,7 - r, r} e.10,q* l; {i } (q*,1l; (iv) {1 - i, r,l} [0, q*]; ll; (v) (1 - i, r} e [0, q*]; {i,7 - r} e(q*,11; (vi) {1- t} {1 - r).(q*, - r, r, i} e (q *, 1l; and (vii) {1 - r,7 - i, t, i} e (q *, 11. When l: c [0, q" ]; {1 but 1 - r <7 - i, similar possibilities (for each assumption on s) have to be (q",71being the inspectregion,and (4*, 1] being studied for each of the cases: the repair region (seeremark 3.1). It might be the case that some of these cases are nct possible at all (either becausethey contradict the properties of V(q), or becausethey make no sense with respect to this specific application (for example, in [28, example 2.1], the authors consider always the case s> max{ i, ,\), in which casenot all the above mentioned possibilities need to be considered. In any event, note that even if one obtains severalsets of formulas to test in order to find the optimal pair t1Q), Q*, this still represents a considerable improvement compared with having to solve the functional equation (33) using DP, not only in terms of computational effort, but also in terms of providing an easy and accurate method for investigating the sensitivity of the optimal policy with respect to the parameters of the model (a question often addressedin optimal and adaptive control problems; see e.g. [8,9,25],[28, examples2.1 and 2-21,129]). As an illustration of the process involved in order to obtain formulas for this problem, consider the following case. Suppose s > / > i and that the optimal

492

E.I-. Sernik. S.I. Marcus / Discrete time Markou models

policy is "do nothing for q < q* and replace for q ) q* " . Let T'(q) = T(T'-tQD. Then, T'(q): snq+ (1 - s'). Assuming that {1 - t} e [0, q*], - i, i} e(q*, 1l and that (1 - r) falls in the /th interval(the intervalwhere {r, 1 thc /th line segment is specified), one finds the following equations by following exactly the same procedureused in section 2.2.1:
k- 1

A o ( q ) :( r * b % ) L 0 - r i1 n 1 1-0u ) ' + ( 1- b ) r fr ,,
, 7: 0

k:7,...,

L+7,

(42)
describingWr(q); also, let 1, line segments

where Q-@) is the kth, out of L*

A 1 t<1 I1 :-Js/(1 b )'. T h e n: =


w3
q* :7

(t - t)' ,,[1t1
t - (1 - b)" '- (r - b )* b r. 4 (t )'
-------i-

\43) \44)

'

bIi/. 1+bw3

and one finds the number of line segmentsdescribing Wr(q), L + I, as the minimum integer k that satisfies: ro-(t-q*).0. (45) What makes eq. (45) different from its analog in the replacementproblem (e.g., (23)) is that there are two unknowns in (45): k and /. If we could express/ in terms of k, then we could find t * 1 from (45), replace it to find W., q" and treated.Instead,we have to Lf 1, just as in the previouscases Ai(q), j:7,..., follow an iterative procedure,as follows: (1) Assumel: k - 7. (2) Find L * 1 using (a5); if there is no (positive) integer satisfying (45), go to step (4) (i.e., reject this case);otherwise,continue. (3) ComputeW., with the value found for L* 1; store this case; continue. (4) Decrease/ by 1; go to step (2). Several remarks are in order: Remark 3.4 How does the previous procedure stop? We do not know L + 7, but from lemma 3.2 we know that q* < I. In addition, for typical valuesof the parameters of the model, q* is never close to 1. Thus, a heuristic method to stop the above mentioned procedureconsistsof taking a value of q closeto 1, (say) Q : 0.99, and repeatedlyapply f-1(.) to it. The first (positive) integer / such that 7n '(q) <G - s) gives an upper bound for L, as long as q* < {. Then, the previous procedure has (for the case analyzed here) r iterations. As will be seen in examples 4 and 5 in section 5 (see also remark 3.5 below), once / has been found, some extra analysis on the parameters of the problem may reduce the required computation.

E.L. Sernik, S.I. Marcw / Discrete time Markou models Remark 3.5

493

If one wants to find the formulaswhen s > r> i, {1- r,7 -i } c [0, q*], and {i, r) e(q*,11, then eqs.(42), (43) and (44) are replacedby:

-,)) I 0 - r'(q))(t u),+(r - b).fr,, Q-(q):(r * bv40


J :\)

k-l

k:1,..., li/. :

L+7

(4 6 \

( 1 - b) ' ,A( t) r- h i A (m)- b (r- b 1 * *^r A( /) ( 1 - b) ' *^ + bi( l * b) ' ' ^.4( *) '
\47) b(l - biA (m))fi,
Q ": \ -

i +r(1 -b)^fr r '

(+s)

respectively,where we have assumed that 1 - r falls in the /th interval and 1 - r falls in the zth interval. We make two observations; (i) The procedure to find Z + 1 in this case is: (1) Assumel: k - 1, m: k - 1. (2) Find 1-* I using (45), with q* given by (48); if there is no (positive) integer satisfying({5), go to step (4) (i.e., reject this case); otherwise,continue. (3) Compute W3 (in ( 7)) with the value found for L + 1; store this case; continue. (4) Decrease / or m by I, keeping in mind that from the assumptions we have 7-r< 1-l,or l>m; gotostep(2). (ii) In this case, with I the upper bound for Z computed as before, there are /2 iter ations (i.e., (/: k-7, m:k-I),...,(l:k-I, m:k - t+ 1), (l:k-2, m: k-2),...,(l: k- t, m: k - l)). The readeris aware that this is the worst casescenario,and if t such that tz is absurdly large, then the procedure suggested above may not be an alternative to the value iteration algorithm (37) in terms of the computer time required to obtain the optimal cost and the optimal policy. As mentioned in remark 3.4, for typical values of the parameters of the model, the simple procedure suggestedhere is several orders of magnitude faster than the value iteration algorithm. In addition, some analysis reduces the number of iterations in the procedure considerably: in this case for example, since 1 - r < I -i, and {7-r,1-r} e [0,4*]. if f'(1 -r)>7-i,i <r. oneknows that the intervals I and m are such that / - m < 7, which implies that the above procedure has (7 + 1)1(< r2) irerations. Examples 4 and 5 in section 5 illustrate these ideas.

494 Remark 3.6

E.L. Sernik, S.I. Marcrc / Discrete time Markou models

The iterative procedures suggested above for the two cases considered (and similar ones for the casesmentioned in remark 3.3) give the (optimal) number of line segmentsdescribingrTrG) by comparing all the values obtained for lit: the number of line segmentsassociatedwith the largest \il, equals L + 7. Recall that V(q) is the unique solution of the functional equation (33) (see [5]), and so the number of line segmentsdescribing WJq> is also unique. This means that if as a result of the computation one obtains two (or more) times the same value of \il, for different valuesof L'l l, then one has to compute q* and Q", k: 1, . . . , L + 1, for each of these cases: only one would give consistent results (the others will have e.g., negative length intervals, or line segmentsspecified for q < 0, etc.). Remark 3.7 Obtaining formulas for the cases specified in remark 3.3 is not complicated. e[0, q*], Noteforexample,thatif ,s</,<i (ors <i<r) and(say) {7-r,1-t} then eqs. (45) through (48) still hold: what changesis that the iterative procedure will have at most 4 iterationssincein this case1 - i < 7 - r <1 -s (or 1 -r< 1 - i < 1 - s) so that 1 - r and 1 - i fall either in the (Z + 1)st or the Lth intervals (if one considers {1-, } c [0, q*], then there are at most 2 iterations,etc.). Therefore, once formulas for the cases mentioned in remark 3.3 have been obtained, one finds the optimal cost /(q) and the optimal value q* (for a given data set) by comparing all possible solutions. The same observations made at the end of section 2 apply here. Also, as we said before, these results represent a considerable reduction in computational effort, and in addition they permit one to easily obtain information about the behavior of the system by e.9., performing sensitivity analyseswith respect to the parameters of the problem. In the next section, we consider other decision problems to which the ideas presented in sections 2 and 3 could be applied.

4. Other applications In this section we briefly consider other decision problems to which the main ideas presented in the previous two sections could be applied. In particular, observe that several of the models considered below do not involve discounted costs; nevertheless, the techniques presented in sections 2 and 3 still can be applied. The objective is to show that piecewiselinearity of the cost function (and perhaps formulas to compute it) could be obtained in some casesof each of the following applications.

E.L. Sernik, S.I. Marcus / Discrete time Markou models


4.1. MAINTENANCE MODEL UNDER MARKOVIAN DETERIORATION

495

This problem is treated in [11] by Hopp and Wu. The idea here is to study models in which maintenance actions may not return the system to a "good as new" state (as opposed to a replacement action), and where the underlying state of the system is not directly observable. The authors in [11] study two cases: state-dependent maintenance (the action taken at each time period depends on the underlying state, which becomes known after maintenance is performed; see maintenance,and for both casesthe authors [11, p.448]), and state-independent present structural results concerning the optimal maintenance policy. We briefly introduce some notation used in [11]. ^S: {1,...,n} denotesthe state space, where 1 represents the " best" state, and n represents the " worst" state. When the system is in state i and no maintenance is performed, it deterioratesto state j, j>- i, with probability pil.The Markovian deterioration matrix P:Ip,il is assumed have pii17, i:1,...,nto 1, and to be such that - 1. Sincethe underlyingstateis I Lj:r p,1is nondecreasingin for all k:7,...,n not CO, the probability distribution vector rr will denote the information state, i.e., the information available to the decision maker at each time period. A : {0, 1,..., rz} is the set of actions, where 0 represents"do nothing", and maintenance action a > 0 moves the system to state c with probability one. In addition, action 4 costs c(a), and for each period the system spends in state i, it produces a return r(i). The one period discount factor is F, F e [0, 1). We refer the reader to [11] for more details on this model. First take n : 2 (below we refer to the case ,?> 2), and consider the state-independent maintenance case. In this case, if /(z) is the optimal return over the infinite horizon given knowledge state tr at time 0, then /(n) satisfies the functional equation ([11, p. 458]) given by: (4e) f (") : max/( r, a),
aeA

where
I\n .

r,

a ):

l L i :," (i )r(i ) + Bf ftP) ,

a:0,
a>0,

\-c(a)+f(",),

and eo is the ith row of the identity matrix. The similarity between eqs. (49) and (3) (or (a)) is apparent (if n : (1 - p, p), p being the probability that the system is in state 2, then the expression corresponding to a :0 is affine in p while the expressionsfor a > 0 are all constants). Furthermore, since the authors in [11] prove that the optimal policy is structured ([11, lemma 7]), we need only to check the properties of nP,the updated n:2 information state.Since it is assumedthat pii*1, i:0,...,n-7,when (and prr:1) we get that the updated probability, also denoted here by ?(p), is given by T( p) : (l - p) prr* p, which satisfiesT( p) > p and f-t( p) < p as in the models considered in previous sections. Therefore, that f ( p): f (tr) is a

496

E.L. Sernik, S.I. Marcus / Discrete time Markou models

piecewise linear function of p (when pzz: 1) can be proved following the approach of the previous sections. In the state-dependentmaintenance case the model is more elaborate (see [11]), and we do not intend to get into the details of it. Let us just observe that: (i) for n :2, again one can obtain the piecewise linearity of the cost function by following the approach of sections 2 and 3; (ii) for general n, and assuming that it is optimal to perform maintenance when the system is known to be in state n (the worst state), the authors in [11] prove that the functional equation satisfied by the cost function represents a finite system of equations ([11, theorem 3]). Whether this could be used to find formulas to compute the cost and the optimal policy remains to be investigated. In this regard, we recall that for the replacement problem of section 2.1, Wang [30] provides formulas to compute the costs and the optimal policy (a structured policy) for an /,-state (n > 2) model. Although in [30] the two actions considered can be applied in any o[ the states, these results could be combined with those in [11, theorem 3], to obtain specific formulas in the maintenance model described above. 4.2. OPTIMALSTOPPING A BINARY-VALUED IN MARKOVCHAIN The optimal stopping problem for a PO binary valued Markov chain with costly perfect information was considered by Monahan in [18]. At each time I the decision maker can either stop and accept the current reward, or he can reject this reward, pay a fixed fee and move to the next time period, when another reward will be considered. Since it is assumed that the true state of the Markov chain is not known, the information regarding the current state is sumrnarized by the probability distribution (7 - p, p),0 <p < 1, defined over the two states,with p the probability of being in the "good" state. The decision maker may also purchase information that indicates the true value of the current state- Therefore, three actions are available at each time period: stop (accepl the reward), continue (reject the reward), and purchase information (/est before proceeding). These will be denoted by I': {A, R, T} respectively. The reward process evolves according to the (known) transition probabilities:

p:l

" f t -r," I^.1


l 1-tr,
."1, A ,l '

l:0 ,

1,2,...,

(5 0 )

where it i. ur.rr-ed that tr, t tro > 0.5. This simply implies that with higher probability the process will remain in (or make a transition to) the good state. The problem is to find a rule which will indicate the action to take based on the information available, so as to maximize the expected infinite horizon reward

vu(il:rr[:,u(,., ")f
over all nonrandomized, stationary strategies 6 (i.e., V('),

(sr )
the optimal cost

E.L. Sernik, S.I. Marcus / Discrete time Markou models

497

function, is defined as V(p) = supuZu(p), p e [0, 1]), with p the initial state. Note that %(.) is not discounted; from [18, p. 74], the existenceof V(.) and an optimal strategy follow from standard arguments. U(s,, an), Sn[0, 1], ane l, -c-. and are the single period rewards, given by U(p, R): -cR, U(p,T): U(p, A):p.We refer the reader to [18] for further details in the description of the model. It can be shown (see [5,18]) that the infinite horizon optimal value function V( p), associatedwith the POMD problem just described, satisfies the following functional equation:

v(p):

max{v(t(p)) - cR, - c r+ p V (7 ) + (1 - p )V (o ), p } .

\s2)

Here /(p) : tro + (trr - ),n)p is the probability of being in the good state in thc next period, provided that action R is taken (we follow the notation in [18]). The similarity between this problem and the replacement problem in section 2.2 is apparent. Furthermore, Monahan [18] proves that with this stopping problem there are associatedstructured optimal policies, which can have 7, 2, 3 or 4 regions, just as in the replacement problem. He also shows the piecewise linearity of the optimal cost function (the proof is, however, not clear), and provides an algorithm to compute both the cost and the optimal policy. Although the transition probability matrix and the criterion for the stopping problem differ from those in the replacement problem, the approach of sections 2 and 3 can be used here to prove the piecewise linearity of the optimal cost function, and to develop formulas to compute the optimal cost and the optimal policy. These formulas (which we state below only for the three region optimal policy), provide the same results as those that can be obtained by using the algorithm developed by Monahan in [18]. Rennrk 4.1 It is important to note that, contrary to T( p) in the replacement problem, /( p) has a fixed point in [0, 1), given by pr:Xo/[7 - (tr, - ],0)1.However, it is easy to show that it always falls in the accept region whenever the optimal policy has 2 or 4 regions (reject-accept and reject-test-reject-accept respectively), and so, the approach of sections 2 and 3 applies here. For the case in which the optimal policy has 3 regions (namely, reject-test-accept), whether the fixed point pn. belongs to the accept region depends on the parameters of the model and on Z(0). However, we have not found an example in which p" does not fall in the accept region, and the formulas shown below were developed under the assumption that p. is in the accept region. Example 6 in section 5 illustrates the case of an optimal policy with three regions. The same observations made regarding the selection of the optimal policy in the replacement problem apply here. That is, for a given data set, one finds the costs associated with the policies having I,2, 3 and 4 regions (this is computa-

498

E.L. Sernik, S.I. Marcus / Discretetime Markou models

tionally inexpensive sinceit is done using formulas and not an iterative procedure), and compares them to selectthe optimal one. Also, as mentionedbefore, the savingsin computation time plus the accuracygained,allow the performance of sensitivityanalyses the cost and the policy with respect the parameters of to of the model. For the casein which the infinite horizon, optimal policy structurehas three regions(namelyreject-test-accept), and following the sameprocedure illustrated in sections and 3, one obtains that the formulas to computethe optimal cost 2
and the optimal policy are given by:

(r,( o ). v(p ) : \w r(n ), \w r( p ),

p e [0, o*r ], p ( anr , or ^1, p e ( ar o, 1l,

(5 3 )

where Wr(p) is described the line segments by given by: R'(p): -i.c"+RN(0) + (1 -R" (0 )) . t , (p ), i: 1 , . . . , N, with

(s a )

R " (o ):1 and


i- l

N.c^ * c,

Iori;(Ir

- tro),

(5 5 )

l '(p ) :Io I

(I, -tr') ' + ( tr ,- tr o) ' p, i: 1,..., N.

(56)

j: o

The intersections between adjacent line segmentsare given by:


u :- .

tro (.},, tro )'- c^

(r, - tro)'(l- (tr ,- I o ) ) '


p) are given by:

t:1,-..r^-I.

(5 7 )

Wr( p) and \( and

w r( p ) - - cr+ R N (o) ( 1 - n' ( o) ) p + %( p ) : p ,

(5s)
( sq)

respectively. The control limits between the reject and test regions, and between the test and the accept regions are given by:
4pr :

ro (1 -n x(o) ) - c*
(1 -RN(o))(1 -( 1 , -I . ))
nN(0) - c, R"(0) ,

(60)

and crn: (61)

E.L. Sernik, S.I. Marcus / Discrete time Markou models

499

respectively. Finally, N, the number of line segmentsin the reject region, is the smaller integer n that satisfies:

.z - /'(0) < 0 , (1- (),,- tr,,))


where z=(n.cnt cr)/l'(0). Example 6 described in this section.

l,o-z - c*

\62) in section5 illustrates problem the

4.3. RISK SENSITIVE MARKOV DECISION PROCESS Now we consider the risk sensitive MD process studied by Gheorghe [10]. This paper formulates a model for a Markovian decision process with PO states in which the decision maker basesthe decision on risky propositions, and so, its risk preference can be represented by a utility function that assigns a real value to each possible outcome. Again we consider the case where the (core) process {x,, t: 0, 1,. . . } can take two possiblevalues. Let P(u,):In,i(u,)]be the transitionprobability matrix, with u, the control action at time r. Also, let { y,, I : 0, 1, . . . } b" the observation process taking two possible values- The observation and the core processesare related by qio@): j:1,2, Pr{y,n.,: klx,*t: j, u,: D}, with L|:rAio(u):1, for all u. lf n: (nr, nr) is the information state, and 4 is the probability of being in state l, i : 7,2, given past observations and actions, then (see [5,10]) the updated probability T(k, n, u) that the state of the system will be the second one, given that outcome k was observed and control u was applied, is given by: T(k, n, u):

qi*Q)L?: roP , i(u ) , L,? fl y ( u)L! : rn,p,, ( u ) :

(6 3 )

Let c,ro be the reward obtained when the system makes a transition from state i to state 7 and produces an output k after transition (see [10]). Then defining V(n) as the utility functional (for the lifetime of the process) if its current information state is n> we have (see [10]) that Z(z) satisfies the functional equation:
t2 2 2
k:l

v(n ):

f -u*{ \i:r '

I
7 :l

L o, P , , b )q io (u ) e -f " , , rv (T (k , ' , ' )))'

(6 4 )

where 7 is the risk-aversion coefficient, such that y < 0 implies risk-preference, y > 0 risk-aversion and 7:0 risk-indiference. The problem is to find the optimal control policy to maximize the utility function V(n) over the infinite horizon. We refer the reader to [10] for a more detailed description of the model. Gheorghe [10] and Satia and Lave [21], propose a branch and bound algorithm In to solve eq. (64) (for y > 0, y < 0, y:0 and for y:0 respectively). order to lower bounds of V(r) are apply the branch and bound algorithm, upper and

500

E.L. Sernik, S.I. Marcus / Discrete time Markou models

required. The upper bound is obtained by assuming that the states of the system are CO U0,21]. The lower bound is obtained by computing the reward associated (see[10]) policy. Specifically,the authors in [10,21]compute with any reasonable the reward associated with the control policy that makes the same decision for every period. Presumably this control policy is chosen because it is relatively simple to compute the utility function associatedwith it. One possible application of the risk-sensitive Markovian decision process model (the authors provide several in [10] and [21]) is the two-action, CU replacementproblem consideredin section 2.1 (see[21, examplep.477], and [10, example p. 121).In this case,if Qi*(u):0.5 for all u (so that the statesare CU), and assuming that d is the probability of machine failure in one time step, then T(k, n, u) : T( p) (T( p) as defined in section2'7), k: I,2, if the action taken is produce, and T(k, n, u):0, k:I,z, if the action taken is replace.Hence, (P4) are satisfied in the risk sensitive model as well, and properties (P1) through so we can use the approach presented above to find formulas to compute Z(z) (for the CU case). The point we want to make here is that the cost function computed in the CU case can be used as the lower bound required in the branch and bound algorithm, without introducing additional burden to the computational procedure. Similarly, if one considers the replacement problem of section 2-2, and let Qr*(u):1 for .i: k, and q,o(u):0 for j * k when the action taken is to inspect (u:1), then again one can obtain formulas to easily compute the associated risk utility function, and again it can be used as the lower bound required in the branch and bound algorithm. Note that whether the bounds proposed here are or are not better than those used in [10] and [21] still remains to be established,and we will not address this point here. We only want to mention the problem treated in [10] and [21] as another one where the ideas prescnted in previous sections could be applied.
4.4. INPUT OPTIMIZATION FOR INFINITE HORIZON PROGRAMS

We now turn our attention to the input optimization problem considered by the authors in [4]. Thc problem in this case is that if Z(x) is the infinite horizon, maximal discounted cost given the initial state x, one is interested in finding the minimal input required to achieve a target reward u; namely: I(u)= inf{x: V(x)>-u}.

(6s)

In this case, x is a scalar variable taking values in an interval of the real line. We refer the reader to [4] for conditions (cf. [4, assumptions 2.1-2.3]) that guarantee the existenceof V(.) and .I('), as defined above. The authors in [4] show that 1(u), as defined in (65), has certain properties (e.g., it is monotone nondecreasing and lower semicontinuous), and with these results they show that 1(u) also satisfies a functional equation. This in turn

E.L. Sernik, S.I. Marcus / Discrete time Markou models

501

allows them to propose a value iteration algorithm to compute 1( u). We refer the interested reader to [4] for details on the model and the results obtained. we note that the problem posed in [a] is deterministic. However, the same problem can be formulated for probabilistic models: for example, consider the replacementmodel described in section 2.1, and tet i1u) = sup{ p: VB(p) < u). Then, I(u) can be interpreted as the maximal value of the initial probability p of being in the bad state for which one obtains at most a cost u, with Vu(p) the minimal expected cost obtained when an optimal policy is selectedgiven that the machine starts in the bad state with probability p. From the results obtained in section 2.1 it is clear that we can give an explicit solution for 11u), since we have a formula to compute vB(p) (that is, given a target cost t , we can give the maximal initial probability p that would allow onw to remain below the specified cost u). Similarly, for some of the other models consideredin this work, the "dual problem" (i.e., given an output level, find the optimal input commensurable with the corresponding optimal decisions) can also be solved explicitly. These results could be used as a first step in the design of (value iteration) algorithms to find optimal inputs corresponding to costs that cannot be computed explicitly by formulas, but have to be found using the Dp algorithm. 4.5.OPTIMAL-COST CHECKING WITH NO INFORMATION We now consider the problem described in [19] by Pollock. The general statementof the problem consideredis the following ([19, p. 455]). An event E occursat time t, t:7,2,3,..., with known probability p(x). An observation is made at time r, r:7,2,3,.-., the result being a random variable x(r) which has probability density function po(x) if t> r, and pr(x) if I < z. Immediately after each observation, one of the following decisions is made: "decide that the event has occurred", or "wait for another observation". Hence, the action space will be denotedby U = {D, W}. Action D, at time r, may or may not be a terminal action. If action D is picked and I > r, then a false cost F is incurred, the knowledge that I > r is gained, and the process continues. If action D is selected and I ( r, then the process is terminated with cost c(t, r). The objective is to minimize the total expected cost of the process. The following assumptions are made: (i) the terminal cost is of the form c(t, r): (l - r)w, with w being interpretedas "late cost" ([19, p. a56]) (it is mentioned in [19] that this assumptionis not necessary obtain a solution, but it to is convenient since it reduces the algebraic manipulations); (ii) the occurrence time / of the event E follows a qeometric distribution. i.e.. p(t): a(7 - o)'-' , r :7,2,3,...,

(6 6 )

502

E.L. Sernik, S.I. Marcus / Discrete time Markou models

where a is the (constant) probability of occurrence per unit time, given that the event has not yet occurred. We assume0 < c < 1Let P(r) be the present value of the probability that the event has occurred previous to, or at, time r (P(r) will be denoted by P following the notation in that po(x): pJx), then ([19, p. a5\) an observation x does t19l). If one assumes not affect the a posteriori evaluation of P, since no information is gained between decision times. Therefore, the checker's optimal strategy simply consists of either waiting one time unit, or deciding that the event has occurred. Furthermore, denoting by f(P) the updated probability that the event has occurred, we have that 7(P) is given by ([19, p. 455]): r(P): P(l - a) + a.

(6 7 )

Let V(P) be defined as the minimum expectedcost obtained using an optimal strategy at time r. Then ([9, eq. (5)]), V(P) satisfies:

v(P ): mi n {w P+ v(r ( P) ) , ( 1 - PX r + v( a) ) } ,

( 68)

where the first choice in the minimization is associatedwith the "wait" action, and the secondone with the "decide" action. Observethat from (68) one obtains that I/(0):V(a), and that V(1):0, provided that JC>0 and w>0. Also from (68) (and [19, p. 4621), V(a) < (1 - aX F + V(c)), or V(a) <(1 - a)F/a, so Z(-) is bounded from above. We refer the reader to U9] for conditions that guarantee the existenceof V(-) as defined above. We have stated the problem in such a way that the similarity between the problem described in this section and that of section 2.1 is evident (the only differences are: (i) here there is no discount factor; and (ii) the second choice in the minimization in (68) is not constant, but affine in P). In addition, Pollock shows in [19] that the optimal strategyis of the form: ifP<P*, lwait \ decide if P > P *, (69)

where P* is a control-limit to be determined as part of the solution of the problem. Furthermore, Pollock shows that the control limit is not "degenerate", i.e.,0<P*<1. Remark 4.2 This last result can also be obtained (in a simpler way) following lemma 3.2, as follows. Let tlt(P) be the cost associated with the policy that "waits for all P e [0, 1]". Then, using (68) recursivelyone obtains that:
/ n-6\ n-7 n-1 \

,t,(p): l i m l ,pL0-o ) ' + * I ( 1


1-o

-(r-r)')).
j :\ )

( z o)

which clearly diverges,we get a contradiction Since rl.,(1): lim,--{wlj:il}, since by (68) Z(1) :0. This implies that it cannot be optimal to wait for P: 1.

E.L. Sernik, S.I. Marcus / Discrete time Markou models

503

Similarly, let {'(P) be now the cost associatedwith the policy that "decides for all P e [0, 1]", and applying (68) recursively, one obtains that:

*(r1 : liu (u - r)/'i' 0 - o)'): 0- il:.


" -" o \ ,:0 l

(7 1 )

But from (71) {(0):F/a decidefor P:0.

+(1 - a)F/a:rl,@).

Hence it cannot be optimal to

With theseresults at hand, it is clear that the procedure of sections2 and 3 can be used to show the piecewiselinearity of V(P) (cf. remark 2.6). Next, we state the equationsto compute the optimal cost Z(P) and the control limit P*. Let 1 be the number of line segments describing V(P)l1o.r*1, the restriction of the optimal cost function to P e [0, P * ]. Then, 1 is the minimum integer n such that: 11where B(n, F, w, a) w + B(n, F, w, a)

(t - o)'

<0,

(12)

B ( n ,F , w , a ) = ' l r *

, ( t n - t) -

7-a-(7-a)"\.], _ 1 ft)l r(' -(1

-o )" )

Observe that eq. (72) depends only on the parameters of the model. The optimal cost evaluatedat P : a, V(a) (: V(0)) is given by:

a,,, v(o):

I ,[fr-tl

t-a-(1

-o t ' l ] * , t -o )' F (7 3 )

The control-limitP* canbe obtainedfrom:


D*_

a (n + v(o ))
wra(F+V (a))

(7 4 )

is From (68), it is clear that V(P) l1r*,,1,the restriction of V (P ) t o P e (P * , 1 1 , given by:


v(P) 11p*, : (1 - PX F + v(a)). 11

(ts)

Finally, if we let Q'(p) denote the lth line segment describing V( P) | p.p"t, i : l,..., 1, one obtains:

Q '(p): ,l ;(t

f w,

. , - (t - o)r '\ ) - ( 1- a ) ' ( r + v ( ,))]

+
i:7,...,1,

Ir
ltr

- - o ) ' ( F+ v( a ) )+ ' "( ti 1 )

( ( r-

")-

(t -

-l")')\

(7 6 )

504

E.L. Sernik. S.I. Marcus / Discretetime Markou models

where the inlersections betweenadjacentline segments Q'1-7 and 0'*t(-) givcnby:

(n +v(o ))(1 - o) ' o- r ( tQi:

( t- o) ' )

(n + v(o))(1 - a )' a * w(7 - a )'

Remark 4.3 The expressionfor P* (eq.(7a)) is given in [19, eq. (12)]. Also, Pollock gives an expression compute the optimal cost wheneverP * a (cf. [19, eq. (18)]).From to Pollock's results,thc piecewiselinearity of V(P), and hence eqs. (73), (16), (77), are a direct consequence. Our intention here was to include Pollock's problem as anothcr example of the models for which an explicit solution of the DP can be obtained. We illustrate the usc of eqs. (72)-(77) below in section 5, example 7. 4.6.MARKOVDECISION PROCESS WI'IH LAGGEDINFORMATION Consider the problem addressedby Kim in [14], Kim and Jeong in [15], and with White in [34]. ln thesepapers,the authors are interestedin POMD processes lagged information. That is, in addition to the partial observationsof the current states, there are some delayed observationsof previous states available to the decision maker. This lagged information might also not be perfect. In [4] the author studied the casein which both current and laggedobservationsare perfect (so that the current information vector has finite dimension), and presents sufficient conditions for a control-limit policy to be optimal. In [15], the authors analyze the problem in which the current and lagged observations are not perfect, for the finite horiz.onproblem. As stated in [15, p. aaQ,lagged information can also be consideredin the infinite horizon discounted case, and applied to, e.9., maintenanceor replacementproblems. As in previous sections, we consider the case in which the (core) process 1,...} can take two values.The model then becomessimilar to that {x(t),t:0, presented section4.3, where P(u(t)) : I p, i(u(t))l is the transitionprobability in matrix describing the evolution of the process,with a(r) the control action at time t. In this case, however, there are two observation processes:{ y,(t), t : 0, 1,... ), the observationprocessrelated to the current statesby the probabilities and { ytQ), t: with Li-'qj^--1. i:7.2 qiio:Pr{y,(t+1):k lx(r+1):,r}, process associatedwith the previous states by the 0, 1,...), the observation (as probabilities such thatLl:rqt,r:7, i:1,2 atir: Pr{y,(t + 1): klx(t):7}, in [5], the number of observationsneed not be the same as the number of values in the statespace;we let {y.(r)} and {yt(r)} take two values). Since the states are PO, n: (n1, nr) is the information state, with 4 the given past observationsand actions. The probability of being in state i, i:7,2,

E.L. Sernik, S.I. Marcus / Discrete time Markou models

505

authors find a rule for updating the information vector given current and lagged observations. This rule is given by (see[15, eq. (6)]):

r ( ( k ,m ) ,r r ,rr : f f i ,

(7 8 )

where T((k, m), n, u) is the updated probability that the state of the processis the second one, given that outcome ft was observed, delayed observation m v{as available, and control u was applied. Compare this expression with the updated probability of thc model considered in section 4.3. The problem here is to minimize the cost V(n), which satisfies the functional equation:
l2
t,/ \ s' / n fv(n ): q i l l I I n ,c(i .u ) + B L reu .

r ip,1( u) qjoq!^v( r ( ( m) . n. k.

lt-

i . j .k.m

,)) l,
I

(7e)

with U the set of admissibleactions, B the discount factor, 0 < B < 1, and c(i, u) the cost accrued when the system is in state i and action u is applied. Consider now the replacement problem of sections 2.7 and 2.2. The lagged observations could be introduced in this problem as (any) extra information available(e-9.,testswhich are carried out while the machine is working, but such that the results are not available immediately), and which is taken into account with a delay of one time period. In that case, whether qjo: q!^: 0.5 for i, .j, k, m:\,2 (no observations yield information),or Q'ii*:0.5 for j, k:1,2, (laggedobservations and ql-:1.0 if i:m and q!^:0.0 if i*m, i. m:1.2 give perfect state information), then the optimal cost function is piecewiselinear. In addition, in the former case,the expressionfor the optimal cost is exactly that obtained in section 2.1. The previous observations imply thal expressionsfor upper and lower bounds of the optimal cost function in the lagged-observations model can be easily obtained. Thus, thesebounds could be used in either numerical proccduresaimed at solving the general lagged-observationsproblem, or as an easy way to find out if it is at all beneficial to use the laggedobservations, since as it is pointed out in [34], there are cases in which the lagged information does not improve the optimal cost function. We conclude this section by observing that here, as in the model of section 4.3, more interesting questions can be addressedwhen higher dimensional models are considered, and so the work of Wang [30] can be taken as a first step in that direction. The purpose here was to bring attention to the MD processwith lagged information as a possibleapplication of the ideas presentedin previous sections. 5. Examples In this section we solve some examples using the formulas presented in the previous sections.

506 Example I

E.L. Sernik, S.I. Marcus / Discrete time Markou models

Consider the Markovian replacementmodel of section 2, and let B:0.9, :0.7, C: 4, 1: 5 and R : 10. Using the formulasdeveloped that section 0 in one finds that the optimal policy structure is "produce for p e [0.0,0.5826], inspectfor p e.(0.5826, 0.68291and replacefor p e.(0.6829,11", the associand ated cost function is given by: 78.99p 76.79 + p [0.0000, 0.0304] + 18.51p 16. 8 0 p e (0.0304,0.72731 I7.97p+ 16. 8 8 p e (0.7273,0.2746] 77.77p 77.04 + p e (0.2146,0.29311 76.26p 17.30 p e (0.2937, + 0.36381 + 77 p e (0 . 3 6 3,8 . 4 2 7 4 1 0 '77 vu( : lts 'tan (80) n) e (0.4274,0.48471 73.76p 18.31 + p 12.04p+ 79.74 P e (0.4847,0.fi621 p e (0.5362, 9.93p+ 20.21 0.5826] 7.32p+ 27.79 26.79
Example 2 For the same model of example 1, take 0 : 0-12 and leave the rest of the data unchanged. Now the optimal policy structure is "produce for p e [0.0,0.6353] and p e(0.6806,0.77721,inspect for pe (0.6353,0.68061 and replace for pe. (0.7172,1.01", with the optimal cost function given by:

p e (0.5826, 0.6829] p e (O.AtZl,1.0000]

77.36p 18 . 6 9 + 76.87p 18.74 + 76.24p 18 . 8 8 + 15.46p 79 . 1 2 + 74.47p 19. 5 1 +

pe [0.0000,0.7077] p e (0.7077, 0.21481 p e (0 . 2 1 , 4 8 , 0 . 3 0 9 0 1 p e (0 . : 0 e 0 , 0 . 3 9 1 9 1 p e (0 . 3 9 7 9 , 0 . 4 6 4 9 1 (8 1 )

.zz + 20.09 p e (0.4649, 0.5291,1 n v^(p) _ ltl P \t' t l n.0+p+20 . 9 2 p e ( o . s z s 0,. 5 8 5 6 ]


9.64p+22.0 4 7.73p+ 23.69 4.00p+25.8 2 28.69 p e (0 . 5 8 5 6 , 0 . 6 3 5 3 ] p e (0.6353, 0.6806] p e (0 . 6 8 0 6 , 0 . 7 1 7 2 1 p e (O 3 t lZ , 1 . 0 0 0 0 1

E.L. Sernik, S.I. Marcus / Discrete time Markou models Example 3

507

Consider the example presented by Ross in [20]. The data are those of example 1 but with I - 6. Although for iteration n :7 of algorithm (10) there are four regions in the policy structure (see [20]), the stationary optimal policy structure has only two regions; the cost function is given in [25]. Similarly, for example 1, at iterations ,?: 14 and n : 75 of algorithm (10) there are four regions in the policy structure, but the stationary optimal policy structure has only three regions. These examples illustrate the observation in remark 2.8, and (to some extent) motivate the researchpresented here. Unless one knows in advance the structure of the infinite horizon, optimal policy, it is very difficult to decide whcn an optimal policy has been reached when using the dynamic programming iterative procedure. Furthermore, as mentioned in [7], a particular structured policy may occur at any time during the iterative procedure, yet fail to be optimal for the infinite horizon problem. Thus, policy structures which are not optimal for some finite horizon, cannot be eliminated as suboptimal. In fact, estimation of the minimum number of iterations (for example in algorithm (10)) required to guarantee that a finite horizon, optimal policy is also optimal for the infinite horizon case,remains an outstanding problem; see[7, p.29]. Example 4 Consider the inspection model for standby units analyzed in section 3. Let i : 0.85, r: 0.955, c : 0.85, .r : 0.96, b:0.7, N: 0.07 and M: 0.035. Using eqs. (42) through (45) one finds that the optimal policy is "do nothing for q e [0.0, 0.0982],and replace for q e. (0.0982,1.01",with the optimal cost function given by:

e1.8e- 24.96q |r I e 2 .4 1 18.77q v ( q ): r z . a - 10.18q o

{ I

q e [O.OOOO, 0.0215] q e (0.0215, 0.0606] , q e (0 . 0 6 0 60 . 0 9 8 2 1 q e (0.0982, 1.0000]

(8 2 )

\e 1.8 0

For this case,we used {:0.99, and as explainedin remark 3.4, only l:113 iterations of the simple procedure suggestedwere considered. Example 5 For the inspection model of section 3, let now i : 0.9, r: 0.95, c : 0.9 and leave the other parameters as those in example 4. Using gqs. (45) through (48), we have that for this example the optimal policy is "do nothing for q e (0.0, 0.10291, and replace for q e (0.1029,1.0]", and the associatedoptimal cost function is

508 given by:

E.L. Sernik, S.I. Marcus / Discrete time Markou models

q [0.0000 ,0.0266] q e (0.0266, 0.06s6] 0.1029] 4 e (o.oes6,

( 83)

In this example, T'(7 - r) > 1- l, and therefore only 3l :339 iterations are required to find I * 1 (comparethis with 1132).Also, note that: (a) 1 - s < 1 - r <1-i, so /+L+1, m*L+7, and (b) 1-r<1-i, so that />_m, which means that further reductions in the number of iterations are possible (the example above was carried out with 222 iterations). Example 6 Consider the stopping problem describedin section4.2. Let 1,, : 0.95, Ao : 0.5, cn:0.05 and cr:0.1. Then, using eqs. (54)-(62\ one obrains that N:2, and that the optimal policy is: "reject the reward for p c[0.0,0.57954], test for p e (0.57954, 0.861901 and accept rhe reward for p e (0.86190,1.01".The oprimal cost is given by:

V ( ol:

o .o ssa 6 z .72474 pe [O.OOOO0,0.77676] +0 f + 0.579541 l0.12a1ap 0.77207 p e (0.77676, (


+ l|0.27586p 0.62414 lo p e (0.57954, 0.86190] p e (0 . 8 6 1 9 0 ,1 . 0 0 0 0 0 1

(84)

Example 7
Consider the optimal-cost checking problem described in section 4.5. Let w:0.1, c:0.1 and I':1. Then, the optimal policy calls for..wait for pe and There are I:11 line segments 10.0,0.673241 decidefor Pe (0.67324,1.01". describing V(P)1r0.0.0.67324t.The optimal cost function is given by:

v(P) :

0.03962p+1.06038 -0.06709P 1.06709 + -0.18566p+ 1 . 0 8 5 6 6 -0.37740p+ 1.77740 -0.46377p . 1 6 3 7 7 +7 -0.62647p . 2 2 6 4 7 +7

r e Pc p r e

0.06287] [O.OOO00, (0.06287,0.156581 (0 . 1 5 6 s 8 , 0 . 2 4 0 % l (O.Z^OS3, 0.31683] re (0 . 3 1 6 8 3 ,. 3 8 5 1 5 ] 0 p c (0 . 3 8 5 1 5 , 0 . 4 4 6 f 3 l

{ -o.sortrp + t.3o7r2
-7.00792P 7 . 4 0 7 9 2 + -7.23102P 1 . 5 3 1 0 2 + -7.47897p+7.67897 -7.75434P 1 . 8 5 4 3 4 + -2.06038P 2.06038 +

r e (o.a+003, 0.501971
re (0 . 5 0 1 9 7 , 0 . 5 5 7 7 7 1 p e (0 . s s 1 7 7 , 0 . s 9 6 6 0 ] p e (O.SIOOO,0.63694l P e (0 . 6 3 6 9 4 , 0 . 6 n 2 4 ] r e (O.OtlZ4, 1.00000]

(8 5 )

E.L. Sernik, S.I. Marcus / Discrete time Markou models

s09

Z(.) is concave(this can be proved using [20, lemma 2.1]), but from (85) we can see that V(.), as opposed to the cost functions of the models of sections2,3 and 4.2, is not necessarilymonotone (nonincreasing,in this case).Since for this model it must be true that V(a):V(0) (cf. eq. (68)), f(.) will be nonincreasingif the ,Ith line segmenthas zero slope. For illustrative purposes,let a:0.1, w : 1.0 and F = 0.777421.Then, using eqs. (72) through (77), we find that the optimal policy is to "wait for Pe [0.0,0.2710]and decide for P e(0.2710,1.01",and that the optimal cost is given by:

r e [o.ooo0, 0.1000]
P e (0.1000, 0.19001 P (0 . 1 9 0 0 , 0 . 2 7 1 0 1 3.7 7 7 4 P e \0.2770, 1.00001
(86)

Note that P*:0.277:fL(a)e. {a, T(a), Tz(a)}. and so (cf. remark 2.4) the optimal policy is not finitely transient, but the optimal cost function is piecewise linear.

6. Conclusions We have considered several applications of two state, finite action, infinite horizon, discrete-time, Markov decision processeswith partial observations, for two special cases of the observation quaLity, and shown that the procedure followed in [25] can be used to prove the piecewise linearity of the cost function in each of these cases. This result is important in its own right because it helps in the overall understanding of this class of problems. In addition, in most of the cases considered,it allows one to obtain either explicit formulas or simplified computational algorithms to find the optimal cost function and the optimal control policies, making the kind of models considered in this work more appealing and suitable for obtaining insight about the behavior of the system under study. In order to make the results presented here useful for decision making in most practical applications though, these results should be extended to the closed loop case in which the state space has dimension greater than two. A first step in this direction could be the work of Wang [30] for the replacement, CU problem described in section 2.1. Also, information patterns that do not necessarily involve extremal situations (i.e., CU, CO) should be considered. The hope is that the knowledge gained will enable the solution of several open questions associated with the kind of models considered here, like the one posed in [2, p. 559], for the replacement problem of section 2.2, concerning whether in the case of partial observations during production and during inspection, the set of structured policies remains the same considered above.

510 References

E.L. Sernik. S.I. Marcus

/ Discrete time Markou models

processes, Markov decision Oper. Res. [1] S.C.Albright, Structuralresultsfor partiallyobservable 27 (7979\1041-1053. [2] V.A. Andriyanov, I.A. Kogan and G.A. Umnov, Optimal control of a partially observable discreteMarkov process, Autom. RemoteContr. 4 (1980)555-561. with stateinformation,J. Math. [3] K.J. Astrom,Optimal control of Markov processes incomplcte Anal. Appl. 10 (1965)774-205. programs, Optim. J. and S.D. Flam, Input optimizationfor infinite discounted [4] A. Ben-Israel Theory Appl. 61 (7989)347-357.
[5] D.P. Bertsekas, Dynamic Programming (Prcnticc Hall, E,nglewoodCliffs, NJ, 1987). [6] D.P. Bertsekasand S.E. Shreve, StochasticOptimal Control: The Discrete Time Case (Academic Prcss,New York, 1978). [7] A. Federgruen and P.J. Schweitzer, Discounted and undiscounted value iteration in Markov decision problems: A survey, in Dynamic Programming and its Applications, ed. M. Puterman (Academic Press,1979) pp.23-52. [8] E. Fernandez-Gaucherand,A- Araposthatis and S.I. Marcus, On partially observable Markov decision processes with an average cost criterion, Proc. 28th IEEE Conf. on Decision and Control, Tampa, Florida (1989) 1267-1272. [9] E. Fernandez-Gaucherand,A. Arapostathis and S.I. Marcus, On the average cost optimality equation and the structure of optimal policies for partially observable Markov decision processes,this volume. [0] A. Gheorghe, Partially observable Markov processeswith a risk sensitivity decision maker, Rev. Roumaine Math. Pures Appl- 22 (19'77) 461-482. [11] W.J. Hopp and S.C. Wu, Multiaction maintenance under Markovian deterioration and incomplete state information, Naval Res. Log. Quart. 35 (1988) 447-462. p2j J.S. Hughes, Optimal internal audit timing, The Accounting Review 52 (7977\ 56-68. [13] J.S. Hughes, A note on quality control under Markovian deterioration, Oper. Res. 28 (1980) 42t-424. [14] S.H. Kim, State information lag Markov process with control limit rule, Naval Res. Log. Quart. 32 (1985) 491-496. [15] S.H. Kim and B.H. Jeong, A partially observable Markov decision process with lagged information, J. Oper. Res. Soc. 38 (1987) 439-446. [16] P.R. Kumar and T.I. Seidman, On the optimal solution of the one armed bandit adaptive control problem, IEEE Trans. Automatic Control, 26 (1981) 1176-1184. [17] J.J. Martin, Bayesian Decision Problems and Markou Chains (Wiley, New York, 1967). [18] G. Monahan, Optimal stopping in a partially observable binary-valued Markov chain with coslly perfect information, J. Appl. Prob. 19 (1982) 72-8I. [19] S.M. Pollock, Minimum-cost checking using imperfect information, Management Sci. 13 (1967) 454-465. [20] S.M. Ross, Quality control under Markovian deterioration, Management Sci. 17 (1977) 587- 596. [21] J.K. Satia and R.E. Lave, Markovian decision processes with probabilistic observation of states,Management Sci. 20 (1973) i-13. l22l K. Sawaki and A. Ichikawa, Optimal control for partially observable Markov decision processesover an infinite horizon, J. Oper. Res. Soc. Japan2T (1978) i*15. [23] K. Sawaki, Transformation of partially observable Markov decision processesinto piecewise linear ones, J. Math. Anal. Appl. 91 (1983) 112-118. I24l E.L. Sernik and S.l. Marcus, Comments on the sensitivity of the optimal cost and the optimal policy for a discrete Markov decision process, Proc. 27th Annual Allerton Conf. on Communication, Control and Computing, Monticello, Illinois (1989) pp. 935-9M.

E.L. Sernik, S.I. Marcus / Discrete time Markou models

511

[25] E.L. Sernik and S.L Marcus, On the optimal cost and policy for a Markovian replacement problem (1990), to appear in J. Optim. Theory Appl. t26l E.J. Sondik, The optimal control of partially observable Markov processes,Ph. D. Thesis, Department of Electrical Engineering Systems, Stanford University (1971). l27l E.J. Sondik, The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs, Oper. Res. 26 (1978) 282-304. [28] L.C. Thomas, P.A. Jacobs and D.P. Gaver, Optimal inspection policies for standby systems,

Comm. Models (19E7) 259-213. Stat.- Stochastic 3


[29] R.C. Wang, Computing optimal quality control policies - Two actions, J. Appl. Prob. 13 (1976) 826-832. [30] R.C. Wang, Optimal replacement policy with unobservable states, J. Appl. Prob. 14 (1977) 340-348. [31] C.C. White, A Markov quality control process subject to partial observation, Management Sci. 23 (1977\ 843-852. C.C. White, Optimal inspection and repair of a production process subject to deterioration, J. [32] Oper. Res. Soc. 29 (1978) 235-243. [33] C.C. White, Bounds on the optimal cost for a replacement problem with partial observations, Naval. Res. Log. Quart. 26 (1979) 415-422. C.C. White, Note on "A partially observableMarkov decision processwith lagged information", [34] J. Oper. Res- Soc. 39 (1988) 217-278.