Vous êtes sur la page 1sur 24

Int. J. Industrial and Systems Engineering, Vol. 3, No.

4, 2008 474

Use of machine learning for continuous improvement

of the real time heterarchical manufacturing control
system performances

Nassima Aissani* and Bouziane Beldjilali

Department of Computer Science,
University of Oran,
ES-Sénia, El M’naour Wilaya,
BP 1524, d’Oran Algérie
E-mail: aissani.nassima@yahoo.com
E-mail: bouzianebeldjilali@yahoo.fr
*Corresponding author

Damien Trentesaux
Laboratory of Industrial and Human Automation,
Mechanics and Computer Science,
Department of Production Systems,
University of Valenciennes,
Le mont Houy,
Valenciennes cedex 09, F-59313 France
E-mail: damien.trentesaux@univ-valenciennes.fr

Abstract: Heterarchic manufacturing control system offer a significant

potential in terms of capacity, adaptation, self-organisation and real time
control for dynamic manufacturing system. In this paper, we present our steps
to work out a manufacturing control system where the decisions taken by the
system are the result of an agents group work, these agents ensure a continuous
improvement of these performance, thanks to the reinforcement learning
technique which was introduced to them. This technique of learning makes it
possible for the agents to learn the best behaviour in their various roles (answer
the requests (risks), self-organisation, plan, etc.) without attenuating the system
real time quality. We also introduce a new type of agents called ‘observant
agent’, which has the responsibility to supervise the evolution of the system’s
total performance. A computer implementation and experimentation of this
model are provided in this paper to demonstrate the contribution of our

Keywords: manufacturing control system; heterarchical system; total

performance; multiagents system; reinforcement learning; real time control.

Reference to this paper should be made as follows: Aissani, N., Beldjilali, B.

and Trentesaux, D. (2008) ‘Use of machine learning for continuous
improvement of the real time heterarchical manufacturing control system
performances’, Int. J. Industrial and Systems Engineering, Vol. 3, No. 4,

Biographical notes: Aissani Nassima is a Researcher at the Department of

Computer Sciences at Oran University, Algeria. She received a Master’s
Degree in 2006 in Computer Sciences and Automatics from Oran University,

Copyright © 2008 Inderscience Enterprises Ltd.

Use of machine learning for continuous improvement 475

Algeria, and an Engineering Degree in Industrial Informatic 2003 from the

Department of Computer sciences at Oran University, Algeria. Her research
interests include manufacturing system design, machine learning, artificial
intelligence, multiagent systems and manufacturing control systems.

Bouziane Beldjilali is a Professor at Oran University. He teaches Databases in

Graduation and Knowledge Management in Post Graduation. His areas of
research include IDSS, fuzzy neuro, machine learning, techniques used in
regional planning and manufacturing control system. He is author of several
publications in the following reviews: Journal of Computer Science and
Technology, Computer Science Journal of Moldova, Informatica, Knowledge
and Engineering System (to be appeared).

Damien Trentesaux is a Professor in LAMIH-CNRS (Automatic Control,

Computer Science and Mechanical lab) at the University of Valenciennes,
France. His research areas of interest include multicriteria decision making and
autonomous control of production systems. He teaches Production
Management, Linear Control and Simulation at the ENSIAME Engineering
School. He is a Member of the French research CNRS group MACS and is
author of several publications in the manufacturing domain.

1 Introduction

The industrialists must face a double stake:

• on the one hand, the increase in the need for the manufacturing systems
flexibility and in reactivity in order to simultaneously face the competition
and the market needs while adjusting their objectives in terms of cost,
time, quality, etc.
• and in addition, with a production process complexity (design – the products
manufacture) which is always increasing in spite of a need for reduction and
control for the manufacturing systems and products life-cycles.
The current production control methods answer with difficulty this increasing dynamics
evolution, in particular, real time control (piloting) of the manufacturing systems which
constitutes a critical and complex level. This competing context pushes the industrialists
to conceive systems not only able to react effectively, but also and more and more,
systems able to adapt to the fast evolution of demand and at fixed cost, by using the
resources available as well as possible to optimise this adaptation. The concept of
efficiency becomes thus significant, in a context where the concept of continuous
improvement constitutes a major stake of the industrialists and researchers in sectors with
strong innovation and technological development (automobile, mobile telephony,
computer, etc.). The objective here is to offer systems, which are able to not only be
effective (optimal) compared to a target performance, but also to be efficient, in the
context of continuous improvement of these performances. In Bousbia and Trentesaux
(2002) an analysis of the whole of these points was carried out and in particular, was
highlighted, the interest to adopt an approach of self-organised and heterarchic control
system. The heterarchy term describes a relation between the same hierarchical level
entities (Duffie and Prabhu, 1996). Initially, proposed in the medical biology field
(McCulloch, 1945) then experimented in several domains (Haruno and Kawato, 2006;
476 N. Aissani, B. Beldjilali and D. Trentesaux

Maione and Naso, 2003; Prabhu, 2003). This term is relatively close to the concept of
‘distribution’ in the multiagent field (distributed systems) which is largely used in the
conception of manufacturing control systems (Albadawi et al., 2006; Chen et al., 2003;
Ouelhadj et al., 2005; Reaidy et al., 2006; Zhou et al., 2002). However, from our point of
view, the concept of distribution of the capacities of decision does not mean that the
system multiagent is organised in a heterarchic way (even if it is often the case).
However it is the assumption that we wish to make in this paper. From our point of view,
this assumption is justified by the dynamics, and the volatility of information, which
made the approach purely hierarchical, partially unsuited in the objective previously
announced (Bousbia and Trentesaux, 2002). Also let us note that the current
technological offer allows its implementation (processing capacity, communication
networks, technology RFID, etc.) (Wang and Shen, 2003). Thus the idea had just worked
out a control system based on entities (agents…), which is dynamic, reactive and which
would be self-organised and able to process in quasi-real-time, the data concerning the
resources of the system to be controlled: the structure of piloting is composed of a unit of
agents, etc. which cooperate to make a decision and which, in this case, are also able to
learn according to their experiment using reinforcement learning (Watkins, 1989) which
is used for learning in a dynamic environment where it is difficult or even impossible to
model (Bhattacharya et al., 2002; Charton et al., 2003; Marthi et al., 2005). The objective
is thus, to obtain, not only an effective operation (reactivity) but also efficient in the
direction, where the production resources real potentialities are taken into account in
order to improve continuously the already effective solutions.
We will start first of all by introducing the problems. Following the state of the art
analysis, we will present the adopted modelling approach and the learning technique
which we used. We will then propose our system model. Lastly, we will present our
models implementation and an experimentation from which an analysis will be carried
out. Then, a conclusion summarises our contribution and presents a certain number of

2 State-of-the-art

In this part we propose an analysis of the state of the art in the field of the heterarchical
or distributed piloting in the objective, not only, to propose an effective piloting (having
a minimal level of performances) but also efficient (able to improve this constant level
with means).
Let us note that it is usually allowed to break up a manufacturing system into two
complementary parts (see Figure 1):
• the operative system (left operative), which treats material flow
• the control system which treats informational and decisional flow.

2.1 Multiagents approach

Many researchers base their work on a multiagent approach. The multiagents systems are
indeed used in the resolution of the complex problems in a distributed or heterarchic
environment and where the cooperation and the interaction are necessary.
Using of Multiagents system in manufacturing control system is provided by Querrec
et al. (1997) who presented a generic model of the reactive systems, where each entity of
Use of machine learning for continuous improvement 477

the system is represented by an autonomous agent able to make decisions locally. The
agent consists of two parts: operative and the other of control (see Figure 2).

Figure 1 Fundamental model of the manufacturing control system environment

Figure 2 Conceptual model of a generic element

This architecture suggested supports the modularity of the design, and the realisation, by
locating the decision-making in the agents themselves. It also makes it possible to
reconfigure the system and to increase its adaptivity. Several implementations of this
method were proposed with more or less of modifications.
In Zhou et al. (2002) work various functional agents were built representing all
system resources and an agent manager to improve the scheduling agility in a hybrid
hierarchical organisation, scheduling here is supported by the negotiation protocol. That
generates the entire hierarchical disadvantages (difficulty in making modifications and
wasting time when there is communication between entities in the same level) so, we
cannot speak about control in real time. Furthermore, the scheduling quality and its
improvement cannot be assured because the scheduling algorithm does not change, why
it is interessant to use learning techniques.
Chen et al. (2003) presented a multiagent system Mobile Enterprise Information
Portal (MEIP) used to establish the real-time data capture for real-time production
controlling across the intra-/inter-organisational supply system. This system is used to
help the production controller making an effective decision in Just-in-time manufacturing
system. This experiment enables us to affirm that the multiagent approach can be used
within the real time framework. Because communication and cooperation are the key in
multiagent system, some researchers based their work on negotiation protocol in
multiagent system like Ouelhadj et al. (2005) who proposed a negotiation protocol based
on the Contract Net Protocol to allow the agents to cooperate and coordinate their local
schedules in order to find globally near-optimal robust schedules, whilst minimising the
disruption caused by the occurrence of unexpected real-time events.
And more and more experiments were made using an agent-based architecture for
distributed systems control to improve the performance, adaptiveness and productivity,
478 N. Aissani, B. Beldjilali and D. Trentesaux

let us talk about Albadawi et al. (2006) works: a linear, tunable model for the plastic
thermoforming process and a non-linear, mathematical and rule-based model for the
metal powder grinding process were proposed.
On the other hand, Reaidy et al. (2006) presented system model for heterarchical and
complex manufacturing control systems based on multiagent and especially cooperation
and competition paradigm. Agents represent products and resources of the system. The
local scheduling and control functions in dynamic environments is addressed by a
negotiation protocol between agents based on the ‘request session’. The disadvantage is
that it is difficult to thus make a long-term planning; it is difficult to make sure that the
total objectives of the system will be assured. What made us thought, in our approach, to
elect an agent which will direct the system towards its total objectives.
In conclusion, the control systems carried out until now were based on the
multiagents whose knowledge was static, they did not learn anything from their
environment, their knowledge was limited to the base, which was introduced there with
the systems design, whereas it had to solve the same problem on several occasions.

2.2 Machine learning

We notice also a lack of intelligence in the resolution of the problems, the same problem
always asks for the same procedures of resolution and one is always brought to rebuild
them. Some researchers like Maione and Naso (2003) used genetic algorithms to adapt
the decision strategies of autonomous controllers in a heterarchical manufacturing
control system. The control agents use preassigned decision rules only for a limited
amount of time, and obey a rule replacement policy propagating the most successful
rules to the subsequent populations of concurrently operating agents. The time of
resolution in this approach is likely to be long; therefore real-time cannot be assured.
So, we think of equipping our system agents with learning techniques to be able to
ensure an improvement of the performances but we must take care of the real time
notion. HUYET project (Huyet and Limos, 2004) is an interesting example of the use of
machine learning on what improves or degrades system performances for a better design.
This by detecting the variables which influence the performance and by studying their
relations, that is, acquisition of knowledge on what supports or degrades the systems
design in terms of performance.
This project could generate a good manufacturing systems design, but it could not
improve with the wire of time. In another side we could highlight that there were work,
which, in a multiagents context or not, proposed interesting mechanisms which would
make it possible to take into account the efficiency objective, in a context of continuous
improvement. Machine learning is very much used in the performances improvement
contents, some researchers used logic flow to develop model for complex manufacturing
control system (Gopalakrishnan et al., 2007), presented simulation model for the energy
supply chain in the integrated steel industry to develop scenarios that aim to save energy
in the manufacturing system. But, we need a reactive learning technique, which allows us
to control system in real time. Choy et al. (2003) used reinforcement learning for
real-time coordinated signal control in an urban traffic network, they used a multiagent
approach and to handle the changing dynamics of the complex traffic processes within
the network, an online reinforcement learning module is used to update the knowledge
base and inference rules of the agents. Test results showed that the multiagent system
improved the average delay and total vehicle stoppage time, compared with the effects of
Use of machine learning for continuous improvement 479

fixed-time traffic signal control. Katalinic and Kordic (2004) used reinforcement learning
in the optimisation of the resources use in a very expensive production of electric motors.
These systems are characterised by variety of products (produced upon request) which
requires a great flexibility and adaptability. Consequently, the units of assembly must be
autonomous and modular, from where the difficulty of the control and the development
of the performances. Katalinic and Kordic (2004) considered these units as a colony of
insects able to organise themselves to carry out a task which can reduce the number of
used resources (count of machines for example), and to more easily solve the problems
involved in production risks. Evolutionist algorithms were also used in scheduling
problems for (Bousbia and Trentesaux, 2004) in heterarchical control system consisted of
active entities (which are agents) with genetic learning capacity, were used in order to
minimise products latency, improvement of makespan quality and management of
breakdowns where improvement of the system performances.

3 Suggested specification

Let us recall that the objective of our work is to propose a manufacturing control system
model able to improve continuously the potential and total performances according to a
heterarchic structure. In a control system, we are often brought to solve the same
problem in the same way on several occasions; this is likely to create a waste of time.
This is why, the techniques of the IA and in particular machine learning was for a long
time a source of inspiration for researchers in the field of the manufacturing systems,
because machine learning allows the acquisition of knowledge so that the associated
system can adopt the best step to solve a problem associated with manufacturing control.
To this end, we thought of introducing machine learning. Because upto now, the
knowledge of agents in a control system was static, they did not learn anything from their
environments, their knowledge was limited to the base that we introduced there at the
time of the systems design. Our step aims at working out a system which is more
powerful insofar as we respect the globality of the performance: the decisions taken by
the system are the result of an agents group work of whose organisation is under
continuous improvement, which ensures a continuous improvement of the system
performance (see Figure 3). Our approach consists in equipping the agents of our control
system ‘System de Pilotage par Apprentissage par Reinforcement en Temps reel
(SPART)/Manufacturing Control System using Reinforcement learning in Real time
(MCSRR) with a reactive learning module.

Figure 3 Integration of machine learning on the level of a SMA is seen as a continuous

improvement of performance
480 N. Aissani, B. Beldjilali and D. Trentesaux

We want to work out a method which learns from the evolution of the system, at the
same time it solves its problems and which takes the aspect time in great consideration.
With this intention, we chose the use of the reinforcement learning, a very reactive
technique which consists in the research of the most optimal action to execute for each
system stat.

3.1 Reinforcement learning

A manufacturing system is regarded as an unforeseeable environment, dubious and
dynamic because it is subjected to internal stresses (production risks, etc.) and external
constraints (forced market, unforeseeable orders, etc.)

3.1.1 The Markov decision process

The decision in such an environment is a Markov Decision Process (MDP) bus according
to Russell and Norvig (1995). A system which contains an agent function explicitly
represented, and which cannot suspect the unexpected evolutions of the environment. All
that it must do is to execute any action which is recommended to him by its function for
the state in which it is and to perceive the result. The Universal Plans were developed as
a framework for reactive planning, but were found to be a discovery of the idea of the
policies in the MDP. This is represented by Figure 4.

Figure 4 Agent perception – action in a dubious environment

MDP model is composed of two distinct elements: on the one hand, a Markov process
modelling the environment which progress in time like a probabilistic automat and other
share an element controller which examines the current system state to choose an action
to execute in order to maximise the reward that it can obtain.
A reasoning containing MDP can bring to the following speech: “I know that I am in
such situation and if I make such action there are so many chances that I find myself in
this new situation by obtaining such profit”.
In a Markov decision (see Figure 5) in a perceived system state (or environment stat)
Si the agent executing the action Ai which according to the transition function T
leads the system to the state Si + 1 and the agent receives a Ri reward where i is a
given moment.
Use of machine learning for continuous improvement 481

Figure 5 Markov decisional model

The policy of action π in a MDP is a probability of choosing the action α ∈ A where

A is the whole of the actions being in the state s ∈ S where S is the whole of the states.
π : S × A → [ 0;1]

π : ( s, a ) = P ( a s )

A policy in a MDP is a kind of probabilistic plan. Bellman (1957) established a value

function to estimate the utility of a state according to a reward which it can obtain by
executing an action:

V π (s ) = ∑ π (s, a ) ∑ T ( s, a, s ′ ) ⎡⎣ R ( s, a, s ′ ) + γ V π ( s ′ ) ⎤⎦
a∈ A s ′∈S

Where T is transition function, R is reward. Vπ is utility value of the state according to

the π Policy. γ is rate of calculation, positive constant lower than 1 used to give more
weight to the current reward than that of the future. This equation shows also the relation
between the value of a state and that of its successors. It is this equation which will give
rise to thereafter the Q function of Q-Learning.
To learn the optimal policy in Markov environment, where goal is known (to produce
as soon as possible with lower cost) and where learning parameters are not known
because we do not have the base of example beforehand conceived and when we cannot
model the environment, that is, when we do not have the transition function, which then
models the system passage from states to states, we try to proceed by test – error and we
speak then, about reinforcement learning (Russell and Norvig, 1995).

3.1.2 Reinforcement learning

Reinforcement learning is to learn how to act by test and error (Watkins, 1989).
In this paradigm, an agent can perceive its state and execute actions. After each
action, a numerical reward is given (see Figure 6). The agent’s goal in solving a MDP is
to maximise its expected, long-term payoff, known as its return. By means of which an
agent can learn an optimal policy.
482 N. Aissani, B. Beldjilali and D. Trentesaux

Figure 6 Reinforcement learning loop

Source: Sutton and Barto (1998).

The reinforcement learning method was used for the learning of problems in dynamic
stochastic processes coming from experimentation (Isbell et al., 2000; Singh et al., 2000).
Such problems are normally MDPs.
The reinforcement learning theory was applied successfully to a large variety of
control and order problems; let us quote the work of Bhattacharya et al. (2002) for the
water barges control, thus, that work of Charton et al. (2003) for learning mediation
strategies in e-marketing and the work of Marthi et al. (2005) which apply hierarchical
reinforcement learning techniques to a complex computer game domain.
The reinforcement learning theory is used for the learning in a dynamic environment
where it is difficult or even impossible to model.
All this led us use the reinforcement learning for our agents. This technique enables
them to make a decision constantly to respect the real time. It enables them to improve
their performances since it enables them to learn the most optimal policy and which will
improve continuously.

4 Propositions

The agents which we modelled are inspired by the Alaadin model worked out by Ferber
and Gutknecht (1998) where we take again the concepts of: group, role, attributes and
functions, thus, that the modules of perception and action. This choice is justified by the
various modelling means offered by this model and the facility of implementation thanks
to the Madkit platform (Madkit, 2002) for SMAs development.

4.1 SPART/MCSRR agent architecture

The system agents represent the manufacturing system resources: machines, storages
surfaces, conveyors, etc. (see Figure 7): Role, They must have initially knowledge on
their capacities: the name of the resource, names of their realisable operations and their
properties, the list of these proper tools: Properties, and the vicinity which represents the
resources which are directly connected to this resource: Group (according to the cell
fitting). And moreover, we equipped them with a Learning module. In a system state
coded (St ) by the Perception module, agent chose an action (At) to execute by the Action
Use of machine learning for continuous improvement 483

module and according to what this action will generate, the agent perceives a numerical
reward which can be positive or negative (Rt) To reward or punish the executed action.
t at a given instant.

4.2 Q-Learning for the performance improvement

The manufacturing systems are regarded as dubious, unforeseeable environments thus in
deterministic. So we chose the Q-Learning algorithm for our agents learning.
Q-Learning is used for the indeterministics MDPs (For which transition from the
state St to the state St +1 is unknown). The principle is to learn on the Q-values which
represent quality for each pair (state – action). This algorithm has the advantage of
exploiting what was already learnt constantly while continuing to learn, the making of
each decision requires the knowledge of the Q-values and each decision updates these
In fact, the reinforcement learning algorithms are of two classes: Dynamic
Programming (DPs) which is intended for the resolution of the Deterministic MDP
where an environment complete modelling is necessary exp: value iteration, policy
Then, the algorithms with Temporal Differences (TDs), which does not need an
environment modelling, only perceptions are used. The agent must improve its policy
continuously by trying new actions for better understanding of the consequences of its
actions on the environment (online).

Figure 7 SPART/MCSRR agent architecture

Q-Learning is the most known algorithm of this class and all the others are the
alternatives (TD-Learning, QII-Learning, QIII-Learning, etc.) and for this first attempt
we chose to use the basic algorithm and which was the subject of several experiments.
This algorithm was introduced by Watkins (1989), it is an extension of the traditional
DP (Value iteration (Bellman, 1957) to solve the deterministic MDPs problems).
Rather than to work on the V(s) utility of the various states, Q-Learning thus
learns the Q(s, a) values of the pairs (state, action). Lets take again the Bellman
484 N. Aissani, B. Beldjilali and D. Trentesaux

equation allowing to calculate Q* and by considering that we are learning a deterministic

optimal policy
π *: S → A

Q* (s, a ) = ∑ T ( s, a, s′ ) R ( s, a, s′ ) + γ Max Q* ( s′, a′ )
s is the current state, s′ the generated state, a the executed action, a′ the action to be
chosen in the future. T the transition function, R the reward. Q* corresponds to the
Q-values for the optimal policy.
And Since T is unknown, but that in an unfolding, we passed from the state s = s t to
the state s′ = s t + 1 by selecting the action a = at, then we consider that we have at this
T (s, a, s′) = 1 if s = st , a = at

s′ = st +1 = 0 else

Thus, we have obtained the equation to calculate a new value Qt + 1 (st, at):
Qt +1 ( st , at ) = R ( st , at , st +1 ) + γ Max Qt ( st +1 , a ′ )
a ′∈A

In practice, α , which is a learning rate is added in order allowing, taking more or less
into account, this new value while keeping last knowledge Qt, from where obtaining the
final version of the Q-function

Qt +1 (s, a ) = (1 − α )Qt ( st , at ) + α ⎢ Rt ( st , at ) + γ Max Qt ( st +1 , a′ ) ⎥

⎣ a′∈A ⎦
where Qt + 1(st, at): the new Q-value for the ‘at’ action at the ‘st’ stat; Qt(st, at): the last
Q-value for the ‘at’ action at the ‘st’ stat; Rt(st, at): the reward received immediately after
having to carry out the action ‘at’ et ‘st’; maxQ(st+1, a′): the Maximum Q-value for the
next stat ‘s’ like a future reward.
The Q-function (as Table 1 shows it), is a Cartesian product between the whole of the
states and actions. This table is initialised to 0. This function will be used each time the
agent wants to decide.
s1, s2, s3,…, sn ∈S the whole of states.
a1, a2, a3, …, an ∈A the whole of actions.
Q (si, ai ) is the Q-value given to si when ai is carried out.
After each action is carried out in a perceived state and a received reward, the Q-value
will be updated by using the Q-function equation. At any moment, for each state, the
optimal action to choose is that which has the greatest Q-value.
This Q-table is arbitrarily initialised, the Q-values are estimated on the basis of
experiment in an infinite loop (choose an action – update the Q-value) (see Figure 8).
In our control system, each time the agent is brought to decide it’s uses, from the
Q-table choose the action to execute, then update the Q-value corresponding to its state
st and the selected action at (Qt(st, at)) according to its perceptions of the consequence of
its action: the new state st + 1 and rewards it r.
Use of machine learning for continuous improvement 485

Table 1 Q-function

a1 a2 a3 … an
S1 Q(s1,a1) Q(s1,a2) … … …
S2 Q(s2,a1) … … … …
S3 … … … … …
… … … … …
Sn … … … … Q(sn,an)

Figure 8 Reinforcement learning progressions

Illustration: Lets us take an agent machine which has the following data:

At t instant, we have the following Q-value table (Figure 9).

Figure 9 Q table before making decision

486 N. Aissani, B. Beldjilali and D. Trentesaux

The current state: «Stopped Machine and Launch a task» S6

The action to chose: A4 (Q maximum knowing that the Q-value represents the utility to
carry out an action in a state) «Starting»
Perceived reward: R2 «+4»
The new stat «Working and Launch a task»S4
Update Qt + 1(s6, a4) = (1 − α )Qt ( s6 , a4 ) + α ⎡ Rt ( s6 , a4 ) + γ Max Q ( s4 , ai ) ⎤
⎣⎢ ai ∈ A ⎦⎥
= (1 − 0.256) × 2.191 + 0.256 [ 4 + 0.8 × 0.95] = 2.85

The new Q table (Figure 10).

Figure 10 The Q table fragments after making decision

Knowing that α is equal to 1 at the beginning, it will decrease thereafter according to

formula 1/the nbr of visit of a pair stat/action because this will give more weight to the
experiments carried out: as soon as we visit a pair stat/action more, we estimate that we
have acquired the experiment for this state, that is, we know what we must do in this
state. In the preceding example, the Q-value was increased that is, the selected action
generated a correct state supporting our system performance, which makes it possible to
reinforce the undertaken policy (the selected action).

4.3 The decisional learning algorithm

Approximately here, flow charts of our system agents decisional cycle (see
Figures 11 and 12).
An agent SPART/MCSRR learning algorithm will be held in the following way
(see Figure 14):
1 The agent perceives its environment and digitises the perceived state.
2 Select the action to be carried out of which the Q-value is raised and according
to a probability (bus in a first stage of the algorithm, it is necessary to visit each
pair (state, action), for this reason we choose an action randomly by respecting a
Boltzmann probability distribution (see Figure 13).
eQ ( s ,a ) T
P(a s ) =
∑ ai∈A
Q( s ,ai )

The Boltzmann function gives for a given action, the probability p(a) to be
chosen. T is the Boltzmann temperature. When it is high, the temperature
ensures a sufficient exploration then, when it decreases it gradually while
tending towards 0; it reduces the manoeuvre room, because the probabilities end
up converging towards 1 for the best action and 0 for the others.
Use of machine learning for continuous improvement 487

3 Execute the selected action.

4 Perceive the new stat, digitise and obtain the reward associated
in this new state.
5 Update the state preceding Q-value with the selected action:
Qt +1 ( st , at ) = (1 − α )Qt ( st , at ) + α ⎡ Rt ( st , at ) + γ Max Qt ( st +1 , ai ) ⎤
⎢⎣ ai ∈A ⎥⎦
6 Return to two.

Figure 11 A ‘resource’ agent behaviour flow chart

Figure 12 Product agent behaviour flow chart

488 N. Aissani, B. Beldjilali and D. Trentesaux

Figure 13 The Boltzmann algorithm

Figure 14 The learning algorithm

Let us note that the convergence of this algorithm is checked by the Mitchell (1997)
Let us note Qˆ n (s, a ) the Q(s, a) value after the nth update (the nth visit of the (s, a)
pair). If each (stat, action) pair is visited infinitely often, Qˆ (s, a ) will converge towards
Q(s, a) an exact value when n → ∞ for each s, a.
The Q-Learning algorithm convergence indicates that after a rather significant
number of experiments the Q values remain stable, that is, an optimal policy
was determined by always choosing the action which corresponds to the maximum
Q- value.

4.4 The performance globality/total performance

It is true that control and the reaction to risks are immediate (real time) but the problem
of globality is not completely assured, because the system behaviour is a result of the
convergent decisions of its agents. Therefore, we were inspired by Drogoul et al. (1998):
this approach uses a particular type of agents called ‘observer agents’: for new fields of
Use of machine learning for continuous improvement 489

research, experts are invited interactively to define their behaviours by a set of roles and
to formalise their behaviour vis-à-vis a problem that is, representing their knowledge.
Those agents equipped with learning capacities and certain forms of decisional autonomy
are used to automate this process of extraction.

4.4.1 The observer agent

In our system, this agent will have a total sight on the system, the state variables,
which it will observe will be the indicators of the total performance. This
agent will observe all decision-making centre to learn from their behaviour
(agents, humain expert and IDSS). It is a learning centred interaction. The Observer
agent plays triple role:
• the first, is the interpretation and the digitalisation of the perceived state or
received information
• the second is the chart of the received messages
• the third is the calculation of the system learning level evaluation criteria.
In Figure 15, we have this agent behaviour model:
The communication and the interaction are very much favoured within this
architecture, thanks to the Alaadin organisational structure which gives a vast
choice of communication protocols (communication by message sending, broadcast,
point-to-point…) and types of message: string message, act message…).

Figure 15 An Observer agent behaviour flow chart

The observer agent will have the following data:

490 N. Aissani, B. Beldjilali and D. Trentesaux

According to this specification SPART/MCSRR has the following architecture

(Figure 16):

Figure 16 SPART architecture

5 Implementation and experiments

This model was implemented on Borland J Builder because, it offers very great
communication possibilities as it facilitates the threads programming and especially
because, we used Madkit platform of J. Ferber for the SMAs development
(Madkit, 2002). The application realisation comprises several shutters:
• The workshop configuration: to introduce the various resources and their
• Launching of simulation: what launches the multiagents platform and well on
the learning process.
• Shutters of disturbances provocation: to break down a machine, to introduce a
new order, etc.
• A shutter of interaction with the human expert to allow him to act where
• A shutter of the results visualisation: in graphs, etc.
• A shutter for visual simulation by using Java 3D (see Figure 17).
Use of machine learning for continuous improvement 491

Figure 17 3D Simulation of a manufacture workshop

The Q-Learning advantage is that it allows an evaluation during the learning. To carry
out this evaluation, we chose the following criteria inspired by Charton et al. (2003):
The percentage PSucc of tasks carried out successfully (within the deadlines).
PSucc = × 100
Knowing that: NTSucc is the number of tasks carried out successfully and NT is the total
number of tasks.
The iterations average number per task lMoy. This criterion represents the number of
attempts so that a task is carried out successfully. 1Moy = M/NT.
Knowing that: M is the iteration count.
We took two cases of a workshop configuration example (see Figures 18 and 19).

Figure 18 Workshop configuration 1

492 N. Aissani, B. Beldjilali and D. Trentesaux

Figure 19 Workshop configuration 2

In the first experiment we chose the data base Production 03 with the following
configuration: One Batch of products, 2 types of machines and two stocks.
To carry out this experiment over a longer duration and in order to better see the
system evolution, we envisaged an infinite entry in the ordering of P1 type product.
We caused only one type of disturbance: broken down the M1 machine and we did not
utilise the human expert.
The obtained results are represented in the following graphs (see Figures 20 and 21).

Figure 20 The first experiment results (PSucc)

In this Second experiment we chose the data base Production 01 with: 2 Batch of
products, 3 machine types and two stocks.
Use of machine learning for continuous improvement 493

We caused the disturbances: broken down the Machines M1, M2 and M3 by chance,
we utilised the human expert, and by always considering an infinite entry in orders of
Products P1 and P2, we obtained the following results (see Figures 22 and 23).

Figure 21 The first experiment results (lMoy)

Figure 22 The second experiment results (PSucc)

Figure 23 The second experiment results (lMoy)

494 N. Aissani, B. Beldjilali and D. Trentesaux

By comparing the graphs of the two experiments, we note that in the first experiment we
end with an optimal policy more quickly than the second. This is due to the complexity
of the tasks to be managed in the second experiment.
In the second we have a faster convergence (in the 60 iterations against 100 of the
first) thanks to the diversity of the learning sources (the various resources of the system
and the expert interventions), especially that these resources have a collective
learning memory (We have one Q-Table for all machines and one Q-Table for all the
products, etc.).
The reinforcement learning, as explained before, passes by two stages: stage of
exploration in which all pairs state/action will be tested without the well defined policy
and the stage of exploitation which makes possible to choose the actions to be carried out
according to the built policy.
We carried out an experiment by exploiting these two stages, we caused the same
disturbance, which repeated twice continuation with an interval of 10 iterations:
disturbance (see Figures 24 and 25) broken down all the workshop machines.

Figure 24 Broken down setting of the machines

Figure 25 Obtained results with caused disturbances (PSucc)

Use of machine learning for continuous improvement 495

We noticed that in the exploitation stage, the disturbance is quickly deadened and the
system is brought back to its most significant performances.
Lastly, these results show that our system is able to learn an optimal policy of control
which is in continuous improvement while trying to reduce the attempts to carry out right
tasks within their times.

6 Conclusion

In this paper, we brought a clarification on the complex problems to which we attacked:

control functions and activities, control manufacturing system structures and fittings and
motivations of the choice of a heterarchic structure of the control system in order to
guarantee a total performance (efficiency and effectiveness). Then we presented a state
of the art of the work completed in the control systems field which encouraged us to
adopt the multiagent approach for our system modelling by introducing a learning
technique in order to ensure an improvement of their behaviour.
With this intention, we made a study of the learning techniques and in this paper; we
presented the reinforcement learning as a very reactive technique which consists in
research of the most optimal action to execute for each system stat. We then showed,
how we used this technique on our agents by presenting their architecture and that of the
system SPART which we conceived. Lastly, we carried out an experimentation of our
model to prove the positive contribution of the reinforcement learning integration in a
heterarchic control system for the continuous improvement of the system performance
(efficiency and effectiveness).
Following the carried out experiments, we could evaluate the capacities of this
system to learn an optimal policy, within short times. The results were satisfying, no
matter what we noted some limits face the complexity of the manufacturing system.
We thus plan various evolutions of our prototype to allow the treatment of more complex
systems. For that we think that the use of the distributed systems, which offer powerful
calculation techniques such as the calculation grids can be beneficial.
We think also of the use of other reinforcement learning algorithms (Q III-Learning
in Aydin and Özteme (1998), which can converge more quickly towards the best policy
for a learning time optimisation and perhaps in quality too as we must also work on the
perceived system stats representation in order to optimise the Q table entries.

Albadawi, Z. Boulet, B., DiRaddo, R., Girard, P., Rail, A. and Thomson, V. (2006) ‘Agent-based
control of manufacturing processes’, International Journal of Manufacturing Research
(IJMR), Vol. 1, No. 4, pp.466–481.
Aydin, M.E. and Özteme, E. (1998) ‘Generalization of experiences for reinforcement learning
agents’, in I. Cicekli (Ed). Proceedings of 7th Turkish Artificial Intelligence and Neural
Networks Symposium (TAINN.98), 26–28 June 1997, Bilkent University, Ankara, Turkey.
Bellman, R. (1957) Dynamic Programming, Ed Princeton University Press, p.83.
Bhattacharya, B. and Lobbrecht Ah, S. (2002) ‘Control of water levels of regional water systems
using reinforcement learning’, Proceedings of 5th International Conference on
Hydroinformatics, Cardiff, UK, 1–5 July, pp.952–957.
496 N. Aissani, B. Beldjilali and D. Trentesaux

Bousbia, S. and Trentesaux, D. (2002) ‘Self-organization in distributed manufacturing control:

state-of-the-art and future trends’, in A. El Kamel, K. Mellouli and P. Borne (Eds). IEEE
International Conference on Systems, Man and Cybernetics, Vol. 5, CD-Rom, ISBN:
2-9512309-4-X, (Hammamet, Tunisie, Octobre 2002), paper #WA1L1, p.6.
Bousbia, S. and Trentesaux, D. (2004) ‘Towards an intelligent heterarchical manufacturing control
system for continuous improvement of manufacturing systems performances’, in
A.N. Bramley, S.J. Culley, E. Dekoninck, C.A. McMahon, A.J. Medland, A.R. Mileham,
L.B. Newnes and G.W. Owen (Eds). 5th International Conference on Integrated Design and
Manufacturing in Mechanical Engineering – IDMME’2004, CD-Rom, ISBN: 1-85790-129-0,
University of Bath, UK: Hadleys Ltd – Essex, April, p.125 (abstract), paper #123, p.10.
Charton, R., Boyer, A. and Charpillet, F. (2003) ‘Learning of mediation strategies for
heterogeneous agents cooperation’, in B. Werner (Ed). 15th IEEE International Conference on
Tools with Artificial Intelligence – ICTAI'2003, Sacramento, Californie, USA,
IEEE Computer Society, November, pp.330–337.
Chen, R., Lu, K. and Chang, C. (2003) ‘Application of the multi-agent approach in just-in-time
production control system’, International Journal of Computer Applications in Technology
(IJCAT), Vol. 17, No. 2, pp.90–100.
Choy, M.C., Cheu, R.L., Srinivasan, D. and Logi, F. (2003) ‘Real-time coordinated signal control
through use of agents with online reinforcement learning’, Publisher: Transportation Research
Board, No. 1836, pp.64–75.
Drogoul, A., Vanbergue, D. and Meurisse, T. (1998) ‘Simulation orientée agent: où sont les
agents ?’ LIP6 – Université Paris 6 Les Cahiers des Enseignements Francophones en
Roumanie, pp.110–130.
Duffie, N.A. and Prabhu, V.V. (1996) ‘Heterarchical control of highly distributed manufacturing
Systems’, International Journal of Computer Integrated Manufacturing, Vol. 9, No. 4,
Ferber, J. and Gutknecht, O (1998)‘Un méta-modèle organisationnel pour l’analyse, la conception
et l’execution de systèmes multi-agents’ Laboratoire d’Informatique, Robotique et
Micro-électronique de Montpellier’, Proceedings of 3rd International Conference on
Multi-Agent Systems ICMAS’, pp.128–135.
Gopalakrishnan, B., Banta, L. and Bhave, G. (2007) ‘Modelling steam-based energy supply chain
in an integrated steel manufacturing facility: a simulation-based approach’, International
Journal of Industrial and Systems Engineering (IJISE), Vol. 2, No. 1, pp.1–29.
Haruno, M. and Kawato, M. (2006) ‘Heterarchical reinforcement-learning model for integration of
multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association
learning’, Neural Networks, Vol. 19, Special Issue, pp.1242–1254.
Huyet, A. and Limos, J.P. (2004) ‘Extraction de connaissances pertinentes sur le comportement des
systèmes de production: une approche conjointe par optimisation evolutionniste via simulation
et apprentissage’, PhD thesis, universite blaise pascal clermont II, October.
Isbell, C.L., Shelton, C.R., Kearns, M., Singh, S. and Stone, P. (2000) ‘A social reinforcement
learning agent’, Proceedings of the Fifth International Conference on Autonomous Agents
(Agents-2001), pp.377–384.
Katalinic, B. and Kordic, V. (2004) ‘Bionic assembly system: concept, structure and function’,
Proceeding of the 5th International Conference on Integrated Design and Manufacturing in
Mechanical Engineering, IDMME, Bath, UK, 5–7 April.
MadKit (2002) ‘Multi agents development kit’, Available at: http://www.madkit.org/
Maione, G. and Naso, D. (2003) ‘Discrete-event modeling of heterarchical manufacturing control
systems’, Systems, Man and Cybernetics, 2004 IEEE International Conference, Vol. 2,
10–13 October, pp.1783–1788.
Marthi, B., Russell, S., Latham, D. and Guestrin, C. (2005) ‘Concurrent hierarchical reinforcement
learning’, Proceedings of International Joint Conference on Artificial Intelligence, IJCAI-05,
Use of machine learning for continuous improvement 497

McCulloch, W.S. (1945) ‘A heterarchy of values determined by the topology of nervous nets’,
Bull. math. biophys. Vol. 7, pp.89–93.
Mitchell, T.M. (1997) ‘Machine learning’, Publisher: McGraw-Hill Science /Engineering /Math;
(1st March, 1997).
Ouelhadj, D., Petrovic, S., Cowling, P. and Meisels, A. (2005) ‘Inter-agent cooperation and
communication for agent-based robust dynamic scheduling in steel production’, Advanced
Engineering Informatics, Vol. 18, No. 3, pp.161–172.
Prabhu, V.V. (2003) ‘Stability and fault adaptation in distributed control of heterarchical
manufacturing job shops’, IEEE Transactions on Robotics and Automation, Vol. 19, No. 1,
Querrec, R., Tarot, S. Chevaillier, P. and Tisseau, J. (1997) ‘Simulation d’une cellule de
production. Utilisation d’un modèle à base d’agents contrôlés par réseaux de Petri’, Colloque
De recherche Doctorale AGIS’97, pp.209–214.
Reaidy, J., Massotte, P. and Diep, D. (2006) ‘Comparison of negotiation protocols in dynamic
agent-based manufacturing systems’, International Journal of Production Economics,
Vol. 99, pp.117–130.
Russell, S. and Norvig, P. (1995) ‘Artificial intelligence: a modern approach’, The Intelligent Agent
Book, Prentice Hall Series in Artificial Intelligence.
Singh, S., Kearns, M., Littman, D. and Walker, M. (2000) ‘Empirical evaluation of a reinforcement
learning spoken dialogue system’, Proceedings of 17th National Conference on Artificial
Intelligence and 12th Conference on Innovative Applications of Artificial Intelligence,
Sutton, R.S. and Barto, A.G. (1998) Reinforcement Learning: An Introduction, Cambridge, MA:
MIT Press.
Wang, L. and Shen, W. (2003) ‘DPP: an agent-based approach for distributed process planning’,
Journal of Intelligent Manufacturing, October, Vol. 14, No. 5, pp.429–439.
Watkins, C.J.C.H. (1989) ‘Learning from delayed rewards’, PhD thesis, Cambridge University,
Cambridge, England.
Zhou, Z.D., Wang, H.H., Chen, Y.P., Liu, Q., Ong, S.K., Fuh, J.Y.H. and Nee, A.Y.C. (2002)
‘A multi-agent-based agile scheduling model for a virtual manufacturing environment’,
Proceedigns of Artificial Intelligence and Applications AIA, p.46.