Académique Documents
Professionnel Documents
Culture Documents
12 (2018) 164-171
___________________________________________________________________________________________________________
Abstract
Reinforcement learning (RL) aims to resolve the sequential decision-making under uncertainty problem
where an agent needs to interact with an unknown environment with the expectation of optimising the
cumulative long-term reward. Many real-world problems could benefit from RL, e.g., industrial robotics,
medical treatment, and trade execution. As a representative model-free RL algorithm, deep Q-network
(DQN) has recently achieved great success on RL problems and even exceed the human performance
through introducing deep neural networks. However, such classical deep neural network-based models
cannot well handle the uncertainty in sequential decision-making and then limit their learning perfor-
mance. In this paper, we propose a new model-free RL algorithm based on a Bayesian deep model. To
be specific, deep kernel learning (i.e., a Gaussian process with deep kernel) is adopted to learn the hid-
den complex action-value function instead of classical deep learning models, which could encode more
uncertainty and fully take advantage of the replay memory. The comparative experiments on standard
RL testing platform, i.e., OpenAI-Gym, show that the proposed algorithm outweighs the DQN. Further
investigations will be directed to applying RL for supporting dynamic decision-making in complex envi-
ronments.
Keywords: Reinforcement learning, Uncertainty, Bayesian deep model, Gaussian process
tional model for the environment, existing RL algo- both advantages of deep neural network and GP. We
rithms could be categorised as being either model- adopt the deep kernel learning to replace the origi-
based or model-free. Model-based RL algorithms nal deep neural network, which could encode more
improve the policy learning through modelling the uncertainty. For example, the prediction of action-
environment, e.g., simulating the interactions with value function from this algorithm is not just the
the environment model in hand or planning ahead action but also the variance of such action. Fur-
using the model. Due to the lack of such environ- ther, a new weighted sampling strategy is devel-
ment model, model-free RL cannot infer the effect oped. The prediction variance from deep kernel
from environment dynamics and efficiently explore learning is used as the weight for each historical in-
the space. However, for some complex environ- teraction, and then the model is updated using the
ments with large scale states, it is too difficult to samples from weighted interactions. The experi-
accurately build their models. Model-free RL al- mental evaluation on standard RL testing platform
gorithms directly learn the policy without requiring -OpenAI.Gym∗- has demonstrated that the new al-
an explicit environment model. Recently, one sem- gorithm could reach larger scores. In summary, the
inal model-free RL algorithm, i.e., deep Q-network main two contributions of this article are as follows:
(DQN)13 , has achieved great success on Atari games
and substantially promote the development of RL. • we propose a new model-free RL algorithm based
The advanced ability of DQN comes from two as- on deep kernel learning with better performance
pects: 1) deep learning models; DQN adopts deep comparing the classical deep neural network-
neural networks to replace the traditional Q-table based one;
for the hidden action-value function learning. For • a new weighted sampling strategy is developed to
complex environments with large or even near infi- fully take advantage of the replay memory to fur-
nite number of states, Q-table is not appropriate due ther improve the algorithm performance.
to the large-scale storage and high computation re-
quirements; 2) experience replay; The historical in- The remainder of this article is organised as fol-
teractions are saved as replay memory and sampled lows. Section 2 discuses related work. The new
in DQN to stochastically update deep neural net- model-free RL algorithm is introduced in Section 3.
works, which can reduce the dependency between Section 4 evaluates the proposed RL algorithm on
sequential interactions and then increase the effi- standard RL testing platform through comparing the
ciency of the algorithm. However, the classical deep classical algorithm. Section 5 concludes this study
neural networks used in DQN cannot handle uncer- and discusses possible future work.
tainty well due to its deterministic nature. As de-
fined, RL aims to handle sequential decision-making
2. Related Work
under uncertainty, so how to capture and deal with
uncertainty is essential for RL. Besides, all the saved In this section, we briefly review the related work
historical interactions are equally treated in DQN, of this study. The first part summarizes the litera-
but they are in fact different for model training and ture on model-free reinforcement learning and the
convergence. second part summarizes the literatures on Bayesian
In this paper, we propose a new model-free RL deep models.
algorithm based on a Bayesian deep model - deep
kernel learning28 , which is able to encode more un- 2.1. Model-free RL
certainty and fully take advantage of the saved his-
torical interactions. Specially, the deep kernel learn- Model-free RL aims to learn an optimal policy for
ing is a Gaussian process (GP)16 with a deep ker- an agent without explicitly modelling the environ-
nel modelled by a deep neural network, which has ment. Existing works in this direction can be mainly
∗ https://gym.openai.com/
165
International Journal of Computational Intelligence Systems, Vol. 12 (2018) 164-171
___________________________________________________________________________________________________________
grouped into three categories: value-based, policy- ity; 2) encoding the prior knowledge about a prob-
based, and combined22 . Policy-based RL is to di- lem; 3) good interpretability thanks to its clear and
rectly optimise the parameters of the policy aiming meaningful probabilistic structure. Inspired by the
to maximise the expected reward, and policy gradi- recent success of deep learning, researchers start
ent is a typical algorithm of this category. Value- to pay attention to build Bayesian deep models25
based RL is to learn the action-value function first aiming to combine the advantages from two areas.
aiming to minimise the temporal difference error One straightforward strategy is to assign probabilis-
and then design the policy based on this function. tic prior for parameters of traditional deep neural
According to the difference on the computation of networks, e.g., Gaussian prior 6,4 . Such Bayesian
temporal difference error, there are two kinds in method could avoid some of the pitfalls of stochas-
value-based RL: on-policy one which computes the tic optimisation. Bayes by Backprop2 -a varia-
temporal difference error based on current policy, tional inference method- is proposed to efficiently
e.g., SARSA; off-policy one which computes the resolve these models. Another strategy is to pave
temporal difference error based on greedy policy, the probabilistic models deeply. Deep belief net-
e.g., Q-learning. Deep Q-Networks (DQN)13 is a works 8 may be the earliest one, which has undi-
seminal work of Q-learning, which adopts deep neu- rected connections between its top two layers (it is
ral network to approximate the Q-function. Such actually a RBM7 ) and downward directed connec-
algorithm is proofed successful on Atari games tions between all its lower layers. Its advantage
and even exceeds the human performance. Based is that it needs less time to train the model, but
on DQN, Double-DQN24 is proposed to separately there is no feedback from top to bottom. On the
model action-selection and Q-function by two deep contrary, Deep Bolzman Machine (DBM)18 , which
neural networks, which resolves the possible over- is the multi-layered RBM, requires more time for
estimate problem in DQN. The performance can be model training but there is feedback from top to bot-
further improved by splitting Q-function into state tom so it is more robust. The hidden units in DBN
function and advantage function in Dueling DQN26 . and DBM are restricted to be binary. To resolve
There are also works that try to combine two cate- this constraint, Gamma belief networks29 and deep
gories, e.g., Deep Deterministic Policy Gradient10 . poisson factor analysis5 are proposed using Gamma
This paper will only target on DQN and replace the and Negative-binomial distributions, which could
traditional deep neural network with Bayesian deep build deep structure with nonnegative real hidden
models, and other advanced extensions can also be units. Deep latent Gaussian model17 is another deep
easily adopted later. Bayesian model built by Gaussian distributions. Be-
yond Gaussian distributions, Gaussian process (GP)
is also adopted for constructing Bayesian deep mod-
2.2. Bayesian deep model els. For example, deep neural network is used to
The Bayesian paradigm for machine learning, also construct a deep kernel as the covariance function of
known as Bayesian (machine) learning, is to ap- GP in deep kernel learning28,27 ; the other idea is to
ply probability theories and techniques to represent pave more Gaussian process on each other known as
and learn knowledge from data. Some renowned deep Gaussian process3 .
examples are Bayesian network, Gaussian mixture
models, hidden Markov model15 , Markov random 3. The Proposed Model
field, conditional random field, and latent Dirichlet
allocation1 . Compared to other learning paradigms, RL is often modelled by a Markov decision process
Bayesian learning has distinctive advantages: 1) rep- as†< S, A, P, R, γ >, where S is a set of environment
resenting, manipulating, and mitigating uncertainty states, A is a set of agent actions, P defines the state
based on a solid theoretical foundation - probabil- transition of environment, R defines the reward from
† https://en.wikipedia.org/wiki/Markov decision process
166
International Journal of Computational Intelligence Systems, Vol. 12 (2018) 164-171
___________________________________________________________________________________________________________
environment, and γ is the discount factor. The goal GP is often used as the prior for functions and is
of RL is the agent policy π(s) which is a mapping able to approximate complex and nonlinear func-
from environment states to actions aiming to max- tional forms given a number of observations. To fur-
imise the (discounted) expected long-term reward ther improve the ability on function approximation,
∞
E[∑t=1 γt rt ], where rt is the reward from t-th inter- deep kernel learning (DKL) incorporates the deep
action. learning idea to the kernel learning
167
International Journal of Computational Intelligence Systems, Vol. 12 (2018) 164-171
___________________________________________________________________________________________________________
168
International Journal of Computational Intelligence Systems, Vol. 12 (2018) 164-171
___________________________________________________________________________________________________________
fluctuation than the one for DQN, which means the weighted (according to Eq. (6) with α = 0.1. Note
results from BDRL is unstable comparing with the that during the exploring stage, the action is ran-
ones from DQN. domly chosen and their weights are also randomly
assigned. The result is shown in Fig. 3, where the
350
scores of all episodes in testing stage are plotted.
DQN
300
BDRL According to this figure, we can observe that: 1)
The episode number of BDRL with weighted (73)
250 is smaller than the one of BDRL with uniform (88)
within the same number of steps (i.e., 10,000). 2)
200
The average score of BDRL with weighted (136.75)
Score
169
International Journal of Computational Intelligence Systems, Vol. 12 (2018) 164-171
___________________________________________________________________________________________________________
in which the goals and constraints are fuzzy in na- with deep reinforcement learning. arXiv preprint
ture. These future investigations are expected to pro- arXiv:1509.02971, 2015.
vide backbones for applying RL techniques to sup- 11. Ying Liu, Brent Logan, Ning Liu, Zhiyuan Xu, Jian
Tang, and Yangzhi Wang. Deep reinforcement learn-
port dynamic decision-making in complex situations ing for dynamic treatment regimes on medical registry
that involve massive data domains, agents, and envi- data. In Healthcare Informatics (ICHI), 2017 IEEE
ronments. International Conference on, pages 380–385. IEEE,
2017.
12. Francisco S Melo. Convergence of q-learning: A sim-
Acknowledgments ple proof. Institute Of Systems and Robotics, Tech.
Rep, pages 1–4, 2001.
13. Volodymyr Mnih, Koray Kavukcuoglu, David Sil-
This work is supported by the Australian Re-
ver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
search Council (ARC) under Discovery Grant Alex Graves, Martin Riedmiller, Andreas K Fidjeland,
DP170101632. Georg Ostrovski, et al. Human-level control through
deep reinforcement learning. Nature, 518(7540):529,
2015.
References 14. Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. Re-
inforcement learning for optimized trade execution. In
1. David M Blei, Andrew Y Ng, and Michael I Jordan. Proceedings of the 23rd international conference on
Latent dirichlet allocation. Journal of Machine Learn- Machine learning, pages 673–680. ACM, 2006.
ing Research, 3(Jan):993–1022, 2003. 15. Lawrence R Rabiner. A tutorial on hidden markov
2. Charles Blundell, Julien Cornebise, Koray models and selected applications in speech recogni-
Kavukcuoglu, and Daan Wierstra. Weight uncertainty tion. Proceedings of the IEEE, 77(2):257–286, 1989.
in neural networks. arXiv preprint arXiv:1505.05424, 16. Carl Edward Rasmussen and Christopher KI
2015. Williams. Gaussian process for machine learning.
3. Andreas Damianou and Neil Lawrence. Deep gaus- MIT press, 2006.
sian processes. In Artificial Intelligence and Statistics, 17. Danilo Jimenez Rezende, Shakir Mohamed, and Daan
pages 207–215, 2013. Wierstra. Stochastic backpropagation and approxi-
4. Yarin Gal and Zoubin Ghahramani. Bayesian mate inference in deep generative models. In Pro-
convolutional neural networks with bernoulli ap- ceedings of the 31st International Conference on In-
proximate variational inference. arXiv preprint ternational Conference on Machine Learning-Volume
arXiv:1506.02158, 2015. 32, pages II–1278. JMLR. org, 2014.
5. Ricardo Henao, Zhe Gan, James Lu, and Lawrence 18. Ruslan Salakhutdinov and Geoffrey Hinton. Deep
Carin. Deep poisson factor modeling. In C. Cortes, boltzmann machines. In David van Dyk and Max
N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Gar- Welling, editors, Proceedings of the Twelth Interna-
nett, editors, Advances in Neural Information Process- tional Conference on Artificial Intelligence and Statis-
ing Systems 28, pages 2800–2808. Curran Associates, tics, volume 5 of Proceedings of Machine Learn-
Inc., 2015. ing Research, pages 448–455, Hilton Clearwater
6. José Miguel Hernández-Lobato and Ryan Adams. Beach Resort, Clearwater Beach, Florida USA, 2009.
Probabilistic backpropagation for scalable learning of PMLR.
bayesian neural networks. In International Confer- 19. Tom Schaul, John Quan, Ioannis Antonoglou, and
ence on Machine Learning, pages 1861–1869, 2015. David Silver. Prioritized experience replay. arXiv
7. Geoffrey E Hinton. Training products of experts by preprint arXiv:1511.05952, 2015.
minimizing contrastive divergence. Neural Computa- 20. David Silver, Aja Huang, Chris J Maddison, Arthur
tion, 14(8):1771–1800, 2002. Guez, Laurent Sifre, George Van Den Driessche, Ju-
8. Geoffrey E Hinton, Simon Osindero, and Yee-Whye lian Schrittwieser, Ioannis Antonoglou, Veda Panneer-
Teh. A fast learning algorithm for deep belief nets. shelvam, Marc Lanctot, et al. Mastering the game of
Neural Computation, 18(7):1527–1554, 2006. go with deep neural networks and tree search. Nature,
9. Abhishek Kumar and Rajneesh Sharma. Linguistic 529(7587):484, 2016.
lyapunov reinforcement learning control for robotic 21. Richard S Sutton. Learning to predict by the methods
manipulators. Neurocomputing, 272:84–95, 2018. of temporal differences. Machine Learning, 3(1):9–
10. Timothy P Lillicrap, Jonathan J Hunt, Alexander 44, 1988.
Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David 22. Richard S Sutton, Andrew G Barto, et al. Reinforce-
Silver, and Daan Wierstra. Continuous control ment learning: An introduction. MIT press, 1998.
170
International Journal of Computational Intelligence Systems, Vol. 12 (2018) 164-171
___________________________________________________________________________________________________________
23. Yee Whye Teh and Michael I Jordan. Hierarchi- International Conference on Machine Learning, pages
cal bayesian nonparametric models with applications. 1995–2003, 2016.
Bayesian Nonparametrics, 1:158–207, 2010. 27. Andrew G Wilson, Zhiting Hu, Ruslan R Salakhut-
24. Hado Van Hasselt, Arthur Guez, and David Silver. dinov, and Eric P Xing. Stochastic variational deep
Deep reinforcement learning with double q-learning. kernel learning. In Advances in Neural Information
In AAAI, volume 2, page 5. Phoenix, AZ, 2016. Processing Systems, pages 2586–2594, 2016.
25. Hao Wang and Dit-Yan Yeung. Towards bayesian 28. Andrew Gordon Wilson, Zhiting Hu, Ruslan
deep learning: A framework and some existing meth- Salakhutdinov, and Eric P Xing. Deep kernel learn-
ods. IEEE Transactions on Knowledge and Data En- ing. In Artificial Intelligence and Statistics, pages
gineering, 28(12):3395–3408, 2016. 370–378, 2016.
26. Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Has- 29. Mingyuan Zhou, Yulai Cong, and Bo Chen. Aug-
selt, Marc Lanctot, and Nando Freitas. Dueling net- mentable gamma belief networks. Journal of Machine
work architectures for deep reinforcement learning. In Learning Research, 17(1):5656–5699, 2016.
171