Bayesian Reinforcement Learning

[en] This chapter surveys recent lines of work that use Bayesian techniques for reinforcement learning. In Bayesian learning, uncertainty is expressed by a prior distribution over unknown parameters and learning is achieved by computing a posterior distribution based on the data observed. Hence, Bayesian reinforcement learning distinguishes itself from other forms of reinforcement learning by explic- itly maintaining a distribution over various quantities such as the parameters of the model, the value function, the policy or its gradient. This yields several benefits: a) domain knowledge can be naturally encoded in the prior distribution to speed up learning; b) the exploration/exploitation tradeoff can be naturally optimized; and c) notions of risk can be naturally taken into account to obtain robust policies.

Research center :

Luxembourg Centre for Systems Biomedicine (LCSB): Machine Learning (Vlassis Group)

Disciplines :

Computer science

Identifiers :

UNILU:UL-CHAPTER-2012-428

Author, co-author :

Vlassis, Nikos ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB)

Ghavamzadeh, Mohammad

Mannor, Shie

Poupart, Pascal

External co-authors :

yes

Language :

English

Title :

Bayesian Reinforcement Learning

Publication date :

2012

Main work title :

Reinforcement Learning: State of the Art

Editor :

Wiering, Marco

van Otterlo, Martijn

Publisher :

Springer

ISBN/EAN :

978-3-642-27645-3

Pages :

359-386

Peer reviewed :

Peer reviewed

Additional URL :

http://link.springer.com/chapter/10.1007/978-3-642-27645-3_11

Available on ORBilu :

since 04 July 2013

Statistics

Number of views

160 (11 by Unilu)

Number of downloads

3364 (7 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

Bibliography

Aharony, N., Zehavi, T., Engel, Y.: Learning wireless network association control with Gaussian process temporal difference methods. In: Proceedings of OPNETWORK (2005)
Asmuth, J., Li, L., Littman, M.L., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2009, pp. 19–26. AUAI Press (2009)
Bagnell, J., Schneider, J.: Covariant policy search. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (2003)
Barto, A., Sutton, R., Anderson, C.: Neuron-like elements that can solve difficult learning control problems. IEEE Transaction on Systems, Man and Cybernetics 13, 835–846 (1983)
Baxter, J.: A model of inductive bias learning. Journal of Artificial Intelligence Research 12, 149–198 (2000)
Baxter, J., Bartlett, P.: Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, 319–350 (2001)
Bellman, R.: A problem in sequential design of experiments. Sankhya 16, 221–229 (1956)
Bellman, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press (1961)
Bellman, R., Kalaba, R.: On adaptive control processes. Transactions on Automatic Control, IRE 4(2), 1–9 (1959)
Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Incremental natural actor-critic algorithms. In: Proceedings of Advances in Neural Information Processing Systems, vol. 20, pp. 105–112. MIT Press (2007)
Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Automatica 45(11), 2471–2482 (2009)
Brafman, R., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. JMLR 3, 213–231 (2002)
Caruana, R.: Multitask learning. Machine Learning 28(1), 41–75 (1997)
Castro, P., Precup, D.: Using linear programming for Bayesian exploration in Markov decision processes. In: Proc. 20th International Joint Conference on Artificial Intelligence (2007)
Chalkiadakis, G., Boutilier, C.: Coordination in multi-agent reinforcement learning: A Bayesian approach. In: International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 709–716 (2003)
Chalkiadakis, G., Boutilier, C.: Bayesian reinforcement learning for coalition formation under uncertainty. In: International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 1090–1097 (2004)
Cozzolino, J., Gonzales-Zubieta, R., Miller, R.L.: Markovian decision processes with uncertain transition probabilities. Tech. Rep. Technical Report No. 11, Research in the Control of Complex Systems. Operations Research Center, Massachusetts Institute of Technology (1965)
Cozzolino, J.M.: Optimal sequential decision making under uncertainty. Master’s thesis, Massachusetts Institute of Technology (1964)
Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 761–768 (1998)
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: UAI, pp. 150– 159 (1999)
DeGroot, M.H.: Optimal Statistical Decisions. McGraw-Hill, New York (1970)
Delage, E., Mannor, S.: Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research 58(1), 203–213 (2010)
Dimitrakakis, C.: Complexity of stochastic branch and bound methods for belief tree search in bayesian reinforcement learning. In: ICAART (1), pp. 259–264 (2010)
Doshi-Velez, F.: The infinite partially observable Markov decision process. In: Neural Information Processing Systems (2009)
Doshi-Velez, F., Wingate, D., Roy, N., Tenenbaum, J.: Nonparametric Bayesian policy priors for reinforcement learning. In: NIPS (2010)
Duff, M.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massassachusetts Amherst (2002)
Duff, M.: Design for an optimal probe. In: ICML, pp. 131–138 (2003)
Engel, Y.: Algorithms and representations for reinforcement learning. PhD thesis, The Hebrew University of Jerusalem, Israel (2005)
Engel, Y., Mannor, S., Meir, R.: Sparse Online Greedy Support Vector Regression. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 84–96. Springer, Heidelberg (2002)
Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 154–161 (2003)
Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the Twenty Second International Conference on Machine Learning, pp. 201– 208 (2005a)
Engel, Y., Szabo, P., Volkinshtein, D.: Learning to control an octopus arm with Gaussian process temporal difference methods. In: Proceedings of Advances in Neural Information Processing Systems, vol. 18, pp. 347–354. MIT Press (2005b)
Fard, M.M., Pineau, J.: PAC-Bayesian model selection for reinforcement learning. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 23, pp. 1624–1632 (2010)
Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Proceedings of Advances in Neural Information Processing Systems, vol. 19, MIT Press (2006)
Ghavamzadeh, M., Engel, Y.: Bayesian Actor-Critic algorithms. In: Proceedings of the Twenty-Fourth International Conference on Machine Learning (2007)
Gmytrasiewicz, P., Doshi, P.: A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research (JAIR) 24, 49–79 (2005)
Greensmith, E., Bartlett, P., Baxter, J.: Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5, 1471–1530 (2004)
Iyengar, G.N.: Robust dynamic programming. Mathematics of Operations Research 30(2), 257–280 (2005)
Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proceedings of Advances in Neural Information Processing Systems, vol. 11, MIT Press (1999)
Kaelbling, L.P.: Learning in Embedded Systems. MIT Press (1993)
Kakade, S.: A natural policy gradient. In: Proceedings of Advances in Neural Information Processing Systems, vol. 14 (2002)
Kearns, M., Mansour, Y., Ng, A.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In: Proc. IJCAI (1999)
Kolter, J.Z., Ng, A.Y.: Near-bayesian exploration in polynomial time. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 513–520. ACM, New York (2009)
Konda, V., Tsitsiklis, J.: Actor-Critic algorithms. In: Proceedings of Advances in Neural Information Processing Systems, vol. 12, pp. 1008–1014 (2000)
Lazaric, A., Ghavamzadeh, M.: Bayesian multi-task reinforcement learning. In: Proceedings of the Twenty-Seventh International Conference on Machine Learning, pp. 599–606 (2010)
Lazaric, A., Restelli, M., Bonarini, A.: Transfer of samples in batch reinforcement learning. In: Proceedings of ICML, vol. 25, pp. 544–551 (2008)
Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance approximation in value function estimates. Management Science 53(2), 308–322 (2007)
Marbach, P.: Simulated-based methods for Markov decision processes. PhD thesis, Massachusetts Institute of Technology (1998)
Martin, J.J.: Bayesian decision problems and Markov chains. John Wiley, New York (1967)
Mehta, N., Natarajan, S., Tadepalli, P., Fern, A.: Transfer in variable-reward hierarchical reinforcement learning. Machine Learning 73(3), 289–312 (2008)
Meuleau, N., Bourgine, P.: Exploration of multi-state environments: local measures and backpropagation of uncertainty. Machine Learning 35, 117–154 (1999)
Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Operations Research 53(5), 780–798 (2005)
O’Hagan, A.: Monte Carlo is fundamentally unsound. The Statistician 36, 247–249 (1987)
O’Hagan, A.: Bayes-Hermite quadrature. Journal of Statistical Planning and Inference 29, 245–260 (1991)
Pavlov, M., Poupart, P.: Towards global reinforcement learning. In: NIPS Workshop on Model Uncertainty and Risk in Reinforcement Learning (2008)
Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural Networks 21(4), 682–697 (2008)
Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In: Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots (2003)
Peters, J., Vijayakumar, S., Schaal, S.: Natural Actor-Critic. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 280–291. Springer, Heidelberg (2005)
Porta, J.M., Spaan, M.T., Vlassis, N.: Robot planning in partially observable continuous domains. In: Proc. Robotics: Science and Systems (2005)
Poupart, P., Vlassis, N.: Model-based Bayesian reinforcement learning in partially observable domains. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM (2008)
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proc. Int. Conf. on Machine Learning, Pittsburgh, USA (2006)
Rasmussen, C., Ghahramani, Z.: Bayesian Monte Carlo. In: Proceedings of Advances in Neural Information Processing Systems, vol. 15, pp. 489–496. MIT Press (2003)
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. MIT Press (2006)
Reisinger, J., Stone, P., Miikkulainen, R.: Online kernel selection for Bayesian reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Machine Learning, pp. 816– 823 (2008)
Ross, S., Pineau, J.: Model-based Bayesian reinforcement learning in large structured domains. In: Uncertainty in Artificial Intelligence, UAI (2008)
Ross, S., Chaib-Draa, B., Pineau, J.: Bayes-adaptive POMDPs. In: Advances in Neural Information Processing Systems, NIPS (2007)
Ross, S., Chaib-Draa, B., Pineau, J.: Bayesian reinforcement learning in continuous POMDPs with application to robot navigation. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2845–2851 (2008)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004)
Silver, E.A.: Markov decision processes with uncertain transition probabilities or rewards. Tech. Rep. Technical Report No. 1, Research in the Control of Complex Systems. Operations Research Center, Massachusetts Institute of Technology (1963)
Spaan, M.T.J., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24, 195–220 (2005)
Strehl, A.L., Li, L., Littman, M.L.: Incremental model-based learners with formal learning-time guarantees. In: UAI (2006)
Strens, M.: A Bayesian framework for reinforcement learning. In: ICML (2000)
Sutton, R.: Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts Amherst (1984)
Sutton, R.: Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44 (1988)
Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: Proceedings of Advances in Neural Information Processing Systems, vol. 12, pp. 1057–1063 (2000)
Taylor, M., Stone, P., Liu, Y.: Transfer learning via inter-task mappings for temporal difference learning. JMLR 8, 2125–2167 (2007)
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 285–294 (1933)
Veness, J., Ng, K.S., Hutter, M., Silver, D.: Reinforcement learning via AIXI approximation. In: AAAI (2010)
Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line reward optimization. In: ICML (2005)
Watkins, C.: Learning from delayed rewards. PhD thesis, Kings College, Cambridge, England (1989)
Wiering, M.: Explorations in efficient reinforcement learning. PhD thesis, University of Amsterdam (1999)
Williams, R.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256 (1992)
Wilson, A., Fern, A., Ray, S., Tadepalli, P.: Multi-task reinforcement learning: A hierarchical Bayesian approach. In: Proceedings of ICML, vol. 24, pp. 1015–1022 (2007)