[en] With regard to future service robots, unsafe exceptional circumstances can occur in complex
systems that are hardly to foresee. In this paper, the assumption of having no knowledge about
the environment is investigated using reinforcement learning as an option for learning behavior
by trial-and-error. In such a scenario, action-selection decisions are made based on future reward predictions for minimizing costs in reaching a goal. It is shown that the selection of safetycritical actions leading to highly negative costs from the environment is directly related to the exploration/exploitation dilemma in temporal-di erence learning. For this, several exploration
policies are investigated with regard to worst- and best-case performance in a dynamic
environment. Our results show that in contrast to established exploration policies like epsilon-Greedy and Softmax, the recently proposed VDBE-Softmax policy seems to be more appropriate for such applications due to its robustness of the exploration parameter for unexpected situations.
Voos, Holger ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Engineering Research Unit ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Language :
English
Title :
Robust Exploration/Exploitation Trade-Offs in Safety-Critical Applications
Publication date :
2012
Event name :
8th IFAC Int. Symposium on Fault Detection,Supervision and Safety for Technical Processes
Event place :
Mexico City, Mexico
Event date :
29-31 August 2012
Audience :
International
Main work title :
8th IFAC Int. Symposium on Fault Detection, Supervision and Safety for Technical Processes, Mexico City 29-31 August 2012
Daw, N.D., O'Doherty, J.P., Dayan, P., Seymour, B., and Dolan, R.J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876-879.
Dörner, D. (2000). Die Logik des Mißlingens strategisches Denken in komplexen Situationen. Rowohlt, Reinbek bei Hamburg.
Ertle, P., Voos, H., and Söffker, D. (2010). On risk for-malization of on-line risk assessment for safe decision making in robotics. In 7th IARP Workshop on Technical Challenges for Dependable Robots in Human Environments, 15-22.
Geibel, P. (2001). Reinforcement learning with bounded risk. In Proceedings of the 18th International Conference on Machine Learning, ICML'01, 162-169. Morgan Kaufmann Publishers Inc.
George, A.P. and Powell, W.B. (2006). Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65(1), 167-198.
Hans, A., Schneegaß, D., Schäfer, A.M., and Udluft, S. (2008). Safe exploration for reinforcement learning. In Proceedings of the 16th European Symposium on Artificial Neural Networks ESANN'08, 143-148.
Heger, M. (1994). Consideration of risk in reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, 105-111. Morgan Kaufmann Publishers, Inc., San Francisco, CA, USA.
Lussier, B., Chatila, R., Ingrand, F., Killijian, M.O., and Powell, D. (2004). On fault tolerance and robustness in autonomous systems. In 3rd IARP - IEEE/RAS - EURON Joint Workshop on Technical Challenges for Dependable Robots in Human Environments, 7-9.
Mihatsch, O. and Neuneier, R. (2002). Risk-sensitive reinforcement learning. Machine Learning, 49(2), 267-290.
Ng, A.Y. and Kim, H.J. (2004). Stable adaptive control with online learning. In Advances in Neural Information Processing Systems, 17, 13-18.
Perkins, T.J. and Barto, A.G. (2003). Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3, 803-832.
Rummery, G.A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University.
Smithers, T. (1997). Autonomy in robots and other agents. Brain and Cognition, 34, 88-106.
Sutton, R.S. and Barto, A.G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
Tokic, M. (2010). Adaptive ε-greedy exploration in reinforcement learning based on value differences. In KI 2010: Advances in Artificial Intelligence, 203-210. Springer Berlin / Heidelberg.
Tokic, M. and Palm, G. (2011). Value-difference based exploration: Adaptive exploration between epsilon-greedy and softmax. In KI 2011: Advances in Artificial Intelligence, 335-346. Springer Berlin / Heidelberg.
Watkins, C. (1989). Learning from Delayed Rewards. Ph.D. thesis, University of Cambridge, England.
Watkins, C. and Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8(3), 279-292.