Robust Exploration/Exploitation Trade-Offs in Safety-Critical Applications

Tokic, M.; Ertle, P.; Palm, G.; Soeffker, D.; VOOS, Holger

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Tokic, M.; Ertle, P.; Palm, G. et al.

2012 • In 8th IFAC Int. Symposium on Fault Detection, Supervision and Safety for Technical Processes, Mexico City 29-31 August 2012

Peer reviewed

Permalink
https://hdl.handle.net/10993/12682

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

SAFEPROCESS-2012.pdf

Author postprint (372.15 kB)

Request a copy

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Robotics; Safety; Learning

Abstract :

[en] With regard to future service robots, unsafe exceptional circumstances can occur in complex systems that are hardly to foresee. In this paper, the assumption of having no knowledge about the environment is investigated using reinforcement learning as an option for learning behavior by trial-and-error. In such a scenario, action-selection decisions are made based on future reward predictions for minimizing costs in reaching a goal. It is shown that the selection of safetycritical actions leading to highly negative costs from the environment is directly related to the exploration/exploitation dilemma in temporal-di erence learning. For this, several exploration policies are investigated with regard to worst- and best-case performance in a dynamic environment. Our results show that in contrast to established exploration policies like epsilon-Greedy and Softmax, the recently proposed VDBE-Softmax policy seems to be more appropriate for such applications due to its robustness of the exploration parameter for unexpected situations.

Disciplines :

Computer science
Electrical & electronics engineering

Identifiers :

UNILU:UL-CONFERENCE-2013-025

Author, co-author :

Tokic, M.

Ertle, P.

Palm, G.; University of Ulm

Soeffker, D.; University of Duisburg-Essen

VOOS, Holger ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Engineering Research Unit ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)

Language :

English

Title :

Robust Exploration/Exploitation Trade-Offs in Safety-Critical Applications

Publication date :

2012

Event name :

8th IFAC Int. Symposium on Fault Detection,Supervision and Safety for Technical Processes

Event place :

Mexico City, Mexico

Event date :

29-31 August 2012

Audience :

International

Main work title :

8th IFAC Int. Symposium on Fault Detection, Supervision and Safety for Technical Processes, Mexico City 29-31 August 2012

Peer reviewed :

Peer reviewed

Available on ORBilu :

since 06 December 2013

Statistics

Number of views

152 (1 by Unilu)

Number of downloads

0 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

Daw, N.D., O'Doherty, J.P., Dayan, P., Seymour, B., and Dolan, R.J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095), 876-879.
Dörner, D. (2000). Die Logik des Mißlingens strategisches Denken in komplexen Situationen. Rowohlt, Reinbek bei Hamburg.
Ertle, P., Voos, H., and Söffker, D. (2010). On risk for-malization of on-line risk assessment for safe decision making in robotics. In 7th IARP Workshop on Technical Challenges for Dependable Robots in Human Environments, 15-22.
Geibel, P. (2001). Reinforcement learning with bounded risk. In Proceedings of the 18th International Conference on Machine Learning, ICML'01, 162-169. Morgan Kaufmann Publishers Inc.
George, A.P. and Powell, W.B. (2006). Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65(1), 167-198.
Hans, A., Schneegaß, D., Schäfer, A.M., and Udluft, S. (2008). Safe exploration for reinforcement learning. In Proceedings of the 16th European Symposium on Artificial Neural Networks ESANN'08, 143-148.
Heger, M. (1994). Consideration of risk in reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning, 105-111. Morgan Kaufmann Publishers, Inc., San Francisco, CA, USA.
Lussier, B., Chatila, R., Ingrand, F., Killijian, M.O., and Powell, D. (2004). On fault tolerance and robustness in autonomous systems. In 3rd IARP - IEEE/RAS - EURON Joint Workshop on Technical Challenges for Dependable Robots in Human Environments, 7-9.
Mihatsch, O. and Neuneier, R. (2002). Risk-sensitive reinforcement learning. Machine Learning, 49(2), 267-290.
Ng, A.Y. and Kim, H.J. (2004). Stable adaptive control with online learning. In Advances in Neural Information Processing Systems, 17, 13-18.
Perkins, T.J. and Barto, A.G. (2003). Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research, 3, 803-832.
Rummery, G.A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University.
Smithers, T. (1997). Autonomy in robots and other agents. Brain and Cognition, 34, 88-106.
Sutton, R.S. and Barto, A.G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
Tokic, M. (2010). Adaptive ε-greedy exploration in reinforcement learning based on value differences. In KI 2010: Advances in Artificial Intelligence, 203-210. Springer Berlin / Heidelberg.
Tokic, M. and Palm, G. (2011). Value-difference based exploration: Adaptive exploration between epsilon-greedy and softmax. In KI 2011: Advances in Artificial Intelligence, 335-346. Springer Berlin / Heidelberg.
Watkins, C. (1989). Learning from Delayed Rewards. Ph.D. thesis, University of Cambridge, England.
Watkins, C. and Dayan, P. (1992). Technical note: Q-learning. Machine Learning, 8(3), 279-292.