[en] The fields of Reinforcement Learning (RL) and Optimization aim at finding an optimal solution to a problem, characterized by an objective function. The exploration-exploitation dilemma (EED) is a well known subject in those fields. Indeed, a consequent amount of literature has already been proposed on the subject and shown it is a non-negligible topic to consider to achieve good performances. Yet, many problems in real life involve the optimization of multiple objectives. Multi-Policy Multi-Objective Reinforcement Learning (MPMORL) offers a way to learn various optimised behaviours for the agent in such problems. This work introduces a modular framework for the learning phase of such algorithms, allowing to ease the study of the EED in Inner- Loop MPMORL algorithms. We present three new exploration strategies inspired from the metaheuristics domain. To assess the performance of our methods on various environments, we use a classical benchmark - the Deep Sea Treasure (DST) - as well as propose a harder version of it. Our experiments show all of the proposed strategies outperform the current state-of-the-art ε-greedy based methods on the studied benchmarks.
Centre de recherche :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Parallel Computing & Optimization Group (PCOG)
Disciplines :
Sciences informatiques
Auteur, co-auteur :
FELTEN, Florian ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > PCOG
Danoy, Grégoire; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS) ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > PCOG
TALBI, El-Ghazali ; University of Lille, CNRS/CRIStAL, Inria Lille, France
BOUVRY, Pascal ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS) ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > PCOG
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
Metaheuristics-based Exploration Strategies for Multi-Objective Reinforcement Learning
Date de publication/diffusion :
2022
Nom de la manifestation :
14th International Conference on Agents and Artificial Intelligence
Date de la manifestation :
from 3-02-2022 to 5-02-2022
Manifestation à portée :
International
Titre de l'ouvrage principal :
Proceedings of the 14th International Conference on Agents and Artificial Intelligence
Maison d'édition :
SCITEPRESS - Science and Technology Publications, Online Streaming, Inconnu/non spécifié
Amin, S., Gomrokchi, M., Satija, H., van Hoof, H., and Precup, D. (2021). A Survey of Exploration Methods in Reinforcement Learning. arXiv:2109.00157 [cs]. arXiv: 2109.00157.
Barrett, L. and Narayanan, S. (2008). Learning all optimal policies with multiple criteria. In Proceedings of the 25th international conference on Machine learning-ICML ’08, pages 41–47, Helsinki, Finland. ACM Press.
Gambardella, L. M. and Dorigo, M. (1995). Ant-Q: A Reinforcement Learning approach to the traveling salesman problem. In Prieditis, A. and Russell, S., editors, Machine Learning Proceedings 1995, pages 252–260. Morgan Kaufmann, San Francisco (CA).
Hayes, C., Rădulescu, R., Bargiacchi, E., Källström, J., Macfarlane, M., Reymond, M., Verstraeten, T., Zint-graf, L., Dazeley, R., Heintz, F., Howley, E., Irissap-pane, A., Mannion, P., Nowe, A., Ramos, G., Restelli, M., Vamplew, P., and Roijers, D. (2021). A Practical Guide to Multi-Objective Reinforcement Learning and Planning.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bo-denstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., and Hassabis, D. (2021). Highly accurate protein structure prediction with Al-phaFold. Nature, 596(7873):583–589.
Oliveira, T., Medeiros, L., Neto, A. D., and Melo, J. (2020). Q-Managed: A new algorithm for a multiobjective reinforcement learning. Expert Systems with Applications, 168:114228.
Parisi, S., Pirotta, M., and Restelli, M. (2016). Multi-objective Reinforcement Learning through Continuous Pareto Manifold Approximation. Journal of Artificial Intelligence Research, 57:187–227.
Roijers, D. M., Röpke, W., Nowe, A., and Radulescu, R. (2021). On Following Pareto-Optimal Policies in Multi-Objective Planning and Reinforcement Learning.
Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley, R. (2013). A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48(1):67–113.
Ruiz-Montiel, M., Mandow, L., and Pérez-de-la Cruz, J.-L. (2017). A temporal difference method for multi-objective reinforcement learning. Neurocomputing, 263:15–25.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning series. A Bradford Book, Cambridge, MA, USA, 2 edition.
Talbi, E.-G. (2009). Metaheuristics: From Design to Implementation, volume 74.
Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., and Dekker, E. (2011). Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning, 84(1):51–80.
Vamplew, P., Dazeley, R., and Foale, C. (2017a). Soft-max exploration strategies for multiobjective reinforcement learning. Neurocomputing, 263:74–86.
Vamplew, P., Issabekov, R., Dazeley, R., Foale, C., Berry, A., Moore, T., and Creighton, D. (2017b). Steering approaches to Pareto-optimal multiobjective reinforcement learning. Neurocomputing, 263:26–38.
Van Moffaert, K., Drugan, M. M., and Nowé, A. (2013). Hypervolume-Based Multi-Objective Reinforcement Learning. In Purshouse, R. C., Fleming, P. J., Fonseca, C. M., Greco, S., and Shaw, J., editors, Evolutionary Multi-Criterion Optimization, Lecture Notes in Computer Science, pages 352–366, Berlin, Heidelberg. Springer.
Van Moffaert, K. and Nowé, A. (2014). Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1):3483–3512. Publisher: JMLR. org.
Wang, W. and Sebag, M. (2013). Hypervolume indicator and dominance reward based multi-objective Monte-Carlo Tree Search. Machine Learning, 92(2):403– 429.
White, D. (1982). Multi-objective infinite-horizon discounted Markov decision processes. Journal of Mathematical Analysis and Applications, 89(2):639–647.
Wiering, M. A. and de Jong, E. D. (2007). Computing Optimal Stationary Policies for Multi-Objective Markov Decision Processes. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 158–165. ISSN: 2325-1867.
Wiering, M. A., Withagen, M., and Drugan, M. M. (2014). Model-based multi-objective reinforcement learning. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1–6, Orlando, FL, USA. IEEE.