Optimizing the Resource and Job Management System of an Academic HPC and Research Computing Facility

VARRETTE, Sébastien; KIEFFER, Emmanuel; Pinel, Frederic

Demander un accès

Communication publiée dans un ouvrage (Colloques, congrès, conférences scientifiques et actes)

Optimizing the Resource and Job Management System of an Academic HPC and Research Computing Facility

VARRETTE, Sébastien; KIEFFER, Emmanuel; Pinel, Frederic

2022 • In 21st IEEE Intl. Symp. on Parallel and Distributed Computing (ISPDC'22)

Peer reviewed

Permalien
https://hdl.handle.net/10993/51494

Documents (2)Envoyer vers Détails Statistiques Bibliographie Publications similaires

Documents

Texte intégral

2022-06-28_Camera-ready-IEEE-PDF-Express_ispdc22.pdf

Preprint Auteur (2.05 MB)

Demander un accès

Annexes

slides_ispdc2022.pdf

(2.88 MB)

Slides for the conference

Télécharger

Tous les documents dans ORBilu sont protégés par une licence d'utilisation.

Envoyer vers

RIS BibTex APA Chicago Permalink X Linkedin

Détails

Mots-clés :

Slurm; Fairsharing; HPC

Résumé :

[en] High Performance Computing (HPC) is nowadays a strategic asset required to sustain the surging demands for massive processing and data-analytic capabilities. In practice, the effective management of such large scale and distributed computing infrastructures is left to a Resource and Job Management System (RJMS). This essential middleware component is responsible for managing the computing resources, handling user requests to allocate resources while providing an optimized framework for starting, executing and monitoring jobs on the allocated resources. The University of Luxembourg has been operating for 15 years a large academic HPC facility which relies since 2017 on the Slurm RJMS introduced on top of the flagship cluster Iris. The acquisition of a new liquid-cooled supercomputer named Aion which was released in 2021 was the occasion to deeply review and optimize the seminal Slurm configuration, the resource limits defined and the sustaining fairsharing algorithm. This paper presents the outcomes of this study and details the implemented RJMS policy. The impact of the decisions made over the supercomputers workloads is also described. In particular, the performance evaluation conducted highlights that when compared to the seminal configuration, the described and implemented environment brought concrete and measurable improvements with regards the platform utilization (+12.64%), the jobs efficiency (as measured by the average Wall-time Request Accuracy, improved by 110.81%) or the management and funding (increased by 10%). The systems demonstrated sustainable and scalable HPC performances, and this effort has led to a negligible penalty on the average slowdown metric (response time normalized by runtime), which was increased by 0.59% for job workloads covering a complete year of exercise. Overall, this new setup has been in production for 18 months on both supercomputers and the updated model proves to bring a fairer and more satisfying experience to the end users. The proposed configurations and policies may help other HPC centres when designing or improving the RJMS sustaining the job scheduling strategy at the advent of computing capacity expansions.

Centre de recherche :

ULHPC - University of Luxembourg: High Performance Computing

Disciplines :

Sciences informatiques

Auteur, co-auteur :

VARRETTE, Sébastien ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

KIEFFER, Emmanuel ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

Pinel, Frederic

Co-auteurs externes :

yes

Langue du document :

Anglais

Titre :

Optimizing the Resource and Job Management System of an Academic HPC and Research Computing Facility

Date de publication/diffusion :

juillet 2022

Nom de la manifestation :

21st IEEE Intl. Symp. on Parallel and Distributed Computing (ISPDC'22)

Lieu de la manifestation :

Basel, Suisse

Date de la manifestation :

July 11-13, 2022

Manifestation à portée :

International

Titre de l'ouvrage principal :

21st IEEE Intl. Symp. on Parallel and Distributed Computing (ISPDC'22)

Maison d'édition :

IEEE Computer Society, Basel, Suisse

Peer reviewed :

Peer reviewed

Focus Area :

Computational Sciences

URL complémentaire :

https://ispdc2022.dmi.unibas.ch/

Disponible sur ORBilu :

depuis le 04 juillet 2022

Statistiques

Nombre de vues

341 (dont 20 Unilu)

Nombre de téléchargements

125 (dont 12 Unilu)

Voir plus de statistiques

citations Scopus^®

citations Scopus^®
sans auto-citations

Bibliographie

S. Varrette, H. Cartiaux, S. Peter, E. Kieffer, T. Valette, and A. Olloh, “Management of an Academic HPC & Research Computing Facility: The ULHPC Experience 2.0,” in Proc. of the 6th ACM High Performance Computing and Cluster Technologies Conf. (HPCCT 2022). Fuzhou, China: Association for Computing Machinery (ACM), Jul. 2022.
M. A. Jette, A. B. Yoo, and M. Grondona, “SLURM: Simple Linux Utility for Resource Management,” in Proc. of Job Scheduling Strategies for Parallel Processing (JSSPP’03). Springer-Verlag, 2002, pp. 44–60.
Y. Georgiou and M. Hautreux, “Evaluating scalability and efficiency of the resource and job management system on large HPC clusters,” in Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 2012, pp. 134–156.
M. Rocklin, “Dask: Parallel computation with blocked algorithms and task scheduling,” in Proceedings of the 14th Python in Science Conference, K. Huff and J. Bergstra, Eds., 2015, pp. 130 – 136.
IPython, “ipyparallel,” 2015. [Online]. Available: https://github.com/ipython/ipyparallel
N. Capit, G. Da Costa, Y. Georgiou, G. Huard, C. Martin, G. Mounié, P. Neyron, and O. Richard, “A batch scheduler with high level components,” in CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005., vol. 2.
S. Varrette, P. Bouvry, H. Cartiaux, and F. Georgatos, “Management of an Academic HPC Cluster: The UL Experience,” in Proc. of the 2014 Intl. Conf. on High Performance Computing & Simulation (HPCS 2014). Bologna, Italy: IEEE, July 2014, pp. 959–967.
U. Lublin and D. G. Feitelson, “The Workload on Parallel Supercomputers: Modeling the Characteristics of Rigid Jobs,” J. Parallel Distrib. Comput., vol. 63, no. 11, p. 1105–1122, nov 2003.
D. G. Feitelson, Workload Modeling for Computer Systems Performance Evaluation, 1st ed. USA: Cambridge University Press, 2015.
“LLNL slurm tutorial and configuration,” [online] hpc.llnl.gov/banks-jobs/running-jobs/slurm.
“Slurm configuration on the nilfheim cluster,” [online] wiki.fysik.dtu.dk/niflheim/Slurm_configuration.
N. A. Simakov, R. L. DeLeon, M. D. Innus, M. D. Jones, J. P. White, S. M. Gallo, A. K. Patra, and T. R. Furlani, “Slurm Simulator: Improving Slurm Scheduler Performance on Large HPC Systems by Utilization of Multiple Controllers and Node Sharing,” in Proc. of the ACM Practice and Experience on Advanced Research Computing (PEARC’18), 2018.
A. Jokanovic, M. D’Amico, and J. Corbalan, “Evaluating SLURM Simulator with Real-Machine SLURM and Vice Versa,” in Intl. Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS’18). Los Alamitos, CA, USA: IEEE Computer Society, Nov 2018, pp. 72–82.
SchedMD, “Depth-oblivious fair-share factor,” [online] slurm.schedmd.com/priority_multifactor3.html.
SchedMD, “Trackable resources (tres),” [online] slurm.schedmd.com/tres.html.
J. Kay and P. Lauder, “A Fair Share Scheduler,” Commun. ACM, vol. 31, no. 1, p. 44–55, jan 1988.
J. Sawada, “Generating rooted and free plane trees,” ACM Trans. Algorithms, vol. 2, no. 1, p. 1–13, jan 2006.
“Slurm multi-cluster operation,” [online] slurm.schedmd.com/multi_cluster.html.
“Slurm federated scheduling guide,” [online] slurm.schedmd.com/federation.html.
M. Martinasso, M. Gila, M. Bianco, S. R. Alam, C. McMurtrie, and T. Schulthess, “RM-Replay: A High-Fidelity Tuning, Optimization and Exploration Tool for Resource Management,” in Proc. of the Intl. Conf. for High Performance Computing, Networking, Storage, and Analysis (SC’18). IEEE Press, 2018.