[en] With the advent of the technological revolution and the digital transformation that made all scientific disciplines becoming computational, the need for High Performance Computing (HPC) has become and a strategic and critical asset to leverage new research and business in all domains requiring computing and storage performance. Since 2007, the University of Luxembourg operates a large academic HPC facility which remains the reference implementation within the country. This paper provides a general description of the current platform implementation as well as its operational management choices which have been adapted to the integration of a new liquid-cooled supercomputer, named Aion, released in 2021. The administration of a HPC facility to provide state-of-art computing systems, storage and software is indeed a complex and dynamic enterprise with the soul purpose to offer an enhanced user experience for intensive research computing and large-scale analytic workflows. Most design choices and feedback described in this work have been motivated by several years of experience in addressing in a flexible and convenient way the heterogeneous needs inherent to an academic environment towards research excellence. The different layers and stacks used within the operated facilities are reviewed, in particular with regards the user software management, or the adaptation of the Slurm Resource and Job Management System (RJMS) configuration with novel incentives mechanisms. In practice, the described and implemented environment brought concrete and measurable improvements with regards the platform utilization (+12,64%), jobs efficiency (average Wall-time Request Accuracy improved by 110,81%), the management and funding (increased by 10%). Thorough performance evaluation of the facility is also presented in this paper through reference benchmarks such as HPL, HPCG, Graph500, IOR or IO500. It reveals sustainable and scalable performance comparable to the most powerful supercomputers in the world, including for energy-efficient metrics (for instance, 5,19 GFlops/W (resp. 6,14 MTEPS/W) were demonstrated for full HPL (resp. Graph500) runs across all Aion nodes).
Research center :
ULHPC - University of Luxembourg: High Performance Computing
Disciplines :
Computer science
Author, co-author :
Varrette, Sébastien ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
Cartiaux, Hyacinthe ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
Peter, Sarah ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Bioinformatics Core
Kieffer, Emmanuel ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
Valette, Teddy ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
Olloh, Abatcha ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
External co-authors :
no
Language :
English
Title :
Management of an Academic HPC Research Computing Facility: The ULHPC Experience 2.0
Publication date :
July 2022
Event name :
6th ACM High Performance Computing and Cluster Technologies Conf. (HPCCT 2022)
Event place :
Fuzhou, China
Event date :
July 8-10, 2022
Audience :
International
Main work title :
6th High Performance Computing and Cluster Technologies Conference (HPCCT 2022)
Publisher :
Association for Computing Machinery (ACM), Fuzhou, China
[n.d.]. OSU Micro-Benchmarks (OMB). https://mvapich.cse.ohio-state.edu/benchmarks/.
[n.d.]. Puppet: Powerful infrastructure automation and delivery. [Online]. https://puppet.com/.
[n.d.]. The Graph 500 Benchmarks BFS (Search) and SSSP(Shortest Path) Specifications. https://graph500.org/.
[n.d.]. The Unified European Application Benchmark Suite (UEABS). https://repository.prace-ri.eu/git/UEABS/ueabs/.
2022 NVIDIA Data Center GPU Manager (DCGM). https://developer.nvidia.com/dcgm.
2022 R: A Language and Environment for Statistical Computing. https://www.Rproject.org/
A. Bhatele, N. Jain, M. Mubarak, and T. Gamblin. 2019. Analyzing Cost-Performance Tradeoffs of HPC Network Designs under Different Constraints Using Simulations. In Proc. of the ACM SIGSIM Conf. on Principles of Advanced Discrete Simulation (SIGSIM-PADS?19) (Chicago, IL, USA). ACM, 1-12. https://doi.org/10.1145/3316480.3325516
Peter Braam. 2019. The Lustre Storage Architecture. arXiv:1903.01955 [cs.OS]
European Parliament and Council. 2016. Regulation (EU) 2016/679. http://data.europa.eu/eli/reg/2016/679/oj
M. Geimer, K. Hoste, and R. McLay. 2014. Modern Scientific Software Management Using EasyBuild and Lmod. In 2014 First International Workshop on HPC User Support Tools (HUST). 41-51. https://doi.org/10.1109/HUST.2014.8
ISO. 2013. ISO/IEC 27002:2013: Information technology, Security techniques, Code of practice for information security controls (2nd ed.). ISO. 80 pages. https://www.iso.org/standard/54533.html
S. A. Jyothi, A. Singla, G. P. Brighten, and A. Kolla. 2016. Measuring and Understanding Throughput of Network Topologies. In SC ?16: Proceedings of the Intl. Conf. for HPC, Networking, Storage and Analysis. 761-772. https://doi.org/10.1109/SC.2016.64
V. Karakasis and al. 2019. Enabling Continuous Testing of HPC Systems Using ReFrame. In Tools and Techniques for HPC-HUST Workshops, part of SC 2019 (CCIS, Vol. 1190). Springer, 49-68. https://dg/10.1007/978-3-030-44728-13
G. M. Kurtzer, V. Sochat, and M.W. Bauer. 2017. Singularity: Scientific containers for mobility of compute. PloS one 12, 5 (2017).
John D. McCalpin. [n.d.]. STREAM: Sustainable Memory Bandwidth in High Performance Computers, http://cs.virginia.edu/stream/.
Robert McLay. 2015. Lmod: Environmental modules system.
L. Paseri, S. Varrette, and P. Bouvry. 2021. Protection of Personal Data in High Performance Computing Platform for Scientific Research Purposes. In Proc. of the EU Annual Privacy Forum (APF) 2021 (LNCS, Vol. 12703). Springer International Publishing, 123-142.
A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. [n.d.]. HPL-A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. https://www.netlib.org/benchmark/hpl/.
Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra (Eds.). 130-136.
Ronald Ross. 2012. Guide for Conducting Risk Assessments. Special Publication (NIST SP), National Institute of Standards and Technology, Gaithersburg, MD. https://doi.org/10.6028/NIST.SP.800-30r1
F. Schmuck and R. Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (Monterey, CA) (FAST ?02). USENIX Association, USA, 19-es.
S. Varrette, P. Bouvry, H. Cartiaux, and F. Georgatos. 2014. Management of an Academic HPC Cluster: The UL Experience. In Proc. of the 2014 Intl. Conf. on High Performance Computing & Simulation (HPCS 2014). IEEE, Bologna, Italy, 959-967.
S. Varrette, H. Cartiaux, T. Valette, and A. Olloh. 2022 Aggregating and Consolidating two High Performant Network Topologies: The ULHPC Experience. In ACM Practice and Experience in Advanced Research Computing (PEARC?22). Association for Computing Machinery (ACM), Boston, USA. https://doi.org/10.1145/3491418.3535159
S. Varrette, E. Kieffer, and F. Pinel. 2022 Optimizing the Resource and Job Management System of an Academic HPC and Research Computing Facility. In 21st IEEE Intl. Symp. on Parallel and Distributed Computing (ISPDC?22). IEEE Computer Society, Basel, Switzerland.
S. Varrette, E. Kieffer, F. Pinel, E. Krishnasamy, S. Peter, H. Cartiaux, and X. Besseron. 2021. RESIF 3.0: Toward a Flexible & Automated Management of User Software Environment on HPC facility. In ACM Practice and Experience in Advanced Research Computing (PEARC?21). Association for Computing Machinery (ACM), Virtual Event. https://doi.org/10.1145/3437359.3465600
Hadley Wickham. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org
M. D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J-W. Boitenand L.N. da Silva Santos, P.E. Bourne, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3, 160018 (2016). https://www.nature.com/articles/sdata201618
A. B. Yoo, M. A. Jette, and M. Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Proc. of Job Scheduling Strategies for Parallel Processing (JSSPP 2003) (LNCS, Vol. 2862). Springer Verlag, 44-60.