References of "Peter, Sarah 50001697"
     in
Bookmark and Share    
Full Text
Peer Reviewed
See detailManagement of an Academic HPC Research Computing Facility: The ULHPC Experience 2.0
Varrette, Sébastien UL; Cartiaux, Hyacinthe UL; Peter, Sarah UL et al

in 6th High Performance Computing and Cluster Technologies Conference (HPCCT 2022) (2022, July)

With the advent of the technological revolution and the digital transformation that made all scientific disciplines becoming computational, the need for High Performance Computing (HPC) has become and a ... [more ▼]

With the advent of the technological revolution and the digital transformation that made all scientific disciplines becoming computational, the need for High Performance Computing (HPC) has become and a strategic and critical asset to leverage new research and business in all domains requiring computing and storage performance. Since 2007, the University of Luxembourg operates a large academic HPC facility which remains the reference implementation within the country. This paper provides a general description of the current platform implementation as well as its operational management choices which have been adapted to the integration of a new liquid-cooled supercomputer, named Aion, released in 2021. The administration of a HPC facility to provide state-of-art computing systems, storage and software is indeed a complex and dynamic enterprise with the soul purpose to offer an enhanced user experience for intensive research computing and large-scale analytic workflows. Most design choices and feedback described in this work have been motivated by several years of experience in addressing in a flexible and convenient way the heterogeneous needs inherent to an academic environment towards research excellence. The different layers and stacks used within the operated facilities are reviewed, in particular with regards the user software management, or the adaptation of the Slurm Resource and Job Management System (RJMS) configuration with novel incentives mechanisms. In practice, the described and implemented environment brought concrete and measurable improvements with regards the platform utilization (+12,64%), jobs efficiency (average Wall-time Request Accuracy improved by 110,81%), the management and funding (increased by 10%). Thorough performance evaluation of the facility is also presented in this paper through reference benchmarks such as HPL, HPCG, Graph500, IOR or IO500. It reveals sustainable and scalable performance comparable to the most powerful supercomputers in the world, including for energy-efficient metrics (for instance, 5,19 GFlops/W (resp. 6,14 MTEPS/W) were demonstrated for full HPL (resp. Graph500) runs across all Aion nodes). [less ▲]

Detailed reference viewed: 93 (35 UL)
Full Text
Peer Reviewed
See detailRESIF 3.0: Toward a Flexible & Automated Management of User Software Environment on HPC facility
Varrette, Sébastien UL; Kieffer, Emmanuel UL; Pinel, Frederic UL et al

in ACM Practice and Experience in Advanced Research Computing (PEARC'21) (2021, July)

High Performance Computing (HPC) is increasingly identified as a strategic asset and enabler to accelerate the research and the business performed in all areas requiring intensive computing and large ... [more ▼]

High Performance Computing (HPC) is increasingly identified as a strategic asset and enabler to accelerate the research and the business performed in all areas requiring intensive computing and large-scale Big Data analytic capabilities. The efficient exploitation of heterogeneous computing resources featuring different processor architectures and generations, coupled with the eventual presence of GPU accelerators, remains a challenge. The University of Luxembourg operates since 2007 a large academic HPC facility which remains one of the reference implementation within the country and offers a cutting-edge research infrastructure to Luxembourg public research. The HPC support team invests a significant amount of time (i.e., several months of effort per year) in providing a software environment optimised for hundreds of users, but the complexity of HPC software was quickly outpacing the capabilities of classical software management tools. Since 2014, our scientific software stack is generated and deployed in an automated and consistent way through the RESIF framework, a wrapper on top of Easybuild and Lmod [5] meant to efficiently handle user software generation. A large code refactoring was performed in 2017 to better handle different software sets and roles across multiple clusters, all piloted through a dedicated control repository. With the advent in 2020 of a new supercomputer featuring a different CPU architecture, and to mitigate the identified limitations of the existing framework, we report in this state-of-practice article RESIF 3.0, the latest iteration of our scientific software management suit now relying on streamline Easybuild. It permitted to reduce by around 90% the number of custom configurations previously enforced by specific Slurm and MPI settings, while sustaining optimised builds coexisting for different dimensions of CPU and GPU architectures. The workflow for contributing back to the Easybuild community was also automated and a current work in progress aims at drastically decrease the building time of a complete software set generation. Overall, most design choices for our wrapper have been motivated by several years of experience in addressing in a flexible and convenient way the heterogeneous needs inherent to an academic environment aiming for research excellence. As the code base is available publicly, and as we wish to transparently report also the pitfalls and difficulties met, this tool may thus help other HPC centres to consolidate their own software management stack. [less ▲]

Detailed reference viewed: 330 (38 UL)