Aggregating and Consolidating two High Performant Network Topologies: The ULHPC Experience

VARRETTE, Sébastien; CARTIAUX, Hyacinthe; VALETTE, Teddy; OLLOH, Abatcha

doi:10.1145/3491418.3535159

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Aggregating and Consolidating two High Performant Network Topologies: The ULHPC Experience

VARRETTE, Sébastien; CARTIAUX, Hyacinthe; VALETTE, Teddy et al.

2022 • In ACM Practice and Experience in Advanced Research Computing (PEARC'22)

Peer reviewed

Permalink
https://hdl.handle.net/10993/51828

DOI
10.1145/3491418.3535159

Files (2)Send to Details Statistics Bibliography Similar publications

Files

Full Text

final_pearc22-64.pdf

Author preprint (1.28 MB)

Download

Annexes

slides_pearc22.pdf

(2.61 MB)

Slides conference

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

HPC Management; Network; Performance Evaluation

Abstract :

[en] High Performance Computing (HPC) encompasses advanced computation over parallel processing. The execution time of a given simulation depends upon many factors, such as the number of CPU/GPU cores, their utilisation factor and, of course, the inter- connect performance, efficiency, and scalability. In practice, this last component and the associated topology remains the most significant differentiators between HPC systems and lesser perfor- mant systems. The University of Luxembourg operates since 2007 a large academic HPC facility which remains one of the reference implementation within the country and offers a cutting-edge re- search infrastructure to Luxembourg public research. The main high-bandwidth low-latency network of the operated facility relies on the dominant interconnect technology in the HPC market i.e., Infiniband (IB) over a Fat-tree topology. It is complemented by an Ethernet-based network defined for management tasks, external access and interactions with user’s applications that do not support Infiniband natively. The recent acquisition of a new cutting-edge supercomputer Aion which was federated with the previous flag- ship cluster Iris was the occasion to aggregate and consolidate the two types of networks. This article depicts the architecture and the solutions designed to expand and consolidate the existing networks beyond their seminal capacity limits while keeping at best their Bisection bandwidth. At the IB level, and despite moving from a non-blocking configuration, the proposed approach defines a blocking topology maintaining the previous Fat-Tree height. The leaf connection capacity is more than tripled (moving from 216 to 672 end-points) while exhibiting very marginal penalties, i.e. less than 3% (resp. 0.3%) Read (resp. Write) bandwidth degradation against reference parallel I/O benchmarks, or a stable and sustain- able point-to-point bandwidth efficiency among all possible pairs of nodes (measured above 95.45% for bi-directional streams). With regards the Ethernet network, a novel 2-layer topology aiming for improving the availability, maintainability and scalability of the interconnect is described. It was deployed together with consistent network VLANs and subnets enforcing strict security policies via ACLs defined on the layer 3, offering isolated and secure net- work environments. The implemented approaches are applicable to a broad range of HPC infrastructures and thus may help other HPC centres to consolidate their own interconnect stacks when designing or expanding their network infrastructures.

Research center :

ULHPC - University of Luxembourg: High Performance Computing

Disciplines :

Computer science

Author, co-author :

VARRETTE, Sébastien ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

CARTIAUX, Hyacinthe ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

VALETTE, Teddy ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

OLLOH, Abatcha ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

External co-authors :

Language :

English

Title :

Aggregating and Consolidating two High Performant Network Topologies: The ULHPC Experience

Publication date :

July 2022

Event name :

Practice and Experience in Advanced Research Computing (PEARC ’22),

Event place :

Boston, United States

Event date :

July 8-14, 2022

Audience :

International

Main work title :

ACM Practice and Experience in Advanced Research Computing (PEARC'22)

Publisher :

Association for Computing Machinery (ACM), Boston, United States

Peer reviewed :

Peer reviewed

Focus Area :

Computational Sciences

Additional URL :

https://pearc.acm.org/pearc22/

Available on ORBilu :

since 01 August 2022

Statistics

Number of views

390 (53 by Unilu)

Number of downloads

165 (11 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

[n.d.]. IOR: HPC I/O Benchmark. [online]. https://ior.readthedocs.io/.
[n.d.]. iperf3: A TCP, UDP, and SCTP network bandwidth measurement tool. https://software.es.net/iperf/.
[n.d.]. The Top 500 List. https://top500.org/.
Maciej Besta, Jens Domke, Marcel Schneider, Marek Konieczny, Salvatore Di Girolamo, Timo Schneider, Ankit Singla, and Torsten Hoefler. 2021. High-Performance Routing With Multipathing and Path Diversity in Ethernet and HPC Networks. IEEE Transactions on Parallel and Distributed Systems 32, 4 (2021), 943-959.
A. Bhatele, N. Jain, M. Mubarak, and T. Gamblin. 2019. Analyzing Cost-Performance Tradeoffs of HPC Network Designs under Different Constraints Using Simulations. In Proc. of the ACM SIGSIM Conf. on Principles of Advanced Discrete Simulation (SIGSIM-PADS'19) (Chicago, IL, USA) (SIGSIM-PADS '19). ACM, New York, NY, USA, 1-12.
Sangeetha Abdu Jyothi, Ankit Singla, P. Brighten Godfrey, and Alexandra Kolla. 2016. Measuring and Understanding Throughput of Network Topologies. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 761-772. https://doi.org/10.1109/SC.2016.64
S. Varrette, H. Cartiaux, S. Peter, E. Kieffer, T. Valette, and A. Olloh. 2022. Management of an Academic HPC & Research Computing Facility: The ULHPC Experience 2.0. In Proc. of the 6th ACM High Performance Computing and Cluster Technologies Conf. (HPCCT 2022). Association for Computing Machinery (ACM), Fuzhou, China.