ORBilu: Detailed Reference

Article (Scientific journals)

What artificial intelligence can do for high-performance computing systems?

POCHELU, Pierrick; CARTIAUX, Hyacinthe; SCHLEICH, Julien

2026 • In Engineering Applications of Artificial Intelligence, 164, p. 113248

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/10993/66863

DOI
10.1016/j.engappai.2025.113248

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

1-s2.0-S0952197625032798-main.pdf

Author postprint (2.87 MB)

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Artificial intelligence; High-performance computing; Software performance; Computing center; High performance computing systems; Language model; Machine-learning; Optimisations; Performance computing; Performance estimation; Power; Control and Systems Engineering; Electrical and Electronic Engineering; Artificial Intelligence

Abstract :

[en] High-performance computing (HPC) centers consume substantial power, incurring environmental and operational costs. This review assesses how artificial intelligence (AI), including machine learning (ML) and optimization, improves the efficiency of operational HPC systems. Approximately 1,800 publications from 2019 to 2025 were manually screened using predefined inclusion/exclusion criteria; 74 “AI for HPC” papers were retained and grouped into six application areas: performance estimation, performance optimization, scheduling, surrogate modeling, fault detection, and language-model-based automation. Scheduling is the most active area, spanning research-oriented reinforcement-learning schedulers to production-friendly hybrids that combine ML with heuristics. Supervised performance estimation is foundational for both scheduling and optimization. Graph neural networks and time-series models strengthen anomaly detection by capturing spatio-temporal dependencies in production telemetry. Domain-specialized language models for HPC can outperform general-purpose LLMs on targeted coding and automation tasks. Together, these findings highlight integration opportunities such as LLM-based operating-system concepts and underscore the need for advances in MLOps, standardization of AI components, and benchmarking methodology.

Disciplines :

Computer science

Author, co-author :

POCHELU, Pierrick ; University of Luxembourg > Faculty of Science, Technology and Medicine > HPC Platform > High Level Support Team ; LuxProvide S.A., Bertrange, Luxembourg

CARTIAUX, Hyacinthe ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > HPC Platform

SCHLEICH, Julien ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > HPC Platform

External co-authors :

no

Language :

English

Title :

What artificial intelligence can do for high-performance computing systems?

Publication date :

15 January 2026

Journal title :

Engineering Applications of Artificial Intelligence

ISSN :

0952-1976

Publisher :

Elsevier Ltd

Volume :

164

Pages :

113248

Peer reviewed :

Peer Reviewed verified by ORBi

Additional URL :

https://api.elsevier.com/content/article/PII:S0952197625032798?httpAccept=text/xml

Available on ORBilu :

since 16 December 2025

Statistics

Number of views

41 (0 by Unilu)

Number of downloads

28 (0 by Unilu)

More statistics

Scopus citations^®

0

Scopus citations^®
without self-citations

0

OpenCitations

0

OpenAlex citations

0

Bibliography

Z. Abbasiantaeb, Y. Yuan, E. Kanoulas, M. Aliannejadi, https://doi.org/10.1145/3616855.3635856[Let the llms talk: Simulating human-to-human conversational qa via zero-shot llm-to-llm interactions], in: Proceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 8–17. doi:10.1145/3616855.3635856. https://doi.org/10.1145/3616855.3635856
AIPACA MLOps, ML Training Cost Calculator, https://github.com/aipaca-mlops/ML-training-cost-calculator?tab=readme-ov-file, accessed: March 13, 2024.
T. Al-Jody, http://eprints.hud.ac.uk/id/eprint/35579/[Bearicade: A novel high-performance computing user and security management system augmented with machine learning technology] (August 2021). http://eprints.hud.ac.uk/id/eprint/35579/
L. Alawneh, A. Hamou-Lhadj, https://www.sciencedirect.com/science/article/pii/S0164121222001704[Locating and categorizing inefficient communication patterns in hpc systems using inter-process communication traces], Journal of Systems and Software 194 (2022) 111494. https://doi.org/10.1016/j.jss.2022.111494. https://www.sciencedirect.com/science/article/pii/S0164121222001704
M. Albers, P. S. Meysonnat, W. Schröder, https://doi.org/10.1007/s10494-018-9998-z[Actively reduced airfoil drag by transversal surface waves], Flow, Turbulence and Combustion 102 (4) (2019) 865–886. doi:10.1007/s10494-018-9998-z. https://doi.org/10.1007/s10494-018-9998-z
ALCF, https://docs.alcf.anl.gov/[Alcf documentation], accessed: 2024-11-19 (2024). https://docs.alcf.anl.gov/
L. Anton, S. Willemot, S. Gougeaud, S. Zertal, Ml-based methodology for hpc facilities supervision, in: A. Bienz, M. Weiland, M. Baboulin, C. Kruse (Eds.), High Performance Computing, Springer Nature Switzerland, Cham, 2023, pp. 307–319.
ASC, https://computing.llnl.gov/projects/co-design/amg2013[Parallel algebraic multigrid solver for linear systems: The asc amg 2013 benchmark code], accessed: 2024-11-18 (2013). https://computing.llnl.gov/projects/co-design/amg2013
E. Ates, Y. Zhang, B. Aksar, J. Brandt, V. Leung, M. Egele, K. Coşkun, Hpas: An hpc performance anomaly suite for reproducing performance variations, 2019, pp. 1–10. doi:10.1145/3337821.3337907.
R. M. Badia, L. Berti-Equille, R. Ferreira Da Silva, U. Leser, https://www.osti.gov/biblio/2341398[Integrating hpc, ai, and workflows for scientific data analysis: Report from dagstuhl seminar 23352], Tech. Rep. ORNL/TM–2024/3301, Oak Ridge National Laboratory (ORNL), Oak Ridge, TN, United States, research Org.: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Sponsor Org.: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR). Contract Number: AC05-00OR22725 (jan 2024). doi:10.2172/2341398. https://www.osti.gov/biblio/2341398
X. Bai, J. Zhou, Z. Wang, https://www.mdpi.com/2073-431X/14/8/335[Hpc cluster task prediction based on multimodal temporal networks with hierarchical attention mechanism], Computers 14 (8) (2025). doi:10.3390/computers14080335. https://www.mdpi.com/2073-431X/14/8/335
W. Barth, Nagios: System and network monitoring, No Starch Press, 2008.
O. Beaumont, L. Eyraud-Dubois, A. SHILOVA, https://openreview.net/forum?id=BFYlnDtJSqW[Efficient combination of rematerialization and offloading for training DNN]s, in: A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=BFYlnDtJSqW
J.-H. Betting, D. Liakopoulos, M. Engelen, C. Strydis, Oikonomos: An opportunistic, deep-learning, resource-recommendation system for cloud hpc, in: 2023 IEEE 34th International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2023, pp. 188–196. doi:10.1109/ASAP57973.2023.00039.
P. Bharti, R. Ranjan, https://www.proquest.com/scholarly-journals/specifying-cpu-requirements-hpc-applications-via/docview/2762022619/se-2[Specifying cpu requirements for hpc applications via ml techniques], International Journal of Advanced Networking and Applications, suppl.Special Issue 10 (5) (2019) 1–3, copyright - 00A9 2019. This work is published under http://www.ijana.in/index.php (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License; Last updated - 2023-11-25. https://www.proquest.com/scholarly-journals/specifying-cpu-requirements-hpc-applications-via/docview/2762022619/se-2
Z. Bian, S. Li, W. Wang, Y. You, https://doi.org/10.1145/3458817.3480859[Online evolutionary batch size orchestration for scheduling deep learning workloads in gpu clusters], in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, Association for Computing Machinery, New York, NY, USA, 2021. doi:10.1145/3458817.3480859. https://doi.org/10.1145/3458817.3480859
C. Bienia, Benchmarking modern multiprocessors, Ph.D. thesis, USA, aAI3445564 (2011).
Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al., 2020. Piqa: Reasoning about physical commonsense in natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34, pp. 7432–7439.
Bogale, B., Lumsden, I., Sukkari, D., Yokelson, D., Brink, S., Pearce, O., Taufer, M., 2025. Surrogate models for analyzing performance behavior of hpc applications using the raja performance suite. In: Lees, M.H., Cai, W., Cheong, S.A., Su, Y., Abramson, D., Dongarra, J.J., Sloot, P.M.A. (Eds.), Computational Science – ICCS 2025. Springer Nature Switzerland, pp. 327–335.
R. Bolze, F. Cappello, E. Caron, M. Daydé, F. Desprez, E. Jeannot, Y. Jégou, S. Lanteri, J. Leduc, N. Melab, G. Mornet, R. Namyst, P. Primet, B. Quétier, O. Richard, E.-G. Talbi, I. Touche, Grid’5000: A large scale and highly reconfigurable experimental grid testbed, International Journal of High Performance Computing Applications 20 (2006) 481,494. doi:10.1177/1094342006070078.
A. Borghesi, A. Burrello, A. Bartolini, Examon-x: A predictive maintenance framework for automatic monitoring in industrial iot systems, IEEE Internet of Things Journal 10 (4) (2023) 2995–3005. doi:10.1109/JIOT.2021.3125885.
A. Borghesi, M. Molan, M. Milano, A. Bartolini, Anomaly detection and anticipation in high performance computing systems, IEEE Transactions on Parallel and Distributed Systems 33 (4) (2022) 739–750. doi:10.1109/TPDS.2021.3082802.
W. Brewer, A. Gainaru, F. Suter, F. Wang, M. Emani, S. Jha, https://arxiv.org/abs/2406.14315[Ai-coupled hpc workflow applications, middleware and performance] (2025). arXiv:2406.14315. https://arxiv.org/abs/2406.14315
L. Brochard, R. Panda, D. DeSota, F. Thomas, R. H. Bell, https://doi.org/10.1145/1958746.1958780[Power and energy-aware processor scheduling], in: Proceedings of the 2nd ACM/SPEC International Conference on Performance Engineering, ICPE ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 227–234. doi:10.1145/1958746.1958780. https://doi.org/10.1145/1958746.1958780
P. N. Brown, R. D. Falgout, J. E. Jones, https://doi.org/10.1137/S1064827598339141[Semicoarsening multigrid on distributed memory machines], SIAM Journal on Scientific Computing 21 (5) (2000) 1823–1834. arXiv:https://doi.org/10.1137/S1064827598339141, doi:10.1137/S1064827598339141. https://doi.org/10.1137/S1064827598339141
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in Neural Information Processing Systems 33 (2020) 1877–1901.
L. Burchard, M. X. Zhao, J. Langguth, A. Buluç, G. Guidi, https://doi.org/10.1145/3581784.3607094[Space efficient sequence alignment for sram-based computing: X-drop on the graphcore ipu], in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, Association for Computing Machinery, New York, NY, USA, 2023. doi:10.1145/3581784.3607094. https://doi.org/10.1145/3581784.3607094
H. Casanova, A. Giersch, A. Legrand, M. Quinson, F. Suter, http://hal.inria.fr/hal-01017319[Versatile, scalable, and accurate simulation of distributed applications and platforms], Journal of Parallel and Distributed Computing 74 (10) (2014) 2899–2917. http://hal.inria.fr/hal-01017319
M. Cengiz, M. Forshaw, A. Atapour-Abarghouei, A. S. McGough, Predicting the performance of a computing system with deep networks, in: Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering, ICPE ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 91–98. doi:10.1145/3578244.3583731.
Chang, Y., Narang, M., Suzuki, H., Cao, G., Gao, J., Bisk, Y., 2022. Webqa: Multihop and multimodal qa. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16495–16504.
Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16, ACM, New York, NY, USA, pp. 785–794. , URL .
L. Chen, W. Wu, S. F. Siegel, P.-H. Lin, C. Liao, https://arxiv.org/abs/2308.08473[Dataracebench v1.4.1 and dataracebench-ml v0.1: Benchmark suites for data race detection] (2023). arXiv:2308.08473. https://arxiv.org/abs/2308.08473
J. Corbalan, L. Alonso, J. Aneas, L. Brochard, Energy optimization and analysis with ear, in: 2020 IEEE International Conference on Cluster Computing (CLUSTER), 2020, pp. 464–472. doi:10.1109/CLUSTER49012.2020.00067.
V. Dakić, M. Kovač, J. Slovinac, https://www.mdpi.com/2079-9292/13/13/2651[Evolving high-performance computing data centers with kubernetes, performance analysis, and dynamic workload placement based on machine learning scheduling], Electronics 13 (13) (2024). doi:10.3390/electronics13132651. https://www.mdpi.com/2079-9292/13/13/2651
A. Damianou, N. D. Lawrence, Deep gaussian processes, in: Artificial intelligence and statistics, PMLR, 2013, pp. 207–215.
R. L. de Freitas Cunha, L. Chaimowicz, https://www.sciencedirect.com/science/article/pii/S0167739X22003090[An smdp approach for reinforcement learning in hpc cluster schedulers], Future Generation Computer Systems 139 (2023) 239–252. https://doi.org/10.1016/j.future.2022.09.025. https://www.sciencedirect.com/science/article/pii/S0167739X22003090
X. Ding, L. Chen, M. Emani, C. Liao, P.-H. Lin, T. Vanderbruggen, Z. Xie, A. Cerpa, W. Du, https://doi.org/10.1145/3624062.3624172[Hpc-gpt: Integrating large language model for high-performance computing], in: Proceedings of the SC ’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 951–960. doi:10.1145/3624062.3624172. https://doi.org/10.1145/3624062.3624172
W. Dong, J. Ren, https://doi.org/10.1145/3589013.3596677[Autoconstruct: Automated neural surrogate model building and deployment for hpc applications], in: Proceedings of the 13th Workshop on AI and Scientific Computing at Scale Using Flexible Computing, FlexScience ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 33–40. doi:10.1145/3589013.3596677. https://doi.org/10.1145/3589013.3596677
Elsevier, https://www.elsevier.com/solutions/scopus[Scopus - the largest database of peer-reviewed literature], accessed: 11-18-2024 (2018). https://www.elsevier.com/solutions/scopus
Y. Fan, Z. Lan, https://www.sciencedirect.com/science/article/pii/S2665963821000257[Dras-cqsim: A reinforcement learning based framework for hpc cluster scheduling], Software Impacts 8 (2021) 100077. https://doi.org/10.1016/j.simpa.2021.100077. https://www.sciencedirect.com/science/article/pii/S2665963821000257
Y. Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, M. E. Papka, Deep reinforcement agent for scheduling in hpc, in: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021, pp. 807–816. doi:10.1109/IPDPS49936.2021.00090.
Y. Fan, Z. Lan, P. Rich, W. Allcock, M. E. Papka, Hybrid workload scheduling on hpc systems, in: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022, pp. 470–480. doi:10.1109/IPDPS53621.2022.00052.
Y. Fan, B. Li, D. Favorite, N. Singh, T. Childers, P. Rich, W. Allcock, M. E. Papka, Z. Lan, Dras: Deep reinforcement learning for cluster scheduling in high performance computing, IEEE Transactions on Parallel and Distributed Systems 33 (12) (2022) 4903–4917. doi:10.1109/TPDS.2022.3205325.
S. Fdida, N. Makris, T. Korakis, R. Bruno, A. Passarella, P. Andreou, B. Belter, C. Crettaz, W. Dabbous, Y. Demchenko, R. Knopp, https://www.sciencedirect.com/science/article/pii/S0140366422002663[Slices, a scientific instrument for the networking community], Computer Communications 193 (2022) 189–203. https://doi.org/10.1016/j.comcom.2022.07.019. https://www.sciencedirect.com/science/article/pii/S0140366422002663
D. G. Feitelson, D. Tsafrir, D. Krakov, https://www.sciencedirect.com/science/article/pii/S0743731514001154[Experience with using the parallel workloads archive], Journal of Parallel and Distributed Computing 74 (10) (2014) 2967–2982. https://doi.org/10.1016/j.jpdc.2014.06.013. https://www.sciencedirect.com/science/article/pii/S0743731514001154
Fox, G., Glazier, J., Kadupitiya, J., Jadhao, V., Kim, M., Qiu, J., Sluka, J., Somogy, E., Marathe, M., Adiga, A., Chen, J., Beckstein, O., Jha, S., 2019. Learning everywhere: Pervasive machine learning for effective high-performance computation. In: Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019, Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops. IPDPSW 2019, Institute of Electrical and Electronics Engineers Inc., pp. 422–429. , publisher Copyright: © 2019 IEEE.; 33rd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019 ; Conference date: 20-05-2019 Through 24-05-2019.
Y. Gao, Y. Liu, H. Zhang, Z. Li, Y. Zhu, H. Lin, M. Yang, https://doi.org/10.1145/3368089.3417050[Estimating gpu memory consumption of deep learning models], in: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, Association for Computing Machinery, New York, NY, USA, 2020, p. 1342–1352. doi:10.1145/3368089.3417050. https://doi.org/10.1145/3368089.3417050
Y. Ge, W. Hua, K. Mei, J. Ji, J. Tan, S. Xu, Z. Li, Y. Zhang, Openagi: when llm meets domain experts, in: Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Curran Associates Inc., Red Hook, NY, USA, 2023.
Y. Ge, Y. Ren, W. Hua, S. Xu, J. Tan, Y. Zhang, https://arxiv.org/abs/2312.03815[Llm as os, agents as apps: Envisioning aios, agents and the aios-agent ecosystem] (2023). arXiv:2312.03815. https://arxiv.org/abs/2312.03815
Y. Gebreyesus, D. Dalton, D. De Chiara, M. Chinnici, A. Chinnici, https://www.mdpi.com/2079-9292/13/9/1628[Ai for automating data center operations: Model explainability in the data centre context using shapley additive explanations (shap)], Electronics 13 (9) (2024). doi:10.3390/electronics13091628. https://www.mdpi.com/2079-9292/13/9/1628
GitHub, Github copilot: Your ai pair programmer, https://copilot.github.com/, accessed: 2024-08-07.
W. Godoy, P. Valero Lara, K. Teranishi, P. Balaprakash, J. Vetter, https://www.osti.gov/biblio/2000371[Evaluation of openai codex for hpc parallel programming models kernel generation] (8 2023). doi:10.1145/3605731.3605886. https://www.osti.gov/biblio/2000371
R. Gu, Y. Chen, S. Liu, H. Dai, G. Chen, K. Zhang, Y. Che, Y. Huang, Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed gpu clusters, IEEE Transactions on Parallel and Distributed Systems 33 (11) (2022) 2808–2820. doi:10.1109/TPDS.2021.3138825.
D. Gu, Y. Zhao, Y. Zhong, Y. Xiong, Z. Han, P. Cheng, F. Yang, G. Huang, X. Jin, X. Liu, https://doi.org/10.1145/3575693.3575721[Elasticflow: An elastic serverless training platform for distributed deep learning], in: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Association for Computing Machinery, New York, NY, USA, 2023, p. 266–280. doi:10.1145/3575693.3575721. https://doi.org/10.1145/3575693.3575721
H. Guan, L. K. Mokadam, X. Shen, S.-H. Lim, R. Patton, https://proceedings.mlsys.org/paperfiles/paper/2020/file/462211f67c7d858f663355eff93b745e-Paper.pdf[Fleet: Flexible efficient ensemble training for heterogeneous deep neural networks], in: I. Dhillon, D. Papailiopoulos, V. Sze (Eds.), Proceedings of Machine Learning and Systems, Vol. 2, 2020, pp. 247–261. https://proceedings.mlsys.org/paperfiles/paper/2020/file/462211f67c7d858f663355eff93b745e-Paper.pdf
D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, F. Luo, Y. Xiong, W. Liang, https://arxiv.org/abs/2401.14196[Deepseek-coder: When the large language model meets programming – the rise of code intelligence] (2024). https://arxiv.org/abs/2401.14196
W. Hackett, L. Birch, S. Trawicki, N. Suri, P. Garraghan, https://arxiv.org/abs/2504.11168[Bypassing prompt injection and jailbreak detection in llm guardrails] (2025). arXiv:2504.11168. https://arxiv.org/abs/2504.11168
J. Hafner, G. Kresse, https://doi.org/10.1007/978-1-4615-5943-610[The Vienna AB-Initio Simulation Program VASP: An Efficient and Versatile Tool for Studying the Structural, Dynamic, and Electronic Properties of Materials], Springer US, Boston, MA, 1997, pp. 69–82. doi:10.1007/978-1-4615-5943-610. https://doi.org/10.1007/978-1-4615-5943-610
Y. Han, G. Huang, S. Song, L. Yang, H. Wang, Y. Wang, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837[ Dynamic Neural Networks: A Survey ], IEEE Transactions on Pattern Analysis Machine Intelligence 44 (11) (2022) 7436–7456. doi:10.1109/TPAMI.2021.3117837. https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837
S. Han, Q. Zhang, Y. Yao, W. Jin, Z. Xu, C. He, https://arxiv.org/abs/2402.03578[Llm multi-agent systems: Challenges and open problems] (2024). arXiv:2402.03578. https://arxiv.org/abs/2402.03578
R. Hat, https://www.redhat.com/en/blog/llm-and-llm-system-risks-and-safeguards[Llm and llm system risks and safeguards], accessed: 2025-04-03 (2024). https://www.redhat.com/en/blog/llm-and-llm-system-risks-and-safeguards
S. Heldens, P. Hijma, B. V. Werkhoven, J. Maassen, A. S. Z. Belloum, R. V. Van Nieuwpoort, https://doi.org/10.1145/3372390[The landscape of exascale research: A data-driven literature analysis], ACM Comput. Surv. 53 (2) (mar 2020). doi:10.1145/3372390. https://doi.org/10.1145/3372390
High performance computing for energy innovation (hpc4ei), https://hpc4energyinnovation.llnl.gov/, accessed: 2025-04-11 (2025).
S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–80. doi:10.1162/neco.1997.9.8.1735.
J. Hoffmann, P. Bauer, I. Sandu, N. Wedi, T. Geenen, D. Thiemert, https://www.sciencedirect.com/science/article/pii/S2405880723000559[Destination earth – a digital twin in support of climate services], Climate Services 30 (2023) 100394. https://doi.org/10.1016/j.cliser.2023.100394. https://www.sciencedirect.com/science/article/pii/S2405880723000559
A. Hossain, A. Abdurahman, M. A. Islam, K. Ahmed, Power-aware scheduling for multi-center hpc electricity cost optimization, arXiv preprint arXiv:2503.11011 (2025).
M. Howard, https://arxiv.org/abs/2205.10676[Terraform – automating infrastructure as a service] (2022). arXiv:2205.10676. https://arxiv.org/abs/2205.10676
Q. Hu, H. Nori, P. Sun, Y. Wen, T. Zhang, https://www.usenix.org/conference/atc22/presentation/hu[Primo: Practical Learning-Augmented] systems with interpretable models, in: 2022 USENIX Annual Technical Conference (USENIX ATC 22), USENIX Association, Carlsbad, CA, 2022, pp. 519–538. https://www.usenix.org/conference/atc22/presentation/hu
H. Hu, J. Su, J. Zhao, Y. Peng, Y. Zhu, H. Lin, C. Wu, https://doi.org/10.1145/3627703.3629572[Cdmpp: A device-model agnostic framework for latency prediction of tensor programs], in: Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 1054–1074. doi:10.1145/3627703.3629572. https://doi.org/10.1145/3627703.3629572
Q. Hu, P. Sun, S. Yan, Y. Wen, T. Zhang, https://doi.org/10.1145/3458817.3476223[Characterization and prediction of deep learning workloads in large-scale gpu datacenters], in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, Association for Computing Machinery, New York, NY, USA, 2021. doi:10.1145/3458817.3476223. https://doi.org/10.1145/3458817.3476223
Q. Hu, Z. Ye, M. Zhang, Q. Chen, P. Sun, Y. Wen, T. Zhang, https://www.usenix.org/conference/osdi23/presentation/hu[Hydro: Surrogate-Based] hyperparameter tuning service in datacenters, in: 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), USENIX Association, Boston, MA, 2023, pp. 757–777. https://www.usenix.org/conference/osdi23/presentation/hu
Q. Hu, M. Zhang, P. Sun, Y. Wen, T. Zhang, https://doi.org/10.1145/3575693.3575705[Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs], in: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Association for Computing Machinery, New York, NY, USA, 2023, p. 457–472. doi:10.1145/3575693.3575705. https://doi.org/10.1145/3575693.3575705
Huang, J., Chen-Chuan Chang, K., 2023. Towards reasoning in large language models: A survey. In: Findings of the Association for Computational Linguistics, ACL 2023, Proceedings of the Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (ACL, pp. 1049–1065. , we would like to thank Jason Wei (OpenAI) and Denny Zhou (Google DeepMind) for their valuable advice and constructive feedback on this work. This material is based upon work supported by the National Science Foundation IIS 16-19302 and IIS 16-33755, Zhejiang University ZJU Research 083650, IBM-Illinois Center for Cognitive Computing SystemsResearch (C3SR) and IBM-Illinois Discovery Accelerator Institute (IIDAI), gift grants from eBay and Microsoft Azure, UIUC OVCR CCIL Planning Grant 434S34, UIUC CSBS Small Grant 434C8U, and UIUC New Frontiers Initiative. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the funding agencies. We would like to thank Jason Wei (OpenAI) and Denny Zhou (Google DeepMind) for their valuable advice and constructive feedback on this work. This material is based upon work supported by the National Science Foundation IIS 16-19302 and IIS 16-33755, Zhejiang University ZJU Research 083650, IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) and IBM-Illinois Discovery Accelerator Institute (IIDAI), gift grants from eBay and Microsoft Azure, UIUC OVCR CCIL Planning Grant 434S34, UIUC CSBS Small Grant 434C8U, and UIUC New Frontiers Initiative. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the funding agencies.; 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023 ; Conference date: 09-07-2023 Through 14-07-2023.
S. Huang, S. Ontañón, https://doi.org/10.32473/flairs.v35i.130584[A closer look at invalid action masking in policy gradient algorithms], in: R. Barták, F. Keshtkar, M. Franklin (Eds.), Proceedings of the Thirty-Fifth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2022, Hutchinson Island, Jensen Beach, Florida, USA, May 15-18, 2022, 2022. doi:10.32473/flairs.v35i.130584. https://doi.org/10.32473/flairs.v35i.130584
S. Kadirvel, J. Ho, J. A. B. Fortes, https://www.usenix.org/conference/icac13/technical-sessions/presentation/kadirvel[Fault management in Map-Reduce] through early detection of anomalous nodes, in: 10th International Conference on Autonomic Computing (ICAC 13), USENIX Association, San Jose, CA, 2013, pp. 235–245. https://www.usenix.org/conference/icac13/technical-sessions/presentation/kadirvel
J. Kadupitiya, G. C. Fox, V. Jadhao, Machine learning for performance enhancement of molecular dynamics simulations, in: J. M. F. Rodrigues, P. J. S. Cardoso, J. Monteiro, R. Lam, V. V. Krzhizhanovskaya, M. H. Lees, J. J. Dongarra, P. M. Sloot (Eds.), Computational Science – ICCS 2019, Springer International Publishing, Cham, 2019, pp. 116–130.
C. Kaltenecker, A. Grebhahn, N. Siegmund, S. Apel, The interplay of sampling and machine learning for software performance prediction, IEEE Software 37 (4) (2020) 58–66. doi:10.1109/MS.2020.2987024.
S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. Saldyt, A. Murthy, Position: Llms can’t plan, but can help planning in llm-modulo frameworks, in: Proceedings of the 41st International Conference on Machine Learning, ICML’24, JMLR.org, 2024.
N. Kandpal, K. Pillutla, A. Oprea, P. Kairouz, C. A. Choquette-Choo, Z. Xu, https://aclanthology.org/2024.emnlp-main.1014/[User inference attacks on large language models], in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 18238–18265. doi:10.18653/v1/2024.emnlp-main.1014. https://aclanthology.org/2024.emnlp-main.1014/
B. Kocot, P. Czarnul, J. Proficz, https://www.mdpi.com/1996-1073/16/2/890[Energy-aware scheduling for high-performance computing systems: A survey], Energies 16 (2) (2023). doi:10.3390/en16020890. https://www.mdpi.com/1996-1073/16/2/890
Y. Kong, J. Ruan, Y. Chen, B. Zhang, T. Bao, S. Shiwei, d. G. Qing, X. Hu, H. Mao, Z. Li, X. Zeng, R. Zhao, X. Wang, https://aclanthology.org/2024.emnlp-industry.27/[TPTU]-v2: Boosting task planning and tool usage of large language model-based agents in real-world industry systems, in: F. Dernoncourt, D. Preoţiuc-Pietro, A. Shimorina (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, Association for Computational Linguistics, Miami, Florida, US, 2024, pp. 371–385. doi:10.18653/v1/2024.emnlp-industry.27. https://aclanthology.org/2024.emnlp-industry.27/
G. P. Koslovski, K. Pereira, P. R. Albuquerque, https://www.sciencedirect.com/science/article/pii/S0167739X23003485[Dag-based workflows scheduling using actor–critic deep reinforcement learning], Future Generation Computer Systems 150 (2024) 354–363. https://doi.org/10.1016/j.future.2023.09.018. https://www.sciencedirect.com/science/article/pii/S0167739X23003485
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, Curran Associates Inc., Red Hook, NY, USA, 2012, p. 1097–1105.
J.-K. Lee, T. Hong, G. Lee, https://www.mdpi.com/2076-3417/14/11/4373[Ai-based approach to firewall rule refinement on high-performance computing service network], Applied Sciences 14 (11) (2024). doi:10.3390/app14114373. https://www.mdpi.com/2076-3417/14/11/4373
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Curran Associates Inc., Red Hook, NY, USA, 2020.
S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y. Liu, B. Wang, Y. You, https://doi.org/10.1145/3605573.3605613[Colossal-ai: A unified deep learning system for large-scale parallel training], in: Proceedings of the 52nd International Conference on Parallel Processing, ICPP ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 766–775. doi:10.1145/3605573.3605613. https://doi.org/10.1145/3605573.3605613
J. Li, X. Zhang, L. Han, Z. Ji, X. Dong, C. Hu, https://doi.org/10.1007/s11227-020-03506-5[Okcm: improving parallel task scheduling in high-performance computing systems using online learning], J. Supercomput. 77 (6) (2021) 5960–5983. doi:10.1007/s11227-020-03506-5. https://doi.org/10.1007/s11227-020-03506-5
J. Li, X. Zhang, J. Wei, Z. Ji, Z. Wei, https://www.sciencedirect.com/science/article/pii/S0167739X22001613[Garlsched: Generative adversarial deep reinforcement learning task scheduling optimization for large-scale high performance computing systems], Future Generation Computer Systems 135 (2022) 259–269. https://doi.org/10.1016/j.future.2022.04.032. https://www.sciencedirect.com/science/article/pii/S0167739X22001613
P. Liang, B. Song, X. Zhan, Z. Chen, J. Yuan, Automating the training and deployment of models in mlops by integrating systems with machine learning, Applied and Computational Engineering 67 (2024) 1–7. doi:10.54254/2755-2721/67/20240690.
Y. Liu, S. Li, J. Fang, Y. Shao, B. Yao, Y. You, https://arxiv.org/abs/2302.02599[Colossal-auto: Unified automation of parallelization and activation checkpoint for large-scale models] (2023). arXiv:2302.02599. https://arxiv.org/abs/2302.02599
Y. Lou, R. Caruana, J. Gehrke, G. Hooker, https://doi.org/10.1145/2487575.2487579[Accurate intelligible models with pairwise interactions], in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, Association for Computing Machinery, New York, NY, USA, 2013, p. 623–631. doi:10.1145/2487575.2487579. https://doi.org/10.1145/2487575.2487579
H. Ltaief, Y. Hong, A. Dabah, R. Alomairy, S. Abdulah, C. Goreczny, P. Gepner, M. Ravasi, D. Gratadour, D. Keyes, https://doi.org/10.1007/978-3-031-32041-57[Steering customized ai architectures for hpc scientific applications], in: High Performance Computing: 38th International Conference, ISC High Performance 2023, Hamburg, Germany, May 21–25, 2023, Proceedings, Springer-Verlag, Berlin, Heidelberg, 2023, p. 125–143. doi:10.1007/978-3-031-32041-57. https://doi.org/10.1007/978-3-031-32041-57
U. Lublin, D. G. Feitelson, https://www.sciencedirect.com/science/article/pii/S0743731503001084[The workload on parallel supercomputers: modeling the characteristics of rigid jobs], Journal of Parallel and Distributed Computing 63 (11) (2003) 1105–1122. https://doi.org/10.1016/S0743-7315(03)00108-4. https://www.sciencedirect.com/science/article/pii/S0743731503001084
F. Lubrano, C. Vercellino, G. Vitali, P. Viviani, A. Scionti, O. Terzo, https://www.scopus.com/inward/record.uri?eid=2-s2.0-85191989496doi=10.1145%2f3642978.3652835partnerID=40md5=defe58544aea315498139c785aa1609e[Advanced resource allocation in the context of heterogeneous workflows management], 2024, p. 14 – 20. doi:10.1145/3642978.3652835. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85191989496doi=10.1145%2f3642978.3652835partnerID=40md5=defe58544aea315498139c785aa1609e
S. M. Lundberg, S.-I. Lee, http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf[A unified approach to interpreting model predictions] (2017) 4765–4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
P. Luszczek, W. M. Sid-Lakhdar, J. Dongarra, https://doi.org/10.1177/10943420231166365[Combining multitask and transfer learning with deep gaussian processes for autotuning-based performance engineering], The International Journal of High Performance Computing Applications 37 (3-4) (2023) 229–244. arXiv:https://doi.org/10.1177/10943420231166365, doi:10.1177/10943420231166365. https://doi.org/10.1177/10943420231166365
L. van der Maaten, G. Hinton, http://jmlr.org/papers/v9/vandermaaten08a.html[Visualizing data using t-sne], Journal of Machine Learning Research 9 (86) (2008) 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html
K. Mahajan, C.-H. Chu, S. Sridharan, A. Akella, https://www.usenix.org/conference/nsdi23/presentation/mahajan[Better together: Jointly optimizing ML] collective scheduling and execution planning using SYNDICATE, in: 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), USENIX Association, Boston, MA, 2023, pp. 809–824. https://www.usenix.org/conference/nsdi23/presentation/mahajan
M. Malms, L. Cargemel, E. Suarez, N. Mittenzwey, M. Duranton, S. Sezer, C. Prunty, P. rosse laurent, M. Pérez-Harnandez, M. Marazakis, G. Lonsdale, P. Carpenter, G. Antoniu, S. Narasimharmurthy, A. Brinkman, D. Pleiter, U.-U. Haus, J. Krueger, H.-C. Hoppe, R. Haas, Etp4hpc’s sra 5 - strategic research agenda for high-performance computing in europe - 2022 (01 2022).
H. Mao, M. Alizadeh, I. Menache, S. Kandula, https://doi.org/10.1145/3005745.3005750[Resource management with deep reinforcement learning], in: Proceedings of the 15th ACM Workshop on Hot Topics in Networks, HotNets ’16, Association for Computing Machinery, New York, NY, USA, 2016, p. 50–56. doi:10.1145/3005745.3005750. https://doi.org/10.1145/3005745.3005750
K. Mei, X. Zhu, W. Xu, W. Hua, M. Jin, Z. Li, S. Xu, R. Ye, Y. Ge, Y. Zhang, https://arxiv.org/abs/2403.16971[Aios: Llm agent operating system] (2024). arXiv:2403.16971. https://arxiv.org/abs/2403.16971
K. Menear, C. Scully-Allison, D. Duplyakin, https://doi.org/10.1145/3626203.3670627[Quantifying uncertainty in hpc job queue time predictions], in: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, PEARC ’24, Association for Computing Machinery, New York, NY, USA, 2024. doi:10.1145/3626203.3670627. https://doi.org/10.1145/3626203.3670627
H. Menon, A. Bhatele, T. Gamblin, Auto-tuning parameter choices in hpc applications using bayesian optimization, in: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 831–840. doi:10.1109/IPDPS47924.2020.00090.
A. Merzky, M. Titov, M. Turilli, O. Kilic, T. Wang, S. Jha, https://arxiv.org/abs/2503.13343[Scalable runtime architecture for data-driven, hybrid hpc and ml workflow applications] (2025). arXiv:2503.13343. https://arxiv.org/abs/2503.13343
P. Messina, The exascale computing project, Computing in Science Engineering 19 (3) (2017) 63–67. doi:10.1109/MCSE.2017.57.
X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y. Y. Wong, A. Zhu, L. Yang, X. Shi, C. Shi, Z. Chen, D. Arfeen, R. Abhyankar, Z. Jia, https://doi.org/10.1145/3620666.3651335[Specinfer: Accelerating large language model serving with tree-based speculative inference and verification], in: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 932–949. doi:10.1145/3620666.3651335. https://doi.org/10.1145/3620666.3651335
B. Mohammed, I. Awan, H. Ugail, M. Younas, https://doi.org/10.1007/s10586-019-02917-1[Failure prediction using machine learning in a virtualised hpc system and application], Cluster Computing 22 (2) (2019) 471–485. doi:10.1007/s10586-019-02917-1. https://doi.org/10.1007/s10586-019-02917-1
M. Molan, J. Ahmed Khan, A. Borghesi, A. Bartolini, https://doi.org/10.1145/3578245.3585335[Graph neural networks for anomaly anticipation in hpc systems], in: Companion of the 2023 ACM/SPEC International Conference on Performance Engineering, ICPE ’23 Companion, Association for Computing Machinery, New York, NY, USA, 2023, p. 239–244. doi:10.1145/3578245.3585335. https://doi.org/10.1145/3578245.3585335
M. Molan, M. S. Ardebili, J. A. Khan, F. Beneventi, D. Cesarini, A. Borghesi, A. Bartolini, https://www.sciencedirect.com/science/article/pii/S0167739X24003327[Graafe: Graph anomaly anticipation framework for exascale hpc systems], Future Generation Computer Systems 160 (2024) 644–653. https://doi.org/10.1016/j.future.2024.06.032. https://www.sciencedirect.com/science/article/pii/S0167739X24003327
M. Molan, A. Borghesi, D. Cesarini, L. Benini, A. Bartolini, https://doi.org/10.1016/j.future.2022.12.001[Ruad: Unsupervised anomaly detection in hpc systems], Future Gener. Comput. Syst. 141 (C) (2023) 542–554. doi:10.1016/j.future.2022.12.001. https://doi.org/10.1016/j.future.2022.12.001
C. Munley, A. Jarmusch, S. Chandrasekaran, https://www.sciencedirect.com/science/article/pii/S0167739X24002449[Llm4vv: Developing llm-driven testsuite for compiler validation], Future Generation Computer Systems 160 (2024) 1–13. https://doi.org/10.1016/j.future.2024.05.034. https://www.sciencedirect.com/science/article/pii/S0167739X24002449
S. U. Mushtaq, S. Sheikh, S. M. Idrees, P. A. Malla, https://doi.org/10.1007/s12083-024-01798-5[In-depth analysis of fault tolerant approaches integrated with load balancing and task scheduling], Peer-to-Peer Networking and Applications 17 (6) (2024) 4303–4337. doi:10.1007/s12083-024-01798-5. https://doi.org/10.1007/s12083-024-01798-5
Z. Nan, M. Dave, X. Shen, C. Liao, T. Vanderbruggen, P.-H. Lin, M. Emani, Interactive nlu-powered ontology-based workflow synthesis for fair support of hpc, in: 2022 IEEE/ACM International Workshop on HPC User Support Tools (HUST), 2022, pp. 29–40. doi:10.1109/HUST56722.2022.00009.
Z. Nan, H. Guan, X. Shen, https://doi.org/10.1145/3368089.3409673[Hisyn: human learning-inspired natural language programming], in: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, Association for Computing Machinery, New York, NY, USA, 2020, p. 75–86. doi:10.1145/3368089.3409673. https://doi.org/10.1145/3368089.3409673
NAS, https://www.nas.nasa.gov/publications/npb.html[Nas parallel benchmarks], accessed: 2024-11-18 (1994). https://www.nas.nasa.gov/publications/npb.html
NERSC, https://docs.nersc.gov/systems/[Nersc systems], accessed: 2024-11-19 (2024). https://docs.nersc.gov/systems/
M. Nowak, G. Frankowski, N. Meyer, E. Yilmaz, O. Erdogan, J.-P. Nominé, F. Robin, https://prace-ri.eu/wp-content/uploads/wp79.pdf[Security in hpc centres], https://prace-ri.eu/wp-content/uploads/wp79.pdf, partnership for Advanced Computing in Europe (PRACE) (2024). https://prace-ri.eu/wp-content/uploads/wp79.pdf
H. Oh, K. Kim, J. Kim, S. Kim, J. Lee, D.-s. Chang, J. Seo, https://doi.org/10.1145/3620665.3640383[Exegpt: Constraint-aware resource scheduling for llm inference], in: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 369–384. doi:10.1145/3620665.3640383. https://doi.org/10.1145/3620665.3640383
OpenAI, https://openai.com/blog/openai-codex/[Openai codex], accessed: 2024-08-10. https://openai.com/blog/openai-codex/
C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, J. E. Gonzalez, https://arxiv.org/abs/2310.08560[Memgpt: Towards llms as operating systems] (2024). arXiv:2310.08560. https://arxiv.org/abs/2310.08560
A. Palla, Chatbot instruction prompts, https://huggingface.co/datasets/alespalla/chatbotinstructionprompts (2023).
O. Pearce, J. Burmark, R. Hornung, B. Bogale, I. Lumsden, M. McKinsey, D. Yokelson, D. Boehme, S. Brink, M. Taufer, T. Scogland, Raja performance suite: Performance portability analysis with caliper and thicket, in: SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2024, pp. 1206–1218. doi:10.1109/SCW63240.2024.00162.
E. Peixoto, D. Torres, D. Carneiro, B. Silva, R. Marques, https://www.mdpi.com/2504-2289/9/2/47[Reusing ml models in dynamic data environments: Data similarity-based approach for efficient mlops], Big Data and Cognitive Computing 9 (2) (2025). doi:10.3390/bdcc9020047. https://www.mdpi.com/2504-2289/9/2/47
J. L. Peterson, B. Bay, J. Koning, P. Robinson, J. Semler, J. White, R. Anirudh, K. Athey, P.-T. Bremer, F. Di Natale, D. Fox, J. A. Gaffney, S. A. Jacobs, B. Kailkhura, B. Kustowski, S. Langer, B. Spears, J. Thiagarajan, B. Van Essen, J.-S. Yeom, https://www.sciencedirect.com/science/article/pii/S0167739X22000322[Enabling machine learning-ready hpc ensembles with merlin], Future Generation Computer Systems 131 (2022) 255–268. https://doi.org/10.1016/j.future.2022.01.024. https://www.sciencedirect.com/science/article/pii/S0167739X22000322
T.-T. Pham, M. Pister, P. Couvée, Recurrent neural network for classifying of hpc applications, in: 2019 Spring Simulation Conference (SpringSim), 2019, pp. 1–12. doi:10.23919/SpringSim.2019.8732923.
P. Pochelu, S. G. Petiton, B. Conche, https://doi.org/10.1145/3492805.3492819[A deep neural networks ensemble workflow from hyperparameter search to inference leveraging gpu clusters], in: International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 61–71. doi:10.1145/3492805.3492819. https://doi.org/10.1145/3492805.3492819
M. I. Radaideh, T. Kozlowski, https://www.sciencedirect.com/science/article/pii/S0951832019301711[Surrogate modeling of advanced computer simulations using deep gaussian processes], Reliability Engineering System Safety 195 (2020) 106731. https://doi.org/10.1016/j.ress.2019.106731. https://www.sciencedirect.com/science/article/pii/S0951832019301711
A. Rahman, V. Cvetkovic, K. Reece, A. Walters, Y. Hassan, A. Tummeti, B. Torres, D. Cooney, M. Ellis, D. S. Nikolopoulos, https://arxiv.org/abs/2505.03906[Marco: Multi-agent code optimization with real-time knowledge integration for high-performance computing] (2025). arXiv:2505.03906. https://arxiv.org/abs/2505.03906
M. Rashad, Chatgpt prompts, https://huggingface.co/datasets/MohamedRashad/ChatGPT-prompts (2023).
J. Rasley, S. Rajbhandari, O. Ruwase, Y. He, https://doi.org/10.1145/3394486.3406703[Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters], in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, KDD ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 3505–3506. doi:10.1145/3394486.3406703. https://doi.org/10.1145/3394486.3406703
J. Reed, Z. DeVito, H. He, A. Ussery, J. Ansel, torch. fx: Practical program capture and transformation for deep learning in python, Proceedings of Machine Learning and Systems 4 (2022) 638–651.
J. Ren, D. Xu, S. Yang, J. Zhao, Z. Li, C. Navasca, C. Wang, H. Xu, D. Li, https://ieeexplore-ieee-org.proxy.bnl.lu/stamp/stamp.jsp?tp=arnumber=10476398[Enabling large dynamic neural network training with learning-based memory management], in: 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024, pp. 788–802. doi:10.1109/HPCA57654.2024.00066. https://ieeexplore-ieee-org.proxy.bnl.lu/stamp/stamp.jsp?tp=arnumber=10476398
P. J. Rousseeuw, https://www.sciencedirect.com/science/article/pii/0377042787901257[Silhouettes: A graphical aid to the interpretation and validation of cluster analysis], Journal of Computational and Applied Mathematics 20 (1987) 53–65. https://doi.org/10.1016/0377-0427(87)90125-7. https://www.sciencedirect.com/science/article/pii/0377042787901257
R. Sarma, E. Inanc, M. Aach, A. Lintermann, https://juser.fz-juelich.de/record/1031520[P]arallel and scalable AI in HPC systems for CFD applications and beyond 2 (2024) 1444337, missing Journal: Frontiers in High Performance Computing (Front. High Perform. Comput.) = 2813-7337 (import from CrossRef, Journals: juser.fz-juelich.de); Please add the journal to the list of journals. doi:10.3389/fhpcp.2024.1444337. https://juser.fz-juelich.de/record/1031520
Saxena, R., Baskar, A., Haroon, S., Hayat, S., Shcherbakov, O., Kayabay, K., Hoppe, D., 2024. Cybersecurity concerns of artificial intelligence applications on high-performance computing systems. In: AISyS 2024. Venice, Italy.
SchedMD, https://slurm.schedmd.com/publications.html[Slurm publications], accessed: 2024-11-21 (2024). https://slurm.schedmd.com/publications.html
B. Schroeder, G. Gibson, https://doi.org/10.1145/1188455.1188615[The computer failure data repository (cfdr): collecting, sharing and analyzing failure data], in: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, Association for Computing Machinery, New York, NY, USA, 2006, p. 154–es. doi:10.1145/1188455.1188615. https://doi.org/10.1145/1188455.1188615
E. Sencan, Y.-C. Lee, C. Casey, B. Schwaller, V. J. Leung, J. Brandt, B. Kulis, M. Egele, A. K. Coskun, https://ieeexplore.ieee.org/document/11018307[Refine: A robust approach to unsupervised anomaly detection for production hpc systems], in: ISC High Performance 2025 Research Paper Proceedings (40th International Conference), 2025, pp. 1–12. https://ieeexplore.ieee.org/document/11018307
D. Shu, Z. Li, A. Barati Farimani, https://www.sciencedirect.com/science/article/pii/S0021999123000670[A physics-informed diffusion model for high-fidelity flow field reconstruction], Journal of Computational Physics 478 (2023) 111972. https://doi.org/10.1016/j.jcp.2023.111972. https://www.sciencedirect.com/science/article/pii/S0021999123000670
N. A. Skvortsov, S. A. Stupnikov, https://doi.org/10.1162/dinta00142[A semantic approach to workflow management and reuse for research problem solving], Data Intelligence 4 (2) (2022) 439–454, apply FAIR principles to workflows, meaning that workflows should be findable (F), accessible (A), interoperable (I), and reusable (R). arXiv:https://direct.mit.edu/dint/article-pdf/4/2/439/2012427/dinta00142.pdf, doi:10.1162/dinta00142. https://doi.org/10.1162/dinta00142
J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, S. Ganguli, Deep unsupervised learning using nonequilibrium thermodynamics, in: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, JMLR.org, 2015, p. 2256–2265.
I. Syrigos, D. Kefalas, N. Makris, T. Korakis, Eelas: Energy efficient and latency aware scheduling of cloud-native ml workloads, in: 2023 15th International Conference on COMmunication Systems NETworkS (COMSNETS), 2023, pp. 819–824. doi:10.1109/COMSNETS56262.2023.10041344.
R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, T. B. Hashimoto, Stanford alpaca: An instruction-following llama model, https://github.com/tatsu-lab/stanfordalpaca (2023).
K. Teranishi, H. Menon, W. F. Godoy, P. Balaprakash, D. Bau, T. Ben-Nun, A. Bhatele, F. Franchetti, M. Franusich, T. Gamblin, G. Georgakoudis, T. Goldstein, A. Guha, S. Hahn, C. Iancu, Z. Jin, T. Jones, T. M. Low, H. Mankad, N. R. Miniskar, M. A. H. Monil, D. Nichols, K. Parasyris, S. Pophale, P. Valero-Lara, J. S. Vetter, S. Williams, A. Young, https://arxiv.org/abs/2505.08135[Leveraging ai for productive and trustworthy hpc software: Challenges and research directions] (2025). arXiv:2505.08135. https://arxiv.org/abs/2505.08135
P. Thanapol, K. Lavangnananda, F. Leprévost, J. Schleich, P. Bouvry, Scheduling deep learning training in gpu cluster using the model-similarity-based policy, in: N. T. Nguyen, S. Boonsang, H. Fujita, B. Hnatkowska, T.-P. Hong, K. Pasupa, A. Selamat (Eds.), Intelligent Information and Database Systems, Springer Nature Singapore, Singapore, 2023, pp. 363–374.
TOP500 Project, Top500 list, https://top500.org, accessed: 2024-08-12 (2024).
A. M. Turing, http://www.jstor.org/stable/2251299[Computing machinery and intelligence], Mind 59 (236) (1950) 433–460. http://www.jstor.org/stable/2251299
A. Turner, https://doi.org/10.5281/zenodo.2616549[Single node performance comparison report] (Mar. 2019). doi:10.5281/zenodo.2616549. https://doi.org/10.5281/zenodo.2616549
B. Vacchetti, T. Cerquitelli, V. Nosenzo, E. Capitelli, L. Chiosso, M. Trocano, Jem: An ai-based engine workflow to predict simulation’s execution time on hpc cluster, in: 2024 International Conference on Control, Automation and Diagnosis (ICCAD), 2024, pp. 1–5. doi:10.1109/ICCAD60883.2024.10553971.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010.
C. Vercellino, A. Scionti, G. Varavallo, P. Viviani, G. Vitali, O. Terzo, https://www.sciencedirect.com/science/article/pii/S0167739X23000274[A machine learning approach for an hpc use case: the jobs queuing time prediction], Future Generation Computer Systems 143 (2023) 215–230. https://doi.org/10.1016/j.future.2023.01.020. https://www.sciencedirect.com/science/article/pii/S0167739X23000274
G. Verma, M. Emani, C. Liao, P.-H. Lin, T. Vanderbruggen, X. Shen, B. Chapman, Hpcfair: Enabling fair ai for hpc applications, in: 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), 2021, pp. 58–68. doi:10.1109/MLHPC54614.2021.00011.
VI-HPS, https://www.vi-hps.org/projects/score-p[Scalable performance measurement infrastructure for parallel codes], accessed: 11-18-2024 (2022). https://www.vi-hps.org/projects/score-p
L. Wang, M. A. Rodriguez, N. Lipovetzky, https://doi.org/10.1007/s11227-025-07396-3[Optimizing HPC] scheduling: a hierarchical reinforcement learning approach for intelligent job selection and allocation, The Journal of Supercomputing 81 (8) (2025) 918. doi:10.1007/s11227-025-07396-3. https://doi.org/10.1007/s11227-025-07396-3
Y. Wang, Z. Xie, K. Xu, Y. Dou, Y. Lei, https://www.sciencedirect.com/science/article/pii/S0925231215014940[An efficient and effective convolutional auto-encoder extreme learning machine network for 3d feature learning], Neurocomputing 174 (2016) 988–998. https://doi.org/10.1016/j.neucom.2015.10.035. https://www.sciencedirect.com/science/article/pii/S0925231215014940
Q. Wang, H. Zhang, C. Qu, Y. Shen, X. Liu, J. Li, https://www.mdpi.com/2076-3417/11/20/9448[Rlschert: An hpc job scheduler using deep reinforcement learning and remaining time prediction], Applied Sciences 11 (20) (2021). doi:10.3390/app11209448. https://www.mdpi.com/2076-3417/11/20/9448
L. Ward, G. Sivaraman, J. G. Pauloski, Y. Babuji, R. Chard, N. Dandu, P. C. Redfern, R. S. Assary, K. Chard, L. A. Curtiss, R. Thakur, I. Foster, Colmena: Scalable machine-learning-based steering of ensemble simulations for high performance computing, in: 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), 2021, pp. 9–20. doi:10.1109/MLHPC54614.2021.00007.
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Curran Associates Inc., Red Hook, NY, USA, 2022.
B. S. Weihua Liu, Erh-Wen Hu, J. Wang, https://doi.org/10.1080/09540091.2020.1762542[Using machine learning techniques for dsp software performance prediction at source code level], Connection Science 33 (1) (2021) 26–41. arXiv:https://doi.org/10.1080/09540091.2020.1762542, doi:10.1080/09540091.2020.1762542. https://doi.org/10.1080/09540091.2020.1762542
J. Welbl, A. Glaese, J. Uesato, S. Dathathri, J. Mellor, L. A. Hendricks, K. Anderson, P. Kohli, B. Coppin, P.-S. Huang, https://aclanthology.org/2021.findings-emnlp.210/[Challenges in detoxifying language models], in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 2447–2469. doi:10.18653/v1/2021.findings-emnlp.210. https://aclanthology.org/2021.findings-emnlp.210/
F.-J. Willemsen, R. Schoonhoven, J. Filipovič, J. O. Tørring, R. van Nieuwpoort, B. van Werkhoven, https://www.sciencedirect.com/science/article/pii/S0167739X24002498[A methodology for comparing optimization algorithms for auto-tuning], Future Generation Computer Systems 159 (2024) 489–504. https://doi.org/10.1016/j.future.2024.05.021. https://www.sciencedirect.com/science/article/pii/S0167739X24002498
S. Wu, J. Guan, https://www.proquest.com/scholarly-journals/prediction-disk-failure-based-on-classification/docview/3072350799/se-2[Prediction of disk failure based on classification intensity resampling], Information 15 (6) (2024) 322, copyright - 00A9 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License Last updated - 2024-06-27. https://www.proquest.com/scholarly-journals/prediction-disk-failure-based-on-classification/docview/3072350799/se-2
A. B. V. Wyzykowski, G. M. C. D. Sousa, B. S. Coelho, L. B. Santos, D. Anschau, https://www.scopus.com/inward/record.uri?eid=2-s2.0-85187778131doi=10.1109%2fDCHPC60845.2024.10454084partnerID=40md5=e297a54e70ad8df38fc9ef40f2adc7ad[Optimizing geophysical workloads in high-performance computing: Leveraging machine learning and transformer models for enhanced parallelism and processor allocation], 2024. doi:10.1109/DCHPC60845.2024.10454084. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85187778131doi=10.1109%2fDCHPC60845.2024.10454084partnerID=40md5=e297a54e70ad8df38fc9ef40f2adc7ad
Xia, S., Sun, Y., Pan, X., Yuan, Y., Zhang, S., Hu, S., Tao, L., Li, Y., Feng, J., 2025. Effective node-level anomaly detection in hpc systems via coarse-grained clustering and fine-grained model sharing. In: The International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’25, St. Louis, MO, USA, open-sourced code and anomaly labeling tool released;clustering tool available :contentReference[oaicite:1]index=1. URL .
Y. Xie, S. Huang, T. Chen, F. Wei, https://doi.org/10.1609/aaai.v37i11.26617[Moec: mixture of expert clusters], in: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23, AAAI Press, 2023. doi:10.1609/aaai.v37i11.26617. https://doi.org/10.1609/aaai.v37i11.26617
H.-J. Xue, X.-Y. Dai, J. Zhang, S. Huang, J. Chen, Deep matrix factorization models for recommender systems, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, AAAI Press, 2017, p. 3203–3209.
G. Yang, E. Hu, I. Babuschkin, S. Sidor, X. Liu, D. Farhi, N. Ryder, J. Pachocki, W. Chen, J. Gao, https://proceedings.neurips.cc/paperfiles/paper/2021/file/8df7c2e3c3c3be098ef7b382bd2c37ba-Paper.pdf[Tuning large neural networks via zero-shot hyperparameter transfer], in: M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan (Eds.), Advances in Neural Information Processing Systems, Vol. 34, Curran Associates, Inc., 2021, pp. 17084–17097. https://proceedings.neurips.cc/paperfiles/paper/2021/file/8df7c2e3c3c3be098ef7b382bd2c37ba-Paper.pdf
Y. Yang, H. Shen, Deep reinforcement learning enhanced greedy optimization for online scheduling of batched tasks in cloud hpc systems, IEEE Transactions on Parallel and Distributed Systems 33 (11) (2022) 3003–3014. doi:10.1109/TPDS.2021.3138459.
Q. Yang, T. Yang, M. Xiang, L. Zhang, H. Wang, M. Serafini, H. Guan, https://doi.org/10.1145/3627703.3650074[Gmorph: Accelerating multi-dnn inference via model fusion], in: Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 505–523. doi:10.1145/3627703.3650074. https://doi.org/10.1145/3627703.3650074
Yang, X., Zhou, Z., Wallace, S., Lan, Z., Tang, W., Coghlan, S., Papka, M.E., 2013. Integrating dynamic pricing of electricity into energy aware scheduling for hpc systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. pp. 1–11.
D. Zhang, D. Dai, Y. He, F. S. Bao, B. Xie, Rlscheduler: An automated hpc batch job scheduler using reinforcement learning, in: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–15. doi:10.1109/SC41405.2020.00035.
S. Zhang, D. Li, Z. Zhong, J. Zhu, M. Liang, J. Luo, Y. Sun, Y. Su, S. Xia, Z. Hu, Y. Zhang, D. Pei, J. Sun, Y. Liu, https://doi.org/10.1145/3485447.3511983[Robust system instance clustering for large-scale web services], in: Proceedings of the ACM Web Conference 2022, WWW ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 1785–1796. doi:10.1145/3485447.3511983. https://doi.org/10.1145/3485447.3511983
Y. Zhang, X. Zhao, J. Yin, L. Zhang, Z. Chen, Operating system and artificial intelligence: A systematic review, arXiv preprint arXiv:2407.14567 (2024).
L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing, J. E. Gonzalez, I. Stoica, https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin[Alpa: Automating inter- and Intra-Operator] parallelism for distributed deep learning, in: 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), USENIX Association, Carlsbad, CA, 2022, pp. 559–578. https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin
S. Zheng, Y. Liang, S. Wang, R. Chen, K. Sheng, https://doi.org/10.1145/3373376.3378508[Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system], in: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 859–873. doi:10.1145/3373376.3378508. https://doi.org/10.1145/3373376.3378508
L. Zheng, R. Liu, J. Shao, T. Chen, J. Gonzalez, I. Stoica, A. Haj-Ali, https://datasets-benchmarks-proceedings.neurips.cc/paperfiles/paper/2021/file/a684eceee76fc522773286a895bc8436-Paper-round1.pdf[Tenset: A large-scale program performance dataset for learned tensor compilers], in: J. Vanschoren, S. Yeung (Eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, Vol. 1, 2021. https://datasets-benchmarks-proceedings.neurips.cc/paperfiles/paper/2021/file/a684eceee76fc522773286a895bc8436-Paper-round1.pdf
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., Zhang, W., 2021. Informer: Beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35, pp. 11106–11115.

Similar publications