Branch-and-Bound; Chapel; Depth first tree search; Distributed programming; Exascale; Performance; Productivity-awareness; Tree-search; Software
Abstract :
[en] With the recent arrival of the exascale era, modern supercomputers are increasingly big making their programming much more complex. In addition to performance, software productivity is a major concern to choose a programming language, such as Chapel, designed for exascale computing. In this paper, we investigate the design of a parallel distributed tree-search algorithm, namely P3D-DFS, and its implementation using Chapel. The design is based on the Chapel's DistBag data structure, revisited by: (1) redefining the data structure for Depth-First tree-Search (DFS), henceforth renamed DistBag-DFS; (2) redesigning the underlying load balancing mechanism. In addition, we propose two instantiations of P3D-DFS considering the Branch-and-Bound (B&B) and Unbalanced Tree Search (UTS) algorithms. In order to evaluate how much performance is traded for productivity, we compare the Chapel-based implementations of B&B and UTS to their best-known counterparts based on traditional OpenMP (intra-node) and MPI+X (inter-node). For experimental validation using 4096 processing cores, we consider the permutation flow-shop scheduling problem for B&B and synthetic literature benchmarks for UTS. The reported results show that P3D-DFS competes with its OpenMP baselines for coarser-grained shared-memory scenarios, and with its MPI+X counterparts for distributed-memory settings, considering both performance and productivity-awareness. In the context of this work, this makes Chapel an alternative to OpenMP/MPI+X for exascale programming.
Disciplines :
Computer science
Author, co-author :
HELBECQUE, Guillaume ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > PCOG ; Université de Lille, CNRS/CRIStAL UMR 9189, Centre Inria de l'Université de Lille, France
GMYS, Jan; Université de Lille, CNRS/CRIStAL UMR 9189, Centre Inria de l'Université de Lille, France
MELAB, Nouredine; Université de Lille, CNRS/CRIStAL UMR 9189, Centre Inria de l'Université de Lille, France
CARNEIRO PESSOA, Tiago ; University of Luxembourg > Faculty of Science, Technology and Medicine > Department of Computer Science > Team Pascal BOUVRY
BOUVRY, Pascal ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
External co-authors :
yes
Language :
English
Title :
Parallel distributed productivity-aware tree-search using Chapel
Publication date :
19 July 2023
Journal title :
Concurrency and Computation: Practice and Experience
Asanovic K, Bodik R, Demmel J, et al. A view of the parallel computing landscape. Commun ACM. 2009;52(10):56-67. doi:10.1145/1562764.1562783
The Chapel Parallel Programming Language. The DistributedBag module. https://chapel-lang.org/docs/modules/packages/DistributedBag.html, version 1.29.0.
Gmys J. Exactly Solving Hard Permutation Flowshop Scheduling Problems on Peta-Scale GPU-Accelerated Supercomputers. INFORMS J Comput. 2022;34(5):2502-2522. doi:10.1287/ijoc.2022.1193
Trienekens HW, de Bruin A. Towards a taxonomy of parallel branch and bound algorithms. Technical report. 1992. http://hdl.handle.net/1765/1491
Blumofe RD, Leiserson CE. Scheduling multithreaded computations by work stealing. J ACM. 1999;46(5):720-748. doi:10.1145/324133.324234
Lusk E, Yelick K. Languages for high-productivity computing: the DARPA HPCS language project. Parallel Process Lett. 2007;17:89-102. doi:10.1142/S0129626407002892
Chamberlain B, Ronaghan E, Albrecht B, et al. Chapel comes of age: Making scalable programming productive. Cray User Group; 2018. https://chapel-lang.org/publications/cug2018-chapel.pdf
Almasi G. PGAS (partitioned global address space) Languages. Encyclopedia of Parallel Computing. Springer; 2011:1539-1547. doi:10.1007/978-0-387-09766-4_210
Johnson RB, Hollingsworth J. Optimizing Chapel for Single-node Environments. IEEE Intl. Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE; 2016:1558-1567. doi:10.1109/IPDPSW.2016.181
Kayraklioglu E, Chang W, El-Ghazawi TA. Comparative performance and optimization of chapel in modern Manycore architectures. IEEE Intl. Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE; 2017:1105-1114. doi:10.1109/IPDPSW.2017.126
Rolinger TB, Craft J, Krieger CD, Sussman A. Towards high productivity and performance for irregular applications in chapel. SC Workshops Supplementary Proceedings (SCWS). IEEE; 2021:1-11. doi:10.1109/SCWS55283.2021.00012
Helbecque G, Gmys J, Carneiro T, Melab N, Bouvry P. A performance-oriented comparative study of the Chapel high-productivity language to conventional programming environments. 13th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM). Association for Computing Machinery; 2022:21-29. doi:10.1145/3528425.3529104
Carneiro T, Melab N. An Incremental Parallel PGAS-based Tree Search Algorithm. 17th International Conference on High Performance Computing & Simulation (HPCS). IEEE; 2019:19-26. doi:10.1109/HPCS48598.2019.9188106
Carneiro T, Gmys J, Melab N, Tuyttens D. Towards ultra-scale Branch-and-Bound using a High-productivity Language. Futur Gener Comput Syst. 2020;105:196-209. doi:10.1016/j.future.2019.11.011
UPC Consortium. UPC language specifications, v1.3. Technical Report LBNL-6623E. Lawrence Berkeley National Lab; 2005. https://upc.lbl.gov/publications/upc-spec-1.3.pdf
Olivier S, Prins J. Scalable dynamic load balancing using UPC. 37th International Conference on Parallel Processing. IEEE; 2008:123-131. doi:10.1109/ICPP.2008.19
Machado R, Lojewski C, Abreu S, Pfreundt F-J. Unbalanced tree search on a manycore system using the GPI programming model. Comput Sci. 2011;26(3–4):229-236. doi:10.1007/s00450-011-0163-3
Grünewald D, Simmendinger C. The GASPI API specification and its implementation GPI 2.0. 7th International Conference on PGAS Programming Models. Vol 243. University of Edinburgh; 2013:52.
Helbecque G, Gmys J, Carneiro T, Melab N, Bouvry P. Productivity- and Performance-aware Parallel Distributed Depth-First Search repository. https://github.com/Guillaume-Helbecque/P3D-DFS; 2022. doi:10.5281/zenodo.7674860
van Dijk T, van de Pol JC. Lace: Non-blocking split deque for work-stealing. Euro-Par 2014: Parallel Processing Workshops. Lecture Notes in Computer Science. Vol 8806. Springer; 2014. doi:10.1007/978-3-319-14313-2_18
Dinan J, Larkins DB, Sadayappan P, Krishnamoorthy S, Nieplocha J. Scalable work stealing. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09). Vol 53. Association for Computing Machinery; 2009:1-11. doi:10.1145/1654059.1654113
Bolosky WJ, Scott ML. False sharing and its effect on shared memory performance. USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems–Volume 4 (Sedms'93). USENIX Association; 1993:3. doi:10.5555/1295480.1295483
Gmys J. Parallel Branch-and-Bound for permutation-based optimization repository. https://github.com/jangmys/pbb; 2022. doi:10.5281/zenodo.7674826
Dinan J, Olivier S. The unbalanced tree-search benchmark repository. https://sourceforge.net/projects/uts-benchmark/; 2012. doi:10.5281/zenodo.7328332
Gmys J, Mezmaz M, Melab N, Tuyttens D. IVM-based parallel branch-and-bound using hierarchical work stealing on multi-GPU systems. Concurr Comput Practice Exp. 2017;29:9. doi:10.1002/cpe.4019
Dinan J, Olivier S, Prins J, Sabin G, Sadayappan P, Tseng C-W. Dynamic load balancing of unbalanced computations using message passing. 6th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS 2007). IEEE; 2007:1-8. doi:10.1109/IPDPS.2007.370581
Johnson SM. Optimal two- and three-stage production schedules with setup times included. Naval Res Log Quarterly. 1954;1:61-68. doi:10.1002/nav.3800010110
Garey MR, Johnson DS, Sethi R. The complexity of Flowshop and Jobshop scheduling. Math Oper Res. 1976;1:117-129. doi:10.1287/moor.1.2.117
Brown APG, Lomnicki ZA. Some applications of the "branch-and-bound" algorithm to the machine scheduling problem. J Oper Res Soc. 1966;17:173-186. doi:10.1057/jors.1966.25
Lageweg BJ, Lenstra JK, Rinnooy Kan AHG. A general bounding scheme for the permutation flow-shop problem. Oper Res. 1978;26(1):53-67. doi:10.1287/opre.26.1.53
Olivier S, Huan J, Liu J, et al. UTS: An unbalanced tree search benchmark. 19th International Workshop on Languages and Compilers for Parallel Computing (LCPC). Springer; 2007. doi:10.1007/978-3-540-72521-3_18
Eastlake D 3rd, Jones P. US Secure Hash Algorithm 1 (SHA1). 2001. doi:10.17487/RFC3174
Taillard E. Benchmarks for basic scheduling problems. Eur J Oper Res. 1993;64(2):278-285. doi:10.1016/0377-2217(93)90182-M
Gmys J, Mezmaz M, Melab N, Tuyttens D. A computationally efficient Branch-and-Bound algorithm for the permutation flow-shop scheduling problem. Eur J Oper Res. 2020;284(3):814-833. doi:10.1016/j.ejor.2020.01.039
Eric Taillard's page. Summary of best known lower and upper bounds of Taillard's instances. Accessed June 2022. http://mistic.heig-vd.ch/taillard/problemes.dir/ordonnancement.dir/flowshop.dir/best_lb_up.txt
Kennedy K, Koelbel C, Schreiber R. Defining and measuring the productivity of programming languages. Int J High Performance Comput Appl. 2004;18(4):441-448. doi:10.1177/1094342004048537