[en] With renewed global interest for Artificial Intelligence (AI) methods, the past decade has seen a myriad of new programming models and tools that enable better and faster Machine Learning (ML). More recently, a subset of ML known as Deep Learning (DL) raised an increased interest due to its inherent ability to tackle efficiently novel cognitive computing applications. DL allows computational models that are composed of multiple processing layers to learn in an automated way representations of data with multiple levels of abstraction, and can deliver higher predictive accuracy when trained on larger data sets. Based on Artificial Neural Networks (ANN), DL is now at the core of state of the art voice recognition systems (which enable easy control over e.g. Internet-of- Things (IoT) smart home appliances for instance), self-driving car engine, online recommendation systems. The ecosystem of DL frameworks is fast evolving, as well as the DL architectures that are shown to perform well on specialized tasks and to exploit GPU accelerators. For this reason, the frequent performance evaluation of the DL ecosystem is re- quired, especially since the advent of novel distributed training frameworks such as Horovod allowing for scalable training across multiple computing resources.
In this paper, the scalability evaluation of the reference DL frameworks (Tensorflow, Keras, MXNet, and PyTorch) is performed over up-to-date High Performance Comput- ing (HPC) resources to compare the efficiency of differ- ent implementations across several hardware architectures (CPU and GPU). Experimental results demonstrate that the DistributedDataParallel features in the Pytorch library seem to be the most efficient framework for distributing the training process across many devices, allowing to reach a throughput speedup of 10.11 when using 12 NVidia Tesla V100 GPUs when training Resnet44 on the CIFAR10 dataset.
Centre de recherche :
ULHPC - University of Luxembourg: High Performance Computing
Disciplines :
Sciences informatiques
Auteur, co-auteur :
Mahon, S.
VARRETTE, Sébastien ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)
PLUGARU, Valentin ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)
PINEL, Frederic ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)
BOUVRY, Pascal ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
Performance Analysis of Distributed and Scalable Deep Learning
Date de publication/diffusion :
mai 2020
Nom de la manifestation :
20th IEEE/ACM Intl. Symp. on Cluster, Cloud and Internet Computing (CCGrid'20)
Lieu de la manifestation :
Melbourne, Australie
Date de la manifestation :
May 11-14, 2020
Manifestation à portée :
International
Titre de l'ouvrage principal :
20th IEEE/ACM Intl. Symp. on Cluster, Cloud and Internet Computing (CCGrid'20)
S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd ed. USA: Prentice Hall Press, 2009.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998-6008.
M. Banko and E. Brill, "Scaling to very very large corpora for natural language disambiguation," in Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ser. ACL '01. Stroudsburg, PA, USA: Association for Computational Linguistics, 2001, pp. 26-33.
J. Hestness, S. Narang, N. Ardalani, G. F. Diamos, H. Jun, H. Kia-ninejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, "Deep learning scaling is predictable, empirically," ArXiv, vol. abs/1712.00409, 2017.
P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, "Accurate, large minibatch SGD: training imagenet in 1 hour," CoRR, vol. abs/1706.02677, 06 2017. [Online]. Available: http://arxiv.org/abs/1706.02677
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Va-sudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, "Tensorflow: A system for large-scale machine learning," in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265-283.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, high-performance deep learning library," in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelz-imer, F. de Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024-8035.
R. Collobert, K. Kavukcuoglu, and C. Farabet, "Torch7: A matlab-like environment for machine learning," in BigLearn, NIPS Workshop, 2011.
A. Sergeev and M. D. Balso, "Horovod: fast and easy distributed deep learning in tensorflow," CoRR, vol. abs/1802.05799, 2018.
F. Chollet et al., "Keras," https://keras.io, 2015.
A. Sergeev and M. D. Balso, "Horovod: fast and easy distributed deep learning in tensorflow," ArXiv, vol. abs/1802.05799, 2018.
P. Patarasuk and X. Yuan, "Bandwidth optimal all-reduce algorithms for clusters of workstations," Journal of Parallel and Distributed Computing, vol. 69, no. 2, pp. 117-124, 2009.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, "TensorFlow: Large-scale machine learning on heterogeneous systems," 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
M. Poess, R. Nambiar, and K. Kulkarni, "A benchmark proposal for massive scale inference systems: (work-in-progress paper)," in Companion of the 2019 ACM/SPEC International Conference on Performance Engineering, ser. ICPE '19. New York, NY, USA: Association for Computing Machinery, 2019, P. 17-20. [Online]. Available: https://doi.org/10.1145/3302541.3313098
A. Krizhevsky, "Learning multiple layers of features from tiny images," 2009.
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," 2015, pp. 770-778.
C. A. Coleman, D. Narayanan, D. Kang, T. J. Zhao, J. Zhang, L. Nardi, P. Bailis, K. Olukotun, C. Ré, and M. Zaharia, "Dawnbench: An end-to-end deep learning benchmark and competition," 2017.
S. Verma, Q. Wu, B. Hanindhito, G. Jha, E. B. John, R. Radhakrish-nan, and L. K. John, "Demystifying the MLPerf Benchmark Suite," arXiv e-prints, P. arXiv:1908.09207, Aug 2019.
T. Ben-Nun, M. Besta, S. Huber, A. Nikolaos Ziogas, D. Peter, and T. Hoefler, "A modular benchmarking infrastructure for high-performance and reproducible deep learning," 01 2019.
S. Varrette, P. Bouvry, H. Cartiaux, and F. Georgatos, "Management of an Academic HPC Cluster: The UL Experience," in Proc. of the 2014 Intl. Conf. on High Performance Computing & Simulation (HPCS 2014). Bologna, Italy: IEEE, July 2014, pp. 959-967.