Paper published in a book (Scientific congresses, symposiums and conference proceedings)
Performance Analysis of Distributed and Scalable Deep Learning
Mahon, S.; Varrette, Sébastien; Plugaru, Valentin et al.
2020In 20th IEEE/ACM Intl. Symp. on Cluster, Cloud and Internet Computing (CCGrid'20)
Peer reviewed
 

Files


Full Text
Performance-Analysis-of-Distributed-and-Scalable-Deep-Learning_CCGrid20_609500a760.pdf
Publisher postprint (513.93 kB)
Request a copy

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Deep Learning; Performance Evaluation; GPU
Abstract :
[en] With renewed global interest for Artificial Intelligence (AI) methods, the past decade has seen a myriad of new programming models and tools that enable better and faster Machine Learning (ML). More recently, a subset of ML known as Deep Learning (DL) raised an increased interest due to its inherent ability to tackle efficiently novel cognitive computing applications. DL allows computational models that are composed of multiple processing layers to learn in an automated way representations of data with multiple levels of abstraction, and can deliver higher predictive accuracy when trained on larger data sets. Based on Artificial Neural Networks (ANN), DL is now at the core of state of the art voice recognition systems (which enable easy control over e.g. Internet-of- Things (IoT) smart home appliances for instance), self-driving car engine, online recommendation systems. The ecosystem of DL frameworks is fast evolving, as well as the DL architectures that are shown to perform well on specialized tasks and to exploit GPU accelerators. For this reason, the frequent performance evaluation of the DL ecosystem is re- quired, especially since the advent of novel distributed training frameworks such as Horovod allowing for scalable training across multiple computing resources. In this paper, the scalability evaluation of the reference DL frameworks (Tensorflow, Keras, MXNet, and PyTorch) is performed over up-to-date High Performance Comput- ing (HPC) resources to compare the efficiency of differ- ent implementations across several hardware architectures (CPU and GPU). Experimental results demonstrate that the DistributedDataParallel features in the Pytorch library seem to be the most efficient framework for distributing the training process across many devices, allowing to reach a throughput speedup of 10.11 when using 12 NVidia Tesla V100 GPUs when training Resnet44 on the CIFAR10 dataset.
Research center :
ULHPC - University of Luxembourg: High Performance Computing
Disciplines :
Computer science
Author, co-author :
Mahon, S.
Varrette, Sébastien ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)
Plugaru, Valentin ;  University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)
Pinel, Frederic ;  University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)
Bouvry, Pascal ;  University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)
External co-authors :
yes
Language :
English
Title :
Performance Analysis of Distributed and Scalable Deep Learning
Publication date :
May 2020
Event name :
20th IEEE/ACM Intl. Symp. on Cluster, Cloud and Internet Computing (CCGrid'20)
Event place :
Melbourne, Australia
Event date :
May 11-14, 2020
Audience :
International
Main work title :
20th IEEE/ACM Intl. Symp. on Cluster, Cloud and Internet Computing (CCGrid'20)
Publisher :
IEEE/ACM, Melbourne, Australia
ISBN/EAN :
978-1-7281-6095-5
Pages :
760--766
Peer reviewed :
Peer reviewed
Focus Area :
Computational Sciences
Available on ORBilu :
since 04 June 2020

Statistics


Number of views
208 (23 by Unilu)
Number of downloads
5 (4 by Unilu)

Scopus citations®
 
5
Scopus citations®
without self-citations
5

Bibliography


Similar publications



Contact ORBilu