Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 440-445, 2017.
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems (NeurIPS), pages 1709-1720, 2017.
Quentin Anthony, Benjamin Michalowicz, Jacob Hatef, Lang Xu, Mustafa Abduljabbai, Aamir Shafi, Hari Subramoni, and Dhabaleswar K Panda. Demystifying the communication characteristics for distributed transformer models. In 2024 IEEE Symposium on High-Performance Interconnects (HOTI), pages 57-65. IEEE, 2024.
Yang Chen, Min Chen, Yanzhi Zhang, Liang Yang, and Victor CM Leung. Communication-efficient distributed learning: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 2023.
Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. Inverting gradients-how easy is it to break privacy in federated learning? Advances in neural information processing systems, 33:16937-16947, 2020.
Priya Goyal, Piotr Dollar, Ross Girshick, et al. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770-778, 2016.
Liam Hodgkinson, Zhichao Wang, and Michael W Mahoney. Models of heavy-tailed mechanistic universality. arXiv preprint arXiv:2506.03470, 2025.
Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning (ICML), pages 3252-3261, 2019.
Hongyang Li, Caesar Wu, Mohammed Chadli, Said Mammar, and Pascal Bouvry. Trustworthiness of stochastic gradient descent in distributed learning. arXiv preprint arXiv:2410.21491, 2024.
Hongyang Li, Caesar Wu, Mohammed Chadli, Said Mammar, and Pascal Bouvry. Lightweight trustworthy distributed clustering. arXiv preprint arXiv:2504.10109, 2025.
Qiongxiu Li, Jaron Skovsted Gundersen, Milan Lopuhaä-Zwakenberg, and Richard Heusdens. Adaptive differentially quantized subspace perturbation (adqsp): A unified framework for privacy-preserving distributed average consensus. IEEE Transactions on Information Forensics and Security, 19:1780-1793, 2023.
Qiongxiu Li, Richard Heusdens, and Mads Græsbøll Christensen. Communication efficient privacy-preserving distributed optimization using adaptive differential quantization. Signal Processing, 194:108456, 2022.
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations (ICLR), 2018.
Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, and Daniel M Roy. Provably communication-efficient data-parallel sgd via nonuniform quantization.
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH, pages 1058-1062, 2014.
Christopher J Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018.
Shaohuai Shi, Qiang Wang, Kaiyong Zhao, Zhenheng Tang, Yuxin Wang, Xiang Huang, and Xiaowen Chu. A distributed synchronous sgd algorithm with global top-k sparsification for low bandwidth networks. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pages 2238-2247. IEEE, 2019.
Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems (NeurIPS), pages 4447-4458, 2018.
Thijs Vogels, Sai Praneeth Karimireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compression for distributed optimization. In Advances in Neural Information Processing Systems (NeurIPS), pages 14236-14246, 2019.