Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with Sparsity-Inducing Penalties. Foundations and Trends in Machine Learning, 4(1):1-106, 2011. doi: 10.1561/2200000015. URL http://arxiv.org/abs/1108.0775.
Yu Bai, Yu-Xiang Wang, and Edo Liberty. Proxquant: Quantized neural networks via proximal operators. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyzMyhCcK7.
Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627-642, 2000. doi: 10.1137/S1052623497331063. URL http://link.aip.org/link/SJOPE8/v10/i3/p627/s1&Agg=doi.
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 2010. doi: 10.1561/2200000016. URL http://www.nowpublishers.com/product.aspx?product=MAL{&}doi=2200000016.
Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization. In International Conference on Learning Representations, 2019. URL http://arxiv.org/abs/1808.02941.
Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems 28, pp. 3123-3131, 2015.
Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized Neural Networks. In Advances in Neural Information Processing Systems 29, 2016. URL https://papers.nips.cc/paper/6573-binarized-neural-networks.pdf.
Trevor Gale, Erich Elsen, and Sara Hooker. The State of Sparsity in Deep Neural Networks. 2019. URL http://arxiv.org/abs/1902.09574.
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT Press, 2016.
Lu Hou, Quanming Yao, and James T. Kwok. Loss-aware binarization of deep networks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=S1oWlN9ll.
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In The Conference on Computer Vision and Pattern Recognition, 2017. URL https://arxiv.org/abs/1608.06993.
Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015. URL http://arxiv.org/abs/1412.6980.
Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
Yunwen Lei, Ting Hu, and Ke Tang. Stochastic Gradient Descent for Nonconvex Learning without Bounded Gradient Assumptions. to appear in IEEE Transactions on Neural Networks and Learning Systems, 2019. URL https://ieeexplore.ieee.org/document/8930994.
Christos Louizos, Max Welling, and Diederik P. Kingma. Learning Sparse Neural Networks through L0 Regularization. In International Conference on Learning Representations, 2018. URL http://arxiv.org/abs/1712.01312.
Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive Gradient Method with Dynamic Bound of Learning Rate. In International Conference on Learning Representations, 2019. URL http://arxiv.org/abs/1902.09843v1.
Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1 (3):127-239, 2014. doi: 10.1561/2400000003. URL http://www.nowpublishers.com/articles/foundations-and-trends-in-optimization/OPT-003.
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of ADAM and beyond. In International Conference on Learning Representations, 2018. URL http://arxiv.org/abs/1904.09237.
Andrzej Ruszczyński. Feasible direction methods for stochastic programming problems. Mathematical Programming, 19(1):220-229, December 1980. doi: 10.1007/BF01581643. URL http://www.springerlink.com/index/10.1007/BF01581643.
Yang Yang, Gesualdo Scutari, Daniel P Palomar, and Marius Pesavento. A parallel decomposition method for nonconvex stochastic multi-agent optimization problems. IEEE Transactions on Signal Processing, 64(11):2949-2964, June 2016. doi: 10.1109/TSP.2016.2531627. URL http://ieeexplore.ieee.org/document/7412752/.
Penghang Yin, Shuai Zhang, Jiancheng Lyu, Stanley Osher, Yingyong Qi, and Jack Xin. BinaryRelax: A Relaxation Approach for Training Deep Neural Networks with Quantized Weights. SIAM Journal on Imaging Sciences, 11(4):2205-2223, January 2018. URL https://epubs.siam.org/doi/10.1137/18M1166134.