[en] Transformers have achieved state-of-the-art performance in most common tasks to which they have been applied. Those achievements are attributed to the Self-Attention mechanism at their core. Self-Attention is understood to map the relationship between tokens of any given sequence. This exhaustive mapping incurs massive costs in memory and inference time, as Self-Attention scales quadratically with regard to sequence length. Standard Self-Attention has required increasingly large compute and memory usage when applied to long input sequences because of this memory and time bottleneck. Efficient Transformers emerged as performant alternatives demonstrating good scalability and occasionally better tracking of long-range dependencies. Their efficiency gains are obtained through different methods, usually focusing on the linear scaling of the attention matrix through sparsification, approximation, or other methods. Among existing approaches, those using low-rank approximation present particular advantages because of their compatibility with standard Self-Attention-based models, allowing for weight transfers and other time-saving schemes. More recently, hardware-aware versions of Self-Attention (e.g., FlashAttention) have mitigated all memory bottlenecks and have alleviated its compute burden. Unfortunately, hardware-aware Self-Attentions have stricter hardware compatibility requirements making Efficient Transformers still relevant for use on older or less powerful hardware. Furthermore, some Efficient Transformers can even be applied in an hardware-aware manner to further improve training and inference speed. In this paper, we propose a novel linear approximation method for Self-Attention inspired by the CUR approximation method. This method, proposed in two versions (one leveraging FlashAttention), is conceived as a drop-in replacement for standard Self-Attention with weights compatibility. Our method compares favorably to standard Transformers’ and Efficient Transformers’ performances on varied tasks and demonstrates a significant decrease in memory footprint as well as competitive performance in training speed, even compared to similar methods.
Centre de recherche :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Other Luxembourg Centre for Systems Biomedicine (LCSB): Integrative Cell Signalling (Skupin Group)
Disciplines :
Sciences informatiques
Auteur, co-auteur :
FRANCOIS, Damien ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Saillot, Mathis ; LGIPM, Université de Lorraine, Metz, France
KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Bissyandé, Tegawendé F. ; SnT, University of Luxembourg, Luxembourg, Luxembourg
SKUPIN, Alexander ; University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Integrative Cell Signalling ; Department of Neurosciences, University of California San Diego, United States
Institute for Advanced Studies of the University of Luxembourg
Subventionnement (détails) :
Author Damien Fran\u00E7ois acknowledges financial support of the Institute for Advanced Studies of the University of Luxembourg through the IDAE Audacity Grant (AUDACITY- 2021)
Beltagy, I, Peters, M.E, & Cohan, A. (2020). Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150
Child, R, Gray, S, Radford, A, & Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509
Choromanski, K, Likhosherstov, V, Dohan, D, Song, X, Gane, A, Sarlos, T, Hawkins, P, Davis, J, Mohiuddin, A, & Kaiser, L, (2020). Rethinking attention with performers. arXiv preprint arXiv:2009.14794
Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q.V., & Salakhutdinov, R. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. In: Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:57759363
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. ArXiv abs/2307.08691
Dao, T, Fu, D.Y, Ermon, S, Rudra, A, & R’e, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. ArXiv abs/2205.14135
Dosovitskiy, A, Beyer, L, Kolesnikov, A, Weissenborn, D, Zhai, X, Unterthiner, T, Dehghani, M, Minderer, M, Heigold, G, Gelly, S, Uszkoreit, J, & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale., ArXiv abs/2010.11929
P. Drineas M.W. Mahoney S. Muthukrishnan Relative-error cur matrix decompositions SIAM Journal on Matrix Analysis and Applications 30 844 881 2443975 10.1137/07070471X
Ho, J, Kalchbrenner, N, Weissenborn, D, & Salimans, T. (2019). Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180
Hu, E.J, Shen, Y, Wallis, P, Allen-Zhu, Z, Li, Y, Wang, S, Wang, L, & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. https://arxiv.org/abs/2106.09685
Katharopoulos, A, Vyas, A, Pappas, N, & Fleuret, F. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In: International Conference on Machine Learning, pp. 5156–5165. PMLR
Kitaev, N, Kaiser, L, & Levskaya, A. (2020). Reformer: The efficient transformer. ArXiv abs/2001.04451
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images.
Linsley, D.A., Kim, J., Veerabadran, V., & Serre, T. (2018). Learning long-range spatial dependencies with horizontal gated-recurrent units. ArXiv abs/1805.08315
Ma, X, Kong, X, Wang, S, Zhou, C, May, J, Ma, H, & Zettlemoyer, L. (2021). Luna: Linear unified nested attention. ArXiv abs/2106.01540
Ma, X, Zhou, C, Kong, X, He, J, Gui, L, Neubig, G, May, J, & Zettlemoyer, L. (2022). Mega: Moving average equipped gated attention. ArXiv abs/2209.10655
Maas, A.L, Daly, R.E, Pham, P.T, Huang, D, Ng, A, & Potts, C. (2011). Learning word vectors for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics
Nangia, N., & Bowman, S.R. (2018). Listops: A diagnostic dataset for latent tree learning. ArXiv abs/1804.06028
Parmar, N, Vaswani, A, Uszkoreit, J, Kaiser, L, Shazeer, N, Ku, A, & Tran, D. (2018). Image transformer. In: International Conference on Machine Learning, pp. 4055–4064. PMLR
Peng, H, Pappas, N, Yogatama, D, Schwartz, R, Smith, N.A, & Kong, L. (2021). Random feature attention. arXiv preprint arXiv:2103.02143
Peng, B, Quesnelle, J, Fan, H, & Shippole, E. (2023). YaRN: Efficient Context Window Extension of Large Language Models. https://arxiv.org/abs/2309.00071
R. Penrose A generalized inverse for matrices Mathematical Proceedings of the Cambridge Philosophical Society 51 406 413 10.1017/S0305004100030401
Qin, B., Li, J., Tang, S., & Zhuang, Y. (2022). Dba: Efficient transformer with dynamic bilinear low-rank attention. ArXiv abs/2211.16368
Qiu, J., Ma, H., Levy, O., Yih, S., Wang, S., & Tang, J. (2019). Blockwise self-attention for long document understanding. ArXiv abs/1911.02972
D.R. Radev P. Muthukrishnan V. Qazvinian A. Abu-Jbara The acl anthology network corpus Language Resources and Evaluation 47 919 944 10.1007/s10579-012-9211-2
A. Roy M. Saffar A. Vaswani D. Grangier Efficient content-based sparse attention with routing transformers Transactions of the Association for Computational Linguistics 9 53 68 10.1162/tacl_a_00353
O. Russakovsky J. Deng H. Su J. Krause S. Satheesh S. Ma Z. Huang A. Karpathy A. Khosla M. Bernstein A.C. Berg L. Fei-Fei ImageNet Large Scale Visual Recognition Challenge International Journal of Computer Vision (IJCV) 115 3 211 252 3422482 10.1007/s11263-015-0816-y
Tay, Y, Dehghani, M, Abnar, S, Shen, Y, Bahri, D, Pham, P, Rao, J, Yang, L, Ruder, S, & Metzler, D. (2020). Long range arena: A benchmark for efficient transformers. ArXiv abs/2011.04006
Tillet, P., Kung, H.-T., & Cox, D.D. (2019). Triton: an intermediate language and compiler for tiled neural network computations. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages
Vaswani, A, Shazeer, N, Parmar, N, Uszkoreit, J, Jones, L, Gomez, A. N, Kaiser, Ł, & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Wang, S, Li, B.Z, Khabsa, M, Fang, H, & Ma, H. (2020). Linformer: Self-attention with linear complexity. ArXiv abs/2006.04768
Wu, Y, Kan, S, Zeng, M, & Li, M. (2023). Singularformer: Learning to decompose self-attention to linearize the complexity of transformer. In: Elkind, E. (ed.) Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI- 23, pp. 4433–4441. International Joint Conferences on Artificial Intelligence Organization. https://doi.org/10.24963/ijcai.2023/493. Main Track.
Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G. M., Li, Y., & Singh, V. (2021). Nyströmformer: A nyström-based algorithm for approximating self-attention. Proceedings of the.. AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 35 (16), 14138–14148.