Article (Périodiques scientifiques)
Drop-in efficient self-attention approximation method
FRANCOIS, Damien; Saillot, Mathis; KLEIN, Jacques et al.
2025In Machine Learning, 114 (6)
Peer reviewed vérifié par ORBi
 

Documents


Texte intégral
s10994-025-06768-3.pdf
Postprint Éditeur (1.69 MB) Licence Creative Commons - Attribution
Télécharger

Tous les documents dans ORBilu sont protégés par une licence d'utilisation.

Envoyer vers



Détails



Mots-clés :
Attention approximation; Deep learning; Machine learning; Self-attention; Transformers; Approximation methods; Attention mechanisms; Machine-learning; Memory usage; Sequence lengths; State-of-the-art performance; Transformer; Software; Artificial Intelligence
Résumé :
[en] Transformers have achieved state-of-the-art performance in most common tasks to which they have been applied. Those achievements are attributed to the Self-Attention mechanism at their core. Self-Attention is understood to map the relationship between tokens of any given sequence. This exhaustive mapping incurs massive costs in memory and inference time, as Self-Attention scales quadratically with regard to sequence length. Standard Self-Attention has required increasingly large compute and memory usage when applied to long input sequences because of this memory and time bottleneck. Efficient Transformers emerged as performant alternatives demonstrating good scalability and occasionally better tracking of long-range dependencies. Their efficiency gains are obtained through different methods, usually focusing on the linear scaling of the attention matrix through sparsification, approximation, or other methods. Among existing approaches, those using low-rank approximation present particular advantages because of their compatibility with standard Self-Attention-based models, allowing for weight transfers and other time-saving schemes. More recently, hardware-aware versions of Self-Attention (e.g., FlashAttention) have mitigated all memory bottlenecks and have alleviated its compute burden. Unfortunately, hardware-aware Self-Attentions have stricter hardware compatibility requirements making Efficient Transformers still relevant for use on older or less powerful hardware. Furthermore, some Efficient Transformers can even be applied in an hardware-aware manner to further improve training and inference speed. In this paper, we propose a novel linear approximation method for Self-Attention inspired by the CUR approximation method. This method, proposed in two versions (one leveraging FlashAttention), is conceived as a drop-in replacement for standard Self-Attention with weights compatibility. Our method compares favorably to standard Transformers’ and Efficient Transformers’ performances on varied tasks and demonstrates a significant decrease in memory footprint as well as competitive performance in training speed, even compared to similar methods.
Centre de recherche :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Other
Luxembourg Centre for Systems Biomedicine (LCSB): Integrative Cell Signalling (Skupin Group)
Disciplines :
Sciences informatiques
Auteur, co-auteur :
FRANCOIS, Damien  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Saillot, Mathis ;  LGIPM, Université de Lorraine, Metz, France
KLEIN, Jacques  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Bissyandé, Tegawendé F. ;  SnT, University of Luxembourg, Luxembourg, Luxembourg
SKUPIN, Alexander  ;  University of Luxembourg > Luxembourg Centre for Systems Biomedicine (LCSB) > Integrative Cell Signalling ; Department of Neurosciences, University of California San Diego, United States
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
Drop-in efficient self-attention approximation method
Date de publication/diffusion :
25 avril 2025
Titre du périodique :
Machine Learning
ISSN :
0885-6125
eISSN :
1573-0565
Maison d'édition :
Springer
Volume/Tome :
114
Fascicule/Saison :
6
Peer reviewed :
Peer reviewed vérifié par ORBi
Focus Area :
Computational Sciences
Intitulé du projet de recherche :
U-AGR-6008 - IAS AUDACITY IDAE - SKUPIN Alexander
Organisme subsidiant :
Institute for Advanced Studies of the University of Luxembourg
Subventionnement (détails) :
Author Damien Fran\u00E7ois acknowledges financial support of the Institute for Advanced Studies of the University of Luxembourg through the IDAE Audacity Grant (AUDACITY- 2021)
Disponible sur ORBilu :
depuis le 21 mai 2025

Statistiques


Nombre de vues
98 (dont 6 Unilu)
Nombre de téléchargements
47 (dont 1 Unilu)

citations Scopus®
 
0
citations Scopus®
sans auto-citations
0
OpenCitations
 
0
citations OpenAlex
 
0
citations WoS
 
0

Bibliographie


Publications similaires



Contacter ORBilu