Reference : LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for ...
Scientific congresses, symposiums and conference proceedings : Paper published in a book
Engineering, computing & technology : Computer science
Computational Sciences
http://hdl.handle.net/10993/51815
LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish
English
Lothritz, Cedric mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX >]
Lebichot, Bertrand mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX >]
Allix, Kevin mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX >]
Veiber, Lisa mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX >]
Bissyande, Tegawendé François D Assise mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX >]
Klein, Jacques mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX >]
Boytsov, Andrey mailto [Banque BGL BNP Paribas]
Goujon, Anne mailto [Banque BGL BNP Paribas]
Lefebvre, Clément mailto [Banque BGL BNP Paribas]
Jun-2022
Proceedings of the Language Resources and Evaluation Conference, 2022
5080-5089
Yes
No
International
13th Language Resources and Evaluation Conference (LREC 2022)
20.06.2022-25.06.2022
European Language Resources Association
Marseille
France
[en] language modelling ; natural language processing ; data augmentation
[en] Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.
http://hdl.handle.net/10993/51815

File(s) associated to this reference

Fulltext file(s):

FileCommentaryVersionSizeAccess
Open access
LuxemBERT_LREC.pdfAuthor postprint627.75 kBView/Open

Bookmark and Share SFX Query

All documents in ORBilu are protected by a user license.