LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish

LOTHRITZ, Cedric; LEBICHOT, Bertrand; ALLIX, Kevin; VEIBER, Lisa; BISSYANDE, Tegawendé François D Assise; KLEIN, Jacques; Boytsov, Andrey; Goujon, Anne; Lefebvre, Clément

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish

LOTHRITZ, Cedric; LEBICHOT, Bertrand; ALLIX, Kevin et al.

2022 • In Proceedings of the Language Resources and Evaluation Conference, 2022

Peer reviewed

Permalink
https://hdl.handle.net/10993/51815

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

LuxemBERT_LREC.pdf

Author postprint (642.82 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

language modelling; natural language processing; data augmentation

Abstract :

[en] Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.

Disciplines :

Computer science

Author, co-author :

LOTHRITZ, Cedric ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

LEBICHOT, Bertrand ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

ALLIX, Kevin ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

VEIBER, Lisa ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

BISSYANDE, Tegawendé François D Assise ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

Boytsov, Andrey; Banque BGL BNP Paribas

Goujon, Anne; Banque BGL BNP Paribas

Lefebvre, Clément; Banque BGL BNP Paribas

External co-authors :

Language :

English

Title :

LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish

Publication date :

June 2022

Event name :

13th Language Resources and Evaluation Conference (LREC 2022)

Event organizer :

European Language Resources Association

Event place :

Marseille, France

Event date :

20.06.2022-25.06.2022

Audience :

International

Main work title :

Proceedings of the Language Resources and Evaluation Conference, 2022

Pages :

5080-5089

Peer reviewed :

Peer reviewed

Focus Area :

Computational Sciences

Available on ORBilu :

since 01 August 2022

Statistics

Number of views

635 (91 by Unilu)

Number of downloads

319 (46 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

Bernhard, D. and Ligozat, A.-L. (2013). Hassle-free pos-tagging for the alsatian dialects.
Carnie, A. (2021). Syntax: A generative introduction. John Wiley & Sons.
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., and Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020.
Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic press.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1-30.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Kobayashi, S. (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 452-457.
Levesque, H., Davis, E., and Morgenstern, L. (2012). The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Liu, R., Xu, G., Jia, C., Ma, W., Wang, L., and Vosoughi, S. (2020). Data boost: Text data augmentation through reinforcement learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9031-9041.
Lothritz, C., Allix, K., Lebichot, B., Veiber, L., Bissyandé, T. F., and Klein, J. (2021). Comparing multilingual and multiple monolingual models for intent classification and slot filling. In International Conference on Applications of Natural Language to Information Systems, pages 367-375. Springer.
Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de la Clergerie, É. V., Seddah, D., and Sagot, B. (2020). Camembert: a tasty french language model. In ACL 2020-58th Annual Meeting of the Association for Computational Linguistics.
Ortiz Suárez, P. J., Romary, L., and Sagot, B. (2020). A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703-1714, Online, July. Association for Computational Linguistics.
Pennebaker, J. W. (2011). The secret life of pronouns. New Scientist, 211(2828):42-45.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Scheible, R., Thomczyk, F., Tippmann, P., Jaravine, V., and Boeker, M. (2020). Gottbert: a pure german language model. arXiv preprint arXiv:2012.02110.
Suárez, P. J. O., Romary, L., and Sagot, B. (2020). A monolingual approach to contextualized word embeddings for mid-resource languages. In ACL 2020-58th Annual Meeting of the Association for Computational Linguistics.
Varrette, S., Bouvry, P., Cartiaux, H., and Georgatos, F. (2014). Management of an academic hpc cluster: The ul experience. In Proc. of the 2014 Intl. Conf. on High Performance Computing & Simulation (HPCS 2014), pages 959-967, Bologna, Italy, July. IEEE.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353-355.
Wei, J. and Zou, K. (2019). Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 6382-6388.
Wu, S. and Dredze, M. (2020). Are all languages created equal in multilingual bert? ACL 2020, page 120.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pages 5754-5764.
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19-27.
Goldhahn, Dirk and Eckart, Thomas and Quasthoff, Uwe and others. (2012). Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.