[en] Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
Disciplines :
Languages & linguistics Computer science
Author, co-author :
LUTGEN, Anne-Marie ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > Department of Humanities (DHUM) > Luxembourg Studies
PLUM, Alistair ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > Department of Humanities (DHUM) > Luxembourg Studies
PURSCHKE, Christoph ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > Department of Humanities (DHUM) > Luxembourg Studies
Plank, Barbara
External co-authors :
yes
Language :
English
Title :
Neural Text Normalization for Luxembourgish Using Real-Life Variation Data
Publication date :
January 2025
Event name :
VarDial @ COLING
Event date :
2025
Audience :
International
Journal title :
International Conference on Computational Linguistics
Publisher :
Association for Computational Linguistics, Abu dhabi uae