Paper published in a journal (Scientific congresses, symposiums and conference proceedings)
Neural Text Normalization for Luxembourgish Using Real-Life Variation Data
LUTGEN, Anne-Marie; PLUM, Alistair; PURSCHKE, Christoph et al.
2025In International Conference on Computational Linguistics, p. 115–127
Peer reviewed
 

Files


Full Text
2025.vardial-1.9.pdf
Publisher postprint (330.3 kB) Creative Commons License - Attribution
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
CuCo Lab
Abstract :
[en] Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
Disciplines :
Languages & linguistics
Computer science
Author, co-author :
LUTGEN, Anne-Marie ;  University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > Department of Humanities (DHUM) > Luxembourg Studies
PLUM, Alistair  ;  University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > Department of Humanities (DHUM) > Luxembourg Studies
PURSCHKE, Christoph  ;  University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > Department of Humanities (DHUM) > Luxembourg Studies
Plank, Barbara
External co-authors :
yes
Language :
English
Title :
Neural Text Normalization for Luxembourgish Using Real-Life Variation Data
Publication date :
January 2025
Event name :
VarDial @ COLING
Event date :
2025
Audience :
International
Journal title :
International Conference on Computational Linguistics
Publisher :
Association for Computational Linguistics, Abu dhabi uae
Pages :
115–127
Peer reviewed :
Peer reviewed
Available on ORBilu :
since 29 January 2025

Statistics


Number of views
154 (12 by Unilu)
Number of downloads
43 (3 by Unilu)

Bibliography


Similar publications



Contact ORBilu