Doctoral thesis (Dissertations and theses)
NLP De Luxe - Challenges for Natural Language Processing in Luxembourg
Lothritz, Cedric
2023
 

Files


Full Text
PhD_Thesis_Lothritz.pdf
Author postprint (10.3 MB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
NLP; natural language processing; linguistics; luxembourg; luxembourgish; multilingualism; fintech; language modeling; bert; named entity recognition; de-identification; anonymisation; chatbot; conversational ai; luxembert; low-resource; data augmentation; pre-training
Abstract :
[en] The Grand Duchy of Luxembourg is a small country in Western Europe, which, despite its size, is an important global financial centre. Due to its highly multilingual population, and the fact that one of its national languages - Luxembourgish - is regarded as a low-resource language, this country lends itself naturally to a wide variety of interesting research opportunities in the domain of Natural Language Processing (NLP). This thesis discusses and addresses challenges with regard to domain-specific and language-specific NLP, using the unique linguistic situation in Luxembourg as an elaborate case study. We focus on three main topics: (I) NLP challenges present in the financial domain, specifically handling personal names in sensitive documents, (II) NLP challenges related to multilingualism, and (III) NLP challenges for low-resource languages with Luxembourgish as the language of interest. With regard to NLP challenges in the financial domain, we address the challenge of finding and anonymising names in documents. Firstly, an empirical study on the usefulness of Transformer-based deep learning models is presented on the task of Fine-Grained Named Entity Recognition. This empirical study was conducted for a wide array of domains, including the financial domain. We show that Transformer-based models, and in particular BERT models, yield the best performance for this task. We furthermore show that the performance is also strongly dependent on the domain itself, regardless of the choice of model. The automatic detection of names in text documents in turn facilitates the anonymisation of these documents. However, anonymisation can distort data and have a negative effect on models that are built on that data. We investigate the impact of anonymisation of personal names on the performance of deep learning models trained on a large number of NLP tasks. Based on our experiments, we establish which anonymisation strategy should be used to guarantee accurate NLP models. Regarding NLP challenges related to multilingualism, we address the need for polyglot conversational AI in a multilingual environment such as Luxembourg. The trade-off between a single multilingual chatbot and multiple monolingual chatbots trained on Intent Classification and Slot Filling for the banking domain is evaluated in an empirical study. Furthermore, we publish a quadrilingual, parallel dataset that we built specifically for this study, and which can be used to train a client support assistant for the banking domain. With regard to NLP challenges for the Luxembourgish language, we predominantly address the lack of a suitable language model and datasets for NLP tasks in Luxembourgish. First, we present the most impactful contribution of this PhD thesis, which is the first BERT model for the Luxembourgish language which we name LuxemBERT. We explore a novel data augmentation technique based on partially and systematically translating texts to Luxembourgish from a closely related language in order to artificially increase the training data to build our LuxemBERT model. Furthermore, we create datasets for a variety of downstream NLP tasks in Luxembourgish to evaluate the performance of LuxemBERT. We use these datasets to show that LuxemBERT outperforms mBERT, the de facto state-of-the-art model for Luxembourgish. Finally, we compare different approaches to pre-train BERT models for Luxembourgish. Specifically, we investigate whether it is preferable to pre-train a BERT model from scratch or continue pre-training an already existing pre-trained model on new data. To this end, we further pre-train the multilingual mBERT model and the German GottBERT on the Luxembourgish dataset that we used to pre-train LuxemBERT and compare all models in terms of performance and robustness. We make all our language models as well as the datasets available to the NLP community.
Research center :
- Interdisciplinary Centre for Security, Reliability and Trust (SnT) > TruX - Trustworthy Software Engineering
Disciplines :
Computer science
Author, co-author :
Lothritz, Cedric  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Language :
English
Title :
NLP De Luxe - Challenges for Natural Language Processing in Luxembourg
Defense date :
29 March 2023
Number of pages :
xvi, 132
Institution :
Unilu - University of Luxembourg, Luxembourg
Degree :
Docteur en Informatique
Promotor :
Jury member :
Purschke, Christoph  
Savoy, Jacques
Doğruöz, Seza
Boytsov, Andrey
Focus Area :
Computational Sciences
Available on ORBilu :
since 25 April 2023

Statistics


Number of views
136 (21 by Unilu)
Number of downloads
29 (3 by Unilu)

Bibliography


Similar publications



Contact ORBilu