NLP De Luxe - Challenges for Natural Language Processing in Luxembourg

NLP; natural language processing; linguistics; luxembourg; luxembourgish; multilingualism; fintech; language modeling; bert; named entity recognition; de-identification; anonymisation; chatbot; conversational ai; luxembert; low-resource; data augmentation; pre-training

Abstract :

[en] The Grand Duchy of Luxembourg is a small country in Western Europe, which, despite its size, is an important global financial centre. Due to its highly multilingual population, and the fact that one of its national languages - Luxembourgish - is regarded as a low-resource language, this country lends itself naturally to a wide variety of interesting research opportunities in the domain of Natural Language Processing (NLP). This thesis discusses and addresses challenges with regard to domain-specific and language-specific NLP, using the unique linguistic situation in Luxembourg as an elaborate case study. We focus on three main topics: (I) NLP challenges present in the financial domain, specifically handling personal names in sensitive documents, (II) NLP challenges related to multilingualism, and (III) NLP challenges for low-resource languages with Luxembourgish as the language of interest. With regard to NLP challenges in the financial domain, we address the challenge of finding and anonymising names in documents. Firstly, an empirical study on the usefulness of Transformer-based deep learning models is presented on the task of Fine-Grained Named Entity Recognition. This empirical study was conducted for a wide array of domains, including the financial domain. We show that Transformer-based models, and in particular BERT models, yield the best performance for this task. We furthermore show that the performance is also strongly dependent on the domain itself, regardless of the choice of model. The automatic detection of names in text documents in turn facilitates the anonymisation of these documents. However, anonymisation can distort data and have a negative effect on models that are built on that data. We investigate the impact of anonymisation of personal names on the performance of deep learning models trained on a large number of NLP tasks. Based on our experiments, we establish which anonymisation strategy should be used to guarantee accurate NLP models. Regarding NLP challenges related to multilingualism, we address the need for polyglot conversational AI in a multilingual environment such as Luxembourg. The trade-off between a single multilingual chatbot and multiple monolingual chatbots trained on Intent Classification and Slot Filling for the banking domain is evaluated in an empirical study. Furthermore, we publish a quadrilingual, parallel dataset that we built specifically for this study, and which can be used to train a client support assistant for the banking domain. With regard to NLP challenges for the Luxembourgish language, we predominantly address the lack of a suitable language model and datasets for NLP tasks in Luxembourgish. First, we present the most impactful contribution of this PhD thesis, which is the first BERT model for the Luxembourgish language which we name LuxemBERT. We explore a novel data augmentation technique based on partially and systematically translating texts to Luxembourgish from a closely related language in order to artificially increase the training data to build our LuxemBERT model. Furthermore, we create datasets for a variety of downstream NLP tasks in Luxembourgish to evaluate the performance of LuxemBERT. We use these datasets to show that LuxemBERT outperforms mBERT, the de facto state-of-the-art model for Luxembourgish. Finally, we compare different approaches to pre-train BERT models for Luxembourgish. Specifically, we investigate whether it is preferable to pre-train a BERT model from scratch or continue pre-training an already existing pre-trained model on new data. To this end, we further pre-train the multilingual mBERT model and the German GottBERT on the Luxembourgish dataset that we used to pre-train LuxemBERT and compare all models in terms of performance and robustness. We make all our language models as well as the datasets available to the NLP community.

Research center :

- Interdisciplinary Centre for Security, Reliability and Trust (SnT) > TruX - Trustworthy Software Engineering

Disciplines :

Computer science

Author, co-author :

LOTHRITZ, Cedric ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

Language :

English

Title :

NLP De Luxe - Challenges for Natural Language Processing in Luxembourg

Defense date :

29 March 2023

Number of pages :

xvi, 132

Institution :

Unilu - University of Luxembourg, Luxembourg

Degree :

Docteur en Informatique

Promotor :

KLEIN, Jacques

President :

BISSYANDE, Tegawendé François D Assise

Jury member :

PURSCHKE, Christoph

Savoy, Jacques

Doğruöz, Seza

Boytsov, Andrey

Focus Area :

Computational Sciences

Available on ORBilu :

since 25 April 2023

Statistics

Number of views

180 (24 by Unilu)

Number of downloads

114 (6 by Unilu)

More statistics