Results 1-5 of 5.
((uid:50035213))
![]() Lothritz, Cedric ![]() Doctoral thesis (2023) The Grand Duchy of Luxembourg is a small country in Western Europe, which, despite its size, is an important global financial centre. Due to its highly multilingual population, and the fact that one of ... [more ▼] The Grand Duchy of Luxembourg is a small country in Western Europe, which, despite its size, is an important global financial centre. Due to its highly multilingual population, and the fact that one of its national languages - Luxembourgish - is regarded as a low-resource language, this country lends itself naturally to a wide variety of interesting research opportunities in the domain of Natural Language Processing (NLP). This thesis discusses and addresses challenges with regard to domain-specific and language-specific NLP, using the unique linguistic situation in Luxembourg as an elaborate case study. We focus on three main topics: (I) NLP challenges present in the financial domain, specifically handling personal names in sensitive documents, (II) NLP challenges related to multilingualism, and (III) NLP challenges for low-resource languages with Luxembourgish as the language of interest. With regard to NLP challenges in the financial domain, we address the challenge of finding and anonymising names in documents. Firstly, an empirical study on the usefulness of Transformer-based deep learning models is presented on the task of Fine-Grained Named Entity Recognition. This empirical study was conducted for a wide array of domains, including the financial domain. We show that Transformer-based models, and in particular BERT models, yield the best performance for this task. We furthermore show that the performance is also strongly dependent on the domain itself, regardless of the choice of model. The automatic detection of names in text documents in turn facilitates the anonymisation of these documents. However, anonymisation can distort data and have a negative effect on models that are built on that data. We investigate the impact of anonymisation of personal names on the performance of deep learning models trained on a large number of NLP tasks. Based on our experiments, we establish which anonymisation strategy should be used to guarantee accurate NLP models. Regarding NLP challenges related to multilingualism, we address the need for polyglot conversational AI in a multilingual environment such as Luxembourg. The trade-off between a single multilingual chatbot and multiple monolingual chatbots trained on Intent Classification and Slot Filling for the banking domain is evaluated in an empirical study. Furthermore, we publish a quadrilingual, parallel dataset that we built specifically for this study, and which can be used to train a client support assistant for the banking domain. With regard to NLP challenges for the Luxembourgish language, we predominantly address the lack of a suitable language model and datasets for NLP tasks in Luxembourgish. First, we present the most impactful contribution of this PhD thesis, which is the first BERT model for the Luxembourgish language which we name LuxemBERT. We explore a novel data augmentation technique based on partially and systematically translating texts to Luxembourgish from a closely related language in order to artificially increase the training data to build our LuxemBERT model. Furthermore, we create datasets for a variety of downstream NLP tasks in Luxembourgish to evaluate the performance of LuxemBERT. We use these datasets to show that LuxemBERT outperforms mBERT, the de facto state-of-the-art model for Luxembourgish. Finally, we compare different approaches to pre-train BERT models for Luxembourgish. Specifically, we investigate whether it is preferable to pre-train a BERT model from scratch or continue pre-training an already existing pre-trained model on new data. To this end, we further pre-train the multilingual mBERT model and the German GottBERT on the Luxembourgish dataset that we used to pre-train LuxemBERT and compare all models in terms of performance and robustness. We make all our language models as well as the datasets available to the NLP community. [less ▲] Detailed reference viewed: 63 (13 UL)![]() Lothritz, Cedric ![]() ![]() ![]() in Proceedings of the Language Resources and Evaluation Conference, 2022 (2022, June) Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and ... [more ▼] Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request. [less ▲] Detailed reference viewed: 282 (44 UL)![]() Lothritz, Cedric ![]() ![]() ![]() in 26th International Conference on Applications of Natural Language to Information Systems (2021, June 25) With the momentum of conversational AI for enhancing client-to-business interactions, chatbots are sought in various domains, including FinTech where they can automatically handle requests for opening ... [more ▼] With the momentum of conversational AI for enhancing client-to-business interactions, chatbots are sought in various domains, including FinTech where they can automatically handle requests for opening/closing bank accounts or issuing/terminating credit cards. Since they are expected to replace emails and phone calls, chatbots must be capable to deal with diversities of client populations. In this work, we focus on the variety of languages, in particular in multilingual countries. Specifically, we investigate the strategies for training deep learning models of chatbots with multilingual data. We perform experiments for the specific tasks of Intent Classification and Slot Filling in financial domain chatbots and assess the performance of mBERT multilingual model vs multiple monolingual models. [less ▲] Detailed reference viewed: 131 (16 UL)![]() Arslan, Yusuf ![]() ![]() ![]() in Companion Proceedings of the Web Conference 2021 (WWW '21 Companion), April 19--23, 2021, Ljubljana, Slovenia (2021, April 19) Detailed reference viewed: 162 (23 UL)![]() Lothritz, Cedric ![]() ![]() ![]() in Proceedings of the 28th International Conference on Computational Linguistics (2020, December) Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task and has remained an active research field. In recent years, transformer models and more specifically the BERT model ... [more ▼] Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task and has remained an active research field. In recent years, transformer models and more specifically the BERT model developed at Google revolutionised the field of NLP. While the performance of transformer-based approaches such as BERT has been studied for NER, there has not yet been a study for the fine-grained Named Entity Recognition (FG-NER) task. In this paper, we compare three transformer-based models (BERT, RoBERTa, and XLNet) to two non-transformer-based models (CRF and BiLSTM-CNN-CRF). Furthermore, we apply each model to a multitude of distinct domains. We find that transformer-based models incrementally outperform the studied non-transformer-based models in most domains with respect to the F1 score. Furthermore, we find that the choice of domains significantly influenced the performance regardless of the respective data size or the model chosen. [less ▲] Detailed reference viewed: 416 (24 UL) |
||