[en] Most domain-specific BERT models are designed to work with short sentences and do
not deal with the limitation of 512 tokens in the default BERT tokenizer. This limitation is further exacerbated if the tokenizer has high number of tokens per word ratio (fertility) and thus splits words into several tokens. A term based multilingual Financial (T-MuFin) BERT tokenizer has been proposed to reduce the fertility of the default BERT tokenizer by extending the base dictionary with the most common financial terms instead of word pieces. One key factor of this proposal is to introduce multiword domain-specific terms without affecting the performance of the BERT models. T-MuFin BERT tokenizer reduces at least 40% of the fertility of long text sequences. T-MuFin BERT improves the fine-tuning of a downstream task by approximately 4% compared to a default fine-tuned model. Hence, by reducing the tokenizer’s fertility, the results of explainable methods are more user-friendly.
Centre de recherche :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SEDAN - Service and Data Management in Distributed Systems NCER-FT - FinTech National Centre of Excellence in Research
Disciplines :
Sciences informatiques
Auteur, co-auteur :
BLANCO, Braulio ✱; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
Becerra-Sanchez, Patricia; Unilu - University of Luxembourg [LU] > SNT
BRORSSON, Mats Håkan ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
✱ Ces auteurs ont contribué de façon équivalente à la publication.
Autre collaborateur :
ZURAD, Maciej; Yoba S.A.
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
Reducing tokenizer’s tokens per word ratio in Financial domain with T-MuFin BERT Tokenizer
Date de publication/diffusion :
20 août 2023
Titre du périodique :
Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting