[en] Most domain-specific BERT models are designed to work with short sentences and do
not deal with the limitation of 512 tokens in the default BERT tokenizer. This limitation is further exacerbated if the tokenizer has high number of tokens per word ratio (fertility) and thus splits words into several tokens. A term based multilingual Financial (T-MuFin) BERT tokenizer has been proposed to reduce the fertility of the default BERT tokenizer by extending the base dictionary with the most common financial terms instead of word pieces. One key factor of this proposal is to introduce multiword domain-specific terms without affecting the performance of the BERT models. T-MuFin BERT tokenizer reduces at least 40% of the fertility of long text sequences. T-MuFin BERT improves the fine-tuning of a downstream task by approximately 4% compared to a default fine-tuned model. Hence, by reducing the tokenizer’s fertility, the results of explainable methods are more user-friendly.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SEDAN - Service and Data Management in Distributed Systems NCER-FT - FinTech National Centre of Excellence in Research
Disciplines :
Computer science
Author, co-author :
BLANCO, Braulio ✱; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
Becerra-Sanchez, Patricia; Unilu - University of Luxembourg [LU] > SNT
BRORSSON, Mats Håkan ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN
✱ These authors have contributed equally to this work.
Other collaborator :
ZURAD, Maciej; Yoba S.A.
External co-authors :
yes
Language :
English
Title :
Reducing tokenizer’s tokens per word ratio in Financial domain with T-MuFin BERT Tokenizer
Publication date :
20 August 2023
Journal title :
Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting