Reducing tokenizer’s tokens per word ratio in Financial domain with T-MuFin BERT Tokenizer

BLANCO, Braulio; Becerra-Sanchez, Patricia; BRORSSON, Mats Håkan

Article (Scientific journals)

BLANCO, Braulio; Becerra-Sanchez, Patricia; BRORSSON, Mats Håkan et al.

2023 • In Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting, p. 94–103

Peer reviewed

Permalink
https://hdl.handle.net/10993/62208

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

2023.finnlp-1.9.pdf

Publisher postprint (3.57 MB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

BERT; Tokenizer; Finance

Abstract :

[en] Most domain-specific BERT models are designed to work with short sentences and do not deal with the limitation of 512 tokens in the default BERT tokenizer. This limitation is further exacerbated if the tokenizer has high number of tokens per word ratio (fertility) and thus splits words into several tokens. A term based multilingual Financial (T-MuFin) BERT tokenizer has been proposed to reduce the fertility of the default BERT tokenizer by extending the base dictionary with the most common financial terms instead of word pieces. One key factor of this proposal is to introduce multiword domain-specific terms without affecting the performance of the BERT models. T-MuFin BERT tokenizer reduces at least 40% of the fertility of long text sequences. T-MuFin BERT improves the fine-tuning of a downstream task by approximately 4% compared to a default fine-tuned model. Hence, by reducing the tokenizer’s fertility, the results of explainable methods are more user-friendly.

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SEDAN - Service and Data Management in Distributed Systems
NCER-FT - FinTech National Centre of Excellence in Research

Disciplines :

Computer science

Author, co-author :

BLANCO, Braulio ^✱; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN

Becerra-Sanchez, Patricia; Unilu - University of Luxembourg [LU] > SNT

BRORSSON, Mats Håkan ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SEDAN

^✱ These authors have contributed equally to this work.

Other collaborator :

ZURAD, Maciej; Yoba S.A.

External co-authors :

yes

Language :

English

Title :

Reducing tokenizer’s tokens per word ratio in Financial domain with T-MuFin BERT Tokenizer

Publication date :

20 August 2023

Journal title :

Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting

Publisher :

ACL Anthology, Macao, Macao SAR China

Pages :

94–103

Peer reviewed :

Peer reviewed

Focus Area :

Computational Sciences

Additional URL :

https://aclanthology.org/2023.finnlp-1.9.pdf

FnR Project :

15403349

Name of the research project :

U-AGR-7012 - BRIDGES2020/IS/15403349/SCRiPT_Yoba Cont - BRORSSON Mats Hakan

Available on ORBilu :

since 14 October 2024

Statistics

Number of views

179 (19 by Unilu)

Number of downloads

59 (3 by Unilu)

More statistics