BERT; Clinical natural language processing; Mixture of Expert; Transformer; cardiac failure; Humans; France; Algorithms; Natural Language Processing; Electronic Health Records/classification; Computational resources; Language processing; Mixture of experts; Natural languages; Small scale; Transformer modeling; Electronic Health Records; Medicine (all); Biomedical Engineering; Transformers; Computational modeling; Biological system modeling; Data models; Brain modeling; Text categorization; Adaptation models; Accuracy; Predictive models; Computer Science - Computation and Language; eess.SP
Abstract :
[en] Transformer-based models have shown outstanding results in natural language processing but face challenges in applications like classifying small-scale clinical texts, especially with constrained computational resources. This study presents a customized Mixture of Expert (MoE) Transformer models for classifying small-scale French clinical texts at CHU Sainte-Justine Hospital. The MoE-Transformer addresses the dual challenges of effective training with limited data and low-resource computation suitable for in-house hospital use. Despite the success of biomedical pre-trained models such as CamemBERT-bio, DrBERT, and AliBERT, their high computational demands make them impractical for many clinical settings. Our MoE-Transformer model not only outperforms DistillBERT, CamemBERT, FlauBERT, and Transformer models on the same dataset but also achieves impressive results: an accuracy of 87%, precision of 87%, recall of 85%, and F1-score of 86%. While the MoE-Transformer does not surpass the performance of biomedical pre-trained BERT models, it can be trained at least 190 times faster, offering a viable alternative for settings with limited data and computational resources. Although the MoE-Transformer addresses challenges of generalization gaps and sharp minima, demonstrating some limitations for efficient and accurate clinical text classification, this model still represents a significant advancement in the field. It is particularly valuable for classifying small French clinical narratives within the privacy and constraints of hospital-based computational resources. Clinical and Translational Impact Statement-This study highlights the potential of customized MoE-Transformers in enhancing clinical text classification, particularly for small-scale datasets like French clinical narratives. The MoE-Transformer's ability to outperform several pre-trained BERT models marks a stride in applying NLP techniques to clinical data and integrating into a Clinical Decision Support System in a Pediatric Intensive Care Unit. The study underscores the importance of model selection and customization in achieving optimal performance for specific clinical applications, especially with limited data availability and within the constraints of hospital-based computational resources.
Disciplines :
Computer science
Author, co-author :
LE, Thanh-Dung ; University of Luxembourg ; Biomedical Information Processing Laboratory, École de Technologie SupérieureUniversity of Quebec Quebec City QC G1K 9H6 Canada
Jouvet, Philippe ; Research Center at CHU Sainte-JustineUniversity of Montreal Montreal QC H3T 1J4 Canada
Noumeir, Rita ; Biomedical Information Processing Laboratory, École de Technologie SupérieureUniversity of Quebec Quebec City QC G1K 9H6 Canada
External co-authors :
yes
Language :
English
Title :
Improving Transformer Performance for French Clinical Notes Classification Using Mixture of Experts on a Limited Dataset.
Publication date :
2025
Journal title :
IEEE Journal of Translational Engineering in Health and Medicine
ISSN :
2168-2372
Publisher :
Institute of Electrical and Electronics Engineers Inc., United States
Natural Sciences and Engineering Research Council Institut de Valorisation des donnees de l'Universite de Montreal Fonds de la Recherche du Quebec–Sante Fonds de Recherche du Quebec–Nature et Technologies Scholarship from FRQNT
Funding text :
This work was supported in part by the Natural Sciences and Engineering Research Council (NSERC), in part by the Institut de Valorisation des donnees de l\u2019Universite de Montreal (IVADO), in part by the Fonds de la Recherche du Quebec\u2014Sante (FRQS), and in part by the Fonds de Recherche du Quebec\u2014Nature et Technologies (FRQNT). The work of Thanh-Dung Le was supported by the Scholarship from FRQNT. Data and reproducible codes are available upon request from Prof. Philippe Jouvet, M.D., Ph.D.
Commentary :
Accepted for publication in the IEEE Journal of Translational
Engineering in Health and Medicine
A. Vaswani et al., ‘‘Attention is all you need,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, Jun. 2017, pp. 5998–6008.
J. K. Tripathy et al., ‘‘Comprehensive analysis of embeddings and pretraining in NLP,’’ Comput. Sci. Rev., vol. 42, Nov. 2021, Art. no. 100433.
Y.-J. Huang et al., ‘‘Assessing schizophrenia patients through linguistic and acoustic features using deep learning techniques,’’ IEEE Trans. Neural Syst. Rehabil. Eng., vol. 30, pp. 947–956, 2022.
L. Ilias and D. Askounis, ‘‘Explainable identification of dementia from transcripts using transformer networks,’’ IEEE J. Biomed. Health Informat., vol. 26, no. 8, pp. 4153–4164, Aug. 2022.
Y. Meng, W. Speier, M. K. Ong, and C. W. Arnold, ‘‘Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression,’’ IEEE J. Biomed. Health Informat., vol. 25, no. 8, pp. 3121–3129, Aug. 2021.
A. Blanco, A. Pérez, and A. Casillas, ‘‘Exploiting ICD hierarchy for classification of EHRs in Spanish through multi-task transformers,’’ IEEE J. Biomed. Health Informat., vol. 26, no. 3, pp. 1374–1383, Mar. 2022.
Z. Deng et al., ‘‘RFormer: Transformer-based generative adversarial network for real fundus image restoration on a new clinical benchmark,’’ IEEE J. Biomed. Health Informat., vol. 26, no. 9, pp. 4645–4655, Sep. 2022.
H. Phan, K. Mikkelsen, O. Y. Chen, P. Koch, A. Mertins, and M. De Vos, ‘‘SleepTransformer: Automatic sleep staging with interpretability and uncertainty quantification,’’ IEEE Trans. Biomed. Eng., vol. 69, no. 8, pp. 2456–2467, Aug. 2022.
G. López-García, J. M. Jerez, N. Ribelles, E. Alba, and F. J. Veredas, ‘‘Transformers for clinical coding in Spanish,’’ IEEE Access, vol. 9, pp. 72387–72397, 2021.
M. Rizwan, M. F. Mushtaq, U. Akram, A. Mehmood, I. Ashraf, and B. Sahelices, ‘‘Depression classification from tweets using small deep transfer learning language models,’’ IEEE Access, vol. 10, pp. 129176–129189, 2022.
H. K. Kim et al., ‘‘Identifying alcohol-related information from unstructured bilingual clinical notes with multilingual transformers,’’ IEEE Access, vol. 11, pp. 16066–16075, 2023.
S. Gao et al., ‘‘Limitations of transformers on clinical text classification,’’ IEEE J. Biomed. Health Informat., vol. 25, no. 9, pp. 3596–3607, Sep. 2021.
O. J. Bear Don’t Walk IV, T. Sun, A. Perotte, and N. Elhadad, ‘‘Clinically relevant pretraining is all you need,’’ J. Amer. Med. Inform. Assoc., vol. 28, no. 9, pp. 1970–1976, Aug. 2021.
I. Alimova, E. Tutubalina, and S. I. Nikolenko, ‘‘Cross-domain limitations of neural models on biomedical relation classification,’’ IEEE Access, vol. 10, pp. 1432–1439, 2022.
A. Gillioz, J. Casas, E. Mugellini, and O. A. Khaled, ‘‘Overview of the transformer-based models for NLP tasks,’’ in Proc. 15th Conf. Comput. Sci. Inf. Syst. (FedCSIS), Sep. 2020, pp. 179–183.
C. Rudin, ‘‘Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,’’ Nature Mach. Intell., vol. 1, no. 5, pp. 206–215, May 2019.
A. Névéol, H. Dalianis, S. Velupillai, G. Savova, and P. Zweigenbaum, ‘‘Clinical natural language processing in languages other than English: Opportunities and challenges,’’ J. Biomed. Semantics, vol. 9, no. 1, pp. 1–13, Dec. 2018.
M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy, ‘‘Do vision transformers see like convolutional neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 34, Aug. 2021, pp. 12116–12128.
W. Fedus, B. Zoph, and N. Shazeer, ‘‘Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,’’ J. Mach. Learn. Res, vol. 23, pp. 1–40, Jan. 2021.
F. Xue, Z. Shi, F. Wei, Y. Lou, Y. Liu, and Y. You, ‘‘Go wider instead of deeper,’’ in Proc. AAAI Conf. Artif. Intell., vol. 36, Jun. 2022, pp. 8779–8787.
A. Lazaridou et al., ‘‘Mind the gap: Assessing temporal generalization in neural language models,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 34, Jan. 2021, pp. 29348–29363.
M. Sundararajan, A. Taly, and Q. Yan, ‘‘Axiomatic attribution for deep networks,’’ in Proc. Int. Conf. Mach. Learn., vol. 70, 2017, pp. 3319–3328.
G. Bellani et al., ‘‘Epidemiology, patterns of care, and mortality for patients with acute respiratory distress syndrome in intensive care units in 50 countries,’’ JAMA, vol. 315, no. 8, pp. 788–800, 2016.
Pediatric Acute Lung Injury Consensus Conference Group, ‘‘Pediatric acute respiratory distress syndrome: Consensus recommendations from the pediatric acute lung injury consensus conference,’’ Pediatric Crit. Care Med., vol. 5, pp. 428–439, Jun. 2015.
M. Sauthier et al., ‘‘Estimated Pao2: A continuous and noninvasive method to estimate Pao2 and oxygenation index,’’ Crit. Care Explor., vol. 3, no. 10, 2021, Art. no. e0546.
N. Zaglam, P. Jouvet, O. Flechelles, G. Emeriaud, and F. Cheriet, ‘‘Computer-aided diagnosis system for the acute respiratory distress syndrome from chest radiographs,’’ Comput. Biol. Med., vol. 52, pp. 41–48, Sep. 2014.
M. Yahyatabar, P. Jouvet, and F. Cheriet, ‘‘Dense-Unet: A light model for lung fields segmentation in chest X-ray images,’’ in Proc. 42nd Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), Jul. 2020, pp. 1242–1245.
T.-D. Le, R. Noumeir, J. Rambaud, G. Sans, and P. Jouvet, ‘‘Detecting of a patient’s condition from clinical narratives using natural language representation,’’ IEEE Open J. Eng. Med. Biol., vol. 3, pp. 142–149, 2022.
T.-D. Le, R. Noumeir, J. Rambaud, G. Sans, and P. Jouvet, ‘‘Adaptation of autoencoder for sparsity reduction from clinical notes representation learning,’’ IEEE J. Translational Eng. Health Med., vol. 11, pp. 469–478, 2023.
J. Alammar, ‘‘The illustrated transformer,’’ in The Illustrated Transformer–Jay Alammar–Visualizing Machine Learning One Concept At a Time, vol. 27, 2018, pp. 1–2. [Online]. Available: https://jalammar.github.io/illustrated-transformer/
T. Lin, Y. Wang, X. Liu, and X. Qiu, ‘‘A survey of transformers,’’ AI Open, vol. 3, pp. 111–132, Jan. 2022.
C. Raffel et al., ‘‘Exploring the limits of transfer learning with a unified text-to-text transformer,’’ J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485–5551, 2020.
N. Dikkala, N. Ghosh, R. Meka, R. Panigrahy, N. Vyas, and X. Wang, ‘‘On the benefits of learning to route in mixture-of-experts models,’’ in Proc. Conf. Empirical Methods Natural Lang. Process., Dec. 2023, pp. 9376–9396.
Z. Chen, Y. Deng, Y. Wu, Q. Gu, and Y. Li, ‘‘Towards understanding mixture of experts in deep learning,’’ 2022, arXiv:2208.02813.
A. Fan et al., ‘‘Beyond English-centric multilingual machine translation,’’ J. Mach. Learn. Res., vol. 22, no. 1, pp. 4839–4886, 2021.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training of deep bidirectional transformers for language understanding,’’ in Proc. NAACL, 2019, pp. 4171–4186.
L. Martin et al., ‘‘CamemBERT: A tasty French language model,’’ in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020, pp. 7203–7219.
H. Le et al., ‘‘FlauBERT: Unsupervised language model pre-training for French,’’ in Proc. 12th Lang. Resour. Eval. Conf., 2020, pp. 2479–2490.
O. Cattan, C. Servan, and S. Rosset, ‘‘On the usability of transformers-based models for a French question-answering task,’’ in Proc. Int. Conf. Recent Adv. Natural Lang. Process. (RANLP), 2021, pp. 244–255.
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, ‘‘DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,’’ 2019, arXiv:1910.01108.
R. Touchent et al., ‘‘Camembert-bio: A tasty French language model better for your health,’’ 2023, arXiv:2306.15550.
Y. Labrak et al., ‘‘DrBERT: A robust pre-trained model in French for biomedical and clinical domains,’’ in Proc. 61st Annu. Meeting Assoc. Comput. Linguistics, Toronto, ON, Canada, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds., Stroudsburg, PA, USA: Association for Computational Linguistics, 2023, pp. 16207–16221, doi: 10.18653/v1/2023.acl-long.896.
A. Berhe, G. Draznieks, V. Martenot, V. Masdeu, L. Davy, and J.-D. Zucker, ‘‘AliBERT: A pre-trained language model for French biomedical text,’’ in Proc. 22nd Workshop Biomed. Natural Lang. Process. BioNLP Shared Tasks, Toronto, ON, Canada, D. Demner-Fushman, S. Ananiadou, and K. Cohen, Eds., 2023, pp. 223–236, doi: 10.18653/v1/2023.bionlp-1.19.
À. R. Atrio and A. Popescu-Belis, ‘‘Small batch sizes improve training of low-resource neural MT,’’ in Proc. 18th Int. Conf. Natural Lang. Process. (ICON), Jan. 2022, pp. 18–24.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, 2014.
O. Konur, D. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’ in Proc. Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–15.
I. Loshchilov and F. Hutter, ‘‘Decoupled weight decay regularization,’’ in Proc. 7th Int. Conf. Learn. Represent. (ICLR), New Orleans, LA, USA, 2019.
A. Gotmare, N. S. Keskar, C. Xiong, and R. Socher, ‘‘A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation,’’ in Proc. 7th Int. Conf. Learn. Represent., New Orleans, LA, USA, Jan. 2018.
D. Fan, B. Messmer, and M. Jaggi, ‘‘Towards an empirical understanding of MoE design choices,’’ 2024, arXiv:2402.13089.
J. Zhao, P. Wang, and Z. Wang, ‘‘Generalization error analysis for sparse mixture-of-experts: A preliminary study,’’ 2024, arXiv:2403.17404.
M. Popel and O. Bojar, ‘‘Training tips for the transformer model,’’ 2018, arXiv:1804.00247.
X. Glorot and Y. Bengio, ‘‘Understanding the difficulty of training deep feedforward neural networks,’’ in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010, pp. 249–256.
S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating deep network training by reducing internal covariate shift,’’ in Proc. Int. Conf. Mach. Learn., 2015, pp. 448–456.
C. Goutte and E. Gaussier, ‘‘A probabilistic interpretation of precision, recall and F-score, with implication for evaluation,’’ in Proc. Eur. Conf. Inf. Retr. Berlin, Germany: Springer, Mar. 2005, pp. 345–359.
E. Hoffer, I. Hubara, and D. Soudry, ‘‘Train longer, generalize better: Closing the generalization gap in large batch training of neural networks,’’ in Proc. Adv. Neural Inf. Process. Syst., Jan. 2017.
P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever, ‘‘Deep double descent: Where bigger models and more data hurt,’’ J. Stat. Mech., Theory Exp., vol. 2021, no. 12, Dec. 2021, Art. no. 124003.
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang, ‘‘On large-batch training for deep learning: Generalization gap and sharp minima,’’ in Proc. 5th Int. Conf. Learn. Represent., Toulon, France, Sep. 2017.
W. Khan, S. Leem, K. B. See, J. K. Wong, S. Zhang, and R. Fang, ‘‘A comprehensive survey of foundation models in medicine,’’ IEEE Rev. Biomed. Eng., early access, Jan. 20, 2025, doi: 10.1109/RBME.2025.3531360.
T.-D. Le, T. T. Nguyen, V. N. Ha, S. Chatzinotas, P. Jouvet, and R. Noumeir, ‘‘The impact of LoRA adapters for LLMs on clinical NLP classification under data limitations,’’ 2024, arXiv:2407.19299.
A. Q. Jiang et al., ‘‘Mistral 7B,’’ 2023, arXiv:2310.06825.
A. Grattafiori et al., ‘‘The llama 3 herd of models,’’ 2024, arXiv:2407.21783.
Z. Lu et al., ‘‘Small language models: Survey, measurements, and insights,’’ 2024, arXiv:2409.15790.
J. A. Omiye, H. Gui, S. J. Rezaei, J. Zou, and R. Daneshjou, ‘‘Large language models in medicine: The potentials and pitfalls: A narrative review,’’ Ann. Internal Med., vol. 177, no. 2, pp. 210–220, Feb. 2024.
P. Benz, C. Zhang, A. Karjauv, and I. S. Kweon, ‘‘Robustness may be at odds with fairness: An empirical study on class-wise accuracy,’’ in Proc. NeurIPS Workshop Pre-Registration Mach. Learn., Jan. 2020, pp. 325–342.
S. Raza, E. Dolatabadi, N. Ondrusek, L. Rosella, and B. Schwartz, ‘‘Discovering social determinants of health from case reports using natural language processing: Algorithmic development and validation,’’ BMC Digit. Health, vol. 1, no. 1, p. 35, Sep. 2023.
P. Slattery et al., ‘‘The AI risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence,’’ 2024, arXiv:2408.12622.