LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

SUN, Tiezhu; PIAN, Weiguo; DAOUDI, Nadia; ALLIX, Kevin; BISSYANDE, Tegawendé; KLEIN, Jacques

doi:10.1007/978-3-031-70239-6_5

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

SUN, Tiezhu; PIAN, Weiguo; DAOUDI, Nadia et al.

2024 • In Rapp, Amon; Di Caro, Luigi (Eds.) Natural Language Processing and Information Systems - 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Proceedings

Peer reviewed

Permalink
https://hdl.handle.net/10993/62891

DOI
10.1007/978-3-031-70239-6_5

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

LaFiCMIL.pdf

Author postprint (793.03 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Large file classification; Multiple instance learning; Classification tasks; Computational costs; Input constraints; Language processing; Large files; Multiple-instance learning; Natural languages; Text classification

Abstract :

[en] Transformer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. It is optimized for efficient training on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL’s effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20000 tokens while training on a single GPU with 32 GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL’s potential as a groundbreaking approach in the field of large file classification.

Disciplines :

Computer science

Author, co-author :

SUN, Tiezhu ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

PIAN, Weiguo ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

DAOUDI, Nadia ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Jacques KLEIN ; Luxembourg Institute of Science and Technology, Esch-sur-Alzette, Luxembourg

ALLIX, Kevin ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Jacques KLEIN ; Rennes, France

BISSYANDE, Tegawendé ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

External co-authors :

Language :

English

Title :

LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

Publication date :

20 September 2024

Event name :

The 29th International Conference on Natural Language & Information Systems

Event place :

Turin, Italy

Event date :

25-06-2024 => 27-06-2024

Main work title :

Natural Language Processing and Information Systems - 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Proceedings

Editor :

Rapp, Amon

Di Caro, Luigi

Publisher :

Springer Science and Business Media Deutschland GmbH

ISBN/EAN :

978-3-03-170238-9

Peer reviewed :

Peer reviewed

Additional URL :

https://link.springer.com/content/pdf/10.1007/978-3-031-70239-6_5

Funding text :

This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant references 16344458 (REPROCESS), 18154263 (UNLOCK), and 17046335 (AFR PhD grant).

Available on ORBilu :

since 05 December 2024

Statistics

Number of views

108 (0 by Unilu)

Number of downloads

28 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Ahmad, W.U., Chakraborty, S., Ray, B., Chang, K.W.: Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021)
Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages (2019)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Baker, C.T.: The Numerical Treatment of Integral Equations. Oxford University Press, Oxford (1977)
Bamman, D., Smith, N.: New alignment methods for discriminative book summarization. arXiv preprint arXiv:1305.1319 (2013)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Ben-Israel, A., Greville, T.N.: Generalized Inverses: Theory and Applications, vol. 15. Springer, Heidelberg (2003)
Bulatov, A., Kuratov, Y., Burtsev, M.S.: Scaling transformer to 1 m tokens and beyond with RMT. arXiv preprint arXiv:2304.11062 (2023)
Bulatov, A., Kuratov, Y., Burtsev, M.: Recurrent memory transformer. In: Advances in Neural Information Processing Systems, vol. 35, pp. 11079–11091 (2022)
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Androutsopoulos, I.: Large-scale multi-label text classification on EU legislation. arXiv:1906.02192 (2019)
Dang, N.C., Moreno-García, M.N., De la Prieta, F.: Sentiment analysis based on deep learning: a comparative study. Electronics 9(3), 483 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Ding, M., Zhou, C., Yang, H., Tang, J.: CogLTX: applying BERT to long texts. In: NeurIPS (2020)
Feng, J., Zhou, Z.H.: Deep MIML network. In: AAAI (2017)
Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. In: Findings of EMNLP (2020)
Hanif, H., Maffeis, S.: VulBERTa: simplified source code pre-training for vulnerability detection. arXiv preprint arXiv:2205.12424 (2022)
Hebbar, R., et al.: Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices. Speech, and Music Processing, Audio (2021)
Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: ICML (2018)
Ji, Y., Liu, H., He, B., Xiao, X., Wu, H., Yu, Y.: Diversified multiple instance learning for document-level multi-aspect sentiment classification. In: EMNLP (2020)
Kanavati, F., et al.: Weakly-supervised learning for lung carcinoma classification using deep learning. Sci. Rep. (2020)
Kiesel, J., et al.: SemEval-2019 task 4: Hyperpartisan news detection. In: 13th International Workshop on Semantic Evaluation (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019)
Kumar, S., Asthana, R., Upadhyay, S., Upreti, N., Akbar, M.: Fake news detection using deep learning models: a novel approach. Trans. Emerg. Telecommun. Technol. 31(2), e3767 (2020)
Lang, K.: NewsWeeder: learning to filter netnews. In: Machine Learning Proceedings 1995, pp. 331–339 (1995)
Lerousseau, M., et al.: Weakly supervised multiple instance learning histopathological tumor segmentation. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12265, pp. 470–479. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59722-1_45
Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In: CVPR (2021)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021)
Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: EMNLP (2004)
Pappagari, R., Zelasko, P., Villalba, J., Carmiel, Y., Dehak, N.: Hierarchical transformers for long document classification. In: IEEE ASRU (2019)
Park, H., Vyas, Y., Shah, K.: Efficient classification of long documents using transformers. In: ACL (2022)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Ranasinghe, T., Zampieri, M.: Multilingual offensive language identification with cross-lingual embeddings. arXiv preprint arXiv:2010.05324 (2020)
Razavi, M.K., Kerayechian, A., Gachpazan, M., Shateyi, S.: A new iterative method for finding approximate inverses of complex matrices. In: Abstract and Applied Analysis (2014)
Rote, G.: Computing the minimum hausdorff distance between two point sets on a line under translation. Inf. Process. Lett. 38(3), 123–127 (1991)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature (1986)
Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., et al.: Transmil: transformer based correlated multiple instance learning for whole slide image classification. In: NeurIPS (2021)
Sharma, Y., Shrivastava, A., Ehsan, L., Moskaluk, C.A., Syed, S., Brown, D.: Cluster-to-conquer: a framework for end-to-end multi-instance learning for whole slide image classification. In: Medical Imaging with Deep Learning (2021)
Shen, D., et al.: Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms. arXiv preprint arXiv:1805.09843 (2018)
Song, K., et al.: Using customer service dialogues for satisfaction analysis with context-assisted multiple instance learning. In: EMNLP (2019)
Sun, T., et al.: DexBERT: effective, task-agnostic and fine-grained representation learning of Android bytecode. IEEE Trans. Softw. Eng. (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, X., Yan, Y., Tang, P., Bai, X., Liu, W.: Revisiting multiple instance neural networks. Pattern Recogn. (2018)
Xiong, Y., et al.: Nyströmformer: a nyström-based algorithm for approximating self-attention. In: AAAI (2021)
Xu, G., et al.: Camel: a weakly supervised learning framework for histopathology image segmentation. In: ICCV (2019)
Zaheer, M., et al.: Big bird: transformers for longer sequences. In: NeurIPS (2020)
Zhang, H., et al.: DTFD-mil: double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In: CVPR (2022)
Zhang, W.: Non-IID multi-instance learning for predicting instance and bag labels using variational auto-encoder. arXiv preprint arXiv:2105.01276 (2021)
Zhang, Y., et al.: Pushing the limit of LLM capacity for text classification. arXiv preprint arXiv:2402.07470 (2024)
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: NeurIPS (2019)
Zhou, Z.H., Sun, Y.Y., Li, Y.F.: Multi-instance learning by treating instances as non-IID samples. In: ICML (2009)