Paper published in a book (Scientific congresses, symposiums and conference proceedings)
LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning
SUN, Tiezhu; PIAN, Weiguo; DAOUDI, Nadia et al.
2024In Rapp, Amon; Di Caro, Luigi (Eds.) Natural Language Processing and Information Systems - 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Proceedings
Peer reviewed
 

Files


Full Text
LaFiCMIL.pdf
Author postprint (793.03 kB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Large file classification; Multiple instance learning; Classification tasks; Computational costs; Input constraints; Language processing; Large files; Multiple-instance learning; Natural languages; Text classification
Abstract :
[en] Transformer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. It is optimized for efficient training on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL’s effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20000 tokens while training on a single GPU with 32 GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL’s potential as a groundbreaking approach in the field of large file classification.
Disciplines :
Computer science
Author, co-author :
SUN, Tiezhu  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
PIAN, Weiguo ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
DAOUDI, Nadia ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Jacques KLEIN ; Luxembourg Institute of Science and Technology, Esch-sur-Alzette, Luxembourg
ALLIX, Kevin ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Jacques KLEIN ; Rennes, France
F. Bissyandé, Tegawendé;  University of Luxembourg, Kirchberg, Luxembourg
KLEIN, Jacques  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
External co-authors :
no
Language :
English
Title :
LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning
Publication date :
20 September 2024
Event name :
The 29th International Conference on Natural Language & Information Systems
Event place :
Turin, Italy
Event date :
25-06-2024 => 27-06-2024
Main work title :
Natural Language Processing and Information Systems - 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Proceedings
Editor :
Rapp, Amon
Di Caro, Luigi
Publisher :
Springer Science and Business Media Deutschland GmbH
ISBN/EAN :
978-3-03-170238-9
Peer reviewed :
Peer reviewed
Funding text :
This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant references 16344458 (REPROCESS), 18154263 (UNLOCK), and 17046335 (AFR PhD grant).
Available on ORBilu :
since 05 December 2024

Statistics


Number of views
23 (0 by Unilu)
Number of downloads
5 (0 by Unilu)

Scopus citations®
 
0
Scopus citations®
without self-citations
0
OpenCitations
 
0
OpenAlex citations
 
1

Bibliography


Similar publications



Contact ORBilu