[en] Malware classification is a specific and refined task within the broader malware detection problem. Effective classification aids in understanding attack techniques and developing robust defenses, ensuring application security and timely mitigation of software vulnerabilities. The dynamic nature of malware demands adaptive classification techniques that can handle the continuous emergence of new families. Traditionally, this is done by retraining models on all historical samples, which requires significant resources in terms of time and storage. An alternative approach is Class-Incremental Learning (CIL), which focuses on progressively learning new classes (malware families) while preserving knowledge from previous training steps. However, CIL assumes that each class appears only once in training and is not revisited, an assumption that does not hold for malware families, which often persist across multiple time intervals. This leads to shifts in the data distribution for the same family over time, a challenge that is not addressed by traditional CIL methods. We formulate this problem as Temporal-Incremental Malware Learning (TIML), which adapts to these shifts and effectively classifies new variants. To support this, we organize the MalNet dataset, consisting of over a million entries of Android malware data collected over a decade, in chronological order. We first adapt state-of-the-art CIL approaches to meet TIML's requirements, serving as baseline methods. Then, we propose a novel multimodal TIML approach that leverages multiple malware modalities for improved performance. Extensive evaluations show that our TIML approaches outperform traditional CIL methods and demonstrate the feasibility of periodically updating malware classifiers at a low cost. This process is efficient and requires minimal storage and computational resources, with only a slight dip in performance compared to full retraining with historical data.
DAOUDI, Nadia ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Jacques KLEIN ; Luxembourg Institute of Science and Technology, Luxembourg
PIAN, Weiguo ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
KIM, Kisub ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE ; Singapore Management University, Singapore
ALLIX, Kevin ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Jacques KLEIN ; Independent Researcher, France
Ikram B. Abdel Ouahab, Mohammed Bouhorma, Lotfi El Aachak, and Anouar Abdelhakim Boudhir. 2022. Towards a new cyberdefense generation: Proposition of an intelligent cybersecurity framework for malware attacks. Recent Advances in Computer Science and Communications (Formerly: Recent Patents on Computer Science) 15, 8 (2022), 1026–1042.
Saket Acharya, Umashankar Rawat, and Roheet Bhatnagar. 2022. A low computational cost method for mobile malware detection using transfer learning and familial classification using topic modelling. Applied Computational Intelligence and Soft Computing 2022 (2022), 1–22.
Hongjoon Ahn, Jihwan Kwak, Subin Lim, Hyeonsu Bang, Hyojun Kim, and Taesup Moon. 2021. SS-IL: Separated softmax for incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 844–853.
Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. 2016. AndroZoo: Collecting millions of android apps for the research community. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR’16). ACM, New York, NY, 468–471. DOI: https://doi.org/10.1145/2901739.2903508
Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Effective and explainable detection of android malware in your pocket. In Proceedings of the Network and Distributed System Security Symposium (NDSS’14), Vol. 14, 23–26.
Federico Barbero, Feargus Pendlebury, Fabio Pierazzi, and Lorenzo Cavallaro. 2022. Transcending transcend: Revisiting malware classification in the presence of concept drift. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP’22). IEEE, 805–823.
Mahawaga Arachchige Pathum Chamikara, Peter Bertók, Dongxi Liu, Seyit Camtepe, and Ibrahim Khalil. 2018. Efficient data perturbation for privacy preserving and accurate data stream mining. Pervasive and Mobile Computing 48 (2018), 1–19.
Yizheng Chen, Zhoujie Ding, and David Wagner. 2023. Continuous learning for android malware detection. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security’23), 1127–1144.
Lei Cui, Junnan Yin, Jiancong Cui, Yuede Ji, Peng Liu, Zhiyu Hao, and Xiaochun Yun. 2024. API2Vec++: Boosting API sequence representation for malware detection and classification. IEEE Transactions on Software Engineering 50 (2024), 2142–2162.
Nadia Daoudi, Jordan Samhi, Abdoul Kader Kabore, Kevin Allix, Tegawendé F. Bissyandé, and Jacques Klein. 2021. Dexray: A simple, yet effective deep learning approach to android malware detection based on image representation of bytecode. In Proceedings of the 2nd International Workshop on Deployable Machine Learning for Security Defense (MLHat’21). Springer, 81–106.
Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. 2021. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 7 (2021), 3366–3385.
Luke Deshotels, Vivek Notani, and Arun Lakhotia. 2014. Droidlegacy: Automated familial classification of android malware. In Proceedings of ACM SIGPLAN on Program Protection and Reverse Engineering Workshop 2014, 1–12.
Gregory Ditzler, Manuel Roveri, Cesare Alippi, and Robi Polikar. 2015. Learning in nonstationary environments: A survey. IEEE Computational Intelligence Magazine 10, 4 (2015), 12–25.
Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. 2020. Podnet: Pooled outputs distillation for small-tasks incremental learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20). Springer, 86–102.
Yong Fang, Yangchen Gao, Fan Jing, and Lei Zhang. 2020. Android malware familial classification based on dex file section features. IEEE Access 8 (2020), 10614–10627.
Enrico Fini, Victor G. Turrisi Da Costa, Xavier Alameda-Pineda, Elisa Ricci, Karteek Alahari, and Julien Mairal. 2022. Self-supervised models are continual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9621–9630.
James B. Fraley and James Cannady. 2017. The promise of machine learning in cybersecurity. In Proceedings of the SoutheastCon 2017. IEEE, 1–6.
Scott Freitas, Rahul Duggal, and Duen Horng Chau. 2022. MalNet: A large-scale image database of malicious software. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 3948–3952.
João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Computing Surveys 46, 4 (2014), 1–37.
David Escudero García, Noemí DeCastro-García, and Angel Luis Muñoz Castañeda. 2023. An effectiveness analysis of transfer learning for the concept drift problem in malware detection. Expert Systems with Applications 212 (2023), 118724.
Joshua Garcia, Mahmoud Hammad, and Sam Malek. 2018. Lightweight, obfuscation-resilient detection and family identification of android malware. ACM Transactions on Software Engineering and Methodology 26, 3 (2018), 1–29.
Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. Extremely randomized trees. Machine Learning 63 (2006), 3–42.
Lukasz Golab and M. Tamer Özsu. 2003. Issues in data stream management. ACM Sigmod Record 32, 2 (2003), 5–14.
Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. 2017. A survey on ensemble learning for data stream classification. ACM Computing Surveys 50, 2 (2017), 1–36.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. 2019. Learning a unified classifier incrementally via rebalancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 831–839.
Gao Huang, Zhuang Liu, Geoff Pleiss, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2019. Convolutional networks with dense connectivity. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 12 (2019), 8704–8716.
Médéric Hurier, Guillermo Suarez-Tangil, Santanu Kumar Dash, Tegawendé F. Bissyandé, Yves Le Traon, Jacques Klein, and Lorenzo Cavallaro. 2017. Euphony: Harmonious unification of cacophonous anti-virus vendor labels for android malware. In Proceedings of the 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR’17), 425–435. DOI: https://doi.org/10.1109/MSR.2017.57
Roberto Jordaney, Kumar Sharad, Santanu K. Dash, Zhi Wang, Davide Papini, Ilia Nouretdinov, and Lorenzo Cavallaro. 2017. Transcend: Detecting concept drift in malware classification models. In Proceedings of the 26th USENIX Security Symposium (USENIX Security’17), 625–642.
Minsoo Kang, Jaeyoo Park, and Bohyung Han. 2022. Class-incremental learning by knowledge distillation with adaptive feature consolidation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16071–16080.
Hye Min Kim, Hyun Min Song, Jae Woo Seo, and Huy Kang Kim. 2018. Andro-simnet: Android malware family classification using social network analysis. In Proceedings of the 2018 16th Annual Conference on Privacy, Security and Trust (PST’18). IEEE, 1–8.
Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational Bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114
Haodong Li, Guosheng Xu, Liu Wang, Xusheng Xiao, Xiapu Luo, Guoai Xu, and Haoyu Wang. 2024. MalCertain: Enhancing deep neural network based android malware detection by tackling prediction uncertainty. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 1–13.
Jingmei Li, Di Xue, Weifei Wu, and Jiaxiang Wang. 2020. Incremental learning for malware classification in small datasets. Security and Communication Networks 2020 (2020), 1–12.
Yao Li, Dawei Yuan, Tao Zhang, Haipeng Cai, David Lo, Cuiyun Gao, Xiapu Luo, and He Jiang. 2024. Meta-learning for multi-family android malware classification. ACM Transactions on Software Engineering and Methodology 33 (2024), 1–27.
Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 12 (2017), 2935–2947.
Zhijie Liu, Liang Feng Zhang, and Yutian Tang. 2023. Enhancing malware detection for android apps: Detecting fine-granularity malicious components. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE’23). IEEE, 1212–1224.
Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. 2018. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering 31, 12 (2018), 2346–2363.
Shentong Mo, Weiguo Pian, and Yapeng Tian. 2023. Class-incremental grouping network for continual audio-visual learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7788–7798.
Weiguo Pian, Shentong Mo, Yunhui Guo, and Yapeng Tian. 2023. Audio-visual class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7799–7811.
Qian Qiang, Mian Cheng, Yang Hu, Yuan Zhou, Jiawei Sun, Yu Ding, Zisen Qi, and Fei Jiao. 2022. An incremental malware classification approach based on few-shot learning. In Proceedings of the IEEE International Conference on Communications (ICC’22). IEEE, 2682–2687.
Mohammad Saidur Rahman, Scott Coull, and Matthew Wright. 2022. On the limitations of continual learning for malware classification. In Proceedings of the Conference on Lifelong Learning Agents. PMLR, 564–582.
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. 2017. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001–2010.
G. Renjith and S. Aji. 2022. On-device resilient android malware detection using incremental learning. Procedia Computer Science 215 (2022), 929–936.
Guillermo Suarez-Tangil, Santanu Kumar Dash, Mansour Ahmadi, Johannes Kinder, Giorgio Giacinto, and Lorenzo Cavallaro. 2017. DroidSieve: Fast and accurate classification of obfuscated android malware. In Proceedings of the 7th ACM on Conference on Data and Application Security and Privacy, 309–320.
Tiezhu Sun, Kevin Allix, Kisub Kim, Xin Zhou, Dongsun Kim, David Lo, Tegawendé F. Bissyandé, and Jacques Klein. 2023. DexBERT: Effective, task-agnostic and fine-grained representation learning of android bytecode. IEEE Transactions on Software Engineering 49 (2023), 4691–4706.
Tiezhu Sun, Nadia Daoudi, Kevin Allix, and Tegawendé F. Bissyandé. 2021. Android malware detection: Looking beyond Dalvik bytecode. In Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW’21). IEEE, 34–39.
Tiezhu Sun, Nadia Daoudi, Kisub Kim, Kevin Allix, Tegawendé F. Bissyandé, and Jacques Klein. 2024. DetectBERT: Towards full app-level representation learning to detect android malware. arXiv:2408.16353. Retrieved from https://doi.org/10.1145/3674805.3690745
Tiezhu Sun, Weiguo Pian, Nadia Daoudi, Kevin Allix, Tegawendé F. Bissyandé, and Jacques Klein. 2024. LaFiCMIL: Rethinking large file classification from the perspective of correlated multiple instance learning. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. Springer, 62–77.
Tiezhu Sun, Wei Zhang, Zhijie Wang, Lin Ma, and Zequn Jie. 2018. Image-level to pixel-wise labeling: From theory to practice. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18), 928–934.
Yuxia Sun, Yanjia Chen, Yuchang Pan, and Lingyu Wu. 2019. Android malware family classification based on deep learning of code images. IAENG International Journal of Computer Science 46, 4 (2019), 524–533.
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2020. FCOS: A simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 4 (2020), 1922–1933.
VirusTotal. Retrieved September 21, 2023 from https://www.virustotal.com
Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. 2022. S-prompts learning with pre-trained transformers: An Occam’s razor for domain incremental learning. In Proceedings of the Advances in Neural Information Processing Systems, 5682–5695.
Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. 2022. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 139–149.
Max Welling. 2009. Herding dynamical weights to learn. In Proceedings of the 26th Annual International Conference on Machine Learning, 1121–1128.
Jason Weston and Chris Watkins. 1999. Support vector machines for multi-class pattern recognition. In Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN’99), Vol. 99, 219–224.
Bozhi Wu, Sen Chen, Cuiyun Gao, Lingling Fan, Yang Liu, Weiping Wen, and Michael R. Lyu. 2021. Why an android app is classified as malware: Toward malware classification interpretation. ACM Transactions on Software Engineering and Methodology 30, 2 (2021), 1–29.
Yueming Wu, Shihan Dou, Deqing Zou, Wei Yang, Weizhong Qiang, and Hai Jin. 2022. Contrastive learning for robust android malware familial classification. IEEE Transactions on Dependable and Secure Computing (2022), 1–14.
Y. Wu, X. Li, D. Zou, W. Yang, X. Zhang, and H. Jin. 2019. MalScan: Fast market-wide mobile malware scanning by social-network centrality analysis. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19), 139–150. DOI: https://doi.org/10.1109/ASE.2019.00023
Guoqing Xiao, Jingning Li, Yuedan Chen, and Kenli Li. 2020. MalFCS: An effective malware classification framework with automated feature extraction based on deep convolutional neural networks. Journal of Parallel and Distributed Computing 141 (2020), 49–58.
Jiangwei Xie, Shipeng Yan, and Xuming He. 2022. General incremental learning with domain-aware categorical representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14351–14360.
Senming Yan, Jing Ren, Wei Wang, Limin Sun, Wei Zhang, and Quan Yu. 2022. A survey of adversarial attack and defense methods for malware classification in cyber security. IEEE Communications Surveys & Tutorials 25 (2022), 467–496.
Mingxin Zhang, Zhijie Wang, Tiezhu Sun, and Xiaolei Li. 2019. Salient object detection by pyramid networks with gating. In Proceedings of the 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO’19). IEEE, 1791–1796.
Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. 2023. Deep class-incremental learning: A survey. arXiv:2302.03648. Retrieved from https://doi.org/10.1109/TPAMI.2024.3429383