[en] The popularity of Android OS has made it an appealing target for malware developers. To evade detection, including by ML-based techniques, attackers invest in creating malware that closely resemble legitimate apps, challenging the state of the art with difficult-to-detect samples. In this paper, we propose Guided Retraining, a supervised representation learning-based method for boosting the performance of malware detectors. To that end, we first split the experimental dataset into subsets of “easy” and “difficult” samples, where difficulty is associated to the prediction probabilities yielded by a malware detector. For the subset of “easy” samples, the base malware detector is used to make the final predictions since the error rate on that subset is low by construction. Our work targets the second subset containing “difficult” samples, for which the probabilities are such that the classifier is not confident on the predictions, which have high error rates. We apply our Guided Retraining method on these difficult samples to improve their classification. Guided Retraining leverages the correct predictions and the errors made by the base malware detector to guide the retraining process. Guided Retraining learns new embeddings of the difficult samples using Supervised Contrastive Learning and trains an auxiliary classifier for the final predictions. We validate our method on four state-of-the-art Android malware detection approaches using over 265k malware and benign apps. Experimental results show that Guided Retraining can boost state-of-the-art detectors by eliminating up to 45.19% of the prediction errors that they make on difficult samples. We note furthermore that our method is generic and designed to enhance the performance of binary classifiers for other tasks beyond Android malware detection.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
DAOUDI, Nadia ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Allix, Kevin; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust
Fahad Akbar, Mehdi Hussain, Ra_a Mumtaz, Qaiser Riaz, Ainuddin Wahid Abdul Wahab, and Ki-Hyun Jung. 2022. Permissions-Based Detection of Android Malware Using Machine Learning. Symmetry 14, 4 (2022), 718. https://doi. org/10. 3390/sym14040718
Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. 2016. AndroZoo: Collecting Millions of Android Apps for the Research Community. In Proceedings of the 13th International Conference on Mining Software Repositories (Austin, Texas) (MSR '16). ACM, New York, NY, USA, 468-471. https://doi. org/10. 1145/2901739. 2903508
Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves LeTraon. 2015. Are Your Training Datasets Yet Relevant? In Engineering Secure Software and Systems, Frank Piessens, Juan Caballero, and Nataliia Bielova (Eds.). Springer International Publishing, Cham, 51-67. https://doi. org/10. 1007/978-3-319-15618-7_5
Daniel Arp, Michael Spreitzenbarth, Malte Hübner, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Efficient and explainable detection of android malware in your pocket. In Proceedings of the ISOC Network and Distributed System Security Symposium (NDSS), San Diego, CA. https://doi. org/10. 14722/ndss. 2014. 23247
Mariam Barque, Simon Martin, Jérémie Etienne Norbert Vianin, Dominique Genoud, and David Wannier. 2018. Improving wind power prediction with retraining machine learning algorithms. In 2018 International Workshop on Big Data and Information Security (IWBIS). 43-48. https://doi. org/10. 1109/IWBIS. 2018. 8471713
Cheng-Yi Chiang, Nai-Fu Chang, Tung-Chien Chen, Hong-Hui Chen, and Liang-Gee Chen. 2011. Seizure prediction based on classification of EEG synchronization patterns with on-line retraining and post-processing scheme. In 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 7564-7569. https://doi. org/10. 1109/IEMBS. 2011. 6091865
Nadia Daoudi, Kevin Allix, Tegawendé F Bissyandé, and Jacques Klein. 2021. Lessons Learnt on Reproducibility in Machine Learning Based Android Malware Detection. Empirical Software Engineering 26, 4 (2021), 1-53. https://doi. org/10. 1007/s10664-021-09955-7
Nadia Daoudi, Kevin Allix, Tegawendé F Bissyandé, and Jacques Klein. 2023. Assessing the opportunity of combining state-of-the-art Android malware detectors. Empirical Software Engineering 28, 2 (2023), 22. https://doi. org/10. 1007/s10664-022-10249-9
Nadia Daoudi, Jordan Samhi, Abdoul Kader Kabore, Kevin Allix, Tegawendé F. Bissyandé, and Jacques Klein. 2021. DexRay: A Simple, yet Effiective Deep Learning Approach to Android Malware Detection Based on Image Representation of Bytecode. In Deployable Machine Learning for Security Defense, Gang Wang, Arridhana Ciptadi, and Ali Ahmadzadeh (Eds.). Springer International Publishing, Cham, 81-106. https://doi. org/10. 1007/978-3-030-87839-9_4
Yuxin Ding, Xiao Zhang, Jieke Hu, and Wenting Xu. 2020. Android malware detection method based on bytecode image. Journal of Ambient Intelligence and Humanized Computing (2020), 1-10. https://doi. org/10. 1007/s12652-020-02196-4
Yujie Fan, Mingxuan Ju, Shifu Hou, Yanfang Ye, Wenqiang Wan, Kui Wang, Yinming Mei, and Qi Xiong. 2021. Heterogeneous Temporal Graph Transformer: An Intelligent System for Evolving Android Malware Detection. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (Virtual Event, Singapore) (KDD '21). Association for Computing Machinery, New York, NY, USA, 2831-2839. https://doi. org/10. 1145/3447548. 3467168
Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. 2009. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32, 9 (2009), 1627-1645. https://doi. org/10. 1109/TPAMI. 2009. 167
Lei Feng and Bo An. 2019. Partial Label Learning with Self-Guided Retraining. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 3542-3549. https://doi. org/10. 1609/aaai. v33i01. 33013542
Yoav Freund and Robert E Schapire. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. System Sci. 55, 1 (1997), 119-139. https://doi. org/10. 1006/jcss. 1997. 1504
Joshua Garcia, Mahmoud Hammad, and Sam Malek. 2018. Lightweight, Obfuscation-Resilient Detection and Family Identification of Android Malware. ACM Trans. Softw. Eng. Methodol. 26, 3, Article 11 (Jan. 2018), 29 pages. https://doi. org/10. 1145/3162625
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580-587. https://doi. org/10. 1109/CVPR. 2014. 81
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings. neurips. cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper. pdf
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 1322-1328. https://doi. org/10. 1109/IJCNN. 2008. 4633969
Shifu Hou, Yujie Fan, Yiming Zhang, Yanfang Ye, Jingwei Lei, Wenqiang Wan, Jiabin Wang, Qi Xiong, and Fudong Shao. 2019. U Cyber: Enhancing Robustness of Android Malware Detection System against Adversarial Attacks on Heterogeneous Graph Based Model. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM '19). Association for Computing Machinery, New York, NY, USA, 609-618. https://doi. org/10. 1145/3357384. 3357875
T. H. Huang and H. Kao. 2018. R2-D2: ColoR-inspired Convolutional NeuRal Network (CNN)-based AndroiD Malware Detections. In 2018 IEEE International Conference on Big Data (Big Data). 2633-2642. https://doi. org/10. 1109/BigData. 2018. 8622324
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 18661-18673. https://proceedings. neurips. cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper. pdf
Vasileios Kouliaridis and Georgios Kambourakis. 2021. A Comprehensive Survey on Machine Learning Techniques for Android Malware Detection. Information 12, 5 (2021). https://doi. org/10. 3390/info12050185
Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. 2020. Dice Loss for Data-imbalanced NLP Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 465-476. https://doi. org/10. 18653/v1/2020. acl-main. 45
Yi Li and Nuno Vasconcelos. 2020. Background data resampling for outlier-aware classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13218-13227. https://doi. org/10. 1109/CVPR42600. 2020. 01323
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). https://doi. org/10. 1109/ICCV. 2017. 324
K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun, and H. Liu. 2020. A Review of Android Malware Detection Approaches Based on Machine Learning. IEEE Access 8 (2020), 124579-124607. https://doi. org/10. 1109/ACCESS. 2020. 3006143
Yue Liu, Chakkrit Tantithamthavorn, Li Li, and Yepang Liu. 2022. Deep Learning for Android Malware Defenses: A Systematic Literature Review. ACM Comput. Surv. 55, 8, Article 153 (dec 2022), 36 pages. https://doi. org/10. 1145/3544968
Mohammad Mahdi Maghouli, Mohamadreza Fereydooni, Monireh Abdoos, and Mojtaba Vahidi-Asl. 2021. Malfustection: Obfuscated Malware Detection and Malware Classification with Data Shortage by Combining Semi-Supervised and Contrastive Learning. arXiv preprint arXiv: 2111. 09975 (2021). https://doi. org/10. 48550/arXiv. 2111. 09975
Arvind Mahindru and AL Sangal. 2021. MLDroid-framework for Android malware detection using machine learning techniques. Neural Computing and Appli-cations 33, 10 (2021), 5183-5240. https://doi. org/10. 1007/s00521-020-05309-4
Enrico Mariconti, Lucky Onwuzurike, Panagiotis Andriotis, Emiliano De Cristofaro, Gordon Ross, and Gianluca Stringhini. 2017. MaMaDroid: Detecting Android Malware by Building Markov Chains of Behavioral Models. In ISOC Network and Distributed Systems Security Symposiym (NDSS). San Diego, CA. https://doi. org/10. 14722/ndss. 2017. 23353
Stuart Millar, Niall McLaughlin, Jesus Martinez del Rincon, and Paul Miller. 2021. Multi-view deep learning for zero-day Android malware detection. Journal of Information Security and Applications 58 (2021), 102718. https://doi. org/10. 1016/j. jisa. 2020. 102718
Jun-Gyu Park, Hang-Bae Jun, and Tae-Young Heo. 2021. Retraining prior state performances of anaerobic digestion improves prediction accuracy of methane yield in various machine learning models. Applied Energy 298 (2021), 117250. https://doi. org/10. 1016/j. apenergy. 2021. 117250
Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. 2019. TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time. In 28th USENIX Security Sympo-sium (USENIX Security 19). USENIX Association, Santa Clara, CA, 729-746. https://www. usenix. org/conference/usenixsecurity19/presentation/pendlebury
Tejpal Sharma and Dhavleesh Rattan. 2021. Malicious application detection in android-A systematic literature review. Computer Science Review 40 (2021), 100373. https://doi. org/10. 1016/j. cosrev. 2021. 100373
Yan Song, Yibin Li, Lei Jia, and Meikang Qiu. 2020. Retraining Strategy-Based Domain Adaption Network for Intelligent Fault Diagnosis. IEEE Transactions on Industrial Informatics 16, 9 (2020), 6163-6171. https://doi. org/10. 1109/TII. 2019. 2950667
Tiezhu Sun, Nadia Daoudi, Kevin Allix, and Tegawendé F. Bissyandé. 2021. Android Malware Detection: Looking beyond Dalvik Bytecode. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software EngineeringWork-shops (Virtual Event, Australia) (ASE '21). https://doi. org/10. 1109/ASEW52652. 2021. 00019
Jiachen Tian, Shizhan Chen, Xiaowang Zhang, Zhiyong Feng, Deyi Xiong, Shaojuan Wu, and Chunliu Dou. 2021. Re-embedding Difficult Samples via Mutual Information Constrained Semantically Oversampling for Imbalanced Text Classi-fication. In Proceedings of the 2021 Conference on Empirical Methods in Natural Lan-guage Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 3148-3161. https://doi. org/10. 18653/v1/2021. emnlpmain. 252
Austin Tripp, Erik Daxberger, and José Miguel Hernández-Lobato. 2020. Sample-Efficient Optimization in the Latent Space of Deep Generative Models via Weighted Retraining. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 11259-11272. https://proceedings. neurips. cc/paper/2020/file/81e3225c6ad49623167a4309eb4b2e75-Paper. pdf
Mike Walmsley, Anna MM Scaife, Chris Lintott, Michelle Lochner, Verlon Etsebeth, Tobias Géron, Hugh Dickinson, Lucy Fortson, Sandor Kruk, Karen L Masters, et al. 2021. Practical Galaxy Morphology Tools from Deep Supervised Representation Learning. arXiv preprint arXiv: 2110. 12735 (2021). https://doi. org/10. 1093/mnras/stac525
Xiaohui Wan, Zheng Zheng, Fangyun Qin, Yu Qiao, and Kishor S. Trivedi. 2019. Supervised Representation Learning Approach for Cross-Project Aging-Related Bug Prediction. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE). 163-172. https://doi. org/10. 1109/ISSRE. 2019. 00025
Yinjun Wu, Edgar Dobriban, and Susan Davidson. 2020. DeltaGrad: Rapid retraining of machine learning models. In Proceedings of the 37th Interna-tional Conference on Machine Learning (Proceedings of Machine Learning Re-search, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 10355-10366. https://proceedings. mlr. press/v119/wu20b. html
Yueming Wu, Shihan Dou, Deqing Zou, Wei Yang, Weizhong Qiang, and Hai Jin. 2021. Obfuscation-resilient Android Malware Analysis Based on Contrastive Learning. arXiv preprint arXiv: 2107. 03799 (2021). https://doi. org/10. 48550/arXiv. 2107. 03799
Y. Wu, X. Li, D. Zou, W. Yang, X. Zhang, and H. Jin. 2019. MalScan: Fast Market-Wide Mobile Malware Scanning by Social-Network Centrality Analysis. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 139-150. https://doi. org/10. 1109/ASE. 2019. 00023
Limin Yang, Wenbo Guo, Qingying Hao, Arridhana Ciptadi, Ali Ahmadzadeh, Xinyu Xing, and Gang Wang. 2021. CADE: Detecting and Explaining Concept Drift Samples for Security Applications. In 30th USENIX Security Symposium (USENIX Security 21). USENIX Association, 2327-2344. https://www. usenix. org/conference/usenixsecurity21/presentation/yang-limin
Nan Zhang, Yu an Tan, Chen Yang, and Yuanzhang Li. 2021. Deep learning feature exploration for Android malware detection. Applied Soft Computing 102 (2021), 107069. https://doi. org/10. 1016/j. asoc. 2020. 107069
Q. Zhao. 2001. Training and retraining of neural network trees. In IJCNN'01. In-ternational Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222), Vol. 1. 726-731 vol. 1. https://doi. org/10. 1109/IJCNN. 2001. 939114