[en] The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Traditionally, researchers and practitioners have relied on manually selected features, based on expert knowledge, for the task at hand. Such knowledge is sometimes imprecise and generally incomplete. To overcome this limitation, many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations and selections of the most relevant features. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Thus, the produced representation may turn out to be unsuitable for fine-grained tasks or cannot generalize beyond the task that they have been trained on. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.
Disciplines :
Computer science
Author, co-author :
SUN, Tiezhu ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
ALLIX, Kevin ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Jacques KLEIN
KIM, Kisub ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE ; Singapore Management University, Singapore
KIM, Dongsun ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > SerVal ; Kyungpook National University, Daegu, South Korea
Lo, David ; Singapore Management University, Singapore
Bissyande, Tegawende F. ; University of Luxembourg, Kirchberg, Luxembourg
KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
External co-authors :
yes
Language :
English
Title :
DexBERT: Effective, Task-Agnostic and Fine-Grained Representation Learning of Android Bytecode
Publication date :
01 September 2023
Journal title :
IEEE Transactions on Software Engineering
ISSN :
0098-5589
eISSN :
1939-3520
Publisher :
Institute of Electrical and Electronics Engineers Inc.
AFR PhD Project number 17046335 REPROCESS C21/IS/16344458
Funders :
Fonds National de la Recherche (FNR), Luxembourg NRF - National Research Foundation of Korea National Research Foundation, Singapore Cyber Security Agency of Singapore
J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
U. Alon, M. Zilberstein, O. Levy, and E. Yahav, "code2vec: Learning distributed representations of code," Proc. ACM Program. Lang., vol. 3, no. POPL, pp. 1-29, 2019.
S. Mani, A. Sankaran, and R. Aralikatte, "DeepTriage: Exploring the effectiveness of deep learning for bug triaging," in Proc. ACM India Joint Int. Conf. Data Sci. Manage. Data, 2019, pp. 171-179.
H. J. Kang, T. F. Bissyandé, and D. Lo, "Assessing the generalizability of code2vec token embeddings," in Proc. 34th IEEE/ACM Int. Conf. Autom. Softw. Eng. (ASE), 2019, pp. 1-12.
G. Canfora, E. Medvet, F. Mercaldo, and C. A. Visaggio, "Detecting Android malware using sequences of system calls," in Proc. 3rd Int. Workshop Softw. Develop. Lifecycle Mob., 2015, pp. 13-20.
S. Hou, A. Saas, Y. Ye, and L. Chen, "DroidDelver: An android malware detection system using deep belief network based on API call blocks," in Int. Conf. Web-Age Inf. Manag. Nanchang, China: Springer, 2016, pp. 54-66.
L. Singh and M. Hofmann, "Dynamic behavior analysis of android applications for malware detection," in Proc. Int. Conf. Intell. Commun. Comput. Techn. (ICCT). Piscataway, NJ, USA: IEEE, 2017, pp. 1-7.
X. Xiao, Z. Wang, Q. Li, S. Xia, and Y. Jiang, "Back-propagation neural network on Markov chains from system call sequences: A new approach for detecting Android malware with system call sequences," IET Inf. Secur., vol. 11, no. 1, pp. 8-15, 2017.
X. Xiao, X. Xiao, Y. Jiang, X. Liu, and R. Ye, "Identifying Android malware with system call co-occurrence matrices," Trans. Emerg. Telecommun. Technol., vol. 27, no. 5, pp. 675-684, 2016.
Z. Xu, K. Ren, S. Qin, and F. Craciun, "CDGDroid: Android malware detection based on deep learning using CFG and DFG," in Int. Conf. Formal Eng. Methods. Gold Coast, Australia: Springer, 2018, pp. 177-193.
Z. Ma, H. Ge, Z.Wang, Y. Liu, and X. Liu, "Droidetec: Android malware detection and malicious code localization through deep learning," 2020, arXiv:2002.03594.
R. S. Arslan, "AndroAnalyzer: Android malicious software detection based on deep learning," PeerJ Comput. Sci., vol. 7, 2021, Art. no. e533.
A. Narayanan, M. Chandramohan, L. Chen, and Y. Liu, "A multi-view context-aware approach to android malware detection and malicious code localization," Empirical Softw. Eng., vol. 23, no. 3, pp. 1222-1274, 2018.
F. Dong, J. Wang, Q. Li, G. Xu, and S. Zhang, "Defect prediction in android binary executables using deep neural network," Wireless Pers. Commun., vol. 102, no. 3, pp. 2261-2285, 2018.
Z. Feng et al., "CodeBERT: A pre-trained model for programming and natural languages," 2020. [Online]. Available: https://arxiv.org/abs/2002.08155
A. Narayanan, C. Soh, L. Chen, Y. Liu, and L. Wang, "Apk2vec: Semi-supervised multi-view representation learning for profiling android applications," in Proc. IEEE Int. Conf. Data Min. (ICDM). Piscataway, NJ, USA: IEEE, 2018, pp. 357-366.
E. Giger, M. D'Ambros, M. Pinzger, and H. C. Gall, "Method-level bug prediction," in Proc. ACM-IEEE Int. Symp. Empirical Softw. Eng. Meas. Piscataway, NJ, USA: IEEE, 2012, pp. 171-180.
M. Singh and V. Sharma, "Detection of file level clone for high level cloning," Procedia Comput. Sci., vol. 57, pp. 915-922, 2015.
C. Tantithamthavorn, S. L. Abebe, A. E. Hassan, A. Ihara, and K. Matsumoto, "The impact of IR-based classifier configuration on the performance and the effort of method-level bug localization," Inf. Softw. Technol., vol. 102, pp. 160-174, 2018.
W. Zhang, Z. Li, Q. Wang, and J. Li, "Finelocator: A novel approach to method-level fine-grained bug localization by query expansion," Inf. Softw. Technol., vol. 110, pp. 121-135, 2019.
V. Frick, "Understanding software changes: Extracting, classifying, and presenting fine-grained source code changes," in Proc. ACM/IEEE 42nd Int. Conf. Softw. Eng.: Companion Proc., 2020, pp. 226-229.
E. Mashhadi and H. Hemmati, "Applying codeBERT for automated program repair of Java simple bugs," in Proc. IEEE/ACM 18th Int. Conf. Min. Softw. Repositories (MSR), 2021, pp. 505-509.
C. Pan, M. Lu, and B. Xu, "An empirical study on software defect prediction using codeBERT model," Appl. Sci., vol. 11, no. 11, pp. 1-20, 2021. Accessed: May 23, 2021. [Online]. Available: https://www.mdpi.com/2076-3417/11/11/4793
X. Yuan, G. Lin, Y. Tai, and J. Zhang, "Deep neural embedding for software vulnerability discovery: Comparison and optimization," Secur. Commun. Netw., vol. 2022, pp. 1-12, 2022.
S. Fujimori, M. Harmanani, O. Siddiqui, and L. Zhang, "Using deep learning to localize errors in student code submissions," in Proc. 53rd ACM Tech. Symp. Comput. Sci. Educ. V. 2, 2022, pp. 1077-1077.
R. Malhotra, "An empirical framework for defect prediction using machine learning techniques with Android software," Appl. Soft Comput., vol. 49, pp. 1034-1050, 2016.
D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and C. Siemens, "DREBIN: Effective and explainable detection of Android malware in your pocket," in Proc. NDSS, vol. 14, 2014, pp. 23-26.
N. Daoudi, J. Samhi, A. K. Kabore, K. Allix, T. F. Bissyandé, and J. Klein, "DexRay: A simple, yet effective deep learning approach to Android malware detection based on image representation of bytecode," in Proc. Int. Workshop Deployable Mach. Learn. Secur. Defense. Springer, 2021, pp. 81-106.
A. Majd, M. Vahidi-Asl, A. Khalilian, P. Poorsarvi-Tehrani, and H. Haghighi, "SLDeep: Statement-level software defect prediction using deep-learning model on static code features," Expert Syst. Appl., vol. 147, 2020, Art. no. 113156.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," Nature, vol. 323, no. 6088, pp. 533-536, 1986.
D. Bamman and N. A. Smith, "New alignment methods for discriminative book summarization," 2013, arXiv:1305.1319.
Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks," in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 1-11.
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016.
S. Fadnavis, "Image interpolation techniques in digital image processing: An overview," Int. J. Eng. Res. Appl., vol. 4, no. 10, pp. 70-73, 2014.
K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon, "AndroZoo: Collecting millions of Android apps for the research community," in Proc. IEEE/ACM 13th Work. Conf. Min. Softw. Repositories (MSR). Piscataway, NJ, USA: IEEE, 2016, pp. 468-471.
Y. Zhu et al., "Aligning books and movies: Towards story-like visual explanations by watching movies and reading books," in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 19-27.
Y. Wu et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," 2016, arXiv:1609.08144.
G. Meng et al., "MYSTIQUE: Evolving Android malware for auditing anti-malware tools," in Proc. 11th ACM Asia Conf. Comput. Commun. Secur., 2016, pp. 365-376.
J. Samhi, L. Li, T. F. Bissyandé, and J. Klein, "Difuzer: Uncovering suspicious hidden sensitive operations in Android apps," in Proc. 44th Int. Conf. Softw. Eng., 2022, pp. 723-735.
[Online]. Available: https://checkmarx.com/, Aug. 2022.
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," 2014, arXiv:1412.6980.
M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, "A survey of machine learning for big code and naturalness," ACM Comput. Surv. (CSUR), vol. 51, no. 4, pp. 1-37, 2018.
M. D. Ernst, "Natural language is a programming language: Applying natural language processing to software development," in Proc. 2nd Summit Adv. Program. Lang. (SNAPL). Dagstuhl, Germany: Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017, pp. 1-14.
W. Hua, Y. Sui, Y. Wan, G. Liu, and G. Xu, "FCCA: Hybrid code representation for functional clone detection using attention networks," IEEE Trans. Rel., vol. 70, no. 1, pp. 304-318, Mar. 2020.
A. Xu, T. Dai, H. Chen, Z. Ming, and W. Li, "Vulnerability detection for source code using contextual LSTM," in Proc. 5th Int. Conf. Syst. Inform. (ICSAI). Piscataway, NJ, USA: IEEE, 2018, pp. 1225-1230.
J. K. Siow, C. Gao, L. Fan, S. Chen, and Y. Liu, "Core: Automating review recommendation for code changes," in Proc. IEEE 27th Int. Conf. Softw. Anal., Evol. Reeng. (SANER). Piscataway, NJ, USA: IEEE, 2020, pp. 284-295.
L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, "Convolutional neural networks over tree structures for programming language processing," in Proc. 30th AAAI Conf. Artif. Intell., 2016, pp. 1-7.
K. S. Tai, R. Socher, and C. D. Manning, "Improved semantic representations from tree-structured long short-term memory networks," 2015, arXiv:1503.00075.
Y. Wan et al., "Multi-modal attention network learning for semantic source code retrieval," in Proc. 34th IEEE/ACM Int. Conf. Autom. Softw. Eng. (ASE). Piscataway, NJ, USA: IEEE, 2019, pp. 13-25.
W. Wang, G. Li, B. Ma, X. Xia, and Z. Jin, "Detecting code clones with graph neural network and flow-augmented abstract syntax tree," in Proc. IEEE 27th Int. Conf. Softw. Anal., Evol. Reeng. (SANER). Piscataway, NJ, USA: IEEE, 2020, pp. 261-271.
Y. Liu et al., "RoBERTa: A robustly optimized BERT pretraining approach," 2019. [Online]. Available: http://arxiv.org/abs/1907.11692
D. Guo et al., "GraphCodeBERT: Pre-training code representations with data flow," 2021, arXiv:abs/2009.08366.
K. Chen, P. Liu, and Y. Zhang, "Achieving accuracy and scalability simultaneously in detecting application clones on Android markets," in Proc. 36th Int. Conf. Softw. Eng., 2014, pp. 175-186.
E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb, "MalDozer: Automatic framework for Android malware detection using deep learning," Digit. Invest., vol. 24, pp. S48-S59, 2018.
T. Bhatia and R. Kaushal, "Malware detection in Android based on dynamic analysis," in Proc. Int. Conf. Cyber Secur. Protection Digit. Services. Piscataway, NJ, USA: IEEE, 2017, pp. 1-6.
J. Qiu, J. Zhang, W. Luo, L. Pan, S. Nepal, and Y. Xiang, "A survey of Android malware detection with deep neural models," ACM Comput. Surv. (CSUR), vol. 53, no. 6, pp. 1-36, 2020.
D. Li, Z. Wang, and Y. Xue, "Fine-grained Android malware detection based on deep learning," in Proc. IEEE Conf. Commun. Netw. Secur. (CNS). Piscataway, NJ, USA: IEEE, 2018, pp. 1-2.
J. Booz, J. McGiff, W. G. Hatcher, W. Yu, J. Nguyen, and C. Lu, "Tuning deep learning performance for Android malware detection," in Proc. 19th IEEE/ACIS Int. Conf. Softw. Eng., Artif. Intell., Netw. Parallel/Distrib. Comput. (SNPD). Piscataway, NJ, USA: IEEE, 2018, pp. 140-145.
K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, "Adversarial examples for malware detection," in Proc. Eur. Symp. Res. Comput. Secur. Oslo, Norway: Springer, 2017, pp. 62-79.
F. Pendlebury, F. Pierazzi, R. Jordaney, J. Kinder, and L. Cavallaro, "TESSERACT: Eliminating experimental bias in malware classification across space and time," in Proc. 28th USENIX Secur. Symp. (USENIX Security), 2019, pp. 729-746.
A. Naway and Y. Li, "Using deep neural network for Android malware detection," Int. J. Adv. Stud. Comput., Sci. Eng., vol. 7, no. 12, pp. 9-18, 2018.
A. Naway and Y. Li, "Android malware detection using autoencoder," 2019, arXiv:1901.07315.
W. Y. Lee, J. Saxe, and R. Harang, "SeqDroid: Obfuscated Android malware detection using stacked convolutional and recurrent neural networks," in Deep Learning Applications for Cyber Security. Cham, Switzerland: Springer, 2019, pp. 197-210.
N. He, T. Wang, P. Chen, H. Yan, and Z. Jin, "An android malware detection method based on deep AutoEncoder," in Proc. Artif. Intell. Cloud Comput. Conf., 2018, pp. 88-93.
N. McLaughlin et al., "Deep Android malware detection," in Proc. 7th ACM Conf. Data Appl. Secur. Privacy, 2017, pp. 301-308.
Q. Jerome, K. Allix, R. State, and T. Engel, "Using opcode-sequences to detect malicious Android applications," in Proc. IEEE Int. Conf. Commun. (ICC), 2014, pp. 914-919.
K. Allix, T. F. Bissyandé, Q. Jérome, J. Klein, R. State, and Y. Le Traon, "Empirical assessment of machine learning-based malware detectors for Android," Empirical Softw. Eng., vol. 21, no. 1, pp. 183-211, Feb. 2016. [Online]. Available: https://doi.org/10.1007/s10664-014-9352-6
R. Vinayakumar, K. Soman, P. Poornachandran, and S. Sachin Kumar, "Detecting Android malware using long short-term memory (LSTM)," J. Intell. Fuzzy Syst., vol. 34, no. 3, pp. 1277-1288, 2018.
Z. Yuan, Y. Lu, Z. Wang, and Y. Xue, "Droid-Sec: Deep learning in Android malware detection," in Proc. ACM Conf. SIGCOMM, 2014, pp. 371-372.
C. Yang, Z. Xu, G. Gu, V. Yegneswaran, and P. A. Porras, "DroidMiner: Automated mining and characterization of fine-grained malicious behaviors in Android applications," in Proc. ESORICS, 2014, pp. 163-182.
M. A. Atici, S. Saǧiroǧlu, and I. A. Dogru, "Android malware analysis approach based on control flow graphs and machine learning algorithms," in Proc. 4th Int. Symp. Digit. Forensic Secur. (ISDFS), 2016, pp. 26-31.
F. Wei, S. Roy, X. Ou, and Robby, "Amandroid: A precise and general inter-component data flow analysis framework for security vetting of android apps," Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2014, pp. 1-32.
J. Bruna, W. Zaremba, A. D. Szlam, and Y. LeCun, "Spectral networks and locally connected networks on graphs," 2014, arXiv:abs/1312.6203.
T. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," 2017, arXiv:abs/1609.02907.
A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal, "graph2vec: Learning distributed representations of graphs," 2017, arXiv:1707.05005.
T. Hsien-De Huang and H.-Y. Kao, "R2-D2: ColoR-inspired convolutional neuRal network (CNN)-based AndroiD malware Detections," in Proc. IEEE Int. Conf. Big Data (Big Data). Piscataway, NJ, USA: IEEE, 2018, pp. 2633-2642.
T. Sun, N. Daoudi, K. Allix, and T. F. Bissyandé, "Android malware detection: Looking beyond Dalvik bytecode," in Proc. 36th IEEE/ACM Int. Conf. Autom. Softw. Eng. Workshops (ASEW). Piscataway, NJ, USA: IEEE, 2021, pp. 34-39.
Y. Shao, X. Luo, C. Qian, P. Zhu, and L. Zhang, "Towards a scalable resource-driven approach for detecting repackaged Android applications," in Proc. 30th Annu. Comput. Secur. Appl. Conf., 2014, pp. 56-65.
L. Li, T. F. Bissyandé, and J. Klein, "SimiDroid: Identifying and explaining similarities in Android apps," in Proc. IEEE Trustcom/ BigDataSE/ICESS. Piscataway, NJ, USA: IEEE, 2017, pp. 136-143.
S. Singh, K. Chaturvedy, and B. Mishra, "Multi-view learning for repackaged malware detection," in Proc.16th Int. Conf. Availability, Rel. Secur., 2021, pp. 1-9.
X. Pan, X. Wang, Y. Duan, X. Wang, and H. Yin, "Dark hazard: Learning-based, large-scale discovery of hidden sensitive operations in android apps," in Proc. NDSS, 2017, pp. 1-15.
L. Li et al., "On locating malicious code in piggybacked Android apps," J. Comput. Sci. Technol., vol. 32, no. 6, pp. 1108-1124, 2017.
Q. Wu, P. Sun, X. Hong, X. Zhu, and B. Liu, "An Android malware detection and malicious code location method based on graph neural network," in Proc. MLMI, 2021, pp. 50-56.
D. Bowes, T. Hall, and D. Gray, "DConfusion: A technique to allow cross study performance evaluation of fault prediction studies," Autom. Softw. Eng., vol. 21, no. 2, pp. 287-313, 2014.
H. Perl et al., "VCCFinder: Finding potential vulnerabilities in opensource projects to assist code audits," in Proc. 22nd ACM SIGSAC Conf. Comput. Commun. Secur., 2015, pp. 426-437.
R. Scandariato, J. Walden, A. Hovsepyan, and W. Joosen, "Predicting vulnerable software components via text mining," IEEE Trans. Softw. Eng., vol. 40, no. 10, pp. 993-1006, Oct. 2014.
S. Wang, T. Liu, and L. Tan, "Automatically learning semantic features for defect prediction," in Proc. ICSE. Piscataway, NJ, USA: IEEE, 2016, pp. 297-308.
A. Kaur, K. Kaur, and H. Kaur, "An investigation of the accuracy of code and process metrics for defect prediction of mobile applications," in Proc. 4th Int. Conf. Rel., Infocom Technol. Optim. (ICRITO) (Trends Future Directions). Piscataway, NJ, USA: IEEE, 2015, pp. 1-6.
A. Kaur, K. Kaur, and H. Kaur, "Application of machine learning on process metrics for defect prediction in mobile application," in Proc. Inf. Syst. Des. Intell. Appl.: Proc. 3rd Int. Conf. INDIA, vol. 1. New Delhi, India: Springer, 2016, pp. 81-98.
M. Y. Ricky, F. Purnomo, and B. Yulianto, "Mobile application software defect prediction," in Proc. IEEE Symp. Service-Oriented Syst. Eng. (SOSE). Piscataway, NJ, USA: IEEE, 2016, pp. 307-313.
M. Yan, X. Xia, Y. Fan, A. E. Hassan, D. Lo, and S. Li, "Just-in-time defect identification and localization: A two-phase framework," IEEE Trans. Softw. Eng., vol. 48, no. 1, pp. 82-101, 2020.
M. Yan, X. Xia, Y. Fan, D. Lo, A. E. Hassan, and X. Zhang, "Effortaware just-in-time defect identification in practice: A case study at Alibaba," in Proc. 28th ACM Joint Meet. Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2020, pp. 1308-1319.
K. Zhao, Z. Xu, M. Yan, Y. Tang, M. Fan, and G. Catolino, "Just-in-time defect prediction for Android apps via imbalanced deep learning model," in Proc. 36th Annu. ACM Symp. Appl. Comput., 2021, pp. 1447-1454.