Article (Scientific journals)
DexBERT: Effective, Task-Agnostic and Fine-Grained Representation Learning of Android Bytecode
SUN, Tiezhu; ALLIX, Kevin; KIM, Kisub et al.
2023In IEEE Transactions on Software Engineering, 49 (10), p. 4691 - 4706
Peer Reviewed verified by ORBi
 

Files


Full Text
DexBERT_Effective_Task-Agnostic_and_Fine-Grained_Representation_Learning_of_Android_Bytecode.pdf
Publisher postprint (1.45 MB) Creative Commons License - Attribution
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Android app analysis; code representation; defect prediction; malicious code localization; Representation learning; Android apps; Code; Code representation; Localisation; Location awareness; Malicious code localization; Malicious codes; Malwares; Predictive models; Task analysis; Software
Abstract :
[en] The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Traditionally, researchers and practitioners have relied on manually selected features, based on expert knowledge, for the task at hand. Such knowledge is sometimes imprecise and generally incomplete. To overcome this limitation, many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations and selections of the most relevant features. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Thus, the produced representation may turn out to be unsuitable for fine-grained tasks or cannot generalize beyond the task that they have been trained on. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task.
Disciplines :
Computer science
Author, co-author :
SUN, Tiezhu  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
ALLIX, Kevin  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Jacques KLEIN
KIM, Kisub  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE ; Singapore Management University, Singapore
Zhou, Xin ;  Singapore Management University, Singapore
KIM, Dongsun  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > SerVal ; Kyungpook National University, Daegu, South Korea
Lo, David ;  Singapore Management University, Singapore
Bissyande, Tegawende F. ;  University of Luxembourg, Kirchberg, Luxembourg
KLEIN, Jacques  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
External co-authors :
yes
Language :
English
Title :
DexBERT: Effective, Task-Agnostic and Fine-Grained Representation Learning of Android Bytecode
Publication date :
01 September 2023
Journal title :
IEEE Transactions on Software Engineering
ISSN :
0098-5589
eISSN :
1939-3520
Publisher :
Institute of Electrical and Electronics Engineers Inc.
Volume :
49
Issue :
10
Pages :
4691 - 4706
Peer reviewed :
Peer Reviewed verified by ORBi
FnR Project :
AFR PhD Project number 17046335
REPROCESS C21/IS/16344458
Funders :
Fonds National de la Recherche (FNR), Luxembourg
NRF - National Research Foundation of Korea
National Research Foundation, Singapore
Cyber Security Agency of Singapore
Funding number :
C21/IS/16344458; 17046335; 2021R1A5A1021944; 2021R1I1A3048013; NCRP25- P03-NCR-TAU
Available on ORBilu :
since 22 November 2023

Statistics


Number of views
137 (2 by Unilu)
Number of downloads
49 (0 by Unilu)

Scopus citations®
 
15
Scopus citations®
without self-citations
8
OpenAlex citations
 
16

Bibliography


Similar publications



Contact ORBilu