Reference : CODEBERT-NT: code naturalness via CodeBERT
Scientific congresses, symposiums and conference proceedings : Paper published in a journal
Engineering, computing & technology : Computer science
Security, Reliability and Trust
http://hdl.handle.net/10993/53506
CODEBERT-NT: code naturalness via CodeBERT
English
Khanfir, Ahmed mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal >]
Jimenez, Matthieu mailto [University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS) >]
Papadakis, Mike mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)]
Le Traon, Yves mailto [University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal > ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)]
5-Dec-2022
22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22)
Yes
No
International
22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22)
from 05-12-2022 to 09-12-2022
Guangzhou, China
[en] Code Naturalness ; CodeBERT ; Pre-trained models
[en] Much of recent software-engineering research has investigated the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like n-gram. Although powerful, training such models on large code corpus can be tedious, time consuming
and sensitive to code patterns (and practices) encountered during training. Consequently, these models are often trained on a small corpus and thus only estimate the language naturalness relative to a specific style of programming or type of project. To overcome these issues, we investigate the use of pre-trained generative language models to infer code naturalness. Pre-trained models are often built on big data, are easy to use in an out-of-the-box way and include powerful learning associations mechanisms. Our key idea is to quantify code naturalness through its predictability, by using state-of-the-art generative pre-trained language models. Thus, we suggest to infer naturalness by masking (omitting) code tokens, one at a time, of code-sequences, and checking the models’ability to predict them. We explore three different predictability metrics; a) measuring the number of exact matches of the predictions, b) computing the embedding similarity between the original and predicted code, i.e., similarity at the vector space, and c) computing the confidence of the model when doing the token completion task regardless of the outcome. We implement this workflow, named CODEBERT-NT, and evaluate its capability to prioritize buggy lines over non-buggy ones when ranking code based on its naturalness. Our results, on 2,510 buggy versions of 40 projects from the SmartShark dataset, show that CODEBERT-NT outperforms both, random-uniform and complexity-based ranking techniques, and yields comparable results to the n-gram models.
Researchers
http://hdl.handle.net/10993/53506
FnR ; FNR12630949 > Yves Le Traon > TESTFAST > Software Testing In A Fast, Clever And Effective Way > 01/01/2019 > 30/09/2022 > 2018

File(s) associated to this reference

Fulltext file(s):

FileCommentaryVersionSizeAccess
Open access
CodeBERT_nt___QRS_2022.pdfAuthor preprint929.18 kBView/Open

Bookmark and Share SFX Query

All documents in ORBilu are protected by a user license.