Paper published in a journal (Scientific congresses, symposiums and conference proceedings)
CODEBERT-NT: code naturalness via CodeBERT
Khanfir, Ahmed; Jimenez, Matthieu; Papadakis, Mike et al.
2022In 22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22)
Peer reviewed
 

Files


Full Text
CodeBERT_nt___QRS_2022.pdf
Author preprint (951.49 kB)
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Code Naturalness; CodeBERT; Pre-trained models
Abstract :
[en] Much of recent software-engineering research has investigated the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like n-gram. Although powerful, training such models on large code corpus can be tedious, time consuming and sensitive to code patterns (and practices) encountered during training. Consequently, these models are often trained on a small corpus and thus only estimate the language naturalness relative to a specific style of programming or type of project. To overcome these issues, we investigate the use of pre-trained generative language models to infer code naturalness. Pre-trained models are often built on big data, are easy to use in an out-of-the-box way and include powerful learning associations mechanisms. Our key idea is to quantify code naturalness through its predictability, by using state-of-the-art generative pre-trained language models. Thus, we suggest to infer naturalness by masking (omitting) code tokens, one at a time, of code-sequences, and checking the models’ability to predict them. We explore three different predictability metrics; a) measuring the number of exact matches of the predictions, b) computing the embedding similarity between the original and predicted code, i.e., similarity at the vector space, and c) computing the confidence of the model when doing the token completion task regardless of the outcome. We implement this workflow, named CODEBERT-NT, and evaluate its capability to prioritize buggy lines over non-buggy ones when ranking code based on its naturalness. Our results, on 2,510 buggy versions of 40 projects from the SmartShark dataset, show that CODEBERT-NT outperforms both, random-uniform and complexity-based ranking techniques, and yields comparable results to the n-gram models.
Disciplines :
Computer science
Author, co-author :
Khanfir, Ahmed ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal
Jimenez, Matthieu  ;  University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)
Papadakis, Mike ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)
Le Traon, Yves ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)
External co-authors :
yes
Language :
English
Title :
CODEBERT-NT: code naturalness via CodeBERT
Publication date :
05 December 2022
Event name :
22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22)
Event date :
from 05-12-2022 to 09-12-2022
Audience :
International
Journal title :
22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22)
Peer reviewed :
Peer reviewed
Focus Area :
Security, Reliability and Trust
FnR Project :
FNR12630949 - Software Testing In A Fast, Clever And Effective Way, 2018 (01/01/2019-30/09/2022) - Yves Le Traon
Available on ORBilu :
since 06 January 2023

Statistics


Number of views
102 (18 by Unilu)
Number of downloads
76 (10 by Unilu)

Scopus citations®
 
0
Scopus citations®
without self-citations
0
WoS citations
 
0

Bibliography


Similar publications



Contact ORBilu