CODEBERT-NT: code naturalness via CodeBERT

KHANFIR, Ahmed; JIMENEZ, Matthieu; PAPADAKIS, Mike; LE TRAON, Yves

doi:10.1109/QRS57517.2022.00098

Download

Paper published in a journal (Scientific congresses, symposiums and conference proceedings)

CODEBERT-NT: code naturalness via CodeBERT

KHANFIR, Ahmed; JIMENEZ, Matthieu; PAPADAKIS, Mike et al.

2022 • In 22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22)

Peer reviewed

Permalink
https://hdl.handle.net/10993/53506

DOI
10.1109/QRS57517.2022.00098

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

CodeBERT_nt___QRS_2022.pdf

Author preprint (951.49 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Code Naturalness; CodeBERT; Pre-trained models

Abstract :

[en] Much of recent software-engineering research has investigated the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like n-gram. Although powerful, training such models on large code corpus can be tedious, time consuming and sensitive to code patterns (and practices) encountered during training. Consequently, these models are often trained on a small corpus and thus only estimate the language naturalness relative to a specific style of programming or type of project. To overcome these issues, we investigate the use of pre-trained generative language models to infer code naturalness. Pre-trained models are often built on big data, are easy to use in an out-of-the-box way and include powerful learning associations mechanisms. Our key idea is to quantify code naturalness through its predictability, by using state-of-the-art generative pre-trained language models. Thus, we suggest to infer naturalness by masking (omitting) code tokens, one at a time, of code-sequences, and checking the models’ability to predict them. We explore three different predictability metrics; a) measuring the number of exact matches of the predictions, b) computing the embedding similarity between the original and predicted code, i.e., similarity at the vector space, and c) computing the confidence of the model when doing the token completion task regardless of the outcome. We implement this workflow, named CODEBERT-NT, and evaluate its capability to prioritize buggy lines over non-buggy ones when ranking code based on its naturalness. Our results, on 2,510 buggy versions of 40 projects from the SmartShark dataset, show that CODEBERT-NT outperforms both, random-uniform and complexity-based ranking techniques, and yields comparable results to the n-gram models.

Disciplines :

Computer science

Author, co-author :

KHANFIR, Ahmed ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

JIMENEZ, Matthieu ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

PAPADAKIS, Mike ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)

LE TRAON, Yves ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)

External co-authors :

yes

Language :

English

Title :

CODEBERT-NT: code naturalness via CodeBERT

Publication date :

05 December 2022

Event name :

22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22)

Event date :

from 05-12-2022 to 09-12-2022

Audience :

International

Journal title :

22nd IEEE International Conference on Software Quality, Reliability and Security (QRS'22)

Peer reviewed :

Peer reviewed

Focus Area :

Security, Reliability and Trust

FnR Project :

FNR12630949 - Software Testing In A Fast, Clever And Effective Way, 2018 (01/01/2019-30/09/2022) - Yves Le Traon

Available on ORBilu :

since 06 January 2023

Statistics

Number of views

107 (18 by Unilu)

Number of downloads

79 (10 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

WoS citations^™

Bibliography

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, "On the naturalness of software, " in Proceedings of the 34th International Conference on Software Engineering, ser. ICSE '12. IEEE Press, 2012, p. 837-847.
M. Allamanis and C. Sutton, "Mining source code repositories at massive scale using language modeling, " in 2013 10th working conference on mining software repositories (MSR). IEEE, 2013, pp. 207-216.
T. Sharma, V. Efstathiou, P. Louridas, and D. Spinellis, "Code smell detection by deep direct-learning and transfer-learning, " Journal of Systems and Software, vol. 176, p. 110936, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121221000339
B. Lin, C. Nagy, G. Bavota, and M. Lanza, "On the impact of refactoring operations on code naturalness, " in 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2019, pp. 594-598.
D. Posnett, A. Hindle, and P. Devanbu, "Reflections on: A simpler model of software readability, " SIGSOFT Softw. Eng. Notes, vol. 46, no. 3, p. 30-32, jul 2021. [Online]. Available: https://doi.org/10.1145/3468744.3468754
V. J. Hellendoorn, P. T. Devanbu, and A. Bacchelli, "Will they like this? evaluating code contributions with language models, " in 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, 2015, pp. 157-167.
S. Wang, D. Chollak, D. Movshovitz-Attias, and L. Tan, "Bugram: Bug detection with n-gram language models, " in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ser. ASE 2016. New York, NY, USA: Association for Computing Machinery, 2016, p. 708-719. [Online]. Available: https://doi.org/10.1145/2970276.2970341
M. Jimenez, C. Maxime, Y. Le Traon, and M. Papadakis, "On the impact of tokenizer and parameters on n-gram based code analysis, " in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2018, pp. 437-448.
M. Allamanis, E. T. Barr, P. T. Devanbu, and C. Sutton, "A survey of machine learning for big code and naturalness, " CoRR, vol. Abs/1709.06182, 2017. [Online]. Available: http://arxiv.org/abs/1709. 06182
B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. Devanbu, "On the "naturalness" of buggy code, " in 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), 2016, pp. 428-439.
S. H. Alexander Trautsch, Fabian Trautsch, "The smartshark repository mining data, " 2021.
T. T. Chekam, M. Papadakis, T. F. Bissyandé, Y. L. Traon, and K. Sen, "Selecting fault revealing mutants, " Empir. Softw. Eng., vol. 25, no. 1, pp. 434-487, 2020. [Online]. Available: https://doi.org/10.1007/s10664-019-09778-7
J. Kim, J. Jeon, S. Hong, and S. Yoo, "Predictive mutation analysis via natural language channel in source code, " CoRR, vol. Abs/2104.10865, 2021. [Online]. Available: https://arxiv.org/abs/2104.10865
S. Kang and S. Yoo, "Language models can prioritize patches for practical program patching, " in 3rd IEEE/ACM International Workshop on Automated Program Repair, APR@ICSE 2022, Pittsburgh, PA, USA, May 19, 2022. IEEE, 2022, pp. 8-15. [Online]. Available: https://doi.org/10.1145/3524459.3527343
C. E. Shannon, "Prediction and entropy of printed english, " The Bell System Technical Journal, vol. 30, no. 1, pp. 50-64, 1951.
-, "A mathematical theory of communication, " The Bell System Technical Journal, vol. 27, no. 3, pp. 379-423, 1948.
S. F. Chen and J. Goodman, "An empirical study of smoothing techniques for language modeling, " Computer Speech & Language, vol. 13, no. 4, pp. 359-394, 1999. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0885230899901286
R. Kneser and H. Ney, "Improved backing-off for m-gram language modeling, " in 1995 International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 1995, pp. 181-184 vol. 1.
"Github copilot, " https://github.com/features/copilot.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., "Evaluating large language models trained on code.(2021), " arXiv preprint arXiv:2107.03374, 2021.
"Amazon codewhisperer, " https://aws.amazon.com/codewhisperer/.
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, "Codebert: A pre-Trained model for programming and natural languages, " in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020, ser. Findings of ACL, T. Cohn, Y. He, and Y. Liu, Eds., vol. EMNLP 2020. Association for Computational Linguistics, 2020, pp. 1536-1547. [Online]. Available: https://doi.org/10.18653/v1/2020.findings-emnlp.139
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need, " Advances in neural information processing systems, vol. 30, 2017.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-Training of deep bidirectional transformers for language understanding, " arXiv preprint arXiv:1810.04805, 2018.
"Codebert, " https://github.com/microsoft/CodeBERT.
Z. Sun, J. M. Zhang, Y. Xiong, M. Harman, M. Papadakis, and L. Zhang, "Improving machine translation systems via isotopic replacement, " in 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), 2022, pp. 1181-1192.
R. Pawlak, M. Monperrus, N. Petitprez, C. Noguera, and L. Seinturier, "Spoon: A Library for Implementing Analyses and Transformations of Java Source Code, " Software: Practice and Experience, vol. 46, pp. 1155-1179, 2015. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01078532/document
"Pytorch, " https://pytorch.org/.
M. K. Thota, F. H. Shajin, and P. Rajesh, "Survey on software defect prediction techniques, " International Journal of Applied Science and Engineering, vol. 17, pp. 331-344, December 2020.
M. Rahman, D. Palani, and P. C. Rigby, "Natural software revisited, " in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019, pp. 37-48.
M. Jimenez, C. Maxime, Y. Le Traon, and M. Papadakis, "Tuna: Tuning naturalness-based analysis, " in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2018, pp. 715-715.
M. Leszak, D. E. Perry, and D. Stoll, "Classification and evaluation of defects in a project retrospective, " Journal of Systems and Software, vol. 61, no. 3, pp. 173-187, 2002. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121201001467