Boosting source code learning with text-oriented data augmentation: an empirical study

DONG, Zeming; HU, Qiang; GUO, Yuejun; Zhang, Zhenya; CORDY, Maxime; PAPADAKIS, Mike; Le Traon, Yves; Zhao, Jianjun

doi:10.1007/s10664-025-10624-2

Download

Article (Scientific journals)

Boosting source code learning with text-oriented data augmentation: an empirical study

DONG, Zeming; HU, Qiang; GUO, Yuejun et al.

2025 • In Empirical Software Engineering, 30 (3)

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/10993/64535

DOI
10.1007/s10664-025-10624-2

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

2303.06808v2.pdf

Author postprint (2.82 MB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Disciplines :

Computer science

Author, co-author :

DONG, Zeming ; University of Luxembourg

HU, Qiang ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > SerVal > Team Yves LE TRAON

GUO, Yuejun ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > SerVal > Team Yves LE TRAON

Zhang, Zhenya

CORDY, Maxime ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

PAPADAKIS, Mike ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

Le Traon, Yves

Zhao, Jianjun

External co-authors :

yes

Language :

English

Title :

Boosting source code learning with text-oriented data augmentation: an empirical study

Publication date :

18 February 2025

Journal title :

Empirical Software Engineering

ISSN :

1382-3256

eISSN :

1573-7616

Publisher :

Springer Science and Business Media LLC

Volume :

Issue :

Peer reviewed :

Peer Reviewed verified by ORBi

Focus Area :

Computational Sciences

Additional URL :

https://link.springer.com/content/pdf/10.1007/s10664-025-10624-2.pdf

Available on ORBilu :

since 24 March 2025

Statistics

Number of views

205 (45 by Unilu)

Number of downloads

203 (3 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Allamanis M, Barr ET, Devanbu P, Sutton C (2018) A survey of machine learning for big code and naturalness. ACM Comput Surv(CSUR) 51(4). https://doi.org/10.1145/3212695
Allamanis M, Brockschmidt M, Khademi M (2018) Learning to represent programs with graphs. In: International Conference on Learning Representations (ICLR)
Allamanis M, Jackson-Flux HR, Brockschmidt M (2021) Self-supervised bug detection and repair. In: Advances in neural information processing systems
Alon U, Brody S, Levy O, Yahav E (2019) code2seq: generating sequences from structured representations of code. In: International Conference on Learning Representations (ICLR)
U. Alon M. Zilberstein O. Levy E. Yahav code2vec: learning distributed representations of code Proc ACM Program Lang 3 POPL 1 29 10.1145/3290353 1526.05122
R.A. Armstrong When to use the b onferroni correction Ophthalmic Physiol Opt 34 5 502 508 10.1111/opo.12131 1323.53054
Ben-Nun T, Jakobovits AS, Hoefler T (2018) Neural code comprehension: a learnable representation of code semantics. Adv Neural Inf Process Syst 31
Bielik P, Vechev M (2020) Adversarial robustness for code. In: Proceedings of the 37th international conference on machine learning, ser. Proceedings of Machine Learning Research, vol 119. PMLR, pp 896–907
Bui ND, Yu Y, Jiang L (2021) Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, ser. SIGIR ’21. Association for Computing Machinery, New York, NY, USA, pp 511–521. https://dl.acm.org/doi/abs/10.1145/3404835.3462840
Buratti L, Pujar S, Bornea M, McCarley S, Zheng Y, Rossiello G, Morari A, Laredo J, Thost V, Zhuang Y et al (2020) Exploring software naturalness through neural language models. arXiv:2006.12641. https://arxiv.org/abs/2006.12641
Chen Z, Monperrus M (2018) The codrep machine learning on source code competition. arXiv:1807.03200
Chirkova N, Troshin S (2021) Empirical study of transformers for source code. ser. ESEC/FSE 2021. Association for Computing Machinery, New York, NY, USA, pp 703–715. https://dl.acm.org/doi/10.1145/3468264.3468611
K.W. Church Word2vec Nat Lang Eng 23 1 155 162 10.1017/S1351324916000334
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the ACL: Human Language Technologies (NAACL-HLT), vol 1. ACL, pp 4171–4186
Dinella E, Dai H, Li Z, Naik M, Song L, Wang K (2020) Hoppity: learning graph transformations to detect and fix bugs in programs. In: International conference on learning representations
Z. Dong Q. Hu Z. Zhang J. Zhao On the effectiveness of graph data augmentation for source code learning Knowl-Based Syst 285 10.1016/j.knosys.2023.111328 07540361 111328
Dong Z, Hu Q, Guo Y, Cordy M, Papadakis M, Zhang Z, Le Traon Y, Zhao J (2023) Mixcode: enhancing code classification by mixup-based data augmentation. In: 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 379–390
Dong Z, Hu Q, Guo Y, Zhang Z, Zhao J (2023) Boosting source code learning with text-oriented data augmentation: an empirical study. In: 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRS-C). IEEE, pp 383–392
Dong Z, Hu Q, Zhang Z, Guo Y, Cordy M, Papadakis M, Le Traon Y, Zhao J (2024) On the effectiveness of hybrid pooling in mixup-based graph learning for language processing. J Syst Softw, p 112139
Fabbri A, Han S, Li H, Li H, Ghazvininejad M, Joty S, Radev D, Mehdad Y (2021) Improving zero and few-shot abstractive summarization with intermediate fine-tuning and data augmentation. In: Proceedings of the 2021 Conference of the North American Chapter of the ACL: Human Language Technologies. ACL, pp 704–717
Feng SY, Gangal V, Wei J, Chandar S, Vosoughi S, Mitamura T, Hovy E (2021) A survey of data augmentation approaches for nlp. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, pp 968–988
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M (2020) Codebert: a pre-trained model for programming and natural languages, pp 1536–1547
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: 3rd International Conference on Learning Representations (ICLR)
C.L. Goues M. Pradel A. Roychoudhury Automated program repair Commun ACM 62 12 56 65 10.1145/3318162
Guo H, Mao Y, Zhang R (2019) Augmenting data with mixup for sentence classification: an empirical study. arXiv:1905.08941
Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S et al (2020) Graphcodebert: pre-training code representations with data flow. arXiv:2009.08366
Hendrycks D, Dietterich T (2019) Benchmarking neural network robustness to common corruptions and perturbations. Proc Int Conf Learn Rep
Hindle A, Barr ET, Su Z, Gabel M, Devanbu P (2012). On the naturalness of software. In: Proceedings of the 34th International Conference on Software Engineering (ICSE), ser. ICSE ’12. IEEE Press, pp 837–847
Hu Y, Ahmed UZ, Mechtaev S, Leong B, Roychoudhury A (2019) Re-factoring based program repair applied to programming assignments. In: 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp 388–398. https://ieeexplore.ieee.org/abstract/document/8952522
Hu Q, Guo Y, Xie X, Cordy M, Ma L, Papadakis M, Traon YL (2023) Codes: towards code model generalization under distribution shift. In: ICSE: New Ideas and Emerging Results (NIER)
Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: Proceedings of the 26th conference on program comprehension, ser. ICPC ’18. Association for Computing Machinery, New York, NY, USA, pp 200–210. https://doi.org/10.1145/3196321.3196334
Jebnoun H, Ben Braiek H, Rahman MM, Khomh F (2020) The scent of deep learning code: an empirical study. In: Proceedings of the 17th international conference on mining software repositories, ser. MSR ’20. Association for Computing Machinery, pp 420–430. https://doi.org/10.1145/3379597.3387479
Kanade A, Maniatis P, Balakrishnan G, Shi K (2020) Learning and evaluating contextual embedding of source code. In: Proceedings of the 37th international conference on machine learning, ser. ICML’20. JMLR.org, pp 5110–5121
Kaur A, Kaur M (2016) Analysis of code refactoring impact on software quality. In: MATEC Web of Conferences, vol 57. EDP Sciences, p 02012
Kenton JDM-WC, Toutanova LK (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, vol 1. Minneapolis, Minnesota, p 2
Kimura M (2021) Why mixup improves the model performance. In: Artificial Neural Networks and Machine Learning - ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14-17, 2021, Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg, pp 275–286. https://doi.org/10.1007/978-3-030-86340-1_22
B. Li Y. Hou W. Che Data augmentation approaches in natural language processing: a survey AI Open 3 71 90 10.1016/j.aiopen.2022.03.001 1459.94048
A. Maćkiewicz W. Ratajczak Principal components analysis (pca) Comput Geosci 19 3 303 342 10.1016/0098-3004(93)90090-R
Marivate V, Sefara T (2020) Improving short text classification through global augmentation methods. In: Machine learning and knowledge extraction, pp 385–399
Mastropaolo A, Pascarella L, Guglielmi E, Ciniselli M, Scalabrino S, Oliveto R, Bavota G (2023) On the robustness of code generation techniques: an empirical study on github copilot. In: Proceedings of the 45th international conference on software engineering, ser. ICSE ’23. IEEE Press, pp 2149–2160. https://doi.org/10.1109/ICSE48619.2023.00181
Ma W, Zhao M, Soremekun E, Hu Q, Zhang JM, Papadakis M, Cordy M, Xie X, Traon YL (2022) Graphcode2vec: generic code embedding via lexical and program dependence analyses. In: Proceedings of the 19th international conference on mining software repositories, pp 524–536
Mi Q, Xiao Y, Cai Z, Jia X (2021) The effectiveness of data augmentation in code readability classification. Information and Software Technology 129:106378. https://www.sciencedirect.com/science/article/abs/pii/S0950584920301464#:~:text=The%20empirical%20results%20show%20that,reaching%20up%20to%2087.38%25%20accuracy
Niu C, Li C, Ng V, Chen D, Ge J, Luo B (2023) An empirical comparison of pre-trained models of source code. In: Proceedings of the 45th international conference on software engineering, ser. ICSE ’23. IEEE Press, pp 2136–2148. https://doi.org/10.1109/ICSE48619.2023.00180
Pour MV, Li Z, Ma L, Hemmati H (2021) A search-based testing framework for deep neural networks of source code embedding. In: 14th IEEE Conference on Software Testing, Verification and Validation (ICST), pp 36–46
Puri R, Kung DS, Janssen G, Zhang W, Domeniconi G, Zolotov V, Dolby J, Chen J, Choudhury M, Decker L et al (2021) Codenet: a large-scale ai for code dataset for learning a diversity of coding tasks. arXiv:2105.12655
V. Raychev M. Vechev E. Yahav Code completion with statistical language models SIGPLAN Not 49 6 419 428 10.1145/2666356.2594321 1346.68063
S.-A. Rebuffi S. Gowal D.A. Calian F. Stimberg O. Wiles T.A. Mann Data augmentation can improve robustness Adv Neural Inf Process Syst 34 29 935 29 948
K. Ren T. Zheng Z. Qin X. Liu Adversarial attacks and defenses in deep learning Engineering 6 3 346 360 10.1016/j.eng.2019.12.012 1335.35221
Roziere B, Gehring J, Gloeckle F, Sootla S, Gat I, Tan XE, Adi Y, Liu J, Remez T, Rapin J et al (2023) Code llama: open foundation models for code. arXiv preprint arXiv:2308.12950
C. Shorten T.M. Khoshgoftaar A survey on image data augmentation for deep learning J Big Data 6 1 1 48 10.1186/s40537-019-0197-0 1429.93311
Siow JK, Liu S, Xie X, Meng G, Liu Y (2022) Learning program semantics with code representations: an empirical study. In: 2022 IEEE international conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, IEEE Computer Society, Los Alamitos, CA, USA, pp 554–565. https://doi.ieeecomputersociety.org/10.1109/SANER53432.2022.00073
Steenhoek B, Rahman MM, Jiles R, Le W (2023) An empirical study of deep learning models for vulnerability detection. In: Proceedings of the 45th international conference on software engineering, ser. ICSE ’23. IEEE Press, pp 2237–2248. https://doi.org/10.1109/ICSE48619.2023.00188
Svajlenko J, Islam JF, Keivanloo I, Roy CK, Mia MM (2014) Towards a big data curated benchmark of inter-project code clones. In: Proceedings of the 2014 IEEE international conference on software maintenance and evolution, ser. ICSME ’14. IEEE Computer Society, USA, pp 476–480. https://ieeexplore.ieee.org/document/6976121
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale Set al (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Verma V, Lamb A, Beckham C, Najafi A, Mitliagkas I, Lopez-Paz D, Bengio Y (2019) Manifold mixup: better representations by interpolating hidden states. In: International conference on machine learning. PMLR, pp 6438–6447
Wang J, Chen H-C, Radach R, Inhoff A (1999) Reading chinese script: a cognitive analysis. Psychology Press
Wang D, Jia Z, Li S, Yu Y, Xiong Y, Dong W, Liao X (2022) Bridging pre-trained models and downstream tasks for source code understanding. In: Proceedings of the 44th international conference on software engineering, ser. ICSE ’22. Association for Computing Machinery, New York, NY, USA, pp 287–298. https://dl.acm.org/doi/abs/10.1145/3510003.3510062
Wang W, Li G, Ma B, Xia X, Jin Z (2020) Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp 261–271
Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu PS (2018) Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, pp 397–407
Wei M, Huang Y, Yang J, Wang J, Wang S (2022) Cocofuzzing: testing neural code models with coverage-guided fuzzing. IEEE Trans Reliab
Wei J, Zou K (2019) Eda: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 6382–6388
White M, Vendome C, Linares-Vásquez M, Poshyvanyk D (2015) Toward deep learning software repositories. In: 2015 IEEE/ACM 12th working conference on mining software repositories. IEEE, pp 334–345
Woolson RF (2007) Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, pp 1–3
Xia M, Kong X, Anastasopoulos A, Neubig G (2019) Generalized data augmentation for low-resource translation. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 5786–5796
Xie Q, Dai Z, Hovy E, Luong M-T, Le QV (2020) Unsupervised data augmentation for consistency training. In: NIPS’20
Yang Z, Shi J, He J, Lo D (2022) Natural attack for pre-trained models of code. In: Proceedings of the 44th international conference on software engineering, ser. ICSE ’22. Association for Computing Machinery, pp 1482–1493. https://dl.acm.org/doi/abs/10.1145/3510003.3510146
Yan S, Yu H, Chen Y, Shen B, Jiang L (2020) Are the code snippets what we are searching for? a benchmark and an empirical study on code search with natural-language queries. In: 2020 IEEE 27th international conference on Software Analysis, Evolution and Reengineering (SANER), pp 344–354
N. Yefet U. Alon E. Yahav Adversarial examples for models of code Proc ACM Program Lang 4 OOPSLA 1 30 10.1145/3428230 1526.05122
Yu AW, Dohan D, Luong T, Zhao R, Chen K, Le Q (2018) Qanet: combining local convolution with global self-attention for reading comprehension. In: International conference on learning representations. https://openreview.net/forum?id=B14TlG-RW
S. Yu T. Wang J. Wang Data augmentation by program transformation J Syst Softw 190 10.1016/j.jss.2022.111304 07536793 111304
Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6023–6032
H. Zhang Z. Li G. Li L. Ma Y. Liu Z. Jin Generating adversarial examples for holding robustness of source code processing models Proc AAAI Conf Artif Intell 34 01 1169 1176 1307.93340
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2018) mixup: beyond empirical risk minimization. In: International Conference on Learning Representations (ICLR)
Zhang L, Deng Z, Kawaguchi K, Ghorbani A, Zou J (2021) How does mixup help with robustness and generalization?. In: International conference on learning representations
Zhang R, Xiao W, Zhang H, Liu Y, Lin H, Yang M (2020) An empirical study on program failures of deep learning jobs. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering, ser. ICSE ’20. Association for Computing Machinery, New York, NY, USA, pp 1159–1170. https://dl.acm.org/doi/10.1145/3377811.3380362
Zhang X, Zhou Y, Han T, Chen T (2021) Training deep code comment generation models via data augmentation. In: Proceedings of the 12th Asia-Pacific symposium on internetware, ser. Internetware ’20. Association for Computing Machinery, New York, NY, USA, pp 185–188. https://doi.org/10.1145/3457913.3457937
Zhong H, Su Z (2015) An empirical study on real bug fixes. In: IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE), vol 1, pp 913–923
Zhou Y, Liu S, Siow J, Du X, Liu Y (2019) Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: Proceedings of the 33rd international conference on neural information processing systems