[en] INTRODUCTION: Artificial Intelligence (AI) is increasingly used as a helper to develop computing programs. While it can boost software development and improve coding proficiency, this practice offers no guarantee of security. On the contrary, recent research shows that some AI models produce software with vulnerabilities. This situation leads to the question: How serious and widespread are the security flaws in code generated using AI models?
METHODS: Through a systematic literature review, this work reviews the state of the art on how AI models impact software security. It systematizes the knowledge about the risks of using AI in coding security-critical software.
RESULTS: It reviews what security flaws of well-known vulnerabilities (e.g., the MITRE CWE Top 25 Most Dangerous Software Weaknesses) are commonly hidden in AI-generated code. It also reviews works that discuss how vulnerabilities in AI-generated code can be exploited to compromise security and lists the attempts to improve the security of such AI-generated code.
DISCUSSION: Overall, this work provides a comprehensive and systematic overview of the impact of AI in secure coding. This topic has sparked interest and concern within the software security engineering community. It highlights the importance of setting up security measures and processes, such as code verification, and that such practices could be customized for AI-aided code production.
Centre de recherche :
NCER-FT - FinTech National Centre of Excellence in Research
Disciplines :
Sciences informatiques
Auteur, co-auteur :
NEGRI RIBALTA, Claudia Sofia ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > IRiSC
Geraud-Stewart, Rémi; École Normale Supérieure, Paris, France
SERGEEVA, Anastasia ; University of Luxembourg > Faculty of Humanities, Education and Social Sciences (FHSE) > Department of Behavioural and Cognitive Sciences (DBCS) > Lifespan Development, Family and Culture
LENZINI, Gabriele ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > IRiSC
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
A systematic literature review on the impact of AI models on the security of code generation.
The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant: NCER22/IS/16570468/NCER-FT.
Ahmad W. Chakraborty S. Ray B. Chang K.-W. (2021). “Unified pre-training for program understanding and generation,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, eds. K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, et al. (Association for Computational Linguistics), 2655–2668.
Asare O. Nagappan M. Asokan N. (2023). Is GitHub's Copilot as bad as humans at introducing vulnerabilities in code? Empir. Softw. Eng. 28:129. 10.48550/arXiv.2204.04741
Becker B. A. Denny P. Finnie-Ansley J. Luxton-Reilly A. Prather J. Santos E. A. (2023). “Programming is hard-or at least it used to be: educational opportunities and challenges of ai code generation,” in Proceedings of the 54th ACM Technical Symposium on Computer Science Education V.1 (New York, NY), 500–506.
Botacin M. (2023). “GPThreats-3: is automatic malware generation a threat?” in 2023 IEEE Security and Privacy Workshops (SPW) (San Francisco, CA: IEEE), 238–254.
Britz D. Goldie A. Luong T. Le Q. (2017). Massive exploration of neural machine translation architectures. ArXiv e-prints. 10.48550/arXiv.1703.03906
Burgess M. (2023). Criminals Have Created Their Own ChatGPT Clones. Wired.
Carrera-Rivera A. Ochoa W. Larrinaga F. Lasa G. (2022). How-to conduct a systematic literature review: a quick guide for computer science research. MethodsX 9:101895. 10.1016/j.mex.2022.10189536405369
Chen M. Tworek J. Jun H. Yuan Q. de Oliveira Pinto H. P. Kaplan J. et al. (2021). Evaluating large language models trained on code. CoRR abs/2107.03374. 10.48550/arXiv.2107.03374
Fan J. Li Y. Wang S. Nguyen T. N. (2020). “A C/C++ code vulnerability dataset with code changes and CVE summaries,” in Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20 (New York, NY: Association for Computing Machinery), 508–512.
Feng Z. Guo D. Tang D. Duan N. Feng X. Gong M. et al. (2020). “CodeBERT: a pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020, eds. T. Cohn, Y. He, and Y. Liu (Association for Computational Linguistics), 1536–1547.
Fried D. Aghajanyan A. Lin J. Wang S. I. Wallace E. Shi F. et al. (2022). InCoder: a generative model for code infilling and synthesis. ArXiv abs/2204.05999. 10.48550/arXiv.2204.05999
Guo D. Ren S. Lu S. Feng Z. Tang D. Liu S. et al. (2021). “GraphCodeBERT: pre-training code representations with data flow,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net.
He J. Vechev M. (2023). “Large language models for code: Security hardening and adversarial testing,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (New York, NY), 1865–1879.
Henkel J. Ramakrishnan G. Wang Z. Albarghouthi A. Jha S. Reps T. (2022). “Semantic robustness of models of source code,” in 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (Honolulu, HI), 526–537.
Holtzman A. Buys J. Du L. Forbes M. Choi Y. (2020). “The curious case of neural text degeneration,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net.
Huang Y. Li Y. Wu W. Zhang J. Lyu M. R. (2023). Do Not Give Away My Secrets: Uncovering the Privacy Issue of Neural Code Completion Tools.
HuggingFaces (2022). Codeparrot. Available online at: https://huggingface.co/codeparrot/codeparrot (accessed February, 2024).
Jain P. Jain A. Zhang T. Abbeel P. Gonzalez J. Stoica I. (2021). “Contrastive code representation learning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, eds. M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (Punta Cana: Association for Computational Linguistics), 5954–5971.
Jesse K. Ahmed T. Devanbu P. T. Morgan E. (2023). “Large language models and simple, stupid bugs,” in 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR) (Los Alamitos, CA: IEEE Computer Society), 563–575.
Jha A. Reddy C. K. (2023). “CodeAttack: code-based adversarial attacks for pre-trained programming language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 14892–14900.
Jia J. Srikant S. Mitrovska T. Gan C. Chang S. Liu S. et al. (2023). “CLAWSAT: towards both robust and accurate code models,” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (Los Alamitos, CA: IEEE), 212–223.
Karampatsis R.-M. Sutton C. (2020). “How often do single-statement bugs occur? The manySStuBs4J dataset,” in Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20 (Seoul: Association for Computing Machinery), 573–577.
Kitchenham B. Charters S. (2007). Guidelines for performing systematic literature reviews in software engineering. Tech. Rep. Available online at: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=CQDOm2gAAAAJ&citation_for_view=CQDOm2gAAAAJ:d1gkVwhDpl0C
Kitchenham B. Sjøberg D. I. Brereton O. P. Budgen D. Dybå T. Höst M. et al. (2010). “Can we evaluate the quality of software engineering experiments?,” in Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (New York, NY), 1–8.
Li R. Allal L. B. Zi Y. Muennighoff N. Kocetkov D. Mou C. et al. (2023). StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161. 10.48550/arXiv.2305.06161
Liguori P. Improta C. Natella R. Cukic B. Cotroneo D. (2023). Who evaluates the evaluators? On automatic metrics for assessing AI-based offensive code generators. Expert Syst. Appl. 225:120073. 10.48550/arXiv.2212.06008
Liu Y. Ott M. Goyal N. Du J. Joshi M. Chen D. et al. (2019). RoBERTa: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692. 10.48550/arXiv.1907.11692
Moradi Dakhel A. Majdinasab V. Nikanjam A. Khomh F. Desmarais M. C. Jiang Z. M. J. (2023). GitHub Copilot AI pair programmer: asset or liability? J. Syst. Softw. 203:111734. 10.48550/arXiv.2206.15331
Multiple authors (2021). GPT Code Clippy: The Open Source Version of GitHub Copilot.
Nair M. Sadhukhan R. Mukhopadhyay D. (2023). “How hardened is your hardware? Guiding ChatGPT to generate secure hardware resistant to CWEs,” in International Symposium on Cyber Security, Cryptology, and Machine Learning (Berlin: Springer), 320–336.
Natella R. Liguori P. Improta C. Cukic B. Cotroneo D. (2024). AI code generators for security: friend or foe? IEEE Secur. Priv. 2024:1219. 10.48550/arXiv.2402.01219
Nijkamp E. Pang B. Hayashi H. Tu L. Wang H. Zhou Y. et al. (2023). CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. ICLR.
Nikitopoulos G. Dritsa K. Louridas P. Mitropoulos D. (2021). “CrossVul: a cross-language vulnerability dataset with commit data,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021 (New York, NY: Association for Computing Machinery), 1565–1569.
Niu L. Mirza S. Maradni Z. Pöpper C. (2023). “CodexLeaks: privacy leaks from code generation language models in GitHub's Copilot,” in 32nd USENIX Security Symposium (USENIX Security 23), 2133–2150.
Olson M. Wyner A. Berk R. (2018). Modern neural networks generalize on small data sets. Adv. Neural Inform. Process. Syst. 31, 3623–3632. Available online at: https://proceedings.neurips.cc/paper/2018/hash/fface8385abbf94b4593a0ed53a0c70f-Abstract.html
Pa Pa Y. M. Tanizaki S. Kou T. Van Eeten M. Yoshioka K. Matsumoto T. (2023). “An attacker's dream? Exploring the capabilities of chatgpt for developing malware,” in Proceedings of the 16th Cyber Security Experimentation and Test Workshop (New York, NY), 10–18.
Pearce H. Ahmad B. Tan B. Dolan-Gavitt B. Karri R. (2022). “Asleep at the keyboard? Assessing the security of GitHub Copilot's code contributions,” in 2022 IEEE Symposium on Security and Privacy (SP) (IEEE), 754–768.
Pearce H. Tan B. Ahmad B. Karri R. Dolan-Gavitt B. (2023). “Examining zero-shot vulnerability repair with large language models,” in 2023 IEEE Symposium on Security and Privacy (SP) (Los Alamitos, CA: IEEE), 2339–2356.
Perry N. Srivastava M. Kumar D. Boneh D. (2023). “Do users write more insecure code with AI assistants?,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (New York, NY), 2785–2799.
Petersen K. Vakkalanka S. Kuzniarz L. (2015). Guidelines for conducting systematic mapping studies in software engineering: an update. Inform. Softw. Technol. 64, 1–18. 10.1016/j.infsof.2015.03.007
Sandoval G. Pearce H. Nys T. Karri R. Garg S. Dolan-Gavitt B. (2023). “Lost at C: a user study on the security implications of large language model code assistants,” in 32nd USENIX Security Symposium (USENIX Security 23) (Anaheim, CA: USENIX Association), 2205–2222.
Siddiq M. L. Majumder S. H. Mim M. R. Jajodia S. Santos J. C. (2022). “An empirical study of code smells in transformer-based code generation techniques,” in 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM) (Limassol: IEEE), 71–82.
Storhaug A. Li J. Hu T. (2023). “Efficient avoidance of vulnerabilities in auto-completed smart contract code using vulnerability-constrained decoding,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) (Los Alamitos, CA: IEEE), 683–693.
Tan C. Sun F. Kong T. Zhang W. Yang C. Liu C. (2018). “A survey on deep transfer learning,” in Artificial Neural Networks and Machine Learning – ICANN 2018, eds. V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis (Cham. Springer International Publishing), 270–279.
Tony C. Ferreyra N. E. D. Scandariato R. (2022). “GitHub considered harmful? Analyzing open-source projects for the automatic generation of cryptographic API call sequences,” in 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS) (Guangzhou: IEEE), 270–279.
Tony C. Mutas M. Ferreyra N. E. D. Scandariato R. (2023). “LLMSecEval: a dataset of natural language prompts for security evaluations,” in 20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023, Melbourne, Australia, May 15-16, 2023 (Los Alamitos, CA: IEEE), 588–592.
Vaswani A. Shazeer N. Parmar N. Uszkoreit J. Jones L. Gomez A. N. et al. (2017). “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, eds. I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, et al. (Long Beach, CA), 5998–6008.
Wang Y. Wang W. Joty S. R. Hoi S. C. H. (2021). “CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 8696–8708.
Wartschinski L. Noller Y. Vogel T. Kehrer T. Grunske L. (2022). VUDENC: vulnerability detection with deep learning on a natural codebase for Python. Inform. Softw. Technol. 144:106809. 10.48550/arXiv.2201.08441
Wieringa R. Maiden N. Mead N. Rolland C. (2006). Requirements engineering paper classification and evaluation criteria: a proposal and a discussion. Requir. Eng. 11, 102–107. 10.1007/s00766-005-0021-6
Wohlin C. (2014). “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (New York, NY), 1–10.
Wohlin C. Runeson P. Neto P. A. d. M. S. Engström E. do Carmo Machado I. De Almeida E. S. (2013). On the reliability of mapping studies in software engineering. J. Syst. Softw. 86, 2594–2610. 10.1016/j.jss.2013.04.076
Wu Y. Jiang N. Pham H. V. Lutellier T. Davis J. Tan L. et al. (2023). “How effective are neural networks for fixing security vulnerabilities,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023 (New York, NY: Association for Computing Machinery), 1282–1294.37614968
Xu F. F. Alon U. Neubig G. Hellendoorn V. J. (2022). “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (New York, NY), 1–10.