automatic test generation; empirical evaluation; large language models; prompt engineering; unit tests; Automatic test generation; Empirical evaluations; Language model; Large language model; Large-scales; Prompt engineering; Time constraints; Unit test generations; Unit testing; Unit tests; Artificial Intelligence; Software; Safety, Risk, Reliability and Quality
Résumé :
[en] Unit testing, essential for identifying bugs, is often neglected due to time constraints. Automated test generation tools exist but typically lack readability and require developer intervention. Large Language Models (LLMs) like GPT and Mistral show potential in test generation, but their effectiveness remains unclear.This study evaluates four LLMs and five prompt engineering techniques, analyzing 216 300 tests for 690 Java classes from diverse datasets. We assess correctness, readability, coverage, and bug detection, comparing LLM-generated tests to EvoSuite. While LLMs show promise, improvements in correctness are needed. The study highlights both the strengths and limitations of LLMs, offering insights for future research.
KABORE, Abdoul Kader ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SNT Office > Project Coordination
TIAN, Haoye ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE ; University of Melbourne, Australia
SONG, Yewei ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
KOYUNCU, Anil ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE ; Bilkent University, Turkey
KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
LO, David ; Singapore Management University, Singapore
ACM ACM SIGAI Google IEEE Special Interest Group on Software Engineering (SIGSOFT) University of California, Davis (UC Davis)
N° du Fonds :
17185670
Subventionnement (détails) :
This work is supported by funding from the Fonds National de la Recherche Luxembourg (FNR) under the Aides la Formation- Recherche (AFR) (grant agreement No. 17185670).
Gordon Fraser and Andrea Arcuri. 2011. Evosuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416-419.
Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM) 24, 2 (2014), 1-42.
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv: 2310. 06825 (2023).
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv: 2401. 04088 (2024).
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437-440.
Annibale Panichella, Fitsum Meshesha Kifetew, and Paolo Tonella. 2017. Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets. IEEE Transactions on Software Engineering 44, 2 (2017), 122-158.
Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. 1-7.
Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv preprint arXiv: 2402. 07927 (2024).
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824-24837.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36 (2024).