LLMs and Prompting for Unit Test Generation: A Large-Scale Evaluation

automatic test generation; empirical evaluation; large language models; prompt engineering; unit tests; Automatic test generation; Empirical evaluations; Language model; Large language model; Large-scales; Prompt engineering; Time constraints; Unit test generations; Unit testing; Unit tests; Artificial Intelligence; Software; Safety, Risk, Reliability and Quality

Abstract :

[en] Unit testing, essential for identifying bugs, is often neglected due to time constraints. Automated test generation tools exist but typically lack readability and require developer intervention. Large Language Models (LLMs) like GPT and Mistral show potential in test generation, but their effectiveness remains unclear.This study evaluates four LLMs and five prompt engineering techniques, analyzing 216 300 tests for 690 Java classes from diverse datasets. We assess correctness, readability, coverage, and bug detection, comparing LLM-generated tests to EvoSuite. While LLMs show promise, improvements in correctness are needed. The study highlights both the strengths and limitations of LLMs, offering insights for future research.

Disciplines :

Computer science

Author, co-author :

OUEDRAOGO, Wendkûuni Arzouma Marc Christian ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

KABORE, Abdoul Kader ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SNT Office > Project Coordination

TIAN, Haoye ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE ; University of Melbourne, Australia

SONG, Yewei ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

KOYUNCU, Anil ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE ; Bilkent University, Turkey

KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

LO, David ; Singapore Management University, Singapore

BISSYANDE, Tegawendé François d Assise ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

External co-authors :

yes

Language :

English

Title :

LLMs and Prompting for Unit Test Generation: A Large-Scale Evaluation

Publication date :

27 October 2024

Event name :

Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

Event place :

Sacramento, Usa

Event date :

28-10-2024 => 01-11-2024

By request :

Yes

Main work title :

Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024

Publisher :

Association for Computing Machinery, Inc

ISBN/EAN :

9798400712487

Peer reviewed :

Editorial reviewed

Additional URL :

https://dl.acm.org/doi/pdf/10.1145/3691620.3695330

Funders :

ACM
ACM SIGAI
Google
IEEE
Special Interest Group on Software Engineering (SIGSOFT)
University of California, Davis (UC Davis)

Funding number :

17185670

Funding text :

This work is supported by funding from the Fonds National de la Recherche Luxembourg (FNR) under the Aides la Formation- Recherche (AFR) (grant agreement No. 17185670).

Available on ORBilu :

since 06 February 2025

Statistics

Number of views

149 (4 by Unilu)

Number of downloads

174 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv: 2303. 08774 (2023).
c2nes. 2012. javalang. https://github. com/c2nes/javalang. [Online; accessed 06-Jun-2024].
checkstyle. 2001. Checkstyle. https://checkstyle. sourceforge. io/. [Online; accessed 06-Jun-2024].
conductor oss. 2023. Conductor OSS). https://github. com/conductor-oss/conductor. [Online; accessed 06-Jun-2024].
Gordon Fraser and Andrea Arcuri. 2011. Evosuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416-419.
Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM) 24, 2 (2014), 1-42.
JaCoCo. 2022. JaCoCo Java Code Coverage Library. https://www. jacoco. org/jacoco/. [Online; accessed 06-Jun-2024].
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv: 2310. 06825 (2023).
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. 2024. Mixtral of experts. arXiv preprint arXiv: 2401. 04088 (2024).
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437-440.
OceanBase. 2023. OceanBase Developer Center (ODC). https://github. com/oceanbase/odc. [Online; accessed 06-Jun-2024].
OpenAI. 2023. GPT-3. 5-Turbo. URL: https://platform. openai. com/docs/models/gpt-3-5-turbo/[accessed 2024-05-21] (2023).
Annibale Panichella, Fitsum Meshesha Kifetew, and Paolo Tonella. 2017. Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets. IEEE Transactions on Software Engineering 44, 2 (2017), 122-158.
PMD. 2012. PMD An extensible cross-language static code analyzer. https://pmd. github. io/. [Online; accessed 06-Jun-2024].
Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. 1-7.
Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv preprint arXiv: 2402. 07927 (2024).
spotbugs. 2016. Spotbugs. https://spotbugs. github. io/index. html. [Online; accessed 06-Jun-2024].
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824-24837.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems 36 (2024).