Prompt engineering in LLMs for automated unit test generation: A large-scale study

OUEDRAOGO, Wendkûuni Arzouma Marc Christian; KABORE, Abdoul Kader; LI, Yinghua; TIAN, Haoye; KOYUNCU, Anil; KLEIN, Jacques; Lo, David; BISSYANDE, Tegawendé

doi:10.1007/s10664-026-10840-4

Download

Article (Scientific journals)

Prompt engineering in LLMs for automated unit test generation: A large-scale study

OUEDRAOGO, Wendkûuni Arzouma Marc Christian; KABORE, Abdoul Kader; LI, Yinghua et al.

2026 • In Empirical Software Engineering, 31 (4)

Peer Reviewed verified by ORBi Dataset

Permalink
https://hdl.handle.net/10993/68099

DOI
10.1007/s10664-026-10840-4

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

s10664-026-10840-4.pdf

Publisher postprint (5.5 MB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Automatic Test Generation; Unit Tests; Large Language Models; Prompt Engineering; Empirical Evaluation

Abstract :

[en] Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Although search-based software testing improves efficiency, it produces tests with poor readability and maintainability. Although LLMs show promise for test generation, existing research lacks comprehensive evaluation across execution-driven assessment, reasoning-based prompting, and real-world testing scenarios. This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, systematically analyzing four state-of-the-art models (GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B) against EvoSuite across 216,300 generated test cases targeting Defects4J, SF110, and CMD (a dataset mitigating LLM training data leakage). We evaluate five prompting techniques–Zero Shot Learning (ZSL), Few-Shot Learning (FSL), Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Guided Tree-of-Thought (GToT)–assessing syntactic correctness, compilability, hallucination-driven failures, readability, code coverage metrics, and test smells. Reasoning-based prompting particularly GToT significantly enhances test reliability, compilability, and structural adherence in general-purpose LLMs. However, hallucination-driven failures remain a persistent challenge, manifesting as non-existent symbol references, incorrect API calls, and fabricated dependencies, resulting in high compilation failure rates (up to 86%). Moreover, test smell analysis reveals that while LLM-generated tests are generally more readable than those produced by traditional tools, they still suffer from recurring design issues such as Magic Number Tests and Assertion Roulette, which hinder maintainability. Overall, our findings indicate that LLMs can serve as effective assistive tools for generating readable and maintainable test suites, but hybrid approaches that combine LLM-based generation with automated validation and search-based refinement are required to achieve reliable and production ready test generation.

Disciplines :

Computer science

Author, co-author :

OUEDRAOGO, Wendkûuni Arzouma Marc Christian ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

KABORE, Abdoul Kader ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SNT Office > Project Coordination

LI, Yinghua ; University of Luxembourg

TIAN, Haoye ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE

KOYUNCU, Anil ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE

KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

Lo, David

BISSYANDE, Tegawendé ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

External co-authors :

yes

Language :

English

Title :

Prompt engineering in LLMs for automated unit test generation: A large-scale study

Publication date :

28 March 2026

Journal title :

Empirical Software Engineering

ISSN :

1382-3256

eISSN :

1573-7616

Publisher :

Springer Science and Business Media LLC

Volume :

Issue :

Peer reviewed :

Peer Reviewed verified by ORBi

Additional URL :

https://link.springer.com/content/pdf/10.1007/s10664-026-10840-4.pdf

Funders :

FNR - Fonds National de la Recherche Luxembourg
H2020 European Research Council

Funding number :

17185670; 949014

Funding text :

This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant reference AFR PhD bilateral, project reference 17185670. This work was also supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 949014). For the purpose of open access, and in fulfilment of the obligations arising from the grant agreement, the author has applied a Creative Commons Attribution 4.0 International (CC BY 4.0) license to any Author Accepted Manuscript version arising from this submission.

Data Set :

https://anonymous.4open.science/r/LLM4TS-0F76/

Available on ORBilu :

since 28 March 2026

Statistics

Number of views

74 (5 by Unilu)

Number of downloads

30 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Abdullin AM Itsykson VM Kex: A platform for analysis of jvm programs Inform Contr Syst 2022 1 116 30 43
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S et al (2023) Gpt-4 technical report. arXiv:2303.08774
Almasi MM, Hemmati H, Fraser G, Arcuri A, Benefelds J (2017) An industrial evaluation of unit test generation: Finding real faults in a financial application. In: 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP), IEEE, pp 263–272
Alshahwan N, Chheda J, Finogenova A, Gokkaya B, Harman M, Harper I, Marginean A, Sengupta S, Wang E (2024) Automated unit test improvement using large language models at meta. In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp 185–196
Amatriain X (2024) Prompt design and engineering: Introduction and advanced methods. arXiv preprint arXiv:2401.14423
Arcuri A Fraser G Parameter tuning or default values? an empirical investigation in search-based software engineering Empir Softw Eng 2013 18 594 623 10.1007/s10664-013-9249-9
Arcuri A, Fraser G, Just R (2017) Private api access and functional mocking in automated unit test generation. In: 2017 IEEE international conference on software testing, verification and validation (ICST), IEEE, pp 126–137
Beck K (2000) Extreme programming explained: embrace change. addison-wesley professional
Biagiola M Ghislotti G Tonella P Improving the readability of automatically generated tests using large language models 2025 IEEE Conference on Software Testing 2025 IEEE Verification and Validation (ICST) 162 173
Brown T Mann B Ryder N Subbiah M Kaplan JD Dhariwal P Neelakantan A Shyam P Sastry G Askell A et al. Language models are few-shot learners Adv Neural Inf Process Syst 2020 33 1877 1901
Buse RP Weimer WR Learning a metric for code readability IEEE Trans Software Eng 2009 36 4 546 558 10.1109/TSE.2009.70
Chen Y, Hu Z, Zhi C, Han J, Deng S, Yin J (2023) Chatunitest: A framework for llm-based test generation. arXiv e-prints pp arXiv-2305
Chen Y, Hu Z, Zhi C, Han J, Deng S, Yin J (2024) Chatunitest: A framework for llm-based test generation. In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp 572–576
Daka E, Fraser G (2014) A survey on unit testing practices and problems. In: 2014 IEEE 25th International Symposium on Software Reliability Engineering, IEEE, pp 201–211
Daka E, Campos J, Fraser G, Dorn J, Weimer W (2015) Modeling readability to improve unit tests. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pp 107–118
Dakhel AM Nikanjam A Majdinasab V Khomh F Desmarais MC Effective test generation using pre-trained large language models and mutation testing Inf Softw Technol 2024 171 107468 10.1016/j.infsof.2024.107468
Dantas CEC, Maia MA (2021) Readability and understandability scores for snippet assessment: An exploratory study. arXiv preprint arXiv:2108.09181
Deljouyi A, Koohestani R, Izadi M, Zaidman A (2024) Leveraging large language models for enhancing the understandability of generated unit tests. arXiv preprint arXiv:2408.11710
Deljouyi A, Koohestani R, Izadi M, Zaidman A (2025) Leveraging large language models for enhancing the understandability of generated unit tests. In: 2025 ieee/acm 47th international conference on software engineering (icse). Los Alamitos, CA, USA pp 392–404
Dorn J (2012) A general software readability model. MCS Thesis available from (http://www.cs.virginia.edu/~weimer/students/dorn-mcs-paper.pdf) 5:11–14
Fan A, Gokkaya B, Harman M, Lyubarskiy M, Sengupta S, Yoo S, Zhang JM (2023) Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533
Fraser G, Arcuri A (2011) Evosuite: automatic test suite generation for object-oriented software. In: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pp 416–419
Fraser G, Arcuri A (2013) Evosuite: On the challenges of test case generation in the real world. In: 2013 IEEE sixth international conference on software testing, verification and validation, IEEE, pp 362–369
Fraser G Staats M McMinn P Arcuri A Padberg F Does automated unit test generation really help software testers? a controlled empirical study ACM Trans Softw Eng Methodol (TOSEM) 2015 24 4 1 49 10.1145/2699688
Grano G, Scalabrino S, Gall HC, Oliveto R (2018) An empirical investigation on the readability of manual and generated test cases. In: Proceedings of the 26th Conference on Program Comprehension, pp 348–351
Gu S, Fang C, Zhang Q, Tian F, Chen Z (2024) Testart: Improving llm-based unit test via co-evolution of automated generation and repair iteration. arXiv e-prints pp arXiv-2408
Hossain SB, Dwyer M (2024) Togll: Correct and strong test oracle generation with llms. arXiv preprint arXiv:2405.03786
Jahangirova G, Terragni V (2023) Sbft tool competition 2023-java test case generation track. In: 2023 IEEE/ACM International Workshop on Search-Based and Fuzz Testing (SBFT), IEEE, pp 61–64
Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, Casas Ddl, Bressand F, Lengyel G, Lample G, Saulnier L et al (2023) Mistral 7b. arXiv preprint arXiv:2310.06825
Jiang AQ, Sablayrolles A, Roux A, Mensch A, Savary B, Bamford C, Chaplot DS, Casas Ddl, Hanna EB, Bressand F et al (2024) Mixtral of experts. arXiv preprint arXiv:2401.04088
Kojima T Gu SS Reid M Matsuo Y Iwasawa Y Large language models are zero-shot reasoners Adv Neural Inf Process Syst 2022 35 22199 22213
Kumar A, Haiduc S, Das PP, Chakrabarti PP (2024) Llms as evaluators: A novel approach to evaluate bug report summarization. arXiv preprint arXiv:2409.00630
Li J Li G Li Y Jin Z Structured chain-of-thought prompting for code generation ACM Trans Softw Eng Methodol 2025 34 2 1 23
Long J (2023) Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291
Macedo M, Tian Y, Cogo FR, Adams B (2024) Exploring the impact of the output format on the evaluation of large language models for code translation. arXiv preprint arXiv:2403.17214
Mi Q Hao Y Ou L Ma W Towards using visual, semantic and structural features to improve code readability classification J Syst Softw 2022 193 111454 10.1016/j.jss.2022.111454
Molina F Gorla A d’Amorim M Test oracle automation in the era of llms ACM Trans Softw Eng Methodol 2025 34 5 1 24 10.1145/3715107
Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M, Barnes N, Mian A (2023) A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435
Oliveira D, Bruno R, Madeiral F, Masuhara H, Castor F (2022) A systematic literature review on the impact of formatting elements on program understandability. Available at SSRN 4182156
OpenAI (2023) Gpt-3.5-turbo. https://platformopenai.com/docs/models/gpt-3-5-turbo/. Accessed 21 May 2024
Ouédraogo WC Plein L Kabore K Habib A Klein J Lo D Bissyandé TF Enriching automatic test case generation by extracting relevant test inputs from bug reports Empir Softw Eng 2025 30 3 85 10.1007/s10664-025-10635-z
Pacheco C, Ernst MD (2007) Randoop: feedback-directed random testing for java. In: Companion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion, pp 815–816
Palomba F, Di Nucci D, Panichella A, Oliveto R, De Lucia A (2016) On the diffusion of test smells in automatically generated test code: An empirical study. In: Proceedings of the 9th international workshop on search-based software testing, pp 5–14
Panichella A Kifetew FM Tonella P Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets IEEE Trans Software Eng 2017 44 2 122 158 10.1109/TSE.2017.2663435
Panichella A, Panichella S, Fraser G, Sawant AA, Hellendoorn VJ (2020) Revisiting test smells in automatically generated tests: limitations, pitfalls, and opportunities. In: 2020 IEEE international conference on software maintenance and evolution (ICSME), IEEE, pp 523–533
Panichella A Panichella S Fraser G Sawant AA Hellendoorn VJ Test smells 20 years later: detectability, validity, and reliability Empir Softw Eng 2022 27 7 170 10.1007/s10664-022-10207-5
Peruma A, Almalki KS, Newman CD, Mkaouer MW, Ouni A, Palomba F (2019) On the distribution of test smells in open source android applications: An exploratory study. (2019). Citado na p 13
Peruma A, Almalki K, Newman CD, Mkaouer MW, Ouni A, Palomba F (2020) Tsdetect: An open source test smells detection tool. In: Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 1650–1654
Pinto GH, Vergilio SR (2010) A multi-objective genetic algorithm to test data generation. In: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence, IEEE, vol 1, pp 129–134
Posnett D, Hindle A, Devanbu P (2011) A simpler model of software readability. In: Proceedings of the 8th working conference on mining software repositories, pp 73–82
Rojas JM, Fraser G, Arcuri A (2015) Automated unit test generation during software development: A controlled experiment and think-aloud observations. In: Proceedings of the 2015 international symposium on software testing and analysis, pp 338–349
Sahoo P, Singh AK, Saha S, Jain V, Mondal S, Chadha A (2024) A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv preprint arXiv:2402.07927
Sallou J, Durieux T, Panichella A (2023) Breaking the silence: the threats of using llms in software engineering. arXiv preprint arXiv:2312.08055
Scalabrino S Linares-Vásquez M Oliveto R Poshyvanyk D A comprehensive model for code readability J Softw Evol Process 2018 30 6 e1958 10.1002/smr.1958
Schäfer M, Nadi S, Eghbali A, Tip F (2023) An empirical evaluation of using large language models for automated unit test generation. IEEE Trans Softw Eng
Sen K Marinov D Agha G Cute: A concolic unit testing engine for c ACM SIGSOFT Softw Eng Notes 2005 30 5 263 272 10.1145/1095430.1081750
Sergeyuk A, Lvova O, Titov S, Serova A, Bagirov F, Kirillova E, Bryksin T (2024) Reassessing java code readability models with a human-centered approach. In: Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension, pp 225–235
Shamshiri S, Just R, Rojas JM, Fraser G, McMinn P, Arcuri A (2015) Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t). In: 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE, pp 201–211
Shamshiri S, Rojas JM, Galeotti JP, Walkinshaw N, Fraser G (2018) How do automatically generated unit tests influence software maintenance? In: 2018 IEEE 11th international conference on software testing, verification and validation (ICST), IEEE, pp 250–261
Si C, Gan Z, Yang Z, Wang S, Wang J, Boyd-Graber J, Wang L (2022) Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150
Siddiq ML, Da Silva Santos JC, Tanvir RH, Ulfat N, Al Rifat F, Carvalho Lopes V (2024a) Using large language models to generate junit tests: An empirical study. In: Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pp 313–322
Siddiq ML, Dristi S, Saha J, Santos J (2024b) Quality assessment of prompts used in code generation. arXiv preprint arXiv:2404.10155
Sun W, Miao Y, Li Y, Zhang H, Fang C, Liu Y, Deng G, Liu Y, Chen Z (2024) Source code summarization in the era of large language models. arXiv preprint arXiv:2407.07959
Tang Y, Liu Z, Zhou Z, Luo X (2024) Chatgpt vs sbst: A comparative assessment of unit test suite generation. IEEE Trans Softw Eng
Wang J, Huang Y, Chen C, Liu Z, Wang S, Wang Q (2024a) Software testing with large language models: Survey, landscape, and vision. IEEE Trans Softw Eng
Wang T, Zhou N, Chen Z (2024b) Enhancing computer programming education with llms: A study on effective prompt engineering for python code generation. arXiv preprint arXiv:2407.05437
Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, Chowdhery A, Zhou D (2022) Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171
Wei J Wang X Schuurmans D Bosma M Xia F Chi E Le QV Zhou D et al. Chain-of-thought prompting elicits reasoning in large language models Adv Neural Inf Process Syst 2022 35 24824 24837
Winkler D Urbanke P Ramler R Investigating the readability of test code Empir Softw Eng 2024 29 2 53 10.1007/s10664-023-10390-z
Yang L, Yang C, Gao S, Wang W, Wang B, Zhu Q, Chu X, Zhou J, Liang G, Wang Q et al (2024) On the evaluation of large language models in unit test generation. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pp 1607–1619
Yuan Z, Lou Y, Liu M, Ding S, Wang K, Chen Y, Peng X (2023) No more manual tests? evaluating and improving chatgpt for unit test generation. arXiv preprint arXiv:2305.04207
Yuan Z, Liu M, Ding S, Wang K, Chen Y, Peng X, Lou Y (2024) Evaluating and improving chatgpt for unit test generation. Proceed ACM Softw Eng 1(FSE):1703–1726
Zeller A, Gopinath R, Böhme M, Fraser G, Holler C (2019) The fuzzing book
Zhang X Hou X Qiao X Song W A review of automatic source code summarization Empir Softw Eng 2024 29 6 162 10.1007/s10664-024-10553-6
Zhang Y, Lu Q, Liu K, Dou W, Zhu J, Qian L, Zhang C, Lin Z, Wei J (2025) Citywalk: Enhancing llm-based c++ unit test generation via project-dependency awareness and language-specific knowledge. arXiv preprint arXiv:2501.16155
Zhang Z, Wang Y, Wang C, Chen J, Zheng Z (2024b) Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation. arXiv preprint arXiv:2409.20550