Article (Scientific journals)
Prompt engineering in LLMs for automated unit test generation: A large-scale study
OUEDRAOGO, Wendkûuni Arzouma Marc Christian; KABORE, Abdoul Kader; LI, Yinghua et al.
2026In Empirical Software Engineering, 31 (4)
Peer Reviewed verified by ORBi Dataset
 

Files


Full Text
s10664-026-10840-4.pdf
Publisher postprint (5.5 MB) Creative Commons License - Attribution
Download

All documents in ORBilu are protected by a user license.

Send to



Details



Keywords :
Automatic Test Generation; Unit Tests; Large Language Models; Prompt Engineering; Empirical Evaluation
Abstract :
[en] Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Although search-based software testing improves efficiency, it produces tests with poor readability and maintainability. Although LLMs show promise for test generation, existing research lacks comprehensive evaluation across execution-driven assessment, reasoning-based prompting, and real-world testing scenarios. This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the full class level, systematically analyzing four state-of-the-art models (GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B) against EvoSuite across 216,300 generated test cases targeting Defects4J, SF110, and CMD (a dataset mitigating LLM training data leakage). We evaluate five prompting techniques–Zero Shot Learning (ZSL), Few-Shot Learning (FSL), Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Guided Tree-of-Thought (GToT)–assessing syntactic correctness, compilability, hallucination-driven failures, readability, code coverage metrics, and test smells. Reasoning-based prompting particularly GToT significantly enhances test reliability, compilability, and structural adherence in general-purpose LLMs. However, hallucination-driven failures remain a persistent challenge, manifesting as non-existent symbol references, incorrect API calls, and fabricated dependencies, resulting in high compilation failure rates (up to 86%). Moreover, test smell analysis reveals that while LLM-generated tests are generally more readable than those produced by traditional tools, they still suffer from recurring design issues such as Magic Number Tests and Assertion Roulette, which hinder maintainability. Overall, our findings indicate that LLMs can serve as effective assistive tools for generating readable and maintainable test suites, but hybrid approaches that combine LLM-based generation with automated validation and search-based refinement are required to achieve reliable and production ready test generation.
Disciplines :
Computer science
Author, co-author :
OUEDRAOGO, Wendkûuni Arzouma Marc Christian  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
KABORE, Abdoul Kader  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SNT Office > Project Coordination
LI, Yinghua  ;  University of Luxembourg
TIAN, Haoye ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE
KOYUNCU, Anil ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > TruX > Team Tegawendé François d A BISSYANDE
KLEIN, Jacques  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Lo, David
BISSYANDE, Tegawendé  ;  University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
External co-authors :
yes
Language :
English
Title :
Prompt engineering in LLMs for automated unit test generation: A large-scale study
Publication date :
28 March 2026
Journal title :
Empirical Software Engineering
ISSN :
1382-3256
eISSN :
1573-7616
Publisher :
Springer Science and Business Media LLC
Volume :
31
Issue :
4
Peer reviewed :
Peer Reviewed verified by ORBi
Funders :
FNR - Fonds National de la Recherche Luxembourg
H2020 European Research Council
Funding number :
17185670; 949014
Funding text :
This research was funded in whole, or in part, by the Luxembourg National Research Fund (FNR), grant reference AFR PhD bilateral, project reference 17185670. This work was also supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 949014). For the purpose of open access, and in fulfilment of the obligations arising from the grant agreement, the author has applied a Creative Commons Attribution 4.0 International (CC BY 4.0) license to any Author Accepted Manuscript version arising from this submission.
Available on ORBilu :
since 28 March 2026

Statistics


Number of views
74 (5 by Unilu)
Number of downloads
30 (0 by Unilu)

Scopus citations®
 
0
Scopus citations®
without self-citations
0
OpenCitations
 
0
OpenAlex citations
 
0

Bibliography


Similar publications



Contact ORBilu