Code Generation; LLMs; Non-Determinism; Tree of Thoughts; Codegeneration; Deterministic behavior; High degree of variability; Language model; Large language model; Non Determinism; Software engineering research; Thought process; Tree of thought; Hardware and Architecture; Software; Safety, Risk, Reliability and Quality
Abstract :
[en] Despite recent advancements in Large Language Models (LLMs) for code generation, their inherent non-determinism remains a significant obstacle for reliable and reproducible software engineering research. Prior work has highlighted the high degree of variability in LLM-generated code, even when prompted with identical inputs. This non-deterministic behavior can undermine the validity of scientific conclusions drawn from LLM-based experiments. This paper showcases the Tree of Thoughts (ToT) prompting strategy as a promising alternative for improving the predictability and quality of code generation results. By guiding the LLM through a structured Thoughts process, ToT aims to reduce the randomness inherent in the generation process and improve the consistency of the output. Our experiments on GPT-3.5 Turbo model using 829 code generation problems from benchmarks such as CodeContests, APPS (Automated Programming Progress Standard) and HumanEval demonstrate a substantial reduction in non-determinism compared to previous findings. Specifically, we observed a significant decrease in the number of coding tasks that produced inconsistent outputs across multiple requests. Nevertheless, we show that the reduction in semantic variability was less pronounced for HumanEval (69%), indicating unique challenges present in this dataset that are not fully mitigated by ToT.
Disciplines :
Computer science
Author, co-author :
Sawadogo, Salimata; Université Joseph Ki-Zerbo, Centre d'Excellence en IA (CITADEL), Ouagadougou, Burkina Faso
Sabane, Aminata; Université Joseph Ki-Zerbo, Centre d'Excellence en IA (CITADEL), Ouagadougou, Burkina Faso
Kafando, Rodrique; Université Virtuelle du Burkina Faso, Centre d'Excellence en IA (CITADEL), Ouagadougou, Burkina Faso
Kabore, Abdoul Kader; Université du Luxembourg, Centre d'Excellence en IA (CITADEL), Ouagadougou, Burkina Faso
BISSYANDE, Tegawendé ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
External co-authors :
yes
Language :
English
Title :
Revisiting the Non-Determinism of Code Generation by the GPT-3.5 Large Language Model
Publication date :
04 March 2025
Event name :
2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
Event organizer :
IEEE
Event date :
04-03-2025 - 07-03-2025
Audience :
International
Main work title :
Proceedings - 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025
Publisher :
Institute of Electrical and Electronics Engineers Inc.
J. Liu, A. Liu, X. Lu, S. Welleck, P. West, R. L. Bras, Y. Choi, and H. Hajishirzi, "Generated knowledge prompting for commonsense reasoning," arXiv preprint arXiv:2110.08387, 2021.
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, "Self-consistency improves chain of thought reasoning in language models," arXiv preprint arXiv:2203.11171, 2022.
S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, "Llm is like a box of chocolates: the non-determinism of chatgpt in code generation," arXiv preprint arXiv:2308.02828, 2023.
M. Lee, P. Liang, and Q. Yang, "Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities," in Proceedings of the 2022 CHI conference on human factors in computing systems, 2022, pp. 1-19.
S. Gulwani, "Automating string processing in spreadsheets using inputoutput examples," ACM Sigplan Notices, vol. 46, no. 1, pp. 317-330, 2011.
S. Chatterjee, D. Saha, A. Sharma, and Y. Verma, "Reliability and optimal release time analysis for multi up-gradation software with imperfect debugging and varied testing coverage under the effect of random field environments," Annals of Operations Research, pp. 1-21, 2022.
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, "Tree of thoughts: Deliberate problem solving with large language models," arXiv preprint arXiv:2305.10601, 2023.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., "Evaluating large language models trained on code," arXiv preprint arXiv:2107.03374, 2021.
I. I. Revzin and Y. Gentilhomme, "lesmodèles linguistiques," 2020.
W. Ling, E. Grefenstette, K. M. Hermann, T. Koĉiskỳ, A. Senior, F. Wang, and P. Blunsom, "Latent predictor networks for code generation," arXiv preprint arXiv:1603.06744, 2016.
T. B. Hashimoto, K. Guu, Y. Oren, and P. S. Liang, "A retrieve-andedit framework for predicting structured outputs," Advances in Neural Information Processing Systems, vol. 31, 2018.
L. Dong and M. Lapata, "Language to logical form with neural attention," arXiv preprint arXiv:1601.01280, 2016.
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., "Codebert: A pre-trained model for programming and natural languages," arXiv preprint arXiv:2002.08155, 2020.
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., "Improving language understanding by generative pre-training," 2018.
S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., "Codexglue: A machine learning benchmark dataset for code understanding and generation," arXiv preprint arXiv:2102.04664, 2021.
J. Long, "Large language model guided tree-of-thought," arXiv preprint arXiv:2305.08291, 2023.
X. Chen, M. Lin, N. Schärli, and D. Zhou, "Teaching large language models to self-debug," arXiv preprint arXiv:2304.05128, 2023.
G. Kim, P. Baldi, and S. McAleer, "Language models can solve computer tasks," Advances in Neural Information Processing Systems, vol. 36, 2024.
I. Schlag, S. Sukhbaatar, A. Celikyilmaz, W.-t. Yih, J. Weston, J. Schmidhuber, and X. Li, "Large language model programs," arXiv preprint arXiv:2305.05364, 2023.
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., "Program synthesis with large language models," arXiv preprint arXiv:2108.07732, 2021.
Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., "Competitionlevel code generation with alphacode," Science, vol. 378, no. 6624, pp. 1092-1097, 2022.
D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song et al., "Measuring coding challenge competence with apps," arXiv preprint arXiv:2105.09938, 2021.