Revisiting the Non-Determinism of Code Generation by the GPT-3.5 Large Language Model

Sawadogo, Salimata; Sabane, Aminata; Kafando, Rodrique; Kabore, Abdoul Kader; BISSYANDE, Tegawendé

doi:10.1109/SANER64311.2025.00012

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Revisiting the Non-Determinism of Code Generation by the GPT-3.5 Large Language Model

Sawadogo, Salimata; Sabane, Aminata; Kafando, Rodrique et al.

2025 • In Proceedings - 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025

Peer reviewed

Permalink
https://hdl.handle.net/10993/66851

DOI
10.1109/SANER64311.2025.00012

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

First_paper.pdf

Author preprint (613.11 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Code Generation; LLMs; Non-Determinism; Tree of Thoughts; Codegeneration; Deterministic behavior; High degree of variability; Language model; Large language model; Non Determinism; Software engineering research; Thought process; Tree of thought; Hardware and Architecture; Software; Safety, Risk, Reliability and Quality

Abstract :

[en] Despite recent advancements in Large Language Models (LLMs) for code generation, their inherent non-determinism remains a significant obstacle for reliable and reproducible software engineering research. Prior work has highlighted the high degree of variability in LLM-generated code, even when prompted with identical inputs. This non-deterministic behavior can undermine the validity of scientific conclusions drawn from LLM-based experiments. This paper showcases the Tree of Thoughts (ToT) prompting strategy as a promising alternative for improving the predictability and quality of code generation results. By guiding the LLM through a structured Thoughts process, ToT aims to reduce the randomness inherent in the generation process and improve the consistency of the output. Our experiments on GPT-3.5 Turbo model using 829 code generation problems from benchmarks such as CodeContests, APPS (Automated Programming Progress Standard) and HumanEval demonstrate a substantial reduction in non-determinism compared to previous findings. Specifically, we observed a significant decrease in the number of coding tasks that produced inconsistent outputs across multiple requests. Nevertheless, we show that the reduction in semantic variability was less pronounced for HumanEval (69%), indicating unique challenges present in this dataset that are not fully mitigated by ToT.

Disciplines :

Computer science

Author, co-author :

Sawadogo, Salimata; Université Joseph Ki-Zerbo, Centre d'Excellence en IA (CITADEL), Ouagadougou, Burkina Faso

Sabane, Aminata; Université Joseph Ki-Zerbo, Centre d'Excellence en IA (CITADEL), Ouagadougou, Burkina Faso

Kafando, Rodrique; Université Virtuelle du Burkina Faso, Centre d'Excellence en IA (CITADEL), Ouagadougou, Burkina Faso

Kabore, Abdoul Kader; Université du Luxembourg, Centre d'Excellence en IA (CITADEL), Ouagadougou, Burkina Faso

BISSYANDE, Tegawendé ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

External co-authors :

yes

Language :

English

Title :

Revisiting the Non-Determinism of Code Generation by the GPT-3.5 Large Language Model

Publication date :

04 March 2025

Event name :

2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

Event organizer :

IEEE

Event date :

04-03-2025 - 07-03-2025

Audience :

International

Main work title :

Proceedings - 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2025

Publisher :

Institute of Electrical and Electronics Engineers Inc.

ISBN/EAN :

9798331535100

Pages :

36-44

Peer reviewed :

Peer reviewed

Additional URL :

http://xplorestaging.ieee.org/ielx8/10992311/10992314/10992476.pdf?arnumber=10992476

Available on ORBilu :

since 15 December 2025

Statistics

Number of views

39 (2 by Unilu)

Number of downloads

14 (4 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

J. Liu, A. Liu, X. Lu, S. Welleck, P. West, R. L. Bras, Y. Choi, and H. Hajishirzi, "Generated knowledge prompting for commonsense reasoning," arXiv preprint arXiv:2110.08387, 2021.
X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, "Self-consistency improves chain of thought reasoning in language models," arXiv preprint arXiv:2203.11171, 2022.
S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, "Llm is like a box of chocolates: the non-determinism of chatgpt in code generation," arXiv preprint arXiv:2308.02828, 2023.
M. Lee, P. Liang, and Q. Yang, "Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities," in Proceedings of the 2022 CHI conference on human factors in computing systems, 2022, pp. 1-19.
S. Gulwani, "Automating string processing in spreadsheets using inputoutput examples," ACM Sigplan Notices, vol. 46, no. 1, pp. 317-330, 2011.
S. Chatterjee, D. Saha, A. Sharma, and Y. Verma, "Reliability and optimal release time analysis for multi up-gradation software with imperfect debugging and varied testing coverage under the effect of random field environments," Annals of Operations Research, pp. 1-21, 2022.
S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, "Tree of thoughts: Deliberate problem solving with large language models," arXiv preprint arXiv:2305.10601, 2023.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., "Evaluating large language models trained on code," arXiv preprint arXiv:2107.03374, 2021.
I. I. Revzin and Y. Gentilhomme, "lesmodèles linguistiques," 2020.
W. Ling, E. Grefenstette, K. M. Hermann, T. Koĉiskỳ, A. Senior, F. Wang, and P. Blunsom, "Latent predictor networks for code generation," arXiv preprint arXiv:1603.06744, 2016.
T. B. Hashimoto, K. Guu, Y. Oren, and P. S. Liang, "A retrieve-andedit framework for predicting structured outputs," Advances in Neural Information Processing Systems, vol. 31, 2018.
L. Dong and M. Lapata, "Language to logical form with neural attention," arXiv preprint arXiv:1601.01280, 2016.
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., "Codebert: A pre-trained model for programming and natural languages," arXiv preprint arXiv:2002.08155, 2020.
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., "Improving language understanding by generative pre-training," 2018.
S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., "Codexglue: A machine learning benchmark dataset for code understanding and generation," arXiv preprint arXiv:2102.04664, 2021.
J. Long, "Large language model guided tree-of-thought," arXiv preprint arXiv:2305.08291, 2023.
X. Chen, M. Lin, N. Schärli, and D. Zhou, "Teaching large language models to self-debug," arXiv preprint arXiv:2304.05128, 2023.
G. Kim, P. Baldi, and S. McAleer, "Language models can solve computer tasks," Advances in Neural Information Processing Systems, vol. 36, 2024.
I. Schlag, S. Sukhbaatar, A. Celikyilmaz, W.-t. Yih, J. Weston, J. Schmidhuber, and X. Li, "Large language model programs," arXiv preprint arXiv:2305.05364, 2023.
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al., "Program synthesis with large language models," arXiv preprint arXiv:2108.07732, 2021.
Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., "Competitionlevel code generation with alphacode," Science, vol. 378, no. 6624, pp. 1092-1097, 2022.
D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song et al., "Measuring coding challenge competence with apps," arXiv preprint arXiv:2105.09938, 2021.