Mutation testing; Program behavior; Program specification; Software testing; Large language models
Abstract :
[en] This paper presents intent-based mutation testing, a testing approach that generates mutations by changing the programming intents that are implemented in the programs under test. In contrast to traditional mutation testing, which changes (mutates) the way programs are written, intent mutation changes (mutates) the behavior of the programs by producing mutations that implement (slightly) different intents than those implemented in the original program. The mutations of the programming intents represent possible corner cases and misunderstandings of the program behavior, i.e., program specifications, and thus can capture different classes of faults than traditional (syntax-based) mutation. Moreover, since programming intents can be implemented in different ways, intent-based mutation testing can generate diverse and complex mutations that are close to the original programming intents (specifications) and thus direct testing towards the intent variants of the program behavior/specifications. We implement intent-based mutation testing using Large Language Models (LLMs) that mutate programming intents and transform them into mutants. We evaluate intent-based mutation on 29 programs and show that it generates mutations that are syntactically complex, semantically diverse, and quite different (semantically) from the traditional ones. We also show that 55% of the intent-based mutations are not subsumed by traditional mutations. Overall, our analysis shows that intent-based mutation testing can be a powerful complement to traditional (syntax-based) mutation testing.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SerVal - Security, Reasoning & Validation
Disciplines :
Computer science
Author, co-author :
HAMIDI, Asma Sadjida ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal
KHANFIR, Ahmed ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust > SerVal > Team Yves LE TRAON
PAPADAKIS, Michail ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal
External co-authors :
yes
Language :
English
Title :
Intent-Based Mutation Testing: From Naturally Written Programming Intents to Mutants
Publication date :
31 March 2025
Event name :
2025 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)
Event place :
Naples, Italy
Event date :
31-03-2025 => 04-04-2025
Main work title :
2025 IEEE International Conference on Software Testing, Verification and Validation Workshops, ICSTW 2025
Editor :
Fasolino, Anna Rita
Publisher :
Institute of Electrical and Electronics Engineers Inc.
T. T. Chekam, M. Papadakis, Y. L. Traon, and M. Harman, “An empirical study on mutation, statement and branch coverage fault revelation that avoids the unreliable clean program assumption,” in International Conference on Software Engineering, ICSE, 2017, pp. 597–608.
M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. L. Traon, and M. Harman, “Chapter six - mutation testing advances: An analysis and survey,” Advances in Computers, vol. 112, pp. 275–378, 2019.
T. T. Chekam, M. Papadakis, T. F. Bissyandé, Y. L. Traon, and K. Sen, “Selecting fault revealing mutants,” Empirical Software Engineering, vol. 25, no. 1, pp. 434–487, 2020.
S. J. Kaufman, R. Featherman, J. Alvin, B. Kurtz, P. Ammann, and R. Just, “Prioritizing mutants to guide mutation testing,” in International Conference on Software Engineering, 2022, p. 1743–1754.
M. Tufano, J. Kimko, S. Wang, C. Watson, G. Bavota, M. Di Penta, and D. Poshyvanyk, “Deepmutation: A neural mutation tool,” in International Conference on Software Engineering: Companion Proceedings, ser. ICSE, 2020, p. 29–32.
M. Marcozzi, S. Bardin, N. Kosmatov, M. Papadakis, V. Prevosto, and L. Correnson, “Time to clean your test objectives,” in International Conference on Software Engineering, ICSE, 2018, pp. 456–467.
M. Papadakis, Y. Jia, M. Harman, and Y. L. Traon, “Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique,” in 37th IEEE/ACM International Conference on Software Engineering, ICSE, 2015, pp. 936–946.
M. Kintis, M. Papadakis, and N. Malevris, “Employing second-order mutation for isolating first-order equivalent mutants,” Softw. Test. Verification Reliab., vol. 25, no. 5-7, pp. 508–535, 2015.
Y. Ma, J. Offutt, and Y. R. Kwon, “Mujava: an automated class mutation system,” Softw. Test. Verification Reliab., vol. 15, no. 2, pp. 97–133, 2005.
T. Laurent, M. Papadakis, M. Kintis, C. Henard, Y. L. Traon, and A. Ventresque, “Assessing and improving the mutation testing practice of pit,” in 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST), March 2017, pp. 430–435.
H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: A practical mutation testing tool for java (demo),” in Proceedings of the 25th International Symposium on Software Testing and Analysis, 2016, p. 449–452.
P. Ammann and J. Offutt, Introduction to Software Testing. Cambridge University Press, 2008.
A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf, “An experimental determination of sufficient mutant operators,” ACM Trans. Softw. Eng. Methodol., vol. 5, no. 2, pp. 99–118, 1996.
M. Kintis, M. Papadakis, A. Papadopoulos, E. Valvis, N. Malevris, and Y. L. Traon, “How effective are mutation testing tools? an empirical analysis of java mutation testing tools with manual analysis and real faults,” Empir. Softw. Eng., vol. 23, no. 4, pp. 2426–2463, 2018.
Z. Tian, J. Chen, Q. Zhu, J. Yang, and L. Zhang, “Learning to construct better mutation faults,” in Proceedings of the International Conference on Automated Software Engineering, 2022, pp. 1–13.
J. Patra and M. Pradel, “Semantic bug seeding: A learning-based approach for creating realistic bugs,” in ESEC/FSE Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, p. 906–918.
A. Khanfir, A. Koyuncu, M. Papadakis, M. Cordy, T. F. Bissyandé, J. Klein, and Y. Le Traon, “Ibir: Bug report driven fault injection,” ACM Trans. Softw. Eng. Methodol., may 2022.
K. Herzig and A. Zeller, “Untangling changes,” Unpublished manuscript, September, vol. 37, pp. 38–40, 2011.
R. Degiovanni and M. Papadakis, “µbert: Mutation testing using pre-trained language models,” in 15th IEEE International Conference on Software Testing, Verification and Validation Workshops ICST Workshops, 2022, pp. 160–169.
A. Khanfir, R. Degiovanni, M. Papadakis, and Y. L. Traon, “Efficient mutation testing via pre-trained language models,” arXiv:2301.03543, 2023.
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model for programming and natural languages,” in Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP, 2020, pp. 1536–1547.
M. Ojdanic, A. Garg, A. Khanfir, R. Degiovanni, M. Papadakis, and Y. Le Traon, “Syntactic versus semantic similarity of artificial and real faults in mutation testing studies,” IEEE Transactions on Software Engineering, vol. 49, no. 7, pp. 3922–3938, 2023.
M. Ojdanic, A. Khanfir, A. Garg, R. Degiovanni, M. Papadakis, and Y. Le Traon, “On comparing mutation testing tools through learning-based mutant selection,” in 2023 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 2023, pp. 35–46.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code.(2021),” arXiv:2107.03374, 2021.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
M. Papadakis, T. T. Chekam, and Y. L. Traon, “Mutant quality indicators,” in 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops, 2018, pp. 32–39.
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv:2107.03374, 2021.
Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, L. Shen, Z. Wang, A. Wang, Y. Li et al., “Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 5673–5684.
J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
B. Kurtz, P. Ammann, M. E. Delamaro, J. Offutt, and L. Deng, “Mutant subsumption graphs,” in International Conference on Software Testing, Verification, and Validation Workshops ICSTW, 2014, p. 176–185.