What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness

[en] Flaky tests are defined as tests that manifest non-deterministic behaviour by passing and failing intermittently for the same version of the code. These tests cripple continuous integration with false alerts that waste developers' time and break their trust in regression testing. To mitigate the effects of flakiness, both researchers and industrial experts proposed strategies and tools to detect and isolate flaky tests. However, flaky tests are rarely fixed as developers struggle to localise and understand their causes. Additionally, developers working with large codebases often need to know the sources of non-determinism to preserve code quality, i.e. avoid introducing technical debt linked with non-deterministic behaviour, and to avoid introducing new flaky tests. To aid with these tasks, we propose re-targeting Fault Localisation techniques to the flaky component localisation problem, i.e. pinpointing program classes that cause the non-deterministic behaviour of flaky tests. In particular, we employ Spectrum-Based Fault Localisation (SBFL), a coverage-based fault localisation technique commonly adopted for its simplicity and effectiveness. We also utilise other data sources, such as change history and static code metrics, to further improve the localisation. Our results show that augmenting SBFL with change and code metrics ranks flaky classes in the top-1 and top-5 suggestions, in 26% and 47% of the cases. Overall, we successfully reduced the average number of classes inspected to locate the first flaky class to 19% of the total number of classes covered by flaky tests. Our results also show that localisation methods are effective in major flakiness categories, such as concurrency and asynchronous waits, indicating their general ability to identify flaky components.

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SerVal - Security, Reasoning & Validation

Disciplines :

Computer science

Author, co-author :

Habchi, Sarra

HABEN, Guillaume ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

SOHN, Jeongju ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

Franci, Adriano

PAPADAKIS, Mike ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

CORDY, Maxime ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

LE TRAON, Yves ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

External co-authors :

yes

Language :

English

Title :

What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness

Publication date :

October 2022

Event name :

38th International Conference on Software Maintenance and Evolution

Event date :

from 02-10-2022 to 07-10-2022

Main work title :

What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness

Peer reviewed :

Peer reviewed

Available on ORBilu :

since 26 August 2023

Statistics

Number of views

227 (3 by Unilu)

Number of downloads

117 (1 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenAlex citations

Bibliography

C. Leong, A. Singh, M. Papadakis, Y. L. Traon, and J. Micco, "Assessing transition-based test selection algorithms at google," in Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2019, Montreal, QC, Canada, May 25-31, 2019, H. Sharp and M. Whalen, Eds. IEEE/ACM, 2019, pp. 101-110. [Online]. Available: https://doi.org/10.1109/ICSE-SEIP. 2019.00019
S. Habchi, G. Haben, M. Papadakis, M. Cordy, and Y. L. Traon, "A qualitative study on the sources, impacts, and mitigation strategies of flaky tests," pp. 244-255, 2022.
G. F. Martin Gruber, "A survey on how test flakiness affects developers and what support they need to address it," International Conference on Software Testing (ICST), 2022.
J. Micco, "The State of Continuous Integration Testing Google," 2017.
W. Lam, S. Winter, A. Wei, T. Xie, D. Marinov, and J. Bell, "A largescale longitudinal study of flaky tests," Proceedings of the ACM on Programming Languages, vol. 4, no. OOPSLA, pp. 1-29, 2020.
J. Palmer, "Test flakiness - methods for identifying and dealing with flaky tests : Spotify engineering," https://engineering.atspotify.com/2019/11/18/test-flakiness-methods-for-identifying-and-dealing-with-flaky-tests/, November 2019, (Accessed on 01/12/2021).
M. T. Rahman and P. C. Rigby, "The impact of failing, flaky, and high failure tests on the number of crash reports associated with firefox builds," ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 857-862, 2018.
J. Micco and A. Memon, "Gtac 2016: How flaky tests in continuous integration - youtube," https://www.youtube.com/watch?v=CrzpkF1-VsA, December 2016, (Accessed on 01/12/2021).
M. Eck, M. Castelluccio, F. Palomba, and A. Bacchelli, "Understanding Flaky Tests: The Developer's Perspective," arXiv, pp. 830-840, 2019.
W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie, "IDFlakies: A framework for detecting and partially classifying flaky tests," Proceedings - 2019 IEEE 12th International Conference on Software Testing, Verification and Validation, ICST 2019, pp. 312-322, 2019.
D. Silva, L. Teixeira, and M. D'Amorim, "Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker," Proceedings - 2020 IEEE International Conference on Software Maintenance and Evolution, ICSME 2020, pp. 301-311, 2020.
J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov, "DeFlaker: Automatically Detecting Flaky Tests," in Proceedings of the 40th International Conference on Software Engineering - ICSE '18. New York, New York, USA: ACM Press, 2018, pp. 433-444. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3180155.3180164
A. Alshammari, C. Morris, M. Hilton, and J. Bell, "Flakeflagger: Predicting flakiness without rerunning tests," in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 1572-1584.
G. Haben, S. Habchi, M. Papadakis, M. Cordy, and Y. Le Traon, "A Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests," Proceedings of the International Conference on Mining Software Repositories (MSR), 2021.
G. Pinto, B. Miranda, S. Dissanayake, M. D'Amorim, C. Treude, and A. Bertolino, "What is the Vocabulary of Flaky Tests?" Proceedings - 2020 IEEE/ACM 17th International Conference on Mining Software Repositories, MSR 2020, pp. 492-502, 2020.
Z. Dong, A. Tiwari, X. L. Yu, and A. Roychoudhury, "Flaky test detection in Android via event order exploration," in Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '21), August 23â•fi28, 2021, Athens, Greece, vol. 1, no. 1. Association for Computing Machinery, 2021, pp. 367-378.
B. Camara, M. Silva, A. T. Endo, and S. Vergilio, "What is the vocabulary of flaky tests? an extended replication," in 2021 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC) (ICPC). Los Alamitos, CA, USA: IEEE Computer Society, may 2021, pp. 444-454. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICPC52881.2021.00052
B. Camara, M. Silva, A. Endo, and S. Vergilio, "On the use of test smells for prediction of flaky tests," in Brazilian Symposium on Systematic and Automated Software Testing, 2021, pp. 46-54.
S. Fatima, T. A. Ghaleb, and L. Briand, "Flakify: A Black-Box, Language Model-based Predictor for Flaky Tests," arXiv preprint arXiv:2112.12331, pp. 1-12, 2021. [Online]. Available: http://arxiv.org/abs/2112.12331
A. Shi, W. Lam, R. Oei, T. Xie, and D. Marinov, "iFixFlakies : A Framework for Automatically Fixing Order-Dependent Flaky Tests," in 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations ofSoftware Engineering (ESEC/FSE '19), 2019.
S. Dutta, A. Shi, and S. Misailovic, FLEX: Fixing Flaky Tests in Machine Learning Projects by Updating Assertion Bounds. New York, NY, USA: Association for Computing Machinery, 2021, p. 603-614. [Online]. Available: https://doi.org/10.1145/3468264.3468615
M. Gruber, S. Lukasczyk, F. Krois, and G. Fraser, "An Empirical Study of Flaky Tests in Python," Proceedings - 2021 IEEE 14th International Conference on Software Testing, Verification and Validation, ICST 2021, pp. 148-158, 2021.
A. Romano, Z. Song, S. Grandhi, W. Yang, and W. Wang, "An empirical analysis of ui-based flaky tests," in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 2021, pp. 1585-1597.
Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, "An empirical analysis of flaky tests," in Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering, vol. 16-21-November-2014, nov 2014, pp. 643-653.
O. Dabic, E. Aghajani, and G. Bavota, "Sampling projects in github for MSR studies," in 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021. IEEE, 2021, pp. 560-564.
Y. Lou, Q. Zhu, J. Dong, X. Li, Z. Sun, D. Hao, L. Zhang, and L. Zhang, "Boosting coverage-based fault localization via graph-based representation learning," in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2021. New York, NY, USA: Association for Computing Machinery, 2021, p. 664-676. [Online]. Available: https://doi.org/10.1145/3468264.3468580
X. Li, W. Li, Y. Zhang, and L. Zhang, "Deepfl: Integrating multiple fault diagnosis dimensions for deep fault localization," in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 169-180. [Online]. Available: https://doi.org/10.1145/3293882.3330574
L. C. Briand, Y. Labiche, and X. Liu, "Using machine learning to support debugging with tarantula," in The 18th IEEE International Symposium on Software Reliability (ISSRE'07). IEEE, 2007, pp. 137-146.
M. Papadakis and Y. L. Traon, "Metallaxis-fl: mutation-based fault localization," Journal of Software Testing, Verification and Reliability, vol. 25, no. 5-7, pp. 605-628, 2015.
S. Hong, B. Lee, T. Kwak, Y. Jeon, B. Ko, Y. Kim, and M. Kim, "Mutation-based fault localization for real-world multilingual programs (T)," in 30th IEEE/ACM International Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015, 2015, pp. 464-475.
A. Perez, R. Abreu, and I. Haslab, "Leveraging qualitative reasoning to improve sfl." in IJCAI, 2018, pp. 1935-1941.
W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, "A survey on software fault localization," IEEE Transactions on Software Engineering, vol. 42, no. 8, pp. 707-740, 2016.
M. Renieres and S. P. Reiss, "Fault localization with nearest neighbor queries," in 18th IEEE International Conference on Automated Software Engineering, 2003. Proceedings. IEEE, 2003, pp. 30-39.
W. E. Wong, V. Debroy, R. Gao, and Y. Li, "The dstar method for effective software fault localization," IEEE Transactions on Reliability, vol. 63, no. 1, pp. 290-308, 2014.
S. Yoo, X. Xie, F.-C. Kuo, T. Y. Chen, and M. Harman, "No pot of gold at the end of program spectrum rainbow: Greatest risk evaluation formula does not exist," University College London, Tech. Rep. RN/14/14, 2014.
J. Xuan and M. Monperrus, "Learning to combine multiple ranking metrics for fault localization," in 2014 IEEE International Conference on Software Maintenance and Evolution, 2014, pp. 191-200.
T.-D. B. Le, D. Lo, C. Le Goues, and L. Grunske, "A learning-to-rank based fault localization approach using likely invariants," in Proceedings of the 25th International Symposium on Software Testing and Analysis, ser. ISSTA 2016. New York, NY, USA: ACM, 2016, pp. 177-188.
D. Zou, J. Liang, Y. Xiong, M. D. Ernst, and L. Zhang, "An empirical study of fault localization families and their combinations," IEEE Transactions on Software Engineering, 2019.
S. Yoo, X. Xie, F.-C. Kuo, T. Y. Chen, and M. Harman, "Human competitiveness of genetic programming in sbfl: Theoretical and empirical analysis," ACM Transactions on Software Engineering and Methodology, vol. 26, no. 1, pp. 4:1-4:30, July 2017.
J. Sohn and S. Yoo, "Empirical evaluation of fault localisation using code and change metrics," IEEE Transactions on Software Engineering, vol. 47, no. 8, pp. 1605-1625, 2021.
F.-A. Fortin, F.-M. De Rainville, M.-A. Gardner, M. Parizeau, and C. Gagné, "DEAP: Evolutionary algorithms made easy," Journal of Machine Learning Research, vol. 13, pp. 2171-2175, July 2012.
R. Abreu, P. Zoeteweij, and A. J. van Gemund, "An evaluation of similarity coefficients for software fault localization," in The proceedings of the 12th Pacific Rim International Symposium on Dependable Computing, ser. PRDC 2006. IEEE, 2006, pp. 39-46.
R. Abreu, P. Zoeteweij, and A. J. Van Gemund, "Spectrum-based multiple fault localization," in 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE, 2009, pp. 88-99.
J. A. Jones, M. J. Harrold, and J. T. Stasko, "Visualization for fault localization," in Proceedings of ICSE Workshop on Software Visualization, 2001, pp. 71-75.
J. A. Jones, M. J. Harrold, and J. Stasko, "Visualization of test information to assist fault localization," in Proceedings of the 24th International Conference on Software Engineering. New York, NY, USA: ACM, 2002, pp. 467-477.
S. McIntosh and Y. Kamei, "Are fix-inducing changes a moving target? a longitudinal case study of just-in-time defect prediction," IEEE Transactions on Software Engineering, vol. 44, no. 5, pp. 412-428, May 2018.
T. M. King, D. Santiago, J. Phillips, and P. J. Clarke, "Towards a Bayesian Network Model for Predicting Flaky Automated Tests," 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), pp. 100-107, 2018.
O. Parry, "A Survey of Flaky Tests," ACM transactions on software engineering and methodology, vol. 31, no. 1, 2021.
W. Lam, K. Muslu, H. Sajnani, and S. Thummalapenta, "A study on the lifecycle of flaky tests," Proceedings - International Conference on Software Engineering, pp. 1471-1482, 2020.
R. Pawlak, M. Monperrus, N. Petitprez, C. Noguera, and L. Seinturier, "Spoon: A Library for Implementing Analyses and Transformations of Java Source Code," Software: Practice and Experience, vol. 46, pp. 1155-1179, 2015. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01078532/document
J. Sohn, G. An, J. Hong, D. Hwang, and S. Yoo, "Assisting bug report assignment using automated fault localisation: An industrial case study," in Proceedings of the 14th IEEE International Conference on Software Testing, Verification and Validation, 2021.
J. Sohn and S. Yoo, "Why train-and-select when you can use them all? Ensemble model for fault localisation," in Proceedings of the Annual Conference on Genetic and Evolutionary Computation, ser. GECCO 2019, 2019, pp. 1408-1416.
S. Thorve, C. Sreshtha, and N. Meng, "An empirical study of flaky tests in android apps," Proceedings - 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, pp. 534-538, 2018.
C. Parnin and A. Orso, "Are automated debugging techniques actually helping programmers?" in Proceedings of the 2011 International Symposium on Software Testing and Analysis, ser. ISSTA '11. New York, NY, USA: Association for Computing Machinery, 2011, p. 199-209. [Online]. Available: https://doi.org/10.1145/2001420.2001445
X. Xu, V. Debroy, W. Eric Wong, and D. Guo, "Ties within fault localization rankings: Exposing and addressing the problem," International Journal of Software Engineering and Knowledge Engineering, vol. 21, no. 06, pp. 803-827, 2011.
M. Wen, J. Chen, Y. Tian, R. Wu, D. Hao, S. Han, and S. C. Cheung, "Historical spectrum based fault localization," IEEE Transactions on Software Engineering, pp. 1-1, 2019.
W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa, "A survey on software fault localization," IEEE Transactions on Software Engineering, vol. 42, no. 8, pp. 707-740, 2016.
A. Perez, R. Abreu, and A. van Deursen, "A test-suite diagnosability metric for spectrum-based fault localization approaches," in 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), 2017, pp. 654-664.
E. Kowalczyk, K. Nair, Z. Gao, L. Silberstein, T. Long, and A. Memon, "Modeling and ranking flaky tests at apple," in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice, ser. ICSE-SEIP '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 110-119. [Online]. Available: https://doi.org/10.1145/3377813.3381370
K. Herzig and N. Nagappan, "Empirically Detecting False Test Alarms Using Association Rules," Proceedings - International Conference on Software Engineering, vol. 2, pp. 39-48, 2015.
S. Habchi, M. Cordy, M. Papadakis, and Y. L. Traon, "On the use of mutation in injecting test order-dependency," CoRR, vol. abs/2104.07441, 2021. [Online]. Available: https://arxiv.org/abs/2104. 07441
W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummalapenta, "Root Causing Flaky Tests in a Large-Scale Industrial Setting," in Proceedings ofthe 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA '19). Beijing, China: ACM Press, 2019, pp. 101-111.
C. Ziftci and D. Cavalcanti, "De-flake your tests : Automatically locating root causes of flaky tests in code at google," in 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2020, pp. 736-745.
W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov, "Understanding reproducibility and characteristics of flaky tests through test reruns in java projects," in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2020, pp. 403-413.
J. Morán, C. Augusto, A. Bertolino, C. de la Riva, and J. Tuya, "Flakyloc: Flakiness localization for reliable test suites in web applications," J. Web Eng., vol. 19, no. 2, pp. 267-296, 2020. [Online]. Available: https://doi.org/10.13052/jwe1540-9589.1927
R. Abreu, P. Zoeteweij, R. Golsteijn, and A. J. C. van Gemund, "A practical evaluation of spectrum-based fault localization," J. Syst. Softw., vol. 82, no. 11, p. 1780-1792, nov 2009. [Online]. Available: https://doi.org/10.1016/j.jss.2009.06.035
J. A. Jones and M. J. Harrold, "Empirical evaluation of the tarantula automatic fault-localization technique," in Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE '05. New York, NY, USA: Association for Computing Machinery, 2005, p. 273-282. [Online]. Available: https://doi.org/10.1145/1101908.1101949
C. Catal, "Software fault prediction: A literature review and current trends," Expert Systems with Applications, vol. 38, no. 4, pp. 4626-4636, 2011.
W. Wen, "Software fault localization based on program slicing spectrum," in 2012 34th International Conference on Software Engineering (ICSE), 2012, pp. 1511-1514.
W. E. Wong, Y. Qi, L. Zhao, and K.-Y. Cai, "Effective fault localization using code coverage," in Proceedings of the 31st Annual International Computer Software and Applications Conference - Volume 01, ser. COMPSAC '07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 449-456.
X. Xie, T. Y. Chen, F.-C. Kuo, and B. Xu, "A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization," ACM Transactions on Software Engineering Methodology, vol. 22, no. 4, pp. 31:1-31:40, October 2013.
X. Xie, F.-C. Kuo, T. Y. Chen, S. Yoo, and M. Harman, "Provably optimal and human-competitive results in sbse for spectrum based fault localisation," in Search Based Software Engineering, ser. Lecture Notes in Computer Science, G. Ruhe and Y. Zhang, Eds. Springer Berlin Heidelberg, 2013, vol. 8084, pp. 224-238.
Y. Li, S. Wang, and T. N. Nguyen, "Fault localization with code coverage representation learning," in 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 2021, pp. 661-673. [Online]. Available: https://doi.org/10.1109/ICSE43902.2021.00067
X. Li, S. Zhu, M. d'Amorim, and A. Orso, "Enlightened debugging," in Proceedings of the 40th International Conference on Software Engineering, ser. ICSE '18. New York, NY, USA: Association for Computing Machinery, 2018, p. 82-92. [Online]. Available: https://doi.org/10.1145/3180155.3180242