FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

AKLI, Amal; HABEN, Guillaume; Habchi, Sarra; PAPADAKIS, Mike; LE TRAON, Yves

doi:10.1109/AST58925.2023.00018

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

AKLI, Amal; HABEN, Guillaume; Habchi, Sarra et al.

2023 • In FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

Peer reviewed

Permalink
https://hdl.handle.net/10993/55848

DOI
10.1109/AST58925.2023.00018

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

FlakyCat.pdf

Author preprint (488.31 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Software Testing; Flaky Tests; CodeBERT; Few-Shot learning; Siamese Networks

Abstract :

[en] Flaky tests are tests that yield different outcomes when run on the same version of a program. This non-deterministic behaviour plagues continuous integration with false signals, wasting developers’ time and reducing their trust in test suites. Studies highlighted the importance of keeping tests flakiness-free. Recently, the research community has been pushing towards the detection of flaky tests by suggesting many static and dynamic approaches. While promising, those approaches mainly focus on classifying tests as flaky or not and, even when high performances are reported, it remains challenging to understand the cause of flakiness. This part is crucial for researchers and developers that aim to fix it. To help with the comprehension of a given flaky test, we propose FlakyCat, the first approach to classify flaky tests based on their root cause category. FlakyCat relies on CodeBERT for code representation and leverages Siamese networks to train a multi-class classifier. We train and evaluate FlakyCat on a set of 451 flaky tests collected from open-source Java projects. Our evaluation shows that FlakyCat categorises flaky tests accurately, with an F1 score of 73%. Furthermore, we investigate the performance of our approach for each category, revealing that Async waits, Unordered collections and Time-related flaky tests are accurately classified, while Concurrency-related flaky tests are more challenging to predict. Finally, to facilitate the comprehension of FlakyCat’s predictions, we present a new technique for CodeBERT-based model interpretability that highlights code statements influencing the categorization.

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SerVal - Security, Reasoning & Validation

Disciplines :

Computer science

Author, co-author :

AKLI, Amal ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

HABEN, Guillaume ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

Habchi, Sarra

PAPADAKIS, Mike ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

LE TRAON, Yves ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SerVal

External co-authors :

yes

Language :

English

Title :

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

Publication date :

May 2023

Event name :

4th International Conference on Automation of Software Test

Event date :

from 15-05-2023 to 16-05-2023

Main work title :

FlakyCat: Predicting Flaky Tests Categories using Few-Shot Learning

Peer reviewed :

Peer reviewed

Focus Area :

Security, Reliability and Trust

Available on ORBilu :

since 26 August 2023

Statistics

Number of views

68 (6 by Unilu)

Number of downloads

108 (5 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

M. Shahin, M. A. Babar, and L. Zhu, "Continuous integration, delivery and deployment: A systematic review on approaches, tools, challenges and practices, " IEEE Access, vol. 5, pp. 3909-3943, 2017.
M. Rehkopf, "What is continuous integration-atlassian, " https://www. atlassian. com/continuous-delivery/continuous-integration, (Accessed on 01/12/2021).
J. Micco, "The State of Continuous Integration Testing Google, " 2017.
C. Leong, A. Singh, M. Papadakis, Y. L. Traon, and J. Micco, "Assessing transition-based test selection algorithms at google, " in Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP). IEEE/ACM, 2019, pp. 101-110.
A. Memon, Z. Gao, B. Nguyen, S. Dhanda, E. Nickell, R. Siemborski, and J. Micco, "Taming google-scale continuous testing, " in 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). IEEE, 2017, pp. 233-242.
S. Habchi, G. Haben, M. Papadakis, M. Cordy, and Y. L. Traon, "A qualitative study on the sources, impacts, and mitigation strategies of flaky tests, " International Conference on Software Testing (ICST), 2022.
M. Eck, M. Castelluccio, F. Palomba, and A. Bacchelli, "Understanding Flaky Tests: The Developer's Perspective, " arXiv, pp. 830-840, 2019.
M. Gruber and G. Fraser, "A survey on how test flakiness affects developers and what support they need to address it, " in Proceedings of the 15th IEEE International Conference on Software Testing, Verification and Validation, ser. ICST '22, 2022.
W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie, "IDFlakies: A framework for detecting and partially classifying flaky tests, " Proceedings-2019 IEEE 12th International Conference on Software Testing, Verification and Validation, ICST 2019, pp. 312-322, 2019.
D. Silva, L. Teixeira, and M. D'Amorim, "Shake It! Detecting Flaky Tests Caused by Concurrency with Shaker, " Proceedings-2020 IEEE International Conference on Software Maintenance and Evolution, ICSME 2020, pp. 301-311, 2020.
T. M. King, D. Santiago, J. Phillips, and P. J. Clarke, "Towards a Bayesian Network Model for Predicting Flaky Automated Tests, " 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), pp. 100-107, 2018.
G. Pinto, B. Miranda, S. Dissanayake, M. d'Amorim, C. Treude, and A. Bertolino, "What is the vocabulary of flaky tests?" in Proceedings of the 17th International Conference on Mining Software Repositories, 2020, pp. 492-502.
B. Camara, M. Silva, A. Endo, and S. Vergilio, "On the use of test smells for prediction of flaky tests, " in Brazilian Symposium on Systematic and Automated Software Testing, 2021, pp. 46-54.
C. Ziftci and D. Cavalcanti, "De-flake your tests: Automatically locating root causes of flaky tests in code at google, " in 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2020, pp. 736-745.
W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummalapenta, "Root Causing Flaky Tests in a Large-Scale Industrial Setting, " in Proceedings ofthe 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA '19). Beijing, China: ACM Press, 2019, pp. 101-111.
J. Morán, C. Augusto, A. Bertolino, C. de la Riva, and J. Tuya, "Flakyloc: Flakiness localization for reliable test suites in web applications, " J. Web Eng., vol. 19, no. 2, pp. 267-296, 2020. [Online]. Available: https://doi. org/10. 13052/jwe1540-9589. 1927
A. Shi, W. Lam, R. Oei, T. Xie, and D. Marinov, "iFixFlakies: A Framework for Automatically Fixing Order-Dependent Flaky Tests, " in 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations ofSoftware Engineering (ESEC/FSE '19), 2019.
C. Li, C. Zhu, W. Wang, and A. Shi, "Repairing order-dependent flaky tests via test generation, " in Proceedings of the 44th International Conference on Software Engineering-ICSE '22. ICSE, 2022.
S. Dutta, A. Shi, R. Choudhary, Z. Zhang, A. Jain, and S. Misailovic, "Detecting flaky tests in probabilistic and machine learning applications, " ISSTA 2020-Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 211-224, 2020.
H. Jiang, X. Li, Z. Yang, and J. Xuan, "What causes my test alarm? automatic cause analysis for test alarms in system and integration testing, " in Proceedings of the 39th International Conference on Software Engineering, ser. ICSE '17. IEEE Press, 2017, p. 712-723. [Online]. Available: https://doi. org/10. 1109/ICSE. 2017. 71
M. contributors, "Test verification-mozilla-mdn, " https://developer. mozilla. org/en-US/docs/Mozilla/QA/Test Verification, March 2019, (Accessed on 01/12/2021).
M. Harman and P. O'Hearn, "From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis, " in 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, sep 2018, pp. 1-23. [Online]. Available: https://ieeexplore. ieee. org/document/8530713/
J. Palmer, "Test flakiness-methods for identifying and dealing with flaky tests: Spotify engineering, " https://engineering. atspotify. com/2019/11/18/test-flakiness-methods-foridentifying-and-dealing-with-flaky-tests/, November 2019, (Accessed on 01/12/2021).
Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, "An empirical analysis of flaky tests, " in Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering, vol. 16-21-November-2014, nov 2014, pp. 643-653.
M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli, "Understanding flaky tests: The developer's perspective, " in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 830-840. [Online]. Available: https://doi-org. sndl1. arn. dz/10. 1145/3338906. 3338945
O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, "Surveying the developer experience of flaky tests, " in Proceedings of the International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2022.
O. Parry, "A Survey of Flaky Tests, " ACM transactions on software engineering and methodology, vol. 31, no. 1, 2021.
J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov, "DeFlaker: Automatically Detecting Flaky Tests, " in Proceedings of the 40th International Conference on Software Engineering-ICSE '18. New York, New York, USA: ACM Press, 2018, pp. 433-444. [Online]. Available: http://dl. acm. org/citation. cfm?doid=3180155. 3180164
A. Gyori, B. Lambeth, A. Shi, O. Legunsen, and D. Marinov, "Nondex: A tool for detecting and debugging wrong assumptions on java api specifications, " in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2016, pp. 993-997.
G. Pinto, B. Miranda, S. Dissanayake, M. D'Amorim, C. Treude, and A. Bertolino, "What is the Vocabulary of Flaky Tests?" Proceedings-2020 IEEE/ACM 17th International Conference on Mining Software Repositories, MSR 2020, pp. 492-502, 2020.
G. Haben, S. Habchi, M. Papadakis, M. Cordy, and Y. Le Traon, "A Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests, " Proceedings of the International Conference on Mining Software Repositories (MSR), 2021.
B. Camara, M. Silva, A. T. Endo, and S. Vergilio, "What is the vocabulary of flaky tests? an extended replication, " in 2021 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC) (ICPC). Los Alamitos, CA, USA: IEEE Computer Society, may 2021, pp. 444-454. [Online]. Available: https://doi. ieeecomputersociety. org/10. 1109/ICPC52881. 2021. 00052
V. Pontillo, F. Palomba, and F. Ferrucci, "Toward static test flakiness prediction: A feasibility study, " in Proceedings of the 5th International Workshop on Machine Learning Techniques for Software Quality Evolution, 2021, pp. 19-24.
A. Alshammari, C. Morris, M. Hilton, and J. Bell, "Flakeflagger: Predicting flakiness without rerunning tests, " in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 2021, pp. 1572-1584.
O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, "Evaluating features for machine learning detection of order-and non-orderdependent flaky tests, " in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 93-104.
S. Fatima, T. A. Ghaleb, and L. Briand, "Flakify: A black-box, language model-based predictor for flaky tests, " IEEE Transactions on Software Engineering, pp. 1-17, 2022.
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., "Codebert: A pre-trained model for programming and natural languages, " arXiv preprint arXiv: 2002. 08155, 2020.
X. Sun, B. Wang, Z. Wang, H. Li, H. Li, and K. Fu, "Research progress on few-shot learning for remote sensing image interpretation, " IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 2387-2402, 2021.
M. Khajezade, F. H. Fard, and M. S. Shehata, "Evaluating few shot and contrastive learning methods for code clone detection, " arXiv preprint arXiv: 2204. 07501, 2022.
Y. He, W. Wang, H. Sun, and Y. Zhang, "Vul-mirror: A few-shot learning method for discovering vulnerable code clone, " EAI Endorsed Transactions on Security and Safety, vol. 7, no. 23, p. e4, 2020.
Y. Wan, W. Zhao, H. Zhang, Y. Sui, G. Xu, and H. Jin, "What do they capture?-a structural analysis of pre-trained language models for source code, " arXiv preprint arXiv: 2202. 06840, 2022.
W. Lam, K. Muslu, H. Sajnani, and S. Thummalapenta, "A study on the lifecycle of flaky tests, " in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ser. ICSE '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1471-1482. [Online]. Available: https://doi. org/10. 1145/3377811. 3381749
X. Zhou, D. Han, and D. Lo, "Assessing generalizability of codebert, " in 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2021, pp. 425-436.
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, "CodeBERT: A pre-trained model for programming and natural languages, " in Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 1536-1547. [Online]. Available: https://aclanthology. org/2020. findings-emnlp. 139
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need, " Advances in neural information processing systems, vol. 30, 2017.
-, "Attention is all you need, " Advances in neural information processing systems, vol. 30, 2017.
D. V. Carvalho, E. M. Pereira, and J. S. Cardoso, "Machine learning interpretability: A survey on methods and metrics, " Electronics, vol. 8, no. 8, p. 832, 2019.
"Feature importances with a forest of trees-scikit-learn 1. 1. 1 documentation, " https://scikitlearn. org/stable/auto examples/ensemble/plot forest importances. html, (Accessed on 06/24/2022).
S. Lundberg, "Shap documentation, " https://shap. readthedocs. io/, 2018, (Accessed on 06/23/2022).
A. Zeller and R. Hildebrandt, "Simplifying and isolating failure-inducing input, " IEEE Transactions on Software Engineering, vol. 28, no. 2, pp. 183-200, 2002.
K. Costa, R. Ferreira, G. Pinto, M. d'Amorim, and B. Miranda, "Test flakiness across programming languages, " IEEE Transactions on Software Engineering, pp. 1-14, 2022.
S. Habchi, G. Haben, J. Sohn, A. Franci, M. Papadakis, M. Cordy, and Y. Le Traon, "What made this test flake? pinpointing classes responsible for test flakiness, " arXiv e-prints, pp. arXiv-2207, 2022.
J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov, "Deflaker: Automatically detecting flaky tests, " in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), 2018, pp. 433-444.
C. Li and A. Shi, "Evolution-aware detection of order-dependent flaky tests, " in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 114-125.
O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, "Evaluating features for machine learning detection of order-and non-orderdependent flaky tests, " in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 93-104.
M. Gruber, S. Lukasczyk, F. Krois, and G. Fraser, "An Empirical Study of Flaky Tests in Python, " Proceedings-2021 IEEE 14th International Conference on Software Testing, Verification and Validation, ICST 2021, pp. 148-158, 2021.
R. V. Krejcie and D. W. Morgan, "Determining sample size for research activities, " Educational and psychological measurement, vol. 30, no. 3, pp. 607-610, 1970.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "Smote: synthetic minority over-sampling technique, " Journal of artificial intelligence research, vol. 16, pp. 321-357, 2002.
A. Peruma, K. Almalki, C. D. Newman, M. W. Mkaouer, A. Ouni, and F. Palomba, "Tsdetect: An open source test smells detection tool, " in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2020. New York, NY, USA: Association for Computing Machinery, 2020, p. 1650-1654. [Online]. Available: https://doi. org/10. 1145/3368089. 3417921
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., "Scikit-learn: Machine learning in python, " the Journal of machine Learning research, vol. 12, pp. 2825-2830, 2011.
J. Bergstra and Y. Bengio, "Random search for hyper-parameter optimization. " Journal of machine learning research, vol. 13, no. 2, 2012.