Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests

Fatima, Sakina; Ghaleb, Taher; BRIAND, Lionel

doi:10.1109/TSE.2022.3201209

Download

Article (Scientific journals)

Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests

Fatima, Sakina; Ghaleb, Taher; BRIAND, Lionel

2023 • In IEEE Transactions on Software Engineering

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/10993/53818

DOI
10.1109/TSE.2022.3201209

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

2112.12331v2.pdf

Author postprint (4.32 MB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Research center :

Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SVV - Software Verification and Validation

Disciplines :

Computer science

Author, co-author :

Fatima, Sakina; University of Ottawa

Ghaleb, Taher; University of Ottawa

BRIAND, Lionel ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > SVV

External co-authors :

yes

Language :

English

Title :

Flakify: A Black-Box, Language Model-Based Predictor for Flaky Tests

Publication date :

2023

Journal title :

IEEE Transactions on Software Engineering

ISSN :

0098-5589

eISSN :

1939-3520

Publisher :

Institute of Electrical and Electronics Engineers, New-York, United States - New York

Peer reviewed :

Peer Reviewed verified by ORBi

Focus Area :

Security, Reliability and Trust

Available on ORBilu :

since 13 January 2023

Statistics

Number of views

55 (3 by Unilu)

Number of downloads

50 (3 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

B. Zolfaghari, R.M. Parizi, G. Srivastava, and Y.Hailemariam, "Root causing, detecting, and fixing flaky tests: State of the art and future roadmap, " Softw.: Pract. Exp., vol. 51, no. 5, pp. 851-867, 2021.
Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, "An empirical analysis of flaky tests, " in Proc. 22nd ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2014, pp. 643-653.
M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli, "Understanding flaky tests: The developer's perspective, " in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 830-840.
J. Bell, O. Legunsen, M. Hilton, L. Eloussi, T. Yung, and D. Marinov, "DeFlaker: Automatically detecting flaky tests, " in Proc. IEEE/ACM 40th Int. Conf. Softw. Eng., 2018, pp. 433-444.
W. Lam, R. Oei, A. Shi, D. Marinov, and T. Xie, "iDFlakies: A framework for detecting and partially classifying flaky tests, " in Proc. 12th IEEE Conf. Softw. Testing, Validation Verification, 2019, pp. 312-322.
J. Micco, "Advances in continuous integration testing at Google, " 2018. [Online]. Available: https://research.google/pubs/pub46593
A. Alshammari, C. Morris, M. Hilton, and J. Bell, "FlakeFlagger: Predicting flakiness without rerunning tests, " in Proc. IEEE/ACM 43rd Int. Conf. Softw. Eng., 2021, pp. 1572-1584.
G. Pinto, B. Miranda, S. Dissanayake, M. d'Amorim, C. Treude, and A. Bertolino, "What is the vocabulary of flaky tests?, " in Proc. 17th Int. Conf. Mining Softw. Repositories, 2020, pp. 492-502.
B. Camara, M. Silva, A. Endo, and S. Vergilio, "On the use of test smells for prediction of flaky tests, " in Proc. Braz. Symp. Systematic Autom. Softw. Testing, 2021, pp. 46-54.
Z. Feng et al., "CodeBERT: A pre-Trained model for programming and natural languages, " in Proc. Findings Assoc. Comput. Linguistics: Empir. Methods Natural Lang. Process., 2020, pp. 1536-1547.
V. Pontillo, F. Palomba, and F. Ferrucci, "Toward static test flakiness prediction: A feasibility study, " in Proc. 5th Int. Workshop Mach. Learn. Techn. Softw. Qual. Evol. , 2021, pp. 19-24.
C. Ziftci and D. Cavalcanti, "De-flake your tests: Automatically locating root causes of flaky tests in code at Google, " in Proc. IEEE Int. Conf. Softw. Maintenance Evol. , 2020, pp. 736-745.
W. Lam, P. Godefroid, S. Nath, A. Santhiar, and S. Thummalapenta, "Root causing flaky tests in a large-scale industrial setting, " in Proc. 28th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2019, pp. 101-111.
T. Bach, A. Andrzejak, and R. Pannemans, "Coverage-based reduction of test execution time: Lessons from a very large industrial project, " in Proc. IEEE Int. Conf. Softw. Testing, Verification Validation Workshops, 2017, pp. 3-12.
G. Haben, S. Habchi, M. Papadakis, M. Cordy, and Y. L. Traon, "A replication study on the usability of code vocabulary in predicting flaky tests, " in Proc. IEEE/ACM 18th Int. Conf. Mining Softw. Repositories, 2021, pp. 219-229.
O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, "A survey of flaky tests, " ACM Trans. Softw. Eng. Methodol., vol. 31, no. 1, pp. 1-74, 2021.
O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, "Surveying the developer experience of flaky tests, " in Proc. IEEE/ACM Int. Conf. Softw. Eng.: Softw. Eng. Pract., 2022, pp. 253-262.
A. Van Deursen, L. Moonen, A. Van Den Bergh, and G. Kok, "Refactoring test code, " in Proc. 2nd Int. Conf. Extreme Program. Flexible Processes Softw. Eng., 2001, pp. 92-95.
J.Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-Training of deep bidirectional transformers for language understanding, " in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2019, pp. 4171-4186.
M. E. Peters et al., "Deep contextualized word representations, " in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2018, pp. 2227-2237.
Z. Yang et al., "XLNet: Generalized autoregressive pretraining for language understanding, " in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 5753-5763.
Y. Liu et al., "RoBERTa: A robustly optimized BERT pretraining approach, " 2019, arXiv:1907.11692.
C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, "VideoBERT: A joint model for video and language representation learning, " in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 7464-7473.
D. Nadeau and S. Sekine, "A survey of named entity recognition and classification, " Lingvisticae Investigationes, vol. 30, no. 1, pp. 3-26, 2007.
N. Bach and S. Badaskar, "A review of relation extraction, " Literature Rev. Lang. Statist., vol. II, no. 2, pp. 1-15, 2007.
H. Xu, B. Liu, L. Shu, and P. S. Yu, "BERT post-Training for review reading comprehension and aspect-based sentiment analysis, " in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2019, pp. 2324-2335.
C. Sun, X. Qiu, Y. Xu, and X. Huang, "How to fine-Tune BERT for text classification?, " in Proc. China Nat. Conf. Chin. Comput. Linguistics, 2019, pp. 194-206.
A. Vaswani et al., "Attention is all you need, " in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998-6008.
D. Mandic and J. Chambers, Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. Hoboken, NJ, USA: Wiley, 2001.
S. Hochreiter and J. Schmidhuber, "Long short-Term memory, " Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
J. Keim, A. Kaplan, A. Koziolek, and M. Mirakhorli, "Does BERT understand code?-An exploratory study on the detection of architectural tactics in code, " in Proc. Eur. Conf. Softw. Archit., 2020, pp. 220-228.
D. Guo et al., "GraphCodeBERT: Pre-Training code representations with data flow, " in Proc. 9th Int. Conf. Learn. Representations, 2021, pp. 1-18.
A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, "Learning and evaluating contextual embedding of source code, " in Proc. Int. Conf. Mach. Learn., 2020, pp. 5110-5121.
X. Jiang, Z. Zheng, C. Lyu, L. Li, and L. Lyu, "TreeBERT: Atree-based pre-Trainedmodel for programming language, " Proc. 37th Conf.Uncertainty Artif. Intell., Mach. Learn. Res., vol. 161, pp. 54-63, 2021.
H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, "CodeSearchNet challenge: Evaluating the state of semantic code search, " 2019, arXiv:1909.09436.
Y.Wu et al., "Google's neuralmachine translation system: Bridging the gap between human and machine translation, " 2016, arXiv:1609.08144.
K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, "ELECTRA: Pre-Training text encoders as discriminators rather than generators, " in Proc. 8th Int. Conf. Learn. Representations, 2020, pp. 1-18.
C. Pan, M. Lu, and B. Xu, "An empirical study on software defect prediction using CodeBERT model, " Appl. Sci., vol. 11, no. 11, 2021, Art. no. 4793.
J. Wu, "Literature review on vulnerability detection using NLP technology, " 2021, arXiv:2104.11230.
J. Howard and S. Ruder, "Universal language model fine-Tuning for text classification, " in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 328-339.
A. F. A., "Deep learning using rectified linear units (ReLU), " 2018, arXiv:1803.08375.
N. Srivastava, G.Hinton, A.Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting, " J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929-1958, 2014.
S. El Anigri, M. M. Himmi, and A. Mahmoudi, "How BERT's dropout fine-Tuning affects text classification?, " in Proc. Int. Conf. Bus. Intell., 2021, pp. 130-139.
Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and M. Mahoney, "ADAHESSIAN: An adaptive second order optimizer for machine learning, " Proc. AAAI Conf. Artif. Intell., vol. 35, no. 12, pp. 10665-10673, 2021.
W. Aljedaani et al., "Test smell detection tools: A systematic mapping study, " in Proc. Eval. Assessment Softw. Eng., 2021, pp. 170-180.
A. Peruma, K. Almalki, C. D. Newman, M. W. Mkaouer, A. Ouni, and F. Palomba, "TsDetect: An open source test smells detection tool, " in Proc. 28th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2020, pp. 1650-1654.
T. Virgínio et al., "JNose: Java test smell detector, " in Proc. 34th Braz. Symp. Softw. Eng., 2020, pp. 564-569.
R. E. Noonan, "An algorithm for generating abstract syntax trees, " Comput. Lang., vol. 10, no. 3/4, pp. 225-236, 1985.
A. Panichella, S. Panichella, G. Fraser, A. A. Sawant, and V. J. Hellendoorn, "Revisiting test smells in automatically generated tests: Limitations, pitfalls, and opportunities, " in Proc. IEEE Int. Conf. Softw. Maintenance Evol. , 2020, pp. 523-533.
A. Wei, P. Yi, T. Xie, D. Marinov, and W. Lam, "Probabilistic and systematic coverage of consecutive test-method pairs for detecting order-dependent flaky tests, " in Proc. Int. Conf. Tools Algorithms Construction Anal. Syst., 2021, pp. 270-287.
W. Lam, S. Winter, A. Wei, T. Xie, D. Marinov, and J. Bell, "A large-scale longitudinal study of flaky tests, " Proc. ACM Program. Lang., vol. 4, no. OOPSLA, pp. 1-29, 2020.
W. Lam, S. Winter, A. Astorga, V. Stodden, and D. Marinov, "Understanding reproducibility and characteristics of flaky tests through test reruns in Java projects, " in Proc. IEEE 31st Int. Symp. Softw. Rel. Eng., 2020, pp. 403-413.
W. Lam, A. Shi, R. Oei, S. Zhang, M. D. Ernst, and T. Xie, "Dependent-Test-Aware regression testing techniques, " in Proc. 29th ACMSIGSOFT Int. Symp. Softw. TestingAnal., 2020, pp. 298-311.
A. Shi, W. Lam, R. Oei, T. Xie, and D. Marinov, "iFixFlakies: A framework for automatically fixing order-dependent flaky tests, " in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 545-555.
Flakify: A Black-Box, Language Model-based Predictor for Flaky Tests-Replication Package, 2022. [Online]. Available: https://doi.org/10.5281/zenodo.6994692
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic minority over-sampling technique, " J. Artif. Intell. Res., vol. 16, pp. 321-357, 2002.
P. Branco, L. Torgo, and R. P. Ribeiro, "A survey of predictive modeling on imbalanced domains, " ACM Comput. Surv., vol. 49, no. 2, pp. 1-50, 2016.
C. Goutte and E. Gaussier, "A probabilistic interpretation of precision, recall and f-score, with implication for evaluation, " in Proc. Eur. Conf. Inf. Retrieval, 2005, pp. 345-359.
M. Raymond and F. Rousset, "An exact test for population differentiation, " Evolution, vol. 49, pp. 1280-1283, 1995.
J. Micco, "Flaky tests at Google and how we mitigate them, " 2016. [Online]. Available: https://testing.googleblog.com/2016/05/flaky-Tests-At-google-And-how-we.html
A. Memon et al., "Taming Google-scale continuous testing, " in Proc. IEEE/ACM 39th Int. Conf. Softw. Eng.: Softw. Eng. Pract. Track, 2017, pp. 233-242.
G. Melotti, C. Premebida, J. J. Bird, D. R. Faria, and N. Gonçalves, "Probabilistic object classification using CNN ML-MAP layers, " 2020, arXiv:2005.14565.
C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, "On calibration of modern neural networks, " in Proc. Int. Conf. Mach. Learn., 2017, pp. 1321-1330.
B.H. P. Camara, M. A. G. Silva, A. T. Endo, and S. R.Vergilio, "What is the vocabulary of flaky tests? An extended replication, " in Proc. IEEE/ACM29th Int. Conf. Prog. Comprehension, 2021, pp. 444-454.
A. Memon and J. Micco, "How flaky tests in continuous integration, " 2016. [Online]. Available: https://www.youtube.com/watch?v=CrzpkF1-VsA
E. Kowalczyk, K. Nair, Z. Gao, L. Silberstein, T. Long, and A. Memon, "Modeling and ranking flaky tests at Apple, " in Proc. IEEE/ACM42nd Int. Conf. Softw. Eng.: Softw. Eng. Pract., 2020, pp. 110-119.
Identifying and analyzing flaky tests in maven and gradle builds, 2019. Accessed: Jan. 11, 2021. [Online]. Available: https://gradle.com/blog/flaky-Tests
T. A. Ghaleb, D. A. da Costa, Y. Zou, and A. E. Hassan, "Studying the impact of noises in build breakage data, " IEEE Trans. Softw. Eng., vol. 47, no. 09, pp. 1998-2011, Sep. 2021.
J. Lampel, S. Just, S. Apel, and A. Zeller, "When life gives you oranges: Detecting and diagnosing intermittent job failures at mozilla, " in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2021, pp. 1381-1392.