Patch generation; Program repair; Efficiency; Empirical assessment
Abstract :
[en] Test-based automated program repair has been a prolific field of research in software engineering in the last decade. Many approaches have indeed been proposed, which leverage test suites as a weak, but affordable, approximation to program specifications. Although the literature regularly sets new records on the number of benchmark bugs that can be fixed, several studies increasingly raise concerns about the limitations and biases of state-of-the-art approaches. For example, the correctness of generated patches has been questioned in a number of studies, while other researchers pointed out that evaluation schemes may be misleading with respect to the processing of fault localization results. Nevertheless, there is little work addressing the efficiency of patch generation, with regard to the practicality of program repair. In this paper, we fill this gap in the literature, by providing an extensive review on the efficiency of test suite based program repair. Our objective is to assess the number of generated patch candidates, since this information is correlated to (1) the strategy to traverse the search space efficiently in order to select sensical repair attempts, (2) the strategy to minimize the test effort for identifying a plausible patch, (3) as well as the strategy to prioritize the generation of a correct patch. To that end, we perform a large-scale empirical study on the efficiency, in terms of quantity of generated patch candidates of the 16 open-source repair tools for Java programs. The experiments are carefully conducted under the same fault localization configurations to limit biases. Eventually, among other findings, we note that: (1) many irrelevant patch candidates are generated by changing wrong code locations; (2) however, if the search space is carefully triaged, fault localization noise has little impact on patch generation efficiency; (3) yet, current template-based repair systems, which are known to be most effective in fixing a large number of bugs, are actually least efficient as they tend to generate majoritarily irrelevant patch candidates.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > Security Design and Validation Research Group (SerVal)
Disciplines :
Computer science
Author, co-author :
Liu, Kui ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Wang, Shangwen
Koyuncu, Anil ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Kim, Kisub ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Klein, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)
Mao, Xiaoguang
Le Traon, Yves ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)
External co-authors :
yes
Language :
English
Title :
On the Efficiency of Test Suite based Program Repair: A Systematic Assessment of 16 Automated Repair Systems for Java Programs
Publication date :
May 2020
Event name :
42nd ACM/IEEE International Conference on Software Engineering
Event date :
from 5-23-2020 to 5-29-2020
Audience :
International
Main work title :
42nd ACM/IEEE International Conference on Software Engineering (ICSE)
Peer reviewed :
Peer reviewed
Focus Area :
Security, Reliability and Trust
FnR Project :
FNR10449467 - Automatic Bug Fix Recommendation: Improving Software Repair And Reducing Time-to-fix Delays In Software Development Projects, 2015 (01/02/2016-31/01/2019) - Tegawendé François D'assise Bissyandé
Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the accuracy of spectrum-based fault localization. In Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION (TAICPART-MUTATION). IEEE, 89-98.
Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. Comput. Surveys 51, 4 (2018), 81:1-81:37. https://doi. org/10. 1145/3212695
Tien-Duy B. Le, David Lo, Claire Le Goues, and Lars Grunske. 2016. A Learning-to-Rank Based Fault Localization Approach Using Likely Invariants. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 177-188. https://doi. org/10. 1145/2931037. 2931049
Liushan Chen, Yu Pei, and Carlo A Furia. 2017. Contract-based program repair without the contracts. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE, 637-647.
Vidroha Debroy andWEricWong. 2010. Using mutation to automatically suggest fixes for faulty programs. In Proceedings of the 3rd International Conference on Software Testing, Verification and Validation. IEEE, 65-74.
Thomas Durieux, Benoit Cornu, Lionel Seinturier, and Martin Monperrus. 2017. Dynamic patch generation for null pointer exceptions using metaprogramming. In Proceedings of the 24th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 349-358.
Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2, 141 Bugs and 23, 551 Repair Attempts. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 302-313. https://doi. org/10. 1145/ 3338906. 3338911
Thomas Durieux and Martin Monperrus. 2016. Dynamoth: dynamic code synthesis for automatic program repair. In Proceedings of the 11th IEEE/ACM International Workshop in Automation of Software Test. IEEE, 85-91.
Michael Frigge, David C. Hoaglin, and Boris Iglewicz. 1989. Some implementations of the boxplot. The American Statistician 43, 1 (1989), 50-54.
Zachary P. Fry, Bryan Landau, and Westley Weimer. 2012. A Human Study of Patch Maintainability. In Proceedings of the 21st International Symposium on Software Testing and Analysis. ACM, 177-187. https://doi. org/10. 1145/04000800. 2336775
Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2017. Automatic software repair: A survey. IEEE Transactions on Software Engineering 45, 1 (2017), 34-67.
Ali Ghanbari, Samuel Benton, and Lingming Zhang. 2019. Practical program repair via bytecode mutation. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 19-30.
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. Deepfix: Fixing common c language errors by deep learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI, 1345-1351.
Jinru Hua, Mengshi Zhang, Kaiyuan Wang, and Sarfraz Khurshid. 2018. Towards practical program repair with on-demand candidate generation. In Proceedings of the 40th International Conference on Software Engineering. ACM, 12-23.
Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 298-309.
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 23rd International Symposium on Software Testing and Analysis. ACM, 437-440.
Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In Proceedings of the 35th International Conference on Software Engineering. IEEE, 802-811.
Kisub Kim, Dongsun Kim, Tegawendé F. Bissyandé, Eunjong Choi, Li Li, Jacques Klein, and Yves Le Traon. 2018. FaCoY: a code-to-code search engine. In Proceedings of the 40th International Conference on Software Engineering. ACM, 946-957.
Anil Koyuncu, Kui Liu, Tegawendé F. Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. 2018. Fixminer: Mining relevant fix patterns for automated program repair. arXiv preprint arXiv:1810. 01791 (2018).
Xuan-Bach D. Le, Lingfeng Bao, David Lo, Xin Xia, Shanping Li, and Corina Pasareanu. 2019. On reliability of patch correctness assessment. In Proceedings of the 41st International Conference on Software Engineering. IEEE, 524-535.
Xuan-Bach D. Le, Duc-Hiep Chu, David Lo, Claire Le Goues, and Willem Visser. 2017. S3: syntax-and semantic-guided repair synthesis via programming by examples. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering. ACM, 593-604.
Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History driven program repair. In Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering. IEEE, 213-224.
Xuan Bach D. Le, Ferdian Thung, David Lo, and Claire Le Goues. 2018. Overfitting in semantics-based automated program repair. Empirical Software Engineering 23, 5 (2018), 3007-3033.
Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In Proceedings of the 34th International Conference on Software Engineering. IEEE, 3-13.
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, andWestleyWeimer. 2012. GenProg: A generic method for automatic software repair. IEEE Transactions on Software Engineering 38, 1 (2012), 54-72.
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated Program Repair. Commun. ACM 62, 12 (2019), 56-65. https://doi. org/10. 1145/ 3318162
Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 32nd ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. ACM, 55-56.
Kui Liu, Dongsun Kim, Tegawendé F Bissyandé, Shin Yoo, and Yves Le Traon. 2018. Mining fix patterns for findbugs violations. IEEE Transactions on Software Engineering (2018).
Kui Liu, Anil Koyuncu, Tegawendé F. Bissyandé, Dongsun Kim, Jacques. Klein, and Yves Le Traon. 2019. You Cannot Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair Systems. In Proceedings of the 12th IEEE Conference on Software Testing, Validation and Verification. 102-113. https://doi. org/10. 1109/ICST. 2019. 00020
Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. Avatar: Fixing semantic bugs with fix patterns of static analysis violations. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 456-467.
Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F. Bissyandé. 2019. TBar: Revisiting Template-based Automated Program Repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 31-42.
Kui Liu, Anil Koyuncu, Kisub Kim, Dongsun Kim, and Tegawendé F. Bissyandé. 2018. LSRepair: Live Search of Fix Ingredients for Automated Program Repair. In Proceedings of the 25th Asia-Pacific Software Engineering Conference. 658-662. https://doi. org/10. 1109/APSEC. 2018. 00085
Xuliang Liu and Hao Zhong. 2018. Mining stackoverflow for program repair. In Proceedings of the 25th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 118-129.
Fan Long and Martin Rinard. 2016. An Analysis of the Search Spaces for Generate and Validate Patch Generation Systems. In Proceedings of the 38th International Conference on Software Engineering. ACM, 702-713. https://doi. org/10. 1145/ 2884781. 2884872
Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Correct Code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 298-312. https: //doi. org/10. 1145/2837614. 2837617
Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019. Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering. IEEE, 468-478.
Henry B Mann and Donald R. Whitney. 1947. On a Test of Whether One of Two Random Variables Is Stochastically Larger than the Other. The Annals of Mathematical Statistics 18, 1 (1947), 50-60. https://doi. org/10. 1214/aoms/ 1177730491
Matias Martinez and Martin Monperrus. 2016. Astor: A program repair library for Java. In Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 441-444.
Matias Martinez and Martin Monperrus. 2018. Ultra-Large Repair Search Space with Automatically Mined Templates: the Cardumen Mode of Astor. In Proceedings of the 10th International Symposium on Search Based Software Engineering. Springer, 65-86.
Martin Monperrus. 2014. A critical review of automatic patch generation learned from human-written patches: essay on the problem statement and the evaluation of automatic software repair. In Proceedings of the 36th International Conference on Software Engineering. ACM, 234-242.
Martin Monperrus. 2018. Automatic software repair: A bibliography. Comput. Surveys 51, 1 (2018), 17:1-17:24.
Martin Monperrus. 2018. The Living Review on Automated Program Repair. Technical Report. Technical Report hal-01956501. HAL/archives-ouvertes. fr, HAL/archives....
Manish Motwani, Sandhya Sankaranarayanan, René Just, and Yuriy Brun. 2018. Do automated program repair techniques repair hard and important bugs? Empirical Software Engineering 23, 5 (2018), 2901-2947.
Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. 2013. Semfix: Program repair via semantic analysis. In Proceedings of the 35th International Conference on Software Engineering. IEEE, 772-781.
Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D. Ernst, Deric Pang, and Benjamin Keller. 2017. Evaluating and Improving Fault Localization. In Proceedings of the 39th International Conference on Software Engineering. IEEE, 609-620. https://doi. org/10. 1109/ICSE. 2017. 62
Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai, and Chengsong Wang. 2014. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering. ACM, 254-265.
Yuhua Qi, Xiaoguang Mao, Yan Lei, and ChengsongWang. 2013. Using automated program repair for evaluating the effectiveness of fault localization techniques. In Proceedings of the 22nd International Symposium on Software Testing and Analysis. ACM, 191-201.
Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An analysis of patch plausibility and correctness for generate-and-validate patch generation systems. In Proceedings of the 24th International Symposium on Software Testing and Analysis. ACM, 24-36.
Ripon Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul Prasad. 2018. Bugs. jar: A large-scale, diverse dataset of real-world Java bugs. In Proceedings of the 15th IEEE/ACM International Conference on Mining Software Repositories. IEEE, 10-13.
Ripon K. Saha, Yingjun Lyu, Hiroaki Yoshida, and Mukul R. Prasad. 2017. Elixir: Effective object-oriented program repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE, 648-659.
Seemanta Saha, Ripon K Saha, and Mukul R Prasad. 2019. Harnessing evolution for multi-hunk program repair. In Proceedings of the 41st International Conference on Software Engineering. IEEE, 13-24.
Edward K Smith, Earl T. Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the cure worse than the disease? overfitting in automated program repair. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering. ACM, 532-543.
Victor Sobreira, Thomas Durieux, Fernanda Madeiral, Martin Monperrus, and Marcelo de Almeida Maia. 2018. Dissection of a bug dataset: Anatomy of 395 patches from Defects4J. In Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 130-140.
ShangwenWang, MingWen, Liqian Chen, Xin Yi, and Xiaoguang Mao. 2019. How Different Is It Between Machine-Generated and Developer-Provided Patches?: An Empirical Study on the Correct Patches Generated by Automated Program Repair Techniques. In Proceedings of the 13rd ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. IEEE, 1-12.
ShangwenWang, MingWen, Xiaoguang Mao, and Deheng Yang. 2019. Attention please: Consider Mockito when evaluating newly proposed automated program repair techniques. In Proceedings of the 23rd Evaluation and Assessment on Software Engineering. ACM, 260-266.
WestleyWeimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In Proceedings of the 31st International Conference on Software Engineering. IEEE, 364-374.
Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018. Context-aware patch generation for better automated program repair. In Proceedings of the 40th International Conference on Software Engineering. ACM, 1-11.
Ming Wen, Rongxin Wu, Yepang Liu, Yongqiang Tian, Xuan Xie, Shing-Chi Cheung, and Zhendong Su. 2019. Exploring and Exploiting the Correlations Between Bug-Inducing and Bug-Fixing Commits. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 326-337. https://doi. org/10. 1145/ 3338906. 3338962
Martin White, Michele Tufano, Matias Martinez, Martin Monperrus, and Denys Poshyvanyk. 2019. Sorting and transforming program repair ingredients via deep learning code similarities. In Proceedings of the IEEE 26th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 479-490.
F. Wilcoxon. 1945. Individual Comparisons by Ranking Methods. Biometrics Bulletin 1, 6 (1945), 80-83.
Qi Xin and Steven P. Reiss. 2017. Leveraging syntax-related code for automated program repair. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE, 660-670.
Yingfei Xiong, Xinyuan Liu, Muhan Zeng, Lu Zhang, and Gang Huang. 2018. Identifying patch correctness in test-based program repair. In Proceedings of the 40th International Conference on Software Engineering. ACM, 789-799.
Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise condition synthesis for program repair. In Proceedings of the 39th IEEE/ACM International Conference on Software Engineering. IEEE, 416-426.
Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clement, Sebastian Lamelas Marcote, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2017. Nopol: Automatic repair of conditional statement bugs in Java programs. IEEE Transactions on Software Engineering 43, 1 (2017), 34-55.
Jinqiu Yang, Alexey Zhikhartsev, Yuefei Liu, and Lin Tan. 2017. Better test cases for better automated program repair. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering. ACM, 831-841.
Jooyong Yi, Shin Hwei Tan, Sergey Mechtaev, Marcel Böhme, and Abhik Roychoudhury. 2018. A correlation study between automated program repair and test-suite metrics. Empirical Software Engineering 23, 5 (2018), 2948-2979.
Yuan Yuan and Wolfgang Banzhaf. 2018. ARJA: Automated Repair of Java Programs via Multi-Objective Genetic Programming. IEEE Transactions on Software Engineering (2018).