[en] Code search is an unavoidable activity in software development. Various approaches and techniques have been explored in the literature to support code search tasks. Most of these approaches focus on serving user queries provided as natural language free-form input. However, there exists a wide range of use-case scenarios where a code-to-code approach would be most beneficial. For example, research directions in code transplantation, code diversity, patch recommendation can leverage a code-to-code search engine to find essential ingredients for their techniques. In this paper, we propose FaCoY, a novel approach for statically finding code fragments which may be semantically similar to user input code. FaCoY implements a query alternation strategy: instead of directly matching code query tokens with code in the search space, FaCoY first attempts to identify other tokens which may also be relevant in implementing the functional behavior of the input code. With various experiments, we show that (1) FaCoY is more effective than online code-to-code search engines; (2) FaCoY can detect more semantic code clones (i.e., Type-4) in BigCloneBench than the state-of-theart; (3) FaCoY, while static, can detect code fragments which are indeed similar with respect to runtime execution behavior; and (4) FaCoY can be useful in code/patch recommendation.
Disciplines :
Sciences informatiques
Auteur, co-auteur :
KIM, Kisub ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
KIM, Dongsun ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > Computer Science and Communications Research Unit (CSC)
LE TRAON, Yves ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)
Co-auteurs externes :
yes
Langue du document :
Anglais
Titre :
FaCoY - A Code-to-Code Search Engine
Date de publication/diffusion :
27 mai 2018
Nom de la manifestation :
40th International Conference on Software Engineering
Lieu de la manifestation :
Gothernberg, Suède
Date de la manifestation :
27-05-2018 to 03-06-2018
Titre de l'ouvrage principal :
International Conference on Software Engineering (ICSE 2018)
Le An, Ons Mlouki, Foutse Khomh, and Giuliano Antoniol. 2017. Stack Overflow: A Code Laundering Platform?. In Proceedings of the 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 283-293
Sushil Bajracharya, Trung Ngo, Erik Linstead, Yimeng Dou, Paul Rigor, Pierre Baldi, and Cristina Lopes. 2006. Sourcerer: A Search Engine for Open Source Code Supporting Structure-based search. In Companion to the 21st ACM SIGPLAN Symposium on Object-oriented Programming Systems, Languages, and Applications (OOPSLA). ACM, 681-682
Sushil Krishna Bajracharya and Cristina Videira Lopes. 2012. Analyzing and Mining a Code Search Engine Usage Log. Empirical Software Engineering (EMSE) 17, 4-5 (Aug. 2012), 424-466
Sushil K. Bajracharya, Joel Ossher, and Cristina V. Lopes. 2010. Leveraging Usage Similarity for Effective Retrieval of Examples in Code Repositories. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). ACM, 157-166
Brenda S. Baker. 1992. A Program for Identifying Duplicated Code. Computing Science and Statistics (1992)
Earl T Barr, Mark Harman, Yue Jia, Alexandru Marginean, and Justyna Petke. 2015. Automated Software Transplantation. In Proceedings of the 24th International Symposium on Software Testing and Analysis (ISSTA). ACM, 257-269
Ohad Barzilay, Christoph Treude, and Alexey Zagalsky. 2013. Facilitating Crowd Sourced Software Engineering via Stack Overflow. In Finding Source Code on the Web for Remix and Reuse. Springer, 289-308
Benoit Baudry, Simon Allier, and Martin Monperrus. 2014. Tailored Source Code Transformations to Synthesize Computationally Diverse Program Variants. In Proceedings of the 23th International Symposium on Software Testing and Analysis (ISSTA). ACM, 149-159
Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant'Anna, and Lorraine Bier. 1998. Clone Detection Using Abstract Syntax Trees. In Proceedings of the International Conference on Software Maintenance (ICSM). IEEE, 368-377
Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering (TSE) 33, 9 (2007), 577-591
Tegawende F Bissyande. 2015. Harvesting Fix Hints in the History of Bugs. arXiv:1507.05742 [cs] (July 2015). arXiv: 1507.05742
Tegawende F. Bissyande, Ferdian Thung, David Lo, Lingxiao Jiang, and Laurent Reveillere. 2013. Popularity, Interoperability, and Impact of Programming Languages in 100,000 Open Source Projects. In Proceedings of the 37th IEEE Computer Software and Applications Conference (COMPSAC). IEEE, 303-312
R. Brixtel, M. Fontaine, B. Lesner, C. Bazin, and R. Robbes. 2010. Language-Independent Clone Detection Applied to Plagiarism Detection. In 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation (SCAM). 77-86
Raymond P. L. Buse andWestleyWeimer. 2012. Synthesizing API Usage Examples. In Proceedings of the 34th International Conference on Software Engineering (ICSE). IEEE Press, 782-792
J. R. Cordy and C. K. Roy. 2011. The NiCad Clone Detector. In Proceedings of the 19th International Conference on Program Comprehension (ICPC). IEEE, 219-220
Barthelemy Dagenais and Martin P Robillard. 2012. Recovering traceability links between an API and its learning resources. In Proceedings of the 34th International Conference on Software Engineering (ICSE). IEEE, 47-57
Manuel Egele, Maverick Woo, Peter Chapman, and David Brumley. 2014. Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components. In Proceedings of the 23rd USENIX Security Symposium. 303-317
T. Eisenbarth, R. Koschke, and D. Simon. 2003. Locating Features in Source Code. IEEE Transactions on Software Engineering (TSE) 29, 3 (March 2003), 210-224
G.W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. 1987. The Vocabulary Problem in Human-system Communication. Commun. ACM 30, 11 (Nov. 1987), 964-971
Q. Gao, H. Zhang, J.Wang, Y. Xiong, L. Zhang, and H. Mei. 2015. Fixing Recurring Crash Bugs via Analyzing Q and A Sites (T). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 307-318
Google Code Jam. 2017. https://code.google.com/codejam/. (Jan. 2017)
C. Goues, T. Nguyen, S. Forrest, and W. Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. IEEE Transactions on Software Engineering (TSE) 38, 1 (Jan. 2012), 54-72
N. GAude and R. Koschke. 2009. Incremental Clone Detection. In Proceedings of the 13th European Conference on Software Maintenance and Reengineering (CSMR). 219-228
Sonia Haiduc, Gabriele Bavota, Andrian Marcus, Rocco Oliveto, Andrea Lucia, and Tim Menzies. 2013. Automatic Query Reformulations for Text Retrieval in Software Engineering. In Proceedings of the 35th International Conference on Software Engineering (ICSE). IEEE Press, 842-851
Raphael Hoffmann, James Fogarty, and Daniel SWeld. 2007. Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers. In Proceedings of the 20th ACM Symposium on User Interface Software and Technology (UIST). ACM, 13-22
Reid Holmes and Gail C. Murphy. 2005. Using Structural Context to Recommend Source Code Examples. In Proceedings of the 27th International Conference on Software Engineering (ICSE). ACM, 117-125
R. Holmes, R. J.Walker, and G. C. Murphy. 2006. Approximate Structural Context Matching: An Approach to Recommend Relevant Examples. IEEE Transactions on Software Engineering (TSE) 32, 12 (Dec. 2006), 952-970
Katsuro Inoue, Yusuke Sasaki, Pei Xia, and Yuki Manabe. 2012. Where Does This Code Come from and Where Does It Go?-Integrated Code History Tracker for Open Source Systems. In Proceedings of the 34th International Conference on Software Engineering (ICSE). IEEE Press, 331-341
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and Accurate Tree-based Detection of Code Clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE). IEEE Computer Society, 96-105
Lingxiao Jiang and Zhendong Su. 2009. Automatic Mining of Functionally Equivalent Code Fragments via Random Testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA). ACM, 81-92
Elmar Juergens, Florian Deissenboeck, and Benjamin Hummel. 2010. Code Similarities Beyond Copy &Paste. In Proceedings of the 14th European Conference on Software Maintenance and Reengineering (CSMR). IEEE, 78-87
Rene Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA). ACM, 437-440
Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 92-101
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A Multilinguistic Token-based Code Clone Detection System for Large Scale Source Code. IEEE Transactions on Software Engineering (TSE) 28, 7 (2002), 654-670
Y. Ke, K. T. Stolee, C. L. Goues, and Y. Brun. 2015. Repairing Programs with Semantic Code Search (T). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 295-306
Yalin Ke, Kathryn T Stolee, Claire Goues, and Yuriy Brun. 2015. Repairing Programs with Semantic Code Search (T). In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 295-306
I. Keivanloo, J. Rilling, and P. Charland. 2011. SeClone-A Hybrid Approach to Internet-Scale Real-Time Code Clone Search. In Proceedings of the 19th International Conference on Program Comprehension (ICPC). 223-224
Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting Working Code Examples. In Proceedings of the 36th International Conference on Software Engineering (ICSE). ACM, 664-675
D. Kim, J. Nam, J. Song, and S. Kim. 2013. Automatic Patch Generation Learned from Human-written Patches. In 2013 35th International Conference on Software Engineering (ICSE). 802-811. doi:http://dx.doi.org/10.1109/ICSE.2013.6606626
H. Kim, Y. Jung, S. Kim, and K. Yi. 2011. MeCC: Memory Comparison-based Clone Detector. In Proceedings of the 33rd International Conference on Software Engineering (ICSE). IEEE, 301-310
A. J. Ko, B. A. Myers, M. J. Coblenz, and H. H. Aung. 2006. An Exploratory Study of How Developers Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks. IEEE Transactions on Software Engineering (TSE) 32, 12 (Dec. 2006), 971-987
Raghavan Komondoor and Susan Horwitz. 2001. Using Slicing to Identify Duplication in Source Code. In Proceedings of the 8th International Symposium on Static Analysis (SAS). Springer-Verlag, 40-56
Anil Koyuncu, Tegawende F Bissyande, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon. 2017. Impact of Tool Support in Patch Construction. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). ACM, 237-248
J. Krinke. 2001. Identifying Similar Code with Program Dependence Graphs. In Proceedings of the 8th Working Conference on Reverse Engineering (WCRE). 301-309
D. E. Krutz and E. Shihab. 2013. CCCD: Concolic code clone detection. In Proceedings of the 20th Working Conference on Reverse Engineering (WCRE). 489-490
Frederick Wilfrid Lancaster and Emily Gallup Fayen. 1973. Information Retrieval: On-line. Melville Publishing Company
Mu-Woong Lee, Jong-Won Roh, Seung-won Hwang, and Sunghun Kim. 2010. Instant Code Clone Search. In Proceedings of the 18th International Symposium on Foundations of Software Engineering (FSE). ACM, 167-176
OtAavio A. L. Lemos, Adriano C. Paula, Felipe C. Zanichelli, and Cristina V. Lopes. 2014. Thesaurus-based Automatic Query Expansion for Interface-driven Code Search. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 212-221
O. A. L. Lemos, A. C. Paula, H. Sajnani, and C. V. Lopes. 2015. Can the Use of Types and Query Expansion Help Improve Large-Scale Code Search?. In Proceedings of the 15th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM). 41-50
Sihan Li, Xusheng Xiao, Blake Bassett, Tao Xie, and Nikolai Tillmann. 2016. Measuring Code Behavioral Similarity for Programming and Software Engineering Education. In Proceedings of the 38th International Conference on Software Engineering Companion (ICSE). ACM, 501-510
Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2004. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design &Implementation (OSDI). USENIX Association
Chao Liu, Chen Chen, Jiawei Han, and Philip S. Yu. 2006. GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (KDD). ACM, 872-881
Kui Liu, Dongsun Kim, Tegawende F Bissyande, Shin Yoo, and Yves Le Traon. 2017. Mining Fix Patterns for FindBugs Violations. arXiv preprint arXiv:1712.03201 (2017)
Fei Lv, Hongyu Zhang, Jian-guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. CodeHow: Effective Code Search based on API Understanding and Extended Boolean Model. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, 260-270
Lena Mamykina, Bella Manoim, Manas Mittal, George Hripcsak, and Bjorn Hartmann. 2011. Design Lessons from the Fastest Q and A Site in theWest. In Proceedings of the SIG Conference on Human Factors in Computing Systems (CHI). ACM, 2857-2866
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press
A. Marcus and J. I. Maletic. 2001. Identification of High-level Concept Clones in Source Code. In Proceedings of the 16th International Conference on Automated Software Engineering (ASE). 107-114
Lee Martie, Andre van der Hoek, and Thomas Kwak. 2017. Understanding the Impact of Support for Iteration on Code Search. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering (FSE). ACM, 774-785
Michael McCandless, Erik Hatcher, and Otis Gospodnetic. 2010. Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Manning Publications Co
C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. 2012. Exemplar: A Source Code Search Engine for Finding Highly Relevant Applications. IEEE Transactions on Software Engineering (TSE) 38, 5 (Sept. 2012), 1069-1087
Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: Finding Relevant Functions and Their Usage. In Proceeding of the 33rd International Conference on Software Engineering (ICSE). ACM, 111-120
Laura Moreno, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and Andrian Marcus. 2015. How Can I Use This Method?. In Proceedings of the 37th International Conference on Software Engineering (ICSE). IEEE Press, 880-890
Seyed Mehdi Nasehi, Jonathan Sillito, Frank Maurer, and Chris Burns. 2012. What Makes a Good Code Example?: A Study of Programming Q and A in StackOverflow. In Proceedings of the 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 25-34
Haoran Niu, Iman Keivanloo, and Ying Zou. 2017. Learning to Rank Code Examples for Code Search Engines. Empirical Software Engineering (EMSE) 22, 1 (Feb. 2017), 259-291
Praveen Pathak, Michael Gordon, and Weiguo Fan. 2000. Effective Information Retrieval Using Genetic Algorithms based Matching Functions Adaptation. In Proceedings of the 33rd Hawaii International Conference on System Sciences (HICSS). IEEE, 8-pp
Luca Ponzanelli, Gabriele Bavota, Massimiliano Penta, Rocco Oliveto, and Michele Lanza. 2014. Mining Stackoverflow to Turn the IDE into a Self-confident Programming Prompter. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 102-111
D. Poshyvanyk, Y. G. Gueheneuc, A. Marcus, G. Antoniol, and V. Rajlich. 2007. Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval. IEEE Transactions on Software Engineering (TSE) 33, 6 (June 2007), 420-432
Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach. Science of Computer Programming 74, 7 (May 2009), 470-495
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling Code Clone Detection to Big Code. In Proceedings of the 38th International Conference on Software Engineering (ICSE). ACM, 1157-1168
Gerard Salton and Michael J. McGill. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, Inc
G. Salton, A. Wong, and C. S. Yang. 1975. A Vector Space Model for Automatic Indexing. Commun. ACM 18, 11 (Nov. 1975), 613-620
Huascar Sanchez. 2013. SNIPR: Complementing Code Search with Code Retargeting Capabilities. In Proceedings of the 2013 International Conference on Software Engineering (ICSE). IEEE Press, 1423-1426
Niko Schwarz, Mircea Lungu, and Romain Robbes. 2012. On How Often Code is Cloned Across Repositories. In Proceedings of the 34th International Conference on Software Engineering (ICSE). IEEE Press, 1289-1292
Raphael Sirres, Tegawende F. Bissyande, Dongsun Kim, David Lo, Jacques Klein, Kisub Kim, and Yves Traon. 2018. Augmenting and Structuring User Queries to Support Efficient Free-Form Code Search. Empirical Software Engineering (EMSE) (2018), (to appear)
Kathryn T. Stolee, Sebastian Elbaum, and Daniel Dobos. 2014. Solving the Search for Source Code. ACM Transactions on Software Engineering and Methodology (TOSEM) 23, 3 (May 2014), 26:1-26:45
Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser, and Tony Jebara. 2016. Code Relatives: Detecting Similarly Behaving Software. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). ACM, 702-714
Siddharth Subramanian, Laura Inozemtseva, and Reid Holmes. 2014. Live API Documentation. In Proceedings of the 36th International Conference on Software Engineering (ICSE). ACM, 643-652
Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. 2014. Towards A Big Data Curated Benchmark of Inter-Project Code Clones. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 476-480
Jeffrey Svajlenko and Chanchal K Roy. 2015. Evaluating Clone Detection Tools with Bigclonebench. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 131-140
Christoph Treude and Martin P Robillard. 2016. Augmenting API Documentation with Insights from Stack Overflow. In Proceedings of the 38th International Conference on Software Engineering (ICSE). ACM, 392-403
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE). ACM, 87-98
Xin Xia, Lingfeng Bao, David Lo, Pavneet Singh Kochhar, Ahmed E. Hassan, and Zhenchang Xing. 2017. What Do Developers Search for on the Web? Empirical Software Engineering (EMSE) 22, 6 (April 2017), 3149-3185
Tao Xie and Jian Pei. 2006. MAPO: Mining API Usages from Open Source Repositories. In Proceedings of the International Workshop on Mining Software Repositories (MSR). ACM, 54-57
Le Zhao and Jamie Callan. 2010. Term Necessity Prediction. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM). ACM, 259-268
Le Zhao and Jamie Callan. 2012. Automatic Term Mismatch Diagnosis for Selective Query Expansion. In Proceedings of the 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR). ACM, 515-524