Code search; data fusion; information retrieval; Basic hypothesis; Code; Different mechanisms; Intrinsic differences; Pre-training; Query provides; Relevance score; Search technique; State of the art; Software; Codes
Résumé :
[en] Code search, which consists in retrieving relevant code snippets from a codebase based on a given query, provides developers with useful references during software development. Over the years, techniques alternatively adopting different mechanisms to compute the relevance score between a query and a code snippet have been proposed to advance the state of the art in this domain, including those relying on information retrieval, supervised learning, and pre-training. Despite that, the usefulness of existing techniques is still compromised since they cannot effectively handle all the diversified queries and code in practice. To tackle this challenge, we present Dancer, a data fusion based code searcher. Our intuition (also the basic hypothesis of this study) is that existing techniques may complement each other because of the intrinsic differences in their working mechanisms. We have validated this hypothesis via an exploratory study. Based on that, we propose to fuse the results generated by different code search techniques so that the advantage of each standalone technique can be fully leveraged. Specifically, we treat each technique as a retrieval system and leverage well-known data fusion approaches to aggregate the results from different systems. We evaluate six existing code search techniques on two large-scale datasets, and exploit eight classic data fusion approaches to incorporate their results. Our experiments show that the best fusion approach is able to outperform the standalone techniques by 35%-550% and 65%-825% in terms of MRR (mean reciprocal rank) on the two datasets, respectively.
Centre de recherche :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > TruX - Trustworthy Software Engineering
Disciplines :
Sciences informatiques
Auteur, co-auteur :
Wang, Shangwen ; National University of Defense Technology, Changsha, China
Geng, Mingyang ; National University of Defense Technology, Changsha, China
Lin, Bo ; National University of Defense Technology, Changsha, China
Sun, Zhensu ; ShanghaiTech University, Shanghai, China
Wen, Ming ; Huazhong University of Science and Technology, Wuhan, China
Liu, Yepang ; Southern University of Science and Technology, Shenzhen, China
W. Sun, C. Fang, Y. Chen, G. Tao, T. Han, and Q. Zhang, "Code search based on context-aware code translation, " in Proc. 44th Int. Conf. Softw. Eng., 2022, pp. 388-400.
J. Brandt, M. Dontcheva, M. Weskamp, and S. R. Klemmer, "Examplecentric programming: Integrating web search into the development environment, " in Proc. SIGCHI Conf. Human Factors Comput. Syst., 2010, pp. 513-522.
M. Gharehyazie, B. Ray, and V. Filkov, "Some from here, some from there: Cross-project code reuse in GitHub, " in Proc. IEEE/ACM 14th Int. Conf. Mining Softw. Repositories (MSR), Piscataway, NJ, USA: IEEE Press, 2017, pp. 291-301.
J. Brandt, P. J. Guo, J. Lewenstein, M. Dontcheva, and S. R. Klemmer, "Two studies of opportunistic programming: Interleaving web foraging, learning, and writing code, " in Proc. SIGCHI Conf. Human Factors Comput. Syst., 2009, pp.1589-1598.
J. Shuai, L. Xu, C. Liu, M. Yan, X. Xia, and Y. Lei, "Improving code search with co-attentive representation learning, " in Proc. 28th Int. Conf. Program Comprehension, 2020, pp. 196-207.
K. Kim et al., "FaCoY: A code-to-code search engine, " in Proc. 40th Int. Conf. Softw. Eng., New York, NY, USA: ACM, 2018, pp. 946-957.
J. Cambronero, H. Li, S. Kim, K. Sen, and S. Chandra, "When deep learning met code search, " in Proc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 964-974.
C. McMillan, N. Hariri, D. Poshyvanyk, J. Cleland-Huang, and B. Mobasher, "Recommending source code for use in rapid software prototypes, " in Proc. 34th Int. Conf. Softw. Eng. (ICSE), Piscataway, NJ, USA: IEEE Press, 2012, pp. 848-858.
C. Sadowski, K. T. Stolee, and S. Elbaum, "How developers search for code: A case study, " in Proc. 10th Joint Meeting Found. Softw. Eng., 2015, pp. 191-201.
L. Xu et al., "Two-stage attention-based model for code search with textual and structural features, " in Proc. IEEE Int. Conf. Softw. Anal., Evolution Reeng. (SANER), Piscataway, NJ, USA: IEEE Press, 2021, pp. 342-353.
X. Xia, L. Bao, D. Lo, P. S. Kochhar, A. E. Hassan, and Z. Xing, "What do developers search for on the web?" Empirical Softw. Eng., vol. 22, no. 6, pp.3149-3185, 2017.
C. Ling, Z. Lin, Y. Zou, and B. Xie, "Adaptive deep code search, " in Proc. 28th Int. Conf. Program Comprehension, 2020, pp. 48-59.
Z. Yao, J. R. Peddamail, and H. Sun, "CoaCor: Code annotation for code retrieval with reinforcement learning, " in Proc. World Wide Web Conf., 2019, pp.2203-2214.
S. Wang et al., "Two birds with one stone: Boosting code generation and code search via a generative adversarial network, " Proc. ACM Program. Lang., vol. 7, no. OOPSLA2, pp. 486-515, 2023, doi: 10.1145/3622815.
S. Wang et al., "Natural language to code: How far are we?" in Proc. 31st ACM Joint Eur. Softw. Eng. Conf./Symp. Found. Softw. Eng. (ESEC/FSE), New York, NY, USA: ACM, 2023, pp. 375-387, doi: 10.1145/3611643.3616323.
L. Di Grazia and M. Pradel, "Code search: A survey of techniques for finding code, "2022, arXiv: 2204.02765.
C. Liu, X. Xia, D. Lo, C. Gao, X. Yang, and J. Grundy, "Opportunities and challenges in code search tools, " ACM Comput. Surveys (CSUR), vol. 54, no. 9, pp. 1-40, 2021.
S. Yan, H. Yu, Y. Chen, B. Shen, and L. Jiang, "Are the code snippets what we are searching for? A benchmark and an empirical study on code search with natural-language queries, " in Proc. IEEE 27th Int. Conf. Softw. Anal., Evolution Reeng. (SANER), Piscataway, NJ, USA: IEEE Press, 2020, pp. 344-354.
M. Lu, X. Sun, S. Wang, D. Lo, and Y. Duan, "Query expansion via wordnet for effective code search, " in Proc. IEEE 22nd Int. Conf. Softw. Anal., Evolution, Reeng. (SANER), Piscataway, NJ, USA: IEEE Press, 2015, pp. 545-549.
C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu, "Portfolio: Finding relevant functions and their usage, " in Proc. 33rd Int. Conf. Softw. Eng., 2011, pp. 111-120.
O. A. Lemos, A. C. de Paula, F. C. Zanichelli, and C. V. Lopes, "Thesaurus-based automatic query expansion for interface-driven code search, " in Proc. 11th Work. Conf. Mining Softw. Repositories, 2014, pp. 212-221.
F. Zhang, H. Niu, I. Keivanloo, and Y. Zou, "Expanding queries for code search using semantically related API class-names, " IEEE Trans. Softw. Eng., vol. 44, no. 11, pp.1070-1082, Nov.2018.
X. Gu, H. Zhang, and S. Kim, "Deep code search, " in Proc. IEEE/ACM 40th Int. Conf. Softw. Eng. (ICSE), Piscataway, NJ, USA: IEEE Press, 2018, pp. 933-944.
Y. Wan et al., "Multi-modal attention network learning for semantic source code retrieval, " in Proc. 34th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Piscataway, NJ, USA: IEEE Press, 2019, pp. 13-25.
Q. Zhu, Z. Sun, X. Liang, Y. Xiong, and L. Zhang, "OCoR: An overlapping-aware code retriever, " in Proc. 35th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Piscataway, NJ, USA: IEEE Press, 2020, pp. 883-894.
W. Ye, R. Xie, J. Zhang, T. Hu, X. Wang, and S. Zhang, "Leveraging code generation to improve code retrieval and summarization via dual learning, " in Proc. Web Conf., 2020, pp.2309-2319.
Z. Fengg et al., "CodeBERT: A pre-trained model for programming and natural languages, " in Proc. Findings Assoc. Comput. Linguistics (EMNLP), 2020, pp.1536-1547.
D. Guo et al., "GraphCodeBERT: Pre-training code representations with data flow, " in Proc. Int. Conf. Learn. Representations (ICLR), 2021.
Y. Wang, W. Wang, S. Joty, and S. C. Hoi, "CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation, " in Proc. Conf. Empirical Methods Natural Lang. Process., 2021, pp.8696-8708.
S. Lu et al., "CodeXGLUE: A machine learning benchmark dataset for code understanding and generation, " in Proc. 35th Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track (Round 1), 2021.
C. Zeng et al., "DEGRAPHCS: Embedding variable-based flow graph for neural code search, " ACM Trans. Softw. Eng. Methodol. (TOSEM), vol. 32, no. 2, pp. 1-27, 2022.
J. Gu, Z. Chen, and M. Monperrus, "Multimodal representation for neural code search, " in Proc. IEEE Int. Conf. Softw. Maintenance Evolution (ICSME), Piscataway, NJ, USA: IEEE Press, 2021, pp. 483-494.
N. D. Bui, Y. Yu, and L. Jiang, "Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations, " in Proc. 44th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2021, pp. 511-521.
G. DeSanctis and R. B. Gallupe, "A foundation for the study of group decision support systems, " Manage. Sci., vol. 33, no. 5, pp. 589-609, 1987.
E. A. Fox and J. A. Shaw, "Combination of multiple searches, " NIST Special Publication, vol. 243, 1994, Art. no. 243.
C. C. Vogt and G. W. Cottrell, "Fusion via a linear combination of scores, " Inf. Retrieval, vol. 1, no. 3, pp. 151-173, 1999.
W. B. Croft, "Combining approaches to information retrieval, " in Proc. Adv. Inf. Retrieval, New York, NY, USA: Springer-Verlag, 2002, pp. 1-36.
J. A. Aslam and M. Montague, "Models for metasearch, " in Proc. 24th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2001, pp. 276-284.
F. C. Gey, N. Kando, and C. Peters, "Cross-language information retrieval: The way ahead, " Inf. Process. Manage., vol. 41, no. 3, pp. 415-431, 2005.
D. Lo and X. Xia, "Fusion fault localizers, " in Proc. 29th ACM/IEEE Int. Conf. Automated Softw. Eng., 2014, pp. 127-138.
H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, "CodeSearchNet challenge: Evaluating the state of semantic code search, "2019, arXiv: 1909.09436.
G. V. Cormack, C. L. Clarke, and S. Buettcher, "Reciprocal rank fusion outperforms Condorcet and individual rank learning methods, " in Proc. 32nd Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2009, pp. 758-759.
C. Dwork, R. Kumar, M. Naor, and D. Sivakumar, "Rank aggregation methods for the web, " in Proc. 10th Int. Conf. World Wide Web, 2001, pp. 613-622.
D. Frank Hsu and I. Taksa, "Comparing rank and score combination methods for data fusion in information retrieval, " Inf. Retrieval, vol. 8, no. 3, pp. 449-480, 2005.
P. Bailey, A. Moffat, F. Scholer, and P. Thomas, "Retrieval consistency in the presence of query variations, " in Proc. 40th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2017, pp. 395-404.
C. D. Manning, Introduction to Information Retrieval. Syngress Publishing, 2008.
F. Lv, H. Zhang, J.-g. Lou, S. Wang, D. Zhang, and J. Zhao, "CodeHow: Effective code search based on API understanding and extended boolean model (E), " in Proc. 30th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Piscataway, NJ, USA: IEEE Press, 2015, pp. 260-270.
Z. Sun, L. Li, Y. Liu, X. Du, and L. Li, "On the importance of building high-quality training datasets for neural code search, " in Proc. IEEE/ACM 44th Int. Conf. Softw. Eng. (ICSE), New York, NY, USA: ACM, 2022.
K. S. Tai, R. Socher, and C. D. Manning, "Improved semantic representations from tree-structured long short-term memory networks, " in Proc. 53rd Annu. Meeting Assoc. Comput. Linguistics/7th Int. Joint Conf. Natural Lang. Process. (Volume 1 Long Papers), Rockland, MA, USA, 2015, pp.1556-1566.
A. Vaswani et al., "Attention is all you need, " in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
Z. Zeng, H. Tan, H. Zhang, J. Li, Y. Zhang, and L. Zhang, "An extensive study on pre-trained models for program understanding and generation, " in Proc. 31st ACM SIGSOFT Int. Symp. Softw. Testing Anal., New York, NY, USA: ACM, 2022.
S. Robertson et al., "The probabilistic relevance framework: BM25 and beyond, " Found. Trends Inf. Retrieval, vol. 3, no. 4, pp. 333-389, 2009.
M. Wu, D. Hawking, A. Turpin, and F. Scholer, "Using anchor text for homepage and topic distillation search tasks, " J. Amer. Soc. Inf. Sci. Technol., vol. 63, no. 6, pp.1235-1255, 2012.
M. Montague and J. A. Aslam, "Relevance score normalization for metasearch, " in Proc. 10th Int. Conf. Inf. Knowl. Manage., 2001, pp. 427-433.
M. Geng et al., "Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning, " in Proc. 46th IEEE/ACM Int. Conf. Softw. Eng., 2024, pp. 1-13.
B. Lin, S. Wang, Z. Liu, Y. Liu, X. Xia, and X. Mao, "CCT5: A codechange-oriented pre-trained model, " in Proc. 31st ACM Joint Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2023, pp.1509-1521.
T.-Y. Liu et al., "Learning to rank for information retrieval, " Found. Trends. Inf. Retrieval, vol. 3, no. 3, pp. 225-331, 2009.
T.-M. Kuo, C.-P. Lee, and C.-J. Lin, "Large-scale kernel rankSVM, " in Proc. SIAM Int. Conf. Data Mining, Philadelphia, PA, USA: SIAM, 2014, pp. 812-820.
D. Zou, J. Liang, Y. Xiong, M. D. Ernst, and L. Zhang, "An empirical study of fault localization families and their combinations, " IEEE Trans. Softw. Eng., vol. 47, no. 2, pp. 332-347, Feb.2021.
X. Ling et al., "Deep graph matching and searching for semantic code retrieval, " ACM Trans. Knowl. Discovery Data (TKDD), vol. 15, no. 5, pp. 1-21, 2021.
M. Revelle, B. Dit, and D. Poshyvanyk, "Using data fusion and web mining to support feature location in software, " in Proc. IEEE 18th Int. Conf. Program Comprehension, Piscataway, NJ, USA: IEEE Press, 2010, pp. 14-23.
M. M. Rahman, F. Khomh, and M. Castelluccio, "Why are some bugs non-reproducible?: An empirical investigation using data fusion, " in Proc. IEEE Int. Conf. Softw. Maintenance Evolution (ICSME), Piscataway, NJ, USA: IEEE Press, 2020, pp. 605-616.
J. Xuan and M. Monperrus, "Learning to combine multiple ranking metrics for fault localization, " in Proc. IEEE Int. Conf. Softw. Maintenance Evolution, Piscataway, NJ, USA: IEEE Press, 2014, pp. 191-200.
X. Ye, R. Bunescu, and C. Liu, "Learning to rank relevant files for bug reports using domain knowledge, " in Proc. 22nd ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2014, pp. 689-699.
S. Benton, X. Li, Y. Lou, and L. Zhang, "On the effectiveness of unified debugging: An extensive study on 16 program repair systems, " in Proc. 35th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Piscataway, NJ, USA: IEEE Press, 2020, pp. 907-918.
C. Fang, Z. Liu, Y. Shi, J. Huang, and Q. Shi, "Functional code clone detection with syntax and semantics fusion learning, " in Proc. 29th ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2020, pp. 516-527.
S. Wang et al., "Automated patch correctness assessment: How far are we?" in Proc. 35th IEEE/ACM Int. Conf. Automated Softw. Eng., New York, NY, USA: ACM, pp. 968-980, 2020.
B. Lin, S. Wang, Z. Liu, X. Xia, and X. Mao, "Predictive comment updating with heuristics and AST-path-based neural learning: A twophase approach, " IEEE Trans. Softw. Eng., vol. 49, no. 4, pp.1640-1660, Apr.2023.