[en] Code search is a vital activity in software engineering, focused on identifying and retrieving the correct code snippets based on a query provided in natural language. Approaches based on deep learning techniques have been increasingly adopted for this task, enhancing the initial representations of both code and its natural language descriptions. Despite this progress, there remains an unexplored gap in ensuring consistency between the representation spaces of code and its descriptions. Furthermore, existing methods have not fully leveraged the potential relevance between code snippets and their descriptions, presenting a challenge in discerning fine-grained semantic distinctions among similar code snippets. To address these challenges, we introduce a multi-task hedging contrastive Learning framework for Code Search, referred to as HedgeCode. HedgeCode is structured around two primary training phases. The first phase, known as the representation alignment stage, proposes a hedging contrastive learning approach. This method aims to detect subtle differences between code and natural language text, thereby aligning their representation spaces by identifying relevance. The subsequent phase involves multi-task joint learning, wherein the previously trained model serves as the encoder. This stage optimizes the model through a combination of supervised and self-supervised contrastive learning tasks. Our framework's effectiveness is demonstrated through its performance on the CodeSearchNet benchmark, showcasing HedgeCode's ability to address the mentioned limitations in code search tasks.
Disciplines :
Computer science
Author, co-author :
Chen, Gong; Wuhan University, School of Computer Science, China
Xie, Xiaoyuan; Wuhan University, School of Computer Science, China
TANG, Xunzhu ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX
Xin, Qi; Wuhan University, School of Computer Science, China
Liu, Wenjie; Wuhan University, School of Computer Science, China
External co-authors :
yes
Language :
English
Title :
Hedgecode: A Multi-Task Hedging Contrastive Learning Framework for Code Search
Publication date :
2025
Event name :
2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)
Event place :
Ottawa, Can
Event date :
27-04-2025 => 03-05-2025
By request :
Yes
Main work title :
Proceedings - 2025 IEEE/ACM 47th International Conference on Software Engineering, ICSE 2025
This work was supported by National Natural Science Foundation of China (Grant No. 62250610224) and the NATURAL project, which has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant No. 949014).
K. Kim, S. Ghatpande, D. Kim, X. Zhou, K. Liu, T. F. Bissyandé, J. Klein, and Y. Le Traon, "Big code search: A bibliography, " ACM Computing Surveys, vol. 56, no. 1, August 2023.
Y. Xie, J. Lin, H. Dong, L. Zhang, and Z. Wu, "Survey of code search based on deep learning, " ACM Transactions on Software Engineering and Methodology, vol. 33, no. 2, p. 1-42, December 2023.
J. Shuai, L. Xu, C. Liu, M. Yan, X. Xia, and Y. Lei, "Improving code search with co-attentive representation learning, " in Proceedings of the 28th International Conference on Program Comprehension. New York, NY, USA: Association for Computing Machinery, 2020, p. 196-207.
J. Li, F. Liu, J. Li, Y. Zhao, G. Li, and Z. Jin, "MCodeSearcher: Multi-view contrastive learning for code search, " in Proceedings of the 14th Asia-Pacific Symposium on Internetware. New York, NY, USA: Association for Computing Machinery, October 2023, p. 270-280.
Z. Li, G. Yin, T. Wang, Y. Zhang, Y. Yu, and H. Wang, "Correlationbased software search by leveraging software term database, " Frontiers of Computer Science, vol. 12, no. 5, pp. 923-938, October 2018.
X. Gu, H. Zhang, and S. Kim, "Deep code search, " in Proceedings of the 40th International Conference on Software Engineering. New York, NY, USA: Association for Computing Machinery, 2018, p. 933-944.
C. Liu, X. Xia, D. Lo, C. Gao, X. Yang, and J. Grundy, "Opportunities and challenges in code search tools, " ACM Computing Surveys, vol. 54, no. 9, October 2021.
L. Di Grazia and M. Pradel, "Code search: A survey of techniques for finding code, " ACM Computing Surveys, vol. 55, no. 11, February 2023.
Y. Hu, H. Jiang, and Z. Hu, "Measuring code maintainability with deep neural networks, " Frontiers of Computer Science, vol. 17, no. 6, p. 176214, January 2023.
C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu, "Portfolio: Finding relevant functions and their usage, " in 2011 33rd International Conference on Software Engineering, 2011, pp. 111-120.
F. Lv, H. Zhang, J.-g. Lou, S. Wang, D. Zhang, and J. Zhao, "CodeHow: Effective code search based on api understanding and extended boolean model, " in 2015 30th IEEE/ACM International Conference on Automated Software Engineering, 2015, pp. 260-270.
E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi, "Sourcerer: Mining and searching internet-scale software repositories, " Data Mining and Knowledge Discovery, vol. 18, no. 2, p. 300-336, April 2009.
J. Cambronero, H. Li, S. Kim, K. Sen, and S. Chandra, "When deep learning met code search, " in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY, USA: Association for Computing Machinery, 2019, p. 964-974.
L. Xu, H. Yang, C. Liu, J. Shuai, M. Yan, Y. Lei, and Z. Xu, "Twostage attention-based model for code search with textual and structural features, " in 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering, 2021, pp. 342-353.
Y. Chai, H. Zhang, B. Shen, and X. Gu, "Cross-domain deep code search with meta learning, " in Proceedings of the 44th International Conference on Software Engineering. New York, NY, USA: Association for Computing Machinery, 2022, p. 487-498.
Y. Cheng and L. Kuang, "CSRS: Code search with relevance matching and semantic matching, " in 2022 IEEE/ACM 30th International Conference on Program Comprehension, 2022, pp. 533-542.
W. Sun, C. Fang, Y. Chen, G. Tao, T. Han, and Q. Zhang, "Code search based on context-aware code translation, " in Proceedings of the 44th International Conference on Software Engineering. New York, NY, USA: Association for Computing Machinery, July 2022, p. 388-400.
Y. Shi, Y. Yin, Z. Wang, D. Lo, T. Zhang, X. Xia, Y. Zhao, and B. Xu, "How to better utilize code graphs in semantic code search?" in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York, NY, USA: Association for Computing Machinery, 2022, p. 722-733.
S. Fang, Y.-S. Tan, T. Zhang, and Y. Liu, "Self-attention networks for code search, " Information and Software Technology, vol. 134, p. 106542, 2021.
Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, "CodeBERT: A pre-trained model for programming and natural languages, " in Findings of the Association for Computational Linguistics. Association for Computational Linguistics, November 2020, pp. 1536-1547.
D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. LIU, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, "GraphCodeBERT: Pretraining code representations with data flow, " in International Conference on Learning Representations, 2021.
Y. Wang, W. Wang, S. Joty, and S. C. Hoi, "CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation, " in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, November 2021, pp. 8696-8708.
D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, "UniXcoder: Unified cross-modal pre-training for code representation, " in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, vol. 1. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 7212-7225.
E. Shi, Y. Wang, W. Gu, L. Du, H. Zhang, S. Han, D. Zhang, and H. Sun, "CoCoSoDa: Effective contrastive learning for code search, " in Proceedings of the 45th International Conference on Software Engineering. IEEE Press, 2023, p. 2198-2210.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, "A simple framework for contrastive learning of visual representations, " in Proceedings of the 37th International Conference on Machine Learning, vol. 119. PMLR, July 2020, pp. 1597-1607.
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, "Momentum contrast for unsupervised visual representation learning, " in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9726-9735.
Y. Yan, R. Li, S. Wang, F. Zhang, W. Wu, and W. Xu, "ConSERT: A contrastive framework for self-supervised sentence representation transfer, " in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1. Association for Computational Linguistics, August 2021, pp. 5065-5075.
X. Li, Y. Gong, Y. Shen, X. Qiu, H. Zhang, B. Yao, W. Qi, D. Jiang, W. Chen, and N. Duan, "CodeRetriever: A large scale contrastive pretraining method for code search, " in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, December 2022, pp. 2898-2910.
Y. Ding, L. Buratti, S. Pujar, A. Morari, B. Ray, and S. Chakraborty, "Contrastive learning for source code with structural and functional properties, " CoRR, vol. abs/2110.03868, 2021.
N. D. Q. Bui, Y. Yu, and L. Jiang, "Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations, " in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: Association for Computing Machinery, 2021, p. 511-521.
X. Wang, Y. Wang, F. Mi, P. Zhou, Y. Wan, X. Liu, L. Li, H. Wu, J. Liu, and X. Jiang, "SynCoBERT: Syntax-guided multi-modal contrastive pretraining for code representation, " CoRR, vol. abs/2108.04556, 2021.
Y. Wang, H. Le, A. Gotmare, N. Bui, J. Li, and S. Hoi, "CodeT5+: Open code large language models for code understanding and generation, " in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, December 2023, pp. 1069-1088.
P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, "Supervised contrastive learning, " in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., December 2020, pp. 18 661-18 673.
L. Van Der Maaten and G. Hinton, "Visualizing data using t-SNE, " Journal of machine learning research, vol. 9, no. 11, 2008.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pretraining of deep bidirectional transformers for language understanding, " in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1. Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171-4186.
Q. Chen, R. Zhang, Y. Zheng, and Y. Mao, "Dual contrastive learning: Text classification via label-aware data augmentation, " CoRR, vol. abs/2201.08702, 2022.
R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, "Learning deep representations by mutual information estimation and maximization, " in International Conference on Learning Representations, 2019.
H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, "CodesearchNet challenge: Evaluating the state of semantic code search, " CoRR, vol. abs/1909.09436, 2019.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, "RoBERTa: A robustly optimized bert pretraining approach, " CoRR, vol. abs/1907.11692, 2019.
T. Liu, A. W. Moore, and A. Gray, "New algorithms for efficient highdimensional nonparametric classification, " J. Mach. Learn. Res., vol. 7, p. 1135-1158, December 2006.
S. Hochreiter and J. Schmidhuber, "Long short-term memory, " Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.
M. Gardner and S. Dorling, "Artificial neural networks (the multilayer perceptron)-a review of applications in the atmospheric sciences, " Atmospheric Environment, vol. 32, no. 14, pp. 2627-2636, 1998.
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, "Bag of tricks for efficient text classification, " in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2. Valencia, Spain: Association for Computational Linguistics, April 2017, pp. 427-431.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need, " in Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6000-6010.
Q. Zhu, Z. Sun, X. Liang, Y. Xiong, and L. Zhang, "OCoR: An overlapping-aware code retriever, " in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. New York, NY, USA: Association for Computing Machinery, 2021, p. 883-894.
C. Finn, P. Abbeel, and S. Levine, "Model-agnostic meta-learning for fast adaptation of deep networks, " in Proceedings of the 34th International Conference on Machine Learning, vol. 70. PMLR, August 2017, pp. 1126-1135.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners, " 2019.
H. Fang, S. Wang, M. Zhou, J. Ding, and P. Xie, "CERT: Contrastive self-supervised learning for language understanding, " CoRR, vol. abs/2005.12766, 2020.
T. Gao, X. Yao, and D. Chen, "SimCSE: Simple contrastive learning of sentence embeddings, " in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, November 2021, pp. 6894-6910.
J. Giorgi, O. Nitski, B. Wang, and G. Bader, "DeCLUTR: Deep contrastive learning for unsupervised textual representations, " in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1. Association for Computational Linguistics, August 2021, pp. 879-895.
B. Gunel, J. Du, A. Conneau, and V. Stoyanov, "Supervised contrastive learning for pre-trained language model fine-tuning, " in International Conference on Learning Representations, 2021.