![]() Kim, Kisub ![]() Doctoral thesis (2021) Code search can be a core activity in software development for enhancing productivity. Developers commonly reuse existing source code fragments by searching for codebases available in local or global ... [more ▼] Code search can be a core activity in software development for enhancing productivity. Developers commonly reuse existing source code fragments by searching for codebases available in local or global repositories. Code search helps developers ease the implementation by supplying code snippets to reuse or understand specific concepts deeper during software development by providing various code snippets for the same tasks. In addition, reading real-world examples (the results of code search) is helpful for developers to make programs more reliable, faster, or secure as the examples have been tested and reused by many other developers. However, it is getting more challenging as the codebases are becoming larger since the large codebase can derive too many code candidates. Thus, the research community has invested substantial efforts in developing new techniques, combining methods, and applying more extensive data to improve the performance and efficiency of code search. Despite the significant efforts made by researchers in the field, code search still has many open problems that the community needs to address, such as lack of benchmarks, vocabulary mismatch (between natural language and source code), and low extensibility on programming languages. Our work focuses on the open issues and the momentum of the domain on semantic code search, which considers the meaning of the user query rather than concerning the syntactic similarity that most other studies have approached. The thesis begins with exploring general issues on code search by conducting a systematic literature review. The survey organizes and classifies the code search approaches with various directions such as learning-based, feedback-driven, dynamic techniques. It reveals insights and new research directions. Given the research directions by the survey, we concentrate on alleviating the vocabulary mismatch problem between free-form text query and source code to improve the overall performance of code search first. To understand the free-form text query, we leverage crowd knowledge. The survey also discovered that there are only a few code-to-code approaches and investigation on crowd-knowledge indicated there exists demand, especially on finding semantically similar source code, i.e., source code that is syntactically different but performs the same functionality. Therefore, we go further, reformulating the user code query with real-world code snippets. This allows catching the semantics from the source code. Given the semantic information, a user can search for desired source code by using their code fragments. In this context, the present dissertation aims to explore semantic code search by contributing to the following three building blocks: Review of state-of-the-art: Despite the growing interest in code search, a comprehensive survey or systematic literature review on the field of code search remains limited. We conducted a large-scale systematic literature review on the internet-scale code search. Our objective in this study was to devise a grounded approach to understand the procedure for the code search approach. We built an operational taxonomy on top of each procedure to categorize the approaches and provide insights on the selection of various approaches. Our investigation on the open issues from the literature guide researchers and practitioners to future research directions. CoCaBu: Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. We presented COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant but missing structural code entities to improve matching relevant code examples within large code repositories. To instantiate this approach, we built GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. Experimental results, collected via several comparisons against the state-of-the-art code search and existing online search engines such as Google, show that CoCaBu provides qualitatively better results. Furthermore, our live study on the developer community indicates that it can retrieve acceptable or attractive answers for their questions. FaCoY: Most existing approaches focus on serving user queries provided as natural language free-form input. However, there exists a wide range of use-case scenarios where a code-to-code approach would be most beneficial. For example, research directions in code transplantation, code diversity, patch recommendation can leverage a code-to-code search engine to find essential ingredients for their techniques. Given the wide range of use-case for code-to-code search, we propose FaCoY, a novel approach for statically finding code snippets that may be semantically similar to user input code. FaCoY implements a query alternation strategy: instead of directly matching code query tokens with code in the search space, FaCoY first attempts to identify other tokens, which may also be relevant in implementing the functional behavior of the input code. The experimental results show that FaCoY is more effective than all the existing online code-to-code search engines, and it can also be used to find semantic code clones (i.e., Type-4). Moreover, the results proved that FaCoY could be helpful in code/patch recommendation. [less ▲] Detailed reference viewed: 446 (35 UL)![]() ; ; et al in Empirical Software Engineering (2021), 26(6), 1--33 Detailed reference viewed: 44 (7 UL)![]() Liu, Kui ![]() ![]() in 42nd ACM/IEEE International Conference on Software Engineering (ICSE) (2020, May) Test-based automated program repair has been a prolific field of research in software engineering in the last decade. Many approaches have indeed been proposed, which leverage test suites as a weak, but ... [more ▼] Test-based automated program repair has been a prolific field of research in software engineering in the last decade. Many approaches have indeed been proposed, which leverage test suites as a weak, but affordable, approximation to program specifications. Although the literature regularly sets new records on the number of benchmark bugs that can be fixed, several studies increasingly raise concerns about the limitations and biases of state-of-the-art approaches. For example, the correctness of generated patches has been questioned in a number of studies, while other researchers pointed out that evaluation schemes may be misleading with respect to the processing of fault localization results. Nevertheless, there is little work addressing the efficiency of patch generation, with regard to the practicality of program repair. In this paper, we fill this gap in the literature, by providing an extensive review on the efficiency of test suite based program repair. Our objective is to assess the number of generated patch candidates, since this information is correlated to (1) the strategy to traverse the search space efficiently in order to select sensical repair attempts, (2) the strategy to minimize the test effort for identifying a plausible patch, (3) as well as the strategy to prioritize the generation of a correct patch. To that end, we perform a large-scale empirical study on the efficiency, in terms of quantity of generated patch candidates of the 16 open-source repair tools for Java programs. The experiments are carefully conducted under the same fault localization configurations to limit biases. Eventually, among other findings, we note that: (1) many irrelevant patch candidates are generated by changing wrong code locations; (2) however, if the search space is carefully triaged, fault localization noise has little impact on patch generation efficiency; (3) yet, current template-based repair systems, which are known to be most effective in fixing a large number of bugs, are actually least efficient as they tend to generate majoritarily irrelevant patch candidates. [less ▲] Detailed reference viewed: 261 (18 UL)![]() Liu, Kui ![]() ![]() in 41st ACM/IEEE International Conference on Software Engineering (ICSE) (2019, May) To ensure code readability and facilitate software maintenance, program methods must be named properly. In particular, method names must be consistent with the corresponding method implementations ... [more ▼] To ensure code readability and facilitate software maintenance, program methods must be named properly. In particular, method names must be consistent with the corresponding method implementations. Debugging method names remains an important topic in the literature, where various approaches analyze commonalities among method names in a large dataset to detect inconsistent method names and suggest better ones. We note that the state-of-the-art does not analyze the implemented code itself to assess consistency. We thus propose a novel automated approach to debugging method names based on the analysis of consistency between method names and method code. The approach leverages deep feature representation techniques adapted to the nature of each artifact. Experimental results on over 2.1 million Java methods show that we can achieve up to 15 percentage points improvement over the state-of-the-art, establishing a record performance of 67.9% F1-measure in identifying inconsistent method names. We further demonstrate that our approach yields up to 25% accuracy in suggesting full names, while the state-of-the-art lags far behind at 1.1% accuracy. Finally, we report on our success in fixing 66 inconsistent method names in a live study on projects in the wild. [less ▲] Detailed reference viewed: 361 (21 UL)![]() Liu, Kui ![]() ![]() ![]() in 25th Asia-Pacific Software Engineering Conference (APSEC) (2018, December 07) Automated program repair (APR) has extensively been developed by leveraging search-based techniques, in which fix ingredients are explored and identified in different granularities from a specific search ... [more ▼] Automated program repair (APR) has extensively been developed by leveraging search-based techniques, in which fix ingredients are explored and identified in different granularities from a specific search space. State-of-the approaches often find fix ingredients by using mutation operators or leveraging manually-crafted templates. We argue that the fix ingredients can be searched in an online mode, leveraging code search techniques to find potentially-fixed versions of buggy code fragments from which repair actions can be extracted. In this study, we present an APR tool, LSRepair, that automatically explores code repositories to search for fix ingredients at the method-level granularity with three strategies of similar code search. Our preliminary evaluation shows that code search can drive a faster fix process (some bugs are fixed in a few seconds). LSRepair helps repair 19 bugs from the Defects4J benchmark successfully. We expect our approach to open new directions for fixing multiple-lines bugs. [less ▲] Detailed reference viewed: 305 (27 UL)![]() Kim, Kisub ![]() ![]() ![]() in International Conference on Software Engineering (ICSE 2018) (2018, May 27) Code search is an unavoidable activity in software development. Various approaches and techniques have been explored in the literature to support code search tasks. Most of these approaches focus on ... [more ▼] Code search is an unavoidable activity in software development. Various approaches and techniques have been explored in the literature to support code search tasks. Most of these approaches focus on serving user queries provided as natural language free-form input. However, there exists a wide range of use-case scenarios where a code-to-code approach would be most beneficial. For example, research directions in code transplantation, code diversity, patch recommendation can leverage a code-to-code search engine to find essential ingredients for their techniques. In this paper, we propose FaCoY, a novel approach for statically finding code fragments which may be semantically similar to user input code. FaCoY implements a query alternation strategy: instead of directly matching code query tokens with code in the search space, FaCoY first attempts to identify other tokens which may also be relevant in implementing the functional behavior of the input code. With various experiments, we show that (1) FaCoY is more effective than online code-to-code search engines; (2) FaCoY can detect more semantic code clones (i.e., Type-4) in BigCloneBench than the state-of-theart; (3) FaCoY, while static, can detect code fragments which are indeed similar with respect to runtime execution behavior; and (4) FaCoY can be useful in code/patch recommendation. [less ▲] Detailed reference viewed: 252 (32 UL) |
||