Abstract :
[en] Tests suites are a key ingredient in various software automation tasks. Recently, various studies [4] have demonstrated that they are paramount in the adoption of latest innovations in software engineering, such as automated program repair (APR) [3]. Test suites are unfortunately often too scarce in software development projects. Generally, they are provided for regression testing, while new bugs are discovered by users who then describe them informally in bug reports. In recent literature, a new trend of research in APR has attempted to leverage bug reports in generate-and-validate pipelines for program repair. Even in such cases, when an APR tool generates a patch candidate, if test cases are unavailable, developers must manually validate the patch, leading to a threat to validity. On the one hand, automatic test generation approaches in the literature [2], unfortunately, either target unit test cases and thus do not cater to the need for revealing complex bugs that users face in the execution of software, or require formally-defined inputs such as the function signatures, or even the test oracle. On the other hand, bug reports are pervasive, but remain under-explored. There is thus a need to investigate the feasibility of test case generation by leveraging bug reports. Our ultimate objective indeed is to address a challenge in the adoption of program repair by practitioners, towards ensuring that patches can be automatically generated and validated for bugs that are reported by users. Concretely, we observe that, while bug reports can quickly be overwhelming (in terms of high quantity and/or low quality) for developers, they are still recognized to contain a wealth of information. Unfortunately, such information hidden in natural language informality can be difficult to extract, contextualize, and leverage for specifying program executions. Nevertheless, recent advances in Natural Language Processing (NLP) have opened up new possibilities in software engineering. In particular, with the advent of large language models (LLMs), a wide range of tasks have seen machine learning achieve, or even exceed human performance. In thiswork,we propose to study the feasibility of exploiting LLMs towards producing executable test cases based on informal bug reports. Our experiments build on ChatGPT [1], a general-purpose LLM, and codeGPT [5], a code-specific LLM. The performance of test case generation with LLMs is assessed based on the Defects4J repository which includes real-world faults from various Java software development projects.
Scopus citations®
without self-citations
16