Automatic Generation of Test Cases based on Bug Reports: A Feasibility Study with Large Language Models

Plein, Laura; OUEDRAOGO, Wendkûuni Arzouma Marc Christian; KLEIN, Jacques; BISSYANDE, Tegawendé François d Assise

doi:10.1145/3639478.3643119

Download

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Automatic Generation of Test Cases based on Bug Reports: A Feasibility Study with Large Language Models

Plein, Laura; OUEDRAOGO, Wendkûuni Arzouma Marc Christian; KLEIN, Jacques et al.

2024 • In Proceedings - 2024 ACM/IEEE 46th International Conference on Software Engineering: Companion, ICSE-Companion 2024

Peer reviewed

Permalink
https://hdl.handle.net/10993/65665

DOI
10.1145/3639478.3643119

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

3639478.3643119.pdf

Author postprint (518.17 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Automatic Generation; Bug reports; Case based; Feasibility studies; Language model; Natural languages; Software automation; Software development projects; Test case; Test case generation; Software

Abstract :

[en] Tests suites are a key ingredient in various software automation tasks. Recently, various studies [4] have demonstrated that they are paramount in the adoption of latest innovations in software engineering, such as automated program repair (APR) [3]. Test suites are unfortunately often too scarce in software development projects. Generally, they are provided for regression testing, while new bugs are discovered by users who then describe them informally in bug reports. In recent literature, a new trend of research in APR has attempted to leverage bug reports in generate-and-validate pipelines for program repair. Even in such cases, when an APR tool generates a patch candidate, if test cases are unavailable, developers must manually validate the patch, leading to a threat to validity. On the one hand, automatic test generation approaches in the literature [2], unfortunately, either target unit test cases and thus do not cater to the need for revealing complex bugs that users face in the execution of software, or require formally-defined inputs such as the function signatures, or even the test oracle. On the other hand, bug reports are pervasive, but remain under-explored. There is thus a need to investigate the feasibility of test case generation by leveraging bug reports. Our ultimate objective indeed is to address a challenge in the adoption of program repair by practitioners, towards ensuring that patches can be automatically generated and validated for bugs that are reported by users. Concretely, we observe that, while bug reports can quickly be overwhelming (in terms of high quantity and/or low quality) for developers, they are still recognized to contain a wealth of information. Unfortunately, such information hidden in natural language informality can be difficult to extract, contextualize, and leverage for specifying program executions. Nevertheless, recent advances in Natural Language Processing (NLP) have opened up new possibilities in software engineering. In particular, with the advent of large language models (LLMs), a wide range of tasks have seen machine learning achieve, or even exceed human performance. In thiswork,we propose to study the feasibility of exploiting LLMs towards producing executable test cases based on informal bug reports. Our experiments build on ChatGPT [1], a general-purpose LLM, and codeGPT [5], a code-specific LLM. The performance of test case generation with LLMs is assessed based on the Defects4J repository which includes real-world faults from various Java software development projects.

Disciplines :

Computer science

Author, co-author :

Plein, Laura ; University of Luxembourg, Luxembourg

OUEDRAOGO, Wendkûuni Arzouma Marc Christian ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

KLEIN, Jacques ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

BISSYANDE, Tegawendé François d Assise ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > TruX

External co-authors :

Language :

English

Title :

Automatic Generation of Test Cases based on Bug Reports: A Feasibility Study with Large Language Models

Publication date :

14 April 2024

Event name :

Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings

Event place :

Lisbon, Prt

Event date :

14-04-2024 => 20-04-2024

By request :

Yes

Audience :

International

Main work title :

Proceedings - 2024 ACM/IEEE 46th International Conference on Software Engineering: Companion, ICSE-Companion 2024

Publisher :

IEEE Computer Society

ISBN/EAN :

9798400705021

Peer reviewed :

Peer reviewed

Additional URL :

https://dl.acm.org/doi/10.1145/3639478.3643119

Funders :

ACM and ACM Special Interest Group on Software Engineering
Centro Cultural de Belem
et al.
Faculty of Engineering of University of Porto
IEEE Computer Society and IEEE Technical Council on Software Engineering
INESC-ID

Funding text :

This work is supported by funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 949014).

Available on ORBilu :

since 02 September 2025

Statistics

Number of views

36 (3 by Unilu)

Number of downloads

60 (0 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

WoS citations^™

Bibliography

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877-1901.
Gordon Fraser and Andrea Arcuri. 2011. Evosuite: Automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416-419.
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair. Commun. ACM 62, 12 (2019), 56-65.
Kui Liu, Anil Koyuncu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, and Yves Le Traon. 2019. You cannot fix what you cannot find! an investigation of fault localization bias in benchmarking automated program repair systems. In 2019 12th IEEE conference on software testing, validation and verification (ICST). IEEE, 102-113.
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. CoRR abs/2102.04664 (2021).