LLMs and Stack Overflow discussions: Reliability, impact, and challenges

Da Silva, Leuson; SAMHI, Jordan; Khomh, Foutse

doi:10.1016/j.jss.2025.112541

Download

Article (Scientific journals)

LLMs and Stack Overflow discussions: Reliability, impact, and challenges

Da Silva, Leuson; SAMHI, Jordan; Khomh, Foutse

2025 • In Journal of Systems and Software, p. 112541

Peer Reviewed verified by ORBi Dataset

Permalink
https://hdl.handle.net/10993/65324

DOI
10.1016/j.jss.2025.112541

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

2402.08801v1.pdf

Author preprint (963.6 kB)

Download

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Disciplines :

Computer science

Author, co-author :

Da Silva, Leuson

SAMHI, Jordan ; University of Luxembourg

Khomh, Foutse

External co-authors :

yes

Language :

English

Title :

LLMs and Stack Overflow discussions: Reliability, impact, and challenges

Publication date :

03 July 2025

Journal title :

Journal of Systems and Software

ISSN :

0164-1212

eISSN :

1873-1228

Publisher :

Elsevier BV

Pages :

112541

Peer reviewed :

Peer Reviewed verified by ORBi

Additional URL :

https://api.elsevier.com/content/article/PII:S0164121225002109?httpAccept=text/xml

Funders :

Canada Research Chairs Program
Fonds de recherche du Québec
Natural Sciences and Engineering Research Council of Canada
Canadian Institute for Advanced Research

Data Set :

https://zenodo.org/records/15086542

Available on ORBilu :

since 04 July 2025

Statistics

Number of views

84 (2 by Unilu)

Number of downloads

21 (1 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Asaduzzaman, M., Mashiyat, A.S., Roy, C.K., Schneider, K.A., Answering questions about unanswered questions of stack overflow. 2013 10th Working Conference on Mining Software Repositories, MSR, 2013, IEEE, 97–100.
Association, I.S., et al. Standard glossary of software engineering terminology. IEEE Std, 1990, 610–612.
Bakker, M., Chadwick, M., Sheahan, H., Tessler, M., Campbell-Gillingham, L., Balaguer, J., McAleese, N., Glaese, A., Aslanides, J., Botvinick, M., et al. Fine-tuning language models to find agreement among humans with diverse preferences. Adv. Neural Inf. Process. Syst. 35 (2022), 38176–38189.
Baltes, S., Treude, C., Robillard, M.P., Contextual documentation referencing on stack overflow. IEEE Trans. Softw. Eng. 48:1 (2020), 135–149.
Barua, A., Thomas, S.W., Hassan, A.E., What are developers talking about? An analysis of topics and trends in stack overflow. Empir. Softw. Eng. 19 (2014), 619–654.
Blanco, G., Pérez-López, R., Fdez-Riverola, F., Lourenço, A.M.G., Understanding the social evolution of the java community in stack overflow: A 10-year study of developer interactions. Future Gener. Comput. Syst. 105 (2020), 446–454.
Burtch, G., Lee, D., Chen, Z., The consequences of generative ai for ugc and online community engagement. 2023 Available at SSRN 4521754.
Calefato, F., Lanubile, F., Marasciulo, M.C., Novielli, N., Mining successful answers in stack overflow. 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, 2015, IEEE, 430–433.
Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., Zhang, C., Quantifying memorization across neural language models. 2022 arXiv preprint arXiv:2202.07646.
Dakhel, A.M., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M.C., Jiang, Z.M.J., Github copilot ai pair programmer: Asset or liability?. J. Syst. Softw., 203, 2023, 111734.
Decan, A., Mens, T., Constantinou, E., 2018. On the impact of security vulnerabilities in the npm package dependency network. In: Proceedings of the 15th International Conference on Mining Software Repositories. pp. 181–191.
del Rio-Chanona, M., Laurentsyeva, N., Wachs, J., Are large language models a threat to digital public goods? Evidence from activity on stack overflow. 2023 arXiv preprint arXiv:2307.07367.
Delile, Z., Radel, S., Godinez, J., Engstrom, G., Brucker, T., Young, K., Ghanavati, S., Evaluating privacy questions from stack overflow: Can chatgpt compete?. 2023 arXiv preprint arXiv:2306.11174.
Dias, K., Borba, P., Barreto, M., Understanding predictive factors for merge conflicts. Inf. Softw. Technol., 121, 2020, 106256.
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., et al. Codebert: A pre-trained model for programming and natural languages. 2020 arXiv preprint arXiv:2002.08155.
Galappaththi, A., Nadi, S., Treude, C., 2022. Does this apply to me? an empirical study of technical context in stack overflow. In: Proceedings of the 19th International Conference on Mining Software Repositories. pp. 23–34.
GitHub, Github copilot: Your ai pair programmer. 2024 URL https://github.com/features/copilot/.
Goodrich, B., Rao, V., Liu, P.J., Saleh, M., 2019. Assessing the factual accuracy of generated text. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 166–175.
Hämäläinen, P., Tavast, M., Kunnari, A., 2023. Evaluating large language models in generating synthetic hci research data: a case study. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–19.
Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., Wang, H., Large language models for software engineering: A systematic literature review. 2023 arXiv preprint arXiv:2308.10620.
Johnson, J., Lubo, S., Yedla, N., Aponte, J., Sharif, B., An empirical study assessing source code readability in comprehension. 2019 IEEE International Conference on Software Maintenance and Evolution, ICSME, 2019, IEEE, 513–523.
Kabir, S., Udo-Imeh, D.N., Kou, B., Zhang, T., Who answers it better? An in-depth analysis of chatgpt and stack overflow answers to software engineering questions. 2023 arXiv preprint arXiv:2308.02312.
Kaddour, J., Harris, J., Mozes, M., Bradley, H., Raileanu, R., McHardy, R., Challenges and applications of large language models. 2023 arXiv preprint arXiv:2307.10169.
Kashefi, A., Mukerji, T., Chatgpt for programming numerical methods. J. Mach. Learn. Model. Comput., 4(2), 2023.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al. Chatgpt for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ., 103, 2023, 102274.
Krippendorff, K., Computing Krippendorff's alpha-reliability. 2011 URL https://repository.upenn.edu/handle/20.500.14332/2089.
Lahitani, A.R., Permanasari, A.E., Setiawan, N.A., Cosine similarity to determine similarity measure: Study case in online essay assessment. 2016 4th International Conference on Cyber and IT Service Management, 2016, IEEE, 1–6.
Lee, P., Bubeck, S., Petro, J., Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. N. Engl. J. Med. 388:13 (2023), 1233–1239.
Li, R., Allal, L.B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. Starcoder: may the source be with you!. 2023 arXiv preprint arXiv:2305.06161.
Li, J., Li, D., Savarese, S., Hoi, S., Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. 2023 arXiv preprint arXiv:2301.12597.
Liang, J.T., Badea, C., Bird, C., DeLine, R., Ford, D., Forsgren, N., Zimmermann, T., 2024. Can gpt-4 replicate empirical software engineering research?. In: Proceedings of the ACM on Software Engineering 1. FSE, pp. 1330–1353.
Liu, J., Tang, X., Li, L., Chen, P., Liu, Y., Which is a better programming assistant? A comparative study between chatgpt and stack overflow. 2023 arXiv preprint arXiv:2308.13851.
Lyu, M.R., Software reliability engineering: A roadmap. Future of Software Engineering, FOSE’07, 2007, IEEE, 153–170.
Mann, H.B., Whitney, D.R., On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18:1 (1947), 50–60, 10.1214/aoms/1177730491.
Marsicano, G., Pereira, D.V., da Silva, F.Q., França, C., Team maturity in software engineering teams. 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM, 2017, IEEE, 235–240.
Nasehi, S.M., Sillito, J., Maurer, F., Burns, C., What makes a good code example?: A study of programming q & a in stackoverflow. 2012 28th IEEE International Conference on Software Maintenance, ICSM, 2012, IEEE, 25–34.
Oishwee, S.J., Stakhanova, N., Codabux, Z., 2024. Large language model vs. stack overflow in addressing android permission related challenges. In: Proceedings of the 21st International Conference on Mining Software Repositories. pp. 373–383.
Oliveira, D., Bruno, R., Madeiral, F., Castor, F., Evaluating code readability and legibility: An examination of human-centric studies. 2020 IEEE International Conference on Software Maintenance and Evolution, ICSME, 2020, IEEE, 348–359.
Online Appendix, Link. 2025 URL https://github.com/leusonmario/chat-stack.
OpenAI, Code interpreter. 2023 URL https://openai.com/blog/chatgpt-plugins#code-interpreter.
Orosz, G., Stack overflow is dead, almost, the pragmatic engineer blog. 2025 URL https://blog.pragmaticengineer.com/stack-overflow-is-almost-dead.
Ozkaya, I., Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Softw. 40:3 (2023), 4–8.
Pinto, G., Cardoso-Pereira, I., Monteiro, D., Lucena, D., Souza, A., Gama, K., 2023. Large language models for education: Grading open-ended questions using chatgpt. In: Proceedings of the XXXVII Brazilian Symposium on Software Engineering. pp. 293–302.
Ragkhitwetsagul, C., Krinke, J., Paixao, M., Bianco, G., Oliveto, R., Toxic code snippets on stack overflow. IEEE Trans. Softw. Eng. 47:3 (2019), 560–581.
Ray, P.P., Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys. Syst., 2023.
Robillard, M.P., DeLine, R., A field study of api learning obstacles. Empir. Softw. Eng. 16 (2011), 703–732.
Rubei, R., Di Sipio, C., Nguyen, P.T., Di Rocco, J., Di Ruscio, D., Postfinder: Mining stack overflow posts to support software developers. Inf. Softw. Technol., 127, 2020, 106367.
Sadasivan, V.S., Kumar, A., Balasubramanian, S., Wang, W., Feizi, S., Can ai-generated text be reliably detected?. 2023 arXiv preprint arXiv:2303.11156.
Salton, G., Buckley, C., Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24:5 (1988), 513–523.
Shapiro, S.S., Wilk, M.B., An analysis of variance test for normality (complete samples). Biometrika 52:3/4 (1965), 591–611.
Squire, M., Should we move to stack overflow? Measuring the utility of social media for developer support. 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 2, 2015, IEEE, 219–228.
StackOverflow, Announcing overflowai. 2023 URL https://stackoverflow.blog/2023/07/27/announcing-overflowai/.
StackOverflow, Temporary policy: Generative ai (e.g. chatgpt) is banned. 2023 URL https://meta.stackoverflow.com/questions/421831/temporary-policy-generative-ai-e-g-chatgpt-is-banned.
Strobelt, H., Webson, A., Sanh, V., Hoover, B., Beyer, J., Pfister, H., Rush, A.M., Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE Trans. Vis. Comput. Graphics 29:1 (2022), 1146–1156.
Surameery, N.M.S., Shakor, M.Y., Use chat gpt to solve programming bugs. Int. J. Inf. Technol. Comput. Eng. (IJITC) 2455-5290, 3(01), 2023, 17–22.
Syam, G., Lal, S., Chen, T., Empirical study of the evolution of python questions on stack overflow. e-Inform. Softw. Eng. J., 17(1), 2023.
Tamburri, D.A., Kruchten, P., Lago, P., van Vliet, H., What is social debt in software engineering?. 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering, CHASE, 2013, IEEE, 93–96.
Tang, R., Chuang, Y.-N., Hu, X., The science of detecting llm-generated texts. 2023 arXiv preprint arXiv:2303.07205.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. 2023 arXiv preprint arXiv:2307.09288.
Uddin, G., Baysal, O., Guerrouj, L., Khomh, F., Understanding how and why developers seek and analyze api-related opinions. IEEE Trans. Softw. Eng. 47:4 (2019), 694–735.
Verdi, M., Sami, A., Akhondali, J., Khomh, F., Uddin, G., Motlagh, A.K., An empirical study of c++ vulnerabilities in crowd-sourced code examples. IEEE Trans. Softw. Eng. 48:5 (2020), 1497–1514.
Wagner, S., Barón, M.M., Falessi, D., Baltes, S., Towards evaluation guidelines for empirical studies involving llms. 2024 arXiv preprint arXiv:2411.07668.
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C., A prompt pattern catalog to enhance prompt engineering with chatgpt. 2023 arXiv preprint arXiv:2302.11382.
Widjojo, P., Treude, C., Addressing compiler errors: Stack overflow or large language models?. 2023 arXiv preprint arXiv:2307.10793.
Xia, X., Bao, L., Lo, D., Kochhar, P.S., Hassan, A.E., Xing, Z., What do developers search for on the web?. Empir. Softw. Eng. 22 (2017), 3149–3185.
Xu, F.F., Alon, U., Neubig, G., Hellendoorn, V.J., 2022. A systematic evaluation of large language models of code. In: Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. pp. 1–10.
Xue, J., Wang, L., Zheng, J., Li, Y., Tan, Y., Can chatgpt kill user-generated q & a platforms?. 2023 Available at SSRN 4448938.
Yazdaninia, M., Lo, D., Sami, A., Characterization and prediction of questions without accepted answers on stack overflow. 2021 IEEE/ACM 29th International Conference on Program Comprehension, ICPC, 2021, IEEE, 59–70.
Yli-Huumo, J., Maglyas, A., Smolander, K., How do software development teams manage technical debt?–An empirical study. J. Syst. Softw. 120 (2016), 195–218.
Zhang, T., Upadhyaya, G., Reinhardt, A., Rajan, H., Kim, M., 2018. Are code examples on an online q & a forum reliable? A study of api misuse on stack overflow. In: Proceedings of the 40th International Conference on Software Engineering. pp. 886–896.
Zhang, H., Wang, S., Chen, T.-H., Hassan, A.E., Reading answers on stack overflow: Not enough!. IEEE Trans. Softw. Eng. 47:11 (2019), 2520–2533.
Zheng, Z., Ning, K., Chen, J., Wang, Y., Chen, W., Guo, L., Wang, W., Towards an understanding of large language models in software engineering tasks. 2023 arXiv preprint arXiv:2308.11396.