LLM-based automatic short answer grading in undergraduate medical education.

[en] BACKGROUND: Multiple choice questions are heavily used in medical education assessments, but rely on recognition instead of knowledge recall. However, grading open questions is a time-intensive task for teachers. Automatic short answer grading (ASAG) has tried to fill this gap, and with the recent advent of Large Language Models (LLM), this branch has seen a new momentum. METHODS: We graded 2288 student answers from 12 undergraduate medical education courses in 3 languages using GPT-4 and Gemini 1.0 Pro. RESULTS: GPT-4 proposed significantly lower grades than the human evaluator, but reached low rates of false positives. The grades of Gemini 1.0 Pro were not significantly different from the teachers'. Both LLMs reached a moderate agreement with human grades, and a high precision for GPT-4 among answers considered fully correct. A consistent grading behavior could be determined for high-quality keys. A weak correlation was found wrt. the length or language of student answers. There is a risk of bias if the LLM knows the human grade a priori. CONCLUSIONS: LLM-based ASAG applied to medical education still requires human oversight, but time can be spared on the edge cases, allowing teachers to focus on the middle ones. For Bachelor-level medical education questions, the training knowledge of LLMs seems to be sufficient, fine-tuning is thus not necessary.

Disciplines :

Human health sciences: Multidisciplinary, general & others
Computer science
Education & instruction

Author, co-author :

GREVISSE, Christian ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Life Sciences and Medicine (DLSM) > Medical Education

External co-authors :

Language :

English

Title :

LLM-based automatic short answer grading in undergraduate medical education.

Publication date :

27 September 2024

Journal title :

BMC Medical Education

eISSN :

1472-6920

Publisher :

Springer Science and Business Media LLC, England

Volume :

Issue :

Pages :

1060

Peer reviewed :

Peer Reviewed verified by ORBi

Available on ORBilu :

since 30 September 2024

Statistics

Number of views

19 (4 by Unilu)

Number of downloads

1 (1 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Adıgüzel T, Kaya MH, Cansu FK. Revolutionizing education with AI: Exploring the transformative potential of ChatGPT. Contemp Educ Technol. 2023;15(3). https://doi.org/10.30935/cedtech/13152.
L. Bala R.J. Westacott C. Brown A.H. Sam Twelve tips for introducing very short answer questions (VSAQs) into your medical curriculum Med Teach. 2023 45 4 360 367 10.1080/0142159X.2022.2093706
B.S. Bloom Taxonomy of educational objectives: The classification of educational goals 1956 Boston Allyn and Bacon
S. Burrows I. Gurevych B. Stein The Eras and Trends of Automatic Short Answer Grading Int J Artif Intell Educ. 2015 25 1 60 117 10.1007/s40593-014-0026-8
L.H. Chang F. Ginter Automatic Short Answer Grading for Finnish with ChatGPT Proc AAAI Conf Artif Intell. 2024 38 21 23173 23181 10.1609/aaai.v38i21.30363
Cochran K, Cohn C, Rouet JF, Hastings P. Improving Automated Evaluation of Student Text Responses Using GPT-3.5 for Text Data Augmentation. In: Wang N, Rebolledo-Mendez G, Matsuda N, Santos OC, Dimitrova V, editors. Artificial Intelligence in Education. Cham: Springer Nature Switzerland; 2023. pp. 217–28. https://doi.org/10.1007/978-3-031-36272-9_18.
Condor A. Exploring Automatic Short Answer Grading as a Tool to Assist in Human Rating. In: Bittencourt II, Cukurova M, Muldner K, Luckin R, Millán E, editors. Artificial Intelligence in Education. Cham: Springer International Publishing; 2020. pp. 74–9. https://doi.org/10.1007/978-3-030-52240-7_14.
O. Fagbohun N. Iduwe M. Abdullahi A. Ifaturoti O. Nwanna Beyond Traditional Assessment: Exploring the Impact of Large Language Models on Grading Practices J Artif Intell Mach Learn Data Sci. 2024 2 1 1 8 10.51219/JAIMLD/oluwole-fagbohun/19
Gaddipati SK, Nair D, Plöger PG. Comparative Evaluation of Pretrained Transfer Learning Models on Automatic Short Answer Grading. 2020. https://doi.org/10.48550/arXiv.2009.01303.
Gao R, Thomas N, Srinivasa A. Work in Progress: Large Language Model Based Automatic Grading Study. In: 2023 IEEE Frontiers in Education Conference (FIE). 2023. https://doi.org/10.1109/FIE58773.2023.10343006.
Greshake K, Abdelnabi S, Mishra S, Endres C, Holz T, Fritz M. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. AISec ’23. New York: Association for Computing Machinery; 2023. pp. 79–90. https://doi.org/10.1145/3605764.3623985.
Hackl V, Müller AE, Granitzer M, Sailer M. Is GPT-4 a reliable rater? Evaluating consistency in GPT-4’s text ratings. Front Educ. 2023;8. https://doi.org/10.3389/feduc.2023.1272229.
Kortemeyer G. Toward AI grading of student problem solutions in introductory physics: A feasibility study. Phys Rev Phys Educ Res. 2023;19(2). https://doi.org/10.1103/physrevphyseducres.19.020163.
Latif E, Zhai X. Fine-tuning ChatGPT for automatic scoring. Comput Educ Artif Intell. 2024;6. https://doi.org/10.1016/j.caeai.2024.100210.
K. Masters Medical Teacher’s first ChatGPT’s referencing hallucinations: Lessons for editors, reviewers, and teachers Med Teach. 2023 45 7 673 675 10.1080/0142159X.2023.2208731
Matelsky JK, Parodi F, Liu T, Lange RD, Kording KP. A large language model-assisted education tool to provide feedback on open-ended responses. 2023. https://doi.org/10.48550/arXiv.2308.02439.
Okgetheng B, Takeuchi K. Estimating Japanese Essay Grading Scores with Large Language Models. In: 30th Annual Conference of the Language Processing Society (NLP2024). Japan: The Association for Natural Language Processing; 2024. https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/B3-2.pdf
M. Olde Bekkink A.R.T.R. Donders J.G. Kooloos R.M.W. de Waal D.J. Ruiter Uncovering students’ misconceptions by assessment of their written questions BMC Med Educ. 2016 16 1 221 10.1186/s12909-016-0739-5
Perez F, Ribeiro I. Ignore Previous Prompt: Attack Techniques For Language Models. In: NeurIPS ML Safety Workshop. 2022. https://doi.org/10.48550/arXiv.2211.09527.
Pinto G, Cardoso-Pereira I, Monteiro D, Lucena D, Souza A, Gama K. Large Language Models for Education: Grading Open-Ended Questions Using ChatGPT. In: Proceedings of the XXXVII Brazilian Symposium on Software Engineering. SBES ’23. New York: Association for Computing Machinery; 2023. pp. 293–302. https://doi.org/10.1145/3613372.3614197.
Schneider J, Schenk B, Niklaus C, Vlachos M. Towards LLM-based Autograding for Short Textual Answers. 2023. https://doi.org/10.48550/arXiv.2309.11508.
Schultze T, Kumar VS, McKeown GJ, O’Connor PA, Rychlowska M, Sparemblek K. Using Large Language Models to Augment (Rather Than Replace) Human Feedback in Higher Education Improves Perceived Feedback Quality. 2024. https://doi.org/10.31234/osf.io/tvcag.
Tobler S. Smart grading: A generative AI-based tool for knowledge-grounded answer evaluation in educational assessments. Methods X. 2024;12. https://doi.org/10.1016/j.mex.2023.102531.
Xiao C, Ma W, Xu SX, Zhang K, Wang Y, Fu Q. From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape. 2024. https://doi.org/10.48550/arXiv.2401.06431.
Yip DW, Esmradi A, Chan CF. A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models. In: 2023 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE). 2023. https://doi.org/10.1109/CSDE59766.2023.10487667.
M. Zuckerman R. Flood R.J.B. Tan N. Kelp D.J. Ecker J. Menke et al. ChatGPT for assessment writing Med Teach. 2023 45 11 1224 1227 10.1080/0142159X.2023.2249239