AI; ML; digitalization; healthcare; performance metrics; Orthopedics and Sports Medicine
Abstract :
[en] [en] UNLABELLED: Artificial intelligence's (AI) accelerating progress demands rigorous evaluation standards to ensure safe, effective integration into healthcare's high-stakes decisions. As AI increasingly enables prediction, analysis and judgement capabilities relevant to medicine, proper evaluation and interpretation are indispensable. Erroneous AI could endanger patients; thus, developing, validating and deploying medical AI demands adhering to strict, transparent standards centred on safety, ethics and responsible oversight. Core considerations include assessing performance on diverse real-world data, collaborating with domain experts, confirming model reliability and limitations, and advancing interpretability. Thoughtful selection of evaluation metrics suited to the clinical context along with testing on diverse data sets representing different populations improves generalisability. Partnering software engineers, data scientists and medical practitioners ground assessment in real needs. Journals must uphold reporting standards matching AI's societal impacts. With rigorous, holistic evaluation frameworks, AI can progress towards expanding healthcare access and quality.
LEVEL OF EVIDENCE: Level V.
Disciplines :
Physical, chemical, mathematical & earth Sciences: Multidisciplinary, general & others
Author, co-author :
Oettl, Felix C ; Hospital for Special Surgery New York New York USA ; Schulthess Klinik Zurich Switzerland
Pareek, Ayoosh ; Sports Medicine and Shoulder Institute, Hospital for Special Surgery New York New York USA
Winkler, Philipp W; Department for Orthopaedics and Traumatology, Kepler University Hospital GmbH Johannes Kepler University Linz Linz Austria ; Department of Orthopaedics, Institute of Clinical Sciences, Sahlgrenska Academy University of Gothenburg Gothenburg Sweden ; Sahlgrenska Sports Medicine Center Göteborg Sweden
Zsidai, Bálint ; Department of Orthopaedics, Institute of Clinical Sciences, Sahlgrenska Academy University of Gothenburg Gothenburg Sweden ; Sahlgrenska Sports Medicine Center Göteborg Sweden
Pruneski, James A ; Department of Orthopaedic Surgery Tripler Army Medical Center Honolulu Hawaii USA
Senorski, Eric Hamrin ; Sahlgrenska Sports Medicine Center Göteborg Sweden ; Department of Health and Rehabilitation, Institute of Neuroscience and Physiology, Sahlgrenska Academy University of Gothenburg Gothenburg Sweden
Kopf, Sebastian; Center of Orthopaedics and Traumatology, University Hospital Brandenburg an der Havel, Brandenburg Medical School Theodor Fontane Germany
LEY, Christophe ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Mathematics (DMATH)
Herbst, Elmar ; Department of Trauma, Hand and Reconstructive Surgery University Hospital Muenster Muenster Germany
Oeding, Jacob F ; Department of Orthopaedics, Institute of Clinical Sciences, Sahlgrenska Academy University of Gothenburg Gothenburg Sweden ; Mayo Clinic Alix School of Medicine, Mayo Clinic Rochester Minnesota USA
Hirschmann, Michael T ; Department of Orthopaedic Surgery and Traumatology Kantonsspital Baselland Bruderholz Switzerland ; University of Basel Basel Switzerland
Musahl, Volker ; Department of Orthopaedic Surgery, UPMC Freddie Fu Sports Medicine Center University of Pittsburgh Pittsburgh Pennsylvania USA
Samuelsson, Kristian ; Department of Orthopaedics, Institute of Clinical Sciences, Sahlgrenska Academy University of Gothenburg Gothenburg Sweden ; Sahlgrenska Sports Medicine Center Göteborg Sweden ; Department of Orthopaedics Sahlgrenska University Hospital Mölndal Sweden
Tischer, Thomas; Department of Orthopaedic Surgery Universitymedicine Rostock Rostock Germany ; Department of Orthopaedic and Trauma Surgery Malteser Waldkrankenhaus Erlangen Erlangen Germany
Feldt, Robert ; Department of Computer Science and Engineering Chalmers University of Technology Gothenburg Sweden
Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M. et al. (2021) A review of uncertainty quantification in deep learning: techniques, applications and challenges. Information Fusion, 76, 243–297. Available from: https://doi.org/10.1016/j.inffus.2021.05.008
Adams, L.C., Busch, F., Truhn, D., Makowski, M.R., Aerts, H.J.W.L. & Bressem, K.K. (2023) What does DALL-E 2 know about radiology? Journal of Medical Internet Research, 25, e43110. Available from: https://doi.org/10.2196/43110
Ashraf, S., Wibberley, H., Mapp, P.I., Hill, R., Wilson, D. & Walsh, D.A. (2011) Increased vascular penetration and nerve growth in the meniscus: a potential source of pain in osteoarthritis. Annals of the Rheumatic Diseases, 70, 523–529. Available from: https://doi.org/10.1136/ard.2010.137844
Box, G.E.P. (1976) Science and statistics. Journal of the American Statistical Association, 71, 791–799. Available from: https://doi.org/10.1080/01621459.1976.10480949
Chen, A., Stanovsky, G., Singh, S. & Gardner, M. (2019) Evaluating question answering evaluation. Proceedings of the 2nd Workshop on Machine Reading for Question Answering, 1 January 2019. Hong Kong, China: Association for Computational Linguistics, pp. 119–124. Available from: https://doi.org/10.18653/v1/D19-5817
Chicco, D. & Jurman, G. (2023) The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Mining, 16, 4. Available from: https://doi.org/10.1186/s13040-023-00322-4
Cho, S.H. & Kim, Y.S. (2023) Prediction of retear after arthroscopic rotator cuff repair based on intraoperative arthroscopic images using deep learning. The American Journal of Sports Medicine, 51, 2824–2830. Available from: https://doi.org/10.1177/03635465231189201
Clancey, W.J. & Hoffman, R.R. (2021) Methods and standards for research on explainable artificial intelligence: lessons from intelligent tutoring systems. Applied AI Letters, 2, e53. Available from: https://doi.org/10.1002/ail2.53
Collins, G.S., Reitsma, J.B., Altman, D.G. & Moons, K.G.M. (2015) Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. European Urology, 67, 1142–1151. Available from: https://doi.org/10.1016/j.eururo.2014.11.025
Dave, T., Athaluri, S.A. & Singh, S. (2023) ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence, 6, 1169595. Available from: https://doi.org/10.3389/frai.2023.1169595
Davenport, T. & Kalakota, R. (2019) The potential for artificial intelligence in healthcare. Future Healthcare Journal, 6, 94–98. Available from: https://doi.org/10.7861/futurehosp.6-2-94
De Hond, A.A.H., Leeuwenberg, A.M., Hooft, L., Kant, I.M.J., Nijman, S.W.J., Van Os, H.J.A. et al. (2022) Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. npj Digital Medicine, 5, 2. Available from: https://doi.org/10.1038/s41746-021-00549-7
Deng, H. & Li, X. (2022) Self-supervised anomaly detection with random-shape pseudo-outliers. 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, Scotland, United Kingdom. pp. 4768–4772. Available from: https://doi.org/10.1109/EMBC48229.2022.9871621
Dunn, J.C. (2008) Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4, 95–104. Available from: https://doi.org/10.1080/01969727408546059
Eche, T., Schwartz, L.H., Mokrane, F.-Z. & Dercle, L. (2021) Toward generalizability in the deployment of artificial intelligence in radiology: role of computation stress testing to overcome underspecification. Radiology. Artificial intelligence, 3, 210097. Available from: https://doi.org/10.1148/ryai.2021210097
Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M. et al. (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, 115–118. Available from: https://doi.org/10.1038/nature21056
Fawcett, T. (2006) An introduction to ROC analysis. Pattern Recognition Letters, 27, 861–874. Available from: https://doi.org/10.1016/j.patrec.2005.10.010
Festag, S., Denzler, J. & Spreckelsen, C. (2022) Generative adversarial networks for biomedical time series forecasting and imputation. Journal of Biomedical Informatics, 129, 104058. Available from: https://doi.org/10.1016/j.jbi.2022.104058
Gisev, N., Bell, J.S. & Chen, T.F. (2013) Interrater agreement and interrater reliability: key concepts, approaches, and applications. Research in Social and Administrative Pharmacy, 9, 330–338. Available from: https://doi.org/10.1016/j.sapharm.2012.04.004
Gunning, D., Vorm, E., Wang, J.Y. & Turek, M. (2021) DARPA's explainable AI (XAI) program: a retrospective. Applied AI Letters, 2, e61. Available from: https://doi.org/10.1002/ail2.61
Hendricks, L.A., Rohrbach, A., Schiele, B., Darrell, T. & Akata, Z. (2021) Generating visual explanations with natural language. Applied AI Letters, 2, e55. Available from: https://doi.org/10.1002/ail2.55
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. (2017) GANs trained by a two time-scale update rule converge to a local Nash equilibrium. arXiv e-prints. Available from: https://doi.org/10.48550/arXiv.1706.08500
Hicks, S.A., Strümke, I., Thambawita, V., Hammou, M., Riegler, M.A., Halvorsen, P. et al. (2022) On evaluation metrics for medical applications of artificial intelligence. Scientific Reports, 12, 5979. Available from: https://doi.org/10.1038/s41598-022-09954-8
Hoenders, C.S.M., Harmsen, M.C. & van Luyn, M.J.A. (2008) The local inflammatory environment and microorganisms in “aseptic” loosening of hip prostheses. Journal of Biomedical Materials Research, Part B: Applied Biomaterials, 86, 291–301. Available from: https://doi.org/10.1002/jbm.b.30992
Hu, R., Andreas, J., Darrell, T. & Saenko, K. (2021) Explainable neural computation via stack neural module networks. Applied AI Letters, 2, e39. Available from: https://doi.org/10.1002/ail2.39
Kaarre, J., Feldt, R., Keeling, L.E., Dadoo, S., Zsidai, B., Hughes, J.D. et al. (2023) Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surgery, Sports Traumatology, Arthroscopy, 31, 5190–5198. Available from: https://doi.org/10.1007/s00167-023-07529-2
Kaur, J., Parmar, K.S. & Singh, S. (2023) Autoregressive models in environmental forecasting time series: a theoretical and application review. Environmental Science and Pollution Research, 30, 19617–19641. Available from: https://doi.org/10.1007/s11356-023-25148-9
Khanmohammadi, S., Adibeig, N. & Shanehbandy, S. (2017) An improved overlapping k-means clustering method for medical applications. Expert Systems With Applications, 67, 12–18. Available from: https://doi.org/10.1016/j.eswa.2016.09.025
Khashei, M., Bakhtiarvand, N. & Etemadi, S. (2021) A novel reliability-based regression model for medical modeling and forecasting. Diabetes & Metabolic Syndrome, 15, 102331. Available from: https://doi.org/10.1016/j.dsx.2021.102331
Laskar, M.T.R., Bari, M.S., Rahman, M., Bhuiyan, M.A.H., Joty, S. & Huang, J.X. (2023) A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets. arXiv, 2305, 18486. Available from: https://doi.org/10.48550/arXiv.2305.18486
Makarov, V.A., Stouch, T., Allgood, B., Willis, C.D. & Lynch, N. (2021) Best practices for artificial intelligence in life sciences research. Drug Discovery Today, 26, 1107–1110. Available from: https://doi.org/10.1016/j.drudis.2021.01.017
Martin, R.K., Wastvedt, S., Pareek, A., Persson, A., Visnes, H., Fenstad, A.M. et al. (2023) Ceiling effect of the combined norwegian and danish knee ligament registers limits anterior cruciate ligament reconstruction outcome prediction. The American Journal of Sports Medicine, 51, 2324–2332. Available from: https://doi.org/10.1177/03635465231177905
Martin, R.K., Wastvedt, S., Pareek, A., Persson, A., Visnes, H., Fenstad, A.M. et al. (2022) Machine learning algorithm to predict anterior cruciate ligament revision demonstrates external validity. Knee Surgery, Sports Traumatology, Arthroscopy, 30, 368–375. Available from: https://doi.org/10.1007/s00167-021-06828-w
Meena, T. & Roy, S. (2022) Bone fracture detection using deep supervised learning from radiological images: a paradigm shift. Diagnostics, 12, 2420. Available from: https://doi.org/10.3390/diagnostics12102420
Morris, M.R. (2023) Scientists' perspectives on the potential for generative ai in their fields. arXiv e-prints. Available from: https://doi.org/10.48550/arXiv.2304.01420
Panesar, A. (2021) Machine Learning and AI for Healthcare: Big Data for Improved Health Outcomes, 2nd edition. New York: Apress. Available from: https://doi.org/10.1007/978-1-4842-6537-6
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J.B.L.E.U. (2001) Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL ‘02, 1 January 2001. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, pp. 311–318. Available from: https://doi.org/10.3115/1073083.1073135
Plevris, V., Solorzano, G., Bakas, N. & Ben Seghier, M. (2022) Investigation of performance metrics in regression analysis and machine learning-based prediction models. In: 8th European Congress on Computational Methods in Applied Sciences and Engineering (eccomas). Available from: https://www.scipedia.com/pulic/Plevris_et_al_2022a
Powers, D.M.W. (2020) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv e-prints. Available from: https://doi.org/10.48550/arXiv.2010.16061
Pruneski, J.A., Pareek, A., Kunze, K.N., Martin, R.K., Karlsson, J., Oeding, J.F. et al. (2023) Supervised machine learning and associated algorithms: applications in orthopedic surgery. Knee Surgery, Sports Traumatology, Arthroscopy, 31, 1196–1202. Available from: https://doi.org/10.1007/s00167-022-07181-2
Ricci, F., Rokach, L. & Shapira, B. Recommender systems: techniques, applications, and challenges. Recommender systems handbook. Springer New York, NY. pp. 1–35. Available from: https://doi.org/10.1007/978-1-0716-2197-4
Rousseeuw, P.J. (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. Available from: https://doi.org/10.1016/0377-0427(87)90125-7
Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T. et al. (2020) Improved protein structure prediction using potentials from deep learning. Nature, 577, 706–710. Available from: https://doi.org/10.1038/s41586-019-1923-7
Shao, Y., Cheng, Y., Shah, R.U., Weir, C.R., Bray, B.E. & Zeng-Treitler, Q. (2021) Shedding light on the black box: explaining deep neural network prediction of clinical outcomes. Journal of Medical Systems, 45, 5. Available from: https://doi.org/10.1007/s10916-020-01701-8
Singh, S., Parmar, K.S., Makkhan, S.J.S., Kaur, J., Peshoria, S. & Kumar, J. (2020) Study of ARIMA and least square support vector machine (LS-SVM) models for the prediction of SARS-CoV-2 confirmed cases in the most affected countries. Chaos, Solitons, and Fractals, 139, 110086. Available from: https://doi.org/10.1016/j.chaos.2020.110086
Stefanidis, K., Tsatsou, D., Konstantinidis, D., Gymnopoulos, L., Daras, P., Wilson-Barnes, S. et al. (2022) PROTEIN AI advisor: a knowledge-based recommendation framework using expert-validated meals for healthy Diets. Nutrients, 14, 4435. Available from: https://doi.org/10.3390/nu14204435
Stokel-Walker, C. & Van Noorden, R. (2023) What ChatGPT and generative AI mean for science. Nature, 614, 214–216. Available from: https://doi.org/10.1038/d41586-023-00340-6
Taha, A.A. & Hanbury, A. (2015) Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Medical Imaging, 15, 29. Available from: https://doi.org/10.1186/s12880-015-0068-x
Teng, H.S., Chen, K. & Lu, S.C. (1990) Adaptive real-time anomaly detection using inductively generated sequential patterns. Proceedings of the IEEE Computer Society Symposium on Research in Security and Privacy, Oakland, CA, USA. pp. 278–284. Available from: https://doi.org/10.1109/RISP.1990.63857
Ustun, B. & Rudin, C. (2016) Supersparse linear integer models for optimized medical scoring systems. Machine Learning, 102, 349–391. Available from: https://doi.org/10.1007/s10994-015-5528-6
Vasu, B., Hu, B., Dong, B., Collins, R. & Hoogs, A. (2021) Explainable, interactive content-based image retrieval. Applied AI Letters, 2, e41. Available from: https://doi.org/10.1002/ail2.41
Watson, D.S., Krutzinna, J., Bruce, I.N., Griffiths, C.E., McInnes, I.B., Barnes, M.R. et al. (2019) Clinical applications of machine learning algorithms: beyond the black box. BMJ, 364, l886. Available from: https://doi.org/10.1136/bmj.l886
Witty, S., Lee, J.K., Tosch, E., Atrey, A., Clary, K., Littman, M.L. et al. (2021) Measuring and characterizing generalization in deep reinforcement learning. Applied AI Letters, 2, e45. Available from: https://doi.org/10.1002/ail2.45
Yang, J., Soltan, A.A.S. & Clifton, D.A. (2022) Machine learning generalizability across healthcare settings: insights from multi-site COVID-19 screening. npj Digital Medicine, 5, 69. Available from: https://doi.org/10.1038/s41746-022-00614-9
Yang, S.C.-H., Folke, T. & Shafto, P. (2021) Abstraction, validation, and generalization for explainable artificial intelligence. Applied AI Letters, 2, e37. Available from: https://doi.org/10.1002/ail2.37
Yeh C.-K., Ravikumar P. (2021) Objective criteria for explanations of machine learning models. Applied AI Letters 2, e57. https://doi.org/10.1002/ail2.57