TY - JOUR
T1 - Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making
T2 - Comparative Analysis of GPT-3.5 and GPT-4
AU - Lahat, Adi
AU - Sharif, Kassem
AU - Zoabi, Narmin
AU - Patt, Yonatan Shneor
AU - Sharif, Yousra
AU - Fisher, Lior
AU - Shani, Uria
AU - Arow, Mohamad
AU - Levin, Roni
AU - Klang, Eyal
N1 - Publisher Copyright:
©Adi Lahat, Kassem Sharif, Narmin Zoabi, Yonatan Shneor Patt, Yousra Sharif, Lior Fisher, Uria Shani, Mohamad Arow, Roni Levin, Eyal Klang.
PY - 2024
Y1 - 2024
N2 - Background: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. Objective: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors’ and residents’ ratings, and specific question types. Methods: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. Results: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5’s accuracy, beneficial, and completeness dimensions. Conclusions: ChatGPT’s potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.
AB - Background: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. Objective: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors’ and residents’ ratings, and specific question types. Methods: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. Results: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5’s accuracy, beneficial, and completeness dimensions. Conclusions: ChatGPT’s potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments.
KW - AI
KW - ChatGPT
KW - ED physician
KW - EM medicine
KW - ML
KW - NLP
KW - algorithm
KW - algorithms
KW - artificial intelligence
KW - bioethics
KW - chat-GPT
KW - chat-bot
KW - chat-bots
KW - chatbot
KW - chatbots
KW - emergency doctor
KW - emergency medicine
KW - emergency physician
KW - ethical
KW - ethical dilemma
KW - ethical dilemmas
KW - ethics
KW - internal medicine
KW - machine learning
KW - natural language processing
KW - practical model
KW - practical models
KW - predictive analytics
KW - predictive model
KW - predictive models
KW - predictive system
UR - http://www.scopus.com/inward/record.url?scp=85197167878&partnerID=8YFLogxK
U2 - 10.2196/54571
DO - 10.2196/54571
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 38935937
AN - SCOPUS:85197167878
SN - 1439-4456
VL - 26
JO - Journal of Medical Internet Research
JF - Journal of Medical Internet Research
IS - 1
M1 - e54571
ER -