TY - JOUR
T1 - ChatGPT’s adherence to otolaryngology clinical practice guidelines
AU - Tessler, Idit
AU - Wolfovitz, Amit
AU - Alon, Eran E.
AU - Gecel, Nir A.
AU - Livneh, Nir
AU - Zimlichman, Eyal
AU - Klang, Eyal
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
PY - 2024/7
Y1 - 2024/7
N2 - Objectives: Large language models, including ChatGPT, has the potential to transform the way we approach medical knowledge, yet accuracy in clinical topics is critical. Here we assessed ChatGPT’s performance in adhering to the American Academy of Otolaryngology-Head and Neck Surgery guidelines. Methods: We presented ChatGPT with 24 clinical otolaryngology questions based on the guidelines of the American Academy of Otolaryngology. This was done three times (N = 72) to test the model’s consistency. Two otolaryngologists evaluated the responses for accuracy and relevance to the guidelines. Cohen’s Kappa was used to measure evaluator agreement, and Cronbach’s alpha assessed the consistency of ChatGPT’s responses. Results: The study revealed mixed results; 59.7% (43/72) of ChatGPT’s responses were highly accurate, while only 2.8% (2/72) directly contradicted the guidelines. The model showed 100% accuracy in Head and Neck, but lower accuracy in Rhinology and Otology/Neurotology (66%), Laryngology (50%), and Pediatrics (8%). The model’s responses were consistent in 17/24 (70.8%), with a Cronbach’s alpha value of 0.87, indicating a reasonable consistency across tests. Conclusions: Using a guideline-based set of structured questions, ChatGPT demonstrates consistency but variable accuracy in otolaryngology. Its lower performance in some areas, especially Pediatrics, suggests that further rigorous evaluation is needed before considering real-world clinical use.
AB - Objectives: Large language models, including ChatGPT, has the potential to transform the way we approach medical knowledge, yet accuracy in clinical topics is critical. Here we assessed ChatGPT’s performance in adhering to the American Academy of Otolaryngology-Head and Neck Surgery guidelines. Methods: We presented ChatGPT with 24 clinical otolaryngology questions based on the guidelines of the American Academy of Otolaryngology. This was done three times (N = 72) to test the model’s consistency. Two otolaryngologists evaluated the responses for accuracy and relevance to the guidelines. Cohen’s Kappa was used to measure evaluator agreement, and Cronbach’s alpha assessed the consistency of ChatGPT’s responses. Results: The study revealed mixed results; 59.7% (43/72) of ChatGPT’s responses were highly accurate, while only 2.8% (2/72) directly contradicted the guidelines. The model showed 100% accuracy in Head and Neck, but lower accuracy in Rhinology and Otology/Neurotology (66%), Laryngology (50%), and Pediatrics (8%). The model’s responses were consistent in 17/24 (70.8%), with a Cronbach’s alpha value of 0.87, indicating a reasonable consistency across tests. Conclusions: Using a guideline-based set of structured questions, ChatGPT demonstrates consistency but variable accuracy in otolaryngology. Its lower performance in some areas, especially Pediatrics, suggests that further rigorous evaluation is needed before considering real-world clinical use.
KW - Artificial intelligence
KW - ChatGPT
KW - Clinical practice guidelines
KW - Medical knowledge
KW - Otolaryngology
UR - http://www.scopus.com/inward/record.url?scp=85191095679&partnerID=8YFLogxK
U2 - 10.1007/s00405-024-08634-9
DO - 10.1007/s00405-024-08634-9
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 38647684
AN - SCOPUS:85191095679
SN - 0937-4477
VL - 281
SP - 3829
EP - 3834
JO - European Archives of Oto-Rhino-Laryngology
JF - European Archives of Oto-Rhino-Laryngology
IS - 7
ER -