TY - JOUR
T1 - Performance of gastroenterologists and multimodal LLMs in endoscopic EREFS scoring of Eosinophilic Esophagitis
AU - Levartovsky, Asaf
AU - Dar, Lior
AU - Fay, Shmuel
AU - Ukashi, Offir
AU - Shachar, Eyal
AU - Engel, Tal
AU - Ben-Horin, Shomron
AU - Savarino, Edoardo V.
AU - Ungar, Bella
N1 - Publisher Copyright:
© 2025 Editrice Gastroenterologica Italiana S.r.l.
PY - 2025/12
Y1 - 2025/12
N2 - Background: The comprehensive evaluation of multimodal large language models (LLMs) in gastroenterological image-analysis remains limited. We aimed to assess the accuracy of the Eosinophilic Esophagitis Endoscopic Reference Score (EREFS) scoring system for Eosinophilic Esophagitis (EoE) across gastroenterology (GI) clinicians with varying levels of experience. Methods: Fifty real-world endoscopic images of EoE patients were graded based on the original EREFS score by a gold standard anchor-rater. EREFS gradings of GI specialists, fellows and three multimodal LLMs (ChatGPT-4o, Claude Sonnet 3.5, Perplexity Sonar) were compared with the anchor-rater's scoring. LLMs were provided with both single-shot and few-shot prompting strategies to optimize performance. Results: Overall assessment accuracy was significantly higher among GI fellows (72.4 %) compared to specialists (65.3 %, p = 0.004) and LLMs (58.9 %, p < 0.001). In the detection of edema, LLMs outperformed specialists (83.3 % vs 49.3 %, p < 0.001). However, LLMs showed significantly poor performance in rings assessment (30 %) compared to specialists (58 %) and fellows (58.7 %, both p < 0.001). After implementation of few-shot prompting, the overall performance of LLMs was comparable to GI specialists (62.7 % vs 65.3 %, p = 0.3). Conclusions: This study uncovers variability in EREFS scoring across human raters with different expertise and multimodal LLMs, and demonstrates that few-shot prompting can optimize LLM accuracy.
AB - Background: The comprehensive evaluation of multimodal large language models (LLMs) in gastroenterological image-analysis remains limited. We aimed to assess the accuracy of the Eosinophilic Esophagitis Endoscopic Reference Score (EREFS) scoring system for Eosinophilic Esophagitis (EoE) across gastroenterology (GI) clinicians with varying levels of experience. Methods: Fifty real-world endoscopic images of EoE patients were graded based on the original EREFS score by a gold standard anchor-rater. EREFS gradings of GI specialists, fellows and three multimodal LLMs (ChatGPT-4o, Claude Sonnet 3.5, Perplexity Sonar) were compared with the anchor-rater's scoring. LLMs were provided with both single-shot and few-shot prompting strategies to optimize performance. Results: Overall assessment accuracy was significantly higher among GI fellows (72.4 %) compared to specialists (65.3 %, p = 0.004) and LLMs (58.9 %, p < 0.001). In the detection of edema, LLMs outperformed specialists (83.3 % vs 49.3 %, p < 0.001). However, LLMs showed significantly poor performance in rings assessment (30 %) compared to specialists (58 %) and fellows (58.7 %, both p < 0.001). After implementation of few-shot prompting, the overall performance of LLMs was comparable to GI specialists (62.7 % vs 65.3 %, p = 0.3). Conclusions: This study uncovers variability in EREFS scoring across human raters with different expertise and multimodal LLMs, and demonstrates that few-shot prompting can optimize LLM accuracy.
KW - EREFS
KW - Endoscopy
KW - Eosinophilic Esophagitis
KW - Large-language models (LLMs)
KW - Prompt engineering
UR - https://www.scopus.com/pages/publications/105024211231
U2 - 10.1016/j.dld.2025.11.009
DO - 10.1016/j.dld.2025.11.009
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 41365739
AN - SCOPUS:105024211231
SN - 1590-8658
VL - 57
SP - 2449
EP - 2456
JO - Digestive and Liver Disease
JF - Digestive and Liver Disease
IS - 12
ER -