Performance of gastroenterologists and multimodal LLMs in endoscopic EREFS scoring of Eosinophilic Esophagitis

Research output: Contribution to journalArticlepeer-review

Abstract

Background: The comprehensive evaluation of multimodal large language models (LLMs) in gastroenterological image-analysis remains limited. We aimed to assess the accuracy of the Eosinophilic Esophagitis Endoscopic Reference Score (EREFS) scoring system for Eosinophilic Esophagitis (EoE) across gastroenterology (GI) clinicians with varying levels of experience. Methods: Fifty real-world endoscopic images of EoE patients were graded based on the original EREFS score by a gold standard anchor-rater. EREFS gradings of GI specialists, fellows and three multimodal LLMs (ChatGPT-4o, Claude Sonnet 3.5, Perplexity Sonar) were compared with the anchor-rater's scoring. LLMs were provided with both single-shot and few-shot prompting strategies to optimize performance. Results: Overall assessment accuracy was significantly higher among GI fellows (72.4 %) compared to specialists (65.3 %, p = 0.004) and LLMs (58.9 %, p < 0.001). In the detection of edema, LLMs outperformed specialists (83.3 % vs 49.3 %, p < 0.001). However, LLMs showed significantly poor performance in rings assessment (30 %) compared to specialists (58 %) and fellows (58.7 %, both p < 0.001). After implementation of few-shot prompting, the overall performance of LLMs was comparable to GI specialists (62.7 % vs 65.3 %, p = 0.3). Conclusions: This study uncovers variability in EREFS scoring across human raters with different expertise and multimodal LLMs, and demonstrates that few-shot prompting can optimize LLM accuracy.

Original languageEnglish
Pages (from-to)2449-2456
Number of pages8
JournalDigestive and Liver Disease
Volume57
Issue number12
DOIs
StatePublished - Dec 2025

Keywords

  • EREFS
  • Endoscopy
  • Eosinophilic Esophagitis
  • Large-language models (LLMs)
  • Prompt engineering

Fingerprint

Dive into the research topics of 'Performance of gastroenterologists and multimodal LLMs in endoscopic EREFS scoring of Eosinophilic Esophagitis'. Together they form a unique fingerprint.

Cite this