TY - JOUR
T1 - Large language models in radiology reporting - A systematic review of performance, limitations, and clinical implications
AU - Artsi, Yaara
AU - Klang, Eyal
AU - Collins, Jeremy D.
AU - Glicksberg, Benjamin S.
AU - Nadkarni, Girish N.
AU - Korfiatis, Panagiotis
AU - Sorin, Vera
N1 - Publisher Copyright:
© 2025 The Authors
PY - 2025/1
Y1 - 2025/1
N2 - Rationale and objectives: Large language models (LLMs) and vision-language models (VLMs), have emerged as potential tools for automated radiology reporting. However, concerns regarding their fidelity, reliability, and clinical applicability remain. This systematic review examines the current literature on LLM-generated radiology reports. Assessing their fidelity, clinical reliability, and effectiveness. The review aims to identify benefits, limitations, and key factors influencing AI-generated report quality. Materials and methods: We conducted a systematic search of MEDLINE, Google Scholar, Scopus, and Web of Science to identify studies published between January 2015 and July 2025. Studies evaluating VLM/LLM-generated radiology reports were included (Transformer-based generative large language models). The study follows PRISMA guidelines. Risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool. Results: Fifteen studies met the inclusion criteria. Four assessed VLMs that generate full radiology reports directly from images, whereas eleven examined LLMs that summarize textual findings into radiology impressions. Six studies evaluated out-of-the-box (base) models, and nine analyzed models that had been fine-tuned. Twelve investigations paired automated natural-language metrics with radiologist review, while three relied on automated metrics. Fine-tuned models demonstrated better alignment with expert evaluations and achieved higher performance on natural language processing metrics compared to base models. All LLMs showed hallucinations, misdiagnoses, and inconsistencies. Conclusion: LLMs show promise in radiology reporting. However, limitations in diagnostic accuracy and hallucinations necessitate human oversight. Future research should focus on improving evaluation frameworks, incorporating diverse datasets, and prospectively validating AI-generated reports in clinical workflows.
AB - Rationale and objectives: Large language models (LLMs) and vision-language models (VLMs), have emerged as potential tools for automated radiology reporting. However, concerns regarding their fidelity, reliability, and clinical applicability remain. This systematic review examines the current literature on LLM-generated radiology reports. Assessing their fidelity, clinical reliability, and effectiveness. The review aims to identify benefits, limitations, and key factors influencing AI-generated report quality. Materials and methods: We conducted a systematic search of MEDLINE, Google Scholar, Scopus, and Web of Science to identify studies published between January 2015 and July 2025. Studies evaluating VLM/LLM-generated radiology reports were included (Transformer-based generative large language models). The study follows PRISMA guidelines. Risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool. Results: Fifteen studies met the inclusion criteria. Four assessed VLMs that generate full radiology reports directly from images, whereas eleven examined LLMs that summarize textual findings into radiology impressions. Six studies evaluated out-of-the-box (base) models, and nine analyzed models that had been fine-tuned. Twelve investigations paired automated natural-language metrics with radiologist review, while three relied on automated metrics. Fine-tuned models demonstrated better alignment with expert evaluations and achieved higher performance on natural language processing metrics compared to base models. All LLMs showed hallucinations, misdiagnoses, and inconsistencies. Conclusion: LLMs show promise in radiology reporting. However, limitations in diagnostic accuracy and hallucinations necessitate human oversight. Future research should focus on improving evaluation frameworks, incorporating diverse datasets, and prospectively validating AI-generated reports in clinical workflows.
KW - AI alignment
KW - Artificial Intelligence
KW - Automated reporting
KW - Clinical evaluation
KW - Generative AI
KW - Large language models
KW - Natural language processing
KW - Radiology reports
UR - https://www.scopus.com/pages/publications/105013626188
U2 - 10.1016/j.ibmed.2025.100287
DO - 10.1016/j.ibmed.2025.100287
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.systematicreview???
AN - SCOPUS:105013626188
SN - 2666-5212
VL - 12
JO - Intelligence-Based Medicine
JF - Intelligence-Based Medicine
M1 - 100287
ER -