TY - JOUR
T1 - Toward expert-level medical question answering with large language models
AU - Singhal, Karan
AU - Tu, Tao
AU - Gottweis, Juraj
AU - Sayres, Rory
AU - Wulczyn, Ellery
AU - Amin, Mohamed
AU - Hou, Le
AU - Clark, Kevin
AU - Pfohl, Stephen R.
AU - Cole-Lewis, Heather
AU - Neal, Darlene
AU - Rashid, Qazi Mamunur
AU - Schaekermann, Mike
AU - Wang, Amy
AU - Dash, Dev
AU - Chen, Jonathan H.
AU - Shah, Nigam H.
AU - Lachgar, Sami
AU - Mansfield, Philip Andrew
AU - Prakash, Sushant
AU - Green, Bradley
AU - Dominowska, Ewa
AU - Agüera y Arcas, Blaise
AU - Tomašev, Nenad
AU - Liu, Yun
AU - Wong, Renee
AU - Semturs, Christopher
AU - Mahdavi, S. Sara
AU - Barral, Joelle K.
AU - Webster, Dale R.
AU - Corrado, Greg S.
AU - Matias, Yossi
AU - Azizi, Shekoofeh
AU - Karthikesalingam, Alan
AU - Natarajan, Vivek
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/3
Y1 - 2025/3
N2 - Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a ‘passing’ score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.
AB - Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a ‘passing’ score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.
UR - http://www.scopus.com/inward/record.url?scp=85215627683&partnerID=8YFLogxK
U2 - 10.1038/s41591-024-03423-7
DO - 10.1038/s41591-024-03423-7
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 39779926
AN - SCOPUS:85215627683
SN - 1078-8956
VL - 31
SP - 943
EP - 950
JO - Nature Medicine
JF - Nature Medicine
IS - 3
M1 - 16
ER -