Toward expert-level medical question answering with large language models

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant PrakashBradley Green, Ewa Dominowska, Blaise Agüera y Arcas, Nenad Tomašev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi, Joelle K. Barral, Dale R. Webster, Greg S. Corrado, Yossi Matias, Shekoofeh Azizi*, Alan Karthikesalingam*, Vivek Natarajan*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

26 Scopus citations

Abstract

Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a ‘passing’ score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.

Original languageEnglish
Article number16
Pages (from-to)943-950
Number of pages8
JournalNature Medicine
Volume31
Issue number3
DOIs
StatePublished - Mar 2025
Externally publishedYes

Funding

FundersFunder number
Debra and Mark Leslie endowment
American Heart Association
Google Research
National Institutes of Health
National Institute of Allergy and Infectious Diseases1R01AI17812101
NIH-NCATS-Clinical & Translational Science AwardUM1TR004921
National Institute on Drug Abuse Clinical Trials NetworkUG1DA015815—CTN-0136
Gordon and Betty Moore Foundation12409

    Fingerprint

    Dive into the research topics of 'Toward expert-level medical question answering with large language models'. Together they form a unique fingerprint.

    Cite this