朝向利用大型語言模型實現專家級醫學問答

摘要

最近的人工智慧（AI）系統已經在從圍棋到蛋白質折疊等“重大挑戰”中取得了里程碑式的成就。擁有檢索醫學知識、進行推理並對醫學問題進行回答的能力長期以來一直被視為這樣一個重大挑戰之一。大型語言模型（LLMs）已經催生了醫學問答方面的重大進展；Med-PaLM是第一個在MedQA數據集上以67.2%的分數超過“及格”分數的模型，符合美國醫學執照考試（USMLE）風格問題。然而，這項工作以及其他先前的工作表明，有很大的改進空間，特別是當模型的答案與臨床醫生的答案進行比較時。在這裡，我們介紹了Med-PaLM 2，通過利用基礎LLM改進（PaLM 2）、醫學領域微調以及提示策略，包括一種新的集成精煉方法，來彌合這些差距。 Med-PaLM 2在MedQA數據集上取得了高達86.5%的分數，比Med-PaLM提高了超過19%，創立了新的最先進技術。我們還觀察到在MedMCQA、PubMedQA和MMLU臨床主題數據集中，性能接近或超過了最先進技術。我們對與臨床應用相關的長篇問題進行了詳細的人類評估。在對1066個消費者醫學問題進行兩兩比較排名時，醫生們在與臨床效用相關的九個軸中有八個軸上更喜歡Med-PaLM 2的答案，而非其他醫生的答案（p < 0.001）。我們還觀察到，在新引入的240個長篇“對抗性”問題數據集上，與Med-PaLM相比，在每個評估軸上都有顯著的改進（p < 0.001），以探究LLM的局限性。雖然進一步的研究有必要驗證這些模型在現實世界中的有效性，但這些結果突顯了在醫學問答方面朝著與醫生水平性能迅速進展的情況。

English

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

朝向利用大型語言模型實現專家級醫學問答

Towards Expert-Level Medical Question Answering with Large Language Models

摘要

Support