借助大型语言模型实现专业水平的医学问题回答

摘要

最近的人工智能（AI）系统已经在从围棋到蛋白质折叠等“重大挑战”中取得了里程碑式的进展。检索医学知识、推理并回答医学问题，与医生相媲美的能力长期以来一直被视为这类重大挑战之一。大型语言模型（LLMs）已经在医学问题回答领域催生了显著进展；Med-PaLM是第一个在MedQA数据集上超过“及格”分数（67.2%）的模型，符合美国医学执照考试（USMLE）风格的问题。然而，这项工作和其他先前的工作表明，尤其是当模型的答案与临床医生的答案进行比较时，仍有很大改进空间。在这里，我们介绍了Med-PaLM 2，它通过利用基础LLM改进（PaLM 2）、医学领域微调以及包括一种新型集成细化方法在内的提示策略来弥合这些差距。 Med-PaLM 2在MedQA数据集上获得了高达86.5%的分数，比Med-PaLM提高了超过19%，创造了新的最新技术水平。我们还观察到在MedMCQA、PubMedQA和MMLU临床主题数据集中，表现接近或超过最新技术水平。我们对与临床应用相关的长篇问题进行了详细的人类评估。在对1066个消费者医学问题进行成对比较排名时，医生在涉及临床实用性的九个轴线上，更倾向于Med-PaLM 2的答案，而非其他医生的答案（p < 0.001）。我们还观察到，在240个长篇“对抗性”问题的新数据集上，与Med-PaLM相比，在每个评估轴线上都有显著改进（p < 0.001），以探究LLM的局限性。虽然进一步的研究有必要验证这些模型在实际环境中的有效性，但这些结果突显了在医学问题回答领域朝着医生水平表现的快速进展。

English

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

借助大型语言模型实现专业水平的医学问题回答

Towards Expert-Level Medical Question Answering with Large Language Models

摘要

Support