大規模言語モデルを用いた専門家レベルの医療質問応答に向けて

要旨

近年の人工知能（AI）システムは、囲碁からタンパク質フォールディングに至るまでの「グランドチャレンジ」において重要なマイルストーンを達成してきた。医療知識を検索し、それを推論し、医師と同等のレベルで医療質問に答える能力は、長らくそのようなグランドチャレンジの一つと見なされてきた。大規模言語モデル（LLMs）は、医療質問応答において著しい進展を促してきた。Med-PaLMは、MedQAデータセットにおいて米国医師免許試験（USMLE）形式の質問で「合格」スコアを超えた最初のモデルであり、67.2%のスコアを記録した。しかし、この結果や他の先行研究は、特にモデルの回答と臨床医の回答を比較した際に、改善の余地が大きいことを示唆していた。ここでは、ベースLLMの改良（PaLM 2）、医療ドメインのファインチューニング、および新たなアンサンブル改良アプローチを含むプロンプト戦略を組み合わせることで、これらのギャップを埋めるMed-PaLM 2を紹介する。 Med-PaLM 2は、MedQAデータセットで最大86.5%のスコアを記録し、Med-PaLMを19%以上上回り、新たな最先端を確立した。また、MedMCQA、PubMedQA、およびMMLU臨床トピックデータセットにおいても、最先端に迫るかそれを超える性能を観察した。臨床応用に関連する複数の軸に沿って、長文質問に対する詳細な人間評価を実施した。1066件の消費者医療質問に対するペアワイズ比較ランキングでは、医師は臨床的有用性に関連する9つの軸のうち8つにおいて、Med-PaLM 2の回答を医師の回答よりも好んだ（p < 0.001）。また、新たに導入された240件の長文「敵対的」質問データセットにおいても、Med-PaLMと比較して全ての評価軸で有意な改善を観察した（p < 0.001）。これらのモデルの実世界での有効性を検証するためにはさらなる研究が必要であるが、これらの結果は、医療質問応答における医師レベルの性能に向けた急速な進展を強調している。

English

Recent artificial intelligence (AI) systems have reached milestones in "grand challenges" ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a "passing" score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p < 0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p < 0.001) on newly introduced datasets of 240 long-form "adversarial" questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

大規模言語モデルを用いた専門家レベルの医療質問応答に向けて

Towards Expert-Level Medical Question Answering with Large Language Models

要旨

Support