MedCaseReasoning: 臨床症例報告からの診断推論の評価と学習

要旨

医師と患者の双方が、臨床症例の診断に大規模言語モデル（LLMs）を利用する機会が増えています。しかし、数学やコーディングなどの分野とは異なり、医療診断では最終的な答えだけでなく、その推論プロセスも正確であることが求められます。現在、MedQAやMMLUなどの広く使用されている医療ベンチマークは、最終的な答えの正確性のみを評価しており、臨床推論プロセスの質や忠実性を見落としています。この制約を解決するため、我々はMedCaseReasoningを導入しました。これは、LLMsが臨床医が作成した診断推論と整合する能力を評価するための初めてのオープンアクセスデータセットです。このデータセットには14,489の診断質問と回答ケースが含まれており、それぞれがオープンアクセスの医療症例報告から導出された詳細な推論ステートメントとペアになっています。我々は、MedCaseReasoningを用いて最先端の推論LLMsを評価し、その診断と推論に重大な欠陥があることを発見しました。例えば、トップパフォーマンスのオープンソースモデルであるDeepSeek-R1は、10ショット診断精度でわずか48%を達成し、臨床医の推論ステートメントの64%しか言及しませんでした（リコール）。しかし、MedCaseReasoningから導出された推論トレースを用いてLLMsをファインチューニングすることで、診断精度と臨床推論リコールがそれぞれ平均29%と41%の相対的な向上を示すことを実証しました。オープンソースのデータセット、コード、およびモデルはhttps://github.com/kevinwu23/Stanford-MedCaseReasoningで利用可能です。

English

Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning.

MedCaseReasoning: 臨床症例報告からの診断推論の評価と学習

MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

要旨

Support