MedCaseReasoning:从临床病例报告中评估与学习诊断推理
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports
May 16, 2025
作者: Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J. Tao, Min Woo Sun, Alejandro Lozano, James Zou
cs.AI
摘要
医生和患者越来越多地利用大型语言模型(LLMs)来诊断临床病例。然而,与数学或编程等领域不同,这些领域的正确性可以通过最终答案客观定义,而医学诊断不仅要求结果准确,还要求推理过程无误。目前,广泛使用的医学基准测试如MedQA和MMLU仅评估最终答案的准确性,忽视了临床推理过程的质量与忠实度。为弥补这一不足,我们推出了MedCaseReasoning,这是首个开放获取的数据集,旨在评估LLMs与临床医生撰写的诊断推理保持一致的能力。该数据集包含14,489个诊断问答案例,每个案例均配有源自开放获取医学病例报告的详细推理说明。我们在MedCaseReasoning上评估了最先进的推理型LLMs,发现其在诊断和推理方面存在显著不足:例如,表现最佳的开源模型DeepSeek-R1仅达到48%的10-shot诊断准确率,且仅提及了64%的临床医生推理说明(召回率)。然而,我们证明,基于MedCaseReasoning的推理轨迹对LLMs进行微调,能显著提升诊断准确率和临床推理召回率,平均相对增益分别达到29%和41%。开源数据集、代码及模型可在https://github.com/kevinwu23/Stanford-MedCaseReasoning获取。
English
Doctors and patients alike increasingly use Large Language Models (LLMs) to
diagnose clinical cases. However, unlike domains such as math or coding, where
correctness can be objectively defined by the final answer, medical diagnosis
requires both the outcome and the reasoning process to be accurate. Currently,
widely used medical benchmarks like MedQA and MMLU assess only accuracy in the
final answer, overlooking the quality and faithfulness of the clinical
reasoning process. To address this limitation, we introduce MedCaseReasoning,
the first open-access dataset for evaluating LLMs on their ability to align
with clinician-authored diagnostic reasoning. The dataset includes 14,489
diagnostic question-and-answer cases, each paired with detailed reasoning
statements derived from open-access medical case reports. We evaluate
state-of-the-art reasoning LLMs on MedCaseReasoning and find significant
shortcomings in their diagnoses and reasoning: for instance, the top-performing
open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy
and mentions only 64% of the clinician reasoning statements (recall). However,
we demonstrate that fine-tuning LLMs on the reasoning traces derived from
MedCaseReasoning significantly improves diagnostic accuracy and clinical
reasoning recall by an average relative gain of 29% and 41%, respectively. The
open-source dataset, code, and models are available at
https://github.com/kevinwu23/Stanford-MedCaseReasoning.Summary
AI-Generated Summary