ChatPaper.aiChatPaper

MedCaseReasoning:从临床病例报告中评估与学习诊断推理

MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

May 16, 2025
作者: Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J. Tao, Min Woo Sun, Alejandro Lozano, James Zou
cs.AI

摘要

医生和患者越来越多地利用大型语言模型(LLMs)来诊断临床病例。然而,与数学或编程等领域不同,这些领域的正确性可以通过最终答案客观定义,而医学诊断不仅要求结果准确,还要求推理过程无误。目前,广泛使用的医学基准测试如MedQA和MMLU仅评估最终答案的准确性,忽视了临床推理过程的质量与忠实度。为弥补这一不足,我们推出了MedCaseReasoning,这是首个开放获取的数据集,旨在评估LLMs与临床医生撰写的诊断推理保持一致的能力。该数据集包含14,489个诊断问答案例,每个案例均配有源自开放获取医学病例报告的详细推理说明。我们在MedCaseReasoning上评估了最先进的推理型LLMs,发现其在诊断和推理方面存在显著不足:例如,表现最佳的开源模型DeepSeek-R1仅达到48%的10-shot诊断准确率,且仅提及了64%的临床医生推理说明(召回率)。然而,我们证明,基于MedCaseReasoning的推理轨迹对LLMs进行微调,能显著提升诊断准确率和临床推理召回率,平均相对增益分别达到29%和41%。开源数据集、代码及模型可在https://github.com/kevinwu23/Stanford-MedCaseReasoning获取。
English
Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning.

Summary

AI-Generated Summary

PDF32May 20, 2025