ChatPaper.aiChatPaper

MedCaseReasoning:基於臨床病例報告的診斷推理評估與學習

MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

May 16, 2025
作者: Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J. Tao, Min Woo Sun, Alejandro Lozano, James Zou
cs.AI

摘要

医生与患者日益频繁地采用大型语言模型(LLMs)来辅助临床病例的诊断。然而,与数学或编程等领域不同,这些领域中的正确性可通过最终答案客观界定,而医学诊断则要求结果与推理过程均需准确无误。当前,广泛应用的医学基准测试如MedQA和MMLU仅评估最终答案的准确性,忽视了临床推理过程的质量与忠实度。为弥补这一不足,我们推出了MedCaseReasoning,这是首个开放获取的数据集,旨在评估LLMs与临床医生撰写的诊断推理相契合的能力。该数据集包含14,489个诊断问答案例,每个案例均配有源自开放获取医学病例报告的详细推理陈述。我们在MedCaseReasoning上对顶尖的推理型LLMs进行了评估,发现其在诊断与推理方面存在显著不足:例如,表现最佳的开源模型DeepSeek-R1,其10-shot诊断准确率仅为48%,且仅提及了64%的临床医生推理陈述(召回率)。然而,我们证明,基于MedCaseReasoning推理轨迹对LLMs进行微调,能显著提升诊断准确率与临床推理召回率,平均相对增益分别达到29%和41%。开源数据集、代码及模型可访问https://github.com/kevinwu23/Stanford-MedCaseReasoning获取。
English
Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning.

Summary

AI-Generated Summary

PDF22May 20, 2025