ReasonMed:一個37萬筆由多智能體生成的數據集,用於推進醫療推理研究
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
June 11, 2025
作者: Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu
cs.AI
摘要
尽管基于推理的大型语言模型(LLMs)在数学和编程领域表现出色,但它们在知识密集型医学问答中的能力仍未被充分探索。为此,我们引入了ReasonMed,这是最大的医学推理数据集,包含从各种LLMs生成的170万条初始推理路径中提炼出的37万条高质量示例。ReasonMed通过多智能体验证和精炼过程构建,其中我们设计了一个错误精炼器,通过识别和纠正由验证器标记的易错步骤来增强推理路径。利用ReasonMed,我们系统地研究了训练医学推理模型的最佳实践,发现将详细的思维链(CoT)推理与简洁的答案摘要相结合,能产生最有效的微调策略。基于这一策略,我们训练了ReasonMed-7B,它为10B以下模型设立了新的基准,比之前的最佳模型高出4.17%,甚至在PubMedQA上超过了LLaMA3.1-70B,提升了4.60%。
English
Though reasoning-based large language models (LLMs) have excelled in
mathematics and programming, their capabilities in knowledge-intensive medical
question answering remain underexplored. To address this, we introduce
ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality
examples distilled from 1.7 million initial reasoning paths generated by
various LLMs. ReasonMed is constructed through a multi-agent
verification and refinement process, where we design an Error Refiner
to enhance the reasoning paths by identifying and correcting error-prone steps
flagged by a verifier. Leveraging ReasonMed, we systematically investigate best
practices for training medical reasoning models and find that combining
detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields
the most effective fine-tuning strategy. Based on this strategy, we train
ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the
prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.