ReasonMed: 医療推論の進展に向けた37万件のマルチエージェント生成データセット

要旨

推論ベースの大規模言語モデル（LLM）は数学やプログラミングにおいて優れた性能を発揮してきたが、知識集約型の医療質問応答における能力はまだ十分に検証されていない。この課題に対処するため、我々は最大規模の医療推論データセットであるReasonMedを導入した。これは、様々なLLMによって生成された170万件の初期推論パスから精選された37万件の高品質な例で構成されている。ReasonMedは、マルチエージェントによる検証と精緻化プロセスを通じて構築されており、エラーリファイナーを設計して、検証者がフラグを立てたエラーが発生しやすいステップを特定し、修正することで推論パスを強化している。ReasonMedを活用し、医療推論モデルのトレーニングにおけるベストプラクティスを体系的に調査した結果、詳細なChain-of-Thought（CoT）推論と簡潔な回答要約を組み合わせることが最も効果的なファインチューニング戦略であることがわかった。この戦略に基づいて、我々はReasonMed-7Bをトレーニングし、10B未満のモデルにおいて新たなベンチマークを確立した。これは、従来の最高性能を4.17%上回り、PubMedQAにおいてはLLaMA3.1-70Bを4.60%上回る結果を示した。

English

Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a multi-agent verification and refinement process, where we design an Error Refiner to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.

ReasonMed: 医療推論の進展に向けた37万件のマルチエージェント生成データセット

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

要旨

Support