ReasonMed: 의료 추론 발전을 위한 370K 다중 에이전트 생성 데이터셋

초록

추론 기반 대형 언어 모델(LLM)이 수학 및 프로그래밍 분야에서 뛰어난 성과를 보였음에도 불구하고, 지식 집약적인 의학 질문 응답에서의 능력은 아직 충분히 탐구되지 않았습니다. 이를 해결하기 위해 우리는 170만 개의 초기 추론 경로에서 정제된 37만 개의 고품질 예시로 구성된 가장 큰 의학 추론 데이터셋인 ReasonMed를 소개합니다. ReasonMed는 다중 에이전트 검증 및 개선 프로세스를 통해 구축되었으며, 검증자가 표시한 오류 가능성이 높은 단계를 식별하고 수정하여 추론 경로를 향상시키는 Error Refiner를 설계했습니다. ReasonMed를 활용하여 의학 추론 모델 훈련을 위한 최적의 방법을 체계적으로 연구한 결과, 상세한 Chain-of-Thought(CoT) 추론과 간결한 답변 요약을 결합하는 것이 가장 효과적인 미세 조정 전략임을 발견했습니다. 이 전략을 바탕으로 훈련된 ReasonMed-7B는 10B 미만 모델의 새로운 벤치마크를 설정하며, 이전 최고 성능을 4.17% 앞섰고 PubMedQA에서 LLaMA3.1-70B를 4.60% 능가했습니다.

English

Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a multi-agent verification and refinement process, where we design an Error Refiner to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.

ReasonMed: 의료 추론 발전을 위한 370K 다중 에이전트 생성 데이터셋

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

초록

Support