증류를 넘어서: 최소한의 규칙 기반 강화 학습으로 의료 LLM 추론의 한계를 넘다

초록

복잡한 작업에서의 성능 향상과 대규모 언어 모델(LLMs)의 해석 가능한 의사결정, 특히 임상 응용 분야를 위해서는 효과적인 추론이 필요합니다. 그러나 비용이 많이 드는 연쇄적 사고(CoT) 데이터에 대한 지도 미세 조정(SFT) 없이는 이를 달성하기 어렵습니다. 이 연구에서는 AlphaMed을 소개합니다. AlphaMed은 공개된 객관식 질의응답(QA) 데이터셋에 대해 최소한의 규칙 기반 보상을 사용하여 강화 학습(RL)만으로 추론 능력이 나타날 수 있음을 보여주는 최초의 의료 LLM입니다. AlphaMed은 기존의 SFT+RL 파이프라인으로 훈련된 모델을 능가하며, 여섯 가지 의료 QA 벤치마크에서 최첨단 성적을 달성했습니다. 특히 도전적인 벤치마크(예: MedXpert)에서는 DeepSeek-V3-671B 및 Claude-3.5-Sonnet과 같은 더 크거나 폐쇄형 모델을 능가하기도 했습니다. 이러한 성공 요인을 이해하기 위해 세 가지 질문을 중심으로 포괄적인 데이터 중심 분석을 수행했습니다: (i) 최소한의 규칙 기반 RL이 CoT 지도 없이 추론을 유도할 수 있는가? (ii) 데이터셋의 양과 다양성이 추론에 어떤 영향을 미치는가? (iii) 질문의 난이도가 추론의 발생과 일반화에 어떻게 영향을 미치는가? 연구 결과, 데이터셋의 정보성은 추론 성능의 주요 동인이며, 정보가 풍부한 객관식 QA 데이터에 대한 최소한의 RL이 CoT 지도 없이도 추론을 유도하는 데 효과적임을 확인했습니다. 또한 벤치마크 간의 상이한 경향을 관찰함으로써 현재 평가의 한계와 더 도전적이고 추론 중심의 의료 QA 벤치마크의 필요성을 강조했습니다.

English

Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

증류를 넘어서: 최소한의 규칙 기반 강화 학습으로 의료 LLM 추론의 한계를 넘다

Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

초록

Support