蒸留を超えて：最小限のルールベースRLによる医療LLM推論の限界を押し広げる

要旨

複雑なタスクのパフォーマンス向上と、特に臨床応用における大規模言語モデル（LLMs）の解釈可能な意思決定を実現するためには、効果的な推論能力が求められる。しかし、閉鎖ソースモデル（例：GPT-4o）から蒸留された高価な連鎖思考（CoT）データを用いた教師ありファインチューニング（SFT）なしでは、これが依然として困難である。本研究では、AlphaMedを紹介する。これは、SFTや蒸留されたCoTデータに依存せず、公開されている多肢選択式QAデータセットに対してミニマリストなルールベースの報酬を用いた強化学習（RL）のみを通じて推論能力が発現することを示した初の医療用LLMである。AlphaMedは、6つの医療QAベンチマークで最先端の結果を達成し、従来のSFT+RLパイプラインで訓練されたモデルを上回った。特に難しいベンチマーク（例：MedXpert）では、AlphaMedはDeepSeek-V3-671BやClaude-3.5-Sonnetといったより大規模または閉鎖ソースのモデルさえも凌駕した。この成功の背景にある要因を理解するため、我々は3つの問いに基づいて包括的なデータ中心の分析を行った：（i）ミニマリストなルールベースのRLは、蒸留されたCoTの監督なしに推論を促進できるか？（ii）データセットの量と多様性は推論にどのような影響を与えるか？（iii）質問の難易度は推論の発現と一般化にどのように影響するか？我々の調査結果は、データセットの情報量が推論性能の主要な要因であること、そして情報量の高い多肢選択式QAデータに対するミニマリストなRLが、CoTの監督なしに推論を誘発するのに効果的であることを示している。また、ベンチマーク間で異なる傾向が観察され、現在の評価の限界と、より挑戦的で推論指向の医療QAベンチマークの必要性が強調された。

English

Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

蒸留を超えて：最小限のルールベースRLによる医療LLM推論の限界を押し広げる

Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

要旨

Support