超越蒸餾：以極簡規則強化學習突破醫學大語言模型推理的極限

摘要

在大型語言模型（LLMs）中，尤其是在臨床應用領域，提升複雜任務的表現並實現可解釋的決策制定，需要有效的推理能力。然而，若無需對來自閉源模型（如GPT-4o）蒸餾出的高成本思維鏈（CoT）數據進行監督微調（SFT），這仍具挑戰性。在本研究中，我們提出了AlphaMed，這是首個醫學LLM，展示了僅通過強化學習（RL）並在公開的多項選擇問答數據集上使用極簡規則獎勵，即可湧現出推理能力，而無需依賴SFT或蒸餾的CoT數據。AlphaMed在六個醫學問答基準測試中取得了最先進的成績，超越了採用傳統SFT+RL流程訓練的模型。在具有挑戰性的基準測試（如MedXpert）中，AlphaMed甚至超越了更大規模或閉源模型，如DeepSeek-V3-671B和Claude-3.5-Sonnet。為理解這一成功背後的關鍵因素，我們進行了以數據為中心的全面分析，圍繞三個問題展開：(i) 極簡規則的RL能否在無蒸餾CoT監督的情況下激勵推理？(ii) 數據集的數量和多樣性如何影響推理？(iii) 問題難度如何塑造推理的湧現與泛化？我們的研究發現，數據集的信息量是推理性能的關鍵驅動因素，而在信息豐富的多項選擇問答數據上進行極簡RL，能有效誘導推理而無需CoT監督。我們還觀察到不同基準測試間的差異趨勢，這凸顯了當前評估的局限性以及對更具挑戰性、以推理為導向的醫學問答基準測試的需求。

English

Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

超越蒸餾：以極簡規則強化學習突破醫學大語言模型推理的極限

Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

摘要

Support