超越蒸馏：以极简规则强化学习突破医疗大语言模型推理的极限

摘要

在提升大型语言模型（LLMs）处理复杂任务的能力及实现可解释决策，尤其是在临床应用中，有效的推理机制至关重要。然而，若缺乏对从闭源模型（如GPT-4o）提炼出的昂贵链式思维（CoT）数据进行监督微调（SFT），这一目标仍具挑战性。本研究提出了AlphaMed，首个证明仅通过强化学习（RL）结合公开多选题问答数据集上的极简规则奖励，无需依赖SFT或CoT数据提炼，即可涌现推理能力的医疗LLM。AlphaMed在六项医疗问答基准测试中取得了领先成绩，超越了采用传统SFT+RL流程训练的模型。在如MedXpert等具有挑战性的基准上，AlphaMed甚至超越了更大规模或闭源模型，如DeepSeek-V3-671B和Claude-3.5-Sonnet。为探究成功背后的因素，我们围绕三个问题展开了全面的数据导向分析：(i) 极简规则RL能否在没有CoT监督的情况下激励推理？(ii) 数据集的数量与多样性如何影响推理？(iii) 问题难度如何塑造推理的涌现与泛化？我们的研究结果表明，数据集的信息量是推理性能的关键驱动力，而在富含信息的多选题问答数据上实施极简RL，能有效诱导推理而无需CoT监督。同时，我们观察到不同基准间的表现差异，凸显了当前评估的局限性及对更具挑战性、推理导向的医疗问答基准的需求。

English

Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

超越蒸馏：以极简规则强化学习突破医疗大语言模型推理的极限

Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

摘要

Support