超越正确性:通过迁移学习实现稳健推理
Beyond Correctness: Learning Robust Reasoning via Transfer
February 9, 2026
作者: Hyunseok Lee, Soheil Abbasloo, Jihoon Tack, Jinwoo Shin
cs.AI
摘要
尽管可验证奖励的强化学习(RLVR)近期强化了大语言模型的推理能力,但其对最终答案正确性的单一关注存在明显缺陷:无法保证推理过程本身的鲁棒性。我们采用一个简单的哲学观点——鲁棒推理应能在产生它的思维主体之外保持效用,并将推理视为一种必须经受截断、重释和续写考验的意义传递形式。基于此原则,我们提出了可转移奖励的强化学习(RLTR),通过设计转移奖励来具象化鲁棒性要求:测试从一个模型提取的部分推理前缀是否能引导另一个模型得出正确答案。这种方法促使大语言模型生成稳定、可解释且真正具备泛化能力的推理过程。我们的方法在提升最终答案准确率的同时改善了采样一致性,并能以显著更少的训练步数达到相当的性能水平。例如在MATH500数据集上,RLTR的Maj@64指标较RLVR提升3.6个百分点,且仅需约2.5倍少的训练步数即可匹配RLVR的平均准确率,既提供了更可靠的推理过程,也实现了显著的样本效率提升。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has recently strengthened LLM reasoning, but its focus on final answer correctness leaves a critical gap: it does not ensure the robustness of the reasoning process itself. We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it, and treat reasoning as a form of meaning transfer that must survive truncation, reinterpretation, and continuation. Building on this principle, we introduce Reinforcement Learning with Transferable Reward (RLTR), which operationalizes robustness via transfer reward that tests whether a partial reasoning prefix from one model can guide a separate model to the correct answer. This encourages LLMs to produce reasoning that is stable, interpretable, and genuinely generalizable. Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps. For example, on MATH500, RLTR achieves a +3.6%p gain in Maj@64 compared to RLVR and matches RLVR's average accuracy with roughly 2.5x fewer training steps, providing both more reliable reasoning and significantly more sample efficient.