ChatPaper.aiChatPaper

超越推理增益:缓解大型推理模型中通用能力的遗忘问题

Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

October 24, 2025
作者: Hoang Phan, Xianjun Yang, Kevin Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei
cs.AI

摘要

可验证奖励的强化学习(RLVR)在数学与多模态推理领域展现出显著成效,已成为当代语言及视觉语言模型的标准后训练范式。然而该方案存在能力衰退的重大风险——若未采用正则化策略,模型在长期训练后可能遗忘基础技能。我们通过实证研究证实了这一担忧,发现开源推理模型在感知能力、事实一致性等核心指标上出现性能退化。虽然施加KL散度等正则化项有助于防止模型偏离基础模型,但这些项仅基于当前任务计算,无法保证广泛知识的保留。与此同时,跨异构领域的常用经验回放方法难以确定各训练目标应占的权重比例。为此,我们提出RECAP动态目标重加权回放策略,以实现通用知识保存。该重加权机制通过收敛性和不稳定性的短期信号进行在线自适应调整,将后训练重心从已饱和目标转向表现欠佳或波动较大的目标。我们的方法采用端到端设计,无需训练额外模型或复杂调参即可直接应用于现有RLVR流程。基于Qwen2.5-VL-3B和Qwen2.5-VL-7B的基准测试表明,该方法不仅能有效保留通用能力,还可通过灵活调整任务内奖励的权衡关系进一步提升推理性能。
English
Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, where models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are calculated on the current task, thus they do not guarantee broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training focus each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts in an online manner using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.
PDF141December 1, 2025