超越推理增益:缓解大型推理模型中通用能力的遗忘问题
Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models
October 24, 2025
作者: Hoang Phan, Xianjun Yang, Kevin Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei
cs.AI
摘要
可验证奖励的强化学习(RLVR)在数学与多模态推理领域展现出显著成效,已成为当代语言及视觉语言模型的标准后训练范式。然而该方案存在能力衰退的重大风险——若未采用正则化策略,模型在长期训练后可能遗忘基础技能。我们通过实证研究证实了这一担忧,发现开源推理模型在感知能力、事实一致性等核心指标上出现性能退化。虽然施加KL散度等正则化项有助于防止模型偏离基础模型,但这些项仅基于当前任务计算,无法保证广泛知识的保留。与此同时,跨异构领域的常用经验回放方法难以确定各训练目标应占的权重比例。为此,我们提出RECAP动态目标重加权回放策略,以实现通用知识保存。该重加权机制通过收敛性和不稳定性的短期信号进行在线自适应调整,将后训练重心从已饱和目标转向表现欠佳或波动较大的目标。我们的方法采用端到端设计,无需训练额外模型或复杂调参即可直接应用于现有RLVR流程。基于Qwen2.5-VL-3B和Qwen2.5-VL-7B的基准测试表明,该方法不仅能有效保留通用能力,还可通过灵活调整任务内奖励的权衡关系进一步提升推理性能。
English
Reinforcement learning with verifiable rewards (RLVR) has delivered
impressive gains in mathematical and multimodal reasoning and has become a
standard post-training paradigm for contemporary language and vision-language
models. However, the RLVR recipe introduces a significant risk of capability
regression, where models forget foundational skills after prolonged training
without employing regularization strategies. We empirically confirm this
concern, observing that open-source reasoning models suffer performance
degradation on core capabilities such as perception and faithfulness. While
imposing regularization terms like KL divergence can help prevent deviation
from the base model, these terms are calculated on the current task, thus they
do not guarantee broader knowledge. Meanwhile, commonly used experience replay
across heterogeneous domains makes it nontrivial to decide how much training
focus each objective should receive. To address this, we propose RECAP-a replay
strategy with dynamic objective reweighting for general knowledge preservation.
Our reweighting mechanism adapts in an online manner using short-horizon
signals of convergence and instability, shifting the post-training focus away
from saturated objectives and toward underperforming or volatile ones. Our
method is end-to-end and readily applicable to existing RLVR pipelines without
training additional models or heavy tuning. Extensive experiments on benchmarks
based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our
method, which not only preserves general capabilities but also improves
reasoning by enabling more flexible trade-offs among in-task rewards.