基於失敗前綴調控的飽和問題推理模型訓練策略

摘要

尽管可验证奖励的强化学习（RLVR）显著提升了大型语言模型的推理能力，但随着问题趋于饱和，训练往往会陷入停滞。我们发现核心挑战在于信息性失败样本的可获取性不足：学习信号虽然存在，但在标准推演过程中却鲜少出现。为此，我们提出失败前缀条件化这一简单有效的方法，从饱和问题中持续学习。该方法不再从原始问题出发，而是通过将训练重新分配至基于罕见错误推理轨迹生成的前缀条件，使模型暴露于易失败状态。实验表明，失败前缀条件化带来的性能提升相当于在中等难度问题上的训练效果，同时保持了标记效率。进一步分析模型鲁棒性发现，该方法能降低误导性失败前缀下的性能衰减，尽管对正确早期推理路径的遵循程度存在轻微权衡。最后我们证明，在训练过程中动态更新失败前缀的迭代策略，能在性能平台期后实现额外增益。总体而言，我们的研究结果表明失败前缀条件化为RLVR在饱和问题上的持续训练提供了有效路径。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.

基於失敗前綴調控的飽和問題推理模型訓練策略

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

摘要

Support